[patch 00/10] first set of vfs scale patches

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch 00/10] first set of vfs scale patches
@ 2010-08-17 18:37 Nick Piggin
  2010-08-17 18:37 ` [patch 01/10] fs: fix do_lookup false negative Nick Piggin
                   ` (10 more replies)
  0 siblings, 11 replies; 25+ messages in thread
From: Nick Piggin @ 2010-08-17 18:37 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel

Does not contain inode lock patches yet, I've not
quite finished porting them yet, but even when I do I think they should
get some time in linux-next and some more time for people to review them
before going upstream. So let's wait for next release on those?

This patchset contains:
* some misc bits
* rwlock->spinlock for fs_struct lock
* files_lock cleanup
* tty files list bugfix
* files_lock scaling
* vfsmount_lock scaling

These should all be in good shape for review and hopefully merging, so
please let me know if I need to fix anything.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 01/10] fs: fix do_lookup false negative
  2010-08-17 18:37 [patch 00/10] first set of vfs scale patches Nick Piggin
@ 2010-08-17 18:37 ` Nick Piggin
  2010-08-17 22:45   ` Valerie Aurora
                     ` (2 more replies)
  2010-08-17 18:37 ` [patch 02/10] fs: dentry allocation consolidation Nick Piggin
                   ` (9 subsequent siblings)
  10 siblings, 3 replies; 25+ messages in thread
From: Nick Piggin @ 2010-08-17 18:37 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel

[-- Attachment #1: fs-do_lookup-false-neg.patch --]
[-- Type: text/plain, Size: 1672 bytes --]

fs: fix do_lookup false negative

In do_lookup, if we initially find no dentry, we take the directory i_mutex and
re-check the lookup. If we find a dentry there, then we revalidate it if
needed. However if that revalidate asks for the dentry to be invalidated, we
return -ENOENT from do_lookup. What should happen instead is an attempt to
allocate and lookup a new dentry.

This is probably not noticed because it is rare. It is only reached if a
concurrent create races in first (in which case, the dentry probably won't be
invalidated anyway), or if the racy __d_lookup has failed due to a
false-negative (which is very rare).

Fix this by removing code and have it use the normal reval path.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/namei.c |   10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c	2010-08-18 04:04:18.000000000 +1000
+++ linux-2.6/fs/namei.c	2010-08-18 04:05:15.000000000 +1000
@@ -709,6 +709,7 @@ static int do_lookup(struct nameidata *n
 	dentry = __d_lookup(nd->path.dentry, name);
 	if (!dentry)
 		goto need_lookup;
+found:
 	if (dentry->d_op && dentry->d_op->d_revalidate)
 		goto need_revalidate;
 done:
@@ -766,14 +767,7 @@ out_unlock:
 	 * we waited on the semaphore. Need to revalidate.
 	 */
 	mutex_unlock(&dir->i_mutex);
-	if (dentry->d_op && dentry->d_op->d_revalidate) {
-		dentry = do_revalidate(dentry, nd);
-		if (!dentry)
-			dentry = ERR_PTR(-ENOENT);
-	}
-	if (IS_ERR(dentry))
-		goto fail;
-	goto done;
+	goto found;
 
 need_revalidate:
 	dentry = do_revalidate(dentry, nd);



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 02/10] fs: dentry allocation consolidation
  2010-08-17 18:37 [patch 00/10] first set of vfs scale patches Nick Piggin
  2010-08-17 18:37 ` [patch 01/10] fs: fix do_lookup false negative Nick Piggin
@ 2010-08-17 18:37 ` Nick Piggin
  2010-08-17 22:45   ` Valerie Aurora
  2010-08-17 18:37 ` [patch 03/10] apparmor: use task path helpers Nick Piggin
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2010-08-17 18:37 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel

[-- Attachment #1: fs-dentry-alloc-consolidate.patch --]
[-- Type: text/plain, Size: 3051 bytes --]

fs: dentry allocation consolidation

There are 2 duplicate copies of code in dentry allocation in path lookup.
Consolidate them into a single function.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/namei.c |   70 ++++++++++++++++++++++++++++---------------------------------
 1 file changed, 33 insertions(+), 37 deletions(-)

Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c	2010-08-18 04:04:29.000000000 +1000
+++ linux-2.6/fs/namei.c	2010-08-18 04:05:12.000000000 +1000
@@ -686,6 +686,35 @@ static __always_inline void follow_dotdo
 }
 
 /*
+ * Allocate a dentry with name and parent, and perform a parent
+ * directory ->lookup on it. Returns the new dentry, or ERR_PTR
+ * on error. parent->d_inode->i_mutex must be held. d_lookup must
+ * have verified that no child exists while under i_mutex.
+ */
+static struct dentry *d_alloc_and_lookup(struct dentry *parent,
+				struct qstr *name, struct nameidata *nd)
+{
+	struct inode *inode = parent->d_inode;
+	struct dentry *dentry;
+	struct dentry *old;
+
+	/* Don't create child dentry for a dead directory. */
+	if (unlikely(IS_DEADDIR(inode)))
+		return ERR_PTR(-ENOENT);
+
+	dentry = d_alloc(parent, name);
+	if (unlikely(!dentry))
+		return ERR_PTR(-ENOMEM);
+
+	old = inode->i_op->lookup(inode, dentry, nd);
+	if (unlikely(old)) {
+		dput(dentry);
+		dentry = old;
+	}
+	return dentry;
+}
+
+/*
  *  It's more convoluted than I'd like it to be, but... it's still fairly
  *  small and for now I'd prefer to have fast path as straight as possible.
  *  It _is_ time-critical.
@@ -738,30 +767,13 @@ need_lookup:
 	 * so doing d_lookup() (with seqlock), instead of lockfree __d_lookup
 	 */
 	dentry = d_lookup(parent, name);
-	if (!dentry) {
-		struct dentry *new;
-
-		/* Don't create child dentry for a dead directory. */
-		dentry = ERR_PTR(-ENOENT);
-		if (IS_DEADDIR(dir))
-			goto out_unlock;
-
-		new = d_alloc(parent, name);
-		dentry = ERR_PTR(-ENOMEM);
-		if (new) {
-			dentry = dir->i_op->lookup(dir, new, nd);
-			if (dentry)
-				dput(new);
-			else
-				dentry = new;
-		}
-out_unlock:
+	if (likely(!dentry)) {
+		dentry = d_alloc_and_lookup(parent, name, nd);
 		mutex_unlock(&dir->i_mutex);
 		if (IS_ERR(dentry))
 			goto fail;
 		goto done;
 	}
-
 	/*
 	 * Uhhuh! Nasty case: the cache was re-populated while
 	 * we waited on the semaphore. Need to revalidate.
@@ -1135,24 +1147,8 @@ static struct dentry *__lookup_hash(stru
 	if (dentry && dentry->d_op && dentry->d_op->d_revalidate)
 		dentry = do_revalidate(dentry, nd);
 
-	if (!dentry) {
-		struct dentry *new;
-
-		/* Don't create child dentry for a dead directory. */
-		dentry = ERR_PTR(-ENOENT);
-		if (IS_DEADDIR(inode))
-			goto out;
-
-		new = d_alloc(base, name);
-		dentry = ERR_PTR(-ENOMEM);
-		if (!new)
-			goto out;
-		dentry = inode->i_op->lookup(inode, new, nd);
-		if (!dentry)
-			dentry = new;
-		else
-			dput(new);
-	}
+	if (!dentry)
+		dentry = d_alloc_and_lookup(base, name, nd);
 out:
 	return dentry;
 }



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 03/10] apparmor: use task path helpers
  2010-08-17 18:37 [patch 00/10] first set of vfs scale patches Nick Piggin
  2010-08-17 18:37 ` [patch 01/10] fs: fix do_lookup false negative Nick Piggin
  2010-08-17 18:37 ` [patch 02/10] fs: dentry allocation consolidation Nick Piggin
@ 2010-08-17 18:37 ` Nick Piggin
  2010-08-17 22:59   ` Valerie Aurora
  2010-08-17 18:37 ` [patch 04/10] fs: fs_struct rwlock to spinlock Nick Piggin
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2010-08-17 18:37 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel

[-- Attachment #1: apparmor-use-path-helpers.patch --]
[-- Type: text/plain, Size: 1278 bytes --]

apparmor: use task path helpers

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
BTW, argh! dcache_lock! Why does everything have to invent its own reverse
path lookup crud and hide it away in its own code? Also, admire the rest of
this beautiful function.

---
 security/apparmor/path.c |    9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

Index: linux-2.6/security/apparmor/path.c
===================================================================
--- linux-2.6.orig/security/apparmor/path.c	2010-08-18 04:04:02.000000000 +1000
+++ linux-2.6/security/apparmor/path.c	2010-08-18 04:04:29.000000000 +1000
@@ -62,19 +62,14 @@ static int d_namespace_path(struct path
 	int deleted, connected;
 	int error = 0;
 
-	/* Get the root we want to resolve too */
+	/* Get the root we want to resolve too, released below */
 	if (flags & PATH_CHROOT_REL) {
 		/* resolve paths relative to chroot */
-		read_lock(&current->fs->lock);
-		root = current->fs->root;
-		/* released below */
-		path_get(&root);
-		read_unlock(&current->fs->lock);
+		get_fs_root(current->fs, &root);
 	} else {
 		/* resolve paths relative to namespace */
 		root.mnt = current->nsproxy->mnt_ns->root;
 		root.dentry = root.mnt->mnt_root;
-		/* released below */
 		path_get(&root);
 	}
 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 04/10] fs: fs_struct rwlock to spinlock
  2010-08-17 18:37 [patch 00/10] first set of vfs scale patches Nick Piggin
                   ` (2 preceding siblings ...)
  2010-08-17 18:37 ` [patch 03/10] apparmor: use task path helpers Nick Piggin
@ 2010-08-17 18:37 ` Nick Piggin
  2010-08-17 23:14   ` Valerie Aurora
  2010-08-17 18:37 ` [patch 05/10] fs: remove extra lookup in __lookup_hash Nick Piggin
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2010-08-17 18:37 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel

[-- Attachment #1: fs-rwlock-to-spinlock.patch --]
[-- Type: text/plain, Size: 7565 bytes --]

fs: fs_struct rwlock to spinlock

struct fs_struct.lock is an rwlock with the read-side used to protect root and
pwd members while taking references to them. Taking a reference to a path
typically requires just 2 atomic ops, so the critical section is very small.
Parallel read-side operations would have cacheline contention on the lock, the
dentry, and the vfsmount cachelines, so the rwlock is unlikely to ever give a
real parallelism increase.

Replace it with a spinlock to avoid one or two atomic operations in typical
path lookup fastpath.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 drivers/staging/pohmelfs/path_entry.c |    8 ++++----
 fs/exec.c                             |    4 ++--
 fs/fs_struct.c                        |   32 ++++++++++++++++----------------
 include/linux/fs_struct.h             |   14 +++++++-------
 kernel/fork.c                         |   10 +++++-----
 5 files changed, 34 insertions(+), 34 deletions(-)

Index: linux-2.6/fs/fs_struct.c
===================================================================
--- linux-2.6.orig/fs/fs_struct.c	2010-08-18 04:04:01.000000000 +1000
+++ linux-2.6/fs/fs_struct.c	2010-08-18 04:04:29.000000000 +1000
@@ -13,11 +13,11 @@ void set_fs_root(struct fs_struct *fs, s
 {
 	struct path old_root;
 
-	write_lock(&fs->lock);
+	spin_lock(&fs->lock);
 	old_root = fs->root;
 	fs->root = *path;
 	path_get(path);
-	write_unlock(&fs->lock);
+	spin_unlock(&fs->lock);
 	if (old_root.dentry)
 		path_put(&old_root);
 }
@@ -30,11 +30,11 @@ void set_fs_pwd(struct fs_struct *fs, st
 {
 	struct path old_pwd;
 
-	write_lock(&fs->lock);
+	spin_lock(&fs->lock);
 	old_pwd = fs->pwd;
 	fs->pwd = *path;
 	path_get(path);
-	write_unlock(&fs->lock);
+	spin_unlock(&fs->lock);
 
 	if (old_pwd.dentry)
 		path_put(&old_pwd);
@@ -51,7 +51,7 @@ void chroot_fs_refs(struct path *old_roo
 		task_lock(p);
 		fs = p->fs;
 		if (fs) {
-			write_lock(&fs->lock);
+			spin_lock(&fs->lock);
 			if (fs->root.dentry == old_root->dentry
 			    && fs->root.mnt == old_root->mnt) {
 				path_get(new_root);
@@ -64,7 +64,7 @@ void chroot_fs_refs(struct path *old_roo
 				fs->pwd = *new_root;
 				count++;
 			}
-			write_unlock(&fs->lock);
+			spin_unlock(&fs->lock);
 		}
 		task_unlock(p);
 	} while_each_thread(g, p);
@@ -87,10 +87,10 @@ void exit_fs(struct task_struct *tsk)
 	if (fs) {
 		int kill;
 		task_lock(tsk);
-		write_lock(&fs->lock);
+		spin_lock(&fs->lock);
 		tsk->fs = NULL;
 		kill = !--fs->users;
-		write_unlock(&fs->lock);
+		spin_unlock(&fs->lock);
 		task_unlock(tsk);
 		if (kill)
 			free_fs_struct(fs);
@@ -104,7 +104,7 @@ struct fs_struct *copy_fs_struct(struct
 	if (fs) {
 		fs->users = 1;
 		fs->in_exec = 0;
-		rwlock_init(&fs->lock);
+		spin_lock_init(&fs->lock);
 		fs->umask = old->umask;
 		get_fs_root_and_pwd(old, &fs->root, &fs->pwd);
 	}
@@ -121,10 +121,10 @@ int unshare_fs_struct(void)
 		return -ENOMEM;
 
 	task_lock(current);
-	write_lock(&fs->lock);
+	spin_lock(&fs->lock);
 	kill = !--fs->users;
 	current->fs = new_fs;
-	write_unlock(&fs->lock);
+	spin_unlock(&fs->lock);
 	task_unlock(current);
 
 	if (kill)
@@ -143,7 +143,7 @@ EXPORT_SYMBOL(current_umask);
 /* to be mentioned only in INIT_TASK */
 struct fs_struct init_fs = {
 	.users		= 1,
-	.lock		= __RW_LOCK_UNLOCKED(init_fs.lock),
+	.lock		= __SPIN_LOCK_UNLOCKED(init_fs.lock),
 	.umask		= 0022,
 };
 
@@ -156,14 +156,14 @@ void daemonize_fs_struct(void)
 
 		task_lock(current);
 
-		write_lock(&init_fs.lock);
+		spin_lock(&init_fs.lock);
 		init_fs.users++;
-		write_unlock(&init_fs.lock);
+		spin_unlock(&init_fs.lock);
 
-		write_lock(&fs->lock);
+		spin_lock(&fs->lock);
 		current->fs = &init_fs;
 		kill = !--fs->users;
-		write_unlock(&fs->lock);
+		spin_unlock(&fs->lock);
 
 		task_unlock(current);
 		if (kill)
Index: linux-2.6/include/linux/fs_struct.h
===================================================================
--- linux-2.6.orig/include/linux/fs_struct.h	2010-08-18 04:04:01.000000000 +1000
+++ linux-2.6/include/linux/fs_struct.h	2010-08-18 04:04:29.000000000 +1000
@@ -5,7 +5,7 @@
 
 struct fs_struct {
 	int users;
-	rwlock_t lock;
+	spinlock_t lock;
 	int umask;
 	int in_exec;
 	struct path root, pwd;
@@ -23,29 +23,29 @@ extern int unshare_fs_struct(void);
 
 static inline void get_fs_root(struct fs_struct *fs, struct path *root)
 {
-	read_lock(&fs->lock);
+	spin_lock(&fs->lock);
 	*root = fs->root;
 	path_get(root);
-	read_unlock(&fs->lock);
+	spin_unlock(&fs->lock);
 }
 
 static inline void get_fs_pwd(struct fs_struct *fs, struct path *pwd)
 {
-	read_lock(&fs->lock);
+	spin_lock(&fs->lock);
 	*pwd = fs->pwd;
 	path_get(pwd);
-	read_unlock(&fs->lock);
+	spin_unlock(&fs->lock);
 }
 
 static inline void get_fs_root_and_pwd(struct fs_struct *fs, struct path *root,
 				       struct path *pwd)
 {
-	read_lock(&fs->lock);
+	spin_lock(&fs->lock);
 	*root = fs->root;
 	path_get(root);
 	*pwd = fs->pwd;
 	path_get(pwd);
-	read_unlock(&fs->lock);
+	spin_unlock(&fs->lock);
 }
 
 #endif /* _LINUX_FS_STRUCT_H */
Index: linux-2.6/fs/exec.c
===================================================================
--- linux-2.6.orig/fs/exec.c	2010-08-18 04:04:01.000000000 +1000
+++ linux-2.6/fs/exec.c	2010-08-18 04:04:29.000000000 +1000
@@ -1117,7 +1117,7 @@ int check_unsafe_exec(struct linux_binpr
 	bprm->unsafe = tracehook_unsafe_exec(p);
 
 	n_fs = 1;
-	write_lock(&p->fs->lock);
+	spin_lock(&p->fs->lock);
 	rcu_read_lock();
 	for (t = next_thread(p); t != p; t = next_thread(t)) {
 		if (t->fs == p->fs)
@@ -1134,7 +1134,7 @@ int check_unsafe_exec(struct linux_binpr
 			res = 1;
 		}
 	}
-	write_unlock(&p->fs->lock);
+	spin_unlock(&p->fs->lock);
 
 	return res;
 }
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2010-08-18 04:04:01.000000000 +1000
+++ linux-2.6/kernel/fork.c	2010-08-18 04:04:29.000000000 +1000
@@ -752,13 +752,13 @@ static int copy_fs(unsigned long clone_f
 	struct fs_struct *fs = current->fs;
 	if (clone_flags & CLONE_FS) {
 		/* tsk->fs is already what we want */
-		write_lock(&fs->lock);
+		spin_lock(&fs->lock);
 		if (fs->in_exec) {
-			write_unlock(&fs->lock);
+			spin_unlock(&fs->lock);
 			return -EAGAIN;
 		}
 		fs->users++;
-		write_unlock(&fs->lock);
+		spin_unlock(&fs->lock);
 		return 0;
 	}
 	tsk->fs = copy_fs_struct(fs);
@@ -1676,13 +1676,13 @@ SYSCALL_DEFINE1(unshare, unsigned long,
 
 		if (new_fs) {
 			fs = current->fs;
-			write_lock(&fs->lock);
+			spin_lock(&fs->lock);
 			current->fs = new_fs;
 			if (--fs->users)
 				new_fs = NULL;
 			else
 				new_fs = fs;
-			write_unlock(&fs->lock);
+			spin_unlock(&fs->lock);
 		}
 
 		if (new_mm) {
Index: linux-2.6/drivers/staging/pohmelfs/path_entry.c
===================================================================
--- linux-2.6.orig/drivers/staging/pohmelfs/path_entry.c	2010-08-18 04:04:01.000000000 +1000
+++ linux-2.6/drivers/staging/pohmelfs/path_entry.c	2010-08-18 04:04:29.000000000 +1000
@@ -44,9 +44,9 @@ int pohmelfs_construct_path_string(struc
 		return -ENOENT;
 	}
 
-	read_lock(&current->fs->lock);
+	spin_lock(&current->fs->lock);
 	path.mnt = mntget(current->fs->root.mnt);
-	read_unlock(&current->fs->lock);
+	spin_unlock(&current->fs->lock);
 
 	path.dentry = d;
 
@@ -91,9 +91,9 @@ int pohmelfs_path_length(struct pohmelfs
 		return -ENOENT;
 	}
 
-	read_lock(&current->fs->lock);
+	spin_lock(&current->fs->lock);
 	root = dget(current->fs->root.dentry);
-	read_unlock(&current->fs->lock);
+	spin_unlock(&current->fs->lock);
 
 	spin_lock(&dcache_lock);
 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 05/10] fs: remove extra lookup in __lookup_hash
  2010-08-17 18:37 [patch 00/10] first set of vfs scale patches Nick Piggin
                   ` (3 preceding siblings ...)
  2010-08-17 18:37 ` [patch 04/10] fs: fs_struct rwlock to spinlock Nick Piggin
@ 2010-08-17 18:37 ` Nick Piggin
  2010-08-18 13:57   ` Andi Kleen
  2010-08-18 19:34   ` Valerie Aurora
  2010-08-17 18:37 ` [patch 06/10] fs: cleanup files_lock locking Nick Piggin
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 25+ messages in thread
From: Nick Piggin @ 2010-08-17 18:37 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel

[-- Attachment #1: fs-creat-speedup.patch --]
[-- Type: text/plain, Size: 7168 bytes --]

fs: remove extra lookup in __lookup_hash

Optimize lookup for create operations, where no dentry should often be
common-case. In cases where it is not, such as unlink, the added overhead
is much smaller than the removed.

Also, move comments about __d_lookup racyness to the __d_lookup call site.
d_lookup is intuitive; __d_lookup is what needs commenting. So in that same
vein, add kerneldoc comments to __d_lookup and clean up some of the comments:

- We are interested in how the RCU lookup works here, particularly with
  renames. Make that explicit, and point to the document where it is explained
  in more detail.
- RCU is pretty standard now, and macros make implementations pretty mindless.
  If we want to know about RCU barrier details, we look in RCU code.
- Delete some boring legacy comments because we don't care much about how the
  code used to work, more about the interesting parts of how it works now. So
  comments about lazy LRU may be interesting, but would better be done in the
  LRU or refcount management code.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/dcache.c |   60 +++++++++++++++++++++++++++++++++++-------------------------
 fs/namei.c  |   32 ++++++++++++++++----------------
 2 files changed, 51 insertions(+), 41 deletions(-)

Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c	2010-08-18 04:04:29.000000000 +1000
+++ linux-2.6/fs/namei.c	2010-08-18 04:05:07.000000000 +1000
@@ -735,6 +735,11 @@ static int do_lookup(struct nameidata *n
 			return err;
 	}
 
+	/*
+	 * Rename seqlock is not required here because in the off chance
+	 * of a false negative due to a concurrent rename, we're going to
+	 * do the non-racy lookup, below.
+	 */
 	dentry = __d_lookup(nd->path.dentry, name);
 	if (!dentry)
 		goto need_lookup;
@@ -754,17 +759,13 @@ need_lookup:
 	mutex_lock(&dir->i_mutex);
 	/*
 	 * First re-do the cached lookup just in case it was created
-	 * while we waited for the directory semaphore..
-	 *
-	 * FIXME! This could use version numbering or similar to
-	 * avoid unnecessary cache lookups.
-	 *
-	 * The "dcache_lock" is purely to protect the RCU list walker
-	 * from concurrent renames at this point (we mustn't get false
-	 * negatives from the RCU list walk here, unlike the optimistic
-	 * fast walk).
+	 * while we waited for the directory semaphore, or the first
+	 * lookup failed due to an unrelated rename.
 	 *
-	 * so doing d_lookup() (with seqlock), instead of lockfree __d_lookup
+	 * This could use version numbering or similar to avoid unnecessary
+	 * cache lookups, but then we'd have to do the first lookup in the
+	 * non-racy way. However in the common case here, everything should
+	 * be hot in cache, so would it be a big win?
 	 */
 	dentry = d_lookup(parent, name);
 	if (likely(!dentry)) {
@@ -1136,13 +1137,12 @@ static struct dentry *__lookup_hash(stru
 			goto out;
 	}
 
-	dentry = __d_lookup(base, name);
-
-	/* lockess __d_lookup may fail due to concurrent d_move()
-	 * in some unrelated directory, so try with d_lookup
+	/*
+	 * Don't bother with __d_lookup: callers are for creat as
+	 * well as unlink, so a lot of the time it would cost
+	 * a double lookup.
 	 */
-	if (!dentry)
-		dentry = d_lookup(base, name);
+	dentry = d_lookup(base, name);
 
 	if (dentry && dentry->d_op && dentry->d_op->d_revalidate)
 		dentry = do_revalidate(dentry, nd);
Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c	2010-08-18 04:04:01.000000000 +1000
+++ linux-2.6/fs/dcache.c	2010-08-18 04:05:07.000000000 +1000
@@ -1332,31 +1332,13 @@ EXPORT_SYMBOL(d_add_ci);
  * d_lookup - search for a dentry
  * @parent: parent dentry
  * @name: qstr of name we wish to find
+ * Returns: dentry, or NULL
  *
- * Searches the children of the parent dentry for the name in question. If
- * the dentry is found its reference count is incremented and the dentry
- * is returned. The caller must use dput to free the entry when it has
- * finished using it. %NULL is returned on failure.
- *
- * __d_lookup is dcache_lock free. The hash list is protected using RCU.
- * Memory barriers are used while updating and doing lockless traversal. 
- * To avoid races with d_move while rename is happening, d_lock is used.
- *
- * Overflows in memcmp(), while d_move, are avoided by keeping the length
- * and name pointer in one structure pointed by d_qstr.
- *
- * rcu_read_lock() and rcu_read_unlock() are used to disable preemption while
- * lookup is going on.
- *
- * The dentry unused LRU is not updated even if lookup finds the required dentry
- * in there. It is updated in places such as prune_dcache, shrink_dcache_sb,
- * select_parent and __dget_locked. This laziness saves lookup from dcache_lock
- * acquisition.
- *
- * d_lookup() is protected against the concurrent renames in some unrelated
- * directory using the seqlockt_t rename_lock.
+ * d_lookup searches the children of the parent dentry for the name in
+ * question. If the dentry is found its reference count is incremented and the
+ * dentry is returned. The caller must use dput to free the entry when it has
+ * finished using it. %NULL is returned if the dentry does not exist.
  */
-
 struct dentry * d_lookup(struct dentry * parent, struct qstr * name)
 {
 	struct dentry * dentry = NULL;
@@ -1372,6 +1354,21 @@ struct dentry * d_lookup(struct dentry *
 }
 EXPORT_SYMBOL(d_lookup);
 
+/*
+ * __d_lookup - search for a dentry (racy)
+ * @parent: parent dentry
+ * @name: qstr of name we wish to find
+ * Returns: dentry, or NULL
+ *
+ * __d_lookup is like d_lookup, however it may (rarely) return a
+ * false-negative result due to unrelated rename activity.
+ *
+ * __d_lookup is slightly faster by avoiding rename_lock read seqlock,
+ * however it must be used carefully, eg. with a following d_lookup in
+ * the case of failure.
+ *
+ * __d_lookup callers must be commented.
+ */
 struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
 {
 	unsigned int len = name->len;
@@ -1382,6 +1379,19 @@ struct dentry * __d_lookup(struct dentry
 	struct hlist_node *node;
 	struct dentry *dentry;
 
+	/*
+	 * The hash list is protected using RCU.
+	 *
+	 * Take d_lock when comparing a candidate dentry, to avoid races
+	 * with d_move().
+	 *
+	 * It is possible that concurrent renames can mess up our list
+	 * walk here and result in missing our dentry, resulting in the
+	 * false-negative result. d_lookup() protects against concurrent
+	 * renames using rename_lock seqlock.
+	 *
+	 * See Documentation/vfs/dcache-locking.txt for more details.
+	 */
 	rcu_read_lock();
 	
 	hlist_for_each_entry_rcu(dentry, node, head, d_hash) {
@@ -1396,8 +1406,8 @@ struct dentry * __d_lookup(struct dentry
 
 		/*
 		 * Recheck the dentry after taking the lock - d_move may have
-		 * changed things.  Don't bother checking the hash because we're
-		 * about to compare the whole name anyway.
+		 * changed things. Don't bother checking the hash because
+		 * we're about to compare the whole name anyway.
 		 */
 		if (dentry->d_parent != parent)
 			goto next;



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 06/10] fs: cleanup files_lock locking
  2010-08-17 18:37 [patch 00/10] first set of vfs scale patches Nick Piggin
                   ` (4 preceding siblings ...)
  2010-08-17 18:37 ` [patch 05/10] fs: remove extra lookup in __lookup_hash Nick Piggin
@ 2010-08-17 18:37 ` Nick Piggin
  2010-08-18 19:46   ` Valerie Aurora
  2010-08-17 18:37 ` [patch 07/10] tty: fix fu_list abuse Nick Piggin
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2010-08-17 18:37 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Alan Cox,
	Andi Kleen, Greg Kroah-Hartman

[-- Attachment #1: fs-files_list-improve.patch --]
[-- Type: text/plain, Size: 10398 bytes --]

fs: cleanup files_lock locking

Lock tty_files with a new spinlock, tty_files_lock; provide helpers to
manipulate the per-sb files list; unexport the files_lock spinlock.

Cc: linux-kernel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 drivers/char/pty.c       |    6 +++++-
 drivers/char/tty_io.c    |   26 ++++++++++++++++++--------
 fs/file_table.c          |   42 ++++++++++++++++++------------------------
 fs/open.c                |    4 ++--
 include/linux/fs.h       |    7 ++-----
 include/linux/tty.h      |    1 +
 security/selinux/hooks.c |    4 ++--
 7 files changed, 48 insertions(+), 42 deletions(-)

Index: linux-2.6/security/selinux/hooks.c
===================================================================
--- linux-2.6.orig/security/selinux/hooks.c	2010-08-18 04:04:01.000000000 +1000
+++ linux-2.6/security/selinux/hooks.c	2010-08-18 04:05:10.000000000 +1000
@@ -2170,7 +2170,7 @@ static inline void flush_unauthorized_fi
 
 	tty = get_current_tty();
 	if (tty) {
-		file_list_lock();
+		spin_lock(&tty_files_lock);
 		if (!list_empty(&tty->tty_files)) {
 			struct inode *inode;
 
@@ -2186,7 +2186,7 @@ static inline void flush_unauthorized_fi
 				drop_tty = 1;
 			}
 		}
-		file_list_unlock();
+		spin_unlock(&tty_files_lock);
 		tty_kref_put(tty);
 	}
 	/* Reset controlling tty. */
Index: linux-2.6/drivers/char/pty.c
===================================================================
--- linux-2.6.orig/drivers/char/pty.c	2010-08-18 04:04:01.000000000 +1000
+++ linux-2.6/drivers/char/pty.c	2010-08-18 04:05:10.000000000 +1000
@@ -676,7 +676,11 @@ static int ptmx_open(struct inode *inode
 
 	set_bit(TTY_PTY_LOCK, &tty->flags); /* LOCK THE SLAVE */
 	filp->private_data = tty;
-	file_move(filp, &tty->tty_files);
+
+	file_sb_list_del(filp); /* __dentry_open has put it on the sb list */
+	spin_lock(&tty_files_lock);
+	list_add(&filp->f_u.fu_list, &tty->tty_files);
+	spin_unlock(&tty_files_lock);
 
 	retval = devpts_pty_new(inode, tty->link);
 	if (retval)
Index: linux-2.6/drivers/char/tty_io.c
===================================================================
--- linux-2.6.orig/drivers/char/tty_io.c	2010-08-18 04:04:01.000000000 +1000
+++ linux-2.6/drivers/char/tty_io.c	2010-08-18 04:05:10.000000000 +1000
@@ -136,6 +136,9 @@ LIST_HEAD(tty_drivers);			/* linked list
 DEFINE_MUTEX(tty_mutex);
 EXPORT_SYMBOL(tty_mutex);
 
+/* Spinlock to protect the tty->tty_files list */
+DEFINE_SPINLOCK(tty_files_lock);
+
 static ssize_t tty_read(struct file *, char __user *, size_t, loff_t *);
 static ssize_t tty_write(struct file *, const char __user *, size_t, loff_t *);
 ssize_t redirected_tty_write(struct file *, const char __user *,
@@ -235,11 +238,11 @@ static int check_tty_count(struct tty_st
 	struct list_head *p;
 	int count = 0;
 
-	file_list_lock();
+	spin_lock(&tty_files_lock);
 	list_for_each(p, &tty->tty_files) {
 		count++;
 	}
-	file_list_unlock();
+	spin_unlock(&tty_files_lock);
 	if (tty->driver->type == TTY_DRIVER_TYPE_PTY &&
 	    tty->driver->subtype == PTY_TYPE_SLAVE &&
 	    tty->link && tty->link->count)
@@ -519,7 +522,7 @@ void __tty_hangup(struct tty_struct *tty
 	   workqueue with the lock held */
 	check_tty_count(tty, "tty_hangup");
 
-	file_list_lock();
+	spin_lock(&tty_files_lock);
 	/* This breaks for file handles being sent over AF_UNIX sockets ? */
 	list_for_each_entry(filp, &tty->tty_files, f_u.fu_list) {
 		if (filp->f_op->write == redirected_tty_write)
@@ -530,7 +533,7 @@ void __tty_hangup(struct tty_struct *tty
 		__tty_fasync(-1, filp, 0);	/* can't block */
 		filp->f_op = &hung_up_tty_fops;
 	}
-	file_list_unlock();
+	spin_unlock(&tty_files_lock);
 
 	tty_ldisc_hangup(tty);
 
@@ -1424,9 +1427,9 @@ static void release_one_tty(struct work_
 	tty_driver_kref_put(driver);
 	module_put(driver->owner);
 
-	file_list_lock();
+	spin_lock(&tty_files_lock);
 	list_del_init(&tty->tty_files);
-	file_list_unlock();
+	spin_unlock(&tty_files_lock);
 
 	put_pid(tty->pgrp);
 	put_pid(tty->session);
@@ -1671,7 +1674,10 @@ int tty_release(struct inode *inode, str
 	 *  - do_tty_hangup no longer sees this file descriptor as
 	 *    something that needs to be handled for hangups.
 	 */
-	file_kill(filp);
+	spin_lock(&tty_files_lock);
+	BUG_ON(list_empty(&filp->f_u.fu_list));
+	list_del_init(&filp->f_u.fu_list);
+	spin_unlock(&tty_files_lock);
 	filp->private_data = NULL;
 
 	/*
@@ -1840,7 +1846,11 @@ got_driver:
 	}
 
 	filp->private_data = tty;
-	file_move(filp, &tty->tty_files);
+	BUG_ON(list_empty(&filp->f_u.fu_list));
+	file_sb_list_del(filp); /* __dentry_open has put it on the sb list */
+	spin_lock(&tty_files_lock);
+	list_add(&filp->f_u.fu_list, &tty->tty_files);
+	spin_unlock(&tty_files_lock);
 	check_tty_count(tty, "tty_open");
 	if (tty->driver->type == TTY_DRIVER_TYPE_PTY &&
 	    tty->driver->subtype == PTY_TYPE_MASTER)
Index: linux-2.6/fs/file_table.c
===================================================================
--- linux-2.6.orig/fs/file_table.c	2010-08-18 04:04:01.000000000 +1000
+++ linux-2.6/fs/file_table.c	2010-08-18 04:05:09.000000000 +1000
@@ -32,8 +32,7 @@ struct files_stat_struct files_stat = {
 	.max_files = NR_FILE
 };
 
-/* public. Not pretty! */
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock);
+static __cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock);
 
 /* SLAB cache for file structures */
 static struct kmem_cache *filp_cachep __read_mostly;
@@ -249,7 +248,7 @@ static void __fput(struct file *file)
 		cdev_put(inode->i_cdev);
 	fops_put(file->f_op);
 	put_pid(file->f_owner.pid);
-	file_kill(file);
+	file_sb_list_del(file);
 	if (file->f_mode & FMODE_WRITE)
 		drop_file_write_access(file);
 	file->f_path.dentry = NULL;
@@ -328,31 +327,29 @@ struct file *fget_light(unsigned int fd,
 	return file;
 }
 
-
 void put_filp(struct file *file)
 {
 	if (atomic_long_dec_and_test(&file->f_count)) {
 		security_file_free(file);
-		file_kill(file);
+		file_sb_list_del(file);
 		file_free(file);
 	}
 }
 
-void file_move(struct file *file, struct list_head *list)
+void file_sb_list_add(struct file *file, struct super_block *sb)
 {
-	if (!list)
-		return;
-	file_list_lock();
-	list_move(&file->f_u.fu_list, list);
-	file_list_unlock();
+	spin_lock(&files_lock);
+	BUG_ON(!list_empty(&file->f_u.fu_list));
+	list_add(&file->f_u.fu_list, &sb->s_files);
+	spin_unlock(&files_lock);
 }
 
-void file_kill(struct file *file)
+void file_sb_list_del(struct file *file)
 {
 	if (!list_empty(&file->f_u.fu_list)) {
-		file_list_lock();
+		spin_lock(&files_lock);
 		list_del_init(&file->f_u.fu_list);
-		file_list_unlock();
+		spin_unlock(&files_lock);
 	}
 }
 
@@ -361,7 +358,7 @@ int fs_may_remount_ro(struct super_block
 	struct file *file;
 
 	/* Check that no files are currently opened for writing. */
-	file_list_lock();
+	spin_lock(&files_lock);
 	list_for_each_entry(file, &sb->s_files, f_u.fu_list) {
 		struct inode *inode = file->f_path.dentry->d_inode;
 
@@ -373,10 +370,10 @@ int fs_may_remount_ro(struct super_block
 		if (S_ISREG(inode->i_mode) && (file->f_mode & FMODE_WRITE))
 			goto too_bad;
 	}
-	file_list_unlock();
+	spin_unlock(&files_lock);
 	return 1; /* Tis' cool bro. */
 too_bad:
-	file_list_unlock();
+	spin_unlock(&files_lock);
 	return 0;
 }
 
@@ -392,7 +389,7 @@ void mark_files_ro(struct super_block *s
 	struct file *f;
 
 retry:
-	file_list_lock();
+	spin_lock(&files_lock);
 	list_for_each_entry(f, &sb->s_files, f_u.fu_list) {
 		struct vfsmount *mnt;
 		if (!S_ISREG(f->f_path.dentry->d_inode->i_mode))
@@ -408,16 +405,13 @@ retry:
 			continue;
 		file_release_write(f);
 		mnt = mntget(f->f_path.mnt);
-		file_list_unlock();
-		/*
-		 * This can sleep, so we can't hold
-		 * the file_list_lock() spinlock.
-		 */
+		/* This can sleep, so we can't hold the spinlock. */
+		spin_unlock(&files_lock);
 		mnt_drop_write(mnt);
 		mntput(mnt);
 		goto retry;
 	}
-	file_list_unlock();
+	spin_unlock(&files_lock);
 }
 
 void __init files_init(unsigned long mempages)
Index: linux-2.6/fs/open.c
===================================================================
--- linux-2.6.orig/fs/open.c	2010-08-18 04:04:01.000000000 +1000
+++ linux-2.6/fs/open.c	2010-08-18 04:04:29.000000000 +1000
@@ -675,7 +675,7 @@ static struct file *__dentry_open(struct
 	f->f_path.mnt = mnt;
 	f->f_pos = 0;
 	f->f_op = fops_get(inode->i_fop);
-	file_move(f, &inode->i_sb->s_files);
+	file_sb_list_add(f, inode->i_sb);
 
 	error = security_dentry_open(f, cred);
 	if (error)
@@ -721,7 +721,7 @@ cleanup_all:
 			mnt_drop_write(mnt);
 		}
 	}
-	file_kill(f);
+	file_sb_list_del(f);
 	f->f_path.dentry = NULL;
 	f->f_path.mnt = NULL;
 cleanup_file:
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-08-18 04:04:01.000000000 +1000
+++ linux-2.6/include/linux/fs.h	2010-08-18 04:05:10.000000000 +1000
@@ -953,9 +953,6 @@ struct file {
 	unsigned long f_mnt_write_state;
 #endif
 };
-extern spinlock_t files_lock;
-#define file_list_lock() spin_lock(&files_lock);
-#define file_list_unlock() spin_unlock(&files_lock);
 
 #define get_file(x)	atomic_long_inc(&(x)->f_count)
 #define fput_atomic(x)	atomic_long_add_unless(&(x)->f_count, -1, 1)
@@ -2197,8 +2194,8 @@ static inline void insert_inode_hash(str
 	__insert_inode_hash(inode, inode->i_ino);
 }
 
-extern void file_move(struct file *f, struct list_head *list);
-extern void file_kill(struct file *f);
+extern void file_sb_list_add(struct file *f, struct super_block *sb);
+extern void file_sb_list_del(struct file *f);
 #ifdef CONFIG_BLOCK
 extern void submit_bio(int, struct bio *);
 extern int bdev_read_only(struct block_device *);
Index: linux-2.6/include/linux/tty.h
===================================================================
--- linux-2.6.orig/include/linux/tty.h	2010-08-18 04:04:01.000000000 +1000
+++ linux-2.6/include/linux/tty.h	2010-08-18 04:05:10.000000000 +1000
@@ -470,6 +470,7 @@ extern struct tty_struct *tty_pair_get_t
 extern struct tty_struct *tty_pair_get_pty(struct tty_struct *tty);
 
 extern struct mutex tty_mutex;
+extern spinlock_t tty_files_lock;
 
 extern void tty_write_unlock(struct tty_struct *tty);
 extern int tty_write_lock(struct tty_struct *tty, int ndelay);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 07/10] tty: fix fu_list abuse
  2010-08-17 18:37 [patch 00/10] first set of vfs scale patches Nick Piggin
                   ` (5 preceding siblings ...)
  2010-08-17 18:37 ` [patch 06/10] fs: cleanup files_lock locking Nick Piggin
@ 2010-08-17 18:37 ` Nick Piggin
  2010-08-17 18:37 ` [patch 08/10] lglock: introduce special lglock and brlock spin locks Nick Piggin
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2010-08-17 18:37 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Alan Cox,
	Greg Kroah-Hartman

[-- Attachment #1: fs-tty-files_list-fix.patch --]
[-- Type: text/plain, Size: 11904 bytes --]

tty: fix fu_list abuse

tty code abuses fu_list, which causes a bug in remount,ro handling.

If a tty device node is opened on a filesystem, then the last link to the inode
removed, the filesystem will be allowed to be remounted readonly. This is
because fs_may_remount_ro does not find the 0 link tty inode on the file sb
list (because the tty code incorrectly removed it to use for its own purpose).
This can result in a filesystem with errors after it is marked "clean".

Taking idea from Christoph's initial patch, allocate a tty private struct
at file->private_data and put our required list fields in there, linking
file and tty. This makes tty nodes behave the same way as other device nodes
and avoid meddling with the vfs, and avoids this bug.

The error handling is not trivial in the tty code, so for this bugfix, I take
the simple approach of using __GFP_NOFAIL and don't worry about memory errors.
This is not a problem because our allocator doesn't fail small allocs as a rule
anyway. So proper error handling is left as an exercise for tty hackers.

[ Arguably filesystem's device inode would ideally be divorced from the
driver's pseudo inode when it is opened, but in practice it's not clear whether
that will ever be worth implementing. ]

Cc: linux-kernel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 drivers/char/pty.c       |    6 ---
 drivers/char/tty_io.c    |   84 ++++++++++++++++++++++++++++++-----------------
 fs/internal.h            |    2 +
 include/linux/fs.h       |    2 -
 include/linux/tty.h      |    8 ++++
 security/selinux/hooks.c |    5 ++
 6 files changed, 69 insertions(+), 38 deletions(-)

Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-08-18 04:04:29.000000000 +1000
+++ linux-2.6/include/linux/fs.h	2010-08-18 04:05:09.000000000 +1000
@@ -2194,8 +2194,6 @@ static inline void insert_inode_hash(str
 	__insert_inode_hash(inode, inode->i_ino);
 }
 
-extern void file_sb_list_add(struct file *f, struct super_block *sb);
-extern void file_sb_list_del(struct file *f);
 #ifdef CONFIG_BLOCK
 extern void submit_bio(int, struct bio *);
 extern int bdev_read_only(struct block_device *);
Index: linux-2.6/drivers/char/pty.c
===================================================================
--- linux-2.6.orig/drivers/char/pty.c	2010-08-18 04:04:29.000000000 +1000
+++ linux-2.6/drivers/char/pty.c	2010-08-18 04:04:29.000000000 +1000
@@ -675,12 +675,8 @@ static int ptmx_open(struct inode *inode
 	}
 
 	set_bit(TTY_PTY_LOCK, &tty->flags); /* LOCK THE SLAVE */
-	filp->private_data = tty;
 
-	file_sb_list_del(filp); /* __dentry_open has put it on the sb list */
-	spin_lock(&tty_files_lock);
-	list_add(&filp->f_u.fu_list, &tty->tty_files);
-	spin_unlock(&tty_files_lock);
+	tty_add_file(tty, filp);
 
 	retval = devpts_pty_new(inode, tty->link);
 	if (retval)
Index: linux-2.6/drivers/char/tty_io.c
===================================================================
--- linux-2.6.orig/drivers/char/tty_io.c	2010-08-18 04:04:29.000000000 +1000
+++ linux-2.6/drivers/char/tty_io.c	2010-08-18 04:04:29.000000000 +1000
@@ -188,6 +188,41 @@ void free_tty_struct(struct tty_struct *
 	kfree(tty);
 }
 
+static inline struct tty_struct *file_tty(struct file *file)
+{
+	return ((struct tty_file_private *)file->private_data)->tty;
+}
+
+/* Associate a new file with the tty structure */
+void tty_add_file(struct tty_struct *tty, struct file *file)
+{
+	struct tty_file_private *priv;
+
+	/* XXX: must implement proper error handling in callers */
+	priv = kmalloc(sizeof(*priv), GFP_KERNEL|__GFP_NOFAIL);
+
+	priv->tty = tty;
+	priv->file = file;
+	file->private_data = priv;
+
+	spin_lock(&tty_files_lock);
+	list_add(&priv->list, &tty->tty_files);
+	spin_unlock(&tty_files_lock);
+}
+
+/* Delete file from its tty */
+void tty_del_file(struct file *file)
+{
+	struct tty_file_private *priv = file->private_data;
+
+	spin_lock(&tty_files_lock);
+	list_del(&priv->list);
+	spin_unlock(&tty_files_lock);
+	file->private_data = NULL;
+	kfree(priv);
+}
+
+
 #define TTY_NUMBER(tty) ((tty)->index + (tty)->driver->name_base)
 
 /**
@@ -500,6 +535,7 @@ void __tty_hangup(struct tty_struct *tty
 	struct file *cons_filp = NULL;
 	struct file *filp, *f = NULL;
 	struct task_struct *p;
+	struct tty_file_private *priv;
 	int    closecount = 0, n;
 	unsigned long flags;
 	int refs = 0;
@@ -509,7 +545,7 @@ void __tty_hangup(struct tty_struct *tty
 
 
 	spin_lock(&redirect_lock);
-	if (redirect && redirect->private_data == tty) {
+	if (redirect && file_tty(redirect) == tty) {
 		f = redirect;
 		redirect = NULL;
 	}
@@ -524,7 +560,8 @@ void __tty_hangup(struct tty_struct *tty
 
 	spin_lock(&tty_files_lock);
 	/* This breaks for file handles being sent over AF_UNIX sockets ? */
-	list_for_each_entry(filp, &tty->tty_files, f_u.fu_list) {
+	list_for_each_entry(priv, &tty->tty_files, list) {
+		filp = priv->file;
 		if (filp->f_op->write == redirected_tty_write)
 			cons_filp = filp;
 		if (filp->f_op->write != tty_write)
@@ -892,12 +929,10 @@ static ssize_t tty_read(struct file *fil
 			loff_t *ppos)
 {
 	int i;
-	struct tty_struct *tty;
-	struct inode *inode;
+	struct inode *inode = file->f_path.dentry->d_inode;
+	struct tty_struct *tty = file_tty(file);
 	struct tty_ldisc *ld;
 
-	tty = file->private_data;
-	inode = file->f_path.dentry->d_inode;
 	if (tty_paranoia_check(tty, inode, "tty_read"))
 		return -EIO;
 	if (!tty || (test_bit(TTY_IO_ERROR, &tty->flags)))
@@ -1068,12 +1103,11 @@ void tty_write_message(struct tty_struct
 static ssize_t tty_write(struct file *file, const char __user *buf,
 						size_t count, loff_t *ppos)
 {
-	struct tty_struct *tty;
 	struct inode *inode = file->f_path.dentry->d_inode;
+	struct tty_struct *tty = file_tty(file);
+ 	struct tty_ldisc *ld;
 	ssize_t ret;
-	struct tty_ldisc *ld;
 
-	tty = file->private_data;
 	if (tty_paranoia_check(tty, inode, "tty_write"))
 		return -EIO;
 	if (!tty || !tty->ops->write ||
@@ -1510,13 +1544,13 @@ static void release_tty(struct tty_struc
 
 int tty_release(struct inode *inode, struct file *filp)
 {
-	struct tty_struct *tty, *o_tty;
+	struct tty_struct *tty = file_tty(filp);
+	struct tty_struct *o_tty;
 	int	pty_master, tty_closing, o_tty_closing, do_sleep;
 	int	devpts;
 	int	idx;
 	char	buf[64];
 
-	tty = filp->private_data;
 	if (tty_paranoia_check(tty, inode, "tty_release_dev"))
 		return 0;
 
@@ -1674,11 +1708,7 @@ int tty_release(struct inode *inode, str
 	 *  - do_tty_hangup no longer sees this file descriptor as
 	 *    something that needs to be handled for hangups.
 	 */
-	spin_lock(&tty_files_lock);
-	BUG_ON(list_empty(&filp->f_u.fu_list));
-	list_del_init(&filp->f_u.fu_list);
-	spin_unlock(&tty_files_lock);
-	filp->private_data = NULL;
+	tty_del_file(filp);
 
 	/*
 	 * Perform some housekeeping before deciding whether to return.
@@ -1845,12 +1875,8 @@ got_driver:
 		return PTR_ERR(tty);
 	}
 
-	filp->private_data = tty;
-	BUG_ON(list_empty(&filp->f_u.fu_list));
-	file_sb_list_del(filp); /* __dentry_open has put it on the sb list */
-	spin_lock(&tty_files_lock);
-	list_add(&filp->f_u.fu_list, &tty->tty_files);
-	spin_unlock(&tty_files_lock);
+	tty_add_file(tty, filp);
+
 	check_tty_count(tty, "tty_open");
 	if (tty->driver->type == TTY_DRIVER_TYPE_PTY &&
 	    tty->driver->subtype == PTY_TYPE_MASTER)
@@ -1926,11 +1952,10 @@ got_driver:
 
 static unsigned int tty_poll(struct file *filp, poll_table *wait)
 {
-	struct tty_struct *tty;
+	struct tty_struct *tty = file_tty(filp);
 	struct tty_ldisc *ld;
 	int ret = 0;
 
-	tty = filp->private_data;
 	if (tty_paranoia_check(tty, filp->f_path.dentry->d_inode, "tty_poll"))
 		return 0;
 
@@ -1943,11 +1968,10 @@ static unsigned int tty_poll(struct file
 
 static int __tty_fasync(int fd, struct file *filp, int on)
 {
-	struct tty_struct *tty;
+	struct tty_struct *tty = file_tty(filp);
 	unsigned long flags;
 	int retval = 0;
 
-	tty = filp->private_data;
 	if (tty_paranoia_check(tty, filp->f_path.dentry->d_inode, "tty_fasync"))
 		goto out;
 
@@ -2501,13 +2525,13 @@ EXPORT_SYMBOL(tty_pair_get_pty);
  */
 long tty_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
-	struct tty_struct *tty, *real_tty;
+	struct tty_struct *tty = file_tty(file);
+	struct tty_struct *real_tty;
 	void __user *p = (void __user *)arg;
 	int retval;
 	struct tty_ldisc *ld;
 	struct inode *inode = file->f_dentry->d_inode;
 
-	tty = file->private_data;
 	if (tty_paranoia_check(tty, inode, "tty_ioctl"))
 		return -EINVAL;
 
@@ -2629,7 +2653,7 @@ static long tty_compat_ioctl(struct file
 				unsigned long arg)
 {
 	struct inode *inode = file->f_dentry->d_inode;
-	struct tty_struct *tty = file->private_data;
+	struct tty_struct *tty = file_tty(file);
 	struct tty_ldisc *ld;
 	int retval = -ENOIOCTLCMD;
 
@@ -2721,7 +2745,7 @@ void __do_SAK(struct tty_struct *tty)
 				if (!filp)
 					continue;
 				if (filp->f_op->read == tty_read &&
-				    filp->private_data == tty) {
+				    file_tty(filp) == tty) {
 					printk(KERN_NOTICE "SAK: killed process %d"
 					    " (%s): fd#%d opened to the tty\n",
 					    task_pid_nr(p), p->comm, i);
Index: linux-2.6/security/selinux/hooks.c
===================================================================
--- linux-2.6.orig/security/selinux/hooks.c	2010-08-18 04:04:29.000000000 +1000
+++ linux-2.6/security/selinux/hooks.c	2010-08-18 04:04:29.000000000 +1000
@@ -2172,6 +2172,7 @@ static inline void flush_unauthorized_fi
 	if (tty) {
 		spin_lock(&tty_files_lock);
 		if (!list_empty(&tty->tty_files)) {
+			struct tty_file_private *file_priv;
 			struct inode *inode;
 
 			/* Revalidate access to controlling tty.
@@ -2179,7 +2180,9 @@ static inline void flush_unauthorized_fi
 			   than using file_has_perm, as this particular open
 			   file may belong to another process and we are only
 			   interested in the inode-based check here. */
-			file = list_first_entry(&tty->tty_files, struct file, f_u.fu_list);
+			file_priv = list_first_entry(&tty->tty_files,
+						struct tty_file_private, list);
+			file = file_priv->file;
 			inode = file->f_path.dentry->d_inode;
 			if (inode_has_perm(cred, inode,
 					   FILE__READ | FILE__WRITE, NULL)) {
Index: linux-2.6/fs/internal.h
===================================================================
--- linux-2.6.orig/fs/internal.h	2010-08-18 04:04:01.000000000 +1000
+++ linux-2.6/fs/internal.h	2010-08-18 04:05:07.000000000 +1000
@@ -80,6 +80,8 @@ extern void chroot_fs_refs(struct path *
 /*
  * file_table.c
  */
+extern void file_sb_list_add(struct file *f, struct super_block *sb);
+extern void file_sb_list_del(struct file *f);
 extern void mark_files_ro(struct super_block *);
 extern struct file *get_empty_filp(void);
 
Index: linux-2.6/include/linux/tty.h
===================================================================
--- linux-2.6.orig/include/linux/tty.h	2010-08-18 04:04:29.000000000 +1000
+++ linux-2.6/include/linux/tty.h	2010-08-18 04:04:29.000000000 +1000
@@ -329,6 +329,13 @@ struct tty_struct {
 	struct tty_port *port;
 };
 
+/* Each of a tty's open files has private_data pointing to tty_file_private */
+struct tty_file_private {
+	struct tty_struct *tty;
+	struct file *file;
+	struct list_head list;
+};
+
 /* tty magic number */
 #define TTY_MAGIC		0x5401
 
@@ -458,6 +465,7 @@ extern void proc_clear_tty(struct task_s
 extern struct tty_struct *get_current_tty(void);
 extern void tty_default_fops(struct file_operations *fops);
 extern struct tty_struct *alloc_tty_struct(void);
+extern void tty_add_file(struct tty_struct *tty, struct file *file);
 extern void free_tty_struct(struct tty_struct *tty);
 extern void initialize_tty_struct(struct tty_struct *tty,
 		struct tty_driver *driver, int idx);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 08/10] lglock: introduce special lglock and brlock spin locks
  2010-08-17 18:37 [patch 00/10] first set of vfs scale patches Nick Piggin
                   ` (6 preceding siblings ...)
  2010-08-17 18:37 ` [patch 07/10] tty: fix fu_list abuse Nick Piggin
@ 2010-08-17 18:37 ` Nick Piggin
  2010-08-17 18:37 ` [patch 09/10] fs: scale files_lock Nick Piggin
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2010-08-17 18:37 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel, Al Viro, Paul E. McKenney

[-- Attachment #1: kernel-introduce-brlock.patch --]
[-- Type: text/plain, Size: 7135 bytes --]

lglock: introduce special lglock and brlock spin locks

This patch introduces "local-global" locks (lglocks). These can be used to:

- Provide fast exclusive access to per-CPU data, with exclusive access to
  another CPU's data allowed but possibly subject to contention, and to provide
  very slow exclusive access to all per-CPU data.
- Or to provide very fast and scalable read serialisation, and to provide
  very slow exclusive serialisation of data (not necessarily per-CPU data).

Brlocks are also implemented as a short-hand notation for the latter use
case.

Thanks to Paul for local/global naming convention.

Cc: linux-kernel@vger.kernel.org
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 include/linux/lglock.h |  172 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 172 insertions(+)

Index: linux-2.6/include/linux/lglock.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/lglock.h	2010-08-18 04:04:29.000000000 +1000
@@ -0,0 +1,172 @@
+/*
+ * Specialised local-global spinlock. Can only be declared as global variables
+ * to avoid overhead and keep things simple (and we don't want to start using
+ * these inside dynamically allocated structures).
+ *
+ * "local/global locks" (lglocks) can be used to:
+ *
+ * - Provide fast exclusive access to per-CPU data, with exclusive access to
+ *   another CPU's data allowed but possibly subject to contention, and to
+ *   provide very slow exclusive access to all per-CPU data.
+ * - Or to provide very fast and scalable read serialisation, and to provide
+ *   very slow exclusive serialisation of data (not necessarily per-CPU data).
+ *
+ * Brlocks are also implemented as a short-hand notation for the latter use
+ * case.
+ *
+ * Copyright 2009, 2010, Nick Piggin, Novell Inc.
+ */
+#ifndef __LINUX_LGLOCK_H
+#define __LINUX_LGLOCK_H
+
+#include <linux/spinlock.h>
+#include <linux/lockdep.h>
+#include <linux/percpu.h>
+
+/* can make br locks by using local lock for read side, global lock for write */
+#define br_lock_init(name)	name##_lock_init()
+#define br_read_lock(name)	name##_local_lock()
+#define br_read_unlock(name)	name##_local_unlock()
+#define br_write_lock(name)	name##_global_lock_online()
+#define br_write_unlock(name)	name##_global_unlock_online()
+
+#define DECLARE_BRLOCK(name)	DECLARE_LGLOCK(name)
+#define DEFINE_BRLOCK(name)	DEFINE_LGLOCK(name)
+
+
+#define lg_lock_init(name)	name##_lock_init()
+#define lg_local_lock(name)	name##_local_lock()
+#define lg_local_unlock(name)	name##_local_unlock()
+#define lg_local_lock_cpu(name, cpu)	name##_local_lock_cpu(cpu)
+#define lg_local_unlock_cpu(name, cpu)	name##_local_unlock_cpu(cpu)
+#define lg_global_lock(name)	name##_global_lock()
+#define lg_global_unlock(name)	name##_global_unlock()
+#define lg_global_lock_online(name) name##_global_lock_online()
+#define lg_global_unlock_online(name) name##_global_unlock_online()
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+#define LOCKDEP_INIT_MAP lockdep_init_map
+
+#define DEFINE_LGLOCK_LOCKDEP(name)					\
+ struct lock_class_key name##_lock_key;					\
+ struct lockdep_map name##_lock_dep_map;				\
+ EXPORT_SYMBOL(name##_lock_dep_map)
+
+#else
+#define LOCKDEP_INIT_MAP(a, b, c, d)
+
+#define DEFINE_LGLOCK_LOCKDEP(name)
+#endif
+
+
+#define DECLARE_LGLOCK(name)						\
+ extern void name##_lock_init(void);					\
+ extern void name##_local_lock(void);					\
+ extern void name##_local_unlock(void);					\
+ extern void name##_local_lock_cpu(int cpu);				\
+ extern void name##_local_unlock_cpu(int cpu);				\
+ extern void name##_global_lock(void);					\
+ extern void name##_global_unlock(void);				\
+ extern void name##_global_lock_online(void);				\
+ extern void name##_global_unlock_online(void);				\
+
+#define DEFINE_LGLOCK(name)						\
+									\
+ DEFINE_PER_CPU(arch_spinlock_t, name##_lock);				\
+ DEFINE_LGLOCK_LOCKDEP(name);						\
+									\
+ void name##_lock_init(void) {						\
+	int i;								\
+	LOCKDEP_INIT_MAP(&name##_lock_dep_map, #name, &name##_lock_key, 0); \
+	for_each_possible_cpu(i) {					\
+		arch_spinlock_t *lock;					\
+		lock = &per_cpu(name##_lock, i);			\
+		*lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;	\
+	}								\
+ }									\
+ EXPORT_SYMBOL(name##_lock_init);					\
+									\
+ void name##_local_lock(void) {						\
+	arch_spinlock_t *lock;						\
+	preempt_disable();						\
+	rwlock_acquire_read(&name##_lock_dep_map, 0, 0, _THIS_IP_);	\
+	lock = &__get_cpu_var(name##_lock);				\
+	arch_spin_lock(lock);						\
+ }									\
+ EXPORT_SYMBOL(name##_local_lock);					\
+									\
+ void name##_local_unlock(void) {					\
+	arch_spinlock_t *lock;						\
+	rwlock_release(&name##_lock_dep_map, 1, _THIS_IP_);		\
+	lock = &__get_cpu_var(name##_lock);				\
+	arch_spin_unlock(lock);						\
+	preempt_enable();						\
+ }									\
+ EXPORT_SYMBOL(name##_local_unlock);					\
+									\
+ void name##_local_lock_cpu(int cpu) {					\
+	arch_spinlock_t *lock;						\
+	preempt_disable();						\
+	rwlock_acquire_read(&name##_lock_dep_map, 0, 0, _THIS_IP_);	\
+	lock = &per_cpu(name##_lock, cpu);				\
+	arch_spin_lock(lock);						\
+ }									\
+ EXPORT_SYMBOL(name##_local_lock_cpu);					\
+									\
+ void name##_local_unlock_cpu(int cpu) {				\
+	arch_spinlock_t *lock;						\
+	rwlock_release(&name##_lock_dep_map, 1, _THIS_IP_);		\
+	lock = &per_cpu(name##_lock, cpu);				\
+	arch_spin_unlock(lock);						\
+	preempt_enable();						\
+ }									\
+ EXPORT_SYMBOL(name##_local_unlock_cpu);				\
+									\
+ void name##_global_lock_online(void) {					\
+	int i;								\
+	preempt_disable();						\
+	rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_);		\
+	for_each_online_cpu(i) {					\
+		arch_spinlock_t *lock;					\
+		lock = &per_cpu(name##_lock, i);			\
+		arch_spin_lock(lock);					\
+	}								\
+ }									\
+ EXPORT_SYMBOL(name##_global_lock_online);				\
+									\
+ void name##_global_unlock_online(void) {				\
+	int i;								\
+	rwlock_release(&name##_lock_dep_map, 1, _RET_IP_);		\
+	for_each_online_cpu(i) {					\
+		arch_spinlock_t *lock;					\
+		lock = &per_cpu(name##_lock, i);			\
+		arch_spin_unlock(lock);					\
+	}								\
+	preempt_enable();						\
+ }									\
+ EXPORT_SYMBOL(name##_global_unlock_online);				\
+									\
+ void name##_global_lock(void) {					\
+	int i;								\
+	preempt_disable();						\
+	rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_);		\
+	for_each_online_cpu(i) {					\
+		arch_spinlock_t *lock;					\
+		lock = &per_cpu(name##_lock, i);			\
+		arch_spin_lock(lock);					\
+	}								\
+ }									\
+ EXPORT_SYMBOL(name##_global_lock);					\
+									\
+ void name##_global_unlock(void) {					\
+	int i;								\
+	rwlock_release(&name##_lock_dep_map, 1, _RET_IP_);		\
+	for_each_online_cpu(i) {					\
+		arch_spinlock_t *lock;					\
+		lock = &per_cpu(name##_lock, i);			\
+		arch_spin_unlock(lock);					\
+	}								\
+	preempt_enable();						\
+ }									\
+ EXPORT_SYMBOL(name##_global_unlock);
+#endif

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 09/10] fs: scale files_lock
  2010-08-17 18:37 [patch 00/10] first set of vfs scale patches Nick Piggin
                   ` (7 preceding siblings ...)
  2010-08-17 18:37 ` [patch 08/10] lglock: introduce special lglock and brlock spin locks Nick Piggin
@ 2010-08-17 18:37 ` Nick Piggin
  2010-08-17 18:37 ` [patch 10/10] fs: brlock vfsmount_lock Nick Piggin
  2010-08-17 21:14 ` [patch 00/10] first set of vfs scale patches Al Viro
  10 siblings, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2010-08-17 18:37 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel, Tim Chen, Andi Kleen

[-- Attachment #1: fs-files_lock-scale.patch --]
[-- Type: text/plain, Size: 9574 bytes --]

fs: scale files_lock

Improve scalability of files_lock by adding per-cpu, per-sb files lists,
protected with an lglock. The lglock provides fast access to the per-cpu lists
to add and remove files. It also provides a snapshot of all the per-cpu lists
(although this is very slow).

One difficulty with this approach is that a file can be removed from the list
by another CPU. We must track which per-cpu list the file is on with a new
variale in the file struct (packed into a hole on 64-bit archs). Scalability
could suffer if files are frequently removed from different cpu's list.

However loads with frequent removal of files imply short interval between
adding and removing the files, and the scheduler attempts to avoid moving
processes too far away. Also, even in the case of cross-CPU removal, the
hardware has much more opportunity to parallelise cacheline transfers with N
cachelines than with 1.

A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
degenerates to contending on a single lock, which is no worse than before. When
more than one CPU are allocating files, even if they are always freed by
different CPUs, there will be more parallelism than the single-lock case.


Testing results:

On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
to remove the file, the number of times it is removed by the same CPU that
added it, and the number of times it is removed by the same node that added it.

Booting:    locks=  25049 cpu-hits=  23174 (92.5%) node-hits=  23945 (95.6%)
kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
dbench 64   locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)

So a file is removed from the same CPU it was added by over 90% of the time.
It remains within the same node 95% of the time.


Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.

                throughput
2.6.34-rc2      24.5
+patch          24.9

                us      sys     idle    IO wait (in %)
2.6.34-rc2      51.25   28.25   17.25   3.25
+patch          53.75   18.5    19      8.75

So significantly less CPU time spent in kernel code, higher idle time and
slightly higher throughput.


Single threaded performance difference was within the noise of microbenchmarks.
That is not to say penalty does not exist, the code is larger and more memory
accesses required so it will be slightly slower.

Cc: linux-kernel@vger.kernel.org
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/file_table.c    |  108 ++++++++++++++++++++++++++++++++++++++++++++---------
 fs/super.c         |   18 ++++++++
 include/linux/fs.h |    7 +++
 3 files changed, 115 insertions(+), 18 deletions(-)

Index: linux-2.6/fs/file_table.c
===================================================================
--- linux-2.6.orig/fs/file_table.c	2010-08-18 04:04:29.000000000 +1000
+++ linux-2.6/fs/file_table.c	2010-08-18 04:04:29.000000000 +1000
@@ -20,7 +20,9 @@
 #include <linux/cdev.h>
 #include <linux/fsnotify.h>
 #include <linux/sysctl.h>
+#include <linux/lglock.h>
 #include <linux/percpu_counter.h>
+#include <linux/percpu.h>
 #include <linux/ima.h>
 
 #include <asm/atomic.h>
@@ -32,7 +34,8 @@ struct files_stat_struct files_stat = {
 	.max_files = NR_FILE
 };
 
-static __cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock);
+DECLARE_LGLOCK(files_lglock);
+DEFINE_LGLOCK(files_lglock);
 
 /* SLAB cache for file structures */
 static struct kmem_cache *filp_cachep __read_mostly;
@@ -336,30 +339,98 @@ void put_filp(struct file *file)
 	}
 }
 
+static inline int file_list_cpu(struct file *file)
+{
+#ifdef CONFIG_SMP
+	return file->f_sb_list_cpu;
+#else
+	return smp_processor_id();
+#endif
+}
+
+/* helper for file_sb_list_add to reduce ifdefs */
+static inline void __file_sb_list_add(struct file *file, struct super_block *sb)
+{
+	struct list_head *list;
+#ifdef CONFIG_SMP
+	int cpu;
+	cpu = smp_processor_id();
+	file->f_sb_list_cpu = cpu;
+	list = per_cpu_ptr(sb->s_files, cpu);
+#else
+	list = &sb->s_files;
+#endif
+	list_add(&file->f_u.fu_list, list);
+}
+
+/**
+ * file_sb_list_add - add a file to the sb's file list
+ * @file: file to add
+ * @sb: sb to add it to
+ *
+ * Use this function to associate a file with the superblock of the inode it
+ * refers to.
+ */
 void file_sb_list_add(struct file *file, struct super_block *sb)
 {
-	spin_lock(&files_lock);
-	BUG_ON(!list_empty(&file->f_u.fu_list));
-	list_add(&file->f_u.fu_list, &sb->s_files);
-	spin_unlock(&files_lock);
+	lg_local_lock(files_lglock);
+	__file_sb_list_add(file, sb);
+	lg_local_unlock(files_lglock);
 }
 
+/**
+ * file_sb_list_del - remove a file from the sb's file list
+ * @file: file to remove
+ * @sb: sb to remove it from
+ *
+ * Use this function to remove a file from its superblock.
+ */
 void file_sb_list_del(struct file *file)
 {
 	if (!list_empty(&file->f_u.fu_list)) {
-		spin_lock(&files_lock);
+		lg_local_lock_cpu(files_lglock, file_list_cpu(file));
 		list_del_init(&file->f_u.fu_list);
-		spin_unlock(&files_lock);
+		lg_local_unlock_cpu(files_lglock, file_list_cpu(file));
 	}
 }
 
+#ifdef CONFIG_SMP
+
+/*
+ * These macros iterate all files on all CPUs for a given superblock.
+ * files_lglock must be held globally.
+ */
+#define do_file_list_for_each_entry(__sb, __file)		\
+{								\
+	int i;							\
+	for_each_possible_cpu(i) {				\
+		struct list_head *list;				\
+		list = per_cpu_ptr((__sb)->s_files, i);		\
+		list_for_each_entry((__file), list, f_u.fu_list)
+
+#define while_file_list_for_each_entry				\
+	}							\
+}
+
+#else
+
+#define do_file_list_for_each_entry(__sb, __file)		\
+{								\
+	struct list_head *list;					\
+	list = &(sb)->s_files;					\
+	list_for_each_entry((__file), list, f_u.fu_list)
+
+#define while_file_list_for_each_entry				\
+}
+
+#endif
+
 int fs_may_remount_ro(struct super_block *sb)
 {
 	struct file *file;
-
 	/* Check that no files are currently opened for writing. */
-	spin_lock(&files_lock);
-	list_for_each_entry(file, &sb->s_files, f_u.fu_list) {
+	lg_global_lock(files_lglock);
+	do_file_list_for_each_entry(sb, file) {
 		struct inode *inode = file->f_path.dentry->d_inode;
 
 		/* File with pending delete? */
@@ -369,11 +440,11 @@ int fs_may_remount_ro(struct super_block
 		/* Writeable file? */
 		if (S_ISREG(inode->i_mode) && (file->f_mode & FMODE_WRITE))
 			goto too_bad;
-	}
-	spin_unlock(&files_lock);
+	} while_file_list_for_each_entry;
+	lg_global_unlock(files_lglock);
 	return 1; /* Tis' cool bro. */
 too_bad:
-	spin_unlock(&files_lock);
+	lg_global_unlock(files_lglock);
 	return 0;
 }
 
@@ -389,8 +460,8 @@ void mark_files_ro(struct super_block *s
 	struct file *f;
 
 retry:
-	spin_lock(&files_lock);
-	list_for_each_entry(f, &sb->s_files, f_u.fu_list) {
+	lg_global_lock(files_lglock);
+	do_file_list_for_each_entry(sb, f) {
 		struct vfsmount *mnt;
 		if (!S_ISREG(f->f_path.dentry->d_inode->i_mode))
 		       continue;
@@ -406,12 +477,12 @@ retry:
 		file_release_write(f);
 		mnt = mntget(f->f_path.mnt);
 		/* This can sleep, so we can't hold the spinlock. */
-		spin_unlock(&files_lock);
+		lg_global_unlock(files_lglock);
 		mnt_drop_write(mnt);
 		mntput(mnt);
 		goto retry;
-	}
-	spin_unlock(&files_lock);
+	} while_file_list_for_each_entry;
+	lg_global_unlock(files_lglock);
 }
 
 void __init files_init(unsigned long mempages)
@@ -431,5 +502,6 @@ void __init files_init(unsigned long mem
 	if (files_stat.max_files < NR_FILE)
 		files_stat.max_files = NR_FILE;
 	files_defer_init();
+	lg_lock_init(files_lglock);
 	percpu_counter_init(&nr_files, 0);
 } 
Index: linux-2.6/fs/super.c
===================================================================
--- linux-2.6.orig/fs/super.c	2010-08-18 04:04:00.000000000 +1000
+++ linux-2.6/fs/super.c	2010-08-18 04:04:29.000000000 +1000
@@ -54,7 +54,22 @@ static struct super_block *alloc_super(s
 			s = NULL;
 			goto out;
 		}
+#ifdef CONFIG_SMP
+		s->s_files = alloc_percpu(struct list_head);
+		if (!s->s_files) {
+			security_sb_free(s);
+			kfree(s);
+			s = NULL;
+			goto out;
+		} else {
+			int i;
+
+			for_each_possible_cpu(i)
+				INIT_LIST_HEAD(per_cpu_ptr(s->s_files, i));
+		}
+#else
 		INIT_LIST_HEAD(&s->s_files);
+#endif
 		INIT_LIST_HEAD(&s->s_instances);
 		INIT_HLIST_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
@@ -108,6 +123,9 @@ out:
  */
 static inline void destroy_super(struct super_block *s)
 {
+#ifdef CONFIG_SMP
+	free_percpu(s->s_files);
+#endif
 	security_sb_free(s);
 	kfree(s->s_subtype);
 	kfree(s->s_options);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-08-18 04:04:29.000000000 +1000
+++ linux-2.6/include/linux/fs.h	2010-08-18 04:04:29.000000000 +1000
@@ -929,6 +929,9 @@ struct file {
 #define f_vfsmnt	f_path.mnt
 	const struct file_operations	*f_op;
 	spinlock_t		f_lock;  /* f_ep_links, f_flags, no IRQ */
+#ifdef CONFIG_SMP
+	int			f_sb_list_cpu;
+#endif
 	atomic_long_t		f_count;
 	unsigned int 		f_flags;
 	fmode_t			f_mode;
@@ -1343,7 +1346,11 @@ struct super_block {
 
 	struct list_head	s_inodes;	/* all inodes */
 	struct hlist_head	s_anon;		/* anonymous dentries for (nfs) exporting */
+#ifdef CONFIG_SMP
+	struct list_head __percpu *s_files;
+#else
 	struct list_head	s_files;
+#endif
 	/* s_dentry_lru and s_nr_dentry_unused are protected by dcache_lock */
 	struct list_head	s_dentry_lru;	/* unused dentry lru */
 	int			s_nr_dentry_unused;	/* # of dentry on lru */



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 10/10] fs: brlock vfsmount_lock
  2010-08-17 18:37 [patch 00/10] first set of vfs scale patches Nick Piggin
                   ` (8 preceding siblings ...)
  2010-08-17 18:37 ` [patch 09/10] fs: scale files_lock Nick Piggin
@ 2010-08-17 18:37 ` Nick Piggin
  2010-08-18 14:05   ` Andi Kleen
  2010-08-17 21:14 ` [patch 00/10] first set of vfs scale patches Al Viro
  10 siblings, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2010-08-17 18:37 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Al Viro

[-- Attachment #1: fs-vfsmount_lock-scale-2.patch --]
[-- Type: text/plain, Size: 19913 bytes --]

fs: brlock vfsmount_lock

Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.

A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.

The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).

The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.

Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/dcache.c    |   11 +--
 fs/internal.h  |    5 +
 fs/namei.c     |    7 +-
 fs/namespace.c |  177 +++++++++++++++++++++++++++++++++++----------------------
 fs/pnode.c     |   11 ++-
 5 files changed, 134 insertions(+), 77 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c	2010-08-18 04:04:29.000000000 +1000
+++ linux-2.6/fs/dcache.c	2010-08-18 04:04:29.000000000 +1000
@@ -1935,7 +1935,7 @@ static int prepend_path(const struct pat
 	bool slash = false;
 	int error = 0;
 
-	spin_lock(&vfsmount_lock);
+	br_read_lock(vfsmount_lock);
 	while (dentry != root->dentry || vfsmnt != root->mnt) {
 		struct dentry * parent;
 
@@ -1964,7 +1964,7 @@ out:
 	if (!error && !slash)
 		error = prepend(buffer, buflen, "/", 1);
 
-	spin_unlock(&vfsmount_lock);
+	br_read_unlock(vfsmount_lock);
 	return error;
 
 global_root:
@@ -2302,11 +2302,12 @@ int path_is_under(struct path *path1, st
 	struct vfsmount *mnt = path1->mnt;
 	struct dentry *dentry = path1->dentry;
 	int res;
-	spin_lock(&vfsmount_lock);
+
+	br_read_lock(vfsmount_lock);
 	if (mnt != path2->mnt) {
 		for (;;) {
 			if (mnt->mnt_parent == mnt) {
-				spin_unlock(&vfsmount_lock);
+				br_read_unlock(vfsmount_lock);
 				return 0;
 			}
 			if (mnt->mnt_parent == path2->mnt)
@@ -2316,7 +2317,7 @@ int path_is_under(struct path *path1, st
 		dentry = mnt->mnt_mountpoint;
 	}
 	res = is_subdir(dentry, path2->dentry);
-	spin_unlock(&vfsmount_lock);
+	br_read_unlock(vfsmount_lock);
 	return res;
 }
 EXPORT_SYMBOL(path_is_under);
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c	2010-08-18 04:04:29.000000000 +1000
+++ linux-2.6/fs/namei.c	2010-08-18 04:04:29.000000000 +1000
@@ -595,15 +595,16 @@ int follow_up(struct path *path)
 {
 	struct vfsmount *parent;
 	struct dentry *mountpoint;
-	spin_lock(&vfsmount_lock);
+
+	br_read_lock(vfsmount_lock);
 	parent = path->mnt->mnt_parent;
 	if (parent == path->mnt) {
-		spin_unlock(&vfsmount_lock);
+		br_read_unlock(vfsmount_lock);
 		return 0;
 	}
 	mntget(parent);
 	mountpoint = dget(path->mnt->mnt_mountpoint);
-	spin_unlock(&vfsmount_lock);
+	br_read_unlock(vfsmount_lock);
 	dput(path->dentry);
 	path->dentry = mountpoint;
 	mntput(path->mnt);
Index: linux-2.6/fs/namespace.c
===================================================================
--- linux-2.6.orig/fs/namespace.c	2010-08-18 04:04:00.000000000 +1000
+++ linux-2.6/fs/namespace.c	2010-08-18 04:04:29.000000000 +1000
@@ -11,6 +11,8 @@
 #include <linux/syscalls.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/spinlock.h>
+#include <linux/percpu.h>
 #include <linux/smp_lock.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
@@ -38,12 +40,10 @@
 #define HASH_SHIFT ilog2(PAGE_SIZE / sizeof(struct list_head))
 #define HASH_SIZE (1UL << HASH_SHIFT)
 
-/* spinlock for vfsmount related operations, inplace of dcache_lock */
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(vfsmount_lock);
-
 static int event;
 static DEFINE_IDA(mnt_id_ida);
 static DEFINE_IDA(mnt_group_ida);
+static DEFINE_SPINLOCK(mnt_id_lock);
 static int mnt_id_start = 0;
 static int mnt_group_start = 1;
 
@@ -55,6 +55,16 @@ static struct rw_semaphore namespace_sem
 struct kobject *fs_kobj;
 EXPORT_SYMBOL_GPL(fs_kobj);
 
+/*
+ * vfsmount lock may be taken for read to prevent changes to the
+ * vfsmount hash, ie. during mountpoint lookups or walking back
+ * up the tree.
+ *
+ * It should be taken for write in all cases where the vfsmount
+ * tree or hash is modified or when a vfsmount structure is modified.
+ */
+DEFINE_BRLOCK(vfsmount_lock);
+
 static inline unsigned long hash(struct vfsmount *mnt, struct dentry *dentry)
 {
 	unsigned long tmp = ((unsigned long)mnt / L1_CACHE_BYTES);
@@ -65,18 +75,21 @@ static inline unsigned long hash(struct
 
 #define MNT_WRITER_UNDERFLOW_LIMIT -(1<<16)
 
-/* allocation is serialized by namespace_sem */
+/*
+ * allocation is serialized by namespace_sem, but we need the spinlock to
+ * serialize with freeing.
+ */
 static int mnt_alloc_id(struct vfsmount *mnt)
 {
 	int res;
 
 retry:
 	ida_pre_get(&mnt_id_ida, GFP_KERNEL);
-	spin_lock(&vfsmount_lock);
+	spin_lock(&mnt_id_lock);
 	res = ida_get_new_above(&mnt_id_ida, mnt_id_start, &mnt->mnt_id);
 	if (!res)
 		mnt_id_start = mnt->mnt_id + 1;
-	spin_unlock(&vfsmount_lock);
+	spin_unlock(&mnt_id_lock);
 	if (res == -EAGAIN)
 		goto retry;
 
@@ -86,11 +99,11 @@ retry:
 static void mnt_free_id(struct vfsmount *mnt)
 {
 	int id = mnt->mnt_id;
-	spin_lock(&vfsmount_lock);
+	spin_lock(&mnt_id_lock);
 	ida_remove(&mnt_id_ida, id);
 	if (mnt_id_start > id)
 		mnt_id_start = id;
-	spin_unlock(&vfsmount_lock);
+	spin_unlock(&mnt_id_lock);
 }
 
 /*
@@ -348,7 +361,7 @@ static int mnt_make_readonly(struct vfsm
 {
 	int ret = 0;
 
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	mnt->mnt_flags |= MNT_WRITE_HOLD;
 	/*
 	 * After storing MNT_WRITE_HOLD, we'll read the counters. This store
@@ -382,15 +395,15 @@ static int mnt_make_readonly(struct vfsm
 	 */
 	smp_wmb();
 	mnt->mnt_flags &= ~MNT_WRITE_HOLD;
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 	return ret;
 }
 
 static void __mnt_unmake_readonly(struct vfsmount *mnt)
 {
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	mnt->mnt_flags &= ~MNT_READONLY;
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 }
 
 void simple_set_mnt(struct vfsmount *mnt, struct super_block *sb)
@@ -414,6 +427,7 @@ void free_vfsmnt(struct vfsmount *mnt)
 /*
  * find the first or last mount at @dentry on vfsmount @mnt depending on
  * @dir. If @dir is set return the first mount else return the last mount.
+ * vfsmount_lock must be held for read or write.
  */
 struct vfsmount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry,
 			      int dir)
@@ -443,10 +457,11 @@ struct vfsmount *__lookup_mnt(struct vfs
 struct vfsmount *lookup_mnt(struct path *path)
 {
 	struct vfsmount *child_mnt;
-	spin_lock(&vfsmount_lock);
+
+	br_read_lock(vfsmount_lock);
 	if ((child_mnt = __lookup_mnt(path->mnt, path->dentry, 1)))
 		mntget(child_mnt);
-	spin_unlock(&vfsmount_lock);
+	br_read_unlock(vfsmount_lock);
 	return child_mnt;
 }
 
@@ -455,6 +470,9 @@ static inline int check_mnt(struct vfsmo
 	return mnt->mnt_ns == current->nsproxy->mnt_ns;
 }
 
+/*
+ * vfsmount lock must be held for write
+ */
 static void touch_mnt_namespace(struct mnt_namespace *ns)
 {
 	if (ns) {
@@ -463,6 +481,9 @@ static void touch_mnt_namespace(struct m
 	}
 }
 
+/*
+ * vfsmount lock must be held for write
+ */
 static void __touch_mnt_namespace(struct mnt_namespace *ns)
 {
 	if (ns && ns->event != event) {
@@ -471,6 +492,9 @@ static void __touch_mnt_namespace(struct
 	}
 }
 
+/*
+ * vfsmount lock must be held for write
+ */
 static void detach_mnt(struct vfsmount *mnt, struct path *old_path)
 {
 	old_path->dentry = mnt->mnt_mountpoint;
@@ -482,6 +506,9 @@ static void detach_mnt(struct vfsmount *
 	old_path->dentry->d_mounted--;
 }
 
+/*
+ * vfsmount lock must be held for write
+ */
 void mnt_set_mountpoint(struct vfsmount *mnt, struct dentry *dentry,
 			struct vfsmount *child_mnt)
 {
@@ -490,6 +517,9 @@ void mnt_set_mountpoint(struct vfsmount
 	dentry->d_mounted++;
 }
 
+/*
+ * vfsmount lock must be held for write
+ */
 static void attach_mnt(struct vfsmount *mnt, struct path *path)
 {
 	mnt_set_mountpoint(path->mnt, path->dentry, mnt);
@@ -499,7 +529,7 @@ static void attach_mnt(struct vfsmount *
 }
 
 /*
- * the caller must hold vfsmount_lock
+ * vfsmount lock must be held for write
  */
 static void commit_tree(struct vfsmount *mnt)
 {
@@ -623,39 +653,43 @@ static inline void __mntput(struct vfsmo
 void mntput_no_expire(struct vfsmount *mnt)
 {
 repeat:
-	if (atomic_dec_and_lock(&mnt->mnt_count, &vfsmount_lock)) {
-		if (likely(!mnt->mnt_pinned)) {
-			spin_unlock(&vfsmount_lock);
-			__mntput(mnt);
-			return;
-		}
-		atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count);
-		mnt->mnt_pinned = 0;
-		spin_unlock(&vfsmount_lock);
-		acct_auto_close_mnt(mnt);
-		goto repeat;
+	if (atomic_add_unless(&mnt->mnt_count, -1, 1))
+		return;
+	br_write_lock(vfsmount_lock);
+	if (!atomic_dec_and_test(&mnt->mnt_count)) {
+		br_write_unlock(vfsmount_lock);
+		return;
+	}
+	if (likely(!mnt->mnt_pinned)) {
+		br_write_unlock(vfsmount_lock);
+		__mntput(mnt);
+		return;
 	}
+	atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count);
+	mnt->mnt_pinned = 0;
+	br_write_unlock(vfsmount_lock);
+	acct_auto_close_mnt(mnt);
+	goto repeat;
 }
-
 EXPORT_SYMBOL(mntput_no_expire);
 
 void mnt_pin(struct vfsmount *mnt)
 {
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	mnt->mnt_pinned++;
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 }
 
 EXPORT_SYMBOL(mnt_pin);
 
 void mnt_unpin(struct vfsmount *mnt)
 {
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	if (mnt->mnt_pinned) {
 		atomic_inc(&mnt->mnt_count);
 		mnt->mnt_pinned--;
 	}
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 }
 
 EXPORT_SYMBOL(mnt_unpin);
@@ -746,12 +780,12 @@ int mnt_had_events(struct proc_mounts *p
 	struct mnt_namespace *ns = p->ns;
 	int res = 0;
 
-	spin_lock(&vfsmount_lock);
+	br_read_lock(vfsmount_lock);
 	if (p->event != ns->event) {
 		p->event = ns->event;
 		res = 1;
 	}
-	spin_unlock(&vfsmount_lock);
+	br_read_unlock(vfsmount_lock);
 
 	return res;
 }
@@ -952,12 +986,12 @@ int may_umount_tree(struct vfsmount *mnt
 	int minimum_refs = 0;
 	struct vfsmount *p;
 
-	spin_lock(&vfsmount_lock);
+	br_read_lock(vfsmount_lock);
 	for (p = mnt; p; p = next_mnt(p, mnt)) {
 		actual_refs += atomic_read(&p->mnt_count);
 		minimum_refs += 2;
 	}
-	spin_unlock(&vfsmount_lock);
+	br_read_unlock(vfsmount_lock);
 
 	if (actual_refs > minimum_refs)
 		return 0;
@@ -984,10 +1018,10 @@ int may_umount(struct vfsmount *mnt)
 {
 	int ret = 1;
 	down_read(&namespace_sem);
-	spin_lock(&vfsmount_lock);
+	br_read_lock(vfsmount_lock);
 	if (propagate_mount_busy(mnt, 2))
 		ret = 0;
-	spin_unlock(&vfsmount_lock);
+	br_read_unlock(vfsmount_lock);
 	up_read(&namespace_sem);
 	return ret;
 }
@@ -1003,13 +1037,14 @@ void release_mounts(struct list_head *he
 		if (mnt->mnt_parent != mnt) {
 			struct dentry *dentry;
 			struct vfsmount *m;
-			spin_lock(&vfsmount_lock);
+
+			br_write_lock(vfsmount_lock);
 			dentry = mnt->mnt_mountpoint;
 			m = mnt->mnt_parent;
 			mnt->mnt_mountpoint = mnt->mnt_root;
 			mnt->mnt_parent = mnt;
 			m->mnt_ghosts--;
-			spin_unlock(&vfsmount_lock);
+			br_write_unlock(vfsmount_lock);
 			dput(dentry);
 			mntput(m);
 		}
@@ -1017,6 +1052,10 @@ void release_mounts(struct list_head *he
 	}
 }
 
+/*
+ * vfsmount lock must be held for write
+ * namespace_sem must be held for write
+ */
 void umount_tree(struct vfsmount *mnt, int propagate, struct list_head *kill)
 {
 	struct vfsmount *p;
@@ -1107,7 +1146,7 @@ static int do_umount(struct vfsmount *mn
 	}
 
 	down_write(&namespace_sem);
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	event++;
 
 	if (!(flags & MNT_DETACH))
@@ -1119,7 +1158,7 @@ static int do_umount(struct vfsmount *mn
 			umount_tree(mnt, 1, &umount_list);
 		retval = 0;
 	}
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 	up_write(&namespace_sem);
 	release_mounts(&umount_list);
 	return retval;
@@ -1231,19 +1270,19 @@ struct vfsmount *copy_tree(struct vfsmou
 			q = clone_mnt(p, p->mnt_root, flag);
 			if (!q)
 				goto Enomem;
-			spin_lock(&vfsmount_lock);
+			br_write_lock(vfsmount_lock);
 			list_add_tail(&q->mnt_list, &res->mnt_list);
 			attach_mnt(q, &path);
-			spin_unlock(&vfsmount_lock);
+			br_write_unlock(vfsmount_lock);
 		}
 	}
 	return res;
 Enomem:
 	if (res) {
 		LIST_HEAD(umount_list);
-		spin_lock(&vfsmount_lock);
+		br_write_lock(vfsmount_lock);
 		umount_tree(res, 0, &umount_list);
-		spin_unlock(&vfsmount_lock);
+		br_write_unlock(vfsmount_lock);
 		release_mounts(&umount_list);
 	}
 	return NULL;
@@ -1262,9 +1301,9 @@ void drop_collected_mounts(struct vfsmou
 {
 	LIST_HEAD(umount_list);
 	down_write(&namespace_sem);
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	umount_tree(mnt, 0, &umount_list);
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 	up_write(&namespace_sem);
 	release_mounts(&umount_list);
 }
@@ -1392,7 +1431,7 @@ static int attach_recursive_mnt(struct v
 	if (err)
 		goto out_cleanup_ids;
 
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 
 	if (IS_MNT_SHARED(dest_mnt)) {
 		for (p = source_mnt; p; p = next_mnt(p, source_mnt))
@@ -1411,7 +1450,8 @@ static int attach_recursive_mnt(struct v
 		list_del_init(&child->mnt_hash);
 		commit_tree(child);
 	}
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
+
 	return 0;
 
  out_cleanup_ids:
@@ -1466,10 +1506,10 @@ static int do_change_type(struct path *p
 			goto out_unlock;
 	}
 
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	for (m = mnt; m; m = (recurse ? next_mnt(m, mnt) : NULL))
 		change_mnt_propagation(m, type);
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 
  out_unlock:
 	up_write(&namespace_sem);
@@ -1513,9 +1553,10 @@ static int do_loopback(struct path *path
 	err = graft_tree(mnt, path);
 	if (err) {
 		LIST_HEAD(umount_list);
-		spin_lock(&vfsmount_lock);
+
+		br_write_lock(vfsmount_lock);
 		umount_tree(mnt, 0, &umount_list);
-		spin_unlock(&vfsmount_lock);
+		br_write_unlock(vfsmount_lock);
 		release_mounts(&umount_list);
 	}
 
@@ -1568,16 +1609,16 @@ static int do_remount(struct path *path,
 	else
 		err = do_remount_sb(sb, flags, data, 0);
 	if (!err) {
-		spin_lock(&vfsmount_lock);
+		br_write_lock(vfsmount_lock);
 		mnt_flags |= path->mnt->mnt_flags & MNT_PROPAGATION_MASK;
 		path->mnt->mnt_flags = mnt_flags;
-		spin_unlock(&vfsmount_lock);
+		br_write_unlock(vfsmount_lock);
 	}
 	up_write(&sb->s_umount);
 	if (!err) {
-		spin_lock(&vfsmount_lock);
+		br_write_lock(vfsmount_lock);
 		touch_mnt_namespace(path->mnt->mnt_ns);
-		spin_unlock(&vfsmount_lock);
+		br_write_unlock(vfsmount_lock);
 	}
 	return err;
 }
@@ -1754,7 +1795,7 @@ void mark_mounts_for_expiry(struct list_
 		return;
 
 	down_write(&namespace_sem);
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 
 	/* extract from the expiration list every vfsmount that matches the
 	 * following criteria:
@@ -1773,7 +1814,7 @@ void mark_mounts_for_expiry(struct list_
 		touch_mnt_namespace(mnt->mnt_ns);
 		umount_tree(mnt, 1, &umounts);
 	}
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 	up_write(&namespace_sem);
 
 	release_mounts(&umounts);
@@ -1830,6 +1871,8 @@ resume:
 /*
  * process a list of expirable mountpoints with the intent of discarding any
  * submounts of a specific parent mountpoint
+ *
+ * vfsmount_lock must be held for write
  */
 static void shrink_submounts(struct vfsmount *mnt, struct list_head *umounts)
 {
@@ -2048,9 +2091,9 @@ static struct mnt_namespace *dup_mnt_ns(
 		kfree(new_ns);
 		return ERR_PTR(-ENOMEM);
 	}
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	list_add_tail(&new_ns->list, &new_ns->root->mnt_list);
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 
 	/*
 	 * Second pass: switch the tsk->fs->* elements and mark new vfsmounts
@@ -2244,7 +2287,7 @@ SYSCALL_DEFINE2(pivot_root, const char _
 		goto out2; /* not attached */
 	/* make sure we can reach put_old from new_root */
 	tmp = old.mnt;
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	if (tmp != new.mnt) {
 		for (;;) {
 			if (tmp->mnt_parent == tmp)
@@ -2264,7 +2307,7 @@ SYSCALL_DEFINE2(pivot_root, const char _
 	/* mount new_root on / */
 	attach_mnt(new.mnt, &root_parent);
 	touch_mnt_namespace(current->nsproxy->mnt_ns);
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 	chroot_fs_refs(&root, &new);
 	error = 0;
 	path_put(&root_parent);
@@ -2279,7 +2322,7 @@ out1:
 out0:
 	return error;
 out3:
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 	goto out2;
 }
 
@@ -2326,6 +2369,8 @@ void __init mnt_init(void)
 	for (u = 0; u < HASH_SIZE; u++)
 		INIT_LIST_HEAD(&mount_hashtable[u]);
 
+	br_lock_init(vfsmount_lock);
+
 	err = sysfs_init();
 	if (err)
 		printk(KERN_WARNING "%s: sysfs_init error: %d\n",
@@ -2344,9 +2389,9 @@ void put_mnt_ns(struct mnt_namespace *ns
 	if (!atomic_dec_and_test(&ns->count))
 		return;
 	down_write(&namespace_sem);
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	umount_tree(ns->root, 0, &umount_list);
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 	up_write(&namespace_sem);
 	release_mounts(&umount_list);
 	kfree(ns);
Index: linux-2.6/fs/pnode.c
===================================================================
--- linux-2.6.orig/fs/pnode.c	2010-08-18 04:04:00.000000000 +1000
+++ linux-2.6/fs/pnode.c	2010-08-18 04:04:29.000000000 +1000
@@ -126,6 +126,9 @@ static int do_make_slave(struct vfsmount
 	return 0;
 }
 
+/*
+ * vfsmount lock must be held for write
+ */
 void change_mnt_propagation(struct vfsmount *mnt, int type)
 {
 	if (type == MS_SHARED) {
@@ -270,12 +273,12 @@ int propagate_mnt(struct vfsmount *dest_
 		prev_src_mnt  = child;
 	}
 out:
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	while (!list_empty(&tmp_list)) {
 		child = list_first_entry(&tmp_list, struct vfsmount, mnt_hash);
 		umount_tree(child, 0, &umount_list);
 	}
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 	release_mounts(&umount_list);
 	return ret;
 }
@@ -296,6 +299,8 @@ static inline int do_refcount_check(stru
  * other mounts its parent propagates to.
  * Check if any of these mounts that **do not have submounts**
  * have more references than 'refcnt'. If so return busy.
+ *
+ * vfsmount lock must be held for read or write
  */
 int propagate_mount_busy(struct vfsmount *mnt, int refcnt)
 {
@@ -353,6 +358,8 @@ static void __propagate_umount(struct vf
  * collect all mounts that receive propagation from the mount in @list,
  * and return these additional mounts in the same list.
  * @list: the list of mounts to be unmounted.
+ *
+ * vfsmount lock must be held for write
  */
 int propagate_umount(struct list_head *list)
 {
Index: linux-2.6/fs/internal.h
===================================================================
--- linux-2.6.orig/fs/internal.h	2010-08-18 04:04:29.000000000 +1000
+++ linux-2.6/fs/internal.h	2010-08-18 04:04:29.000000000 +1000
@@ -9,6 +9,8 @@
  * 2 of the License, or (at your option) any later version.
  */
 
+#include <linux/lglock.h>
+
 struct super_block;
 struct linux_binprm;
 struct path;
@@ -70,7 +72,8 @@ extern struct vfsmount *copy_tree(struct
 
 extern void __init mnt_init(void);
 
-extern spinlock_t vfsmount_lock;
+DECLARE_BRLOCK(vfsmount_lock);
+
 
 /*
  * fs_struct.c



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 00/10] first set of vfs scale patches
  2010-08-17 18:37 [patch 00/10] first set of vfs scale patches Nick Piggin
                   ` (9 preceding siblings ...)
  2010-08-17 18:37 ` [patch 10/10] fs: brlock vfsmount_lock Nick Piggin
@ 2010-08-17 21:14 ` Al Viro
  10 siblings, 0 replies; 25+ messages in thread
From: Al Viro @ 2010-08-17 21:14 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel

On Wed, Aug 18, 2010 at 04:37:29AM +1000, Nick Piggin wrote:
> Does not contain inode lock patches yet, I've not
> quite finished porting them yet, but even when I do I think they should
> get some time in linux-next and some more time for people to review them
> before going upstream. So let's wait for next release on those?
> 
> This patchset contains:
> * some misc bits
> * rwlock->spinlock for fs_struct lock
> * files_lock cleanup
> * tty files list bugfix
> * files_lock scaling
> * vfsmount_lock scaling
> 
> These should all be in good shape for review and hopefully merging, so
> please let me know if I need to fix anything.

OK, I'll review that stuff and hopefully push to Linus tonight

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 01/10] fs: fix do_lookup false negative
  2010-08-17 18:37 ` [patch 01/10] fs: fix do_lookup false negative Nick Piggin
@ 2010-08-17 22:45   ` Valerie Aurora
  2010-08-17 23:04   ` Sage Weil
  2010-08-18 13:41   ` Andi Kleen
  2 siblings, 0 replies; 25+ messages in thread
From: Valerie Aurora @ 2010-08-17 22:45 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Al Viro, linux-fsdevel

On Wed, Aug 18, 2010 at 04:37:30AM +1000, Nick Piggin wrote:
> fs: fix do_lookup false negative
> 
> In do_lookup, if we initially find no dentry, we take the directory i_mutex and
> re-check the lookup. If we find a dentry there, then we revalidate it if
> needed. However if that revalidate asks for the dentry to be invalidated, we
> return -ENOENT from do_lookup. What should happen instead is an attempt to
> allocate and lookup a new dentry.
> 
> This is probably not noticed because it is rare. It is only reached if a
> concurrent create races in first (in which case, the dentry probably won't be
> invalidated anyway), or if the racy __d_lookup has failed due to a
> false-negative (which is very rare).
> 
> Fix this by removing code and have it use the normal reval path.
> 
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> 
> ---
>  fs/namei.c |   10 ++--------
>  1 file changed, 2 insertions(+), 8 deletions(-)
> 
> Index: linux-2.6/fs/namei.c
> ===================================================================
> --- linux-2.6.orig/fs/namei.c	2010-08-18 04:04:18.000000000 +1000
> +++ linux-2.6/fs/namei.c	2010-08-18 04:05:15.000000000 +1000
> @@ -709,6 +709,7 @@ static int do_lookup(struct nameidata *n
>  	dentry = __d_lookup(nd->path.dentry, name);
>  	if (!dentry)
>  		goto need_lookup;
> +found:
>  	if (dentry->d_op && dentry->d_op->d_revalidate)
>  		goto need_revalidate;
>  done:
> @@ -766,14 +767,7 @@ out_unlock:
>  	 * we waited on the semaphore. Need to revalidate.
>  	 */
>  	mutex_unlock(&dir->i_mutex);
> -	if (dentry->d_op && dentry->d_op->d_revalidate) {
> -		dentry = do_revalidate(dentry, nd);
> -		if (!dentry)
> -			dentry = ERR_PTR(-ENOENT);
> -	}
> -	if (IS_ERR(dentry))
> -		goto fail;
> -	goto done;
> +	goto found;
>  
>  need_revalidate:
>  	dentry = do_revalidate(dentry, nd);
> 
> 

Looks good.

Reviewed-by: Valerie Aurora <vaurora@redhat.com>

-VAL

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 02/10] fs: dentry allocation consolidation
  2010-08-17 18:37 ` [patch 02/10] fs: dentry allocation consolidation Nick Piggin
@ 2010-08-17 22:45   ` Valerie Aurora
  0 siblings, 0 replies; 25+ messages in thread
From: Valerie Aurora @ 2010-08-17 22:45 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Al Viro, linux-fsdevel

On Wed, Aug 18, 2010 at 04:37:31AM +1000, Nick Piggin wrote:
> fs: dentry allocation consolidation
> 
> There are 2 duplicate copies of code in dentry allocation in path lookup.
> Consolidate them into a single function.
> 
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> 
> ---
>  fs/namei.c |   70 ++++++++++++++++++++++++++++---------------------------------
>  1 file changed, 33 insertions(+), 37 deletions(-)
> 
> Index: linux-2.6/fs/namei.c
> ===================================================================
> --- linux-2.6.orig/fs/namei.c	2010-08-18 04:04:29.000000000 +1000
> +++ linux-2.6/fs/namei.c	2010-08-18 04:05:12.000000000 +1000
> @@ -686,6 +686,35 @@ static __always_inline void follow_dotdo
>  }
>  
>  /*
> + * Allocate a dentry with name and parent, and perform a parent
> + * directory ->lookup on it. Returns the new dentry, or ERR_PTR
> + * on error. parent->d_inode->i_mutex must be held. d_lookup must
> + * have verified that no child exists while under i_mutex.
> + */
> +static struct dentry *d_alloc_and_lookup(struct dentry *parent,
> +				struct qstr *name, struct nameidata *nd)
> +{
> +	struct inode *inode = parent->d_inode;
> +	struct dentry *dentry;
> +	struct dentry *old;
> +
> +	/* Don't create child dentry for a dead directory. */
> +	if (unlikely(IS_DEADDIR(inode)))
> +		return ERR_PTR(-ENOENT);
> +
> +	dentry = d_alloc(parent, name);
> +	if (unlikely(!dentry))
> +		return ERR_PTR(-ENOMEM);
> +
> +	old = inode->i_op->lookup(inode, dentry, nd);
> +	if (unlikely(old)) {
> +		dput(dentry);
> +		dentry = old;
> +	}
> +	return dentry;
> +}
> +
> +/*
>   *  It's more convoluted than I'd like it to be, but... it's still fairly
>   *  small and for now I'd prefer to have fast path as straight as possible.
>   *  It _is_ time-critical.
> @@ -738,30 +767,13 @@ need_lookup:
>  	 * so doing d_lookup() (with seqlock), instead of lockfree __d_lookup
>  	 */
>  	dentry = d_lookup(parent, name);
> -	if (!dentry) {
> -		struct dentry *new;
> -
> -		/* Don't create child dentry for a dead directory. */
> -		dentry = ERR_PTR(-ENOENT);
> -		if (IS_DEADDIR(dir))
> -			goto out_unlock;
> -
> -		new = d_alloc(parent, name);
> -		dentry = ERR_PTR(-ENOMEM);
> -		if (new) {
> -			dentry = dir->i_op->lookup(dir, new, nd);
> -			if (dentry)
> -				dput(new);
> -			else
> -				dentry = new;
> -		}
> -out_unlock:
> +	if (likely(!dentry)) {
> +		dentry = d_alloc_and_lookup(parent, name, nd);
>  		mutex_unlock(&dir->i_mutex);
>  		if (IS_ERR(dentry))
>  			goto fail;
>  		goto done;
>  	}
> -
>  	/*
>  	 * Uhhuh! Nasty case: the cache was re-populated while
>  	 * we waited on the semaphore. Need to revalidate.
> @@ -1135,24 +1147,8 @@ static struct dentry *__lookup_hash(stru
>  	if (dentry && dentry->d_op && dentry->d_op->d_revalidate)
>  		dentry = do_revalidate(dentry, nd);
>  
> -	if (!dentry) {
> -		struct dentry *new;
> -
> -		/* Don't create child dentry for a dead directory. */
> -		dentry = ERR_PTR(-ENOENT);
> -		if (IS_DEADDIR(inode))
> -			goto out;
> -
> -		new = d_alloc(base, name);
> -		dentry = ERR_PTR(-ENOMEM);
> -		if (!new)
> -			goto out;
> -		dentry = inode->i_op->lookup(inode, new, nd);
> -		if (!dentry)
> -			dentry = new;
> -		else
> -			dput(new);
> -	}
> +	if (!dentry)
> +		dentry = d_alloc_and_lookup(base, name, nd);
>  out:
>  	return dentry;
>  }

Looks good.

Reviewed-by: Valerie Aurora <vaurora@redhat.com>

-VAL

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 03/10] apparmor: use task path helpers
  2010-08-17 18:37 ` [patch 03/10] apparmor: use task path helpers Nick Piggin
@ 2010-08-17 22:59   ` Valerie Aurora
  0 siblings, 0 replies; 25+ messages in thread
From: Valerie Aurora @ 2010-08-17 22:59 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Al Viro, linux-fsdevel

On Wed, Aug 18, 2010 at 04:37:32AM +1000, Nick Piggin wrote:
> apparmor: use task path helpers
> 
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> 
> ---
> BTW, argh! dcache_lock! Why does everything have to invent its own reverse
> path lookup crud and hide it away in its own code? Also, admire the rest of
> this beautiful function.
> 
> ---
>  security/apparmor/path.c |    9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> Index: linux-2.6/security/apparmor/path.c
> ===================================================================
> --- linux-2.6.orig/security/apparmor/path.c	2010-08-18 04:04:02.000000000 +1000
> +++ linux-2.6/security/apparmor/path.c	2010-08-18 04:04:29.000000000 +1000
> @@ -62,19 +62,14 @@ static int d_namespace_path(struct path
>  	int deleted, connected;
>  	int error = 0;
>  
> -	/* Get the root we want to resolve too */
> +	/* Get the root we want to resolve too, released below */
>  	if (flags & PATH_CHROOT_REL) {
>  		/* resolve paths relative to chroot */
> -		read_lock(&current->fs->lock);
> -		root = current->fs->root;
> -		/* released below */
> -		path_get(&root);
> -		read_unlock(&current->fs->lock);
> +		get_fs_root(current->fs, &root);
>  	} else {
>  		/* resolve paths relative to namespace */
>  		root.mnt = current->nsproxy->mnt_ns->root;
>  		root.dentry = root.mnt->mnt_root;
> -		/* released below */
>  		path_get(&root);
>  	}

Gosh, maybe this VFS stuff isn't as hard as I thought!  This one's
easy to review too. :)

Looks good.

Reviewed-by: Valerie Aurora <vaurora@redhat.com>

-VAL


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 01/10] fs: fix do_lookup false negative
  2010-08-17 18:37 ` [patch 01/10] fs: fix do_lookup false negative Nick Piggin
  2010-08-17 22:45   ` Valerie Aurora
@ 2010-08-17 23:04   ` Sage Weil
  2010-08-18 13:41   ` Andi Kleen
  2 siblings, 0 replies; 25+ messages in thread
From: Sage Weil @ 2010-08-17 23:04 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Al Viro, linux-fsdevel

On Wed, 18 Aug 2010, Nick Piggin wrote:

> fs: fix do_lookup false negative
> 
> In do_lookup, if we initially find no dentry, we take the directory i_mutex and
> re-check the lookup. If we find a dentry there, then we revalidate it if
> needed. However if that revalidate asks for the dentry to be invalidated, we
> return -ENOENT from do_lookup. What should happen instead is an attempt to
> allocate and lookup a new dentry.
> 
> This is probably not noticed because it is rare. It is only reached if a
> concurrent create races in first (in which case, the dentry probably won't be
> invalidated anyway), or if the racy __d_lookup has failed due to a
> false-negative (which is very rare).
> 
> Fix this by removing code and have it use the normal reval path.
> 
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>

FWIW, I was hitting this bug with Ceph a while back and proposed a 
different fix for it: instead of dropping the lock, my patch did the 
revalidation under i_mutex.  Unfortunately that conflicts with autofs 
shenanigans and is still sitting in Al's tree awaiting some resolution 
there, see

 http://git.kernel.org/?p=linux/kernel/git/viro/vfs-2.6.git;a=commit;h=1f24668c6c673e2e74fb77ff6ef5a07651c3bd10 
 
This approach is simpler and looks okay to me.

Reviewed-by: Sage Weil <sage@newdream.net>

sage


> 
> ---
>  fs/namei.c |   10 ++--------
>  1 file changed, 2 insertions(+), 8 deletions(-)
> 
> Index: linux-2.6/fs/namei.c
> ===================================================================
> --- linux-2.6.orig/fs/namei.c	2010-08-18 04:04:18.000000000 +1000
> +++ linux-2.6/fs/namei.c	2010-08-18 04:05:15.000000000 +1000
> @@ -709,6 +709,7 @@ static int do_lookup(struct nameidata *n
>  	dentry = __d_lookup(nd->path.dentry, name);
>  	if (!dentry)
>  		goto need_lookup;
> +found:
>  	if (dentry->d_op && dentry->d_op->d_revalidate)
>  		goto need_revalidate;
>  done:
> @@ -766,14 +767,7 @@ out_unlock:
>  	 * we waited on the semaphore. Need to revalidate.
>  	 */
>  	mutex_unlock(&dir->i_mutex);
> -	if (dentry->d_op && dentry->d_op->d_revalidate) {
> -		dentry = do_revalidate(dentry, nd);
> -		if (!dentry)
> -			dentry = ERR_PTR(-ENOENT);
> -	}
> -	if (IS_ERR(dentry))
> -		goto fail;
> -	goto done;
> +	goto found;
>  
>  need_revalidate:
>  	dentry = do_revalidate(dentry, nd);
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 04/10] fs: fs_struct rwlock to spinlock
  2010-08-17 18:37 ` [patch 04/10] fs: fs_struct rwlock to spinlock Nick Piggin
@ 2010-08-17 23:14   ` Valerie Aurora
  2010-08-20 10:05     ` Nick Piggin
  0 siblings, 1 reply; 25+ messages in thread
From: Valerie Aurora @ 2010-08-17 23:14 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Al Viro, linux-fsdevel

On Wed, Aug 18, 2010 at 04:37:33AM +1000, Nick Piggin wrote:
> fs: fs_struct rwlock to spinlock
> 
> struct fs_struct.lock is an rwlock with the read-side used to protect root and
> pwd members while taking references to them. Taking a reference to a path
> typically requires just 2 atomic ops, so the critical section is very small.
> Parallel read-side operations would have cacheline contention on the lock, the
> dentry, and the vfsmount cachelines, so the rwlock is unlikely to ever give a
> real parallelism increase.
> 
> Replace it with a spinlock to avoid one or two atomic operations in typical
> path lookup fastpath.
> 
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> 
> ---
>  drivers/staging/pohmelfs/path_entry.c |    8 ++++----
>  fs/exec.c                             |    4 ++--
>  fs/fs_struct.c                        |   32 ++++++++++++++++----------------
>  include/linux/fs_struct.h             |   14 +++++++-------
>  kernel/fork.c                         |   10 +++++-----
>  5 files changed, 34 insertions(+), 34 deletions(-)
> 
> Index: linux-2.6/fs/fs_struct.c
> ===================================================================
> --- linux-2.6.orig/fs/fs_struct.c	2010-08-18 04:04:01.000000000 +1000
> +++ linux-2.6/fs/fs_struct.c	2010-08-18 04:04:29.000000000 +1000
> @@ -13,11 +13,11 @@ void set_fs_root(struct fs_struct *fs, s
>  {
>  	struct path old_root;
>  
> -	write_lock(&fs->lock);
> +	spin_lock(&fs->lock);
>  	old_root = fs->root;
>  	fs->root = *path;
>  	path_get(path);
> -	write_unlock(&fs->lock);
> +	spin_unlock(&fs->lock);
>  	if (old_root.dentry)
>  		path_put(&old_root);
>  }
> @@ -30,11 +30,11 @@ void set_fs_pwd(struct fs_struct *fs, st
>  {
>  	struct path old_pwd;
>  
> -	write_lock(&fs->lock);
> +	spin_lock(&fs->lock);
>  	old_pwd = fs->pwd;
>  	fs->pwd = *path;
>  	path_get(path);
> -	write_unlock(&fs->lock);
> +	spin_unlock(&fs->lock);
>  
>  	if (old_pwd.dentry)
>  		path_put(&old_pwd);
> @@ -51,7 +51,7 @@ void chroot_fs_refs(struct path *old_roo
>  		task_lock(p);
>  		fs = p->fs;
>  		if (fs) {
> -			write_lock(&fs->lock);
> +			spin_lock(&fs->lock);
>  			if (fs->root.dentry == old_root->dentry
>  			    && fs->root.mnt == old_root->mnt) {
>  				path_get(new_root);
> @@ -64,7 +64,7 @@ void chroot_fs_refs(struct path *old_roo
>  				fs->pwd = *new_root;
>  				count++;
>  			}
> -			write_unlock(&fs->lock);
> +			spin_unlock(&fs->lock);
>  		}
>  		task_unlock(p);
>  	} while_each_thread(g, p);
> @@ -87,10 +87,10 @@ void exit_fs(struct task_struct *tsk)
>  	if (fs) {
>  		int kill;
>  		task_lock(tsk);
> -		write_lock(&fs->lock);
> +		spin_lock(&fs->lock);
>  		tsk->fs = NULL;
>  		kill = !--fs->users;
> -		write_unlock(&fs->lock);
> +		spin_unlock(&fs->lock);
>  		task_unlock(tsk);
>  		if (kill)
>  			free_fs_struct(fs);
> @@ -104,7 +104,7 @@ struct fs_struct *copy_fs_struct(struct
>  	if (fs) {
>  		fs->users = 1;
>  		fs->in_exec = 0;
> -		rwlock_init(&fs->lock);
> +		spin_lock_init(&fs->lock);
>  		fs->umask = old->umask;
>  		get_fs_root_and_pwd(old, &fs->root, &fs->pwd);
>  	}
> @@ -121,10 +121,10 @@ int unshare_fs_struct(void)
>  		return -ENOMEM;
>  
>  	task_lock(current);
> -	write_lock(&fs->lock);
> +	spin_lock(&fs->lock);
>  	kill = !--fs->users;
>  	current->fs = new_fs;
> -	write_unlock(&fs->lock);
> +	spin_unlock(&fs->lock);
>  	task_unlock(current);
>  
>  	if (kill)
> @@ -143,7 +143,7 @@ EXPORT_SYMBOL(current_umask);
>  /* to be mentioned only in INIT_TASK */
>  struct fs_struct init_fs = {
>  	.users		= 1,
> -	.lock		= __RW_LOCK_UNLOCKED(init_fs.lock),
> +	.lock		= __SPIN_LOCK_UNLOCKED(init_fs.lock),
>  	.umask		= 0022,
>  };
>  
> @@ -156,14 +156,14 @@ void daemonize_fs_struct(void)
>  
>  		task_lock(current);
>  
> -		write_lock(&init_fs.lock);
> +		spin_lock(&init_fs.lock);
>  		init_fs.users++;
> -		write_unlock(&init_fs.lock);
> +		spin_unlock(&init_fs.lock);
>  
> -		write_lock(&fs->lock);
> +		spin_lock(&fs->lock);
>  		current->fs = &init_fs;
>  		kill = !--fs->users;
> -		write_unlock(&fs->lock);
> +		spin_unlock(&fs->lock);
>  
>  		task_unlock(current);
>  		if (kill)
> Index: linux-2.6/include/linux/fs_struct.h
> ===================================================================
> --- linux-2.6.orig/include/linux/fs_struct.h	2010-08-18 04:04:01.000000000 +1000
> +++ linux-2.6/include/linux/fs_struct.h	2010-08-18 04:04:29.000000000 +1000
> @@ -5,7 +5,7 @@
>  
>  struct fs_struct {
>  	int users;
> -	rwlock_t lock;
> +	spinlock_t lock;
>  	int umask;
>  	int in_exec;
>  	struct path root, pwd;
> @@ -23,29 +23,29 @@ extern int unshare_fs_struct(void);
>  
>  static inline void get_fs_root(struct fs_struct *fs, struct path *root)
>  {
> -	read_lock(&fs->lock);
> +	spin_lock(&fs->lock);
>  	*root = fs->root;
>  	path_get(root);
> -	read_unlock(&fs->lock);
> +	spin_unlock(&fs->lock);
>  }
>  
>  static inline void get_fs_pwd(struct fs_struct *fs, struct path *pwd)
>  {
> -	read_lock(&fs->lock);
> +	spin_lock(&fs->lock);
>  	*pwd = fs->pwd;
>  	path_get(pwd);
> -	read_unlock(&fs->lock);
> +	spin_unlock(&fs->lock);
>  }
>  
>  static inline void get_fs_root_and_pwd(struct fs_struct *fs, struct path *root,
>  				       struct path *pwd)
>  {
> -	read_lock(&fs->lock);
> +	spin_lock(&fs->lock);
>  	*root = fs->root;
>  	path_get(root);
>  	*pwd = fs->pwd;
>  	path_get(pwd);
> -	read_unlock(&fs->lock);
> +	spin_unlock(&fs->lock);
>  }
>  
>  #endif /* _LINUX_FS_STRUCT_H */
> Index: linux-2.6/fs/exec.c
> ===================================================================
> --- linux-2.6.orig/fs/exec.c	2010-08-18 04:04:01.000000000 +1000
> +++ linux-2.6/fs/exec.c	2010-08-18 04:04:29.000000000 +1000
> @@ -1117,7 +1117,7 @@ int check_unsafe_exec(struct linux_binpr
>  	bprm->unsafe = tracehook_unsafe_exec(p);
>  
>  	n_fs = 1;
> -	write_lock(&p->fs->lock);
> +	spin_lock(&p->fs->lock);
>  	rcu_read_lock();
>  	for (t = next_thread(p); t != p; t = next_thread(t)) {
>  		if (t->fs == p->fs)
> @@ -1134,7 +1134,7 @@ int check_unsafe_exec(struct linux_binpr
>  			res = 1;
>  		}
>  	}
> -	write_unlock(&p->fs->lock);
> +	spin_unlock(&p->fs->lock);
>  
>  	return res;
>  }
> Index: linux-2.6/kernel/fork.c
> ===================================================================
> --- linux-2.6.orig/kernel/fork.c	2010-08-18 04:04:01.000000000 +1000
> +++ linux-2.6/kernel/fork.c	2010-08-18 04:04:29.000000000 +1000
> @@ -752,13 +752,13 @@ static int copy_fs(unsigned long clone_f
>  	struct fs_struct *fs = current->fs;
>  	if (clone_flags & CLONE_FS) {
>  		/* tsk->fs is already what we want */
> -		write_lock(&fs->lock);
> +		spin_lock(&fs->lock);
>  		if (fs->in_exec) {
> -			write_unlock(&fs->lock);
> +			spin_unlock(&fs->lock);
>  			return -EAGAIN;
>  		}
>  		fs->users++;
> -		write_unlock(&fs->lock);
> +		spin_unlock(&fs->lock);
>  		return 0;
>  	}
>  	tsk->fs = copy_fs_struct(fs);
> @@ -1676,13 +1676,13 @@ SYSCALL_DEFINE1(unshare, unsigned long,
>  
>  		if (new_fs) {
>  			fs = current->fs;
> -			write_lock(&fs->lock);
> +			spin_lock(&fs->lock);
>  			current->fs = new_fs;
>  			if (--fs->users)
>  				new_fs = NULL;
>  			else
>  				new_fs = fs;
> -			write_unlock(&fs->lock);
> +			spin_unlock(&fs->lock);
>  		}
>  
>  		if (new_mm) {
> Index: linux-2.6/drivers/staging/pohmelfs/path_entry.c
> ===================================================================
> --- linux-2.6.orig/drivers/staging/pohmelfs/path_entry.c	2010-08-18 04:04:01.000000000 +1000
> +++ linux-2.6/drivers/staging/pohmelfs/path_entry.c	2010-08-18 04:04:29.000000000 +1000
> @@ -44,9 +44,9 @@ int pohmelfs_construct_path_string(struc
>  		return -ENOENT;
>  	}
>  
> -	read_lock(&current->fs->lock);
> +	spin_lock(&current->fs->lock);
>  	path.mnt = mntget(current->fs->root.mnt);
> -	read_unlock(&current->fs->lock);
> +	spin_unlock(&current->fs->lock);
>  
>  	path.dentry = d;
>  
> @@ -91,9 +91,9 @@ int pohmelfs_path_length(struct pohmelfs
>  		return -ENOENT;
>  	}
>  
> -	read_lock(&current->fs->lock);
> +	spin_lock(&current->fs->lock);
>  	root = dget(current->fs->root.dentry);
> -	read_unlock(&current->fs->lock);
> +	spin_unlock(&current->fs->lock);
>  
>  	spin_lock(&dcache_lock);

Your reasoning makes sense to me.  Shared reader access seems very
unlikely whereas the cost of taking the lock is certain.

Reviewed-by: Valerie Aurora <vaurora@redhat.com>

-VAL


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 01/10] fs: fix do_lookup false negative
  2010-08-17 18:37 ` [patch 01/10] fs: fix do_lookup false negative Nick Piggin
  2010-08-17 22:45   ` Valerie Aurora
  2010-08-17 23:04   ` Sage Weil
@ 2010-08-18 13:41   ` Andi Kleen
  2 siblings, 0 replies; 25+ messages in thread
From: Andi Kleen @ 2010-08-18 13:41 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Al Viro, linux-fsdevel

Nick Piggin <npiggin@kernel.dk> writes:

> fs: fix do_lookup false negative
>
> In do_lookup, if we initially find no dentry, we take the directory i_mutex and
> re-check the lookup. If we find a dentry there, then we revalidate it if
> needed. However if that revalidate asks for the dentry to be invalidated, we
> return -ENOENT from do_lookup. What should happen instead is an attempt to
> allocate and lookup a new dentry.
>
> This is probably not noticed because it is rare. It is only reached if a
> concurrent create races in first (in which case, the dentry probably won't be
> invalidated anyway), or if the racy __d_lookup has failed due to a
> false-negative (which is very rare).
>
> Fix this by removing code and have it use the normal reval path.

Looks good, but a comment would be good.

Reviewed-by: Andi Kleen <ak@linux.intel.com>

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 05/10] fs: remove extra lookup in __lookup_hash
  2010-08-17 18:37 ` [patch 05/10] fs: remove extra lookup in __lookup_hash Nick Piggin
@ 2010-08-18 13:57   ` Andi Kleen
  2010-08-18 21:13     ` Andi Kleen
  2010-08-18 19:34   ` Valerie Aurora
  1 sibling, 1 reply; 25+ messages in thread
From: Andi Kleen @ 2010-08-18 13:57 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Al Viro, linux-fsdevel

Nick Piggin <npiggin@kernel.dk> writes:

> - Delete some boring legacy comments because we don't care much about how the
>   code used to work, more about the interesting parts of how it works now. So
>   comments about lazy LRU may be interesting, but would better be done in the
>   LRU or refcount management code.

It would have been nice if you had done all the comment changes in
another patch.

As far as I can see this is only a two liner and it looks obviously 
correct.

As a quick experiment I set a systemtap probe for this on my workstation

global first, second
probe kernel.function("*@fs/namei.c:1143") { first++ } 
probe kernel.function("*@fs/namei.c:1149") { second++ }
probe end { printf("first %d, second %d\n", first, second) } 

and did a quick kernel build, resulting in:

first 22753, second 22753

So yes it looks like the hit rate is about zero for the first case
and the change is good.

Reviewed-by: Andi Kleen <ak@linux.intel.com>

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 10/10] fs: brlock vfsmount_lock
  2010-08-17 18:37 ` [patch 10/10] fs: brlock vfsmount_lock Nick Piggin
@ 2010-08-18 14:05   ` Andi Kleen
  2010-08-20 10:09     ` Nick Piggin
  0 siblings, 1 reply; 25+ messages in thread
From: Andi Kleen @ 2010-08-18 14:05 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Al Viro, linux-fsdevel

Nick Piggin <npiggin@kernel.dk> writes:

BTW one way to make the slow path faster would be to start sharing
per cpu locks inside a core on SMT at least. The same cores have the same
caches and sharing cache lines is free. That would cut it in half
on a 2x HT system.

> -
>  static int event;
>  static DEFINE_IDA(mnt_id_ida);
>  static DEFINE_IDA(mnt_group_ida);
> +static DEFINE_SPINLOCK(mnt_id_lock);

Can you add a scope comment to that lock? 

> @@ -623,39 +653,43 @@ static inline void __mntput(struct vfsmo
>  void mntput_no_expire(struct vfsmount *mnt)
>  {
>  repeat:
> -	if (atomic_dec_and_lock(&mnt->mnt_count, &vfsmount_lock)) {
> -		if (likely(!mnt->mnt_pinned)) {
> -			spin_unlock(&vfsmount_lock);
> -			__mntput(mnt);
> -			return;
> -		}
> -		atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count);
> -		mnt->mnt_pinned = 0;
> -		spin_unlock(&vfsmount_lock);
> -		acct_auto_close_mnt(mnt);
> -		goto repeat;
> +	if (atomic_add_unless(&mnt->mnt_count, -1, 1))
> +		return;

Hmm that's a unrelated change?

The rest looks all good and quite straight forward

Reviewed-by: Andi Kleen <ak@linux.intel.com>
-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 05/10] fs: remove extra lookup in __lookup_hash
  2010-08-17 18:37 ` [patch 05/10] fs: remove extra lookup in __lookup_hash Nick Piggin
  2010-08-18 13:57   ` Andi Kleen
@ 2010-08-18 19:34   ` Valerie Aurora
  1 sibling, 0 replies; 25+ messages in thread
From: Valerie Aurora @ 2010-08-18 19:34 UTC (permalink / raw)
  To: Nick Piggin, Jan Blunck; +Cc: Al Viro, linux-fsdevel

On Wed, Aug 18, 2010 at 04:37:34AM +1000, Nick Piggin wrote:
> fs: remove extra lookup in __lookup_hash
> 
> Optimize lookup for create operations, where no dentry should often be
> common-case. In cases where it is not, such as unlink, the added overhead
> is much smaller than the removed.
> 
> Also, move comments about __d_lookup racyness to the __d_lookup call site.
> d_lookup is intuitive; __d_lookup is what needs commenting. So in that same
> vein, add kerneldoc comments to __d_lookup and clean up some of the comments:
> 
> - We are interested in how the RCU lookup works here, particularly with
>   renames. Make that explicit, and point to the document where it is explained
>   in more detail.
> - RCU is pretty standard now, and macros make implementations pretty mindless.
>   If we want to know about RCU barrier details, we look in RCU code.
> - Delete some boring legacy comments because we don't care much about how the
>   code used to work, more about the interesting parts of how it works now. So
>   comments about lazy LRU may be interesting, but would better be done in the
>   LRU or refcount management code.
> 
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> 
> ---
>  fs/dcache.c |   60 +++++++++++++++++++++++++++++++++++-------------------------
>  fs/namei.c  |   32 ++++++++++++++++----------------
>  2 files changed, 51 insertions(+), 41 deletions(-)
> 
> Index: linux-2.6/fs/namei.c
> ===================================================================
> --- linux-2.6.orig/fs/namei.c	2010-08-18 04:04:29.000000000 +1000
> +++ linux-2.6/fs/namei.c	2010-08-18 04:05:07.000000000 +1000
> @@ -735,6 +735,11 @@ static int do_lookup(struct nameidata *n
>  			return err;
>  	}
>  
> +	/*
> +	 * Rename seqlock is not required here because in the off chance
> +	 * of a false negative due to a concurrent rename, we're going to
> +	 * do the non-racy lookup, below.
> +	 */
>  	dentry = __d_lookup(nd->path.dentry, name);
>  	if (!dentry)
>  		goto need_lookup;
> @@ -754,17 +759,13 @@ need_lookup:
>  	mutex_lock(&dir->i_mutex);
>  	/*
>  	 * First re-do the cached lookup just in case it was created
> -	 * while we waited for the directory semaphore..
> -	 *
> -	 * FIXME! This could use version numbering or similar to
> -	 * avoid unnecessary cache lookups.
> -	 *
> -	 * The "dcache_lock" is purely to protect the RCU list walker
> -	 * from concurrent renames at this point (we mustn't get false
> -	 * negatives from the RCU list walk here, unlike the optimistic
> -	 * fast walk).
> +	 * while we waited for the directory semaphore, or the first
> +	 * lookup failed due to an unrelated rename.
>  	 *
> -	 * so doing d_lookup() (with seqlock), instead of lockfree __d_lookup
> +	 * This could use version numbering or similar to avoid unnecessary
> +	 * cache lookups, but then we'd have to do the first lookup in the
> +	 * non-racy way. However in the common case here, everything should
> +	 * be hot in cache, so would it be a big win?
>  	 */
>  	dentry = d_lookup(parent, name);
>  	if (likely(!dentry)) {
> @@ -1136,13 +1137,12 @@ static struct dentry *__lookup_hash(stru
>  			goto out;
>  	}
>  
> -	dentry = __d_lookup(base, name);
> -
> -	/* lockess __d_lookup may fail due to concurrent d_move()
> -	 * in some unrelated directory, so try with d_lookup
> +	/*
> +	 * Don't bother with __d_lookup: callers are for creat as
> +	 * well as unlink, so a lot of the time it would cost
> +	 * a double lookup.
>  	 */
> -	if (!dentry)
> -		dentry = d_lookup(base, name);
> +	dentry = d_lookup(base, name);
>  
>  	if (dentry && dentry->d_op && dentry->d_op->d_revalidate)
>  		dentry = do_revalidate(dentry, nd);
> Index: linux-2.6/fs/dcache.c
> ===================================================================
> --- linux-2.6.orig/fs/dcache.c	2010-08-18 04:04:01.000000000 +1000
> +++ linux-2.6/fs/dcache.c	2010-08-18 04:05:07.000000000 +1000
> @@ -1332,31 +1332,13 @@ EXPORT_SYMBOL(d_add_ci);
>   * d_lookup - search for a dentry
>   * @parent: parent dentry
>   * @name: qstr of name we wish to find
> + * Returns: dentry, or NULL
>   *
> - * Searches the children of the parent dentry for the name in question. If
> - * the dentry is found its reference count is incremented and the dentry
> - * is returned. The caller must use dput to free the entry when it has
> - * finished using it. %NULL is returned on failure.
> - *
> - * __d_lookup is dcache_lock free. The hash list is protected using RCU.
> - * Memory barriers are used while updating and doing lockless traversal. 
> - * To avoid races with d_move while rename is happening, d_lock is used.
> - *
> - * Overflows in memcmp(), while d_move, are avoided by keeping the length
> - * and name pointer in one structure pointed by d_qstr.
> - *
> - * rcu_read_lock() and rcu_read_unlock() are used to disable preemption while
> - * lookup is going on.
> - *
> - * The dentry unused LRU is not updated even if lookup finds the required dentry
> - * in there. It is updated in places such as prune_dcache, shrink_dcache_sb,
> - * select_parent and __dget_locked. This laziness saves lookup from dcache_lock
> - * acquisition.
> - *
> - * d_lookup() is protected against the concurrent renames in some unrelated
> - * directory using the seqlockt_t rename_lock.
> + * d_lookup searches the children of the parent dentry for the name in
> + * question. If the dentry is found its reference count is incremented and the
> + * dentry is returned. The caller must use dput to free the entry when it has
> + * finished using it. %NULL is returned if the dentry does not exist.
>   */
> -
>  struct dentry * d_lookup(struct dentry * parent, struct qstr * name)
>  {
>  	struct dentry * dentry = NULL;
> @@ -1372,6 +1354,21 @@ struct dentry * d_lookup(struct dentry *
>  }
>  EXPORT_SYMBOL(d_lookup);
>  
> +/*
> + * __d_lookup - search for a dentry (racy)
> + * @parent: parent dentry
> + * @name: qstr of name we wish to find
> + * Returns: dentry, or NULL
> + *
> + * __d_lookup is like d_lookup, however it may (rarely) return a
> + * false-negative result due to unrelated rename activity.
> + *
> + * __d_lookup is slightly faster by avoiding rename_lock read seqlock,
> + * however it must be used carefully, eg. with a following d_lookup in
> + * the case of failure.
> + *
> + * __d_lookup callers must be commented.
> + */
>  struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
>  {
>  	unsigned int len = name->len;
> @@ -1382,6 +1379,19 @@ struct dentry * __d_lookup(struct dentry
>  	struct hlist_node *node;
>  	struct dentry *dentry;
>  
> +	/*
> +	 * The hash list is protected using RCU.
> +	 *
> +	 * Take d_lock when comparing a candidate dentry, to avoid races
> +	 * with d_move().
> +	 *
> +	 * It is possible that concurrent renames can mess up our list
> +	 * walk here and result in missing our dentry, resulting in the
> +	 * false-negative result. d_lookup() protects against concurrent
> +	 * renames using rename_lock seqlock.
> +	 *
> +	 * See Documentation/vfs/dcache-locking.txt for more details.
> +	 */
>  	rcu_read_lock();
>  	
>  	hlist_for_each_entry_rcu(dentry, node, head, d_hash) {
> @@ -1396,8 +1406,8 @@ struct dentry * __d_lookup(struct dentry
>  
>  		/*
>  		 * Recheck the dentry after taking the lock - d_move may have
> -		 * changed things.  Don't bother checking the hash because we're
> -		 * about to compare the whole name anyway.
> +		 * changed things. Don't bother checking the hash because
> +		 * we're about to compare the whole name anyway.
>  		 */
>  		if (dentry->d_parent != parent)
>  			goto next;

Jan Blunck (cc'd) wrote a similar patch that I ended up dropping from
the union mount queue just to make my life easier.

This makes sense; it's far more likely that the dentry doesn't exist
than that we collided with a d_move(), in which case the second locked
lookup is totally wasted effort.  And the seqlock is low cost anyway.

Reviewed-by: Valerie Aurora <vaurora@redhat.com>

-VAL


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 06/10] fs: cleanup files_lock locking
  2010-08-17 18:37 ` [patch 06/10] fs: cleanup files_lock locking Nick Piggin
@ 2010-08-18 19:46   ` Valerie Aurora
  0 siblings, 0 replies; 25+ messages in thread
From: Valerie Aurora @ 2010-08-18 19:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Al Viro, linux-fsdevel, linux-kernel, Christoph Hellwig, Alan Cox,
	Andi Kleen, Greg Kroah-Hartman

On Wed, Aug 18, 2010 at 04:37:35AM +1000, Nick Piggin wrote:
> fs: cleanup files_lock locking
> 
> Lock tty_files with a new spinlock, tty_files_lock; provide helpers to
> manipulate the per-sb files list; unexport the files_lock spinlock.
> 
> Cc: linux-kernel@vger.kernel.org
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
> Acked-by: Andi Kleen <ak@linux.intel.com>
> Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> 
> ---
>  drivers/char/pty.c       |    6 +++++-
>  drivers/char/tty_io.c    |   26 ++++++++++++++++++--------
>  fs/file_table.c          |   42 ++++++++++++++++++------------------------
>  fs/open.c                |    4 ++--
>  include/linux/fs.h       |    7 ++-----
>  include/linux/tty.h      |    1 +
>  security/selinux/hooks.c |    4 ++--
>  7 files changed, 48 insertions(+), 42 deletions(-)
> 
> Index: linux-2.6/security/selinux/hooks.c
> ===================================================================
> --- linux-2.6.orig/security/selinux/hooks.c	2010-08-18 04:04:01.000000000 +1000
> +++ linux-2.6/security/selinux/hooks.c	2010-08-18 04:05:10.000000000 +1000
> @@ -2170,7 +2170,7 @@ static inline void flush_unauthorized_fi
>  
>  	tty = get_current_tty();
>  	if (tty) {
> -		file_list_lock();
> +		spin_lock(&tty_files_lock);
>  		if (!list_empty(&tty->tty_files)) {
>  			struct inode *inode;
>  
> @@ -2186,7 +2186,7 @@ static inline void flush_unauthorized_fi
>  				drop_tty = 1;
>  			}
>  		}
> -		file_list_unlock();
> +		spin_unlock(&tty_files_lock);
>  		tty_kref_put(tty);
>  	}
>  	/* Reset controlling tty. */
> Index: linux-2.6/drivers/char/pty.c
> ===================================================================
> --- linux-2.6.orig/drivers/char/pty.c	2010-08-18 04:04:01.000000000 +1000
> +++ linux-2.6/drivers/char/pty.c	2010-08-18 04:05:10.000000000 +1000
> @@ -676,7 +676,11 @@ static int ptmx_open(struct inode *inode
>  
>  	set_bit(TTY_PTY_LOCK, &tty->flags); /* LOCK THE SLAVE */
>  	filp->private_data = tty;
> -	file_move(filp, &tty->tty_files);
> +
> +	file_sb_list_del(filp); /* __dentry_open has put it on the sb list */
> +	spin_lock(&tty_files_lock);
> +	list_add(&filp->f_u.fu_list, &tty->tty_files);
> +	spin_unlock(&tty_files_lock);
>  
>  	retval = devpts_pty_new(inode, tty->link);
>  	if (retval)
> Index: linux-2.6/drivers/char/tty_io.c
> ===================================================================
> --- linux-2.6.orig/drivers/char/tty_io.c	2010-08-18 04:04:01.000000000 +1000
> +++ linux-2.6/drivers/char/tty_io.c	2010-08-18 04:05:10.000000000 +1000
> @@ -136,6 +136,9 @@ LIST_HEAD(tty_drivers);			/* linked list
>  DEFINE_MUTEX(tty_mutex);
>  EXPORT_SYMBOL(tty_mutex);
>  
> +/* Spinlock to protect the tty->tty_files list */
> +DEFINE_SPINLOCK(tty_files_lock);
> +
>  static ssize_t tty_read(struct file *, char __user *, size_t, loff_t *);
>  static ssize_t tty_write(struct file *, const char __user *, size_t, loff_t *);
>  ssize_t redirected_tty_write(struct file *, const char __user *,
> @@ -235,11 +238,11 @@ static int check_tty_count(struct tty_st
>  	struct list_head *p;
>  	int count = 0;
>  
> -	file_list_lock();
> +	spin_lock(&tty_files_lock);
>  	list_for_each(p, &tty->tty_files) {
>  		count++;
>  	}
> -	file_list_unlock();
> +	spin_unlock(&tty_files_lock);
>  	if (tty->driver->type == TTY_DRIVER_TYPE_PTY &&
>  	    tty->driver->subtype == PTY_TYPE_SLAVE &&
>  	    tty->link && tty->link->count)
> @@ -519,7 +522,7 @@ void __tty_hangup(struct tty_struct *tty
>  	   workqueue with the lock held */
>  	check_tty_count(tty, "tty_hangup");
>  
> -	file_list_lock();
> +	spin_lock(&tty_files_lock);
>  	/* This breaks for file handles being sent over AF_UNIX sockets ? */
>  	list_for_each_entry(filp, &tty->tty_files, f_u.fu_list) {
>  		if (filp->f_op->write == redirected_tty_write)
> @@ -530,7 +533,7 @@ void __tty_hangup(struct tty_struct *tty
>  		__tty_fasync(-1, filp, 0);	/* can't block */
>  		filp->f_op = &hung_up_tty_fops;
>  	}
> -	file_list_unlock();
> +	spin_unlock(&tty_files_lock);
>  
>  	tty_ldisc_hangup(tty);
>  
> @@ -1424,9 +1427,9 @@ static void release_one_tty(struct work_
>  	tty_driver_kref_put(driver);
>  	module_put(driver->owner);
>  
> -	file_list_lock();
> +	spin_lock(&tty_files_lock);
>  	list_del_init(&tty->tty_files);
> -	file_list_unlock();
> +	spin_unlock(&tty_files_lock);
>  
>  	put_pid(tty->pgrp);
>  	put_pid(tty->session);
> @@ -1671,7 +1674,10 @@ int tty_release(struct inode *inode, str
>  	 *  - do_tty_hangup no longer sees this file descriptor as
>  	 *    something that needs to be handled for hangups.
>  	 */
> -	file_kill(filp);
> +	spin_lock(&tty_files_lock);
> +	BUG_ON(list_empty(&filp->f_u.fu_list));
> +	list_del_init(&filp->f_u.fu_list);
> +	spin_unlock(&tty_files_lock);
>  	filp->private_data = NULL;
>  
>  	/*
> @@ -1840,7 +1846,11 @@ got_driver:
>  	}
>  
>  	filp->private_data = tty;
> -	file_move(filp, &tty->tty_files);
> +	BUG_ON(list_empty(&filp->f_u.fu_list));
> +	file_sb_list_del(filp); /* __dentry_open has put it on the sb list */
> +	spin_lock(&tty_files_lock);
> +	list_add(&filp->f_u.fu_list, &tty->tty_files);
> +	spin_unlock(&tty_files_lock);
>  	check_tty_count(tty, "tty_open");
>  	if (tty->driver->type == TTY_DRIVER_TYPE_PTY &&
>  	    tty->driver->subtype == PTY_TYPE_MASTER)
> Index: linux-2.6/fs/file_table.c
> ===================================================================
> --- linux-2.6.orig/fs/file_table.c	2010-08-18 04:04:01.000000000 +1000
> +++ linux-2.6/fs/file_table.c	2010-08-18 04:05:09.000000000 +1000
> @@ -32,8 +32,7 @@ struct files_stat_struct files_stat = {
>  	.max_files = NR_FILE
>  };
>  
> -/* public. Not pretty! */
> -__cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock);
> +static __cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock);
>  
>  /* SLAB cache for file structures */
>  static struct kmem_cache *filp_cachep __read_mostly;
> @@ -249,7 +248,7 @@ static void __fput(struct file *file)
>  		cdev_put(inode->i_cdev);
>  	fops_put(file->f_op);
>  	put_pid(file->f_owner.pid);
> -	file_kill(file);
> +	file_sb_list_del(file);
>  	if (file->f_mode & FMODE_WRITE)
>  		drop_file_write_access(file);
>  	file->f_path.dentry = NULL;
> @@ -328,31 +327,29 @@ struct file *fget_light(unsigned int fd,
>  	return file;
>  }
>  
> -
>  void put_filp(struct file *file)
>  {
>  	if (atomic_long_dec_and_test(&file->f_count)) {
>  		security_file_free(file);
> -		file_kill(file);
> +		file_sb_list_del(file);
>  		file_free(file);
>  	}
>  }
>  
> -void file_move(struct file *file, struct list_head *list)
> +void file_sb_list_add(struct file *file, struct super_block *sb)
>  {
> -	if (!list)
> -		return;
> -	file_list_lock();
> -	list_move(&file->f_u.fu_list, list);
> -	file_list_unlock();
> +	spin_lock(&files_lock);
> +	BUG_ON(!list_empty(&file->f_u.fu_list));
> +	list_add(&file->f_u.fu_list, &sb->s_files);
> +	spin_unlock(&files_lock);
>  }
>  
> -void file_kill(struct file *file)
> +void file_sb_list_del(struct file *file)
>  {
>  	if (!list_empty(&file->f_u.fu_list)) {
> -		file_list_lock();
> +		spin_lock(&files_lock);
>  		list_del_init(&file->f_u.fu_list);
> -		file_list_unlock();
> +		spin_unlock(&files_lock);
>  	}
>  }
>  
> @@ -361,7 +358,7 @@ int fs_may_remount_ro(struct super_block
>  	struct file *file;
>  
>  	/* Check that no files are currently opened for writing. */
> -	file_list_lock();
> +	spin_lock(&files_lock);
>  	list_for_each_entry(file, &sb->s_files, f_u.fu_list) {
>  		struct inode *inode = file->f_path.dentry->d_inode;
>  
> @@ -373,10 +370,10 @@ int fs_may_remount_ro(struct super_block
>  		if (S_ISREG(inode->i_mode) && (file->f_mode & FMODE_WRITE))
>  			goto too_bad;
>  	}
> -	file_list_unlock();
> +	spin_unlock(&files_lock);
>  	return 1; /* Tis' cool bro. */
>  too_bad:
> -	file_list_unlock();
> +	spin_unlock(&files_lock);
>  	return 0;
>  }
>  
> @@ -392,7 +389,7 @@ void mark_files_ro(struct super_block *s
>  	struct file *f;
>  
>  retry:
> -	file_list_lock();
> +	spin_lock(&files_lock);
>  	list_for_each_entry(f, &sb->s_files, f_u.fu_list) {
>  		struct vfsmount *mnt;
>  		if (!S_ISREG(f->f_path.dentry->d_inode->i_mode))
> @@ -408,16 +405,13 @@ retry:
>  			continue;
>  		file_release_write(f);
>  		mnt = mntget(f->f_path.mnt);
> -		file_list_unlock();
> -		/*
> -		 * This can sleep, so we can't hold
> -		 * the file_list_lock() spinlock.
> -		 */
> +		/* This can sleep, so we can't hold the spinlock. */
> +		spin_unlock(&files_lock);
>  		mnt_drop_write(mnt);
>  		mntput(mnt);
>  		goto retry;
>  	}
> -	file_list_unlock();
> +	spin_unlock(&files_lock);
>  }
>  
>  void __init files_init(unsigned long mempages)
> Index: linux-2.6/fs/open.c
> ===================================================================
> --- linux-2.6.orig/fs/open.c	2010-08-18 04:04:01.000000000 +1000
> +++ linux-2.6/fs/open.c	2010-08-18 04:04:29.000000000 +1000
> @@ -675,7 +675,7 @@ static struct file *__dentry_open(struct
>  	f->f_path.mnt = mnt;
>  	f->f_pos = 0;
>  	f->f_op = fops_get(inode->i_fop);
> -	file_move(f, &inode->i_sb->s_files);
> +	file_sb_list_add(f, inode->i_sb);
>  
>  	error = security_dentry_open(f, cred);
>  	if (error)
> @@ -721,7 +721,7 @@ cleanup_all:
>  			mnt_drop_write(mnt);
>  		}
>  	}
> -	file_kill(f);
> +	file_sb_list_del(f);
>  	f->f_path.dentry = NULL;
>  	f->f_path.mnt = NULL;
>  cleanup_file:
> Index: linux-2.6/include/linux/fs.h
> ===================================================================
> --- linux-2.6.orig/include/linux/fs.h	2010-08-18 04:04:01.000000000 +1000
> +++ linux-2.6/include/linux/fs.h	2010-08-18 04:05:10.000000000 +1000
> @@ -953,9 +953,6 @@ struct file {
>  	unsigned long f_mnt_write_state;
>  #endif
>  };
> -extern spinlock_t files_lock;
> -#define file_list_lock() spin_lock(&files_lock);
> -#define file_list_unlock() spin_unlock(&files_lock);
>  
>  #define get_file(x)	atomic_long_inc(&(x)->f_count)
>  #define fput_atomic(x)	atomic_long_add_unless(&(x)->f_count, -1, 1)
> @@ -2197,8 +2194,8 @@ static inline void insert_inode_hash(str
>  	__insert_inode_hash(inode, inode->i_ino);
>  }
>  
> -extern void file_move(struct file *f, struct list_head *list);
> -extern void file_kill(struct file *f);
> +extern void file_sb_list_add(struct file *f, struct super_block *sb);
> +extern void file_sb_list_del(struct file *f);
>  #ifdef CONFIG_BLOCK
>  extern void submit_bio(int, struct bio *);
>  extern int bdev_read_only(struct block_device *);
> Index: linux-2.6/include/linux/tty.h
> ===================================================================
> --- linux-2.6.orig/include/linux/tty.h	2010-08-18 04:04:01.000000000 +1000
> +++ linux-2.6/include/linux/tty.h	2010-08-18 04:05:10.000000000 +1000
> @@ -470,6 +470,7 @@ extern struct tty_struct *tty_pair_get_t
>  extern struct tty_struct *tty_pair_get_pty(struct tty_struct *tty);
>  
>  extern struct mutex tty_mutex;
> +extern spinlock_t tty_files_lock;
>  
>  extern void tty_write_unlock(struct tty_struct *tty);
>  extern int tty_write_lock(struct tty_struct *tty, int ndelay);

Looks good.

Reviewed-by: Valerie Aurora <vaurora@redhat.com>

-VAL

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 05/10] fs: remove extra lookup in __lookup_hash
  2010-08-18 13:57   ` Andi Kleen
@ 2010-08-18 21:13     ` Andi Kleen
  0 siblings, 0 replies; 25+ messages in thread
From: Andi Kleen @ 2010-08-18 21:13 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Al Viro, linux-fsdevel

Andi Kleen <andi@firstfloor.org> writes:
>
> and did a quick kernel build, resulting in:
>
> first 22753, second 22753
>
> So yes it looks like the hit rate is about zero for the first case
> and the change is good.

I was informed that my script was buggy, here are updated numbers.

hits: 22899, cycles: 59min/1771avg/223268max
hits: 2784, cycles: 88min/421avg/1110max

So it seems to be still rather uncommon, but not as uncommon as the 
first output suggested.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 04/10] fs: fs_struct rwlock to spinlock
  2010-08-17 23:14   ` Valerie Aurora
@ 2010-08-20 10:05     ` Nick Piggin
  0 siblings, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2010-08-20 10:05 UTC (permalink / raw)
  To: Valerie Aurora; +Cc: Nick Piggin, Al Viro, linux-fsdevel

On Tue, Aug 17, 2010 at 07:14:42PM -0400, Valerie Aurora wrote:
> On Wed, Aug 18, 2010 at 04:37:33AM +1000, Nick Piggin wrote:
> > -	read_lock(&current->fs->lock);
> > +	spin_lock(&current->fs->lock);
> >  	root = dget(current->fs->root.dentry);
> > -	read_unlock(&current->fs->lock);
> > +	spin_unlock(&current->fs->lock);
> >  
> >  	spin_lock(&dcache_lock);
> 
> Your reasoning makes sense to me.  Shared reader access seems very
> unlikely whereas the cost of taking the lock is certain.

Yes, shared reader will only help if we have multiple threads inside
the critical section at the same time, and if they can actually get
any extra parallelism (which they can't, because all they're doing here
is just hitting another contended cacheline).

I doubt this will change scalability at all, but it will improve single
threaded performance a little bit.

With the store-free path walk, I can actually do some tricks to entirely
remove the fs->lock spinlock in the common case here (using a seqlock
instead). Combined with avoiding refcounts to the cwd, so it avoids all
the contention between threads.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 10/10] fs: brlock vfsmount_lock
  2010-08-18 14:05   ` Andi Kleen
@ 2010-08-20 10:09     ` Nick Piggin
  0 siblings, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2010-08-20 10:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Nick Piggin, Al Viro, linux-fsdevel

On Wed, Aug 18, 2010 at 04:05:39PM +0200, Andi Kleen wrote:
> Nick Piggin <npiggin@kernel.dk> writes:
> 
> BTW one way to make the slow path faster would be to start sharing
> per cpu locks inside a core on SMT at least. The same cores have the same
> caches and sharing cache lines is free. That would cut it in half
> on a 2x HT system.

Yes it's possible. brlock code is encapsulated, so you could experiment.
One problem is that vfsmount lock gets held for read for a relatively
long time in the store-free path walk patches. So you could get multiple
threads contending on it.

> 
> > -
> >  static int event;
> >  static DEFINE_IDA(mnt_id_ida);
> >  static DEFINE_IDA(mnt_group_ida);
> > +static DEFINE_SPINLOCK(mnt_id_lock);
> 
> Can you add a scope comment to that lock? 

It protects mnt_id_ida; I should have explicitly commented that.
I'll put a patch to do that at the head of my next queue to submit.

Thanks for reviewing.

> 
> > @@ -623,39 +653,43 @@ static inline void __mntput(struct vfsmo
> >  void mntput_no_expire(struct vfsmount *mnt)
> >  {
> >  repeat:
> > -	if (atomic_dec_and_lock(&mnt->mnt_count, &vfsmount_lock)) {
> > -		if (likely(!mnt->mnt_pinned)) {
> > -			spin_unlock(&vfsmount_lock);
> > -			__mntput(mnt);
> > -			return;
> > -		}
> > -		atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count);
> > -		mnt->mnt_pinned = 0;
> > -		spin_unlock(&vfsmount_lock);
> > -		acct_auto_close_mnt(mnt);
> > -		goto repeat;
> > +	if (atomic_add_unless(&mnt->mnt_count, -1, 1))
> > +		return;
> 
> Hmm that's a unrelated change?

It's because we don't have atomic_dec_and_br_lock()...

> 
> The rest looks all good and quite straight forward
> 
> Reviewed-by: Andi Kleen <ak@linux.intel.com>

Thanks,
Nick


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2010-08-20 10:09 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-17 18:37 [patch 00/10] first set of vfs scale patches Nick Piggin
2010-08-17 18:37 ` [patch 01/10] fs: fix do_lookup false negative Nick Piggin
2010-08-17 22:45   ` Valerie Aurora
2010-08-17 23:04   ` Sage Weil
2010-08-18 13:41   ` Andi Kleen
2010-08-17 18:37 ` [patch 02/10] fs: dentry allocation consolidation Nick Piggin
2010-08-17 22:45   ` Valerie Aurora
2010-08-17 18:37 ` [patch 03/10] apparmor: use task path helpers Nick Piggin
2010-08-17 22:59   ` Valerie Aurora
2010-08-17 18:37 ` [patch 04/10] fs: fs_struct rwlock to spinlock Nick Piggin
2010-08-17 23:14   ` Valerie Aurora
2010-08-20 10:05     ` Nick Piggin
2010-08-17 18:37 ` [patch 05/10] fs: remove extra lookup in __lookup_hash Nick Piggin
2010-08-18 13:57   ` Andi Kleen
2010-08-18 21:13     ` Andi Kleen
2010-08-18 19:34   ` Valerie Aurora
2010-08-17 18:37 ` [patch 06/10] fs: cleanup files_lock locking Nick Piggin
2010-08-18 19:46   ` Valerie Aurora
2010-08-17 18:37 ` [patch 07/10] tty: fix fu_list abuse Nick Piggin
2010-08-17 18:37 ` [patch 08/10] lglock: introduce special lglock and brlock spin locks Nick Piggin
2010-08-17 18:37 ` [patch 09/10] fs: scale files_lock Nick Piggin
2010-08-17 18:37 ` [patch 10/10] fs: brlock vfsmount_lock Nick Piggin
2010-08-18 14:05   ` Andi Kleen
2010-08-20 10:09     ` Nick Piggin
2010-08-17 21:14 ` [patch 00/10] first set of vfs scale patches Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).