linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/27] [rfc] vfs scalability patchset
@ 2009-04-25  1:20 npiggin
  2009-04-25  1:20 ` [patch 01/27] fs: cleanup files_lock npiggin
                   ` (27 more replies)
  0 siblings, 28 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

Here is my current patchset for improving vfs locking scalability. Since
last posting, I have fixed several bugs, solved several more problems, and
done an initial sweep of filesystems (autofs4 is probably the trickiest,
and unfortunately I don't have a good test setup here for that yet, but
at least I've looked through it).

Also started to tackle files_lock, vfsmount_lock, and inode_lock.
(I included my mnt_want_write patches before the vfsmount_lock scalability
stuff because that just made it a bit easier...). These appear to be the
problematic global locks in the vfs.

It's running stably here so far on basic stress testing here on several file
systems (xfs, tmpfs, ext?). But it still might eat your data of course.

Would be very interested in any feedback.

Thanks,
Nick



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 01/27] fs: cleanup files_lock
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  3:20   ` Al Viro
                     ` (2 more replies)
  2009-04-25  1:20 ` [patch 02/27] fs: scale files_lock npiggin
                   ` (26 subsequent siblings)
  27 siblings, 3 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: fs-files_list-improve.patch --]
[-- Type: text/plain, Size: 11095 bytes --]

Lock tty_files with tty_mutex, provide helpers to manipulate the per-sb
files list, and unexport the files_lock spinlock.
---
 drivers/char/pty.c       |    6 +++-
 drivers/char/tty_io.c    |   39 +++++++++++++++++++--------
 fs/file_table.c          |   66 ++++++++++++++++++++++++++++++++++-------------
 fs/open.c                |    4 +-
 fs/super.c               |   39 ---------------------------
 include/linux/fs.h       |    8 ++---
 security/selinux/hooks.c |    4 +-
 7 files changed, 89 insertions(+), 77 deletions(-)

Index: linux-2.6/drivers/char/pty.c
===================================================================
--- linux-2.6.orig/drivers/char/pty.c
+++ linux-2.6/drivers/char/pty.c
@@ -662,7 +662,11 @@ static int __ptmx_open(struct inode *ino
 
 	set_bit(TTY_PTY_LOCK, &tty->flags); /* LOCK THE SLAVE */
 	filp->private_data = tty;
-	file_move(filp, &tty->tty_files);
+
+	mutex_lock(&tty_mutex);
+	file_list_del(filp);
+	list_add(&filp->f_u.fu_list, &tty->tty_files);
+	mutex_unlock(&tty_mutex);
 
 	retval = devpts_pty_new(inode, tty->link);
 	if (retval)
Index: linux-2.6/drivers/char/tty_io.c
===================================================================
--- linux-2.6.orig/drivers/char/tty_io.c
+++ linux-2.6/drivers/char/tty_io.c
@@ -229,17 +229,15 @@ int tty_paranoia_check(struct tty_struct
 	return 0;
 }
 
-static int check_tty_count(struct tty_struct *tty, const char *routine)
+static int __check_tty_count(struct tty_struct *tty, const char *routine)
 {
 #ifdef CHECK_TTY_COUNT
 	struct list_head *p;
 	int count = 0;
 
-	file_list_lock();
 	list_for_each(p, &tty->tty_files) {
 		count++;
 	}
-	file_list_unlock();
 	if (tty->driver->type == TTY_DRIVER_TYPE_PTY &&
 	    tty->driver->subtype == PTY_TYPE_SLAVE &&
 	    tty->link && tty->link->count)
@@ -254,6 +252,19 @@ static int check_tty_count(struct tty_st
 	return 0;
 }
 
+static int check_tty_count(struct tty_struct *tty, const char *routine)
+{
+	int ret = 0;
+
+#ifdef CHECK_TTY_COUNT
+	mutex_lock(&tty_mutex);
+	ret = __check_tty_count(tty, routine);
+	mutex_unlock(&tty_mutex);
+#endif
+
+	return ret;
+}
+
 /**
  *	get_tty_driver		-	find device of a tty
  *	@dev_t: device identifier
@@ -543,6 +554,8 @@ static void do_tty_hangup(struct work_st
 	if (!tty)
 		return;
 
+	mutex_lock(&tty_mutex);
+
 	/* inuse_filps is protected by the single kernel lock */
 	lock_kernel();
 
@@ -553,8 +566,7 @@ static void do_tty_hangup(struct work_st
 	}
 	spin_unlock(&redirect_lock);
 
-	check_tty_count(tty, "do_tty_hangup");
-	file_list_lock();
+	__check_tty_count(tty, "do_tty_hangup");
 	/* This breaks for file handles being sent over AF_UNIX sockets ? */
 	list_for_each_entry(filp, &tty->tty_files, f_u.fu_list) {
 		if (filp->f_op->write == redirected_tty_write)
@@ -565,7 +577,7 @@ static void do_tty_hangup(struct work_st
 		tty_fasync(-1, filp, 0);	/* can't block */
 		filp->f_op = &hung_up_tty_fops;
 	}
-	file_list_unlock();
+	mutex_unlock(&tty_mutex);
 	/*
 	 * FIXME! What are the locking issues here? This may me overdoing
 	 * things... This question is especially important now that we've
@@ -1467,9 +1479,9 @@ static void release_one_tty(struct kref
 	tty_driver_kref_put(driver);
 	module_put(driver->owner);
 
-	file_list_lock();
+	mutex_lock(&tty_mutex);
 	list_del_init(&tty->tty_files);
-	file_list_unlock();
+	mutex_unlock(&tty_mutex);
 
 	free_tty_struct(tty);
 }
@@ -1678,7 +1690,8 @@ void tty_release_dev(struct file *filp)
 	 *  - do_tty_hangup no longer sees this file descriptor as
 	 *    something that needs to be handled for hangups.
 	 */
-	file_kill(filp);
+	BUG_ON(list_empty(&filp->f_u.fu_list));
+	list_del_init(&filp->f_u.fu_list);
 	filp->private_data = NULL;
 
 	/*
@@ -1836,8 +1849,12 @@ got_driver:
 		return PTR_ERR(tty);
 
 	filp->private_data = tty;
-	file_move(filp, &tty->tty_files);
-	check_tty_count(tty, "tty_open");
+	mutex_lock(&tty_mutex);
+	BUG_ON(list_empty(&filp->f_u.fu_list));
+	file_list_del(filp); /* __dentry_open has put it on the sb list */
+	list_add(&filp->f_u.fu_list, &tty->tty_files);
+	__check_tty_count(tty, "tty_open");
+	mutex_unlock(&tty_mutex);
 	if (tty->driver->type == TTY_DRIVER_TYPE_PTY &&
 	    tty->driver->subtype == PTY_TYPE_MASTER)
 		noctty = 1;
Index: linux-2.6/fs/file_table.c
===================================================================
--- linux-2.6.orig/fs/file_table.c
+++ linux-2.6/fs/file_table.c
@@ -30,8 +30,7 @@ struct files_stat_struct files_stat = {
 	.max_files = NR_FILE
 };
 
-/* public. Not pretty! */
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock);
+static __cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock);
 
 /* SLAB cache for file structures */
 static struct kmem_cache *filp_cachep __read_mostly;
@@ -285,7 +284,7 @@ void __fput(struct file *file)
 		cdev_put(inode->i_cdev);
 	fops_put(file->f_op);
 	put_pid(file->f_owner.pid);
-	file_kill(file);
+	file_list_del(file);
 	if (file->f_mode & FMODE_WRITE)
 		drop_file_write_access(file);
 	file->f_path.dentry = NULL;
@@ -347,31 +346,29 @@ struct file *fget_light(unsigned int fd,
 	return file;
 }
 
-
 void put_filp(struct file *file)
 {
 	if (atomic_long_dec_and_test(&file->f_count)) {
 		security_file_free(file);
-		file_kill(file);
+		file_list_del(file);
 		file_free(file);
 	}
 }
 
-void file_move(struct file *file, struct list_head *list)
+void file_sb_list_add(struct file *file, struct super_block *sb)
 {
-	if (!list)
-		return;
-	file_list_lock();
-	list_move(&file->f_u.fu_list, list);
-	file_list_unlock();
+	spin_lock(&files_lock);
+	BUG_ON(!list_empty(&file->f_u.fu_list));
+	list_add(&file->f_u.fu_list, &sb->s_files);
+	spin_unlock(&files_lock);
 }
 
-void file_kill(struct file *file)
+void file_list_del(struct file *file)
 {
 	if (!list_empty(&file->f_u.fu_list)) {
-		file_list_lock();
+		spin_lock(&files_lock);
 		list_del_init(&file->f_u.fu_list);
-		file_list_unlock();
+		spin_unlock(&files_lock);
 	}
 }
 
@@ -380,7 +377,7 @@ int fs_may_remount_ro(struct super_block
 	struct file *file;
 
 	/* Check that no files are currently opened for writing. */
-	file_list_lock();
+	spin_lock(&files_lock);
 	list_for_each_entry(file, &sb->s_files, f_u.fu_list) {
 		struct inode *inode = file->f_path.dentry->d_inode;
 
@@ -392,13 +389,48 @@ int fs_may_remount_ro(struct super_block
 		if (S_ISREG(inode->i_mode) && (file->f_mode & FMODE_WRITE))
 			goto too_bad;
 	}
-	file_list_unlock();
+	spin_unlock(&files_lock);
 	return 1; /* Tis' cool bro. */
 too_bad:
-	file_list_unlock();
+	spin_unlock(&files_lock);
 	return 0;
 }
 
+/**
+ *	mark_files_ro - mark all files read-only
+ *	@sb: superblock in question
+ *
+ *	All files are marked read-only.  We don't care about pending
+ *	delete files so this should be used in 'force' mode only.
+ */
+void mark_files_ro(struct super_block *sb)
+{
+	struct file *f;
+
+retry:
+	spin_lock(&files_lock);
+	list_for_each_entry(f, &sb->s_files, f_u.fu_list) {
+		struct vfsmount *mnt;
+		if (!S_ISREG(f->f_path.dentry->d_inode->i_mode))
+		       continue;
+		if (!file_count(f))
+			continue;
+		if (!(f->f_mode & FMODE_WRITE))
+			continue;
+		f->f_mode &= ~FMODE_WRITE;
+		if (file_check_writeable(f) != 0)
+			continue;
+		file_release_write(f);
+		mnt = mntget(f->f_path.mnt);
+		/* This can sleep, so we can't hold the spinlock. */
+		spin_unlock(&files_lock);
+		mnt_drop_write(mnt);
+		mntput(mnt);
+		goto retry;
+	}
+	spin_unlock(&files_lock);
+}
+
 void __init files_init(unsigned long mempages)
 { 
 	int n; 
Index: linux-2.6/fs/open.c
===================================================================
--- linux-2.6.orig/fs/open.c
+++ linux-2.6/fs/open.c
@@ -828,7 +828,7 @@ static struct file *__dentry_open(struct
 	f->f_path.mnt = mnt;
 	f->f_pos = 0;
 	f->f_op = fops_get(inode->i_fop);
-	file_move(f, &inode->i_sb->s_files);
+	file_sb_list_add(f, inode->i_sb);
 
 	error = security_dentry_open(f, cred);
 	if (error)
@@ -873,7 +873,7 @@ cleanup_all:
 			mnt_drop_write(mnt);
 		}
 	}
-	file_kill(f);
+	file_list_del(f);
 	f->f_path.dentry = NULL;
 	f->f_path.mnt = NULL;
 cleanup_file:
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -934,9 +934,6 @@ struct file {
 	unsigned long f_mnt_write_state;
 #endif
 };
-extern spinlock_t files_lock;
-#define file_list_lock() spin_lock(&files_lock);
-#define file_list_unlock() spin_unlock(&files_lock);
 
 #define get_file(x)	atomic_long_inc(&(x)->f_count)
 #define file_count(x)	atomic_long_read(&(x)->f_count)
@@ -2021,6 +2018,7 @@ extern const struct file_operations read
 extern const struct file_operations write_pipefifo_fops;
 extern const struct file_operations rdwr_pipefifo_fops;
 
+extern void mark_files_ro(struct super_block *sb);
 extern int fs_may_remount_ro(struct super_block *);
 
 #ifdef CONFIG_BLOCK
@@ -2172,8 +2170,8 @@ static inline void insert_inode_hash(str
 }
 
 extern struct file * get_empty_filp(void);
-extern void file_move(struct file *f, struct list_head *list);
-extern void file_kill(struct file *f);
+extern void file_sb_list_add(struct file *f, struct super_block *sb);
+extern void file_list_del(struct file *f);
 #ifdef CONFIG_BLOCK
 struct bio;
 extern void submit_bio(int, struct bio *);
Index: linux-2.6/fs/super.c
===================================================================
--- linux-2.6.orig/fs/super.c
+++ linux-2.6/fs/super.c
@@ -588,45 +588,6 @@ out:
 }
 
 /**
- *	mark_files_ro - mark all files read-only
- *	@sb: superblock in question
- *
- *	All files are marked read-only.  We don't care about pending
- *	delete files so this should be used in 'force' mode only.
- */
-
-static void mark_files_ro(struct super_block *sb)
-{
-	struct file *f;
-
-retry:
-	file_list_lock();
-	list_for_each_entry(f, &sb->s_files, f_u.fu_list) {
-		struct vfsmount *mnt;
-		if (!S_ISREG(f->f_path.dentry->d_inode->i_mode))
-		       continue;
-		if (!file_count(f))
-			continue;
-		if (!(f->f_mode & FMODE_WRITE))
-			continue;
-		f->f_mode &= ~FMODE_WRITE;
-		if (file_check_writeable(f) != 0)
-			continue;
-		file_release_write(f);
-		mnt = mntget(f->f_path.mnt);
-		file_list_unlock();
-		/*
-		 * This can sleep, so we can't hold
-		 * the file_list_lock() spinlock.
-		 */
-		mnt_drop_write(mnt);
-		mntput(mnt);
-		goto retry;
-	}
-	file_list_unlock();
-}
-
-/**
  *	do_remount_sb - asks filesystem to change mount options.
  *	@sb:	superblock in question
  *	@flags:	numeric part of options
Index: linux-2.6/security/selinux/hooks.c
===================================================================
--- linux-2.6.orig/security/selinux/hooks.c
+++ linux-2.6/security/selinux/hooks.c
@@ -2244,7 +2244,7 @@ static inline void flush_unauthorized_fi
 
 	tty = get_current_tty();
 	if (tty) {
-		file_list_lock();
+		mutex_lock(&tty_mutex);
 		if (!list_empty(&tty->tty_files)) {
 			struct inode *inode;
 
@@ -2260,7 +2260,7 @@ static inline void flush_unauthorized_fi
 				drop_tty = 1;
 			}
 		}
-		file_list_unlock();
+		mutex_unlock(&tty_mutex);
 		tty_kref_put(tty);
 	}
 	/* Reset controlling tty. */



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 02/27] fs: scale files_lock
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
  2009-04-25  1:20 ` [patch 01/27] fs: cleanup files_lock npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  3:32   ` Al Viro
  2009-04-25  1:20 ` [patch 03/27] fs: mnt_want_write speedup npiggin
                   ` (25 subsequent siblings)
  27 siblings, 1 reply; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: fs-files_lock-scale.patch --]
[-- Type: text/plain, Size: 7536 bytes --]

Improve scalability of files_lock by adding per-cpu, per-sb files lists,
protected with per-cpu locking. Effectively turning it into a big-writer
lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 fs/file_table.c    |  159 +++++++++++++++++++++++++++++++++++++++--------------
 fs/super.c         |   16 +++++
 include/linux/fs.h |    7 ++
 3 files changed, 141 insertions(+), 41 deletions(-)

Index: linux-2.6/fs/file_table.c
===================================================================
--- linux-2.6.orig/fs/file_table.c
+++ linux-2.6/fs/file_table.c
@@ -22,6 +22,7 @@
 #include <linux/fsnotify.h>
 #include <linux/sysctl.h>
 #include <linux/percpu_counter.h>
+#include <linux/percpu.h>
 
 #include <asm/atomic.h>
 
@@ -30,7 +31,7 @@ struct files_stat_struct files_stat = {
 	.max_files = NR_FILE
 };
 
-static __cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock);
+static DEFINE_PER_CPU(spinlock_t, files_cpulock);
 
 /* SLAB cache for file structures */
 static struct kmem_cache *filp_cachep __read_mostly;
@@ -124,6 +125,9 @@ struct file *get_empty_filp(void)
 		goto fail_sec;
 
 	INIT_LIST_HEAD(&f->f_u.fu_list);
+#ifdef CONFIG_SMP
+	f->f_sb_list_cpu = -1;
+#endif
 	atomic_long_set(&f->f_count, 1);
 	rwlock_init(&f->f_owner.lock);
 	f->f_cred = get_cred(cred);
@@ -357,42 +361,102 @@ void put_filp(struct file *file)
 
 void file_sb_list_add(struct file *file, struct super_block *sb)
 {
-	spin_lock(&files_lock);
+	spinlock_t *lock;
+	struct list_head *list;
+	int cpu;
+
+	lock = &get_cpu_var(files_cpulock);
+#ifdef CONFIG_SMP
+	BUG_ON(file->f_sb_list_cpu != -1);
+	cpu = smp_processor_id();
+	list = per_cpu_ptr(sb->s_files, cpu);
+	file->f_sb_list_cpu = cpu;
+#else
+	list = &sb->s_files;
+#endif
+	spin_lock(lock);
 	BUG_ON(!list_empty(&file->f_u.fu_list));
-	list_add(&file->f_u.fu_list, &sb->s_files);
-	spin_unlock(&files_lock);
+	list_add(&file->f_u.fu_list, list);
+	spin_unlock(lock);
+	put_cpu_var(files_cpulock);
 }
 
 void file_list_del(struct file *file)
 {
 	if (!list_empty(&file->f_u.fu_list)) {
-		spin_lock(&files_lock);
+		spinlock_t *lock;
+
+#ifdef CONFIG_SMP
+		BUG_ON(file->f_sb_list_cpu == -1);
+		lock = &per_cpu(files_cpulock, file->f_sb_list_cpu);
+		file->f_sb_list_cpu = -1;
+#else
+		lock = &__get_cpu_var(files_cpulock);
+#endif
+		spin_lock(lock);
 		list_del_init(&file->f_u.fu_list);
-		spin_unlock(&files_lock);
+		spin_unlock(lock);
+	}
+}
+
+static void file_list_lock_all(void)
+{
+	int i;
+	int nr = 0;
+
+	for_each_possible_cpu(i) {
+		spinlock_t *lock;
+
+		lock = &per_cpu(files_cpulock, i);
+		spin_lock_nested(lock, nr);
+		nr++;
+	}
+}
+
+static void file_list_unlock_all(void)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		spinlock_t *lock;
+
+		lock = &per_cpu(files_cpulock, i);
+		spin_unlock(lock);
 	}
 }
 
 int fs_may_remount_ro(struct super_block *sb)
 {
-	struct file *file;
+	int i;
 
 	/* Check that no files are currently opened for writing. */
-	spin_lock(&files_lock);
-	list_for_each_entry(file, &sb->s_files, f_u.fu_list) {
-		struct inode *inode = file->f_path.dentry->d_inode;
-
-		/* File with pending delete? */
-		if (inode->i_nlink == 0)
-			goto too_bad;
-
-		/* Writeable file? */
-		if (S_ISREG(inode->i_mode) && (file->f_mode & FMODE_WRITE))
-			goto too_bad;
+	file_list_lock_all();
+	for_each_possible_cpu(i) {
+		struct file *file;
+		struct list_head *list;
+
+#ifdef CONFIG_SMP
+		list = per_cpu_ptr(sb->s_files, i);
+#else
+		list = &sb->s_files;
+#endif
+		list_for_each_entry(file, list, f_u.fu_list) {
+			struct inode *inode = file->f_path.dentry->d_inode;
+
+			/* File with pending delete? */
+			if (inode->i_nlink == 0)
+				goto too_bad;
+
+			/* Writeable file? */
+			if (S_ISREG(inode->i_mode) &&
+					(file->f_mode & FMODE_WRITE))
+				goto too_bad;
+		}
 	}
-	spin_unlock(&files_lock);
+	file_list_unlock_all();
 	return 1; /* Tis' cool bro. */
 too_bad:
-	spin_unlock(&files_lock);
+	file_list_unlock_all();
 	return 0;
 }
 
@@ -405,35 +469,46 @@ too_bad:
  */
 void mark_files_ro(struct super_block *sb)
 {
-	struct file *f;
+	int i;
 
 retry:
-	spin_lock(&files_lock);
-	list_for_each_entry(f, &sb->s_files, f_u.fu_list) {
-		struct vfsmount *mnt;
-		if (!S_ISREG(f->f_path.dentry->d_inode->i_mode))
-		       continue;
-		if (!file_count(f))
-			continue;
-		if (!(f->f_mode & FMODE_WRITE))
-			continue;
-		f->f_mode &= ~FMODE_WRITE;
-		if (file_check_writeable(f) != 0)
-			continue;
-		file_release_write(f);
-		mnt = mntget(f->f_path.mnt);
-		/* This can sleep, so we can't hold the spinlock. */
-		spin_unlock(&files_lock);
-		mnt_drop_write(mnt);
-		mntput(mnt);
-		goto retry;
+	file_list_lock_all();
+	for_each_possible_cpu(i) {
+		struct file *f;
+		struct list_head *list;
+
+#ifdef CONFIG_SMP
+		list = per_cpu_ptr(sb->s_files, i);
+#else
+		list = &sb->s_files;
+#endif
+		list_for_each_entry(f, list, f_u.fu_list) {
+			struct vfsmount *mnt;
+			if (!S_ISREG(f->f_path.dentry->d_inode->i_mode))
+			       continue;
+			if (!file_count(f))
+				continue;
+			if (!(f->f_mode & FMODE_WRITE))
+				continue;
+			f->f_mode &= ~FMODE_WRITE;
+			if (file_check_writeable(f) != 0)
+				continue;
+			file_release_write(f);
+			mnt = mntget(f->f_path.mnt);
+			/* This can sleep, so we can't hold the spinlock. */
+			file_list_unlock_all();
+			mnt_drop_write(mnt);
+			mntput(mnt);
+			goto retry;
+		}
 	}
-	spin_unlock(&files_lock);
+	file_list_unlock_all();
 }
 
 void __init files_init(unsigned long mempages)
 { 
 	int n; 
+	int i;
 
 	filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
 			SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
@@ -448,5 +523,7 @@ void __init files_init(unsigned long mem
 	if (files_stat.max_files < NR_FILE)
 		files_stat.max_files = NR_FILE;
 	files_defer_init();
+	for_each_possible_cpu(i)
+		spin_lock_init(&per_cpu(files_cpulock, i));
 	percpu_counter_init(&nr_files, 0);
 } 
Index: linux-2.6/fs/super.c
===================================================================
--- linux-2.6.orig/fs/super.c
+++ linux-2.6/fs/super.c
@@ -67,7 +67,23 @@ static struct super_block *alloc_super(s
 		INIT_LIST_HEAD(&s->s_dirty);
 		INIT_LIST_HEAD(&s->s_io);
 		INIT_LIST_HEAD(&s->s_more_io);
+#ifdef CONFIG_SMP
+		s->s_files = alloc_percpu(struct list_head);
+		if (!s->s_files) {
+			security_sb_free(s);
+			kfree(s);
+			s = NULL;
+			goto out;
+		} else {
+			int i;
+
+			for_each_possible_cpu(i)
+				INIT_LIST_HEAD(per_cpu_ptr(s->s_files, i));
+		}
+#else
 		INIT_LIST_HEAD(&s->s_files);
+#endif
+
 		INIT_LIST_HEAD(&s->s_instances);
 		INIT_HLIST_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -910,6 +910,9 @@ struct file {
 #define f_vfsmnt	f_path.mnt
 	const struct file_operations	*f_op;
 	spinlock_t		f_lock;  /* f_ep_links, f_flags, no IRQ */
+#ifdef CONFIG_SMP
+	int			f_sb_list_cpu;
+#endif
 	atomic_long_t		f_count;
 	unsigned int 		f_flags;
 	fmode_t			f_mode;
@@ -1330,7 +1333,11 @@ struct super_block {
 	struct list_head	s_io;		/* parked for writeback */
 	struct list_head	s_more_io;	/* parked for more writeback */
 	struct hlist_head	s_anon;		/* anonymous dentries for (nfs) exporting */
+#ifdef CONFIG_SMP
+	struct list_head	*s_files;
+#else
 	struct list_head	s_files;
+#endif
 	/* s_dentry_lru and s_nr_dentry_unused are protected by dcache_lock */
 	struct list_head	s_dentry_lru;	/* unused dentry lru */
 	int			s_nr_dentry_unused;	/* # of dentry on lru */



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 03/27] fs: mnt_want_write speedup
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
  2009-04-25  1:20 ` [patch 01/27] fs: cleanup files_lock npiggin
  2009-04-25  1:20 ` [patch 02/27] fs: scale files_lock npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  1:20 ` [patch 04/27] fs: introduce mnt_clone_write npiggin
                   ` (24 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Dave Hansen

[-- Attachment #1: mnt-want-write-speedup.patch --]
[-- Type: text/plain, Size: 12159 bytes --]

This patch speeds up lmbench lat_mmap test by about 8%. lat_mmap is set up
basically to mmap a 64MB file on tmpfs, fault in its pages, then unmap it.
A microbenchmark yes, but it exercises some important paths in the mm.

Before:
 avg = 501.9
 std = 14.7773

After:
 avg = 462.286
 std = 5.46106

(50 runs of each, stddev gives a reasonable confidence, but there is quite
a bit of variation there still)

It does this by removing the complex per-cpu locking and counter-cache and
replaces it with a percpu counter in struct vfsmount. This makes the code
much simpler, and avoids spinlocks (although the msync is still pretty
costly, unfortunately). It results in about 900 bytes smaller code too. It
does increase the size of a vfsmount, however.

It should also give a speedup on large systems if CPUs are frequently operating
on different mounts (because the existing scheme has to operate on an atomic in
the struct vfsmount when switching between mounts). But I'm most interested in
the single threaded path performance for the moment.

Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 fs/namespace.c        |  263 ++++++++++++++++----------------------------------
 include/linux/mount.h |   21 ++-
 2 files changed, 103 insertions(+), 181 deletions(-)

Index: linux-2.6/fs/namespace.c
===================================================================
--- linux-2.6.orig/fs/namespace.c
+++ linux-2.6/fs/namespace.c
@@ -131,10 +131,20 @@ struct vfsmount *alloc_vfsmnt(const char
 		INIT_LIST_HEAD(&mnt->mnt_share);
 		INIT_LIST_HEAD(&mnt->mnt_slave_list);
 		INIT_LIST_HEAD(&mnt->mnt_slave);
-		atomic_set(&mnt->__mnt_writers, 0);
+#ifdef CONFIG_SMP
+		mnt->mnt_writers = alloc_percpu(int);
+		if (!mnt->mnt_writers)
+			goto out_free_devname;
+#else
+		mnt->mnt_writers = 0;
+#endif
 	}
 	return mnt;
 
+#ifdef CONFIG_SMP
+out_free_devname:
+	kfree(mnt->mnt_devname);
+#endif
 out_free_id:
 	mnt_free_id(mnt);
 out_free_cache:
@@ -171,65 +181,38 @@ int __mnt_is_readonly(struct vfsmount *m
 }
 EXPORT_SYMBOL_GPL(__mnt_is_readonly);
 
-struct mnt_writer {
-	/*
-	 * If holding multiple instances of this lock, they
-	 * must be ordered by cpu number.
-	 */
-	spinlock_t lock;
-	struct lock_class_key lock_class; /* compiles out with !lockdep */
-	unsigned long count;
-	struct vfsmount *mnt;
-} ____cacheline_aligned_in_smp;
-static DEFINE_PER_CPU(struct mnt_writer, mnt_writers);
+static inline void inc_mnt_writers(struct vfsmount *mnt)
+{
+#ifdef CONFIG_SMP
+	(*per_cpu_ptr(mnt->mnt_writers, smp_processor_id()))++;
+#else
+	mnt->mnt_writers++;
+#endif
+}
 
-static int __init init_mnt_writers(void)
+static inline void dec_mnt_writers(struct vfsmount *mnt)
 {
-	int cpu;
-	for_each_possible_cpu(cpu) {
-		struct mnt_writer *writer = &per_cpu(mnt_writers, cpu);
-		spin_lock_init(&writer->lock);
-		lockdep_set_class(&writer->lock, &writer->lock_class);
-		writer->count = 0;
-	}
-	return 0;
+#ifdef CONFIG_SMP
+	(*per_cpu_ptr(mnt->mnt_writers, smp_processor_id()))--;
+#else
+	mnt->mnt_writers--;
+#endif
 }
-fs_initcall(init_mnt_writers);
 
-static void unlock_mnt_writers(void)
+static unsigned int count_mnt_writers(struct vfsmount *mnt)
 {
+#ifdef CONFIG_SMP
+	unsigned int count = 0;
 	int cpu;
-	struct mnt_writer *cpu_writer;
 
 	for_each_possible_cpu(cpu) {
-		cpu_writer = &per_cpu(mnt_writers, cpu);
-		spin_unlock(&cpu_writer->lock);
+		count += *per_cpu_ptr(mnt->mnt_writers, cpu);
 	}
-}
 
-static inline void __clear_mnt_count(struct mnt_writer *cpu_writer)
-{
-	if (!cpu_writer->mnt)
-		return;
-	/*
-	 * This is in case anyone ever leaves an invalid,
-	 * old ->mnt and a count of 0.
-	 */
-	if (!cpu_writer->count)
-		return;
-	atomic_add(cpu_writer->count, &cpu_writer->mnt->__mnt_writers);
-	cpu_writer->count = 0;
-}
- /*
- * must hold cpu_writer->lock
- */
-static inline void use_cpu_writer_for_mount(struct mnt_writer *cpu_writer,
-					  struct vfsmount *mnt)
-{
-	if (cpu_writer->mnt == mnt)
-		return;
-	__clear_mnt_count(cpu_writer);
-	cpu_writer->mnt = mnt;
+	return count;
+#else
+	return mnt->mnt_writers;
+#endif
 }
 
 /*
@@ -253,75 +236,34 @@ static inline void use_cpu_writer_for_mo
 int mnt_want_write(struct vfsmount *mnt)
 {
 	int ret = 0;
-	struct mnt_writer *cpu_writer;
 
-	cpu_writer = &get_cpu_var(mnt_writers);
-	spin_lock(&cpu_writer->lock);
+	preempt_disable();
+	inc_mnt_writers(mnt);
+	/*
+	 * The store to inc_mnt_writers must be visible before we pass
+	 * MNT_WRITE_HOLD loop below, so that the slowpath can see our
+	 * incremented count after it has set MNT_WRITE_HOLD.
+	 */
+	smp_mb();
+	while (mnt->mnt_flags & MNT_WRITE_HOLD)
+		cpu_relax();
+	/*
+	 * After the slowpath clears MNT_WRITE_HOLD, mnt_is_readonly will
+	 * be set to match its requirements. So we must not load that until
+	 * MNT_WRITE_HOLD is cleared.
+	 */
+	smp_rmb();
 	if (__mnt_is_readonly(mnt)) {
+		dec_mnt_writers(mnt);
 		ret = -EROFS;
 		goto out;
 	}
-	use_cpu_writer_for_mount(cpu_writer, mnt);
-	cpu_writer->count++;
 out:
-	spin_unlock(&cpu_writer->lock);
-	put_cpu_var(mnt_writers);
+	preempt_enable();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(mnt_want_write);
 
-static void lock_mnt_writers(void)
-{
-	int cpu;
-	struct mnt_writer *cpu_writer;
-
-	for_each_possible_cpu(cpu) {
-		cpu_writer = &per_cpu(mnt_writers, cpu);
-		spin_lock(&cpu_writer->lock);
-		__clear_mnt_count(cpu_writer);
-		cpu_writer->mnt = NULL;
-	}
-}
-
-/*
- * These per-cpu write counts are not guaranteed to have
- * matched increments and decrements on any given cpu.
- * A file open()ed for write on one cpu and close()d on
- * another cpu will imbalance this count.  Make sure it
- * does not get too far out of whack.
- */
-static void handle_write_count_underflow(struct vfsmount *mnt)
-{
-	if (atomic_read(&mnt->__mnt_writers) >=
-	    MNT_WRITER_UNDERFLOW_LIMIT)
-		return;
-	/*
-	 * It isn't necessary to hold all of the locks
-	 * at the same time, but doing it this way makes
-	 * us share a lot more code.
-	 */
-	lock_mnt_writers();
-	/*
-	 * vfsmount_lock is for mnt_flags.
-	 */
-	spin_lock(&vfsmount_lock);
-	/*
-	 * If coalescing the per-cpu writer counts did not
-	 * get us back to a positive writer count, we have
-	 * a bug.
-	 */
-	if ((atomic_read(&mnt->__mnt_writers) < 0) &&
-	    !(mnt->mnt_flags & MNT_IMBALANCED_WRITE_COUNT)) {
-		WARN(1, KERN_DEBUG "leak detected on mount(%p) writers "
-				"count: %d\n",
-			mnt, atomic_read(&mnt->__mnt_writers));
-		/* use the flag to keep the dmesg spam down */
-		mnt->mnt_flags |= MNT_IMBALANCED_WRITE_COUNT;
-	}
-	spin_unlock(&vfsmount_lock);
-	unlock_mnt_writers();
-}
-
 /**
  * mnt_drop_write - give up write access to a mount
  * @mnt: the mount on which to give up write access
@@ -332,37 +274,9 @@ static void handle_write_count_underflow
  */
 void mnt_drop_write(struct vfsmount *mnt)
 {
-	int must_check_underflow = 0;
-	struct mnt_writer *cpu_writer;
-
-	cpu_writer = &get_cpu_var(mnt_writers);
-	spin_lock(&cpu_writer->lock);
-
-	use_cpu_writer_for_mount(cpu_writer, mnt);
-	if (cpu_writer->count > 0) {
-		cpu_writer->count--;
-	} else {
-		must_check_underflow = 1;
-		atomic_dec(&mnt->__mnt_writers);
-	}
-
-	spin_unlock(&cpu_writer->lock);
-	/*
-	 * Logically, we could call this each time,
-	 * but the __mnt_writers cacheline tends to
-	 * be cold, and makes this expensive.
-	 */
-	if (must_check_underflow)
-		handle_write_count_underflow(mnt);
-	/*
-	 * This could be done right after the spinlock
-	 * is taken because the spinlock keeps us on
-	 * the cpu, and disables preemption.  However,
-	 * putting it here bounds the amount that
-	 * __mnt_writers can underflow.  Without it,
-	 * we could theoretically wrap __mnt_writers.
-	 */
-	put_cpu_var(mnt_writers);
+	preempt_disable();
+	dec_mnt_writers(mnt);
+	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(mnt_drop_write);
 
@@ -370,24 +284,44 @@ static int mnt_make_readonly(struct vfsm
 {
 	int ret = 0;
 
-	lock_mnt_writers();
+	spin_lock(&vfsmount_lock);
+	mnt->mnt_flags |= MNT_WRITE_HOLD;
+	/*
+	 * After storing MNT_WRITE_HOLD, we'll read the counters. This store
+	 * should be visible before we do.
+	 */
+	smp_mb();
+
 	/*
-	 * With all the locks held, this value is stable
+	 * With writers on hold, if this value is zero, then there are
+	 * definitely no active writers (although held writers may subsequently
+	 * increment the count, they'll have to wait, and decrement it after
+	 * seeing MNT_READONLY).
+	 *
+	 * It is OK to have counter incremented on one CPU and decremented on
+	 * another: the sum will add up correctly. The danger would be when we
+	 * sum up each counter, if we read a counter before it is incremented,
+	 * but then read another CPU's count which it has been subsequently
+	 * decremented from -- we would see more decrements than we should.
+	 * MNT_WRITE_HOLD protects against this scenario, because
+	 * mnt_want_write first increments count, then smp_mb, then spins on
+	 * MNT_WRITE_HOLD, so it can't be decremented by another CPU while
+	 * we're counting up here.
 	 */
-	if (atomic_read(&mnt->__mnt_writers) > 0) {
+	if (count_mnt_writers(mnt) > 0) {
 		ret = -EBUSY;
 		goto out;
 	}
-	/*
-	 * nobody can do a successful mnt_want_write() with all
-	 * of the counts in MNT_DENIED_WRITE and the locks held.
-	 */
-	spin_lock(&vfsmount_lock);
 	if (!ret)
 		mnt->mnt_flags |= MNT_READONLY;
-	spin_unlock(&vfsmount_lock);
 out:
-	unlock_mnt_writers();
+	/*
+	 * MNT_READONLY must become visible before ~MNT_WRITE_HOLD, so writers
+	 * that become unheld will see MNT_READONLY.
+	 */
+	smp_wmb();
+	mnt->mnt_flags &= ~MNT_WRITE_HOLD;
+	spin_unlock(&vfsmount_lock);
 	return ret;
 }
 
@@ -410,6 +344,9 @@ void free_vfsmnt(struct vfsmount *mnt)
 {
 	kfree(mnt->mnt_devname);
 	mnt_free_id(mnt);
+#ifdef CONFIG_SMP
+	free_percpu(mnt->mnt_writers);
+#endif
 	kmem_cache_free(mnt_cache, mnt);
 }
 
@@ -604,38 +541,14 @@ static struct vfsmount *clone_mnt(struct
 
 static inline void __mntput(struct vfsmount *mnt)
 {
-	int cpu;
 	struct super_block *sb = mnt->mnt_sb;
 	/*
-	 * We don't have to hold all of the locks at the
-	 * same time here because we know that we're the
-	 * last reference to mnt and that no new writers
-	 * can come in.
-	 */
-	for_each_possible_cpu(cpu) {
-		struct mnt_writer *cpu_writer = &per_cpu(mnt_writers, cpu);
-		spin_lock(&cpu_writer->lock);
-		if (cpu_writer->mnt != mnt) {
-			spin_unlock(&cpu_writer->lock);
-			continue;
-		}
-		atomic_add(cpu_writer->count, &mnt->__mnt_writers);
-		cpu_writer->count = 0;
-		/*
-		 * Might as well do this so that no one
-		 * ever sees the pointer and expects
-		 * it to be valid.
-		 */
-		cpu_writer->mnt = NULL;
-		spin_unlock(&cpu_writer->lock);
-	}
-	/*
 	 * This probably indicates that somebody messed
 	 * up a mnt_want/drop_write() pair.  If this
 	 * happens, the filesystem was probably unable
 	 * to make r/w->r/o transitions.
 	 */
-	WARN_ON(atomic_read(&mnt->__mnt_writers));
+	WARN_ON(count_mnt_writers(mnt));
 	dput(mnt->mnt_root);
 	free_vfsmnt(mnt);
 	deactivate_super(sb);
Index: linux-2.6/include/linux/mount.h
===================================================================
--- linux-2.6.orig/include/linux/mount.h
+++ linux-2.6/include/linux/mount.h
@@ -30,7 +30,7 @@ struct mnt_namespace;
 #define MNT_STRICTATIME 0x80
 
 #define MNT_SHRINKABLE	0x100
-#define MNT_IMBALANCED_WRITE_COUNT	0x200 /* just for debugging */
+#define MNT_WRITE_HOLD	0x200
 
 #define MNT_SHARED	0x1000	/* if the vfsmount is a shared mount */
 #define MNT_UNBINDABLE	0x2000	/* if the vfsmount is a unbindable mount */
@@ -65,13 +65,22 @@ struct vfsmount {
 	int mnt_expiry_mark;		/* true if marked for expiry */
 	int mnt_pinned;
 	int mnt_ghosts;
-	/*
-	 * This value is not stable unless all of the mnt_writers[] spinlocks
-	 * are held, and all mnt_writer[]s on this mount have 0 as their ->count
-	 */
-	atomic_t __mnt_writers;
+#ifdef CONFIG_SMP
+	int *mnt_writers;
+#else
+	int mnt_writers;
+#endif
 };
 
+static inline int *get_mnt_writers_ptr(struct vfsmount *mnt)
+{
+#ifdef CONFIG_SMP
+	return mnt->mnt_writers;
+#else
+	return &mnt->mnt_writers;
+#endif
+}
+
 static inline struct vfsmount *mntget(struct vfsmount *mnt)
 {
 	if (mnt)



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 04/27] fs: introduce mnt_clone_write
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (2 preceding siblings ...)
  2009-04-25  1:20 ` [patch 03/27] fs: mnt_want_write speedup npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  3:35   ` Al Viro
  2009-04-25  1:20 ` [patch 05/27] fs: brlock vfsmount_lock npiggin
                   ` (23 subsequent siblings)
  27 siblings, 1 reply; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Dave Hansen

[-- Attachment #1: mnt_clone_write.patch --]
[-- Type: text/plain, Size: 5254 bytes --]

This patch speeds up lmbench lat_mmap test by about another 2% after the
first patch.

Before:
 avg = 462.286
 std = 5.46106

After:
 avg = 453.12
 std = 9.58257

(50 runs of each, stddev gives a reasonable confidence)

It does this by introducing mnt_clone_write, which avoids some heavyweight
operations of mnt_want_write if called on a vfsmount which we know already
has a write count; and mnt_want_write_file, which can call mnt_clone_write
if the file is open for write.

After these two patches, mnt_want_write and mnt_drop_write go from 7% on
the profile down to 1.3% (including mnt_clone_write).

Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 fs/file_table.c       |    3 +--
 fs/inode.c            |    2 +-
 fs/namespace.c        |   38 ++++++++++++++++++++++++++++++++++++++
 fs/open.c             |    4 ++--
 fs/xattr.c            |    4 ++--
 include/linux/mount.h |    4 ++++
 6 files changed, 48 insertions(+), 7 deletions(-)

Index: linux-2.6/fs/file_table.c
===================================================================
--- linux-2.6.orig/fs/file_table.c
+++ linux-2.6/fs/file_table.c
@@ -217,8 +217,7 @@ int init_file(struct file *file, struct
 	 */
 	if ((mode & FMODE_WRITE) && !special_file(dentry->d_inode->i_mode)) {
 		file_take_write(file);
-		error = mnt_want_write(mnt);
-		WARN_ON(error);
+		mnt_clone_write(mnt);
 	}
 	return error;
 }
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -1401,7 +1401,7 @@ void file_update_time(struct file *file)
 	if (IS_NOCMTIME(inode))
 		return;
 
-	err = mnt_want_write(file->f_path.mnt);
+	err = mnt_want_write_file(file->f_path.mnt, file);
 	if (err)
 		return;
 
Index: linux-2.6/fs/namespace.c
===================================================================
--- linux-2.6.orig/fs/namespace.c
+++ linux-2.6/fs/namespace.c
@@ -265,6 +265,44 @@ out:
 EXPORT_SYMBOL_GPL(mnt_want_write);
 
 /**
+ * mnt_clone_write - get write access to a mount
+ * @mnt: the mount on which to take a write
+ *
+ * This is effectively like mnt_want_write, except
+ * it must only be used to take an extra write reference
+ * on a mountpoint that we already know has a write reference
+ * on it. This allows some optimisation.
+ *
+ * After finished, mnt_drop_write must be called as usual to
+ * drop the reference.
+ */
+void mnt_clone_write(struct vfsmount *mnt)
+{
+	preempt_disable();
+	inc_mnt_writers(mnt);
+	preempt_enable();
+}
+EXPORT_SYMBOL_GPL(mnt_clone_write);
+
+/**
+ * mnt_want_write_file - get write access to a file's mount
+ * @file: the file who's mount on which to take a write
+ *
+ * This is like mnt_want_write, but it takes a file and can
+ * do some optimisations if the file is open for write already
+ */
+int mnt_want_write_file(struct vfsmount *mnt, struct file *file)
+{
+	if (!(file->f_mode & FMODE_WRITE))
+		return mnt_want_write(mnt);
+	else {
+		mnt_clone_write(mnt);
+		return 0;
+	}
+}
+EXPORT_SYMBOL_GPL(mnt_want_write_file);
+
+/**
  * mnt_drop_write - give up write access to a mount
  * @mnt: the mount on which to give up write access
  *
Index: linux-2.6/fs/open.c
===================================================================
--- linux-2.6.orig/fs/open.c
+++ linux-2.6/fs/open.c
@@ -612,7 +612,7 @@ SYSCALL_DEFINE2(fchmod, unsigned int, fd
 
 	audit_inode(NULL, dentry);
 
-	err = mnt_want_write(file->f_path.mnt);
+	err = mnt_want_write_file(file->f_path.mnt, file);
 	if (err)
 		goto out_putf;
 	mutex_lock(&inode->i_mutex);
@@ -761,7 +761,7 @@ SYSCALL_DEFINE3(fchown, unsigned int, fd
 	if (!file)
 		goto out;
 
-	error = mnt_want_write(file->f_path.mnt);
+	error = mnt_want_write_file(file->f_path.mnt, file);
 	if (error)
 		goto out_fput;
 	dentry = file->f_path.dentry;
Index: linux-2.6/fs/xattr.c
===================================================================
--- linux-2.6.orig/fs/xattr.c
+++ linux-2.6/fs/xattr.c
@@ -297,7 +297,7 @@ SYSCALL_DEFINE5(fsetxattr, int, fd, cons
 		return error;
 	dentry = f->f_path.dentry;
 	audit_inode(NULL, dentry);
-	error = mnt_want_write(f->f_path.mnt);
+	error = mnt_want_write_file(f->f_path.mnt, f);
 	if (!error) {
 		error = setxattr(dentry, name, value, size, flags);
 		mnt_drop_write(f->f_path.mnt);
@@ -524,7 +524,7 @@ SYSCALL_DEFINE2(fremovexattr, int, fd, c
 		return error;
 	dentry = f->f_path.dentry;
 	audit_inode(NULL, dentry);
-	error = mnt_want_write(f->f_path.mnt);
+	error = mnt_want_write_file(f->f_path.mnt, f);
 	if (!error) {
 		error = removexattr(dentry, name);
 		mnt_drop_write(f->f_path.mnt);
Index: linux-2.6/include/linux/mount.h
===================================================================
--- linux-2.6.orig/include/linux/mount.h
+++ linux-2.6/include/linux/mount.h
@@ -88,7 +88,11 @@ static inline struct vfsmount *mntget(st
 	return mnt;
 }
 
+struct file; /* forward dec */
+
 extern int mnt_want_write(struct vfsmount *mnt);
+extern int mnt_want_write_file(struct vfsmount *mnt, struct file *file);
+extern void mnt_clone_write(struct vfsmount *mnt);
 extern void mnt_drop_write(struct vfsmount *mnt);
 extern void mntput_no_expire(struct vfsmount *mnt);
 extern void mnt_pin(struct vfsmount *mnt);



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 05/27] fs: brlock vfsmount_lock
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (3 preceding siblings ...)
  2009-04-25  1:20 ` [patch 04/27] fs: introduce mnt_clone_write npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  3:50   ` Al Viro
  2009-04-25  1:20 ` [patch 06/27] fs: dcache fix LRU ordering npiggin
                   ` (22 subsequent siblings)
  27 siblings, 1 reply; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: fs-vfsmount_lock-scale.patch --]
[-- Type: text/plain, Size: 17770 bytes --]

Use a brlock for the vfsmount lock.
---
 fs/dcache.c                   |    4 
 fs/namei.c                    |   16 +--
 fs/namespace.c                |  194 +++++++++++++++++++++++++++++-------------
 fs/pnode.c                    |    4 
 fs/proc/base.c                |    4 
 include/linux/mnt_namespace.h |    8 -
 include/linux/mount.h         |    6 +
 kernel/audit_tree.c           |    6 -
 security/tomoyo/realpath.c    |    4 
 9 files changed, 159 insertions(+), 87 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -1908,7 +1908,7 @@ char *__d_path(const struct path *path,
 	char *end = buffer + buflen;
 	char *retval;
 
-	spin_lock(&vfsmount_lock);
+	vfsmount_read_lock();
 	prepend(&end, &buflen, "\0", 1);
 	if (!IS_ROOT(dentry) && d_unhashed(dentry) &&
 		(prepend(&end, &buflen, " (deleted)", 10) != 0))
@@ -1944,7 +1944,7 @@ char *__d_path(const struct path *path,
 	}
 
 out:
-	spin_unlock(&vfsmount_lock);
+	vfsmount_read_unlock();
 	return retval;
 
 global_root:
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c
+++ linux-2.6/fs/namei.c
@@ -672,15 +672,15 @@ int follow_up(struct vfsmount **mnt, str
 {
 	struct vfsmount *parent;
 	struct dentry *mountpoint;
-	spin_lock(&vfsmount_lock);
+	vfsmount_read_unlock();
 	parent=(*mnt)->mnt_parent;
 	if (parent == *mnt) {
-		spin_unlock(&vfsmount_lock);
+		vfsmount_read_unlock();
 		return 0;
 	}
 	mntget(parent);
-	mountpoint=dget((*mnt)->mnt_mountpoint);
-	spin_unlock(&vfsmount_lock);
+	mountpoint = dget((*mnt)->mnt_mountpoint);
+	vfsmount_read_unlock();
 	dput(*dentry);
 	*dentry = mountpoint;
 	mntput(*mnt);
@@ -762,15 +762,15 @@ static __always_inline void follow_dotdo
 			break;
 		}
 		spin_unlock(&dcache_lock);
-		spin_lock(&vfsmount_lock);
+		vfsmount_read_lock();
 		parent = nd->path.mnt->mnt_parent;
 		if (parent == nd->path.mnt) {
-			spin_unlock(&vfsmount_lock);
+			vfsmount_read_unlock();
 			break;
 		}
 		mntget(parent);
 		nd->path.dentry = dget(nd->path.mnt->mnt_mountpoint);
-		spin_unlock(&vfsmount_lock);
+		vfsmount_read_unlock();
 		dput(old);
 		mntput(nd->path.mnt);
 		nd->path.mnt = parent;
Index: linux-2.6/fs/namespace.c
===================================================================
--- linux-2.6.orig/fs/namespace.c
+++ linux-2.6/fs/namespace.c
@@ -11,6 +11,8 @@
 #include <linux/syscalls.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/spinlock.h>
+#include <linux/percpu.h>
 #include <linux/smp_lock.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
@@ -37,7 +39,7 @@
 #define HASH_SIZE (1UL << HASH_SHIFT)
 
 /* spinlock for vfsmount related operations, inplace of dcache_lock */
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(vfsmount_lock);
+static DEFINE_PER_CPU(spinlock_t, vfsmount_lock);
 
 static int event;
 static DEFINE_IDA(mnt_id_ida);
@@ -51,6 +53,49 @@ static struct rw_semaphore namespace_sem
 struct kobject *fs_kobj;
 EXPORT_SYMBOL_GPL(fs_kobj);
 
+void vfsmount_read_lock(void)
+{
+	spinlock_t *lock;
+
+	lock = &get_cpu_var(vfsmount_lock);
+	spin_lock(lock);
+}
+
+void vfsmount_read_unlock(void)
+{
+	spinlock_t *lock;
+
+	lock = &__get_cpu_var(vfsmount_lock);
+	spin_unlock(lock);
+	put_cpu_var(vfsmount_lock);
+}
+
+void vfsmount_write_lock(void)
+{
+	int i;
+	int nr = 0;
+
+	for_each_possible_cpu(i) {
+		spinlock_t *lock;
+
+		lock = &per_cpu(vfsmount_lock, i);
+		spin_lock_nested(lock, nr);
+		nr++;
+	}
+}
+
+void vfsmount_write_unlock(void)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		spinlock_t *lock;
+
+		lock = &per_cpu(vfsmount_lock, i);
+		spin_unlock(lock);
+	}
+}
+
 static inline unsigned long hash(struct vfsmount *mnt, struct dentry *dentry)
 {
 	unsigned long tmp = ((unsigned long)mnt / L1_CACHE_BYTES);
@@ -68,9 +113,9 @@ static int mnt_alloc_id(struct vfsmount
 
 retry:
 	ida_pre_get(&mnt_id_ida, GFP_KERNEL);
-	spin_lock(&vfsmount_lock);
+	vfsmount_write_lock();
 	res = ida_get_new(&mnt_id_ida, &mnt->mnt_id);
-	spin_unlock(&vfsmount_lock);
+	vfsmount_write_unlock();
 	if (res == -EAGAIN)
 		goto retry;
 
@@ -79,9 +124,9 @@ retry:
 
 static void mnt_free_id(struct vfsmount *mnt)
 {
-	spin_lock(&vfsmount_lock);
+	vfsmount_write_lock();
 	ida_remove(&mnt_id_ida, mnt->mnt_id);
-	spin_unlock(&vfsmount_lock);
+	vfsmount_write_unlock();
 }
 
 /*
@@ -322,7 +367,7 @@ static int mnt_make_readonly(struct vfsm
 {
 	int ret = 0;
 
-	spin_lock(&vfsmount_lock);
+	vfsmount_write_lock();
 	mnt->mnt_flags |= MNT_WRITE_HOLD;
 	/*
 	 * After storing MNT_WRITE_HOLD, we'll read the counters. This store
@@ -359,15 +404,15 @@ out:
 	 */
 	smp_wmb();
 	mnt->mnt_flags &= ~MNT_WRITE_HOLD;
-	spin_unlock(&vfsmount_lock);
+	vfsmount_write_unlock();
 	return ret;
 }
 
 static void __mnt_unmake_readonly(struct vfsmount *mnt)
 {
-	spin_lock(&vfsmount_lock);
+	vfsmount_write_lock();
 	mnt->mnt_flags &= ~MNT_READONLY;
-	spin_unlock(&vfsmount_lock);
+	vfsmount_write_unlock();
 }
 
 void simple_set_mnt(struct vfsmount *mnt, struct super_block *sb)
@@ -420,10 +465,10 @@ struct vfsmount *__lookup_mnt(struct vfs
 struct vfsmount *lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
 {
 	struct vfsmount *child_mnt;
-	spin_lock(&vfsmount_lock);
+	vfsmount_read_lock();
 	if ((child_mnt = __lookup_mnt(mnt, dentry, 1)))
 		mntget(child_mnt);
-	spin_unlock(&vfsmount_lock);
+	vfsmount_read_unlock();
 	return child_mnt;
 }
 
@@ -595,40 +640,46 @@ static inline void __mntput(struct vfsmo
 void mntput_no_expire(struct vfsmount *mnt)
 {
 repeat:
-	if (atomic_dec_and_lock(&mnt->mnt_count, &vfsmount_lock)) {
-		if (likely(!mnt->mnt_pinned)) {
-			spin_unlock(&vfsmount_lock);
-			__mntput(mnt);
-			return;
-		}
-		atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count);
-		mnt->mnt_pinned = 0;
-		spin_unlock(&vfsmount_lock);
-		acct_auto_close_mnt(mnt);
-		security_sb_umount_close(mnt);
-		goto repeat;
+	if (atomic_add_unless(&mnt->mnt_count, -1, 1))
+		return;
+	vfsmount_write_lock();
+	if (atomic_add_unless(&mnt->mnt_count, -1, 1)) {
+		vfsmount_write_unlock();
+		return;
+	}
+
+	if (likely(!mnt->mnt_pinned)) {
+		vfsmount_write_unlock();
+		__mntput(mnt);
+		return;
 	}
+	atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count);
+	mnt->mnt_pinned = 0;
+	vfsmount_write_unlock();
+	acct_auto_close_mnt(mnt);
+	security_sb_umount_close(mnt);
+	goto repeat;
 }
 
 EXPORT_SYMBOL(mntput_no_expire);
 
 void mnt_pin(struct vfsmount *mnt)
 {
-	spin_lock(&vfsmount_lock);
+	vfsmount_write_lock();
 	mnt->mnt_pinned++;
-	spin_unlock(&vfsmount_lock);
+	vfsmount_write_unlock();
 }
 
 EXPORT_SYMBOL(mnt_pin);
 
 void mnt_unpin(struct vfsmount *mnt)
 {
-	spin_lock(&vfsmount_lock);
+	vfsmount_write_lock();
 	if (mnt->mnt_pinned) {
 		atomic_inc(&mnt->mnt_count);
 		mnt->mnt_pinned--;
 	}
-	spin_unlock(&vfsmount_lock);
+	vfsmount_write_unlock();
 }
 
 EXPORT_SYMBOL(mnt_unpin);
@@ -896,12 +947,12 @@ int may_umount_tree(struct vfsmount *mnt
 	int minimum_refs = 0;
 	struct vfsmount *p;
 
-	spin_lock(&vfsmount_lock);
+	vfsmount_read_lock();
 	for (p = mnt; p; p = next_mnt(p, mnt)) {
 		actual_refs += atomic_read(&p->mnt_count);
 		minimum_refs += 2;
 	}
-	spin_unlock(&vfsmount_lock);
+	vfsmount_read_unlock();
 
 	if (actual_refs > minimum_refs)
 		return 0;
@@ -927,10 +978,12 @@ EXPORT_SYMBOL(may_umount_tree);
 int may_umount(struct vfsmount *mnt)
 {
 	int ret = 1;
-	spin_lock(&vfsmount_lock);
+
+	vfsmount_read_lock();
 	if (propagate_mount_busy(mnt, 2))
 		ret = 0;
-	spin_unlock(&vfsmount_lock);
+	vfsmount_read_unlock();
+
 	return ret;
 }
 
@@ -945,13 +998,14 @@ void release_mounts(struct list_head *he
 		if (mnt->mnt_parent != mnt) {
 			struct dentry *dentry;
 			struct vfsmount *m;
-			spin_lock(&vfsmount_lock);
+
+			vfsmount_write_lock();
 			dentry = mnt->mnt_mountpoint;
 			m = mnt->mnt_parent;
 			mnt->mnt_mountpoint = mnt->mnt_root;
 			mnt->mnt_parent = mnt;
 			m->mnt_ghosts--;
-			spin_unlock(&vfsmount_lock);
+			vfsmount_write_unlock();
 			dput(dentry);
 			mntput(m);
 		}
@@ -1054,7 +1108,7 @@ static int do_umount(struct vfsmount *mn
 	}
 
 	down_write(&namespace_sem);
-	spin_lock(&vfsmount_lock);
+	vfsmount_write_lock();
 	event++;
 
 	if (!(flags & MNT_DETACH))
@@ -1066,7 +1120,7 @@ static int do_umount(struct vfsmount *mn
 			umount_tree(mnt, 1, &umount_list);
 		retval = 0;
 	}
-	spin_unlock(&vfsmount_lock);
+	vfsmount_write_unlock();
 	if (retval)
 		security_sb_umount_busy(mnt);
 	up_write(&namespace_sem);
@@ -1173,19 +1227,19 @@ struct vfsmount *copy_tree(struct vfsmou
 			q = clone_mnt(p, p->mnt_root, flag);
 			if (!q)
 				goto Enomem;
-			spin_lock(&vfsmount_lock);
+			vfsmount_write_lock();
 			list_add_tail(&q->mnt_list, &res->mnt_list);
 			attach_mnt(q, &path);
-			spin_unlock(&vfsmount_lock);
+			vfsmount_write_unlock();
 		}
 	}
 	return res;
 Enomem:
 	if (res) {
 		LIST_HEAD(umount_list);
-		spin_lock(&vfsmount_lock);
+		vfsmount_write_lock();
 		umount_tree(res, 0, &umount_list);
-		spin_unlock(&vfsmount_lock);
+		vfsmount_write_unlock();
 		release_mounts(&umount_list);
 	}
 	return NULL;
@@ -1204,9 +1258,9 @@ void drop_collected_mounts(struct vfsmou
 {
 	LIST_HEAD(umount_list);
 	down_write(&namespace_sem);
-	spin_lock(&vfsmount_lock);
+	vfsmount_write_lock();
 	umount_tree(mnt, 0, &umount_list);
-	spin_unlock(&vfsmount_lock);
+	vfsmount_write_unlock();
 	up_write(&namespace_sem);
 	release_mounts(&umount_list);
 }
@@ -1324,7 +1378,7 @@ static int attach_recursive_mnt(struct v
 			set_mnt_shared(p);
 	}
 
-	spin_lock(&vfsmount_lock);
+	vfsmount_write_lock();
 	if (parent_path) {
 		detach_mnt(source_mnt, parent_path);
 		attach_mnt(source_mnt, path);
@@ -1338,7 +1392,8 @@ static int attach_recursive_mnt(struct v
 		list_del_init(&child->mnt_hash);
 		commit_tree(child);
 	}
-	spin_unlock(&vfsmount_lock);
+	vfsmount_write_unlock();
+
 	return 0;
 
  out_cleanup_ids:
@@ -1400,10 +1455,10 @@ static int do_change_type(struct path *p
 			goto out_unlock;
 	}
 
-	spin_lock(&vfsmount_lock);
+	vfsmount_write_lock();
 	for (m = mnt; m; m = (recurse ? next_mnt(m, mnt) : NULL))
 		change_mnt_propagation(m, type);
-	spin_unlock(&vfsmount_lock);
+	vfsmount_write_unlock();
 
  out_unlock:
 	up_write(&namespace_sem);
@@ -1447,9 +1502,10 @@ static int do_loopback(struct path *path
 	err = graft_tree(mnt, path);
 	if (err) {
 		LIST_HEAD(umount_list);
-		spin_lock(&vfsmount_lock);
+
+		vfsmount_write_lock();
 		umount_tree(mnt, 0, &umount_list);
-		spin_unlock(&vfsmount_lock);
+		vfsmount_write_unlock();
 		release_mounts(&umount_list);
 	}
 
@@ -1507,9 +1563,9 @@ static int do_remount(struct path *path,
 	if (!err) {
 		security_sb_post_remount(path->mnt, flags, data);
 
-		spin_lock(&vfsmount_lock);
+		vfsmount_write_lock();
 		touch_mnt_namespace(path->mnt->mnt_ns);
-		spin_unlock(&vfsmount_lock);
+		vfsmount_write_unlock();
 	}
 	return err;
 }
@@ -1682,7 +1738,7 @@ void mark_mounts_for_expiry(struct list_
 		return;
 
 	down_write(&namespace_sem);
-	spin_lock(&vfsmount_lock);
+	vfsmount_write_lock();
 
 	/* extract from the expiration list every vfsmount that matches the
 	 * following criteria:
@@ -1701,7 +1757,7 @@ void mark_mounts_for_expiry(struct list_
 		touch_mnt_namespace(mnt->mnt_ns);
 		umount_tree(mnt, 1, &umounts);
 	}
-	spin_unlock(&vfsmount_lock);
+	vfsmount_write_unlock();
 	up_write(&namespace_sem);
 
 	release_mounts(&umounts);
@@ -1951,9 +2007,9 @@ static struct mnt_namespace *dup_mnt_ns(
 		kfree(new_ns);
 		return ERR_PTR(-ENOMEM);
 	}
-	spin_lock(&vfsmount_lock);
+	vfsmount_write_lock();
 	list_add_tail(&new_ns->list, &new_ns->root->mnt_list);
-	spin_unlock(&vfsmount_lock);
+	vfsmount_write_unlock();
 
 	/*
 	 * Second pass: switch the tsk->fs->* elements and mark new vfsmounts
@@ -2132,7 +2188,7 @@ SYSCALL_DEFINE2(pivot_root, const char _
 		goto out2; /* not attached */
 	/* make sure we can reach put_old from new_root */
 	tmp = old.mnt;
-	spin_lock(&vfsmount_lock);
+	vfsmount_write_lock();
 	if (tmp != new.mnt) {
 		for (;;) {
 			if (tmp->mnt_parent == tmp)
@@ -2152,7 +2208,7 @@ SYSCALL_DEFINE2(pivot_root, const char _
 	/* mount new_root on / */
 	attach_mnt(new.mnt, &root_parent);
 	touch_mnt_namespace(current->nsproxy->mnt_ns);
-	spin_unlock(&vfsmount_lock);
+	vfsmount_write_unlock();
 	chroot_fs_refs(&root, &new);
 	security_sb_post_pivotroot(&root, &new);
 	error = 0;
@@ -2168,7 +2224,7 @@ out1:
 out0:
 	return error;
 out3:
-	spin_unlock(&vfsmount_lock);
+	vfsmount_write_unlock();
 	goto out2;
 }
 
@@ -2205,6 +2261,7 @@ static void __init init_mount_tree(void)
 void __init mnt_init(void)
 {
 	unsigned u;
+	int i;
 	int err;
 
 	init_rwsem(&namespace_sem);
@@ -2222,6 +2279,9 @@ void __init mnt_init(void)
 	for (u = 0; u < HASH_SIZE; u++)
 		INIT_LIST_HEAD(&mount_hashtable[u]);
 
+	for_each_possible_cpu(i)
+		spin_lock_init(&per_cpu(vfsmount_lock, i));
+
 	err = sysfs_init();
 	if (err)
 		printk(KERN_WARNING "%s: sysfs_init error: %d\n",
@@ -2233,17 +2293,31 @@ void __init mnt_init(void)
 	init_mount_tree();
 }
 
-void __put_mnt_ns(struct mnt_namespace *ns)
+static void __put_mnt_ns(struct mnt_namespace *ns)
 {
 	struct vfsmount *root = ns->root;
 	LIST_HEAD(umount_list);
 	ns->root = NULL;
-	spin_unlock(&vfsmount_lock);
+	vfsmount_read_unlock();
 	down_write(&namespace_sem);
-	spin_lock(&vfsmount_lock);
+	vfsmount_write_lock();
 	umount_tree(root, 0, &umount_list);
-	spin_unlock(&vfsmount_lock);
+	vfsmount_write_unlock();
 	up_write(&namespace_sem);
 	release_mounts(&umount_list);
 	kfree(ns);
 }
+
+void put_mnt_ns(struct mnt_namespace *ns)
+{
+	spinlock_t *lock;
+
+	lock = &get_cpu_var(vfsmount_lock);
+	if (atomic_dec_and_lock(&ns->count, lock)) {
+		/* releases vfsmount_lock */
+		put_cpu_var(vfsmount_lock);
+		__put_mnt_ns(ns);
+	} else
+		put_cpu_var(vfsmount_lock);
+}
+
Index: linux-2.6/fs/pnode.c
===================================================================
--- linux-2.6.orig/fs/pnode.c
+++ linux-2.6/fs/pnode.c
@@ -264,12 +264,12 @@ int propagate_mnt(struct vfsmount *dest_
 		prev_src_mnt  = child;
 	}
 out:
-	spin_lock(&vfsmount_lock);
+	vfsmount_write_lock();
 	while (!list_empty(&tmp_list)) {
 		child = list_first_entry(&tmp_list, struct vfsmount, mnt_hash);
 		umount_tree(child, 0, &umount_list);
 	}
-	spin_unlock(&vfsmount_lock);
+	vfsmount_write_unlock();
 	release_mounts(&umount_list);
 	return ret;
 }
Index: linux-2.6/fs/proc/base.c
===================================================================
--- linux-2.6.orig/fs/proc/base.c
+++ linux-2.6/fs/proc/base.c
@@ -652,12 +652,12 @@ static unsigned mounts_poll(struct file
 
 	poll_wait(file, &ns->poll, wait);
 
-	spin_lock(&vfsmount_lock);
+	vfsmount_read_lock();
 	if (p->event != ns->event) {
 		p->event = ns->event;
 		res |= POLLERR | POLLPRI;
 	}
-	spin_unlock(&vfsmount_lock);
+	vfsmount_read_unlock();
 
 	return res;
 }
Index: linux-2.6/include/linux/mnt_namespace.h
===================================================================
--- linux-2.6.orig/include/linux/mnt_namespace.h
+++ linux-2.6/include/linux/mnt_namespace.h
@@ -26,14 +26,8 @@ struct fs_struct;
 
 extern struct mnt_namespace *copy_mnt_ns(unsigned long, struct mnt_namespace *,
 		struct fs_struct *);
-extern void __put_mnt_ns(struct mnt_namespace *ns);
 
-static inline void put_mnt_ns(struct mnt_namespace *ns)
-{
-	if (atomic_dec_and_lock(&ns->count, &vfsmount_lock))
-		/* releases vfsmount_lock */
-		__put_mnt_ns(ns);
-}
+extern void put_mnt_ns(struct mnt_namespace *ns);
 
 static inline void exit_mnt_ns(struct task_struct *p)
 {
Index: linux-2.6/include/linux/mount.h
===================================================================
--- linux-2.6.orig/include/linux/mount.h
+++ linux-2.6/include/linux/mount.h
@@ -90,6 +90,11 @@ static inline struct vfsmount *mntget(st
 
 struct file; /* forward dec */
 
+extern void vfsmount_read_lock(void);
+extern void vfsmount_read_unlock(void);
+extern void vfsmount_write_lock(void);
+extern void vfsmount_write_unlock(void);
+
 extern int mnt_want_write(struct vfsmount *mnt);
 extern int mnt_want_write_file(struct vfsmount *mnt, struct file *file);
 extern void mnt_clone_write(struct vfsmount *mnt);
@@ -123,7 +128,6 @@ extern int do_add_mount(struct vfsmount
 
 extern void mark_mounts_for_expiry(struct list_head *mounts);
 
-extern spinlock_t vfsmount_lock;
 extern dev_t name_to_dev_t(char *name);
 
 #endif /* _LINUX_MOUNT_H */
Index: linux-2.6/kernel/audit_tree.c
===================================================================
--- linux-2.6.orig/kernel/audit_tree.c
+++ linux-2.6/kernel/audit_tree.c
@@ -757,15 +757,15 @@ int audit_tag_tree(char *old, char *new)
 			continue;
 		}
 
-		spin_lock(&vfsmount_lock);
+		vfsmount_read_lock();
 		if (!is_under(mnt, dentry, &path)) {
-			spin_unlock(&vfsmount_lock);
+			vfsmount_read_unlock();
 			path_put(&path);
 			put_tree(tree);
 			mutex_lock(&audit_filter_mutex);
 			continue;
 		}
-		spin_unlock(&vfsmount_lock);
+		vfsmount_read_unlock();
 		path_put(&path);
 
 		list_for_each_entry(p, &list, mnt_list) {
Index: linux-2.6/security/tomoyo/realpath.c
===================================================================
--- linux-2.6.orig/security/tomoyo/realpath.c
+++ linux-2.6/security/tomoyo/realpath.c
@@ -96,12 +96,12 @@ int tomoyo_realpath_from_path2(struct pa
 		root = current->fs->root;
 		path_get(&root);
 		read_unlock(&current->fs->lock);
-		spin_lock(&vfsmount_lock);
+		vfsmount_read_lock();
 		if (root.mnt && root.mnt->mnt_ns)
 			ns_root.mnt = mntget(root.mnt->mnt_ns->root);
 		if (ns_root.mnt)
 			ns_root.dentry = dget(ns_root.mnt->mnt_root);
-		spin_unlock(&vfsmount_lock);
+		vfsmount_read_unlock();
 		spin_lock(&dcache_lock);
 		tmp = ns_root;
 		sp = __d_path(path, &tmp, newname, newname_len);



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 06/27] fs: dcache fix LRU ordering
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (4 preceding siblings ...)
  2009-04-25  1:20 ` [patch 05/27] fs: brlock vfsmount_lock npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  1:20 ` [patch 07/27] fs: dcache scale hash npiggin
                   ` (21 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: fs-dcache-order-lru.patch --]
[-- Type: text/plain, Size: 771 bytes --]

Fix ordering of LRU when moving referenced dentries to the head of the list
(they should go to the head of the list in the same order as they were found
from the tail, rather than reverse order).
---
 fs/dcache.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -481,7 +481,7 @@ restart:
 			if ((flags & DCACHE_REFERENCED)
 				&& (dentry->d_flags & DCACHE_REFERENCED)) {
 				dentry->d_flags &= ~DCACHE_REFERENCED;
-				list_move_tail(&dentry->d_lru, &referenced);
+				list_move(&dentry->d_lru, &referenced);
 				spin_unlock(&dentry->d_lock);
 			} else {
 				list_move_tail(&dentry->d_lru, &tmp);



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 07/27] fs: dcache scale hash
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (5 preceding siblings ...)
  2009-04-25  1:20 ` [patch 06/27] fs: dcache fix LRU ordering npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  1:20 ` [patch 08/27] fs: dcache scale lru npiggin
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: fs-dcache-scale-d_hash.patch --]
[-- Type: text/plain, Size: 3803 bytes --]

Add a new lock, dcache_hash_lock, to protect the dcache hash table from
concurrent modification. d_hash is also protected by d_lock.
---
 fs/dcache.c            |   35 ++++++++++++++++++++++++-----------
 include/linux/dcache.h |    3 +++
 2 files changed, 27 insertions(+), 11 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -34,12 +34,23 @@
 #include <linux/fs_struct.h>
 #include "internal.h"
 
+/*
+ * Usage:
+ * dcache_hash_lock protects dcache hash table
+ *
+ * Ordering:
+ * dcache_lock
+ *   dentry->d_lock
+ *     dcache_hash_lock
+ */
 int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
 
- __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lock);
+__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_hash_lock);
+__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lock);
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
 
+EXPORT_SYMBOL(dcache_hash_lock);
 EXPORT_SYMBOL(dcache_lock);
 
 static struct kmem_cache *dentry_cache __read_mostly;
@@ -1466,17 +1477,20 @@ int d_validate(struct dentry *dentry, st
 		goto out;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_hash_lock);
 	base = d_hash(dparent, dentry->d_name.hash);
 	hlist_for_each(lhp,base) { 
 		/* hlist_for_each_entry_rcu() not required for d_hash list
 		 * as it is parsed under dcache_lock
 		 */
 		if (dentry == hlist_entry(lhp, struct dentry, d_hash)) {
+			spin_unlock(&dcache_hash_lock);
 			__dget_locked(dentry);
 			spin_unlock(&dcache_lock);
 			return 1;
 		}
 	}
+	spin_unlock(&dcache_hash_lock);
 	spin_unlock(&dcache_lock);
 out:
 	return 0;
@@ -1550,7 +1564,9 @@ void d_rehash(struct dentry * entry)
 {
 	spin_lock(&dcache_lock);
 	spin_lock(&entry->d_lock);
+	spin_lock(&dcache_hash_lock);
 	_d_rehash(entry);
+	spin_unlock(&dcache_hash_lock);
 	spin_unlock(&entry->d_lock);
 	spin_unlock(&dcache_lock);
 }
@@ -1629,8 +1645,6 @@ static void switch_names(struct dentry *
  */
 static void d_move_locked(struct dentry * dentry, struct dentry * target)
 {
-	struct hlist_head *list;
-
 	if (!dentry->d_inode)
 		printk(KERN_WARNING "VFS: moving negative dcache entry\n");
 
@@ -1647,14 +1661,11 @@ static void d_move_locked(struct dentry
 	}
 
 	/* Move the dentry to the target hash queue, if on different bucket */
-	if (d_unhashed(dentry))
-		goto already_unhashed;
-
-	hlist_del_rcu(&dentry->d_hash);
-
-already_unhashed:
-	list = d_hash(target->d_parent, target->d_name.hash);
-	__d_rehash(dentry, list);
+	spin_lock(&dcache_hash_lock);
+	if (!d_unhashed(dentry))
+		hlist_del_rcu(&dentry->d_hash);
+	__d_rehash(dentry, d_hash(target->d_parent, target->d_name.hash));
+	spin_unlock(&dcache_hash_lock);
 
 	/* Unhash the target: dput() will then get rid of it */
 	__d_drop(target);
@@ -1850,7 +1861,9 @@ struct dentry *d_materialise_unique(stru
 found_lock:
 	spin_lock(&actual->d_lock);
 found:
+	spin_lock(&dcache_hash_lock);
 	_d_rehash(actual);
+	spin_unlock(&dcache_hash_lock);
 	spin_unlock(&actual->d_lock);
 	spin_unlock(&dcache_lock);
 out_nolock:
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -184,6 +184,7 @@ d_iput:		no		no		no       yes
 
 #define DCACHE_COOKIE		0x0040	/* For use by dcookie subsystem */
 
+extern spinlock_t dcache_hash_lock;
 extern spinlock_t dcache_lock;
 extern seqlock_t rename_lock;
 
@@ -207,7 +208,9 @@ static inline void __d_drop(struct dentr
 {
 	if (!(dentry->d_flags & DCACHE_UNHASHED)) {
 		dentry->d_flags |= DCACHE_UNHASHED;
+		spin_lock(&dcache_hash_lock);
 		hlist_del_rcu(&dentry->d_hash);
+		spin_unlock(&dcache_hash_lock);
 	}
 }
 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 08/27] fs: dcache scale lru
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (6 preceding siblings ...)
  2009-04-25  1:20 ` [patch 07/27] fs: dcache scale hash npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  1:20 ` [patch 09/27] fs: dcache scale nr_dentry npiggin
                   ` (19 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: fs-dcache-scale-d_lru.patch --]
[-- Type: text/plain, Size: 7877 bytes --]

Add a new lock, dcache_lru_lock, to protect the dcache hash table from
concurrent modification. d_lru is also protected by d_lock.

Move lru scanning out from underneath dcache_lock.

---
 fs/dcache.c |  105 ++++++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 85 insertions(+), 20 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -36,17 +36,26 @@
 
 /*
  * Usage:
- * dcache_hash_lock protects dcache hash table
+ * dcache_hash_lock protects:
+ *   - the dcache hash table
+ * dcache_lru_lock protects:
+ *   - the dcache lru lists and counters
+ * d_lock protects:
+ *   - d_flags
+ *   - d_name
+ *   - d_lru
  *
  * Ordering:
  * dcache_lock
  *   dentry->d_lock
+ *     dcache_lru_lock
  *     dcache_hash_lock
  */
 int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
 
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_hash_lock);
+__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lock);
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
 
@@ -133,37 +142,56 @@ static void dentry_iput(struct dentry *
 }
 
 /*
- * dentry_lru_(add|add_tail|del|del_init) must be called with dcache_lock held.
+ * dentry_lru_(add|add_tail|del|del_init) must be called with d_lock held
+ * to protect list_empty(d_lru) condition.
  */
 static void dentry_lru_add(struct dentry *dentry)
 {
+	spin_lock(&dcache_lru_lock);
 	list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
 	dentry->d_sb->s_nr_dentry_unused++;
 	dentry_stat.nr_unused++;
+	spin_unlock(&dcache_lru_lock);
 }
 
 static void dentry_lru_add_tail(struct dentry *dentry)
 {
+	spin_lock(&dcache_lru_lock);
 	list_add_tail(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
 	dentry->d_sb->s_nr_dentry_unused++;
 	dentry_stat.nr_unused++;
+	spin_unlock(&dcache_lru_lock);
+}
+
+static void __dentry_lru_del(struct dentry *dentry)
+{
+	list_del(&dentry->d_lru);
+	dentry->d_sb->s_nr_dentry_unused--;
+	dentry_stat.nr_unused--;
+}
+
+static void __dentry_lru_del_init(struct dentry *dentry)
+{
+	list_del_init(&dentry->d_lru);
+	dentry->d_sb->s_nr_dentry_unused--;
+	dentry_stat.nr_unused--;
 }
 
 static void dentry_lru_del(struct dentry *dentry)
 {
 	if (!list_empty(&dentry->d_lru)) {
-		list_del(&dentry->d_lru);
-		dentry->d_sb->s_nr_dentry_unused--;
-		dentry_stat.nr_unused--;
+		spin_lock(&dcache_lru_lock);
+		__dentry_lru_del(dentry);
+		spin_unlock(&dcache_lru_lock);
 	}
 }
 
 static void dentry_lru_del_init(struct dentry *dentry)
 {
 	if (likely(!list_empty(&dentry->d_lru))) {
-		list_del_init(&dentry->d_lru);
-		dentry->d_sb->s_nr_dentry_unused--;
-		dentry_stat.nr_unused--;
+		spin_lock(&dcache_lru_lock);
+		__dentry_lru_del_init(dentry);
+		spin_unlock(&dcache_lru_lock);
 	}
 }
 
@@ -174,6 +202,8 @@ static void dentry_lru_del_init(struct d
  * The dentry must already be unhashed and removed from the LRU.
  *
  * If this is the root of the dentry tree, return NULL.
+ *
+ * dcache_lock and d_lock must be held by caller, are dropped by d_kill.
  */
 static struct dentry *d_kill(struct dentry *dentry)
 	__releases(dentry->d_lock)
@@ -326,11 +356,19 @@ int d_invalidate(struct dentry * dentry)
 }
 
 /* This should be called _only_ with dcache_lock held */
+static inline struct dentry * __dget_locked_dlock(struct dentry *dentry)
+{
+	atomic_inc(&dentry->d_count);
+	dentry_lru_del_init(dentry);
+	return dentry;
+}
 
 static inline struct dentry * __dget_locked(struct dentry *dentry)
 {
 	atomic_inc(&dentry->d_count);
+	spin_lock(&dentry->d_lock);
 	dentry_lru_del_init(dentry);
+	spin_lock(&dentry->d_lock);
 	return dentry;
 }
 
@@ -407,7 +445,7 @@ restart:
 	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
 		spin_lock(&dentry->d_lock);
 		if (!atomic_read(&dentry->d_count)) {
-			__dget_locked(dentry);
+			__dget_locked_dlock(dentry);
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_lock);
@@ -439,17 +477,18 @@ static void prune_one_dentry(struct dent
 	 * Prune ancestors.  Locking is simpler than in dput(),
 	 * because dcache_lock needs to be taken anyway.
 	 */
-	spin_lock(&dcache_lock);
 	while (dentry) {
-		if (!atomic_dec_and_lock(&dentry->d_count, &dentry->d_lock))
+		spin_lock(&dcache_lock);
+		if (!atomic_dec_and_lock(&dentry->d_count, &dentry->d_lock)) {
+			spin_unlock(&dcache_lock);
 			return;
+		}
 
 		if (dentry->d_op && dentry->d_op->d_delete)
 			dentry->d_op->d_delete(dentry);
 		dentry_lru_del_init(dentry);
 		__d_drop(dentry);
 		dentry = d_kill(dentry);
-		spin_lock(&dcache_lock);
 	}
 }
 
@@ -470,10 +509,11 @@ static void __shrink_dcache_sb(struct su
 
 	BUG_ON(!sb);
 	BUG_ON((flags & DCACHE_REFERENCED) && count == NULL);
-	spin_lock(&dcache_lock);
 	if (count != NULL)
 		/* called from prune_dcache() and shrink_dcache_parent() */
 		cnt = *count;
+relock:
+	spin_lock(&dcache_lru_lock);
 restart:
 	if (count == NULL)
 		list_splice_init(&sb->s_dentry_lru, &tmp);
@@ -483,7 +523,10 @@ restart:
 					struct dentry, d_lru);
 			BUG_ON(dentry->d_sb != sb);
 
-			spin_lock(&dentry->d_lock);
+			if (!spin_trylock(&dentry->d_lock)) {
+				spin_unlock(&dcache_lru_lock);
+				goto relock;
+			}
 			/*
 			 * If we are honouring the DCACHE_REFERENCED flag and
 			 * the dentry has this flag set, don't free it. Clear
@@ -501,13 +544,22 @@ restart:
 				if (!cnt)
 					break;
 			}
-			cond_resched_lock(&dcache_lock);
+			cond_resched_lock(&dcache_lru_lock);
 		}
 	}
+	spin_unlock(&dcache_lru_lock);
+
+	spin_lock(&dcache_lock);
+again:
+	spin_lock(&dcache_lru_lock); /* lru_lock also protects tmp list */
 	while (!list_empty(&tmp)) {
 		dentry = list_entry(tmp.prev, struct dentry, d_lru);
-		dentry_lru_del_init(dentry);
-		spin_lock(&dentry->d_lock);
+
+		if (!spin_trylock(&dentry->d_lock)) {
+			spin_unlock(&dcache_lru_lock);
+			goto again;
+		}
+		__dentry_lru_del_init(dentry);
 		/*
 		 * We found an inuse dentry which was not removed from
 		 * the LRU because of laziness during lookup.  Do not free
@@ -517,17 +569,22 @@ restart:
 			spin_unlock(&dentry->d_lock);
 			continue;
 		}
+
+		spin_unlock(&dcache_lru_lock);
 		prune_one_dentry(dentry);
-		/* dentry->d_lock was dropped in prune_one_dentry() */
-		cond_resched_lock(&dcache_lock);
+		/* dcache_lock and dentry->d_lock dropped */
+		spin_lock(&dcache_lock);
+		spin_lock(&dcache_lru_lock);
 	}
+	spin_unlock(&dcache_lock);
+
 	if (count == NULL && !list_empty(&sb->s_dentry_lru))
 		goto restart;
 	if (count != NULL)
 		*count = cnt;
 	if (!list_empty(&referenced))
 		list_splice(&referenced, &sb->s_dentry_lru);
-	spin_unlock(&dcache_lock);
+	spin_unlock(&dcache_lru_lock);
 }
 
 /**
@@ -635,7 +692,9 @@ static void shrink_dcache_for_umount_sub
 
 	/* detach this root from the system */
 	spin_lock(&dcache_lock);
+	spin_lock(&dentry->d_lock);
 	dentry_lru_del_init(dentry);
+	spin_unlock(&dentry->d_lock);
 	__d_drop(dentry);
 	spin_unlock(&dcache_lock);
 
@@ -649,7 +708,9 @@ static void shrink_dcache_for_umount_sub
 			spin_lock(&dcache_lock);
 			list_for_each_entry(loop, &dentry->d_subdirs,
 					    d_u.d_child) {
+				spin_lock(&loop->d_lock);
 				dentry_lru_del_init(loop);
+				spin_unlock(&loop->d_lock);
 				__d_drop(loop);
 				cond_resched_lock(&dcache_lock);
 			}
@@ -832,13 +893,17 @@ resume:
 		struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
 		next = tmp->next;
 
+		spin_lock(&dentry->d_lock);
 		dentry_lru_del_init(dentry);
+		spin_unlock(&dentry->d_lock);
 		/* 
 		 * move only zero ref count dentries to the end 
 		 * of the unused list for prune_dcache
 		 */
 		if (!atomic_read(&dentry->d_count)) {
+			spin_lock(&dentry->d_lock);
 			dentry_lru_add_tail(dentry);
+			spin_unlock(&dentry->d_lock);
 			found++;
 		}
 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 09/27] fs: dcache scale nr_dentry
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (7 preceding siblings ...)
  2009-04-25  1:20 ` [patch 08/27] fs: dcache scale lru npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  1:20 ` [patch 10/27] fs: dcache scale dentry refcount npiggin
                   ` (18 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: fs-dcache-scale-nr_dentry.patch --]
[-- Type: text/plain, Size: 3093 bytes --]

Make dentry_stat_t.nr_dentry an atomic_t type, and move it from under
dcache_lock.
---
 fs/dcache.c            |   20 +++++++++-----------
 include/linux/dcache.h |    4 ++--
 kernel/sysctl.c        |    6 ++++++
 3 files changed, 17 insertions(+), 13 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -83,6 +83,7 @@ static struct hlist_head *dentry_hashtab
 
 /* Statistics gathering. */
 struct dentry_stat_t dentry_stat = {
+	.nr_dentry = ATOMIC_INIT(0),
 	.age_limit = 45,
 };
 
@@ -101,11 +102,11 @@ static void d_callback(struct rcu_head *
 }
 
 /*
- * no dcache_lock, please.  The caller must decrement dentry_stat.nr_dentry
- * inside dcache_lock.
+ * no dcache_lock, please.
  */
 static void d_free(struct dentry *dentry)
 {
+	atomic_dec(&dentry_stat.nr_dentry);
 	if (dentry->d_op && dentry->d_op->d_release)
 		dentry->d_op->d_release(dentry);
 	/* if dentry was never inserted into hash, immediate free is OK */
@@ -212,7 +213,6 @@ static struct dentry *d_kill(struct dent
 	struct dentry *parent;
 
 	list_del(&dentry->d_u.d_child);
-	dentry_stat.nr_dentry--;	/* For d_free, below */
 	/*drops the locks, at that point nobody can reach this dentry */
 	dentry_iput(dentry);
 	if (IS_ROOT(dentry))
@@ -777,10 +777,7 @@ static void shrink_dcache_for_umount_sub
 				    struct dentry, d_u.d_child);
 	}
 out:
-	/* several dentries were freed, need to correct nr_dentry */
-	spin_lock(&dcache_lock);
-	dentry_stat.nr_dentry -= detached;
-	spin_unlock(&dcache_lock);
+	return;
 }
 
 /*
@@ -1035,11 +1032,12 @@ struct dentry *d_alloc(struct dentry * p
 		INIT_LIST_HEAD(&dentry->d_u.d_child);
 	}
 
-	spin_lock(&dcache_lock);
-	if (parent)
+	if (parent) {
+		spin_lock(&dcache_lock);
 		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
-	dentry_stat.nr_dentry++;
-	spin_unlock(&dcache_lock);
+		spin_unlock(&dcache_lock);
+	}
+	atomic_inc(&dentry_stat.nr_dentry);
 
 	return dentry;
 }
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -37,8 +37,8 @@ struct qstr {
 };
 
 struct dentry_stat_t {
-	int nr_dentry;
-	int nr_unused;
+	atomic_t nr_dentry;
+	int nr_unused;		/* protected by dcache_lru_lock */
 	int age_limit;          /* age in seconds */
 	int want_pages;         /* pages requested by system */
 	int dummy[2];
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -1358,6 +1358,12 @@ static struct ctl_table fs_table[] = {
 		.extra2		= &sysctl_nr_open_max,
 	},
 	{
+		/*
+		 * dentry_stat has an atomic_t member, so this is a bit of
+		 * a hack, but it works for the moment, and I won't bother
+		 * changing it now because we'll probably want to change to
+		 * a more scalable counter anyway.
+		 */
 		.ctl_name	= FS_DENTRY,
 		.procname	= "dentry-state",
 		.data		= &dentry_stat,



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 10/27] fs: dcache scale dentry refcount
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (8 preceding siblings ...)
  2009-04-25  1:20 ` [patch 09/27] fs: dcache scale nr_dentry npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  1:20 ` [patch 11/27] fs: dcache scale d_unhashed npiggin
                   ` (17 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: fs-dcache-scale-d_count.patch --]
[-- Type: text/plain, Size: 25048 bytes --]

Make d_count non-atomic and protect it with d_lock. This allows us to
ensure a 0 refcount dentry remains 0 without dcache_lock. It is also
fairly natural when we start protecting many other dentry members with
d_lock.

---
 arch/powerpc/platforms/cell/spufs/inode.c |    2 
 drivers/infiniband/hw/ipath/ipath_fs.c    |    2 
 fs/autofs4/expire.c                       |    8 +-
 fs/autofs4/root.c                         |    6 -
 fs/coda/dir.c                             |    2 
 fs/configfs/dir.c                         |    3 
 fs/configfs/inode.c                       |    2 
 fs/dcache.c                               |  103 ++++++++++++++++++++++--------
 fs/ecryptfs/inode.c                       |    2 
 fs/exportfs/expfs.c                       |    8 ++
 fs/hpfs/namei.c                           |    2 
 fs/locks.c                                |    2 
 fs/namei.c                                |    2 
 fs/nfs/dir.c                              |   12 +--
 fs/nfsd/vfs.c                             |    5 -
 fs/notify/dnotify/dnotify.c               |   11 ++-
 fs/notify/inotify/inotify.c               |   12 ++-
 fs/smbfs/dir.c                            |    8 ++
 fs/smbfs/proc.c                           |    8 ++
 include/linux/dcache.h                    |   29 ++++----
 kernel/cgroup.c                           |    2 
 net/sunrpc/rpc_pipe.c                     |    2 
 22 files changed, 157 insertions(+), 76 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -107,6 +107,7 @@ static void d_callback(struct rcu_head *
 static void d_free(struct dentry *dentry)
 {
 	atomic_dec(&dentry_stat.nr_dentry);
+	BUG_ON(dentry->d_count);
 	if (dentry->d_op && dentry->d_op->d_release)
 		dentry->d_op->d_release(dentry);
 	/* if dentry was never inserted into hash, immediate free is OK */
@@ -258,13 +259,23 @@ void dput(struct dentry *dentry)
 		return;
 
 repeat:
-	if (atomic_read(&dentry->d_count) == 1)
+	if (dentry->d_count == 1)
 		might_sleep();
-	if (!atomic_dec_and_lock(&dentry->d_count, &dcache_lock))
-		return;
-
 	spin_lock(&dentry->d_lock);
-	if (atomic_read(&dentry->d_count)) {
+	if (dentry->d_count == 1) {
+		if (!spin_trylock(&dcache_lock)) {
+			/*
+			 * Something of a livelock possibility we could avoid
+			 * by taking dcache_lock and trying again, but we
+			 * want to reduce dcache_lock anyway so this will
+			 * get improved.
+			 */
+			spin_unlock(&dentry->d_lock);
+			goto repeat;
+		}
+	}
+	dentry->d_count--;
+	if (dentry->d_count) {
 		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_lock);
 		return;
@@ -341,7 +352,7 @@ int d_invalidate(struct dentry * dentry)
 	 * working directory or similar).
 	 */
 	spin_lock(&dentry->d_lock);
-	if (atomic_read(&dentry->d_count) > 1) {
+	if (dentry->d_count > 1) {
 		if (dentry->d_inode && S_ISDIR(dentry->d_inode->i_mode)) {
 			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_lock);
@@ -355,28 +366,54 @@ int d_invalidate(struct dentry * dentry)
 	return 0;
 }
 
-/* This should be called _only_ with dcache_lock held */
+/* This should be called _only_ with a lock pinning the dentry */
 static inline struct dentry * __dget_locked_dlock(struct dentry *dentry)
 {
-	atomic_inc(&dentry->d_count);
+	dentry->d_count++;
 	dentry_lru_del_init(dentry);
 	return dentry;
 }
 
 static inline struct dentry * __dget_locked(struct dentry *dentry)
 {
-	atomic_inc(&dentry->d_count);
 	spin_lock(&dentry->d_lock);
-	dentry_lru_del_init(dentry);
+	__dget_locked_dlock(dentry);
 	spin_lock(&dentry->d_lock);
 	return dentry;
 }
 
+struct dentry * dget_locked_dlock(struct dentry *dentry)
+{
+	return __dget_locked_dlock(dentry);
+}
+
 struct dentry * dget_locked(struct dentry *dentry)
 {
 	return __dget_locked(dentry);
 }
 
+struct dentry *dget_parent(struct dentry *dentry)
+{
+	struct dentry *ret;
+
+repeat:
+	spin_lock(&dentry->d_lock);
+	ret = dentry->d_parent;
+	if (!ret)
+		goto out;
+	if (!spin_trylock(&ret->d_lock)) {
+		spin_unlock(&dentry->d_lock);
+		goto repeat;
+	}
+	BUG_ON(!ret->d_count);
+	ret->d_count++;
+	spin_unlock(&ret->d_lock);
+out:
+	spin_unlock(&dentry->d_lock);
+	return ret;
+}
+EXPORT_SYMBOL(dget_parent);
+
 /**
  * d_find_alias - grab a hashed alias of inode
  * @inode: inode in question
@@ -444,7 +481,7 @@ restart:
 	spin_lock(&dcache_lock);
 	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
 		spin_lock(&dentry->d_lock);
-		if (!atomic_read(&dentry->d_count)) {
+		if (!dentry->d_count) {
 			__dget_locked_dlock(dentry);
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
@@ -479,7 +516,10 @@ static void prune_one_dentry(struct dent
 	 */
 	while (dentry) {
 		spin_lock(&dcache_lock);
-		if (!atomic_dec_and_lock(&dentry->d_count, &dentry->d_lock)) {
+		spin_lock(&dentry->d_lock);
+		dentry->d_count--;
+		if (dentry->d_count) {
+			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_lock);
 			return;
 		}
@@ -565,7 +605,7 @@ again:
 		 * the LRU because of laziness during lookup.  Do not free
 		 * it - just keep it off the LRU list.
 		 */
-		if (atomic_read(&dentry->d_count)) {
+		if (dentry->d_count) {
 			spin_unlock(&dentry->d_lock);
 			continue;
 		}
@@ -726,7 +766,7 @@ static void shrink_dcache_for_umount_sub
 		do {
 			struct inode *inode;
 
-			if (atomic_read(&dentry->d_count) != 0) {
+			if (dentry->d_count != 0) {
 				printk(KERN_ERR
 				       "BUG: Dentry %p{i=%lx,n=%s}"
 				       " still in use (%d)"
@@ -735,7 +775,7 @@ static void shrink_dcache_for_umount_sub
 				       dentry->d_inode ?
 				       dentry->d_inode->i_ino : 0UL,
 				       dentry->d_name.name,
-				       atomic_read(&dentry->d_count),
+				       dentry->d_count,
 				       dentry->d_sb->s_type->name,
 				       dentry->d_sb->s_id);
 				BUG();
@@ -745,7 +785,9 @@ static void shrink_dcache_for_umount_sub
 				parent = NULL;
 			else {
 				parent = dentry->d_parent;
-				atomic_dec(&parent->d_count);
+				spin_lock(&parent->d_lock);
+				parent->d_count--;
+				spin_unlock(&parent->d_lock);
 			}
 
 			list_del(&dentry->d_u.d_child);
@@ -800,7 +842,9 @@ void shrink_dcache_for_umount(struct sup
 
 	dentry = sb->s_root;
 	sb->s_root = NULL;
-	atomic_dec(&dentry->d_count);
+	spin_lock(&dentry->d_lock);
+	dentry->d_count--;
+	spin_unlock(&dentry->d_lock);
 	shrink_dcache_for_umount_subtree(dentry);
 
 	while (!hlist_empty(&sb->s_anon)) {
@@ -892,17 +936,15 @@ resume:
 
 		spin_lock(&dentry->d_lock);
 		dentry_lru_del_init(dentry);
-		spin_unlock(&dentry->d_lock);
 		/* 
 		 * move only zero ref count dentries to the end 
 		 * of the unused list for prune_dcache
 		 */
-		if (!atomic_read(&dentry->d_count)) {
-			spin_lock(&dentry->d_lock);
+		if (!dentry->d_count) {
 			dentry_lru_add_tail(dentry);
-			spin_unlock(&dentry->d_lock);
 			found++;
 		}
+		spin_unlock(&dentry->d_lock);
 
 		/*
 		 * We can return to the caller if we have found some (this
@@ -1011,7 +1053,7 @@ struct dentry *d_alloc(struct dentry * p
 	memcpy(dname, name->name, name->len);
 	dname[name->len] = 0;
 
-	atomic_set(&dentry->d_count, 1);
+	dentry->d_count = 1;
 	dentry->d_flags = DCACHE_UNHASHED;
 	spin_lock_init(&dentry->d_lock);
 	dentry->d_inode = NULL;
@@ -1479,7 +1521,7 @@ struct dentry * __d_lookup(struct dentry
 				goto next;
 		}
 
-		atomic_inc(&dentry->d_count);
+		dentry->d_count++;
 		found = dentry;
 		spin_unlock(&dentry->d_lock);
 		break;
@@ -1540,6 +1582,7 @@ int d_validate(struct dentry *dentry, st
 		goto out;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dentry->d_lock);
 	spin_lock(&dcache_hash_lock);
 	base = d_hash(dparent, dentry->d_name.hash);
 	hlist_for_each(lhp,base) { 
@@ -1548,12 +1591,14 @@ int d_validate(struct dentry *dentry, st
 		 */
 		if (dentry == hlist_entry(lhp, struct dentry, d_hash)) {
 			spin_unlock(&dcache_hash_lock);
-			__dget_locked(dentry);
+			__dget_locked_dlock(dentry);
+			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_lock);
 			return 1;
 		}
 	}
 	spin_unlock(&dcache_hash_lock);
+	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
 out:
 	return 0;
@@ -1589,7 +1634,7 @@ void d_delete(struct dentry * dentry)
 	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	isdir = S_ISDIR(dentry->d_inode->i_mode);
-	if (atomic_read(&dentry->d_count) == 1) {
+	if (dentry->d_count == 1) {
 		dentry_iput(dentry);
 		fsnotify_nameremove(dentry, isdir);
 		return;
@@ -2265,11 +2310,15 @@ resume:
 			this_parent = dentry;
 			goto repeat;
 		}
-		atomic_dec(&dentry->d_count);
+		spin_lock(&dentry->d_lock);
+		dentry->d_count--;
+		spin_unlock(&dentry->d_lock);
 	}
 	if (this_parent != root) {
 		next = this_parent->d_u.d_child.next;
-		atomic_dec(&this_parent->d_count);
+		spin_lock(&this_parent->d_lock);
+		this_parent->d_count--;
+		spin_unlock(&this_parent->d_lock);
 		this_parent = this_parent->d_parent;
 		goto resume;
 	}
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -87,7 +87,7 @@ full_name_hash(const unsigned char *name
 #endif
 
 struct dentry {
-	atomic_t d_count;
+	unsigned int d_count;		/* protected by d_lock */
 	unsigned int d_flags;		/* protected by d_lock */
 	spinlock_t d_lock;		/* per dentry lock */
 	int d_mounted;
@@ -330,17 +330,28 @@ extern char *dentry_path(struct dentry *
  *	needs and they take necessary precautions) you should hold dcache_lock
  *	and call dget_locked() instead of dget().
  */
- 
+static inline struct dentry *dget_dlock(struct dentry *dentry)
+{
+	if (dentry) {
+		BUG_ON(!dentry->d_count);
+		dentry->d_count++;
+	}
+	return dentry;
+}
 static inline struct dentry *dget(struct dentry *dentry)
 {
 	if (dentry) {
-		BUG_ON(!atomic_read(&dentry->d_count));
-		atomic_inc(&dentry->d_count);
+		spin_lock(&dentry->d_lock);
+		dget_dlock(dentry);
+		spin_unlock(&dentry->d_lock);
 	}
 	return dentry;
 }
 
 extern struct dentry * dget_locked(struct dentry *);
+extern struct dentry * dget_locked_dlock(struct dentry *);
+
+extern struct dentry *dget_parent(struct dentry *dentry);
 
 /**
  *	d_unhashed -	is dentry hashed
@@ -354,16 +365,6 @@ static inline int d_unhashed(struct dent
 	return (dentry->d_flags & DCACHE_UNHASHED);
 }
 
-static inline struct dentry *dget_parent(struct dentry *dentry)
-{
-	struct dentry *ret;
-
-	spin_lock(&dentry->d_lock);
-	ret = dget(dentry->d_parent);
-	spin_unlock(&dentry->d_lock);
-	return ret;
-}
-
 extern void dput(struct dentry *);
 
 static inline int d_mountpoint(struct dentry *dentry)
Index: linux-2.6/fs/configfs/dir.c
===================================================================
--- linux-2.6.orig/fs/configfs/dir.c
+++ linux-2.6/fs/configfs/dir.c
@@ -311,8 +311,7 @@ static void remove_dir(struct dentry * d
 	if (d->d_inode)
 		simple_rmdir(parent->d_inode,d);
 
-	pr_debug(" o %s removing done (%d)\n",d->d_name.name,
-		 atomic_read(&d->d_count));
+	pr_debug(" o %s removing done (%d)\n",d->d_name.name, d->d_count);
 
 	dput(parent);
 }
Index: linux-2.6/fs/locks.c
===================================================================
--- linux-2.6.orig/fs/locks.c
+++ linux-2.6/fs/locks.c
@@ -1373,7 +1373,7 @@ int generic_setlease(struct file *filp,
 		if ((arg == F_RDLCK) && (atomic_read(&inode->i_writecount) > 0))
 			goto out;
 		if ((arg == F_WRLCK)
-		    && ((atomic_read(&dentry->d_count) > 1)
+		    && (dentry->d_count > 1
 			|| (atomic_read(&inode->i_count) > 1)))
 			goto out;
 	}
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c
+++ linux-2.6/fs/namei.c
@@ -2116,7 +2116,7 @@ void dentry_unhash(struct dentry *dentry
 	shrink_dcache_parent(dentry);
 	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
-	if (atomic_read(&dentry->d_count) == 2)
+	if (dentry->d_count == 2)
 		__d_drop(dentry);
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
Index: linux-2.6/fs/autofs4/expire.c
===================================================================
--- linux-2.6.orig/fs/autofs4/expire.c
+++ linux-2.6/fs/autofs4/expire.c
@@ -197,7 +197,7 @@ static int autofs4_tree_busy(struct vfsm
 			else
 				ino_count++;
 
-			if (atomic_read(&p->d_count) > ino_count) {
+			if (p->d_count > ino_count) {
 				top_ino->last_used = jiffies;
 				dput(p);
 				return 1;
@@ -346,7 +346,7 @@ struct dentry *autofs4_expire_indirect(s
 
 			/* Path walk currently on this dentry? */
 			ino_count = atomic_read(&ino->count) + 2;
-			if (atomic_read(&dentry->d_count) > ino_count)
+			if (dentry->d_count > ino_count)
 				goto next;
 
 			/* Can we umount this guy */
@@ -368,7 +368,7 @@ struct dentry *autofs4_expire_indirect(s
 		if (!exp_leaves) {
 			/* Path walk currently on this dentry? */
 			ino_count = atomic_read(&ino->count) + 1;
-			if (atomic_read(&dentry->d_count) > ino_count)
+			if (dentry->d_count > ino_count)
 				goto next;
 
 			if (!autofs4_tree_busy(mnt, dentry, timeout, do_now)) {
@@ -382,7 +382,7 @@ struct dentry *autofs4_expire_indirect(s
 		} else {
 			/* Path walk currently on this dentry? */
 			ino_count = atomic_read(&ino->count) + 1;
-			if (atomic_read(&dentry->d_count) > ino_count)
+			if (dentry->d_count > ino_count)
 				goto next;
 
 			expired = autofs4_check_leaves(mnt, dentry, timeout, do_now);
Index: linux-2.6/fs/autofs4/root.c
===================================================================
--- linux-2.6.orig/fs/autofs4/root.c
+++ linux-2.6/fs/autofs4/root.c
@@ -381,7 +381,7 @@ static struct dentry *autofs4_lookup_act
 		spin_lock(&dentry->d_lock);
 
 		/* Already gone? */
-		if (atomic_read(&dentry->d_count) == 0)
+		if (dentry->d_count == 0)
 			goto next;
 
 		qstr = &dentry->d_name;
@@ -397,7 +397,7 @@ static struct dentry *autofs4_lookup_act
 			goto next;
 
 		if (d_unhashed(dentry)) {
-			dget(dentry);
+			dget_locked_dlock(dentry);
 			spin_unlock(&dentry->d_lock);
 			spin_unlock(&sbi->lookup_lock);
 			spin_unlock(&dcache_lock);
@@ -449,7 +449,7 @@ static struct dentry *autofs4_lookup_exp
 			goto next;
 
 		if (d_unhashed(dentry)) {
-			dget(dentry);
+			dget_locked_dlock(dentry);
 			spin_unlock(&dentry->d_lock);
 			spin_unlock(&sbi->lookup_lock);
 			spin_unlock(&dcache_lock);
Index: linux-2.6/fs/coda/dir.c
===================================================================
--- linux-2.6.orig/fs/coda/dir.c
+++ linux-2.6/fs/coda/dir.c
@@ -611,7 +611,7 @@ static int coda_dentry_revalidate(struct
 	if (cii->c_flags & C_FLUSH) 
 		coda_flag_inode_children(inode, C_FLUSH);
 
-	if (atomic_read(&de->d_count) > 1)
+	if (de->d_count > 1)
 		/* pretend it's valid, but don't change the flags */
 		goto out;
 
Index: linux-2.6/fs/ecryptfs/inode.c
===================================================================
--- linux-2.6.orig/fs/ecryptfs/inode.c
+++ linux-2.6/fs/ecryptfs/inode.c
@@ -263,7 +263,7 @@ int ecryptfs_lookup_and_interpose_lower(
 				   ecryptfs_dentry->d_parent));
 	lower_inode = lower_dentry->d_inode;
 	fsstack_copy_attr_atime(ecryptfs_dir_inode, lower_dir_dentry->d_inode);
-	BUG_ON(!atomic_read(&lower_dentry->d_count));
+	BUG_ON(!lower_dentry->d_count);
 	ecryptfs_set_dentry_private(ecryptfs_dentry,
 				    kmem_cache_alloc(ecryptfs_dentry_info_cache,
 						     GFP_KERNEL));
Index: linux-2.6/fs/hpfs/namei.c
===================================================================
--- linux-2.6.orig/fs/hpfs/namei.c
+++ linux-2.6/fs/hpfs/namei.c
@@ -414,7 +414,7 @@ again:
 		mutex_unlock(&hpfs_i(inode)->i_parent_mutex);
 		d_drop(dentry);
 		spin_lock(&dentry->d_lock);
-		if (atomic_read(&dentry->d_count) > 1 ||
+		if (dentry->d_count > 1 ||
 		    generic_permission(inode, MAY_WRITE, NULL) ||
 		    !S_ISREG(inode->i_mode) ||
 		    get_write_access(inode)) {
Index: linux-2.6/fs/nfs/dir.c
===================================================================
--- linux-2.6.orig/fs/nfs/dir.c
+++ linux-2.6/fs/nfs/dir.c
@@ -1326,7 +1326,7 @@ static int nfs_sillyrename(struct inode
 
 	dfprintk(VFS, "NFS: silly-rename(%s/%s, ct=%d)\n",
 		dentry->d_parent->d_name.name, dentry->d_name.name, 
-		atomic_read(&dentry->d_count));
+		dentry->d_count);
 	nfs_inc_stats(dir, NFSIOS_SILLYRENAME);
 
 	/*
@@ -1435,7 +1435,7 @@ static int nfs_unlink(struct inode *dir,
 
 	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
-	if (atomic_read(&dentry->d_count) > 1) {
+	if (dentry->d_count > 1) {
 		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_lock);
 		/* Start asynchronous writeout of the inode */
@@ -1590,7 +1590,7 @@ static int nfs_rename(struct inode *old_
 	dfprintk(VFS, "NFS: rename(%s/%s -> %s/%s, ct=%d)\n",
 		 old_dentry->d_parent->d_name.name, old_dentry->d_name.name,
 		 new_dentry->d_parent->d_name.name, new_dentry->d_name.name,
-		 atomic_read(&new_dentry->d_count));
+		 new_dentry->d_count);
 
 	/*
 	 * First check whether the target is busy ... we can't
@@ -1606,7 +1606,7 @@ static int nfs_rename(struct inode *old_
 		error = -EISDIR;
 		if (!S_ISDIR(old_inode->i_mode))
 			goto out;
-	} else if (atomic_read(&new_dentry->d_count) > 2) {
+	} else if (new_dentry->d_count > 2) {
 		int err;
 		/* copy the target dentry's name */
 		dentry = d_alloc(new_dentry->d_parent,
@@ -1621,7 +1621,7 @@ static int nfs_rename(struct inode *old_
 			new_inode = NULL;
 			/* instantiate the replacement target */
 			d_instantiate(new_dentry, NULL);
-		} else if (atomic_read(&new_dentry->d_count) > 1)
+		} else if (new_dentry->d_count > 1)
 			/* dentry still busy? */
 			goto out;
 	}
@@ -1630,7 +1630,7 @@ go_ahead:
 	/*
 	 * ... prune child dentries and writebacks if needed.
 	 */
-	if (atomic_read(&old_dentry->d_count) > 1) {
+	if (old_dentry->d_count > 1) {
 		if (S_ISREG(old_inode->i_mode))
 			nfs_wb_all(old_inode);
 		shrink_dcache_parent(old_dentry);
Index: linux-2.6/fs/nfsd/vfs.c
===================================================================
--- linux-2.6.orig/fs/nfsd/vfs.c
+++ linux-2.6/fs/nfsd/vfs.c
@@ -1735,8 +1735,7 @@ nfsd_rename(struct svc_rqst *rqstp, stru
 		goto out_dput_new;
 
 	if (svc_msnfs(ffhp) &&
-		((atomic_read(&odentry->d_count) > 1)
-		 || (atomic_read(&ndentry->d_count) > 1))) {
+		((odentry->d_count > 1) || (ndentry->d_count > 1))) {
 			host_err = -EPERM;
 			goto out_dput_new;
 	}
@@ -1822,7 +1821,7 @@ nfsd_unlink(struct svc_rqst *rqstp, stru
 	if (type != S_IFDIR) { /* It's UNLINK */
 #ifdef MSNFS
 		if ((fhp->fh_export->ex_flags & NFSEXP_MSNFS) &&
-			(atomic_read(&rdentry->d_count) > 1)) {
+			(rdentry->d_count > 1)) {
 			host_err = -EPERM;
 		} else
 #endif
Index: linux-2.6/fs/exportfs/expfs.c
===================================================================
--- linux-2.6.orig/fs/exportfs/expfs.c
+++ linux-2.6/fs/exportfs/expfs.c
@@ -74,11 +74,17 @@ static struct dentry *
 find_disconnected_root(struct dentry *dentry)
 {
 	dget(dentry);
+again:
 	spin_lock(&dentry->d_lock);
 	while (!IS_ROOT(dentry) &&
 	       (dentry->d_parent->d_flags & DCACHE_DISCONNECTED)) {
 		struct dentry *parent = dentry->d_parent;
-		dget(parent);
+
+		if (!spin_trylock(&parent->d_lock)) {
+			spin_unlock(&dentry->d_lock);
+			goto again;
+		}
+		dget_dlock(parent);
 		spin_unlock(&dentry->d_lock);
 		dput(dentry);
 		dentry = parent;
Index: linux-2.6/fs/notify/dnotify/dnotify.c
===================================================================
--- linux-2.6.orig/fs/notify/dnotify/dnotify.c
+++ linux-2.6/fs/notify/dnotify/dnotify.c
@@ -168,15 +168,24 @@ void dnotify_parent(struct dentry *dentr
 	if (!dir_notify_enable)
 		return;
 
+again:
 	spin_lock(&dentry->d_lock);
 	parent = dentry->d_parent;
+	if (parent != dentry && !spin_trylock(&parent->d_lock)) {
+		spin_unlock(&dentry->d_lock);
+		goto again;
+	}
 	if (parent->d_inode->i_dnotify_mask & event) {
-		dget(parent);
+		dget_dlock(parent);
 		spin_unlock(&dentry->d_lock);
+		if (parent != dentry)
+			spin_unlock(&parent->d_lock);
 		__inode_dir_notify(parent->d_inode, event);
 		dput(parent);
 	} else {
 		spin_unlock(&dentry->d_lock);
+		if (parent != dentry)
+			spin_unlock(&parent->d_lock);
 	}
 }
 EXPORT_SYMBOL_GPL(dnotify_parent);
Index: linux-2.6/fs/notify/inotify/inotify.c
===================================================================
--- linux-2.6.orig/fs/notify/inotify/inotify.c
+++ linux-2.6/fs/notify/inotify/inotify.c
@@ -338,18 +338,26 @@ void inotify_dentry_parent_queue_event(s
 	if (!(dentry->d_flags & DCACHE_INOTIFY_PARENT_WATCHED))
 		return;
 
+again:
 	spin_lock(&dentry->d_lock);
 	parent = dentry->d_parent;
+	if (!spin_trylock(&parent->d_lock)) {
+		spin_unlock(&dentry->d_lock);
+		goto again;
+	}
 	inode = parent->d_inode;
 
 	if (inotify_inode_watched(inode)) {
-		dget(parent);
+		dget_dlock(parent);
 		spin_unlock(&dentry->d_lock);
+		spin_unlock(&parent->d_lock);
 		inotify_inode_queue_event(inode, mask, cookie, name,
 					  dentry->d_inode);
 		dput(parent);
-	} else
+	} else {
 		spin_unlock(&dentry->d_lock);
+		spin_unlock(&parent->d_lock);
+	}
 }
 EXPORT_SYMBOL_GPL(inotify_dentry_parent_queue_event);
 
Index: linux-2.6/fs/smbfs/dir.c
===================================================================
--- linux-2.6.orig/fs/smbfs/dir.c
+++ linux-2.6/fs/smbfs/dir.c
@@ -405,6 +405,7 @@ void
 smb_renew_times(struct dentry * dentry)
 {
 	dget(dentry);
+again:
 	spin_lock(&dentry->d_lock);
 	for (;;) {
 		struct dentry *parent;
@@ -413,8 +414,13 @@ smb_renew_times(struct dentry * dentry)
 		if (IS_ROOT(dentry))
 			break;
 		parent = dentry->d_parent;
-		dget(parent);
+		if (!spin_trylock(&parent->d_lock)) {
+			spin_unlock(&dentry->d_lock);
+			goto again;
+		}
+		dget_dlock(parent);
 		spin_unlock(&dentry->d_lock);
+		spin_unlock(&parent->d_lock);
 		dput(dentry);
 		dentry = parent;
 		spin_lock(&dentry->d_lock);
Index: linux-2.6/fs/smbfs/proc.c
===================================================================
--- linux-2.6.orig/fs/smbfs/proc.c
+++ linux-2.6/fs/smbfs/proc.c
@@ -332,6 +332,7 @@ static int smb_build_path(struct smb_sb_
 	 * and store it in reversed order [see reverse_string()]
 	 */
 	dget(entry);
+again:
 	spin_lock(&entry->d_lock);
 	while (!IS_ROOT(entry)) {
 		struct dentry *parent;
@@ -350,6 +351,7 @@ static int smb_build_path(struct smb_sb_
 			dput(entry);
 			return len;
 		}
+
 		reverse_string(path, len);
 		path += len;
 		if (unicode) {
@@ -361,7 +363,11 @@ static int smb_build_path(struct smb_sb_
 		maxlen -= len+1;
 
 		parent = entry->d_parent;
-		dget(parent);
+		if (!spin_trylock(&parent->d_lock)) {
+			spin_unlock(&entry->d_lock);
+			goto again;
+		}
+		dget_dlock(parent);
 		spin_unlock(&entry->d_lock);
 		dput(entry);
 		entry = parent;
Index: linux-2.6/kernel/cgroup.c
===================================================================
--- linux-2.6.orig/kernel/cgroup.c
+++ linux-2.6/kernel/cgroup.c
@@ -2728,9 +2728,7 @@ again:
 	list_del(&cgrp->sibling);
 	cgroup_unlock_hierarchy(cgrp->root);
 
-	spin_lock(&cgrp->dentry->d_lock);
 	d = dget(cgrp->dentry);
-	spin_unlock(&d->d_lock);
 
 	cgroup_d_remove_dir(d);
 	dput(d);
Index: linux-2.6/arch/powerpc/platforms/cell/spufs/inode.c
===================================================================
--- linux-2.6.orig/arch/powerpc/platforms/cell/spufs/inode.c
+++ linux-2.6/arch/powerpc/platforms/cell/spufs/inode.c
@@ -161,7 +161,7 @@ static void spufs_prune_dir(struct dentr
 		spin_lock(&dcache_lock);
 		spin_lock(&dentry->d_lock);
 		if (!(d_unhashed(dentry)) && dentry->d_inode) {
-			dget_locked(dentry);
+			dget_locked_dlock(dentry);
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
 			simple_unlink(dir->d_inode, dentry);
Index: linux-2.6/drivers/infiniband/hw/ipath/ipath_fs.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/ipath/ipath_fs.c
+++ linux-2.6/drivers/infiniband/hw/ipath/ipath_fs.c
@@ -275,7 +275,7 @@ static int remove_file(struct dentry *pa
 	spin_lock(&dcache_lock);
 	spin_lock(&tmp->d_lock);
 	if (!(d_unhashed(tmp) && tmp->d_inode)) {
-		dget_locked(tmp);
+		dget_locked_dlock(tmp);
 		__d_drop(tmp);
 		spin_unlock(&tmp->d_lock);
 		spin_unlock(&dcache_lock);
Index: linux-2.6/fs/configfs/inode.c
===================================================================
--- linux-2.6.orig/fs/configfs/inode.c
+++ linux-2.6/fs/configfs/inode.c
@@ -218,7 +218,7 @@ void configfs_drop_dentry(struct configf
 		spin_lock(&dcache_lock);
 		spin_lock(&dentry->d_lock);
 		if (!(d_unhashed(dentry) && dentry->d_inode)) {
-			dget_locked(dentry);
+			dget_locked_dlock(dentry);
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_lock);
Index: linux-2.6/net/sunrpc/rpc_pipe.c
===================================================================
--- linux-2.6.orig/net/sunrpc/rpc_pipe.c
+++ linux-2.6/net/sunrpc/rpc_pipe.c
@@ -556,7 +556,7 @@ repeat:
 			continue;
 		spin_lock(&dentry->d_lock);
 		if (!d_unhashed(dentry)) {
-			dget_locked(dentry);
+			dget_locked_dlock(dentry);
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
 			dvec[n++] = dentry;



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 11/27] fs: dcache scale d_unhashed
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (9 preceding siblings ...)
  2009-04-25  1:20 ` [patch 10/27] fs: dcache scale dentry refcount npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  1:20 ` [patch 12/27] fs: dcache scale subdirs npiggin
                   ` (16 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: fs-dcache-scale-d_unhashed.patch --]
[-- Type: text/plain, Size: 10789 bytes --]

Protect d_unhashed(dentry) condition with d_lock.
---
 arch/powerpc/platforms/cell/spufs/inode.c |    3 ++
 fs/configfs/configfs_internal.h           |    2 +
 fs/dcache.c                               |   40 +++++++++++++++++++++++-------
 fs/libfs.c                                |   29 +++++++++++++++------
 fs/ocfs2/dcache.c                         |    5 +++
 fs/seq_file.c                             |    3 ++
 fs/sysfs/dir.c                            |    8 +++---
 7 files changed, 68 insertions(+), 22 deletions(-)

Index: linux-2.6/fs/sysfs/dir.c
===================================================================
--- linux-2.6.orig/fs/sysfs/dir.c
+++ linux-2.6/fs/sysfs/dir.c
@@ -549,10 +549,12 @@ static void sysfs_drop_dentry(struct sys
 repeat:
 	spin_lock(&dcache_lock);
 	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
-		if (d_unhashed(dentry))
-			continue;
-		dget_locked(dentry);
 		spin_lock(&dentry->d_lock);
+		if (d_unhashed(dentry)) {
+			spin_unlock(&dentry->d_lock);
+			continue;
+		}
+		dget_locked_dlock(dentry);
 		__d_drop(dentry);
 		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_lock);
Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -12,6 +12,11 @@
 
 #include <asm/uaccess.h>
 
+static inline int simple_positive(struct dentry *dentry)
+{
+	return dentry->d_inode && !d_unhashed(dentry);
+}
+
 int simple_getattr(struct vfsmount *mnt, struct dentry *dentry,
 		   struct kstat *stat)
 {
@@ -101,8 +106,10 @@ loff_t dcache_dir_lseek(struct file *fil
 			while (n && p != &file->f_path.dentry->d_subdirs) {
 				struct dentry *next;
 				next = list_entry(p, struct dentry, d_u.d_child);
-				if (!d_unhashed(next) && next->d_inode)
+				spin_lock(&next->d_lock);
+				if (simple_positive(next))
 					n--;
+				spin_unlock(&next->d_lock);
 				p = p->next;
 			}
 			list_add_tail(&cursor->d_u.d_child, p);
@@ -156,9 +163,13 @@ int dcache_readdir(struct file * filp, v
 			for (p=q->next; p != &dentry->d_subdirs; p=p->next) {
 				struct dentry *next;
 				next = list_entry(p, struct dentry, d_u.d_child);
-				if (d_unhashed(next) || !next->d_inode)
+				spin_lock_nested(&next->d_lock, DENTRY_D_LOCK_NESTED);
+				if (!simple_positive(next)) {
+					spin_unlock(&next->d_lock);
 					continue;
+				}
 
+				spin_unlock(&next->d_lock);
 				spin_unlock(&dcache_lock);
 				if (filldir(dirent, next->d_name.name, 
 					    next->d_name.len, filp->f_pos, 
@@ -263,20 +274,20 @@ int simple_link(struct dentry *old_dentr
 	return 0;
 }
 
-static inline int simple_positive(struct dentry *dentry)
-{
-	return dentry->d_inode && !d_unhashed(dentry);
-}
-
 int simple_empty(struct dentry *dentry)
 {
 	struct dentry *child;
 	int ret = 0;
 
 	spin_lock(&dcache_lock);
-	list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child)
-		if (simple_positive(child))
+	list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child) {
+		spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
+		if (simple_positive(child)) {
+			spin_unlock(&child->d_lock);
 			goto out;
+		}
+		spin_unlock(&child->d_lock);
+	}
 	ret = 1;
 out:
 	spin_unlock(&dcache_lock);
Index: linux-2.6/fs/seq_file.c
===================================================================
--- linux-2.6.orig/fs/seq_file.c
+++ linux-2.6/fs/seq_file.c
@@ -6,6 +6,7 @@
  */
 
 #include <linux/fs.h>
+#include <linux/mount.h>
 #include <linux/module.h>
 #include <linux/seq_file.h>
 #include <linux/slab.h>
@@ -460,7 +461,9 @@ int seq_path_root(struct seq_file *m, st
 		char *p;
 
 		spin_lock(&dcache_lock);
+		vfsmount_read_lock();
 		p = __d_path(path, root, s, m->size - m->count);
+		vfsmount_read_unlock();
 		spin_unlock(&dcache_lock);
 		err = PTR_ERR(p);
 		if (!IS_ERR(p)) {
Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -327,7 +327,9 @@ int d_invalidate(struct dentry * dentry)
 	 * If it's already been dropped, return OK.
 	 */
 	spin_lock(&dcache_lock);
+	spin_lock(&dentry->d_lock);
 	if (d_unhashed(dentry)) {
+		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_lock);
 		return 0;
 	}
@@ -336,6 +338,7 @@ int d_invalidate(struct dentry * dentry)
 	 * to get rid of unused child entries.
 	 */
 	if (!list_empty(&dentry->d_subdirs)) {
+		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_lock);
 		shrink_dcache_parent(dentry);
 		spin_lock(&dcache_lock);
@@ -443,15 +446,18 @@ static struct dentry * __d_find_alias(st
 		next = tmp->next;
 		prefetch(next);
 		alias = list_entry(tmp, struct dentry, d_alias);
+		spin_lock(&alias->d_lock);
  		if (S_ISDIR(inode->i_mode) || !d_unhashed(alias)) {
 			if (IS_ROOT(alias) &&
 			    (alias->d_flags & DCACHE_DISCONNECTED))
 				discon_alias = alias;
 			else if (!want_discon) {
-				__dget_locked(alias);
+				__dget_locked_dlock(alias);
+				spin_unlock(&alias->d_lock);
 				return alias;
 			}
 		}
+		spin_unlock(&alias->d_lock);
 	}
 	if (discon_alias)
 		__dget_locked(discon_alias);
@@ -734,8 +740,8 @@ static void shrink_dcache_for_umount_sub
 	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	dentry_lru_del_init(dentry);
-	spin_unlock(&dentry->d_lock);
 	__d_drop(dentry);
+	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
 
 	for (;;) {
@@ -750,8 +756,8 @@ static void shrink_dcache_for_umount_sub
 					    d_u.d_child) {
 				spin_lock(&loop->d_lock);
 				dentry_lru_del_init(loop);
-				spin_unlock(&loop->d_lock);
 				__d_drop(loop);
+				spin_unlock(&loop->d_lock);
 				cond_resched_lock(&dcache_lock);
 			}
 			spin_unlock(&dcache_lock);
@@ -2016,7 +2022,8 @@ static int prepend_name(char **buffer, i
  * Returns a pointer into the buffer or an error code if the
  * path was too long.
  *
- * "buflen" should be positive. Caller holds the dcache_lock.
+ * "buflen" should be positive. Caller holds the dcache_lock and
+ * path->dentry->d_lock.
  *
  * If path is not reachable from the supplied root, then the value of
  * root is changed (without modifying refcounts).
@@ -2029,7 +2036,6 @@ char *__d_path(const struct path *path,
 	char *end = buffer + buflen;
 	char *retval;
 
-	vfsmount_read_lock();
 	prepend(&end, &buflen, "\0", 1);
 	if (!IS_ROOT(dentry) && d_unhashed(dentry) &&
 		(prepend(&end, &buflen, " (deleted)", 10) != 0))
@@ -2065,7 +2071,6 @@ char *__d_path(const struct path *path,
 	}
 
 out:
-	vfsmount_read_unlock();
 	return retval;
 
 global_root:
@@ -2118,8 +2123,12 @@ char *d_path(const struct path *path, ch
 	path_get(&root);
 	read_unlock(&current->fs->lock);
 	spin_lock(&dcache_lock);
+	vfsmount_read_lock();
+	spin_lock(&path->dentry->d_lock);
 	tmp = root;
 	res = __d_path(path, &tmp, buf, buflen);
+	spin_unlock(&path->dentry->d_lock);
+	vfsmount_read_unlock();
 	spin_unlock(&dcache_lock);
 	path_put(&root);
 	return res;
@@ -2155,6 +2164,7 @@ char *dentry_path(struct dentry *dentry,
 	char *retval;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dentry->d_lock);
 	prepend(&end, &buflen, "\0", 1);
 	if (!IS_ROOT(dentry) && d_unhashed(dentry) &&
 		(prepend(&end, &buflen, "//deleted", 9) != 0))
@@ -2176,9 +2186,11 @@ char *dentry_path(struct dentry *dentry,
 		retval = end;
 		dentry = parent;
 	}
+	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
 	return retval;
 Elong:
+	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
 	return ERR_PTR(-ENAMETOOLONG);
 }
@@ -2220,12 +2232,16 @@ SYSCALL_DEFINE2(getcwd, char __user *, b
 	error = -ENOENT;
 	/* Has the current directory has been unlinked? */
 	spin_lock(&dcache_lock);
+	vfsmount_read_lock();
+	spin_lock(&pwd.dentry->d_lock);
 	if (IS_ROOT(pwd.dentry) || !d_unhashed(pwd.dentry)) {
 		unsigned long len;
 		struct path tmp = root;
 		char * cwd;
 
 		cwd = __d_path(&pwd, &tmp, page, PAGE_SIZE);
+		spin_unlock(&pwd.dentry->d_lock);
+		vfsmount_read_unlock();
 		spin_unlock(&dcache_lock);
 
 		error = PTR_ERR(cwd);
@@ -2239,8 +2255,11 @@ SYSCALL_DEFINE2(getcwd, char __user *, b
 			if (copy_to_user(buf, cwd, len))
 				error = -EFAULT;
 		}
-	} else
+	} else {
+		spin_unlock(&pwd.dentry->d_lock);
+		vfsmount_read_unlock();
 		spin_unlock(&dcache_lock);
+	}
 
 out:
 	path_put(&pwd);
@@ -2304,13 +2323,16 @@ resume:
 		struct list_head *tmp = next;
 		struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
 		next = tmp->next;
-		if (d_unhashed(dentry)||!dentry->d_inode)
+		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
+		if (d_unhashed(dentry) || !dentry->d_inode) {
+			spin_unlock(&dentry->d_lock);
 			continue;
+		}
 		if (!list_empty(&dentry->d_subdirs)) {
+			spin_unlock(&dentry->d_lock);
 			this_parent = dentry;
 			goto repeat;
 		}
-		spin_lock(&dentry->d_lock);
 		dentry->d_count--;
 		spin_unlock(&dentry->d_lock);
 	}
Index: linux-2.6/arch/powerpc/platforms/cell/spufs/inode.c
===================================================================
--- linux-2.6.orig/arch/powerpc/platforms/cell/spufs/inode.c
+++ linux-2.6/arch/powerpc/platforms/cell/spufs/inode.c
@@ -165,6 +165,9 @@ static void spufs_prune_dir(struct dentr
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
 			simple_unlink(dir->d_inode, dentry);
+			/* XXX: what is dcache_lock protecting here? Other
+			 * filesystems (IB, configfs) release dcache_lock
+			 * before unlink */
 			spin_unlock(&dcache_lock);
 			dput(dentry);
 		} else {
Index: linux-2.6/fs/configfs/configfs_internal.h
===================================================================
--- linux-2.6.orig/fs/configfs/configfs_internal.h
+++ linux-2.6/fs/configfs/configfs_internal.h
@@ -118,6 +118,7 @@ static inline struct config_item *config
 	struct config_item * item = NULL;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dentry->d_lock);
 	if (!d_unhashed(dentry)) {
 		struct configfs_dirent * sd = dentry->d_fsdata;
 		if (sd->s_type & CONFIGFS_ITEM_LINK) {
@@ -126,6 +127,7 @@ static inline struct config_item *config
 		} else
 			item = config_item_get(sd->s_element);
 	}
+	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
 
 	return item;
Index: linux-2.6/fs/ocfs2/dcache.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/dcache.c
+++ linux-2.6/fs/ocfs2/dcache.c
@@ -145,13 +145,16 @@ struct dentry *ocfs2_find_local_alias(st
 	list_for_each(p, &inode->i_dentry) {
 		dentry = list_entry(p, struct dentry, d_alias);
 
+		spin_lock(&dentry->d_lock);
 		if (ocfs2_match_dentry(dentry, parent_blkno, skip_unhashed)) {
 			mlog(0, "dentry found: %.*s\n",
 			     dentry->d_name.len, dentry->d_name.name);
 
-			dget_locked(dentry);
+			dget_locked_dlock(dentry);
+			spin_unlock(&dentry->d_lock);
 			break;
 		}
+		spin_unlock(&dentry->d_lock);
 
 		dentry = NULL;
 	}



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 12/27] fs: dcache scale subdirs
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (10 preceding siblings ...)
  2009-04-25  1:20 ` [patch 11/27] fs: dcache scale d_unhashed npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  1:20 ` [patch 13/27] fs: scale inode alias list npiggin
                   ` (15 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: fs-dcache-scale-d_subdirs.patch --]
[-- Type: text/plain, Size: 31930 bytes --]

Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
using dcache_lock for these anyway (eg. using i_mutex).

XXX: probably don't need parent lock in inotify (because child lock
should stabilize parent). Also, possibly some filesystems don't need so
much locking (eg. of child dentry when modifying d_child, so long as
parent is locked)... but be on the safe side. Hmm, maybe we should just
say d_child list is protected by d_parent->d_lock. d_parent could remain
protected with d_lock.

---
 drivers/usb/core/inode.c     |    6 +
 fs/autofs4/expire.c          |   81 ++++++++++++++-------
 fs/autofs4/inode.c           |    5 +
 fs/autofs4/root.c            |    9 ++
 fs/coda/cache.c              |    2 
 fs/dcache.c                  |  159 ++++++++++++++++++++++++++++++++++---------
 fs/libfs.c                   |   40 ++++++----
 fs/ncpfs/dir.c               |    3 
 fs/ncpfs/ncplib_kernel.h     |    4 +
 fs/notify/inotify/inotify.c  |    4 -
 fs/smbfs/cache.c             |    4 +
 include/linux/dcache.h       |    1 
 kernel/cgroup.c              |   19 ++++-
 net/sunrpc/rpc_pipe.c        |    2 
 security/selinux/selinuxfs.c |   12 ++-
 15 files changed, 271 insertions(+), 80 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -44,6 +44,8 @@
  *   - d_flags
  *   - d_name
  *   - d_lru
+ *   - d_unhashed
+ *   - d_subdirs and children's d_child
  *
  * Ordering:
  * dcache_lock
@@ -205,7 +207,8 @@ static void dentry_lru_del_init(struct d
  *
  * If this is the root of the dentry tree, return NULL.
  *
- * dcache_lock and d_lock must be held by caller, are dropped by d_kill.
+ * dcache_lock and d_lock and d_parent->d_lock must be held by caller, and
+ * are dropped by d_kill.
  */
 static struct dentry *d_kill(struct dentry *dentry)
 	__releases(dentry->d_lock)
@@ -214,12 +217,14 @@ static struct dentry *d_kill(struct dent
 	struct dentry *parent;
 
 	list_del(&dentry->d_u.d_child);
-	/*drops the locks, at that point nobody can reach this dentry */
-	dentry_iput(dentry);
+	if (dentry->d_parent && dentry != dentry->d_parent)
+		spin_unlock(&dentry->d_parent->d_lock);
 	if (IS_ROOT(dentry))
 		parent = NULL;
 	else
 		parent = dentry->d_parent;
+	/*drops the locks, at that point nobody can reach this dentry */
+	dentry_iput(dentry);
 	d_free(dentry);
 	return parent;
 }
@@ -255,6 +260,7 @@ static struct dentry *d_kill(struct dent
 
 void dput(struct dentry *dentry)
 {
+	struct dentry *parent = NULL;
 	if (!dentry)
 		return;
 
@@ -273,6 +279,15 @@ repeat:
 			spin_unlock(&dentry->d_lock);
 			goto repeat;
 		}
+		parent = dentry->d_parent;
+		if (parent) {
+			BUG_ON(parent == dentry);
+			if (!spin_trylock(&parent->d_lock)) {
+				spin_unlock(&dentry->d_lock);
+				spin_unlock(&dcache_lock);
+				goto repeat;
+			}
+		}
 	}
 	dentry->d_count--;
 	if (dentry->d_count) {
@@ -296,6 +311,8 @@ repeat:
 		dentry_lru_add(dentry);
   	}
  	spin_unlock(&dentry->d_lock);
+	if (parent)
+		spin_unlock(&parent->d_lock);
 	spin_unlock(&dcache_lock);
 	return;
 
@@ -521,10 +538,22 @@ static void prune_one_dentry(struct dent
 	 * because dcache_lock needs to be taken anyway.
 	 */
 	while (dentry) {
+		struct dentry *parent = NULL;
+
 		spin_lock(&dcache_lock);
+again:
 		spin_lock(&dentry->d_lock);
+		if (dentry->d_parent && dentry != dentry->d_parent) {
+			if (!spin_trylock(&dentry->d_parent->d_lock)) {
+				spin_unlock(&dentry->d_lock);
+				goto again;
+			}
+ 			parent = dentry->d_parent;
+		}
 		dentry->d_count--;
 		if (dentry->d_count) {
+			if (parent)
+				spin_unlock(&parent->d_lock);
 			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_lock);
 			return;
@@ -602,20 +631,28 @@ again:
 		dentry = list_entry(tmp.prev, struct dentry, d_lru);
 
 		if (!spin_trylock(&dentry->d_lock)) {
+again1:
 			spin_unlock(&dcache_lru_lock);
 			goto again;
 		}
-		__dentry_lru_del_init(dentry);
 		/*
 		 * We found an inuse dentry which was not removed from
 		 * the LRU because of laziness during lookup.  Do not free
 		 * it - just keep it off the LRU list.
 		 */
 		if (dentry->d_count) {
+			__dentry_lru_del_init(dentry);
 			spin_unlock(&dentry->d_lock);
 			continue;
 		}
-
+		if (dentry->d_parent) {
+			BUG_ON(dentry == dentry->d_parent);
+			if (!spin_trylock(&dentry->d_parent->d_lock)) {
+				spin_unlock(&dentry->d_lock);
+				goto again1;
+			}
+		}
+		__dentry_lru_del_init(dentry);
 		spin_unlock(&dcache_lru_lock);
 		prune_one_dentry(dentry);
 		/* dcache_lock and dentry->d_lock dropped */
@@ -752,14 +789,15 @@ static void shrink_dcache_for_umount_sub
 			/* this is a branch with children - detach all of them
 			 * from the system in one go */
 			spin_lock(&dcache_lock);
+			spin_lock(&dentry->d_lock);
 			list_for_each_entry(loop, &dentry->d_subdirs,
 					    d_u.d_child) {
-				spin_lock(&loop->d_lock);
+				spin_lock_nested(&loop->d_lock, DENTRY_D_LOCK_NESTED);
 				dentry_lru_del_init(loop);
 				__d_drop(loop);
 				spin_unlock(&loop->d_lock);
-				cond_resched_lock(&dcache_lock);
 			}
+			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_lock);
 
 			/* move to the first child */
@@ -787,16 +825,17 @@ static void shrink_dcache_for_umount_sub
 				BUG();
 			}
 
-			if (IS_ROOT(dentry))
+			if (IS_ROOT(dentry)) {
 				parent = NULL;
-			else {
+				list_del(&dentry->d_u.d_child);
+			} else {
 				parent = dentry->d_parent;
 				spin_lock(&parent->d_lock);
 				parent->d_count--;
+				list_del(&dentry->d_u.d_child);
 				spin_unlock(&parent->d_lock);
 			}
 
-			list_del(&dentry->d_u.d_child);
 			detached++;
 
 			inode = dentry->d_inode;
@@ -881,6 +920,7 @@ int have_submounts(struct dentry *parent
 	spin_lock(&dcache_lock);
 	if (d_mountpoint(parent))
 		goto positive;
+	spin_lock(&this_parent->d_lock);
 repeat:
 	next = this_parent->d_subdirs.next;
 resume:
@@ -888,22 +928,34 @@ resume:
 		struct list_head *tmp = next;
 		struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
 		next = tmp->next;
+
+		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
 		/* Have we found a mount point ? */
-		if (d_mountpoint(dentry))
+		if (d_mountpoint(dentry)) {
+			spin_unlock(&dentry->d_lock);
+			spin_unlock(&this_parent->d_lock);
 			goto positive;
+		}
 		if (!list_empty(&dentry->d_subdirs)) {
+			spin_unlock(&this_parent->d_lock);
+			spin_release(&dentry->d_lock.dep_map, 1, _RET_IP_);
 			this_parent = dentry;
+			spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
 			goto repeat;
 		}
+		spin_unlock(&dentry->d_lock);
 	}
 	/*
 	 * All done at this level ... ascend and resume the search.
 	 */
 	if (this_parent != parent) {
 		next = this_parent->d_u.d_child.next;
+		spin_unlock(&this_parent->d_lock);
 		this_parent = this_parent->d_parent;
+		spin_lock(&this_parent->d_lock);
 		goto resume;
 	}
+	spin_unlock(&this_parent->d_lock);
 	spin_unlock(&dcache_lock);
 	return 0; /* No mount points found in tree */
 positive:
@@ -932,6 +984,7 @@ static int select_parent(struct dentry *
 	int found = 0;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&this_parent->d_lock);
 repeat:
 	next = this_parent->d_subdirs.next;
 resume:
@@ -939,8 +992,9 @@ resume:
 		struct list_head *tmp = next;
 		struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
 		next = tmp->next;
+		BUG_ON(this_parent == dentry);
 
-		spin_lock(&dentry->d_lock);
+		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
 		dentry_lru_del_init(dentry);
 		/* 
 		 * move only zero ref count dentries to the end 
@@ -950,33 +1004,45 @@ resume:
 			dentry_lru_add_tail(dentry);
 			found++;
 		}
-		spin_unlock(&dentry->d_lock);
 
 		/*
 		 * We can return to the caller if we have found some (this
 		 * ensures forward progress). We'll be coming back to find
 		 * the rest.
 		 */
-		if (found && need_resched())
+		if (found && need_resched()) {
+			spin_unlock(&dentry->d_lock);
 			goto out;
+		}
 
 		/*
 		 * Descend a level if the d_subdirs list is non-empty.
 		 */
 		if (!list_empty(&dentry->d_subdirs)) {
+			spin_unlock(&this_parent->d_lock);
+			spin_release(&dentry->d_lock.dep_map, 1, _RET_IP_);
 			this_parent = dentry;
+			spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
 			goto repeat;
 		}
+
+		spin_unlock(&dentry->d_lock);
 	}
 	/*
 	 * All done at this level ... ascend and resume the search.
 	 */
 	if (this_parent != parent) {
+		struct dentry *tmp;
 		next = this_parent->d_u.d_child.next;
-		this_parent = this_parent->d_parent;
+		tmp = this_parent->d_parent;
+		spin_unlock(&this_parent->d_lock);
+		BUG_ON(tmp == this_parent);
+		this_parent = tmp;
+		spin_lock(&this_parent->d_lock);
 		goto resume;
 	}
 out:
+	spin_unlock(&this_parent->d_lock);
 	spin_unlock(&dcache_lock);
 	return found;
 }
@@ -1072,19 +1138,20 @@ struct dentry *d_alloc(struct dentry * p
 	INIT_LIST_HEAD(&dentry->d_lru);
 	INIT_LIST_HEAD(&dentry->d_subdirs);
 	INIT_LIST_HEAD(&dentry->d_alias);
-
-	if (parent) {
-		dentry->d_parent = dget(parent);
-		dentry->d_sb = parent->d_sb;
-	} else {
-		INIT_LIST_HEAD(&dentry->d_u.d_child);
-	}
+	INIT_LIST_HEAD(&dentry->d_u.d_child);
 
 	if (parent) {
 		spin_lock(&dcache_lock);
+		spin_lock(&parent->d_lock);
+		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
+		dentry->d_parent = dget_dlock(parent);
+		dentry->d_sb = parent->d_sb;
 		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
+		spin_unlock(&dentry->d_lock);
+		spin_unlock(&parent->d_lock);
 		spin_unlock(&dcache_lock);
 	}
+
 	atomic_inc(&dentry_stat.nr_dentry);
 
 	return dentry;
@@ -1763,15 +1830,27 @@ static void d_move_locked(struct dentry
 		printk(KERN_WARNING "VFS: moving negative dcache entry\n");
 
 	write_seqlock(&rename_lock);
-	/*
-	 * XXXX: do we really need to take target->d_lock?
-	 */
+
+	if (target->d_parent != dentry->d_parent) {
+		if (target->d_parent < dentry->d_parent) {
+			spin_lock(&target->d_parent->d_lock);
+			spin_lock_nested(&dentry->d_parent->d_lock,
+						DENTRY_D_LOCK_NESTED);
+		} else {
+			spin_lock(&dentry->d_parent->d_lock);
+			spin_lock_nested(&target->d_parent->d_lock,
+						DENTRY_D_LOCK_NESTED);
+		}
+	} else {
+		spin_lock(&target->d_parent->d_lock);
+	}
+
 	if (target < dentry) {
-		spin_lock(&target->d_lock);
-		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
+		spin_lock_nested(&target->d_lock, 2);
+		spin_lock_nested(&dentry->d_lock, 3);
 	} else {
-		spin_lock(&dentry->d_lock);
-		spin_lock_nested(&target->d_lock, DENTRY_D_LOCK_NESTED);
+		spin_lock_nested(&dentry->d_lock, 2);
+		spin_lock_nested(&target->d_lock, 3);
 	}
 
 	/* Move the dentry to the target hash queue, if on different bucket */
@@ -1804,7 +1883,10 @@ static void d_move_locked(struct dentry
 	}
 
 	list_add(&dentry->d_u.d_child, &dentry->d_parent->d_subdirs);
+	if (target->d_parent != dentry->d_parent)
+		spin_unlock(&dentry->d_parent->d_lock);
 	spin_unlock(&target->d_lock);
+	spin_unlock(&target->d_parent->d_lock);
 	fsnotify_d_move(dentry);
 	spin_unlock(&dentry->d_lock);
 	write_sequnlock(&rename_lock);
@@ -1903,6 +1985,12 @@ static void __d_materialise_dentry(struc
 	dparent = dentry->d_parent;
 	aparent = anon->d_parent;
 
+	/* XXX: hack */
+	spin_lock(&aparent->d_lock);
+	spin_lock(&dparent->d_lock);
+	spin_lock(&dentry->d_lock);
+	spin_lock(&anon->d_lock);
+
 	dentry->d_parent = (aparent == anon) ? dentry : aparent;
 	list_del(&dentry->d_u.d_child);
 	if (!IS_ROOT(dentry))
@@ -1917,6 +2005,11 @@ static void __d_materialise_dentry(struc
 	else
 		INIT_LIST_HEAD(&anon->d_u.d_child);
 
+	spin_unlock(&anon->d_lock);
+	spin_unlock(&dentry->d_lock);
+	spin_unlock(&dparent->d_lock);
+	spin_unlock(&aparent->d_lock);
+
 	anon->d_flags &= ~DCACHE_DISCONNECTED;
 }
 
@@ -2316,6 +2409,7 @@ void d_genocide(struct dentry *root)
 	struct list_head *next;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&this_parent->d_lock);
 repeat:
 	next = this_parent->d_subdirs.next;
 resume:
@@ -2329,8 +2423,10 @@ resume:
 			continue;
 		}
 		if (!list_empty(&dentry->d_subdirs)) {
-			spin_unlock(&dentry->d_lock);
+			spin_unlock(&this_parent->d_lock);
+			spin_release(&dentry->d_lock.dep_map, 1, _RET_IP_);
 			this_parent = dentry;
+			spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
 			goto repeat;
 		}
 		dentry->d_count--;
@@ -2338,12 +2434,13 @@ resume:
 	}
 	if (this_parent != root) {
 		next = this_parent->d_u.d_child.next;
-		spin_lock(&this_parent->d_lock);
 		this_parent->d_count--;
 		spin_unlock(&this_parent->d_lock);
 		this_parent = this_parent->d_parent;
+		spin_lock(&this_parent->d_lock);
 		goto resume;
 	}
+	spin_unlock(&this_parent->d_lock);
 	spin_unlock(&dcache_lock);
 }
 
Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -82,7 +82,8 @@ int dcache_dir_close(struct inode *inode
 
 loff_t dcache_dir_lseek(struct file *file, loff_t offset, int origin)
 {
-	mutex_lock(&file->f_path.dentry->d_inode->i_mutex);
+	struct dentry *dentry = file->f_path.dentry;
+	mutex_lock(&dentry->d_inode->i_mutex);
 	switch (origin) {
 		case 1:
 			offset += file->f_pos;
@@ -90,7 +91,7 @@ loff_t dcache_dir_lseek(struct file *fil
 			if (offset >= 0)
 				break;
 		default:
-			mutex_unlock(&file->f_path.dentry->d_inode->i_mutex);
+			mutex_unlock(&dentry->d_inode->i_mutex);
 			return -EINVAL;
 	}
 	if (offset != file->f_pos) {
@@ -100,23 +101,27 @@ loff_t dcache_dir_lseek(struct file *fil
 			struct dentry *cursor = file->private_data;
 			loff_t n = file->f_pos - 2;
 
-			spin_lock(&dcache_lock);
+			spin_lock(&dentry->d_lock);
+			spin_lock_nested(&cursor->d_lock, DENTRY_D_LOCK_NESTED);
 			list_del(&cursor->d_u.d_child);
-			p = file->f_path.dentry->d_subdirs.next;
-			while (n && p != &file->f_path.dentry->d_subdirs) {
+			spin_unlock(&cursor->d_lock);
+			p = dentry->d_subdirs.next;
+			while (n && p != &dentry->d_subdirs) {
 				struct dentry *next;
 				next = list_entry(p, struct dentry, d_u.d_child);
-				spin_lock(&next->d_lock);
+				spin_lock_nested(&next->d_lock, DENTRY_D_LOCK_NESTED);
 				if (simple_positive(next))
 					n--;
 				spin_unlock(&next->d_lock);
 				p = p->next;
 			}
+			spin_lock_nested(&cursor->d_lock, DENTRY_D_LOCK_NESTED);
 			list_add_tail(&cursor->d_u.d_child, p);
-			spin_unlock(&dcache_lock);
+			spin_unlock(&cursor->d_lock);
+			spin_unlock(&dentry->d_lock);
 		}
 	}
-	mutex_unlock(&file->f_path.dentry->d_inode->i_mutex);
+	mutex_unlock(&dentry->d_inode->i_mutex);
 	return offset;
 }
 
@@ -156,9 +161,12 @@ int dcache_readdir(struct file * filp, v
 			i++;
 			/* fallthrough */
 		default:
-			spin_lock(&dcache_lock);
-			if (filp->f_pos == 2)
+			spin_lock(&dentry->d_lock);
+			if (filp->f_pos == 2) {
+				spin_lock_nested(&cursor->d_lock, DENTRY_D_LOCK_NESTED);
 				list_move(q, &dentry->d_subdirs);
+				spin_unlock(&cursor->d_lock);
+			}
 
 			for (p=q->next; p != &dentry->d_subdirs; p=p->next) {
 				struct dentry *next;
@@ -170,19 +178,21 @@ int dcache_readdir(struct file * filp, v
 				}
 
 				spin_unlock(&next->d_lock);
-				spin_unlock(&dcache_lock);
+				spin_unlock(&dentry->d_lock);
 				if (filldir(dirent, next->d_name.name, 
 					    next->d_name.len, filp->f_pos, 
 					    next->d_inode->i_ino, 
 					    dt_type(next->d_inode)) < 0)
 					return 0;
-				spin_lock(&dcache_lock);
+				spin_lock(&dentry->d_lock);
+				spin_lock_nested(&next->d_lock, DENTRY_D_LOCK_NESTED);
 				/* next is still alive */
 				list_move(q, p);
+				spin_unlock(&next->d_lock);
 				p = q;
 				filp->f_pos++;
 			}
-			spin_unlock(&dcache_lock);
+			spin_unlock(&dentry->d_lock);
 	}
 	return 0;
 }
@@ -279,7 +289,7 @@ int simple_empty(struct dentry *dentry)
 	struct dentry *child;
 	int ret = 0;
 
-	spin_lock(&dcache_lock);
+	spin_lock(&dentry->d_lock);
 	list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child) {
 		spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
 		if (simple_positive(child)) {
@@ -290,7 +300,7 @@ int simple_empty(struct dentry *dentry)
 	}
 	ret = 1;
 out:
-	spin_unlock(&dcache_lock);
+	spin_unlock(&dentry->d_lock);
 	return ret;
 }
 
Index: linux-2.6/fs/notify/inotify/inotify.c
===================================================================
--- linux-2.6.orig/fs/notify/inotify/inotify.c
+++ linux-2.6/fs/notify/inotify/inotify.c
@@ -188,17 +188,19 @@ static void set_dentry_child_flags(struc
 	list_for_each_entry(alias, &inode->i_dentry, d_alias) {
 		struct dentry *child;
 
+		spin_lock(&alias->d_lock);
 		list_for_each_entry(child, &alias->d_subdirs, d_u.d_child) {
 			if (!child->d_inode)
 				continue;
 
-			spin_lock(&child->d_lock);
+			spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
 			if (watched)
 				child->d_flags |= DCACHE_INOTIFY_PARENT_WATCHED;
 			else
 				child->d_flags &=~DCACHE_INOTIFY_PARENT_WATCHED;
 			spin_unlock(&child->d_lock);
 		}
+		spin_unlock(&alias->d_lock);
 	}
 	spin_unlock(&dcache_lock);
 }
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -338,6 +338,7 @@ static inline struct dentry *dget_dlock(
 	}
 	return dentry;
 }
+
 static inline struct dentry *dget(struct dentry *dentry)
 {
 	if (dentry) {
Index: linux-2.6/drivers/usb/core/inode.c
===================================================================
--- linux-2.6.orig/drivers/usb/core/inode.c
+++ linux-2.6/drivers/usb/core/inode.c
@@ -344,16 +344,18 @@ static int usbfs_empty (struct dentry *d
 	struct list_head *list;
 
 	spin_lock(&dcache_lock);
-
+	spin_lock(&dentry->d_lock);
 	list_for_each(list, &dentry->d_subdirs) {
 		struct dentry *de = list_entry(list, struct dentry, d_u.d_child);
 		if (usbfs_positive(de)) {
+			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_lock);
 			return 0;
 		}
 	}
-
+	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
+
 	return 1;
 }
 
Index: linux-2.6/fs/autofs4/expire.c
===================================================================
--- linux-2.6.orig/fs/autofs4/expire.c
+++ linux-2.6/fs/autofs4/expire.c
@@ -92,22 +92,63 @@ done:
 /*
  * Calculate next entry in top down tree traversal.
  * From next_mnt in namespace.c - elegant.
+ *
+ * How is this supposed to work if we drop dcache_lock between calls anyway?
+ * How does it cope with renames?
+ * And also callers dput the returned dentry before taking dcache_lock again
+ * so what prevents it from being freed??
  */
-static struct dentry *next_dentry(struct dentry *p, struct dentry *root)
+static struct dentry *get_next_positive_dentry(struct dentry *p,
+						struct dentry *root)
 {
-	struct list_head *next = p->d_subdirs.next;
+	struct list_head *next;
+	struct dentry *ret;
 
+	spin_lock(&dcache_lock);
+	spin_lock(&p->d_lock);
+again:
+	next = p->d_subdirs.next;
 	if (next == &p->d_subdirs) {
 		while (1) {
-			if (p == root)
+			struct dentry *parent;
+
+			if (p == root) {
+				spin_unlock(&p->d_lock);
 				return NULL;
+			}
+
+			parent = p->d_parent;
+			if (!spin_trylock(&parent->d_lock)) {
+				dget_dlock(p);
+				spin_unlock(&p->d_lock);
+				parent = dget_parent(p);
+				spin_unlock(&dcache_lock);
+				dput(p);
+				spin_lock(&dcache_lock);
+				spin_lock(&parent->d_lock);
+			} else
+				spin_unlock(&p->d_lock);
 			next = p->d_u.d_child.next;
-			if (next != &p->d_parent->d_subdirs)
+			p = parent;
+			if (next != &parent->d_subdirs)
 				break;
-			p = p->d_parent;
 		}
 	}
-	return list_entry(next, struct dentry, d_u.d_child);
+	ret = list_entry(next, struct dentry, d_u.d_child);
+
+	spin_lock(&ret->d_lock);
+	/* Negative dentry - give up */
+	if (!simple_positive(ret)) {
+		spin_unlock(&ret->d_lock);
+		p = ret;
+		goto again;
+	}
+	dget_dlock(ret);
+	spin_unlock(&ret->d_lock);
+
+	spin_unlock(&dcache_lock);
+
+	return ret;
 }
 
 /*
@@ -157,18 +198,11 @@ static int autofs4_tree_busy(struct vfsm
 	if (!simple_positive(top))
 		return 1;
 
-	spin_lock(&dcache_lock);
-	for (p = top; p; p = next_dentry(p, top)) {
-		/* Negative dentry - give up */
-		if (!simple_positive(p))
-			continue;
+	for (p = top; p; p = get_next_positive_dentry(p, top)) {
 
 		DPRINTK("dentry %p %.*s",
 			p, (int) p->d_name.len, p->d_name.name);
 
-		p = dget(p);
-		spin_unlock(&dcache_lock);
-
 		/*
 		 * Is someone visiting anywhere in the subtree ?
 		 * If there's no mount we need to check the usage
@@ -204,9 +238,7 @@ static int autofs4_tree_busy(struct vfsm
 			}
 		}
 		dput(p);
-		spin_lock(&dcache_lock);
 	}
-	spin_unlock(&dcache_lock);
 
 	/* Timeout of a tree mount is ultimately determined by its top dentry */
 	if (!autofs4_can_expire(top, timeout, do_now))
@@ -225,18 +257,11 @@ static struct dentry *autofs4_check_leav
 	DPRINTK("parent %p %.*s",
 		parent, (int)parent->d_name.len, parent->d_name.name);
 
-	spin_lock(&dcache_lock);
-	for (p = parent; p; p = next_dentry(p, parent)) {
-		/* Negative dentry - give up */
-		if (!simple_positive(p))
-			continue;
+	for (p = parent; p; p = get_next_positive_dentry(p, parent)) {
 
 		DPRINTK("dentry %p %.*s",
 			p, (int) p->d_name.len, p->d_name.name);
 
-		p = dget(p);
-		spin_unlock(&dcache_lock);
-
 		if (d_mountpoint(p)) {
 			/* Can we umount this guy */
 			if (autofs4_mount_busy(mnt, p))
@@ -248,9 +273,7 @@ static struct dentry *autofs4_check_leav
 		}
 cont:
 		dput(p);
-		spin_lock(&dcache_lock);
 	}
-	spin_unlock(&dcache_lock);
 	return NULL;
 }
 
@@ -315,6 +338,7 @@ struct dentry *autofs4_expire_indirect(s
 	timeout = sbi->exp_timeout;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&root->d_lock);
 	next = root->d_subdirs.next;
 
 	/* On exit from the loop expire is set to a dgot dentry
@@ -329,6 +353,7 @@ struct dentry *autofs4_expire_indirect(s
 		}
 
 		dentry = dget(dentry);
+		spin_unlock(&root->d_lock);
 		spin_unlock(&dcache_lock);
 
 		spin_lock(&sbi->fs_lock);
@@ -395,8 +420,10 @@ next:
 		spin_unlock(&sbi->fs_lock);
 		dput(dentry);
 		spin_lock(&dcache_lock);
+		spin_lock(&root->d_lock);
 		next = next->next;
 	}
+	spin_unlock(&root->d_lock);
 	spin_unlock(&dcache_lock);
 	return NULL;
 
@@ -408,7 +435,9 @@ found:
 	init_completion(&ino->expire_complete);
 	spin_unlock(&sbi->fs_lock);
 	spin_lock(&dcache_lock);
+	spin_lock(&expired->d_parent->d_lock);
 	list_move(&expired->d_parent->d_subdirs, &expired->d_u.d_child);
+	spin_unlock(&expired->d_parent->d_lock);
 	spin_unlock(&dcache_lock);
 	return expired;
 }
Index: linux-2.6/fs/autofs4/inode.c
===================================================================
--- linux-2.6.orig/fs/autofs4/inode.c
+++ linux-2.6/fs/autofs4/inode.c
@@ -111,6 +111,7 @@ static void autofs4_force_release(struct
 
 	spin_lock(&dcache_lock);
 repeat:
+	spin_lock(&this_parent->d_lock);
 	next = this_parent->d_subdirs.next;
 resume:
 	while (next != &this_parent->d_subdirs) {
@@ -128,6 +129,7 @@ resume:
 		}
 
 		next = next->next;
+		spin_unlock(&this_parent->d_lock);
 		spin_unlock(&dcache_lock);
 
 		DPRINTK("dentry %p %.*s",
@@ -141,15 +143,18 @@ resume:
 		struct dentry *dentry = this_parent;
 
 		next = this_parent->d_u.d_child.next;
+		spin_unlock(&this_parent->d_lock);
 		this_parent = this_parent->d_parent;
 		spin_unlock(&dcache_lock);
 		DPRINTK("parent dentry %p %.*s",
 			dentry, (int)dentry->d_name.len, dentry->d_name.name);
 		dput(dentry);
 		spin_lock(&dcache_lock);
+		spin_lock(&this_parent->d_lock);
 		goto resume;
 	}
 	spin_unlock(&dcache_lock);
+	spin_unlock(&this_parent->d_lock);
 }
 
 void autofs4_kill_sb(struct super_block *sb)
Index: linux-2.6/fs/autofs4/root.c
===================================================================
--- linux-2.6.orig/fs/autofs4/root.c
+++ linux-2.6/fs/autofs4/root.c
@@ -93,10 +93,13 @@ static int autofs4_dir_open(struct inode
 	 * it.
 	 */
 	spin_lock(&dcache_lock);
+	spin_lock(&dentry->d_lock);
 	if (!d_mountpoint(dentry) && __simple_empty(dentry)) {
+		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_lock);
 		return -ENOENT;
 	}
+	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
 
 out:
@@ -212,8 +215,10 @@ static void *autofs4_follow_link(struct
 	 * mount it again.
 	 */
 	spin_lock(&dcache_lock);
+	spin_lock(&dentry->d_lock);
 	if (dentry->d_flags & DCACHE_AUTOFS_PENDING ||
 	    (!d_mountpoint(dentry) && __simple_empty(dentry))) {
+		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_lock);
 
 		status = try_to_fill_dentry(dentry, 0);
@@ -222,6 +227,7 @@ static void *autofs4_follow_link(struct
 
 		goto follow;
 	}
+	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
 follow:
 	/*
@@ -731,7 +737,9 @@ static int autofs4_dir_rmdir(struct inod
 		return -EACCES;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dentry->d_lock);
 	if (!list_empty(&dentry->d_subdirs)) {
+		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_lock);
 		return -ENOTEMPTY;
 	}
@@ -739,7 +747,6 @@ static int autofs4_dir_rmdir(struct inod
 	if (list_empty(&ino->expiring))
 		list_add(&ino->expiring, &sbi->expiring_list);
 	spin_unlock(&sbi->lookup_lock);
-	spin_lock(&dentry->d_lock);
 	__d_drop(dentry);
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
Index: linux-2.6/fs/coda/cache.c
===================================================================
--- linux-2.6.orig/fs/coda/cache.c
+++ linux-2.6/fs/coda/cache.c
@@ -87,6 +87,7 @@ static void coda_flag_children(struct de
 	struct dentry *de;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&parent->d_lock);
 	list_for_each(child, &parent->d_subdirs)
 	{
 		de = list_entry(child, struct dentry, d_u.d_child);
@@ -95,6 +96,7 @@ static void coda_flag_children(struct de
 			continue;
 		coda_flag_inode(de->d_inode, flag);
 	}
+	spin_unlock(&parent->d_lock);
 	spin_unlock(&dcache_lock);
 	return; 
 }
Index: linux-2.6/fs/ncpfs/dir.c
===================================================================
--- linux-2.6.orig/fs/ncpfs/dir.c
+++ linux-2.6/fs/ncpfs/dir.c
@@ -365,6 +365,7 @@ ncp_dget_fpos(struct dentry *dentry, str
 
 	/* If a pointer is invalid, we search the dentry. */
 	spin_lock(&dcache_lock);
+	spin_lock(&parent->d_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
 		dent = list_entry(next, struct dentry, d_u.d_child);
@@ -373,11 +374,13 @@ ncp_dget_fpos(struct dentry *dentry, str
 				dget_locked(dent);
 			else
 				dent = NULL;
+			spin_unlock(&parent->d_lock);
 			spin_unlock(&dcache_lock);
 			goto out;
 		}
 		next = next->next;
 	}
+	spin_unlock(&parent->d_lock);
 	spin_unlock(&dcache_lock);
 	return NULL;
 
Index: linux-2.6/fs/ncpfs/ncplib_kernel.h
===================================================================
--- linux-2.6.orig/fs/ncpfs/ncplib_kernel.h
+++ linux-2.6/fs/ncpfs/ncplib_kernel.h
@@ -193,6 +193,7 @@ ncp_renew_dentries(struct dentry *parent
 	struct dentry *dentry;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&parent->d_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
 		dentry = list_entry(next, struct dentry, d_u.d_child);
@@ -204,6 +205,7 @@ ncp_renew_dentries(struct dentry *parent
 
 		next = next->next;
 	}
+	spin_unlock(&parent->d_lock);
 	spin_unlock(&dcache_lock);
 }
 
@@ -215,6 +217,7 @@ ncp_invalidate_dircache_entries(struct d
 	struct dentry *dentry;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&parent->d_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
 		dentry = list_entry(next, struct dentry, d_u.d_child);
@@ -222,6 +225,7 @@ ncp_invalidate_dircache_entries(struct d
 		ncp_age_dentry(server, dentry);
 		next = next->next;
 	}
+	spin_unlock(&parent->d_lock);
 	spin_unlock(&dcache_lock);
 }
 
Index: linux-2.6/fs/smbfs/cache.c
===================================================================
--- linux-2.6.orig/fs/smbfs/cache.c
+++ linux-2.6/fs/smbfs/cache.c
@@ -63,6 +63,7 @@ smb_invalidate_dircache_entries(struct d
 	struct dentry *dentry;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&parent->d_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
 		dentry = list_entry(next, struct dentry, d_u.d_child);
@@ -70,6 +71,7 @@ smb_invalidate_dircache_entries(struct d
 		smb_age_dentry(server, dentry);
 		next = next->next;
 	}
+	spin_unlock(&parent->d_lock);
 	spin_unlock(&dcache_lock);
 }
 
@@ -97,6 +99,7 @@ smb_dget_fpos(struct dentry *dentry, str
 
 	/* If a pointer is invalid, we search the dentry. */
 	spin_lock(&dcache_lock);
+	spin_lock(&parent->d_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
 		dent = list_entry(next, struct dentry, d_u.d_child);
@@ -111,6 +114,7 @@ smb_dget_fpos(struct dentry *dentry, str
 	}
 	dent = NULL;
 out_unlock:
+	spin_unlock(&parent->d_lock);
 	spin_unlock(&dcache_lock);
 	return dent;
 }
Index: linux-2.6/kernel/cgroup.c
===================================================================
--- linux-2.6.orig/kernel/cgroup.c
+++ linux-2.6/kernel/cgroup.c
@@ -694,23 +694,31 @@ static void cgroup_clear_directory(struc
 
 	BUG_ON(!mutex_is_locked(&dentry->d_inode->i_mutex));
 	spin_lock(&dcache_lock);
+	spin_lock(&dentry->d_lock);
 	node = dentry->d_subdirs.next;
 	while (node != &dentry->d_subdirs) {
 		struct dentry *d = list_entry(node, struct dentry, d_u.d_child);
+
+		spin_lock_nested(&d->d_lock, DENTRY_D_LOCK_NESTED);
 		list_del_init(node);
 		if (d->d_inode) {
 			/* This should never be called on a cgroup
 			 * directory with child cgroups */
 			BUG_ON(d->d_inode->i_mode & S_IFDIR);
-			d = dget_locked(d);
+			dget_locked_dlock(d);
+			spin_unlock(&d->d_lock);
+			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_lock);
 			d_delete(d);
 			simple_unlink(dentry->d_inode, d);
 			dput(d);
 			spin_lock(&dcache_lock);
-		}
+			spin_lock(&dentry->d_lock);
+		} else
+			spin_unlock(&d->d_lock);
 		node = dentry->d_subdirs.next;
 	}
+	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
 }
 
@@ -719,10 +727,17 @@ static void cgroup_clear_directory(struc
  */
 static void cgroup_d_remove_dir(struct dentry *dentry)
 {
+	struct dentry *parent;
+
 	cgroup_clear_directory(dentry);
 
 	spin_lock(&dcache_lock);
+	parent = dentry->d_parent;
+	spin_lock(&parent->d_lock);
+	spin_lock(&dentry->d_lock);
 	list_del_init(&dentry->d_u.d_child);
+	spin_unlock(&dentry->d_lock);
+	spin_unlock(&parent->d_lock);
 	spin_unlock(&dcache_lock);
 	remove_dir(dentry);
 }
Index: linux-2.6/net/sunrpc/rpc_pipe.c
===================================================================
--- linux-2.6.orig/net/sunrpc/rpc_pipe.c
+++ linux-2.6/net/sunrpc/rpc_pipe.c
@@ -548,6 +548,7 @@ static void rpc_depopulate(struct dentry
 	mutex_lock_nested(&dir->i_mutex, I_MUTEX_CHILD);
 repeat:
 	spin_lock(&dcache_lock);
+	spin_lock(&parent->d_lock);
 	list_for_each_safe(pos, next, &parent->d_subdirs) {
 		dentry = list_entry(pos, struct dentry, d_u.d_child);
 		if (!dentry->d_inode ||
@@ -565,6 +566,7 @@ repeat:
 		} else
 			spin_unlock(&dentry->d_lock);
 	}
+	spin_unlock(&parent->d_lock);
 	spin_unlock(&dcache_lock);
 	if (n) {
 		do {
Index: linux-2.6/security/selinux/selinuxfs.c
===================================================================
--- linux-2.6.orig/security/selinux/selinuxfs.c
+++ linux-2.6/security/selinux/selinuxfs.c
@@ -948,22 +948,30 @@ static void sel_remove_entries(struct de
 	struct list_head *node;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&de->d_lock);
 	node = de->d_subdirs.next;
 	while (node != &de->d_subdirs) {
 		struct dentry *d = list_entry(node, struct dentry, d_u.d_child);
+
+		spin_lock_nested(&d->d_lock, DENTRY_D_LOCK_NESTED);
 		list_del_init(node);
 
 		if (d->d_inode) {
-			d = dget_locked(d);
+			dget_locked_dlock(d);
+			spin_unlock(&de->d_lock);
+			spin_unlock(&d->d_lock);
 			spin_unlock(&dcache_lock);
 			d_delete(d);
 			simple_unlink(de->d_inode, d);
 			dput(d);
 			spin_lock(&dcache_lock);
-		}
+			spin_lock(&de->d_lock);
+		} else
+			spin_unlock(&d->d_lock);
 		node = de->d_subdirs.next;
 	}
 
+	spin_unlock(&de->d_lock);
 	spin_unlock(&dcache_lock);
 }
 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 13/27] fs: scale inode alias list
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (11 preceding siblings ...)
  2009-04-25  1:20 ` [patch 12/27] fs: dcache scale subdirs npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  1:20 ` [patch 14/27] fs: use RCU / seqlock logic for reverse and multi-step operaitons npiggin
                   ` (14 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: fs-dcache-scale-i_dentry.patch --]
[-- Type: text/plain, Size: 13233 bytes --]

Add a new lock, dcache_inode_lock, to protect the inode's i_dentry list
from concurrent modification. d_alias is also protected by d_lock.

---
 fs/affs/amigaffs.c          |    2 +
 fs/dcache.c                 |   56 +++++++++++++++++++++++++++++++++++++++-----
 fs/exportfs/expfs.c         |    4 +++
 fs/nfs/getroot.c            |    4 +++
 fs/notify/inotify/inotify.c |    2 +
 fs/ocfs2/dcache.c           |    3 +-
 fs/sysfs/dir.c              |    3 ++
 include/linux/dcache.h      |    1 
 8 files changed, 68 insertions(+), 7 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -36,6 +36,8 @@
 
 /*
  * Usage:
+ * dcache_inode_lock protects:
+ *   - the inode alias lists, d_inode
  * dcache_hash_lock protects:
  *   - the dcache hash table
  * dcache_lru_lock protects:
@@ -49,18 +51,21 @@
  *
  * Ordering:
  * dcache_lock
- *   dentry->d_lock
- *     dcache_lru_lock
- *     dcache_hash_lock
+ *   dcache_inode_lock
+ *     dentry->d_lock
+ *       dcache_lru_lock
+ *       dcache_hash_lock
  */
 int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
 
+__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_inode_lock);
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_hash_lock);
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lock);
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
 
+EXPORT_SYMBOL(dcache_inode_lock);
 EXPORT_SYMBOL(dcache_hash_lock);
 EXPORT_SYMBOL(dcache_lock);
 
@@ -125,6 +130,7 @@ static void d_free(struct dentry *dentry
  */
 static void dentry_iput(struct dentry * dentry)
 	__releases(dentry->d_lock)
+	__releases(dcache_inode_lock)
 	__releases(dcache_lock)
 {
 	struct inode *inode = dentry->d_inode;
@@ -132,6 +138,7 @@ static void dentry_iput(struct dentry *
 		dentry->d_inode = NULL;
 		list_del_init(&dentry->d_alias);
 		spin_unlock(&dentry->d_lock);
+		spin_unlock(&dcache_inode_lock);
 		spin_unlock(&dcache_lock);
 		if (!inode->i_nlink)
 			fsnotify_inoderemove(inode);
@@ -141,6 +148,7 @@ static void dentry_iput(struct dentry *
 			iput(inode);
 	} else {
 		spin_unlock(&dentry->d_lock);
+		spin_unlock(&dcache_inode_lock);
 		spin_unlock(&dcache_lock);
 	}
 }
@@ -212,6 +220,7 @@ static void dentry_lru_del_init(struct d
  */
 static struct dentry *d_kill(struct dentry *dentry)
 	__releases(dentry->d_lock)
+	__releases(dcache_inode_lock)
 	__releases(dcache_lock)
 {
 	struct dentry *parent;
@@ -276,16 +285,21 @@ repeat:
 			 * want to reduce dcache_lock anyway so this will
 			 * get improved.
 			 */
+drop1:
 			spin_unlock(&dentry->d_lock);
 			goto repeat;
 		}
+		if (!spin_trylock(&dcache_inode_lock)) {
+drop2:
+			spin_unlock(&dcache_lock);
+			goto drop1;
+		}
 		parent = dentry->d_parent;
 		if (parent) {
 			BUG_ON(parent == dentry);
 			if (!spin_trylock(&parent->d_lock)) {
-				spin_unlock(&dentry->d_lock);
-				spin_unlock(&dcache_lock);
-				goto repeat;
+				spin_unlock(&dcache_inode_lock);
+				goto drop2;
 			}
 		}
 	}
@@ -313,6 +327,7 @@ repeat:
  	spin_unlock(&dentry->d_lock);
 	if (parent)
 		spin_unlock(&parent->d_lock);
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 	return;
 
@@ -487,7 +502,9 @@ struct dentry * d_find_alias(struct inod
 
 	if (!list_empty(&inode->i_dentry)) {
 		spin_lock(&dcache_lock);
+		spin_lock(&dcache_inode_lock);
 		de = __d_find_alias(inode, 0);
+		spin_unlock(&dcache_inode_lock);
 		spin_unlock(&dcache_lock);
 	}
 	return de;
@@ -502,18 +519,21 @@ void d_prune_aliases(struct inode *inode
 	struct dentry *dentry;
 restart:
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
 		spin_lock(&dentry->d_lock);
 		if (!dentry->d_count) {
 			__dget_locked_dlock(dentry);
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
+			spin_unlock(&dcache_inode_lock);
 			spin_unlock(&dcache_lock);
 			dput(dentry);
 			goto restart;
 		}
 		spin_unlock(&dentry->d_lock);
 	}
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 }
 
@@ -541,6 +561,7 @@ static void prune_one_dentry(struct dent
 		struct dentry *parent = NULL;
 
 		spin_lock(&dcache_lock);
+		spin_lock(&dcache_inode_lock);
 again:
 		spin_lock(&dentry->d_lock);
 		if (dentry->d_parent && dentry != dentry->d_parent) {
@@ -555,6 +576,7 @@ again:
 			if (parent)
 				spin_unlock(&parent->d_lock);
 			spin_unlock(&dentry->d_lock);
+			spin_unlock(&dcache_inode_lock);
 			spin_unlock(&dcache_lock);
 			return;
 		}
@@ -625,6 +647,7 @@ restart:
 	spin_unlock(&dcache_lru_lock);
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 again:
 	spin_lock(&dcache_lru_lock); /* lru_lock also protects tmp list */
 	while (!list_empty(&tmp)) {
@@ -657,8 +680,10 @@ again1:
 		prune_one_dentry(dentry);
 		/* dcache_lock and dentry->d_lock dropped */
 		spin_lock(&dcache_lock);
+		spin_lock(&dcache_inode_lock);
 		spin_lock(&dcache_lru_lock);
 	}
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 
 	if (count == NULL && !list_empty(&sb->s_dentry_lru))
@@ -1195,7 +1220,9 @@ void d_instantiate(struct dentry *entry,
 {
 	BUG_ON(!list_empty(&entry->d_alias));
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 	__d_instantiate(entry, inode);
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 	security_d_instantiate(entry, inode);
 }
@@ -1255,7 +1282,9 @@ struct dentry *d_instantiate_unique(stru
 	BUG_ON(!list_empty(&entry->d_alias));
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 	result = __d_instantiate_unique(entry, inode);
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 
 	if (!result) {
@@ -1345,8 +1374,10 @@ struct dentry *d_obtain_alias(struct ino
 	tmp->d_parent = tmp; /* make sure dput doesn't croak */
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 	res = __d_find_alias(inode, 0);
 	if (res) {
+		spin_unlock(&dcache_inode_lock);
 		spin_unlock(&dcache_lock);
 		dput(tmp);
 		goto out_iput;
@@ -1361,6 +1392,7 @@ struct dentry *d_obtain_alias(struct ino
 	list_add(&tmp->d_alias, &inode->i_dentry);
 	hlist_add_head(&tmp->d_hash, &inode->i_sb->s_anon);
 	spin_unlock(&tmp->d_lock);
+	spin_unlock(&dcache_inode_lock);
 
 	spin_unlock(&dcache_lock);
 	return tmp;
@@ -1393,9 +1425,11 @@ struct dentry *d_splice_alias(struct ino
 
 	if (inode && S_ISDIR(inode->i_mode)) {
 		spin_lock(&dcache_lock);
+		spin_lock(&dcache_inode_lock);
 		new = __d_find_alias(inode, 1);
 		if (new) {
 			BUG_ON(!(new->d_flags & DCACHE_DISCONNECTED));
+			spin_unlock(&dcache_inode_lock);
 			spin_unlock(&dcache_lock);
 			security_d_instantiate(new, inode);
 			d_rehash(dentry);
@@ -1404,6 +1438,7 @@ struct dentry *d_splice_alias(struct ino
 		} else {
 			/* already taking dcache_lock, so d_add() by hand */
 			__d_instantiate(dentry, inode);
+			spin_unlock(&dcache_inode_lock);
 			spin_unlock(&dcache_lock);
 			security_d_instantiate(dentry, inode);
 			d_rehash(dentry);
@@ -1477,8 +1512,10 @@ struct dentry *d_add_ci(struct dentry *d
 	 * already has a dentry.
 	 */
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 	if (!S_ISDIR(inode->i_mode) || list_empty(&inode->i_dentry)) {
 		__d_instantiate(found, inode);
+		spin_unlock(&dcache_inode_lock);
 		spin_unlock(&dcache_lock);
 		security_d_instantiate(found, inode);
 		return found;
@@ -1490,6 +1527,7 @@ struct dentry *d_add_ci(struct dentry *d
 	 */
 	new = list_entry(inode->i_dentry.next, struct dentry, d_alias);
 	dget_locked(new);
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 	security_d_instantiate(found, inode);
 	d_move(new, found);
@@ -1705,6 +1743,7 @@ void d_delete(struct dentry * dentry)
 	 * Are we the only user?
 	 */
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 	spin_lock(&dentry->d_lock);
 	isdir = S_ISDIR(dentry->d_inode->i_mode);
 	if (dentry->d_count == 1) {
@@ -1717,6 +1756,7 @@ void d_delete(struct dentry * dentry)
 		__d_drop(dentry);
 
 	spin_unlock(&dentry->d_lock);
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 
 	fsnotify_nameremove(dentry, isdir);
@@ -1963,6 +2003,7 @@ out_unalias:
 	d_move_locked(alias, dentry);
 	ret = alias;
 out_err:
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 	if (m2)
 		mutex_unlock(m2);
@@ -2028,6 +2069,7 @@ struct dentry *d_materialise_unique(stru
 	BUG_ON(!d_unhashed(dentry));
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 
 	if (!inode) {
 		actual = dentry;
@@ -2072,6 +2114,7 @@ found:
 	_d_rehash(actual);
 	spin_unlock(&dcache_hash_lock);
 	spin_unlock(&actual->d_lock);
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 out_nolock:
 	if (actual == dentry) {
@@ -2083,6 +2126,7 @@ out_nolock:
 	return actual;
 
 shouldnt_be_hashed:
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 	BUG();
 }
Index: linux-2.6/fs/sysfs/dir.c
===================================================================
--- linux-2.6.orig/fs/sysfs/dir.c
+++ linux-2.6/fs/sysfs/dir.c
@@ -548,6 +548,7 @@ static void sysfs_drop_dentry(struct sys
 	 */
 repeat:
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
 		spin_lock(&dentry->d_lock);
 		if (d_unhashed(dentry)) {
@@ -557,10 +558,12 @@ repeat:
 		dget_locked_dlock(dentry);
 		__d_drop(dentry);
 		spin_unlock(&dentry->d_lock);
+		spin_unlock(&dcache_inode_lock);
 		spin_unlock(&dcache_lock);
 		dput(dentry);
 		goto repeat;
 	}
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 
 	/* adjust nlink and update timestamp */
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -184,6 +184,7 @@ d_iput:		no		no		no       yes
 
 #define DCACHE_COOKIE		0x0040	/* For use by dcookie subsystem */
 
+extern spinlock_t dcache_inode_lock;
 extern spinlock_t dcache_hash_lock;
 extern spinlock_t dcache_lock;
 extern seqlock_t rename_lock;
Index: linux-2.6/fs/notify/inotify/inotify.c
===================================================================
--- linux-2.6.orig/fs/notify/inotify/inotify.c
+++ linux-2.6/fs/notify/inotify/inotify.c
@@ -185,6 +185,7 @@ static void set_dentry_child_flags(struc
 	struct dentry *alias;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 	list_for_each_entry(alias, &inode->i_dentry, d_alias) {
 		struct dentry *child;
 
@@ -202,6 +203,7 @@ static void set_dentry_child_flags(struc
 		}
 		spin_unlock(&alias->d_lock);
 	}
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 }
 
Index: linux-2.6/fs/exportfs/expfs.c
===================================================================
--- linux-2.6.orig/fs/exportfs/expfs.c
+++ linux-2.6/fs/exportfs/expfs.c
@@ -48,8 +48,10 @@ find_acceptable_alias(struct dentry *res
 		return result;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 	list_for_each_entry(dentry, &result->d_inode->i_dentry, d_alias) {
 		dget_locked(dentry);
+		spin_unlock(&dcache_inode_lock);
 		spin_unlock(&dcache_lock);
 		if (toput)
 			dput(toput);
@@ -58,8 +60,10 @@ find_acceptable_alias(struct dentry *res
 			return dentry;
 		}
 		spin_lock(&dcache_lock);
+		spin_lock(&dcache_inode_lock);
 		toput = dentry;
 	}
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 
 	if (toput)
Index: linux-2.6/fs/affs/amigaffs.c
===================================================================
--- linux-2.6.orig/fs/affs/amigaffs.c
+++ linux-2.6/fs/affs/amigaffs.c
@@ -129,6 +129,7 @@ affs_fix_dcache(struct dentry *dentry, u
 	struct list_head *head, *next;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 	head = &inode->i_dentry;
 	next = head->next;
 	while (next != head) {
@@ -139,6 +140,7 @@ affs_fix_dcache(struct dentry *dentry, u
 		}
 		next = next->next;
 	}
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 }
 
Index: linux-2.6/fs/ocfs2/dcache.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/dcache.c
+++ linux-2.6/fs/ocfs2/dcache.c
@@ -141,7 +141,7 @@ struct dentry *ocfs2_find_local_alias(st
 	struct dentry *dentry = NULL;
 
 	spin_lock(&dcache_lock);
-
+	spin_lock(&dcache_inode_lock);
 	list_for_each(p, &inode->i_dentry) {
 		dentry = list_entry(p, struct dentry, d_alias);
 
@@ -159,6 +159,7 @@ struct dentry *ocfs2_find_local_alias(st
 		dentry = NULL;
 	}
 
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 
 	return dentry;
Index: linux-2.6/fs/nfs/getroot.c
===================================================================
--- linux-2.6.orig/fs/nfs/getroot.c
+++ linux-2.6/fs/nfs/getroot.c
@@ -66,7 +66,11 @@ static int nfs_superblock_set_dummy_root
 		 * Oops, since the test for IS_ROOT() will fail.
 		 */
 		spin_lock(&dcache_lock);
+		spin_lock(&dcache_inode_lock);
+		spin_lock(&sb->s_root->d_lock);
 		list_del_init(&sb->s_root->d_alias);
+		spin_unlock(&sb->s_root->d_lock);
+		spin_unlock(&dcache_inode_lock);
 		spin_unlock(&dcache_lock);
 	}
 	return 0;



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 14/27] fs: use RCU / seqlock logic for reverse and multi-step operaitons
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (12 preceding siblings ...)
  2009-04-25  1:20 ` [patch 13/27] fs: scale inode alias list npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  1:20 ` [patch 15/27] fs: dcache remove dcache_lock npiggin
                   ` (13 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: fs-dcache_lock-multi-step.patch --]
[-- Type: text/plain, Size: 10946 bytes --]

The remaining usages for dcache_lock is to allow atomic, multi-step read-side
operations over the directory tree by excluding modifications to the tree.
Also, to walk in the leaf->root direction in the tree where we don't have
a natural d_lock ordering.

This could be accomplished by taking every d_lock, but this would mean a
huge number of locks and actually gets very tricky.

Solve this instead by using the rename seqlock for multi-step read-side
operations. Insert operations are not serialised. Delete operations are
tricky when walking up the directory our parent might have been deleted
when dropping locks so also need to check and retry for that.

XXX: hmm, we could of course just take the rename lock if there is any worry
about livelock. Most of these are slow paths.

---
 drivers/staging/pohmelfs/path_entry.c |    7 ++
 fs/autofs4/waitq.c                    |   10 +++
 fs/dcache.c                           |  108 ++++++++++++++++++++++++++++++----
 fs/nfs/namespace.c                    |   10 +++
 fs/seq_file.c                         |    6 +
 5 files changed, 130 insertions(+), 11 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -936,11 +936,15 @@ void shrink_dcache_for_umount(struct sup
  * Return true if the parent or its subdirectories contain
  * a mount point
  */
- 
 int have_submounts(struct dentry *parent)
 {
-	struct dentry *this_parent = parent;
+	struct dentry *this_parent;
 	struct list_head *next;
+	unsigned seq;
+
+rename_retry:
+	this_parent = parent;
+	seq = read_seqbegin(&rename_lock);
 
 	spin_lock(&dcache_lock);
 	if (d_mountpoint(parent))
@@ -974,17 +978,37 @@ resume:
 	 * All done at this level ... ascend and resume the search.
 	 */
 	if (this_parent != parent) {
+		struct dentry *tmp;
+		struct dentry *child;
+
 		next = this_parent->d_u.d_child.next;
+		tmp = this_parent->d_parent;
+		rcu_read_lock();
 		spin_unlock(&this_parent->d_lock);
-		this_parent = this_parent->d_parent;
+		child = this_parent;
+		this_parent = tmp;
 		spin_lock(&this_parent->d_lock);
+		/* might go back up the wrong parent if we have had a rename
+		 * or deletion */
+		if (this_parent != child->d_parent ||
+				read_seqretry(&rename_lock, seq)) {
+			spin_unlock(&this_parent->d_lock);
+			spin_unlock(&dcache_lock);
+			rcu_read_unlock();
+			goto rename_retry;
+		}
+		rcu_read_unlock();
 		goto resume;
 	}
 	spin_unlock(&this_parent->d_lock);
 	spin_unlock(&dcache_lock);
+	if (read_seqretry(&rename_lock, seq))
+		goto rename_retry;
 	return 0; /* No mount points found in tree */
 positive:
 	spin_unlock(&dcache_lock);
+	if (read_seqretry(&rename_lock, seq))
+		goto rename_retry;
 	return 1;
 }
 
@@ -1004,10 +1028,15 @@ positive:
  */
 static int select_parent(struct dentry * parent)
 {
-	struct dentry *this_parent = parent;
+	struct dentry *this_parent;
 	struct list_head *next;
+	unsigned seq;
 	int found = 0;
 
+rename_retry:
+	this_parent = parent;
+	seq = read_seqbegin(&rename_lock);
+
 	spin_lock(&dcache_lock);
 	spin_lock(&this_parent->d_lock);
 repeat:
@@ -1058,17 +1087,35 @@ resume:
 	 */
 	if (this_parent != parent) {
 		struct dentry *tmp;
+		struct dentry *child;
+
 		next = this_parent->d_u.d_child.next;
 		tmp = this_parent->d_parent;
+		rcu_read_lock();
 		spin_unlock(&this_parent->d_lock);
-		BUG_ON(tmp == this_parent);
+		child = this_parent;
 		this_parent = tmp;
 		spin_lock(&this_parent->d_lock);
+		/* might go back up the wrong parent if we have had a rename
+		 * or deletion */
+		if (this_parent != child->d_parent ||
+				read_seqretry(&rename_lock, seq)) {
+//		if (this_parent != parent &&
+//				(/* d_unhashed(this_parent) XXX: hmm... */ 0 ||
+//				read_seqretry(&rename_lock, seq))) {
+			spin_unlock(&this_parent->d_lock);
+			spin_unlock(&dcache_lock);
+			rcu_read_unlock();
+			goto rename_retry;
+		}
+		rcu_read_unlock();
 		goto resume;
 	}
 out:
 	spin_unlock(&this_parent->d_lock);
 	spin_unlock(&dcache_lock);
+	if (read_seqretry(&rename_lock, seq))
+		goto rename_retry;
 	return found;
 }
 
@@ -2173,6 +2220,7 @@ char *__d_path(const struct path *path,
 	char *end = buffer + buflen;
 	char *retval;
 
+	rcu_read_lock();
 	prepend(&end, &buflen, "\0", 1);
 	if (!IS_ROOT(dentry) && d_unhashed(dentry) &&
 		(prepend(&end, &buflen, " (deleted)", 10) != 0))
@@ -2208,6 +2256,7 @@ char *__d_path(const struct path *path,
 	}
 
 out:
+	rcu_read_unlock();
 	return retval;
 
 global_root:
@@ -2244,6 +2293,7 @@ char *d_path(const struct path *path, ch
 	char *res;
 	struct path root;
 	struct path tmp;
+	unsigned seq;
 
 	/*
 	 * We have various synthetic filesystems that never get mounted.  On
@@ -2259,6 +2309,9 @@ char *d_path(const struct path *path, ch
 	root = current->fs->root;
 	path_get(&root);
 	read_unlock(&current->fs->lock);
+
+rename_retry:
+	seq = read_seqbegin(&rename_lock);
 	spin_lock(&dcache_lock);
 	vfsmount_read_lock();
 	spin_lock(&path->dentry->d_lock);
@@ -2267,6 +2320,9 @@ char *d_path(const struct path *path, ch
 	spin_unlock(&path->dentry->d_lock);
 	vfsmount_read_unlock();
 	spin_unlock(&dcache_lock);
+	if (read_seqretry(&rename_lock, seq))
+		goto rename_retry;
+
 	path_put(&root);
 	return res;
 }
@@ -2297,9 +2353,14 @@ char *dynamic_dname(struct dentry *dentr
  */
 char *dentry_path(struct dentry *dentry, char *buf, int buflen)
 {
-	char *end = buf + buflen;
+	char *end;
 	char *retval;
+	unsigned seq;
 
+rename_retry:
+	end = buf + buflen;
+	seq = read_seqbegin(&rename_lock);
+	rcu_read_lock(); /* protect parent */
 	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	prepend(&end, &buflen, "\0", 1);
@@ -2323,13 +2384,16 @@ char *dentry_path(struct dentry *dentry,
 		retval = end;
 		dentry = parent;
 	}
+out:
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
+	rcu_read_unlock();
+	if (read_seqretry(&rename_lock, seq))
+		goto rename_retry;
 	return retval;
 Elong:
-	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
-	return ERR_PTR(-ENAMETOOLONG);
+	retval = ERR_PTR(-ENAMETOOLONG);
+	goto out;
 }
 
 /*
@@ -2449,9 +2513,13 @@ int is_subdir(struct dentry *new_dentry,
 
 void d_genocide(struct dentry *root)
 {
-	struct dentry *this_parent = root;
+	struct dentry *this_parent;
 	struct list_head *next;
+	unsigned seq;
 
+rename_retry:
+	this_parent = root;
+	seq = read_seqbegin(&rename_lock);
 	spin_lock(&dcache_lock);
 	spin_lock(&this_parent->d_lock);
 repeat:
@@ -2477,15 +2545,33 @@ resume:
 		spin_unlock(&dentry->d_lock);
 	}
 	if (this_parent != root) {
+		struct dentry *tmp;
+		struct dentry *child;
+
 		next = this_parent->d_u.d_child.next;
+		tmp = this_parent->d_parent;
 		this_parent->d_count--;
+		rcu_read_lock();
 		spin_unlock(&this_parent->d_lock);
-		this_parent = this_parent->d_parent;
+		child = this_parent;
+		this_parent = tmp;
 		spin_lock(&this_parent->d_lock);
+		/* might go back up the wrong parent if we have had a rename
+		 * or deletion */
+		if (this_parent != child->d_parent ||
+				read_seqretry(&rename_lock, seq)) {
+			spin_unlock(&this_parent->d_lock);
+			spin_unlock(&dcache_lock);
+			rcu_read_unlock();
+			goto rename_retry;
+		}
+		rcu_read_unlock();
 		goto resume;
 	}
 	spin_unlock(&this_parent->d_lock);
 	spin_unlock(&dcache_lock);
+	if (read_seqretry(&rename_lock, seq))
+		goto rename_retry;
 }
 
 /**
Index: linux-2.6/fs/seq_file.c
===================================================================
--- linux-2.6.orig/fs/seq_file.c
+++ linux-2.6/fs/seq_file.c
@@ -459,12 +459,18 @@ int seq_path_root(struct seq_file *m, st
 	if (m->count < m->size) {
 		char *s = m->buf + m->count;
 		char *p;
+		unsigned seq;
 
+rename_retry:
+		seq = read_seqbegin(&rename_lock);
 		spin_lock(&dcache_lock);
 		vfsmount_read_lock();
 		p = __d_path(path, root, s, m->size - m->count);
 		vfsmount_read_unlock();
 		spin_unlock(&dcache_lock);
+		if (read_seqretry(&rename_lock, seq))
+			goto rename_retry;
+
 		err = PTR_ERR(p);
 		if (!IS_ERR(p)) {
 			s = mangle_path(s, p, esc);
Index: linux-2.6/drivers/staging/pohmelfs/path_entry.c
===================================================================
--- linux-2.6.orig/drivers/staging/pohmelfs/path_entry.c
+++ linux-2.6/drivers/staging/pohmelfs/path_entry.c
@@ -85,6 +85,7 @@ int pohmelfs_path_length(struct pohmelfs
 {
 	struct dentry *d, *root, *first;
 	int len = 1; /* Root slash */
+	unsigned seq;
 
 	first = d = d_find_alias(&pi->vfs_inode);
 	if (!d) {
@@ -96,6 +97,9 @@ int pohmelfs_path_length(struct pohmelfs
 	root = dget(current->fs->root.dentry);
 	read_unlock(&current->fs->lock);
 
+	rcu_read_lock();
+rename_retry:
+	seq = read_seqbegin(&rename_lock);
 	spin_lock(&dcache_lock);
 
 	if (!IS_ROOT(d) && d_unhashed(d))
@@ -106,6 +110,9 @@ int pohmelfs_path_length(struct pohmelfs
 		d = d->d_parent;
 	}
 	spin_unlock(&dcache_lock);
+	if (read_seqretry(&rename_lock, seq))
+		goto rename_retry;
+	rcu_read_unlock();
 
 	dput(root);
 	dput(first);
Index: linux-2.6/fs/autofs4/waitq.c
===================================================================
--- linux-2.6.orig/fs/autofs4/waitq.c
+++ linux-2.6/fs/autofs4/waitq.c
@@ -189,13 +189,20 @@ static int autofs4_getpath(struct autofs
 	char *buf = *name;
 	char *p;
 	int len = 0;
+	unsigned seq;
 
+	rcu_read_lock();
+rename_retry:
+	seq = read_seqbegin(&rename_lock);
 	spin_lock(&dcache_lock);
 	for (tmp = dentry ; tmp != root ; tmp = tmp->d_parent)
 		len += tmp->d_name.len + 1;
 
 	if (!len || --len > NAME_MAX) {
 		spin_unlock(&dcache_lock);
+		if (read_seqretry(&rename_lock, seq))
+			goto rename_retry;
+		rcu_read_unlock();
 		return 0;
 	}
 
@@ -209,6 +216,9 @@ static int autofs4_getpath(struct autofs
 		strncpy(p, tmp->d_name.name, tmp->d_name.len);
 	}
 	spin_unlock(&dcache_lock);
+	if (read_seqretry(&rename_lock, seq))
+		goto rename_retry;
+	rcu_read_unlock();
 
 	return len;
 }
Index: linux-2.6/fs/nfs/namespace.c
===================================================================
--- linux-2.6.orig/fs/nfs/namespace.c
+++ linux-2.6/fs/nfs/namespace.c
@@ -50,9 +50,13 @@ char *nfs_path(const char *base,
 {
 	char *end = buffer+buflen;
 	int namelen;
+	unsigned seq;
 
 	*--end = '\0';
 	buflen--;
+	rcu_read_lock();
+rename_retry:
+	seq = read_seqbegin(&rename_lock);
 	spin_lock(&dcache_lock);
 	while (!IS_ROOT(dentry) && dentry != droot) {
 		namelen = dentry->d_name.len;
@@ -65,6 +69,9 @@ char *nfs_path(const char *base,
 		dentry = dentry->d_parent;
 	}
 	spin_unlock(&dcache_lock);
+	if (read_seqretry(&rename_lock, seq))
+		goto rename_retry;
+	rcu_read_unlock();
 	namelen = strlen(base);
 	/* Strip off excess slashes in base string */
 	while (namelen > 0 && base[namelen - 1] == '/')
@@ -77,6 +84,9 @@ char *nfs_path(const char *base,
 	return end;
 Elong_unlock:
 	spin_unlock(&dcache_lock);
+	if (read_seqretry(&rename_lock, seq))
+		goto rename_retry;
+	rcu_read_unlock();
 Elong:
 	return ERR_PTR(-ENAMETOOLONG);
 }



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 15/27] fs: dcache remove dcache_lock
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (13 preceding siblings ...)
  2009-04-25  1:20 ` [patch 14/27] fs: use RCU / seqlock logic for reverse and multi-step operaitons npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  1:20 ` [patch 16/27] fs: dcache reduce dput locking npiggin
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: fs-dcache_lock-remove.patch --]
[-- Type: text/plain, Size: 50772 bytes --]

dcache_lock no longer protects anything (I hope). remove it.

This breaks a lot of the tree where I haven't thought about the problem,
but it simplifies the dcache.c code quite a bit (and it's also probably
a good thing to break unconverted code). So I include this here before
making further changes to the locking.

---
 Documentation/filesystems/Locking         |    2 
 arch/powerpc/platforms/cell/spufs/inode.c |    5 -
 drivers/infiniband/hw/ipath/ipath_fs.c    |    6 -
 drivers/staging/pohmelfs/path_entry.c     |    2 
 drivers/usb/core/inode.c                  |    3 
 fs/affs/amigaffs.c                        |    2 
 fs/autofs4/expire.c                       |   11 --
 fs/autofs4/inode.c                        |    6 -
 fs/autofs4/root.c                         |   20 ----
 fs/autofs4/waitq.c                        |    3 
 fs/coda/cache.c                           |    2 
 fs/configfs/configfs_internal.h           |    2 
 fs/configfs/inode.c                       |    6 -
 fs/dcache.c                               |  131 ++++--------------------------
 fs/exportfs/expfs.c                       |    4 
 fs/namei.c                                |    5 -
 fs/ncpfs/dir.c                            |    3 
 fs/ncpfs/ncplib_kernel.h                  |    4 
 fs/nfs/dir.c                              |    3 
 fs/nfs/getroot.c                          |    2 
 fs/nfs/namespace.c                        |    3 
 fs/notify/inotify/inotify.c               |    4 
 fs/ocfs2/dcache.c                         |    2 
 fs/seq_file.c                             |    2 
 fs/smbfs/cache.c                          |    4 
 fs/sysfs/dir.c                            |    3 
 include/linux/dcache.h                    |   17 +--
 kernel/cgroup.c                           |    6 -
 net/sunrpc/rpc_pipe.c                     |   11 +-
 security/selinux/selinuxfs.c              |    4 
 security/tomoyo/realpath.c                |    2 
 31 files changed, 36 insertions(+), 244 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -50,11 +50,10 @@
  *   - d_subdirs and children's d_child
  *
  * Ordering:
- * dcache_lock
- *   dcache_inode_lock
- *     dentry->d_lock
- *       dcache_lru_lock
- *       dcache_hash_lock
+ * dcache_inode_lock
+ *   dentry->d_lock
+ *     dcache_lru_lock
+ *     dcache_hash_lock
  */
 int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
@@ -62,12 +61,10 @@ EXPORT_SYMBOL_GPL(sysctl_vfs_cache_press
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_inode_lock);
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_hash_lock);
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lock);
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
 
 EXPORT_SYMBOL(dcache_inode_lock);
 EXPORT_SYMBOL(dcache_hash_lock);
-EXPORT_SYMBOL(dcache_lock);
 
 static struct kmem_cache *dentry_cache __read_mostly;
 
@@ -109,7 +106,7 @@ static void d_callback(struct rcu_head *
 }
 
 /*
- * no dcache_lock, please.
+ * no locks, please.
  */
 static void d_free(struct dentry *dentry)
 {
@@ -131,7 +128,6 @@ static void d_free(struct dentry *dentry
 static void dentry_iput(struct dentry * dentry)
 	__releases(dentry->d_lock)
 	__releases(dcache_inode_lock)
-	__releases(dcache_lock)
 {
 	struct inode *inode = dentry->d_inode;
 	if (inode) {
@@ -139,7 +135,6 @@ static void dentry_iput(struct dentry *
 		list_del_init(&dentry->d_alias);
 		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_inode_lock);
-		spin_unlock(&dcache_lock);
 		if (!inode->i_nlink)
 			fsnotify_inoderemove(inode);
 		if (dentry->d_op && dentry->d_op->d_iput)
@@ -149,7 +144,6 @@ static void dentry_iput(struct dentry *
 	} else {
 		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_inode_lock);
-		spin_unlock(&dcache_lock);
 	}
 }
 
@@ -215,13 +209,12 @@ static void dentry_lru_del_init(struct d
  *
  * If this is the root of the dentry tree, return NULL.
  *
- * dcache_lock and d_lock and d_parent->d_lock must be held by caller, and
+ * d_lock and d_parent->d_lock must be held by caller, and
  * are dropped by d_kill.
  */
 static struct dentry *d_kill(struct dentry *dentry)
 	__releases(dentry->d_lock)
 	__releases(dcache_inode_lock)
-	__releases(dcache_lock)
 {
 	struct dentry *parent;
 
@@ -278,21 +271,10 @@ repeat:
 		might_sleep();
 	spin_lock(&dentry->d_lock);
 	if (dentry->d_count == 1) {
-		if (!spin_trylock(&dcache_lock)) {
-			/*
-			 * Something of a livelock possibility we could avoid
-			 * by taking dcache_lock and trying again, but we
-			 * want to reduce dcache_lock anyway so this will
-			 * get improved.
-			 */
-drop1:
-			spin_unlock(&dentry->d_lock);
-			goto repeat;
-		}
 		if (!spin_trylock(&dcache_inode_lock)) {
 drop2:
-			spin_unlock(&dcache_lock);
-			goto drop1;
+			spin_unlock(&dentry->d_lock);
+			goto repeat;
 		}
 		parent = dentry->d_parent;
 		if (parent) {
@@ -306,7 +288,6 @@ drop2:
 	dentry->d_count--;
 	if (dentry->d_count) {
 		spin_unlock(&dentry->d_lock);
-		spin_unlock(&dcache_lock);
 		return;
 	}
 
@@ -328,7 +309,6 @@ drop2:
 	if (parent)
 		spin_unlock(&parent->d_lock);
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 	return;
 
 unhash_it:
@@ -358,11 +338,9 @@ int d_invalidate(struct dentry * dentry)
 	/*
 	 * If it's already been dropped, return OK.
 	 */
-	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	if (d_unhashed(dentry)) {
 		spin_unlock(&dentry->d_lock);
-		spin_unlock(&dcache_lock);
 		return 0;
 	}
 	/*
@@ -371,9 +349,7 @@ int d_invalidate(struct dentry * dentry)
 	 */
 	if (!list_empty(&dentry->d_subdirs)) {
 		spin_unlock(&dentry->d_lock);
-		spin_unlock(&dcache_lock);
 		shrink_dcache_parent(dentry);
-		spin_lock(&dcache_lock);
 	}
 
 	/*
@@ -390,14 +366,12 @@ int d_invalidate(struct dentry * dentry)
 	if (dentry->d_count > 1) {
 		if (dentry->d_inode && S_ISDIR(dentry->d_inode->i_mode)) {
 			spin_unlock(&dentry->d_lock);
-			spin_unlock(&dcache_lock);
 			return -EBUSY;
 		}
 	}
 
 	__d_drop(dentry);
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
 	return 0;
 }
 
@@ -501,11 +475,9 @@ struct dentry * d_find_alias(struct inod
 	struct dentry *de = NULL;
 
 	if (!list_empty(&inode->i_dentry)) {
-		spin_lock(&dcache_lock);
 		spin_lock(&dcache_inode_lock);
 		de = __d_find_alias(inode, 0);
 		spin_unlock(&dcache_inode_lock);
-		spin_unlock(&dcache_lock);
 	}
 	return de;
 }
@@ -518,7 +490,6 @@ void d_prune_aliases(struct inode *inode
 {
 	struct dentry *dentry;
 restart:
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
 		spin_lock(&dentry->d_lock);
@@ -527,14 +498,12 @@ restart:
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_inode_lock);
-			spin_unlock(&dcache_lock);
 			dput(dentry);
 			goto restart;
 		}
 		spin_unlock(&dentry->d_lock);
 	}
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 }
 
 /*
@@ -547,20 +516,16 @@ restart:
  */
 static void prune_one_dentry(struct dentry * dentry)
 	__releases(dentry->d_lock)
-	__releases(dcache_lock)
-	__acquires(dcache_lock)
 {
 	__d_drop(dentry);
 	dentry = d_kill(dentry);
 
 	/*
-	 * Prune ancestors.  Locking is simpler than in dput(),
-	 * because dcache_lock needs to be taken anyway.
+	 * Prune ancestors.
 	 */
 	while (dentry) {
 		struct dentry *parent = NULL;
 
-		spin_lock(&dcache_lock);
 		spin_lock(&dcache_inode_lock);
 again:
 		spin_lock(&dentry->d_lock);
@@ -577,7 +542,6 @@ again:
 				spin_unlock(&parent->d_lock);
 			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_inode_lock);
-			spin_unlock(&dcache_lock);
 			return;
 		}
 
@@ -646,7 +610,6 @@ restart:
 	}
 	spin_unlock(&dcache_lru_lock);
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 again:
 	spin_lock(&dcache_lru_lock); /* lru_lock also protects tmp list */
@@ -677,14 +640,13 @@ again1:
 		}
 		__dentry_lru_del_init(dentry);
 		spin_unlock(&dcache_lru_lock);
+
 		prune_one_dentry(dentry);
-		/* dcache_lock and dentry->d_lock dropped */
-		spin_lock(&dcache_lock);
+		/* dentry->d_lock dropped */
 		spin_lock(&dcache_inode_lock);
 		spin_lock(&dcache_lru_lock);
 	}
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 
 	if (count == NULL && !list_empty(&sb->s_dentry_lru))
 		goto restart;
@@ -714,7 +676,6 @@ static void prune_dcache(int count)
 
 	if (unused == 0 || count == 0)
 		return;
-	spin_lock(&dcache_lock);
 restart:
 	if (count >= unused)
 		prune_ratio = 1;
@@ -750,11 +711,9 @@ restart:
 		if (down_read_trylock(&sb->s_umount)) {
 			if ((sb->s_root != NULL) &&
 			    (!list_empty(&sb->s_dentry_lru))) {
-				spin_unlock(&dcache_lock);
 				__shrink_dcache_sb(sb, &w_count,
 						DCACHE_REFERENCED);
 				pruned -= w_count;
-				spin_lock(&dcache_lock);
 			}
 			up_read(&sb->s_umount);
 		}
@@ -770,7 +729,6 @@ restart:
 		}
 	}
 	spin_unlock(&sb_lock);
-	spin_unlock(&dcache_lock);
 }
 
 /**
@@ -799,12 +757,10 @@ static void shrink_dcache_for_umount_sub
 	BUG_ON(!IS_ROOT(dentry));
 
 	/* detach this root from the system */
-	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	dentry_lru_del_init(dentry);
 	__d_drop(dentry);
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
 
 	for (;;) {
 		/* descend to the first leaf in the current subtree */
@@ -813,7 +769,6 @@ static void shrink_dcache_for_umount_sub
 
 			/* this is a branch with children - detach all of them
 			 * from the system in one go */
-			spin_lock(&dcache_lock);
 			spin_lock(&dentry->d_lock);
 			list_for_each_entry(loop, &dentry->d_subdirs,
 					    d_u.d_child) {
@@ -823,7 +778,6 @@ static void shrink_dcache_for_umount_sub
 				spin_unlock(&loop->d_lock);
 			}
 			spin_unlock(&dentry->d_lock);
-			spin_unlock(&dcache_lock);
 
 			/* move to the first child */
 			dentry = list_entry(dentry->d_subdirs.next,
@@ -894,8 +848,7 @@ out:
 
 /*
  * destroy the dentries attached to a superblock on unmounting
- * - we don't need to use dentry->d_lock, and only need dcache_lock when
- *   removing the dentry from the system lists and hashes because:
+ * - we don't need to use dentry->d_lock because:
  *   - the superblock is detached from all mountings and open files, so the
  *     dentry trees will not be rearranged by the VFS
  *   - s_umount is write-locked, so the memory pressure shrinker will ignore
@@ -946,7 +899,6 @@ rename_retry:
 	this_parent = parent;
 	seq = read_seqbegin(&rename_lock);
 
-	spin_lock(&dcache_lock);
 	if (d_mountpoint(parent))
 		goto positive;
 	spin_lock(&this_parent->d_lock);
@@ -993,7 +945,6 @@ resume:
 		if (this_parent != child->d_parent ||
 				read_seqretry(&rename_lock, seq)) {
 			spin_unlock(&this_parent->d_lock);
-			spin_unlock(&dcache_lock);
 			rcu_read_unlock();
 			goto rename_retry;
 		}
@@ -1001,12 +952,10 @@ resume:
 		goto resume;
 	}
 	spin_unlock(&this_parent->d_lock);
-	spin_unlock(&dcache_lock);
 	if (read_seqretry(&rename_lock, seq))
 		goto rename_retry;
 	return 0; /* No mount points found in tree */
 positive:
-	spin_unlock(&dcache_lock);
 	if (read_seqretry(&rename_lock, seq))
 		goto rename_retry;
 	return 1;
@@ -1037,7 +986,6 @@ rename_retry:
 	this_parent = parent;
 	seq = read_seqbegin(&rename_lock);
 
-	spin_lock(&dcache_lock);
 	spin_lock(&this_parent->d_lock);
 repeat:
 	next = this_parent->d_subdirs.next;
@@ -1104,7 +1052,6 @@ resume:
 //				(/* d_unhashed(this_parent) XXX: hmm... */ 0 ||
 //				read_seqretry(&rename_lock, seq))) {
 			spin_unlock(&this_parent->d_lock);
-			spin_unlock(&dcache_lock);
 			rcu_read_unlock();
 			goto rename_retry;
 		}
@@ -1113,7 +1060,6 @@ resume:
 	}
 out:
 	spin_unlock(&this_parent->d_lock);
-	spin_unlock(&dcache_lock);
 	if (read_seqretry(&rename_lock, seq))
 		goto rename_retry;
 	return found;
@@ -1213,7 +1159,6 @@ struct dentry *d_alloc(struct dentry * p
 	INIT_LIST_HEAD(&dentry->d_u.d_child);
 
 	if (parent) {
-		spin_lock(&dcache_lock);
 		spin_lock(&parent->d_lock);
 		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
 		dentry->d_parent = dget_dlock(parent);
@@ -1221,7 +1166,6 @@ struct dentry *d_alloc(struct dentry * p
 		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
 		spin_unlock(&dentry->d_lock);
 		spin_unlock(&parent->d_lock);
-		spin_unlock(&dcache_lock);
 	}
 
 	atomic_inc(&dentry_stat.nr_dentry);
@@ -1239,7 +1183,6 @@ struct dentry *d_alloc_name(struct dentr
 	return d_alloc(parent, &q);
 }
 
-/* the caller must hold dcache_lock */
 static void __d_instantiate(struct dentry *dentry, struct inode *inode)
 {
 	if (inode)
@@ -1266,11 +1209,9 @@ static void __d_instantiate(struct dentr
 void d_instantiate(struct dentry *entry, struct inode * inode)
 {
 	BUG_ON(!list_empty(&entry->d_alias));
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 	__d_instantiate(entry, inode);
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 	security_d_instantiate(entry, inode);
 }
 
@@ -1328,11 +1269,9 @@ struct dentry *d_instantiate_unique(stru
 
 	BUG_ON(!list_empty(&entry->d_alias));
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 	result = __d_instantiate_unique(entry, inode);
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 
 	if (!result) {
 		security_d_instantiate(entry, inode);
@@ -1420,12 +1359,10 @@ struct dentry *d_obtain_alias(struct ino
 	}
 	tmp->d_parent = tmp; /* make sure dput doesn't croak */
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 	res = __d_find_alias(inode, 0);
 	if (res) {
 		spin_unlock(&dcache_inode_lock);
-		spin_unlock(&dcache_lock);
 		dput(tmp);
 		goto out_iput;
 	}
@@ -1441,7 +1378,6 @@ struct dentry *d_obtain_alias(struct ino
 	spin_unlock(&tmp->d_lock);
 	spin_unlock(&dcache_inode_lock);
 
-	spin_unlock(&dcache_lock);
 	return tmp;
 
  out_iput:
@@ -1471,22 +1407,19 @@ struct dentry *d_splice_alias(struct ino
 	struct dentry *new = NULL;
 
 	if (inode && S_ISDIR(inode->i_mode)) {
-		spin_lock(&dcache_lock);
 		spin_lock(&dcache_inode_lock);
 		new = __d_find_alias(inode, 1);
 		if (new) {
 			BUG_ON(!(new->d_flags & DCACHE_DISCONNECTED));
 			spin_unlock(&dcache_inode_lock);
-			spin_unlock(&dcache_lock);
 			security_d_instantiate(new, inode);
 			d_rehash(dentry);
 			d_move(new, dentry);
 			iput(inode);
 		} else {
-			/* already taking dcache_lock, so d_add() by hand */
+			/* already taken dcache_inode_lock, d_add() by hand */
 			__d_instantiate(dentry, inode);
 			spin_unlock(&dcache_inode_lock);
-			spin_unlock(&dcache_lock);
 			security_d_instantiate(dentry, inode);
 			d_rehash(dentry);
 		}
@@ -1558,12 +1491,10 @@ struct dentry *d_add_ci(struct dentry *d
 	 * Negative dentry: instantiate it unless the inode is a directory and
 	 * already has a dentry.
 	 */
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 	if (!S_ISDIR(inode->i_mode) || list_empty(&inode->i_dentry)) {
 		__d_instantiate(found, inode);
 		spin_unlock(&dcache_inode_lock);
-		spin_unlock(&dcache_lock);
 		security_d_instantiate(found, inode);
 		return found;
 	}
@@ -1575,7 +1506,6 @@ struct dentry *d_add_ci(struct dentry *d
 	new = list_entry(inode->i_dentry.next, struct dentry, d_alias);
 	dget_locked(new);
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 	security_d_instantiate(found, inode);
 	d_move(new, found);
 	iput(inode);
@@ -1597,7 +1527,7 @@ err_out:
  * is returned. The caller must use dput to free the entry when it has
  * finished using it. %NULL is returned on failure.
  *
- * __d_lookup is dcache_lock free. The hash list is protected using RCU.
+ * __d_lookup is global lock free. The hash list is protected using RCU.
  * Memory barriers are used while updating and doing lockless traversal. 
  * To avoid races with d_move while rename is happening, d_lock is used.
  *
@@ -1609,7 +1539,7 @@ err_out:
  *
  * The dentry unused LRU is not updated even if lookup finds the required dentry
  * in there. It is updated in places such as prune_dcache, shrink_dcache_sb,
- * select_parent and __dget_locked. This laziness saves lookup from dcache_lock
+ * select_parent and __dget_locked. This laziness saves lookup from LRU lock
  * acquisition.
  *
  * d_lookup() is protected against the concurrent renames in some unrelated
@@ -1739,25 +1669,22 @@ int d_validate(struct dentry *dentry, st
 	if (dentry->d_parent != dparent)
 		goto out;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	spin_lock(&dcache_hash_lock);
 	base = d_hash(dparent, dentry->d_name.hash);
 	hlist_for_each(lhp,base) { 
 		/* hlist_for_each_entry_rcu() not required for d_hash list
-		 * as it is parsed under dcache_lock
+		 * as it is parsed under dcache_hash_lock
 		 */
 		if (dentry == hlist_entry(lhp, struct dentry, d_hash)) {
 			spin_unlock(&dcache_hash_lock);
 			__dget_locked_dlock(dentry);
 			spin_unlock(&dentry->d_lock);
-			spin_unlock(&dcache_lock);
 			return 1;
 		}
 	}
 	spin_unlock(&dcache_hash_lock);
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
 out:
 	return 0;
 }
@@ -1789,7 +1716,6 @@ void d_delete(struct dentry * dentry)
 	/*
 	 * Are we the only user?
 	 */
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 	spin_lock(&dentry->d_lock);
 	isdir = S_ISDIR(dentry->d_inode->i_mode);
@@ -1804,7 +1730,6 @@ void d_delete(struct dentry * dentry)
 
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 
 	fsnotify_nameremove(dentry, isdir);
 }
@@ -1830,13 +1755,11 @@ static void _d_rehash(struct dentry * en
  
 void d_rehash(struct dentry * entry)
 {
-	spin_lock(&dcache_lock);
 	spin_lock(&entry->d_lock);
 	spin_lock(&dcache_hash_lock);
 	_d_rehash(entry);
 	spin_unlock(&dcache_hash_lock);
 	spin_unlock(&entry->d_lock);
-	spin_unlock(&dcache_lock);
 }
 
 /*
@@ -1990,9 +1913,7 @@ static void d_move_locked(struct dentry
 
 void d_move(struct dentry * dentry, struct dentry * target)
 {
-	spin_lock(&dcache_lock);
 	d_move_locked(dentry, target);
-	spin_unlock(&dcache_lock);
 }
 
 /**
@@ -2018,13 +1939,12 @@ struct dentry *d_ancestor(struct dentry
  * This helper attempts to cope with remotely renamed directories
  *
  * It assumes that the caller is already holding
- * dentry->d_parent->d_inode->i_mutex and the dcache_lock
+ * dentry->d_parent->d_inode->i_mutex
  *
  * Note: If ever the locking in lock_rename() changes, then please
  * remember to update this too...
  */
 static struct dentry *__d_unalias(struct dentry *dentry, struct dentry *alias)
-	__releases(dcache_lock)
 {
 	struct mutex *m1 = NULL, *m2 = NULL;
 	struct dentry *ret;
@@ -2051,7 +1971,6 @@ out_unalias:
 	ret = alias;
 out_err:
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 	if (m2)
 		mutex_unlock(m2);
 	if (m1)
@@ -2115,7 +2034,6 @@ struct dentry *d_materialise_unique(stru
 
 	BUG_ON(!d_unhashed(dentry));
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 
 	if (!inode) {
@@ -2162,7 +2080,6 @@ found:
 	spin_unlock(&dcache_hash_lock);
 	spin_unlock(&actual->d_lock);
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 out_nolock:
 	if (actual == dentry) {
 		security_d_instantiate(dentry, inode);
@@ -2174,7 +2091,6 @@ out_nolock:
 
 shouldnt_be_hashed:
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 	BUG();
 }
 
@@ -2206,8 +2122,7 @@ static int prepend_name(char **buffer, i
  * Returns a pointer into the buffer or an error code if the
  * path was too long.
  *
- * "buflen" should be positive. Caller holds the dcache_lock and
- * path->dentry->d_lock.
+ * "buflen" should be positive. Caller holds the path->dentry->d_lock.
  *
  * If path is not reachable from the supplied root, then the value of
  * root is changed (without modifying refcounts).
@@ -2312,14 +2227,12 @@ char *d_path(const struct path *path, ch
 
 rename_retry:
 	seq = read_seqbegin(&rename_lock);
-	spin_lock(&dcache_lock);
 	vfsmount_read_lock();
 	spin_lock(&path->dentry->d_lock);
 	tmp = root;
 	res = __d_path(path, &tmp, buf, buflen);
 	spin_unlock(&path->dentry->d_lock);
 	vfsmount_read_unlock();
-	spin_unlock(&dcache_lock);
 	if (read_seqretry(&rename_lock, seq))
 		goto rename_retry;
 
@@ -2361,7 +2274,6 @@ rename_retry:
 	end = buf + buflen;
 	seq = read_seqbegin(&rename_lock);
 	rcu_read_lock(); /* protect parent */
-	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	prepend(&end, &buflen, "\0", 1);
 	if (!IS_ROOT(dentry) && d_unhashed(dentry) &&
@@ -2386,7 +2298,6 @@ rename_retry:
 	}
 out:
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
 	rcu_read_unlock();
 	if (read_seqretry(&rename_lock, seq))
 		goto rename_retry;
@@ -2432,7 +2343,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, b
 
 	error = -ENOENT;
 	/* Has the current directory has been unlinked? */
-	spin_lock(&dcache_lock);
 	vfsmount_read_lock();
 	spin_lock(&pwd.dentry->d_lock);
 	if (IS_ROOT(pwd.dentry) || !d_unhashed(pwd.dentry)) {
@@ -2443,7 +2353,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, b
 		cwd = __d_path(&pwd, &tmp, page, PAGE_SIZE);
 		spin_unlock(&pwd.dentry->d_lock);
 		vfsmount_read_unlock();
-		spin_unlock(&dcache_lock);
 
 		error = PTR_ERR(cwd);
 		if (IS_ERR(cwd))
@@ -2459,7 +2368,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, b
 	} else {
 		spin_unlock(&pwd.dentry->d_lock);
 		vfsmount_read_unlock();
-		spin_unlock(&dcache_lock);
 	}
 
 out:
@@ -2520,7 +2428,6 @@ void d_genocide(struct dentry *root)
 rename_retry:
 	this_parent = root;
 	seq = read_seqbegin(&rename_lock);
-	spin_lock(&dcache_lock);
 	spin_lock(&this_parent->d_lock);
 repeat:
 	next = this_parent->d_subdirs.next;
@@ -2561,7 +2468,6 @@ resume:
 		if (this_parent != child->d_parent ||
 				read_seqretry(&rename_lock, seq)) {
 			spin_unlock(&this_parent->d_lock);
-			spin_unlock(&dcache_lock);
 			rcu_read_unlock();
 			goto rename_retry;
 		}
@@ -2569,7 +2475,6 @@ resume:
 		goto resume;
 	}
 	spin_unlock(&this_parent->d_lock);
-	spin_unlock(&dcache_lock);
 	if (read_seqretry(&rename_lock, seq))
 		goto rename_retry;
 }
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c
+++ linux-2.6/fs/namei.c
@@ -754,14 +754,11 @@ static __always_inline void follow_dotdo
 			break;
 		}
                 read_unlock(&fs->lock);
-		spin_lock(&dcache_lock);
 		if (nd->path.dentry != nd->path.mnt->mnt_root) {
 			nd->path.dentry = dget(nd->path.dentry->d_parent);
-			spin_unlock(&dcache_lock);
 			dput(old);
 			break;
 		}
-		spin_unlock(&dcache_lock);
 		vfsmount_read_lock();
 		parent = nd->path.mnt->mnt_parent;
 		if (parent == nd->path.mnt) {
@@ -2114,12 +2111,10 @@ void dentry_unhash(struct dentry *dentry
 {
 	dget(dentry);
 	shrink_dcache_parent(dentry);
-	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	if (dentry->d_count == 2)
 		__d_drop(dentry);
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
 }
 
 int vfs_rmdir(struct inode *dir, struct dentry *dentry)
Index: linux-2.6/fs/seq_file.c
===================================================================
--- linux-2.6.orig/fs/seq_file.c
+++ linux-2.6/fs/seq_file.c
@@ -463,11 +463,9 @@ int seq_path_root(struct seq_file *m, st
 
 rename_retry:
 		seq = read_seqbegin(&rename_lock);
-		spin_lock(&dcache_lock);
 		vfsmount_read_lock();
 		p = __d_path(path, root, s, m->size - m->count);
 		vfsmount_read_unlock();
-		spin_unlock(&dcache_lock);
 		if (read_seqretry(&rename_lock, seq))
 			goto rename_retry;
 
Index: linux-2.6/fs/sysfs/dir.c
===================================================================
--- linux-2.6.orig/fs/sysfs/dir.c
+++ linux-2.6/fs/sysfs/dir.c
@@ -547,7 +547,6 @@ static void sysfs_drop_dentry(struct sys
 	 * dput to immediately free the dentry  if it is not in use.
 	 */
 repeat:
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
 		spin_lock(&dentry->d_lock);
@@ -559,12 +558,10 @@ repeat:
 		__d_drop(dentry);
 		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_inode_lock);
-		spin_unlock(&dcache_lock);
 		dput(dentry);
 		goto repeat;
 	}
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 
 	/* adjust nlink and update timestamp */
 	mutex_lock(&inode->i_mutex);
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -150,13 +150,13 @@ struct dentry_operations {
 
 /*
 locking rules:
-		big lock	dcache_lock	d_lock   may block
-d_revalidate:	no		no		no       yes
-d_hash		no		no		no       yes
-d_compare:	no		yes		yes      no
-d_delete:	no		yes		no       no
-d_release:	no		no		no       yes
-d_iput:		no		no		no       yes
+		big lock	d_lock   may block
+d_revalidate:	no		no       yes
+d_hash		no		no       yes
+d_compare:	no		yes      no
+d_delete:	no		no       no
+d_release:	no		no       yes
+d_iput:		no		no       yes
  */
 
 /* d_flags entries */
@@ -186,7 +186,6 @@ d_iput:		no		no		no       yes
 
 extern spinlock_t dcache_inode_lock;
 extern spinlock_t dcache_hash_lock;
-extern spinlock_t dcache_lock;
 extern seqlock_t rename_lock;
 
 /**
@@ -217,11 +216,9 @@ static inline void __d_drop(struct dentr
 
 static inline void d_drop(struct dentry *dentry)
 {
-	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
  	__d_drop(dentry);
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
 }
 
 static inline int dname_external(struct dentry *dentry)
Index: linux-2.6/fs/notify/inotify/inotify.c
===================================================================
--- linux-2.6.orig/fs/notify/inotify/inotify.c
+++ linux-2.6/fs/notify/inotify/inotify.c
@@ -184,7 +184,6 @@ static void set_dentry_child_flags(struc
 {
 	struct dentry *alias;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 	list_for_each_entry(alias, &inode->i_dentry, d_alias) {
 		struct dentry *child;
@@ -204,7 +203,6 @@ static void set_dentry_child_flags(struc
 		spin_unlock(&alias->d_lock);
 	}
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 }
 
 /*
@@ -272,6 +270,7 @@ void inotify_d_instantiate(struct dentry
 	if (!inode)
 		return;
 
+	/* XXX: need parent lock in place of dcache_lock? */
 	spin_lock(&entry->d_lock);
 	parent = entry->d_parent;
 	if (parent->d_inode && inotify_inode_watched(parent->d_inode))
@@ -286,6 +285,7 @@ void inotify_d_move(struct dentry *entry
 {
 	struct dentry *parent;
 
+	/* XXX: need parent lock in place of dcache_lock? */
 	parent = entry->d_parent;
 	if (inotify_inode_watched(parent->d_inode))
 		entry->d_flags |= DCACHE_INOTIFY_PARENT_WATCHED;
Index: linux-2.6/fs/exportfs/expfs.c
===================================================================
--- linux-2.6.orig/fs/exportfs/expfs.c
+++ linux-2.6/fs/exportfs/expfs.c
@@ -47,24 +47,20 @@ find_acceptable_alias(struct dentry *res
 	if (acceptable(context, result))
 		return result;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 	list_for_each_entry(dentry, &result->d_inode->i_dentry, d_alias) {
 		dget_locked(dentry);
 		spin_unlock(&dcache_inode_lock);
-		spin_unlock(&dcache_lock);
 		if (toput)
 			dput(toput);
 		if (dentry != result && acceptable(context, dentry)) {
 			dput(result);
 			return dentry;
 		}
-		spin_lock(&dcache_lock);
 		spin_lock(&dcache_inode_lock);
 		toput = dentry;
 	}
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 
 	if (toput)
 		dput(toput);
Index: linux-2.6/Documentation/filesystems/Locking
===================================================================
--- linux-2.6.orig/Documentation/filesystems/Locking
+++ linux-2.6/Documentation/filesystems/Locking
@@ -17,7 +17,7 @@ prototypes:
 	void (*d_iput)(struct dentry *, struct inode *);
 	char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
 
-locking rules:
+locking rules: XXX: update these!!
 	none have BKL
 		dcache_lock	rename_lock	->d_lock	may block
 d_revalidate:	no		no		no		yes
Index: linux-2.6/arch/powerpc/platforms/cell/spufs/inode.c
===================================================================
--- linux-2.6.orig/arch/powerpc/platforms/cell/spufs/inode.c
+++ linux-2.6/arch/powerpc/platforms/cell/spufs/inode.c
@@ -158,21 +158,18 @@ static void spufs_prune_dir(struct dentr
 
 	mutex_lock(&dir->d_inode->i_mutex);
 	list_for_each_entry_safe(dentry, tmp, &dir->d_subdirs, d_u.d_child) {
-		spin_lock(&dcache_lock);
 		spin_lock(&dentry->d_lock);
 		if (!(d_unhashed(dentry)) && dentry->d_inode) {
 			dget_locked_dlock(dentry);
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
 			simple_unlink(dir->d_inode, dentry);
-			/* XXX: what is dcache_lock protecting here? Other
+			/* XXX: what was dcache_lock protecting here? Other
 			 * filesystems (IB, configfs) release dcache_lock
 			 * before unlink */
-			spin_unlock(&dcache_lock);
 			dput(dentry);
 		} else {
 			spin_unlock(&dentry->d_lock);
-			spin_unlock(&dcache_lock);
 		}
 	}
 	shrink_dcache_parent(dir);
Index: linux-2.6/drivers/infiniband/hw/ipath/ipath_fs.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/ipath/ipath_fs.c
+++ linux-2.6/drivers/infiniband/hw/ipath/ipath_fs.c
@@ -272,18 +272,14 @@ static int remove_file(struct dentry *pa
 		goto bail;
 	}
 
-	spin_lock(&dcache_lock);
 	spin_lock(&tmp->d_lock);
 	if (!(d_unhashed(tmp) && tmp->d_inode)) {
 		dget_locked_dlock(tmp);
 		__d_drop(tmp);
 		spin_unlock(&tmp->d_lock);
-		spin_unlock(&dcache_lock);
 		simple_unlink(parent->d_inode, tmp);
-	} else {
+	} else
 		spin_unlock(&tmp->d_lock);
-		spin_unlock(&dcache_lock);
-	}
 
 	ret = 0;
 bail:
Index: linux-2.6/drivers/usb/core/inode.c
===================================================================
--- linux-2.6.orig/drivers/usb/core/inode.c
+++ linux-2.6/drivers/usb/core/inode.c
@@ -343,18 +343,15 @@ static int usbfs_empty (struct dentry *d
 {
 	struct list_head *list;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	list_for_each(list, &dentry->d_subdirs) {
 		struct dentry *de = list_entry(list, struct dentry, d_u.d_child);
 		if (usbfs_positive(de)) {
 			spin_unlock(&dentry->d_lock);
-			spin_unlock(&dcache_lock);
 			return 0;
 		}
 	}
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
 
 	return 1;
 }
Index: linux-2.6/fs/affs/amigaffs.c
===================================================================
--- linux-2.6.orig/fs/affs/amigaffs.c
+++ linux-2.6/fs/affs/amigaffs.c
@@ -128,7 +128,6 @@ affs_fix_dcache(struct dentry *dentry, u
 	void *data = dentry->d_fsdata;
 	struct list_head *head, *next;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 	head = &inode->i_dentry;
 	next = head->next;
@@ -141,7 +140,6 @@ affs_fix_dcache(struct dentry *dentry, u
 		next = next->next;
 	}
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 }
 
 
Index: linux-2.6/fs/autofs4/expire.c
===================================================================
--- linux-2.6.orig/fs/autofs4/expire.c
+++ linux-2.6/fs/autofs4/expire.c
@@ -104,7 +104,6 @@ static struct dentry *get_next_positive_
 	struct list_head *next;
 	struct dentry *ret;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&p->d_lock);
 again:
 	next = p->d_subdirs.next;
@@ -122,9 +121,7 @@ again:
 				dget_dlock(p);
 				spin_unlock(&p->d_lock);
 				parent = dget_parent(p);
-				spin_unlock(&dcache_lock);
 				dput(p);
-				spin_lock(&dcache_lock);
 				spin_lock(&parent->d_lock);
 			} else
 				spin_unlock(&p->d_lock);
@@ -146,8 +143,6 @@ again:
 	dget_dlock(ret);
 	spin_unlock(&ret->d_lock);
 
-	spin_unlock(&dcache_lock);
-
 	return ret;
 }
 
@@ -337,7 +332,6 @@ struct dentry *autofs4_expire_indirect(s
 	now = jiffies;
 	timeout = sbi->exp_timeout;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&root->d_lock);
 	next = root->d_subdirs.next;
 
@@ -354,7 +348,6 @@ struct dentry *autofs4_expire_indirect(s
 
 		dentry = dget(dentry);
 		spin_unlock(&root->d_lock);
-		spin_unlock(&dcache_lock);
 
 		spin_lock(&sbi->fs_lock);
 		ino = autofs4_dentry_ino(dentry);
@@ -419,12 +412,10 @@ struct dentry *autofs4_expire_indirect(s
 next:
 		spin_unlock(&sbi->fs_lock);
 		dput(dentry);
-		spin_lock(&dcache_lock);
 		spin_lock(&root->d_lock);
 		next = next->next;
 	}
 	spin_unlock(&root->d_lock);
-	spin_unlock(&dcache_lock);
 	return NULL;
 
 found:
@@ -434,11 +425,9 @@ found:
 	ino->flags |= AUTOFS_INF_EXPIRING;
 	init_completion(&ino->expire_complete);
 	spin_unlock(&sbi->fs_lock);
-	spin_lock(&dcache_lock);
 	spin_lock(&expired->d_parent->d_lock);
 	list_move(&expired->d_parent->d_subdirs, &expired->d_u.d_child);
 	spin_unlock(&expired->d_parent->d_lock);
-	spin_unlock(&dcache_lock);
 	return expired;
 }
 
Index: linux-2.6/fs/autofs4/inode.c
===================================================================
--- linux-2.6.orig/fs/autofs4/inode.c
+++ linux-2.6/fs/autofs4/inode.c
@@ -109,7 +109,6 @@ static void autofs4_force_release(struct
 	if (!sbi->sb->s_root)
 		return;
 
-	spin_lock(&dcache_lock);
 repeat:
 	spin_lock(&this_parent->d_lock);
 	next = this_parent->d_subdirs.next;
@@ -130,13 +129,11 @@ resume:
 
 		next = next->next;
 		spin_unlock(&this_parent->d_lock);
-		spin_unlock(&dcache_lock);
 
 		DPRINTK("dentry %p %.*s",
 			dentry, (int)dentry->d_name.len, dentry->d_name.name);
 
 		dput(dentry);
-		spin_lock(&dcache_lock);
 	}
 
 	if (this_parent != sbi->sb->s_root) {
@@ -145,15 +142,12 @@ resume:
 		next = this_parent->d_u.d_child.next;
 		spin_unlock(&this_parent->d_lock);
 		this_parent = this_parent->d_parent;
-		spin_unlock(&dcache_lock);
 		DPRINTK("parent dentry %p %.*s",
 			dentry, (int)dentry->d_name.len, dentry->d_name.name);
 		dput(dentry);
-		spin_lock(&dcache_lock);
 		spin_lock(&this_parent->d_lock);
 		goto resume;
 	}
-	spin_unlock(&dcache_lock);
 	spin_unlock(&this_parent->d_lock);
 }
 
Index: linux-2.6/fs/autofs4/root.c
===================================================================
--- linux-2.6.orig/fs/autofs4/root.c
+++ linux-2.6/fs/autofs4/root.c
@@ -92,15 +92,12 @@ static int autofs4_dir_open(struct inode
 	 * autofs file system so just let the libfs routines handle
 	 * it.
 	 */
-	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	if (!d_mountpoint(dentry) && __simple_empty(dentry)) {
 		spin_unlock(&dentry->d_lock);
-		spin_unlock(&dcache_lock);
 		return -ENOENT;
 	}
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
 
 out:
 	return dcache_dir_open(inode, file);
@@ -214,12 +211,10 @@ static void *autofs4_follow_link(struct
 	 * multi-mount with no root mount offset. So don't try to
 	 * mount it again.
 	 */
-	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	if (dentry->d_flags & DCACHE_AUTOFS_PENDING ||
 	    (!d_mountpoint(dentry) && __simple_empty(dentry))) {
 		spin_unlock(&dentry->d_lock);
-		spin_unlock(&dcache_lock);
 
 		status = try_to_fill_dentry(dentry, 0);
 		if (status)
@@ -228,7 +223,6 @@ static void *autofs4_follow_link(struct
 		goto follow;
 	}
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
 follow:
 	/*
 	 * If there is no root mount it must be an autofs
@@ -299,13 +293,11 @@ static int autofs4_revalidate(struct den
 		return 0;
 
 	/* Check for a non-mountpoint directory with no contents */
-	spin_lock(&dcache_lock);
 	if (S_ISDIR(dentry->d_inode->i_mode) &&
 	    !d_mountpoint(dentry) && 
 	    __simple_empty(dentry)) {
 		DPRINTK("dentry=%p %.*s, emptydir",
 			 dentry, dentry->d_name.len, dentry->d_name.name);
-		spin_unlock(&dcache_lock);
 
 		/* The daemon never causes a mount to trigger */
 		if (oz_mode)
@@ -321,7 +313,6 @@ static int autofs4_revalidate(struct den
 
 		return status;
 	}
-	spin_unlock(&dcache_lock);
 
 	return 1;
 }
@@ -373,7 +364,6 @@ static struct dentry *autofs4_lookup_act
 	const unsigned char *str = name->name;
 	struct list_head *p, *head;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&sbi->lookup_lock);
 	head = &sbi->active_list;
 	list_for_each(p, head) {
@@ -406,14 +396,12 @@ static struct dentry *autofs4_lookup_act
 			dget_locked_dlock(dentry);
 			spin_unlock(&dentry->d_lock);
 			spin_unlock(&sbi->lookup_lock);
-			spin_unlock(&dcache_lock);
 			return dentry;
 		}
 next:
 		spin_unlock(&dentry->d_lock);
 	}
 	spin_unlock(&sbi->lookup_lock);
-	spin_unlock(&dcache_lock);
 
 	return NULL;
 }
@@ -425,7 +413,6 @@ static struct dentry *autofs4_lookup_exp
 	const unsigned char *str = name->name;
 	struct list_head *p, *head;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&sbi->lookup_lock);
 	head = &sbi->expiring_list;
 	list_for_each(p, head) {
@@ -458,14 +445,12 @@ static struct dentry *autofs4_lookup_exp
 			dget_locked_dlock(dentry);
 			spin_unlock(&dentry->d_lock);
 			spin_unlock(&sbi->lookup_lock);
-			spin_unlock(&dcache_lock);
 			return dentry;
 		}
 next:
 		spin_unlock(&dentry->d_lock);
 	}
 	spin_unlock(&sbi->lookup_lock);
-	spin_unlock(&dcache_lock);
 
 	return NULL;
 }
@@ -711,7 +696,6 @@ static int autofs4_dir_unlink(struct ino
 
 	dir->i_mtime = CURRENT_TIME;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&sbi->lookup_lock);
 	if (list_empty(&ino->expiring))
 		list_add(&ino->expiring, &sbi->expiring_list);
@@ -719,7 +703,6 @@ static int autofs4_dir_unlink(struct ino
 	spin_lock(&dentry->d_lock);
 	__d_drop(dentry);
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
 
 	return 0;
 }
@@ -736,11 +719,9 @@ static int autofs4_dir_rmdir(struct inod
 	if (!autofs4_oz_mode(sbi))
 		return -EACCES;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	if (!list_empty(&dentry->d_subdirs)) {
 		spin_unlock(&dentry->d_lock);
-		spin_unlock(&dcache_lock);
 		return -ENOTEMPTY;
 	}
 	spin_lock(&sbi->lookup_lock);
@@ -749,7 +730,6 @@ static int autofs4_dir_rmdir(struct inod
 	spin_unlock(&sbi->lookup_lock);
 	__d_drop(dentry);
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
 
 	if (atomic_dec_and_test(&ino->count)) {
 		p_ino = autofs4_dentry_ino(dentry->d_parent);
Index: linux-2.6/fs/coda/cache.c
===================================================================
--- linux-2.6.orig/fs/coda/cache.c
+++ linux-2.6/fs/coda/cache.c
@@ -86,7 +86,6 @@ static void coda_flag_children(struct de
 	struct list_head *child;
 	struct dentry *de;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&parent->d_lock);
 	list_for_each(child, &parent->d_subdirs)
 	{
@@ -97,7 +96,6 @@ static void coda_flag_children(struct de
 		coda_flag_inode(de->d_inode, flag);
 	}
 	spin_unlock(&parent->d_lock);
-	spin_unlock(&dcache_lock);
 	return; 
 }
 
Index: linux-2.6/fs/configfs/configfs_internal.h
===================================================================
--- linux-2.6.orig/fs/configfs/configfs_internal.h
+++ linux-2.6/fs/configfs/configfs_internal.h
@@ -117,7 +117,6 @@ static inline struct config_item *config
 {
 	struct config_item * item = NULL;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	if (!d_unhashed(dentry)) {
 		struct configfs_dirent * sd = dentry->d_fsdata;
@@ -128,7 +127,6 @@ static inline struct config_item *config
 			item = config_item_get(sd->s_element);
 	}
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
 
 	return item;
 }
Index: linux-2.6/fs/configfs/inode.c
===================================================================
--- linux-2.6.orig/fs/configfs/inode.c
+++ linux-2.6/fs/configfs/inode.c
@@ -215,18 +215,14 @@ void configfs_drop_dentry(struct configf
 	struct dentry * dentry = sd->s_dentry;
 
 	if (dentry) {
-		spin_lock(&dcache_lock);
 		spin_lock(&dentry->d_lock);
 		if (!(d_unhashed(dentry) && dentry->d_inode)) {
 			dget_locked_dlock(dentry);
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
-			spin_unlock(&dcache_lock);
 			simple_unlink(parent->d_inode, dentry);
-		} else {
+		} else
 			spin_unlock(&dentry->d_lock);
-			spin_unlock(&dcache_lock);
-		}
 	}
 }
 
Index: linux-2.6/fs/ncpfs/dir.c
===================================================================
--- linux-2.6.orig/fs/ncpfs/dir.c
+++ linux-2.6/fs/ncpfs/dir.c
@@ -364,7 +364,6 @@ ncp_dget_fpos(struct dentry *dentry, str
 	}
 
 	/* If a pointer is invalid, we search the dentry. */
-	spin_lock(&dcache_lock);
 	spin_lock(&parent->d_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
@@ -375,13 +374,11 @@ ncp_dget_fpos(struct dentry *dentry, str
 			else
 				dent = NULL;
 			spin_unlock(&parent->d_lock);
-			spin_unlock(&dcache_lock);
 			goto out;
 		}
 		next = next->next;
 	}
 	spin_unlock(&parent->d_lock);
-	spin_unlock(&dcache_lock);
 	return NULL;
 
 out:
Index: linux-2.6/fs/ncpfs/ncplib_kernel.h
===================================================================
--- linux-2.6.orig/fs/ncpfs/ncplib_kernel.h
+++ linux-2.6/fs/ncpfs/ncplib_kernel.h
@@ -192,7 +192,6 @@ ncp_renew_dentries(struct dentry *parent
 	struct list_head *next;
 	struct dentry *dentry;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&parent->d_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
@@ -206,7 +205,6 @@ ncp_renew_dentries(struct dentry *parent
 		next = next->next;
 	}
 	spin_unlock(&parent->d_lock);
-	spin_unlock(&dcache_lock);
 }
 
 static inline void
@@ -216,7 +214,6 @@ ncp_invalidate_dircache_entries(struct d
 	struct list_head *next;
 	struct dentry *dentry;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&parent->d_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
@@ -226,7 +223,6 @@ ncp_invalidate_dircache_entries(struct d
 		next = next->next;
 	}
 	spin_unlock(&parent->d_lock);
-	spin_unlock(&dcache_lock);
 }
 
 struct ncp_cache_head {
Index: linux-2.6/fs/nfs/dir.c
===================================================================
--- linux-2.6.orig/fs/nfs/dir.c
+++ linux-2.6/fs/nfs/dir.c
@@ -1433,11 +1433,9 @@ static int nfs_unlink(struct inode *dir,
 	dfprintk(VFS, "NFS: unlink(%s/%ld, %s)\n", dir->i_sb->s_id,
 		dir->i_ino, dentry->d_name.name);
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	if (dentry->d_count > 1) {
 		spin_unlock(&dentry->d_lock);
-		spin_unlock(&dcache_lock);
 		/* Start asynchronous writeout of the inode */
 		write_inode_now(dentry->d_inode, 0);
 		error = nfs_sillyrename(dir, dentry);
@@ -1448,7 +1446,6 @@ static int nfs_unlink(struct inode *dir,
 		need_rehash = 1;
 	}
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
 	error = nfs_safe_remove(dentry);
 	if (!error || error == -ENOENT) {
 		nfs_set_verifier(dentry, nfs_save_change_attribute(dir));
Index: linux-2.6/fs/ocfs2/dcache.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/dcache.c
+++ linux-2.6/fs/ocfs2/dcache.c
@@ -140,7 +140,6 @@ struct dentry *ocfs2_find_local_alias(st
 	struct list_head *p;
 	struct dentry *dentry = NULL;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 	list_for_each(p, &inode->i_dentry) {
 		dentry = list_entry(p, struct dentry, d_alias);
@@ -160,7 +159,6 @@ struct dentry *ocfs2_find_local_alias(st
 	}
 
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 
 	return dentry;
 }
Index: linux-2.6/fs/smbfs/cache.c
===================================================================
--- linux-2.6.orig/fs/smbfs/cache.c
+++ linux-2.6/fs/smbfs/cache.c
@@ -62,7 +62,6 @@ smb_invalidate_dircache_entries(struct d
 	struct list_head *next;
 	struct dentry *dentry;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&parent->d_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
@@ -72,7 +71,6 @@ smb_invalidate_dircache_entries(struct d
 		next = next->next;
 	}
 	spin_unlock(&parent->d_lock);
-	spin_unlock(&dcache_lock);
 }
 
 /*
@@ -98,7 +96,6 @@ smb_dget_fpos(struct dentry *dentry, str
 	}
 
 	/* If a pointer is invalid, we search the dentry. */
-	spin_lock(&dcache_lock);
 	spin_lock(&parent->d_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
@@ -115,7 +112,6 @@ smb_dget_fpos(struct dentry *dentry, str
 	dent = NULL;
 out_unlock:
 	spin_unlock(&parent->d_lock);
-	spin_unlock(&dcache_lock);
 	return dent;
 }
 
Index: linux-2.6/kernel/cgroup.c
===================================================================
--- linux-2.6.orig/kernel/cgroup.c
+++ linux-2.6/kernel/cgroup.c
@@ -693,7 +693,6 @@ static void cgroup_clear_directory(struc
 	struct list_head *node;
 
 	BUG_ON(!mutex_is_locked(&dentry->d_inode->i_mutex));
-	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	node = dentry->d_subdirs.next;
 	while (node != &dentry->d_subdirs) {
@@ -708,18 +707,15 @@ static void cgroup_clear_directory(struc
 			dget_locked_dlock(d);
 			spin_unlock(&d->d_lock);
 			spin_unlock(&dentry->d_lock);
-			spin_unlock(&dcache_lock);
 			d_delete(d);
 			simple_unlink(dentry->d_inode, d);
 			dput(d);
-			spin_lock(&dcache_lock);
 			spin_lock(&dentry->d_lock);
 		} else
 			spin_unlock(&d->d_lock);
 		node = dentry->d_subdirs.next;
 	}
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
 }
 
 /*
@@ -731,14 +727,12 @@ static void cgroup_d_remove_dir(struct d
 
 	cgroup_clear_directory(dentry);
 
-	spin_lock(&dcache_lock);
 	parent = dentry->d_parent;
 	spin_lock(&parent->d_lock);
 	spin_lock(&dentry->d_lock);
 	list_del_init(&dentry->d_u.d_child);
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&parent->d_lock);
-	spin_unlock(&dcache_lock);
 	remove_dir(dentry);
 }
 
Index: linux-2.6/net/sunrpc/rpc_pipe.c
===================================================================
--- linux-2.6.orig/net/sunrpc/rpc_pipe.c
+++ linux-2.6/net/sunrpc/rpc_pipe.c
@@ -547,15 +547,14 @@ static void rpc_depopulate(struct dentry
 
 	mutex_lock_nested(&dir->i_mutex, I_MUTEX_CHILD);
 repeat:
-	spin_lock(&dcache_lock);
 	spin_lock(&parent->d_lock);
 	list_for_each_safe(pos, next, &parent->d_subdirs) {
 		dentry = list_entry(pos, struct dentry, d_u.d_child);
+		spin_lock(&dentry->d_lock);
 		if (!dentry->d_inode ||
 				dentry->d_inode->i_ino < start ||
 				dentry->d_inode->i_ino >= eof)
-			continue;
-		spin_lock(&dentry->d_lock);
+			goto next;
 		if (!d_unhashed(dentry)) {
 			dget_locked_dlock(dentry);
 			__d_drop(dentry);
@@ -563,11 +562,11 @@ repeat:
 			dvec[n++] = dentry;
 			if (n == ARRAY_SIZE(dvec))
 				break;
-		} else
-			spin_unlock(&dentry->d_lock);
+		}
+next:
+		spin_unlock(&dentry->d_lock);
 	}
 	spin_unlock(&parent->d_lock);
-	spin_unlock(&dcache_lock);
 	if (n) {
 		do {
 			dentry = dvec[--n];
Index: linux-2.6/security/selinux/selinuxfs.c
===================================================================
--- linux-2.6.orig/security/selinux/selinuxfs.c
+++ linux-2.6/security/selinux/selinuxfs.c
@@ -947,7 +947,6 @@ static void sel_remove_entries(struct de
 {
 	struct list_head *node;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&de->d_lock);
 	node = de->d_subdirs.next;
 	while (node != &de->d_subdirs) {
@@ -960,11 +959,9 @@ static void sel_remove_entries(struct de
 			dget_locked_dlock(d);
 			spin_unlock(&de->d_lock);
 			spin_unlock(&d->d_lock);
-			spin_unlock(&dcache_lock);
 			d_delete(d);
 			simple_unlink(de->d_inode, d);
 			dput(d);
-			spin_lock(&dcache_lock);
 			spin_lock(&de->d_lock);
 		} else
 			spin_unlock(&d->d_lock);
@@ -972,7 +969,6 @@ static void sel_remove_entries(struct de
 	}
 
 	spin_unlock(&de->d_lock);
-	spin_unlock(&dcache_lock);
 }
 
 #define BOOL_DIR_NAME "booleans"
Index: linux-2.6/security/tomoyo/realpath.c
===================================================================
--- linux-2.6.orig/security/tomoyo/realpath.c
+++ linux-2.6/security/tomoyo/realpath.c
@@ -102,10 +102,8 @@ int tomoyo_realpath_from_path2(struct pa
 		if (ns_root.mnt)
 			ns_root.dentry = dget(ns_root.mnt->mnt_root);
 		vfsmount_read_unlock();
-		spin_lock(&dcache_lock);
 		tmp = ns_root;
 		sp = __d_path(path, &tmp, newname, newname_len);
-		spin_unlock(&dcache_lock);
 		path_put(&root);
 		path_put(&ns_root);
 	}
Index: linux-2.6/fs/nfs/getroot.c
===================================================================
--- linux-2.6.orig/fs/nfs/getroot.c
+++ linux-2.6/fs/nfs/getroot.c
@@ -65,13 +65,11 @@ static int nfs_superblock_set_dummy_root
 		 * This again causes shrink_dcache_for_umount_subtree() to
 		 * Oops, since the test for IS_ROOT() will fail.
 		 */
-		spin_lock(&dcache_lock);
 		spin_lock(&dcache_inode_lock);
 		spin_lock(&sb->s_root->d_lock);
 		list_del_init(&sb->s_root->d_alias);
 		spin_unlock(&sb->s_root->d_lock);
 		spin_unlock(&dcache_inode_lock);
-		spin_unlock(&dcache_lock);
 	}
 	return 0;
 }
Index: linux-2.6/drivers/staging/pohmelfs/path_entry.c
===================================================================
--- linux-2.6.orig/drivers/staging/pohmelfs/path_entry.c
+++ linux-2.6/drivers/staging/pohmelfs/path_entry.c
@@ -100,7 +100,6 @@ int pohmelfs_path_length(struct pohmelfs
 	rcu_read_lock();
 rename_retry:
 	seq = read_seqbegin(&rename_lock);
-	spin_lock(&dcache_lock);
 
 	if (!IS_ROOT(d) && d_unhashed(d))
 		len += UNHASHED_OBSCURE_STRING_SIZE; /* Obscure " (deleted)" string */
@@ -109,7 +108,6 @@ rename_retry:
 		len += d->d_name.len + 1; /* Plus slash */
 		d = d->d_parent;
 	}
-	spin_unlock(&dcache_lock);
 	if (read_seqretry(&rename_lock, seq))
 		goto rename_retry;
 	rcu_read_unlock();
Index: linux-2.6/fs/autofs4/waitq.c
===================================================================
--- linux-2.6.orig/fs/autofs4/waitq.c
+++ linux-2.6/fs/autofs4/waitq.c
@@ -194,12 +194,10 @@ static int autofs4_getpath(struct autofs
 	rcu_read_lock();
 rename_retry:
 	seq = read_seqbegin(&rename_lock);
-	spin_lock(&dcache_lock);
 	for (tmp = dentry ; tmp != root ; tmp = tmp->d_parent)
 		len += tmp->d_name.len + 1;
 
 	if (!len || --len > NAME_MAX) {
-		spin_unlock(&dcache_lock);
 		if (read_seqretry(&rename_lock, seq))
 			goto rename_retry;
 		rcu_read_unlock();
@@ -215,7 +213,6 @@ rename_retry:
 		p -= tmp->d_name.len;
 		strncpy(p, tmp->d_name.name, tmp->d_name.len);
 	}
-	spin_unlock(&dcache_lock);
 	if (read_seqretry(&rename_lock, seq))
 		goto rename_retry;
 	rcu_read_unlock();
Index: linux-2.6/fs/nfs/namespace.c
===================================================================
--- linux-2.6.orig/fs/nfs/namespace.c
+++ linux-2.6/fs/nfs/namespace.c
@@ -57,7 +57,6 @@ char *nfs_path(const char *base,
 	rcu_read_lock();
 rename_retry:
 	seq = read_seqbegin(&rename_lock);
-	spin_lock(&dcache_lock);
 	while (!IS_ROOT(dentry) && dentry != droot) {
 		namelen = dentry->d_name.len;
 		buflen -= namelen + 1;
@@ -68,7 +67,6 @@ rename_retry:
 		*--end = '/';
 		dentry = dentry->d_parent;
 	}
-	spin_unlock(&dcache_lock);
 	if (read_seqretry(&rename_lock, seq))
 		goto rename_retry;
 	rcu_read_unlock();
@@ -83,7 +81,6 @@ rename_retry:
 	memcpy(end, base, namelen);
 	return end;
 Elong_unlock:
-	spin_unlock(&dcache_lock);
 	if (read_seqretry(&rename_lock, seq))
 		goto rename_retry;
 	rcu_read_unlock();



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 16/27] fs: dcache reduce dput locking
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (14 preceding siblings ...)
  2009-04-25  1:20 ` [patch 15/27] fs: dcache remove dcache_lock npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  1:20 ` [patch 17/27] fs: dcache per-bucket dcache hash locking npiggin
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: dcache-dput-less-dcache_lock.patch --]
[-- Type: text/plain, Size: 2605 bytes --]

It is possible to run dput without taking locks up-front. In many cases
where we don't kill the dentry anyway, these locks are not required.

(I think... need to think about it more). Further changes ->d_delete
locking which is not all audited.

---
 fs/dcache.c |   59 ++++++++++++++++++++++++++++++++---------------------------
 1 file changed, 32 insertions(+), 27 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -262,7 +262,8 @@ static struct dentry *d_kill(struct dent
 
 void dput(struct dentry *dentry)
 {
-	struct dentry *parent = NULL;
+	struct dentry *parent;
+
 	if (!dentry)
 		return;
 
@@ -270,23 +271,9 @@ repeat:
 	if (dentry->d_count == 1)
 		might_sleep();
 	spin_lock(&dentry->d_lock);
-	if (dentry->d_count == 1) {
-		if (!spin_trylock(&dcache_inode_lock)) {
-drop2:
-			spin_unlock(&dentry->d_lock);
-			goto repeat;
-		}
-		parent = dentry->d_parent;
-		if (parent) {
-			BUG_ON(parent == dentry);
-			if (!spin_trylock(&parent->d_lock)) {
-				spin_unlock(&dcache_inode_lock);
-				goto drop2;
-			}
-		}
-	}
-	dentry->d_count--;
-	if (dentry->d_count) {
+	BUG_ON(!dentry->d_count);
+	if (dentry->d_count > 1) {
+		dentry->d_count--;
 		spin_unlock(&dentry->d_lock);
 		return;
 	}
@@ -295,8 +282,10 @@ drop2:
 	 * AV: ->d_delete() is _NOT_ allowed to block now.
 	 */
 	if (dentry->d_op && dentry->d_op->d_delete) {
-		if (dentry->d_op->d_delete(dentry))
-			goto unhash_it;
+		if (dentry->d_op->d_delete(dentry)) {
+			__d_drop(dentry);
+			goto kill_it;
+		}
 	}
 	/* Unreachable? Get rid of it */
  	if (d_unhashed(dentry))
@@ -305,15 +294,31 @@ drop2:
   		dentry->d_flags |= DCACHE_REFERENCED;
 		dentry_lru_add(dentry);
   	}
- 	spin_unlock(&dentry->d_lock);
-	if (parent)
-		spin_unlock(&parent->d_lock);
-	spin_unlock(&dcache_inode_lock);
-	return;
+	dentry->d_count--;
+	spin_unlock(&dentry->d_lock);
+  	return;
 
-unhash_it:
-	__d_drop(dentry);
 kill_it:
+	spin_unlock(&dentry->d_lock);
+	spin_lock(&dcache_inode_lock);
+relock:
+	spin_lock(&dentry->d_lock);
+	parent = dentry->d_parent;
+	if (parent) {
+		BUG_ON(parent == dentry);
+		if (!spin_trylock(&parent->d_lock)) {
+			spin_unlock(&dentry->d_lock);
+			goto relock;
+		}
+	}
+	dentry->d_count--;
+	if (dentry->d_count) {
+		/* This case should be fine */
+		spin_unlock(&dentry->d_lock);
+		spin_unlock(&parent->d_lock);
+		spin_unlock(&dcache_inode_lock);
+		return;
+	}
 	/* if dentry was on the d_lru list delete it from there */
 	dentry_lru_del(dentry);
 	dentry = d_kill(dentry);



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 17/27] fs: dcache per-bucket dcache hash locking
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (15 preceding siblings ...)
  2009-04-25  1:20 ` [patch 16/27] fs: dcache reduce dput locking npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  1:20 ` [patch 18/27] fs: dcache reduce dcache_inode_lock npiggin
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: dcache-chain-hashlock.patch --]
[-- Type: text/plain, Size: 11164 bytes --]

We can turn the dcache hash locking from a global dcache_hash_lock into
per-bucket locking.

XXX: should probably use a bit lock in the first bit of the hash pointers
to avoid any space bloating (non-atomic unlock means no extra atomics either)
---
 fs/dcache.c            |  197 ++++++++++++++++++++++++++++---------------------
 include/linux/dcache.h |   20 ----
 2 files changed, 115 insertions(+), 102 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -38,7 +38,7 @@
  * Usage:
  * dcache_inode_lock protects:
  *   - the inode alias lists, d_inode
- * dcache_hash_lock protects:
+ * dcache_hash_bucket->lock protects:
  *   - the dcache hash table
  * dcache_lru_lock protects:
  *   - the dcache lru lists and counters
@@ -53,18 +53,16 @@
  * dcache_inode_lock
  *   dentry->d_lock
  *     dcache_lru_lock
- *     dcache_hash_lock
+ *     dcache_hash_bucket->lock
  */
 int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
 
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_inode_lock);
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_hash_lock);
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
 
 EXPORT_SYMBOL(dcache_inode_lock);
-EXPORT_SYMBOL(dcache_hash_lock);
 
 static struct kmem_cache *dentry_cache __read_mostly;
 
@@ -83,7 +81,12 @@ static struct kmem_cache *dentry_cache _
 
 static unsigned int d_hash_mask __read_mostly;
 static unsigned int d_hash_shift __read_mostly;
-static struct hlist_head *dentry_hashtable __read_mostly;
+
+struct dcache_hash_bucket {
+	spinlock_t lock;
+	struct hlist_head head;
+};
+static struct dcache_hash_bucket *dentry_hashtable __read_mostly;
 
 /* Statistics gathering. */
 struct dentry_stat_t dentry_stat = {
@@ -91,6 +94,14 @@ struct dentry_stat_t dentry_stat = {
 	.age_limit = 45,
 };
 
+static inline struct dcache_hash_bucket *d_hash(struct dentry *parent,
+					unsigned long hash)
+{
+	hash += ((unsigned long) parent ^ GOLDEN_RATIO_PRIME) / L1_CACHE_BYTES;
+	hash = hash ^ ((hash ^ GOLDEN_RATIO_PRIME) >> D_HASHBITS);
+	return dentry_hashtable + (hash & D_HASHMASK);
+}
+
 static void __d_free(struct dentry *dentry)
 {
 	WARN_ON(!list_empty(&dentry->d_alias));
@@ -231,6 +242,73 @@ static struct dentry *d_kill(struct dent
 	return parent;
 }
 
+void __d_drop(struct dentry *dentry)
+{
+	if (!(dentry->d_flags & DCACHE_UNHASHED)) {
+		struct dcache_hash_bucket *b;
+		b = d_hash(dentry->d_parent, dentry->d_name.hash);
+		dentry->d_flags |= DCACHE_UNHASHED;
+		spin_lock(&b->lock);
+		hlist_del_rcu(&dentry->d_hash);
+		spin_unlock(&b->lock);
+	}
+}
+
+void d_drop(struct dentry *dentry)
+{
+	spin_lock(&dentry->d_lock);
+ 	__d_drop(dentry);
+	spin_unlock(&dentry->d_lock);
+}
+
+/* This should be called _only_ with a lock pinning the dentry */
+static inline struct dentry * __dget_locked_dlock(struct dentry *dentry)
+{
+	dentry->d_count++;
+	dentry_lru_del_init(dentry);
+	return dentry;
+}
+
+static inline struct dentry * __dget_locked(struct dentry *dentry)
+{
+	spin_lock(&dentry->d_lock);
+	__dget_locked_dlock(dentry);
+	spin_lock(&dentry->d_lock);
+	return dentry;
+}
+
+struct dentry * dget_locked_dlock(struct dentry *dentry)
+{
+	return __dget_locked_dlock(dentry);
+}
+
+struct dentry * dget_locked(struct dentry *dentry)
+{
+	return __dget_locked(dentry);
+}
+
+struct dentry *dget_parent(struct dentry *dentry)
+{
+	struct dentry *ret;
+
+repeat:
+	spin_lock(&dentry->d_lock);
+	ret = dentry->d_parent;
+	if (!ret)
+		goto out;
+	if (!spin_trylock(&ret->d_lock)) {
+		spin_unlock(&dentry->d_lock);
+		goto repeat;
+	}
+	BUG_ON(!ret->d_count);
+	ret->d_count++;
+	spin_unlock(&ret->d_lock);
+out:
+	spin_unlock(&dentry->d_lock);
+	return ret;
+}
+EXPORT_SYMBOL(dget_parent);
+
 /* 
  * This is dput
  *
@@ -380,54 +458,6 @@ int d_invalidate(struct dentry * dentry)
 	return 0;
 }
 
-/* This should be called _only_ with a lock pinning the dentry */
-static inline struct dentry * __dget_locked_dlock(struct dentry *dentry)
-{
-	dentry->d_count++;
-	dentry_lru_del_init(dentry);
-	return dentry;
-}
-
-static inline struct dentry * __dget_locked(struct dentry *dentry)
-{
-	spin_lock(&dentry->d_lock);
-	__dget_locked_dlock(dentry);
-	spin_lock(&dentry->d_lock);
-	return dentry;
-}
-
-struct dentry * dget_locked_dlock(struct dentry *dentry)
-{
-	return __dget_locked_dlock(dentry);
-}
-
-struct dentry * dget_locked(struct dentry *dentry)
-{
-	return __dget_locked(dentry);
-}
-
-struct dentry *dget_parent(struct dentry *dentry)
-{
-	struct dentry *ret;
-
-repeat:
-	spin_lock(&dentry->d_lock);
-	ret = dentry->d_parent;
-	if (!ret)
-		goto out;
-	if (!spin_trylock(&ret->d_lock)) {
-		spin_unlock(&dentry->d_lock);
-		goto repeat;
-	}
-	BUG_ON(!ret->d_count);
-	ret->d_count++;
-	spin_unlock(&ret->d_lock);
-out:
-	spin_unlock(&dentry->d_lock);
-	return ret;
-}
-EXPORT_SYMBOL(dget_parent);
-
 /**
  * d_find_alias - grab a hashed alias of inode
  * @inode: inode in question
@@ -1316,14 +1346,6 @@ struct dentry * d_alloc_root(struct inod
 	return res;
 }
 
-static inline struct hlist_head *d_hash(struct dentry *parent,
-					unsigned long hash)
-{
-	hash += ((unsigned long) parent ^ GOLDEN_RATIO_PRIME) / L1_CACHE_BYTES;
-	hash = hash ^ ((hash ^ GOLDEN_RATIO_PRIME) >> D_HASHBITS);
-	return dentry_hashtable + (hash & D_HASHMASK);
-}
-
 /**
  * d_obtain_alias - find or allocate a dentry for a given inode
  * @inode: inode to allocate the dentry for
@@ -1570,7 +1592,8 @@ struct dentry * __d_lookup(struct dentry
 	unsigned int len = name->len;
 	unsigned int hash = name->hash;
 	const unsigned char *str = name->name;
-	struct hlist_head *head = d_hash(parent,hash);
+	struct dcache_hash_bucket *b = d_hash(parent, hash);
+	struct hlist_head *head = &b->head;
 	struct dentry *found = NULL;
 	struct hlist_node *node;
 	struct dentry *dentry;
@@ -1664,6 +1687,7 @@ out:
  
 int d_validate(struct dentry *dentry, struct dentry *dparent)
 {
+	struct dcache_hash_bucket *b;
 	struct hlist_head *base;
 	struct hlist_node *lhp;
 
@@ -1675,20 +1699,21 @@ int d_validate(struct dentry *dentry, st
 		goto out;
 
 	spin_lock(&dentry->d_lock);
-	spin_lock(&dcache_hash_lock);
-	base = d_hash(dparent, dentry->d_name.hash);
-	hlist_for_each(lhp,base) { 
+	b = d_hash(dparent, dentry->d_name.hash);
+	base = &b->head;
+	spin_lock(&b->lock);
+	hlist_for_each(lhp, base) {
 		/* hlist_for_each_entry_rcu() not required for d_hash list
-		 * as it is parsed under dcache_hash_lock
+		 * as it is parsed under dcache_hash_bucket->lock
 		 */
 		if (dentry == hlist_entry(lhp, struct dentry, d_hash)) {
-			spin_unlock(&dcache_hash_lock);
+			spin_unlock(&b->lock);
 			__dget_locked_dlock(dentry);
 			spin_unlock(&dentry->d_lock);
 			return 1;
 		}
 	}
-	spin_unlock(&dcache_hash_lock);
+	spin_unlock(&b->lock);
 	spin_unlock(&dentry->d_lock);
 out:
 	return 0;
@@ -1739,11 +1764,12 @@ void d_delete(struct dentry * dentry)
 	fsnotify_nameremove(dentry, isdir);
 }
 
-static void __d_rehash(struct dentry * entry, struct hlist_head *list)
+static void __d_rehash(struct dentry * entry, struct dcache_hash_bucket *b)
 {
-
  	entry->d_flags &= ~DCACHE_UNHASHED;
- 	hlist_add_head_rcu(&entry->d_hash, list);
+	spin_lock(&b->lock);
+ 	hlist_add_head_rcu(&entry->d_hash, &b->head);
+	spin_unlock(&b->lock);
 }
 
 static void _d_rehash(struct dentry * entry)
@@ -1761,9 +1787,7 @@ static void _d_rehash(struct dentry * en
 void d_rehash(struct dentry * entry)
 {
 	spin_lock(&entry->d_lock);
-	spin_lock(&dcache_hash_lock);
 	_d_rehash(entry);
-	spin_unlock(&dcache_hash_lock);
 	spin_unlock(&entry->d_lock);
 }
 
@@ -1841,6 +1865,7 @@ static void switch_names(struct dentry *
  */
 static void d_move_locked(struct dentry * dentry, struct dentry * target)
 {
+	struct dcache_hash_bucket *b;
 	if (!dentry->d_inode)
 		printk(KERN_WARNING "VFS: moving negative dcache entry\n");
 
@@ -1869,11 +1894,13 @@ static void d_move_locked(struct dentry
 	}
 
 	/* Move the dentry to the target hash queue, if on different bucket */
-	spin_lock(&dcache_hash_lock);
-	if (!d_unhashed(dentry))
+	if (!d_unhashed(dentry)) {
+		b = d_hash(dentry->d_parent, dentry->d_name.hash);
+		spin_lock(&b->lock);
 		hlist_del_rcu(&dentry->d_hash);
+		spin_unlock(&b->lock);
+	}
 	__d_rehash(dentry, d_hash(target->d_parent, target->d_name.hash));
-	spin_unlock(&dcache_hash_lock);
 
 	/* Unhash the target: dput() will then get rid of it */
 	__d_drop(target);
@@ -2080,9 +2107,7 @@ struct dentry *d_materialise_unique(stru
 found_lock:
 	spin_lock(&actual->d_lock);
 found:
-	spin_lock(&dcache_hash_lock);
 	_d_rehash(actual);
-	spin_unlock(&dcache_hash_lock);
 	spin_unlock(&actual->d_lock);
 	spin_unlock(&dcache_inode_lock);
 out_nolock:
@@ -2534,7 +2559,7 @@ static void __init dcache_init_early(voi
 
 	dentry_hashtable =
 		alloc_large_system_hash("Dentry cache",
-					sizeof(struct hlist_head),
+					sizeof(struct dcache_hash_bucket),
 					dhash_entries,
 					13,
 					HASH_EARLY,
@@ -2542,8 +2567,10 @@ static void __init dcache_init_early(voi
 					&d_hash_mask,
 					0);
 
-	for (loop = 0; loop < (1 << d_hash_shift); loop++)
-		INIT_HLIST_HEAD(&dentry_hashtable[loop]);
+	for (loop = 0; loop < (1 << d_hash_shift); loop++) {
+		spin_lock_init(&dentry_hashtable[loop].lock);
+		INIT_HLIST_HEAD(&dentry_hashtable[loop].head);
+	}
 }
 
 static void __init dcache_init(void)
@@ -2566,7 +2593,7 @@ static void __init dcache_init(void)
 
 	dentry_hashtable =
 		alloc_large_system_hash("Dentry cache",
-					sizeof(struct hlist_head),
+					sizeof(struct dcache_hash_bucket),
 					dhash_entries,
 					13,
 					0,
@@ -2574,8 +2601,10 @@ static void __init dcache_init(void)
 					&d_hash_mask,
 					0);
 
-	for (loop = 0; loop < (1 << d_hash_shift); loop++)
-		INIT_HLIST_HEAD(&dentry_hashtable[loop]);
+	for (loop = 0; loop < (1 << d_hash_shift); loop++) {
+		spin_lock_init(&dentry_hashtable[loop].lock);
+		INIT_HLIST_HEAD(&dentry_hashtable[loop].head);
+	}
 }
 
 /* SLAB cache for __getname() consumers */
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -185,7 +185,6 @@ d_iput:		no		no       yes
 #define DCACHE_COOKIE		0x0040	/* For use by dcookie subsystem */
 
 extern spinlock_t dcache_inode_lock;
-extern spinlock_t dcache_hash_lock;
 extern seqlock_t rename_lock;
 
 /**
@@ -203,23 +202,8 @@ extern seqlock_t rename_lock;
  *
  * __d_drop requires dentry->d_lock.
  */
-
-static inline void __d_drop(struct dentry *dentry)
-{
-	if (!(dentry->d_flags & DCACHE_UNHASHED)) {
-		dentry->d_flags |= DCACHE_UNHASHED;
-		spin_lock(&dcache_hash_lock);
-		hlist_del_rcu(&dentry->d_hash);
-		spin_unlock(&dcache_hash_lock);
-	}
-}
-
-static inline void d_drop(struct dentry *dentry)
-{
-	spin_lock(&dentry->d_lock);
- 	__d_drop(dentry);
-	spin_unlock(&dentry->d_lock);
-}
+void d_drop(struct dentry *dentry);
+void __d_drop(struct dentry *dentry);
 
 static inline int dname_external(struct dentry *dentry)
 {



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 18/27] fs: dcache reduce dcache_inode_lock
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (16 preceding siblings ...)
  2009-04-25  1:20 ` [patch 17/27] fs: dcache per-bucket dcache hash locking npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  1:20 ` [patch 19/27] fs: dcache per-inode inode alias locking npiggin
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: fs-dcache-d_delete-less-lock.patch --]
[-- Type: text/plain, Size: 1893 bytes --]

dcache_inode_lock can be avoided in d_delete() and d_materialise_unique()
in cases where it is not required.
---
 fs/dcache.c |   23 +++++++++++------------
 1 file changed, 11 insertions(+), 12 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -1746,10 +1746,14 @@ void d_delete(struct dentry * dentry)
 	/*
 	 * Are we the only user?
 	 */
-	spin_lock(&dcache_inode_lock);
+again:
 	spin_lock(&dentry->d_lock);
 	isdir = S_ISDIR(dentry->d_inode->i_mode);
 	if (dentry->d_count == 1) {
+		if (!spin_trylock(&dcache_inode_lock)) {
+			spin_unlock(&dentry->d_lock);
+			goto again;
+		}
 		dentry_iput(dentry);
 		fsnotify_nameremove(dentry, isdir);
 		return;
@@ -1759,7 +1763,6 @@ void d_delete(struct dentry * dentry)
 		__d_drop(dentry);
 
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_inode_lock);
 
 	fsnotify_nameremove(dentry, isdir);
 }
@@ -2066,14 +2069,15 @@ struct dentry *d_materialise_unique(stru
 
 	BUG_ON(!d_unhashed(dentry));
 
-	spin_lock(&dcache_inode_lock);
-
 	if (!inode) {
 		actual = dentry;
 		__d_instantiate(dentry, NULL);
-		goto found_lock;
+		d_rehash(actual);
+		goto out_nolock;
 	}
 
+	spin_lock(&dcache_inode_lock);
+
 	if (S_ISDIR(inode->i_mode)) {
 		struct dentry *alias;
 
@@ -2101,10 +2105,9 @@ struct dentry *d_materialise_unique(stru
 	actual = __d_instantiate_unique(dentry, inode);
 	if (!actual)
 		actual = dentry;
-	else if (unlikely(!d_unhashed(actual)))
-		goto shouldnt_be_hashed;
+	else
+		BUG_ON(!d_unhashed(actual));
 
-found_lock:
 	spin_lock(&actual->d_lock);
 found:
 	_d_rehash(actual);
@@ -2118,10 +2121,6 @@ out_nolock:
 
 	iput(inode);
 	return actual;
-
-shouldnt_be_hashed:
-	spin_unlock(&dcache_inode_lock);
-	BUG();
 }
 
 static int prepend(char **buffer, int *buflen, const char *str, int namelen)



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 19/27] fs: dcache per-inode inode alias locking
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (17 preceding siblings ...)
  2009-04-25  1:20 ` [patch 18/27] fs: dcache reduce dcache_inode_lock npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  1:20 ` [patch 20/27] fs: icache lock s_inodes list npiggin
                   ` (8 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: dcache-split-inode_lock.patch --]
[-- Type: text/plain, Size: 15691 bytes --]

dcache_inode_lock can be replaced with per-inode locking. Use existing
inode->i_lock for this. This is slightly non-trivial because we sometimes
need to find the inode from the dentry, which requires d_inode to be
stabilised (either with refcount or d_lock).

---
 fs/affs/amigaffs.c          |    4 -
 fs/dcache.c                 |  118 +++++++++++++++++++++++++-------------------
 fs/exportfs/expfs.c         |   12 ++--
 fs/nfs/getroot.c            |    4 -
 fs/notify/inotify/inotify.c |    4 -
 fs/ocfs2/dcache.c           |    4 -
 fs/sysfs/dir.c              |    6 +-
 include/linux/dcache.h      |    1 
 8 files changed, 87 insertions(+), 66 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -36,7 +36,7 @@
 
 /*
  * Usage:
- * dcache_inode_lock protects:
+ * dcache->d_inode->i_lock protects:
  *   - the inode alias lists, d_inode
  * dcache_hash_bucket->lock protects:
  *   - the dcache hash table
@@ -50,7 +50,7 @@
  *   - d_subdirs and children's d_child
  *
  * Ordering:
- * dcache_inode_lock
+ * dcache->d_inode->i_lock
  *   dentry->d_lock
  *     dcache_lru_lock
  *     dcache_hash_bucket->lock
@@ -58,12 +58,9 @@
 int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
 
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_inode_lock);
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
 
-EXPORT_SYMBOL(dcache_inode_lock);
-
 static struct kmem_cache *dentry_cache __read_mostly;
 
 #define DNAME_INLINE_LEN (sizeof(struct dentry)-offsetof(struct dentry,d_iname))
@@ -138,14 +135,13 @@ static void d_free(struct dentry *dentry
  */
 static void dentry_iput(struct dentry * dentry)
 	__releases(dentry->d_lock)
-	__releases(dcache_inode_lock)
 {
 	struct inode *inode = dentry->d_inode;
 	if (inode) {
 		dentry->d_inode = NULL;
 		list_del_init(&dentry->d_alias);
 		spin_unlock(&dentry->d_lock);
-		spin_unlock(&dcache_inode_lock);
+		spin_unlock(&inode->i_lock);
 		if (!inode->i_nlink)
 			fsnotify_inoderemove(inode);
 		if (dentry->d_op && dentry->d_op->d_iput)
@@ -154,7 +150,6 @@ static void dentry_iput(struct dentry *
 			iput(inode);
 	} else {
 		spin_unlock(&dentry->d_lock);
-		spin_unlock(&dcache_inode_lock);
 	}
 }
 
@@ -225,7 +220,6 @@ static void dentry_lru_del_init(struct d
  */
 static struct dentry *d_kill(struct dentry *dentry)
 	__releases(dentry->d_lock)
-	__releases(dcache_inode_lock)
 {
 	struct dentry *parent;
 
@@ -341,6 +335,7 @@ EXPORT_SYMBOL(dget_parent);
 void dput(struct dentry *dentry)
 {
 	struct dentry *parent;
+	struct inode *inode;
 
 	if (!dentry)
 		return;
@@ -376,17 +371,24 @@ repeat:
 	spin_unlock(&dentry->d_lock);
   	return;
 
-kill_it:
-	spin_unlock(&dentry->d_lock);
-	spin_lock(&dcache_inode_lock);
-relock:
+relock1:
 	spin_lock(&dentry->d_lock);
+kill_it:
+	inode = dentry->d_inode;
+	if (inode) {
+		if (!spin_trylock(&inode->i_lock)) {
+relock2:
+			spin_unlock(&dentry->d_lock);
+			goto relock1;
+		}
+	}
 	parent = dentry->d_parent;
 	if (parent) {
 		BUG_ON(parent == dentry);
 		if (!spin_trylock(&parent->d_lock)) {
-			spin_unlock(&dentry->d_lock);
-			goto relock;
+			if (inode)
+				spin_unlock(&inode->i_lock);
+			goto relock2;
 		}
 	}
 	dentry->d_count--;
@@ -394,7 +396,8 @@ relock:
 		/* This case should be fine */
 		spin_unlock(&dentry->d_lock);
 		spin_unlock(&parent->d_lock);
-		spin_unlock(&dcache_inode_lock);
+		if (inode)
+			spin_unlock(&inode->i_lock);
 		return;
 	}
 	/* if dentry was on the d_lru list delete it from there */
@@ -510,9 +513,9 @@ struct dentry * d_find_alias(struct inod
 	struct dentry *de = NULL;
 
 	if (!list_empty(&inode->i_dentry)) {
-		spin_lock(&dcache_inode_lock);
+		spin_lock(&inode->i_lock);
 		de = __d_find_alias(inode, 0);
-		spin_unlock(&dcache_inode_lock);
+		spin_unlock(&inode->i_lock);
 	}
 	return de;
 }
@@ -525,20 +528,20 @@ void d_prune_aliases(struct inode *inode
 {
 	struct dentry *dentry;
 restart:
-	spin_lock(&dcache_inode_lock);
+	spin_lock(&inode->i_lock);
 	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
 		spin_lock(&dentry->d_lock);
 		if (!dentry->d_count) {
 			__dget_locked_dlock(dentry);
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
-			spin_unlock(&dcache_inode_lock);
+			spin_unlock(&inode->i_lock);
 			dput(dentry);
 			goto restart;
 		}
 		spin_unlock(&dentry->d_lock);
 	}
-	spin_unlock(&dcache_inode_lock);
+	spin_unlock(&inode->i_lock);
 }
 
 /*
@@ -560,8 +563,10 @@ static void prune_one_dentry(struct dent
 	 */
 	while (dentry) {
 		struct dentry *parent = NULL;
+		struct inode *inode = dentry->d_inode;
 
-		spin_lock(&dcache_inode_lock);
+		if (inode)
+			spin_lock(&inode->i_lock);
 again:
 		spin_lock(&dentry->d_lock);
 		if (dentry->d_parent && dentry != dentry->d_parent) {
@@ -576,7 +581,8 @@ again:
 			if (parent)
 				spin_unlock(&parent->d_lock);
 			spin_unlock(&dentry->d_lock);
-			spin_unlock(&dcache_inode_lock);
+			if (inode)
+				spin_unlock(&inode->i_lock);
 			return;
 		}
 
@@ -645,10 +651,11 @@ restart:
 	}
 	spin_unlock(&dcache_lru_lock);
 
-	spin_lock(&dcache_inode_lock);
 again:
 	spin_lock(&dcache_lru_lock); /* lru_lock also protects tmp list */
 	while (!list_empty(&tmp)) {
+		struct inode *inode;
+
 		dentry = list_entry(tmp.prev, struct dentry, d_lru);
 
 		if (!spin_trylock(&dentry->d_lock)) {
@@ -666,11 +673,18 @@ again1:
 			spin_unlock(&dentry->d_lock);
 			continue;
 		}
+		inode = dentry->d_inode;
+		if (inode && !spin_trylock(&inode->i_lock)) {
+again2:
+			spin_unlock(&dentry->d_lock);
+			goto again1;
+		}
 		if (dentry->d_parent) {
 			BUG_ON(dentry == dentry->d_parent);
 			if (!spin_trylock(&dentry->d_parent->d_lock)) {
-				spin_unlock(&dentry->d_lock);
-				goto again1;
+				if (inode)
+					spin_unlock(&inode->i_lock);
+				goto again2;
 			}
 		}
 		__dentry_lru_del_init(dentry);
@@ -678,10 +692,8 @@ again1:
 
 		prune_one_dentry(dentry);
 		/* dentry->d_lock dropped */
-		spin_lock(&dcache_inode_lock);
 		spin_lock(&dcache_lru_lock);
 	}
-	spin_unlock(&dcache_inode_lock);
 
 	if (count == NULL && !list_empty(&sb->s_dentry_lru))
 		goto restart;
@@ -1244,9 +1256,11 @@ static void __d_instantiate(struct dentr
 void d_instantiate(struct dentry *entry, struct inode * inode)
 {
 	BUG_ON(!list_empty(&entry->d_alias));
-	spin_lock(&dcache_inode_lock);
+	if (inode)
+		spin_lock(&inode->i_lock);
 	__d_instantiate(entry, inode);
-	spin_unlock(&dcache_inode_lock);
+	if (inode)
+		spin_unlock(&inode->i_lock);
 	security_d_instantiate(entry, inode);
 }
 
@@ -1304,9 +1318,11 @@ struct dentry *d_instantiate_unique(stru
 
 	BUG_ON(!list_empty(&entry->d_alias));
 
-	spin_lock(&dcache_inode_lock);
+	if (inode)
+		spin_lock(&inode->i_lock);
 	result = __d_instantiate_unique(entry, inode);
-	spin_unlock(&dcache_inode_lock);
+	if (inode)
+		spin_unlock(&inode->i_lock);
 
 	if (!result) {
 		security_d_instantiate(entry, inode);
@@ -1386,10 +1402,10 @@ struct dentry *d_obtain_alias(struct ino
 	}
 	tmp->d_parent = tmp; /* make sure dput doesn't croak */
 
-	spin_lock(&dcache_inode_lock);
+	spin_lock(&inode->i_lock);
 	res = __d_find_alias(inode, 0);
 	if (res) {
-		spin_unlock(&dcache_inode_lock);
+		spin_unlock(&inode->i_lock);
 		dput(tmp);
 		goto out_iput;
 	}
@@ -1403,7 +1419,7 @@ struct dentry *d_obtain_alias(struct ino
 	list_add(&tmp->d_alias, &inode->i_dentry);
 	hlist_add_head(&tmp->d_hash, &inode->i_sb->s_anon);
 	spin_unlock(&tmp->d_lock);
-	spin_unlock(&dcache_inode_lock);
+	spin_unlock(&inode->i_lock);
 
 	return tmp;
 
@@ -1434,19 +1450,19 @@ struct dentry *d_splice_alias(struct ino
 	struct dentry *new = NULL;
 
 	if (inode && S_ISDIR(inode->i_mode)) {
-		spin_lock(&dcache_inode_lock);
+		spin_lock(&inode->i_lock);
 		new = __d_find_alias(inode, 1);
 		if (new) {
 			BUG_ON(!(new->d_flags & DCACHE_DISCONNECTED));
-			spin_unlock(&dcache_inode_lock);
+			spin_unlock(&inode->i_lock);
 			security_d_instantiate(new, inode);
 			d_rehash(dentry);
 			d_move(new, dentry);
 			iput(inode);
 		} else {
-			/* already taken dcache_inode_lock, d_add() by hand */
+			/* already taken inode->i_lock, d_add() by hand */
 			__d_instantiate(dentry, inode);
-			spin_unlock(&dcache_inode_lock);
+			spin_unlock(&inode->i_lock);
 			security_d_instantiate(dentry, inode);
 			d_rehash(dentry);
 		}
@@ -1518,10 +1534,10 @@ struct dentry *d_add_ci(struct dentry *d
 	 * Negative dentry: instantiate it unless the inode is a directory and
 	 * already has a dentry.
 	 */
-	spin_lock(&dcache_inode_lock);
+	spin_lock(&inode->i_lock);
 	if (!S_ISDIR(inode->i_mode) || list_empty(&inode->i_dentry)) {
 		__d_instantiate(found, inode);
-		spin_unlock(&dcache_inode_lock);
+		spin_unlock(&inode->i_lock);
 		security_d_instantiate(found, inode);
 		return found;
 	}
@@ -1532,7 +1548,7 @@ struct dentry *d_add_ci(struct dentry *d
 	 */
 	new = list_entry(inode->i_dentry.next, struct dentry, d_alias);
 	dget_locked(new);
-	spin_unlock(&dcache_inode_lock);
+	spin_unlock(&inode->i_lock);
 	security_d_instantiate(found, inode);
 	d_move(new, found);
 	iput(inode);
@@ -1742,15 +1758,17 @@ out:
  
 void d_delete(struct dentry * dentry)
 {
+	struct inode *inode;
 	int isdir = 0;
 	/*
 	 * Are we the only user?
 	 */
 again:
 	spin_lock(&dentry->d_lock);
-	isdir = S_ISDIR(dentry->d_inode->i_mode);
+	inode = dentry->d_inode;
+	isdir = S_ISDIR(inode->i_mode);
 	if (dentry->d_count == 1) {
-		if (!spin_trylock(&dcache_inode_lock)) {
+		if (inode && !spin_trylock(&inode->i_lock)) {
 			spin_unlock(&dentry->d_lock);
 			goto again;
 		}
@@ -1983,6 +2001,7 @@ static struct dentry *__d_unalias(struct
 {
 	struct mutex *m1 = NULL, *m2 = NULL;
 	struct dentry *ret;
+	struct inode *inode;
 
 	/* If alias and dentry share a parent, then no extra locks required */
 	if (alias->d_parent == dentry->d_parent)
@@ -1998,14 +2017,15 @@ static struct dentry *__d_unalias(struct
 	if (!mutex_trylock(&dentry->d_sb->s_vfs_rename_mutex))
 		goto out_err;
 	m1 = &dentry->d_sb->s_vfs_rename_mutex;
-	if (!mutex_trylock(&alias->d_parent->d_inode->i_mutex))
+	inode = alias->d_parent->d_inode;
+	if (!mutex_trylock(&inode->i_mutex))
 		goto out_err;
-	m2 = &alias->d_parent->d_inode->i_mutex;
+	m2 = &inode->i_mutex;
 out_unalias:
 	d_move_locked(alias, dentry);
 	ret = alias;
 out_err:
-	spin_unlock(&dcache_inode_lock);
+	spin_unlock(&inode->i_lock);
 	if (m2)
 		mutex_unlock(m2);
 	if (m1)
@@ -2076,7 +2096,7 @@ struct dentry *d_materialise_unique(stru
 		goto out_nolock;
 	}
 
-	spin_lock(&dcache_inode_lock);
+	spin_lock(&inode->i_lock);
 
 	if (S_ISDIR(inode->i_mode)) {
 		struct dentry *alias;
@@ -2112,7 +2132,7 @@ struct dentry *d_materialise_unique(stru
 found:
 	_d_rehash(actual);
 	spin_unlock(&actual->d_lock);
-	spin_unlock(&dcache_inode_lock);
+	spin_unlock(&inode->i_lock);
 out_nolock:
 	if (actual == dentry) {
 		security_d_instantiate(dentry, inode);
Index: linux-2.6/fs/sysfs/dir.c
===================================================================
--- linux-2.6.orig/fs/sysfs/dir.c
+++ linux-2.6/fs/sysfs/dir.c
@@ -547,7 +547,7 @@ static void sysfs_drop_dentry(struct sys
 	 * dput to immediately free the dentry  if it is not in use.
 	 */
 repeat:
-	spin_lock(&dcache_inode_lock);
+	spin_lock(&inode->i_lock);
 	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
 		spin_lock(&dentry->d_lock);
 		if (d_unhashed(dentry)) {
@@ -557,11 +557,11 @@ repeat:
 		dget_locked_dlock(dentry);
 		__d_drop(dentry);
 		spin_unlock(&dentry->d_lock);
-		spin_unlock(&dcache_inode_lock);
+		spin_unlock(&inode->i_lock);
 		dput(dentry);
 		goto repeat;
 	}
-	spin_unlock(&dcache_inode_lock);
+	spin_unlock(&inode->i_lock);
 
 	/* adjust nlink and update timestamp */
 	mutex_lock(&inode->i_mutex);
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -184,7 +184,6 @@ d_iput:		no		no       yes
 
 #define DCACHE_COOKIE		0x0040	/* For use by dcookie subsystem */
 
-extern spinlock_t dcache_inode_lock;
 extern seqlock_t rename_lock;
 
 /**
Index: linux-2.6/fs/notify/inotify/inotify.c
===================================================================
--- linux-2.6.orig/fs/notify/inotify/inotify.c
+++ linux-2.6/fs/notify/inotify/inotify.c
@@ -184,7 +184,7 @@ static void set_dentry_child_flags(struc
 {
 	struct dentry *alias;
 
-	spin_lock(&dcache_inode_lock);
+	spin_lock(&inode->i_lock);
 	list_for_each_entry(alias, &inode->i_dentry, d_alias) {
 		struct dentry *child;
 
@@ -202,7 +202,7 @@ static void set_dentry_child_flags(struc
 		}
 		spin_unlock(&alias->d_lock);
 	}
-	spin_unlock(&dcache_inode_lock);
+	spin_unlock(&inode->i_lock);
 }
 
 /*
Index: linux-2.6/fs/exportfs/expfs.c
===================================================================
--- linux-2.6.orig/fs/exportfs/expfs.c
+++ linux-2.6/fs/exportfs/expfs.c
@@ -43,24 +43,26 @@ find_acceptable_alias(struct dentry *res
 		void *context)
 {
 	struct dentry *dentry, *toput = NULL;
+	struct inode *inode;
 
 	if (acceptable(context, result))
 		return result;
 
-	spin_lock(&dcache_inode_lock);
-	list_for_each_entry(dentry, &result->d_inode->i_dentry, d_alias) {
+	inode = result->d_inode;
+	spin_lock(&inode->i_lock);
+	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
 		dget_locked(dentry);
-		spin_unlock(&dcache_inode_lock);
+		spin_unlock(&inode->i_lock);
 		if (toput)
 			dput(toput);
 		if (dentry != result && acceptable(context, dentry)) {
 			dput(result);
 			return dentry;
 		}
-		spin_lock(&dcache_inode_lock);
+		spin_lock(&inode->i_lock);
 		toput = dentry;
 	}
-	spin_unlock(&dcache_inode_lock);
+	spin_unlock(&inode->i_lock);
 
 	if (toput)
 		dput(toput);
Index: linux-2.6/fs/affs/amigaffs.c
===================================================================
--- linux-2.6.orig/fs/affs/amigaffs.c
+++ linux-2.6/fs/affs/amigaffs.c
@@ -128,7 +128,7 @@ affs_fix_dcache(struct dentry *dentry, u
 	void *data = dentry->d_fsdata;
 	struct list_head *head, *next;
 
-	spin_lock(&dcache_inode_lock);
+	spin_lock(&inode->i_lock);
 	head = &inode->i_dentry;
 	next = head->next;
 	while (next != head) {
@@ -139,7 +139,7 @@ affs_fix_dcache(struct dentry *dentry, u
 		}
 		next = next->next;
 	}
-	spin_unlock(&dcache_inode_lock);
+	spin_unlock(&inode->i_lock);
 }
 
 
Index: linux-2.6/fs/nfs/getroot.c
===================================================================
--- linux-2.6.orig/fs/nfs/getroot.c
+++ linux-2.6/fs/nfs/getroot.c
@@ -65,11 +65,11 @@ static int nfs_superblock_set_dummy_root
 		 * This again causes shrink_dcache_for_umount_subtree() to
 		 * Oops, since the test for IS_ROOT() will fail.
 		 */
-		spin_lock(&dcache_inode_lock);
+		spin_lock(&sb->s_root->d_inode->i_lock);
 		spin_lock(&sb->s_root->d_lock);
 		list_del_init(&sb->s_root->d_alias);
 		spin_unlock(&sb->s_root->d_lock);
-		spin_unlock(&dcache_inode_lock);
+		spin_unlock(&sb->s_root->d_inode->i_lock);
 	}
 	return 0;
 }
Index: linux-2.6/fs/ocfs2/dcache.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/dcache.c
+++ linux-2.6/fs/ocfs2/dcache.c
@@ -140,7 +140,7 @@ struct dentry *ocfs2_find_local_alias(st
 	struct list_head *p;
 	struct dentry *dentry = NULL;
 
-	spin_lock(&dcache_inode_lock);
+	spin_lock(&inode->i_lock);
 	list_for_each(p, &inode->i_dentry) {
 		dentry = list_entry(p, struct dentry, d_alias);
 
@@ -158,7 +158,7 @@ struct dentry *ocfs2_find_local_alias(st
 		dentry = NULL;
 	}
 
-	spin_unlock(&dcache_inode_lock);
+	spin_unlock(&inode->i_lock);
 
 	return dentry;
 }



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 20/27] fs: icache lock s_inodes list
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (18 preceding siblings ...)
  2009-04-25  1:20 ` [patch 19/27] fs: dcache per-inode inode alias locking npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  1:20 ` [patch 21/27] fs: icache lock inode hash npiggin
                   ` (7 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: fs-inode_lock-scale.patch --]
[-- Type: text/plain, Size: 7498 bytes --]

Protect sb->s_inodes with a new lock, sb_inode_list_lock.
---
 fs/drop_caches.c            |    4 ++++
 fs/fs-writeback.c           |    4 ++++
 fs/hugetlbfs/inode.c        |    2 ++
 fs/inode.c                  |   12 ++++++++++++
 fs/notify/inotify/inotify.c |    2 ++
 fs/quota/dquot.c            |    6 ++++++
 include/linux/writeback.h   |    1 +
 7 files changed, 31 insertions(+)

Index: linux-2.6/fs/drop_caches.c
===================================================================
--- linux-2.6.orig/fs/drop_caches.c
+++ linux-2.6/fs/drop_caches.c
@@ -17,18 +17,22 @@ static void drop_pagecache_sb(struct sup
 	struct inode *inode, *toput_inode = NULL;
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
 			continue;
 		if (inode->i_mapping->nrpages == 0)
 			continue;
 		__iget(inode);
+		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 		__invalidate_mapping_pages(inode->i_mapping, 0, -1, true);
 		iput(toput_inode);
 		toput_inode = inode;
 		spin_lock(&inode_lock);
+		spin_lock(&sb_inode_list_lock);
 	}
+	spin_unlock(&sb_inode_list_lock);
 	spin_unlock(&inode_lock);
 	iput(toput_inode);
 }
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -553,6 +553,7 @@ void generic_sync_sb_inodes(struct super
 		 * In which case, the inode may not be on the dirty list, but
 		 * we still have to wait for that writeout.
 		 */
+		spin_lock(&sb_inode_list_lock);
 		list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 			struct address_space *mapping;
 
@@ -563,6 +564,7 @@ void generic_sync_sb_inodes(struct super
 			if (mapping->nrpages == 0)
 				continue;
 			__iget(inode);
+			spin_unlock(&sb_inode_list_lock);
 			spin_unlock(&inode_lock);
 			/*
 			 * We hold a reference to 'inode' so it couldn't have
@@ -580,7 +582,9 @@ void generic_sync_sb_inodes(struct super
 			cond_resched();
 
 			spin_lock(&inode_lock);
+			spin_lock(&sb_inode_list_lock);
 		}
+		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 		iput(old_inode);
 	} else
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -83,6 +83,7 @@ static struct hlist_head *inode_hashtabl
  * the i_state of an inode while it is in use..
  */
 DEFINE_SPINLOCK(inode_lock);
+DEFINE_SPINLOCK(sb_inode_list_lock);
 
 /*
  * iprune_mutex provides exclusion between the kswapd or try_to_free_pages
@@ -329,7 +330,9 @@ static void dispose_list(struct list_hea
 
 		spin_lock(&inode_lock);
 		hlist_del_init(&inode->i_hash);
+		spin_lock(&sb_inode_list_lock);
 		list_del_init(&inode->i_sb_list);
+		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 
 		wake_up_inode(inode);
@@ -361,6 +364,7 @@ static int invalidate_list(struct list_h
 		 * shrink_icache_memory() away.
 		 */
 		cond_resched_lock(&inode_lock);
+		cond_resched_lock(&sb_inode_list_lock);
 
 		next = next->next;
 		if (tmp == head)
@@ -398,8 +402,10 @@ int invalidate_inodes(struct super_block
 
 	mutex_lock(&iprune_mutex);
 	spin_lock(&inode_lock);
+	spin_lock(&sb_inode_list_lock);
 	inotify_unmount_inodes(&sb->s_inodes);
 	busy = invalidate_list(&sb->s_inodes, &throw_away);
+	spin_unlock(&sb_inode_list_lock);
 	spin_unlock(&inode_lock);
 
 	dispose_list(&throw_away);
@@ -584,7 +590,9 @@ __inode_add_to_lists(struct super_block
 {
 	inodes_stat.nr_inodes++;
 	list_add(&inode->i_list, &inode_in_use);
+	spin_lock(&sb_inode_list_lock);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
+	spin_unlock(&sb_inode_list_lock);
 	if (head)
 		hlist_add_head(&inode->i_hash, head);
 }
@@ -1159,7 +1167,9 @@ void generic_delete_inode(struct inode *
 	const struct super_operations *op = inode->i_sb->s_op;
 
 	list_del_init(&inode->i_list);
+	spin_lock(&sb_inode_list_lock);
 	list_del_init(&inode->i_sb_list);
+	spin_unlock(&sb_inode_list_lock);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 	inodes_stat.nr_inodes--;
@@ -1213,7 +1223,9 @@ static void generic_forget_inode(struct
 		hlist_del_init(&inode->i_hash);
 	}
 	list_del_init(&inode->i_list);
+	spin_lock(&sb_inode_list_lock);
 	list_del_init(&inode->i_sb_list);
+	spin_unlock(&sb_inode_list_lock);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 	inodes_stat.nr_inodes--;
Index: linux-2.6/fs/notify/inotify/inotify.c
===================================================================
--- linux-2.6.orig/fs/notify/inotify/inotify.c
+++ linux-2.6/fs/notify/inotify/inotify.c
@@ -430,6 +430,7 @@ void inotify_unmount_inodes(struct list_
 		 * will be added since the umount has begun.  Finally,
 		 * iprune_mutex keeps shrink_icache_memory() away.
 		 */
+		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 
 		if (need_iput_tmp)
@@ -452,6 +453,7 @@ void inotify_unmount_inodes(struct list_
 		iput(inode);		
 
 		spin_lock(&inode_lock);
+		spin_lock(&sb_inode_list_lock);
 	}
 }
 EXPORT_SYMBOL_GPL(inotify_unmount_inodes);
Index: linux-2.6/fs/quota/dquot.c
===================================================================
--- linux-2.6.orig/fs/quota/dquot.c
+++ linux-2.6/fs/quota/dquot.c
@@ -822,6 +822,7 @@ static void add_dquot_ref(struct super_b
 	struct inode *inode, *old_inode = NULL;
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
 			continue;
@@ -831,6 +832,7 @@ static void add_dquot_ref(struct super_b
 			continue;
 
 		__iget(inode);
+		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 
 		iput(old_inode);
@@ -842,7 +844,9 @@ static void add_dquot_ref(struct super_b
 		 * keep the reference and iput it later. */
 		old_inode = inode;
 		spin_lock(&inode_lock);
+		spin_lock(&sb_inode_list_lock);
 	}
+	spin_unlock(&sb_inode_list_lock);
 	spin_unlock(&inode_lock);
 	iput(old_inode);
 }
@@ -914,6 +918,7 @@ static void remove_dquot_ref(struct supe
 	struct inode *inode;
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
 		 *  We have to scan also I_NEW inodes because they can already
@@ -924,6 +929,7 @@ static void remove_dquot_ref(struct supe
 		if (!IS_NOQUOTA(inode))
 			remove_inode_dquot_ref(inode, type, tofree_head);
 	}
+	spin_unlock(&sb_inode_list_lock);
 	spin_unlock(&inode_lock);
 }
 
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h
+++ linux-2.6/include/linux/writeback.h
@@ -10,6 +10,7 @@
 struct backing_dev_info;
 
 extern spinlock_t inode_lock;
+extern spinlock_t sb_inode_list_lock;
 extern struct list_head inode_in_use;
 extern struct list_head inode_unused;
 
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -413,7 +413,9 @@ static void hugetlbfs_forget_inode(struc
 		hlist_del_init(&inode->i_hash);
 	}
 	list_del_init(&inode->i_list);
+	spin_lock(&sb_inode_list_lock);
 	list_del_init(&inode->i_sb_list);
+	spin_unlock(&sb_inode_list_lock);
 	inode->i_state |= I_FREEING;
 	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 21/27] fs: icache lock inode hash
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (19 preceding siblings ...)
  2009-04-25  1:20 ` [patch 20/27] fs: icache lock s_inodes list npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  1:20 ` [patch 22/27] fs: icache lock i_state npiggin
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: fs-inode_lock-scale-2.patch --]
[-- Type: text/plain, Size: 5077 bytes --]

Add a new lock, inode_hash_lock, to protect the inode hash table lists.
---
 fs/hugetlbfs/inode.c      |    2 ++
 fs/inode.c                |   26 +++++++++++++++++++++++++-
 include/linux/writeback.h |    1 +
 3 files changed, 28 insertions(+), 1 deletion(-)

Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -410,7 +410,9 @@ static void hugetlbfs_forget_inode(struc
 		spin_lock(&inode_lock);
 		inode->i_state &= ~I_WILL_FREE;
 		inodes_stat.nr_unused--;
+		spin_lock(&inode_hash_lock);
 		hlist_del_init(&inode->i_hash);
+		spin_unlock(&inode_hash_lock);
 	}
 	list_del_init(&inode->i_list);
 	spin_lock(&sb_inode_list_lock);
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -84,6 +84,7 @@ static struct hlist_head *inode_hashtabl
  */
 DEFINE_SPINLOCK(inode_lock);
 DEFINE_SPINLOCK(sb_inode_list_lock);
+DEFINE_SPINLOCK(inode_hash_lock);
 
 /*
  * iprune_mutex provides exclusion between the kswapd or try_to_free_pages
@@ -329,7 +330,9 @@ static void dispose_list(struct list_hea
 		clear_inode(inode);
 
 		spin_lock(&inode_lock);
+		spin_lock(&inode_hash_lock);
 		hlist_del_init(&inode->i_hash);
+		spin_unlock(&inode_hash_lock);
 		spin_lock(&sb_inode_list_lock);
 		list_del_init(&inode->i_sb_list);
 		spin_unlock(&sb_inode_list_lock);
@@ -536,17 +539,20 @@ static struct inode * find_inode(struct
 	struct inode * inode = NULL;
 
 repeat:
+	spin_lock(&inode_hash_lock);
 	hlist_for_each_entry(inode, node, head, i_hash) {
 		if (inode->i_sb != sb)
 			continue;
 		if (!test(inode, data))
 			continue;
 		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)) {
+			spin_unlock(&inode_hash_lock);
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
 		break;
 	}
+	spin_unlock(&inode_hash_lock);
 	return node ? inode : NULL;
 }
 
@@ -560,17 +566,20 @@ static struct inode * find_inode_fast(st
 	struct inode * inode = NULL;
 
 repeat:
+	spin_lock(&inode_hash_lock);
 	hlist_for_each_entry(inode, node, head, i_hash) {
 		if (inode->i_ino != ino)
 			continue;
 		if (inode->i_sb != sb)
 			continue;
 		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)) {
+			spin_unlock(&inode_hash_lock);
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
 		break;
 	}
+	spin_unlock(&inode_hash_lock);
 	return node ? inode : NULL;
 }
 
@@ -593,8 +602,11 @@ __inode_add_to_lists(struct super_block
 	spin_lock(&sb_inode_list_lock);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	spin_unlock(&sb_inode_list_lock);
-	if (head)
+	if (head) {
+		spin_lock(&inode_hash_lock);
 		hlist_add_head(&inode->i_hash, head);
+		spin_unlock(&inode_hash_lock);
+	}
 }
 
 /**
@@ -1070,7 +1082,9 @@ int insert_inode_locked(struct inode *in
 		spin_lock(&inode_lock);
 		old = find_inode_fast(sb, head, ino);
 		if (likely(!old)) {
+			spin_lock(&inode_hash_lock);
 			hlist_add_head(&inode->i_hash, head);
+			spin_unlock(&inode_hash_lock);
 			spin_unlock(&inode_lock);
 			return 0;
 		}
@@ -1100,7 +1114,9 @@ int insert_inode_locked4(struct inode *i
 		spin_lock(&inode_lock);
 		old = find_inode(sb, head, test, data);
 		if (likely(!old)) {
+			spin_lock(&inode_hash_lock);
 			hlist_add_head(&inode->i_hash, head);
+			spin_unlock(&inode_hash_lock);
 			spin_unlock(&inode_lock);
 			return 0;
 		}
@@ -1129,7 +1145,9 @@ void __insert_inode_hash(struct inode *i
 {
 	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
 	spin_lock(&inode_lock);
+	spin_lock(&inode_hash_lock);
 	hlist_add_head(&inode->i_hash, head);
+	spin_unlock(&inode_hash_lock);
 	spin_unlock(&inode_lock);
 }
 
@@ -1144,7 +1162,9 @@ EXPORT_SYMBOL(__insert_inode_hash);
 void remove_inode_hash(struct inode *inode)
 {
 	spin_lock(&inode_lock);
+	spin_lock(&inode_hash_lock);
 	hlist_del_init(&inode->i_hash);
+	spin_unlock(&inode_hash_lock);
 	spin_unlock(&inode_lock);
 }
 
@@ -1191,7 +1211,9 @@ void generic_delete_inode(struct inode *
 		clear_inode(inode);
 	}
 	spin_lock(&inode_lock);
+	spin_lock(&inode_hash_lock);
 	hlist_del_init(&inode->i_hash);
+	spin_unlock(&inode_hash_lock);
 	spin_unlock(&inode_lock);
 	wake_up_inode(inode);
 	BUG_ON(inode->i_state != I_CLEAR);
@@ -1220,7 +1242,9 @@ static void generic_forget_inode(struct
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
 		inodes_stat.nr_unused--;
+		spin_lock(&inode_hash_lock);
 		hlist_del_init(&inode->i_hash);
+		spin_unlock(&inode_hash_lock);
 	}
 	list_del_init(&inode->i_list);
 	spin_lock(&sb_inode_list_lock);
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h
+++ linux-2.6/include/linux/writeback.h
@@ -11,6 +11,7 @@ struct backing_dev_info;
 
 extern spinlock_t inode_lock;
 extern spinlock_t sb_inode_list_lock;
+extern spinlock_t inode_hash_lock;
 extern struct list_head inode_in_use;
 extern struct list_head inode_unused;
 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 22/27] fs: icache lock i_state
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (20 preceding siblings ...)
  2009-04-25  1:20 ` [patch 21/27] fs: icache lock inode hash npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  1:20 ` [patch 23/27] fs: icache lock i_count npiggin
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: fs-inode_lock-scale-3.patch --]
[-- Type: text/plain, Size: 17236 bytes --]

Protect i_state updates with i_lock
---
 fs/drop_caches.c     |    9 ++++--
 fs/fs-writeback.c    |   46 ++++++++++++++++++++++++---------
 fs/hugetlbfs/inode.c |    6 ++++
 fs/inode.c           |   71 +++++++++++++++++++++++++++++++++++++++++++--------
 fs/nilfs2/gcdat.c    |    1 
 fs/quota/dquot.c     |   14 +++++++---
 6 files changed, 118 insertions(+), 29 deletions(-)

Index: linux-2.6/fs/drop_caches.c
===================================================================
--- linux-2.6.orig/fs/drop_caches.c
+++ linux-2.6/fs/drop_caches.c
@@ -19,11 +19,14 @@ static void drop_pagecache_sb(struct sup
 	spin_lock(&inode_lock);
 	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
-			continue;
-		if (inode->i_mapping->nrpages == 0)
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW)
+				|| inode->i_mapping->nrpages == 0) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 		__iget(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 		__invalidate_mapping_pages(inode->i_mapping, 0, -1, true);
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -118,6 +118,7 @@ void __mark_inode_dirty(struct inode *in
 		struct dentry *dentry = NULL;
 		const char *name = "?";
 
+		/* XXX: someone forgot their locking here */
 		if (!list_empty(&inode->i_dentry)) {
 			dentry = list_entry(inode->i_dentry.next,
 					    struct dentry, d_alias);
@@ -133,6 +134,7 @@ void __mark_inode_dirty(struct inode *in
 	}
 
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	if ((inode->i_state & flags) != flags) {
 		const int was_dirty = inode->i_state & I_DIRTY;
 
@@ -167,6 +169,7 @@ void __mark_inode_dirty(struct inode *in
 		}
 	}
 out:
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 }
 
@@ -296,6 +299,7 @@ __sync_single_inode(struct inode *inode,
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY;
 
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 
 	ret = do_writepages(mapping, wbc);
@@ -314,6 +318,7 @@ __sync_single_inode(struct inode *inode,
 	}
 
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
@@ -419,10 +424,12 @@ __writeback_single_inode(struct inode *i
 
 		wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
 		do {
+			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
 			__wait_on_bit(wqh, &wq, inode_wait,
 							TASK_UNINTERRUPTIBLE);
 			spin_lock(&inode_lock);
+			spin_lock(&inode->i_lock);
 		} while (inode->i_state & I_SYNC);
 	}
 	return __sync_single_inode(inode, wbc);
@@ -487,11 +494,6 @@ void generic_sync_sb_inodes(struct super
 			break;
 		}
 
-		if (inode->i_state & I_NEW) {
-			requeue_io(inode);
-			continue;
-		}
-
 		if (wbc->nonblocking && bdi_write_congested(bdi)) {
 			wbc->encountered_congestion = 1;
 			if (!sb_is_blkdev_sb(sb))
@@ -507,16 +509,27 @@ void generic_sync_sb_inodes(struct super
 			continue;		/* blockdev has wrong queue */
 		}
 
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & I_NEW) {
+			spin_unlock(&inode->i_lock);
+			requeue_io(inode);
+			continue;
+		}
+
 		/*
 		 * Was this inode dirtied after sync_sb_inodes was called?
 		 * This keeps sync from extra jobs and livelock.
 		 */
-		if (inode_dirtied_after(inode, start))
+		if (inode_dirtied_after(inode, start)) {
+			spin_unlock(&inode->i_lock);
 			break;
+		}
 
 		/* Is another pdflush already flushing this queue? */
-		if (current_is_pdflush() && !writeback_acquire(bdi))
+		if (current_is_pdflush() && !writeback_acquire(bdi)) {
+			spin_unlock(&inode->i_lock);
 			break;
+		}
 
 		BUG_ON(inode->i_state & I_FREEING);
 		__iget(inode);
@@ -531,6 +544,7 @@ void generic_sync_sb_inodes(struct super
 			 */
 			redirty_tail(inode);
 		}
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		iput(inode);
 		cond_resched();
@@ -555,15 +569,17 @@ void generic_sync_sb_inodes(struct super
 		 */
 		spin_lock(&sb_inode_list_lock);
 		list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-			struct address_space *mapping;
+			struct address_space *mapping = inode->i_mapping;
 
+			spin_lock(&inode->i_lock);
 			if (inode->i_state &
-					(I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
-				continue;
-			mapping = inode->i_mapping;
-			if (mapping->nrpages == 0)
+					(I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW)
+					|| mapping->nrpages == 0) {
+				spin_unlock(&inode->i_lock);
 				continue;
+			}
 			__iget(inode);
+			spin_unlock(&inode->i_lock);
 			spin_unlock(&sb_inode_list_lock);
 			spin_unlock(&inode_lock);
 			/*
@@ -756,7 +772,9 @@ int write_inode_now(struct inode *inode,
 
 	might_sleep();
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	ret = __writeback_single_inode(inode, &wbc);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	if (sync)
 		inode_sync_wait(inode);
@@ -780,7 +798,9 @@ int sync_inode(struct inode *inode, stru
 	int ret;
 
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	ret = __writeback_single_inode(inode, wbc);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	return ret;
 }
@@ -823,9 +843,11 @@ int generic_osync_inode(struct inode *in
 	}
 
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	if ((inode->i_state & I_DIRTY) &&
 	    ((what & OSYNC_INODE) || (inode->i_state & I_DIRTY_DATASYNC)))
 		need_write_inode_now = 1;
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 
 	if (need_write_inode_now) {
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -269,6 +269,7 @@ static void init_once(void *foo)
  */
 void __iget(struct inode * inode)
 {
+	assert_spin_locked(&inode->i_lock);
 	if (atomic_read(&inode->i_count)) {
 		atomic_inc(&inode->i_count);
 		return;
@@ -373,16 +374,21 @@ static int invalidate_list(struct list_h
 		if (tmp == head)
 			break;
 		inode = list_entry(tmp, struct inode, i_sb_list);
-		if (inode->i_state & I_NEW)
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & I_NEW) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 		invalidate_inode_buffers(inode);
 		if (!atomic_read(&inode->i_count)) {
 			list_move(&inode->i_list, dispose);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
+			spin_unlock(&inode->i_lock);
 			count++;
 			continue;
 		}
+		spin_unlock(&inode->i_lock);
 		busy = 1;
 	}
 	/* only unused inodes may be cached with i_count zero */
@@ -462,12 +468,15 @@ static void prune_icache(int nr_to_scan)
 
 		inode = list_entry(inode_unused.prev, struct inode, i_list);
 
+		spin_lock(&inode->i_lock);
 		if (inode->i_state || atomic_read(&inode->i_count)) {
 			list_move(&inode->i_list, &inode_unused);
+			spin_unlock(&inode->i_lock);
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
 			__iget(inode);
+			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
@@ -478,12 +487,16 @@ static void prune_icache(int nr_to_scan)
 			if (inode != list_entry(inode_unused.next,
 						struct inode, i_list))
 				continue;	/* wrong inode or list_empty */
-			if (!can_unuse(inode))
+			spin_lock(&inode->i_lock);
+			if (!can_unuse(inode)) {
+				spin_unlock(&inode->i_lock);
 				continue;
+			}
 		}
 		list_move(&inode->i_list, &freeable);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
+		spin_unlock(&inode->i_lock);
 		nr_pruned++;
 	}
 	inodes_stat.nr_unused -= nr_pruned;
@@ -543,8 +556,14 @@ repeat:
 	hlist_for_each_entry(inode, node, head, i_hash) {
 		if (inode->i_sb != sb)
 			continue;
-		if (!test(inode, data))
+		if (!spin_trylock(&inode->i_lock)) {
+			spin_unlock(&inode_hash_lock);
+			goto repeat;
+		}
+		if (!test(inode, data)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)) {
 			spin_unlock(&inode_hash_lock);
 			__wait_on_freeing_inode(inode);
@@ -572,6 +591,10 @@ repeat:
 			continue;
 		if (inode->i_sb != sb)
 			continue;
+		if (!spin_trylock(&inode->i_lock)) {
+			spin_unlock(&inode_hash_lock);
+			goto repeat;
+		}
 		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)) {
 			spin_unlock(&inode_hash_lock);
 			__wait_on_freeing_inode(inode);
@@ -598,10 +621,10 @@ __inode_add_to_lists(struct super_block
 			struct inode *inode)
 {
 	inodes_stat.nr_inodes++;
-	list_add(&inode->i_list, &inode_in_use);
 	spin_lock(&sb_inode_list_lock);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	spin_unlock(&sb_inode_list_lock);
+	list_add(&inode->i_list, &inode_in_use);
 	if (head) {
 		spin_lock(&inode_hash_lock);
 		hlist_add_head(&inode->i_hash, head);
@@ -658,9 +681,9 @@ struct inode *new_inode(struct super_blo
 	inode = alloc_inode(sb);
 	if (inode) {
 		spin_lock(&inode_lock);
-		__inode_add_to_lists(sb, NULL, inode);
 		inode->i_ino = ++last_ino;
 		inode->i_state = 0;
+		__inode_add_to_lists(sb, NULL, inode);
 		spin_unlock(&inode_lock);
 	}
 	return inode;
@@ -718,8 +741,8 @@ static struct inode * get_new_inode(stru
 			if (set(inode, data))
 				goto set_failed;
 
-			__inode_add_to_lists(sb, head, inode);
 			inode->i_state = I_LOCK|I_NEW;
+			__inode_add_to_lists(sb, head, inode);
 			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
@@ -734,6 +757,7 @@ static struct inode * get_new_inode(stru
 		 * allocated.
 		 */
 		__iget(old);
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -742,6 +766,7 @@ static struct inode * get_new_inode(stru
 	return inode;
 
 set_failed:
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	destroy_inode(inode);
 	return NULL;
@@ -764,8 +789,8 @@ static struct inode * get_new_inode_fast
 		old = find_inode_fast(sb, head, ino);
 		if (!old) {
 			inode->i_ino = ino;
-			__inode_add_to_lists(sb, head, inode);
 			inode->i_state = I_LOCK|I_NEW;
+			__inode_add_to_lists(sb, head, inode);
 			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
@@ -780,6 +805,7 @@ static struct inode * get_new_inode_fast
 		 * allocated.
 		 */
 		__iget(old);
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -821,6 +847,7 @@ ino_t iunique(struct super_block *sb, in
 		res = counter++;
 		head = inode_hashtable + hash(sb, res);
 		inode = find_inode_fast(sb, head, res);
+		spin_unlock(&inode->i_lock);
 	} while (inode != NULL);
 	spin_unlock(&inode_lock);
 
@@ -830,7 +857,10 @@ EXPORT_SYMBOL(iunique);
 
 struct inode *igrab(struct inode *inode)
 {
+	struct inode *ret = inode;
+
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	if (!(inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)))
 		__iget(inode);
 	else
@@ -839,9 +869,11 @@ struct inode *igrab(struct inode *inode)
 		 * called yet, and somebody is calling igrab
 		 * while the inode is getting freed.
 		 */
-		inode = NULL;
+		ret = NULL;
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
-	return inode;
+
+	return ret;
 }
 
 EXPORT_SYMBOL(igrab);
@@ -875,6 +907,7 @@ static struct inode *ifind(struct super_
 	inode = find_inode(sb, head, test, data);
 	if (inode) {
 		__iget(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		if (likely(wait))
 			wait_on_inode(inode);
@@ -908,6 +941,7 @@ static struct inode *ifind_fast(struct s
 	inode = find_inode_fast(sb, head, ino);
 	if (inode) {
 		__iget(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		wait_on_inode(inode);
 		return inode;
@@ -1089,6 +1123,7 @@ int insert_inode_locked(struct inode *in
 			return 0;
 		}
 		__iget(old);
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
@@ -1121,6 +1156,7 @@ int insert_inode_locked4(struct inode *i
 			return 0;
 		}
 		__iget(old);
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
@@ -1186,12 +1222,14 @@ void generic_delete_inode(struct inode *
 {
 	const struct super_operations *op = inode->i_sb->s_op;
 
-	list_del_init(&inode->i_list);
 	spin_lock(&sb_inode_list_lock);
+	spin_lock(&inode->i_lock);
+	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	spin_unlock(&sb_inode_list_lock);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
+	spin_unlock(&inode->i_lock);
 	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
 
@@ -1226,19 +1264,27 @@ static void generic_forget_inode(struct
 {
 	struct super_block *sb = inode->i_sb;
 
+	spin_lock(&sb_inode_list_lock);
+	spin_lock(&inode->i_lock);
 	if (!hlist_unhashed(&inode->i_hash)) {
 		if (!(inode->i_state & (I_DIRTY|I_SYNC)))
 			list_move(&inode->i_list, &inode_unused);
 		inodes_stat.nr_unused++;
 		if (sb->s_flags & MS_ACTIVE) {
+			spin_unlock(&inode->i_lock);
+			spin_unlock(&sb_inode_list_lock);
 			spin_unlock(&inode_lock);
 			return;
 		}
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_WILL_FREE;
+		spin_unlock(&inode->i_lock);
+		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 		write_inode_now(inode, 1);
 		spin_lock(&inode_lock);
+		spin_lock(&sb_inode_list_lock);
+		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
 		inodes_stat.nr_unused--;
@@ -1247,12 +1293,12 @@ static void generic_forget_inode(struct
 		spin_unlock(&inode_hash_lock);
 	}
 	list_del_init(&inode->i_list);
-	spin_lock(&sb_inode_list_lock);
 	list_del_init(&inode->i_sb_list);
 	spin_unlock(&sb_inode_list_lock);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 	inodes_stat.nr_inodes--;
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	if (inode->i_data.nrpages)
 		truncate_inode_pages(&inode->i_data, 0);
@@ -1493,6 +1539,8 @@ EXPORT_SYMBOL(inode_wait);
  * wake_up_inode() after removing from the hash list will DTRT.
  *
  * This is called with inode_lock held.
+ *
+ * Called with i_lock held and returns with it dropped.
  */
 static void __wait_on_freeing_inode(struct inode *inode)
 {
@@ -1500,6 +1548,7 @@ static void __wait_on_freeing_inode(stru
 	DEFINE_WAIT_BIT(wait, &inode->i_state, __I_LOCK);
 	wq = bit_waitqueue(&inode->i_state, __I_LOCK);
 	prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	schedule();
 	finish_wait(wq, &wait.wait);
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -400,7 +400,9 @@ static void hugetlbfs_forget_inode(struc
 			spin_unlock(&inode_lock);
 			return;
 		}
+		spin_lock(&inode->i_lock);
 		inode->i_state |= I_WILL_FREE;
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		/*
 		 * write_inode_now is a noop as we set BDI_CAP_NO_WRITEBACK
@@ -408,7 +410,9 @@ static void hugetlbfs_forget_inode(struc
 		 */
 		write_inode_now(inode, 1);
 		spin_lock(&inode_lock);
+		spin_lock(&inode->i_lock);
 		inode->i_state &= ~I_WILL_FREE;
+		spin_unlock(&inode->i_lock);
 		inodes_stat.nr_unused--;
 		spin_lock(&inode_hash_lock);
 		hlist_del_init(&inode->i_hash);
@@ -418,7 +422,9 @@ static void hugetlbfs_forget_inode(struc
 	spin_lock(&sb_inode_list_lock);
 	list_del_init(&inode->i_sb_list);
 	spin_unlock(&sb_inode_list_lock);
+	spin_lock(&inode->i_lock);
 	inode->i_state |= I_FREEING;
+	spin_unlock(&inode->i_lock);
 	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
 	truncate_hugepages(inode, 0);
Index: linux-2.6/fs/nilfs2/gcdat.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/gcdat.c
+++ linux-2.6/fs/nilfs2/gcdat.c
@@ -27,6 +27,7 @@
 #include "page.h"
 #include "mdt.h"
 
+/* XXX: what protects i_state? */
 int nilfs_init_gcdat_inode(struct the_nilfs *nilfs)
 {
 	struct inode *dat = nilfs->ns_dat, *gcdat = nilfs->ns_gc_dat;
Index: linux-2.6/fs/quota/dquot.c
===================================================================
--- linux-2.6.orig/fs/quota/dquot.c
+++ linux-2.6/fs/quota/dquot.c
@@ -824,14 +824,22 @@ static void add_dquot_ref(struct super_b
 	spin_lock(&inode_lock);
 	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW)) {
+			spin_unlock(&inode->i_lock);
 			continue;
-		if (!atomic_read(&inode->i_writecount))
+		}
+		if (!atomic_read(&inode->i_writecount)) {
+			spin_unlock(&inode->i_lock);
 			continue;
-		if (!dqinit_needed(inode, type))
+		}
+		if (!dqinit_needed(inode, type)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 
 		__iget(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 23/27] fs: icache lock i_count
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (21 preceding siblings ...)
  2009-04-25  1:20 ` [patch 22/27] fs: icache lock i_state npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  1:20 ` [patch 24/27] fs: icache atomic inodes_stat npiggin
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: fs-inode_lock-scale-4.patch --]
[-- Type: text/plain, Size: 33238 bytes --]

Protect inode->i_count with i_lock, rather than having it atomic.
Next step should also be to move things together (eg. the refcount increment
into d_instantiate, which will remove a lock/unlock cycle on i_lock).
---
 arch/powerpc/platforms/cell/spufs/file.c |    2 -
 fs/affs/inode.c                          |    4 ++-
 fs/afs/dir.c                             |    4 ++-
 fs/anon_inodes.c                         |    4 ++-
 fs/bfs/dir.c                             |    4 ++-
 fs/block_dev.c                           |    8 +++++-
 fs/btrfs/inode.c                         |    4 ++-
 fs/coda/dir.c                            |    4 ++-
 fs/exofs/inode.c                         |   12 +++++++---
 fs/exofs/namei.c                         |    4 ++-
 fs/ext2/namei.c                          |    4 ++-
 fs/ext3/ialloc.c                         |    4 +--
 fs/ext3/namei.c                          |    4 ++-
 fs/ext4/ialloc.c                         |    4 +--
 fs/ext4/namei.c                          |    4 ++-
 fs/fs-writeback.c                        |    4 +--
 fs/gfs2/ops_inode.c                      |    4 ++-
 fs/hfsplus/dir.c                         |    4 ++-
 fs/hpfs/inode.c                          |    2 -
 fs/inode.c                               |   37 ++++++++++++++++++++-----------
 fs/jffs2/dir.c                           |    8 +++++-
 fs/jfs/jfs_txnmgr.c                      |    4 ++-
 fs/jfs/namei.c                           |    4 ++-
 fs/libfs.c                               |    4 ++-
 fs/locks.c                               |    3 --
 fs/minix/namei.c                         |    4 ++-
 fs/namei.c                               |    7 ++++-
 fs/nfs/dir.c                             |    4 ++-
 fs/nfs/getroot.c                         |    4 ++-
 fs/nfs/inode.c                           |    4 +--
 fs/nilfs2/mdt.c                          |    2 -
 fs/nilfs2/namei.c                        |    4 ++-
 fs/notify/inotify/inotify.c              |   28 +++++++++++++----------
 fs/ocfs2/namei.c                         |    4 ++-
 fs/reiserfs/file.c                       |    4 +--
 fs/reiserfs/namei.c                      |    4 ++-
 fs/reiserfs/stree.c                      |    2 -
 fs/sysv/namei.c                          |    4 ++-
 fs/ubifs/dir.c                           |    4 ++-
 fs/ubifs/super.c                         |    2 -
 fs/udf/namei.c                           |    4 ++-
 fs/ufs/namei.c                           |    4 ++-
 fs/xfs/linux-2.6/xfs_iops.c              |    4 ++-
 fs/xfs/xfs_iget.c                        |    2 -
 fs/xfs/xfs_inode.h                       |    6 +++--
 include/linux/fs.h                       |    2 -
 ipc/mqueue.c                             |    7 ++++-
 kernel/futex.c                           |    4 ++-
 mm/shmem.c                               |    4 ++-
 49 files changed, 177 insertions(+), 85 deletions(-)

Index: linux-2.6/arch/powerpc/platforms/cell/spufs/file.c
===================================================================
--- linux-2.6.orig/arch/powerpc/platforms/cell/spufs/file.c
+++ linux-2.6/arch/powerpc/platforms/cell/spufs/file.c
@@ -1548,7 +1548,7 @@ static int spufs_mfc_open(struct inode *
 	if (ctx->owner != current->mm)
 		return -EINVAL;
 
-	if (atomic_read(&inode->i_count) != 1)
+	if (inode->i_count != 1)
 		return -EBUSY;
 
 	mutex_lock(&ctx->mapping_lock);
Index: linux-2.6/fs/affs/inode.c
===================================================================
--- linux-2.6.orig/fs/affs/inode.c
+++ linux-2.6/fs/affs/inode.c
@@ -379,7 +379,9 @@ affs_add_entry(struct inode *dir, struct
 		affs_adjust_checksum(inode_bh, block - be32_to_cpu(chain));
 		mark_buffer_dirty_inode(inode_bh, inode);
 		inode->i_nlink = 2;
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_count++;
+		spin_unlock(&inode->i_lock);
 	}
 	affs_fix_checksum(sb, bh);
 	mark_buffer_dirty_inode(bh, inode);
Index: linux-2.6/fs/afs/dir.c
===================================================================
--- linux-2.6.orig/fs/afs/dir.c
+++ linux-2.6/fs/afs/dir.c
@@ -1007,7 +1007,9 @@ static int afs_link(struct dentry *from,
 	if (ret < 0)
 		goto link_error;
 
-	atomic_inc(&vnode->vfs_inode.i_count);
+	spin_lock(&vnode->vfs_inode.i_lock);
+	vnode->vfs_inode.i_count++;
+	spin_unlock(&vnode->vfs_inode.i_lock);
 	d_instantiate(dentry, &vnode->vfs_inode);
 	key_put(key);
 	_leave(" = 0");
Index: linux-2.6/fs/anon_inodes.c
===================================================================
--- linux-2.6.orig/fs/anon_inodes.c
+++ linux-2.6/fs/anon_inodes.c
@@ -104,7 +104,9 @@ int anon_inode_getfd(const char *name, c
 	 * so we can avoid doing an igrab() and we can use an open-coded
 	 * atomic_inc().
 	 */
-	atomic_inc(&anon_inode_inode->i_count);
+	spin_lock(&anon_inode_inode->i_lock);
+	anon_inode_inode->i_count++;
+	spin_unlock(&anon_inode_inode->i_lock);
 
 	dentry->d_op = &anon_inodefs_dentry_operations;
 	/* Do not publish this dentry inside the global dentry hash table */
Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c
+++ linux-2.6/fs/block_dev.c
@@ -579,7 +579,9 @@ static struct block_device *bd_acquire(s
 	spin_lock(&bdev_lock);
 	bdev = inode->i_bdev;
 	if (bdev) {
-		atomic_inc(&bdev->bd_inode->i_count);
+		spin_lock(&inode->i_lock);
+		bdev->bd_inode->i_count++;
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&bdev_lock);
 		return bdev;
 	}
@@ -595,7 +597,9 @@ static struct block_device *bd_acquire(s
 			 * So, we can access it via ->i_mapping always
 			 * without igrab().
 			 */
-			atomic_inc(&bdev->bd_inode->i_count);
+			spin_lock(&inode->i_lock);
+			bdev->bd_inode->i_count++;
+			spin_unlock(&inode->i_lock);
 			inode->i_bdev = bdev;
 			inode->i_mapping = bdev->bd_inode->i_mapping;
 			list_add(&inode->i_devices, &bdev->bd_inodes);
Index: linux-2.6/fs/ext2/namei.c
===================================================================
--- linux-2.6.orig/fs/ext2/namei.c
+++ linux-2.6/fs/ext2/namei.c
@@ -188,7 +188,9 @@ static int ext2_link (struct dentry * ol
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	err = ext2_add_link(dentry, inode);
 	if (!err) {
Index: linux-2.6/fs/ext3/ialloc.c
===================================================================
--- linux-2.6.orig/fs/ext3/ialloc.c
+++ linux-2.6/fs/ext3/ialloc.c
@@ -100,9 +100,9 @@ void ext3_free_inode (handle_t *handle,
 	struct ext3_sb_info *sbi;
 	int fatal = 0, err;
 
-	if (atomic_read(&inode->i_count) > 1) {
+	if (inode->i_count > 1) {
 		printk ("ext3_free_inode: inode has count=%d\n",
-					atomic_read(&inode->i_count));
+					inode->i_count);
 		return;
 	}
 	if (inode->i_nlink) {
Index: linux-2.6/fs/ext3/namei.c
===================================================================
--- linux-2.6.orig/fs/ext3/namei.c
+++ linux-2.6/fs/ext3/namei.c
@@ -2244,7 +2244,9 @@ retry:
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	err = ext3_add_entry(handle, dentry, inode);
 	if (!err) {
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -373,7 +373,7 @@ __sync_single_inode(struct inode *inode,
 			 * the pages.
 			 */
 			redirty_tail(inode);
-		} else if (atomic_read(&inode->i_count)) {
+		} else if (inode->i_count) {
 			/*
 			 * The inode is clean, inuse
 			 */
@@ -399,7 +399,7 @@ __writeback_single_inode(struct inode *i
 {
 	wait_queue_head_t *wqh;
 
-	if (!atomic_read(&inode->i_count))
+	if (!inode->i_count)
 		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
 	else
 		WARN_ON(inode->i_state & I_WILL_FREE);
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -131,7 +131,7 @@ struct inode *inode_init_always(struct s
 	inode->i_sb = sb;
 	inode->i_blkbits = sb->s_blocksize_bits;
 	inode->i_flags = 0;
-	atomic_set(&inode->i_count, 1);
+	inode->i_count = 1;
 	inode->i_op = &empty_iops;
 	inode->i_fop = &empty_fops;
 	inode->i_nlink = 1;
@@ -270,11 +270,10 @@ static void init_once(void *foo)
 void __iget(struct inode * inode)
 {
 	assert_spin_locked(&inode->i_lock);
-	if (atomic_read(&inode->i_count)) {
-		atomic_inc(&inode->i_count);
+	inode->i_count++;
+	if (inode->i_count > 1)
 		return;
-	}
-	atomic_inc(&inode->i_count);
+
 	if (!(inode->i_state & (I_DIRTY|I_SYNC)))
 		list_move(&inode->i_list, &inode_in_use);
 	inodes_stat.nr_unused--;
@@ -380,7 +379,7 @@ static int invalidate_list(struct list_h
 			continue;
 		}
 		invalidate_inode_buffers(inode);
-		if (!atomic_read(&inode->i_count)) {
+		if (!inode->i_count) {
 			list_move(&inode->i_list, dispose);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
@@ -431,7 +430,7 @@ static int can_unuse(struct inode *inode
 		return 0;
 	if (inode_has_buffers(inode))
 		return 0;
-	if (atomic_read(&inode->i_count))
+	if (inode->i_count)
 		return 0;
 	if (inode->i_data.nrpages)
 		return 0;
@@ -469,7 +468,7 @@ static void prune_icache(int nr_to_scan)
 		inode = list_entry(inode_unused.prev, struct inode, i_list);
 
 		spin_lock(&inode->i_lock);
-		if (inode->i_state || atomic_read(&inode->i_count)) {
+		if (inode->i_state || inode->i_count) {
 			list_move(&inode->i_list, &inode_unused);
 			spin_unlock(&inode->i_lock);
 			continue;
@@ -1222,8 +1221,6 @@ void generic_delete_inode(struct inode *
 {
 	const struct super_operations *op = inode->i_sb->s_op;
 
-	spin_lock(&sb_inode_list_lock);
-	spin_lock(&inode->i_lock);
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	spin_unlock(&sb_inode_list_lock);
@@ -1264,8 +1261,6 @@ static void generic_forget_inode(struct
 {
 	struct super_block *sb = inode->i_sb;
 
-	spin_lock(&sb_inode_list_lock);
-	spin_lock(&inode->i_lock);
 	if (!hlist_unhashed(&inode->i_hash)) {
 		if (!(inode->i_state & (I_DIRTY|I_SYNC)))
 			list_move(&inode->i_list, &inode_unused);
@@ -1357,8 +1352,24 @@ void iput(struct inode *inode)
 	if (inode) {
 		BUG_ON(inode->i_state == I_CLEAR);
 
-		if (atomic_dec_and_lock(&inode->i_count, &inode_lock))
+retry:
+		spin_lock(&inode->i_lock);
+		if (inode->i_count == 1) {
+			if (!spin_trylock(&inode_lock)) {
+				spin_unlock(&inode->i_lock);
+				goto retry;
+			}
+			if (!spin_trylock(&sb_inode_list_lock)) {
+				spin_unlock(&inode_lock);
+				spin_unlock(&inode->i_lock);
+				goto retry;
+			}
+			inode->i_count--;
 			iput_final(inode);
+		} else {
+			inode->i_count--;
+			spin_unlock(&inode->i_lock);
+		}
 	}
 }
 
Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -278,7 +278,9 @@ int simple_link(struct dentry *old_dentr
 
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	dget(dentry);
 	d_instantiate(dentry, inode);
 	return 0;
Index: linux-2.6/fs/locks.c
===================================================================
--- linux-2.6.orig/fs/locks.c
+++ linux-2.6/fs/locks.c
@@ -1373,8 +1373,7 @@ int generic_setlease(struct file *filp,
 		if ((arg == F_RDLCK) && (atomic_read(&inode->i_writecount) > 0))
 			goto out;
 		if ((arg == F_WRLCK)
-		    && (dentry->d_count > 1
-			|| (atomic_read(&inode->i_count) > 1)))
+		    && (dentry->d_count > 1 || inode->i_count > 1))
 			goto out;
 	}
 
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c
+++ linux-2.6/fs/namei.c
@@ -2267,8 +2267,11 @@ static long do_unlinkat(int dfd, const c
 		if (nd.last.name[nd.last.len])
 			goto slashes;
 		inode = dentry->d_inode;
-		if (inode)
-			atomic_inc(&inode->i_count);
+		if (inode) {
+			spin_lock(&inode->i_lock);
+			inode->i_count++;
+			spin_unlock(&inode->i_lock);
+		}
 		error = mnt_want_write(nd.path.mnt);
 		if (error)
 			goto exit2;
Index: linux-2.6/fs/nfs/dir.c
===================================================================
--- linux-2.6.orig/fs/nfs/dir.c
+++ linux-2.6/fs/nfs/dir.c
@@ -1537,7 +1537,9 @@ nfs_link(struct dentry *old_dentry, stru
 	d_drop(dentry);
 	error = NFS_PROTO(dir)->link(inode, dir, &dentry->d_name);
 	if (error == 0) {
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_count++;
+		spin_unlock(&inode->i_lock);
 		d_add(dentry, inode);
 	}
 	return error;
Index: linux-2.6/fs/nfs/getroot.c
===================================================================
--- linux-2.6.orig/fs/nfs/getroot.c
+++ linux-2.6/fs/nfs/getroot.c
@@ -56,7 +56,9 @@ static int nfs_superblock_set_dummy_root
 			return -ENOMEM;
 		}
 		/* Circumvent igrab(): we know the inode is not being freed */
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_count++;
+		spin_unlock(&inode->i_lock);
 		/*
 		 * Ensure that this dentry is invisible to d_find_alias().
 		 * Otherwise, it may be spliced into the tree by
Index: linux-2.6/fs/notify/inotify/inotify.c
===================================================================
--- linux-2.6.orig/fs/notify/inotify/inotify.c
+++ linux-2.6/fs/notify/inotify/inotify.c
@@ -405,23 +405,28 @@ void inotify_unmount_inodes(struct list_
 		 * evict all inodes with zero i_count from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
-		if (!atomic_read(&inode->i_count))
+		if (!inode->i_count)
 			continue;
 
 		need_iput_tmp = need_iput;
 		need_iput = NULL;
 		/* In case inotify_remove_watch_locked() drops a reference. */
-		if (inode != need_iput_tmp)
+		if (inode != need_iput_tmp) {
+			spin_lock(&inode->i_lock);
 			__iget(inode);
-		else
+			spin_unlock(&inode->i_lock);
+		} else
 			need_iput_tmp = NULL;
 		/* In case the dropping of a reference would nuke next_i. */
-		if ((&next_i->i_sb_list != list) &&
-				atomic_read(&next_i->i_count) &&
-				!(next_i->i_state & (I_CLEAR | I_FREEING |
-					I_WILL_FREE))) {
-			__iget(next_i);
-			need_iput = next_i;
+		if (&next_i->i_sb_list != list) {
+			spin_lock(&next_i->i_lock);
+			if (next_i->i_count &&
+				!(next_i->i_state &
+					(I_CLEAR|I_FREEING|I_WILL_FREE))) {
+				__iget(next_i);
+				need_iput = next_i;
+			}
+			spin_unlock(&next_i->i_lock);
 		}
 
 		/*
@@ -440,11 +445,10 @@ void inotify_unmount_inodes(struct list_
 		mutex_lock(&inode->inotify_mutex);
 		watches = &inode->inotify_watches;
 		list_for_each_entry_safe(watch, next_w, watches, i_list) {
-			struct inotify_handle *ih= watch->ih;
+			struct inotify_handle *ih = watch->ih;
 			get_inotify_watch(watch);
 			mutex_lock(&ih->mutex);
-			ih->in_ops->handle_event(watch, watch->wd, IN_UNMOUNT, 0,
-						 NULL, NULL);
+			ih->in_ops->handle_event(watch, watch->wd, IN_UNMOUNT, 0, NULL, NULL);
 			inotify_remove_watch_locked(ih, watch);
 			mutex_unlock(&ih->mutex);
 			put_inotify_watch(watch);
Index: linux-2.6/fs/xfs/linux-2.6/xfs_iops.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_iops.c
+++ linux-2.6/fs/xfs/linux-2.6/xfs_iops.c
@@ -362,7 +362,9 @@ xfs_vn_link(
 	if (unlikely(error))
 		return -error;
 
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	d_instantiate(dentry, inode);
 	return 0;
 }
Index: linux-2.6/fs/xfs/xfs_iget.c
===================================================================
--- linux-2.6.orig/fs/xfs/xfs_iget.c
+++ linux-2.6/fs/xfs/xfs_iget.c
@@ -822,7 +822,7 @@ xfs_isilocked(
 /*  0 */		(void *)(__psint_t)(vk),		\
 /*  1 */		(void *)(s),				\
 /*  2 */		(void *)(__psint_t) line,		\
-/*  3 */		(void *)(__psint_t)atomic_read(&VFS_I(ip)->i_count), \
+/*  3 */		(void *)(__psint_t)&VFS_I(ip)->i_count,	\
 /*  4 */		(void *)(ra),				\
 /*  5 */		NULL,					\
 /*  6 */		(void *)(__psint_t)current_cpu(),	\
Index: linux-2.6/fs/xfs/xfs_inode.h
===================================================================
--- linux-2.6.orig/fs/xfs/xfs_inode.h
+++ linux-2.6/fs/xfs/xfs_inode.h
@@ -561,8 +561,10 @@ extern void xfs_itrace_rele(struct xfs_i
 
 #define IHOLD(ip) \
 do { \
-	ASSERT(atomic_read(&VFS_I(ip)->i_count) > 0) ; \
-	atomic_inc(&(VFS_I(ip)->i_count)); \
+	spin_lock(&VFS_I(ip)->i_lock);		\
+	ASSERT(&VFS_I(ip)->i_count > 0);	\
+	VFS_I(ip)->i_count++;			\
+	spin_unlock(&VFS_I(ip)->i_lock);	\
 	xfs_itrace_hold((ip), __FILE__, __LINE__, (inst_t *)__return_address); \
 } while (0)
 
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -716,7 +716,7 @@ struct inode {
 	struct list_head	i_sb_list;
 	struct list_head	i_dentry;
 	unsigned long		i_ino;
-	atomic_t		i_count;
+	unsigned int		i_count;
 	unsigned int		i_nlink;
 	uid_t			i_uid;
 	gid_t			i_gid;
Index: linux-2.6/ipc/mqueue.c
===================================================================
--- linux-2.6.orig/ipc/mqueue.c
+++ linux-2.6/ipc/mqueue.c
@@ -777,8 +777,11 @@ SYSCALL_DEFINE1(mq_unlink, const char __
 	}
 
 	inode = dentry->d_inode;
-	if (inode)
-		atomic_inc(&inode->i_count);
+	if (inode) {
+		spin_lock(&inode->i_lock);
+		inode->i_count++;
+		spin_unlock(&inode->i_lock);
+	}
 	err = mnt_want_write(ipc_ns->mq_mnt);
 	if (err)
 		goto out_err;
Index: linux-2.6/kernel/futex.c
===================================================================
--- linux-2.6.orig/kernel/futex.c
+++ linux-2.6/kernel/futex.c
@@ -158,7 +158,9 @@ static void get_futex_key_refs(union fut
 
 	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
 	case FUT_OFF_INODE:
-		atomic_inc(&key->shared.inode->i_count);
+		spin_lock(&key->shared.inode->i_lock);
+		key->shared.inode->i_count++;
+		spin_unlock(&key->shared.inode->i_lock);
 		break;
 	case FUT_OFF_MMSHARED:
 		atomic_inc(&key->private.mm->mm_count);
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -1862,7 +1862,9 @@ static int shmem_link(struct dentry *old
 	dir->i_size += BOGO_DIRENT_SIZE;
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);	/* New dentry reference */
+	spin_lock(&inode->i_lock);
+	inode->i_count++;	/* New dentry reference */
+	spin_unlock(&inode->i_lock);
 	dget(dentry);		/* Extra pinning count for the created dentry */
 	d_instantiate(dentry, inode);
 out:
Index: linux-2.6/fs/bfs/dir.c
===================================================================
--- linux-2.6.orig/fs/bfs/dir.c
+++ linux-2.6/fs/bfs/dir.c
@@ -179,7 +179,9 @@ static int bfs_link(struct dentry *old,
 	inc_nlink(inode);
 	inode->i_ctime = CURRENT_TIME_SEC;
 	mark_inode_dirty(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	d_instantiate(new, inode);
 	mutex_unlock(&info->bfs_lock);
 	return 0;
Index: linux-2.6/fs/btrfs/inode.c
===================================================================
--- linux-2.6.orig/fs/btrfs/inode.c
+++ linux-2.6/fs/btrfs/inode.c
@@ -3786,7 +3786,9 @@ static int btrfs_link(struct dentry *old
 	trans = btrfs_start_transaction(root, 1);
 
 	btrfs_set_trans_block_group(trans, dir);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	err = btrfs_add_nondir(trans, dentry, inode, 1, index);
 
Index: linux-2.6/fs/coda/dir.c
===================================================================
--- linux-2.6.orig/fs/coda/dir.c
+++ linux-2.6/fs/coda/dir.c
@@ -302,7 +302,9 @@ static int coda_link(struct dentry *sour
 	}
 
 	coda_dir_update_mtime(dir_inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	d_instantiate(de, inode);
 	inc_nlink(inode);
 
Index: linux-2.6/fs/exofs/inode.c
===================================================================
--- linux-2.6.orig/fs/exofs/inode.c
+++ linux-2.6/fs/exofs/inode.c
@@ -1037,7 +1037,9 @@ static void create_done(struct osd_reque
 	} else
 		set_obj_created(oi);
 
-	atomic_dec(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count--;
+	spin_unlock(&inode->i_lock);
 	wake_up(&oi->i_wq);
 }
 
@@ -1103,11 +1105,15 @@ struct inode *exofs_new_inode(struct ino
 	/* increment the refcount so that the inode will still be around when we
 	 * reach the callback
 	 */
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	ret = exofs_async_op(or, create_done, inode, oi->i_cred);
 	if (ret) {
-		atomic_dec(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_count--;
+		spin_unlock(&inode->i_lock);
 		osd_end_request(or);
 		return ERR_PTR(-EIO);
 	}
Index: linux-2.6/fs/exofs/namei.c
===================================================================
--- linux-2.6.orig/fs/exofs/namei.c
+++ linux-2.6/fs/exofs/namei.c
@@ -155,7 +155,9 @@ static int exofs_link(struct dentry *old
 
 	inode->i_ctime = CURRENT_TIME;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	return exofs_add_nondir(dentry, inode);
 }
Index: linux-2.6/fs/ext4/ialloc.c
===================================================================
--- linux-2.6.orig/fs/ext4/ialloc.c
+++ linux-2.6/fs/ext4/ialloc.c
@@ -190,9 +190,9 @@ void ext4_free_inode(handle_t *handle, s
 	struct ext4_sb_info *sbi;
 	int fatal = 0, err, count, cleared;
 
-	if (atomic_read(&inode->i_count) > 1) {
+	if (inode->i_count > 1) {
 		printk(KERN_ERR "ext4_free_inode: inode has count=%d\n",
-		       atomic_read(&inode->i_count));
+		       inode->i_count);
 		return;
 	}
 	if (inode->i_nlink) {
Index: linux-2.6/fs/ext4/namei.c
===================================================================
--- linux-2.6.orig/fs/ext4/namei.c
+++ linux-2.6/fs/ext4/namei.c
@@ -2327,7 +2327,9 @@ retry:
 
 	inode->i_ctime = ext4_current_time(inode);
 	ext4_inc_count(handle, inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	err = ext4_add_entry(handle, dentry, inode);
 	if (!err) {
Index: linux-2.6/fs/gfs2/ops_inode.c
===================================================================
--- linux-2.6.orig/fs/gfs2/ops_inode.c
+++ linux-2.6/fs/gfs2/ops_inode.c
@@ -255,7 +255,9 @@ out_parent:
 	gfs2_holder_uninit(ghs);
 	gfs2_holder_uninit(ghs + 1);
 	if (!error) {
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_count++;
+		spin_unlock(&inode->i_lock);
 		d_instantiate(dentry, inode);
 		mark_inode_dirty(inode);
 	}
Index: linux-2.6/fs/hfsplus/dir.c
===================================================================
--- linux-2.6.orig/fs/hfsplus/dir.c
+++ linux-2.6/fs/hfsplus/dir.c
@@ -301,7 +301,9 @@ static int hfsplus_link(struct dentry *s
 
 	inc_nlink(inode);
 	hfsplus_instantiate(dst_dentry, inode, cnid);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	inode->i_ctime = CURRENT_TIME_SEC;
 	mark_inode_dirty(inode);
 	HFSPLUS_SB(sb).file_count++;
Index: linux-2.6/fs/hpfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hpfs/inode.c
+++ linux-2.6/fs/hpfs/inode.c
@@ -181,7 +181,7 @@ void hpfs_write_inode(struct inode *i)
 	struct hpfs_inode_info *hpfs_inode = hpfs_i(i);
 	struct inode *parent;
 	if (i->i_ino == hpfs_sb(i->i_sb)->sb_root) return;
-	if (hpfs_inode->i_rddir_off && !atomic_read(&i->i_count)) {
+	if (hpfs_inode->i_rddir_off && !i->i_count) {
 		if (*hpfs_inode->i_rddir_off) printk("HPFS: write_inode: some position still there\n");
 		kfree(hpfs_inode->i_rddir_off);
 		hpfs_inode->i_rddir_off = NULL;
Index: linux-2.6/fs/jffs2/dir.c
===================================================================
--- linux-2.6.orig/fs/jffs2/dir.c
+++ linux-2.6/fs/jffs2/dir.c
@@ -287,7 +287,9 @@ static int jffs2_link (struct dentry *ol
 		mutex_unlock(&f->sem);
 		d_instantiate(dentry, old_dentry->d_inode);
 		dir_i->i_mtime = dir_i->i_ctime = ITIME(now);
-		atomic_inc(&old_dentry->d_inode->i_count);
+		spin_lock(&old_dentry->d_inode->i_lock);
+		old_dentry->d_inode->i_count++;
+		spin_unlock(&old_dentry->d_inode->i_lock);
 	}
 	return ret;
 }
@@ -866,7 +868,9 @@ static int jffs2_rename (struct inode *o
 		printk(KERN_NOTICE "jffs2_rename(): Link succeeded, unlink failed (err %d). You now have a hard link\n", ret);
 		/* Might as well let the VFS know */
 		d_instantiate(new_dentry, old_dentry->d_inode);
-		atomic_inc(&old_dentry->d_inode->i_count);
+		spin_lock(&old_dentry->d_inode->i_lock);
+		old_dentry->d_inode->i_count++;
+		spin_unlock(&old_dentry->d_inode->i_lock);
 		new_dir_i->i_mtime = new_dir_i->i_ctime = ITIME(now);
 		return ret;
 	}
Index: linux-2.6/fs/jfs/jfs_txnmgr.c
===================================================================
--- linux-2.6.orig/fs/jfs/jfs_txnmgr.c
+++ linux-2.6/fs/jfs/jfs_txnmgr.c
@@ -1279,7 +1279,9 @@ int txCommit(tid_t tid,		/* transaction
 	 * lazy commit thread finishes processing
 	 */
 	if (tblk->xflag & COMMIT_DELETE) {
-		atomic_inc(&tblk->u.ip->i_count);
+		spin_lock(&tblk->u.ip->i_lock);
+		tblk->u.ip->i_count++;
+		spin_unlock(&tblk->u.ip->i_lock);
 		/*
 		 * Avoid a rare deadlock
 		 *
Index: linux-2.6/fs/jfs/namei.c
===================================================================
--- linux-2.6.orig/fs/jfs/namei.c
+++ linux-2.6/fs/jfs/namei.c
@@ -831,7 +831,9 @@ static int jfs_link(struct dentry *old_d
 	ip->i_ctime = CURRENT_TIME;
 	dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	mark_inode_dirty(dir);
-	atomic_inc(&ip->i_count);
+	spin_lock(&ip->i_lock);
+	ip->i_count++;
+	spin_unlock(&ip->i_lock);
 
 	iplist[0] = ip;
 	iplist[1] = dir;
Index: linux-2.6/fs/minix/namei.c
===================================================================
--- linux-2.6.orig/fs/minix/namei.c
+++ linux-2.6/fs/minix/namei.c
@@ -103,7 +103,9 @@ static int minix_link(struct dentry * ol
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	return add_nondir(dentry, inode);
 }
 
Index: linux-2.6/fs/nfs/inode.c
===================================================================
--- linux-2.6.orig/fs/nfs/inode.c
+++ linux-2.6/fs/nfs/inode.c
@@ -365,7 +365,7 @@ nfs_fhget(struct super_block *sb, struct
 	dprintk("NFS: nfs_fhget(%s/%Ld ct=%d)\n",
 		inode->i_sb->s_id,
 		(long long)NFS_FILEID(inode),
-		atomic_read(&inode->i_count));
+		inode->i_count);
 
 out:
 	return inode;
@@ -1149,7 +1149,7 @@ static int nfs_update_inode(struct inode
 
 	dfprintk(VFS, "NFS: %s(%s/%ld ct=%d info=0x%x)\n",
 			__func__, inode->i_sb->s_id, inode->i_ino,
-			atomic_read(&inode->i_count), fattr->valid);
+			inode->i_count, fattr->valid);
 
 	if ((fattr->valid & NFS_ATTR_FATTR_FILEID) && nfsi->fileid != fattr->fileid)
 		goto out_fileid;
Index: linux-2.6/fs/nilfs2/mdt.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/mdt.c
+++ linux-2.6/fs/nilfs2/mdt.c
@@ -466,7 +466,7 @@ nilfs_mdt_new_common(struct the_nilfs *n
 		inode->i_sb = sb; /* sb may be NULL for some meta data files */
 		inode->i_blkbits = nilfs->ns_blocksize_bits;
 		inode->i_flags = 0;
-		atomic_set(&inode->i_count, 1);
+		inode->i_count = 1;
 		inode->i_nlink = 1;
 		inode->i_ino = ino;
 		inode->i_mode = S_IFREG;
Index: linux-2.6/fs/nilfs2/namei.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/namei.c
+++ linux-2.6/fs/nilfs2/namei.c
@@ -221,7 +221,9 @@ static int nilfs_link(struct dentry *old
 
 	inode->i_ctime = CURRENT_TIME;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	err = nilfs_add_nondir(dentry, inode);
 	if (!err)
Index: linux-2.6/fs/ocfs2/namei.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/namei.c
+++ linux-2.6/fs/ocfs2/namei.c
@@ -719,7 +719,9 @@ static int ocfs2_link(struct dentry *old
 		goto out_commit;
 	}
 
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	dentry->d_op = &ocfs2_dentry_ops;
 	d_instantiate(dentry, inode);
 
Index: linux-2.6/fs/reiserfs/file.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/file.c
+++ linux-2.6/fs/reiserfs/file.c
@@ -39,7 +39,7 @@ static int reiserfs_file_release(struct
 	BUG_ON(!S_ISREG(inode->i_mode));
 
 	/* fast out for when nothing needs to be done */
-	if ((atomic_read(&inode->i_count) > 1 ||
+	if ((inode->i_count > 1 ||
 	     !(REISERFS_I(inode)->i_flags & i_pack_on_close_mask) ||
 	     !tail_has_to_be_packed(inode)) &&
 	    REISERFS_I(inode)->i_prealloc_count <= 0) {
@@ -94,7 +94,7 @@ static int reiserfs_file_release(struct
 	if (!err)
 		err = jbegin_failure;
 
-	if (!err && atomic_read(&inode->i_count) <= 1 &&
+	if (!err && inode->i_count <= 1 &&
 	    (REISERFS_I(inode)->i_flags & i_pack_on_close_mask) &&
 	    tail_has_to_be_packed(inode)) {
 		/* if regular file is released by last holder and it has been
Index: linux-2.6/fs/reiserfs/namei.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/namei.c
+++ linux-2.6/fs/reiserfs/namei.c
@@ -1155,7 +1155,9 @@ static int reiserfs_link(struct dentry *
 	inode->i_ctime = CURRENT_TIME_SEC;
 	reiserfs_update_sd(&th, inode);
 
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	d_instantiate(dentry, inode);
 	retval = journal_end(&th, dir->i_sb, jbegin_count);
 	reiserfs_write_unlock(dir->i_sb);
Index: linux-2.6/fs/reiserfs/stree.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/stree.c
+++ linux-2.6/fs/reiserfs/stree.c
@@ -1440,7 +1440,7 @@ static int maybe_indirect_to_direct(stru
 	 ** reading in the last block.  The user will hit problems trying to
 	 ** read the file, but for now we just skip the indirect2direct
 	 */
-	if (atomic_read(&inode->i_count) > 1 ||
+	if (inode->i_count > 1 ||
 	    !tail_has_to_be_packed(inode) ||
 	    !page || (REISERFS_I(inode)->i_flags & i_nopack_mask)) {
 		/* leave tail in an unformatted node */
Index: linux-2.6/fs/sysv/namei.c
===================================================================
--- linux-2.6.orig/fs/sysv/namei.c
+++ linux-2.6/fs/sysv/namei.c
@@ -126,7 +126,9 @@ static int sysv_link(struct dentry * old
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	return add_nondir(dentry, inode);
 }
Index: linux-2.6/fs/ubifs/dir.c
===================================================================
--- linux-2.6.orig/fs/ubifs/dir.c
+++ linux-2.6/fs/ubifs/dir.c
@@ -538,7 +538,9 @@ static int ubifs_link(struct dentry *old
 
 	lock_2_inodes(dir, inode);
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	inode->i_ctime = ubifs_current_time(inode);
 	dir->i_size += sz_change;
 	dir_ui->ui_size = dir->i_size;
Index: linux-2.6/fs/ubifs/super.c
===================================================================
--- linux-2.6.orig/fs/ubifs/super.c
+++ linux-2.6/fs/ubifs/super.c
@@ -340,7 +340,7 @@ static void ubifs_delete_inode(struct in
 		goto out;
 
 	dbg_gen("inode %lu, mode %#x", inode->i_ino, (int)inode->i_mode);
-	ubifs_assert(!atomic_read(&inode->i_count));
+	ubifs_assert(!inode->i_count);
 	ubifs_assert(inode->i_nlink == 0);
 
 	truncate_inode_pages(&inode->i_data, 0);
Index: linux-2.6/fs/udf/namei.c
===================================================================
--- linux-2.6.orig/fs/udf/namei.c
+++ linux-2.6/fs/udf/namei.c
@@ -1091,7 +1091,9 @@ static int udf_link(struct dentry *old_d
 	inc_nlink(inode);
 	inode->i_ctime = current_fs_time(inode->i_sb);
 	mark_inode_dirty(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	d_instantiate(dentry, inode);
 	unlock_kernel();
 
Index: linux-2.6/fs/ufs/namei.c
===================================================================
--- linux-2.6.orig/fs/ufs/namei.c
+++ linux-2.6/fs/ufs/namei.c
@@ -178,7 +178,9 @@ static int ufs_link (struct dentry * old
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	error = ufs_add_nondir(dentry, inode);
 	unlock_kernel();



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 24/27] fs: icache atomic inodes_stat
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (22 preceding siblings ...)
  2009-04-25  1:20 ` [patch 23/27] fs: icache lock i_count npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  1:20 ` [patch 25/27] fs: icache lock lru/writeback lists npiggin
                   ` (3 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: fs-inode_lock-scale-5.patch --]
[-- Type: text/plain, Size: 7135 bytes --]

Protect inodes_stat statistics with atomic ops rather than inode_lock.
---
 fs/cifs/inode.c      |    2 +-
 fs/fs-writeback.c    |    3 ++-
 fs/hugetlbfs/inode.c |    6 +++---
 fs/inode.c           |   28 +++++++++++++++-------------
 include/linux/fs.h   |    5 +++--
 mm/page-writeback.c  |    3 ++-
 6 files changed, 26 insertions(+), 21 deletions(-)

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -691,7 +691,8 @@ void sync_inodes_sb(struct super_block *
 		unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
 
 		wbc.nr_to_write = nr_dirty + nr_unstable +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+			(atomic_read(&inodes_stat.nr_inodes) -
+			atomic_read(&inodes_stat.nr_unused));
 	} else
 		wbc.nr_to_write = LONG_MAX; /* doesn't actually matter */
 
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -395,7 +395,7 @@ static void hugetlbfs_forget_inode(struc
 	if (!hlist_unhashed(&inode->i_hash)) {
 		if (!(inode->i_state & (I_DIRTY|I_SYNC)))
 			list_move(&inode->i_list, &inode_unused);
-		inodes_stat.nr_unused++;
+		atomic_inc(&inodes_stat.nr_unused);
 		if (!sb || (sb->s_flags & MS_ACTIVE)) {
 			spin_unlock(&inode_lock);
 			return;
@@ -413,7 +413,7 @@ static void hugetlbfs_forget_inode(struc
 		spin_lock(&inode->i_lock);
 		inode->i_state &= ~I_WILL_FREE;
 		spin_unlock(&inode->i_lock);
-		inodes_stat.nr_unused--;
+		atomic_dec(&inodes_stat.nr_unused);
 		spin_lock(&inode_hash_lock);
 		hlist_del_init(&inode->i_hash);
 		spin_unlock(&inode_hash_lock);
@@ -425,7 +425,7 @@ static void hugetlbfs_forget_inode(struc
 	spin_lock(&inode->i_lock);
 	inode->i_state |= I_FREEING;
 	spin_unlock(&inode->i_lock);
-	inodes_stat.nr_inodes--;
+	atomic_dec(&inodes_stat.nr_unused);
 	spin_unlock(&inode_lock);
 	truncate_hugepages(inode, 0);
 	clear_inode(inode);
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -99,7 +99,10 @@ static DEFINE_MUTEX(iprune_mutex);
 /*
  * Statistics gathering..
  */
-struct inodes_stat_t inodes_stat;
+struct inodes_stat_t inodes_stat = {
+	.nr_inodes = ATOMIC_INIT(0),
+	.nr_unused = ATOMIC_INIT(0),
+};
 
 static struct kmem_cache * inode_cachep __read_mostly;
 
@@ -276,7 +279,7 @@ void __iget(struct inode * inode)
 
 	if (!(inode->i_state & (I_DIRTY|I_SYNC)))
 		list_move(&inode->i_list, &inode_in_use);
-	inodes_stat.nr_unused--;
+	atomic_dec(&inodes_stat.nr_unused);
 }
 
 /**
@@ -342,9 +345,7 @@ static void dispose_list(struct list_hea
 		destroy_inode(inode);
 		nr_disposed++;
 	}
-	spin_lock(&inode_lock);
-	inodes_stat.nr_inodes -= nr_disposed;
-	spin_unlock(&inode_lock);
+	atomic_sub(nr_disposed, &inodes_stat.nr_inodes);
 }
 
 /*
@@ -391,7 +392,7 @@ static int invalidate_list(struct list_h
 		busy = 1;
 	}
 	/* only unused inodes may be cached with i_count zero */
-	inodes_stat.nr_unused -= count;
+	atomic_sub(count, &inodes_stat.nr_unused);
 	return busy;
 }
 
@@ -498,7 +499,7 @@ static void prune_icache(int nr_to_scan)
 		spin_unlock(&inode->i_lock);
 		nr_pruned++;
 	}
-	inodes_stat.nr_unused -= nr_pruned;
+	atomic_sub(nr_pruned, &inodes_stat.nr_unused);
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
 	else
@@ -530,7 +531,8 @@ static int shrink_icache_memory(int nr,
 			return -1;
 		prune_icache(nr);
 	}
-	return (inodes_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
+	return (atomic_read(&inodes_stat.nr_unused) / 100) *
+					sysctl_vfs_cache_pressure;
 }
 
 static struct shrinker icache_shrinker = {
@@ -619,7 +621,7 @@ static inline void
 __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
 			struct inode *inode)
 {
-	inodes_stat.nr_inodes++;
+	atomic_inc(&inodes_stat.nr_inodes);
 	spin_lock(&sb_inode_list_lock);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	spin_unlock(&sb_inode_list_lock);
@@ -1227,7 +1229,7 @@ void generic_delete_inode(struct inode *
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 	spin_unlock(&inode->i_lock);
-	inodes_stat.nr_inodes--;
+	atomic_dec(&inodes_stat.nr_inodes);
 	spin_unlock(&inode_lock);
 
 	security_inode_delete(inode);
@@ -1264,7 +1266,7 @@ static void generic_forget_inode(struct
 	if (!hlist_unhashed(&inode->i_hash)) {
 		if (!(inode->i_state & (I_DIRTY|I_SYNC)))
 			list_move(&inode->i_list, &inode_unused);
-		inodes_stat.nr_unused++;
+		atomic_inc(&inodes_stat.nr_unused);
 		if (sb->s_flags & MS_ACTIVE) {
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&sb_inode_list_lock);
@@ -1282,7 +1284,7 @@ static void generic_forget_inode(struct
 		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
-		inodes_stat.nr_unused--;
+		atomic_dec(&inodes_stat.nr_unused);
 		spin_lock(&inode_hash_lock);
 		hlist_del_init(&inode->i_hash);
 		spin_unlock(&inode_hash_lock);
@@ -1292,7 +1294,7 @@ static void generic_forget_inode(struct
 	spin_unlock(&sb_inode_list_lock);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
-	inodes_stat.nr_inodes--;
+	atomic_dec(&inodes_stat.nr_inodes);
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	if (inode->i_data.nrpages)
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -8,6 +8,7 @@
 
 #include <linux/limits.h>
 #include <linux/ioctl.h>
+#include <asm/atomic.h>
 
 /*
  * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
@@ -39,8 +40,8 @@ struct files_stat_struct {
 };
 
 struct inodes_stat_t {
-	int nr_inodes;
-	int nr_unused;
+	atomic_t nr_inodes;
+	atomic_t nr_unused;
 	int dummy[5];		/* padding for sysctl ABI compatibility */
 };
 
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -775,7 +775,8 @@ static void wb_kupdate(unsigned long arg
 	next_jif = start_jif + msecs_to_jiffies(dirty_writeback_interval * 10);
 	nr_to_write = global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+			(atomic_read(&inodes_stat.nr_inodes) -
+			atomic_read(&inodes_stat.nr_unused));
 	while (nr_to_write > 0) {
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
Index: linux-2.6/fs/cifs/inode.c
===================================================================
--- linux-2.6.orig/fs/cifs/inode.c
+++ linux-2.6/fs/cifs/inode.c
@@ -1507,7 +1507,7 @@ int cifs_revalidate(struct dentry *diren
 	}
 	cFYI(1, ("Revalidate: %s inode 0x%p count %d dentry: 0x%p d_time %ld "
 		 "jiffies %ld", full_path, direntry->d_inode,
-		 direntry->d_inode->i_count.counter, direntry,
+		 direntry->d_inode->i_count, direntry,
 		 direntry->d_time, jiffies));
 
 	if (cifsInode->time == 0) {



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 25/27] fs: icache lock lru/writeback lists
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (23 preceding siblings ...)
  2009-04-25  1:20 ` [patch 24/27] fs: icache atomic inodes_stat npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  1:20 ` [patch 26/27] fs: icache protect inode state npiggin
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: fs-inode_lock-scale-6.patch --]
[-- Type: text/plain, Size: 12764 bytes --]

Add a new lock, wb_inode_list_lock, to protect i_list and various lists
which the inode can be put onto.

XXX: haven't audited ocfs2
---
 fs/fs-writeback.c         |   41 ++++++++++++++++++++++++++++++++++------
 fs/hugetlbfs/inode.c      |   11 +++++++---
 fs/inode.c                |   47 ++++++++++++++++++++++++++++++++++++----------
 include/linux/writeback.h |    1 
 4 files changed, 81 insertions(+), 19 deletions(-)

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -165,7 +165,9 @@ void __mark_inode_dirty(struct inode *in
 		 */
 		if (!was_dirty) {
 			inode->dirtied_when = jiffies;
+			spin_lock(&wb_inode_list_lock);
 			list_move(&inode->i_list, &sb->s_dirty);
+			spin_unlock(&wb_inode_list_lock);
 		}
 	}
 out:
@@ -195,12 +197,12 @@ static void redirty_tail(struct inode *i
 {
 	struct super_block *sb = inode->i_sb;
 
+	assert_spin_locked(&wb_inode_list_lock);
 	if (!list_empty(&sb->s_dirty)) {
 		struct inode *tail_inode;
 
 		tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
-		if (time_before(inode->dirtied_when,
-				tail_inode->dirtied_when))
+		if (time_before(inode->dirtied_when, tail_inode->dirtied_when))
 			inode->dirtied_when = jiffies;
 	}
 	list_move(&inode->i_list, &sb->s_dirty);
@@ -211,6 +213,7 @@ static void redirty_tail(struct inode *i
  */
 static void requeue_io(struct inode *inode)
 {
+	assert_spin_locked(&wb_inode_list_lock);
 	list_move(&inode->i_list, &inode->i_sb->s_more_io);
 }
 
@@ -245,6 +248,7 @@ static void move_expired_inodes(struct l
 			       struct list_head *dispatch_queue,
 				unsigned long *older_than_this)
 {
+	assert_spin_locked(&wb_inode_list_lock);
 	while (!list_empty(delaying_queue)) {
 		struct inode *inode = list_entry(delaying_queue->prev,
 						struct inode, i_list);
@@ -299,6 +303,7 @@ __sync_single_inode(struct inode *inode,
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY;
 
+	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 
@@ -319,6 +324,7 @@ __sync_single_inode(struct inode *inode,
 
 	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
+	spin_lock(&wb_inode_list_lock);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
@@ -424,12 +430,14 @@ __writeback_single_inode(struct inode *i
 
 		wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
 		do {
+			spin_unlock(&wb_inode_list_lock);
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
 			__wait_on_bit(wqh, &wq, inode_wait,
 							TASK_UNINTERRUPTIBLE);
 			spin_lock(&inode_lock);
 			spin_lock(&inode->i_lock);
+			spin_lock(&wb_inode_list_lock);
 		} while (inode->i_state & I_SYNC);
 	}
 	return __sync_single_inode(inode, wbc);
@@ -467,6 +475,8 @@ void generic_sync_sb_inodes(struct super
 	int sync = wbc->sync_mode == WB_SYNC_ALL;
 
 	spin_lock(&inode_lock);
+again:
+	spin_lock(&wb_inode_list_lock);
 	if (!wbc->for_kupdate || list_empty(&sb->s_io))
 		queue_io(sb, wbc->older_than_this);
 
@@ -477,6 +487,11 @@ void generic_sync_sb_inodes(struct super
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		long pages_skipped;
 
+		if (!spin_trylock(&inode->i_lock)) {
+			spin_unlock(&wb_inode_list_lock);
+			goto again;
+		}
+
 		if (!bdi_cap_writeback_dirty(bdi)) {
 			redirty_tail(inode);
 			if (sb_is_blkdev_sb(sb)) {
@@ -484,6 +499,7 @@ void generic_sync_sb_inodes(struct super
 				 * Dirty memory-backed blockdev: the ramdisk
 				 * driver does this.  Skip just this inode
 				 */
+				spin_unlock(&inode->i_lock);
 				continue;
 			}
 			/*
@@ -491,28 +507,34 @@ void generic_sync_sb_inodes(struct super
 			 * than the kernel-internal bdev filesystem.  Skip the
 			 * entire superblock.
 			 */
+			spin_unlock(&inode->i_lock);
 			break;
 		}
 
 		if (wbc->nonblocking && bdi_write_congested(bdi)) {
 			wbc->encountered_congestion = 1;
-			if (!sb_is_blkdev_sb(sb))
+			if (!sb_is_blkdev_sb(sb)) {
+				spin_unlock(&inode->i_lock);
 				break;		/* Skip a congested fs */
+			}
 			requeue_io(inode);
+			spin_unlock(&inode->i_lock);
 			continue;		/* Skip a congested blockdev */
 		}
 
 		if (wbc->bdi && bdi != wbc->bdi) {
-			if (!sb_is_blkdev_sb(sb))
+			if (!sb_is_blkdev_sb(sb)) {
+				spin_unlock(&inode->i_lock);
 				break;		/* fs has the wrong queue */
+			}
 			requeue_io(inode);
+			spin_unlock(&inode->i_lock);
 			continue;		/* blockdev has wrong queue */
 		}
 
-		spin_lock(&inode->i_lock);
 		if (inode->i_state & I_NEW) {
-			spin_unlock(&inode->i_lock);
 			requeue_io(inode);
+			spin_unlock(&inode->i_lock);
 			continue;
 		}
 
@@ -544,11 +566,13 @@ void generic_sync_sb_inodes(struct super
 			 */
 			redirty_tail(inode);
 		}
+		spin_unlock(&wb_inode_list_lock);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		iput(inode);
 		cond_resched();
 		spin_lock(&inode_lock);
+		spin_lock(&wb_inode_list_lock);
 		if (wbc->nr_to_write <= 0) {
 			wbc->more_io = 1;
 			break;
@@ -556,6 +580,7 @@ void generic_sync_sb_inodes(struct super
 		if (!list_empty(&sb->s_more_io))
 			wbc->more_io = 1;
 	}
+	spin_unlock(&wb_inode_list_lock);
 
 	if (sync) {
 		struct inode *inode, *old_inode = NULL;
@@ -774,7 +799,9 @@ int write_inode_now(struct inode *inode,
 	might_sleep();
 	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
+	spin_lock(&wb_inode_list_lock);
 	ret = __writeback_single_inode(inode, &wbc);
+	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	if (sync)
@@ -800,7 +827,9 @@ int sync_inode(struct inode *inode, stru
 
 	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
+	spin_lock(&wb_inode_list_lock);
 	ret = __writeback_single_inode(inode, wbc);
+	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	return ret;
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -393,8 +393,11 @@ static void hugetlbfs_forget_inode(struc
 	struct super_block *sb = inode->i_sb;
 
 	if (!hlist_unhashed(&inode->i_hash)) {
-		if (!(inode->i_state & (I_DIRTY|I_SYNC)))
+		if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
+			spin_lock(&wb_inode_list_lock);
 			list_move(&inode->i_list, &inode_unused);
+			spin_unlock(&wb_inode_list_lock);
+		}
 		atomic_inc(&inodes_stat.nr_unused);
 		if (!sb || (sb->s_flags & MS_ACTIVE)) {
 			spin_unlock(&inode_lock);
@@ -412,13 +415,15 @@ static void hugetlbfs_forget_inode(struc
 		spin_lock(&inode_lock);
 		spin_lock(&inode->i_lock);
 		inode->i_state &= ~I_WILL_FREE;
-		spin_unlock(&inode->i_lock);
-		atomic_dec(&inodes_stat.nr_unused);
 		spin_lock(&inode_hash_lock);
 		hlist_del_init(&inode->i_hash);
 		spin_unlock(&inode_hash_lock);
+		spin_unlock(&inode->i_lock);
+		atomic_dec(&inodes_stat.nr_unused);
 	}
+	spin_lock(&wb_inode_list_lock);
 	list_del_init(&inode->i_list);
+	spin_unlock(&wb_inode_list_lock);
 	spin_lock(&sb_inode_list_lock);
 	list_del_init(&inode->i_sb_list);
 	spin_unlock(&sb_inode_list_lock);
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -84,6 +84,7 @@ static struct hlist_head *inode_hashtabl
  */
 DEFINE_SPINLOCK(inode_lock);
 DEFINE_SPINLOCK(sb_inode_list_lock);
+DEFINE_SPINLOCK(wb_inode_list_lock);
 DEFINE_SPINLOCK(inode_hash_lock);
 
 /*
@@ -277,8 +278,11 @@ void __iget(struct inode * inode)
 	if (inode->i_count > 1)
 		return;
 
-	if (!(inode->i_state & (I_DIRTY|I_SYNC)))
+	if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
+		spin_lock(&wb_inode_list_lock);
 		list_move(&inode->i_list, &inode_in_use);
+		spin_unlock(&wb_inode_list_lock);
+	}
 	atomic_dec(&inodes_stat.nr_unused);
 }
 
@@ -381,7 +385,9 @@ static int invalidate_list(struct list_h
 		}
 		invalidate_inode_buffers(inode);
 		if (!inode->i_count) {
+			spin_lock(&wb_inode_list_lock);
 			list_move(&inode->i_list, dispose);
+			spin_unlock(&wb_inode_list_lock);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
 			spin_unlock(&inode->i_lock);
@@ -460,6 +466,8 @@ static void prune_icache(int nr_to_scan)
 
 	mutex_lock(&iprune_mutex);
 	spin_lock(&inode_lock);
+again:
+	spin_lock(&wb_inode_list_lock);
 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
 		struct inode *inode;
 
@@ -468,13 +476,17 @@ static void prune_icache(int nr_to_scan)
 
 		inode = list_entry(inode_unused.prev, struct inode, i_list);
 
-		spin_lock(&inode->i_lock);
+		if (!spin_trylock(&inode->i_lock)) {
+			spin_unlock(&wb_inode_list_lock);
+			goto again;
+		}
 		if (inode->i_state || inode->i_count) {
 			list_move(&inode->i_list, &inode_unused);
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+			spin_unlock(&wb_inode_list_lock);
 			__iget(inode);
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
@@ -483,11 +495,16 @@ static void prune_icache(int nr_to_scan)
 								0, -1);
 			iput(inode);
 			spin_lock(&inode_lock);
+again2:
+			spin_lock(&wb_inode_list_lock);
 
 			if (inode != list_entry(inode_unused.next,
 						struct inode, i_list))
 				continue;	/* wrong inode or list_empty */
-			spin_lock(&inode->i_lock);
+			if (!spin_trylock(&inode->i_lock)) {
+				spin_unlock(&wb_inode_list_lock);
+				goto again2;
+			}
 			if (!can_unuse(inode)) {
 				spin_unlock(&inode->i_lock);
 				continue;
@@ -505,6 +522,7 @@ static void prune_icache(int nr_to_scan)
 	else
 		__count_vm_events(PGINODESTEAL, reap);
 	spin_unlock(&inode_lock);
+	spin_unlock(&wb_inode_list_lock);
 
 	dispose_list(&freeable);
 	mutex_unlock(&iprune_mutex);
@@ -625,7 +643,9 @@ __inode_add_to_lists(struct super_block
 	spin_lock(&sb_inode_list_lock);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	spin_unlock(&sb_inode_list_lock);
+	spin_lock(&wb_inode_list_lock);
 	list_add(&inode->i_list, &inode_in_use);
+	spin_unlock(&wb_inode_list_lock);
 	if (head) {
 		spin_lock(&inode_hash_lock);
 		hlist_add_head(&inode->i_hash, head);
@@ -1223,14 +1243,16 @@ void generic_delete_inode(struct inode *
 {
 	const struct super_operations *op = inode->i_sb->s_op;
 
+	spin_lock(&wb_inode_list_lock);
 	list_del_init(&inode->i_list);
+	spin_unlock(&wb_inode_list_lock);
 	list_del_init(&inode->i_sb_list);
 	spin_unlock(&sb_inode_list_lock);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 	spin_unlock(&inode->i_lock);
-	atomic_dec(&inodes_stat.nr_inodes);
 	spin_unlock(&inode_lock);
+	atomic_dec(&inodes_stat.nr_inodes);
 
 	security_inode_delete(inode);
 
@@ -1264,8 +1286,11 @@ static void generic_forget_inode(struct
 	struct super_block *sb = inode->i_sb;
 
 	if (!hlist_unhashed(&inode->i_hash)) {
-		if (!(inode->i_state & (I_DIRTY|I_SYNC)))
+		if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
+			spin_lock(&wb_inode_list_lock);
 			list_move(&inode->i_list, &inode_unused);
+			spin_unlock(&wb_inode_list_lock);
+		}
 		atomic_inc(&inodes_stat.nr_unused);
 		if (sb->s_flags & MS_ACTIVE) {
 			spin_unlock(&inode->i_lock);
@@ -1289,14 +1314,16 @@ static void generic_forget_inode(struct
 		hlist_del_init(&inode->i_hash);
 		spin_unlock(&inode_hash_lock);
 	}
+	spin_lock(&wb_inode_list_lock);
 	list_del_init(&inode->i_list);
+	spin_unlock(&wb_inode_list_lock);
 	list_del_init(&inode->i_sb_list);
 	spin_unlock(&sb_inode_list_lock);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
-	atomic_dec(&inodes_stat.nr_inodes);
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
+	atomic_dec(&inodes_stat.nr_inodes);
 	if (inode->i_data.nrpages)
 		truncate_inode_pages(&inode->i_data, 0);
 	clear_inode(inode);
@@ -1354,17 +1381,17 @@ void iput(struct inode *inode)
 	if (inode) {
 		BUG_ON(inode->i_state == I_CLEAR);
 
-retry:
+retry1:
 		spin_lock(&inode->i_lock);
 		if (inode->i_count == 1) {
 			if (!spin_trylock(&inode_lock)) {
+retry2:
 				spin_unlock(&inode->i_lock);
-				goto retry;
+				goto retry1;
 			}
 			if (!spin_trylock(&sb_inode_list_lock)) {
 				spin_unlock(&inode_lock);
-				spin_unlock(&inode->i_lock);
-				goto retry;
+				goto retry2;
 			}
 			inode->i_count--;
 			iput_final(inode);
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h
+++ linux-2.6/include/linux/writeback.h
@@ -11,6 +11,7 @@ struct backing_dev_info;
 
 extern spinlock_t inode_lock;
 extern spinlock_t sb_inode_list_lock;
+extern spinlock_t wb_inode_list_lock;
 extern spinlock_t inode_hash_lock;
 extern struct list_head inode_in_use;
 extern struct list_head inode_unused;



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 26/27] fs: icache protect inode state
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (24 preceding siblings ...)
  2009-04-25  1:20 ` [patch 25/27] fs: icache lock lru/writeback lists npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  1:20 ` [patch 27/27] fs: icache remove inode_lock npiggin
  2009-04-25  4:18 ` [patch 00/27] [rfc] vfs scalability patchset Al Viro
  27 siblings, 0 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: fs-inode_lock-scale-6b.patch --]
[-- Type: text/plain, Size: 6769 bytes --]

Protect i_hash, i_sb_list etc members with i_lock.
---
 fs/hugetlbfs/inode.c |   14 +++++++++-----
 fs/inode.c           |   30 +++++++++++++++++++++++++++---
 2 files changed, 36 insertions(+), 8 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -337,12 +337,14 @@ static void dispose_list(struct list_hea
 		clear_inode(inode);
 
 		spin_lock(&inode_lock);
+		spin_lock(&sb_inode_list_lock);
+		spin_lock(&inode->i_lock);
 		spin_lock(&inode_hash_lock);
 		hlist_del_init(&inode->i_hash);
 		spin_unlock(&inode_hash_lock);
-		spin_lock(&sb_inode_list_lock);
 		list_del_init(&inode->i_sb_list);
 		spin_unlock(&sb_inode_list_lock);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 
 		wake_up_inode(inode);
@@ -640,7 +642,6 @@ __inode_add_to_lists(struct super_block
 			struct inode *inode)
 {
 	atomic_inc(&inodes_stat.nr_inodes);
-	spin_lock(&sb_inode_list_lock);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	spin_unlock(&sb_inode_list_lock);
 	spin_lock(&wb_inode_list_lock);
@@ -670,7 +671,10 @@ void inode_add_to_lists(struct super_blo
 	struct hlist_head *head = inode_hashtable + hash(sb, inode->i_ino);
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb_inode_list_lock);
+	spin_lock(&inode->i_lock);
 	__inode_add_to_lists(sb, head, inode);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL_GPL(inode_add_to_lists);
@@ -702,9 +706,12 @@ struct inode *new_inode(struct super_blo
 	inode = alloc_inode(sb);
 	if (inode) {
 		spin_lock(&inode_lock);
+		spin_lock(&sb_inode_list_lock);
+		spin_lock(&inode->i_lock);
 		inode->i_ino = ++last_ino;
 		inode->i_state = 0;
 		__inode_add_to_lists(sb, NULL, inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 	}
 	return inode;
@@ -759,11 +766,14 @@ static struct inode * get_new_inode(stru
 		/* We released the lock, so.. */
 		old = find_inode(sb, head, test, data);
 		if (!old) {
+			spin_lock(&sb_inode_list_lock);
+			spin_lock(&inode->i_lock);
 			if (set(inode, data))
 				goto set_failed;
 
 			inode->i_state = I_LOCK|I_NEW;
 			__inode_add_to_lists(sb, head, inode);
+			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
@@ -809,9 +819,12 @@ static struct inode * get_new_inode_fast
 		/* We released the lock, so.. */
 		old = find_inode_fast(sb, head, ino);
 		if (!old) {
+			spin_lock(&sb_inode_list_lock);
+			spin_lock(&inode->i_lock);
 			inode->i_ino = ino;
 			inode->i_state = I_LOCK|I_NEW;
 			__inode_add_to_lists(sb, head, inode);
+			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
@@ -1137,9 +1150,11 @@ int insert_inode_locked(struct inode *in
 		spin_lock(&inode_lock);
 		old = find_inode_fast(sb, head, ino);
 		if (likely(!old)) {
+			spin_lock(&inode->i_lock);
 			spin_lock(&inode_hash_lock);
 			hlist_add_head(&inode->i_hash, head);
 			spin_unlock(&inode_hash_lock);
+			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
 			return 0;
 		}
@@ -1170,9 +1185,11 @@ int insert_inode_locked4(struct inode *i
 		spin_lock(&inode_lock);
 		old = find_inode(sb, head, test, data);
 		if (likely(!old)) {
+			spin_lock(&inode->i_lock);
 			spin_lock(&inode_hash_lock);
 			hlist_add_head(&inode->i_hash, head);
 			spin_unlock(&inode_hash_lock);
+			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
 			return 0;
 		}
@@ -1201,10 +1218,13 @@ EXPORT_SYMBOL(insert_inode_locked4);
 void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 {
 	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
+
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	spin_lock(&inode_hash_lock);
 	hlist_add_head(&inode->i_hash, head);
 	spin_unlock(&inode_hash_lock);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 }
 
@@ -1219,9 +1239,11 @@ EXPORT_SYMBOL(__insert_inode_hash);
 void remove_inode_hash(struct inode *inode)
 {
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	spin_lock(&inode_hash_lock);
 	hlist_del_init(&inode->i_hash);
 	spin_unlock(&inode_hash_lock);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 }
 
@@ -1270,9 +1292,11 @@ void generic_delete_inode(struct inode *
 		clear_inode(inode);
 	}
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	spin_lock(&inode_hash_lock);
 	hlist_del_init(&inode->i_hash);
 	spin_unlock(&inode_hash_lock);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	wake_up_inode(inode);
 	BUG_ON(inode->i_state != I_CLEAR);
@@ -1309,10 +1333,10 @@ static void generic_forget_inode(struct
 		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
-		atomic_dec(&inodes_stat.nr_unused);
 		spin_lock(&inode_hash_lock);
 		hlist_del_init(&inode->i_hash);
 		spin_unlock(&inode_hash_lock);
+		atomic_dec(&inodes_stat.nr_unused);
 	}
 	spin_lock(&wb_inode_list_lock);
 	list_del_init(&inode->i_list);
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -400,12 +400,15 @@ static void hugetlbfs_forget_inode(struc
 		}
 		atomic_inc(&inodes_stat.nr_unused);
 		if (!sb || (sb->s_flags & MS_ACTIVE)) {
+			spin_unlock(&inode->i_lock);
+			spin_unlock(&sb_inode_list_lock);
 			spin_unlock(&inode_lock);
 			return;
 		}
-		spin_lock(&inode->i_lock);
+		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_WILL_FREE;
 		spin_unlock(&inode->i_lock);
+		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 		/*
 		 * write_inode_now is a noop as we set BDI_CAP_NO_WRITEBACK
@@ -413,27 +416,28 @@ static void hugetlbfs_forget_inode(struc
 		 */
 		write_inode_now(inode, 1);
 		spin_lock(&inode_lock);
+		spin_lock(&sb_inode_list_lock);
 		spin_lock(&inode->i_lock);
+		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
 		spin_lock(&inode_hash_lock);
 		hlist_del_init(&inode->i_hash);
 		spin_unlock(&inode_hash_lock);
-		spin_unlock(&inode->i_lock);
 		atomic_dec(&inodes_stat.nr_unused);
 	}
 	spin_lock(&wb_inode_list_lock);
 	list_del_init(&inode->i_list);
 	spin_unlock(&wb_inode_list_lock);
-	spin_lock(&sb_inode_list_lock);
 	list_del_init(&inode->i_sb_list);
 	spin_unlock(&sb_inode_list_lock);
-	spin_lock(&inode->i_lock);
+	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 	spin_unlock(&inode->i_lock);
-	atomic_dec(&inodes_stat.nr_unused);
 	spin_unlock(&inode_lock);
+	atomic_dec(&inodes_stat.nr_unused);
 	truncate_hugepages(inode, 0);
 	clear_inode(inode);
+	/* XXX: why no wake_up_inode? */
 	destroy_inode(inode);
 }
 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch 27/27] fs: icache remove inode_lock
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (25 preceding siblings ...)
  2009-04-25  1:20 ` [patch 26/27] fs: icache protect inode state npiggin
@ 2009-04-25  1:20 ` npiggin
  2009-04-25  4:18 ` [patch 00/27] [rfc] vfs scalability patchset Al Viro
  27 siblings, 0 replies; 50+ messages in thread
From: npiggin @ 2009-04-25  1:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

[-- Attachment #1: fs-inode_lock-scale-7.patch --]
[-- Type: text/plain, Size: 21171 bytes --]

Remove the global inode_lock
---
 fs/buffer.c                 |    2 -
 fs/drop_caches.c            |    4 --
 fs/fs-writeback.c           |   23 +-------------
 fs/hugetlbfs/inode.c        |    6 ---
 fs/inode.c                  |   71 ++++----------------------------------------
 fs/notify/inotify/inotify.c |    2 -
 fs/quota/dquot.c            |    6 ---
 include/linux/writeback.h   |    1 
 8 files changed, 11 insertions(+), 104 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1145,7 +1145,7 @@ __getblk_slow(struct block_device *bdev,
  * inode list.
  *
  * mark_buffer_dirty() is atomic.  It takes bh->b_page->mapping->private_lock,
- * mapping->tree_lock and the global inode_lock.
+ * and mapping->tree_lock.
  */
 void mark_buffer_dirty(struct buffer_head *bh)
 {
Index: linux-2.6/fs/drop_caches.c
===================================================================
--- linux-2.6.orig/fs/drop_caches.c
+++ linux-2.6/fs/drop_caches.c
@@ -16,7 +16,6 @@ static void drop_pagecache_sb(struct sup
 {
 	struct inode *inode, *toput_inode = NULL;
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
@@ -28,15 +27,12 @@ static void drop_pagecache_sb(struct sup
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
-		spin_unlock(&inode_lock);
 		__invalidate_mapping_pages(inode->i_mapping, 0, -1, true);
 		iput(toput_inode);
 		toput_inode = inode;
-		spin_lock(&inode_lock);
 		spin_lock(&sb_inode_list_lock);
 	}
 	spin_unlock(&sb_inode_list_lock);
-	spin_unlock(&inode_lock);
 	iput(toput_inode);
 }
 
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -133,7 +133,6 @@ void __mark_inode_dirty(struct inode *in
 			       name, inode->i_sb->s_id);
 	}
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	if ((inode->i_state & flags) != flags) {
 		const int was_dirty = inode->i_state & I_DIRTY;
@@ -172,7 +171,6 @@ void __mark_inode_dirty(struct inode *in
 	}
 out:
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 }
 
 EXPORT_SYMBOL(__mark_inode_dirty);
@@ -220,7 +218,7 @@ static void requeue_io(struct inode *ino
 static void inode_sync_complete(struct inode *inode)
 {
 	/*
-	 * Prevent speculative execution through spin_unlock(&inode_lock);
+	 * Prevent speculative execution through spin_unlock(&inode->i_lock);
 	 */
 	smp_mb();
 	wake_up_bit(&inode->i_state, __I_SYNC);
@@ -305,7 +303,6 @@ __sync_single_inode(struct inode *inode,
 
 	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 
 	ret = do_writepages(mapping, wbc);
 
@@ -322,7 +319,6 @@ __sync_single_inode(struct inode *inode,
 			ret = err;
 	}
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	spin_lock(&wb_inode_list_lock);
 	WARN_ON(inode->i_state & I_NEW);
@@ -432,10 +428,8 @@ __writeback_single_inode(struct inode *i
 		do {
 			spin_unlock(&wb_inode_list_lock);
 			spin_unlock(&inode->i_lock);
-			spin_unlock(&inode_lock);
 			__wait_on_bit(wqh, &wq, inode_wait,
 							TASK_UNINTERRUPTIBLE);
-			spin_lock(&inode_lock);
 			spin_lock(&inode->i_lock);
 			spin_lock(&wb_inode_list_lock);
 		} while (inode->i_state & I_SYNC);
@@ -474,7 +468,6 @@ void generic_sync_sb_inodes(struct super
 	const unsigned long start = jiffies;	/* livelock avoidance */
 	int sync = wbc->sync_mode == WB_SYNC_ALL;
 
-	spin_lock(&inode_lock);
 again:
 	spin_lock(&wb_inode_list_lock);
 	if (!wbc->for_kupdate || list_empty(&sb->s_io))
@@ -568,10 +561,8 @@ again:
 		}
 		spin_unlock(&wb_inode_list_lock);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		iput(inode);
 		cond_resched();
-		spin_lock(&inode_lock);
 		spin_lock(&wb_inode_list_lock);
 		if (wbc->nr_to_write <= 0) {
 			wbc->more_io = 1;
@@ -606,7 +597,6 @@ again:
 			__iget(inode);
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&sb_inode_list_lock);
-			spin_unlock(&inode_lock);
 			/*
 			 * We hold a reference to 'inode' so it couldn't have
 			 * been removed from s_inodes list while we dropped the
@@ -622,14 +612,11 @@ again:
 
 			cond_resched();
 
-			spin_lock(&inode_lock);
 			spin_lock(&sb_inode_list_lock);
 		}
 		spin_unlock(&sb_inode_list_lock);
-		spin_unlock(&inode_lock);
 		iput(old_inode);
-	} else
-		spin_unlock(&inode_lock);
+	}
 
 	return;		/* Leave any unwritten inodes on s_io */
 }
@@ -797,13 +784,11 @@ int write_inode_now(struct inode *inode,
 		wbc.nr_to_write = 0;
 
 	might_sleep();
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	spin_lock(&wb_inode_list_lock);
 	ret = __writeback_single_inode(inode, &wbc);
 	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	if (sync)
 		inode_sync_wait(inode);
 	return ret;
@@ -825,13 +810,11 @@ int sync_inode(struct inode *inode, stru
 {
 	int ret;
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	spin_lock(&wb_inode_list_lock);
 	ret = __writeback_single_inode(inode, wbc);
 	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	return ret;
 }
 EXPORT_SYMBOL(sync_inode);
@@ -872,13 +855,11 @@ int generic_osync_inode(struct inode *in
 			err = err2;
 	}
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	if ((inode->i_state & I_DIRTY) &&
 	    ((what & OSYNC_INODE) || (inode->i_state & I_DIRTY_DATASYNC)))
 		need_write_inode_now = 1;
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 
 	if (need_write_inode_now) {
 		err2 = write_inode_now(inode, 1);
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -388,7 +388,7 @@ static void hugetlbfs_delete_inode(struc
 	clear_inode(inode);
 }
 
-static void hugetlbfs_forget_inode(struct inode *inode) __releases(inode_lock)
+static void hugetlbfs_forget_inode(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
 
@@ -402,20 +402,17 @@ static void hugetlbfs_forget_inode(struc
 		if (!sb || (sb->s_flags & MS_ACTIVE)) {
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&sb_inode_list_lock);
-			spin_unlock(&inode_lock);
 			return;
 		}
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_WILL_FREE;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
-		spin_unlock(&inode_lock);
 		/*
 		 * write_inode_now is a noop as we set BDI_CAP_NO_WRITEBACK
 		 * in our backing_dev_info.
 		 */
 		write_inode_now(inode, 1);
-		spin_lock(&inode_lock);
 		spin_lock(&sb_inode_list_lock);
 		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
@@ -433,7 +430,6 @@ static void hugetlbfs_forget_inode(struc
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	atomic_dec(&inodes_stat.nr_unused);
 	truncate_hugepages(inode, 0);
 	clear_inode(inode);
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -82,7 +82,6 @@ static struct hlist_head *inode_hashtabl
  * NOTE! You also have to own the lock if you change
  * the i_state of an inode while it is in use..
  */
-DEFINE_SPINLOCK(inode_lock);
 DEFINE_SPINLOCK(sb_inode_list_lock);
 DEFINE_SPINLOCK(wb_inode_list_lock);
 DEFINE_SPINLOCK(inode_hash_lock);
@@ -336,16 +335,14 @@ static void dispose_list(struct list_hea
 			truncate_inode_pages(&inode->i_data, 0);
 		clear_inode(inode);
 
-		spin_lock(&inode_lock);
 		spin_lock(&sb_inode_list_lock);
 		spin_lock(&inode->i_lock);
 		spin_lock(&inode_hash_lock);
 		hlist_del_init(&inode->i_hash);
 		spin_unlock(&inode_hash_lock);
 		list_del_init(&inode->i_sb_list);
-		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
+		spin_unlock(&sb_inode_list_lock);
 
 		wake_up_inode(inode);
 		destroy_inode(inode);
@@ -373,7 +370,6 @@ static int invalidate_list(struct list_h
 		 * change during umount anymore, and because iprune_mutex keeps
 		 * shrink_icache_memory() away.
 		 */
-		cond_resched_lock(&inode_lock);
 		cond_resched_lock(&sb_inode_list_lock);
 
 		next = next->next;
@@ -418,12 +414,10 @@ int invalidate_inodes(struct super_block
 	LIST_HEAD(throw_away);
 
 	mutex_lock(&iprune_mutex);
-	spin_lock(&inode_lock);
 	spin_lock(&sb_inode_list_lock);
 	inotify_unmount_inodes(&sb->s_inodes);
 	busy = invalidate_list(&sb->s_inodes, &throw_away);
 	spin_unlock(&sb_inode_list_lock);
-	spin_unlock(&inode_lock);
 
 	dispose_list(&throw_away);
 	mutex_unlock(&iprune_mutex);
@@ -467,7 +461,6 @@ static void prune_icache(int nr_to_scan)
 	unsigned long reap = 0;
 
 	mutex_lock(&iprune_mutex);
-	spin_lock(&inode_lock);
 again:
 	spin_lock(&wb_inode_list_lock);
 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
@@ -491,12 +484,10 @@ again:
 			spin_unlock(&wb_inode_list_lock);
 			__iget(inode);
 			spin_unlock(&inode->i_lock);
-			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
 								0, -1);
 			iput(inode);
-			spin_lock(&inode_lock);
 again2:
 			spin_lock(&wb_inode_list_lock);
 
@@ -523,7 +514,6 @@ again2:
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
 	else
 		__count_vm_events(PGINODESTEAL, reap);
-	spin_unlock(&inode_lock);
 	spin_unlock(&wb_inode_list_lock);
 
 	dispose_list(&freeable);
@@ -670,12 +660,10 @@ void inode_add_to_lists(struct super_blo
 {
 	struct hlist_head *head = inode_hashtable + hash(sb, inode->i_ino);
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb_inode_list_lock);
 	spin_lock(&inode->i_lock);
 	__inode_add_to_lists(sb, head, inode);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL_GPL(inode_add_to_lists);
 
@@ -698,21 +686,17 @@ struct inode *new_inode(struct super_blo
 	 * error if st_ino won't fit in target struct field. Use 32bit counter
 	 * here to attempt to avoid that.
 	 */
-	static unsigned int last_ino;
+	static unsigned int last_ino; /* protected with sb_inode_list_lock for now */
 	struct inode * inode;
 
-	spin_lock_prefetch(&inode_lock);
-	
 	inode = alloc_inode(sb);
 	if (inode) {
-		spin_lock(&inode_lock);
 		spin_lock(&sb_inode_list_lock);
 		spin_lock(&inode->i_lock);
 		inode->i_ino = ++last_ino;
 		inode->i_state = 0;
 		__inode_add_to_lists(sb, NULL, inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 	}
 	return inode;
 }
@@ -762,7 +746,6 @@ static struct inode * get_new_inode(stru
 	if (inode) {
 		struct inode * old;
 
-		spin_lock(&inode_lock);
 		/* We released the lock, so.. */
 		old = find_inode(sb, head, test, data);
 		if (!old) {
@@ -774,7 +757,6 @@ static struct inode * get_new_inode(stru
 			inode->i_state = I_LOCK|I_NEW;
 			__inode_add_to_lists(sb, head, inode);
 			spin_unlock(&inode->i_lock);
-			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
 			 * caller is responsible for filling in the contents
@@ -789,7 +771,6 @@ static struct inode * get_new_inode(stru
 		 */
 		__iget(old);
 		spin_unlock(&old->i_lock);
-		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
 		wait_on_inode(inode);
@@ -798,7 +779,6 @@ static struct inode * get_new_inode(stru
 
 set_failed:
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	destroy_inode(inode);
 	return NULL;
 }
@@ -815,7 +795,6 @@ static struct inode * get_new_inode_fast
 	if (inode) {
 		struct inode * old;
 
-		spin_lock(&inode_lock);
 		/* We released the lock, so.. */
 		old = find_inode_fast(sb, head, ino);
 		if (!old) {
@@ -825,7 +804,6 @@ static struct inode * get_new_inode_fast
 			inode->i_state = I_LOCK|I_NEW;
 			__inode_add_to_lists(sb, head, inode);
 			spin_unlock(&inode->i_lock);
-			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
 			 * caller is responsible for filling in the contents
@@ -840,7 +818,6 @@ static struct inode * get_new_inode_fast
 		 */
 		__iget(old);
 		spin_unlock(&old->i_lock);
-		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
 		wait_on_inode(inode);
@@ -874,16 +851,16 @@ ino_t iunique(struct super_block *sb, in
 	struct hlist_head *head;
 	ino_t res;
 
-	spin_lock(&inode_lock);
 	do {
+		spin_lock(&sb_inode_list_lock); /* xxx: hack to protect counter */
 		if (counter <= max_reserved)
 			counter = max_reserved + 1;
 		res = counter++;
+		spin_unlock(&sb_inode_list_lock);
 		head = inode_hashtable + hash(sb, res);
 		inode = find_inode_fast(sb, head, res);
 		spin_unlock(&inode->i_lock);
 	} while (inode != NULL);
-	spin_unlock(&inode_lock);
 
 	return res;
 }
@@ -893,7 +870,6 @@ struct inode *igrab(struct inode *inode)
 {
 	struct inode *ret = inode;
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	if (!(inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)))
 		__iget(inode);
@@ -905,7 +881,6 @@ struct inode *igrab(struct inode *inode)
 		 */
 		ret = NULL;
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 
 	return ret;
 }
@@ -937,17 +912,14 @@ static struct inode *ifind(struct super_
 {
 	struct inode *inode;
 
-	spin_lock(&inode_lock);
 	inode = find_inode(sb, head, test, data);
 	if (inode) {
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		if (likely(wait))
 			wait_on_inode(inode);
 		return inode;
 	}
-	spin_unlock(&inode_lock);
 	return NULL;
 }
 
@@ -971,16 +943,13 @@ static struct inode *ifind_fast(struct s
 {
 	struct inode *inode;
 
-	spin_lock(&inode_lock);
 	inode = find_inode_fast(sb, head, ino);
 	if (inode) {
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		wait_on_inode(inode);
 		return inode;
 	}
-	spin_unlock(&inode_lock);
 	return NULL;
 }
 
@@ -1147,7 +1116,6 @@ int insert_inode_locked(struct inode *in
 
 	inode->i_state |= I_LOCK|I_NEW;
 	while (1) {
-		spin_lock(&inode_lock);
 		old = find_inode_fast(sb, head, ino);
 		if (likely(!old)) {
 			spin_lock(&inode->i_lock);
@@ -1155,12 +1123,10 @@ int insert_inode_locked(struct inode *in
 			hlist_add_head(&inode->i_hash, head);
 			spin_unlock(&inode_hash_lock);
 			spin_unlock(&inode->i_lock);
-			spin_unlock(&inode_lock);
 			return 0;
 		}
 		__iget(old);
 		spin_unlock(&old->i_lock);
-		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
 			iput(old);
@@ -1169,7 +1135,6 @@ int insert_inode_locked(struct inode *in
 		iput(old);
 	}
 }
-
 EXPORT_SYMBOL(insert_inode_locked);
 
 int insert_inode_locked4(struct inode *inode, unsigned long hashval,
@@ -1182,7 +1147,6 @@ int insert_inode_locked4(struct inode *i
 	inode->i_state |= I_LOCK|I_NEW;
 
 	while (1) {
-		spin_lock(&inode_lock);
 		old = find_inode(sb, head, test, data);
 		if (likely(!old)) {
 			spin_lock(&inode->i_lock);
@@ -1190,12 +1154,10 @@ int insert_inode_locked4(struct inode *i
 			hlist_add_head(&inode->i_hash, head);
 			spin_unlock(&inode_hash_lock);
 			spin_unlock(&inode->i_lock);
-			spin_unlock(&inode_lock);
 			return 0;
 		}
 		__iget(old);
 		spin_unlock(&old->i_lock);
-		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
 			iput(old);
@@ -1204,7 +1166,6 @@ int insert_inode_locked4(struct inode *i
 		iput(old);
 	}
 }
-
 EXPORT_SYMBOL(insert_inode_locked4);
 
 /**
@@ -1219,13 +1180,11 @@ void __insert_inode_hash(struct inode *i
 {
 	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	spin_lock(&inode_hash_lock);
 	hlist_add_head(&inode->i_hash, head);
 	spin_unlock(&inode_hash_lock);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 }
 
 EXPORT_SYMBOL(__insert_inode_hash);
@@ -1238,13 +1197,11 @@ EXPORT_SYMBOL(__insert_inode_hash);
  */
 void remove_inode_hash(struct inode *inode)
 {
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	spin_lock(&inode_hash_lock);
 	hlist_del_init(&inode->i_hash);
 	spin_unlock(&inode_hash_lock);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 }
 
 EXPORT_SYMBOL(remove_inode_hash);
@@ -1273,7 +1230,6 @@ void generic_delete_inode(struct inode *
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	atomic_dec(&inodes_stat.nr_inodes);
 
 	security_inode_delete(inode);
@@ -1291,13 +1247,11 @@ void generic_delete_inode(struct inode *
 		truncate_inode_pages(&inode->i_data, 0);
 		clear_inode(inode);
 	}
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	spin_lock(&inode_hash_lock);
 	hlist_del_init(&inode->i_hash);
 	spin_unlock(&inode_hash_lock);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	wake_up_inode(inode);
 	BUG_ON(inode->i_state != I_CLEAR);
 	destroy_inode(inode);
@@ -1319,16 +1273,13 @@ static void generic_forget_inode(struct
 		if (sb->s_flags & MS_ACTIVE) {
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&sb_inode_list_lock);
-			spin_unlock(&inode_lock);
 			return;
 		}
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_WILL_FREE;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
-		spin_unlock(&inode_lock);
 		write_inode_now(inode, 1);
-		spin_lock(&inode_lock);
 		spin_lock(&sb_inode_list_lock);
 		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
@@ -1346,7 +1297,6 @@ static void generic_forget_inode(struct
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	atomic_dec(&inodes_stat.nr_inodes);
 	if (inode->i_data.nrpages)
 		truncate_inode_pages(&inode->i_data, 0);
@@ -1405,17 +1355,12 @@ void iput(struct inode *inode)
 	if (inode) {
 		BUG_ON(inode->i_state == I_CLEAR);
 
-retry1:
+retry:
 		spin_lock(&inode->i_lock);
 		if (inode->i_count == 1) {
-			if (!spin_trylock(&inode_lock)) {
-retry2:
-				spin_unlock(&inode->i_lock);
-				goto retry1;
-			}
 			if (!spin_trylock(&sb_inode_list_lock)) {
-				spin_unlock(&inode_lock);
-				goto retry2;
+				spin_unlock(&inode->i_lock);
+				goto retry;
 			}
 			inode->i_count--;
 			iput_final(inode);
@@ -1613,10 +1558,8 @@ static void __wait_on_freeing_inode(stru
 	wq = bit_waitqueue(&inode->i_state, __I_LOCK);
 	prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	schedule();
 	finish_wait(wq, &wait.wait);
-	spin_lock(&inode_lock);
 }
 
 static __initdata unsigned long ihash_entries;
Index: linux-2.6/fs/notify/inotify/inotify.c
===================================================================
--- linux-2.6.orig/fs/notify/inotify/inotify.c
+++ linux-2.6/fs/notify/inotify/inotify.c
@@ -436,7 +436,6 @@ void inotify_unmount_inodes(struct list_
 		 * iprune_mutex keeps shrink_icache_memory() away.
 		 */
 		spin_unlock(&sb_inode_list_lock);
-		spin_unlock(&inode_lock);
 
 		if (need_iput_tmp)
 			iput(need_iput_tmp);
@@ -456,7 +455,6 @@ void inotify_unmount_inodes(struct list_
 		mutex_unlock(&inode->inotify_mutex);
 		iput(inode);		
 
-		spin_lock(&inode_lock);
 		spin_lock(&sb_inode_list_lock);
 	}
 }
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h
+++ linux-2.6/include/linux/writeback.h
@@ -9,7 +9,6 @@
 
 struct backing_dev_info;
 
-extern spinlock_t inode_lock;
 extern spinlock_t sb_inode_list_lock;
 extern spinlock_t wb_inode_list_lock;
 extern spinlock_t inode_hash_lock;
Index: linux-2.6/fs/quota/dquot.c
===================================================================
--- linux-2.6.orig/fs/quota/dquot.c
+++ linux-2.6/fs/quota/dquot.c
@@ -821,7 +821,6 @@ static void add_dquot_ref(struct super_b
 {
 	struct inode *inode, *old_inode = NULL;
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
@@ -841,7 +840,6 @@ static void add_dquot_ref(struct super_b
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
-		spin_unlock(&inode_lock);
 
 		iput(old_inode);
 		sb->dq_op->initialize(inode, type);
@@ -851,11 +849,9 @@ static void add_dquot_ref(struct super_b
 		 * reference and we cannot iput it under inode_lock. So we
 		 * keep the reference and iput it later. */
 		old_inode = inode;
-		spin_lock(&inode_lock);
 		spin_lock(&sb_inode_list_lock);
 	}
 	spin_unlock(&sb_inode_list_lock);
-	spin_unlock(&inode_lock);
 	iput(old_inode);
 }
 
@@ -925,7 +921,6 @@ static void remove_dquot_ref(struct supe
 {
 	struct inode *inode;
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
@@ -938,7 +933,6 @@ static void remove_dquot_ref(struct supe
 			remove_inode_dquot_ref(inode, type, tofree_head);
 	}
 	spin_unlock(&sb_inode_list_lock);
-	spin_unlock(&inode_lock);
 }
 
 /* Gather all references from inodes and drop them */



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch 01/27] fs: cleanup files_lock
  2009-04-25  1:20 ` [patch 01/27] fs: cleanup files_lock npiggin
@ 2009-04-25  3:20   ` Al Viro
  2009-04-25  5:35   ` Eric W. Biederman
  2009-04-25  9:42   ` Alan Cox
  2 siblings, 0 replies; 50+ messages in thread
From: Al Viro @ 2009-04-25  3:20 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, linux-kernel, Alan Cox

[Alan Cc'ed due to tty part of it]

On Sat, Apr 25, 2009 at 11:20:21AM +1000, npiggin@suse.de wrote:

>  	set_bit(TTY_PTY_LOCK, &tty->flags); /* LOCK THE SLAVE */
>  	filp->private_data = tty;
> -	file_move(filp, &tty->tty_files);
> +
> +	mutex_lock(&tty_mutex);
> +	file_list_del(filp);
> +	list_add(&filp->f_u.fu_list, &tty->tty_files);
> +	mutex_unlock(&tty_mutex);

Is there any problem with just shifting mutex_unlock down from several lines
above?


(in do_tty_hangup)
> +	mutex_lock(&tty_mutex);
> +
>  	/* inuse_filps is protected by the single kernel lock */
>  	lock_kernel();

isn't it too early?

> @@ -553,8 +566,7 @@ static void do_tty_hangup(struct work_st
>  	}
>  	spin_unlock(&redirect_lock);
>  
> -	check_tty_count(tty, "do_tty_hangup");
> -	file_list_lock();

i.e. why not here?

> +	__check_tty_count(tty, "do_tty_hangup");
>  	/* This breaks for file handles being sent over AF_UNIX sockets ? */
>  	list_for_each_entry(filp, &tty->tty_files, f_u.fu_list) {
>  		if (filp->f_op->write == redirected_tty_write)

> @@ -1467,9 +1479,9 @@ static void release_one_tty(struct kref
>  	tty_driver_kref_put(driver);
>  	module_put(driver->owner);
>  
> -	file_list_lock();
> +	mutex_lock(&tty_mutex);
>  	list_del_init(&tty->tty_files);
> -	file_list_unlock();
> +	mutex_unlock(&tty_mutex);

Umm... why is it safe from the deadlock POV?

> @@ -1836,8 +1849,12 @@ got_driver:
>  		return PTR_ERR(tty);
>  
>  	filp->private_data = tty;
> -	file_move(filp, &tty->tty_files);
> -	check_tty_count(tty, "tty_open");
> +	mutex_lock(&tty_mutex);
> +	BUG_ON(list_empty(&filp->f_u.fu_list));
> +	file_list_del(filp); /* __dentry_open has put it on the sb list */
> +	list_add(&filp->f_u.fu_list, &tty->tty_files);
> +	__check_tty_count(tty, "tty_open");
> +	mutex_unlock(&tty_mutex);

a) why not simply shift mutex_unlock from several lines above?
b) that code really looks b0rken - what happens if you block on that
mutex_lock and somebody else comes and sees (at least) inconsistent
tty->count?

====

Could you split that into direct move (one patch) + changes?

> +/**
> + *	mark_files_ro - mark all files read-only
> + *	@sb: superblock in question
> + *
> + *	All files are marked read-only.  We don't care about pending
> + *	delete files so this should be used in 'force' mode only.
> + */
> +void mark_files_ro(struct super_block *sb)

BTW, I'd rather merge mnt_write_count one first, so reordering of those
would be appreciated; mnt_write_count + move that function + this patch
is the order I'd prefer.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch 02/27] fs: scale files_lock
  2009-04-25  1:20 ` [patch 02/27] fs: scale files_lock npiggin
@ 2009-04-25  3:32   ` Al Viro
  0 siblings, 0 replies; 50+ messages in thread
From: Al Viro @ 2009-04-25  3:32 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, linux-kernel

On Sat, Apr 25, 2009 at 11:20:22AM +1000, npiggin@suse.de wrote:
> Improve scalability of files_lock by adding per-cpu, per-sb files lists,
> protected with per-cpu locking. Effectively turning it into a big-writer
> lock.

Og dumb.  Many locks.  Many ifdefs.  Og don't like.

>  void file_sb_list_add(struct file *file, struct super_block *sb)
>  {
> -	spin_lock(&files_lock);
> +	spinlock_t *lock;
> +	struct list_head *list;
> +	int cpu;
> +
> +	lock = &get_cpu_var(files_cpulock);
> +#ifdef CONFIG_SMP
> +	BUG_ON(file->f_sb_list_cpu != -1);
> +	cpu = smp_processor_id();
> +	list = per_cpu_ptr(sb->s_files, cpu);
> +	file->f_sb_list_cpu = cpu;
> +#else
> +	list = &sb->s_files;
> +#endif
> +	spin_lock(lock);
>  	BUG_ON(!list_empty(&file->f_u.fu_list));
> -	list_add(&file->f_u.fu_list, &sb->s_files);
> -	spin_unlock(&files_lock);
> +	list_add(&file->f_u.fu_list, list);
> +	spin_unlock(lock);
> +	put_cpu_var(files_cpulock);
>  }

Don't like overhead on hot paths either.

And grown memory footprint of struct super_block (with alloc_percpu())

>  	atomic_long_t		f_count;
>  	unsigned int 		f_flags;
>  	fmode_t			f_mode;
> @@ -1330,7 +1333,11 @@ struct super_block {
>  	struct list_head	s_io;		/* parked for writeback */
>  	struct list_head	s_more_io;	/* parked for more writeback */
>  	struct hlist_head	s_anon;		/* anonymous dentries for (nfs) exporting */
> +#ifdef CONFIG_SMP
> +	struct list_head	*s_files;
> +#else
>  	struct list_head	s_files;
> +#endif
>  	/* s_dentry_lru and s_nr_dentry_unused are protected by dcache_lock */
>  	struct list_head	s_dentry_lru;	/* unused dentry lru */
>  	int			s_nr_dentry_unused;	/* # of dentry on lru */

... and ifdefs like that in structs.

What I really want to see is a rationale for all that.  Preferably with
more than microbenchmarks showing a visible impact.

Especially if you compare it with alternative variant that simply splits
files_lock on per-sb basis.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch 04/27] fs: introduce mnt_clone_write
  2009-04-25  1:20 ` [patch 04/27] fs: introduce mnt_clone_write npiggin
@ 2009-04-25  3:35   ` Al Viro
  0 siblings, 0 replies; 50+ messages in thread
From: Al Viro @ 2009-04-25  3:35 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, linux-kernel, Dave Hansen

On Sat, Apr 25, 2009 at 11:20:24AM +1000, npiggin@suse.de wrote:
> This patch speeds up lmbench lat_mmap test by about another 2% after the
> first patch.
> 
> Before:
>  avg = 462.286
>  std = 5.46106
> 
> After:
>  avg = 453.12
>  std = 9.58257
> 
> (50 runs of each, stddev gives a reasonable confidence)
> 
> It does this by introducing mnt_clone_write, which avoids some heavyweight
> operations of mnt_want_write if called on a vfsmount which we know already
> has a write count; and mnt_want_write_file, which can call mnt_clone_write
> if the file is open for write.
> 
> After these two patches, mnt_want_write and mnt_drop_write go from 7% on
> the profile down to 1.3% (including mnt_clone_write).

NAK in this form; nested mnt_want_write() *CAN* fail (note the check for
superblock itself being r/o).  Make you mnt_clone_write() returning int
and doing that superblock check, and I'm OK with it.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch 05/27] fs: brlock vfsmount_lock
  2009-04-25  1:20 ` [patch 05/27] fs: brlock vfsmount_lock npiggin
@ 2009-04-25  3:50   ` Al Viro
  2009-04-26  6:36     ` Nick Piggin
  0 siblings, 1 reply; 50+ messages in thread
From: Al Viro @ 2009-04-25  3:50 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, linux-kernel

On Sat, Apr 25, 2009 at 11:20:25AM +1000, npiggin@suse.de wrote:

[overall: sane idea, but...]

> +void vfsmount_read_lock(void)
> +{
> +	spinlock_t *lock;
> +
> +	lock = &get_cpu_var(vfsmount_lock);
> +	spin_lock(lock);
> +}
> +
> +void vfsmount_read_unlock(void)
> +{
> +	spinlock_t *lock;
> +
> +	lock = &__get_cpu_var(vfsmount_lock);
> +	spin_unlock(lock);
> +	put_cpu_var(vfsmount_lock);
> +}

These might be hot enough to be worth inlining, at least in fs/namei.c
users.  Or not - really needs testing.

> @@ -68,9 +113,9 @@ static int mnt_alloc_id(struct vfsmount
>  
>  retry:
>  	ida_pre_get(&mnt_id_ida, GFP_KERNEL);
> -	spin_lock(&vfsmount_lock);
> +	vfsmount_write_lock();
>  	res = ida_get_new(&mnt_id_ida, &mnt->mnt_id);
> -	spin_unlock(&vfsmount_lock);
> +	vfsmount_write_unlock();

Yuck.  _Really_ an overkill here.

>  static void mnt_free_id(struct vfsmount *mnt)
>  {
> -	spin_lock(&vfsmount_lock);
> +	vfsmount_write_lock();
>  	ida_remove(&mnt_id_ida, mnt->mnt_id);
> -	spin_unlock(&vfsmount_lock);
> +	vfsmount_write_unlock();
>  }

Ditto.

Missing: description of when we need it for read/when we need it for write.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch 00/27] [rfc] vfs scalability patchset
  2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
                   ` (26 preceding siblings ...)
  2009-04-25  1:20 ` [patch 27/27] fs: icache remove inode_lock npiggin
@ 2009-04-25  4:18 ` Al Viro
  2009-04-25  5:02   ` Nick Piggin
  2009-04-25  8:01   ` Christoph Hellwig
  27 siblings, 2 replies; 50+ messages in thread
From: Al Viro @ 2009-04-25  4:18 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, linux-kernel

On Sat, Apr 25, 2009 at 11:20:20AM +1000, npiggin@suse.de wrote:
> Here is my current patchset for improving vfs locking scalability. Since
> last posting, I have fixed several bugs, solved several more problems, and
> done an initial sweep of filesystems (autofs4 is probably the trickiest,
> and unfortunately I don't have a good test setup here for that yet, but
> at least I've looked through it).
> 
> Also started to tackle files_lock, vfsmount_lock, and inode_lock.
> (I included my mnt_want_write patches before the vfsmount_lock scalability
> stuff because that just made it a bit easier...). These appear to be the
> problematic global locks in the vfs.
> 
> It's running stably here so far on basic stress testing here on several file
> systems (xfs, tmpfs, ext?). But it still might eat your data of course.
> 
> Would be very interested in any feedback.

First of all, I happily admit that wrt locking I'm a barbarian, and proud
of it.  I.e. simpler locking scheme beats theoretical improvement, unless
we have really good evidence that there's a real-world problem.  All things
equal, complexity loses.  All things not quite equal - ditto.  Amount of
fuckups is at least quadratic by the number of lock types, with quite a big
chunk on top added by each per-something kind of lock.

Said that, I like mnt_want_write part, vfsmount_lock splitup (modulo
several questions) and _maybe_ doing something about files_lock.
Like as in "would seriously consider merging next cycle".  I'd keep
dcache and icache parts separate for now.

However, files_lock part 2 looks very dubious - if nothing else, I would
expect that you'll get *more* cross-CPU traffic that way, since the CPU
where final fput() runs will correlate only weakly (if at all) with one
where open() had been done.  So you are getting more cachelines bouncing.
I want to see the numbers for this one, and on different kinds of loads,
but as it is I've very sceptical.  BTW, could you try to collect stats
along the lines of "CPU #i has done N_{i,j} removals from sb list for
files that had been in list #j"?

Splitting files_lock on per-sb basis might be an interesting variant, too.

Another thing: could you pull outright bugfixes as early as possible in the
queue?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch 00/27] [rfc] vfs scalability patchset
  2009-04-25  4:18 ` [patch 00/27] [rfc] vfs scalability patchset Al Viro
@ 2009-04-25  5:02   ` Nick Piggin
  2009-04-25  8:01   ` Christoph Hellwig
  1 sibling, 0 replies; 50+ messages in thread
From: Nick Piggin @ 2009-04-25  5:02 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel

Thanks for taking a look. I'll spend a bit of time to go over your
feedback.


On Sat, Apr 25, 2009 at 05:18:29AM +0100, Al Viro wrote:
> On Sat, Apr 25, 2009 at 11:20:20AM +1000, npiggin@suse.de wrote:
> > Here is my current patchset for improving vfs locking scalability. Since
> > last posting, I have fixed several bugs, solved several more problems, and
> > done an initial sweep of filesystems (autofs4 is probably the trickiest,
> > and unfortunately I don't have a good test setup here for that yet, but
> > at least I've looked through it).
> > 
> > Also started to tackle files_lock, vfsmount_lock, and inode_lock.
> > (I included my mnt_want_write patches before the vfsmount_lock scalability
> > stuff because that just made it a bit easier...). These appear to be the
> > problematic global locks in the vfs.
> > 
> > It's running stably here so far on basic stress testing here on several file
> > systems (xfs, tmpfs, ext?). But it still might eat your data of course.
> > 
> > Would be very interested in any feedback.
> 
> First of all, I happily admit that wrt locking I'm a barbarian, and proud
> of it.  I.e. simpler locking scheme beats theoretical improvement, unless
> we have really good evidence that there's a real-world problem.  All things
> equal, complexity loses.  All things not quite equal - ditto.  Amount of
> fuckups is at least quadratic by the number of lock types, with quite a big
> chunk on top added by each per-something kind of lock.

Yes definitely. What recently prompted me to finally look at this is
the nasty looking "batched dput/iput" stuff that came out of google.
Unfortunately I don't remember seeing a description of the workload
but I'll ping them.

I do know that SGI has had problems with these locks on NFS server
workloads too (and not on insanely sized systems). I should be able
to get a recipe for reproducing this.

And this is an open call for anyone else seeing scalability problems
here too.

 
> Said that, I like mnt_want_write part, vfsmount_lock splitup (modulo
> several questions) and _maybe_ doing something about files_lock.
> Like as in "would seriously consider merging next cycle".

OK that's a good start. I do admit I didn't take enough time to grok
the tty stuff :P But I'll try to get it in shape.

>  I'd keep
> dcache and icache parts separate for now.

Yes they need a lot more review and results.
 

> However, files_lock part 2 looks very dubious - if nothing else, I would
> expect that you'll get *more* cross-CPU traffic that way, since the CPU
> where final fput() runs will correlate only weakly (if at all) with one
> where open() had been done.  So you are getting more cachelines bouncing.

You think? Weakly? Well I guess it will depend on the workload. In some
cases it will be. Although the alternative is all CPUs bouncing a single
lock cacheline, so with multiple lock cachelines then at least we have
less contention at the cache coherency level (ie. we can have multiple
cacheline bounces in flight across the entire machine). But... enough
handwaving from me, I agree it needs results.


> I want to see the numbers for this one, and on different kinds of loads,
> but as it is I've very sceptical.  BTW, could you try to collect stats
> along the lines of "CPU #i has done N_{i,j} removals from sb list for
> files that had been in list #j"?
> 
> Splitting files_lock on per-sb basis might be an interesting variant, too.

Yes that could help, although I had been trying to keep in mind
single-sb scalability too.

 
> Another thing: could you pull outright bugfixes as early as possible in the
> queue?

Sure thing.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch 01/27] fs: cleanup files_lock
  2009-04-25  1:20 ` [patch 01/27] fs: cleanup files_lock npiggin
  2009-04-25  3:20   ` Al Viro
@ 2009-04-25  5:35   ` Eric W. Biederman
  2009-04-26  6:12     ` Nick Piggin
  2009-04-25  9:42   ` Alan Cox
  2 siblings, 1 reply; 50+ messages in thread
From: Eric W. Biederman @ 2009-04-25  5:35 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, linux-kernel, Al Viro, Alan Cox

npiggin@suse.de writes:

> Lock tty_files with tty_mutex, provide helpers to manipulate the per-sb
> files list, and unexport the files_lock spinlock.

This conflicts a bit with some of my ongoing work, which is generalizing
the file list to make it more useful and makes the tty case much less
of a special case.

Do you know if the performance improvement would be anywhere near as good if
file_list and file_list_lock becoming per inode?

Do you have any idea what the performance improvement with changing the file_list_lock
is?


> Index: linux-2.6/fs/open.c
> ===================================================================
> --- linux-2.6.orig/fs/open.c
> +++ linux-2.6/fs/open.c
> @@ -828,7 +828,7 @@ static struct file *__dentry_open(struct
>  	f->f_path.mnt = mnt;
>  	f->f_pos = 0;
>  	f->f_op = fops_get(inode->i_fop);
> -	file_move(f, &inode->i_sb->s_files);
> +	file_sb_list_add(f, inode->i_sb);

You can make this just:
	 if (!special_file(inode->i_mode))
		file_add(f, &inode->i_files);

And save yourself a lot of complexity.

Eric

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch 00/27] [rfc] vfs scalability patchset
  2009-04-25  4:18 ` [patch 00/27] [rfc] vfs scalability patchset Al Viro
  2009-04-25  5:02   ` Nick Piggin
@ 2009-04-25  8:01   ` Christoph Hellwig
  2009-04-25  8:06     ` Al Viro
  2009-04-25 19:08     ` Eric W. Biederman
  1 sibling, 2 replies; 50+ messages in thread
From: Christoph Hellwig @ 2009-04-25  8:01 UTC (permalink / raw)
  To: Al Viro; +Cc: npiggin, linux-fsdevel, linux-kernel

On Sat, Apr 25, 2009 at 05:18:29AM +0100, Al Viro wrote:
> However, files_lock part 2 looks very dubious - if nothing else, I would
> expect that you'll get *more* cross-CPU traffic that way, since the CPU
> where final fput() runs will correlate only weakly (if at all) with one
> where open() had been done.  So you are getting more cachelines bouncing.
> I want to see the numbers for this one, and on different kinds of loads,
> but as it is I've very sceptical.  BTW, could you try to collect stats
> along the lines of "CPU #i has done N_{i,j} removals from sb list for
> files that had been in list #j"?
> 
> Splitting files_lock on per-sb basis might be an interesting variant, too.

We should just kill files_lock and s_files completely.  The remaining
user are may remount r/o checks, and with counters in place not only on
the vfsmount but also on the superblock we can kill fs_may_remount_ro in
it's current form.  The only interesting bit left after that is
mark_files_ro which is so buggy that I'd prefer to kill it including the
underlying functionality.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch 00/27] [rfc] vfs scalability patchset
  2009-04-25  8:01   ` Christoph Hellwig
@ 2009-04-25  8:06     ` Al Viro
  2009-04-28  9:09       ` Christoph Hellwig
  2009-04-25 19:08     ` Eric W. Biederman
  1 sibling, 1 reply; 50+ messages in thread
From: Al Viro @ 2009-04-25  8:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: npiggin, linux-fsdevel, linux-kernel

On Sat, Apr 25, 2009 at 04:01:43AM -0400, Christoph Hellwig wrote:
> On Sat, Apr 25, 2009 at 05:18:29AM +0100, Al Viro wrote:
> > However, files_lock part 2 looks very dubious - if nothing else, I would
> > expect that you'll get *more* cross-CPU traffic that way, since the CPU
> > where final fput() runs will correlate only weakly (if at all) with one
> > where open() had been done.  So you are getting more cachelines bouncing.
> > I want to see the numbers for this one, and on different kinds of loads,
> > but as it is I've very sceptical.  BTW, could you try to collect stats
> > along the lines of "CPU #i has done N_{i,j} removals from sb list for
> > files that had been in list #j"?
> > 
> > Splitting files_lock on per-sb basis might be an interesting variant, too.
> 
> We should just kill files_lock and s_files completely.  The remaining
> user are may remount r/o checks, and with counters in place not only on
> the vfsmount but also on the superblock we can kill fs_may_remount_ro in
> it's current form.  The only interesting bit left after that is
> mark_files_ro which is so buggy that I'd prefer to kill it including the
> underlying functionality.

Maybe...  What Eric proposed is essentially a reuse of s_list for per-inode
list of struct file.  Presumably with something like i_lock for protection.
So that's not a conflict.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch 01/27] fs: cleanup files_lock
  2009-04-25  1:20 ` [patch 01/27] fs: cleanup files_lock npiggin
  2009-04-25  3:20   ` Al Viro
  2009-04-25  5:35   ` Eric W. Biederman
@ 2009-04-25  9:42   ` Alan Cox
  2009-04-26  6:15     ` Nick Piggin
  2 siblings, 1 reply; 50+ messages in thread
From: Alan Cox @ 2009-04-25  9:42 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, linux-kernel

On Sat, 25 Apr 2009 11:20:21 +1000
npiggin@suse.de wrote:

> Lock tty_files with tty_mutex, provide helpers to manipulate the per-sb
> files list, and unexport the files_lock spinlock.

This looks half like a backward step to me: It swaps clean method calls
for open coded stuff and it adds more random undocumented uses to
tty_mutex, which has far too much already.

I don't think

-	file_move(filp, &tty->tty_files);
+
+	mutex_lock(&tty_mutex);
+	file_list_del(filp);
+	list_add(&filp->f_u.fu_list, &tty->tty_files);
+	mutex_unlock(&tty_mutex);

is exactly an improvement, nor is 

-	file_move(filp, &tty->tty_files);
-	check_tty_count(tty, "tty_open");
+	mutex_lock(&tty_mutex);
+	BUG_ON(list_empty(&filp->f_u.fu_list));
+	file_list_del(filp); /* __dentry_open has put it on the sb list
  */
+	list_add(&filp->f_u.fu_list, &tty->tty_files);
+	__check_tty_count(tty, "tty_open");
+	mutex_unlock(&tty_mutex);

The basic idea looks totally sound but it can use its own lock and there
should be helpers so this stuff doesn't have to get open coded.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch 00/27] [rfc] vfs scalability patchset
  2009-04-25  8:01   ` Christoph Hellwig
  2009-04-25  8:06     ` Al Viro
@ 2009-04-25 19:08     ` Eric W. Biederman
  2009-04-25 19:31       ` Al Viro
  1 sibling, 1 reply; 50+ messages in thread
From: Eric W. Biederman @ 2009-04-25 19:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Al Viro, npiggin, linux-fsdevel, linux-kernel

Christoph Hellwig <hch@infradead.org> writes:

> On Sat, Apr 25, 2009 at 05:18:29AM +0100, Al Viro wrote:
>> However, files_lock part 2 looks very dubious - if nothing else, I would
>> expect that you'll get *more* cross-CPU traffic that way, since the CPU
>> where final fput() runs will correlate only weakly (if at all) with one
>> where open() had been done.  So you are getting more cachelines bouncing.
>> I want to see the numbers for this one, and on different kinds of loads,
>> but as it is I've very sceptical.  BTW, could you try to collect stats
>> along the lines of "CPU #i has done N_{i,j} removals from sb list for
>> files that had been in list #j"?
>> 
>> Splitting files_lock on per-sb basis might be an interesting variant, too.
>
> We should just kill files_lock and s_files completely.  The remaining
> user are may remount r/o checks, and with counters in place not only on
> the vfsmount but also on the superblock we can kill fs_may_remount_ro in
> it's current form.  

Can we?  My first glance at that code I asked myself if we could examine
i_writecount, instead of going to the file. My impression was that we
were deliberately only counting persistent write references from files
instead of transient write references.  As only the persistent write
references matter.  Transient write references can at least in theory
be flushed as the filesystem is remounting read-only.

Eric

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch 00/27] [rfc] vfs scalability patchset
  2009-04-25 19:08     ` Eric W. Biederman
@ 2009-04-25 19:31       ` Al Viro
  2009-04-25 20:29         ` Eric W. Biederman
  0 siblings, 1 reply; 50+ messages in thread
From: Al Viro @ 2009-04-25 19:31 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Christoph Hellwig, npiggin, linux-fsdevel, linux-kernel

On Sat, Apr 25, 2009 at 12:08:16PM -0700, Eric W. Biederman wrote:

> Can we?  My first glance at that code I asked myself if we could examine
> i_writecount, instead of going to the file. My impression was that we
> were deliberately only counting persistent write references from files

No, there's nothing deliberate about that.  The code is simply wrong;
some of that crap had been fixed with mnt_want_write series, but
the rest...

> instead of transient write references.  As only the persistent write
> references matter.  Transient write references can at least in theory
> be flushed as the filesystem is remounting read-only.

No.  It's far too painful to do and no fs is doing that.  You are looking
for deliberate behaviour in a place where we have a half-fixed pile of races.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch 00/27] [rfc] vfs scalability patchset
  2009-04-25 19:31       ` Al Viro
@ 2009-04-25 20:29         ` Eric W. Biederman
  2009-04-25 22:05           ` Theodore Tso
  0 siblings, 1 reply; 50+ messages in thread
From: Eric W. Biederman @ 2009-04-25 20:29 UTC (permalink / raw)
  To: Al Viro; +Cc: Christoph Hellwig, npiggin, linux-fsdevel, linux-kernel

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Sat, Apr 25, 2009 at 12:08:16PM -0700, Eric W. Biederman wrote:
>
>> Can we?  My first glance at that code I asked myself if we could examine
>> i_writecount, instead of going to the file. My impression was that we
>> were deliberately only counting persistent write references from files
>
> No, there's nothing deliberate about that.  The code is simply wrong;
> some of that crap had been fixed with mnt_want_write series, but
> the rest...
>
>> instead of transient write references.  As only the persistent write
>> references matter.  Transient write references can at least in theory
>> be flushed as the filesystem is remounting read-only.
>
> No.  It's far too painful to do and no fs is doing that.  You are looking
> for deliberate behaviour in a place where we have a half-fixed pile of races.

I didn't trace it all of the way through but this comment in
ext3_remount fooled me:

	/*
	 * We have to unlock super so that we can wait for
	 * transactions.
	 */

Which was enough to think it might have been deliberate behavior so I figured
it was worth asking.  It looked like the journal commit logic could have
been doing the blocking magic to wait on ongoing truncates and the like.

Still even it was deliberate it is the job of user space to remove all writers
before we remount read-only, and making the guarantee that we pass to the
filesystems that the fs is read-only stronger should not hurt anything.

Eric

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch 00/27] [rfc] vfs scalability patchset
  2009-04-25 20:29         ` Eric W. Biederman
@ 2009-04-25 22:05           ` Theodore Tso
  0 siblings, 0 replies; 50+ messages in thread
From: Theodore Tso @ 2009-04-25 22:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, Christoph Hellwig, npiggin, linux-fsdevel, linux-kernel

On Sat, Apr 25, 2009 at 01:29:01PM -0700, Eric W. Biederman wrote:
> 
> I didn't trace it all of the way through but this comment in
> ext3_remount fooled me:
> 
> 	/*
> 	 * We have to unlock super so that we can wait for
> 	 * transactions.
> 	 */
> 
> Which was enough to think it might have been deliberate behavior so I figured
> it was worth asking.  It looked like the journal commit logic could have
> been doing the blocking magic to wait on ongoing truncates and the like.

Working on fixing this already for ext4.  Once we're convinced it's
right for ext4, we can backport the fixes for ext3.

      	  	       		    	      - Ted

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch 01/27] fs: cleanup files_lock
  2009-04-25  5:35   ` Eric W. Biederman
@ 2009-04-26  6:12     ` Nick Piggin
  0 siblings, 0 replies; 50+ messages in thread
From: Nick Piggin @ 2009-04-26  6:12 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-fsdevel, linux-kernel, Al Viro, Alan Cox

On Fri, Apr 24, 2009 at 10:35:10PM -0700, Eric W. Biederman wrote:
> npiggin@suse.de writes:
> 
> > Lock tty_files with tty_mutex, provide helpers to manipulate the per-sb
> > files list, and unexport the files_lock spinlock.
> 
> This conflicts a bit with some of my ongoing work, which is generalizing
> the file list to make it more useful and makes the tty case much less
> of a special case.

OK. My first patch should be fine, though.

 
> Do you know if the performance improvement would be anywhere near as good if
> file_list and file_list_lock becoming per inode?

Interesting (I didn't look closely at your patches yet). Probably that
would be quite reasonable.

 
> Do you have any idea what the performance improvement with changing the file_list_lock
> is?

Several of these locks hit in the same workloads so they mask each
other. I only just got the patchset to the stage where I can really
benchmark it. I could try your alternative as well.


 
> > Index: linux-2.6/fs/open.c
> > ===================================================================
> > --- linux-2.6.orig/fs/open.c
> > +++ linux-2.6/fs/open.c
> > @@ -828,7 +828,7 @@ static struct file *__dentry_open(struct
> >  	f->f_path.mnt = mnt;
> >  	f->f_pos = 0;
> >  	f->f_op = fops_get(inode->i_fop);
> > -	file_move(f, &inode->i_sb->s_files);
> > +	file_sb_list_add(f, inode->i_sb);
> 
> You can make this just:
> 	 if (!special_file(inode->i_mode))
> 		file_add(f, &inode->i_files);
> 
> And save yourself a lot of complexity.

Probably right, but I'll leave that for someone else to do.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch 01/27] fs: cleanup files_lock
  2009-04-25  9:42   ` Alan Cox
@ 2009-04-26  6:15     ` Nick Piggin
  0 siblings, 0 replies; 50+ messages in thread
From: Nick Piggin @ 2009-04-26  6:15 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-fsdevel, linux-kernel

On Sat, Apr 25, 2009 at 10:42:34AM +0100, Alan Cox wrote:
> On Sat, 25 Apr 2009 11:20:21 +1000
> npiggin@suse.de wrote:
> 
> > Lock tty_files with tty_mutex, provide helpers to manipulate the per-sb
> > files list, and unexport the files_lock spinlock.
> 
> This looks half like a backward step to me: It swaps clean method calls
> for open coded stuff and it adds more random undocumented uses to
> tty_mutex, which has far too much already.
> 
> I don't think
> 
> -	file_move(filp, &tty->tty_files);
> +
> +	mutex_lock(&tty_mutex);
> +	file_list_del(filp);
> +	list_add(&filp->f_u.fu_list, &tty->tty_files);
> +	mutex_unlock(&tty_mutex);
> 
> is exactly an improvement, nor is 
> 
> -	file_move(filp, &tty->tty_files);
> -	check_tty_count(tty, "tty_open");
> +	mutex_lock(&tty_mutex);
> +	BUG_ON(list_empty(&filp->f_u.fu_list));
> +	file_list_del(filp); /* __dentry_open has put it on the sb list
>   */
> +	list_add(&filp->f_u.fu_list, &tty->tty_files);
> +	__check_tty_count(tty, "tty_open");
> +	mutex_unlock(&tty_mutex);
> 
> The basic idea looks totally sound but it can use its own lock and there
> should be helpers so this stuff doesn't have to get open coded.

Yes, I agree it was silly to try reusing tty_mutex for this, as you
and Al point out. I've just added a new spinlock for the tty layer
for the moment, which makes it much more like a mechanical search/
replace.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch 05/27] fs: brlock vfsmount_lock
  2009-04-25  3:50   ` Al Viro
@ 2009-04-26  6:36     ` Nick Piggin
  0 siblings, 0 replies; 50+ messages in thread
From: Nick Piggin @ 2009-04-26  6:36 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel

On Sat, Apr 25, 2009 at 04:50:40AM +0100, Al Viro wrote:
> On Sat, Apr 25, 2009 at 11:20:25AM +1000, npiggin@suse.de wrote:
> 
> [overall: sane idea, but...]
> 
> > +void vfsmount_read_lock(void)
> > +{
> > +	spinlock_t *lock;
> > +
> > +	lock = &get_cpu_var(vfsmount_lock);
> > +	spin_lock(lock);
> > +}
> > +
> > +void vfsmount_read_unlock(void)
> > +{
> > +	spinlock_t *lock;
> > +
> > +	lock = &__get_cpu_var(vfsmount_lock);
> > +	spin_unlock(lock);
> > +	put_cpu_var(vfsmount_lock);
> > +}
> 
> These might be hot enough to be worth inlining, at least in fs/namei.c
> users.  Or not - really needs testing.

Hmm, no you could be right. Most of the code is still OOL in the
spinlock call, so avoiding one level of call chain is probably
going to be a win. I'll see how much it increases code size.

 
> > @@ -68,9 +113,9 @@ static int mnt_alloc_id(struct vfsmount
> >  
> >  retry:
> >  	ida_pre_get(&mnt_id_ida, GFP_KERNEL);
> > -	spin_lock(&vfsmount_lock);
> > +	vfsmount_write_lock();
> >  	res = ida_get_new(&mnt_id_ida, &mnt->mnt_id);
> > -	spin_unlock(&vfsmount_lock);
> > +	vfsmount_write_unlock();
> 
> Yuck.  _Really_ an overkill here.
> 
> >  static void mnt_free_id(struct vfsmount *mnt)
> >  {
> > -	spin_lock(&vfsmount_lock);
> > +	vfsmount_write_lock();
> >  	ida_remove(&mnt_id_ida, mnt->mnt_id);
> > -	spin_unlock(&vfsmount_lock);
> > +	vfsmount_write_unlock();
> >  }
> 
> Ditto.

Yeah, wanted to try going as simple as possible for the first cut.
Shall I just add another spinlock for it?

> Missing: description of when we need it for read/when we need it for write.

OK, I'll work on the documentation.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch 00/27] [rfc] vfs scalability patchset
  2009-04-25  8:06     ` Al Viro
@ 2009-04-28  9:09       ` Christoph Hellwig
  2009-04-28  9:48         ` Nick Piggin
                           ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Christoph Hellwig @ 2009-04-28  9:09 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, npiggin, linux-fsdevel, linux-kernel,
	Peter Zijlstra

On Sat, Apr 25, 2009 at 09:06:49AM +0100, Al Viro wrote:
> Maybe...  What Eric proposed is essentially a reuse of s_list for per-inode
> list of struct file.  Presumably with something like i_lock for protection.
> So that's not a conflict.

But what do we actually want it for?  Right now it's only used for
ttys, which Nick has split out, and for remount r/o.  For the normal
remount r/o case it will go away once we have proper per-sb writer
counts.  And the fource remount r/o from sysrq is completely broken.

A while ago Peter had patches for files_lock scalability that went even
further than Nicks, and if I remember the arguments correctly just
splitting the lock wasn't really enough and he required additional
batching because there just were too many lock roundtrips.  (Peter, do
you remember the defails?)


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch 00/27] [rfc] vfs scalability patchset
  2009-04-28  9:09       ` Christoph Hellwig
@ 2009-04-28  9:48         ` Nick Piggin
  2009-04-28 10:58         ` Peter Zijlstra
  2009-04-28 11:32         ` Eric W. Biederman
  2 siblings, 0 replies; 50+ messages in thread
From: Nick Piggin @ 2009-04-28  9:48 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Al Viro, linux-fsdevel, linux-kernel, Peter Zijlstra

On Tue, Apr 28, 2009 at 05:09:30AM -0400, Christoph Hellwig wrote:
> On Sat, Apr 25, 2009 at 09:06:49AM +0100, Al Viro wrote:
> > Maybe...  What Eric proposed is essentially a reuse of s_list for per-inode
> > list of struct file.  Presumably with something like i_lock for protection.
> > So that's not a conflict.
> 
> But what do we actually want it for?  Right now it's only used for
> ttys, which Nick has split out, and for remount r/o.  For the normal
> remount r/o case it will go away once we have proper per-sb writer
> counts.  And the fource remount r/o from sysrq is completely broken.
> 
> A while ago Peter had patches for files_lock scalability that went even
> further than Nicks, and if I remember the arguments correctly just
> splitting the lock wasn't really enough and he required additional
> batching because there just were too many lock roundtrips.  (Peter, do
> you remember the defails?)

Hmm, Peter's patch seemed like it was overkill to me. It avoids
the need for per-cpu files lists in the sb, but the cost is the
locked lists, the batching, etc. But even then it would still
generate more cacheline bounces because it needs to flush batches
back to the per-sb list.

Actually it should not be a problem to avoid the memory overhead
of my patch if we are willing to make the slowpath even slower.
Just have a global per-cpu list, and just test for sb equality
when walking the lists.

Anyway, I'll just continue to maintain this patch and if something
gets done with the file list first, then all the better.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch 00/27] [rfc] vfs scalability patchset
  2009-04-28  9:09       ` Christoph Hellwig
  2009-04-28  9:48         ` Nick Piggin
@ 2009-04-28 10:58         ` Peter Zijlstra
  2009-04-28 11:32         ` Eric W. Biederman
  2 siblings, 0 replies; 50+ messages in thread
From: Peter Zijlstra @ 2009-04-28 10:58 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Al Viro, npiggin, linux-fsdevel, linux-kernel

On Tue, 2009-04-28 at 05:09 -0400, Christoph Hellwig wrote:
> On Sat, Apr 25, 2009 at 09:06:49AM +0100, Al Viro wrote:
> > Maybe...  What Eric proposed is essentially a reuse of s_list for per-inode
> > list of struct file.  Presumably with something like i_lock for protection.
> > So that's not a conflict.
> 
> But what do we actually want it for?  Right now it's only used for
> ttys, which Nick has split out, and for remount r/o.  For the normal
> remount r/o case it will go away once we have proper per-sb writer
> counts.  And the fource remount r/o from sysrq is completely broken.
> 
> A while ago Peter had patches for files_lock scalability that went even
> further than Nicks, and if I remember the arguments correctly just
> splitting the lock wasn't really enough and he required additional
> batching because there just were too many lock roundtrips.  (Peter, do
> you remember the defails?)

Suppose you have some task doing open/close on one filesystem (rather
common scenario) then having the lock split on superblock level doesn't
help you.

My patches were admittedly somewhat over the top, and they could cause
more cacheline bounces but significantly reduce the contention,
delivering an over-all improvement, as can be seen from the
micro-benchmark results posted in that thread.

Anyway, your solution of simply removing all uses of the global files
list still seems like the most attractive option.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch 00/27] [rfc] vfs scalability patchset
  2009-04-28  9:09       ` Christoph Hellwig
  2009-04-28  9:48         ` Nick Piggin
  2009-04-28 10:58         ` Peter Zijlstra
@ 2009-04-28 11:32         ` Eric W. Biederman
  2009-04-30  6:14           ` Nick Piggin
  2 siblings, 1 reply; 50+ messages in thread
From: Eric W. Biederman @ 2009-04-28 11:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Al Viro, npiggin, linux-fsdevel, linux-kernel, Peter Zijlstra

Christoph Hellwig <hch@infradead.org> writes:

> On Sat, Apr 25, 2009 at 09:06:49AM +0100, Al Viro wrote:
>> Maybe...  What Eric proposed is essentially a reuse of s_list for per-inode
>> list of struct file.  Presumably with something like i_lock for protection.
>> So that's not a conflict.
>
> But what do we actually want it for?  Right now it's only used for
> ttys, which Nick has split out, and for remount r/o.  For the normal
> remount r/o case it will go away once we have proper per-sb writer
> counts.  And the fource remount r/o from sysrq is completely broken.

The plan is to post my updated patches tomorrow after I have slept.

What I am looking at is that the tty layer is not a special case.  Any
subsystem that wants any revoke kind of functionality starts wanting
the list of files that are open.  My current list where we have
something like this is: sysfs, proc, sysctl, tun, tty, sound.

I am in the process of generalizing the handling and bringing all of this
into the VFS, where we only need to maintain it once, and can see
clearly what is going on so we can optimize it.

For that I essentially need per inode lists of files.  Devices don't
have inodes but the usually have some kind of equivalent like the
tty struct we can attach inodes to.

It looks like what I have could pretty easily be used to implement
mount -f except for some weird cases like nfsd where the usual vfs
rules are not followed.  In particular things vfs_sync are a pain.

> A while ago Peter had patches for files_lock scalability that went even
> further than Nicks, and if I remember the arguments correctly just
> splitting the lock wasn't really enough and he required additional
> batching because there just were too many lock roundtrips.  (Peter, do
> you remember the defails?)

I would love to hear what the issues are.  Since everyone is worried
about performance and contention I have gone ahead and made the
files_list_lock per inode in my patches.  We will see how well that works.
My goals has simply been to add functionality without making a significant
change in performance on the current workloads.

Eric

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch 00/27] [rfc] vfs scalability patchset
  2009-04-28 11:32         ` Eric W. Biederman
@ 2009-04-30  6:14           ` Nick Piggin
  0 siblings, 0 replies; 50+ messages in thread
From: Nick Piggin @ 2009-04-30  6:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christoph Hellwig, Al Viro, linux-fsdevel, linux-kernel,
	Peter Zijlstra

On Tue, Apr 28, 2009 at 04:32:13AM -0700, Eric W. Biederman wrote:
> Christoph Hellwig <hch@infradead.org> writes:
> 
> > On Sat, Apr 25, 2009 at 09:06:49AM +0100, Al Viro wrote:
> >> Maybe...  What Eric proposed is essentially a reuse of s_list for per-inode
> >> list of struct file.  Presumably with something like i_lock for protection.
> >> So that's not a conflict.
> >
> > But what do we actually want it for?  Right now it's only used for
> > ttys, which Nick has split out, and for remount r/o.  For the normal
> > remount r/o case it will go away once we have proper per-sb writer
> > counts.  And the fource remount r/o from sysrq is completely broken.
> 
> The plan is to post my updated patches tomorrow after I have slept.
> 
> What I am looking at is that the tty layer is not a special case.  Any
> subsystem that wants any revoke kind of functionality starts wanting
> the list of files that are open.  My current list where we have
> something like this is: sysfs, proc, sysctl, tun, tty, sound.

How's this coming along? It would be good to get this change out
and reviewed ASAP, and ahead of the rest of your patchset IMO.
Hopefully we can get it into Al's tree for 2.6.31 if it all falls
out nicely.

BTW. I would just keep the single files_lock spinlock in the
first patch that would move to per-inode lists, and then a 2nd
patch could swap out the locking without making any other changes
(or else you could just leave the locking global and we can
evaluate it in the context of the rest of the vfs scalability
work).


^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2009-04-30  6:14 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-25  1:20 [patch 00/27] [rfc] vfs scalability patchset npiggin
2009-04-25  1:20 ` [patch 01/27] fs: cleanup files_lock npiggin
2009-04-25  3:20   ` Al Viro
2009-04-25  5:35   ` Eric W. Biederman
2009-04-26  6:12     ` Nick Piggin
2009-04-25  9:42   ` Alan Cox
2009-04-26  6:15     ` Nick Piggin
2009-04-25  1:20 ` [patch 02/27] fs: scale files_lock npiggin
2009-04-25  3:32   ` Al Viro
2009-04-25  1:20 ` [patch 03/27] fs: mnt_want_write speedup npiggin
2009-04-25  1:20 ` [patch 04/27] fs: introduce mnt_clone_write npiggin
2009-04-25  3:35   ` Al Viro
2009-04-25  1:20 ` [patch 05/27] fs: brlock vfsmount_lock npiggin
2009-04-25  3:50   ` Al Viro
2009-04-26  6:36     ` Nick Piggin
2009-04-25  1:20 ` [patch 06/27] fs: dcache fix LRU ordering npiggin
2009-04-25  1:20 ` [patch 07/27] fs: dcache scale hash npiggin
2009-04-25  1:20 ` [patch 08/27] fs: dcache scale lru npiggin
2009-04-25  1:20 ` [patch 09/27] fs: dcache scale nr_dentry npiggin
2009-04-25  1:20 ` [patch 10/27] fs: dcache scale dentry refcount npiggin
2009-04-25  1:20 ` [patch 11/27] fs: dcache scale d_unhashed npiggin
2009-04-25  1:20 ` [patch 12/27] fs: dcache scale subdirs npiggin
2009-04-25  1:20 ` [patch 13/27] fs: scale inode alias list npiggin
2009-04-25  1:20 ` [patch 14/27] fs: use RCU / seqlock logic for reverse and multi-step operaitons npiggin
2009-04-25  1:20 ` [patch 15/27] fs: dcache remove dcache_lock npiggin
2009-04-25  1:20 ` [patch 16/27] fs: dcache reduce dput locking npiggin
2009-04-25  1:20 ` [patch 17/27] fs: dcache per-bucket dcache hash locking npiggin
2009-04-25  1:20 ` [patch 18/27] fs: dcache reduce dcache_inode_lock npiggin
2009-04-25  1:20 ` [patch 19/27] fs: dcache per-inode inode alias locking npiggin
2009-04-25  1:20 ` [patch 20/27] fs: icache lock s_inodes list npiggin
2009-04-25  1:20 ` [patch 21/27] fs: icache lock inode hash npiggin
2009-04-25  1:20 ` [patch 22/27] fs: icache lock i_state npiggin
2009-04-25  1:20 ` [patch 23/27] fs: icache lock i_count npiggin
2009-04-25  1:20 ` [patch 24/27] fs: icache atomic inodes_stat npiggin
2009-04-25  1:20 ` [patch 25/27] fs: icache lock lru/writeback lists npiggin
2009-04-25  1:20 ` [patch 26/27] fs: icache protect inode state npiggin
2009-04-25  1:20 ` [patch 27/27] fs: icache remove inode_lock npiggin
2009-04-25  4:18 ` [patch 00/27] [rfc] vfs scalability patchset Al Viro
2009-04-25  5:02   ` Nick Piggin
2009-04-25  8:01   ` Christoph Hellwig
2009-04-25  8:06     ` Al Viro
2009-04-28  9:09       ` Christoph Hellwig
2009-04-28  9:48         ` Nick Piggin
2009-04-28 10:58         ` Peter Zijlstra
2009-04-28 11:32         ` Eric W. Biederman
2009-04-30  6:14           ` Nick Piggin
2009-04-25 19:08     ` Eric W. Biederman
2009-04-25 19:31       ` Al Viro
2009-04-25 20:29         ` Eric W. Biederman
2009-04-25 22:05           ` Theodore Tso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).