linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/4] Initial vfs scalability patches again
@ 2010-06-04  6:43 Nick Piggin
  2010-06-04  6:43 ` [patch 1/4] fs: cleanup files_lock Nick Piggin
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Nick Piggin @ 2010-06-04  6:43 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-kernel, linux-fsdevel, Frank Mayhar, John Stultz,
	Andi Kleen

OK, I realised what I was smoking last time. So I put down the pipe and
went to score some stronger crack. And then:
- reduced ifdefs as much as feasible
- add more comments, avoided churn
- vastly improved lock library code, works with lockdep
- added helpers for file list iterations
- lglock type for what was previously open coded in file list locking

It looks in much better shape now I hope. Al would you consider them?

With all patches applied, I ran some single threaded microbenchmarks, and it
was difficult to tell much difference from the noise. I don't claim that there
is no slowdown because there is more instructions and memory accesses for SMP.
But it doesn't seem too bad.

Opteron, ran each test 30 times. Each run lasts for 3 seconds performing as
many operations as possible. Between each 10 runs, a rebooted. After all that
you still get artifacts, oh well.

Difference at 95.0% confidence (times, positive means patch is slower)
dup/close    No difference proven at 95.0% confidence
open/close  -2.48989% +/- 0.538414%
creat/unlink 3.14688% +/- 0.32411%

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch 1/4] fs: cleanup files_lock
  2010-06-04  6:43 [patch 0/4] Initial vfs scalability patches again Nick Piggin
@ 2010-06-04  6:43 ` Nick Piggin
  2010-06-04  8:38   ` Christoph Hellwig
  2010-06-04 18:39   ` [PATCH, RFC] tty: stop abusing file->f_u.fu_list Christoph Hellwig
  2010-06-04  6:43 ` [patch 2/4] lglock: introduce special lglock and brlock spin locks Nick Piggin
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 18+ messages in thread
From: Nick Piggin @ 2010-06-04  6:43 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-kernel, linux-fsdevel, Frank Mayhar, John Stultz,
	Andi Kleen, Alan Cox, Eric W. Biederman, Greg Kroah-Hartman

[-- Attachment #1: fs-files_list-improve.patch --]
[-- Type: text/plain, Size: 9998 bytes --]

Lock tty_files with a new spinlock, tty_files_lock; provide helpers to
manipulate the per-sb files list; unexport the files_lock spinlock.

Cc: linux-kernel@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Frank Mayhar <fmayhar@google.com>,
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>,
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>,
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 drivers/char/pty.c       |    6 +++++-
 drivers/char/tty_io.c    |   26 ++++++++++++++++++--------
 fs/file_table.c          |   42 ++++++++++++++++++------------------------
 fs/open.c                |    4 ++--
 include/linux/fs.h       |    7 ++-----
 include/linux/tty.h      |    1 +
 security/selinux/hooks.c |    4 ++--
 7 files changed, 48 insertions(+), 42 deletions(-)

Index: linux-2.6/drivers/char/pty.c
===================================================================
--- linux-2.6.orig/drivers/char/pty.c
+++ linux-2.6/drivers/char/pty.c
@@ -650,7 +650,11 @@ static int __ptmx_open(struct inode *ino
 
 	set_bit(TTY_PTY_LOCK, &tty->flags); /* LOCK THE SLAVE */
 	filp->private_data = tty;
-	file_move(filp, &tty->tty_files);
+
+	file_sb_list_del(filp); /* __dentry_open has put it on the sb list */
+	spin_lock(&tty_files_lock);
+	list_add(&filp->f_u.fu_list, &tty->tty_files);
+	spin_unlock(&tty_files_lock);
 
 	retval = devpts_pty_new(inode, tty->link);
 	if (retval)
Index: linux-2.6/drivers/char/tty_io.c
===================================================================
--- linux-2.6.orig/drivers/char/tty_io.c
+++ linux-2.6/drivers/char/tty_io.c
@@ -136,6 +136,9 @@ LIST_HEAD(tty_drivers);			/* linked list
 DEFINE_MUTEX(tty_mutex);
 EXPORT_SYMBOL(tty_mutex);
 
+/* Spinlock to protect the tty->tty_files list */
+DEFINE_SPINLOCK(tty_files_lock);
+
 static ssize_t tty_read(struct file *, char __user *, size_t, loff_t *);
 static ssize_t tty_write(struct file *, const char __user *, size_t, loff_t *);
 ssize_t redirected_tty_write(struct file *, const char __user *,
@@ -234,11 +237,11 @@ static int check_tty_count(struct tty_st
 	struct list_head *p;
 	int count = 0;
 
-	file_list_lock();
+	spin_lock(&tty_files_lock);
 	list_for_each(p, &tty->tty_files) {
 		count++;
 	}
-	file_list_unlock();
+	spin_unlock(&tty_files_lock);
 	if (tty->driver->type == TTY_DRIVER_TYPE_PTY &&
 	    tty->driver->subtype == PTY_TYPE_SLAVE &&
 	    tty->link && tty->link->count)
@@ -517,7 +520,7 @@ static void do_tty_hangup(struct work_st
 	lock_kernel();
 	check_tty_count(tty, "do_tty_hangup");
 
-	file_list_lock();
+	spin_lock(&tty_files_lock);
 	/* This breaks for file handles being sent over AF_UNIX sockets ? */
 	list_for_each_entry(filp, &tty->tty_files, f_u.fu_list) {
 		if (filp->f_op->write == redirected_tty_write)
@@ -528,7 +531,7 @@ static void do_tty_hangup(struct work_st
 		tty_fasync(-1, filp, 0);	/* can't block */
 		filp->f_op = &hung_up_tty_fops;
 	}
-	file_list_unlock();
+	spin_unlock(&tty_files_lock);
 
 	tty_ldisc_hangup(tty);
 
@@ -1419,9 +1422,9 @@ static void release_one_tty(struct work_
 	tty_driver_kref_put(driver);
 	module_put(driver->owner);
 
-	file_list_lock();
+	spin_lock(&tty_files_lock);
 	list_del_init(&tty->tty_files);
-	file_list_unlock();
+	spin_unlock(&tty_files_lock);
 
 	put_pid(tty->pgrp);
 	put_pid(tty->session);
@@ -1666,7 +1669,10 @@ int tty_release(struct inode *inode, str
 	 *  - do_tty_hangup no longer sees this file descriptor as
 	 *    something that needs to be handled for hangups.
 	 */
-	file_kill(filp);
+	spin_lock(&tty_files_lock);
+	BUG_ON(list_empty(&filp->f_u.fu_list));
+	list_del_init(&filp->f_u.fu_list);
+	spin_unlock(&tty_files_lock);
 	filp->private_data = NULL;
 
 	/*
@@ -1835,7 +1841,11 @@ got_driver:
 	}
 
 	filp->private_data = tty;
-	file_move(filp, &tty->tty_files);
+	BUG_ON(list_empty(&filp->f_u.fu_list));
+	file_sb_list_del(filp); /* __dentry_open has put it on the sb list */
+	spin_lock(&tty_files_lock);
+	list_add(&filp->f_u.fu_list, &tty->tty_files);
+	spin_unlock(&tty_files_lock);
 	check_tty_count(tty, "tty_open");
 	if (tty->driver->type == TTY_DRIVER_TYPE_PTY &&
 	    tty->driver->subtype == PTY_TYPE_MASTER)
Index: linux-2.6/fs/file_table.c
===================================================================
--- linux-2.6.orig/fs/file_table.c
+++ linux-2.6/fs/file_table.c
@@ -32,8 +32,7 @@ struct files_stat_struct files_stat = {
 	.max_files = NR_FILE
 };
 
-/* public. Not pretty! */
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock);
+static __cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock);
 
 /* SLAB cache for file structures */
 static struct kmem_cache *filp_cachep __read_mostly;
@@ -249,7 +248,7 @@ static void __fput(struct file *file)
 		cdev_put(inode->i_cdev);
 	fops_put(file->f_op);
 	put_pid(file->f_owner.pid);
-	file_kill(file);
+	file_sb_list_del(file);
 	if (file->f_mode & FMODE_WRITE)
 		drop_file_write_access(file);
 	file->f_path.dentry = NULL;
@@ -319,31 +318,29 @@ struct file *fget_light(unsigned int fd,
 	return file;
 }
 
-
 void put_filp(struct file *file)
 {
 	if (atomic_long_dec_and_test(&file->f_count)) {
 		security_file_free(file);
-		file_kill(file);
+		file_sb_list_del(file);
 		file_free(file);
 	}
 }
 
-void file_move(struct file *file, struct list_head *list)
+void file_sb_list_add(struct file *file, struct super_block *sb)
 {
-	if (!list)
-		return;
-	file_list_lock();
-	list_move(&file->f_u.fu_list, list);
-	file_list_unlock();
+	spin_lock(&files_lock);
+	BUG_ON(!list_empty(&file->f_u.fu_list));
+	list_add(&file->f_u.fu_list, &sb->s_files);
+	spin_unlock(&files_lock);
 }
 
-void file_kill(struct file *file)
+void file_sb_list_del(struct file *file)
 {
 	if (!list_empty(&file->f_u.fu_list)) {
-		file_list_lock();
+		spin_lock(&files_lock);
 		list_del_init(&file->f_u.fu_list);
-		file_list_unlock();
+		spin_unlock(&files_lock);
 	}
 }
 
@@ -352,7 +349,7 @@ int fs_may_remount_ro(struct super_block
 	struct file *file;
 
 	/* Check that no files are currently opened for writing. */
-	file_list_lock();
+	spin_lock(&files_lock);
 	list_for_each_entry(file, &sb->s_files, f_u.fu_list) {
 		struct inode *inode = file->f_path.dentry->d_inode;
 
@@ -364,10 +361,10 @@ int fs_may_remount_ro(struct super_block
 		if (S_ISREG(inode->i_mode) && (file->f_mode & FMODE_WRITE))
 			goto too_bad;
 	}
-	file_list_unlock();
+	spin_unlock(&files_lock);
 	return 1; /* Tis' cool bro. */
 too_bad:
-	file_list_unlock();
+	spin_unlock(&files_lock);
 	return 0;
 }
 
@@ -383,7 +380,7 @@ void mark_files_ro(struct super_block *s
 	struct file *f;
 
 retry:
-	file_list_lock();
+	spin_lock(&files_lock);
 	list_for_each_entry(f, &sb->s_files, f_u.fu_list) {
 		struct vfsmount *mnt;
 		if (!S_ISREG(f->f_path.dentry->d_inode->i_mode))
@@ -399,16 +396,13 @@ retry:
 			continue;
 		file_release_write(f);
 		mnt = mntget(f->f_path.mnt);
-		file_list_unlock();
-		/*
-		 * This can sleep, so we can't hold
-		 * the file_list_lock() spinlock.
-		 */
+		/* This can sleep, so we can't hold the spinlock. */
+		spin_unlock(&files_lock);
 		mnt_drop_write(mnt);
 		mntput(mnt);
 		goto retry;
 	}
-	file_list_unlock();
+	spin_unlock(&files_lock);
 }
 
 void __init files_init(unsigned long mempages)
Index: linux-2.6/fs/open.c
===================================================================
--- linux-2.6.orig/fs/open.c
+++ linux-2.6/fs/open.c
@@ -675,7 +675,7 @@ static struct file *__dentry_open(struct
 	f->f_path.mnt = mnt;
 	f->f_pos = 0;
 	f->f_op = fops_get(inode->i_fop);
-	file_move(f, &inode->i_sb->s_files);
+	file_sb_list_add(f, inode->i_sb);
 
 	error = security_dentry_open(f, cred);
 	if (error)
@@ -721,7 +721,7 @@ cleanup_all:
 			mnt_drop_write(mnt);
 		}
 	}
-	file_kill(f);
+	file_sb_list_del(f);
 	f->f_path.dentry = NULL;
 	f->f_path.mnt = NULL;
 cleanup_file:
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -949,9 +949,6 @@ struct file {
 	unsigned long f_mnt_write_state;
 #endif
 };
-extern spinlock_t files_lock;
-#define file_list_lock() spin_lock(&files_lock);
-#define file_list_unlock() spin_unlock(&files_lock);
 
 #define get_file(x)	atomic_long_inc(&(x)->f_count)
 #define fput_atomic(x)	atomic_long_add_unless(&(x)->f_count, -1, 1)
@@ -2182,8 +2179,8 @@ static inline void insert_inode_hash(str
 	__insert_inode_hash(inode, inode->i_ino);
 }
 
-extern void file_move(struct file *f, struct list_head *list);
-extern void file_kill(struct file *f);
+extern void file_sb_list_add(struct file *f, struct super_block *sb);
+extern void file_sb_list_del(struct file *f);
 #ifdef CONFIG_BLOCK
 struct bio;
 extern void submit_bio(int, struct bio *);
Index: linux-2.6/security/selinux/hooks.c
===================================================================
--- linux-2.6.orig/security/selinux/hooks.c
+++ linux-2.6/security/selinux/hooks.c
@@ -2219,7 +2219,7 @@ static inline void flush_unauthorized_fi
 
 	tty = get_current_tty();
 	if (tty) {
-		file_list_lock();
+		spin_lock(&tty_files_lock);
 		if (!list_empty(&tty->tty_files)) {
 			struct inode *inode;
 
@@ -2235,7 +2235,7 @@ static inline void flush_unauthorized_fi
 				drop_tty = 1;
 			}
 		}
-		file_list_unlock();
+		spin_unlock(&tty_files_lock);
 		tty_kref_put(tty);
 	}
 	/* Reset controlling tty. */
Index: linux-2.6/include/linux/tty.h
===================================================================
--- linux-2.6.orig/include/linux/tty.h
+++ linux-2.6/include/linux/tty.h
@@ -467,6 +467,7 @@ extern struct tty_struct *tty_pair_get_t
 extern struct tty_struct *tty_pair_get_pty(struct tty_struct *tty);
 
 extern struct mutex tty_mutex;
+extern spinlock_t tty_files_lock;
 
 extern void tty_write_unlock(struct tty_struct *tty);
 extern int tty_write_lock(struct tty_struct *tty, int ndelay);



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch 2/4] lglock: introduce special lglock and brlock spin locks
  2010-06-04  6:43 [patch 0/4] Initial vfs scalability patches again Nick Piggin
  2010-06-04  6:43 ` [patch 1/4] fs: cleanup files_lock Nick Piggin
@ 2010-06-04  6:43 ` Nick Piggin
  2010-06-04  7:56   ` Eric Dumazet
  2010-06-04 15:03   ` Paul E. McKenney
  2010-06-04  6:43 ` [patch 3/4] fs: scale files_lock Nick Piggin
  2010-06-04  6:43 ` [patch 4/4] fs: brlock vfsmount_lock Nick Piggin
  3 siblings, 2 replies; 18+ messages in thread
From: Nick Piggin @ 2010-06-04  6:43 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-kernel, linux-fsdevel, Paul E. McKenney, Frank Mayhar,
	John Stultz, Andi Kleen

[-- Attachment #1: kernel-introduce-brlock.patch --]
[-- Type: text/plain, Size: 6944 bytes --]

This patch introduces "local-global" locks (lglocks). These can be used to:

- Provide fast exclusive access to per-CPU data, with exclusive access to
  another CPU's data allowed but possibly subject to contention, and to provide
  very slow exclusive access to all per-CPU data.
- Or to provide very fast and scalable read serialisation, and to provide
  very slow exclusive serialisation of data (not necessarily per-CPU data).

Brlocks are also implemented as a short-hand notation for the latter use
case.

Thanks to Paul for local/global naming convention.

Cc: linux-kernel@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Frank Mayhar <fmayhar@google.com>,
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 include/linux/lglock.h |  165 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 165 insertions(+)

Index: linux-2.6/include/linux/lglock.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/lglock.h
@@ -0,0 +1,165 @@
+/*
+ * Specialised local-global spinlock. Can only be declared as global variables
+ * to avoid overhead and keep things simple (and we don't want to start using
+ * these inside dynamically allocated structures).
+ *
+ * "local/global locks" (lglocks) can be used to:
+ *
+ * - Provide fast exclusive access to per-CPU data, with exclusive access to
+ *   another CPU's data allowed but possibly subject to contention, and to
+ *   provide very slow exclusive access to all per-CPU data.
+ * - Or to provide very fast and scalable read serialisation, and to provide
+ *   very slow exclusive serialisation of data (not necessarily per-CPU data).
+ *
+ * Brlocks are also implemented as a short-hand notation for the latter use
+ * case.
+ *
+ * Copyright 2009, 2010, Nick Piggin, Novell Inc.
+ */
+#ifndef __LINUX_LGLOCK_H
+#define __LINUX_LGLOCK_H
+
+#include <linux/spinlock.h>
+#include <linux/lockdep.h>
+#include <linux/percpu.h>
+#include <asm/atomic.h>
+
+/* can make br locks by using local lock for read side, global lock for write */
+#define br_lock_init(name)	name##_lock_init()
+#define br_read_lock(name)	name##_local_lock()
+#define br_read_unlock(name)	name##_local_unlock()
+#define br_write_lock(name)	name##_global_lock()
+#define br_write_unlock(name)	name##_global_unlock()
+#define atomic_dec_and_br_write_lock(atomic, name)	name##_atomic_dec_and_global_lock(atomic)
+
+#define DECLARE_BRLOCK(name)	DECLARE_LGLOCK(name)
+#define DEFINE_BRLOCK(name)	DEFINE_LGLOCK(name)
+
+
+#define lg_lock_init(name)	name##_lock_init()
+#define lg_local_lock(name)	name##_local_lock()
+#define lg_local_unlock(name)	name##_local_unlock()
+#define lg_local_lock_cpu(name, cpu)	name##_local_lock_cpu(cpu)
+#define lg_local_unlock_cpu(name, cpu)	name##_local_unlock_cpu(cpu)
+#define lg_global_lock(name)	name##_global_lock()
+#define lg_global_unlock(name)	name##_global_unlock()
+#define atomic_dec_and_lg_global_lock(atomic, name)	name##_atomic_dec_and_global_lock(atomic)
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+#define LOCKDEP_INIT_MAP lockdep_init_map
+
+#define DEFINE_LGLOCK_LOCKDEP(name)					\
+ struct lock_class_key name##_lock_key;					\
+ struct lockdep_map name##_lock_dep_map;				\
+ EXPORT_SYMBOL(name##_lock_dep_map)
+
+#else
+#define LOCKDEP_INIT_MAP(a, b, c, d)
+
+#define DEFINE_LGLOCK_LOCKDEP(name)
+#endif
+
+
+#define DECLARE_LGLOCK(name)						\
+ extern void name##_lock_init(void);					\
+ extern void name##_local_lock(void);					\
+ extern void name##_local_unlock(void);					\
+ extern void name##_local_lock_cpu(int cpu);				\
+ extern void name##_local_unlock_cpu(int cpu);				\
+ extern void name##_global_lock(void);					\
+ extern void name##_global_unlock(void);				\
+ extern int name##_atomic_dec_and_global_lock(atomic_t *a);		\
+
+#define DEFINE_LGLOCK(name)						\
+									\
+ DEFINE_PER_CPU(arch_spinlock_t, name##_lock);				\
+ DEFINE_LGLOCK_LOCKDEP(name);						\
+									\
+ void name##_lock_init(void) {						\
+	int i;								\
+	LOCKDEP_INIT_MAP(&name##_lock_dep_map, #name, &name##_lock_key, 0); \
+	for_each_possible_cpu(i) {					\
+		arch_spinlock_t *lock;					\
+		lock = &per_cpu(name##_lock, i);			\
+		*lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;	\
+	}								\
+ }									\
+ EXPORT_SYMBOL(name##_lock_init);					\
+									\
+ void name##_local_lock(void) {						\
+	arch_spinlock_t *lock;						\
+	preempt_disable();						\
+	rwlock_acquire_read(&name##_lock_dep_map, 0, 0, _THIS_IP_);	\
+	lock = &__get_cpu_var(name##_lock);				\
+	arch_spin_lock(lock);						\
+ }									\
+ EXPORT_SYMBOL(name##_local_lock);					\
+									\
+ void name##_local_unlock(void) {					\
+	arch_spinlock_t *lock;						\
+	rwlock_release(&name##_lock_dep_map, 1, _THIS_IP_);		\
+	lock = &__get_cpu_var(name##_lock);				\
+	arch_spin_unlock(lock);						\
+	preempt_enable();						\
+ }									\
+ EXPORT_SYMBOL(name##_local_unlock);					\
+									\
+ void name##_local_lock_cpu(int cpu) {			\
+	arch_spinlock_t *lock;						\
+	preempt_disable();						\
+	rwlock_acquire_read(&name##_lock_dep_map, 0, 0, _THIS_IP_);	\
+	lock = &per_cpu(name##_lock, cpu);				\
+	arch_spin_lock(lock);						\
+ }									\
+ EXPORT_SYMBOL(name##_local_lock_cpu);					\
+									\
+ void name##_local_unlock_cpu(int cpu) {			\
+	arch_spinlock_t *lock;						\
+	rwlock_release(&name##_lock_dep_map, 1, _THIS_IP_);		\
+	lock = &per_cpu(name##_lock, cpu);				\
+	arch_spin_unlock(lock);						\
+	preempt_enable();						\
+ }									\
+ EXPORT_SYMBOL(name##_local_unlock_cpu);				\
+									\
+ void name##_global_lock(void) {					\
+	int i;								\
+	preempt_disable();						\
+	rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_);		\
+	for_each_online_cpu(i) {					\
+		arch_spinlock_t *lock;					\
+		lock = &per_cpu(name##_lock, i);			\
+		arch_spin_lock(lock);					\
+	}								\
+ }									\
+ EXPORT_SYMBOL(name##_global_lock);					\
+									\
+ void name##_global_unlock(void) {					\
+	int i;								\
+	rwlock_release(&name##_lock_dep_map, 1, _RET_IP_);		\
+	for_each_online_cpu(i) {					\
+		arch_spinlock_t *lock;					\
+		lock = &per_cpu(name##_lock, i);			\
+		arch_spin_unlock(lock);					\
+	}								\
+	preempt_enable();						\
+ }									\
+ EXPORT_SYMBOL(name##_global_unlock);					\
+									\
+ static int name##_atomic_dec_and_global_lock__failed(atomic_t *a) {	\
+	name##_global_lock();						\
+	if (!atomic_dec_and_test(a)) {					\
+		name##_global_unlock();					\
+		return 0;						\
+	}								\
+	return 1;							\
+ }									\
+ 									\
+ int name##_atomic_dec_and_global_lock(atomic_t *a) {			\
+	if (likely(atomic_add_unless(a, -1, 1)))			\
+		return 0;						\
+	return name##_atomic_dec_and_global_lock__failed(a);		\
+ }									\
+ EXPORT_SYMBOL(name##_atomic_dec_and_global_lock);
+
+#endif

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch 3/4] fs: scale files_lock
  2010-06-04  6:43 [patch 0/4] Initial vfs scalability patches again Nick Piggin
  2010-06-04  6:43 ` [patch 1/4] fs: cleanup files_lock Nick Piggin
  2010-06-04  6:43 ` [patch 2/4] lglock: introduce special lglock and brlock spin locks Nick Piggin
@ 2010-06-04  6:43 ` Nick Piggin
  2010-06-04  6:43 ` [patch 4/4] fs: brlock vfsmount_lock Nick Piggin
  3 siblings, 0 replies; 18+ messages in thread
From: Nick Piggin @ 2010-06-04  6:43 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-kernel, linux-fsdevel, Frank Mayhar, John Stultz,
	Eric W. Biederman, Tim Chen, Andi Kleen

[-- Attachment #1: fs-files_lock-scale.patch --]
[-- Type: text/plain, Size: 9279 bytes --]

Improve scalability of files_lock by adding per-cpu, per-sb files lists,
protected with an lglock. The lglock provides fast access to the per-cpu
lists to add and remove files. It also provides a snapshot of all the
per-cpu lists (although this is very slow).

One difficulty with this approach is that a file can be removed from the list
by another CPU. We must track which per-cpu list the file is on.  Scalability
could suffer if files are frequently removed from different cpu's list.

However loads with frequent removal of files imply short interval between
adding and removing the files, and the scheduler attempts to avoid moving
processes too far away. Also, even in the case of cross-CPU removal, the
hardware has much more opportunity to parallelise cacheline transfers with N
cachelines than with 1.

A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
degenerates to contending on a single lock, which is no worse than before. When
more than one CPU are allocating files, even if they are always freed by
different CPUs, there will be more parallelism than the single-lock case.


Testing results:

On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
to remove the file, the number of times it is removed by the same CPU that
added it, and the number of times it is removed by the same node that added it.

Booting:    locks=  25049 cpu-hits=  23174 (92.5%) node-hits=  23945 (95.6%)
kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
dbench 64   locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)

So a file is removed from the same CPU it was added by over 90% of the time.
It remains within the same node 95% of the time.


Tim Chen run some numbers for a 64 thread Nehalem system performing a compile.

                throughput
2.6.34-rc2      24.5
+patch          24.9

                us      sys     idle    IO wait (in %)
2.6.34-rc2      51.25   28.25   17.25   3.25
+patch          53.75   18.5    19      8.75

So significantly less CPU time spent in kernel code, higher idle time and
slightly higher throughput.


Single threaded performance difference was within the noise of even my
microbenchmarks. That is not to say one does not exist, the code is
larger and more memory accesses required.

Cc: linux-kernel@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Frank Mayhar <fmayhar@google.com>,
Cc: John Stultz <johnstul@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 fs/file_table.c    |   99 +++++++++++++++++++++++++++++++++++++++++++----------
 fs/super.c         |   18 +++++++++
 include/linux/fs.h |    7 +++
 3 files changed, 106 insertions(+), 18 deletions(-)

Index: linux-2.6/fs/file_table.c
===================================================================
--- linux-2.6.orig/fs/file_table.c
+++ linux-2.6/fs/file_table.c
@@ -20,7 +20,9 @@
 #include <linux/cdev.h>
 #include <linux/fsnotify.h>
 #include <linux/sysctl.h>
+#include <linux/lglock.h>
 #include <linux/percpu_counter.h>
+#include <linux/percpu.h>
 #include <linux/ima.h>
 
 #include <asm/atomic.h>
@@ -32,7 +34,8 @@ struct files_stat_struct files_stat = {
 	.max_files = NR_FILE
 };
 
-static __cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock);
+DECLARE_LGLOCK(files_lglock);
+DEFINE_LGLOCK(files_lglock);
 
 /* SLAB cache for file structures */
 static struct kmem_cache *filp_cachep __read_mostly;
@@ -327,30 +330,89 @@ void put_filp(struct file *file)
 	}
 }
 
+/* helper for file_sb_list_add to reduce ifdefs */
+static inline void __file_sb_list_add(struct file *file, struct super_block *sb)
+{
+	struct list_head *list;
+#ifdef CONFIG_SMP
+	int cpu;
+	cpu = smp_processor_id();
+	file->f_sb_list_cpu = cpu;
+	list = per_cpu_ptr(sb->s_files, cpu);
+#else
+	list = &sb->s_files;
+#endif
+	list_add(&file->f_u.fu_list, list);
+}
+
+/**
+ * file_sb_list_add - add a file to the sb's file list
+ * @file: file to add
+ * @sb: sb to add it to
+ *
+ * Use this function to associate a file with the superblock of the inode it
+ * refers to.
+ */
 void file_sb_list_add(struct file *file, struct super_block *sb)
 {
-	spin_lock(&files_lock);
-	BUG_ON(!list_empty(&file->f_u.fu_list));
-	list_add(&file->f_u.fu_list, &sb->s_files);
-	spin_unlock(&files_lock);
+	lg_local_lock(files_lglock);
+	__file_sb_list_add(file, sb);
+	lg_local_unlock(files_lglock);
 }
 
+/**
+ * file_sb_list_del - remove a file from the sb's file list
+ * @file: file to remove
+ * @sb: sb to remove it from
+ *
+ * Use this function to remove a file from its superblock.
+ */
 void file_sb_list_del(struct file *file)
 {
 	if (!list_empty(&file->f_u.fu_list)) {
-		spin_lock(&files_lock);
+		lg_local_lock_cpu(files_lglock, file->f_sb_list_cpu);
 		list_del_init(&file->f_u.fu_list);
-		spin_unlock(&files_lock);
+		lg_local_unlock_cpu(files_lglock, file->f_sb_list_cpu);
 	}
 }
 
+#ifdef CONFIG_SMP
+
+/*
+ * These macros iterate all files on all CPUs for a given superblock.
+ * files_lglock must be held globally.
+ */
+#define do_file_list_for_each_entry(__sb, __file)		\
+{								\
+	int i;							\
+	for_each_possible_cpu(i) {				\
+		struct list_head *list;				\
+		list = per_cpu_ptr((__sb)->s_files, i);		\
+		list_for_each_entry((__file), list, f_u.fu_list)
+
+#define while_file_list_for_each_entry				\
+	}							\
+}
+
+#else
+
+#define do_file_list_for_each_entry(__sb, __file)		\
+{								\
+	struct list_head *list;					\
+	list = &(sb)->s_files;					\
+	list_for_each_entry((__file), list, f_u.fu_list)
+
+#define while_file_list_for_each_entry				\
+}
+
+#endif
+
 int fs_may_remount_ro(struct super_block *sb)
 {
 	struct file *file;
-
 	/* Check that no files are currently opened for writing. */
-	spin_lock(&files_lock);
-	list_for_each_entry(file, &sb->s_files, f_u.fu_list) {
+	lg_global_lock(files_lglock);
+	do_file_list_for_each_entry(sb, file) {
 		struct inode *inode = file->f_path.dentry->d_inode;
 
 		/* File with pending delete? */
@@ -360,11 +422,11 @@ int fs_may_remount_ro(struct super_block
 		/* Writeable file? */
 		if (S_ISREG(inode->i_mode) && (file->f_mode & FMODE_WRITE))
 			goto too_bad;
-	}
-	spin_unlock(&files_lock);
+	} while_file_list_for_each_entry;
+	lg_global_unlock(files_lglock);
 	return 1; /* Tis' cool bro. */
 too_bad:
-	spin_unlock(&files_lock);
+	lg_global_unlock(files_lglock);
 	return 0;
 }
 
@@ -380,8 +442,8 @@ void mark_files_ro(struct super_block *s
 	struct file *f;
 
 retry:
-	spin_lock(&files_lock);
-	list_for_each_entry(f, &sb->s_files, f_u.fu_list) {
+	lg_global_lock(files_lglock);
+	do_file_list_for_each_entry(sb, f) {
 		struct vfsmount *mnt;
 		if (!S_ISREG(f->f_path.dentry->d_inode->i_mode))
 		       continue;
@@ -397,12 +459,12 @@ retry:
 		file_release_write(f);
 		mnt = mntget(f->f_path.mnt);
 		/* This can sleep, so we can't hold the spinlock. */
-		spin_unlock(&files_lock);
+		lg_global_unlock(files_lglock);
 		mnt_drop_write(mnt);
 		mntput(mnt);
 		goto retry;
-	}
-	spin_unlock(&files_lock);
+	} while_file_list_for_each_entry;
+	lg_global_unlock(files_lglock);
 }
 
 void __init files_init(unsigned long mempages)
@@ -422,5 +484,6 @@ void __init files_init(unsigned long mem
 	if (files_stat.max_files < NR_FILE)
 		files_stat.max_files = NR_FILE;
 	files_defer_init();
+	lg_lock_init(files_lglock);
 	percpu_counter_init(&nr_files, 0);
 } 
Index: linux-2.6/fs/super.c
===================================================================
--- linux-2.6.orig/fs/super.c
+++ linux-2.6/fs/super.c
@@ -54,7 +54,22 @@ static struct super_block *alloc_super(s
 			s = NULL;
 			goto out;
 		}
+#ifdef CONFIG_SMP
+		s->s_files = alloc_percpu(struct list_head);
+		if (!s->s_files) {
+			security_sb_free(s);
+			kfree(s);
+			s = NULL;
+			goto out;
+		} else {
+			int i;
+
+			for_each_possible_cpu(i)
+				INIT_LIST_HEAD(per_cpu_ptr(s->s_files, i));
+		}
+#else
 		INIT_LIST_HEAD(&s->s_files);
+#endif
 		INIT_LIST_HEAD(&s->s_instances);
 		INIT_HLIST_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
@@ -108,6 +123,9 @@ out:
  */
 static inline void destroy_super(struct super_block *s)
 {
+#ifdef CONFIG_SMP
+	free_percpu(s->s_files);
+#endif
 	security_sb_free(s);
 	kfree(s->s_subtype);
 	kfree(s->s_options);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -925,6 +925,9 @@ struct file {
 #define f_vfsmnt	f_path.mnt
 	const struct file_operations	*f_op;
 	spinlock_t		f_lock;  /* f_ep_links, f_flags, no IRQ */
+#ifdef CONFIG_SMP
+	int			f_sb_list_cpu;
+#endif
 	atomic_long_t		f_count;
 	unsigned int 		f_flags;
 	fmode_t			f_mode;
@@ -1339,7 +1342,11 @@ struct super_block {
 
 	struct list_head	s_inodes;	/* all inodes */
 	struct hlist_head	s_anon;		/* anonymous dentries for (nfs) exporting */
+#ifdef CONFIG_SMP
+	struct list_head __percpu *s_files;
+#else
 	struct list_head	s_files;
+#endif
 	/* s_dentry_lru and s_nr_dentry_unused are protected by dcache_lock */
 	struct list_head	s_dentry_lru;	/* unused dentry lru */
 	int			s_nr_dentry_unused;	/* # of dentry on lru */



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch 4/4] fs: brlock vfsmount_lock
  2010-06-04  6:43 [patch 0/4] Initial vfs scalability patches again Nick Piggin
                   ` (2 preceding siblings ...)
  2010-06-04  6:43 ` [patch 3/4] fs: scale files_lock Nick Piggin
@ 2010-06-04  6:43 ` Nick Piggin
  3 siblings, 0 replies; 18+ messages in thread
From: Nick Piggin @ 2010-06-04  6:43 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-kernel, linux-fsdevel, Frank Mayhar, John Stultz,
	Andi Kleen

[-- Attachment #1: fs-vfsmount_lock-scale-2.patch --]
[-- Type: text/plain, Size: 19348 bytes --]

Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.

The number of atomics should remain the same for fastpath rlock cases, though
code will be slightly slower due to per-cpu access. Scalability will probably
not be much improved in common cases yet, due to other locks getting in the
way. However independent path lookups over mountpoints should be one case
where scalability is improved.

The slowpath will be made significantly slower due to use of brlock. On a 64
core, 64 socket, 32 node Altix system (so a decent amount of latency to remote
nodes), a simple umount microbenchmark (mount --bind mnt mnt2 ; umount mnt2
loop 1000 times), before this patch it took 6.8s, afterwards took 7.1s, for
about 5% increase in elapsed time.

Cc: linux-kernel@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Frank Mayhar <fmayhar@google.com>,
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 fs/dcache.c    |   11 +--
 fs/internal.h  |    5 +
 fs/namei.c     |    7 +-
 fs/namespace.c |  174 +++++++++++++++++++++++++++++++++++----------------------
 fs/pnode.c     |   11 ++-
 5 files changed, 131 insertions(+), 77 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -1926,7 +1926,7 @@ char *__d_path(const struct path *path,
 	char *end = buffer + buflen;
 	char *retval;
 
-	spin_lock(&vfsmount_lock);
+	br_read_lock(vfsmount_lock);
 	prepend(&end, &buflen, "\0", 1);
 	if (d_unlinked(dentry) &&
 		(prepend(&end, &buflen, " (deleted)", 10) != 0))
@@ -1962,7 +1962,7 @@ char *__d_path(const struct path *path,
 	}
 
 out:
-	spin_unlock(&vfsmount_lock);
+	br_read_unlock(vfsmount_lock);
 	return retval;
 
 global_root:
@@ -2193,11 +2193,12 @@ int path_is_under(struct path *path1, st
 	struct vfsmount *mnt = path1->mnt;
 	struct dentry *dentry = path1->dentry;
 	int res;
-	spin_lock(&vfsmount_lock);
+
+	br_read_lock(vfsmount_lock);
 	if (mnt != path2->mnt) {
 		for (;;) {
 			if (mnt->mnt_parent == mnt) {
-				spin_unlock(&vfsmount_lock);
+				br_read_unlock(vfsmount_lock);
 				return 0;
 			}
 			if (mnt->mnt_parent == path2->mnt)
@@ -2207,7 +2208,7 @@ int path_is_under(struct path *path1, st
 		dentry = mnt->mnt_mountpoint;
 	}
 	res = is_subdir(dentry, path2->dentry);
-	spin_unlock(&vfsmount_lock);
+	br_read_unlock(vfsmount_lock);
 	return res;
 }
 EXPORT_SYMBOL(path_is_under);
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c
+++ linux-2.6/fs/namei.c
@@ -601,15 +601,16 @@ int follow_up(struct path *path)
 {
 	struct vfsmount *parent;
 	struct dentry *mountpoint;
-	spin_lock(&vfsmount_lock);
+
+	br_read_lock(vfsmount_lock);
 	parent = path->mnt->mnt_parent;
 	if (parent == path->mnt) {
-		spin_unlock(&vfsmount_lock);
+		br_read_unlock(vfsmount_lock);
 		return 0;
 	}
 	mntget(parent);
 	mountpoint = dget(path->mnt->mnt_mountpoint);
-	spin_unlock(&vfsmount_lock);
+	br_read_unlock(vfsmount_lock);
 	dput(path->dentry);
 	path->dentry = mountpoint;
 	mntput(path->mnt);
Index: linux-2.6/fs/namespace.c
===================================================================
--- linux-2.6.orig/fs/namespace.c
+++ linux-2.6/fs/namespace.c
@@ -11,6 +11,8 @@
 #include <linux/syscalls.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/spinlock.h>
+#include <linux/percpu.h>
 #include <linux/smp_lock.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
@@ -37,12 +39,10 @@
 #define HASH_SHIFT ilog2(PAGE_SIZE / sizeof(struct list_head))
 #define HASH_SIZE (1UL << HASH_SHIFT)
 
-/* spinlock for vfsmount related operations, inplace of dcache_lock */
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(vfsmount_lock);
-
 static int event;
 static DEFINE_IDA(mnt_id_ida);
 static DEFINE_IDA(mnt_group_ida);
+static DEFINE_SPINLOCK(mnt_id_lock);
 static int mnt_id_start = 0;
 static int mnt_group_start = 1;
 
@@ -54,6 +54,16 @@ static struct rw_semaphore namespace_sem
 struct kobject *fs_kobj;
 EXPORT_SYMBOL_GPL(fs_kobj);
 
+/*
+ * vfsmount lock may be taken for read to prevent changes to the
+ * vfsmount hash, ie. during mountpoint lookups or walking back
+ * up the tree.
+ *
+ * It should be taken for write in all cases where the vfsmount
+ * tree or hash is modified or when a vfsmount structure is modified.
+ */
+DEFINE_BRLOCK(vfsmount_lock);
+
 static inline unsigned long hash(struct vfsmount *mnt, struct dentry *dentry)
 {
 	unsigned long tmp = ((unsigned long)mnt / L1_CACHE_BYTES);
@@ -64,18 +74,21 @@ static inline unsigned long hash(struct
 
 #define MNT_WRITER_UNDERFLOW_LIMIT -(1<<16)
 
-/* allocation is serialized by namespace_sem */
+/*
+ * allocation is serialized by namespace_sem, but we need the spinlock to
+ * serialize with freeing.
+ */
 static int mnt_alloc_id(struct vfsmount *mnt)
 {
 	int res;
 
 retry:
 	ida_pre_get(&mnt_id_ida, GFP_KERNEL);
-	spin_lock(&vfsmount_lock);
+	spin_lock(&mnt_id_lock);
 	res = ida_get_new_above(&mnt_id_ida, mnt_id_start, &mnt->mnt_id);
 	if (!res)
 		mnt_id_start = mnt->mnt_id + 1;
-	spin_unlock(&vfsmount_lock);
+	spin_unlock(&mnt_id_lock);
 	if (res == -EAGAIN)
 		goto retry;
 
@@ -85,11 +98,11 @@ retry:
 static void mnt_free_id(struct vfsmount *mnt)
 {
 	int id = mnt->mnt_id;
-	spin_lock(&vfsmount_lock);
+	spin_lock(&mnt_id_lock);
 	ida_remove(&mnt_id_ida, id);
 	if (mnt_id_start > id)
 		mnt_id_start = id;
-	spin_unlock(&vfsmount_lock);
+	spin_unlock(&mnt_id_lock);
 }
 
 /*
@@ -344,7 +357,7 @@ static int mnt_make_readonly(struct vfsm
 {
 	int ret = 0;
 
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	mnt->mnt_flags |= MNT_WRITE_HOLD;
 	/*
 	 * After storing MNT_WRITE_HOLD, we'll read the counters. This store
@@ -378,15 +391,15 @@ static int mnt_make_readonly(struct vfsm
 	 */
 	smp_wmb();
 	mnt->mnt_flags &= ~MNT_WRITE_HOLD;
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 	return ret;
 }
 
 static void __mnt_unmake_readonly(struct vfsmount *mnt)
 {
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	mnt->mnt_flags &= ~MNT_READONLY;
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 }
 
 void simple_set_mnt(struct vfsmount *mnt, struct super_block *sb)
@@ -410,6 +423,7 @@ void free_vfsmnt(struct vfsmount *mnt)
 /*
  * find the first or last mount at @dentry on vfsmount @mnt depending on
  * @dir. If @dir is set return the first mount else return the last mount.
+ * vfsmount_lock must be held for read or write.
  */
 struct vfsmount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry,
 			      int dir)
@@ -439,10 +453,11 @@ struct vfsmount *__lookup_mnt(struct vfs
 struct vfsmount *lookup_mnt(struct path *path)
 {
 	struct vfsmount *child_mnt;
-	spin_lock(&vfsmount_lock);
+
+	br_read_lock(vfsmount_lock);
 	if ((child_mnt = __lookup_mnt(path->mnt, path->dentry, 1)))
 		mntget(child_mnt);
-	spin_unlock(&vfsmount_lock);
+	br_read_unlock(vfsmount_lock);
 	return child_mnt;
 }
 
@@ -451,6 +466,9 @@ static inline int check_mnt(struct vfsmo
 	return mnt->mnt_ns == current->nsproxy->mnt_ns;
 }
 
+/*
+ * vfsmount lock must be held for write
+ */
 static void touch_mnt_namespace(struct mnt_namespace *ns)
 {
 	if (ns) {
@@ -459,6 +477,9 @@ static void touch_mnt_namespace(struct m
 	}
 }
 
+/*
+ * vfsmount lock must be held for write
+ */
 static void __touch_mnt_namespace(struct mnt_namespace *ns)
 {
 	if (ns && ns->event != event) {
@@ -467,6 +488,9 @@ static void __touch_mnt_namespace(struct
 	}
 }
 
+/*
+ * vfsmount lock must be held for write
+ */
 static void detach_mnt(struct vfsmount *mnt, struct path *old_path)
 {
 	old_path->dentry = mnt->mnt_mountpoint;
@@ -478,6 +502,9 @@ static void detach_mnt(struct vfsmount *
 	old_path->dentry->d_mounted--;
 }
 
+/*
+ * vfsmount lock must be held for write
+ */
 void mnt_set_mountpoint(struct vfsmount *mnt, struct dentry *dentry,
 			struct vfsmount *child_mnt)
 {
@@ -486,6 +513,9 @@ void mnt_set_mountpoint(struct vfsmount
 	dentry->d_mounted++;
 }
 
+/*
+ * vfsmount lock must be held for write
+ */
 static void attach_mnt(struct vfsmount *mnt, struct path *path)
 {
 	mnt_set_mountpoint(path->mnt, path->dentry, mnt);
@@ -495,7 +525,7 @@ static void attach_mnt(struct vfsmount *
 }
 
 /*
- * the caller must hold vfsmount_lock
+ * vfsmount lock must be held for write
  */
 static void commit_tree(struct vfsmount *mnt)
 {
@@ -618,15 +648,15 @@ static inline void __mntput(struct vfsmo
 void mntput_no_expire(struct vfsmount *mnt)
 {
 repeat:
-	if (atomic_dec_and_lock(&mnt->mnt_count, &vfsmount_lock)) {
+	if (atomic_dec_and_br_write_lock(&mnt->mnt_count, vfsmount_lock)) {
 		if (likely(!mnt->mnt_pinned)) {
-			spin_unlock(&vfsmount_lock);
+			br_write_unlock(vfsmount_lock);
 			__mntput(mnt);
 			return;
 		}
 		atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count);
 		mnt->mnt_pinned = 0;
-		spin_unlock(&vfsmount_lock);
+		br_write_unlock(vfsmount_lock);
 		acct_auto_close_mnt(mnt);
 		goto repeat;
 	}
@@ -636,21 +666,21 @@ EXPORT_SYMBOL(mntput_no_expire);
 
 void mnt_pin(struct vfsmount *mnt)
 {
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	mnt->mnt_pinned++;
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 }
 
 EXPORT_SYMBOL(mnt_pin);
 
 void mnt_unpin(struct vfsmount *mnt)
 {
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	if (mnt->mnt_pinned) {
 		atomic_inc(&mnt->mnt_count);
 		mnt->mnt_pinned--;
 	}
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 }
 
 EXPORT_SYMBOL(mnt_unpin);
@@ -741,12 +771,12 @@ int mnt_had_events(struct proc_mounts *p
 	struct mnt_namespace *ns = p->ns;
 	int res = 0;
 
-	spin_lock(&vfsmount_lock);
+	br_read_lock(vfsmount_lock);
 	if (p->event != ns->event) {
 		p->event = ns->event;
 		res = 1;
 	}
-	spin_unlock(&vfsmount_lock);
+	br_read_unlock(vfsmount_lock);
 
 	return res;
 }
@@ -948,12 +978,12 @@ int may_umount_tree(struct vfsmount *mnt
 	int minimum_refs = 0;
 	struct vfsmount *p;
 
-	spin_lock(&vfsmount_lock);
+	br_read_lock(vfsmount_lock);
 	for (p = mnt; p; p = next_mnt(p, mnt)) {
 		actual_refs += atomic_read(&p->mnt_count);
 		minimum_refs += 2;
 	}
-	spin_unlock(&vfsmount_lock);
+	br_read_unlock(vfsmount_lock);
 
 	if (actual_refs > minimum_refs)
 		return 0;
@@ -980,10 +1010,10 @@ int may_umount(struct vfsmount *mnt)
 {
 	int ret = 1;
 	down_read(&namespace_sem);
-	spin_lock(&vfsmount_lock);
+	br_read_lock(vfsmount_lock);
 	if (propagate_mount_busy(mnt, 2))
 		ret = 0;
-	spin_unlock(&vfsmount_lock);
+	br_read_unlock(vfsmount_lock);
 	up_read(&namespace_sem);
 	return ret;
 }
@@ -999,13 +1029,14 @@ void release_mounts(struct list_head *he
 		if (mnt->mnt_parent != mnt) {
 			struct dentry *dentry;
 			struct vfsmount *m;
-			spin_lock(&vfsmount_lock);
+
+			br_write_lock(vfsmount_lock);
 			dentry = mnt->mnt_mountpoint;
 			m = mnt->mnt_parent;
 			mnt->mnt_mountpoint = mnt->mnt_root;
 			mnt->mnt_parent = mnt;
 			m->mnt_ghosts--;
-			spin_unlock(&vfsmount_lock);
+			br_write_unlock(vfsmount_lock);
 			dput(dentry);
 			mntput(m);
 		}
@@ -1013,6 +1044,10 @@ void release_mounts(struct list_head *he
 	}
 }
 
+/*
+ * vfsmount lock must be held for write
+ * namespace_sem must be held for write
+ */
 void umount_tree(struct vfsmount *mnt, int propagate, struct list_head *kill)
 {
 	struct vfsmount *p;
@@ -1103,7 +1138,7 @@ static int do_umount(struct vfsmount *mn
 	}
 
 	down_write(&namespace_sem);
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	event++;
 
 	if (!(flags & MNT_DETACH))
@@ -1115,7 +1150,7 @@ static int do_umount(struct vfsmount *mn
 			umount_tree(mnt, 1, &umount_list);
 		retval = 0;
 	}
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 	up_write(&namespace_sem);
 	release_mounts(&umount_list);
 	return retval;
@@ -1227,19 +1262,19 @@ struct vfsmount *copy_tree(struct vfsmou
 			q = clone_mnt(p, p->mnt_root, flag);
 			if (!q)
 				goto Enomem;
-			spin_lock(&vfsmount_lock);
+			br_write_lock(vfsmount_lock);
 			list_add_tail(&q->mnt_list, &res->mnt_list);
 			attach_mnt(q, &path);
-			spin_unlock(&vfsmount_lock);
+			br_write_unlock(vfsmount_lock);
 		}
 	}
 	return res;
 Enomem:
 	if (res) {
 		LIST_HEAD(umount_list);
-		spin_lock(&vfsmount_lock);
+		br_write_lock(vfsmount_lock);
 		umount_tree(res, 0, &umount_list);
-		spin_unlock(&vfsmount_lock);
+		br_write_unlock(vfsmount_lock);
 		release_mounts(&umount_list);
 	}
 	return NULL;
@@ -1258,9 +1293,9 @@ void drop_collected_mounts(struct vfsmou
 {
 	LIST_HEAD(umount_list);
 	down_write(&namespace_sem);
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	umount_tree(mnt, 0, &umount_list);
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 	up_write(&namespace_sem);
 	release_mounts(&umount_list);
 }
@@ -1388,7 +1423,7 @@ static int attach_recursive_mnt(struct v
 	if (err)
 		goto out_cleanup_ids;
 
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 
 	if (IS_MNT_SHARED(dest_mnt)) {
 		for (p = source_mnt; p; p = next_mnt(p, source_mnt))
@@ -1407,7 +1442,8 @@ static int attach_recursive_mnt(struct v
 		list_del_init(&child->mnt_hash);
 		commit_tree(child);
 	}
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
+
 	return 0;
 
  out_cleanup_ids:
@@ -1462,10 +1498,10 @@ static int do_change_type(struct path *p
 			goto out_unlock;
 	}
 
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	for (m = mnt; m; m = (recurse ? next_mnt(m, mnt) : NULL))
 		change_mnt_propagation(m, type);
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 
  out_unlock:
 	up_write(&namespace_sem);
@@ -1509,9 +1545,10 @@ static int do_loopback(struct path *path
 	err = graft_tree(mnt, path);
 	if (err) {
 		LIST_HEAD(umount_list);
-		spin_lock(&vfsmount_lock);
+
+		br_write_lock(vfsmount_lock);
 		umount_tree(mnt, 0, &umount_list);
-		spin_unlock(&vfsmount_lock);
+		br_write_unlock(vfsmount_lock);
 		release_mounts(&umount_list);
 	}
 
@@ -1564,16 +1601,16 @@ static int do_remount(struct path *path,
 	else
 		err = do_remount_sb(sb, flags, data, 0);
 	if (!err) {
-		spin_lock(&vfsmount_lock);
+		br_write_lock(vfsmount_lock);
 		mnt_flags |= path->mnt->mnt_flags & MNT_PROPAGATION_MASK;
 		path->mnt->mnt_flags = mnt_flags;
-		spin_unlock(&vfsmount_lock);
+		br_write_unlock(vfsmount_lock);
 	}
 	up_write(&sb->s_umount);
 	if (!err) {
-		spin_lock(&vfsmount_lock);
+		br_write_lock(vfsmount_lock);
 		touch_mnt_namespace(path->mnt->mnt_ns);
-		spin_unlock(&vfsmount_lock);
+		br_write_unlock(vfsmount_lock);
 	}
 	return err;
 }
@@ -1750,7 +1787,7 @@ void mark_mounts_for_expiry(struct list_
 		return;
 
 	down_write(&namespace_sem);
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 
 	/* extract from the expiration list every vfsmount that matches the
 	 * following criteria:
@@ -1769,7 +1806,7 @@ void mark_mounts_for_expiry(struct list_
 		touch_mnt_namespace(mnt->mnt_ns);
 		umount_tree(mnt, 1, &umounts);
 	}
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 	up_write(&namespace_sem);
 
 	release_mounts(&umounts);
@@ -1826,6 +1863,8 @@ resume:
 /*
  * process a list of expirable mountpoints with the intent of discarding any
  * submounts of a specific parent mountpoint
+ *
+ * vfsmount_lock must be held for write
  */
 static void shrink_submounts(struct vfsmount *mnt, struct list_head *umounts)
 {
@@ -2044,9 +2083,9 @@ static struct mnt_namespace *dup_mnt_ns(
 		kfree(new_ns);
 		return ERR_PTR(-ENOMEM);
 	}
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	list_add_tail(&new_ns->list, &new_ns->root->mnt_list);
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 
 	/*
 	 * Second pass: switch the tsk->fs->* elements and mark new vfsmounts
@@ -2243,7 +2282,7 @@ SYSCALL_DEFINE2(pivot_root, const char _
 		goto out2; /* not attached */
 	/* make sure we can reach put_old from new_root */
 	tmp = old.mnt;
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	if (tmp != new.mnt) {
 		for (;;) {
 			if (tmp->mnt_parent == tmp)
@@ -2263,7 +2302,7 @@ SYSCALL_DEFINE2(pivot_root, const char _
 	/* mount new_root on / */
 	attach_mnt(new.mnt, &root_parent);
 	touch_mnt_namespace(current->nsproxy->mnt_ns);
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 	chroot_fs_refs(&root, &new);
 	error = 0;
 	path_put(&root_parent);
@@ -2278,7 +2317,7 @@ out1:
 out0:
 	return error;
 out3:
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 	goto out2;
 }
 
@@ -2325,6 +2364,8 @@ void __init mnt_init(void)
 	for (u = 0; u < HASH_SIZE; u++)
 		INIT_LIST_HEAD(&mount_hashtable[u]);
 
+	br_lock_init(vfsmount_lock);
+
 	err = sysfs_init();
 	if (err)
 		printk(KERN_WARNING "%s: sysfs_init error: %d\n",
@@ -2343,9 +2384,9 @@ void put_mnt_ns(struct mnt_namespace *ns
 	if (!atomic_dec_and_test(&ns->count))
 		return;
 	down_write(&namespace_sem);
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	umount_tree(ns->root, 0, &umount_list);
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 	up_write(&namespace_sem);
 	release_mounts(&umount_list);
 	kfree(ns);
Index: linux-2.6/fs/pnode.c
===================================================================
--- linux-2.6.orig/fs/pnode.c
+++ linux-2.6/fs/pnode.c
@@ -126,6 +126,9 @@ static int do_make_slave(struct vfsmount
 	return 0;
 }
 
+/*
+ * vfsmount lock must be held for write
+ */
 void change_mnt_propagation(struct vfsmount *mnt, int type)
 {
 	if (type == MS_SHARED) {
@@ -270,12 +273,12 @@ int propagate_mnt(struct vfsmount *dest_
 		prev_src_mnt  = child;
 	}
 out:
-	spin_lock(&vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	while (!list_empty(&tmp_list)) {
 		child = list_first_entry(&tmp_list, struct vfsmount, mnt_hash);
 		umount_tree(child, 0, &umount_list);
 	}
-	spin_unlock(&vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 	release_mounts(&umount_list);
 	return ret;
 }
@@ -296,6 +299,8 @@ static inline int do_refcount_check(stru
  * other mounts its parent propagates to.
  * Check if any of these mounts that **do not have submounts**
  * have more references than 'refcnt'. If so return busy.
+ *
+ * vfsmount lock must be held for read or write
  */
 int propagate_mount_busy(struct vfsmount *mnt, int refcnt)
 {
@@ -353,6 +358,8 @@ static void __propagate_umount(struct vf
  * collect all mounts that receive propagation from the mount in @list,
  * and return these additional mounts in the same list.
  * @list: the list of mounts to be unmounted.
+ *
+ * vfsmount lock must be held for write
  */
 int propagate_umount(struct list_head *list)
 {
Index: linux-2.6/fs/internal.h
===================================================================
--- linux-2.6.orig/fs/internal.h
+++ linux-2.6/fs/internal.h
@@ -9,6 +9,8 @@
  * 2 of the License, or (at your option) any later version.
  */
 
+#include <linux/lglock.h>
+
 struct super_block;
 struct linux_binprm;
 struct path;
@@ -70,7 +72,8 @@ extern struct vfsmount *copy_tree(struct
 
 extern void __init mnt_init(void);
 
-extern spinlock_t vfsmount_lock;
+DECLARE_BRLOCK(vfsmount_lock);
+
 
 /*
  * fs_struct.c

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 2/4] lglock: introduce special lglock and brlock spin locks
  2010-06-04  6:43 ` [patch 2/4] lglock: introduce special lglock and brlock spin locks Nick Piggin
@ 2010-06-04  7:56   ` Eric Dumazet
  2010-06-04 14:13     ` Nick Piggin
  2010-06-04 15:03   ` Paul E. McKenney
  1 sibling, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2010-06-04  7:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Al Viro, linux-kernel, linux-fsdevel, Paul E. McKenney,
	Frank Mayhar, John Stultz, Andi Kleen

Le vendredi 04 juin 2010 à 16:43 +1000, Nick Piggin a écrit :
> pièce jointe document texte brut (kernel-introduce-brlock.patch)
> This patch introduces "local-global" locks (lglocks). These can be used to:
> 
> - Provide fast exclusive access to per-CPU data, with exclusive access to
>   another CPU's data allowed but possibly subject to contention, and to provide
>   very slow exclusive access to all per-CPU data.
> - Or to provide very fast and scalable read serialisation, and to provide
>   very slow exclusive serialisation of data (not necessarily per-CPU data).
> 
> Brlocks are also implemented as a short-hand notation for the latter use
> case.
> 
> Thanks to Paul for local/global naming convention.
> 
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-fsdevel@vger.kernel.org
> Cc: Al Viro <viro@ZenIV.linux.org.uk>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Frank Mayhar <fmayhar@google.com>,
> Cc: John Stultz <johnstul@us.ibm.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  include/linux/lglock.h |  165 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 165 insertions(+)
> 

IMHO some changes in Documentation/ would be needed

> 								\
> + void name##_global_lock(void) {					\
> +	int i;								\
> +	preempt_disable();						\
> +	rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_);		\
> +	for_each_online_cpu(i) {					\

for_each_possible_cpu()

> +		arch_spinlock_t *lock;					\
> +		lock = &per_cpu(name##_lock, i);			\
> +		arch_spin_lock(lock);					\
> +	}								\
> + }									\
> + EXPORT_SYMBOL(name##_global_lock);					\
> +									\
> + void name##_global_unlock(void) {					\
> +	int i;								\
> +	rwlock_release(&name##_lock_dep_map, 1, _RET_IP_);		\
> +	for_each_online_cpu(i) {					\

for_each_possible_cpu()

> +		arch_spinlock_t *lock;					\
> +		lock = &per_cpu(name##_lock, i);			\
> +		arch_spin_unlock(lock);					\
> +	}								\
> +	preempt_enable();						\
> + }									\
> + EXPORT_SYMBOL(name##_global_unlock);					\
> +									\

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/4] fs: cleanup files_lock
  2010-06-04  6:43 ` [patch 1/4] fs: cleanup files_lock Nick Piggin
@ 2010-06-04  8:38   ` Christoph Hellwig
  2010-06-04 14:20     ` Nick Piggin
  2010-06-04 18:39   ` [PATCH, RFC] tty: stop abusing file->f_u.fu_list Christoph Hellwig
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2010-06-04  8:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Al Viro, linux-kernel, linux-fsdevel, Frank Mayhar, John Stultz,
	Andi Kleen, Alan Cox, Eric W. Biederman, Greg Kroah-Hartman

On Fri, Jun 04, 2010 at 04:43:08PM +1000, Nick Piggin wrote:
> Lock tty_files with a new spinlock, tty_files_lock; provide helpers to
> manipulate the per-sb files list; unexport the files_lock spinlock.

I'm still not entirely happy with this.  You keep making the tty a
special case by removing it from the files per-sb files list while
nothing else in the system is removed from it.

Thinks would be much better if you could untangle the tty code from
abuse of file->f_u.fu_list entirely.  And from a naive look at the
tty code that actually seems pretty easy.  file->private for ttys
currently directly points to the tty struct.  If you add a tty_private
there which points back to the file, the tty and contains a list_head
the open files in tty code tracking code can be completely divorced
from the per-sb file tracking.  After that we can decide what to do
with the per-sb file tracking, where my favourite still is to get
rid of it entirely.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 2/4] lglock: introduce special lglock and brlock spin locks
  2010-06-04  7:56   ` Eric Dumazet
@ 2010-06-04 14:13     ` Nick Piggin
  2010-06-04 14:24       ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Nick Piggin @ 2010-06-04 14:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Al Viro, linux-kernel, linux-fsdevel, Paul E. McKenney,
	Frank Mayhar, John Stultz, Andi Kleen

On Fri, Jun 04, 2010 at 09:56:03AM +0200, Eric Dumazet wrote:
> Le vendredi 04 juin 2010 à 16:43 +1000, Nick Piggin a écrit :
> > pièce jointe document texte brut (kernel-introduce-brlock.patch)
> > This patch introduces "local-global" locks (lglocks). These can be used to:
> > 
> > - Provide fast exclusive access to per-CPU data, with exclusive access to
> >   another CPU's data allowed but possibly subject to contention, and to provide
> >   very slow exclusive access to all per-CPU data.
> > - Or to provide very fast and scalable read serialisation, and to provide
> >   very slow exclusive serialisation of data (not necessarily per-CPU data).
> > 
> > Brlocks are also implemented as a short-hand notation for the latter use
> > case.
> > 
> > Thanks to Paul for local/global naming convention.
> > 
> > Cc: linux-kernel@vger.kernel.org
> > Cc: linux-fsdevel@vger.kernel.org
> > Cc: Al Viro <viro@ZenIV.linux.org.uk>
> > Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > Cc: Frank Mayhar <fmayhar@google.com>,
> > Cc: John Stultz <johnstul@us.ibm.com>
> > Cc: Andi Kleen <ak@linux.intel.com>
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > ---
> >  include/linux/lglock.h |  165 +++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 165 insertions(+)
> > 
> 
> IMHO some changes in Documentation/ would be needed

I wonder where, and what?

 
> > + void name##_global_lock(void) {					\
> > +	int i;								\
> > +	preempt_disable();						\
> > +	rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_);		\
> > +	for_each_online_cpu(i) {					\
> 
> for_each_possible_cpu()

Oh good spotting. brlock does not need this but lglock does if it
protects offline cpu data too. Maybe better to move file handles in
the event of hotplug.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/4] fs: cleanup files_lock
  2010-06-04  8:38   ` Christoph Hellwig
@ 2010-06-04 14:20     ` Nick Piggin
  2010-06-04 14:39       ` Andi Kleen
  2010-06-04 15:10       ` Christoph Hellwig
  0 siblings, 2 replies; 18+ messages in thread
From: Nick Piggin @ 2010-06-04 14:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Al Viro, linux-kernel, linux-fsdevel, Frank Mayhar, John Stultz,
	Andi Kleen, Alan Cox, Eric W. Biederman, Greg Kroah-Hartman

On Fri, Jun 04, 2010 at 04:38:18AM -0400, Christoph Hellwig wrote:
> On Fri, Jun 04, 2010 at 04:43:08PM +1000, Nick Piggin wrote:
> > Lock tty_files with a new spinlock, tty_files_lock; provide helpers to
> > manipulate the per-sb files list; unexport the files_lock spinlock.
> 
> I'm still not entirely happy with this.  You keep making the tty a
> special case by removing it from the files per-sb files list while
> nothing else in the system is removed from it.
> 
> Thinks would be much better if you could untangle the tty code from
> abuse of file->f_u.fu_list entirely.  And from a naive look at the
> tty code that actually seems pretty easy.  file->private for ttys
> currently directly points to the tty struct.  If you add a tty_private
> there which points back to the file, the tty and contains a list_head
> the open files in tty code tracking code can be completely divorced
> from the per-sb file tracking.

Well it is already a special case, I just switched it to using a
different lock for its private list. I wanted to keep surgery to
a minimum.


>  After that we can decide what to do
> with the per-sb file tracking, where my favourite still is to get
> rid of it entirely.

Again, this would be nice, but I didn't see an easy way to do it.
Even if refcounting obsoleted may_remount_ro, we still have
mark_files_ro. It's no more complex to rip this all out after my
patch. I don't see the problem in doing this patch. It has good
numbers.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 2/4] lglock: introduce special lglock and brlock spin locks
  2010-06-04 14:13     ` Nick Piggin
@ 2010-06-04 14:24       ` Eric Dumazet
  0 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2010-06-04 14:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Al Viro, linux-kernel, linux-fsdevel, Paul E. McKenney,
	Frank Mayhar, John Stultz, Andi Kleen

Le samedi 05 juin 2010 à 00:13 +1000, Nick Piggin a écrit :
> On Fri, Jun 04, 2010 at 09:56:03AM +0200, Eric Dumazet wrote:

> > IMHO some changes in Documentation/ would be needed
> 
> I wonder where, and what?
> 
>  

Documentation/memory-barriers.txt (around line 1111)

Documentation/spinlocks.txt  (change its name ?)

Section 1 : spinlocks

Section 2 : rwlocks

Section 3 : lglock/brlock  ?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/4] fs: cleanup files_lock
  2010-06-04 14:20     ` Nick Piggin
@ 2010-06-04 14:39       ` Andi Kleen
  2010-06-04 15:10       ` Christoph Hellwig
  1 sibling, 0 replies; 18+ messages in thread
From: Andi Kleen @ 2010-06-04 14:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, Al Viro, linux-kernel, linux-fsdevel,
	Frank Mayhar, John Stultz, Andi Kleen, Alan Cox,
	Eric W. Biederman, Greg Kroah-Hartman

Nick Piggin <npiggin@suse.de> writes:
>
> Again, this would be nice, but I didn't see an easy way to do it.
> Even if refcounting obsoleted may_remount_ro, we still have
> mark_files_ro. It's no more complex to rip this all out after my
> patch. I don't see the problem in doing this patch. It has good
> numbers.

Yes agreed. The global lock is currently really painful on workloads
that do a lot of opens and anything to make this better would be good asap.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 2/4] lglock: introduce special lglock and brlock spin locks
  2010-06-04  6:43 ` [patch 2/4] lglock: introduce special lglock and brlock spin locks Nick Piggin
  2010-06-04  7:56   ` Eric Dumazet
@ 2010-06-04 15:03   ` Paul E. McKenney
  2010-06-04 15:12     ` Nick Piggin
  1 sibling, 1 reply; 18+ messages in thread
From: Paul E. McKenney @ 2010-06-04 15:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Al Viro, linux-kernel, linux-fsdevel, Frank Mayhar, John Stultz,
	Andi Kleen

On Fri, Jun 04, 2010 at 04:43:09PM +1000, Nick Piggin wrote:
> This patch introduces "local-global" locks (lglocks). These can be used to:
> 
> - Provide fast exclusive access to per-CPU data, with exclusive access to
>   another CPU's data allowed but possibly subject to contention, and to provide
>   very slow exclusive access to all per-CPU data.
> - Or to provide very fast and scalable read serialisation, and to provide
>   very slow exclusive serialisation of data (not necessarily per-CPU data).
> 
> Brlocks are also implemented as a short-hand notation for the latter use
> case.
> 
> Thanks to Paul for local/global naming convention.

;-)

One set of questions about how this relates to real-time below.

(And I agree with Eric's point about for_each_possible_cpu(), FWIW.)

> Cc: linux-kernel@vger.kernel.org
> Cc: linux-fsdevel@vger.kernel.org
> Cc: Al Viro <viro@ZenIV.linux.org.uk>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Frank Mayhar <fmayhar@google.com>,
> Cc: John Stultz <johnstul@us.ibm.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  include/linux/lglock.h |  165 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 165 insertions(+)
> 
> Index: linux-2.6/include/linux/lglock.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6/include/linux/lglock.h
> @@ -0,0 +1,165 @@
> +/*
> + * Specialised local-global spinlock. Can only be declared as global variables
> + * to avoid overhead and keep things simple (and we don't want to start using
> + * these inside dynamically allocated structures).
> + *
> + * "local/global locks" (lglocks) can be used to:
> + *
> + * - Provide fast exclusive access to per-CPU data, with exclusive access to
> + *   another CPU's data allowed but possibly subject to contention, and to
> + *   provide very slow exclusive access to all per-CPU data.
> + * - Or to provide very fast and scalable read serialisation, and to provide
> + *   very slow exclusive serialisation of data (not necessarily per-CPU data).
> + *
> + * Brlocks are also implemented as a short-hand notation for the latter use
> + * case.
> + *
> + * Copyright 2009, 2010, Nick Piggin, Novell Inc.
> + */
> +#ifndef __LINUX_LGLOCK_H
> +#define __LINUX_LGLOCK_H
> +
> +#include <linux/spinlock.h>
> +#include <linux/lockdep.h>
> +#include <linux/percpu.h>
> +#include <asm/atomic.h>
> +
> +/* can make br locks by using local lock for read side, global lock for write */
> +#define br_lock_init(name)	name##_lock_init()
> +#define br_read_lock(name)	name##_local_lock()
> +#define br_read_unlock(name)	name##_local_unlock()
> +#define br_write_lock(name)	name##_global_lock()
> +#define br_write_unlock(name)	name##_global_unlock()
> +#define atomic_dec_and_br_write_lock(atomic, name)	name##_atomic_dec_and_global_lock(atomic)
> +
> +#define DECLARE_BRLOCK(name)	DECLARE_LGLOCK(name)
> +#define DEFINE_BRLOCK(name)	DEFINE_LGLOCK(name)
> +
> +
> +#define lg_lock_init(name)	name##_lock_init()
> +#define lg_local_lock(name)	name##_local_lock()
> +#define lg_local_unlock(name)	name##_local_unlock()
> +#define lg_local_lock_cpu(name, cpu)	name##_local_lock_cpu(cpu)
> +#define lg_local_unlock_cpu(name, cpu)	name##_local_unlock_cpu(cpu)
> +#define lg_global_lock(name)	name##_global_lock()
> +#define lg_global_unlock(name)	name##_global_unlock()
> +#define atomic_dec_and_lg_global_lock(atomic, name)	name##_atomic_dec_and_global_lock(atomic)
> +
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +#define LOCKDEP_INIT_MAP lockdep_init_map
> +
> +#define DEFINE_LGLOCK_LOCKDEP(name)					\
> + struct lock_class_key name##_lock_key;					\
> + struct lockdep_map name##_lock_dep_map;				\
> + EXPORT_SYMBOL(name##_lock_dep_map)
> +
> +#else
> +#define LOCKDEP_INIT_MAP(a, b, c, d)
> +
> +#define DEFINE_LGLOCK_LOCKDEP(name)
> +#endif
> +
> +
> +#define DECLARE_LGLOCK(name)						\
> + extern void name##_lock_init(void);					\
> + extern void name##_local_lock(void);					\
> + extern void name##_local_unlock(void);					\
> + extern void name##_local_lock_cpu(int cpu);				\
> + extern void name##_local_unlock_cpu(int cpu);				\
> + extern void name##_global_lock(void);					\
> + extern void name##_global_unlock(void);				\
> + extern int name##_atomic_dec_and_global_lock(atomic_t *a);		\
> +
> +#define DEFINE_LGLOCK(name)						\
> +									\
> + DEFINE_PER_CPU(arch_spinlock_t, name##_lock);				\
> + DEFINE_LGLOCK_LOCKDEP(name);						\
> +									\
> + void name##_lock_init(void) {						\
> +	int i;								\
> +	LOCKDEP_INIT_MAP(&name##_lock_dep_map, #name, &name##_lock_key, 0); \
> +	for_each_possible_cpu(i) {					\
> +		arch_spinlock_t *lock;					\
> +		lock = &per_cpu(name##_lock, i);			\
> +		*lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;	\
> +	}								\
> + }									\
> + EXPORT_SYMBOL(name##_lock_init);					\
> +									\
> + void name##_local_lock(void) {						\
> +	arch_spinlock_t *lock;						\
> +	preempt_disable();						\

In a -rt kernel, I believe we would not want the above preempt_disable().
Of course, in this case the arch_spin_lock() would need to become
spin_lock() or some such.

The main point of this approach is to avoid cross-CPU holding of these
locks, correct?  And then the point of arch_spin_lock() is to avoid the
redundant preempt_disable(), right?

							Thanx, Paul

> +	rwlock_acquire_read(&name##_lock_dep_map, 0, 0, _THIS_IP_);	\
> +	lock = &__get_cpu_var(name##_lock);				\
> +	arch_spin_lock(lock);						\
> + }									\
> + EXPORT_SYMBOL(name##_local_lock);					\
> +									\
> + void name##_local_unlock(void) {					\
> +	arch_spinlock_t *lock;						\
> +	rwlock_release(&name##_lock_dep_map, 1, _THIS_IP_);		\
> +	lock = &__get_cpu_var(name##_lock);				\
> +	arch_spin_unlock(lock);						\
> +	preempt_enable();						\
> + }									\
> + EXPORT_SYMBOL(name##_local_unlock);					\
> +									\
> + void name##_local_lock_cpu(int cpu) {			\
> +	arch_spinlock_t *lock;						\
> +	preempt_disable();						\
> +	rwlock_acquire_read(&name##_lock_dep_map, 0, 0, _THIS_IP_);	\
> +	lock = &per_cpu(name##_lock, cpu);				\
> +	arch_spin_lock(lock);						\
> + }									\
> + EXPORT_SYMBOL(name##_local_lock_cpu);					\
> +									\
> + void name##_local_unlock_cpu(int cpu) {			\
> +	arch_spinlock_t *lock;						\
> +	rwlock_release(&name##_lock_dep_map, 1, _THIS_IP_);		\
> +	lock = &per_cpu(name##_lock, cpu);				\
> +	arch_spin_unlock(lock);						\
> +	preempt_enable();						\
> + }									\
> + EXPORT_SYMBOL(name##_local_unlock_cpu);				\
> +									\
> + void name##_global_lock(void) {					\
> +	int i;								\
> +	preempt_disable();						\
> +	rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_);		\
> +	for_each_online_cpu(i) {					\
> +		arch_spinlock_t *lock;					\
> +		lock = &per_cpu(name##_lock, i);			\
> +		arch_spin_lock(lock);					\
> +	}								\
> + }									\
> + EXPORT_SYMBOL(name##_global_lock);					\
> +									\
> + void name##_global_unlock(void) {					\
> +	int i;								\
> +	rwlock_release(&name##_lock_dep_map, 1, _RET_IP_);		\
> +	for_each_online_cpu(i) {					\
> +		arch_spinlock_t *lock;					\
> +		lock = &per_cpu(name##_lock, i);			\
> +		arch_spin_unlock(lock);					\
> +	}								\
> +	preempt_enable();						\
> + }									\
> + EXPORT_SYMBOL(name##_global_unlock);					\
> +									\
> + static int name##_atomic_dec_and_global_lock__failed(atomic_t *a) {	\
> +	name##_global_lock();						\
> +	if (!atomic_dec_and_test(a)) {					\
> +		name##_global_unlock();					\
> +		return 0;						\
> +	}								\
> +	return 1;							\
> + }									\
> + 									\
> + int name##_atomic_dec_and_global_lock(atomic_t *a) {			\
> +	if (likely(atomic_add_unless(a, -1, 1)))			\
> +		return 0;						\
> +	return name##_atomic_dec_and_global_lock__failed(a);		\
> + }									\
> + EXPORT_SYMBOL(name##_atomic_dec_and_global_lock);
> +
> +#endif
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/4] fs: cleanup files_lock
  2010-06-04 14:20     ` Nick Piggin
  2010-06-04 14:39       ` Andi Kleen
@ 2010-06-04 15:10       ` Christoph Hellwig
  1 sibling, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2010-06-04 15:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, Al Viro, linux-kernel, linux-fsdevel,
	Frank Mayhar, John Stultz, Andi Kleen, Alan Cox,
	Eric W. Biederman, Greg Kroah-Hartman

On Sat, Jun 05, 2010 at 12:20:52AM +1000, Nick Piggin wrote:
> Well it is already a special case, I just switched it to using a
> different lock for its private list. I wanted to keep surgery to
> a minimum.

You make it even more special.  Really, the right thing here is to
fix that hack for real.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 2/4] lglock: introduce special lglock and brlock spin locks
  2010-06-04 15:03   ` Paul E. McKenney
@ 2010-06-04 15:12     ` Nick Piggin
  0 siblings, 0 replies; 18+ messages in thread
From: Nick Piggin @ 2010-06-04 15:12 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Al Viro, linux-kernel, linux-fsdevel, Frank Mayhar, John Stultz,
	Andi Kleen

On Fri, Jun 04, 2010 at 08:03:27AM -0700, Paul E. McKenney wrote:
> On Fri, Jun 04, 2010 at 04:43:09PM +1000, Nick Piggin wrote:
> > This patch introduces "local-global" locks (lglocks). These can be used to:
> > 
> > - Provide fast exclusive access to per-CPU data, with exclusive access to
> >   another CPU's data allowed but possibly subject to contention, and to provide
> >   very slow exclusive access to all per-CPU data.
> > - Or to provide very fast and scalable read serialisation, and to provide
> >   very slow exclusive serialisation of data (not necessarily per-CPU data).
> > 
> > Brlocks are also implemented as a short-hand notation for the latter use
> > case.
> > 
> > Thanks to Paul for local/global naming convention.
> 
> ;-)
> 
> One set of questions about how this relates to real-time below.
> 
> (And I agree with Eric's point about for_each_possible_cpu(), FWIW.)
 
...

> > + void name##_lock_init(void) {						\
> > +	int i;								\
> > +	LOCKDEP_INIT_MAP(&name##_lock_dep_map, #name, &name##_lock_key, 0); \
> > +	for_each_possible_cpu(i) {					\
> > +		arch_spinlock_t *lock;					\
> > +		lock = &per_cpu(name##_lock, i);			\
> > +		*lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;	\
> > +	}								\
> > + }									\
> > + EXPORT_SYMBOL(name##_lock_init);					\
> > +									\
> > + void name##_local_lock(void) {						\
> > +	arch_spinlock_t *lock;						\
> > +	preempt_disable();						\
> 
> In a -rt kernel, I believe we would not want the above preempt_disable().
> Of course, in this case the arch_spin_lock() would need to become
> spin_lock() or some such.
> 
> The main point of this approach is to avoid cross-CPU holding of these
> locks, correct?  And then the point of arch_spin_lock() is to avoid the
> redundant preempt_disable(), right?

Yes. Preempt count and possibly lockdep will have issues with taking
so many nested locks in the write path.

The brlock version of this does avoid holding cross-CPU locks in the
fastpath. The lglock version used by files_list locking in the next
patch does need to sometimes take a cross-CPU lock.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH, RFC] tty: stop abusing file->f_u.fu_list
  2010-06-04  6:43 ` [patch 1/4] fs: cleanup files_lock Nick Piggin
  2010-06-04  8:38   ` Christoph Hellwig
@ 2010-06-04 18:39   ` Christoph Hellwig
  2010-06-04 19:35     ` Al Viro
                       ` (2 more replies)
  1 sibling, 3 replies; 18+ messages in thread
From: Christoph Hellwig @ 2010-06-04 18:39 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Al Viro, linux-kernel, linux-fsdevel, Alan Cox

The ttry code currently abuses the file anchor for the per-sb file list
to track instances of a given tty.  But there's no good reason for
that, we can just install a proxy object in file->private that gets
added to the list and points to the tty and keep the list away from
VFS internals.

Note that I've just if 0'd the selinux mess poking into it.  While we
could trivially port it to the new code by making the tty_private
structure public this code is just too revolting to be kept around.
It would never have been there anyway if a person with some amount of
clue had ever reviewed the selinux code.  And no, it's not just the
tty portion, the rest of that function is just as bad.

Index: linux-2.6/drivers/char/pty.c
===================================================================
--- linux-2.6.orig/drivers/char/pty.c	2010-06-04 17:36:40.370024374 +0200
+++ linux-2.6/drivers/char/pty.c	2010-06-04 20:10:11.115254505 +0200
@@ -649,8 +649,10 @@ static int __ptmx_open(struct inode *ino
 	}
 
 	set_bit(TTY_PTY_LOCK, &tty->flags); /* LOCK THE SLAVE */
-	filp->private_data = tty;
-	file_move(filp, &tty->tty_files);
+
+	retval = tty_add_file(tty, filp);
+	if (retval)
+		goto out;
 
 	retval = devpts_pty_new(inode, tty->link);
 	if (retval)
Index: linux-2.6/drivers/char/tty_io.c
===================================================================
--- linux-2.6.orig/drivers/char/tty_io.c	2010-06-04 17:22:38.056253946 +0200
+++ linux-2.6/drivers/char/tty_io.c	2010-06-04 20:11:36.872254644 +0200
@@ -112,6 +112,12 @@
 #define TTY_PARANOIA_CHECK 1
 #define CHECK_TTY_COUNT 1
 
+struct tty_private {
+	struct tty_struct *tty;
+	struct file *file;
+	struct list_head list;
+};
+
 struct ktermios tty_std_termios = {	/* for the benefit of tty drivers  */
 	.c_iflag = ICRNL | IXON,
 	.c_oflag = OPOST | ONLCR,
@@ -184,6 +190,26 @@ void free_tty_struct(struct tty_struct *
 	kfree(tty);
 }
 
+int tty_add_file(struct tty_struct *tty, struct file *file)
+{
+	struct tty_private *priv;
+
+	priv = kmalloc(sizeof(*priv), GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	priv->tty = tty;
+	priv->file = file;
+
+	spin_lock(&tty->tty_files_lock);
+	list_add(&priv->list, &tty->tty_files);
+	spin_unlock(&tty->tty_files_lock);
+
+	file->private_data = priv;
+	return 0;
+}
+
+
 #define TTY_NUMBER(tty) ((tty)->index + (tty)->driver->name_base)
 
 /**
@@ -234,11 +260,11 @@ static int check_tty_count(struct tty_st
 	struct list_head *p;
 	int count = 0;
 
-	file_list_lock();
+	spin_lock(&tty->tty_files_lock);
 	list_for_each(p, &tty->tty_files) {
 		count++;
 	}
-	file_list_unlock();
+	spin_unlock(&tty->tty_files_lock);
 	if (tty->driver->type == TTY_DRIVER_TYPE_PTY &&
 	    tty->driver->subtype == PTY_TYPE_SLAVE &&
 	    tty->link && tty->link->count)
@@ -495,6 +521,7 @@ static void do_tty_hangup(struct work_st
 {
 	struct tty_struct *tty =
 		container_of(work, struct tty_struct, hangup_work);
+	struct tty_private *priv;
 	struct file *cons_filp = NULL;
 	struct file *filp, *f = NULL;
 	struct task_struct *p;
@@ -507,9 +534,12 @@ static void do_tty_hangup(struct work_st
 
 
 	spin_lock(&redirect_lock);
-	if (redirect && redirect->private_data == tty) {
-		f = redirect;
-		redirect = NULL;
+	if (redirect) {
+		struct tty_private *priv = redirect->private_data;
+		if (priv->tty == tty) {
+			f = redirect;
+			redirect = NULL;
+		}
 	}
 	spin_unlock(&redirect_lock);
 
@@ -517,9 +547,11 @@ static void do_tty_hangup(struct work_st
 	lock_kernel();
 	check_tty_count(tty, "do_tty_hangup");
 
-	file_list_lock();
+	spin_lock(&tty->tty_files_lock);
 	/* This breaks for file handles being sent over AF_UNIX sockets ? */
-	list_for_each_entry(filp, &tty->tty_files, f_u.fu_list) {
+	list_for_each_entry(priv, &tty->tty_files, list) {
+		filp = priv->file;
+
 		if (filp->f_op->write == redirected_tty_write)
 			cons_filp = filp;
 		if (filp->f_op->write != tty_write)
@@ -528,7 +560,7 @@ static void do_tty_hangup(struct work_st
 		tty_fasync(-1, filp, 0);	/* can't block */
 		filp->f_op = &hung_up_tty_fops;
 	}
-	file_list_unlock();
+	spin_unlock(&tty->tty_files_lock);
 
 	tty_ldisc_hangup(tty);
 
@@ -874,13 +906,12 @@ EXPORT_SYMBOL(start_tty);
 static ssize_t tty_read(struct file *file, char __user *buf, size_t count,
 			loff_t *ppos)
 {
-	int i;
-	struct tty_struct *tty;
-	struct inode *inode;
+	struct inode *inode = file->f_path.dentry->d_inode;
+	struct tty_private *priv = file->private_data;
+	struct tty_struct *tty = priv->tty;
 	struct tty_ldisc *ld;
+	int i;
 
-	tty = (struct tty_struct *)file->private_data;
-	inode = file->f_path.dentry->d_inode;
 	if (tty_paranoia_check(tty, inode, "tty_read"))
 		return -EIO;
 	if (!tty || (test_bit(TTY_IO_ERROR, &tty->flags)))
@@ -1051,12 +1082,12 @@ void tty_write_message(struct tty_struct
 static ssize_t tty_write(struct file *file, const char __user *buf,
 						size_t count, loff_t *ppos)
 {
-	struct tty_struct *tty;
 	struct inode *inode = file->f_path.dentry->d_inode;
-	ssize_t ret;
+	struct tty_private *priv = file->private_data;
+	struct tty_struct *tty = priv->tty;
 	struct tty_ldisc *ld;
+	ssize_t ret;
 
-	tty = (struct tty_struct *)file->private_data;
 	if (tty_paranoia_check(tty, inode, "tty_write"))
 		return -EIO;
 	if (!tty || !tty->ops->write ||
@@ -1419,9 +1450,9 @@ static void release_one_tty(struct work_
 	tty_driver_kref_put(driver);
 	module_put(driver->owner);
 
-	file_list_lock();
+	spin_lock(&tty->tty_files_lock);
 	list_del_init(&tty->tty_files);
-	file_list_unlock();
+	spin_unlock(&tty->tty_files_lock);
 
 	put_pid(tty->pgrp);
 	put_pid(tty->session);
@@ -1502,13 +1533,14 @@ static void release_tty(struct tty_struc
 
 int tty_release(struct inode *inode, struct file *filp)
 {
-	struct tty_struct *tty, *o_tty;
+	struct tty_private *priv = filp->private_data;
+	struct tty_struct *tty = priv->tty;
+	struct tty_struct *o_tty;
 	int	pty_master, tty_closing, o_tty_closing, do_sleep;
 	int	devpts;
 	int	idx;
 	char	buf[64];
 
-	tty = (struct tty_struct *)filp->private_data;
 	if (tty_paranoia_check(tty, inode, "tty_release_dev"))
 		return 0;
 
@@ -1666,7 +1698,10 @@ int tty_release(struct inode *inode, str
 	 *  - do_tty_hangup no longer sees this file descriptor as
 	 *    something that needs to be handled for hangups.
 	 */
-	file_kill(filp);
+	spin_lock(&tty->tty_files_lock);
+	list_del(&priv->list);
+	spin_unlock(&tty->tty_files_lock);
+	kfree(priv);
 	filp->private_data = NULL;
 
 	/*
@@ -1834,8 +1869,12 @@ got_driver:
 		return PTR_ERR(tty);
 	}
 
-	filp->private_data = tty;
-	file_move(filp, &tty->tty_files);
+	retval = tty_add_file(tty, filp);
+	if (retval) {
+		unlock_kernel();
+		return retval;
+	}
+
 	check_tty_count(tty, "tty_open");
 	if (tty->driver->type == TTY_DRIVER_TYPE_PTY &&
 	    tty->driver->subtype == PTY_TYPE_MASTER)
@@ -1911,11 +1950,11 @@ got_driver:
 
 static unsigned int tty_poll(struct file *filp, poll_table *wait)
 {
-	struct tty_struct *tty;
+	struct tty_private *priv = filp->private_data;
+	struct tty_struct *tty = priv->tty;
 	struct tty_ldisc *ld;
 	int ret = 0;
 
-	tty = (struct tty_struct *)filp->private_data;
 	if (tty_paranoia_check(tty, filp->f_path.dentry->d_inode, "tty_poll"))
 		return 0;
 
@@ -1928,12 +1967,12 @@ static unsigned int tty_poll(struct file
 
 static int tty_fasync(int fd, struct file *filp, int on)
 {
-	struct tty_struct *tty;
+	struct tty_private *priv = filp->private_data;
+	struct tty_struct *tty = priv->tty;
 	unsigned long flags;
 	int retval = 0;
 
 	lock_kernel();
-	tty = (struct tty_struct *)filp->private_data;
 	if (tty_paranoia_check(tty, filp->f_path.dentry->d_inode, "tty_fasync"))
 		goto out;
 
@@ -2479,13 +2518,14 @@ EXPORT_SYMBOL(tty_pair_get_pty);
  */
 long tty_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
-	struct tty_struct *tty, *real_tty;
+	struct tty_private *priv = file->private_data;
+	struct tty_struct *tty = priv->tty;
+	struct tty_struct *real_tty;
 	void __user *p = (void __user *)arg;
 	int retval;
 	struct tty_ldisc *ld;
 	struct inode *inode = file->f_dentry->d_inode;
 
-	tty = (struct tty_struct *)file->private_data;
 	if (tty_paranoia_check(tty, inode, "tty_ioctl"))
 		return -EINVAL;
 
@@ -2607,7 +2647,8 @@ static long tty_compat_ioctl(struct file
 				unsigned long arg)
 {
 	struct inode *inode = file->f_dentry->d_inode;
-	struct tty_struct *tty = file->private_data;
+	struct tty_private *priv = file->private_data;
+	struct tty_struct *tty = priv->tty;
 	struct tty_ldisc *ld;
 	int retval = -ENOIOCTLCMD;
 
@@ -2658,6 +2699,7 @@ void __do_SAK(struct tty_struct *tty)
 	int		i;
 	struct file	*filp;
 	struct fdtable *fdt;
+	struct tty_private *priv;
 
 	if (!tty)
 		return;
@@ -2698,8 +2740,9 @@ void __do_SAK(struct tty_struct *tty)
 				filp = fcheck_files(p->files, i);
 				if (!filp)
 					continue;
+				priv = filp->private_data;
 				if (filp->f_op->read == tty_read &&
-				    filp->private_data == tty) {
+				    priv->tty == tty) {
 					printk(KERN_NOTICE "SAK: killed process %d"
 					    " (%s): fd#%d opened to the tty\n",
 					    task_pid_nr(p), p->comm, i);
@@ -2771,6 +2814,7 @@ void initialize_tty_struct(struct tty_st
 	spin_lock_init(&tty->read_lock);
 	spin_lock_init(&tty->ctrl_lock);
 	INIT_LIST_HEAD(&tty->tty_files);
+	spin_lock_init(&tty->tty_files_lock);
 	INIT_WORK(&tty->SAK_work, do_SAK_work);
 
 	tty->driver = driver;
Index: linux-2.6/security/selinux/hooks.c
===================================================================
--- linux-2.6.orig/security/selinux/hooks.c	2010-06-04 19:55:53.440253458 +0200
+++ linux-2.6/security/selinux/hooks.c	2010-06-04 19:56:39.370253946 +0200
@@ -2212,11 +2212,13 @@ static inline void flush_unauthorized_fi
 {
 	struct common_audit_data ad;
 	struct file *file, *devnull = NULL;
-	struct tty_struct *tty;
 	struct fdtable *fdt;
 	long j = -1;
 	int drop_tty = 0;
 
+#ifdef SANITIY /* selinux on crack */
+	struct tty_struct *tty;
+
 	tty = get_current_tty();
 	if (tty) {
 		file_list_lock();
@@ -2238,6 +2240,7 @@ static inline void flush_unauthorized_fi
 		file_list_unlock();
 		tty_kref_put(tty);
 	}
+#endif
 	/* Reset controlling tty. */
 	if (drop_tty)
 		no_tty();
Index: linux-2.6/include/linux/tty.h
===================================================================
--- linux-2.6.orig/include/linux/tty.h	2010-06-04 20:04:37.892254224 +0200
+++ linux-2.6/include/linux/tty.h	2010-06-04 20:08:42.715254434 +0200
@@ -288,6 +288,7 @@ struct tty_struct {
 	void *disc_data;
 	void *driver_data;
 	struct list_head tty_files;
+	spinlock_t tty_files_lock;
 
 #define N_TTY_BUF_SIZE 4096
 
@@ -455,6 +456,7 @@ extern void proc_clear_tty(struct task_s
 extern struct tty_struct *get_current_tty(void);
 extern void tty_default_fops(struct file_operations *fops);
 extern struct tty_struct *alloc_tty_struct(void);
+extern int tty_add_file(struct tty_struct *tty, struct file *file);
 extern void free_tty_struct(struct tty_struct *tty);
 extern void initialize_tty_struct(struct tty_struct *tty,
 		struct tty_driver *driver, int idx);
Index: linux-2.6/fs/file_table.c
===================================================================
--- linux-2.6.orig/fs/file_table.c	2010-06-04 20:02:52.120024444 +0200
+++ linux-2.6/fs/file_table.c	2010-06-04 20:16:43.857005797 +0200
@@ -32,8 +32,9 @@ struct files_stat_struct files_stat = {
 	.max_files = NR_FILE
 };
 
-/* public. Not pretty! */
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock);
+static __cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock);
+#define file_list_lock() spin_lock(&files_lock);
+#define file_list_unlock() spin_unlock(&files_lock);
 
 /* SLAB cache for file structures */
 static struct kmem_cache *filp_cachep __read_mostly;
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-06-04 20:16:14.958274130 +0200
+++ linux-2.6/include/linux/fs.h	2010-06-04 20:19:37.075254924 +0200
@@ -949,9 +949,6 @@ struct file {
 	unsigned long f_mnt_write_state;
 #endif
 };
-extern spinlock_t files_lock;
-#define file_list_lock() spin_lock(&files_lock);
-#define file_list_unlock() spin_unlock(&files_lock);
 
 #define get_file(x)	atomic_long_inc(&(x)->f_count)
 #define fput_atomic(x)	atomic_long_add_unless(&(x)->f_count, -1, 1)
@@ -2182,8 +2179,6 @@ static inline void insert_inode_hash(str
 	__insert_inode_hash(inode, inode->i_ino);
 }
 
-extern void file_move(struct file *f, struct list_head *list);
-extern void file_kill(struct file *f);
 #ifdef CONFIG_BLOCK
 struct bio;
 extern void submit_bio(int, struct bio *);
Index: linux-2.6/fs/internal.h
===================================================================
--- linux-2.6.orig/fs/internal.h	2010-06-04 20:19:32.144254643 +0200
+++ linux-2.6/fs/internal.h	2010-06-04 20:19:49.453254853 +0200
@@ -80,6 +80,8 @@ extern void chroot_fs_refs(struct path *
 /*
  * file_table.c
  */
+extern void file_move(struct file *f, struct list_head *list);
+extern void file_kill(struct file *f);
 extern void mark_files_ro(struct super_block *);
 extern struct file *get_empty_filp(void);
 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH, RFC] tty: stop abusing file->f_u.fu_list
  2010-06-04 18:39   ` [PATCH, RFC] tty: stop abusing file->f_u.fu_list Christoph Hellwig
@ 2010-06-04 19:35     ` Al Viro
  2010-06-05 11:39     ` Nick Piggin
  2010-06-08  5:22     ` Nick Piggin
  2 siblings, 0 replies; 18+ messages in thread
From: Al Viro @ 2010-06-04 19:35 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, linux-kernel, linux-fsdevel, Alan Cox

On Fri, Jun 04, 2010 at 02:39:34PM -0400, Christoph Hellwig wrote:
> The ttry code currently abuses the file anchor for the per-sb file list
> to track instances of a given tty.  But there's no good reason for
> that, we can just install a proxy object in file->private that gets
> added to the list and points to the tty and keep the list away from
> VFS internals.
> 
> Note that I've just if 0'd the selinux mess poking into it.  While we
> could trivially port it to the new code by making the tty_private
> structure public this code is just too revolting to be kept around.
> It would never have been there anyway if a person with some amount of
> clue had ever reviewed the selinux code.  And no, it's not just the
> tty portion, the rest of that function is just as bad.

This is disgusting, as much as selinux code you've mentioned ;-/

FWIW, selinux problem here is interesting - essentially, it violates its
own rules since the real object here is not an inode.  It's tty.  And
inode pretty much serves as a name - potentially one of many.  So the
policy should've been associated with tty instead...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH, RFC] tty: stop abusing file->f_u.fu_list
  2010-06-04 18:39   ` [PATCH, RFC] tty: stop abusing file->f_u.fu_list Christoph Hellwig
  2010-06-04 19:35     ` Al Viro
@ 2010-06-05 11:39     ` Nick Piggin
  2010-06-08  5:22     ` Nick Piggin
  2 siblings, 0 replies; 18+ messages in thread
From: Nick Piggin @ 2010-06-05 11:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Al Viro, linux-kernel, linux-fsdevel, Alan Cox

On Fri, Jun 04, 2010 at 02:39:34PM -0400, Christoph Hellwig wrote:
> The ttry code currently abuses the file anchor for the per-sb file list
> to track instances of a given tty.  But there's no good reason for
> that, we can just install a proxy object in file->private that gets
> added to the list and points to the tty and keep the list away from
> VFS internals.

Well thanks for this. Yes it is an obviously nicer way to do it, so
tty doesn't have to know what vfs uses files list for.

 
> Note that I've just if 0'd the selinux mess poking into it.  While we
> could trivially port it to the new code by making the tty_private
> structure public this code is just too revolting to be kept around.
> It would never have been there anyway if a person with some amount of
> clue had ever reviewed the selinux code.  And no, it's not just the
> tty portion, the rest of that function is just as bad.

Why is it a mess? Just because of the conceptual nastiness of checking
a tty object via a random one of its inodes? How would be a better way
to do this?

I think for a first pass, a simple conversion for all code would be good
for me because then it stops blocking the scaling patch. (and it's
more bisectable).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH, RFC] tty: stop abusing file->f_u.fu_list
  2010-06-04 18:39   ` [PATCH, RFC] tty: stop abusing file->f_u.fu_list Christoph Hellwig
  2010-06-04 19:35     ` Al Viro
  2010-06-05 11:39     ` Nick Piggin
@ 2010-06-08  5:22     ` Nick Piggin
  2 siblings, 0 replies; 18+ messages in thread
From: Nick Piggin @ 2010-06-08  5:22 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Al Viro, linux-kernel, linux-fsdevel, Alan Cox

On Fri, Jun 04, 2010 at 02:39:34PM -0400, Christoph Hellwig wrote:
> The ttry code currently abuses the file anchor for the per-sb file list
> to track instances of a given tty.  But there's no good reason for
> that, we can just install a proxy object in file->private that gets
> added to the list and points to the tty and keep the list away from
> VFS internals.

Well this code looks like the error handling is broken and it's pretty
convoluted to fix (eg. tty_release requires tty from filp but is called
to clean up code from before tty file private structure is allocated).

So I really prefer to put my original patch first. I really don't see
how it makes the tty code more of a special case. In both cases, it
must know that the vfs does not require the file's presence on the
s_files list so it can be reused for tty code. After my patch, it no
longer knows any details about how the vfs does locking for the list.

Possibly a lighter way to do what you want is to have the vfs not use
fu_list for device inodes and define drivers to be allowed to use it for
their own purpose. But that is easily possible after my patch.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2010-06-08  5:22 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-04  6:43 [patch 0/4] Initial vfs scalability patches again Nick Piggin
2010-06-04  6:43 ` [patch 1/4] fs: cleanup files_lock Nick Piggin
2010-06-04  8:38   ` Christoph Hellwig
2010-06-04 14:20     ` Nick Piggin
2010-06-04 14:39       ` Andi Kleen
2010-06-04 15:10       ` Christoph Hellwig
2010-06-04 18:39   ` [PATCH, RFC] tty: stop abusing file->f_u.fu_list Christoph Hellwig
2010-06-04 19:35     ` Al Viro
2010-06-05 11:39     ` Nick Piggin
2010-06-08  5:22     ` Nick Piggin
2010-06-04  6:43 ` [patch 2/4] lglock: introduce special lglock and brlock spin locks Nick Piggin
2010-06-04  7:56   ` Eric Dumazet
2010-06-04 14:13     ` Nick Piggin
2010-06-04 14:24       ` Eric Dumazet
2010-06-04 15:03   ` Paul E. McKenney
2010-06-04 15:12     ` Nick Piggin
2010-06-04  6:43 ` [patch 3/4] fs: scale files_lock Nick Piggin
2010-06-04  6:43 ` [patch 4/4] fs: brlock vfsmount_lock Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).