public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [patchset] Lockfree fd lookup 0 of 5
@ 2004-08-02 10:10 Ravikiran G Thirumalai
  2004-08-02 10:18 ` Ravikiran G Thirumalai
  2004-08-02 16:56 ` viro
  0 siblings, 2 replies; 12+ messages in thread
From: Ravikiran G Thirumalai @ 2004-08-02 10:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Greg KH, dipankar, viro

Here is a patchset to eliminate taking struct files_struct.file_lock on 
reader side using rcu and rcu based refcounting.  These patches
extend the kref api to include kref_lf_xxx api and kref_lf_get_rcu to
do lockfree refcounting, and use the same.  As posted earlier, since fd
lookups (struct files_struct.fd[]) will be lock free with these patches, 
threaded workloads doing lots of io should see performance benefits 
due to this patchset.  I have observed 13-15% improvement with tiobench 
on a 4 way xeon with this patchset.

The patchset contains:
1. kref-merged-2.6.7.patch -- kref shrinkage patch which GregKH has applied to
   his tree.
2. kref-drivers-2.6.7.patch -- existing users of kref modified to use the
   'shrunk' krefs.  GregKH has applied this to his tree too
3. kref-lf-2.6.7.patch -- kref api additions for lock free refcounting.  
   This patch relocates kref api to kref.h as static inlines since they
   are mostly wrappers around atomic_xxx operations
4. files_struct-kref-s-2.6.7.patch -- change struct file.f_count to a kref
   and use kref api for refcounting.  This does not add any performance
   benefit and is just an intermediate patch
5. files_struct-rcu-kref-2.6.7.patch -- Make fd lookups lock free by using
   rcu and kref_lf_xxx api for lockfree refcounting

The patchset will follow this post.

Thanks,
Kiran


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patchset] Lockfree fd lookup 0 of 5
  2004-08-02 10:10 Ravikiran G Thirumalai
@ 2004-08-02 10:18 ` Ravikiran G Thirumalai
  2004-08-02 16:56 ` viro
  1 sibling, 0 replies; 12+ messages in thread
From: Ravikiran G Thirumalai @ 2004-08-02 10:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Greg KH, dipankar, viro

The following patch adds kref_lf_xxx api and kref_lf_get_rcu to enable
refcounting of objects part of a lockfree collection.

Thanks,
Kiran


D:
D: kref-lf-2.6.7.patch:
D: This patch depends on the kref shrinkage patches.  This provides
D: infrastructure for lockfree refcounting and adds kref_lf_xxx api
D: for lockfree refcounting.  This patch also changes kref_xxx to 
D: static inlines since kref_xxx are mostly one-two liners
D:

diff -ruN -X /home/kiran/dontdiff linux-2.6.7/include/linux/kref.h test-2.6.7/include/linux/kref.h
--- linux-2.6.7/include/linux/kref.h	2004-08-01 22:37:46.000000000 +0530
+++ test-2.6.7/include/linux/kref.h	2004-08-01 22:50:06.000000000 +0530
@@ -16,16 +16,264 @@
 #define _KREF_H_
 
 #include <linux/types.h>
+#include <linux/kernel.h>
 #include <asm/atomic.h>
 
-
 struct kref {
 	atomic_t refcount;
 };
 
-void kref_init(struct kref *kref);
-struct kref *kref_get(struct kref *kref);
-void kref_put(struct kref *kref, void (*release) (struct kref *kref));
+/**
+ * kref_init - initialize object.
+ * @kref: object in question.
+ */
+static inline void kref_init(struct kref *kref)
+{
+	atomic_set(&kref->refcount, 1);
+}
+
+/**
+ * kref_get - increment refcount for object.
+ * @kref: object.
+ */
+static inline struct kref *kref_get(struct kref *kref)
+{
+	WARN_ON(!atomic_read(&kref->refcount));
+	atomic_inc(&kref->refcount);
+	return kref;
+}
+
+/**
+ * kref_put - decrement refcount for object.
+ * @kref: object.
+ * @release: pointer to the function that will clean up the object
+ *	     when the last reference to the object is released.
+ *	     This pointer is required.
+ *
+ * Decrement the refcount, and if 0, call release().
+ */
+static inline void
+kref_put(struct kref *kref, void (*release) (struct kref * kref))
+{
+	if (atomic_dec_and_test(&kref->refcount)) {
+		WARN_ON(release == NULL);
+		pr_debug("kref cleaning up\n");
+		release(kref);
+	}
+}
+
+/**
+ * kref_read - Return the refcount value.
+ * @kref: object.
+ */
+static inline int kref_read(struct kref *kref)
+{
+	return atomic_read(&kref->refcount);
+}
+
+/**
+ * kref_put_last - decrement refcount for object.
+ * @kref: object.
+ *
+ * Decrement the refcount, and if 0 return true.
+ * Returns false otherwise.
+ * Use this only if you cannot use kref_put -- when the
+ * release function of kref_put needs more than just the
+ * refcounted object. Use of kref_put_last when kref_put
+ * can do will be frowned upon.
+ */
+static inline int kref_put_last(struct kref *kref)
+{
+	return atomic_dec_and_test(&kref->refcount);
+}
+
+/* 
+ * Refcounter framework for elements of lists/arrays protected by
+ * RCU.
+ *
+ * Refcounting on elements of  lists which are protected by traditional
+ * reader/writer spinlocks or semaphores are straight forward as in:
+ * 
+ * 1.					2.
+ * add()				search_and_reference()
+ * {					{
+ * 	alloc_object				read_lock(&list_lock);
+ *	...					search_for_element
+ *	kref_init(&el->rc)			kref_get(&el->rc);	
+ *	write_lock(&list_lock);			...
+ *	add_element				read_unlock(&list_lock);
+ *	...					...
+ *	write_unlock(&list_lock);	}
+ * }					
+ *
+ * 3.					4.
+ * release_referenced()			delete()
+ * {					{
+ *	...				write_lock(&list_lock);
+ *	if (kref_put_last(&el->rc))		...
+ *		start_cleanup_object	...
+ *		free_object		delete_element
+ *	...				write_unlock(&list_lock);
+ * }					...
+ *					if (kref_put_last(&el->rc))
+ *						start_cleanup_object
+ *						free_object
+ *					}
+ *
+ * If this list/array is made lock free using rcu as in changing the 
+ * write_lock in add() and delete() to spin_lock and changing read_lock
+ * in search_and_reference to rcu_read_lock(), the kref_get in 
+ * search_and_reference could potentially hold reference to an element which
+ * has already been deleted from the list/array.  kref_lf_get_rcu takes
+ * care of this scenario. search_and_reference should look as;
+ * 2. 
+ * search_and_reference()
+ * {
+ *	rcu_read_lock();
+ *	search_for_element
+ *	if (!kref_lf_get_rcu(&el->rc)) {
+ *		rcu_read_unlock();
+ *		return FAIL;
+ *	}
+ *	...
+ *	...
+ *	rcu_read_unlock();
+ * }
+ * 
+ * Of course, free_object after kref_put_last should be batched using call_rcu.
+ * Sometimes, reference to the element need to be obtained in the 
+ * update (write) stream.  In such cases, kref_lf_get_rcu might be an overkill
+ * since the spinlock serialising list updates are held. kref_lf_get
+ * is to be used in such cases.  
+ * Note: Except for kref_lf_get_rcu, kref_lf_xxx api are the same as 
+ * corresponding kref_xxx api for most arches.  However, for arches which 
+ * donot have cmpxchg kref_lf_xxx api use a hashed spinlock implementation to
+ * implement kref_lf_get_rcu, and acquire the same hashed spinlock for 
+ * kref_lf_get etc to preserve atomicity.
+ * Note: Use kref_lf_xxx api only if you need to use kref_lf_get_rcu on the
+ * refcounter atleast at one place.  Mixing kref_lf_xx and kref_xxx api
+ * might lead to races.
+ *
+ */
+
+#ifdef __HAVE_ARCH_CMPXCHG
+
+#define kref_lf_init kref_init
+#define kref_lf_get kref_get
+#define kref_lf_put kref_put
+#define kref_lf_put_last kref_put_last
+
+/* 
+ * cmpxchg is needed on UP too, if deletions to the list/array can happen
+ * in interrupt context.
+ */
+
+/**
+ * kref_lf_get_rcu - Take reference to an object of a lockfree collection
+ * by traversing a lockfree list/array.
+ * @kref: object.
+ *
+ * Try and increment the refcount by 1.  The increment might fail if
+ * the refcounted object has been through a 1 to 0 transition and 
+ * is no longer part of the lockfree list.
+ * Returns non-zero on successful increment and zero otherwise.
+ */
+static inline int kref_lf_get_rcu(struct kref *kref)
+{
+	int c, old;
+	c = atomic_read(&kref->refcount);
+	while (c && (old = cmpxchg(&kref->refcount.counter, c, c + 1)) != c)
+		c = old;
+	return c;
+}
+
+#else				/* !__HAVE_ARCH_CMPXCHG */
+
+#include <linux/spinlock.h>
+
+/* 
+ * We use an array of spinlocks for the krefs -- similar to ones in sparc
+ * 32 bit atomic_t implementations, and a hash function similar to that
+ * for our refcounting needs.
+ * Can't help multiprocessors which donot have cmpxchg :(
+ */
+
+#ifdef	CONFIG_SMP
+#define KREF_HASH_SIZE	4
+#define KREF_HASH(k) \
+	(&__kref_hash[(((unsigned long)k)>>8) & (KREF_HASH_SIZE-1)])
+#else
+#define	KREF_HASH_SIZE	1
+#define KREF_HASH(k) 	&__kref_hash[0]
+#endif				/* CONFIG_SMP */
+
+extern spinlock_t __kref_hash[KREF_HASH_SIZE];
+
+static inline void kref_lf_init(struct kref *kref)
+{
+	unsigned long flags;
+	spin_lock_irqsave(KREF_HASH(kref), flags);
+	kref->refcount.counter = 1;
+	spin_unlock_irqrestore(KREF_HASH(kref), flags);
+}
+
+static inline void kref_lf_get(struct kref *kref)
+{
+	unsigned long flags;
+	spin_lock_irqsave(KREF_HASH(kref), flags);
+	kref->refcount.counter += 1;
+	spin_unlock_irqrestore(KREF_HASH(kref), flags);
+}
+
+static inline void 
+kref_lf_put(struct kref *kref, void (*release) (struct kref * kref))
+{
+	unsigned long flags;
+	spin_lock_irqsave(KREF_HASH(kref), flags);
+	kref->refcount.counter--;
+	if (!kref->refcount.counter) {
+		spin_unlock_irqrestore(KREF_HASH(kref), flags);
+		WARN_ON(release == NULL);
+		pr_debug("kref cleaning up\n");
+		release(kref);
+	} else {
+		spin_unlock_irqrestore(KREF_HASH(kref), flags);
+	}
+}
+
+static inline int kref_lf_put_last(struct kref *kref)
+{
+	unsigned long flags;
+	spin_lock_irqsave(KREF_HASH(kref), flags);
+	kref->refcount.counter--;
+	if (!kref->refcount.counter) {
+		spin_unlock_irqrestore(KREF_HASH(kref), flags);
+		return 1;
+	} else {
+		spin_unlock_irqrestore(KREF_HASH(kref), flags);
+		return 0;
+	}
+}
+
+static inline int kref_lf_get_rcu(struct kref *kref)
+{
+	int ret;
+	unsigned long flags;
+	spin_lock_irqsave(KREF_HASH(kref), flags);
+	if (kref->refcount.counter)
+		ret = kref->refcount.counter++;
+	else
+		ret = 0;
+	spin_unlock_irqrestore(KREF_HASH(kref), flags);
+	return ret;
+}
+
+#endif				/* __HAVE_ARCH_CMPXCHG */
 
+/* Spinlocks are not needed in this case, if atomic_read is used */
+static inline int kref_lf_read(struct kref *kref)
+{
+	return atomic_read(&kref->refcount);
+}
 
-#endif /* _KREF_H_ */
+#endif				/* _KREF_H_ */
diff -ruN -X /home/kiran/dontdiff linux-2.6.7/lib/kref.c test-2.6.7/lib/kref.c
--- linux-2.6.7/lib/kref.c	2004-08-01 22:37:46.000000000 +0530
+++ test-2.6.7/lib/kref.c	2004-08-01 21:41:19.000000000 +0530
@@ -11,49 +11,12 @@
  *
  */
 
-/* #define DEBUG */
-
 #include <linux/kref.h>
 #include <linux/module.h>
 
-/**
- * kref_init - initialize object.
- * @kref: object in question.
- */
-void kref_init(struct kref *kref)
-{
-	atomic_set(&kref->refcount,1);
-}
-
-/**
- * kref_get - increment refcount for object.
- * @kref: object.
- */
-struct kref *kref_get(struct kref *kref)
-{
-	WARN_ON(!atomic_read(&kref->refcount));
-	atomic_inc(&kref->refcount);
-	return kref;
-}
-
-/**
- * kref_put - decrement refcount for object.
- * @kref: object.
- * @release: pointer to the function that will clean up the object
- *	     when the last reference to the object is released.
- *	     This pointer is required.
- *
- * Decrement the refcount, and if 0, call release().
- */
-void kref_put(struct kref *kref, void (*release) (struct kref *kref))
-{
-	WARN_ON(release == NULL);
-	if (atomic_dec_and_test(&kref->refcount)) {
-		pr_debug("kref cleaning up\n");
-		release(kref);
-	}
-}
-
-EXPORT_SYMBOL(kref_init);
-EXPORT_SYMBOL(kref_get);
-EXPORT_SYMBOL(kref_put);
+#ifndef __HAVE_ARCH_CMPXCHG
+spinlock_t __kref_hash[KREF_HASH_SIZE] = {
+        [0 ... (KREF_HASH_SIZE-1)] = SPIN_LOCK_UNLOCKED
+};
+EXPORT_SYMBOL(__kref_hash);
+#endif

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patchset] Lockfree fd lookup 0 of 5
  2004-08-02 10:10 Ravikiran G Thirumalai
  2004-08-02 10:18 ` Ravikiran G Thirumalai
@ 2004-08-02 16:56 ` viro
  2004-08-02 20:07   ` David S. Miller
  2004-08-03  9:23   ` Ravikiran G Thirumalai
  1 sibling, 2 replies; 12+ messages in thread
From: viro @ 2004-08-02 16:56 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: Andrew Morton, linux-kernel, Greg KH, dipankar

On Mon, Aug 02, 2004 at 03:40:55PM +0530, Ravikiran G Thirumalai wrote:
> Here is a patchset to eliminate taking struct files_struct.file_lock on 
> reader side using rcu and rcu based refcounting.  These patches
> extend the kref api to include kref_lf_xxx api and kref_lf_get_rcu to
> do lockfree refcounting, and use the same.  As posted earlier, since fd
> lookups (struct files_struct.fd[]) will be lock free with these patches, 
> threaded workloads doing lots of io should see performance benefits 
> due to this patchset.  I have observed 13-15% improvement with tiobench 
> on a 4 way xeon with this patchset.

How about this for comparison?  That's just a dumb "convert to rwlock"
patch; we can be smarter in e.g. close_on_exec handling, but that's a
separate story.

diff -urN RC8-rc2-bk12/arch/mips/kernel/irixioctl.c RC8-rc2-bk12-current/arch/mips/kernel/irixioctl.c
--- RC8-rc2-bk12/arch/mips/kernel/irixioctl.c	2004-03-17 15:29:37.000000000 -0500
+++ RC8-rc2-bk12-current/arch/mips/kernel/irixioctl.c	2004-08-02 12:24:00.000000000 -0400
@@ -33,7 +33,7 @@
 	struct file *filp;
 	struct tty_struct *ttyp = NULL;
 
-	spin_lock(&current->files->file_lock);
+	read_lock(&current->files->file_lock);
 	filp = fcheck(fd);
 	if(filp && filp->private_data) {
 		ttyp = (struct tty_struct *) filp->private_data;
@@ -41,7 +41,7 @@
 		if(ttyp->magic != TTY_MAGIC)
 			ttyp =NULL;
 	}
-	spin_unlock(&current->files->file_lock);
+	read_unlock(&current->files->file_lock);
 	return ttyp;
 }
 
diff -urN RC8-rc2-bk12/arch/sparc64/solaris/ioctl.c RC8-rc2-bk12-current/arch/sparc64/solaris/ioctl.c
--- RC8-rc2-bk12/arch/sparc64/solaris/ioctl.c	2004-08-02 12:19:34.000000000 -0400
+++ RC8-rc2-bk12-current/arch/sparc64/solaris/ioctl.c	2004-08-02 12:24:23.000000000 -0400
@@ -294,15 +294,15 @@
 {
 	struct inode *ino;
 	/* I wonder which of these tests are superfluous... --patrik */
-	spin_lock(&current->files->file_lock);
+	read_lock(&current->files->file_lock);
 	if (! current->files->fd[fd] ||
 	    ! current->files->fd[fd]->f_dentry ||
 	    ! (ino = current->files->fd[fd]->f_dentry->d_inode) ||
 	    ! ino->i_sock) {
-		spin_unlock(&current->files->file_lock);
+		read_unlock(&current->files->file_lock);
 		return TBADF;
 	}
-	spin_unlock(&current->files->file_lock);
+	read_unlock(&current->files->file_lock);
 	
 	switch (cmd & 0xff) {
 	case 109: /* SI_SOCKPARAMS */
diff -urN RC8-rc2-bk12/drivers/char/tty_io.c RC8-rc2-bk12-current/drivers/char/tty_io.c
--- RC8-rc2-bk12/drivers/char/tty_io.c	2004-08-02 12:19:35.000000000 -0400
+++ RC8-rc2-bk12-current/drivers/char/tty_io.c	2004-08-02 12:24:52.139649424 -0400
@@ -1978,7 +1978,7 @@
 		}
 		task_lock(p);
 		if (p->files) {
-			spin_lock(&p->files->file_lock);
+			read_lock(&p->files->file_lock);
 			for (i=0; i < p->files->max_fds; i++) {
 				filp = fcheck_files(p->files, i);
 				if (!filp)
@@ -1992,7 +1992,7 @@
 					break;
 				}
 			}
-			spin_unlock(&p->files->file_lock);
+			read_unlock(&p->files->file_lock);
 		}
 		task_unlock(p);
 	}
diff -urN RC8-rc2-bk12/fs/exec.c RC8-rc2-bk12-current/fs/exec.c
--- RC8-rc2-bk12/fs/exec.c	2004-08-02 12:19:36.000000000 -0400
+++ RC8-rc2-bk12-current/fs/exec.c	2004-08-02 12:40:31.000000000 -0400
@@ -762,7 +762,7 @@
 {
 	long j = -1;
 
-	spin_lock(&files->file_lock);
+	write_lock(&files->file_lock);
 	for (;;) {
 		unsigned long set, i;
 
@@ -774,16 +774,16 @@
 		if (!set)
 			continue;
 		files->close_on_exec->fds_bits[j] = 0;
-		spin_unlock(&files->file_lock);
+		write_unlock(&files->file_lock);
 		for ( ; set ; i++,set >>= 1) {
 			if (set & 1) {
 				sys_close(i);
 			}
 		}
-		spin_lock(&files->file_lock);
+		write_lock(&files->file_lock);
 
 	}
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 }
 
 int flush_old_exec(struct linux_binprm * bprm)
diff -urN RC8-rc2-bk12/fs/fcntl.c RC8-rc2-bk12-current/fs/fcntl.c
--- RC8-rc2-bk12/fs/fcntl.c	2004-07-18 09:08:41.000000000 -0400
+++ RC8-rc2-bk12-current/fs/fcntl.c	2004-08-02 12:39:55.000000000 -0400
@@ -22,21 +22,21 @@
 void fastcall set_close_on_exec(unsigned int fd, int flag)
 {
 	struct files_struct *files = current->files;
-	spin_lock(&files->file_lock);
+	write_lock(&files->file_lock);
 	if (flag)
 		FD_SET(fd, files->close_on_exec);
 	else
 		FD_CLR(fd, files->close_on_exec);
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 }
 
 static inline int get_close_on_exec(unsigned int fd)
 {
 	struct files_struct *files = current->files;
 	int res;
-	spin_lock(&files->file_lock);
+	read_lock(&files->file_lock);
 	res = FD_ISSET(fd, files->close_on_exec);
-	spin_unlock(&files->file_lock);
+	read_unlock(&files->file_lock);
 	return res;
 }
 
@@ -133,15 +133,15 @@
 	struct files_struct * files = current->files;
 	int fd;
 
-	spin_lock(&files->file_lock);
+	write_lock(&files->file_lock);
 	fd = locate_fd(files, file, start);
 	if (fd >= 0) {
 		FD_SET(fd, files->open_fds);
 		FD_CLR(fd, files->close_on_exec);
-		spin_unlock(&files->file_lock);
+		write_unlock(&files->file_lock);
 		fd_install(fd, file);
 	} else {
-		spin_unlock(&files->file_lock);
+		write_unlock(&files->file_lock);
 		fput(file);
 	}
 
@@ -154,7 +154,7 @@
 	struct file * file, *tofree;
 	struct files_struct * files = current->files;
 
-	spin_lock(&files->file_lock);
+	write_lock(&files->file_lock);
 	if (!(file = fcheck(oldfd)))
 		goto out_unlock;
 	err = newfd;
@@ -185,7 +185,7 @@
 	files->fd[newfd] = file;
 	FD_SET(newfd, files->open_fds);
 	FD_CLR(newfd, files->close_on_exec);
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 
 	if (tofree)
 		filp_close(tofree, files);
@@ -193,11 +193,11 @@
 out:
 	return err;
 out_unlock:
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 	goto out;
 
 out_fput:
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 	fput(file);
 	goto out;
 }
diff -urN RC8-rc2-bk12/fs/file.c RC8-rc2-bk12-current/fs/file.c
--- RC8-rc2-bk12/fs/file.c	2004-07-18 09:08:41.000000000 -0400
+++ RC8-rc2-bk12-current/fs/file.c	2004-08-02 12:30:35.000000000 -0400
@@ -65,7 +65,7 @@
 		goto out;
 
 	nfds = files->max_fds;
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 
 	/* 
 	 * Expand to the max in easy steps, and keep expanding it until
@@ -89,7 +89,7 @@
 
 	error = -ENOMEM;
 	new_fds = alloc_fd_array(nfds);
-	spin_lock(&files->file_lock);
+	write_lock(&files->file_lock);
 	if (!new_fds)
 		goto out;
 
@@ -110,15 +110,15 @@
 			memset(&new_fds[i], 0,
 			       (nfds-i) * sizeof(struct file *)); 
 
-			spin_unlock(&files->file_lock);
+			write_unlock(&files->file_lock);
 			free_fd_array(old_fds, i);
-			spin_lock(&files->file_lock);
+			write_lock(&files->file_lock);
 		}
 	} else {
 		/* Somebody expanded the array while we slept ... */
-		spin_unlock(&files->file_lock);
+		write_unlock(&files->file_lock);
 		free_fd_array(new_fds, nfds);
-		spin_lock(&files->file_lock);
+		write_lock(&files->file_lock);
 	}
 	error = 0;
 out:
@@ -167,7 +167,7 @@
 		goto out;
 
 	nfds = files->max_fdset;
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 
 	/* Expand to the max in easy steps */
 	do {
@@ -183,7 +183,7 @@
 	error = -ENOMEM;
 	new_openset = alloc_fdset(nfds);
 	new_execset = alloc_fdset(nfds);
-	spin_lock(&files->file_lock);
+	write_lock(&files->file_lock);
 	if (!new_openset || !new_execset)
 		goto out;
 
@@ -208,21 +208,21 @@
 		nfds = xchg(&files->max_fdset, nfds);
 		new_openset = xchg(&files->open_fds, new_openset);
 		new_execset = xchg(&files->close_on_exec, new_execset);
-		spin_unlock(&files->file_lock);
+		write_unlock(&files->file_lock);
 		free_fdset (new_openset, nfds);
 		free_fdset (new_execset, nfds);
-		spin_lock(&files->file_lock);
+		write_lock(&files->file_lock);
 		return 0;
 	} 
 	/* Somebody expanded the array while we slept ... */
 
 out:
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 	if (new_openset)
 		free_fdset(new_openset, nfds);
 	if (new_execset)
 		free_fdset(new_execset, nfds);
-	spin_lock(&files->file_lock);
+	write_lock(&files->file_lock);
 	return error;
 }
 
diff -urN RC8-rc2-bk12/fs/file_table.c RC8-rc2-bk12-current/fs/file_table.c
--- RC8-rc2-bk12/fs/file_table.c	2004-05-10 00:23:35.000000000 -0400
+++ RC8-rc2-bk12-current/fs/file_table.c	2004-08-02 12:31:01.000000000 -0400
@@ -197,11 +197,11 @@
 	struct file *file;
 	struct files_struct *files = current->files;
 
-	spin_lock(&files->file_lock);
+	read_lock(&files->file_lock);
 	file = fcheck_files(files, fd);
 	if (file)
 		get_file(file);
-	spin_unlock(&files->file_lock);
+	read_unlock(&files->file_lock);
 	return file;
 }
 
@@ -223,13 +223,13 @@
 	if (likely((atomic_read(&files->count) == 1))) {
 		file = fcheck_files(files, fd);
 	} else {
-		spin_lock(&files->file_lock);
+		read_lock(&files->file_lock);
 		file = fcheck_files(files, fd);
 		if (file) {
 			get_file(file);
 			*fput_needed = 1;
 		}
-		spin_unlock(&files->file_lock);
+		read_unlock(&files->file_lock);
 	}
 	return file;
 }
diff -urN RC8-rc2-bk12/fs/open.c RC8-rc2-bk12-current/fs/open.c
--- RC8-rc2-bk12/fs/open.c	2004-06-16 10:26:24.000000000 -0400
+++ RC8-rc2-bk12-current/fs/open.c	2004-08-02 12:33:29.000000000 -0400
@@ -841,7 +841,7 @@
 	int fd, error;
 
   	error = -EMFILE;
-	spin_lock(&files->file_lock);
+	write_lock(&files->file_lock);
 
 repeat:
  	fd = find_next_zero_bit(files->open_fds->fds_bits, 
@@ -890,7 +890,7 @@
 	error = fd;
 
 out:
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 	return error;
 }
 
@@ -906,9 +906,9 @@
 void fastcall put_unused_fd(unsigned int fd)
 {
 	struct files_struct *files = current->files;
-	spin_lock(&files->file_lock);
+	write_lock(&files->file_lock);
 	__put_unused_fd(files, fd);
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 }
 
 EXPORT_SYMBOL(put_unused_fd);
@@ -929,11 +929,11 @@
 void fastcall fd_install(unsigned int fd, struct file * file)
 {
 	struct files_struct *files = current->files;
-	spin_lock(&files->file_lock);
+	write_lock(&files->file_lock);
 	if (unlikely(files->fd[fd] != NULL))
 		BUG();
 	files->fd[fd] = file;
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 }
 
 EXPORT_SYMBOL(fd_install);
@@ -1024,7 +1024,7 @@
 	struct file * filp;
 	struct files_struct *files = current->files;
 
-	spin_lock(&files->file_lock);
+	write_lock(&files->file_lock);
 	if (fd >= files->max_fds)
 		goto out_unlock;
 	filp = files->fd[fd];
@@ -1033,11 +1033,11 @@
 	files->fd[fd] = NULL;
 	FD_CLR(fd, files->close_on_exec);
 	__put_unused_fd(files, fd);
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 	return filp_close(filp, files);
 
 out_unlock:
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 	return -EBADF;
 }
 
diff -urN RC8-rc2-bk12/fs/proc/base.c RC8-rc2-bk12-current/fs/proc/base.c
--- RC8-rc2-bk12/fs/proc/base.c	2004-07-18 09:08:42.000000000 -0400
+++ RC8-rc2-bk12-current/fs/proc/base.c	2004-08-02 12:55:13.630740688 -0400
@@ -191,16 +191,16 @@
 
 	files = get_files_struct(task);
 	if (files) {
-		spin_lock(&files->file_lock);
+		read_lock(&files->file_lock);
 		file = fcheck_files(files, fd);
 		if (file) {
 			*mnt = mntget(file->f_vfsmnt);
 			*dentry = dget(file->f_dentry);
-			spin_unlock(&files->file_lock);
+			read_unlock(&files->file_lock);
 			put_files_struct(files);
 			return 0;
 		}
-		spin_unlock(&files->file_lock);
+		read_unlock(&files->file_lock);
 		put_files_struct(files);
 	}
 	return -ENOENT;
@@ -805,7 +805,7 @@
 			files = get_files_struct(p);
 			if (!files)
 				goto out;
-			spin_lock(&files->file_lock);
+			read_lock(&files->file_lock);
 			for (fd = filp->f_pos-2;
 			     fd < files->max_fds;
 			     fd++, filp->f_pos++) {
@@ -813,7 +813,7 @@
 
 				if (!fcheck_files(files, fd))
 					continue;
-				spin_unlock(&files->file_lock);
+				read_unlock(&files->file_lock);
 
 				j = NUMBUF;
 				i = fd;
@@ -825,12 +825,12 @@
 
 				ino = fake_ino(tid, PROC_TID_FD_DIR + fd);
 				if (filldir(dirent, buf+j, NUMBUF-j, fd+2, ino, DT_LNK) < 0) {
-					spin_lock(&files->file_lock);
+					read_lock(&files->file_lock);
 					break;
 				}
-				spin_lock(&files->file_lock);
+				read_lock(&files->file_lock);
 			}
-			spin_unlock(&files->file_lock);
+			read_unlock(&files->file_lock);
 			put_files_struct(files);
 	}
 out:
@@ -1003,9 +1003,9 @@
 
 	files = get_files_struct(task);
 	if (files) {
-		spin_lock(&files->file_lock);
+		read_lock(&files->file_lock);
 		if (fcheck_files(files, fd)) {
-			spin_unlock(&files->file_lock);
+			read_unlock(&files->file_lock);
 			put_files_struct(files);
 			if (task_dumpable(task)) {
 				inode->i_uid = task->euid;
@@ -1017,7 +1017,7 @@
 			security_task_to_inode(task, inode);
 			return 1;
 		}
-		spin_unlock(&files->file_lock);
+		read_unlock(&files->file_lock);
 		put_files_struct(files);
 	}
 	d_drop(dentry);
@@ -1109,7 +1109,7 @@
 	if (!files)
 		goto out_unlock;
 	inode->i_mode = S_IFLNK;
-	spin_lock(&files->file_lock);
+	read_lock(&files->file_lock);
 	file = fcheck_files(files, fd);
 	if (!file)
 		goto out_unlock2;
@@ -1117,7 +1117,7 @@
 		inode->i_mode |= S_IRUSR | S_IXUSR;
 	if (file->f_mode & 2)
 		inode->i_mode |= S_IWUSR | S_IXUSR;
-	spin_unlock(&files->file_lock);
+	read_unlock(&files->file_lock);
 	put_files_struct(files);
 	inode->i_op = &proc_pid_link_inode_operations;
 	inode->i_size = 64;
@@ -1127,7 +1127,7 @@
 	return NULL;
 
 out_unlock2:
-	spin_unlock(&files->file_lock);
+	read_unlock(&files->file_lock);
 	put_files_struct(files);
 out_unlock:
 	iput(inode);
diff -urN RC8-rc2-bk12/fs/select.c RC8-rc2-bk12-current/fs/select.c
--- RC8-rc2-bk12/fs/select.c	2003-10-09 17:34:47.000000000 -0400
+++ RC8-rc2-bk12-current/fs/select.c	2004-08-02 12:34:06.000000000 -0400
@@ -184,9 +184,9 @@
 	int retval, i;
 	long __timeout = *timeout;
 
- 	spin_lock(&current->files->file_lock);
+ 	read_lock(&current->files->file_lock);
 	retval = max_select_fd(n, fds);
-	spin_unlock(&current->files->file_lock);
+	read_unlock(&current->files->file_lock);
 
 	if (retval < 0)
 		return retval;
diff -urN RC8-rc2-bk12/include/linux/file.h RC8-rc2-bk12-current/include/linux/file.h
--- RC8-rc2-bk12/include/linux/file.h	2004-05-10 00:23:41.000000000 -0400
+++ RC8-rc2-bk12-current/include/linux/file.h	2004-08-02 12:35:52.374278504 -0400
@@ -21,7 +21,7 @@
  */
 struct files_struct {
         atomic_t count;
-        spinlock_t file_lock;     /* Protects all the below members.  Nests inside tsk->alloc_lock */
+        rwlock_t file_lock;     /* Protects all the below members.  Nests inside tsk->alloc_lock */
         int max_fds;
         int max_fdset;
         int next_fd;
diff -urN RC8-rc2-bk12/include/linux/init_task.h RC8-rc2-bk12-current/include/linux/init_task.h
--- RC8-rc2-bk12/include/linux/init_task.h	2004-07-18 09:08:44.000000000 -0400
+++ RC8-rc2-bk12-current/include/linux/init_task.h	2004-08-02 12:35:37.000000000 -0400
@@ -6,7 +6,7 @@
 #define INIT_FILES \
 { 							\
 	.count		= ATOMIC_INIT(1), 		\
-	.file_lock	= SPIN_LOCK_UNLOCKED, 		\
+	.file_lock	= RW_LOCK_UNLOCKED, 		\
 	.max_fds	= NR_OPEN_DEFAULT, 		\
 	.max_fdset	= __FD_SETSIZE, 		\
 	.next_fd	= 0, 				\
diff -urN RC8-rc2-bk12/kernel/fork.c RC8-rc2-bk12-current/kernel/fork.c
--- RC8-rc2-bk12/kernel/fork.c	2004-08-02 12:19:37.000000000 -0400
+++ RC8-rc2-bk12-current/kernel/fork.c	2004-08-02 12:37:08.000000000 -0400
@@ -680,7 +680,7 @@
 
 	atomic_set(&newf->count, 1);
 
-	newf->file_lock	    = SPIN_LOCK_UNLOCKED;
+	newf->file_lock	    = RW_LOCK_UNLOCKED;
 	newf->next_fd	    = 0;
 	newf->max_fds	    = NR_OPEN_DEFAULT;
 	newf->max_fdset	    = __FD_SETSIZE;
@@ -693,13 +693,13 @@
 	size = oldf->max_fdset;
 	if (size > __FD_SETSIZE) {
 		newf->max_fdset = 0;
-		spin_lock(&newf->file_lock);
+		write_lock(&newf->file_lock);
 		error = expand_fdset(newf, size-1);
-		spin_unlock(&newf->file_lock);
+		write_unlock(&newf->file_lock);
 		if (error)
 			goto out_release;
 	}
-	spin_lock(&oldf->file_lock);
+	read_lock(&oldf->file_lock);
 
 	open_files = count_open_files(oldf, size);
 
@@ -710,15 +710,15 @@
 	 */
 	nfds = NR_OPEN_DEFAULT;
 	if (open_files > nfds) {
-		spin_unlock(&oldf->file_lock);
+		read_unlock(&oldf->file_lock);
 		newf->max_fds = 0;
-		spin_lock(&newf->file_lock);
+		write_lock(&newf->file_lock);
 		error = expand_fd_array(newf, open_files-1);
-		spin_unlock(&newf->file_lock);
+		write_unlock(&newf->file_lock);
 		if (error) 
 			goto out_release;
 		nfds = newf->max_fds;
-		spin_lock(&oldf->file_lock);
+		read_lock(&oldf->file_lock);
 	}
 
 	old_fds = oldf->fd;
@@ -733,7 +733,7 @@
 			get_file(f);
 		*new_fds++ = f;
 	}
-	spin_unlock(&oldf->file_lock);
+	read_unlock(&oldf->file_lock);
 
 	/* compute the remainder to be cleared */
 	size = (newf->max_fds - open_files) * sizeof(struct file *);
diff -urN RC8-rc2-bk12/net/ipv4/netfilter/ipt_owner.c RC8-rc2-bk12-current/net/ipv4/netfilter/ipt_owner.c
--- RC8-rc2-bk12/net/ipv4/netfilter/ipt_owner.c	2004-07-18 09:08:45.000000000 -0400
+++ RC8-rc2-bk12-current/net/ipv4/netfilter/ipt_owner.c	2004-08-02 12:37:58.000000000 -0400
@@ -35,17 +35,17 @@
 		task_lock(p);
 		files = p->files;
 		if(files) {
-			spin_lock(&files->file_lock);
+			read_lock(&files->file_lock);
 			for (i=0; i < files->max_fds; i++) {
 				if (fcheck_files(files, i) ==
 				    skb->sk->sk_socket->file) {
-					spin_unlock(&files->file_lock);
+					read_unlock(&files->file_lock);
 					task_unlock(p);
 					read_unlock(&tasklist_lock);
 					return 1;
 				}
 			}
-			spin_unlock(&files->file_lock);
+			read_unlock(&files->file_lock);
 		}
 		task_unlock(p);
 	} while_each_thread(g, p);
@@ -67,17 +67,17 @@
 	task_lock(p);
 	files = p->files;
 	if(files) {
-		spin_lock(&files->file_lock);
+		read_lock(&files->file_lock);
 		for (i=0; i < files->max_fds; i++) {
 			if (fcheck_files(files, i) ==
 			    skb->sk->sk_socket->file) {
-				spin_unlock(&files->file_lock);
+				read_unlock(&files->file_lock);
 				task_unlock(p);
 				read_unlock(&tasklist_lock);
 				return 1;
 			}
 		}
-		spin_unlock(&files->file_lock);
+		read_unlock(&files->file_lock);
 	}
 	task_unlock(p);
 out:
@@ -101,14 +101,14 @@
 		task_lock(p);
 		files = p->files;
 		if (files) {
-			spin_lock(&files->file_lock);
+			read_lock(&files->file_lock);
 			for (i=0; i < files->max_fds; i++) {
 				if (fcheck_files(files, i) == file) {
 					found = 1;
 					break;
 				}
 			}
-			spin_unlock(&files->file_lock);
+			read_unlock(&files->file_lock);
 		}
 		task_unlock(p);
 		if (found)
diff -urN RC8-rc2-bk12/net/ipv6/netfilter/ip6t_owner.c RC8-rc2-bk12-current/net/ipv6/netfilter/ip6t_owner.c
--- RC8-rc2-bk12/net/ipv6/netfilter/ip6t_owner.c	2004-07-18 09:08:45.000000000 -0400
+++ RC8-rc2-bk12-current/net/ipv6/netfilter/ip6t_owner.c	2004-08-02 12:38:12.000000000 -0400
@@ -34,16 +34,16 @@
 	task_lock(p);
 	files = p->files;
 	if(files) {
-		spin_lock(&files->file_lock);
+		read_lock(&files->file_lock);
 		for (i=0; i < files->max_fds; i++) {
 			if (fcheck_files(files, i) == skb->sk->sk_socket->file) {
-				spin_unlock(&files->file_lock);
+				read_unlock(&files->file_lock);
 				task_unlock(p);
 				read_unlock(&tasklist_lock);
 				return 1;
 			}
 		}
-		spin_unlock(&files->file_lock);
+		read_unlock(&files->file_lock);
 	}
 	task_unlock(p);
 out:
@@ -67,14 +67,14 @@
 		task_lock(p);
 		files = p->files;
 		if (files) {
-			spin_lock(&files->file_lock);
+			read_lock(&files->file_lock);
 			for (i=0; i < files->max_fds; i++) {
 				if (fcheck_files(files, i) == file) {
 					found = 1;
 					break;
 				}
 			}
-			spin_unlock(&files->file_lock);
+			read_unlock(&files->file_lock);
 		}
 		task_unlock(p);
 		if (found)
diff -urN RC8-rc2-bk12/security/selinux/hooks.c RC8-rc2-bk12-current/security/selinux/hooks.c
--- RC8-rc2-bk12/security/selinux/hooks.c	2004-08-02 12:19:37.000000000 -0400
+++ RC8-rc2-bk12-current/security/selinux/hooks.c	2004-08-02 12:39:03.000000000 -0400
@@ -1794,7 +1794,7 @@
 
 	AVC_AUDIT_DATA_INIT(&ad,FS);
 
-	spin_lock(&files->file_lock);
+	read_lock(&files->file_lock);
 	for (;;) {
 		unsigned long set, i;
 		int fd;
@@ -1806,7 +1806,7 @@
 		set = files->open_fds->fds_bits[j];
 		if (!set)
 			continue;
-		spin_unlock(&files->file_lock);
+		read_unlock(&files->file_lock);
 		for ( ; set ; i++,set >>= 1) {
 			if (set & 1) {
 				file = fget(i);
@@ -1838,10 +1838,10 @@
 				fput(file);
 			}
 		}
-		spin_lock(&files->file_lock);
+		read_lock(&files->file_lock);
 
 	}
-	spin_unlock(&files->file_lock);
+	read_unlock(&files->file_lock);
 }
 
 static void selinux_bprm_apply_creds(struct linux_binprm *bprm, int unsafe)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patchset] Lockfree fd lookup 0 of 5
@ 2004-08-02 18:50 Manfred Spraul
  0 siblings, 0 replies; 12+ messages in thread
From: Manfred Spraul @ 2004-08-02 18:50 UTC (permalink / raw)
  To: viro; +Cc: linux-kernel, Ravikiran G Thirumalai

viro wrote:

>How about this for comparison?  That's just a dumb "convert to rwlock"
>patch; we can be smarter in e.g. close_on_exec handling, but that's a
>separate story.
>  
>
That won't help:
The problem is the cache line trashing from fget() and fget_light() with 
multithreaded apps. A sequence lock might help, but an rw lock is not a 
solution.
Actually: 2.4 had an rwlock, it was converted to a spinlock because 
spinlocks are faster on i386:

    read_lock();
    short operation();
    read_unlock();

is a bit slower than

    spin_lock();
    short operation();
    spin_unlock();

because read_unlock() is a full memory barrier and spin_unlock is not a 
memory barrier (not necessary because writes are ordered)

--
    Manfred

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patchset] Lockfree fd lookup 0 of 5
  2004-08-02 16:56 ` viro
@ 2004-08-02 20:07   ` David S. Miller
  2004-08-02 21:01     ` William Lee Irwin III
  2004-08-03  9:23   ` Ravikiran G Thirumalai
  1 sibling, 1 reply; 12+ messages in thread
From: David S. Miller @ 2004-08-02 20:07 UTC (permalink / raw)
  To: viro; +Cc: kiran, akpm, linux-kernel, greg, dipankar

On Mon, 2 Aug 2004 17:56:07 +0100
viro@parcelfarce.linux.theplanet.co.uk wrote:

> How about this for comparison?  That's just a dumb "convert to rwlock"
> patch; we can be smarter in e.g. close_on_exec handling, but that's a
> separate story.

Compares to plain spinlocks, rwlock's don't buy you much,
if anything, these days.

Especially for short sequences of code.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patchset] Lockfree fd lookup 0 of 5
  2004-08-02 20:07   ` David S. Miller
@ 2004-08-02 21:01     ` William Lee Irwin III
  2004-08-02 23:15       ` David S. Miller
  0 siblings, 1 reply; 12+ messages in thread
From: William Lee Irwin III @ 2004-08-02 21:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: viro, kiran, akpm, linux-kernel, greg, dipankar

On Mon, 2 Aug 2004 17:56:07 +0100 viro@parcelfarce.linux.theplanet.co.uk wrote:
>> How about this for comparison?  That's just a dumb "convert to rwlock"
>> patch; we can be smarter in e.g. close_on_exec handling, but that's a
>> separate story.

On Mon, Aug 02, 2004 at 01:07:29PM -0700, David S. Miller wrote:
> Compares to plain spinlocks, rwlock's don't buy you much,
> if anything, these days.
> Especially for short sequences of code.

I've found unusual results in this area. e.g. it does appear to matter
for mapping->tree_lock for database workloads that heavily share a
given file and access it in parallel. The radix tree walk, though
intuitively short, is long enough to make the rwlock a win in the
database-oriented uses and microbenchmarks starting around 4x.


-- wli

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patchset] Lockfree fd lookup 0 of 5
  2004-08-02 21:01     ` William Lee Irwin III
@ 2004-08-02 23:15       ` David S. Miller
  2004-08-03  2:04         ` William Lee Irwin III
  0 siblings, 1 reply; 12+ messages in thread
From: David S. Miller @ 2004-08-02 23:15 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-kernel, rusty

On Mon, 2 Aug 2004 14:01:19 -0700
William Lee Irwin III <wli@holomorphy.com> wrote:

> I've found unusual results in this area. e.g. it does appear to matter
> for mapping->tree_lock for database workloads that heavily share a
> given file and access it in parallel. The radix tree walk, though
> intuitively short, is long enough to make the rwlock a win in the
> database-oriented uses and microbenchmarks starting around 4x.

Thanks for the data point, because I had this patch I had sent
to Rusty for 2.7.x which ripped rwlocks entirely out of the
kernel.  We might have to toss that idea :-)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patchset] Lockfree fd lookup 0 of 5
  2004-08-02 23:15       ` David S. Miller
@ 2004-08-03  2:04         ` William Lee Irwin III
  0 siblings, 0 replies; 12+ messages in thread
From: William Lee Irwin III @ 2004-08-03  2:04 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel, rusty

On Mon, 2 Aug 2004 14:01:19 -0700 William Lee Irwin III wrote:
>> I've found unusual results in this area. e.g. it does appear to matter
>> for mapping->tree_lock for database workloads that heavily share a
>> given file and access it in parallel. The radix tree walk, though
>> intuitively short, is long enough to make the rwlock a win in the
>> database-oriented uses and microbenchmarks starting around 4x.

On Mon, Aug 02, 2004 at 04:15:14PM -0700, David S. Miller wrote:
> Thanks for the data point, because I had this patch I had sent
> to Rusty for 2.7.x which ripped rwlocks entirely out of the
> kernel.  We might have to toss that idea :-)

In all honesty, I'd rather use RCU, but that's a little more work than
most RCU conversions.


-- wli

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patchset] Lockfree fd lookup 0 of 5
  2004-08-02 16:56 ` viro
  2004-08-02 20:07   ` David S. Miller
@ 2004-08-03  9:23   ` Ravikiran G Thirumalai
  2004-08-03  9:35     ` Ravikiran G Thirumalai
  2004-08-03 10:06     ` viro
  1 sibling, 2 replies; 12+ messages in thread
From: Ravikiran G Thirumalai @ 2004-08-03  9:23 UTC (permalink / raw)
  To: viro; +Cc: Andrew Morton, linux-kernel, Greg KH, dipankar

On Mon, Aug 02, 2004 at 05:56:07PM +0100, viro@parcelfarce.linux.theplanet.co.uk wrote:
> On Mon, Aug 02, 2004 at 03:40:55PM +0530, Ravikiran G Thirumalai wrote:
> > Here is a patchset to eliminate taking struct files_struct.file_lock on 
> > reader side using rcu and rcu based refcounting.  These patches
> > extend the kref api to include kref_lf_xxx api and kref_lf_get_rcu to
> > do lockfree refcounting, and use the same.  As posted earlier, since fd
> > lookups (struct files_struct.fd[]) will be lock free with these patches, 
> > threaded workloads doing lots of io should see performance benefits 
> > due to this patchset.  I have observed 13-15% improvement with tiobench 
> > on a 4 way xeon with this patchset.
> 
> How about this for comparison?  That's just a dumb "convert to rwlock"
> patch; we can be smarter in e.g. close_on_exec handling, but that's a
> separate story.
> 

I ran tiobench on this patch and here is the comparison:


Kernel		Seqread		Randread	Seqwrite	Randwrite
--------------------------------------------------------------------------
2.6.7		410.33		234.15		254.39		189.36
rwlocks-viro	401.84		232.69		254.09		194.62
refcount (kref)	455.72		281.75		272.87		230.10

All of this was 2.6.7 based.  Nos are througput rates in mega bytes per sec.
Test was on ramfs, 48 threads doing 100 MB of io per thread averaged over
8 runs.  This was on a 4 way PIII xeon. 

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patchset] Lockfree fd lookup 0 of 5
  2004-08-03  9:23   ` Ravikiran G Thirumalai
@ 2004-08-03  9:35     ` Ravikiran G Thirumalai
  2004-08-03 10:17       ` Dipankar Sarma
  2004-08-03 10:06     ` viro
  1 sibling, 1 reply; 12+ messages in thread
From: Ravikiran G Thirumalai @ 2004-08-03  9:35 UTC (permalink / raw)
  To: viro; +Cc: Andrew Morton, linux-kernel, Greg KH, dipankar

On Tue, Aug 03, 2004 at 02:53:17PM +0530, Ravikiran G Thirumalai wrote:
> > ...
> > How about this for comparison?  That's just a dumb "convert to rwlock"
> > patch; we can be smarter in e.g. close_on_exec handling, but that's a
> > separate story.
> > 
> 
> I ran tiobench on this patch and here is the comparison:
> 
> 
> Kernel		Seqread		Randread	Seqwrite	Randwrite
> --------------------------------------------------------------------------
> 2.6.7		410.33		234.15		254.39		189.36
> rwlocks-viro	401.84		232.69		254.09		194.62
> refcount (kref)	455.72		281.75		272.87		230.10
> 
> All of this was 2.6.7 based.  Nos are througput rates in mega bytes per sec.
> Test was on ramfs, 48 threads doing 100 MB of io per thread averaged over
> 8 runs.  This was on a 4 way PIII xeon. 

Just adding. This was with premption off and here are the oprofile 
profiles for the above runs;

Thanks,
Kiran

1. Vanilla 2.6.7
=================
CPU: PIII, speed 699.726 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
31816    18.7302  __copy_to_user_ll
30421    17.9089  simple_prepare_write
13270     7.8121  __copy_from_user_ll
11514     6.7783  fget_light
8388      4.9380  mark_offset_tsc
5352      3.1507  sysenter_past_esp
5047      2.9712  default_idle
3723      2.1917  do_gettimeofday
2991      1.7608  increment_tail
2884      1.6978  generic_file_aio_write_nolock
2147      1.2639  do_generic_mapping_read
1922      1.1315  current_kernel_time
1800      1.0597  find_vma
1691      0.9955  get_offset_tsc
1683      0.9908  __pagevec_lru_add
1537      0.9048  add_event_entry
1511      0.8895  profile_hook
1413      0.8318  activate_page
1411      0.8307  find_lock_page
1342      0.7900  lookup_dcookie
1318      0.7759  sync_buffer
1289      0.7588  dnotify_parent
1248      0.7347  apic_timer_interrupt
1244      0.7323  find_get_page
1203      0.7082  .text.lock.file_table
1135      0.6682  release_pages
1067      0.6281  __set_page_dirty_buffers
1032      0.6075  scheduler_tick
1027      0.6046  radix_tree_lookup
920       0.5416  generic_file_write

2. rwlocks-2.6.7
=================
CPU: PIII, speed 699.717 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
32276    18.9282  __copy_to_user_ll
29917    17.5448  simple_prepare_write
13168     7.7224  fget_light
13167     7.7218  __copy_from_user_ll
8505      4.9877  mark_offset_tsc
5237      3.0712  sysenter_past_esp
5199      3.0489  default_idle
3697      2.1681  do_gettimeofday
3090      1.8121  generic_file_aio_write_nolock
2739      1.6063  increment_tail
1977      1.1594  current_kernel_time
1956      1.1471  do_generic_mapping_read
1845      1.0820  find_vma
1764      1.0345  get_offset_tsc
1717      1.0069  __pagevec_lru_add
1598      0.9371  add_event_entry
1511      0.8861  profile_hook
1478      0.8668  activate_page
1390      0.8152  sync_buffer
1350      0.7917  find_lock_page
1336      0.7835  lookup_dcookie
1307      0.7665  apic_timer_interrupt
1240      0.7272  release_pages
1239      0.7266  dnotify_parent
1223      0.7172  find_get_page
1173      0.6879  __set_page_dirty_buffers
999       0.5859  scheduler_tick

3. rcu refcounting -2.6.7
=========================
CPU: PIII, speed 699.717 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
31391    19.8978  __copy_to_user_ll
29672    18.8082  simple_prepare_write
13330     8.4495  __copy_from_user_ll
8076      5.1191  mark_offset_tsc
5323      3.3741  sysenter_past_esp
5071      3.2144  default_idle
3682      2.3339  do_gettimeofday
2890      1.8319  increment_tail
2789      1.7679  generic_file_aio_write_nolock
1997      1.2658  do_generic_mapping_read
1993      1.2633  fget_light
1873      1.1872  __pagevec_lru_add
1826      1.1574  find_vma
1808      1.1460  profile_hook
1797      1.1391  current_kernel_time
1738      1.1017  get_offset_tsc
1439      0.9121  add_event_entry
1354      0.8583  sync_buffer
1346      0.8532  lookup_dcookie
1285      0.8145  dnotify_parent
1267      0.8031  apic_timer_interrupt
1260      0.7987  activate_page
1178      0.7467  find_get_page
1173      0.7435  find_lock_page
1125      0.7131  release_pages
1034      0.6554  scheduler_tick
986       0.6250  __set_page_dirty_buffers
939       0.5952  sched_clock


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patchset] Lockfree fd lookup 0 of 5
  2004-08-03  9:23   ` Ravikiran G Thirumalai
  2004-08-03  9:35     ` Ravikiran G Thirumalai
@ 2004-08-03 10:06     ` viro
  1 sibling, 0 replies; 12+ messages in thread
From: viro @ 2004-08-03 10:06 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: Andrew Morton, linux-kernel, Greg KH, dipankar

On Tue, Aug 03, 2004 at 02:53:17PM +0530, Ravikiran G Thirumalai wrote:
> > How about this for comparison?  That's just a dumb "convert to rwlock"
> > patch; we can be smarter in e.g. close_on_exec handling, but that's a
> > separate story.
> > 
> 
> I ran tiobench on this patch and here is the comparison:
> 
> 
> Kernel		Seqread		Randread	Seqwrite	Randwrite
> --------------------------------------------------------------------------
> 2.6.7		410.33		234.15		254.39		189.36
> rwlocks-viro	401.84		232.69		254.09		194.62
> refcount (kref)	455.72		281.75		272.87		230.10

Thanks.  IOW, we are really seeing cacheline bounces - not contention...

I'm still not sure that in the current form patch is a good idea.  The thing
is, existing checks for ->f_count value are bogus in practically all cases;
IMO we should sort that out before making any decision based on the need for
such checks.  Ditto for uses of fcheck() (open-coded or not) in arch/*.

I agree that some form of "postpone freeing and make fget() lockless" would
make sense, but I'd rather clean the area *before* doing that; afterwards it
will be harder and results of cleanup can affect the patches in non-trivial
way.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patchset] Lockfree fd lookup 0 of 5
  2004-08-03  9:35     ` Ravikiran G Thirumalai
@ 2004-08-03 10:17       ` Dipankar Sarma
  0 siblings, 0 replies; 12+ messages in thread
From: Dipankar Sarma @ 2004-08-03 10:17 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: viro, Andrew Morton, linux-kernel, Greg KH

On Tue, Aug 03, 2004 at 03:05:55PM +0530, Ravikiran G Thirumalai wrote:
> On Tue, Aug 03, 2004 at 02:53:17PM +0530, Ravikiran G Thirumalai wrote:
> > I ran tiobench on this patch and here is the comparison:
> > 
> > 
> > Kernel		Seqread		Randread	Seqwrite	Randwrite
> > --------------------------------------------------------------------------
> > 2.6.7		410.33		234.15		254.39		189.36
> > rwlocks-viro	401.84		232.69		254.09		194.62
> > refcount (kref)	455.72		281.75		272.87		230.10
> > 

Hm...

11514     6.7783  fget_light (vanilla)
13168     7.7224  fget_light (rwlock)
1993      1.2633  fget_light (kref)

Total ticks -

169886  (vanilla)
170520  (rwlock)
157760  (kref)

Of the 12126 ticks that were reduced by kref, 9521 came from 
reduction in fget_light(). So, lock-free fget_light() does help.
Also, it seems the lock contention is not much of an issue -

1203      0.7082  .text.lock.file_table

That explains why rwlock didn't help. I guess we are benefiting
mostly from avoiding the cacheline bouncing and removal of the
lock acquisition.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2004-08-03 10:21 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-02 18:50 [patchset] Lockfree fd lookup 0 of 5 Manfred Spraul
  -- strict thread matches above, loose matches on Subject: below --
2004-08-02 10:10 Ravikiran G Thirumalai
2004-08-02 10:18 ` Ravikiran G Thirumalai
2004-08-02 16:56 ` viro
2004-08-02 20:07   ` David S. Miller
2004-08-02 21:01     ` William Lee Irwin III
2004-08-02 23:15       ` David S. Miller
2004-08-03  2:04         ` William Lee Irwin III
2004-08-03  9:23   ` Ravikiran G Thirumalai
2004-08-03  9:35     ` Ravikiran G Thirumalai
2004-08-03 10:17       ` Dipankar Sarma
2004-08-03 10:06     ` viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox