Re: [RFC] Lock free fd lookup

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC] Lock free fd lookup
@ 2004-07-17 19:17 Albert Cahalan
  2004-07-29  0:14 ` William Lee Irwin III
  0 siblings, 1 reply; 19+ messages in thread
From: Albert Cahalan @ 2004-07-17 19:17 UTC (permalink / raw)
  To: linux-kernel mailing list; +Cc: kaos, wli, chrisw, jbarnes, kiran, dipankar

Keith Owens writes:

> Writer vs. writer starvation on NUMA is a lot harder.  I don't know
> of any algorithm that handles lists with lots of concurrent updates
> and also scales well on large cpus, unless the underlying hardware
> is fair in its handling of exclusive cache lines.

How about MCS (Mellor-Crummey and Scott) locks?

Linux code:
http://oss.software.ibm.com/linux/patches/?patch_id=218

Something supposedly better:
http://user.it.uu.se/~zoranr/rh_lock/

Scott's list of 11 scalable synchronization algorithms:
http://www.cs.rochester.edu/u/scott/synchronization/pseudocode/ss.html

Scott's collection of papers and so on:
http://www.cs.rochester.edu/u/scott/synchronization/

Simply asking Scott might be a wise move. He'd likely know of anything
else that might fit the requirements. That's scott at cs.rochester.edu



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-17 19:17 [RFC] Lock free fd lookup Albert Cahalan
@ 2004-07-29  0:14 ` William Lee Irwin III
  0 siblings, 0 replies; 19+ messages in thread
From: William Lee Irwin III @ 2004-07-29  0:14 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: linux-kernel mailing list, kaos, chrisw, jbarnes, kiran, dipankar

Keith Owens writes:
>> Writer vs. writer starvation on NUMA is a lot harder.  I don't know
>> of any algorithm that handles lists with lots of concurrent updates
>> and also scales well on large cpus, unless the underlying hardware
>> is fair in its handling of exclusive cache lines.

On Sat, Jul 17, 2004 at 03:17:55PM -0400, Albert Cahalan wrote:
> How about MCS (Mellor-Crummey and Scott) locks?
> Linux code:
> http://oss.software.ibm.com/linux/patches/?patch_id=218
> Something supposedly better:
> http://user.it.uu.se/~zoranr/rh_lock/
> Scott's list of 11 scalable synchronization algorithms:
> http://www.cs.rochester.edu/u/scott/synchronization/pseudocode/ss.html
> Scott's collection of papers and so on:
> http://www.cs.rochester.edu/u/scott/synchronization/
> Simply asking Scott might be a wise move. He'd likely know of anything
> else that might fit the requirements. That's scott at cs.rochester.edu

Did anyone follow up with Scott on this?


-- wli

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Lock free fd lookup
@ 2004-07-17  8:50 Manfred Spraul
  2004-07-17  9:30 ` William Lee Irwin III
  0 siblings, 1 reply; 19+ messages in thread
From: Manfred Spraul @ 2004-07-17  8:50 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-kernel, Keith Owens

wli wrote

>> * It requires type stable storage.  That is, once a data area has been
>>   allocated to a particular structure type, it always contains that
>>   structure type, even when it has been freed from the list.  Each list
>>   requires its own free pool, which can never be returned to the OS.
>
>The last of these is particularly lethal.
>  
>
It might be possible to combine such a lock free algorithms with RCU and 
then set Hugh's SLAB_DESTROY_BY_RCU: It inserts a call_rcu between 
leaving the free pool and returning the page to the OS.

--
    Manfred

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-17  8:50 Manfred Spraul
@ 2004-07-17  9:30 ` William Lee Irwin III
  0 siblings, 0 replies; 19+ messages in thread
From: William Lee Irwin III @ 2004-07-17  9:30 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: linux-kernel, Keith Owens

At some point in the past, Keith Owens wrote:
>>> It requires type stable storage.  That is, once a data area has been
>>>  allocated to a particular structure type, it always contains that
>>>  structure type, even when it has been freed from the list.  Each list
>>>  requires its own free pool, which can never be returned to the OS.

wli wrote
>>The last of these is particularly lethal.

On Sat, Jul 17, 2004 at 10:50:48AM +0200, Manfred Spraul wrote:
> It might be possible to combine such a lock free algorithms with RCU and 
> then set Hugh's SLAB_DESTROY_BY_RCU: It inserts a call_rcu between 
> leaving the free pool and returning the page to the OS.

At least I would prefer such a hybrid algorithm if things of this kind
were used in the kernel. From the looks of it, in such algorithms the
structure is still valid for checking of tickets after a voluntary
preemption, in this RCU hybrid not.


-- wli

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC] Refcounting of objects part of a lockfree collection
@ 2004-07-14  4:53 Ravikiran G Thirumalai
  2004-07-14  4:56 ` [RFC] Lock free fd lookup Ravikiran G Thirumalai
  0 siblings, 1 reply; 19+ messages in thread
From: Ravikiran G Thirumalai @ 2004-07-14  4:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: dipankar

Traditionally, refcounting technique is used to avoid or minimise
lock holding time for individual objects part of a collection.  The
'collection' of such objects (lists/arrays/etc) are proteced by traditional
locks such as spinlocks/reader-writer locks.  These locks protect
access to the collection data structure (...walking the objects part of
the collection, adding/deleting objects to the collection). Access to 
individual objects are protected through the refcounting mechanism.  

It makes sense for performance sake to make access to a read-mostly 
collection of objects lock-free with RCU techniques.  Introduction of RCU 
in place of traditional locking to protect such a collection creates some 
refcounting issues.  These issues come up because, with RCU, unlike 
traditional locking, writers and readers could execute concurrently.  

The attatched patch provides infrastructure for refcounting of objects
in a rcu protected collection.

This infrastructure was a result of our go at making the fd lookup lockfree.
fget right now holds the struct files_struct.file_lock to get the 
struct file pointer of the requested fd from the process' fd array.
The spinlock in fget is used to serialise read access to the 
files_struct.fd[] and incrementing the refcount.  We can save on the 
spin_lock and spin_unlock on the fd array lookup if reference can be 
taken on the struct file object in a safe and lock free manner 
(under rcu_readlock/rcu_readunlock and rcu on the update side).  This 
optimization should improve performance for threaded io intensive workloads.  
A patch to make use of the lockfree refcounting infrastructure to do 
just that follows. With the lockfree fd lookup patch, tiobench performance 
increases by 13% for sequential reads, 21 % for random reads on a 4 
processor pIII xeon.

Comments welcome.

Thanks,
Kiran

Signed off by: Ravikiran Thirumalai <kiran@in.ibm.com>

---

diff -ruN -X dontdiff linux-2.6.6/include/linux/refcount.h refcount-2.6.6/include/linux/refcount.h
--- linux-2.6.6/include/linux/refcount.h	1970-01-01 05:30:00.000000000 +0530
+++ refcount-2.6.6/include/linux/refcount.h	2004-05-24 18:31:25.523860952 +0530
@@ -0,0 +1,244 @@
+/* 
+ * Refcounter framework for elements of lists/arrays protected by
+ * RCU.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2004
+ *
+ * Author: Ravikiran Thirumalai <kiran@in.ibm.com>
+ * Credits: Dipankar Sarma <dipankar@in.ibm.com>.  refcount_get_rcu has
+ *	    been abstracted out from Dipankar's implementation of rcu based
+ *	    fd management.
+ * 
+ * Refcounting on elements of  lists which are protected by traditional
+ * reader/writer spinlocks or semaphores are straight forward as in:
+ * 
+ * 1.					2.
+ * add()				search_and_reference()
+ * {					{
+ * 	alloc_object				read_lock(&list_lock);
+ *	...					search_for_element
+ *	refcount_init(&el->rc)			refcount_get(&el->rc);	
+ *	write_lock(&list_lock);			...
+ *	add_element				read_unlock(&list_lock);
+ *	...					...
+ *	write_unlock(&list_lock);	}
+ * }					
+ *
+ * 3.					4.
+ * release_referenced()			delete()
+ * {					{
+ *	...				write_lock(&list_lock);
+ *	if (refcount_put(&el->rc))	...
+ *		start_cleanup_object	...
+ *		free_object		delete_element
+ *	...				write_unlock(&list_lock);
+ * }					...
+ *					if (refcount_put(&el->rc))
+ *						start_cleanup_object
+ *						free_object
+ *					}
+ *
+ * If this list/array is made lock free using rcu as in changing the 
+ * write_lock in add() and delete() to spin_lock and changing read_lock
+ * in search_and_reference to rcu_read_lock(), the refcount_get in 
+ * search_and_reference could potentially hold reference to an element which
+ * has already been deleted from the list/array.  refcount_get_rcu takes
+ * care of this scenario. search_and_reference should look as;
+ * 2. 
+ * search_and_reference()
+ * {
+ *	rcu_read_lock();
+ *	search_for_element
+ *	if (!refcount_get_rcu(&el->rc)) {
+ *		rcu_read_unlock();
+ *		return FAIL;
+ *	}
+ *	...
+ *	...
+ *	rcu_read_unlock();
+ * }
+ * 
+ * Of course, free_object after refcount_put should be batched using call_rcu.
+ * Sometimes, reference to the element need to be obtained in the 
+ * update (write) stream.  In such cases, refcount_get_rcu might be an overkill
+ * since the spinlock serialising list updates are held. refcount_get
+ * is to be used in such cases.  
+ * Note: Eventhough these refcount primitives can be used for lists which
+ * are protected by traditional locking, if the arch doesn't have cmpxchg,
+ * there might be extra penalties in refcount_get.
+ *
+ */
+
+#ifndef _LINUX_REFCOUNT_H
+#define _LINUX_REFCOUNT_H
+#ifdef  __KERNEL__
+
+#include <asm/atomic.h>
+#include <asm/system.h>
+
+typedef struct { atomic_t count; } refcount_t;
+
+/* 
+ * Refcount primitives for lists/array elements protected by traditional
+ * locking
+ */
+
+static inline void refcount_init(refcount_t *rc)	
+{
+	atomic_set(&rc->count, 1);
+}
+
+static inline int refcount_put(refcount_t *rc)
+{
+	return atomic_dec_and_test(&rc->count);
+}
+
+static inline void refcount_dec(refcount_t *rc)
+{
+	atomic_dec(&rc->count);
+}
+
+static inline int refcount_put_and_lock(refcount_t *rc, spinlock_t *lock)
+{
+	return atomic_dec_and_lock(&rc->count, lock);
+}
+
+static inline int refcount_read(refcount_t *rc)
+{
+	return atomic_read(&rc->count);
+}
+
+#ifdef __HAVE_ARCH_CMPXCHG
+
+static inline void refcount_get(refcount_t *rc)
+{
+	atomic_inc(&rc->count);
+}
+
+/* 
+ * Refcount primitives for lists/array elements protected by rcu
+ */
+
+/* 
+ * cmpxchg is needed on UP too, if deletions to the list/array can happen
+ * in interrupt context.
+ */
+static inline int refcount_get_rcu(refcount_t *rc)
+{
+	int c, old;
+	c = atomic_read(&rc->count);
+	while ( c && (old = cmpxchg(&rc->count.counter, c, c+1)) != c) 
+		c = old;
+	return c;
+}
+
+#else /* !__HAVE_ARCH_CMPXCHG */
+
+/* 
+ * We use an array of spinlocks for the refcounts -- similar to ones in sparc
+ * 32 bit atomic_t implementations, and a hash function similar to that
+ * for our refcounting needs.
+ * Can't help multiprocessors which donot have cmpxchg :(
+ */
+
+#ifdef	CONFIG_SMP
+#define REFCOUNT_HASH_SIZE	4
+#define REFCOUNT_HASH(r) \
+	(&__refcount_hash[(((unsigned long)r)>>8) & (REFCOUNT_HASH_SIZE-1)])
+#else
+#define	REFCOUNT_HASH_SIZE	1
+#define REFCOUNT_HASH(r) 	&__refcount_hash[0]
+#endif /* CONFIG_SMP */
+
+extern spinlock_t __refcount_hash[REFCOUNT_HASH_SIZE]
+
+static inline void refcount_init(refcoun_t *rc)	
+{
+	unsigned long flags;
+	spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+	rc->count.counter = 1;
+	spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+}
+
+static inline int refcount_put(refcount_t *rc)
+{
+	unsigned long flags;
+	spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+	rc->count.counter--;
+	if (!rc->count.counter) {
+		spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+		return 1;
+	} else {
+		spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+		return 0;
+	}
+}
+
+static inline void refcount_dec(refcount_t *rc)
+{
+	unsigned long flags;
+	spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+	rc->count.counter--;
+	spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+}
+
+static inline int refcount_put_and_locked(refcount_t *rc, spinlock_t *lock)
+{
+	unsigned long flags;
+	spin_lock(lock);
+	spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+	rc->count.counter--;
+	if (!rc->count.counter) {
+		spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+		return 1;
+	} else {
+		spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+		spin_unlock(lock)
+		return 0;
+	}
+}
+/* Spinlocks are not needed in this case, if atomic_read is used */
+static inline int refcount_read(refcount_t *rc)
+{
+	return atomic_read(&rc->count.counter);
+}
+
+static inline void refcount_get(refcount_t *rc)
+{
+	unsigned long flags;
+	spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+	rc->count++;
+	spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+}
+
+static inline int refcount_get_rcu(refcount_t *rc)
+{
+	int ret;
+	unsigned long flags;
+	spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+	if (rc->count.counter) 
+		ret = rc->count.counter++;
+	else
+		ret = 0;
+	spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+	return ret;
+}
+
+#endif /* __HAVE_ARCH_CMPXCHG */
+
+#endif
+#endif /* _LINUX_REFCOUNT_H */
diff -ruN -X dontdiff linux-2.6.6/kernel/rcupdate.c refcount-2.6.6/kernel/rcupdate.c
--- linux-2.6.6/kernel/rcupdate.c	2004-05-10 08:01:58.000000000 +0530
+++ refcount-2.6.6/kernel/rcupdate.c	2004-05-24 15:41:15.000000000 +0530
@@ -44,6 +44,7 @@
 #include <linux/notifier.h>
 #include <linux/rcupdate.h>
 #include <linux/cpu.h>
+#include <linux/refcount.h>
 
 /* Definition for rcupdate control block. */
 struct rcu_ctrlblk rcu_ctrlblk = 
@@ -325,6 +326,17 @@
 	wait_for_completion(&completion);
 }
 
+/**
+ * spinlock_t array for arches which donot have cmpxchg.  This is for
+ * rcu based refcounting stuff.
+ */
+
+#ifndef __HAVE_ARCH_CMPXCHG
+spinlock_t __refcount_hash[REFCOUNT_HASH_SIZE] = {
+	[0 ... (REFCOUNT_HASH_SIZE-1)] = SPIN_LOCK_UNLOCKED
+};
+EXPORT_SYMBOL(__refcount_hash);
+#endif
 
 EXPORT_SYMBOL(call_rcu);
 EXPORT_SYMBOL(synchronize_kernel);

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC] Lock free fd lookup
  2004-07-14  4:53 [RFC] Refcounting of objects part of a lockfree collection Ravikiran G Thirumalai
@ 2004-07-14  4:56 ` Ravikiran G Thirumalai
  2004-07-14 15:17   ` Chris Wright
  0 siblings, 1 reply; 19+ messages in thread
From: Ravikiran G Thirumalai @ 2004-07-14  4:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: dipankar

This makes use of the lockfree refcounting infrastructure (see earlier
posting today) to make files_struct.fd[] lookup lockfree. This is carrying 
forward work done by Maneesh and Dipankar earlier. 

With the lockfree fd lookup patch, tiobench performance increases by 13% 
for sequential reads, 21 % for random reads on a 4 processor pIII xeon.

Comments welcome.

Thanks,
Kiran


diff -ruN -X dontdiff linux-2.6.6/arch/mips/kernel/irixioctl.c files_struct-2.6.6/arch/mips/kernel/irixioctl.c
--- linux-2.6.6/arch/mips/kernel/irixioctl.c	2004-05-10 08:02:53.000000000 +0530
+++ files_struct-2.6.6/arch/mips/kernel/irixioctl.c	2004-05-14 12:53:40.000000000 +0530
@@ -14,6 +14,7 @@
 #include <linux/syscalls.h>
 #include <linux/tty.h>
 #include <linux/file.h>
+#include <linux/rcupdate.h>
 
 #include <asm/uaccess.h>
 #include <asm/ioctl.h>
@@ -33,15 +34,15 @@
 	struct file *filp;
 	struct tty_struct *ttyp = NULL;
 
-	spin_lock(&current->files->file_lock);
-	filp = fcheck(fd);
+	rcu_read_lock();
+	filp = fcheck_rcu(fd);
 	if(filp && filp->private_data) {
 		ttyp = (struct tty_struct *) filp->private_data;
 
 		if(ttyp->magic != TTY_MAGIC)
 			ttyp =NULL;
 	}
-	spin_unlock(&current->files->file_lock);
+	rcu_read_unlock();
 	return ttyp;
 }
 
diff -ruN -X dontdiff linux-2.6.6/arch/sparc64/solaris/ioctl.c files_struct-2.6.6/arch/sparc64/solaris/ioctl.c
--- linux-2.6.6/arch/sparc64/solaris/ioctl.c	2004-05-10 08:02:38.000000000 +0530
+++ files_struct-2.6.6/arch/sparc64/solaris/ioctl.c	2004-05-14 12:53:40.000000000 +0530
@@ -24,6 +24,7 @@
 #include <linux/netdevice.h>
 #include <linux/mtio.h>
 #include <linux/time.h>
+#include <linux/rcupdate.h>
 #include <linux/compat.h>
 
 #include <net/sock.h>
@@ -291,15 +292,15 @@
 {
 	struct inode *ino;
 	/* I wonder which of these tests are superfluous... --patrik */
-	spin_lock(&current->files->file_lock);
+	rcu_read_lock();
 	if (! current->files->fd[fd] ||
 	    ! current->files->fd[fd]->f_dentry ||
 	    ! (ino = current->files->fd[fd]->f_dentry->d_inode) ||
 	    ! ino->i_sock) {
-		spin_unlock(&current->files->file_lock);
+		rcu_read_unlock();
 		return TBADF;
 	}
-	spin_unlock(&current->files->file_lock);
+	rcu_read_unlock();
 	
 	switch (cmd & 0xff) {
 	case 109: /* SI_SOCKPARAMS */
diff -ruN -X dontdiff linux-2.6.6/arch/sparc64/solaris/timod.c files_struct-2.6.6/arch/sparc64/solaris/timod.c
--- linux-2.6.6/arch/sparc64/solaris/timod.c	2004-05-10 08:02:28.000000000 +0530
+++ files_struct-2.6.6/arch/sparc64/solaris/timod.c	2004-05-14 12:53:40.000000000 +0530
@@ -143,9 +143,14 @@
 static void timod_wake_socket(unsigned int fd)
 {
 	struct socket *sock;
+	struct file *filp; 
 
 	SOLD("wakeing socket");
-	sock = SOCKET_I(current->files->fd[fd]->f_dentry->d_inode);
+	if (!( filp = fcheck(fd))) {
+		SOLD("BAD FD");
+		return;
+	}
+	sock = SOCKET_I(filp->f_dentry->d_inode);
 	wake_up_interruptible(&sock->wait);
 	read_lock(&sock->sk->sk_callback_lock);
 	if (sock->fasync_list && !test_bit(SOCK_ASYNC_WAITDATA, &sock->flags))
@@ -157,9 +162,14 @@
 static void timod_queue(unsigned int fd, struct T_primsg *it)
 {
 	struct sol_socket_struct *sock;
+	struct file *filp; 
 
 	SOLD("queuing primsg");
-	sock = (struct sol_socket_struct *)current->files->fd[fd]->private_data;
+	if (!( filp = fcheck(fd))) {
+		SOLD("BAD FD");
+		return;
+	}
+	sock = (struct sol_socket_struct *)filp->private_data;
 	it->next = sock->pfirst;
 	sock->pfirst = it;
 	if (!sock->plast)
@@ -171,9 +181,14 @@
 static void timod_queue_end(unsigned int fd, struct T_primsg *it)
 {
 	struct sol_socket_struct *sock;
+	struct file *filp; 
 
 	SOLD("queuing primsg at end");
-	sock = (struct sol_socket_struct *)current->files->fd[fd]->private_data;
+	if (!( filp = fcheck(fd))) {
+		SOLD("BAD FD");
+		return;
+	}
+	sock = (struct sol_socket_struct *)filp->private_data;
 	it->next = NULL;
 	if (sock->plast)
 		sock->plast->next = it;
@@ -351,7 +366,10 @@
 		(int (*)(int, unsigned long *))SYS(socketcall);
 	int (*sys_sendto)(int, void *, size_t, unsigned, struct sockaddr *, int) =
 		(int (*)(int, void *, size_t, unsigned, struct sockaddr *, int))SYS(sendto);
-	filp = current->files->fd[fd];
+
+	if (!(filp = fcheck(fd)))
+		return -EBADF;
+
 	ino = filp->f_dentry->d_inode;
 	sock = (struct sol_socket_struct *)filp->private_data;
 	SOLD("entry");
@@ -632,7 +650,10 @@
 	
 	SOLD("entry");
 	SOLDD(("%u %p %d %p %p %d %p %d\n", fd, ctl_buf, ctl_maxlen, ctl_len, data_buf, data_maxlen, data_len, *flags_p));
-	filp = current->files->fd[fd];
+
+	if (!(filp = fcheck(fd)))
+		return -EBADF;
+
 	ino = filp->f_dentry->d_inode;
 	sock = (struct sol_socket_struct *)filp->private_data;
 	SOLDD(("%p %p\n", sock->pfirst, sock->pfirst ? sock->pfirst->next : NULL));
@@ -848,7 +869,7 @@
 	lock_kernel();
 	if(fd >= NR_OPEN) goto out;
 
-	filp = current->files->fd[fd];
+	filp = fcheck(fd);
 	if(!filp) goto out;
 
 	ino = filp->f_dentry->d_inode;
@@ -915,7 +936,7 @@
 	lock_kernel();
 	if(fd >= NR_OPEN) goto out;
 
-	filp = current->files->fd[fd];
+	filp = fcheck(fd);
 	if(!filp) goto out;
 
 	ino = filp->f_dentry->d_inode;
diff -ruN -X dontdiff linux-2.6.6/drivers/char/tty_io.c files_struct-2.6.6/drivers/char/tty_io.c
--- linux-2.6.6/drivers/char/tty_io.c	2004-05-10 08:02:39.000000000 +0530
+++ files_struct-2.6.6/drivers/char/tty_io.c	2004-05-14 12:53:50.000000000 +0530
@@ -1936,9 +1936,9 @@
 		}
 		task_lock(p);
 		if (p->files) {
-			spin_lock(&p->files->file_lock);
+			rcu_read_lock();
 			for (i=0; i < p->files->max_fds; i++) {
-				filp = fcheck_files(p->files, i);
+				filp = fcheck_files_rcu(p->files, i);
 				if (!filp)
 					continue;
 				if (filp->f_op->read == tty_read &&
@@ -1950,7 +1950,7 @@
 					break;
 				}
 			}
-			spin_unlock(&p->files->file_lock);
+			rcu_read_unlock();
 		}
 		task_unlock(p);
 	}
diff -ruN -X dontdiff linux-2.6.6/drivers/net/ppp_generic.c files_struct-2.6.6/drivers/net/ppp_generic.c
--- linux-2.6.6/drivers/net/ppp_generic.c	2004-05-10 08:03:20.000000000 +0530
+++ files_struct-2.6.6/drivers/net/ppp_generic.c	2004-05-14 12:53:50.000000000 +0530
@@ -46,6 +46,7 @@
 #include <linux/rwsem.h>
 #include <linux/stddef.h>
 #include <linux/device.h>
+#include <linux/refcount.h>
 #include <net/slhc_vj.h>
 #include <asm/atomic.h>
 
@@ -525,12 +526,12 @@
 			if (file == ppp->owner)
 				ppp_shutdown_interface(ppp);
 		}
-		if (atomic_read(&file->f_count) <= 2) {
+		if (refcount_read(&file->f_count) <= 2) {
 			ppp_release(inode, file);
 			err = 0;
 		} else
 			printk(KERN_DEBUG "PPPIOCDETACH file->f_count=%d\n",
-			       atomic_read(&file->f_count));
+			       refcount_read(&file->f_count));
 		return err;
 	}
 
diff -ruN -X dontdiff linux-2.6.6/fs/affs/file.c files_struct-2.6.6/fs/affs/file.c
--- linux-2.6.6/fs/affs/file.c	2004-05-10 08:02:29.000000000 +0530
+++ files_struct-2.6.6/fs/affs/file.c	2004-05-14 12:53:50.000000000 +0530
@@ -62,7 +62,7 @@
 static int
 affs_file_open(struct inode *inode, struct file *filp)
 {
-	if (atomic_read(&filp->f_count) != 1)
+	if (refcount_read(&filp->f_count) != 1)
 		return 0;
 	pr_debug("AFFS: open(%d)\n", AFFS_I(inode)->i_opencnt);
 	AFFS_I(inode)->i_opencnt++;
@@ -72,7 +72,7 @@
 static int
 affs_file_release(struct inode *inode, struct file *filp)
 {
-	if (atomic_read(&filp->f_count) != 0)
+	if (refcount_read(&filp->f_count) != 0)
 		return 0;
 	pr_debug("AFFS: release(%d)\n", AFFS_I(inode)->i_opencnt);
 	AFFS_I(inode)->i_opencnt--;
diff -ruN -X dontdiff linux-2.6.6/fs/aio.c files_struct-2.6.6/fs/aio.c
--- linux-2.6.6/fs/aio.c	2004-05-10 08:02:29.000000000 +0530
+++ files_struct-2.6.6/fs/aio.c	2004-05-14 12:53:50.000000000 +0530
@@ -480,7 +480,7 @@
 static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
 {
 	dprintk(KERN_DEBUG "aio_put(%p): f_count=%d\n",
-		req, atomic_read(&req->ki_filp->f_count));
+		req, refcount_read(&req->ki_filp->f_count));
 
 	req->ki_users --;
 	if (unlikely(req->ki_users < 0))
@@ -494,7 +494,7 @@
 	/* Must be done under the lock to serialise against cancellation.
 	 * Call this aio_fput as it duplicates fput via the fput_work.
 	 */
-	if (unlikely(atomic_dec_and_test(&req->ki_filp->f_count))) {
+	if (unlikely(refcount_put(&req->ki_filp->f_count))) {
 		get_ioctx(ctx);
 		spin_lock(&fput_lock);
 		list_add(&req->ki_list, &fput_head);
diff -ruN -X dontdiff linux-2.6.6/fs/fcntl.c files_struct-2.6.6/fs/fcntl.c
--- linux-2.6.6/fs/fcntl.c	2004-05-10 08:02:53.000000000 +0530
+++ files_struct-2.6.6/fs/fcntl.c	2004-05-14 12:53:50.000000000 +0530
@@ -34,9 +34,9 @@
 {
 	struct files_struct *files = current->files;
 	int res;
-	spin_lock(&files->file_lock);
+	rcu_read_lock();
 	res = FD_ISSET(fd, files->close_on_exec);
-	spin_unlock(&files->file_lock);
+	rcu_read_unlock();
 	return res;
 }
 
@@ -138,7 +138,7 @@
 	if (fd >= 0) {
 		FD_SET(fd, files->open_fds);
 		FD_CLR(fd, files->close_on_exec);
-		spin_unlock(&files->file_lock);
+	spin_unlock(&files->file_lock);
 		fd_install(fd, file);
 	} else {
 		spin_unlock(&files->file_lock);
@@ -183,6 +183,7 @@
 		goto out_fput;
 
 	files->fd[newfd] = file;
+	wmb();
 	FD_SET(newfd, files->open_fds);
 	FD_CLR(newfd, files->close_on_exec);
 	spin_unlock(&files->file_lock);
diff -ruN -X dontdiff linux-2.6.6/fs/file.c files_struct-2.6.6/fs/file.c
--- linux-2.6.6/fs/file.c	2004-05-10 08:02:00.000000000 +0530
+++ files_struct-2.6.6/fs/file.c	2004-05-14 12:53:50.000000000 +0530
@@ -14,7 +14,20 @@
 #include <linux/file.h>
 
 #include <asm/bitops.h>
+#include <linux/rcupdate.h>
 
+struct rcu_fd_array {
+	struct rcu_head rh;
+	struct file **array;   
+	int nfds;
+};
+
+struct rcu_fd_set {
+	struct rcu_head rh;
+	fd_set *openset;
+	fd_set *execset;
+	int nfds;
+};
 
 /*
  * Allocate an fd array, using kmalloc or vmalloc.
@@ -49,6 +62,13 @@
 		vfree(array);
 }
 
+static void fd_array_callback(struct rcu_head *rcu) 
+{
+	struct rcu_fd_array *a = container_of(rcu, struct rcu_fd_array, rh);
+	free_fd_array(a->array, a->nfds);
+	kfree(a); 
+}
+
 /*
  * Expand the fd array in the files_struct.  Called with the files
  * spinlock held for write.
@@ -56,8 +76,9 @@
 
 int expand_fd_array(struct files_struct *files, int nr)
 {
-	struct file **new_fds;
-	int error, nfds;
+	struct file **new_fds = NULL;
+	int error, nfds = 0;
+	struct rcu_fd_array *arg = NULL;
 
 	
 	error = -EMFILE;
@@ -89,18 +110,17 @@
 
 	error = -ENOMEM;
 	new_fds = alloc_fd_array(nfds);
+	arg = (struct rcu_fd_array *) kmalloc(sizeof(*arg), GFP_KERNEL);
+
 	spin_lock(&files->file_lock);
-	if (!new_fds)
+	if (!new_fds || !arg)
 		goto out;
 
 	/* Copy the existing array and install the new pointer */
 
 	if (nfds > files->max_fds) {
-		struct file **old_fds;
-		int i;
-		
-		old_fds = xchg(&files->fd, new_fds);
-		i = xchg(&files->max_fds, nfds);
+		struct file **old_fds = files->fd;
+		int i = files->max_fds;
 
 		/* Don't copy/clear the array if we are creating a new
 		   fd array for fork() */
@@ -109,19 +129,35 @@
 			/* clear the remainder of the array */
 			memset(&new_fds[i], 0,
 			       (nfds-i) * sizeof(struct file *)); 
-
-			spin_unlock(&files->file_lock);
-			free_fd_array(old_fds, i);
-			spin_lock(&files->file_lock);
 		}
+
+		smp_wmb();
+		files->fd = new_fds;
+		smp_wmb();
+		files->max_fds = nfds;
+		
+		if (i) {
+			arg->array = old_fds;
+			arg->nfds = i;
+			call_rcu(&arg->rh, (void (*)(void *))fd_array_callback,
+				 &arg->rh);
+		} else 
+			kfree(arg);
 	} else {
 		/* Somebody expanded the array while we slept ... */
 		spin_unlock(&files->file_lock);
 		free_fd_array(new_fds, nfds);
+		kfree(arg);
 		spin_lock(&files->file_lock);
 	}
-	error = 0;
+
+	return 0;
 out:
+	if (new_fds)
+		free_fd_array(new_fds, nfds);
+	if (arg)
+		kfree(arg);
+
 	return error;
 }
 
@@ -153,6 +189,14 @@
 		vfree(array);
 }
 
+static void fd_set_callback (struct rcu_head *rcu)
+{
+	struct rcu_fd_set *a = container_of(rcu, struct rcu_fd_set, rh); 
+	free_fdset(a->openset, a->nfds);
+	free_fdset(a->execset, a->nfds);
+	kfree(a);
+}
+
 /*
  * Expand the fdset in the files_struct.  Called with the files spinlock
  * held for write.
@@ -161,6 +205,7 @@
 {
 	fd_set *new_openset = 0, *new_execset = 0;
 	int error, nfds = 0;
+	struct rcu_fd_set *arg = NULL;
 
 	error = -EMFILE;
 	if (files->max_fdset >= NR_OPEN || nr >= NR_OPEN)
@@ -183,35 +228,43 @@
 	error = -ENOMEM;
 	new_openset = alloc_fdset(nfds);
 	new_execset = alloc_fdset(nfds);
+	arg = (struct rcu_fd_set *) kmalloc(sizeof(*arg), GFP_ATOMIC);
 	spin_lock(&files->file_lock);
-	if (!new_openset || !new_execset)
+	if (!new_openset || !new_execset || !arg)
 		goto out;
 
 	error = 0;
 	
 	/* Copy the existing tables and install the new pointers */
 	if (nfds > files->max_fdset) {
-		int i = files->max_fdset / (sizeof(unsigned long) * 8);
-		int count = (nfds - files->max_fdset) / 8;
+		fd_set *old_openset = files->open_fds;
+		fd_set *old_execset = files->close_on_exec;
+		int old_nfds = files->max_fdset;
+		int i = old_nfds / (sizeof(unsigned long) * 8);
+		int count = (nfds - old_nfds) / 8;
 		
 		/* 
 		 * Don't copy the entire array if the current fdset is
 		 * not yet initialised.  
 		 */
 		if (i) {
-			memcpy (new_openset, files->open_fds, files->max_fdset/8);
-			memcpy (new_execset, files->close_on_exec, files->max_fdset/8);
+			memcpy (new_openset, old_openset, old_nfds/8);
+			memcpy (new_execset, old_execset, old_nfds/8);
 			memset (&new_openset->fds_bits[i], 0, count);
 			memset (&new_execset->fds_bits[i], 0, count);
 		}
 		
-		nfds = xchg(&files->max_fdset, nfds);
-		new_openset = xchg(&files->open_fds, new_openset);
-		new_execset = xchg(&files->close_on_exec, new_execset);
-		spin_unlock(&files->file_lock);
-		free_fdset (new_openset, nfds);
-		free_fdset (new_execset, nfds);
-		spin_lock(&files->file_lock);
+		smp_wmb();
+		files->open_fds =  new_openset;
+		files->close_on_exec = new_execset;
+		smp_wmb();
+		files->max_fdset = nfds;
+
+		arg->openset = old_openset;
+		arg->execset = old_execset;
+		arg->nfds = nfds;
+		call_rcu(&arg->rh, (void (*)(void *))fd_set_callback, &arg->rh);
+
 		return 0;
 	} 
 	/* Somebody expanded the array while we slept ... */
@@ -222,6 +275,8 @@
 		free_fdset(new_openset, nfds);
 	if (new_execset)
 		free_fdset(new_execset, nfds);
+	if (arg)
+		kfree(arg);
 	spin_lock(&files->file_lock);
 	return error;
 }
diff -ruN -X dontdiff linux-2.6.6/fs/file_table.c files_struct-2.6.6/fs/file_table.c
--- linux-2.6.6/fs/file_table.c	2004-05-10 08:01:59.000000000 +0530
+++ files_struct-2.6.6/fs/file_table.c	2004-05-14 14:05:22.000000000 +0530
@@ -14,6 +14,7 @@
 #include <linux/fs.h>
 #include <linux/security.h>
 #include <linux/eventpoll.h>
+#include <linux/rcupdate.h>
 #include <linux/mount.h>
 #include <linux/cdev.h>
 
@@ -54,11 +55,17 @@
 	spin_unlock_irqrestore(&filp_count_lock, flags);
 }
 
-static inline void file_free(struct file *f)
+static inline void file_free_rcu(void *arg)
 {
+	struct file *f =  arg;
 	kmem_cache_free(filp_cachep, f);
 }
 
+static inline void file_free(struct file *f)
+{
+	call_rcu(&f->f_rcuhead, file_free_rcu, f);
+}
+
 /* Find an unused file structure and return a pointer to it.
  * Returns NULL, if there are no more free file structures or
  * we run out of memory.
@@ -81,7 +88,7 @@
 				goto fail;
 			}
 			eventpoll_init_file(f);
-			atomic_set(&f->f_count, 1);
+			refcount_init(&f->f_count);
 			f->f_uid = current->fsuid;
 			f->f_gid = current->fsgid;
 			f->f_owner.lock = RW_LOCK_UNLOCKED;
@@ -118,7 +125,7 @@
 	eventpoll_init_file(filp);
 	filp->f_flags  = flags;
 	filp->f_mode   = (flags+1) & O_ACCMODE;
-	atomic_set(&filp->f_count, 1);
+	refcount_init(&filp->f_count);
 	filp->f_dentry = dentry;
 	filp->f_mapping = dentry->d_inode->i_mapping;
 	filp->f_uid    = current->fsuid;
@@ -154,7 +161,7 @@
 
 void fastcall fput(struct file *file)
 {
-	if (atomic_dec_and_test(&file->f_count))
+	if (refcount_put(&file->f_count))
 		__fput(file);
 }
 
@@ -197,11 +204,16 @@
 	struct file *file;
 	struct files_struct *files = current->files;
 
-	spin_lock(&files->file_lock);
-	file = fcheck_files(files, fd);
-	if (file)
-		get_file(file);
-	spin_unlock(&files->file_lock);
+	rcu_read_lock();
+	file = fcheck_files_rcu(files, fd);
+	if (file) {
+		if (!refcount_get_rcu(&file->f_count)) {
+			/* File object ref couldn't be taken */
+			rcu_read_unlock();
+			return NULL;
+		}
+	}
+	rcu_read_unlock();
 	return file;
 }
 
@@ -223,21 +235,25 @@
 	if (likely((atomic_read(&files->count) == 1))) {
 		file = fcheck_files(files, fd);
 	} else {
-		spin_lock(&files->file_lock);
-		file = fcheck_files(files, fd);
+		rcu_read_lock();
+		file = fcheck_files_rcu(files, fd);
 		if (file) {
-			get_file(file);
-			*fput_needed = 1;
+			if (refcount_get_rcu(&file->f_count))
+				*fput_needed = 1;
+			else
+				/* Didn't get the reference, someone's freed */
+				file = NULL;
 		}
-		spin_unlock(&files->file_lock);
+		rcu_read_unlock();
 	}
+
 	return file;
 }
 
 
 void put_filp(struct file *file)
 {
-	if (atomic_dec_and_test(&file->f_count)) {
+	if (refcount_put(&file->f_count)) {
 		security_file_free(file);
 		file_kill(file);
 		file_free(file);
diff -ruN -X dontdiff linux-2.6.6/fs/open.c files_struct-2.6.6/fs/open.c
--- linux-2.6.6/fs/open.c	2004-05-10 08:01:59.000000000 +0530
+++ files_struct-2.6.6/fs/open.c	2004-05-14 12:53:50.000000000 +0530
@@ -1028,7 +1028,9 @@
 	filp = files->fd[fd];
 	if (!filp)
 		goto out_unlock;
-	files->fd[fd] = NULL;
+	files->fd[fd] = NULL;	
+	/* Need to make it conistent with open_fds in __put_unused_fd() */
+	smp_wmb();
 	FD_CLR(fd, files->close_on_exec);
 	__put_unused_fd(files, fd);
 	spin_unlock(&files->file_lock);
diff -ruN -X dontdiff linux-2.6.6/fs/proc/base.c files_struct-2.6.6/fs/proc/base.c
--- linux-2.6.6/fs/proc/base.c	2004-05-10 08:02:52.000000000 +0530
+++ files_struct-2.6.6/fs/proc/base.c	2004-05-14 12:53:50.000000000 +0530
@@ -28,6 +28,7 @@
 #include <linux/namespace.h>
 #include <linux/mm.h>
 #include <linux/smp_lock.h>
+#include <linux/rcupdate.h>
 #include <linux/kallsyms.h>
 #include <linux/mount.h>
 #include <linux/security.h>
@@ -191,16 +192,16 @@
 
 	files = get_files_struct(task);
 	if (files) {
-		spin_lock(&files->file_lock);
-		file = fcheck_files(files, fd);
+		rcu_read_lock();
+		file = fcheck_files_rcu(files, fd);
 		if (file) {
 			*mnt = mntget(file->f_vfsmnt);
 			*dentry = dget(file->f_dentry);
-			spin_unlock(&files->file_lock);
+			rcu_read_unlock();
 			put_files_struct(files);
 			return 0;
 		}
-		spin_unlock(&files->file_lock);
+		rcu_read_unlock();
 		put_files_struct(files);
 	}
 	return -ENOENT;
@@ -805,15 +806,15 @@
 			files = get_files_struct(p);
 			if (!files)
 				goto out;
-			spin_lock(&files->file_lock);
+			rcu_read_lock();
 			for (fd = filp->f_pos-2;
 			     fd < files->max_fds;
 			     fd++, filp->f_pos++) {
 				unsigned int i,j;
 
-				if (!fcheck_files(files, fd))
+				if (!fcheck_files_rcu(files, fd))
 					continue;
-				spin_unlock(&files->file_lock);
+				rcu_read_unlock();
 
 				j = NUMBUF;
 				i = fd;
@@ -825,12 +826,12 @@
 
 				ino = fake_ino(tid, PROC_TID_FD_DIR + fd);
 				if (filldir(dirent, buf+j, NUMBUF-j, fd+2, ino, DT_LNK) < 0) {
-					spin_lock(&files->file_lock);
+					rcu_read_lock();
 					break;
 				}
-				spin_lock(&files->file_lock);
+				rcu_read_lock();
 			}
-			spin_unlock(&files->file_lock);
+			rcu_read_unlock();
 			put_files_struct(files);
 	}
 out:
@@ -1003,9 +1004,9 @@
 
 	files = get_files_struct(task);
 	if (files) {
-		spin_lock(&files->file_lock);
-		if (fcheck_files(files, fd)) {
-			spin_unlock(&files->file_lock);
+		rcu_read_lock();
+		if (fcheck_files_rcu(files, fd)) {
+			rcu_read_unlock();
 			put_files_struct(files);
 			if (task_dumpable(task)) {
 				inode->i_uid = task->euid;
@@ -1017,7 +1018,7 @@
 			security_task_to_inode(task, inode);
 			return 1;
 		}
-		spin_unlock(&files->file_lock);
+		rcu_read_unlock();
 		put_files_struct(files);
 	}
 	d_drop(dentry);
@@ -1109,15 +1110,15 @@
 	if (!files)
 		goto out_unlock;
 	inode->i_mode = S_IFLNK;
-	spin_lock(&files->file_lock);
-	file = fcheck_files(files, fd);
+	rcu_read_lock();
+	file = fcheck_files_rcu(files, fd);
 	if (!file)
 		goto out_unlock2;
 	if (file->f_mode & 1)
 		inode->i_mode |= S_IRUSR | S_IXUSR;
 	if (file->f_mode & 2)
 		inode->i_mode |= S_IWUSR | S_IXUSR;
-	spin_unlock(&files->file_lock);
+	rcu_read_unlock();
 	put_files_struct(files);
 	inode->i_op = &proc_pid_link_inode_operations;
 	inode->i_size = 64;
@@ -1127,7 +1128,7 @@
 	return NULL;
 
 out_unlock2:
-	spin_unlock(&files->file_lock);
+	rcu_read_unlock();
 	put_files_struct(files);
 out_unlock:
 	iput(inode);
diff -ruN -X dontdiff linux-2.6.6/fs/select.c files_struct-2.6.6/fs/select.c
--- linux-2.6.6/fs/select.c	2004-05-10 08:01:59.000000000 +0530
+++ files_struct-2.6.6/fs/select.c	2004-05-14 12:53:50.000000000 +0530
@@ -21,6 +21,7 @@
 #include <linux/personality.h> /* for STICKY_TIMEOUTS */
 #include <linux/file.h>
 #include <linux/fs.h>
+#include <linux/rcupdate.h>
 
 #include <asm/uaccess.h>
 
@@ -131,13 +132,16 @@
 static int max_select_fd(unsigned long n, fd_set_bits *fds)
 {
 	unsigned long *open_fds;
+	fd_set *open_fdset;
 	unsigned long set;
 	int max;
 
 	/* handle last in-complete long-word first */
 	set = ~(~0UL << (n & (__NFDBITS-1)));
 	n /= __NFDBITS;
-	open_fds = current->files->open_fds->fds_bits+n;
+	open_fdset = current->files->open_fds;
+	smp_read_barrier_depends();
+	open_fds = open_fdset->fds_bits+n;
 	max = 0;
 	if (set) {
 		set &= BITS(fds, n);
@@ -184,9 +188,9 @@
 	int retval, i;
 	long __timeout = *timeout;
 
- 	spin_lock(&current->files->file_lock);
+	rcu_read_lock();
 	retval = max_select_fd(n, fds);
-	spin_unlock(&current->files->file_lock);
+	rcu_read_unlock();
 
 	if (retval < 0)
 		return retval;
diff -ruN -X dontdiff linux-2.6.6/include/linux/file.h files_struct-2.6.6/include/linux/file.h
--- linux-2.6.6/include/linux/file.h	2004-05-10 08:03:22.000000000 +0530
+++ files_struct-2.6.6/include/linux/file.h	2004-05-14 12:53:50.000000000 +0530
@@ -9,6 +9,7 @@
 #include <linux/posix_types.h>
 #include <linux/compiler.h>
 #include <linux/spinlock.h>
+#include <linux/rcupdate.h>
 
 /*
  * The default fd array needs to be at least BITS_PER_LONG,
@@ -69,10 +70,26 @@
 	return file;
 }
 
+/* Need Proper read barriers if fd_array is being indexed lockfree */
+static inline struct file * fcheck_files_rcu(struct files_struct *files, unsigned int fd)
+{
+	struct file * file = NULL;
+
+	if (fd < files->max_fds) {
+		struct file ** fd_array;
+		smp_rmb();
+	       	fd_array = files->fd;
+		smp_read_barrier_depends();
+		file = fd_array[fd];
+	}
+	return file;
+}
+
 /*
  * Check whether the specified fd has an open file.
  */
 #define fcheck(fd)	fcheck_files(current->files, fd)
+#define fcheck_rcu(fd)	fcheck_files_rcu(current->files, fd)
 
 extern void FASTCALL(fd_install(unsigned int fd, struct file * file));
 
diff -ruN -X dontdiff linux-2.6.6/include/linux/fs.h files_struct-2.6.6/include/linux/fs.h
--- linux-2.6.6/include/linux/fs.h	2004-05-10 08:02:26.000000000 +0530
+++ files_struct-2.6.6/include/linux/fs.h	2004-05-14 12:53:50.000000000 +0530
@@ -19,6 +19,7 @@
 #include <linux/cache.h>
 #include <linux/radix-tree.h>
 #include <linux/kobject.h>
+#include <linux/refcount.h>
 #include <asm/atomic.h>
 #include <linux/audit.h>
 
@@ -555,7 +556,7 @@
 	struct dentry		*f_dentry;
 	struct vfsmount         *f_vfsmnt;
 	struct file_operations	*f_op;
-	atomic_t		f_count;
+	refcount_t		f_count;
 	unsigned int 		f_flags;
 	mode_t			f_mode;
 	loff_t			f_pos;
@@ -576,13 +577,14 @@
 	spinlock_t		f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
 	struct address_space	*f_mapping;
+	struct rcu_head		f_rcuhead;
 };
 extern spinlock_t files_lock;
 #define file_list_lock() spin_lock(&files_lock);
 #define file_list_unlock() spin_unlock(&files_lock);
 
-#define get_file(x)	atomic_inc(&(x)->f_count)
-#define file_count(x)	atomic_read(&(x)->f_count)
+#define get_file(x)	refcount_get(&(x)->f_count)
+#define file_count(x)	refcount_read(&(x)->f_count)
 
 /* Initialize and open a private file and allocate its security structure. */
 extern int open_private_file(struct file *, struct dentry *, int);
diff -ruN -X dontdiff linux-2.6.6/kernel/fork.c files_struct-2.6.6/kernel/fork.c
--- linux-2.6.6/kernel/fork.c	2004-05-10 08:02:00.000000000 +0530
+++ files_struct-2.6.6/kernel/fork.c	2004-05-14 12:53:50.000000000 +0530
@@ -30,6 +30,7 @@
 #include <linux/syscalls.h>
 #include <linux/jiffies.h>
 #include <linux/futex.h>
+#include <linux/rcupdate.h>
 #include <linux/ptrace.h>
 #include <linux/mount.h>
 #include <linux/audit.h>
@@ -627,10 +628,12 @@
 static int count_open_files(struct files_struct *files, int size)
 {
 	int i;
+	fd_set *open_fds = files->open_fds;
 
+	smp_read_barrier_depends();
 	/* Find the last open fd */
 	for (i = size/(8*sizeof(long)); i > 0; ) {
-		if (files->open_fds->fds_bits[--i])
+		if (open_fds->fds_bits[--i])
 			break;
 	}
 	i = (i+1) * 8 * sizeof(long);
@@ -660,6 +663,14 @@
 	 * This works because we cache current->files (old) as oldf. Don't
 	 * break this.
 	 */
+	
+	/* 
+	 * We don't have oldf readlock, but even if the old fdset gets
+	 * grown now, we'll only copy upto "size" fds 
+	 */
+	size = oldf->max_fdset;	
+	smp_rmb();
+
 	tsk->files = NULL;
 	error = -ENOMEM;
 	newf = kmem_cache_alloc(files_cachep, SLAB_KERNEL);
@@ -676,9 +687,6 @@
 	newf->open_fds	    = &newf->open_fds_init;
 	newf->fd	    = &newf->fd_array[0];
 
-	/* We don't yet have the oldf readlock, but even if the old
-           fdset gets grown now, we'll only copy up to "size" fds */
-	size = oldf->max_fdset;
 	if (size > __FD_SETSIZE) {
 		newf->max_fdset = 0;
 		spin_lock(&newf->file_lock);
@@ -687,7 +695,7 @@
 		if (error)
 			goto out_release;
 	}
-	spin_lock(&oldf->file_lock);
+	rcu_read_lock();
 
 	open_files = count_open_files(oldf, size);
 
@@ -698,7 +706,7 @@
 	 */
 	nfds = NR_OPEN_DEFAULT;
 	if (open_files > nfds) {
-		spin_unlock(&oldf->file_lock);
+		rcu_read_unlock();
 		newf->max_fds = 0;
 		spin_lock(&newf->file_lock);
 		error = expand_fd_array(newf, open_files-1);
@@ -706,10 +714,11 @@
 		if (error) 
 			goto out_release;
 		nfds = newf->max_fds;
-		spin_lock(&oldf->file_lock);
+		rcu_read_lock();
 	}
 
 	old_fds = oldf->fd;
+	smp_read_barrier_depends();
 	new_fds = newf->fd;
 
 	memcpy(newf->open_fds->fds_bits, oldf->open_fds->fds_bits, open_files/8);
@@ -721,7 +730,7 @@
 			get_file(f);
 		*new_fds++ = f;
 	}
-	spin_unlock(&oldf->file_lock);
+	rcu_read_unlock();
 
 	/* compute the remainder to be cleared */
 	size = (newf->max_fds - open_files) * sizeof(struct file *);
diff -ruN -X dontdiff linux-2.6.6/net/ipv4/netfilter/ipt_owner.c files_struct-2.6.6/net/ipv4/netfilter/ipt_owner.c
--- linux-2.6.6/net/ipv4/netfilter/ipt_owner.c	2004-05-10 08:03:20.000000000 +0530
+++ files_struct-2.6.6/net/ipv4/netfilter/ipt_owner.c	2004-05-14 12:53:50.000000000 +0530
@@ -11,6 +11,7 @@
 #include <linux/module.h>
 #include <linux/skbuff.h>
 #include <linux/file.h>
+#include <linux/rcupdate.h>
 #include <net/sock.h>
 
 #include <linux/netfilter_ipv4/ipt_owner.h>
@@ -35,17 +36,17 @@
 		task_lock(p);
 		files = p->files;
 		if(files) {
-			spin_lock(&files->file_lock);
+			rcu_read_lock();
 			for (i=0; i < files->max_fds; i++) {
-				if (fcheck_files(files, i) ==
+				if (fcheck_files_rcu(files, i) ==
 				    skb->sk->sk_socket->file) {
-					spin_unlock(&files->file_lock);
+					rcu_read_unlock();
 					task_unlock(p);
 					read_unlock(&tasklist_lock);
 					return 1;
 				}
 			}
-			spin_unlock(&files->file_lock);
+			rcu_read_unlock();
 		}
 		task_unlock(p);
 	} while_each_thread(g, p);
@@ -67,17 +68,17 @@
 	task_lock(p);
 	files = p->files;
 	if(files) {
-		spin_lock(&files->file_lock);
+		rcu_read_lock();
 		for (i=0; i < files->max_fds; i++) {
-			if (fcheck_files(files, i) ==
+			if (fcheck_files_rcu(files, i) ==
 			    skb->sk->sk_socket->file) {
-				spin_unlock(&files->file_lock);
+				rcu_read_unlock();
 				task_unlock(p);
 				read_unlock(&tasklist_lock);
 				return 1;
 			}
 		}
-		spin_unlock(&files->file_lock);
+		rcu_read_unlock();
 	}
 	task_unlock(p);
 out:
@@ -101,14 +102,14 @@
 		task_lock(p);
 		files = p->files;
 		if (files) {
-			spin_lock(&files->file_lock);
+			rcu_read_lock();
 			for (i=0; i < files->max_fds; i++) {
-				if (fcheck_files(files, i) == file) {
+				if (fcheck_files_rcu(files, i) == file) {
 					found = 1;
 					break;
 				}
 			}
-			spin_unlock(&files->file_lock);
+			rcu_read_unlock();
 		}
 		task_unlock(p);
 		if (found)
diff -ruN -X dontdiff linux-2.6.6/net/ipv6/netfilter/ip6t_owner.c files_struct-2.6.6/net/ipv6/netfilter/ip6t_owner.c
--- linux-2.6.6/net/ipv6/netfilter/ip6t_owner.c	2004-05-10 08:03:13.000000000 +0530
+++ files_struct-2.6.6/net/ipv6/netfilter/ip6t_owner.c	2004-05-14 12:53:50.000000000 +0530
@@ -11,6 +11,7 @@
 #include <linux/module.h>
 #include <linux/skbuff.h>
 #include <linux/file.h>
+#include <linux/rcupdate.h>
 #include <net/sock.h>
 
 #include <linux/netfilter_ipv6/ip6t_owner.h>
@@ -34,16 +35,16 @@
 	task_lock(p);
 	files = p->files;
 	if(files) {
-		spin_lock(&files->file_lock);
+		rcu_read_lock();
 		for (i=0; i < files->max_fds; i++) {
-			if (fcheck_files(files, i) == skb->sk->sk_socket->file) {
-				spin_unlock(&files->file_lock);
+			if (fcheck_files_rcu(files, i) == skb->sk->sk_socket->file) {
+				rcu_read_unlock();
 				task_unlock(p);
 				read_unlock(&tasklist_lock);
 				return 1;
 			}
 		}
-		spin_unlock(&files->file_lock);
+		rcu_read_unlock();
 	}
 	task_unlock(p);
 out:
@@ -67,14 +68,14 @@
 		task_lock(p);
 		files = p->files;
 		if (files) {
-			spin_lock(&files->file_lock);
+			rcu_read_lock();
 			for (i=0; i < files->max_fds; i++) {
-				if (fcheck_files(files, i) == file) {
+				if (fcheck_files_rcu(files, i) == file) {
 					found = 1;
 					break;
 				}
 			}
-			spin_unlock(&files->file_lock);
+			rcu_read_unlock();
 		}
 		task_unlock(p);
 		if (found)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-14  4:56 ` [RFC] Lock free fd lookup Ravikiran G Thirumalai
@ 2004-07-14 15:17   ` Chris Wright
  2004-07-15 14:22     ` Jesse Barnes
  2004-07-17 13:48     ` Peter Zijlstra
  0 siblings, 2 replies; 19+ messages in thread
From: Chris Wright @ 2004-07-14 15:17 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: linux-kernel, dipankar

* Ravikiran G Thirumalai (kiran@in.ibm.com) wrote:
> This makes use of the lockfree refcounting infrastructure (see earlier
> posting today) to make files_struct.fd[] lookup lockfree. This is carrying 
> forward work done by Maneesh and Dipankar earlier. 
> 
> With the lockfree fd lookup patch, tiobench performance increases by 13% 
> for sequential reads, 21 % for random reads on a 4 processor pIII xeon.

I'm curious, how much of the performance improvement is from RCU usage
vs. making the basic syncronization primitive aware of a reader and
writer distinction?  Do you have benchmark for simply moving to rwlock_t?

thanks,
-chris
-- 
Linux Security Modules     http://lsm.immunix.org     http://lsm.bkbits.net

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-14 15:17   ` Chris Wright
@ 2004-07-15 14:22     ` Jesse Barnes
  2004-07-15 16:10       ` Dipankar Sarma
  2004-07-16  6:27       ` William Lee Irwin III
  2004-07-17 13:48     ` Peter Zijlstra
  1 sibling, 2 replies; 19+ messages in thread
From: Jesse Barnes @ 2004-07-15 14:22 UTC (permalink / raw)
  To: Chris Wright; +Cc: Ravikiran G Thirumalai, linux-kernel, dipankar

On Wednesday, July 14, 2004 11:17 am, Chris Wright wrote:
> * Ravikiran G Thirumalai (kiran@in.ibm.com) wrote:
> > This makes use of the lockfree refcounting infrastructure (see earlier
> > posting today) to make files_struct.fd[] lookup lockfree. This is
> > carrying forward work done by Maneesh and Dipankar earlier.
> >
> > With the lockfree fd lookup patch, tiobench performance increases by 13%
> > for sequential reads, 21 % for random reads on a 4 processor pIII xeon.
>
> I'm curious, how much of the performance improvement is from RCU usage
> vs. making the basic syncronization primitive aware of a reader and
> writer distinction?  Do you have benchmark for simply moving to rwlock_t?

That's a good point.  Also, even though the implementation may be 'lockless', 
there are still a lot of cachelines bouncing around, whether due to atomic 
counters or cmpxchg (in fact the latter will be worse than simple atomics).

It seems to me that RCU is basically rwlocks on steroids, which means that 
using it requires the same care to avoid starvation and/or other scalability 
problems (i.e. we'd better be really sure that a given codepath really should 
be using rwlocks before we change it).

Jesse

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-15 14:22     ` Jesse Barnes
@ 2004-07-15 16:10       ` Dipankar Sarma
  2004-07-15 16:22         ` Jesse Barnes
  2004-07-15 16:34         ` Chris Wright
  2004-07-16  6:27       ` William Lee Irwin III
  1 sibling, 2 replies; 19+ messages in thread
From: Dipankar Sarma @ 2004-07-15 16:10 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: Chris Wright, Ravikiran G Thirumalai, linux-kernel

On Thu, Jul 15, 2004 at 10:22:53AM -0400, Jesse Barnes wrote:
> On Wednesday, July 14, 2004 11:17 am, Chris Wright wrote:
> > I'm curious, how much of the performance improvement is from RCU usage
> > vs. making the basic syncronization primitive aware of a reader and
> > writer distinction?  Do you have benchmark for simply moving to rwlock_t?
> 
> That's a good point.  Also, even though the implementation may be 'lockless', 
> there are still a lot of cachelines bouncing around, whether due to atomic 
> counters or cmpxchg (in fact the latter will be worse than simple atomics).

Chris raises an interesting issue. There are two ways we can benefit from
lock-free lookup - avoidance of atomic ops in lock acquisition/release
and avoidance of contention. The latter can also be provided by
rwlocks in read-mostly situations like this, but rwlock still has
two atomic ops for acquisition/release. So, in another
thread, I have suggested looking into the contention angle. IIUC,
tiobench is threaded and shares fd table. 

That said, atomic counters weren't introduced in this patch,
they are already there for refcounting. cmpxchg is costly,
but if you are replacing read_lock/atomic_inc/read_unlock,
lock-free + cmpxchg, it might not be all that bad. Atleast,
we can benchmark it and see if it is worth it. And in heavily
contended cases, unlike rwlocks, you are not going to have
starvation.

> 
> It seems to me that RCU is basically rwlocks on steroids, which means that 
> using it requires the same care to avoid starvation and/or other scalability 
> problems (i.e. we'd better be really sure that a given codepath really should 
> be using rwlocks before we change it).

The starvation is a problem with rwlocks in linux, not RCU. The
reader's do not impede writers at all with RCU. There are other
issues with RCU that one needs to be careful about, but certainly
not this one.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-15 16:10       ` Dipankar Sarma
@ 2004-07-15 16:22         ` Jesse Barnes
  2004-07-15 16:34         ` Chris Wright
  1 sibling, 0 replies; 19+ messages in thread
From: Jesse Barnes @ 2004-07-15 16:22 UTC (permalink / raw)
  To: dipankar; +Cc: Chris Wright, Ravikiran G Thirumalai, linux-kernel

On Thursday, July 15, 2004 12:10 pm, Dipankar Sarma wrote:
> Chris raises an interesting issue. There are two ways we can benefit from
> lock-free lookup - avoidance of atomic ops in lock acquisition/release
> and avoidance of contention. The latter can also be provided by
> rwlocks in read-mostly situations like this, but rwlock still has
> two atomic ops for acquisition/release. So, in another
> thread, I have suggested looking into the contention angle. IIUC,
> tiobench is threaded and shares fd table.

I must have missed that thread...  Anyway, that's a good idea.

>
> That said, atomic counters weren't introduced in this patch,
> they are already there for refcounting. cmpxchg is costly,
> but if you are replacing read_lock/atomic_inc/read_unlock,
> lock-free + cmpxchg, it might not be all that bad.

Yeah, I didn't mean to imply that atomics were unique to this patch.

> Atleast, 
> we can benchmark it and see if it is worth it. And in heavily
> contended cases, unlike rwlocks, you are not going to have
> starvation.

Which is good.

> > It seems to me that RCU is basically rwlocks on steroids, which means
> > that using it requires the same care to avoid starvation and/or other
> > scalability problems (i.e. we'd better be really sure that a given
> > codepath really should be using rwlocks before we change it).
>
> The starvation is a problem with rwlocks in linux, not RCU. The
> reader's do not impede writers at all with RCU. There are other
> issues with RCU that one needs to be careful about, but certainly
> not this one.

That's good, I didn't think that RCU would cause starvation, but based on 
previous reading of the code it seemed like it would hurt a lot in other 
ways... but I'm definitely not an expert in that area.

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-15 16:10       ` Dipankar Sarma
  2004-07-15 16:22         ` Jesse Barnes
@ 2004-07-15 16:34         ` Chris Wright
  2004-07-16  5:38           ` Ravikiran G Thirumalai
  1 sibling, 1 reply; 19+ messages in thread
From: Chris Wright @ 2004-07-15 16:34 UTC (permalink / raw)
  To: Dipankar Sarma
  Cc: Jesse Barnes, Chris Wright, Ravikiran G Thirumalai, linux-kernel

* Dipankar Sarma (dipankar@in.ibm.com) wrote:
> On Thu, Jul 15, 2004 at 10:22:53AM -0400, Jesse Barnes wrote:
> > On Wednesday, July 14, 2004 11:17 am, Chris Wright wrote:
> > > I'm curious, how much of the performance improvement is from RCU usage
> > > vs. making the basic syncronization primitive aware of a reader and
> > > writer distinction?  Do you have benchmark for simply moving to rwlock_t?
> > 
> > That's a good point.  Also, even though the implementation may be 'lockless', 
> > there are still a lot of cachelines bouncing around, whether due to atomic 
> > counters or cmpxchg (in fact the latter will be worse than simple atomics).
> 
> Chris raises an interesting issue. There are two ways we can benefit from
> lock-free lookup - avoidance of atomic ops in lock acquisition/release
> and avoidance of contention. The latter can also be provided by
> rwlocks in read-mostly situations like this, but rwlock still has
> two atomic ops for acquisition/release. So, in another
> thread, I have suggested looking into the contention angle. IIUC,
> tiobench is threaded and shares fd table. 

Given the read heavy assumption that RCU makes (supported by your
benchmarks), I believe that the comparison with RCU vs. current scheme
is unfair.  Better comparison is against rwlock_t, which may give a
similar improvement w/out the added complexity.  But, I haven't a patch
nor a benchmark, so it's all handwavy at this point.

thanks,
-chris
-- 
Linux Security Modules     http://lsm.immunix.org     http://lsm.bkbits.net

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-15 16:34         ` Chris Wright
@ 2004-07-16  5:38           ` Ravikiran G Thirumalai
  0 siblings, 0 replies; 19+ messages in thread
From: Ravikiran G Thirumalai @ 2004-07-16  5:38 UTC (permalink / raw)
  To: Chris Wright; +Cc: Dipankar Sarma, Jesse Barnes, linux-kernel

On Thu, Jul 15, 2004 at 09:34:08AM -0700, Chris Wright wrote:
> * Dipankar Sarma (dipankar@in.ibm.com) wrote:
> > On Thu, Jul 15, 2004 at 10:22:53AM -0400, Jesse Barnes wrote:
> > > On Wednesday, July 14, 2004 11:17 am, Chris Wright wrote:
> > ... 
> > Chris raises an interesting issue. There are two ways we can benefit from
> > lock-free lookup - avoidance of atomic ops in lock acquisition/release
> > and avoidance of contention. The latter can also be provided by
> > rwlocks in read-mostly situations like this, but rwlock still has
> > two atomic ops for acquisition/release. So, in another
> > thread, I have suggested looking into the contention angle. IIUC,
> > tiobench is threaded and shares fd table. 
> 
> Given the read heavy assumption that RCU makes (supported by your
> benchmarks), I believe that the comparison with RCU vs. current scheme
> is unfair.  Better comparison is against rwlock_t, which may give a
> similar improvement w/out the added complexity.  But, I haven't a patch
> nor a benchmark, so it's all handwavy at this point.

It would be a good datapoint to experiment with rwlock.
But note that on x86 (my testbed right now)
1. read_lock + read_unlock + atomic_inc will be 3 (bus locking) atomic ops
2. spin_lock + spin_unlock + atomic_inc will be 2 atomic ops 
   (x86 spin_unlock is just a move)
3. rcu_read_lock, rcu_read_unlock + cmpxchg is just one atomic op,
added to it the cs is small in fget, fget_light....

IIRC, the files_struct.file_lock was a rwlock sometime back.  
(it still is in 2.4 i think) I am not sure why it was changed to spinlocks.  
I will try to dig through the archives, but if someone can quickly 
fill in that would be nice.

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-15 14:22     ` Jesse Barnes
  2004-07-15 16:10       ` Dipankar Sarma
@ 2004-07-16  6:27       ` William Lee Irwin III
  2004-07-17  0:55         ` Keith Owens
  1 sibling, 1 reply; 19+ messages in thread
From: William Lee Irwin III @ 2004-07-16  6:27 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: Chris Wright, Ravikiran G Thirumalai, linux-kernel, dipankar

On Wednesday, July 14, 2004 11:17 am, Chris Wright wrote:
>> I'm curious, how much of the performance improvement is from RCU usage
>> vs. making the basic syncronization primitive aware of a reader and
>> writer distinction?  Do you have benchmark for simply moving to rwlock_t?

On Thu, Jul 15, 2004 at 10:22:53AM -0400, Jesse Barnes wrote:
> That's a good point.  Also, even though the implementation may be 'lockless', 
> there are still a lot of cachelines bouncing around, whether due to atomic 
> counters or cmpxchg (in fact the latter will be worse than simple atomics).
> It seems to me that RCU is basically rwlocks on steroids, which means that 
> using it requires the same care to avoid starvation and/or other scalability 
> problems (i.e. we'd better be really sure that a given codepath really should 
> be using rwlocks before we change it).

Primarily not for either of your benefit(s):

files->lock actually started as an rwlock and was changed to a spinlock
on the grounds that cmpxchg for rwlocks sucked on Pentium 4.

Typical rwlock implementations don't have such starvation issues.
Normally attempts at acquiring them for write exclude new readers with
various forms of bias. Linux has come to rely on an unfair
implementation by means of recursion, both directly (I hear somewhere
in the net stack and haven't looked) and via recursive acquisition in
interrupt context (for signal delivery). The unfairness does cause
problems in practice, e.g. writers starving on tasklist_lock takes out
machines running large numbers of processes.

RCU does not have such fairness issues. While it is heavily read
biased, readers do not exclude writers. Things can, however, delay
freeing of memory if tasks don't voluntarily yield for long periods
of time (e.g. involuntary context switches caused by kernel preemption)
as the freeing of the data structures potentially referenced by readers
must be delayed until all cpus have gone through a voluntary context
switch. This generally resolves itself as waiting for memory involves a
voluntary context switch. Readers merely disable preemption, and only
writers require atomicity or mutual exclusion, which is more typical in
Linux in part owing to preferred data structures. I suspect this is in
part due to its initial usage for doubly-linked lists that are never
traversed backward. RCU is actually a lockless update protocol and can
be done with atomic updates on the write side as opposed to spinlocks
around the updates. This happily doesn't involve busywait on machines
able to implement such things directly, but also depends heavily on the
data structure.

For instance, an open-addressed hash table or a 4-way associative cache
could simply atomically update the pointers to the elements. To see why
things get much harder when you try to go completely lockfree with less
suitable data structures, a singly-linked list would require a cmpxchg
loop inside a retry loop, temporarily marking next pointers with some
invalid value that signals accessors to retry the read operation, and
some external method of preventing simultaneous removals of the same
element (e.g. refcounting), as the next pointer of the predecessor must
be what was expected during removal, the removed element's next pointer
remain be what was expected, and the updated predecessor must be what
was discovered in the list, and more still for cpus not supporting
cmpxchg or similar operations, though that's probably not an entirely
complete or accurate description of the headaches involved when the
data structure doesn't mesh well with lock-free atomic updates. The
gist of all this is that busywait-free atomic updates are only
implementable for data structures that don't link through the object
but rather maintain an external index with a single pointer to elements
needing updates, like radix trees, B+ trees, arrays of pointers, and
open-addressed hashtables. Otherwise, spinlocks on the write side are
probably better than the completely lock-free alternative for smaller
systems, though larger systems hate sharing the cachelines.

-- wli

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-16  6:27       ` William Lee Irwin III
@ 2004-07-17  0:55         ` Keith Owens
  2004-07-17  1:19           ` William Lee Irwin III
  0 siblings, 1 reply; 19+ messages in thread
From: Keith Owens @ 2004-07-17  0:55 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Jesse Barnes, Chris Wright, Ravikiran G Thirumalai, linux-kernel,
	dipankar

On Thu, 15 Jul 2004 23:27:32 -0700, 
William Lee Irwin III <wli@holomorphy.com> wrote:
>The gist of all this is that busywait-free atomic updates are only
>implementable for data structures that don't link through the object
>but rather maintain an external index with a single pointer to elements
>needing updates, like radix trees, B+ trees, arrays of pointers, and
>open-addressed hashtables.

There are algorithms for busywait-free (lock free) traversal and
concurrent update of lists and structures that contain embedded
pointers.  I once had to implement one on an O/S where the user space
to kernel wait/alarm semantics could not satisfy the timing
requirements of the calling code (don't ask).

The beauty of these lockfree algorithms on large structures is that
nothing ever stalls indefinitely.  If the underlying SMP cache hardware
is fair, everything running a lockfree list will make progress.  These
algorithms do not suffer from reader vs. writer starvation.

The downside is that the algorithms rely on an extra sequence field per
word.  They copy out the structure, modify the local copy, write back
with an update of sequence field.  Writing back detects if any other
update has invalidated the current structure, at which point the second
update is discarded and the writer redrives its update.  Readers are
guaranteed to see a consistent view of the copy, even if the master
structure is being updated at the same time as it is being read.

Bottom line, it can be done but ...

* The structure size doubles with the addition of the sequence fields.
* The hardware must support cmpxchg on the combination of the sequence
  field and the data word that it protects.  LL/SC can be used instead
  of cmpxchg if the hardware supports LL/SC.
* If the hardware only supports cmpxchg4 then the sequence field is
  restricted to 2 bytes, which increases the risk of wraparound.  If an
  update is delayed for exactly the wraparound interval then the data
  may be corrupted, that is a standard risk of small sequence fields.
* The extra copy out just to read a structure needs more memory
  bandwidth, plus local allocation for the copy.
* The internal code of the API to traverse a list containing lockfree
  structures is pretty messy, although that is hidden from the caller.
* All writers must preserve enough state to redrive their update from
  scratch if a concurrent update has changed the same structure.  Note,
  only the curent structure, not the rest of the list.
* It requires type stable storage.  That is, once a data area has been
  allocated to a particular structure type, it always contains that
  structure type, even when it has been freed from the list.  Each list
  requires its own free pool, which can never be returned to the OS.

Nice research topics, but not suitable for day to day work.  I only
used the lockfree algorithms because I could not find an alternative on
that particular OS.  I am not sure that the Linux kernel has any
problems that require the additional complexity.

For some light reading, the code I implemented was based on Practical
Implementations of Non-Blocking Synchronization Primitives by Mark Moir
http://research.sun.com/scalable/Papers/moir-podc97.ps.  Note that page
6, line 6 has an error.  Replace "y := z" with "y := addr->data[i];".

I implemented a completely lockfree 2-3-4 tree with embedded pointers
using Moir's algorithm.  The implementation allowed multiple concurrent
readers and writers.  It even allowed concurrent list insert, delete
and rebalancing, while the structures on the were being read and
updated.

2-3-4 trees are self balancing, which gave decent lookup performance.
Since red-black trees are logically equivalent to 2-3-4 trees it should
be possible to use lockfree red-black trees.  However I could not come
up with a lockfree red-black tree, mainly because an read-black insert
requires atomic changes to multiple structures.  The 2-3-4 tree only
needs atomic update to one structure at a time.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-17  0:55         ` Keith Owens
@ 2004-07-17  1:19           ` William Lee Irwin III
  2004-07-17  2:12             ` Keith Owens
  2004-07-17  2:28             ` Keith Owens
  0 siblings, 2 replies; 19+ messages in thread
From: William Lee Irwin III @ 2004-07-17  1:19 UTC (permalink / raw)
  To: Keith Owens
  Cc: Jesse Barnes, Chris Wright, Ravikiran G Thirumalai, linux-kernel,
	dipankar

On Thu, 15 Jul 2004 23:27:32 -0700, William Lee Irwin III wrote:
>> The gist of all this is that busywait-free atomic updates are only
>> implementable for data structures that don't link through the object
>> but rather maintain an external index with a single pointer to elements
>> needing updates, like radix trees, B+ trees, arrays of pointers, and
>> open-addressed hashtables.

On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote:
> There are algorithms for busywait-free (lock free) traversal and
> concurrent update of lists and structures that contain embedded
> pointers.  I once had to implement one on an O/S where the user space
> to kernel wait/alarm semantics could not satisfy the timing
> requirements of the calling code (don't ask).

Yes. I had in mind only RCU-based algorithms. Throwing more machinery at
the problem may get further.


On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote:
> The beauty of these lockfree algorithms on large structures is that
> nothing ever stalls indefinitely.  If the underlying SMP cache hardware
> is fair, everything running a lockfree list will make progress.  These
> algorithms do not suffer from reader vs. writer starvation.

That's a large assumption. NUMA hardware typically violates it.


On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote:
> The downside is that the algorithms rely on an extra sequence field per
> word.  They copy out the structure, modify the local copy, write back
> with an update of sequence field.  Writing back detects if any other
> update has invalidated the current structure, at which point the second
> update is discarded and the writer redrives its update.  Readers are
> guaranteed to see a consistent view of the copy, even if the master
> structure is being updated at the same time as it is being read.

I would classify this as "ticket-based".


On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote:
> Bottom line, it can be done but ...
> * The structure size doubles with the addition of the sequence fields.
> * The hardware must support cmpxchg on the combination of the sequence
>   field and the data word that it protects.  LL/SC can be used instead
>   of cmpxchg if the hardware supports LL/SC.
> * If the hardware only supports cmpxchg4 then the sequence field is
>   restricted to 2 bytes, which increases the risk of wraparound.  If an
>   update is delayed for exactly the wraparound interval then the data
>   may be corrupted, that is a standard risk of small sequence fields.
> * The extra copy out just to read a structure needs more memory
>   bandwidth, plus local allocation for the copy.
> * The internal code of the API to traverse a list containing lockfree
>   structures is pretty messy, although that is hidden from the caller.
> * All writers must preserve enough state to redrive their update from
>   scratch if a concurrent update has changed the same structure.  Note,
>   only the curent structure, not the rest of the list.
> * It requires type stable storage.  That is, once a data area has been
>   allocated to a particular structure type, it always contains that
>   structure type, even when it has been freed from the list.  Each list
>   requires its own free pool, which can never be returned to the OS.

The last of these is particularly lethal.


On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote:
> Nice research topics, but not suitable for day to day work.  I only
> used the lockfree algorithms because I could not find an alternative on
> that particular OS.  I am not sure that the Linux kernel has any
> problems that require the additional complexity.
> For some light reading, the code I implemented was based on Practical
> Implementations of Non-Blocking Synchronization Primitives by Mark Moir
> http://research.sun.com/scalable/Papers/moir-podc97.ps.  Note that page
> 6, line 6 has an error.  Replace "y := z" with "y := addr->data[i];".
> I implemented a completely lockfree 2-3-4 tree with embedded pointers
> using Moir's algorithm.  The implementation allowed multiple concurrent
> readers and writers.  It even allowed concurrent list insert, delete
> and rebalancing, while the structures on the were being read and
> updated.

Thanks. That may come in handy later.

On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote:
> 2-3-4 trees are self balancing, which gave decent lookup performance.
> Since red-black trees are logically equivalent to 2-3-4 trees it should
> be possible to use lockfree red-black trees.  However I could not come
> up with a lockfree red-black tree, mainly because an read-black insert
> requires atomic changes to multiple structures.  The 2-3-4 tree only
> needs atomic update to one structure at a time.

This actually appears to confirm my earlier assertion about the linkage
of the data structure. Is this conclusion what you had in mind?


-- wli

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-17  1:19           ` William Lee Irwin III
@ 2004-07-17  2:12             ` Keith Owens
  2004-07-17  2:34               ` William Lee Irwin III
  2004-07-17  2:28             ` Keith Owens
  1 sibling, 1 reply; 19+ messages in thread
From: Keith Owens @ 2004-07-17  2:12 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Jesse Barnes, Chris Wright, Ravikiran G Thirumalai, linux-kernel,
	dipankar

On Fri, 16 Jul 2004 18:19:36 -0700, 
William Lee Irwin III <wli@holomorphy.com> wrote:
>On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote:
>> 2-3-4 trees are self balancing, which gave decent lookup performance.
>> Since red-black trees are logically equivalent to 2-3-4 trees it should
>> be possible to use lockfree red-black trees.  However I could not come
>> up with a lockfree red-black tree, mainly because an read-black insert
>> requires atomic changes to multiple structures.  The 2-3-4 tree only
>> needs atomic update to one structure at a time.
>
>This actually appears to confirm my earlier assertion about the linkage
>of the data structure. Is this conclusion what you had in mind?

Not quite.  The 2-3-4 tree has embedded linkage, but it can be done
lockfree if you really have to.  The problem is that a single 2-3-4
list entry maps to two red-black list entries.  I could atomically
update a single 2-3-4 list entry, including its pointers, even when the
list was being read or updated by other users.  I could not work out
how to do the equivalent update when the list linkage data was split
over two red-black nodes.

The list structure is an implementation detail, the use of 2-3-4 or
red-black is completely transparent to the main code.  The main code
wants to lookup a structure from the list, to update a structure, to
insert or to delete a structure without waiting.  How the list of
structures is maintained is a problem for the internals of the API.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-17  2:12             ` Keith Owens
@ 2004-07-17  2:34               ` William Lee Irwin III
  0 siblings, 0 replies; 19+ messages in thread
From: William Lee Irwin III @ 2004-07-17  2:34 UTC (permalink / raw)
  To: Keith Owens
  Cc: Jesse Barnes, Chris Wright, Ravikiran G Thirumalai, linux-kernel,
	dipankar

On Fri, 16 Jul 2004 18:19:36 -0700, William Lee Irwin III wrote:
>> This actually appears to confirm my earlier assertion about the linkage
>> of the data structure. Is this conclusion what you had in mind?

On Sat, Jul 17, 2004 at 12:12:39PM +1000, Keith Owens wrote:
> Not quite.  The 2-3-4 tree has embedded linkage, but it can be done
> lockfree if you really have to.  The problem is that a single 2-3-4
> list entry maps to two red-black list entries.  I could atomically
> update a single 2-3-4 list entry, including its pointers, even when the
> list was being read or updated by other users.  I could not work out
> how to do the equivalent update when the list linkage data was split
> over two red-black nodes.

2-3 trees have external linkage just like B/B+ trees as I had
envisioned what external linkage is. Terminological issue I guess.


On Sat, Jul 17, 2004 at 12:12:39PM +1000, Keith Owens wrote:
> The list structure is an implementation detail, the use of 2-3-4 or
> red-black is completely transparent to the main code.  The main code
> wants to lookup a structure from the list, to update a structure, to
> insert or to delete a structure without waiting.  How the list of
> structures is maintained is a problem for the internals of the API.

That kind of genericity is tough to come by in C. I guess callbacks and
void * can do it when pressed.


-- wli

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-17  1:19           ` William Lee Irwin III
  2004-07-17  2:12             ` Keith Owens
@ 2004-07-17  2:28             ` Keith Owens
  2004-07-17  3:16               ` William Lee Irwin III
  1 sibling, 1 reply; 19+ messages in thread
From: Keith Owens @ 2004-07-17  2:28 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Jesse Barnes, Chris Wright, Ravikiran G Thirumalai, linux-kernel,
	dipankar

On Fri, 16 Jul 2004 18:19:36 -0700, 
William Lee Irwin III <wli@holomorphy.com> wrote:
>On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote:
>> The beauty of these lockfree algorithms on large structures is that
>> nothing ever stalls indefinitely.  If the underlying SMP cache hardware
>> is fair, everything running a lockfree list will make progress.  These
>> algorithms do not suffer from reader vs. writer starvation.
>
>That's a large assumption. NUMA hardware typically violates it.

True, which is why I mentioned it.  However I suspect that you read
something into that paragraph which was not intended.

Just reading the lockfree list and the structures on the list does not
suffer from any NUMA problems, because reading does not perform any
global updates at all.  The SMP starvation problem only kicks in when
multiple concurrent updates are being done.  Even with multiple
writers, one of the writers is guaranteed to succeed every time, so
over time all the write operations will proceed, subject to fair access
to exclusive cache lines.

Lockfree reads with Moir's algorithms require extra memory bandwidth.
In the absence of updates, all the cache lines end up in shared state.
That reduces to local memory bandwidth for the (hopefully) common case
of lots of readers and few writers.  Lockfree code is nicely suited to
the same class of problem that RCU addresses, but without the reader
vs. writer starvation problems.

Writer vs. writer starvation on NUMA is a lot harder.  I don't know of
any algorithm that handles lists with lots of concurrent updates and
also scales well on large cpus, unless the underlying hardware is fair
in its handling of exclusive cache lines.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-17  2:28             ` Keith Owens
@ 2004-07-17  3:16               ` William Lee Irwin III
  0 siblings, 0 replies; 19+ messages in thread
From: William Lee Irwin III @ 2004-07-17  3:16 UTC (permalink / raw)
  To: Keith Owens
  Cc: Jesse Barnes, Chris Wright, Ravikiran G Thirumalai, linux-kernel,
	dipankar

On Fri, 16 Jul 2004 18:19:36 -0700, William Lee Irwin III wrote:
>> That's a large assumption. NUMA hardware typically violates it.

On Sat, Jul 17, 2004 at 12:28:54PM +1000, Keith Owens wrote:
> True, which is why I mentioned it.  However I suspect that you read
> something into that paragraph which was not intended.

The NUMA issue is the only caveat I saw. I guess I just wanted to
mention it by name.


On Sat, Jul 17, 2004 at 12:28:54PM +1000, Keith Owens wrote:
> Just reading the lockfree list and the structures on the list does not
> suffer from any NUMA problems, because reading does not perform any
> global updates at all.  The SMP starvation problem only kicks in when
> multiple concurrent updates are being done.  Even with multiple
> writers, one of the writers is guaranteed to succeed every time, so
> over time all the write operations will proceed, subject to fair access
> to exclusive cache lines.

The only methods I can think of to repair this (basically queueing) are
not busywait-free.


On Sat, Jul 17, 2004 at 12:28:54PM +1000, Keith Owens wrote:
> Lockfree reads with Moir's algorithms require extra memory bandwidth.
> In the absence of updates, all the cache lines end up in shared state.
> That reduces to local memory bandwidth for the (hopefully) common case
> of lots of readers and few writers.  Lockfree code is nicely suited to
> the same class of problem that RCU addresses, but without the reader
> vs. writer starvation problems.

I suppose it's worth refining the starvation claim to delaying freeing
memory as opposed to readers causing writers to busywait indefinitely.


On Sat, Jul 17, 2004 at 12:28:54PM +1000, Keith Owens wrote:
> Writer vs. writer starvation on NUMA is a lot harder.  I don't know of
> any algorithm that handles lists with lots of concurrent updates and
> also scales well on large cpus, unless the underlying hardware is fair
> in its handling of exclusive cache lines.

Well, neither do I. =)


-- wli

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-14 15:17   ` Chris Wright
  2004-07-15 14:22     ` Jesse Barnes
@ 2004-07-17 13:48     ` Peter Zijlstra
  1 sibling, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2004-07-17 13:48 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 403 bytes --]

Hi,

maybe I'm way off, but have you considered multi state locks for your
problem? I attached a draft-paper; written by a former collegue and
myself; on Concurrent AVL trees that uses multi-state locking. This
paper is unpublished and unfinished but might be illustrative.
Do note that those email addresses are no longer valid AFAIK, reply to
reach me or the co-author.

just my 2ct.

Peter Zijlstra

[-- Attachment #2: cavl.ps.gz --]
[-- Type: application/x-gzpostscript, Size: 40823 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2004-07-29  0:20 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-07-17 19:17 [RFC] Lock free fd lookup Albert Cahalan
2004-07-29  0:14 ` William Lee Irwin III
  -- strict thread matches above, loose matches on Subject: below --
2004-07-17  8:50 Manfred Spraul
2004-07-17  9:30 ` William Lee Irwin III
2004-07-14  4:53 [RFC] Refcounting of objects part of a lockfree collection Ravikiran G Thirumalai
2004-07-14  4:56 ` [RFC] Lock free fd lookup Ravikiran G Thirumalai
2004-07-14 15:17   ` Chris Wright
2004-07-15 14:22     ` Jesse Barnes
2004-07-15 16:10       ` Dipankar Sarma
2004-07-15 16:22         ` Jesse Barnes
2004-07-15 16:34         ` Chris Wright
2004-07-16  5:38           ` Ravikiran G Thirumalai
2004-07-16  6:27       ` William Lee Irwin III
2004-07-17  0:55         ` Keith Owens
2004-07-17  1:19           ` William Lee Irwin III
2004-07-17  2:12             ` Keith Owens
2004-07-17  2:34               ` William Lee Irwin III
2004-07-17  2:28             ` Keith Owens
2004-07-17  3:16               ` William Lee Irwin III
2004-07-17 13:48     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox