public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC] Refcounting of objects part of a lockfree collection
@ 2004-07-14  4:53 Ravikiran G Thirumalai
  2004-07-14  4:56 ` [RFC] Lock free fd lookup Ravikiran G Thirumalai
  2004-07-14  7:07 ` [RFC] Refcounting of objects part of a lockfree collection Greg KH
  0 siblings, 2 replies; 33+ messages in thread
From: Ravikiran G Thirumalai @ 2004-07-14  4:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: dipankar

Traditionally, refcounting technique is used to avoid or minimise
lock holding time for individual objects part of a collection.  The
'collection' of such objects (lists/arrays/etc) are proteced by traditional
locks such as spinlocks/reader-writer locks.  These locks protect
access to the collection data structure (...walking the objects part of
the collection, adding/deleting objects to the collection). Access to 
individual objects are protected through the refcounting mechanism.  

It makes sense for performance sake to make access to a read-mostly 
collection of objects lock-free with RCU techniques.  Introduction of RCU 
in place of traditional locking to protect such a collection creates some 
refcounting issues.  These issues come up because, with RCU, unlike 
traditional locking, writers and readers could execute concurrently.  

The attatched patch provides infrastructure for refcounting of objects
in a rcu protected collection.

This infrastructure was a result of our go at making the fd lookup lockfree.
fget right now holds the struct files_struct.file_lock to get the 
struct file pointer of the requested fd from the process' fd array.
The spinlock in fget is used to serialise read access to the 
files_struct.fd[] and incrementing the refcount.  We can save on the 
spin_lock and spin_unlock on the fd array lookup if reference can be 
taken on the struct file object in a safe and lock free manner 
(under rcu_readlock/rcu_readunlock and rcu on the update side).  This 
optimization should improve performance for threaded io intensive workloads.  
A patch to make use of the lockfree refcounting infrastructure to do 
just that follows. With the lockfree fd lookup patch, tiobench performance 
increases by 13% for sequential reads, 21 % for random reads on a 4 
processor pIII xeon.

Comments welcome.

Thanks,
Kiran

Signed off by: Ravikiran Thirumalai <kiran@in.ibm.com>

---

diff -ruN -X dontdiff linux-2.6.6/include/linux/refcount.h refcount-2.6.6/include/linux/refcount.h
--- linux-2.6.6/include/linux/refcount.h	1970-01-01 05:30:00.000000000 +0530
+++ refcount-2.6.6/include/linux/refcount.h	2004-05-24 18:31:25.523860952 +0530
@@ -0,0 +1,244 @@
+/* 
+ * Refcounter framework for elements of lists/arrays protected by
+ * RCU.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2004
+ *
+ * Author: Ravikiran Thirumalai <kiran@in.ibm.com>
+ * Credits: Dipankar Sarma <dipankar@in.ibm.com>.  refcount_get_rcu has
+ *	    been abstracted out from Dipankar's implementation of rcu based
+ *	    fd management.
+ * 
+ * Refcounting on elements of  lists which are protected by traditional
+ * reader/writer spinlocks or semaphores are straight forward as in:
+ * 
+ * 1.					2.
+ * add()				search_and_reference()
+ * {					{
+ * 	alloc_object				read_lock(&list_lock);
+ *	...					search_for_element
+ *	refcount_init(&el->rc)			refcount_get(&el->rc);	
+ *	write_lock(&list_lock);			...
+ *	add_element				read_unlock(&list_lock);
+ *	...					...
+ *	write_unlock(&list_lock);	}
+ * }					
+ *
+ * 3.					4.
+ * release_referenced()			delete()
+ * {					{
+ *	...				write_lock(&list_lock);
+ *	if (refcount_put(&el->rc))	...
+ *		start_cleanup_object	...
+ *		free_object		delete_element
+ *	...				write_unlock(&list_lock);
+ * }					...
+ *					if (refcount_put(&el->rc))
+ *						start_cleanup_object
+ *						free_object
+ *					}
+ *
+ * If this list/array is made lock free using rcu as in changing the 
+ * write_lock in add() and delete() to spin_lock and changing read_lock
+ * in search_and_reference to rcu_read_lock(), the refcount_get in 
+ * search_and_reference could potentially hold reference to an element which
+ * has already been deleted from the list/array.  refcount_get_rcu takes
+ * care of this scenario. search_and_reference should look as;
+ * 2. 
+ * search_and_reference()
+ * {
+ *	rcu_read_lock();
+ *	search_for_element
+ *	if (!refcount_get_rcu(&el->rc)) {
+ *		rcu_read_unlock();
+ *		return FAIL;
+ *	}
+ *	...
+ *	...
+ *	rcu_read_unlock();
+ * }
+ * 
+ * Of course, free_object after refcount_put should be batched using call_rcu.
+ * Sometimes, reference to the element need to be obtained in the 
+ * update (write) stream.  In such cases, refcount_get_rcu might be an overkill
+ * since the spinlock serialising list updates are held. refcount_get
+ * is to be used in such cases.  
+ * Note: Eventhough these refcount primitives can be used for lists which
+ * are protected by traditional locking, if the arch doesn't have cmpxchg,
+ * there might be extra penalties in refcount_get.
+ *
+ */
+
+#ifndef _LINUX_REFCOUNT_H
+#define _LINUX_REFCOUNT_H
+#ifdef  __KERNEL__
+
+#include <asm/atomic.h>
+#include <asm/system.h>
+
+typedef struct { atomic_t count; } refcount_t;
+
+/* 
+ * Refcount primitives for lists/array elements protected by traditional
+ * locking
+ */
+
+static inline void refcount_init(refcount_t *rc)	
+{
+	atomic_set(&rc->count, 1);
+}
+
+static inline int refcount_put(refcount_t *rc)
+{
+	return atomic_dec_and_test(&rc->count);
+}
+
+static inline void refcount_dec(refcount_t *rc)
+{
+	atomic_dec(&rc->count);
+}
+
+static inline int refcount_put_and_lock(refcount_t *rc, spinlock_t *lock)
+{
+	return atomic_dec_and_lock(&rc->count, lock);
+}
+
+static inline int refcount_read(refcount_t *rc)
+{
+	return atomic_read(&rc->count);
+}
+
+#ifdef __HAVE_ARCH_CMPXCHG
+
+static inline void refcount_get(refcount_t *rc)
+{
+	atomic_inc(&rc->count);
+}
+
+/* 
+ * Refcount primitives for lists/array elements protected by rcu
+ */
+
+/* 
+ * cmpxchg is needed on UP too, if deletions to the list/array can happen
+ * in interrupt context.
+ */
+static inline int refcount_get_rcu(refcount_t *rc)
+{
+	int c, old;
+	c = atomic_read(&rc->count);
+	while ( c && (old = cmpxchg(&rc->count.counter, c, c+1)) != c) 
+		c = old;
+	return c;
+}
+
+#else /* !__HAVE_ARCH_CMPXCHG */
+
+/* 
+ * We use an array of spinlocks for the refcounts -- similar to ones in sparc
+ * 32 bit atomic_t implementations, and a hash function similar to that
+ * for our refcounting needs.
+ * Can't help multiprocessors which donot have cmpxchg :(
+ */
+
+#ifdef	CONFIG_SMP
+#define REFCOUNT_HASH_SIZE	4
+#define REFCOUNT_HASH(r) \
+	(&__refcount_hash[(((unsigned long)r)>>8) & (REFCOUNT_HASH_SIZE-1)])
+#else
+#define	REFCOUNT_HASH_SIZE	1
+#define REFCOUNT_HASH(r) 	&__refcount_hash[0]
+#endif /* CONFIG_SMP */
+
+extern spinlock_t __refcount_hash[REFCOUNT_HASH_SIZE]
+
+static inline void refcount_init(refcoun_t *rc)	
+{
+	unsigned long flags;
+	spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+	rc->count.counter = 1;
+	spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+}
+
+static inline int refcount_put(refcount_t *rc)
+{
+	unsigned long flags;
+	spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+	rc->count.counter--;
+	if (!rc->count.counter) {
+		spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+		return 1;
+	} else {
+		spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+		return 0;
+	}
+}
+
+static inline void refcount_dec(refcount_t *rc)
+{
+	unsigned long flags;
+	spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+	rc->count.counter--;
+	spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+}
+
+static inline int refcount_put_and_locked(refcount_t *rc, spinlock_t *lock)
+{
+	unsigned long flags;
+	spin_lock(lock);
+	spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+	rc->count.counter--;
+	if (!rc->count.counter) {
+		spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+		return 1;
+	} else {
+		spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+		spin_unlock(lock)
+		return 0;
+	}
+}
+/* Spinlocks are not needed in this case, if atomic_read is used */
+static inline int refcount_read(refcount_t *rc)
+{
+	return atomic_read(&rc->count.counter);
+}
+
+static inline void refcount_get(refcount_t *rc)
+{
+	unsigned long flags;
+	spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+	rc->count++;
+	spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+}
+
+static inline int refcount_get_rcu(refcount_t *rc)
+{
+	int ret;
+	unsigned long flags;
+	spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+	if (rc->count.counter) 
+		ret = rc->count.counter++;
+	else
+		ret = 0;
+	spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+	return ret;
+}
+
+#endif /* __HAVE_ARCH_CMPXCHG */
+
+#endif
+#endif /* _LINUX_REFCOUNT_H */
diff -ruN -X dontdiff linux-2.6.6/kernel/rcupdate.c refcount-2.6.6/kernel/rcupdate.c
--- linux-2.6.6/kernel/rcupdate.c	2004-05-10 08:01:58.000000000 +0530
+++ refcount-2.6.6/kernel/rcupdate.c	2004-05-24 15:41:15.000000000 +0530
@@ -44,6 +44,7 @@
 #include <linux/notifier.h>
 #include <linux/rcupdate.h>
 #include <linux/cpu.h>
+#include <linux/refcount.h>
 
 /* Definition for rcupdate control block. */
 struct rcu_ctrlblk rcu_ctrlblk = 
@@ -325,6 +326,17 @@
 	wait_for_completion(&completion);
 }
 
+/**
+ * spinlock_t array for arches which donot have cmpxchg.  This is for
+ * rcu based refcounting stuff.
+ */
+
+#ifndef __HAVE_ARCH_CMPXCHG
+spinlock_t __refcount_hash[REFCOUNT_HASH_SIZE] = {
+	[0 ... (REFCOUNT_HASH_SIZE-1)] = SPIN_LOCK_UNLOCKED
+};
+EXPORT_SYMBOL(__refcount_hash);
+#endif
 
 EXPORT_SYMBOL(call_rcu);
 EXPORT_SYMBOL(synchronize_kernel);

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC] Lock free fd lookup
  2004-07-14  4:53 [RFC] Refcounting of objects part of a lockfree collection Ravikiran G Thirumalai
@ 2004-07-14  4:56 ` Ravikiran G Thirumalai
  2004-07-14 15:17   ` Chris Wright
  2004-07-14  7:07 ` [RFC] Refcounting of objects part of a lockfree collection Greg KH
  1 sibling, 1 reply; 33+ messages in thread
From: Ravikiran G Thirumalai @ 2004-07-14  4:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: dipankar

This makes use of the lockfree refcounting infrastructure (see earlier
posting today) to make files_struct.fd[] lookup lockfree. This is carrying 
forward work done by Maneesh and Dipankar earlier. 

With the lockfree fd lookup patch, tiobench performance increases by 13% 
for sequential reads, 21 % for random reads on a 4 processor pIII xeon.

Comments welcome.

Thanks,
Kiran


diff -ruN -X dontdiff linux-2.6.6/arch/mips/kernel/irixioctl.c files_struct-2.6.6/arch/mips/kernel/irixioctl.c
--- linux-2.6.6/arch/mips/kernel/irixioctl.c	2004-05-10 08:02:53.000000000 +0530
+++ files_struct-2.6.6/arch/mips/kernel/irixioctl.c	2004-05-14 12:53:40.000000000 +0530
@@ -14,6 +14,7 @@
 #include <linux/syscalls.h>
 #include <linux/tty.h>
 #include <linux/file.h>
+#include <linux/rcupdate.h>
 
 #include <asm/uaccess.h>
 #include <asm/ioctl.h>
@@ -33,15 +34,15 @@
 	struct file *filp;
 	struct tty_struct *ttyp = NULL;
 
-	spin_lock(&current->files->file_lock);
-	filp = fcheck(fd);
+	rcu_read_lock();
+	filp = fcheck_rcu(fd);
 	if(filp && filp->private_data) {
 		ttyp = (struct tty_struct *) filp->private_data;
 
 		if(ttyp->magic != TTY_MAGIC)
 			ttyp =NULL;
 	}
-	spin_unlock(&current->files->file_lock);
+	rcu_read_unlock();
 	return ttyp;
 }
 
diff -ruN -X dontdiff linux-2.6.6/arch/sparc64/solaris/ioctl.c files_struct-2.6.6/arch/sparc64/solaris/ioctl.c
--- linux-2.6.6/arch/sparc64/solaris/ioctl.c	2004-05-10 08:02:38.000000000 +0530
+++ files_struct-2.6.6/arch/sparc64/solaris/ioctl.c	2004-05-14 12:53:40.000000000 +0530
@@ -24,6 +24,7 @@
 #include <linux/netdevice.h>
 #include <linux/mtio.h>
 #include <linux/time.h>
+#include <linux/rcupdate.h>
 #include <linux/compat.h>
 
 #include <net/sock.h>
@@ -291,15 +292,15 @@
 {
 	struct inode *ino;
 	/* I wonder which of these tests are superfluous... --patrik */
-	spin_lock(&current->files->file_lock);
+	rcu_read_lock();
 	if (! current->files->fd[fd] ||
 	    ! current->files->fd[fd]->f_dentry ||
 	    ! (ino = current->files->fd[fd]->f_dentry->d_inode) ||
 	    ! ino->i_sock) {
-		spin_unlock(&current->files->file_lock);
+		rcu_read_unlock();
 		return TBADF;
 	}
-	spin_unlock(&current->files->file_lock);
+	rcu_read_unlock();
 	
 	switch (cmd & 0xff) {
 	case 109: /* SI_SOCKPARAMS */
diff -ruN -X dontdiff linux-2.6.6/arch/sparc64/solaris/timod.c files_struct-2.6.6/arch/sparc64/solaris/timod.c
--- linux-2.6.6/arch/sparc64/solaris/timod.c	2004-05-10 08:02:28.000000000 +0530
+++ files_struct-2.6.6/arch/sparc64/solaris/timod.c	2004-05-14 12:53:40.000000000 +0530
@@ -143,9 +143,14 @@
 static void timod_wake_socket(unsigned int fd)
 {
 	struct socket *sock;
+	struct file *filp; 
 
 	SOLD("wakeing socket");
-	sock = SOCKET_I(current->files->fd[fd]->f_dentry->d_inode);
+	if (!( filp = fcheck(fd))) {
+		SOLD("BAD FD");
+		return;
+	}
+	sock = SOCKET_I(filp->f_dentry->d_inode);
 	wake_up_interruptible(&sock->wait);
 	read_lock(&sock->sk->sk_callback_lock);
 	if (sock->fasync_list && !test_bit(SOCK_ASYNC_WAITDATA, &sock->flags))
@@ -157,9 +162,14 @@
 static void timod_queue(unsigned int fd, struct T_primsg *it)
 {
 	struct sol_socket_struct *sock;
+	struct file *filp; 
 
 	SOLD("queuing primsg");
-	sock = (struct sol_socket_struct *)current->files->fd[fd]->private_data;
+	if (!( filp = fcheck(fd))) {
+		SOLD("BAD FD");
+		return;
+	}
+	sock = (struct sol_socket_struct *)filp->private_data;
 	it->next = sock->pfirst;
 	sock->pfirst = it;
 	if (!sock->plast)
@@ -171,9 +181,14 @@
 static void timod_queue_end(unsigned int fd, struct T_primsg *it)
 {
 	struct sol_socket_struct *sock;
+	struct file *filp; 
 
 	SOLD("queuing primsg at end");
-	sock = (struct sol_socket_struct *)current->files->fd[fd]->private_data;
+	if (!( filp = fcheck(fd))) {
+		SOLD("BAD FD");
+		return;
+	}
+	sock = (struct sol_socket_struct *)filp->private_data;
 	it->next = NULL;
 	if (sock->plast)
 		sock->plast->next = it;
@@ -351,7 +366,10 @@
 		(int (*)(int, unsigned long *))SYS(socketcall);
 	int (*sys_sendto)(int, void *, size_t, unsigned, struct sockaddr *, int) =
 		(int (*)(int, void *, size_t, unsigned, struct sockaddr *, int))SYS(sendto);
-	filp = current->files->fd[fd];
+
+	if (!(filp = fcheck(fd)))
+		return -EBADF;
+
 	ino = filp->f_dentry->d_inode;
 	sock = (struct sol_socket_struct *)filp->private_data;
 	SOLD("entry");
@@ -632,7 +650,10 @@
 	
 	SOLD("entry");
 	SOLDD(("%u %p %d %p %p %d %p %d\n", fd, ctl_buf, ctl_maxlen, ctl_len, data_buf, data_maxlen, data_len, *flags_p));
-	filp = current->files->fd[fd];
+
+	if (!(filp = fcheck(fd)))
+		return -EBADF;
+
 	ino = filp->f_dentry->d_inode;
 	sock = (struct sol_socket_struct *)filp->private_data;
 	SOLDD(("%p %p\n", sock->pfirst, sock->pfirst ? sock->pfirst->next : NULL));
@@ -848,7 +869,7 @@
 	lock_kernel();
 	if(fd >= NR_OPEN) goto out;
 
-	filp = current->files->fd[fd];
+	filp = fcheck(fd);
 	if(!filp) goto out;
 
 	ino = filp->f_dentry->d_inode;
@@ -915,7 +936,7 @@
 	lock_kernel();
 	if(fd >= NR_OPEN) goto out;
 
-	filp = current->files->fd[fd];
+	filp = fcheck(fd);
 	if(!filp) goto out;
 
 	ino = filp->f_dentry->d_inode;
diff -ruN -X dontdiff linux-2.6.6/drivers/char/tty_io.c files_struct-2.6.6/drivers/char/tty_io.c
--- linux-2.6.6/drivers/char/tty_io.c	2004-05-10 08:02:39.000000000 +0530
+++ files_struct-2.6.6/drivers/char/tty_io.c	2004-05-14 12:53:50.000000000 +0530
@@ -1936,9 +1936,9 @@
 		}
 		task_lock(p);
 		if (p->files) {
-			spin_lock(&p->files->file_lock);
+			rcu_read_lock();
 			for (i=0; i < p->files->max_fds; i++) {
-				filp = fcheck_files(p->files, i);
+				filp = fcheck_files_rcu(p->files, i);
 				if (!filp)
 					continue;
 				if (filp->f_op->read == tty_read &&
@@ -1950,7 +1950,7 @@
 					break;
 				}
 			}
-			spin_unlock(&p->files->file_lock);
+			rcu_read_unlock();
 		}
 		task_unlock(p);
 	}
diff -ruN -X dontdiff linux-2.6.6/drivers/net/ppp_generic.c files_struct-2.6.6/drivers/net/ppp_generic.c
--- linux-2.6.6/drivers/net/ppp_generic.c	2004-05-10 08:03:20.000000000 +0530
+++ files_struct-2.6.6/drivers/net/ppp_generic.c	2004-05-14 12:53:50.000000000 +0530
@@ -46,6 +46,7 @@
 #include <linux/rwsem.h>
 #include <linux/stddef.h>
 #include <linux/device.h>
+#include <linux/refcount.h>
 #include <net/slhc_vj.h>
 #include <asm/atomic.h>
 
@@ -525,12 +526,12 @@
 			if (file == ppp->owner)
 				ppp_shutdown_interface(ppp);
 		}
-		if (atomic_read(&file->f_count) <= 2) {
+		if (refcount_read(&file->f_count) <= 2) {
 			ppp_release(inode, file);
 			err = 0;
 		} else
 			printk(KERN_DEBUG "PPPIOCDETACH file->f_count=%d\n",
-			       atomic_read(&file->f_count));
+			       refcount_read(&file->f_count));
 		return err;
 	}
 
diff -ruN -X dontdiff linux-2.6.6/fs/affs/file.c files_struct-2.6.6/fs/affs/file.c
--- linux-2.6.6/fs/affs/file.c	2004-05-10 08:02:29.000000000 +0530
+++ files_struct-2.6.6/fs/affs/file.c	2004-05-14 12:53:50.000000000 +0530
@@ -62,7 +62,7 @@
 static int
 affs_file_open(struct inode *inode, struct file *filp)
 {
-	if (atomic_read(&filp->f_count) != 1)
+	if (refcount_read(&filp->f_count) != 1)
 		return 0;
 	pr_debug("AFFS: open(%d)\n", AFFS_I(inode)->i_opencnt);
 	AFFS_I(inode)->i_opencnt++;
@@ -72,7 +72,7 @@
 static int
 affs_file_release(struct inode *inode, struct file *filp)
 {
-	if (atomic_read(&filp->f_count) != 0)
+	if (refcount_read(&filp->f_count) != 0)
 		return 0;
 	pr_debug("AFFS: release(%d)\n", AFFS_I(inode)->i_opencnt);
 	AFFS_I(inode)->i_opencnt--;
diff -ruN -X dontdiff linux-2.6.6/fs/aio.c files_struct-2.6.6/fs/aio.c
--- linux-2.6.6/fs/aio.c	2004-05-10 08:02:29.000000000 +0530
+++ files_struct-2.6.6/fs/aio.c	2004-05-14 12:53:50.000000000 +0530
@@ -480,7 +480,7 @@
 static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
 {
 	dprintk(KERN_DEBUG "aio_put(%p): f_count=%d\n",
-		req, atomic_read(&req->ki_filp->f_count));
+		req, refcount_read(&req->ki_filp->f_count));
 
 	req->ki_users --;
 	if (unlikely(req->ki_users < 0))
@@ -494,7 +494,7 @@
 	/* Must be done under the lock to serialise against cancellation.
 	 * Call this aio_fput as it duplicates fput via the fput_work.
 	 */
-	if (unlikely(atomic_dec_and_test(&req->ki_filp->f_count))) {
+	if (unlikely(refcount_put(&req->ki_filp->f_count))) {
 		get_ioctx(ctx);
 		spin_lock(&fput_lock);
 		list_add(&req->ki_list, &fput_head);
diff -ruN -X dontdiff linux-2.6.6/fs/fcntl.c files_struct-2.6.6/fs/fcntl.c
--- linux-2.6.6/fs/fcntl.c	2004-05-10 08:02:53.000000000 +0530
+++ files_struct-2.6.6/fs/fcntl.c	2004-05-14 12:53:50.000000000 +0530
@@ -34,9 +34,9 @@
 {
 	struct files_struct *files = current->files;
 	int res;
-	spin_lock(&files->file_lock);
+	rcu_read_lock();
 	res = FD_ISSET(fd, files->close_on_exec);
-	spin_unlock(&files->file_lock);
+	rcu_read_unlock();
 	return res;
 }
 
@@ -138,7 +138,7 @@
 	if (fd >= 0) {
 		FD_SET(fd, files->open_fds);
 		FD_CLR(fd, files->close_on_exec);
-		spin_unlock(&files->file_lock);
+	spin_unlock(&files->file_lock);
 		fd_install(fd, file);
 	} else {
 		spin_unlock(&files->file_lock);
@@ -183,6 +183,7 @@
 		goto out_fput;
 
 	files->fd[newfd] = file;
+	wmb();
 	FD_SET(newfd, files->open_fds);
 	FD_CLR(newfd, files->close_on_exec);
 	spin_unlock(&files->file_lock);
diff -ruN -X dontdiff linux-2.6.6/fs/file.c files_struct-2.6.6/fs/file.c
--- linux-2.6.6/fs/file.c	2004-05-10 08:02:00.000000000 +0530
+++ files_struct-2.6.6/fs/file.c	2004-05-14 12:53:50.000000000 +0530
@@ -14,7 +14,20 @@
 #include <linux/file.h>
 
 #include <asm/bitops.h>
+#include <linux/rcupdate.h>
 
+struct rcu_fd_array {
+	struct rcu_head rh;
+	struct file **array;   
+	int nfds;
+};
+
+struct rcu_fd_set {
+	struct rcu_head rh;
+	fd_set *openset;
+	fd_set *execset;
+	int nfds;
+};
 
 /*
  * Allocate an fd array, using kmalloc or vmalloc.
@@ -49,6 +62,13 @@
 		vfree(array);
 }
 
+static void fd_array_callback(struct rcu_head *rcu) 
+{
+	struct rcu_fd_array *a = container_of(rcu, struct rcu_fd_array, rh);
+	free_fd_array(a->array, a->nfds);
+	kfree(a); 
+}
+
 /*
  * Expand the fd array in the files_struct.  Called with the files
  * spinlock held for write.
@@ -56,8 +76,9 @@
 
 int expand_fd_array(struct files_struct *files, int nr)
 {
-	struct file **new_fds;
-	int error, nfds;
+	struct file **new_fds = NULL;
+	int error, nfds = 0;
+	struct rcu_fd_array *arg = NULL;
 
 	
 	error = -EMFILE;
@@ -89,18 +110,17 @@
 
 	error = -ENOMEM;
 	new_fds = alloc_fd_array(nfds);
+	arg = (struct rcu_fd_array *) kmalloc(sizeof(*arg), GFP_KERNEL);
+
 	spin_lock(&files->file_lock);
-	if (!new_fds)
+	if (!new_fds || !arg)
 		goto out;
 
 	/* Copy the existing array and install the new pointer */
 
 	if (nfds > files->max_fds) {
-		struct file **old_fds;
-		int i;
-		
-		old_fds = xchg(&files->fd, new_fds);
-		i = xchg(&files->max_fds, nfds);
+		struct file **old_fds = files->fd;
+		int i = files->max_fds;
 
 		/* Don't copy/clear the array if we are creating a new
 		   fd array for fork() */
@@ -109,19 +129,35 @@
 			/* clear the remainder of the array */
 			memset(&new_fds[i], 0,
 			       (nfds-i) * sizeof(struct file *)); 
-
-			spin_unlock(&files->file_lock);
-			free_fd_array(old_fds, i);
-			spin_lock(&files->file_lock);
 		}
+
+		smp_wmb();
+		files->fd = new_fds;
+		smp_wmb();
+		files->max_fds = nfds;
+		
+		if (i) {
+			arg->array = old_fds;
+			arg->nfds = i;
+			call_rcu(&arg->rh, (void (*)(void *))fd_array_callback,
+				 &arg->rh);
+		} else 
+			kfree(arg);
 	} else {
 		/* Somebody expanded the array while we slept ... */
 		spin_unlock(&files->file_lock);
 		free_fd_array(new_fds, nfds);
+		kfree(arg);
 		spin_lock(&files->file_lock);
 	}
-	error = 0;
+
+	return 0;
 out:
+	if (new_fds)
+		free_fd_array(new_fds, nfds);
+	if (arg)
+		kfree(arg);
+
 	return error;
 }
 
@@ -153,6 +189,14 @@
 		vfree(array);
 }
 
+static void fd_set_callback (struct rcu_head *rcu)
+{
+	struct rcu_fd_set *a = container_of(rcu, struct rcu_fd_set, rh); 
+	free_fdset(a->openset, a->nfds);
+	free_fdset(a->execset, a->nfds);
+	kfree(a);
+}
+
 /*
  * Expand the fdset in the files_struct.  Called with the files spinlock
  * held for write.
@@ -161,6 +205,7 @@
 {
 	fd_set *new_openset = 0, *new_execset = 0;
 	int error, nfds = 0;
+	struct rcu_fd_set *arg = NULL;
 
 	error = -EMFILE;
 	if (files->max_fdset >= NR_OPEN || nr >= NR_OPEN)
@@ -183,35 +228,43 @@
 	error = -ENOMEM;
 	new_openset = alloc_fdset(nfds);
 	new_execset = alloc_fdset(nfds);
+	arg = (struct rcu_fd_set *) kmalloc(sizeof(*arg), GFP_ATOMIC);
 	spin_lock(&files->file_lock);
-	if (!new_openset || !new_execset)
+	if (!new_openset || !new_execset || !arg)
 		goto out;
 
 	error = 0;
 	
 	/* Copy the existing tables and install the new pointers */
 	if (nfds > files->max_fdset) {
-		int i = files->max_fdset / (sizeof(unsigned long) * 8);
-		int count = (nfds - files->max_fdset) / 8;
+		fd_set *old_openset = files->open_fds;
+		fd_set *old_execset = files->close_on_exec;
+		int old_nfds = files->max_fdset;
+		int i = old_nfds / (sizeof(unsigned long) * 8);
+		int count = (nfds - old_nfds) / 8;
 		
 		/* 
 		 * Don't copy the entire array if the current fdset is
 		 * not yet initialised.  
 		 */
 		if (i) {
-			memcpy (new_openset, files->open_fds, files->max_fdset/8);
-			memcpy (new_execset, files->close_on_exec, files->max_fdset/8);
+			memcpy (new_openset, old_openset, old_nfds/8);
+			memcpy (new_execset, old_execset, old_nfds/8);
 			memset (&new_openset->fds_bits[i], 0, count);
 			memset (&new_execset->fds_bits[i], 0, count);
 		}
 		
-		nfds = xchg(&files->max_fdset, nfds);
-		new_openset = xchg(&files->open_fds, new_openset);
-		new_execset = xchg(&files->close_on_exec, new_execset);
-		spin_unlock(&files->file_lock);
-		free_fdset (new_openset, nfds);
-		free_fdset (new_execset, nfds);
-		spin_lock(&files->file_lock);
+		smp_wmb();
+		files->open_fds =  new_openset;
+		files->close_on_exec = new_execset;
+		smp_wmb();
+		files->max_fdset = nfds;
+
+		arg->openset = old_openset;
+		arg->execset = old_execset;
+		arg->nfds = nfds;
+		call_rcu(&arg->rh, (void (*)(void *))fd_set_callback, &arg->rh);
+
 		return 0;
 	} 
 	/* Somebody expanded the array while we slept ... */
@@ -222,6 +275,8 @@
 		free_fdset(new_openset, nfds);
 	if (new_execset)
 		free_fdset(new_execset, nfds);
+	if (arg)
+		kfree(arg);
 	spin_lock(&files->file_lock);
 	return error;
 }
diff -ruN -X dontdiff linux-2.6.6/fs/file_table.c files_struct-2.6.6/fs/file_table.c
--- linux-2.6.6/fs/file_table.c	2004-05-10 08:01:59.000000000 +0530
+++ files_struct-2.6.6/fs/file_table.c	2004-05-14 14:05:22.000000000 +0530
@@ -14,6 +14,7 @@
 #include <linux/fs.h>
 #include <linux/security.h>
 #include <linux/eventpoll.h>
+#include <linux/rcupdate.h>
 #include <linux/mount.h>
 #include <linux/cdev.h>
 
@@ -54,11 +55,17 @@
 	spin_unlock_irqrestore(&filp_count_lock, flags);
 }
 
-static inline void file_free(struct file *f)
+static inline void file_free_rcu(void *arg)
 {
+	struct file *f =  arg;
 	kmem_cache_free(filp_cachep, f);
 }
 
+static inline void file_free(struct file *f)
+{
+	call_rcu(&f->f_rcuhead, file_free_rcu, f);
+}
+
 /* Find an unused file structure and return a pointer to it.
  * Returns NULL, if there are no more free file structures or
  * we run out of memory.
@@ -81,7 +88,7 @@
 				goto fail;
 			}
 			eventpoll_init_file(f);
-			atomic_set(&f->f_count, 1);
+			refcount_init(&f->f_count);
 			f->f_uid = current->fsuid;
 			f->f_gid = current->fsgid;
 			f->f_owner.lock = RW_LOCK_UNLOCKED;
@@ -118,7 +125,7 @@
 	eventpoll_init_file(filp);
 	filp->f_flags  = flags;
 	filp->f_mode   = (flags+1) & O_ACCMODE;
-	atomic_set(&filp->f_count, 1);
+	refcount_init(&filp->f_count);
 	filp->f_dentry = dentry;
 	filp->f_mapping = dentry->d_inode->i_mapping;
 	filp->f_uid    = current->fsuid;
@@ -154,7 +161,7 @@
 
 void fastcall fput(struct file *file)
 {
-	if (atomic_dec_and_test(&file->f_count))
+	if (refcount_put(&file->f_count))
 		__fput(file);
 }
 
@@ -197,11 +204,16 @@
 	struct file *file;
 	struct files_struct *files = current->files;
 
-	spin_lock(&files->file_lock);
-	file = fcheck_files(files, fd);
-	if (file)
-		get_file(file);
-	spin_unlock(&files->file_lock);
+	rcu_read_lock();
+	file = fcheck_files_rcu(files, fd);
+	if (file) {
+		if (!refcount_get_rcu(&file->f_count)) {
+			/* File object ref couldn't be taken */
+			rcu_read_unlock();
+			return NULL;
+		}
+	}
+	rcu_read_unlock();
 	return file;
 }
 
@@ -223,21 +235,25 @@
 	if (likely((atomic_read(&files->count) == 1))) {
 		file = fcheck_files(files, fd);
 	} else {
-		spin_lock(&files->file_lock);
-		file = fcheck_files(files, fd);
+		rcu_read_lock();
+		file = fcheck_files_rcu(files, fd);
 		if (file) {
-			get_file(file);
-			*fput_needed = 1;
+			if (refcount_get_rcu(&file->f_count))
+				*fput_needed = 1;
+			else
+				/* Didn't get the reference, someone's freed */
+				file = NULL;
 		}
-		spin_unlock(&files->file_lock);
+		rcu_read_unlock();
 	}
+
 	return file;
 }
 
 
 void put_filp(struct file *file)
 {
-	if (atomic_dec_and_test(&file->f_count)) {
+	if (refcount_put(&file->f_count)) {
 		security_file_free(file);
 		file_kill(file);
 		file_free(file);
diff -ruN -X dontdiff linux-2.6.6/fs/open.c files_struct-2.6.6/fs/open.c
--- linux-2.6.6/fs/open.c	2004-05-10 08:01:59.000000000 +0530
+++ files_struct-2.6.6/fs/open.c	2004-05-14 12:53:50.000000000 +0530
@@ -1028,7 +1028,9 @@
 	filp = files->fd[fd];
 	if (!filp)
 		goto out_unlock;
-	files->fd[fd] = NULL;
+	files->fd[fd] = NULL;	
+	/* Need to make it conistent with open_fds in __put_unused_fd() */
+	smp_wmb();
 	FD_CLR(fd, files->close_on_exec);
 	__put_unused_fd(files, fd);
 	spin_unlock(&files->file_lock);
diff -ruN -X dontdiff linux-2.6.6/fs/proc/base.c files_struct-2.6.6/fs/proc/base.c
--- linux-2.6.6/fs/proc/base.c	2004-05-10 08:02:52.000000000 +0530
+++ files_struct-2.6.6/fs/proc/base.c	2004-05-14 12:53:50.000000000 +0530
@@ -28,6 +28,7 @@
 #include <linux/namespace.h>
 #include <linux/mm.h>
 #include <linux/smp_lock.h>
+#include <linux/rcupdate.h>
 #include <linux/kallsyms.h>
 #include <linux/mount.h>
 #include <linux/security.h>
@@ -191,16 +192,16 @@
 
 	files = get_files_struct(task);
 	if (files) {
-		spin_lock(&files->file_lock);
-		file = fcheck_files(files, fd);
+		rcu_read_lock();
+		file = fcheck_files_rcu(files, fd);
 		if (file) {
 			*mnt = mntget(file->f_vfsmnt);
 			*dentry = dget(file->f_dentry);
-			spin_unlock(&files->file_lock);
+			rcu_read_unlock();
 			put_files_struct(files);
 			return 0;
 		}
-		spin_unlock(&files->file_lock);
+		rcu_read_unlock();
 		put_files_struct(files);
 	}
 	return -ENOENT;
@@ -805,15 +806,15 @@
 			files = get_files_struct(p);
 			if (!files)
 				goto out;
-			spin_lock(&files->file_lock);
+			rcu_read_lock();
 			for (fd = filp->f_pos-2;
 			     fd < files->max_fds;
 			     fd++, filp->f_pos++) {
 				unsigned int i,j;
 
-				if (!fcheck_files(files, fd))
+				if (!fcheck_files_rcu(files, fd))
 					continue;
-				spin_unlock(&files->file_lock);
+				rcu_read_unlock();
 
 				j = NUMBUF;
 				i = fd;
@@ -825,12 +826,12 @@
 
 				ino = fake_ino(tid, PROC_TID_FD_DIR + fd);
 				if (filldir(dirent, buf+j, NUMBUF-j, fd+2, ino, DT_LNK) < 0) {
-					spin_lock(&files->file_lock);
+					rcu_read_lock();
 					break;
 				}
-				spin_lock(&files->file_lock);
+				rcu_read_lock();
 			}
-			spin_unlock(&files->file_lock);
+			rcu_read_unlock();
 			put_files_struct(files);
 	}
 out:
@@ -1003,9 +1004,9 @@
 
 	files = get_files_struct(task);
 	if (files) {
-		spin_lock(&files->file_lock);
-		if (fcheck_files(files, fd)) {
-			spin_unlock(&files->file_lock);
+		rcu_read_lock();
+		if (fcheck_files_rcu(files, fd)) {
+			rcu_read_unlock();
 			put_files_struct(files);
 			if (task_dumpable(task)) {
 				inode->i_uid = task->euid;
@@ -1017,7 +1018,7 @@
 			security_task_to_inode(task, inode);
 			return 1;
 		}
-		spin_unlock(&files->file_lock);
+		rcu_read_unlock();
 		put_files_struct(files);
 	}
 	d_drop(dentry);
@@ -1109,15 +1110,15 @@
 	if (!files)
 		goto out_unlock;
 	inode->i_mode = S_IFLNK;
-	spin_lock(&files->file_lock);
-	file = fcheck_files(files, fd);
+	rcu_read_lock();
+	file = fcheck_files_rcu(files, fd);
 	if (!file)
 		goto out_unlock2;
 	if (file->f_mode & 1)
 		inode->i_mode |= S_IRUSR | S_IXUSR;
 	if (file->f_mode & 2)
 		inode->i_mode |= S_IWUSR | S_IXUSR;
-	spin_unlock(&files->file_lock);
+	rcu_read_unlock();
 	put_files_struct(files);
 	inode->i_op = &proc_pid_link_inode_operations;
 	inode->i_size = 64;
@@ -1127,7 +1128,7 @@
 	return NULL;
 
 out_unlock2:
-	spin_unlock(&files->file_lock);
+	rcu_read_unlock();
 	put_files_struct(files);
 out_unlock:
 	iput(inode);
diff -ruN -X dontdiff linux-2.6.6/fs/select.c files_struct-2.6.6/fs/select.c
--- linux-2.6.6/fs/select.c	2004-05-10 08:01:59.000000000 +0530
+++ files_struct-2.6.6/fs/select.c	2004-05-14 12:53:50.000000000 +0530
@@ -21,6 +21,7 @@
 #include <linux/personality.h> /* for STICKY_TIMEOUTS */
 #include <linux/file.h>
 #include <linux/fs.h>
+#include <linux/rcupdate.h>
 
 #include <asm/uaccess.h>
 
@@ -131,13 +132,16 @@
 static int max_select_fd(unsigned long n, fd_set_bits *fds)
 {
 	unsigned long *open_fds;
+	fd_set *open_fdset;
 	unsigned long set;
 	int max;
 
 	/* handle last in-complete long-word first */
 	set = ~(~0UL << (n & (__NFDBITS-1)));
 	n /= __NFDBITS;
-	open_fds = current->files->open_fds->fds_bits+n;
+	open_fdset = current->files->open_fds;
+	smp_read_barrier_depends();
+	open_fds = open_fdset->fds_bits+n;
 	max = 0;
 	if (set) {
 		set &= BITS(fds, n);
@@ -184,9 +188,9 @@
 	int retval, i;
 	long __timeout = *timeout;
 
- 	spin_lock(&current->files->file_lock);
+	rcu_read_lock();
 	retval = max_select_fd(n, fds);
-	spin_unlock(&current->files->file_lock);
+	rcu_read_unlock();
 
 	if (retval < 0)
 		return retval;
diff -ruN -X dontdiff linux-2.6.6/include/linux/file.h files_struct-2.6.6/include/linux/file.h
--- linux-2.6.6/include/linux/file.h	2004-05-10 08:03:22.000000000 +0530
+++ files_struct-2.6.6/include/linux/file.h	2004-05-14 12:53:50.000000000 +0530
@@ -9,6 +9,7 @@
 #include <linux/posix_types.h>
 #include <linux/compiler.h>
 #include <linux/spinlock.h>
+#include <linux/rcupdate.h>
 
 /*
  * The default fd array needs to be at least BITS_PER_LONG,
@@ -69,10 +70,26 @@
 	return file;
 }
 
+/* Need Proper read barriers if fd_array is being indexed lockfree */
+static inline struct file * fcheck_files_rcu(struct files_struct *files, unsigned int fd)
+{
+	struct file * file = NULL;
+
+	if (fd < files->max_fds) {
+		struct file ** fd_array;
+		smp_rmb();
+	       	fd_array = files->fd;
+		smp_read_barrier_depends();
+		file = fd_array[fd];
+	}
+	return file;
+}
+
 /*
  * Check whether the specified fd has an open file.
  */
 #define fcheck(fd)	fcheck_files(current->files, fd)
+#define fcheck_rcu(fd)	fcheck_files_rcu(current->files, fd)
 
 extern void FASTCALL(fd_install(unsigned int fd, struct file * file));
 
diff -ruN -X dontdiff linux-2.6.6/include/linux/fs.h files_struct-2.6.6/include/linux/fs.h
--- linux-2.6.6/include/linux/fs.h	2004-05-10 08:02:26.000000000 +0530
+++ files_struct-2.6.6/include/linux/fs.h	2004-05-14 12:53:50.000000000 +0530
@@ -19,6 +19,7 @@
 #include <linux/cache.h>
 #include <linux/radix-tree.h>
 #include <linux/kobject.h>
+#include <linux/refcount.h>
 #include <asm/atomic.h>
 #include <linux/audit.h>
 
@@ -555,7 +556,7 @@
 	struct dentry		*f_dentry;
 	struct vfsmount         *f_vfsmnt;
 	struct file_operations	*f_op;
-	atomic_t		f_count;
+	refcount_t		f_count;
 	unsigned int 		f_flags;
 	mode_t			f_mode;
 	loff_t			f_pos;
@@ -576,13 +577,14 @@
 	spinlock_t		f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
 	struct address_space	*f_mapping;
+	struct rcu_head		f_rcuhead;
 };
 extern spinlock_t files_lock;
 #define file_list_lock() spin_lock(&files_lock);
 #define file_list_unlock() spin_unlock(&files_lock);
 
-#define get_file(x)	atomic_inc(&(x)->f_count)
-#define file_count(x)	atomic_read(&(x)->f_count)
+#define get_file(x)	refcount_get(&(x)->f_count)
+#define file_count(x)	refcount_read(&(x)->f_count)
 
 /* Initialize and open a private file and allocate its security structure. */
 extern int open_private_file(struct file *, struct dentry *, int);
diff -ruN -X dontdiff linux-2.6.6/kernel/fork.c files_struct-2.6.6/kernel/fork.c
--- linux-2.6.6/kernel/fork.c	2004-05-10 08:02:00.000000000 +0530
+++ files_struct-2.6.6/kernel/fork.c	2004-05-14 12:53:50.000000000 +0530
@@ -30,6 +30,7 @@
 #include <linux/syscalls.h>
 #include <linux/jiffies.h>
 #include <linux/futex.h>
+#include <linux/rcupdate.h>
 #include <linux/ptrace.h>
 #include <linux/mount.h>
 #include <linux/audit.h>
@@ -627,10 +628,12 @@
 static int count_open_files(struct files_struct *files, int size)
 {
 	int i;
+	fd_set *open_fds = files->open_fds;
 
+	smp_read_barrier_depends();
 	/* Find the last open fd */
 	for (i = size/(8*sizeof(long)); i > 0; ) {
-		if (files->open_fds->fds_bits[--i])
+		if (open_fds->fds_bits[--i])
 			break;
 	}
 	i = (i+1) * 8 * sizeof(long);
@@ -660,6 +663,14 @@
 	 * This works because we cache current->files (old) as oldf. Don't
 	 * break this.
 	 */
+	
+	/* 
+	 * We don't have oldf readlock, but even if the old fdset gets
+	 * grown now, we'll only copy upto "size" fds 
+	 */
+	size = oldf->max_fdset;	
+	smp_rmb();
+
 	tsk->files = NULL;
 	error = -ENOMEM;
 	newf = kmem_cache_alloc(files_cachep, SLAB_KERNEL);
@@ -676,9 +687,6 @@
 	newf->open_fds	    = &newf->open_fds_init;
 	newf->fd	    = &newf->fd_array[0];
 
-	/* We don't yet have the oldf readlock, but even if the old
-           fdset gets grown now, we'll only copy up to "size" fds */
-	size = oldf->max_fdset;
 	if (size > __FD_SETSIZE) {
 		newf->max_fdset = 0;
 		spin_lock(&newf->file_lock);
@@ -687,7 +695,7 @@
 		if (error)
 			goto out_release;
 	}
-	spin_lock(&oldf->file_lock);
+	rcu_read_lock();
 
 	open_files = count_open_files(oldf, size);
 
@@ -698,7 +706,7 @@
 	 */
 	nfds = NR_OPEN_DEFAULT;
 	if (open_files > nfds) {
-		spin_unlock(&oldf->file_lock);
+		rcu_read_unlock();
 		newf->max_fds = 0;
 		spin_lock(&newf->file_lock);
 		error = expand_fd_array(newf, open_files-1);
@@ -706,10 +714,11 @@
 		if (error) 
 			goto out_release;
 		nfds = newf->max_fds;
-		spin_lock(&oldf->file_lock);
+		rcu_read_lock();
 	}
 
 	old_fds = oldf->fd;
+	smp_read_barrier_depends();
 	new_fds = newf->fd;
 
 	memcpy(newf->open_fds->fds_bits, oldf->open_fds->fds_bits, open_files/8);
@@ -721,7 +730,7 @@
 			get_file(f);
 		*new_fds++ = f;
 	}
-	spin_unlock(&oldf->file_lock);
+	rcu_read_unlock();
 
 	/* compute the remainder to be cleared */
 	size = (newf->max_fds - open_files) * sizeof(struct file *);
diff -ruN -X dontdiff linux-2.6.6/net/ipv4/netfilter/ipt_owner.c files_struct-2.6.6/net/ipv4/netfilter/ipt_owner.c
--- linux-2.6.6/net/ipv4/netfilter/ipt_owner.c	2004-05-10 08:03:20.000000000 +0530
+++ files_struct-2.6.6/net/ipv4/netfilter/ipt_owner.c	2004-05-14 12:53:50.000000000 +0530
@@ -11,6 +11,7 @@
 #include <linux/module.h>
 #include <linux/skbuff.h>
 #include <linux/file.h>
+#include <linux/rcupdate.h>
 #include <net/sock.h>
 
 #include <linux/netfilter_ipv4/ipt_owner.h>
@@ -35,17 +36,17 @@
 		task_lock(p);
 		files = p->files;
 		if(files) {
-			spin_lock(&files->file_lock);
+			rcu_read_lock();
 			for (i=0; i < files->max_fds; i++) {
-				if (fcheck_files(files, i) ==
+				if (fcheck_files_rcu(files, i) ==
 				    skb->sk->sk_socket->file) {
-					spin_unlock(&files->file_lock);
+					rcu_read_unlock();
 					task_unlock(p);
 					read_unlock(&tasklist_lock);
 					return 1;
 				}
 			}
-			spin_unlock(&files->file_lock);
+			rcu_read_unlock();
 		}
 		task_unlock(p);
 	} while_each_thread(g, p);
@@ -67,17 +68,17 @@
 	task_lock(p);
 	files = p->files;
 	if(files) {
-		spin_lock(&files->file_lock);
+		rcu_read_lock();
 		for (i=0; i < files->max_fds; i++) {
-			if (fcheck_files(files, i) ==
+			if (fcheck_files_rcu(files, i) ==
 			    skb->sk->sk_socket->file) {
-				spin_unlock(&files->file_lock);
+				rcu_read_unlock();
 				task_unlock(p);
 				read_unlock(&tasklist_lock);
 				return 1;
 			}
 		}
-		spin_unlock(&files->file_lock);
+		rcu_read_unlock();
 	}
 	task_unlock(p);
 out:
@@ -101,14 +102,14 @@
 		task_lock(p);
 		files = p->files;
 		if (files) {
-			spin_lock(&files->file_lock);
+			rcu_read_lock();
 			for (i=0; i < files->max_fds; i++) {
-				if (fcheck_files(files, i) == file) {
+				if (fcheck_files_rcu(files, i) == file) {
 					found = 1;
 					break;
 				}
 			}
-			spin_unlock(&files->file_lock);
+			rcu_read_unlock();
 		}
 		task_unlock(p);
 		if (found)
diff -ruN -X dontdiff linux-2.6.6/net/ipv6/netfilter/ip6t_owner.c files_struct-2.6.6/net/ipv6/netfilter/ip6t_owner.c
--- linux-2.6.6/net/ipv6/netfilter/ip6t_owner.c	2004-05-10 08:03:13.000000000 +0530
+++ files_struct-2.6.6/net/ipv6/netfilter/ip6t_owner.c	2004-05-14 12:53:50.000000000 +0530
@@ -11,6 +11,7 @@
 #include <linux/module.h>
 #include <linux/skbuff.h>
 #include <linux/file.h>
+#include <linux/rcupdate.h>
 #include <net/sock.h>
 
 #include <linux/netfilter_ipv6/ip6t_owner.h>
@@ -34,16 +35,16 @@
 	task_lock(p);
 	files = p->files;
 	if(files) {
-		spin_lock(&files->file_lock);
+		rcu_read_lock();
 		for (i=0; i < files->max_fds; i++) {
-			if (fcheck_files(files, i) == skb->sk->sk_socket->file) {
-				spin_unlock(&files->file_lock);
+			if (fcheck_files_rcu(files, i) == skb->sk->sk_socket->file) {
+				rcu_read_unlock();
 				task_unlock(p);
 				read_unlock(&tasklist_lock);
 				return 1;
 			}
 		}
-		spin_unlock(&files->file_lock);
+		rcu_read_unlock();
 	}
 	task_unlock(p);
 out:
@@ -67,14 +68,14 @@
 		task_lock(p);
 		files = p->files;
 		if (files) {
-			spin_lock(&files->file_lock);
+			rcu_read_lock();
 			for (i=0; i < files->max_fds; i++) {
-				if (fcheck_files(files, i) == file) {
+				if (fcheck_files_rcu(files, i) == file) {
 					found = 1;
 					break;
 				}
 			}
-			spin_unlock(&files->file_lock);
+			rcu_read_unlock();
 		}
 		task_unlock(p);
 		if (found)


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Refcounting of objects part of a lockfree collection
  2004-07-14  4:53 [RFC] Refcounting of objects part of a lockfree collection Ravikiran G Thirumalai
  2004-07-14  4:56 ` [RFC] Lock free fd lookup Ravikiran G Thirumalai
@ 2004-07-14  7:07 ` Greg KH
  2004-07-14  8:26   ` Dipankar Sarma
  2004-07-14  8:57   ` Ravikiran G Thirumalai
  1 sibling, 2 replies; 33+ messages in thread
From: Greg KH @ 2004-07-14  7:07 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: linux-kernel, dipankar

On Wed, Jul 14, 2004 at 10:23:50AM +0530, Ravikiran G Thirumalai wrote:
> 
> The attatched patch provides infrastructure for refcounting of objects
> in a rcu protected collection.

This is really close to the kref implementation.  Why not just use that
instead?

Oh, and I think you need to use atomic_set() instead of initializing the
atomic_t by hand.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Refcounting of objects part of a lockfree collection
  2004-07-14  7:07 ` [RFC] Refcounting of objects part of a lockfree collection Greg KH
@ 2004-07-14  8:26   ` Dipankar Sarma
  2004-07-14 14:26     ` Greg KH
  2004-07-14  8:57   ` Ravikiran G Thirumalai
  1 sibling, 1 reply; 33+ messages in thread
From: Dipankar Sarma @ 2004-07-14  8:26 UTC (permalink / raw)
  To: Greg KH; +Cc: Ravikiran G Thirumalai, linux-kernel

On Wed, Jul 14, 2004 at 12:07:00AM -0700, Greg KH wrote:
> On Wed, Jul 14, 2004 at 10:23:50AM +0530, Ravikiran G Thirumalai wrote:
> > 
> > The attatched patch provides infrastructure for refcounting of objects
> > in a rcu protected collection.
> 
> This is really close to the kref implementation.  Why not just use that
> instead?

Well, the kref has the same get/put race if used in a lock-free
look-up. When you do a kref_get() it is assumed that another
cpu will not see a 1-to-0 transition of the reference count.
If that indeed happens, ->release() will get invoked more
than once for that object which is bad. Kiran's patch actually
solves this fundamental lock-free ref-counting problem.

The other issue is that there are many refcounted data structures
like dentry, dst_entry, file etc. that do not use kref. If everybody
were to use kref, we could possibly apply Kiran's lock-free extensions
to kref itself and be done with it. Until then, we need the lock-free
refcounting support from non-kref refcounting objects.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Refcounting of objects part of a lockfree collection
  2004-07-14  7:07 ` [RFC] Refcounting of objects part of a lockfree collection Greg KH
  2004-07-14  8:26   ` Dipankar Sarma
@ 2004-07-14  8:57   ` Ravikiran G Thirumalai
  2004-07-14 17:08     ` Greg KH
  1 sibling, 1 reply; 33+ messages in thread
From: Ravikiran G Thirumalai @ 2004-07-14  8:57 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, dipankar

On Wed, Jul 14, 2004 at 12:07:00AM -0700, Greg KH wrote:
> On Wed, Jul 14, 2004 at 10:23:50AM +0530, Ravikiran G Thirumalai wrote:
> > 
> > The attatched patch provides infrastructure for refcounting of objects
> > in a rcu protected collection.
> 
> This is really close to the kref implementation.  Why not just use that
> instead?

Close, but not the same.  I just had a quick look at krefs.
Actually, this refrerence count infrastructure I am proposing is not for 
traditional refcounting.  This is for refcounting of elemnts of a list
or array (collection) which can be 'read' in a lock free manner.

For ex. With traditional refcounting, you can have 

1.                                  2.
add()                               search_and_reference()
{                                   {
    alloc_object                            read_lock(&list_lock);
    ...                                     search_for_element
    refcount_init(&el->rc)                  refcount_get(&el->rc);
    write_lock(&list_lock);                 ...
    add_element                             read_unlock(&list_lock);
    ...                                     ...
    write_unlock(&list_lock);       }
}

3.                                  4.
release_referenced()                delete()
{                                   {
    ...                             write_lock(&list_lock);
    if (refcount_put(&el->rc))      ...
            start_cleanup_object    ...
            free_object             delete_element
    ...                             write_unlock(&list_lock);
}                                   ...
                                    if (refcount_put(&el->rc))
                                            start_cleanup_object
                                            free_object
                                    }

add() puts the refcounted element into the system with the list_lock
taken for write, search_and_reference() takes the list_lock for read
and gets the refcount.  Now if the list was a read mostly kind, then
we could replace the list_lock rw lock with a spinlock,
serialise updates to the list with the spinlock taken (in the add())
and use rcu_read_lock() and rcu_read_unlock() on the reader side
(search_and_reference()) replacing the read_locks in 2. above. 

But, with rcu, search and reference could see stale elements, that is
elements which have been taken off the list from delete(). Using call_rcu
to free the object from release_referenced() (free_object in 3.) will defer 
freeing, so that &el memory location above is not freed 'til the reader 
comes out of rcu_read_lock protected code, _but_ the 
search_and_reference thread could potentially get a reference to a 
deleted list element.  Hence, under lockfree circumstances, just an 
atomic_inc to the refcount is not sufficient.  
We do:

	...
	c = atomic_read(&rc->count);
	while ( c && (old = cmpxchg(&rc->count.counter, c, c+1)) != c)
		c = old;
	return c;


which is abstracted out as refcount_get_rcu()

Hence, in the example above, search_and_reference would look like

search_and_reference()
{
    rcu_read_lock();
    search_for_element
    if (!refcount_get_rcu(&el->rc)) {
            rcu_read_unlock();
            return FAIL;
    }
    ...
    ...
    rcu_read_unlock();
}

Hope that clears things up.

> 
> Oh, and I think you need to use atomic_set() instead of initializing the
> atomic_t by hand.

I have used atomic_set for the case where arch has cmpxchg.  But for 
arches lacking cmpxchg, I need to use hashed spinlocks to implement 
the ref_count_get_rcu.
No point in having more atomic operations when I hold spinlocks.  Admittedly,
might be a bit yucky to assume atomic_t internals, but it is just one header
file :) <ducks>

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Refcounting of objects part of a lockfree collection
@ 2004-07-14 10:24 Oleg Nesterov
  0 siblings, 0 replies; 33+ messages in thread
From: Oleg Nesterov @ 2004-07-14 10:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ravikiran G Thirumalai

Hello.

> #ifdef __HAVE_ARCH_CMPXCHG

in wrong place.

static inline void refcount_init(refcount_t *rc)
...
#ifdef __HAVE_ARCH_CMPXCHG
...
#else /* !__HAVE_ARCH_CMPXCHG */
...
static inline void refcount_init(refcoun_t *rc)  <--- redefenition
                                 ^^^^^^^^        <--- and typo
Oleg.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Refcounting of objects part of a lockfree collection
  2004-07-14  8:26   ` Dipankar Sarma
@ 2004-07-14 14:26     ` Greg KH
  2004-07-14 15:22       ` Dipankar Sarma
  2004-07-15  6:21       ` Ravikiran G Thirumalai
  0 siblings, 2 replies; 33+ messages in thread
From: Greg KH @ 2004-07-14 14:26 UTC (permalink / raw)
  To: Dipankar Sarma; +Cc: Ravikiran G Thirumalai, linux-kernel

On Wed, Jul 14, 2004 at 01:56:22PM +0530, Dipankar Sarma wrote:
> On Wed, Jul 14, 2004 at 12:07:00AM -0700, Greg KH wrote:
> > On Wed, Jul 14, 2004 at 10:23:50AM +0530, Ravikiran G Thirumalai wrote:
> > > 
> > > The attatched patch provides infrastructure for refcounting of objects
> > > in a rcu protected collection.
> > 
> > This is really close to the kref implementation.  Why not just use that
> > instead?
> 
> Well, the kref has the same get/put race if used in a lock-free
> look-up. When you do a kref_get() it is assumed that another
> cpu will not see a 1-to-0 transition of the reference count.

You mean kref_put(), right?

> If that indeed happens, ->release() will get invoked more
> than once for that object which is bad.

As kref_put() uses a atomic_t, how can that transistion happen twice?

What can happen is kref_get() and kref_put() can race if the last
kref_put() happens at the same time that kref_get().  But that is solved
by having the caller guarantee that this can not happen (see my 2004 OLS
paper for more info about this.)

> Kiran's patch actually solves this fundamental lock-free ref-counting
> problem.

Hm, maybe I'm missing something, but I don't see how it does so, based
on the implmentation of refcount_get() and refcount_put().  Sure, the
*_rcu implementations are a bit different, but if that's the only
difference with this code and kref, why not just add the rcu versions to
the kref code?

> The other issue is that there are many refcounted data structures
> like dentry, dst_entry, file etc. that do not use kref.

At this time, sure.  But you could always change that :)
(and yes, to do so, we can always shrink the size of struct kref if
really needed...)

> If everybody were to use kref, we could possibly apply Kiran's
> lock-free extensions to kref itself and be done with it.

Ok, sounds like a plan to me.  Having 2 refcount implementations in the
kernel that work alike, yet a bit different, is not acceptable.  Please
rework struct kref to do this.

> Until then, we need the lock-free refcounting support from non-kref
> refcounting objects.

We've lived without it until now somehow :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-14  4:56 ` [RFC] Lock free fd lookup Ravikiran G Thirumalai
@ 2004-07-14 15:17   ` Chris Wright
  2004-07-15 14:22     ` Jesse Barnes
  2004-07-17 13:48     ` Peter Zijlstra
  0 siblings, 2 replies; 33+ messages in thread
From: Chris Wright @ 2004-07-14 15:17 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: linux-kernel, dipankar

* Ravikiran G Thirumalai (kiran@in.ibm.com) wrote:
> This makes use of the lockfree refcounting infrastructure (see earlier
> posting today) to make files_struct.fd[] lookup lockfree. This is carrying 
> forward work done by Maneesh and Dipankar earlier. 
> 
> With the lockfree fd lookup patch, tiobench performance increases by 13% 
> for sequential reads, 21 % for random reads on a 4 processor pIII xeon.

I'm curious, how much of the performance improvement is from RCU usage
vs. making the basic syncronization primitive aware of a reader and
writer distinction?  Do you have benchmark for simply moving to rwlock_t?

thanks,
-chris
-- 
Linux Security Modules     http://lsm.immunix.org     http://lsm.bkbits.net

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Refcounting of objects part of a lockfree collection
  2004-07-14 14:26     ` Greg KH
@ 2004-07-14 15:22       ` Dipankar Sarma
  2004-07-14 17:03         ` Greg KH
  2004-07-15  6:21       ` Ravikiran G Thirumalai
  1 sibling, 1 reply; 33+ messages in thread
From: Dipankar Sarma @ 2004-07-14 15:22 UTC (permalink / raw)
  To: Greg KH; +Cc: Ravikiran G Thirumalai, linux-kernel

On Wed, Jul 14, 2004 at 07:26:14AM -0700, Greg KH wrote:
> On Wed, Jul 14, 2004 at 01:56:22PM +0530, Dipankar Sarma wrote:
> > Well, the kref has the same get/put race if used in a lock-free
> > look-up. When you do a kref_get() it is assumed that another
> > cpu will not see a 1-to-0 transition of the reference count.
> 
> You mean kref_put(), right?

No, I meant kref_get(). See below.

> > If that indeed happens, ->release() will get invoked more
> > than once for that object which is bad.
> 
> As kref_put() uses a atomic_t, how can that transistion happen twice?
> 
> What can happen is kref_get() and kref_put() can race if the last
> kref_put() happens at the same time that kref_get().  But that is solved
> by having the caller guarantee that this can not happen (see my 2004 OLS
> paper for more info about this.)

Yes, and how do the callers guarantee that ? Using a lock, right ?
What Kiran's patch does is to allow those callers to use lock-free
algorithms. Let's look at the race -

---------------------------------------------------------------
                                                                                
CPU #0                                 CPU #1
------                                 ------
                                                                                
                                 my_put() from a user who did my_lookup() ...
                                                                                
                                 [ ->count is 1 ]
                                                                                
In my_lookup() ...               atomic_dec_and_test(&m->count)
                                                                                
                                 [ ->count is 0 ]
m = my_get(my_list[i]);
                                                                                
[ ->count is 1 ]                 call_rcu(&m->head, free, m);
                                                                                
return m;
                                                                                
[This CPU can now context
 switch and allow RCU to
 proceed]
                                                                                
                                 free(m);
                                                                                
Somebody dereferences m and
invalid memory reference
---------------------------------------------------------------

This can happen if my_lookup() is lock-free.

> > The other issue is that there are many refcounted data structures
> > like dentry, dst_entry, file etc. that do not use kref.
> 
> At this time, sure.  But you could always change that :)
> (and yes, to do so, we can always shrink the size of struct kref if
> really needed...)

How are you going to shrink it ? You need the ->release() method
and that is a nice way for drivers to get rid of objects.

> 
> > If everybody were to use kref, we could possibly apply Kiran's
> > lock-free extensions to kref itself and be done with it.
> 
> Ok, sounds like a plan to me.  Having 2 refcount implementations in the
> kernel that work alike, yet a bit different, is not acceptable.  Please
> rework struct kref to do this.

And I suspect that Andrew thwak me for trying to increase dentry size :)
Anyway, the summary is this - Kiran is not trying to introduce
a new refcounting API. He is just adding lock-free support from
an existing refcounting mechanism that is used in VFS. If kref users need to do
lock-free lookup, sure we should add it to kref_xxx APIs also.

> > Until then, we need the lock-free refcounting support from non-kref
> > refcounting objects.
>
> We've lived without it until now somehow :)

Actually, we already use lock-free refcounting in route cache, dcache. In those
cases, we work around this race using a different algorithm.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Refcounting of objects part of a lockfree collection
  2004-07-14 15:22       ` Dipankar Sarma
@ 2004-07-14 17:03         ` Greg KH
  2004-07-14 17:49           ` Dipankar Sarma
  0 siblings, 1 reply; 33+ messages in thread
From: Greg KH @ 2004-07-14 17:03 UTC (permalink / raw)
  To: Dipankar Sarma; +Cc: Ravikiran G Thirumalai, linux-kernel

On Wed, Jul 14, 2004 at 08:52:35PM +0530, Dipankar Sarma wrote:
> On Wed, Jul 14, 2004 at 07:26:14AM -0700, Greg KH wrote:
> > On Wed, Jul 14, 2004 at 01:56:22PM +0530, Dipankar Sarma wrote:
> > > Well, the kref has the same get/put race if used in a lock-free
> > > look-up. When you do a kref_get() it is assumed that another
> > > cpu will not see a 1-to-0 transition of the reference count.
> > 
> > You mean kref_put(), right?
> 
> No, I meant kref_get(). See below.

Ok, we are both talking about the same thing, just from different
angles.  Yes, I agree with you about this.

> Yes, and how do the callers guarantee that ? Using a lock, right ?
> What Kiran's patch does is to allow those callers to use lock-free
> algorithms.

So modify kref to also use those algorithms.  That's all I'm asking.

> > > The other issue is that there are many refcounted data structures
> > > like dentry, dst_entry, file etc. that do not use kref.
> > 
> > At this time, sure.  But you could always change that :)
> > (and yes, to do so, we can always shrink the size of struct kref if
> > really needed...)
> 
> How are you going to shrink it ? You need the ->release() method
> and that is a nice way for drivers to get rid of objects.

struct kref can get rid of the release pointer, and require it as part
of the kref_put() call.

> > > If everybody were to use kref, we could possibly apply Kiran's
> > > lock-free extensions to kref itself and be done with it.
> > 
> > Ok, sounds like a plan to me.  Having 2 refcount implementations in the
> > kernel that work alike, yet a bit different, is not acceptable.  Please
> > rework struct kref to do this.
> 
> And I suspect that Andrew thwak me for trying to increase dentry size :)
> Anyway, the summary is this - Kiran is not trying to introduce
> a new refcounting API.

Yes he is.  He's calling it refcount.h, and creating a typedef called
refcount_t.  Sure looks like a new refcount API to me :)

> He is just adding lock-free support from an existing refcounting
> mechanism that is used in VFS.

If this is true, then I strongly object to the naming of this file, and
the name of the typedef (which shouldn't be a typedef at all) and this
should be made a private data structure to the vfs so no one else tries
to use it.  Otherwise it will be used.

> If kref users need to do lock-free lookup, sure we should add it to
> kref_xxx APIs also.

I still think you should just use a kref and add the apis.

> > > Until then, we need the lock-free refcounting support from non-kref
> > > refcounting objects.
> >
> > We've lived without it until now somehow :)
> 
> Actually, we already use lock-free refcounting in route cache, dcache. In those
> cases, we work around this race using a different algorithm.

I realize that.  Then, again, this shouldn't be advertised as such a
general api that everyone can use.  That's my main objection here.  If
you want to make it private to the specific file implementation in the
vfs, fine with me.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Refcounting of objects part of a lockfree collection
  2004-07-14  8:57   ` Ravikiran G Thirumalai
@ 2004-07-14 17:08     ` Greg KH
  2004-07-14 18:17       ` Dipankar Sarma
  2004-07-15  8:02       ` Ravikiran G Thirumalai
  0 siblings, 2 replies; 33+ messages in thread
From: Greg KH @ 2004-07-14 17:08 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: linux-kernel, dipankar

On Wed, Jul 14, 2004 at 02:27:58PM +0530, Ravikiran G Thirumalai wrote:
> On Wed, Jul 14, 2004 at 12:07:00AM -0700, Greg KH wrote:
> > On Wed, Jul 14, 2004 at 10:23:50AM +0530, Ravikiran G Thirumalai wrote:
> > > 
> > > The attatched patch provides infrastructure for refcounting of objects
> > > in a rcu protected collection.
> > 
> > This is really close to the kref implementation.  Why not just use that
> > instead?
> 
> Close, but not the same.  I just had a quick look at krefs.
> Actually, this refrerence count infrastructure I am proposing is not for 
> traditional refcounting.

But you are advertising it as such by calling it a refcount_t and
putting it in a file called refcount.h.

> > Oh, and I think you need to use atomic_set() instead of initializing the
> > atomic_t by hand.
> 
> I have used atomic_set for the case where arch has cmpxchg.  But for 
> arches lacking cmpxchg, I need to use hashed spinlocks to implement 
> the ref_count_get_rcu.
> No point in having more atomic operations when I hold spinlocks.  Admittedly,
> might be a bit yucky to assume atomic_t internals, but it is just one header
> file :) <ducks>

I still think you need to fix this, manipulating atomic_t variables by
hand is not always guaranteed to work on all arches, from what I
remember.

And what arches do not support cmpxchg?  How does this change affect the
performance of them?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Refcounting of objects part of a lockfree collection
  2004-07-14 17:03         ` Greg KH
@ 2004-07-14 17:49           ` Dipankar Sarma
  2004-07-14 18:03             ` Greg KH
  0 siblings, 1 reply; 33+ messages in thread
From: Dipankar Sarma @ 2004-07-14 17:49 UTC (permalink / raw)
  To: Greg KH; +Cc: Ravikiran G Thirumalai, linux-kernel

On Wed, Jul 14, 2004 at 10:03:37AM -0700, Greg KH wrote:
> On Wed, Jul 14, 2004 at 08:52:35PM +0530, Dipankar Sarma wrote:
> > And I suspect that Andrew thwak me for trying to increase dentry size :)
> > Anyway, the summary is this - Kiran is not trying to introduce
> > a new refcounting API.
> 
> Yes he is.  He's calling it refcount.h, and creating a typedef called
> refcount_t.  Sure looks like a new refcount API to me :)

OK, in terms of code, it is new API. Just the refcounting mechanism
is same as existing non-kref refcounts.

> > He is just adding lock-free support from an existing refcounting
> > mechanism that is used in VFS.
> 
> If this is true, then I strongly object to the naming of this file, and
> the name of the typedef (which shouldn't be a typedef at all) and this
> should be made a private data structure to the vfs so no one else tries
> to use it.  Otherwise it will be used.

I am reasonably sure that when that patch was done (months ago) kref wasn't
there. Now that kref.[ch] is around, everything can be put there.

Now, if struct kref is shrinked (want patch ? ;-), all this 
can possibly be nicely collapsed into one set of APIs for refcounting.
There aren't many users of kref yet, so this seems like a good
time to do it. Was there any objection to shrinking it ?


Thanks
Dipankar

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Refcounting of objects part of a lockfree collection
  2004-07-14 17:49           ` Dipankar Sarma
@ 2004-07-14 18:03             ` Greg KH
  0 siblings, 0 replies; 33+ messages in thread
From: Greg KH @ 2004-07-14 18:03 UTC (permalink / raw)
  To: Dipankar Sarma; +Cc: Ravikiran G Thirumalai, linux-kernel

On Wed, Jul 14, 2004 at 11:19:49PM +0530, Dipankar Sarma wrote:
> On Wed, Jul 14, 2004 at 10:03:37AM -0700, Greg KH wrote:
> > On Wed, Jul 14, 2004 at 08:52:35PM +0530, Dipankar Sarma wrote:
> > > He is just adding lock-free support from an existing refcounting
> > > mechanism that is used in VFS.
> > 
> > If this is true, then I strongly object to the naming of this file, and
> > the name of the typedef (which shouldn't be a typedef at all) and this
> > should be made a private data structure to the vfs so no one else tries
> > to use it.  Otherwise it will be used.
> 
> I am reasonably sure that when that patch was done (months ago) kref wasn't
> there. Now that kref.[ch] is around, everything can be put there.

I agree.

> Now, if struct kref is shrinked (want patch ? ;-), all this 
> can possibly be nicely collapsed into one set of APIs for refcounting.

Sounds good to me.

> There aren't many users of kref yet, so this seems like a good
> time to do it. Was there any objection to shrinking it ?

None that I know of.  In fact, it's on my list of things to do, so a
patch from someone else to fix this would be greatly appreciated.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Refcounting of objects part of a lockfree collection
  2004-07-14 17:08     ` Greg KH
@ 2004-07-14 18:17       ` Dipankar Sarma
  2004-07-15  8:02       ` Ravikiran G Thirumalai
  1 sibling, 0 replies; 33+ messages in thread
From: Dipankar Sarma @ 2004-07-14 18:17 UTC (permalink / raw)
  To: Greg KH; +Cc: Ravikiran G Thirumalai, linux-kernel

On Wed, Jul 14, 2004 at 10:08:00AM -0700, Greg KH wrote:
> On Wed, Jul 14, 2004 at 02:27:58PM +0530, Ravikiran G Thirumalai wrote:
> > might be a bit yucky to assume atomic_t internals, but it is just one header
> > file :) <ducks>
> 
> I still think you need to fix this, manipulating atomic_t variables by
> hand is not always guaranteed to work on all arches, from what I
> remember.

AFAICS, the hash-locked refcounting grabs a spin lock for all
operations on the atomic_t. Any reason why that should not be safe ?
Of course, I can't see why we can't have two versions of the
reference counter depending on __HAVE_ARCH_CMPXCHG. Kiran ?

> 
> And what arches do not support cmpxchg?  How does this change affect the
> performance of them?

mips64, smp arm ?? ;-)

With a hashed lock, it should not be all that bad in low-end SMPs.
Besides we already use such a thing in gettimeofday implementation
with a global lock. However this is a valid issue and performance #s
from those arch users would be useful.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Refcounting of objects part of a lockfree collection
  2004-07-14 14:26     ` Greg KH
  2004-07-14 15:22       ` Dipankar Sarma
@ 2004-07-15  6:21       ` Ravikiran G Thirumalai
  2004-07-15  6:56         ` Dipankar Sarma
  1 sibling, 1 reply; 33+ messages in thread
From: Ravikiran G Thirumalai @ 2004-07-15  6:21 UTC (permalink / raw)
  To: Greg KH; +Cc: Dipankar Sarma, linux-kernel, oleg

On Wed, Jul 14, 2004 at 07:26:14AM -0700, Greg KH wrote:
> ... 
> As kref_put() uses a atomic_t, how can that transistion happen twice?
> 
> What can happen is kref_get() and kref_put() can race if the last
> kref_put() happens at the same time that kref_get().  But that is solved
> by having the caller guarantee that this can not happen (see my 2004 OLS
> paper for more info about this.)

The last kref_put (zero transition) should happen after the refcounted
object is removed from the list, and the removal is serialised
through the traditional rw/spinlock (see my example earlier).
With the traditional lock protection, if the reader has a lock, writer
cannot remove the element from the list and no subsequent last put, or if the 
writer has removed the element, the reader won't even see the elemnt,
when he walks the list..so no question of taking a reference on 
the deleted element.
Lockfree and rcu changes all this.  Hence, we need refcount_get_rcu.

Now it would have been nice if we did not have to use cmpxchg 
for refcount_get_rcu, or atleast if all arches implemented cmpxchg in
hardware.  Since neither is true, for arches with no cmpxchg, we use
the hashed spinlock method.  When we use hashed spinlock for 
refcount_get_rcu method, we need to use the same spinlock to protect the
refcount for the usual refcount_gets too.  So no atomic_inc for
refcount_gets on arches with no cmpxchg.  
Now why refcount_get when you are using refcount_get_rcu for a refcount one 
might ask. With the fd patch, there are places where the refcount_get 
happens with traditional serialisation, and places where refcount_get 
happens lockfree.  So it makes sense for performance sake to have a 
simple atomic_inc (refcount_get) instead of and the whole cmpxchg shebang
for the same refcounter when traditional locking is held.  Hence the
refcount_get infrastructure.  The whole refcount infrastructure 
provided is _not_ meant for traditinal refcounting ...which kref 
accomplishes.  If it is used for traditional refcounting (i.e refcount_get
on a refcounter is used without any refcount_get_rcus for the same
counter), arches with nocmpxchg will end up using hashed spinlocks for
simple refcount_gets :).  A Note was included in the header file as
a warning:

<in the patch>
 * Note: Eventhough these refcount primitives can be used for lists which
 * are protected by traditional locking, if the arch doesn't have cmpxchg,
 * there might be extra penalties in refcount_get.
</>

To summarise, this infrastructure is not meant to replace kref, or is not 
meant to be used for traditional refcounting.  All primitives are for lockfree
refcounting and associated functions.  And the infrastructure came in 
because there are more than one application to use it. files_struct is one,
fs_struct is another, on which I am working.  There might be more later :).

As Oleg rightly pointed out, the #ifdef __HAVE_ARCH_CMPXCHG  is at the wrong
place.  My bad.  Maybe this confusion wouldn't happen if it was at the
right place to begin with.  Here is the corrected refcount patch.

Thanks,
Kiran


diff -ruN -X dontdiff linux-2.6.7/include/linux/refcount.h files_struct-2.6.7/include/linux/refcount.h
--- linux-2.6.7/include/linux/refcount.h	1970-01-01 05:30:00.000000000 +0530
+++ files_struct-2.6.7/include/linux/refcount.h	2004-07-15 11:13:08.000000000 +0530
@@ -0,0 +1,245 @@
+/* 
+ * Refcounter framework for elements of lists/arrays protected by
+ * RCU.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2004
+ *
+ * Author: Ravikiran Thirumalai <kiran@in.ibm.com>
+ * Credits: Dipankar Sarma <dipankar@in.ibm.com>.  refcount_get_rcu has
+ *	    been abstracted out from Dipankar's implementation of rcu based
+ *	    fd management.
+ * 
+ * Refcounting on elements of  lists which are protected by traditional
+ * reader/writer spinlocks or semaphores are straight forward as in:
+ * 
+ * 1.					2.
+ * add()				search_and_reference()
+ * {					{
+ * 	alloc_object				read_lock(&list_lock);
+ *	...					search_for_element
+ *	refcount_init(&el->rc)			refcount_get(&el->rc);	
+ *	write_lock(&list_lock);			...
+ *	add_element				read_unlock(&list_lock);
+ *	...					...
+ *	write_unlock(&list_lock);	}
+ * }					
+ *
+ * 3.					4.
+ * release_referenced()			delete()
+ * {					{
+ *	...				write_lock(&list_lock);
+ *	if (refcount_put(&el->rc))	...
+ *		start_cleanup_object	...
+ *		free_object		delete_element
+ *	...				write_unlock(&list_lock);
+ * }					...
+ *					if (refcount_put(&el->rc))
+ *						start_cleanup_object
+ *						free_object
+ *					}
+ *
+ * If this list/array is made lock free using rcu as in changing the 
+ * write_lock in add() and delete() to spin_lock and changing read_lock
+ * in search_and_reference to rcu_read_lock(), the refcount_get in 
+ * search_and_reference could potentially hold reference to an element which
+ * has already been deleted from the list/array.  refcount_get_rcu takes
+ * care of this scenario. search_and_reference should look as;
+ * 2. 
+ * search_and_reference()
+ * {
+ *	rcu_read_lock();
+ *	search_for_element
+ *	if (!refcount_get_rcu(&el->rc)) {
+ *		rcu_read_unlock();
+ *		return FAIL;
+ *	}
+ *	...
+ *	...
+ *	rcu_read_unlock();
+ * }
+ * 
+ * Of course, free_object after refcount_put should be batched using call_rcu.
+ * Sometimes, reference to the element need to be obtained in the 
+ * update (write) stream.  In such cases, refcount_get_rcu might be an overkill
+ * since the spinlock serialising list updates are held. refcount_get
+ * is to be used in such cases.  
+ * Note: Eventhough these refcount primitives can be used for lists which
+ * are protected by traditional locking, if the arch doesn't have cmpxchg,
+ * there might be extra penalties in refcount_get.
+ *
+ */
+
+#ifndef _LINUX_REFCOUNT_H
+#define _LINUX_REFCOUNT_H
+#ifdef  __KERNEL__
+
+#include <asm/atomic.h>
+#include <asm/system.h>
+
+typedef struct { atomic_t count; } refcount_t;
+
+/* 
+ * Refcount primitives for lists/array elements protected by traditional
+ * locking
+ */
+
+#ifdef __HAVE_ARCH_CMPXCHG
+
+static inline void refcount_init(refcount_t *rc)	
+{
+	atomic_set(&rc->count, 1);
+}
+
+static inline int refcount_put(refcount_t *rc)
+{
+	return atomic_dec_and_test(&rc->count);
+}
+
+static inline void refcount_dec(refcount_t *rc)
+{
+	atomic_dec(&rc->count);
+}
+
+static inline int refcount_put_and_lock(refcount_t *rc, spinlock_t *lock)
+{
+	return atomic_dec_and_lock(&rc->count, lock);
+}
+
+static inline int refcount_read(refcount_t *rc)
+{
+	return atomic_read(&rc->count);
+}
+
+
+static inline void refcount_get(refcount_t *rc)
+{
+	atomic_inc(&rc->count);
+}
+
+/* 
+ * Refcount primitives for lists/array elements protected by rcu
+ */
+
+/* 
+ * cmpxchg is needed on UP too, if deletions to the list/array can happen
+ * in interrupt context.
+ */
+static inline int refcount_get_rcu(refcount_t *rc)
+{
+	int c, old;
+	c = atomic_read(&rc->count);
+	while ( c && (old = cmpxchg(&rc->count.counter, c, c+1)) != c) 
+		c = old;
+	return c;
+}
+
+#else /* !__HAVE_ARCH_CMPXCHG */
+
+/* 
+ * We use an array of spinlocks for the refcounts -- similar to ones in sparc
+ * 32 bit atomic_t implementations, and a hash function similar to that
+ * for our refcounting needs.
+ * Can't help multiprocessors which donot have cmpxchg :(
+ */
+
+#ifdef	CONFIG_SMP
+#define REFCOUNT_HASH_SIZE	4
+#define REFCOUNT_HASH(r) \
+	(&__refcount_hash[(((unsigned long)r)>>8) & (REFCOUNT_HASH_SIZE-1)])
+#else
+#define	REFCOUNT_HASH_SIZE	1
+#define REFCOUNT_HASH(r) 	&__refcount_hash[0]
+#endif /* CONFIG_SMP */
+
+extern spinlock_t __refcount_hash[REFCOUNT_HASH_SIZE]
+
+static inline void refcount_init(refcount_t *rc)	
+{
+	unsigned long flags;
+	spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+	rc->count.counter = 1;
+	spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+}
+
+static inline int refcount_put(refcount_t *rc)
+{
+	unsigned long flags;
+	spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+	rc->count.counter--;
+	if (!rc->count.counter) {
+		spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+		return 1;
+	} else {
+		spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+		return 0;
+	}
+}
+
+static inline void refcount_dec(refcount_t *rc)
+{
+	unsigned long flags;
+	spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+	rc->count.counter--;
+	spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+}
+
+static inline int refcount_put_and_locked(refcount_t *rc, spinlock_t *lock)
+{
+	unsigned long flags;
+	spin_lock(lock);
+	spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+	rc->count.counter--;
+	if (!rc->count.counter) {
+		spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+		return 1;
+	} else {
+		spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+		spin_unlock(lock)
+		return 0;
+	}
+}
+/* Spinlocks are not needed in this case, if atomic_read is used */
+static inline int refcount_read(refcount_t *rc)
+{
+	return atomic_read(&rc->count.counter);
+}
+
+static inline void refcount_get(refcount_t *rc)
+{
+	unsigned long flags;
+	spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+	rc->count++;
+	spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+}
+
+static inline int refcount_get_rcu(refcount_t *rc)
+{
+	int ret;
+	unsigned long flags;
+	spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+	if (rc->count.counter) 
+		ret = rc->count.counter++;
+	else
+		ret = 0;
+	spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+	return ret;
+}
+
+#endif /* __HAVE_ARCH_CMPXCHG */
+
+#endif
+#endif /* _LINUX_REFCOUNT_H */
diff -ruN -X dontdiff linux-2.6.7/kernel/rcupdate.c files_struct-2.6.7/kernel/rcupdate.c
--- linux-2.6.7/kernel/rcupdate.c	2004-06-16 10:48:38.000000000 +0530
+++ files_struct-2.6.7/kernel/rcupdate.c	2004-07-15 10:56:25.000000000 +0530
@@ -44,6 +44,7 @@
 #include <linux/notifier.h>
 #include <linux/rcupdate.h>
 #include <linux/cpu.h>
+#include <linux/refcount.h>
 
 /* Definition for rcupdate control block. */
 struct rcu_ctrlblk rcu_ctrlblk = 
@@ -325,6 +326,17 @@
 	wait_for_completion(&completion);
 }
 
+/**
+ * spinlock_t array for arches which donot have cmpxchg.  This is for
+ * rcu based refcounting stuff.
+ */
+
+#ifndef __HAVE_ARCH_CMPXCHG
+spinlock_t __refcount_hash[REFCOUNT_HASH_SIZE] = {
+	[0 ... (REFCOUNT_HASH_SIZE-1)] = SPIN_LOCK_UNLOCKED
+};
+EXPORT_SYMBOL(__refcount_hash);
+#endif
 
 EXPORT_SYMBOL(call_rcu);
 EXPORT_SYMBOL(synchronize_kernel);

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Refcounting of objects part of a lockfree collection
  2004-07-15  6:21       ` Ravikiran G Thirumalai
@ 2004-07-15  6:56         ` Dipankar Sarma
  0 siblings, 0 replies; 33+ messages in thread
From: Dipankar Sarma @ 2004-07-15  6:56 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: Greg KH, linux-kernel, oleg

On Thu, Jul 15, 2004 at 11:51:04AM +0530, Ravikiran G Thirumalai wrote:
> Now it would have been nice if we did not have to use cmpxchg 
> for refcount_get_rcu, or atleast if all arches implemented cmpxchg in
> hardware.  Since neither is true, for arches with no cmpxchg, we use
> the hashed spinlock method.  When we use hashed spinlock for 
> refcount_get_rcu method, we need to use the same spinlock to protect the
> refcount for the usual refcount_gets too.  So no atomic_inc for
> refcount_gets on arches with no cmpxchg.  
> Now why refcount_get when you are using refcount_get_rcu for a refcount one 
> might ask. With the fd patch, there are places where the refcount_get 
> happens with traditional serialisation, and places where refcount_get 
> happens lockfree.  So it makes sense for performance sake to have a 
> simple atomic_inc (refcount_get) instead of and the whole cmpxchg shebang
> for the same refcounter when traditional locking is held.  Hence the
> refcount_get infrastructure.  The whole refcount infrastructure 
> provided is _not_ meant for traditinal refcounting ...which kref 
> accomplishes.  If it is used for traditional refcounting (i.e refcount_get
> on a refcounter is used without any refcount_get_rcus for the same
> counter), arches with nocmpxchg will end up using hashed spinlocks for
> simple refcount_gets :).  A Note was included in the header file as
> a warning:

Would this work -

kref_get - just atomic inc
kref_put - just atomic dec
kref_get_rcu - cmpxchg if arch has or hashed spinlock
kref_put_rcu - dec if has cmpxchg else hashed spinlock

If the object has lock-free look-up, it must use kref_get_rcu.
You can use kref_get on such an object if the look-up is being
done with lock. Is there a situation that is not covered by this ?

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Refcounting of objects part of a lockfree collection
  2004-07-14 17:08     ` Greg KH
  2004-07-14 18:17       ` Dipankar Sarma
@ 2004-07-15  8:02       ` Ravikiran G Thirumalai
  2004-07-15  9:36         ` Dipankar Sarma
  2004-07-16 14:32         ` Greg KH
  1 sibling, 2 replies; 33+ messages in thread
From: Ravikiran G Thirumalai @ 2004-07-15  8:02 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, dipankar

On Wed, Jul 14, 2004 at 10:08:00AM -0700, Greg KH wrote:
> > ...
> > Close, but not the same.  I just had a quick look at krefs.
> > Actually, this refrerence count infrastructure I am proposing is not for 
> > traditional refcounting.
> 
> But you are advertising it as such by calling it a refcount_t and
> putting it in a file called refcount.h.

The naming is bad, I agree. But as Dipankar pointed out earlier, there
was no kref when I did this.  We (Dipankar and myslef) had a discussion
and decided:
1. I will make a patch to shrink kref and feed it to Greg
2. Add new set kref api for lockfree refcounting --
	kref_lf_xxx.  (kref_lf_get, kref_lf_get_rcu etc.,)
3. Change the fd lookup patch to use kref_lf_xxx api

Does that sound ok?

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Refcounting of objects part of a lockfree collection
  2004-07-15  8:02       ` Ravikiran G Thirumalai
@ 2004-07-15  9:36         ` Dipankar Sarma
  2004-07-16 14:32         ` Greg KH
  1 sibling, 0 replies; 33+ messages in thread
From: Dipankar Sarma @ 2004-07-15  9:36 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: Greg KH, linux-kernel

On Thu, Jul 15, 2004 at 01:32:04PM +0530, Ravikiran G Thirumalai wrote:
> The naming is bad, I agree. But as Dipankar pointed out earlier, there
> was no kref when I did this.  We (Dipankar and myslef) had a discussion
> and decided:
> 1. I will make a patch to shrink kref and feed it to Greg

1.5 Convert files_struct to use kref


> 2. Add new set kref api for lockfree refcounting --
> 	kref_lf_xxx.  (kref_lf_get, kref_lf_get_rcu etc.,)
> 3. Change the fd lookup patch to use kref_lf_xxx api

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-14 15:17   ` Chris Wright
@ 2004-07-15 14:22     ` Jesse Barnes
  2004-07-15 16:10       ` Dipankar Sarma
  2004-07-16  6:27       ` William Lee Irwin III
  2004-07-17 13:48     ` Peter Zijlstra
  1 sibling, 2 replies; 33+ messages in thread
From: Jesse Barnes @ 2004-07-15 14:22 UTC (permalink / raw)
  To: Chris Wright; +Cc: Ravikiran G Thirumalai, linux-kernel, dipankar

On Wednesday, July 14, 2004 11:17 am, Chris Wright wrote:
> * Ravikiran G Thirumalai (kiran@in.ibm.com) wrote:
> > This makes use of the lockfree refcounting infrastructure (see earlier
> > posting today) to make files_struct.fd[] lookup lockfree. This is
> > carrying forward work done by Maneesh and Dipankar earlier.
> >
> > With the lockfree fd lookup patch, tiobench performance increases by 13%
> > for sequential reads, 21 % for random reads on a 4 processor pIII xeon.
>
> I'm curious, how much of the performance improvement is from RCU usage
> vs. making the basic syncronization primitive aware of a reader and
> writer distinction?  Do you have benchmark for simply moving to rwlock_t?

That's a good point.  Also, even though the implementation may be 'lockless', 
there are still a lot of cachelines bouncing around, whether due to atomic 
counters or cmpxchg (in fact the latter will be worse than simple atomics).

It seems to me that RCU is basically rwlocks on steroids, which means that 
using it requires the same care to avoid starvation and/or other scalability 
problems (i.e. we'd better be really sure that a given codepath really should 
be using rwlocks before we change it).

Jesse

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-15 14:22     ` Jesse Barnes
@ 2004-07-15 16:10       ` Dipankar Sarma
  2004-07-15 16:22         ` Jesse Barnes
  2004-07-15 16:34         ` Chris Wright
  2004-07-16  6:27       ` William Lee Irwin III
  1 sibling, 2 replies; 33+ messages in thread
From: Dipankar Sarma @ 2004-07-15 16:10 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: Chris Wright, Ravikiran G Thirumalai, linux-kernel

On Thu, Jul 15, 2004 at 10:22:53AM -0400, Jesse Barnes wrote:
> On Wednesday, July 14, 2004 11:17 am, Chris Wright wrote:
> > I'm curious, how much of the performance improvement is from RCU usage
> > vs. making the basic syncronization primitive aware of a reader and
> > writer distinction?  Do you have benchmark for simply moving to rwlock_t?
> 
> That's a good point.  Also, even though the implementation may be 'lockless', 
> there are still a lot of cachelines bouncing around, whether due to atomic 
> counters or cmpxchg (in fact the latter will be worse than simple atomics).

Chris raises an interesting issue. There are two ways we can benefit from
lock-free lookup - avoidance of atomic ops in lock acquisition/release
and avoidance of contention. The latter can also be provided by
rwlocks in read-mostly situations like this, but rwlock still has
two atomic ops for acquisition/release. So, in another
thread, I have suggested looking into the contention angle. IIUC,
tiobench is threaded and shares fd table. 

That said, atomic counters weren't introduced in this patch,
they are already there for refcounting. cmpxchg is costly,
but if you are replacing read_lock/atomic_inc/read_unlock,
lock-free + cmpxchg, it might not be all that bad. Atleast,
we can benchmark it and see if it is worth it. And in heavily
contended cases, unlike rwlocks, you are not going to have
starvation.

> 
> It seems to me that RCU is basically rwlocks on steroids, which means that 
> using it requires the same care to avoid starvation and/or other scalability 
> problems (i.e. we'd better be really sure that a given codepath really should 
> be using rwlocks before we change it).

The starvation is a problem with rwlocks in linux, not RCU. The
reader's do not impede writers at all with RCU. There are other
issues with RCU that one needs to be careful about, but certainly
not this one.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-15 16:10       ` Dipankar Sarma
@ 2004-07-15 16:22         ` Jesse Barnes
  2004-07-15 16:34         ` Chris Wright
  1 sibling, 0 replies; 33+ messages in thread
From: Jesse Barnes @ 2004-07-15 16:22 UTC (permalink / raw)
  To: dipankar; +Cc: Chris Wright, Ravikiran G Thirumalai, linux-kernel

On Thursday, July 15, 2004 12:10 pm, Dipankar Sarma wrote:
> Chris raises an interesting issue. There are two ways we can benefit from
> lock-free lookup - avoidance of atomic ops in lock acquisition/release
> and avoidance of contention. The latter can also be provided by
> rwlocks in read-mostly situations like this, but rwlock still has
> two atomic ops for acquisition/release. So, in another
> thread, I have suggested looking into the contention angle. IIUC,
> tiobench is threaded and shares fd table.

I must have missed that thread...  Anyway, that's a good idea.

>
> That said, atomic counters weren't introduced in this patch,
> they are already there for refcounting. cmpxchg is costly,
> but if you are replacing read_lock/atomic_inc/read_unlock,
> lock-free + cmpxchg, it might not be all that bad.

Yeah, I didn't mean to imply that atomics were unique to this patch.

> Atleast, 
> we can benchmark it and see if it is worth it. And in heavily
> contended cases, unlike rwlocks, you are not going to have
> starvation.

Which is good.

> > It seems to me that RCU is basically rwlocks on steroids, which means
> > that using it requires the same care to avoid starvation and/or other
> > scalability problems (i.e. we'd better be really sure that a given
> > codepath really should be using rwlocks before we change it).
>
> The starvation is a problem with rwlocks in linux, not RCU. The
> reader's do not impede writers at all with RCU. There are other
> issues with RCU that one needs to be careful about, but certainly
> not this one.

That's good, I didn't think that RCU would cause starvation, but based on 
previous reading of the code it seemed like it would hurt a lot in other 
ways... but I'm definitely not an expert in that area.

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-15 16:10       ` Dipankar Sarma
  2004-07-15 16:22         ` Jesse Barnes
@ 2004-07-15 16:34         ` Chris Wright
  2004-07-16  5:38           ` Ravikiran G Thirumalai
  1 sibling, 1 reply; 33+ messages in thread
From: Chris Wright @ 2004-07-15 16:34 UTC (permalink / raw)
  To: Dipankar Sarma
  Cc: Jesse Barnes, Chris Wright, Ravikiran G Thirumalai, linux-kernel

* Dipankar Sarma (dipankar@in.ibm.com) wrote:
> On Thu, Jul 15, 2004 at 10:22:53AM -0400, Jesse Barnes wrote:
> > On Wednesday, July 14, 2004 11:17 am, Chris Wright wrote:
> > > I'm curious, how much of the performance improvement is from RCU usage
> > > vs. making the basic syncronization primitive aware of a reader and
> > > writer distinction?  Do you have benchmark for simply moving to rwlock_t?
> > 
> > That's a good point.  Also, even though the implementation may be 'lockless', 
> > there are still a lot of cachelines bouncing around, whether due to atomic 
> > counters or cmpxchg (in fact the latter will be worse than simple atomics).
> 
> Chris raises an interesting issue. There are two ways we can benefit from
> lock-free lookup - avoidance of atomic ops in lock acquisition/release
> and avoidance of contention. The latter can also be provided by
> rwlocks in read-mostly situations like this, but rwlock still has
> two atomic ops for acquisition/release. So, in another
> thread, I have suggested looking into the contention angle. IIUC,
> tiobench is threaded and shares fd table. 

Given the read heavy assumption that RCU makes (supported by your
benchmarks), I believe that the comparison with RCU vs. current scheme
is unfair.  Better comparison is against rwlock_t, which may give a
similar improvement w/out the added complexity.  But, I haven't a patch
nor a benchmark, so it's all handwavy at this point.

thanks,
-chris
-- 
Linux Security Modules     http://lsm.immunix.org     http://lsm.bkbits.net

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-15 16:34         ` Chris Wright
@ 2004-07-16  5:38           ` Ravikiran G Thirumalai
  0 siblings, 0 replies; 33+ messages in thread
From: Ravikiran G Thirumalai @ 2004-07-16  5:38 UTC (permalink / raw)
  To: Chris Wright; +Cc: Dipankar Sarma, Jesse Barnes, linux-kernel

On Thu, Jul 15, 2004 at 09:34:08AM -0700, Chris Wright wrote:
> * Dipankar Sarma (dipankar@in.ibm.com) wrote:
> > On Thu, Jul 15, 2004 at 10:22:53AM -0400, Jesse Barnes wrote:
> > > On Wednesday, July 14, 2004 11:17 am, Chris Wright wrote:
> > ... 
> > Chris raises an interesting issue. There are two ways we can benefit from
> > lock-free lookup - avoidance of atomic ops in lock acquisition/release
> > and avoidance of contention. The latter can also be provided by
> > rwlocks in read-mostly situations like this, but rwlock still has
> > two atomic ops for acquisition/release. So, in another
> > thread, I have suggested looking into the contention angle. IIUC,
> > tiobench is threaded and shares fd table. 
> 
> Given the read heavy assumption that RCU makes (supported by your
> benchmarks), I believe that the comparison with RCU vs. current scheme
> is unfair.  Better comparison is against rwlock_t, which may give a
> similar improvement w/out the added complexity.  But, I haven't a patch
> nor a benchmark, so it's all handwavy at this point.

It would be a good datapoint to experiment with rwlock.
But note that on x86 (my testbed right now)
1. read_lock + read_unlock + atomic_inc will be 3 (bus locking) atomic ops
2. spin_lock + spin_unlock + atomic_inc will be 2 atomic ops 
   (x86 spin_unlock is just a move)
3. rcu_read_lock, rcu_read_unlock + cmpxchg is just one atomic op,
added to it the cs is small in fget, fget_light....

IIRC, the files_struct.file_lock was a rwlock sometime back.  
(it still is in 2.4 i think) I am not sure why it was changed to spinlocks.  
I will try to dig through the archives, but if someone can quickly 
fill in that would be nice.

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-15 14:22     ` Jesse Barnes
  2004-07-15 16:10       ` Dipankar Sarma
@ 2004-07-16  6:27       ` William Lee Irwin III
  2004-07-17  0:55         ` Keith Owens
  1 sibling, 1 reply; 33+ messages in thread
From: William Lee Irwin III @ 2004-07-16  6:27 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: Chris Wright, Ravikiran G Thirumalai, linux-kernel, dipankar

On Wednesday, July 14, 2004 11:17 am, Chris Wright wrote:
>> I'm curious, how much of the performance improvement is from RCU usage
>> vs. making the basic syncronization primitive aware of a reader and
>> writer distinction?  Do you have benchmark for simply moving to rwlock_t?

On Thu, Jul 15, 2004 at 10:22:53AM -0400, Jesse Barnes wrote:
> That's a good point.  Also, even though the implementation may be 'lockless', 
> there are still a lot of cachelines bouncing around, whether due to atomic 
> counters or cmpxchg (in fact the latter will be worse than simple atomics).
> It seems to me that RCU is basically rwlocks on steroids, which means that 
> using it requires the same care to avoid starvation and/or other scalability 
> problems (i.e. we'd better be really sure that a given codepath really should 
> be using rwlocks before we change it).

Primarily not for either of your benefit(s):

files->lock actually started as an rwlock and was changed to a spinlock
on the grounds that cmpxchg for rwlocks sucked on Pentium 4.

Typical rwlock implementations don't have such starvation issues.
Normally attempts at acquiring them for write exclude new readers with
various forms of bias. Linux has come to rely on an unfair
implementation by means of recursion, both directly (I hear somewhere
in the net stack and haven't looked) and via recursive acquisition in
interrupt context (for signal delivery). The unfairness does cause
problems in practice, e.g. writers starving on tasklist_lock takes out
machines running large numbers of processes.

RCU does not have such fairness issues. While it is heavily read
biased, readers do not exclude writers. Things can, however, delay
freeing of memory if tasks don't voluntarily yield for long periods
of time (e.g. involuntary context switches caused by kernel preemption)
as the freeing of the data structures potentially referenced by readers
must be delayed until all cpus have gone through a voluntary context
switch. This generally resolves itself as waiting for memory involves a
voluntary context switch. Readers merely disable preemption, and only
writers require atomicity or mutual exclusion, which is more typical in
Linux in part owing to preferred data structures. I suspect this is in
part due to its initial usage for doubly-linked lists that are never
traversed backward. RCU is actually a lockless update protocol and can
be done with atomic updates on the write side as opposed to spinlocks
around the updates. This happily doesn't involve busywait on machines
able to implement such things directly, but also depends heavily on the
data structure.

For instance, an open-addressed hash table or a 4-way associative cache
could simply atomically update the pointers to the elements. To see why
things get much harder when you try to go completely lockfree with less
suitable data structures, a singly-linked list would require a cmpxchg
loop inside a retry loop, temporarily marking next pointers with some
invalid value that signals accessors to retry the read operation, and
some external method of preventing simultaneous removals of the same
element (e.g. refcounting), as the next pointer of the predecessor must
be what was expected during removal, the removed element's next pointer
remain be what was expected, and the updated predecessor must be what
was discovered in the list, and more still for cpus not supporting
cmpxchg or similar operations, though that's probably not an entirely
complete or accurate description of the headaches involved when the
data structure doesn't mesh well with lock-free atomic updates. The
gist of all this is that busywait-free atomic updates are only
implementable for data structures that don't link through the object
but rather maintain an external index with a single pointer to elements
needing updates, like radix trees, B+ trees, arrays of pointers, and
open-addressed hashtables. Otherwise, spinlocks on the write side are
probably better than the completely lock-free alternative for smaller
systems, though larger systems hate sharing the cachelines.


-- wli

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Refcounting of objects part of a lockfree collection
  2004-07-15  8:02       ` Ravikiran G Thirumalai
  2004-07-15  9:36         ` Dipankar Sarma
@ 2004-07-16 14:32         ` Greg KH
  2004-07-16 15:50           ` Ravikiran G Thirumalai
  1 sibling, 1 reply; 33+ messages in thread
From: Greg KH @ 2004-07-16 14:32 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: linux-kernel, dipankar

On Thu, Jul 15, 2004 at 01:32:04PM +0530, Ravikiran G Thirumalai wrote:
> The naming is bad, I agree. But as Dipankar pointed out earlier, there
> was no kref when I did this.

No offence meant, but lib/kref.c has been in the kernel tree since
2.6.5-rc1 which was released 4 months ago.  Did refcount.h take 4 months
to get released?  :)

> We (Dipankar and myslef) had a discussion
> and decided:
> 1. I will make a patch to shrink kref and feed it to Greg
> 2. Add new set kref api for lockfree refcounting --
> 	kref_lf_xxx.  (kref_lf_get, kref_lf_get_rcu etc.,)

kref_*_rcu() as Dipankar noted is much nicer.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Refcounting of objects part of a lockfree collection
  2004-07-16 14:32         ` Greg KH
@ 2004-07-16 15:50           ` Ravikiran G Thirumalai
  0 siblings, 0 replies; 33+ messages in thread
From: Ravikiran G Thirumalai @ 2004-07-16 15:50 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, dipankar

On Fri, Jul 16, 2004 at 07:32:35AM -0700, Greg KH wrote:
>... 
> > We (Dipankar and myslef) had a discussion
> > and decided:
> > 1. I will make a patch to shrink kref and feed it to Greg
> > 2. Add new set kref api for lockfree refcounting --
> > 	kref_lf_xxx.  (kref_lf_get, kref_lf_get_rcu etc.,)
> 
> kref_*_rcu() as Dipankar noted is much nicer.

Do you mean Dipankar had noted as in:

<Dipankar's earlier post>
> Would this work -
> 
> kref_get - just atomic inc
> kref_put - just atomic dec
> kref_get_rcu - cmpxchg if arch has or hashed spinlock
> kref_put_rcu - dec if has cmpxchg else hashed spinlock

> If the object has lock-free look-up, it must use kref_get_rcu.
> You can use kref_get on such an object if the look-up is being
> done with lock. Is there a situation that is not covered by this ?

</>

If that is what you want, the api cannot work well.
The problem with this case is I cannot use hashed spinlock _just_ for
kref_get_rcu as Dipankar mentions above.  If I do that the atomic_inc s
in kref_get will race with the 'hashed spinlock' kref_get_rcu when 
both have to be used on the same refcounter.  (I mentioned this in an 
earlier post).  So I have to take the same hashed spinlock for kref_gets as 
well to maintain correctness -- obviously not a good thing for arches 
with no cmpxchg which use krefs for non lock free purposes only.  
Hence, the proposal now is to add kref_lf_get, kref_lf_put, kref_lf_get_rcu 
to the kref api.  kref_lf_xxx is to be used when the kref refcounter
is used atleast one place lock free -- that is kref_lf_get_rcu is used
at atleast once on the struct kref.  Does that sound ok?

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-16  6:27       ` William Lee Irwin III
@ 2004-07-17  0:55         ` Keith Owens
  2004-07-17  1:19           ` William Lee Irwin III
  0 siblings, 1 reply; 33+ messages in thread
From: Keith Owens @ 2004-07-17  0:55 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Jesse Barnes, Chris Wright, Ravikiran G Thirumalai, linux-kernel,
	dipankar

On Thu, 15 Jul 2004 23:27:32 -0700, 
William Lee Irwin III <wli@holomorphy.com> wrote:
>The gist of all this is that busywait-free atomic updates are only
>implementable for data structures that don't link through the object
>but rather maintain an external index with a single pointer to elements
>needing updates, like radix trees, B+ trees, arrays of pointers, and
>open-addressed hashtables.

There are algorithms for busywait-free (lock free) traversal and
concurrent update of lists and structures that contain embedded
pointers.  I once had to implement one on an O/S where the user space
to kernel wait/alarm semantics could not satisfy the timing
requirements of the calling code (don't ask).

The beauty of these lockfree algorithms on large structures is that
nothing ever stalls indefinitely.  If the underlying SMP cache hardware
is fair, everything running a lockfree list will make progress.  These
algorithms do not suffer from reader vs. writer starvation.

The downside is that the algorithms rely on an extra sequence field per
word.  They copy out the structure, modify the local copy, write back
with an update of sequence field.  Writing back detects if any other
update has invalidated the current structure, at which point the second
update is discarded and the writer redrives its update.  Readers are
guaranteed to see a consistent view of the copy, even if the master
structure is being updated at the same time as it is being read.

Bottom line, it can be done but ...

* The structure size doubles with the addition of the sequence fields.
* The hardware must support cmpxchg on the combination of the sequence
  field and the data word that it protects.  LL/SC can be used instead
  of cmpxchg if the hardware supports LL/SC.
* If the hardware only supports cmpxchg4 then the sequence field is
  restricted to 2 bytes, which increases the risk of wraparound.  If an
  update is delayed for exactly the wraparound interval then the data
  may be corrupted, that is a standard risk of small sequence fields.
* The extra copy out just to read a structure needs more memory
  bandwidth, plus local allocation for the copy.
* The internal code of the API to traverse a list containing lockfree
  structures is pretty messy, although that is hidden from the caller.
* All writers must preserve enough state to redrive their update from
  scratch if a concurrent update has changed the same structure.  Note,
  only the curent structure, not the rest of the list.
* It requires type stable storage.  That is, once a data area has been
  allocated to a particular structure type, it always contains that
  structure type, even when it has been freed from the list.  Each list
  requires its own free pool, which can never be returned to the OS.

Nice research topics, but not suitable for day to day work.  I only
used the lockfree algorithms because I could not find an alternative on
that particular OS.  I am not sure that the Linux kernel has any
problems that require the additional complexity.

For some light reading, the code I implemented was based on Practical
Implementations of Non-Blocking Synchronization Primitives by Mark Moir
http://research.sun.com/scalable/Papers/moir-podc97.ps.  Note that page
6, line 6 has an error.  Replace "y := z" with "y := addr->data[i];".

I implemented a completely lockfree 2-3-4 tree with embedded pointers
using Moir's algorithm.  The implementation allowed multiple concurrent
readers and writers.  It even allowed concurrent list insert, delete
and rebalancing, while the structures on the were being read and
updated.

2-3-4 trees are self balancing, which gave decent lookup performance.
Since red-black trees are logically equivalent to 2-3-4 trees it should
be possible to use lockfree red-black trees.  However I could not come
up with a lockfree red-black tree, mainly because an read-black insert
requires atomic changes to multiple structures.  The 2-3-4 tree only
needs atomic update to one structure at a time.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-17  0:55         ` Keith Owens
@ 2004-07-17  1:19           ` William Lee Irwin III
  2004-07-17  2:12             ` Keith Owens
  2004-07-17  2:28             ` Keith Owens
  0 siblings, 2 replies; 33+ messages in thread
From: William Lee Irwin III @ 2004-07-17  1:19 UTC (permalink / raw)
  To: Keith Owens
  Cc: Jesse Barnes, Chris Wright, Ravikiran G Thirumalai, linux-kernel,
	dipankar

On Thu, 15 Jul 2004 23:27:32 -0700, William Lee Irwin III wrote:
>> The gist of all this is that busywait-free atomic updates are only
>> implementable for data structures that don't link through the object
>> but rather maintain an external index with a single pointer to elements
>> needing updates, like radix trees, B+ trees, arrays of pointers, and
>> open-addressed hashtables.

On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote:
> There are algorithms for busywait-free (lock free) traversal and
> concurrent update of lists and structures that contain embedded
> pointers.  I once had to implement one on an O/S where the user space
> to kernel wait/alarm semantics could not satisfy the timing
> requirements of the calling code (don't ask).

Yes. I had in mind only RCU-based algorithms. Throwing more machinery at
the problem may get further.


On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote:
> The beauty of these lockfree algorithms on large structures is that
> nothing ever stalls indefinitely.  If the underlying SMP cache hardware
> is fair, everything running a lockfree list will make progress.  These
> algorithms do not suffer from reader vs. writer starvation.

That's a large assumption. NUMA hardware typically violates it.


On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote:
> The downside is that the algorithms rely on an extra sequence field per
> word.  They copy out the structure, modify the local copy, write back
> with an update of sequence field.  Writing back detects if any other
> update has invalidated the current structure, at which point the second
> update is discarded and the writer redrives its update.  Readers are
> guaranteed to see a consistent view of the copy, even if the master
> structure is being updated at the same time as it is being read.

I would classify this as "ticket-based".


On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote:
> Bottom line, it can be done but ...
> * The structure size doubles with the addition of the sequence fields.
> * The hardware must support cmpxchg on the combination of the sequence
>   field and the data word that it protects.  LL/SC can be used instead
>   of cmpxchg if the hardware supports LL/SC.
> * If the hardware only supports cmpxchg4 then the sequence field is
>   restricted to 2 bytes, which increases the risk of wraparound.  If an
>   update is delayed for exactly the wraparound interval then the data
>   may be corrupted, that is a standard risk of small sequence fields.
> * The extra copy out just to read a structure needs more memory
>   bandwidth, plus local allocation for the copy.
> * The internal code of the API to traverse a list containing lockfree
>   structures is pretty messy, although that is hidden from the caller.
> * All writers must preserve enough state to redrive their update from
>   scratch if a concurrent update has changed the same structure.  Note,
>   only the curent structure, not the rest of the list.
> * It requires type stable storage.  That is, once a data area has been
>   allocated to a particular structure type, it always contains that
>   structure type, even when it has been freed from the list.  Each list
>   requires its own free pool, which can never be returned to the OS.

The last of these is particularly lethal.


On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote:
> Nice research topics, but not suitable for day to day work.  I only
> used the lockfree algorithms because I could not find an alternative on
> that particular OS.  I am not sure that the Linux kernel has any
> problems that require the additional complexity.
> For some light reading, the code I implemented was based on Practical
> Implementations of Non-Blocking Synchronization Primitives by Mark Moir
> http://research.sun.com/scalable/Papers/moir-podc97.ps.  Note that page
> 6, line 6 has an error.  Replace "y := z" with "y := addr->data[i];".
> I implemented a completely lockfree 2-3-4 tree with embedded pointers
> using Moir's algorithm.  The implementation allowed multiple concurrent
> readers and writers.  It even allowed concurrent list insert, delete
> and rebalancing, while the structures on the were being read and
> updated.

Thanks. That may come in handy later.

On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote:
> 2-3-4 trees are self balancing, which gave decent lookup performance.
> Since red-black trees are logically equivalent to 2-3-4 trees it should
> be possible to use lockfree red-black trees.  However I could not come
> up with a lockfree red-black tree, mainly because an read-black insert
> requires atomic changes to multiple structures.  The 2-3-4 tree only
> needs atomic update to one structure at a time.

This actually appears to confirm my earlier assertion about the linkage
of the data structure. Is this conclusion what you had in mind?


-- wli

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-17  1:19           ` William Lee Irwin III
@ 2004-07-17  2:12             ` Keith Owens
  2004-07-17  2:34               ` William Lee Irwin III
  2004-07-17  2:28             ` Keith Owens
  1 sibling, 1 reply; 33+ messages in thread
From: Keith Owens @ 2004-07-17  2:12 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Jesse Barnes, Chris Wright, Ravikiran G Thirumalai, linux-kernel,
	dipankar

On Fri, 16 Jul 2004 18:19:36 -0700, 
William Lee Irwin III <wli@holomorphy.com> wrote:
>On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote:
>> 2-3-4 trees are self balancing, which gave decent lookup performance.
>> Since red-black trees are logically equivalent to 2-3-4 trees it should
>> be possible to use lockfree red-black trees.  However I could not come
>> up with a lockfree red-black tree, mainly because an read-black insert
>> requires atomic changes to multiple structures.  The 2-3-4 tree only
>> needs atomic update to one structure at a time.
>
>This actually appears to confirm my earlier assertion about the linkage
>of the data structure. Is this conclusion what you had in mind?

Not quite.  The 2-3-4 tree has embedded linkage, but it can be done
lockfree if you really have to.  The problem is that a single 2-3-4
list entry maps to two red-black list entries.  I could atomically
update a single 2-3-4 list entry, including its pointers, even when the
list was being read or updated by other users.  I could not work out
how to do the equivalent update when the list linkage data was split
over two red-black nodes.

The list structure is an implementation detail, the use of 2-3-4 or
red-black is completely transparent to the main code.  The main code
wants to lookup a structure from the list, to update a structure, to
insert or to delete a structure without waiting.  How the list of
structures is maintained is a problem for the internals of the API.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-17  1:19           ` William Lee Irwin III
  2004-07-17  2:12             ` Keith Owens
@ 2004-07-17  2:28             ` Keith Owens
  2004-07-17  3:16               ` William Lee Irwin III
  1 sibling, 1 reply; 33+ messages in thread
From: Keith Owens @ 2004-07-17  2:28 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Jesse Barnes, Chris Wright, Ravikiran G Thirumalai, linux-kernel,
	dipankar

On Fri, 16 Jul 2004 18:19:36 -0700, 
William Lee Irwin III <wli@holomorphy.com> wrote:
>On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote:
>> The beauty of these lockfree algorithms on large structures is that
>> nothing ever stalls indefinitely.  If the underlying SMP cache hardware
>> is fair, everything running a lockfree list will make progress.  These
>> algorithms do not suffer from reader vs. writer starvation.
>
>That's a large assumption. NUMA hardware typically violates it.

True, which is why I mentioned it.  However I suspect that you read
something into that paragraph which was not intended.

Just reading the lockfree list and the structures on the list does not
suffer from any NUMA problems, because reading does not perform any
global updates at all.  The SMP starvation problem only kicks in when
multiple concurrent updates are being done.  Even with multiple
writers, one of the writers is guaranteed to succeed every time, so
over time all the write operations will proceed, subject to fair access
to exclusive cache lines.

Lockfree reads with Moir's algorithms require extra memory bandwidth.
In the absence of updates, all the cache lines end up in shared state.
That reduces to local memory bandwidth for the (hopefully) common case
of lots of readers and few writers.  Lockfree code is nicely suited to
the same class of problem that RCU addresses, but without the reader
vs. writer starvation problems.

Writer vs. writer starvation on NUMA is a lot harder.  I don't know of
any algorithm that handles lists with lots of concurrent updates and
also scales well on large cpus, unless the underlying hardware is fair
in its handling of exclusive cache lines.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-17  2:12             ` Keith Owens
@ 2004-07-17  2:34               ` William Lee Irwin III
  0 siblings, 0 replies; 33+ messages in thread
From: William Lee Irwin III @ 2004-07-17  2:34 UTC (permalink / raw)
  To: Keith Owens
  Cc: Jesse Barnes, Chris Wright, Ravikiran G Thirumalai, linux-kernel,
	dipankar

On Fri, 16 Jul 2004 18:19:36 -0700, William Lee Irwin III wrote:
>> This actually appears to confirm my earlier assertion about the linkage
>> of the data structure. Is this conclusion what you had in mind?

On Sat, Jul 17, 2004 at 12:12:39PM +1000, Keith Owens wrote:
> Not quite.  The 2-3-4 tree has embedded linkage, but it can be done
> lockfree if you really have to.  The problem is that a single 2-3-4
> list entry maps to two red-black list entries.  I could atomically
> update a single 2-3-4 list entry, including its pointers, even when the
> list was being read or updated by other users.  I could not work out
> how to do the equivalent update when the list linkage data was split
> over two red-black nodes.

2-3 trees have external linkage just like B/B+ trees as I had
envisioned what external linkage is. Terminological issue I guess.


On Sat, Jul 17, 2004 at 12:12:39PM +1000, Keith Owens wrote:
> The list structure is an implementation detail, the use of 2-3-4 or
> red-black is completely transparent to the main code.  The main code
> wants to lookup a structure from the list, to update a structure, to
> insert or to delete a structure without waiting.  How the list of
> structures is maintained is a problem for the internals of the API.

That kind of genericity is tough to come by in C. I guess callbacks and
void * can do it when pressed.


-- wli

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-17  2:28             ` Keith Owens
@ 2004-07-17  3:16               ` William Lee Irwin III
  0 siblings, 0 replies; 33+ messages in thread
From: William Lee Irwin III @ 2004-07-17  3:16 UTC (permalink / raw)
  To: Keith Owens
  Cc: Jesse Barnes, Chris Wright, Ravikiran G Thirumalai, linux-kernel,
	dipankar

On Fri, 16 Jul 2004 18:19:36 -0700, William Lee Irwin III wrote:
>> That's a large assumption. NUMA hardware typically violates it.

On Sat, Jul 17, 2004 at 12:28:54PM +1000, Keith Owens wrote:
> True, which is why I mentioned it.  However I suspect that you read
> something into that paragraph which was not intended.

The NUMA issue is the only caveat I saw. I guess I just wanted to
mention it by name.


On Sat, Jul 17, 2004 at 12:28:54PM +1000, Keith Owens wrote:
> Just reading the lockfree list and the structures on the list does not
> suffer from any NUMA problems, because reading does not perform any
> global updates at all.  The SMP starvation problem only kicks in when
> multiple concurrent updates are being done.  Even with multiple
> writers, one of the writers is guaranteed to succeed every time, so
> over time all the write operations will proceed, subject to fair access
> to exclusive cache lines.

The only methods I can think of to repair this (basically queueing) are
not busywait-free.


On Sat, Jul 17, 2004 at 12:28:54PM +1000, Keith Owens wrote:
> Lockfree reads with Moir's algorithms require extra memory bandwidth.
> In the absence of updates, all the cache lines end up in shared state.
> That reduces to local memory bandwidth for the (hopefully) common case
> of lots of readers and few writers.  Lockfree code is nicely suited to
> the same class of problem that RCU addresses, but without the reader
> vs. writer starvation problems.

I suppose it's worth refining the starvation claim to delaying freeing
memory as opposed to readers causing writers to busywait indefinitely.


On Sat, Jul 17, 2004 at 12:28:54PM +1000, Keith Owens wrote:
> Writer vs. writer starvation on NUMA is a lot harder.  I don't know of
> any algorithm that handles lists with lots of concurrent updates and
> also scales well on large cpus, unless the underlying hardware is fair
> in its handling of exclusive cache lines.

Well, neither do I. =)


-- wli

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Lock free fd lookup
  2004-07-14 15:17   ` Chris Wright
  2004-07-15 14:22     ` Jesse Barnes
@ 2004-07-17 13:48     ` Peter Zijlstra
  1 sibling, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2004-07-17 13:48 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 403 bytes --]

Hi,

maybe I'm way off, but have you considered multi state locks for your
problem? I attached a draft-paper; written by a former collegue and
myself; on Concurrent AVL trees that uses multi-state locking. This
paper is unpublished and unfinished but might be illustrative.
Do note that those email addresses are no longer valid AFAIK, reply to
reach me or the co-author.

just my 2ct.

Peter Zijlstra


[-- Attachment #2: cavl.ps.gz --]
[-- Type: application/x-gzpostscript, Size: 40823 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2004-07-17 13:49 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-07-14  4:53 [RFC] Refcounting of objects part of a lockfree collection Ravikiran G Thirumalai
2004-07-14  4:56 ` [RFC] Lock free fd lookup Ravikiran G Thirumalai
2004-07-14 15:17   ` Chris Wright
2004-07-15 14:22     ` Jesse Barnes
2004-07-15 16:10       ` Dipankar Sarma
2004-07-15 16:22         ` Jesse Barnes
2004-07-15 16:34         ` Chris Wright
2004-07-16  5:38           ` Ravikiran G Thirumalai
2004-07-16  6:27       ` William Lee Irwin III
2004-07-17  0:55         ` Keith Owens
2004-07-17  1:19           ` William Lee Irwin III
2004-07-17  2:12             ` Keith Owens
2004-07-17  2:34               ` William Lee Irwin III
2004-07-17  2:28             ` Keith Owens
2004-07-17  3:16               ` William Lee Irwin III
2004-07-17 13:48     ` Peter Zijlstra
2004-07-14  7:07 ` [RFC] Refcounting of objects part of a lockfree collection Greg KH
2004-07-14  8:26   ` Dipankar Sarma
2004-07-14 14:26     ` Greg KH
2004-07-14 15:22       ` Dipankar Sarma
2004-07-14 17:03         ` Greg KH
2004-07-14 17:49           ` Dipankar Sarma
2004-07-14 18:03             ` Greg KH
2004-07-15  6:21       ` Ravikiran G Thirumalai
2004-07-15  6:56         ` Dipankar Sarma
2004-07-14  8:57   ` Ravikiran G Thirumalai
2004-07-14 17:08     ` Greg KH
2004-07-14 18:17       ` Dipankar Sarma
2004-07-15  8:02       ` Ravikiran G Thirumalai
2004-07-15  9:36         ` Dipankar Sarma
2004-07-16 14:32         ` Greg KH
2004-07-16 15:50           ` Ravikiran G Thirumalai
  -- strict thread matches above, loose matches on Subject: below --
2004-07-14 10:24 Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox