* Re: [RFC] Lock free fd lookup
@ 2004-07-17 19:17 Albert Cahalan
2004-07-29 0:14 ` William Lee Irwin III
0 siblings, 1 reply; 19+ messages in thread
From: Albert Cahalan @ 2004-07-17 19:17 UTC (permalink / raw)
To: linux-kernel mailing list; +Cc: kaos, wli, chrisw, jbarnes, kiran, dipankar
Keith Owens writes:
> Writer vs. writer starvation on NUMA is a lot harder. I don't know
> of any algorithm that handles lists with lots of concurrent updates
> and also scales well on large cpus, unless the underlying hardware
> is fair in its handling of exclusive cache lines.
How about MCS (Mellor-Crummey and Scott) locks?
Linux code:
http://oss.software.ibm.com/linux/patches/?patch_id=218
Something supposedly better:
http://user.it.uu.se/~zoranr/rh_lock/
Scott's list of 11 scalable synchronization algorithms:
http://www.cs.rochester.edu/u/scott/synchronization/pseudocode/ss.html
Scott's collection of papers and so on:
http://www.cs.rochester.edu/u/scott/synchronization/
Simply asking Scott might be a wise move. He'd likely know of anything
else that might fit the requirements. That's scott at cs.rochester.edu
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [RFC] Lock free fd lookup 2004-07-17 19:17 [RFC] Lock free fd lookup Albert Cahalan @ 2004-07-29 0:14 ` William Lee Irwin III 0 siblings, 0 replies; 19+ messages in thread From: William Lee Irwin III @ 2004-07-29 0:14 UTC (permalink / raw) To: Albert Cahalan Cc: linux-kernel mailing list, kaos, chrisw, jbarnes, kiran, dipankar Keith Owens writes: >> Writer vs. writer starvation on NUMA is a lot harder. I don't know >> of any algorithm that handles lists with lots of concurrent updates >> and also scales well on large cpus, unless the underlying hardware >> is fair in its handling of exclusive cache lines. On Sat, Jul 17, 2004 at 03:17:55PM -0400, Albert Cahalan wrote: > How about MCS (Mellor-Crummey and Scott) locks? > Linux code: > http://oss.software.ibm.com/linux/patches/?patch_id=218 > Something supposedly better: > http://user.it.uu.se/~zoranr/rh_lock/ > Scott's list of 11 scalable synchronization algorithms: > http://www.cs.rochester.edu/u/scott/synchronization/pseudocode/ss.html > Scott's collection of papers and so on: > http://www.cs.rochester.edu/u/scott/synchronization/ > Simply asking Scott might be a wise move. He'd likely know of anything > else that might fit the requirements. That's scott at cs.rochester.edu Did anyone follow up with Scott on this? -- wli ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] Lock free fd lookup
@ 2004-07-17 8:50 Manfred Spraul
2004-07-17 9:30 ` William Lee Irwin III
0 siblings, 1 reply; 19+ messages in thread
From: Manfred Spraul @ 2004-07-17 8:50 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: linux-kernel, Keith Owens
wli wrote
>> * It requires type stable storage. That is, once a data area has been
>> allocated to a particular structure type, it always contains that
>> structure type, even when it has been freed from the list. Each list
>> requires its own free pool, which can never be returned to the OS.
>
>The last of these is particularly lethal.
>
>
It might be possible to combine such a lock free algorithms with RCU and
then set Hugh's SLAB_DESTROY_BY_RCU: It inserts a call_rcu between
leaving the free pool and returning the page to the OS.
--
Manfred
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [RFC] Lock free fd lookup 2004-07-17 8:50 Manfred Spraul @ 2004-07-17 9:30 ` William Lee Irwin III 0 siblings, 0 replies; 19+ messages in thread From: William Lee Irwin III @ 2004-07-17 9:30 UTC (permalink / raw) To: Manfred Spraul; +Cc: linux-kernel, Keith Owens At some point in the past, Keith Owens wrote: >>> It requires type stable storage. That is, once a data area has been >>> allocated to a particular structure type, it always contains that >>> structure type, even when it has been freed from the list. Each list >>> requires its own free pool, which can never be returned to the OS. wli wrote >>The last of these is particularly lethal. On Sat, Jul 17, 2004 at 10:50:48AM +0200, Manfred Spraul wrote: > It might be possible to combine such a lock free algorithms with RCU and > then set Hugh's SLAB_DESTROY_BY_RCU: It inserts a call_rcu between > leaving the free pool and returning the page to the OS. At least I would prefer such a hybrid algorithm if things of this kind were used in the kernel. From the looks of it, in such algorithms the structure is still valid for checking of tickets after a voluntary preemption, in this RCU hybrid not. -- wli ^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC] Refcounting of objects part of a lockfree collection
@ 2004-07-14 4:53 Ravikiran G Thirumalai
2004-07-14 4:56 ` [RFC] Lock free fd lookup Ravikiran G Thirumalai
0 siblings, 1 reply; 19+ messages in thread
From: Ravikiran G Thirumalai @ 2004-07-14 4:53 UTC (permalink / raw)
To: linux-kernel; +Cc: dipankar
Traditionally, refcounting technique is used to avoid or minimise
lock holding time for individual objects part of a collection. The
'collection' of such objects (lists/arrays/etc) are proteced by traditional
locks such as spinlocks/reader-writer locks. These locks protect
access to the collection data structure (...walking the objects part of
the collection, adding/deleting objects to the collection). Access to
individual objects are protected through the refcounting mechanism.
It makes sense for performance sake to make access to a read-mostly
collection of objects lock-free with RCU techniques. Introduction of RCU
in place of traditional locking to protect such a collection creates some
refcounting issues. These issues come up because, with RCU, unlike
traditional locking, writers and readers could execute concurrently.
The attatched patch provides infrastructure for refcounting of objects
in a rcu protected collection.
This infrastructure was a result of our go at making the fd lookup lockfree.
fget right now holds the struct files_struct.file_lock to get the
struct file pointer of the requested fd from the process' fd array.
The spinlock in fget is used to serialise read access to the
files_struct.fd[] and incrementing the refcount. We can save on the
spin_lock and spin_unlock on the fd array lookup if reference can be
taken on the struct file object in a safe and lock free manner
(under rcu_readlock/rcu_readunlock and rcu on the update side). This
optimization should improve performance for threaded io intensive workloads.
A patch to make use of the lockfree refcounting infrastructure to do
just that follows. With the lockfree fd lookup patch, tiobench performance
increases by 13% for sequential reads, 21 % for random reads on a 4
processor pIII xeon.
Comments welcome.
Thanks,
Kiran
Signed off by: Ravikiran Thirumalai <kiran@in.ibm.com>
---
diff -ruN -X dontdiff linux-2.6.6/include/linux/refcount.h refcount-2.6.6/include/linux/refcount.h
--- linux-2.6.6/include/linux/refcount.h 1970-01-01 05:30:00.000000000 +0530
+++ refcount-2.6.6/include/linux/refcount.h 2004-05-24 18:31:25.523860952 +0530
@@ -0,0 +1,244 @@
+/*
+ * Refcounter framework for elements of lists/arrays protected by
+ * RCU.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2004
+ *
+ * Author: Ravikiran Thirumalai <kiran@in.ibm.com>
+ * Credits: Dipankar Sarma <dipankar@in.ibm.com>. refcount_get_rcu has
+ * been abstracted out from Dipankar's implementation of rcu based
+ * fd management.
+ *
+ * Refcounting on elements of lists which are protected by traditional
+ * reader/writer spinlocks or semaphores are straight forward as in:
+ *
+ * 1. 2.
+ * add() search_and_reference()
+ * { {
+ * alloc_object read_lock(&list_lock);
+ * ... search_for_element
+ * refcount_init(&el->rc) refcount_get(&el->rc);
+ * write_lock(&list_lock); ...
+ * add_element read_unlock(&list_lock);
+ * ... ...
+ * write_unlock(&list_lock); }
+ * }
+ *
+ * 3. 4.
+ * release_referenced() delete()
+ * { {
+ * ... write_lock(&list_lock);
+ * if (refcount_put(&el->rc)) ...
+ * start_cleanup_object ...
+ * free_object delete_element
+ * ... write_unlock(&list_lock);
+ * } ...
+ * if (refcount_put(&el->rc))
+ * start_cleanup_object
+ * free_object
+ * }
+ *
+ * If this list/array is made lock free using rcu as in changing the
+ * write_lock in add() and delete() to spin_lock and changing read_lock
+ * in search_and_reference to rcu_read_lock(), the refcount_get in
+ * search_and_reference could potentially hold reference to an element which
+ * has already been deleted from the list/array. refcount_get_rcu takes
+ * care of this scenario. search_and_reference should look as;
+ * 2.
+ * search_and_reference()
+ * {
+ * rcu_read_lock();
+ * search_for_element
+ * if (!refcount_get_rcu(&el->rc)) {
+ * rcu_read_unlock();
+ * return FAIL;
+ * }
+ * ...
+ * ...
+ * rcu_read_unlock();
+ * }
+ *
+ * Of course, free_object after refcount_put should be batched using call_rcu.
+ * Sometimes, reference to the element need to be obtained in the
+ * update (write) stream. In such cases, refcount_get_rcu might be an overkill
+ * since the spinlock serialising list updates are held. refcount_get
+ * is to be used in such cases.
+ * Note: Eventhough these refcount primitives can be used for lists which
+ * are protected by traditional locking, if the arch doesn't have cmpxchg,
+ * there might be extra penalties in refcount_get.
+ *
+ */
+
+#ifndef _LINUX_REFCOUNT_H
+#define _LINUX_REFCOUNT_H
+#ifdef __KERNEL__
+
+#include <asm/atomic.h>
+#include <asm/system.h>
+
+typedef struct { atomic_t count; } refcount_t;
+
+/*
+ * Refcount primitives for lists/array elements protected by traditional
+ * locking
+ */
+
+static inline void refcount_init(refcount_t *rc)
+{
+ atomic_set(&rc->count, 1);
+}
+
+static inline int refcount_put(refcount_t *rc)
+{
+ return atomic_dec_and_test(&rc->count);
+}
+
+static inline void refcount_dec(refcount_t *rc)
+{
+ atomic_dec(&rc->count);
+}
+
+static inline int refcount_put_and_lock(refcount_t *rc, spinlock_t *lock)
+{
+ return atomic_dec_and_lock(&rc->count, lock);
+}
+
+static inline int refcount_read(refcount_t *rc)
+{
+ return atomic_read(&rc->count);
+}
+
+#ifdef __HAVE_ARCH_CMPXCHG
+
+static inline void refcount_get(refcount_t *rc)
+{
+ atomic_inc(&rc->count);
+}
+
+/*
+ * Refcount primitives for lists/array elements protected by rcu
+ */
+
+/*
+ * cmpxchg is needed on UP too, if deletions to the list/array can happen
+ * in interrupt context.
+ */
+static inline int refcount_get_rcu(refcount_t *rc)
+{
+ int c, old;
+ c = atomic_read(&rc->count);
+ while ( c && (old = cmpxchg(&rc->count.counter, c, c+1)) != c)
+ c = old;
+ return c;
+}
+
+#else /* !__HAVE_ARCH_CMPXCHG */
+
+/*
+ * We use an array of spinlocks for the refcounts -- similar to ones in sparc
+ * 32 bit atomic_t implementations, and a hash function similar to that
+ * for our refcounting needs.
+ * Can't help multiprocessors which donot have cmpxchg :(
+ */
+
+#ifdef CONFIG_SMP
+#define REFCOUNT_HASH_SIZE 4
+#define REFCOUNT_HASH(r) \
+ (&__refcount_hash[(((unsigned long)r)>>8) & (REFCOUNT_HASH_SIZE-1)])
+#else
+#define REFCOUNT_HASH_SIZE 1
+#define REFCOUNT_HASH(r) &__refcount_hash[0]
+#endif /* CONFIG_SMP */
+
+extern spinlock_t __refcount_hash[REFCOUNT_HASH_SIZE]
+
+static inline void refcount_init(refcoun_t *rc)
+{
+ unsigned long flags;
+ spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+ rc->count.counter = 1;
+ spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+}
+
+static inline int refcount_put(refcount_t *rc)
+{
+ unsigned long flags;
+ spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+ rc->count.counter--;
+ if (!rc->count.counter) {
+ spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+ return 1;
+ } else {
+ spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+ return 0;
+ }
+}
+
+static inline void refcount_dec(refcount_t *rc)
+{
+ unsigned long flags;
+ spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+ rc->count.counter--;
+ spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+}
+
+static inline int refcount_put_and_locked(refcount_t *rc, spinlock_t *lock)
+{
+ unsigned long flags;
+ spin_lock(lock);
+ spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+ rc->count.counter--;
+ if (!rc->count.counter) {
+ spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+ return 1;
+ } else {
+ spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+ spin_unlock(lock)
+ return 0;
+ }
+}
+/* Spinlocks are not needed in this case, if atomic_read is used */
+static inline int refcount_read(refcount_t *rc)
+{
+ return atomic_read(&rc->count.counter);
+}
+
+static inline void refcount_get(refcount_t *rc)
+{
+ unsigned long flags;
+ spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+ rc->count++;
+ spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+}
+
+static inline int refcount_get_rcu(refcount_t *rc)
+{
+ int ret;
+ unsigned long flags;
+ spin_lock_irqsave(REFCOUNT_HASH(rc), flags);
+ if (rc->count.counter)
+ ret = rc->count.counter++;
+ else
+ ret = 0;
+ spin_unlock_irqrestore((REFCOUNT_HASH(rc), flags);
+ return ret;
+}
+
+#endif /* __HAVE_ARCH_CMPXCHG */
+
+#endif
+#endif /* _LINUX_REFCOUNT_H */
diff -ruN -X dontdiff linux-2.6.6/kernel/rcupdate.c refcount-2.6.6/kernel/rcupdate.c
--- linux-2.6.6/kernel/rcupdate.c 2004-05-10 08:01:58.000000000 +0530
+++ refcount-2.6.6/kernel/rcupdate.c 2004-05-24 15:41:15.000000000 +0530
@@ -44,6 +44,7 @@
#include <linux/notifier.h>
#include <linux/rcupdate.h>
#include <linux/cpu.h>
+#include <linux/refcount.h>
/* Definition for rcupdate control block. */
struct rcu_ctrlblk rcu_ctrlblk =
@@ -325,6 +326,17 @@
wait_for_completion(&completion);
}
+/**
+ * spinlock_t array for arches which donot have cmpxchg. This is for
+ * rcu based refcounting stuff.
+ */
+
+#ifndef __HAVE_ARCH_CMPXCHG
+spinlock_t __refcount_hash[REFCOUNT_HASH_SIZE] = {
+ [0 ... (REFCOUNT_HASH_SIZE-1)] = SPIN_LOCK_UNLOCKED
+};
+EXPORT_SYMBOL(__refcount_hash);
+#endif
EXPORT_SYMBOL(call_rcu);
EXPORT_SYMBOL(synchronize_kernel);
^ permalink raw reply [flat|nested] 19+ messages in thread* [RFC] Lock free fd lookup 2004-07-14 4:53 [RFC] Refcounting of objects part of a lockfree collection Ravikiran G Thirumalai @ 2004-07-14 4:56 ` Ravikiran G Thirumalai 2004-07-14 15:17 ` Chris Wright 0 siblings, 1 reply; 19+ messages in thread From: Ravikiran G Thirumalai @ 2004-07-14 4:56 UTC (permalink / raw) To: linux-kernel; +Cc: dipankar This makes use of the lockfree refcounting infrastructure (see earlier posting today) to make files_struct.fd[] lookup lockfree. This is carrying forward work done by Maneesh and Dipankar earlier. With the lockfree fd lookup patch, tiobench performance increases by 13% for sequential reads, 21 % for random reads on a 4 processor pIII xeon. Comments welcome. Thanks, Kiran diff -ruN -X dontdiff linux-2.6.6/arch/mips/kernel/irixioctl.c files_struct-2.6.6/arch/mips/kernel/irixioctl.c --- linux-2.6.6/arch/mips/kernel/irixioctl.c 2004-05-10 08:02:53.000000000 +0530 +++ files_struct-2.6.6/arch/mips/kernel/irixioctl.c 2004-05-14 12:53:40.000000000 +0530 @@ -14,6 +14,7 @@ #include <linux/syscalls.h> #include <linux/tty.h> #include <linux/file.h> +#include <linux/rcupdate.h> #include <asm/uaccess.h> #include <asm/ioctl.h> @@ -33,15 +34,15 @@ struct file *filp; struct tty_struct *ttyp = NULL; - spin_lock(¤t->files->file_lock); - filp = fcheck(fd); + rcu_read_lock(); + filp = fcheck_rcu(fd); if(filp && filp->private_data) { ttyp = (struct tty_struct *) filp->private_data; if(ttyp->magic != TTY_MAGIC) ttyp =NULL; } - spin_unlock(¤t->files->file_lock); + rcu_read_unlock(); return ttyp; } diff -ruN -X dontdiff linux-2.6.6/arch/sparc64/solaris/ioctl.c files_struct-2.6.6/arch/sparc64/solaris/ioctl.c --- linux-2.6.6/arch/sparc64/solaris/ioctl.c 2004-05-10 08:02:38.000000000 +0530 +++ files_struct-2.6.6/arch/sparc64/solaris/ioctl.c 2004-05-14 12:53:40.000000000 +0530 @@ -24,6 +24,7 @@ #include <linux/netdevice.h> #include <linux/mtio.h> #include <linux/time.h> +#include <linux/rcupdate.h> #include <linux/compat.h> #include <net/sock.h> @@ -291,15 +292,15 @@ { struct inode *ino; /* I wonder which of these tests are superfluous... --patrik */ - spin_lock(¤t->files->file_lock); + rcu_read_lock(); if (! current->files->fd[fd] || ! current->files->fd[fd]->f_dentry || ! (ino = current->files->fd[fd]->f_dentry->d_inode) || ! ino->i_sock) { - spin_unlock(¤t->files->file_lock); + rcu_read_unlock(); return TBADF; } - spin_unlock(¤t->files->file_lock); + rcu_read_unlock(); switch (cmd & 0xff) { case 109: /* SI_SOCKPARAMS */ diff -ruN -X dontdiff linux-2.6.6/arch/sparc64/solaris/timod.c files_struct-2.6.6/arch/sparc64/solaris/timod.c --- linux-2.6.6/arch/sparc64/solaris/timod.c 2004-05-10 08:02:28.000000000 +0530 +++ files_struct-2.6.6/arch/sparc64/solaris/timod.c 2004-05-14 12:53:40.000000000 +0530 @@ -143,9 +143,14 @@ static void timod_wake_socket(unsigned int fd) { struct socket *sock; + struct file *filp; SOLD("wakeing socket"); - sock = SOCKET_I(current->files->fd[fd]->f_dentry->d_inode); + if (!( filp = fcheck(fd))) { + SOLD("BAD FD"); + return; + } + sock = SOCKET_I(filp->f_dentry->d_inode); wake_up_interruptible(&sock->wait); read_lock(&sock->sk->sk_callback_lock); if (sock->fasync_list && !test_bit(SOCK_ASYNC_WAITDATA, &sock->flags)) @@ -157,9 +162,14 @@ static void timod_queue(unsigned int fd, struct T_primsg *it) { struct sol_socket_struct *sock; + struct file *filp; SOLD("queuing primsg"); - sock = (struct sol_socket_struct *)current->files->fd[fd]->private_data; + if (!( filp = fcheck(fd))) { + SOLD("BAD FD"); + return; + } + sock = (struct sol_socket_struct *)filp->private_data; it->next = sock->pfirst; sock->pfirst = it; if (!sock->plast) @@ -171,9 +181,14 @@ static void timod_queue_end(unsigned int fd, struct T_primsg *it) { struct sol_socket_struct *sock; + struct file *filp; SOLD("queuing primsg at end"); - sock = (struct sol_socket_struct *)current->files->fd[fd]->private_data; + if (!( filp = fcheck(fd))) { + SOLD("BAD FD"); + return; + } + sock = (struct sol_socket_struct *)filp->private_data; it->next = NULL; if (sock->plast) sock->plast->next = it; @@ -351,7 +366,10 @@ (int (*)(int, unsigned long *))SYS(socketcall); int (*sys_sendto)(int, void *, size_t, unsigned, struct sockaddr *, int) = (int (*)(int, void *, size_t, unsigned, struct sockaddr *, int))SYS(sendto); - filp = current->files->fd[fd]; + + if (!(filp = fcheck(fd))) + return -EBADF; + ino = filp->f_dentry->d_inode; sock = (struct sol_socket_struct *)filp->private_data; SOLD("entry"); @@ -632,7 +650,10 @@ SOLD("entry"); SOLDD(("%u %p %d %p %p %d %p %d\n", fd, ctl_buf, ctl_maxlen, ctl_len, data_buf, data_maxlen, data_len, *flags_p)); - filp = current->files->fd[fd]; + + if (!(filp = fcheck(fd))) + return -EBADF; + ino = filp->f_dentry->d_inode; sock = (struct sol_socket_struct *)filp->private_data; SOLDD(("%p %p\n", sock->pfirst, sock->pfirst ? sock->pfirst->next : NULL)); @@ -848,7 +869,7 @@ lock_kernel(); if(fd >= NR_OPEN) goto out; - filp = current->files->fd[fd]; + filp = fcheck(fd); if(!filp) goto out; ino = filp->f_dentry->d_inode; @@ -915,7 +936,7 @@ lock_kernel(); if(fd >= NR_OPEN) goto out; - filp = current->files->fd[fd]; + filp = fcheck(fd); if(!filp) goto out; ino = filp->f_dentry->d_inode; diff -ruN -X dontdiff linux-2.6.6/drivers/char/tty_io.c files_struct-2.6.6/drivers/char/tty_io.c --- linux-2.6.6/drivers/char/tty_io.c 2004-05-10 08:02:39.000000000 +0530 +++ files_struct-2.6.6/drivers/char/tty_io.c 2004-05-14 12:53:50.000000000 +0530 @@ -1936,9 +1936,9 @@ } task_lock(p); if (p->files) { - spin_lock(&p->files->file_lock); + rcu_read_lock(); for (i=0; i < p->files->max_fds; i++) { - filp = fcheck_files(p->files, i); + filp = fcheck_files_rcu(p->files, i); if (!filp) continue; if (filp->f_op->read == tty_read && @@ -1950,7 +1950,7 @@ break; } } - spin_unlock(&p->files->file_lock); + rcu_read_unlock(); } task_unlock(p); } diff -ruN -X dontdiff linux-2.6.6/drivers/net/ppp_generic.c files_struct-2.6.6/drivers/net/ppp_generic.c --- linux-2.6.6/drivers/net/ppp_generic.c 2004-05-10 08:03:20.000000000 +0530 +++ files_struct-2.6.6/drivers/net/ppp_generic.c 2004-05-14 12:53:50.000000000 +0530 @@ -46,6 +46,7 @@ #include <linux/rwsem.h> #include <linux/stddef.h> #include <linux/device.h> +#include <linux/refcount.h> #include <net/slhc_vj.h> #include <asm/atomic.h> @@ -525,12 +526,12 @@ if (file == ppp->owner) ppp_shutdown_interface(ppp); } - if (atomic_read(&file->f_count) <= 2) { + if (refcount_read(&file->f_count) <= 2) { ppp_release(inode, file); err = 0; } else printk(KERN_DEBUG "PPPIOCDETACH file->f_count=%d\n", - atomic_read(&file->f_count)); + refcount_read(&file->f_count)); return err; } diff -ruN -X dontdiff linux-2.6.6/fs/affs/file.c files_struct-2.6.6/fs/affs/file.c --- linux-2.6.6/fs/affs/file.c 2004-05-10 08:02:29.000000000 +0530 +++ files_struct-2.6.6/fs/affs/file.c 2004-05-14 12:53:50.000000000 +0530 @@ -62,7 +62,7 @@ static int affs_file_open(struct inode *inode, struct file *filp) { - if (atomic_read(&filp->f_count) != 1) + if (refcount_read(&filp->f_count) != 1) return 0; pr_debug("AFFS: open(%d)\n", AFFS_I(inode)->i_opencnt); AFFS_I(inode)->i_opencnt++; @@ -72,7 +72,7 @@ static int affs_file_release(struct inode *inode, struct file *filp) { - if (atomic_read(&filp->f_count) != 0) + if (refcount_read(&filp->f_count) != 0) return 0; pr_debug("AFFS: release(%d)\n", AFFS_I(inode)->i_opencnt); AFFS_I(inode)->i_opencnt--; diff -ruN -X dontdiff linux-2.6.6/fs/aio.c files_struct-2.6.6/fs/aio.c --- linux-2.6.6/fs/aio.c 2004-05-10 08:02:29.000000000 +0530 +++ files_struct-2.6.6/fs/aio.c 2004-05-14 12:53:50.000000000 +0530 @@ -480,7 +480,7 @@ static int __aio_put_req(struct kioctx *ctx, struct kiocb *req) { dprintk(KERN_DEBUG "aio_put(%p): f_count=%d\n", - req, atomic_read(&req->ki_filp->f_count)); + req, refcount_read(&req->ki_filp->f_count)); req->ki_users --; if (unlikely(req->ki_users < 0)) @@ -494,7 +494,7 @@ /* Must be done under the lock to serialise against cancellation. * Call this aio_fput as it duplicates fput via the fput_work. */ - if (unlikely(atomic_dec_and_test(&req->ki_filp->f_count))) { + if (unlikely(refcount_put(&req->ki_filp->f_count))) { get_ioctx(ctx); spin_lock(&fput_lock); list_add(&req->ki_list, &fput_head); diff -ruN -X dontdiff linux-2.6.6/fs/fcntl.c files_struct-2.6.6/fs/fcntl.c --- linux-2.6.6/fs/fcntl.c 2004-05-10 08:02:53.000000000 +0530 +++ files_struct-2.6.6/fs/fcntl.c 2004-05-14 12:53:50.000000000 +0530 @@ -34,9 +34,9 @@ { struct files_struct *files = current->files; int res; - spin_lock(&files->file_lock); + rcu_read_lock(); res = FD_ISSET(fd, files->close_on_exec); - spin_unlock(&files->file_lock); + rcu_read_unlock(); return res; } @@ -138,7 +138,7 @@ if (fd >= 0) { FD_SET(fd, files->open_fds); FD_CLR(fd, files->close_on_exec); - spin_unlock(&files->file_lock); + spin_unlock(&files->file_lock); fd_install(fd, file); } else { spin_unlock(&files->file_lock); @@ -183,6 +183,7 @@ goto out_fput; files->fd[newfd] = file; + wmb(); FD_SET(newfd, files->open_fds); FD_CLR(newfd, files->close_on_exec); spin_unlock(&files->file_lock); diff -ruN -X dontdiff linux-2.6.6/fs/file.c files_struct-2.6.6/fs/file.c --- linux-2.6.6/fs/file.c 2004-05-10 08:02:00.000000000 +0530 +++ files_struct-2.6.6/fs/file.c 2004-05-14 12:53:50.000000000 +0530 @@ -14,7 +14,20 @@ #include <linux/file.h> #include <asm/bitops.h> +#include <linux/rcupdate.h> +struct rcu_fd_array { + struct rcu_head rh; + struct file **array; + int nfds; +}; + +struct rcu_fd_set { + struct rcu_head rh; + fd_set *openset; + fd_set *execset; + int nfds; +}; /* * Allocate an fd array, using kmalloc or vmalloc. @@ -49,6 +62,13 @@ vfree(array); } +static void fd_array_callback(struct rcu_head *rcu) +{ + struct rcu_fd_array *a = container_of(rcu, struct rcu_fd_array, rh); + free_fd_array(a->array, a->nfds); + kfree(a); +} + /* * Expand the fd array in the files_struct. Called with the files * spinlock held for write. @@ -56,8 +76,9 @@ int expand_fd_array(struct files_struct *files, int nr) { - struct file **new_fds; - int error, nfds; + struct file **new_fds = NULL; + int error, nfds = 0; + struct rcu_fd_array *arg = NULL; error = -EMFILE; @@ -89,18 +110,17 @@ error = -ENOMEM; new_fds = alloc_fd_array(nfds); + arg = (struct rcu_fd_array *) kmalloc(sizeof(*arg), GFP_KERNEL); + spin_lock(&files->file_lock); - if (!new_fds) + if (!new_fds || !arg) goto out; /* Copy the existing array and install the new pointer */ if (nfds > files->max_fds) { - struct file **old_fds; - int i; - - old_fds = xchg(&files->fd, new_fds); - i = xchg(&files->max_fds, nfds); + struct file **old_fds = files->fd; + int i = files->max_fds; /* Don't copy/clear the array if we are creating a new fd array for fork() */ @@ -109,19 +129,35 @@ /* clear the remainder of the array */ memset(&new_fds[i], 0, (nfds-i) * sizeof(struct file *)); - - spin_unlock(&files->file_lock); - free_fd_array(old_fds, i); - spin_lock(&files->file_lock); } + + smp_wmb(); + files->fd = new_fds; + smp_wmb(); + files->max_fds = nfds; + + if (i) { + arg->array = old_fds; + arg->nfds = i; + call_rcu(&arg->rh, (void (*)(void *))fd_array_callback, + &arg->rh); + } else + kfree(arg); } else { /* Somebody expanded the array while we slept ... */ spin_unlock(&files->file_lock); free_fd_array(new_fds, nfds); + kfree(arg); spin_lock(&files->file_lock); } - error = 0; + + return 0; out: + if (new_fds) + free_fd_array(new_fds, nfds); + if (arg) + kfree(arg); + return error; } @@ -153,6 +189,14 @@ vfree(array); } +static void fd_set_callback (struct rcu_head *rcu) +{ + struct rcu_fd_set *a = container_of(rcu, struct rcu_fd_set, rh); + free_fdset(a->openset, a->nfds); + free_fdset(a->execset, a->nfds); + kfree(a); +} + /* * Expand the fdset in the files_struct. Called with the files spinlock * held for write. @@ -161,6 +205,7 @@ { fd_set *new_openset = 0, *new_execset = 0; int error, nfds = 0; + struct rcu_fd_set *arg = NULL; error = -EMFILE; if (files->max_fdset >= NR_OPEN || nr >= NR_OPEN) @@ -183,35 +228,43 @@ error = -ENOMEM; new_openset = alloc_fdset(nfds); new_execset = alloc_fdset(nfds); + arg = (struct rcu_fd_set *) kmalloc(sizeof(*arg), GFP_ATOMIC); spin_lock(&files->file_lock); - if (!new_openset || !new_execset) + if (!new_openset || !new_execset || !arg) goto out; error = 0; /* Copy the existing tables and install the new pointers */ if (nfds > files->max_fdset) { - int i = files->max_fdset / (sizeof(unsigned long) * 8); - int count = (nfds - files->max_fdset) / 8; + fd_set *old_openset = files->open_fds; + fd_set *old_execset = files->close_on_exec; + int old_nfds = files->max_fdset; + int i = old_nfds / (sizeof(unsigned long) * 8); + int count = (nfds - old_nfds) / 8; /* * Don't copy the entire array if the current fdset is * not yet initialised. */ if (i) { - memcpy (new_openset, files->open_fds, files->max_fdset/8); - memcpy (new_execset, files->close_on_exec, files->max_fdset/8); + memcpy (new_openset, old_openset, old_nfds/8); + memcpy (new_execset, old_execset, old_nfds/8); memset (&new_openset->fds_bits[i], 0, count); memset (&new_execset->fds_bits[i], 0, count); } - nfds = xchg(&files->max_fdset, nfds); - new_openset = xchg(&files->open_fds, new_openset); - new_execset = xchg(&files->close_on_exec, new_execset); - spin_unlock(&files->file_lock); - free_fdset (new_openset, nfds); - free_fdset (new_execset, nfds); - spin_lock(&files->file_lock); + smp_wmb(); + files->open_fds = new_openset; + files->close_on_exec = new_execset; + smp_wmb(); + files->max_fdset = nfds; + + arg->openset = old_openset; + arg->execset = old_execset; + arg->nfds = nfds; + call_rcu(&arg->rh, (void (*)(void *))fd_set_callback, &arg->rh); + return 0; } /* Somebody expanded the array while we slept ... */ @@ -222,6 +275,8 @@ free_fdset(new_openset, nfds); if (new_execset) free_fdset(new_execset, nfds); + if (arg) + kfree(arg); spin_lock(&files->file_lock); return error; } diff -ruN -X dontdiff linux-2.6.6/fs/file_table.c files_struct-2.6.6/fs/file_table.c --- linux-2.6.6/fs/file_table.c 2004-05-10 08:01:59.000000000 +0530 +++ files_struct-2.6.6/fs/file_table.c 2004-05-14 14:05:22.000000000 +0530 @@ -14,6 +14,7 @@ #include <linux/fs.h> #include <linux/security.h> #include <linux/eventpoll.h> +#include <linux/rcupdate.h> #include <linux/mount.h> #include <linux/cdev.h> @@ -54,11 +55,17 @@ spin_unlock_irqrestore(&filp_count_lock, flags); } -static inline void file_free(struct file *f) +static inline void file_free_rcu(void *arg) { + struct file *f = arg; kmem_cache_free(filp_cachep, f); } +static inline void file_free(struct file *f) +{ + call_rcu(&f->f_rcuhead, file_free_rcu, f); +} + /* Find an unused file structure and return a pointer to it. * Returns NULL, if there are no more free file structures or * we run out of memory. @@ -81,7 +88,7 @@ goto fail; } eventpoll_init_file(f); - atomic_set(&f->f_count, 1); + refcount_init(&f->f_count); f->f_uid = current->fsuid; f->f_gid = current->fsgid; f->f_owner.lock = RW_LOCK_UNLOCKED; @@ -118,7 +125,7 @@ eventpoll_init_file(filp); filp->f_flags = flags; filp->f_mode = (flags+1) & O_ACCMODE; - atomic_set(&filp->f_count, 1); + refcount_init(&filp->f_count); filp->f_dentry = dentry; filp->f_mapping = dentry->d_inode->i_mapping; filp->f_uid = current->fsuid; @@ -154,7 +161,7 @@ void fastcall fput(struct file *file) { - if (atomic_dec_and_test(&file->f_count)) + if (refcount_put(&file->f_count)) __fput(file); } @@ -197,11 +204,16 @@ struct file *file; struct files_struct *files = current->files; - spin_lock(&files->file_lock); - file = fcheck_files(files, fd); - if (file) - get_file(file); - spin_unlock(&files->file_lock); + rcu_read_lock(); + file = fcheck_files_rcu(files, fd); + if (file) { + if (!refcount_get_rcu(&file->f_count)) { + /* File object ref couldn't be taken */ + rcu_read_unlock(); + return NULL; + } + } + rcu_read_unlock(); return file; } @@ -223,21 +235,25 @@ if (likely((atomic_read(&files->count) == 1))) { file = fcheck_files(files, fd); } else { - spin_lock(&files->file_lock); - file = fcheck_files(files, fd); + rcu_read_lock(); + file = fcheck_files_rcu(files, fd); if (file) { - get_file(file); - *fput_needed = 1; + if (refcount_get_rcu(&file->f_count)) + *fput_needed = 1; + else + /* Didn't get the reference, someone's freed */ + file = NULL; } - spin_unlock(&files->file_lock); + rcu_read_unlock(); } + return file; } void put_filp(struct file *file) { - if (atomic_dec_and_test(&file->f_count)) { + if (refcount_put(&file->f_count)) { security_file_free(file); file_kill(file); file_free(file); diff -ruN -X dontdiff linux-2.6.6/fs/open.c files_struct-2.6.6/fs/open.c --- linux-2.6.6/fs/open.c 2004-05-10 08:01:59.000000000 +0530 +++ files_struct-2.6.6/fs/open.c 2004-05-14 12:53:50.000000000 +0530 @@ -1028,7 +1028,9 @@ filp = files->fd[fd]; if (!filp) goto out_unlock; - files->fd[fd] = NULL; + files->fd[fd] = NULL; + /* Need to make it conistent with open_fds in __put_unused_fd() */ + smp_wmb(); FD_CLR(fd, files->close_on_exec); __put_unused_fd(files, fd); spin_unlock(&files->file_lock); diff -ruN -X dontdiff linux-2.6.6/fs/proc/base.c files_struct-2.6.6/fs/proc/base.c --- linux-2.6.6/fs/proc/base.c 2004-05-10 08:02:52.000000000 +0530 +++ files_struct-2.6.6/fs/proc/base.c 2004-05-14 12:53:50.000000000 +0530 @@ -28,6 +28,7 @@ #include <linux/namespace.h> #include <linux/mm.h> #include <linux/smp_lock.h> +#include <linux/rcupdate.h> #include <linux/kallsyms.h> #include <linux/mount.h> #include <linux/security.h> @@ -191,16 +192,16 @@ files = get_files_struct(task); if (files) { - spin_lock(&files->file_lock); - file = fcheck_files(files, fd); + rcu_read_lock(); + file = fcheck_files_rcu(files, fd); if (file) { *mnt = mntget(file->f_vfsmnt); *dentry = dget(file->f_dentry); - spin_unlock(&files->file_lock); + rcu_read_unlock(); put_files_struct(files); return 0; } - spin_unlock(&files->file_lock); + rcu_read_unlock(); put_files_struct(files); } return -ENOENT; @@ -805,15 +806,15 @@ files = get_files_struct(p); if (!files) goto out; - spin_lock(&files->file_lock); + rcu_read_lock(); for (fd = filp->f_pos-2; fd < files->max_fds; fd++, filp->f_pos++) { unsigned int i,j; - if (!fcheck_files(files, fd)) + if (!fcheck_files_rcu(files, fd)) continue; - spin_unlock(&files->file_lock); + rcu_read_unlock(); j = NUMBUF; i = fd; @@ -825,12 +826,12 @@ ino = fake_ino(tid, PROC_TID_FD_DIR + fd); if (filldir(dirent, buf+j, NUMBUF-j, fd+2, ino, DT_LNK) < 0) { - spin_lock(&files->file_lock); + rcu_read_lock(); break; } - spin_lock(&files->file_lock); + rcu_read_lock(); } - spin_unlock(&files->file_lock); + rcu_read_unlock(); put_files_struct(files); } out: @@ -1003,9 +1004,9 @@ files = get_files_struct(task); if (files) { - spin_lock(&files->file_lock); - if (fcheck_files(files, fd)) { - spin_unlock(&files->file_lock); + rcu_read_lock(); + if (fcheck_files_rcu(files, fd)) { + rcu_read_unlock(); put_files_struct(files); if (task_dumpable(task)) { inode->i_uid = task->euid; @@ -1017,7 +1018,7 @@ security_task_to_inode(task, inode); return 1; } - spin_unlock(&files->file_lock); + rcu_read_unlock(); put_files_struct(files); } d_drop(dentry); @@ -1109,15 +1110,15 @@ if (!files) goto out_unlock; inode->i_mode = S_IFLNK; - spin_lock(&files->file_lock); - file = fcheck_files(files, fd); + rcu_read_lock(); + file = fcheck_files_rcu(files, fd); if (!file) goto out_unlock2; if (file->f_mode & 1) inode->i_mode |= S_IRUSR | S_IXUSR; if (file->f_mode & 2) inode->i_mode |= S_IWUSR | S_IXUSR; - spin_unlock(&files->file_lock); + rcu_read_unlock(); put_files_struct(files); inode->i_op = &proc_pid_link_inode_operations; inode->i_size = 64; @@ -1127,7 +1128,7 @@ return NULL; out_unlock2: - spin_unlock(&files->file_lock); + rcu_read_unlock(); put_files_struct(files); out_unlock: iput(inode); diff -ruN -X dontdiff linux-2.6.6/fs/select.c files_struct-2.6.6/fs/select.c --- linux-2.6.6/fs/select.c 2004-05-10 08:01:59.000000000 +0530 +++ files_struct-2.6.6/fs/select.c 2004-05-14 12:53:50.000000000 +0530 @@ -21,6 +21,7 @@ #include <linux/personality.h> /* for STICKY_TIMEOUTS */ #include <linux/file.h> #include <linux/fs.h> +#include <linux/rcupdate.h> #include <asm/uaccess.h> @@ -131,13 +132,16 @@ static int max_select_fd(unsigned long n, fd_set_bits *fds) { unsigned long *open_fds; + fd_set *open_fdset; unsigned long set; int max; /* handle last in-complete long-word first */ set = ~(~0UL << (n & (__NFDBITS-1))); n /= __NFDBITS; - open_fds = current->files->open_fds->fds_bits+n; + open_fdset = current->files->open_fds; + smp_read_barrier_depends(); + open_fds = open_fdset->fds_bits+n; max = 0; if (set) { set &= BITS(fds, n); @@ -184,9 +188,9 @@ int retval, i; long __timeout = *timeout; - spin_lock(¤t->files->file_lock); + rcu_read_lock(); retval = max_select_fd(n, fds); - spin_unlock(¤t->files->file_lock); + rcu_read_unlock(); if (retval < 0) return retval; diff -ruN -X dontdiff linux-2.6.6/include/linux/file.h files_struct-2.6.6/include/linux/file.h --- linux-2.6.6/include/linux/file.h 2004-05-10 08:03:22.000000000 +0530 +++ files_struct-2.6.6/include/linux/file.h 2004-05-14 12:53:50.000000000 +0530 @@ -9,6 +9,7 @@ #include <linux/posix_types.h> #include <linux/compiler.h> #include <linux/spinlock.h> +#include <linux/rcupdate.h> /* * The default fd array needs to be at least BITS_PER_LONG, @@ -69,10 +70,26 @@ return file; } +/* Need Proper read barriers if fd_array is being indexed lockfree */ +static inline struct file * fcheck_files_rcu(struct files_struct *files, unsigned int fd) +{ + struct file * file = NULL; + + if (fd < files->max_fds) { + struct file ** fd_array; + smp_rmb(); + fd_array = files->fd; + smp_read_barrier_depends(); + file = fd_array[fd]; + } + return file; +} + /* * Check whether the specified fd has an open file. */ #define fcheck(fd) fcheck_files(current->files, fd) +#define fcheck_rcu(fd) fcheck_files_rcu(current->files, fd) extern void FASTCALL(fd_install(unsigned int fd, struct file * file)); diff -ruN -X dontdiff linux-2.6.6/include/linux/fs.h files_struct-2.6.6/include/linux/fs.h --- linux-2.6.6/include/linux/fs.h 2004-05-10 08:02:26.000000000 +0530 +++ files_struct-2.6.6/include/linux/fs.h 2004-05-14 12:53:50.000000000 +0530 @@ -19,6 +19,7 @@ #include <linux/cache.h> #include <linux/radix-tree.h> #include <linux/kobject.h> +#include <linux/refcount.h> #include <asm/atomic.h> #include <linux/audit.h> @@ -555,7 +556,7 @@ struct dentry *f_dentry; struct vfsmount *f_vfsmnt; struct file_operations *f_op; - atomic_t f_count; + refcount_t f_count; unsigned int f_flags; mode_t f_mode; loff_t f_pos; @@ -576,13 +577,14 @@ spinlock_t f_ep_lock; #endif /* #ifdef CONFIG_EPOLL */ struct address_space *f_mapping; + struct rcu_head f_rcuhead; }; extern spinlock_t files_lock; #define file_list_lock() spin_lock(&files_lock); #define file_list_unlock() spin_unlock(&files_lock); -#define get_file(x) atomic_inc(&(x)->f_count) -#define file_count(x) atomic_read(&(x)->f_count) +#define get_file(x) refcount_get(&(x)->f_count) +#define file_count(x) refcount_read(&(x)->f_count) /* Initialize and open a private file and allocate its security structure. */ extern int open_private_file(struct file *, struct dentry *, int); diff -ruN -X dontdiff linux-2.6.6/kernel/fork.c files_struct-2.6.6/kernel/fork.c --- linux-2.6.6/kernel/fork.c 2004-05-10 08:02:00.000000000 +0530 +++ files_struct-2.6.6/kernel/fork.c 2004-05-14 12:53:50.000000000 +0530 @@ -30,6 +30,7 @@ #include <linux/syscalls.h> #include <linux/jiffies.h> #include <linux/futex.h> +#include <linux/rcupdate.h> #include <linux/ptrace.h> #include <linux/mount.h> #include <linux/audit.h> @@ -627,10 +628,12 @@ static int count_open_files(struct files_struct *files, int size) { int i; + fd_set *open_fds = files->open_fds; + smp_read_barrier_depends(); /* Find the last open fd */ for (i = size/(8*sizeof(long)); i > 0; ) { - if (files->open_fds->fds_bits[--i]) + if (open_fds->fds_bits[--i]) break; } i = (i+1) * 8 * sizeof(long); @@ -660,6 +663,14 @@ * This works because we cache current->files (old) as oldf. Don't * break this. */ + + /* + * We don't have oldf readlock, but even if the old fdset gets + * grown now, we'll only copy upto "size" fds + */ + size = oldf->max_fdset; + smp_rmb(); + tsk->files = NULL; error = -ENOMEM; newf = kmem_cache_alloc(files_cachep, SLAB_KERNEL); @@ -676,9 +687,6 @@ newf->open_fds = &newf->open_fds_init; newf->fd = &newf->fd_array[0]; - /* We don't yet have the oldf readlock, but even if the old - fdset gets grown now, we'll only copy up to "size" fds */ - size = oldf->max_fdset; if (size > __FD_SETSIZE) { newf->max_fdset = 0; spin_lock(&newf->file_lock); @@ -687,7 +695,7 @@ if (error) goto out_release; } - spin_lock(&oldf->file_lock); + rcu_read_lock(); open_files = count_open_files(oldf, size); @@ -698,7 +706,7 @@ */ nfds = NR_OPEN_DEFAULT; if (open_files > nfds) { - spin_unlock(&oldf->file_lock); + rcu_read_unlock(); newf->max_fds = 0; spin_lock(&newf->file_lock); error = expand_fd_array(newf, open_files-1); @@ -706,10 +714,11 @@ if (error) goto out_release; nfds = newf->max_fds; - spin_lock(&oldf->file_lock); + rcu_read_lock(); } old_fds = oldf->fd; + smp_read_barrier_depends(); new_fds = newf->fd; memcpy(newf->open_fds->fds_bits, oldf->open_fds->fds_bits, open_files/8); @@ -721,7 +730,7 @@ get_file(f); *new_fds++ = f; } - spin_unlock(&oldf->file_lock); + rcu_read_unlock(); /* compute the remainder to be cleared */ size = (newf->max_fds - open_files) * sizeof(struct file *); diff -ruN -X dontdiff linux-2.6.6/net/ipv4/netfilter/ipt_owner.c files_struct-2.6.6/net/ipv4/netfilter/ipt_owner.c --- linux-2.6.6/net/ipv4/netfilter/ipt_owner.c 2004-05-10 08:03:20.000000000 +0530 +++ files_struct-2.6.6/net/ipv4/netfilter/ipt_owner.c 2004-05-14 12:53:50.000000000 +0530 @@ -11,6 +11,7 @@ #include <linux/module.h> #include <linux/skbuff.h> #include <linux/file.h> +#include <linux/rcupdate.h> #include <net/sock.h> #include <linux/netfilter_ipv4/ipt_owner.h> @@ -35,17 +36,17 @@ task_lock(p); files = p->files; if(files) { - spin_lock(&files->file_lock); + rcu_read_lock(); for (i=0; i < files->max_fds; i++) { - if (fcheck_files(files, i) == + if (fcheck_files_rcu(files, i) == skb->sk->sk_socket->file) { - spin_unlock(&files->file_lock); + rcu_read_unlock(); task_unlock(p); read_unlock(&tasklist_lock); return 1; } } - spin_unlock(&files->file_lock); + rcu_read_unlock(); } task_unlock(p); } while_each_thread(g, p); @@ -67,17 +68,17 @@ task_lock(p); files = p->files; if(files) { - spin_lock(&files->file_lock); + rcu_read_lock(); for (i=0; i < files->max_fds; i++) { - if (fcheck_files(files, i) == + if (fcheck_files_rcu(files, i) == skb->sk->sk_socket->file) { - spin_unlock(&files->file_lock); + rcu_read_unlock(); task_unlock(p); read_unlock(&tasklist_lock); return 1; } } - spin_unlock(&files->file_lock); + rcu_read_unlock(); } task_unlock(p); out: @@ -101,14 +102,14 @@ task_lock(p); files = p->files; if (files) { - spin_lock(&files->file_lock); + rcu_read_lock(); for (i=0; i < files->max_fds; i++) { - if (fcheck_files(files, i) == file) { + if (fcheck_files_rcu(files, i) == file) { found = 1; break; } } - spin_unlock(&files->file_lock); + rcu_read_unlock(); } task_unlock(p); if (found) diff -ruN -X dontdiff linux-2.6.6/net/ipv6/netfilter/ip6t_owner.c files_struct-2.6.6/net/ipv6/netfilter/ip6t_owner.c --- linux-2.6.6/net/ipv6/netfilter/ip6t_owner.c 2004-05-10 08:03:13.000000000 +0530 +++ files_struct-2.6.6/net/ipv6/netfilter/ip6t_owner.c 2004-05-14 12:53:50.000000000 +0530 @@ -11,6 +11,7 @@ #include <linux/module.h> #include <linux/skbuff.h> #include <linux/file.h> +#include <linux/rcupdate.h> #include <net/sock.h> #include <linux/netfilter_ipv6/ip6t_owner.h> @@ -34,16 +35,16 @@ task_lock(p); files = p->files; if(files) { - spin_lock(&files->file_lock); + rcu_read_lock(); for (i=0; i < files->max_fds; i++) { - if (fcheck_files(files, i) == skb->sk->sk_socket->file) { - spin_unlock(&files->file_lock); + if (fcheck_files_rcu(files, i) == skb->sk->sk_socket->file) { + rcu_read_unlock(); task_unlock(p); read_unlock(&tasklist_lock); return 1; } } - spin_unlock(&files->file_lock); + rcu_read_unlock(); } task_unlock(p); out: @@ -67,14 +68,14 @@ task_lock(p); files = p->files; if (files) { - spin_lock(&files->file_lock); + rcu_read_lock(); for (i=0; i < files->max_fds; i++) { - if (fcheck_files(files, i) == file) { + if (fcheck_files_rcu(files, i) == file) { found = 1; break; } } - spin_unlock(&files->file_lock); + rcu_read_unlock(); } task_unlock(p); if (found) ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] Lock free fd lookup 2004-07-14 4:56 ` [RFC] Lock free fd lookup Ravikiran G Thirumalai @ 2004-07-14 15:17 ` Chris Wright 2004-07-15 14:22 ` Jesse Barnes 2004-07-17 13:48 ` Peter Zijlstra 0 siblings, 2 replies; 19+ messages in thread From: Chris Wright @ 2004-07-14 15:17 UTC (permalink / raw) To: Ravikiran G Thirumalai; +Cc: linux-kernel, dipankar * Ravikiran G Thirumalai (kiran@in.ibm.com) wrote: > This makes use of the lockfree refcounting infrastructure (see earlier > posting today) to make files_struct.fd[] lookup lockfree. This is carrying > forward work done by Maneesh and Dipankar earlier. > > With the lockfree fd lookup patch, tiobench performance increases by 13% > for sequential reads, 21 % for random reads on a 4 processor pIII xeon. I'm curious, how much of the performance improvement is from RCU usage vs. making the basic syncronization primitive aware of a reader and writer distinction? Do you have benchmark for simply moving to rwlock_t? thanks, -chris -- Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] Lock free fd lookup 2004-07-14 15:17 ` Chris Wright @ 2004-07-15 14:22 ` Jesse Barnes 2004-07-15 16:10 ` Dipankar Sarma 2004-07-16 6:27 ` William Lee Irwin III 2004-07-17 13:48 ` Peter Zijlstra 1 sibling, 2 replies; 19+ messages in thread From: Jesse Barnes @ 2004-07-15 14:22 UTC (permalink / raw) To: Chris Wright; +Cc: Ravikiran G Thirumalai, linux-kernel, dipankar On Wednesday, July 14, 2004 11:17 am, Chris Wright wrote: > * Ravikiran G Thirumalai (kiran@in.ibm.com) wrote: > > This makes use of the lockfree refcounting infrastructure (see earlier > > posting today) to make files_struct.fd[] lookup lockfree. This is > > carrying forward work done by Maneesh and Dipankar earlier. > > > > With the lockfree fd lookup patch, tiobench performance increases by 13% > > for sequential reads, 21 % for random reads on a 4 processor pIII xeon. > > I'm curious, how much of the performance improvement is from RCU usage > vs. making the basic syncronization primitive aware of a reader and > writer distinction? Do you have benchmark for simply moving to rwlock_t? That's a good point. Also, even though the implementation may be 'lockless', there are still a lot of cachelines bouncing around, whether due to atomic counters or cmpxchg (in fact the latter will be worse than simple atomics). It seems to me that RCU is basically rwlocks on steroids, which means that using it requires the same care to avoid starvation and/or other scalability problems (i.e. we'd better be really sure that a given codepath really should be using rwlocks before we change it). Jesse ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] Lock free fd lookup 2004-07-15 14:22 ` Jesse Barnes @ 2004-07-15 16:10 ` Dipankar Sarma 2004-07-15 16:22 ` Jesse Barnes 2004-07-15 16:34 ` Chris Wright 2004-07-16 6:27 ` William Lee Irwin III 1 sibling, 2 replies; 19+ messages in thread From: Dipankar Sarma @ 2004-07-15 16:10 UTC (permalink / raw) To: Jesse Barnes; +Cc: Chris Wright, Ravikiran G Thirumalai, linux-kernel On Thu, Jul 15, 2004 at 10:22:53AM -0400, Jesse Barnes wrote: > On Wednesday, July 14, 2004 11:17 am, Chris Wright wrote: > > I'm curious, how much of the performance improvement is from RCU usage > > vs. making the basic syncronization primitive aware of a reader and > > writer distinction? Do you have benchmark for simply moving to rwlock_t? > > That's a good point. Also, even though the implementation may be 'lockless', > there are still a lot of cachelines bouncing around, whether due to atomic > counters or cmpxchg (in fact the latter will be worse than simple atomics). Chris raises an interesting issue. There are two ways we can benefit from lock-free lookup - avoidance of atomic ops in lock acquisition/release and avoidance of contention. The latter can also be provided by rwlocks in read-mostly situations like this, but rwlock still has two atomic ops for acquisition/release. So, in another thread, I have suggested looking into the contention angle. IIUC, tiobench is threaded and shares fd table. That said, atomic counters weren't introduced in this patch, they are already there for refcounting. cmpxchg is costly, but if you are replacing read_lock/atomic_inc/read_unlock, lock-free + cmpxchg, it might not be all that bad. Atleast, we can benchmark it and see if it is worth it. And in heavily contended cases, unlike rwlocks, you are not going to have starvation. > > It seems to me that RCU is basically rwlocks on steroids, which means that > using it requires the same care to avoid starvation and/or other scalability > problems (i.e. we'd better be really sure that a given codepath really should > be using rwlocks before we change it). The starvation is a problem with rwlocks in linux, not RCU. The reader's do not impede writers at all with RCU. There are other issues with RCU that one needs to be careful about, but certainly not this one. Thanks Dipankar ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] Lock free fd lookup 2004-07-15 16:10 ` Dipankar Sarma @ 2004-07-15 16:22 ` Jesse Barnes 2004-07-15 16:34 ` Chris Wright 1 sibling, 0 replies; 19+ messages in thread From: Jesse Barnes @ 2004-07-15 16:22 UTC (permalink / raw) To: dipankar; +Cc: Chris Wright, Ravikiran G Thirumalai, linux-kernel On Thursday, July 15, 2004 12:10 pm, Dipankar Sarma wrote: > Chris raises an interesting issue. There are two ways we can benefit from > lock-free lookup - avoidance of atomic ops in lock acquisition/release > and avoidance of contention. The latter can also be provided by > rwlocks in read-mostly situations like this, but rwlock still has > two atomic ops for acquisition/release. So, in another > thread, I have suggested looking into the contention angle. IIUC, > tiobench is threaded and shares fd table. I must have missed that thread... Anyway, that's a good idea. > > That said, atomic counters weren't introduced in this patch, > they are already there for refcounting. cmpxchg is costly, > but if you are replacing read_lock/atomic_inc/read_unlock, > lock-free + cmpxchg, it might not be all that bad. Yeah, I didn't mean to imply that atomics were unique to this patch. > Atleast, > we can benchmark it and see if it is worth it. And in heavily > contended cases, unlike rwlocks, you are not going to have > starvation. Which is good. > > It seems to me that RCU is basically rwlocks on steroids, which means > > that using it requires the same care to avoid starvation and/or other > > scalability problems (i.e. we'd better be really sure that a given > > codepath really should be using rwlocks before we change it). > > The starvation is a problem with rwlocks in linux, not RCU. The > reader's do not impede writers at all with RCU. There are other > issues with RCU that one needs to be careful about, but certainly > not this one. That's good, I didn't think that RCU would cause starvation, but based on previous reading of the code it seemed like it would hurt a lot in other ways... but I'm definitely not an expert in that area. Thanks, Jesse ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] Lock free fd lookup 2004-07-15 16:10 ` Dipankar Sarma 2004-07-15 16:22 ` Jesse Barnes @ 2004-07-15 16:34 ` Chris Wright 2004-07-16 5:38 ` Ravikiran G Thirumalai 1 sibling, 1 reply; 19+ messages in thread From: Chris Wright @ 2004-07-15 16:34 UTC (permalink / raw) To: Dipankar Sarma Cc: Jesse Barnes, Chris Wright, Ravikiran G Thirumalai, linux-kernel * Dipankar Sarma (dipankar@in.ibm.com) wrote: > On Thu, Jul 15, 2004 at 10:22:53AM -0400, Jesse Barnes wrote: > > On Wednesday, July 14, 2004 11:17 am, Chris Wright wrote: > > > I'm curious, how much of the performance improvement is from RCU usage > > > vs. making the basic syncronization primitive aware of a reader and > > > writer distinction? Do you have benchmark for simply moving to rwlock_t? > > > > That's a good point. Also, even though the implementation may be 'lockless', > > there are still a lot of cachelines bouncing around, whether due to atomic > > counters or cmpxchg (in fact the latter will be worse than simple atomics). > > Chris raises an interesting issue. There are two ways we can benefit from > lock-free lookup - avoidance of atomic ops in lock acquisition/release > and avoidance of contention. The latter can also be provided by > rwlocks in read-mostly situations like this, but rwlock still has > two atomic ops for acquisition/release. So, in another > thread, I have suggested looking into the contention angle. IIUC, > tiobench is threaded and shares fd table. Given the read heavy assumption that RCU makes (supported by your benchmarks), I believe that the comparison with RCU vs. current scheme is unfair. Better comparison is against rwlock_t, which may give a similar improvement w/out the added complexity. But, I haven't a patch nor a benchmark, so it's all handwavy at this point. thanks, -chris -- Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] Lock free fd lookup 2004-07-15 16:34 ` Chris Wright @ 2004-07-16 5:38 ` Ravikiran G Thirumalai 0 siblings, 0 replies; 19+ messages in thread From: Ravikiran G Thirumalai @ 2004-07-16 5:38 UTC (permalink / raw) To: Chris Wright; +Cc: Dipankar Sarma, Jesse Barnes, linux-kernel On Thu, Jul 15, 2004 at 09:34:08AM -0700, Chris Wright wrote: > * Dipankar Sarma (dipankar@in.ibm.com) wrote: > > On Thu, Jul 15, 2004 at 10:22:53AM -0400, Jesse Barnes wrote: > > > On Wednesday, July 14, 2004 11:17 am, Chris Wright wrote: > > ... > > Chris raises an interesting issue. There are two ways we can benefit from > > lock-free lookup - avoidance of atomic ops in lock acquisition/release > > and avoidance of contention. The latter can also be provided by > > rwlocks in read-mostly situations like this, but rwlock still has > > two atomic ops for acquisition/release. So, in another > > thread, I have suggested looking into the contention angle. IIUC, > > tiobench is threaded and shares fd table. > > Given the read heavy assumption that RCU makes (supported by your > benchmarks), I believe that the comparison with RCU vs. current scheme > is unfair. Better comparison is against rwlock_t, which may give a > similar improvement w/out the added complexity. But, I haven't a patch > nor a benchmark, so it's all handwavy at this point. It would be a good datapoint to experiment with rwlock. But note that on x86 (my testbed right now) 1. read_lock + read_unlock + atomic_inc will be 3 (bus locking) atomic ops 2. spin_lock + spin_unlock + atomic_inc will be 2 atomic ops (x86 spin_unlock is just a move) 3. rcu_read_lock, rcu_read_unlock + cmpxchg is just one atomic op, added to it the cs is small in fget, fget_light.... IIRC, the files_struct.file_lock was a rwlock sometime back. (it still is in 2.4 i think) I am not sure why it was changed to spinlocks. I will try to dig through the archives, but if someone can quickly fill in that would be nice. Thanks, Kiran ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] Lock free fd lookup 2004-07-15 14:22 ` Jesse Barnes 2004-07-15 16:10 ` Dipankar Sarma @ 2004-07-16 6:27 ` William Lee Irwin III 2004-07-17 0:55 ` Keith Owens 1 sibling, 1 reply; 19+ messages in thread From: William Lee Irwin III @ 2004-07-16 6:27 UTC (permalink / raw) To: Jesse Barnes; +Cc: Chris Wright, Ravikiran G Thirumalai, linux-kernel, dipankar On Wednesday, July 14, 2004 11:17 am, Chris Wright wrote: >> I'm curious, how much of the performance improvement is from RCU usage >> vs. making the basic syncronization primitive aware of a reader and >> writer distinction? Do you have benchmark for simply moving to rwlock_t? On Thu, Jul 15, 2004 at 10:22:53AM -0400, Jesse Barnes wrote: > That's a good point. Also, even though the implementation may be 'lockless', > there are still a lot of cachelines bouncing around, whether due to atomic > counters or cmpxchg (in fact the latter will be worse than simple atomics). > It seems to me that RCU is basically rwlocks on steroids, which means that > using it requires the same care to avoid starvation and/or other scalability > problems (i.e. we'd better be really sure that a given codepath really should > be using rwlocks before we change it). Primarily not for either of your benefit(s): files->lock actually started as an rwlock and was changed to a spinlock on the grounds that cmpxchg for rwlocks sucked on Pentium 4. Typical rwlock implementations don't have such starvation issues. Normally attempts at acquiring them for write exclude new readers with various forms of bias. Linux has come to rely on an unfair implementation by means of recursion, both directly (I hear somewhere in the net stack and haven't looked) and via recursive acquisition in interrupt context (for signal delivery). The unfairness does cause problems in practice, e.g. writers starving on tasklist_lock takes out machines running large numbers of processes. RCU does not have such fairness issues. While it is heavily read biased, readers do not exclude writers. Things can, however, delay freeing of memory if tasks don't voluntarily yield for long periods of time (e.g. involuntary context switches caused by kernel preemption) as the freeing of the data structures potentially referenced by readers must be delayed until all cpus have gone through a voluntary context switch. This generally resolves itself as waiting for memory involves a voluntary context switch. Readers merely disable preemption, and only writers require atomicity or mutual exclusion, which is more typical in Linux in part owing to preferred data structures. I suspect this is in part due to its initial usage for doubly-linked lists that are never traversed backward. RCU is actually a lockless update protocol and can be done with atomic updates on the write side as opposed to spinlocks around the updates. This happily doesn't involve busywait on machines able to implement such things directly, but also depends heavily on the data structure. For instance, an open-addressed hash table or a 4-way associative cache could simply atomically update the pointers to the elements. To see why things get much harder when you try to go completely lockfree with less suitable data structures, a singly-linked list would require a cmpxchg loop inside a retry loop, temporarily marking next pointers with some invalid value that signals accessors to retry the read operation, and some external method of preventing simultaneous removals of the same element (e.g. refcounting), as the next pointer of the predecessor must be what was expected during removal, the removed element's next pointer remain be what was expected, and the updated predecessor must be what was discovered in the list, and more still for cpus not supporting cmpxchg or similar operations, though that's probably not an entirely complete or accurate description of the headaches involved when the data structure doesn't mesh well with lock-free atomic updates. The gist of all this is that busywait-free atomic updates are only implementable for data structures that don't link through the object but rather maintain an external index with a single pointer to elements needing updates, like radix trees, B+ trees, arrays of pointers, and open-addressed hashtables. Otherwise, spinlocks on the write side are probably better than the completely lock-free alternative for smaller systems, though larger systems hate sharing the cachelines. -- wli ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] Lock free fd lookup 2004-07-16 6:27 ` William Lee Irwin III @ 2004-07-17 0:55 ` Keith Owens 2004-07-17 1:19 ` William Lee Irwin III 0 siblings, 1 reply; 19+ messages in thread From: Keith Owens @ 2004-07-17 0:55 UTC (permalink / raw) To: William Lee Irwin III Cc: Jesse Barnes, Chris Wright, Ravikiran G Thirumalai, linux-kernel, dipankar On Thu, 15 Jul 2004 23:27:32 -0700, William Lee Irwin III <wli@holomorphy.com> wrote: >The gist of all this is that busywait-free atomic updates are only >implementable for data structures that don't link through the object >but rather maintain an external index with a single pointer to elements >needing updates, like radix trees, B+ trees, arrays of pointers, and >open-addressed hashtables. There are algorithms for busywait-free (lock free) traversal and concurrent update of lists and structures that contain embedded pointers. I once had to implement one on an O/S where the user space to kernel wait/alarm semantics could not satisfy the timing requirements of the calling code (don't ask). The beauty of these lockfree algorithms on large structures is that nothing ever stalls indefinitely. If the underlying SMP cache hardware is fair, everything running a lockfree list will make progress. These algorithms do not suffer from reader vs. writer starvation. The downside is that the algorithms rely on an extra sequence field per word. They copy out the structure, modify the local copy, write back with an update of sequence field. Writing back detects if any other update has invalidated the current structure, at which point the second update is discarded and the writer redrives its update. Readers are guaranteed to see a consistent view of the copy, even if the master structure is being updated at the same time as it is being read. Bottom line, it can be done but ... * The structure size doubles with the addition of the sequence fields. * The hardware must support cmpxchg on the combination of the sequence field and the data word that it protects. LL/SC can be used instead of cmpxchg if the hardware supports LL/SC. * If the hardware only supports cmpxchg4 then the sequence field is restricted to 2 bytes, which increases the risk of wraparound. If an update is delayed for exactly the wraparound interval then the data may be corrupted, that is a standard risk of small sequence fields. * The extra copy out just to read a structure needs more memory bandwidth, plus local allocation for the copy. * The internal code of the API to traverse a list containing lockfree structures is pretty messy, although that is hidden from the caller. * All writers must preserve enough state to redrive their update from scratch if a concurrent update has changed the same structure. Note, only the curent structure, not the rest of the list. * It requires type stable storage. That is, once a data area has been allocated to a particular structure type, it always contains that structure type, even when it has been freed from the list. Each list requires its own free pool, which can never be returned to the OS. Nice research topics, but not suitable for day to day work. I only used the lockfree algorithms because I could not find an alternative on that particular OS. I am not sure that the Linux kernel has any problems that require the additional complexity. For some light reading, the code I implemented was based on Practical Implementations of Non-Blocking Synchronization Primitives by Mark Moir http://research.sun.com/scalable/Papers/moir-podc97.ps. Note that page 6, line 6 has an error. Replace "y := z" with "y := addr->data[i];". I implemented a completely lockfree 2-3-4 tree with embedded pointers using Moir's algorithm. The implementation allowed multiple concurrent readers and writers. It even allowed concurrent list insert, delete and rebalancing, while the structures on the were being read and updated. 2-3-4 trees are self balancing, which gave decent lookup performance. Since red-black trees are logically equivalent to 2-3-4 trees it should be possible to use lockfree red-black trees. However I could not come up with a lockfree red-black tree, mainly because an read-black insert requires atomic changes to multiple structures. The 2-3-4 tree only needs atomic update to one structure at a time. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] Lock free fd lookup 2004-07-17 0:55 ` Keith Owens @ 2004-07-17 1:19 ` William Lee Irwin III 2004-07-17 2:12 ` Keith Owens 2004-07-17 2:28 ` Keith Owens 0 siblings, 2 replies; 19+ messages in thread From: William Lee Irwin III @ 2004-07-17 1:19 UTC (permalink / raw) To: Keith Owens Cc: Jesse Barnes, Chris Wright, Ravikiran G Thirumalai, linux-kernel, dipankar On Thu, 15 Jul 2004 23:27:32 -0700, William Lee Irwin III wrote: >> The gist of all this is that busywait-free atomic updates are only >> implementable for data structures that don't link through the object >> but rather maintain an external index with a single pointer to elements >> needing updates, like radix trees, B+ trees, arrays of pointers, and >> open-addressed hashtables. On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote: > There are algorithms for busywait-free (lock free) traversal and > concurrent update of lists and structures that contain embedded > pointers. I once had to implement one on an O/S where the user space > to kernel wait/alarm semantics could not satisfy the timing > requirements of the calling code (don't ask). Yes. I had in mind only RCU-based algorithms. Throwing more machinery at the problem may get further. On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote: > The beauty of these lockfree algorithms on large structures is that > nothing ever stalls indefinitely. If the underlying SMP cache hardware > is fair, everything running a lockfree list will make progress. These > algorithms do not suffer from reader vs. writer starvation. That's a large assumption. NUMA hardware typically violates it. On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote: > The downside is that the algorithms rely on an extra sequence field per > word. They copy out the structure, modify the local copy, write back > with an update of sequence field. Writing back detects if any other > update has invalidated the current structure, at which point the second > update is discarded and the writer redrives its update. Readers are > guaranteed to see a consistent view of the copy, even if the master > structure is being updated at the same time as it is being read. I would classify this as "ticket-based". On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote: > Bottom line, it can be done but ... > * The structure size doubles with the addition of the sequence fields. > * The hardware must support cmpxchg on the combination of the sequence > field and the data word that it protects. LL/SC can be used instead > of cmpxchg if the hardware supports LL/SC. > * If the hardware only supports cmpxchg4 then the sequence field is > restricted to 2 bytes, which increases the risk of wraparound. If an > update is delayed for exactly the wraparound interval then the data > may be corrupted, that is a standard risk of small sequence fields. > * The extra copy out just to read a structure needs more memory > bandwidth, plus local allocation for the copy. > * The internal code of the API to traverse a list containing lockfree > structures is pretty messy, although that is hidden from the caller. > * All writers must preserve enough state to redrive their update from > scratch if a concurrent update has changed the same structure. Note, > only the curent structure, not the rest of the list. > * It requires type stable storage. That is, once a data area has been > allocated to a particular structure type, it always contains that > structure type, even when it has been freed from the list. Each list > requires its own free pool, which can never be returned to the OS. The last of these is particularly lethal. On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote: > Nice research topics, but not suitable for day to day work. I only > used the lockfree algorithms because I could not find an alternative on > that particular OS. I am not sure that the Linux kernel has any > problems that require the additional complexity. > For some light reading, the code I implemented was based on Practical > Implementations of Non-Blocking Synchronization Primitives by Mark Moir > http://research.sun.com/scalable/Papers/moir-podc97.ps. Note that page > 6, line 6 has an error. Replace "y := z" with "y := addr->data[i];". > I implemented a completely lockfree 2-3-4 tree with embedded pointers > using Moir's algorithm. The implementation allowed multiple concurrent > readers and writers. It even allowed concurrent list insert, delete > and rebalancing, while the structures on the were being read and > updated. Thanks. That may come in handy later. On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote: > 2-3-4 trees are self balancing, which gave decent lookup performance. > Since red-black trees are logically equivalent to 2-3-4 trees it should > be possible to use lockfree red-black trees. However I could not come > up with a lockfree red-black tree, mainly because an read-black insert > requires atomic changes to multiple structures. The 2-3-4 tree only > needs atomic update to one structure at a time. This actually appears to confirm my earlier assertion about the linkage of the data structure. Is this conclusion what you had in mind? -- wli ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] Lock free fd lookup 2004-07-17 1:19 ` William Lee Irwin III @ 2004-07-17 2:12 ` Keith Owens 2004-07-17 2:34 ` William Lee Irwin III 2004-07-17 2:28 ` Keith Owens 1 sibling, 1 reply; 19+ messages in thread From: Keith Owens @ 2004-07-17 2:12 UTC (permalink / raw) To: William Lee Irwin III Cc: Jesse Barnes, Chris Wright, Ravikiran G Thirumalai, linux-kernel, dipankar On Fri, 16 Jul 2004 18:19:36 -0700, William Lee Irwin III <wli@holomorphy.com> wrote: >On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote: >> 2-3-4 trees are self balancing, which gave decent lookup performance. >> Since red-black trees are logically equivalent to 2-3-4 trees it should >> be possible to use lockfree red-black trees. However I could not come >> up with a lockfree red-black tree, mainly because an read-black insert >> requires atomic changes to multiple structures. The 2-3-4 tree only >> needs atomic update to one structure at a time. > >This actually appears to confirm my earlier assertion about the linkage >of the data structure. Is this conclusion what you had in mind? Not quite. The 2-3-4 tree has embedded linkage, but it can be done lockfree if you really have to. The problem is that a single 2-3-4 list entry maps to two red-black list entries. I could atomically update a single 2-3-4 list entry, including its pointers, even when the list was being read or updated by other users. I could not work out how to do the equivalent update when the list linkage data was split over two red-black nodes. The list structure is an implementation detail, the use of 2-3-4 or red-black is completely transparent to the main code. The main code wants to lookup a structure from the list, to update a structure, to insert or to delete a structure without waiting. How the list of structures is maintained is a problem for the internals of the API. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] Lock free fd lookup 2004-07-17 2:12 ` Keith Owens @ 2004-07-17 2:34 ` William Lee Irwin III 0 siblings, 0 replies; 19+ messages in thread From: William Lee Irwin III @ 2004-07-17 2:34 UTC (permalink / raw) To: Keith Owens Cc: Jesse Barnes, Chris Wright, Ravikiran G Thirumalai, linux-kernel, dipankar On Fri, 16 Jul 2004 18:19:36 -0700, William Lee Irwin III wrote: >> This actually appears to confirm my earlier assertion about the linkage >> of the data structure. Is this conclusion what you had in mind? On Sat, Jul 17, 2004 at 12:12:39PM +1000, Keith Owens wrote: > Not quite. The 2-3-4 tree has embedded linkage, but it can be done > lockfree if you really have to. The problem is that a single 2-3-4 > list entry maps to two red-black list entries. I could atomically > update a single 2-3-4 list entry, including its pointers, even when the > list was being read or updated by other users. I could not work out > how to do the equivalent update when the list linkage data was split > over two red-black nodes. 2-3 trees have external linkage just like B/B+ trees as I had envisioned what external linkage is. Terminological issue I guess. On Sat, Jul 17, 2004 at 12:12:39PM +1000, Keith Owens wrote: > The list structure is an implementation detail, the use of 2-3-4 or > red-black is completely transparent to the main code. The main code > wants to lookup a structure from the list, to update a structure, to > insert or to delete a structure without waiting. How the list of > structures is maintained is a problem for the internals of the API. That kind of genericity is tough to come by in C. I guess callbacks and void * can do it when pressed. -- wli ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] Lock free fd lookup 2004-07-17 1:19 ` William Lee Irwin III 2004-07-17 2:12 ` Keith Owens @ 2004-07-17 2:28 ` Keith Owens 2004-07-17 3:16 ` William Lee Irwin III 1 sibling, 1 reply; 19+ messages in thread From: Keith Owens @ 2004-07-17 2:28 UTC (permalink / raw) To: William Lee Irwin III Cc: Jesse Barnes, Chris Wright, Ravikiran G Thirumalai, linux-kernel, dipankar On Fri, 16 Jul 2004 18:19:36 -0700, William Lee Irwin III <wli@holomorphy.com> wrote: >On Sat, Jul 17, 2004 at 10:55:59AM +1000, Keith Owens wrote: >> The beauty of these lockfree algorithms on large structures is that >> nothing ever stalls indefinitely. If the underlying SMP cache hardware >> is fair, everything running a lockfree list will make progress. These >> algorithms do not suffer from reader vs. writer starvation. > >That's a large assumption. NUMA hardware typically violates it. True, which is why I mentioned it. However I suspect that you read something into that paragraph which was not intended. Just reading the lockfree list and the structures on the list does not suffer from any NUMA problems, because reading does not perform any global updates at all. The SMP starvation problem only kicks in when multiple concurrent updates are being done. Even with multiple writers, one of the writers is guaranteed to succeed every time, so over time all the write operations will proceed, subject to fair access to exclusive cache lines. Lockfree reads with Moir's algorithms require extra memory bandwidth. In the absence of updates, all the cache lines end up in shared state. That reduces to local memory bandwidth for the (hopefully) common case of lots of readers and few writers. Lockfree code is nicely suited to the same class of problem that RCU addresses, but without the reader vs. writer starvation problems. Writer vs. writer starvation on NUMA is a lot harder. I don't know of any algorithm that handles lists with lots of concurrent updates and also scales well on large cpus, unless the underlying hardware is fair in its handling of exclusive cache lines. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] Lock free fd lookup 2004-07-17 2:28 ` Keith Owens @ 2004-07-17 3:16 ` William Lee Irwin III 0 siblings, 0 replies; 19+ messages in thread From: William Lee Irwin III @ 2004-07-17 3:16 UTC (permalink / raw) To: Keith Owens Cc: Jesse Barnes, Chris Wright, Ravikiran G Thirumalai, linux-kernel, dipankar On Fri, 16 Jul 2004 18:19:36 -0700, William Lee Irwin III wrote: >> That's a large assumption. NUMA hardware typically violates it. On Sat, Jul 17, 2004 at 12:28:54PM +1000, Keith Owens wrote: > True, which is why I mentioned it. However I suspect that you read > something into that paragraph which was not intended. The NUMA issue is the only caveat I saw. I guess I just wanted to mention it by name. On Sat, Jul 17, 2004 at 12:28:54PM +1000, Keith Owens wrote: > Just reading the lockfree list and the structures on the list does not > suffer from any NUMA problems, because reading does not perform any > global updates at all. The SMP starvation problem only kicks in when > multiple concurrent updates are being done. Even with multiple > writers, one of the writers is guaranteed to succeed every time, so > over time all the write operations will proceed, subject to fair access > to exclusive cache lines. The only methods I can think of to repair this (basically queueing) are not busywait-free. On Sat, Jul 17, 2004 at 12:28:54PM +1000, Keith Owens wrote: > Lockfree reads with Moir's algorithms require extra memory bandwidth. > In the absence of updates, all the cache lines end up in shared state. > That reduces to local memory bandwidth for the (hopefully) common case > of lots of readers and few writers. Lockfree code is nicely suited to > the same class of problem that RCU addresses, but without the reader > vs. writer starvation problems. I suppose it's worth refining the starvation claim to delaying freeing memory as opposed to readers causing writers to busywait indefinitely. On Sat, Jul 17, 2004 at 12:28:54PM +1000, Keith Owens wrote: > Writer vs. writer starvation on NUMA is a lot harder. I don't know of > any algorithm that handles lists with lots of concurrent updates and > also scales well on large cpus, unless the underlying hardware is fair > in its handling of exclusive cache lines. Well, neither do I. =) -- wli ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] Lock free fd lookup 2004-07-14 15:17 ` Chris Wright 2004-07-15 14:22 ` Jesse Barnes @ 2004-07-17 13:48 ` Peter Zijlstra 1 sibling, 0 replies; 19+ messages in thread From: Peter Zijlstra @ 2004-07-17 13:48 UTC (permalink / raw) To: linux-kernel [-- Attachment #1: Type: text/plain, Size: 403 bytes --] Hi, maybe I'm way off, but have you considered multi state locks for your problem? I attached a draft-paper; written by a former collegue and myself; on Concurrent AVL trees that uses multi-state locking. This paper is unpublished and unfinished but might be illustrative. Do note that those email addresses are no longer valid AFAIK, reply to reach me or the co-author. just my 2ct. Peter Zijlstra [-- Attachment #2: cavl.ps.gz --] [-- Type: application/x-gzpostscript, Size: 40823 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2004-07-29 0:20 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-07-17 19:17 [RFC] Lock free fd lookup Albert Cahalan 2004-07-29 0:14 ` William Lee Irwin III -- strict thread matches above, loose matches on Subject: below -- 2004-07-17 8:50 Manfred Spraul 2004-07-17 9:30 ` William Lee Irwin III 2004-07-14 4:53 [RFC] Refcounting of objects part of a lockfree collection Ravikiran G Thirumalai 2004-07-14 4:56 ` [RFC] Lock free fd lookup Ravikiran G Thirumalai 2004-07-14 15:17 ` Chris Wright 2004-07-15 14:22 ` Jesse Barnes 2004-07-15 16:10 ` Dipankar Sarma 2004-07-15 16:22 ` Jesse Barnes 2004-07-15 16:34 ` Chris Wright 2004-07-16 5:38 ` Ravikiran G Thirumalai 2004-07-16 6:27 ` William Lee Irwin III 2004-07-17 0:55 ` Keith Owens 2004-07-17 1:19 ` William Lee Irwin III 2004-07-17 2:12 ` Keith Owens 2004-07-17 2:34 ` William Lee Irwin III 2004-07-17 2:28 ` Keith Owens 2004-07-17 3:16 ` William Lee Irwin III 2004-07-17 13:48 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox