public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [patchset] Lockfree fd lookup 0 of 5
@ 2004-08-02 10:10 Ravikiran G Thirumalai
  2004-08-02 10:13 ` [patchset] Lockfree fd lookup 1 " Ravikiran G Thirumalai
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Ravikiran G Thirumalai @ 2004-08-02 10:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Greg KH, dipankar, viro

Here is a patchset to eliminate taking struct files_struct.file_lock on 
reader side using rcu and rcu based refcounting.  These patches
extend the kref api to include kref_lf_xxx api and kref_lf_get_rcu to
do lockfree refcounting, and use the same.  As posted earlier, since fd
lookups (struct files_struct.fd[]) will be lock free with these patches, 
threaded workloads doing lots of io should see performance benefits 
due to this patchset.  I have observed 13-15% improvement with tiobench 
on a 4 way xeon with this patchset.

The patchset contains:
1. kref-merged-2.6.7.patch -- kref shrinkage patch which GregKH has applied to
   his tree.
2. kref-drivers-2.6.7.patch -- existing users of kref modified to use the
   'shrunk' krefs.  GregKH has applied this to his tree too
3. kref-lf-2.6.7.patch -- kref api additions for lock free refcounting.  
   This patch relocates kref api to kref.h as static inlines since they
   are mostly wrappers around atomic_xxx operations
4. files_struct-kref-s-2.6.7.patch -- change struct file.f_count to a kref
   and use kref api for refcounting.  This does not add any performance
   benefit and is just an intermediate patch
5. files_struct-rcu-kref-2.6.7.patch -- Make fd lookups lock free by using
   rcu and kref_lf_xxx api for lockfree refcounting

The patchset will follow this post.

Thanks,
Kiran


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchset] Lockfree fd lookup 1 of 5
  2004-08-02 10:10 [patchset] Lockfree fd lookup 0 of 5 Ravikiran G Thirumalai
@ 2004-08-02 10:13 ` Ravikiran G Thirumalai
  2004-08-02 13:38   ` Christoph Hellwig
  2004-08-03  0:44   ` Greg KH
  2004-08-02 10:16 ` [patchset] Lockfree fd lookup 2 " Ravikiran G Thirumalai
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 23+ messages in thread
From: Ravikiran G Thirumalai @ 2004-08-02 10:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Greg KH, dipankar, viro

Here's the first patch.  This patch 'shrinks' struct kref by removing
the release pointer in the struct kref.  GregKH has applied this to his tree

Thanks,
Kiran


D:
D: Signed-off-by: Ravikiran Thirumalai <kiran@in.ibm.com>
D:
D: kref-merged-2.6.7.patch:
D: This patch is the kref shrinkage patch which GregKH has agreed to include,
D: and has applied to his tree which he will push into mainline.
D:
diff -ruN -X dontdiff2 linux-2.6.7/include/linux/kref.h files_struct-kref-2.6.7/include/linux/kref.h
--- linux-2.6.7/include/linux/kref.h	2004-06-16 10:48:59.000000000 +0530
+++ files_struct-kref-2.6.7/include/linux/kref.h	2004-07-26 16:38:23.604361208 +0530
@@ -21,12 +21,11 @@
 
 struct kref {
 	atomic_t refcount;
-	void (*release)(struct kref *kref);
 };
 
-void kref_init(struct kref *kref, void (*release)(struct kref *));
+void kref_init(struct kref *kref);
 struct kref *kref_get(struct kref *kref);
-void kref_put(struct kref *kref);
+void kref_put(struct kref *kref, void (*release) (struct kref *kref));
 
 
 #endif /* _KREF_H_ */
diff -ruN -X dontdiff2 linux-2.6.7/lib/kref.c files_struct-kref-2.6.7/lib/kref.c
--- linux-2.6.7/lib/kref.c	2004-06-16 10:50:26.000000000 +0530
+++ files_struct-kref-2.6.7/lib/kref.c	2004-07-26 16:52:11.617484080 +0530
@@ -19,15 +19,10 @@
 /**
  * kref_init - initialize object.
  * @kref: object in question.
- * @release: pointer to a function that will clean up the object
- *	     when the last reference to the object is released.
- *	     This pointer is required.
  */
-void kref_init(struct kref *kref, void (*release)(struct kref *kref))
+void kref_init(struct kref *kref)
 {
-	WARN_ON(release == NULL);
 	atomic_set(&kref->refcount,1);
-	kref->release = release;
 }
 
 /**
@@ -44,14 +39,18 @@
 /**
  * kref_put - decrement refcount for object.
  * @kref: object.
+ * @release: pointer to the function that will clean up the object
+ *	     when the last reference to the object is released.
+ *	     This pointer is required.
  *
- * Decrement the refcount, and if 0, call kref->release().
+ * Decrement the refcount, and if 0, call release().
  */
-void kref_put(struct kref *kref)
+void kref_put(struct kref *kref, void (*release) (struct kref *kref))
 {
+	WARN_ON(release == NULL);
 	if (atomic_dec_and_test(&kref->refcount)) {
 		pr_debug("kref cleaning up\n");
-		kref->release(kref);
+		release(kref);
 	}
 }
 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchset] Lockfree fd lookup 2 of 5
  2004-08-02 10:10 [patchset] Lockfree fd lookup 0 of 5 Ravikiran G Thirumalai
  2004-08-02 10:13 ` [patchset] Lockfree fd lookup 1 " Ravikiran G Thirumalai
@ 2004-08-02 10:16 ` Ravikiran G Thirumalai
  2004-08-03  0:43   ` Greg KH
  2004-08-03  6:44   ` Greg KH
  2004-08-02 10:18 ` [patchset] Lockfree fd lookup 0 " Ravikiran G Thirumalai
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 23+ messages in thread
From: Ravikiran G Thirumalai @ 2004-08-02 10:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Greg KH, dipankar, viro

Here's the second patch.  This changes the current kref users to use
the 'shrunk' kref objects and api.  GregKH has applied this to his tree too.

Thanks,
Kiran


D:
D: Signed-off-by: Ravikiran Thirumalai <kiran@in.ibm.com>
D:
D: kref-drivers-2.6.7.patch:
D: This patch changes the current users of kref to use the 'shrunk' kref
D: GregKH has applied this too to his tree,
D:
diff -ruN -X dontdiff2 linux-2.6.7/drivers/scsi/sd.c kref-2.6.7/drivers/scsi/sd.c
--- linux-2.6.7/drivers/scsi/sd.c	2004-06-16 10:49:44.000000000 +0530
+++ kref-2.6.7/drivers/scsi/sd.c	2004-07-20 13:03:49.000000000 +0530
@@ -186,7 +186,7 @@
 	return sdkp;
 
  out_put:
-	kref_put(&sdkp->kref);
+	kref_put(&sdkp->kref, scsi_disk_release);
  out_sdkp:
 	sdkp = NULL;
  out:
@@ -198,7 +198,7 @@
 {
 	down(&sd_ref_sem);
 	scsi_device_put(sdkp->device);
-	kref_put(&sdkp->kref);
+	kref_put(&sdkp->kref, scsi_disk_release);
 	up(&sd_ref_sem);
 }
 
@@ -1360,7 +1360,7 @@
 		goto out;
 
 	memset (sdkp, 0, sizeof(*sdkp));
-	kref_init(&sdkp->kref, scsi_disk_release);
+	kref_init(&sdkp->kref);
 
 	/* Note: We can accomodate 64 partitions, but the genhd code
 	 * assumes partitions allocate consecutive minors, which they don't.
@@ -1462,7 +1462,7 @@
 	del_gendisk(sdkp->disk);
 	sd_shutdown(dev);
 	down(&sd_ref_sem);
-	kref_put(&sdkp->kref);
+	kref_put(&sdkp->kref, scsi_disk_release);
 	up(&sd_ref_sem);
 
 	return 0;
diff -ruN -X dontdiff2 linux-2.6.7/drivers/scsi/sr.c kref-2.6.7/drivers/scsi/sr.c
--- linux-2.6.7/drivers/scsi/sr.c	2004-06-16 10:50:27.000000000 +0530
+++ kref-2.6.7/drivers/scsi/sr.c	2004-07-20 13:03:49.000000000 +0530
@@ -144,7 +144,7 @@
 	goto out;
 
  out_put:
-	kref_put(&cd->kref);
+	kref_put(&cd->kref, sr_kref_release);
  out_null:
 	cd = NULL;
  out:
@@ -156,7 +156,7 @@
 {
 	down(&sr_ref_sem);
 	scsi_device_put(cd->device);
-	kref_put(&cd->kref);
+	kref_put(&cd->kref, sr_kref_release);
 	up(&sr_ref_sem);
 }
 
@@ -573,7 +573,7 @@
 		goto fail;
 	memset(cd, 0, sizeof(*cd));
 
-	kref_init(&cd->kref, sr_kref_release);
+	kref_init(&cd->kref);
 
 	disk = alloc_disk(1);
 	if (!disk)
@@ -952,7 +952,7 @@
 	del_gendisk(cd->disk);
 
 	down(&sr_ref_sem);
-	kref_put(&cd->kref);
+	kref_put(&cd->kref, sr_kref_release);
 	up(&sr_ref_sem);
 
 	return 0;
diff -ruN -X dontdiff2 linux-2.6.7/drivers/usb/core/config.c kref-2.6.7/drivers/usb/core/config.c
--- linux-2.6.7/drivers/usb/core/config.c	2004-06-16 10:48:52.000000000 +0530
+++ kref-2.6.7/drivers/usb/core/config.c	2004-07-20 13:03:49.000000000 +0530
@@ -356,7 +356,7 @@
 		if (!intfc)
 			return -ENOMEM;
 		memset(intfc, 0, len);
-		kref_init(&intfc->ref, usb_release_interface_cache);
+		kref_init(&intfc->ref);
 	}
 
 	/* Skip over any Class Specific or Vendor Specific descriptors;
@@ -422,7 +422,8 @@
 
 		for (i = 0; i < cf->desc.bNumInterfaces; i++) {
 			if (cf->intf_cache[i])
-				kref_put(&cf->intf_cache[i]->ref);
+				kref_put(&cf->intf_cache[i]->ref, 
+					  usb_release_interface_cache);
 		}
 	}
 	kfree(dev->config);
diff -ruN -X dontdiff2 linux-2.6.7/drivers/usb/core/message.c kref-2.6.7/drivers/usb/core/message.c
--- linux-2.6.7/drivers/usb/core/message.c	2004-06-16 10:49:02.000000000 +0530
+++ kref-2.6.7/drivers/usb/core/message.c	2004-07-20 15:07:24.000000000 +0530
@@ -1077,11 +1077,12 @@
 
 static void release_interface(struct device *dev)
 {
+	extern void destroy_serial(struct kref *kref);
 	struct usb_interface *intf = to_usb_interface(dev);
 	struct usb_interface_cache *intfc =
 			altsetting_to_usb_interface_cache(intf->altsetting);
 
-	kref_put(&intfc->ref);
+	kref_put(&intfc->ref, destroy_serial);
 	kfree(intf);
 }
 
diff -ruN -X dontdiff2 linux-2.6.7/drivers/usb/core/urb.c kref-2.6.7/drivers/usb/core/urb.c
--- linux-2.6.7/drivers/usb/core/urb.c	2004-06-16 10:49:36.000000000 +0530
+++ kref-2.6.7/drivers/usb/core/urb.c	2004-07-20 13:03:49.000000000 +0530
@@ -39,7 +39,7 @@
 {
 	if (urb) {
 		memset(urb, 0, sizeof(*urb));
-		kref_init(&urb->kref, urb_destroy);
+		kref_init(&urb->kref);
 		spin_lock_init(&urb->lock);
 	}
 }
@@ -88,7 +88,7 @@
 void usb_free_urb(struct urb *urb)
 {
 	if (urb)
-		kref_put(&urb->kref);
+		kref_put(&urb->kref, urb_destroy);
 }
 
 /**
diff -ruN -X dontdiff2 linux-2.6.7/drivers/usb/host/ehci-mem.c kref-2.6.7/drivers/usb/host/ehci-mem.c
--- linux-2.6.7/drivers/usb/host/ehci-mem.c	2004-06-16 10:50:03.000000000 +0530
+++ kref-2.6.7/drivers/usb/host/ehci-mem.c	2004-07-20 13:03:49.000000000 +0530
@@ -114,7 +114,7 @@
 		return qh;
 
 	memset (qh, 0, sizeof *qh);
-	kref_init(&qh->kref, qh_destroy);
+	kref_init(&qh->kref);
 	qh->ehci = ehci;
 	qh->qh_dma = dma;
 	// INIT_LIST_HEAD (&qh->qh_list);
@@ -139,7 +139,7 @@
 
 static inline void qh_put (struct ehci_qh *qh)
 {
-	kref_put(&qh->kref);
+	kref_put(&qh->kref, qh_destroy);
 }
 
 /*-------------------------------------------------------------------------*/
diff -ruN -X dontdiff2 linux-2.6.7/drivers/usb/serial/usb-serial.c kref-2.6.7/drivers/usb/serial/usb-serial.c
--- linux-2.6.7/drivers/usb/serial/usb-serial.c	2004-06-16 10:49:17.000000000 +0530
+++ kref-2.6.7/drivers/usb/serial/usb-serial.c	2004-07-20 15:07:52.000000000 +0530
@@ -443,6 +443,69 @@
 	return;
 }
 
+static void serial_shutdown (struct usb_serial *serial)
+{
+	dbg ("%s", __FUNCTION__);
+
+	serial->type->shutdown(serial);
+}
+
+void destroy_serial(struct kref *kref)
+{
+	struct usb_serial *serial;
+	struct usb_serial_port *port;
+	int i;
+
+	serial = to_usb_serial(kref);
+
+	dbg ("%s - %s", __FUNCTION__, serial->type->name);
+	serial_shutdown (serial);
+
+	/* return the minor range that this device had */
+	return_serial(serial);
+
+	for (i = 0; i < serial->num_ports; ++i)
+		serial->port[i]->open_count = 0;
+
+	/* the ports are cleaned up and released in port_release() */
+	for (i = 0; i < serial->num_ports; ++i)
+		if (serial->port[i]->dev.parent != NULL) {
+			device_unregister(&serial->port[i]->dev);
+			serial->port[i] = NULL;
+		}
+
+	/* If this is a "fake" port, we have to clean it up here, as it will
+	 * not get cleaned up in port_release() as it was never registered with
+	 * the driver core */
+	if (serial->num_ports < serial->num_port_pointers) {
+		for (i = serial->num_ports; i < serial->num_port_pointers; ++i) {
+			port = serial->port[i];
+			if (!port)
+				continue;
+			if (port->read_urb) {
+				usb_unlink_urb(port->read_urb);
+				usb_free_urb(port->read_urb);
+			}
+			if (port->write_urb) {
+				usb_unlink_urb(port->write_urb);
+				usb_free_urb(port->write_urb);
+			}
+			if (port->interrupt_in_urb) {
+				usb_unlink_urb(port->interrupt_in_urb);
+				usb_free_urb(port->interrupt_in_urb);
+			}
+			kfree(port->bulk_in_buffer);
+			kfree(port->bulk_out_buffer);
+			kfree(port->interrupt_in_buffer);
+		}
+	}
+
+	usb_put_dev(serial->dev);
+
+	/* free up any memory that we allocated */
+	kfree (serial);
+}
+
 /*****************************************************************************
  * Driver tty interface functions
  *****************************************************************************/
@@ -487,7 +550,7 @@
 		if (retval) {
 			port->open_count = 0;
 			module_put(serial->type->owner);
-			kref_put(&serial->kref);
+			kref_put(&serial->kref, destroy_serial);
 		}
 	}
 bailout:
@@ -518,7 +581,7 @@
 	}
 
 	module_put(port->serial->type->owner);
-	kref_put(&port->serial->kref);
+	kref_put(&port->serial->kref, destroy_serial);
 }
 
 static int serial_write (struct tty_struct * tty, int from_user, const unsigned char *buf, int count)
@@ -676,13 +739,6 @@
 	;
 }
 
-static void serial_shutdown (struct usb_serial *serial)
-{
-	dbg ("%s", __FUNCTION__);
-
-	serial->type->shutdown(serial);
-}
-
 static int serial_read_proc (char *page, char **start, off_t off, int count, int *eof, void *data)
 {
 	struct usb_serial *serial;
@@ -716,7 +772,7 @@
 			begin += length;
 			length = 0;
 		}
-		kref_put(&serial->kref);
+		kref_put(&serial->kref, destroy_serial);
 	}
 	*eof = 1;
 done:
@@ -785,62 +841,6 @@
 	wake_up_interruptible(&tty->write_wait);
 }
 
-static void destroy_serial(struct kref *kref)
-{
-	struct usb_serial *serial;
-	struct usb_serial_port *port;
-	int i;
-
-	serial = to_usb_serial(kref);
-
-	dbg ("%s - %s", __FUNCTION__, serial->type->name);
-	serial_shutdown (serial);
-
-	/* return the minor range that this device had */
-	return_serial(serial);
-
-	for (i = 0; i < serial->num_ports; ++i)
-		serial->port[i]->open_count = 0;
-
-	/* the ports are cleaned up and released in port_release() */
-	for (i = 0; i < serial->num_ports; ++i)
-		if (serial->port[i]->dev.parent != NULL) {
-			device_unregister(&serial->port[i]->dev);
-			serial->port[i] = NULL;
-		}
-
-	/* If this is a "fake" port, we have to clean it up here, as it will
-	 * not get cleaned up in port_release() as it was never registered with
-	 * the driver core */
-	if (serial->num_ports < serial->num_port_pointers) {
-		for (i = serial->num_ports; i < serial->num_port_pointers; ++i) {
-			port = serial->port[i];
-			if (!port)
-				continue;
-			if (port->read_urb) {
-				usb_unlink_urb(port->read_urb);
-				usb_free_urb(port->read_urb);
-			}
-			if (port->write_urb) {
-				usb_unlink_urb(port->write_urb);
-				usb_free_urb(port->write_urb);
-			}
-			if (port->interrupt_in_urb) {
-				usb_unlink_urb(port->interrupt_in_urb);
-				usb_free_urb(port->interrupt_in_urb);
-			}
-			kfree(port->bulk_in_buffer);
-			kfree(port->bulk_out_buffer);
-			kfree(port->interrupt_in_buffer);
-		}
-	}
-
-	usb_put_dev(serial->dev);
-
-	/* free up any memory that we allocated */
-	kfree (serial);
-}
-
 static void port_release(struct device *dev)
 {
 	struct usb_serial_port *port = to_usb_serial_port(dev);
@@ -881,7 +881,7 @@
 	serial->interface = interface;
 	serial->vendor = dev->descriptor.idVendor;
 	serial->product = dev->descriptor.idProduct;
-	kref_init(&serial->kref, destroy_serial);
+	kref_init(&serial->kref);
 
 	return serial;
 }
@@ -1231,7 +1231,7 @@
 	if (serial) {
 		/* let the last holder of this object 
 		 * cause it to be cleaned up */
-		kref_put(&serial->kref);
+		kref_put(&serial->kref, destroy_serial);
 	}
 	dev_info(dev, "device disconnected\n");
 }

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchset] Lockfree fd lookup 0 of 5
  2004-08-02 10:10 [patchset] Lockfree fd lookup 0 of 5 Ravikiran G Thirumalai
  2004-08-02 10:13 ` [patchset] Lockfree fd lookup 1 " Ravikiran G Thirumalai
  2004-08-02 10:16 ` [patchset] Lockfree fd lookup 2 " Ravikiran G Thirumalai
@ 2004-08-02 10:18 ` Ravikiran G Thirumalai
  2004-08-02 10:20 ` [patchset] Lockfree fd lookup 4 " Ravikiran G Thirumalai
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Ravikiran G Thirumalai @ 2004-08-02 10:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Greg KH, dipankar, viro

The following patch adds kref_lf_xxx api and kref_lf_get_rcu to enable
refcounting of objects part of a lockfree collection.

Thanks,
Kiran


D:
D: kref-lf-2.6.7.patch:
D: This patch depends on the kref shrinkage patches.  This provides
D: infrastructure for lockfree refcounting and adds kref_lf_xxx api
D: for lockfree refcounting.  This patch also changes kref_xxx to 
D: static inlines since kref_xxx are mostly one-two liners
D:

diff -ruN -X /home/kiran/dontdiff linux-2.6.7/include/linux/kref.h test-2.6.7/include/linux/kref.h
--- linux-2.6.7/include/linux/kref.h	2004-08-01 22:37:46.000000000 +0530
+++ test-2.6.7/include/linux/kref.h	2004-08-01 22:50:06.000000000 +0530
@@ -16,16 +16,264 @@
 #define _KREF_H_
 
 #include <linux/types.h>
+#include <linux/kernel.h>
 #include <asm/atomic.h>
 
-
 struct kref {
 	atomic_t refcount;
 };
 
-void kref_init(struct kref *kref);
-struct kref *kref_get(struct kref *kref);
-void kref_put(struct kref *kref, void (*release) (struct kref *kref));
+/**
+ * kref_init - initialize object.
+ * @kref: object in question.
+ */
+static inline void kref_init(struct kref *kref)
+{
+	atomic_set(&kref->refcount, 1);
+}
+
+/**
+ * kref_get - increment refcount for object.
+ * @kref: object.
+ */
+static inline struct kref *kref_get(struct kref *kref)
+{
+	WARN_ON(!atomic_read(&kref->refcount));
+	atomic_inc(&kref->refcount);
+	return kref;
+}
+
+/**
+ * kref_put - decrement refcount for object.
+ * @kref: object.
+ * @release: pointer to the function that will clean up the object
+ *	     when the last reference to the object is released.
+ *	     This pointer is required.
+ *
+ * Decrement the refcount, and if 0, call release().
+ */
+static inline void
+kref_put(struct kref *kref, void (*release) (struct kref * kref))
+{
+	if (atomic_dec_and_test(&kref->refcount)) {
+		WARN_ON(release == NULL);
+		pr_debug("kref cleaning up\n");
+		release(kref);
+	}
+}
+
+/**
+ * kref_read - Return the refcount value.
+ * @kref: object.
+ */
+static inline int kref_read(struct kref *kref)
+{
+	return atomic_read(&kref->refcount);
+}
+
+/**
+ * kref_put_last - decrement refcount for object.
+ * @kref: object.
+ *
+ * Decrement the refcount, and if 0 return true.
+ * Returns false otherwise.
+ * Use this only if you cannot use kref_put -- when the
+ * release function of kref_put needs more than just the
+ * refcounted object. Use of kref_put_last when kref_put
+ * can do will be frowned upon.
+ */
+static inline int kref_put_last(struct kref *kref)
+{
+	return atomic_dec_and_test(&kref->refcount);
+}
+
+/* 
+ * Refcounter framework for elements of lists/arrays protected by
+ * RCU.
+ *
+ * Refcounting on elements of  lists which are protected by traditional
+ * reader/writer spinlocks or semaphores are straight forward as in:
+ * 
+ * 1.					2.
+ * add()				search_and_reference()
+ * {					{
+ * 	alloc_object				read_lock(&list_lock);
+ *	...					search_for_element
+ *	kref_init(&el->rc)			kref_get(&el->rc);	
+ *	write_lock(&list_lock);			...
+ *	add_element				read_unlock(&list_lock);
+ *	...					...
+ *	write_unlock(&list_lock);	}
+ * }					
+ *
+ * 3.					4.
+ * release_referenced()			delete()
+ * {					{
+ *	...				write_lock(&list_lock);
+ *	if (kref_put_last(&el->rc))		...
+ *		start_cleanup_object	...
+ *		free_object		delete_element
+ *	...				write_unlock(&list_lock);
+ * }					...
+ *					if (kref_put_last(&el->rc))
+ *						start_cleanup_object
+ *						free_object
+ *					}
+ *
+ * If this list/array is made lock free using rcu as in changing the 
+ * write_lock in add() and delete() to spin_lock and changing read_lock
+ * in search_and_reference to rcu_read_lock(), the kref_get in 
+ * search_and_reference could potentially hold reference to an element which
+ * has already been deleted from the list/array.  kref_lf_get_rcu takes
+ * care of this scenario. search_and_reference should look as;
+ * 2. 
+ * search_and_reference()
+ * {
+ *	rcu_read_lock();
+ *	search_for_element
+ *	if (!kref_lf_get_rcu(&el->rc)) {
+ *		rcu_read_unlock();
+ *		return FAIL;
+ *	}
+ *	...
+ *	...
+ *	rcu_read_unlock();
+ * }
+ * 
+ * Of course, free_object after kref_put_last should be batched using call_rcu.
+ * Sometimes, reference to the element need to be obtained in the 
+ * update (write) stream.  In such cases, kref_lf_get_rcu might be an overkill
+ * since the spinlock serialising list updates are held. kref_lf_get
+ * is to be used in such cases.  
+ * Note: Except for kref_lf_get_rcu, kref_lf_xxx api are the same as 
+ * corresponding kref_xxx api for most arches.  However, for arches which 
+ * donot have cmpxchg kref_lf_xxx api use a hashed spinlock implementation to
+ * implement kref_lf_get_rcu, and acquire the same hashed spinlock for 
+ * kref_lf_get etc to preserve atomicity.
+ * Note: Use kref_lf_xxx api only if you need to use kref_lf_get_rcu on the
+ * refcounter atleast at one place.  Mixing kref_lf_xx and kref_xxx api
+ * might lead to races.
+ *
+ */
+
+#ifdef __HAVE_ARCH_CMPXCHG
+
+#define kref_lf_init kref_init
+#define kref_lf_get kref_get
+#define kref_lf_put kref_put
+#define kref_lf_put_last kref_put_last
+
+/* 
+ * cmpxchg is needed on UP too, if deletions to the list/array can happen
+ * in interrupt context.
+ */
+
+/**
+ * kref_lf_get_rcu - Take reference to an object of a lockfree collection
+ * by traversing a lockfree list/array.
+ * @kref: object.
+ *
+ * Try and increment the refcount by 1.  The increment might fail if
+ * the refcounted object has been through a 1 to 0 transition and 
+ * is no longer part of the lockfree list.
+ * Returns non-zero on successful increment and zero otherwise.
+ */
+static inline int kref_lf_get_rcu(struct kref *kref)
+{
+	int c, old;
+	c = atomic_read(&kref->refcount);
+	while (c && (old = cmpxchg(&kref->refcount.counter, c, c + 1)) != c)
+		c = old;
+	return c;
+}
+
+#else				/* !__HAVE_ARCH_CMPXCHG */
+
+#include <linux/spinlock.h>
+
+/* 
+ * We use an array of spinlocks for the krefs -- similar to ones in sparc
+ * 32 bit atomic_t implementations, and a hash function similar to that
+ * for our refcounting needs.
+ * Can't help multiprocessors which donot have cmpxchg :(
+ */
+
+#ifdef	CONFIG_SMP
+#define KREF_HASH_SIZE	4
+#define KREF_HASH(k) \
+	(&__kref_hash[(((unsigned long)k)>>8) & (KREF_HASH_SIZE-1)])
+#else
+#define	KREF_HASH_SIZE	1
+#define KREF_HASH(k) 	&__kref_hash[0]
+#endif				/* CONFIG_SMP */
+
+extern spinlock_t __kref_hash[KREF_HASH_SIZE];
+
+static inline void kref_lf_init(struct kref *kref)
+{
+	unsigned long flags;
+	spin_lock_irqsave(KREF_HASH(kref), flags);
+	kref->refcount.counter = 1;
+	spin_unlock_irqrestore(KREF_HASH(kref), flags);
+}
+
+static inline void kref_lf_get(struct kref *kref)
+{
+	unsigned long flags;
+	spin_lock_irqsave(KREF_HASH(kref), flags);
+	kref->refcount.counter += 1;
+	spin_unlock_irqrestore(KREF_HASH(kref), flags);
+}
+
+static inline void 
+kref_lf_put(struct kref *kref, void (*release) (struct kref * kref))
+{
+	unsigned long flags;
+	spin_lock_irqsave(KREF_HASH(kref), flags);
+	kref->refcount.counter--;
+	if (!kref->refcount.counter) {
+		spin_unlock_irqrestore(KREF_HASH(kref), flags);
+		WARN_ON(release == NULL);
+		pr_debug("kref cleaning up\n");
+		release(kref);
+	} else {
+		spin_unlock_irqrestore(KREF_HASH(kref), flags);
+	}
+}
+
+static inline int kref_lf_put_last(struct kref *kref)
+{
+	unsigned long flags;
+	spin_lock_irqsave(KREF_HASH(kref), flags);
+	kref->refcount.counter--;
+	if (!kref->refcount.counter) {
+		spin_unlock_irqrestore(KREF_HASH(kref), flags);
+		return 1;
+	} else {
+		spin_unlock_irqrestore(KREF_HASH(kref), flags);
+		return 0;
+	}
+}
+
+static inline int kref_lf_get_rcu(struct kref *kref)
+{
+	int ret;
+	unsigned long flags;
+	spin_lock_irqsave(KREF_HASH(kref), flags);
+	if (kref->refcount.counter)
+		ret = kref->refcount.counter++;
+	else
+		ret = 0;
+	spin_unlock_irqrestore(KREF_HASH(kref), flags);
+	return ret;
+}
+
+#endif				/* __HAVE_ARCH_CMPXCHG */
 
+/* Spinlocks are not needed in this case, if atomic_read is used */
+static inline int kref_lf_read(struct kref *kref)
+{
+	return atomic_read(&kref->refcount);
+}
 
-#endif /* _KREF_H_ */
+#endif				/* _KREF_H_ */
diff -ruN -X /home/kiran/dontdiff linux-2.6.7/lib/kref.c test-2.6.7/lib/kref.c
--- linux-2.6.7/lib/kref.c	2004-08-01 22:37:46.000000000 +0530
+++ test-2.6.7/lib/kref.c	2004-08-01 21:41:19.000000000 +0530
@@ -11,49 +11,12 @@
  *
  */
 
-/* #define DEBUG */
-
 #include <linux/kref.h>
 #include <linux/module.h>
 
-/**
- * kref_init - initialize object.
- * @kref: object in question.
- */
-void kref_init(struct kref *kref)
-{
-	atomic_set(&kref->refcount,1);
-}
-
-/**
- * kref_get - increment refcount for object.
- * @kref: object.
- */
-struct kref *kref_get(struct kref *kref)
-{
-	WARN_ON(!atomic_read(&kref->refcount));
-	atomic_inc(&kref->refcount);
-	return kref;
-}
-
-/**
- * kref_put - decrement refcount for object.
- * @kref: object.
- * @release: pointer to the function that will clean up the object
- *	     when the last reference to the object is released.
- *	     This pointer is required.
- *
- * Decrement the refcount, and if 0, call release().
- */
-void kref_put(struct kref *kref, void (*release) (struct kref *kref))
-{
-	WARN_ON(release == NULL);
-	if (atomic_dec_and_test(&kref->refcount)) {
-		pr_debug("kref cleaning up\n");
-		release(kref);
-	}
-}
-
-EXPORT_SYMBOL(kref_init);
-EXPORT_SYMBOL(kref_get);
-EXPORT_SYMBOL(kref_put);
+#ifndef __HAVE_ARCH_CMPXCHG
+spinlock_t __kref_hash[KREF_HASH_SIZE] = {
+        [0 ... (KREF_HASH_SIZE-1)] = SPIN_LOCK_UNLOCKED
+};
+EXPORT_SYMBOL(__kref_hash);
+#endif

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchset] Lockfree fd lookup 4 of 5
  2004-08-02 10:10 [patchset] Lockfree fd lookup 0 of 5 Ravikiran G Thirumalai
                   ` (2 preceding siblings ...)
  2004-08-02 10:18 ` [patchset] Lockfree fd lookup 0 " Ravikiran G Thirumalai
@ 2004-08-02 10:20 ` Ravikiran G Thirumalai
  2004-08-02 10:23 ` [patchset] Lockfree fd lookup 5 " Ravikiran G Thirumalai
  2004-08-02 16:56 ` [patchset] Lockfree fd lookup 0 " viro
  5 siblings, 0 replies; 23+ messages in thread
From: Ravikiran G Thirumalai @ 2004-08-02 10:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Greg KH, dipankar, viro

This patch changes struct file.f_count to use kref api. 

Thanks,
Kiran


D:
D: Signed-off-by: Ravikiran Thirumalai <kiran@in.ibm.com>
D:
D: files_struct-kref-s-2.6.7.patch
D: Patch to changes struct file.f_count to kref based refcounting
D: No value add in terms of performance.  Just converting the f_count refcounter
D: to use kref api.
D:
diff -ruN -X dontdiff2 linux-2.6.7/drivers/net/ppp_generic.c files_struct-kref-2.6.7/drivers/net/ppp_generic.c
--- linux-2.6.7/drivers/net/ppp_generic.c	2004-06-16 10:50:03.000000000 +0530
+++ files_struct-kref-2.6.7/drivers/net/ppp_generic.c	2004-07-27 18:55:18.551578696 +0530
@@ -525,12 +525,12 @@
 			if (file == ppp->owner)
 				ppp_shutdown_interface(ppp);
 		}
-		if (atomic_read(&file->f_count) <= 2) {
+		if (file_count(file) <= 2) {
 			ppp_release(inode, file);
 			err = 0;
 		} else
 			printk(KERN_DEBUG "PPPIOCDETACH file->f_count=%d\n",
-			       atomic_read(&file->f_count));
+			       file_count(file));
 		return err;
 	}
 
diff -ruN -X dontdiff2 linux-2.6.7/fs/affs/file.c files_struct-kref-2.6.7/fs/affs/file.c
--- linux-2.6.7/fs/affs/file.c	2004-06-16 10:49:22.000000000 +0530
+++ files_struct-kref-2.6.7/fs/affs/file.c	2004-07-27 18:55:18.000000000 +0530
@@ -62,7 +62,7 @@
 static int
 affs_file_open(struct inode *inode, struct file *filp)
 {
-	if (atomic_read(&filp->f_count) != 1)
+	if (file_count(filp) != 1)
 		return 0;
 	pr_debug("AFFS: open(%d)\n", AFFS_I(inode)->i_opencnt);
 	AFFS_I(inode)->i_opencnt++;
@@ -72,7 +72,7 @@
 static int
 affs_file_release(struct inode *inode, struct file *filp)
 {
-	if (atomic_read(&filp->f_count) != 0)
+	if (file_count(filp) != 0)
 		return 0;
 	pr_debug("AFFS: release(%d)\n", AFFS_I(inode)->i_opencnt);
 	AFFS_I(inode)->i_opencnt--;
diff -ruN -X dontdiff2 linux-2.6.7/fs/aio.c files_struct-kref-2.6.7/fs/aio.c
--- linux-2.6.7/fs/aio.c	2004-06-16 10:49:22.000000000 +0530
+++ files_struct-kref-2.6.7/fs/aio.c	2004-07-27 18:55:18.000000000 +0530
@@ -475,7 +475,7 @@
 static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
 {
 	dprintk(KERN_DEBUG "aio_put(%p): f_count=%d\n",
-		req, atomic_read(&req->ki_filp->f_count));
+		req, file_count(req->ki_filp));
 
 	req->ki_users --;
 	if (unlikely(req->ki_users < 0))
@@ -489,7 +489,7 @@
 	/* Must be done under the lock to serialise against cancellation.
 	 * Call this aio_fput as it duplicates fput via the fput_work.
 	 */
-	if (unlikely(atomic_dec_and_test(&req->ki_filp->f_count))) {
+	if (unlikely(kref_put_last(&req->ki_filp->f_count))) {
 		get_ioctx(ctx);
 		spin_lock(&fput_lock);
 		list_add(&req->ki_list, &fput_head);
diff -ruN -X dontdiff2 linux-2.6.7/fs/file_table.c files_struct-kref-2.6.7/fs/file_table.c
--- linux-2.6.7/fs/file_table.c	2004-06-16 10:48:56.000000000 +0530
+++ files_struct-kref-2.6.7/fs/file_table.c	2004-07-27 18:55:35.218045008 +0530
@@ -81,7 +81,7 @@
 				goto fail;
 			}
 			eventpoll_init_file(f);
-			atomic_set(&f->f_count, 1);
+			kref_init(&f->f_count);
 			f->f_uid = current->fsuid;
 			f->f_gid = current->fsgid;
 			f->f_owner.lock = RW_LOCK_UNLOCKED;
@@ -118,7 +118,7 @@
 	eventpoll_init_file(filp);
 	filp->f_flags  = flags;
 	filp->f_mode   = (flags+1) & O_ACCMODE;
-	atomic_set(&filp->f_count, 1);
+	kref_init(&filp->f_count);
 	filp->f_dentry = dentry;
 	filp->f_mapping = dentry->d_inode->i_mapping;
 	filp->f_uid    = current->fsuid;
@@ -152,10 +152,17 @@
 
 EXPORT_SYMBOL(close_private_file);
 
+#define kref_to_file(kref) container_of(kref, struct file, f_count)
+
+static void __fput_kref(struct kref *kref)
+{
+	struct file *file = kref_to_file(kref);
+	__fput(file);
+}
+
 void fastcall fput(struct file *file)
 {
-	if (atomic_dec_and_test(&file->f_count))
-		__fput(file);
+	kref_put(&file->f_count, __fput_kref);
 }
 
 EXPORT_SYMBOL(fput);
@@ -235,13 +242,17 @@
 }
 
 
+static void put_filp_kref(struct kref *kref)
+{
+	struct file *file = kref_to_file(kref);
+	security_file_free(file);
+	file_kill(file);
+	file_free(file);
+}
+	
 void put_filp(struct file *file)
 {
-	if (atomic_dec_and_test(&file->f_count)) {
-		security_file_free(file);
-		file_kill(file);
-		file_free(file);
-	}
+	kref_put(&file->f_count, put_filp_kref);
 }
 
 EXPORT_SYMBOL(put_filp);
diff -ruN -X dontdiff2 linux-2.6.7/fs/hfs/inode.c files_struct-kref-2.6.7/fs/hfs/inode.c
--- linux-2.6.7/fs/hfs/inode.c	2004-06-16 10:50:26.000000000 +0530
+++ files_struct-kref-2.6.7/fs/hfs/inode.c	2004-07-27 18:55:18.560577328 +0530
@@ -523,7 +523,7 @@
 {
 	if (HFS_IS_RSRC(inode))
 		inode = HFS_I(inode)->rsrc_inode;
-	if (atomic_read(&file->f_count) != 1)
+	if (file_count(file) != 1)
 		return 0;
 	atomic_inc(&HFS_I(inode)->opencnt);
 	return 0;
@@ -535,7 +535,7 @@
 
 	if (HFS_IS_RSRC(inode))
 		inode = HFS_I(inode)->rsrc_inode;
-	if (atomic_read(&file->f_count) != 0)
+	if (file_count(file) != 0)
 		return 0;
 	if (atomic_dec_and_test(&HFS_I(inode)->opencnt)) {
 		down(&inode->i_sem);
diff -ruN -X dontdiff2 linux-2.6.7/fs/hfsplus/inode.c files_struct-kref-2.6.7/fs/hfsplus/inode.c
--- linux-2.6.7/fs/hfsplus/inode.c	2004-06-16 10:49:02.000000000 +0530
+++ files_struct-kref-2.6.7/fs/hfsplus/inode.c	2004-07-27 18:55:18.561577176 +0530
@@ -268,7 +268,7 @@
 {
 	if (HFSPLUS_IS_RSRC(inode))
 		inode = HFSPLUS_I(inode).rsrc_inode;
-	if (atomic_read(&file->f_count) != 1)
+	if (file_count(file) != 1)
 		return 0;
 	atomic_inc(&HFSPLUS_I(inode).opencnt);
 	return 0;
@@ -280,7 +280,7 @@
 
 	if (HFSPLUS_IS_RSRC(inode))
 		inode = HFSPLUS_I(inode).rsrc_inode;
-	if (atomic_read(&file->f_count) != 0)
+	if (file_count(file) != 0)
 		return 0;
 	if (atomic_dec_and_test(&HFSPLUS_I(inode).opencnt)) {
 		down(&inode->i_sem);
diff -ruN -X dontdiff2 linux-2.6.7/include/linux/fs.h files_struct-kref-2.6.7/include/linux/fs.h
--- linux-2.6.7/include/linux/fs.h	2004-06-16 10:49:02.000000000 +0530
+++ files_struct-kref-2.6.7/include/linux/fs.h	2004-07-27 18:55:35.228043488 +0530
@@ -18,6 +18,7 @@
 #include <linux/cache.h>
 #include <linux/prio_tree.h>
 #include <linux/kobject.h>
+#include <linux/kref.h>
 #include <asm/atomic.h>
 
 struct iovec;
@@ -558,7 +559,7 @@
 	struct dentry		*f_dentry;
 	struct vfsmount         *f_vfsmnt;
 	struct file_operations	*f_op;
-	atomic_t		f_count;
+	struct kref		f_count;
 	unsigned int 		f_flags;
 	mode_t			f_mode;
 	int			f_error;
@@ -584,8 +585,8 @@
 #define file_list_lock() spin_lock(&files_lock);
 #define file_list_unlock() spin_unlock(&files_lock);
 
-#define get_file(x)	atomic_inc(&(x)->f_count)
-#define file_count(x)	atomic_read(&(x)->f_count)
+#define get_file(x)	kref_get(&(x)->f_count)
+#define file_count(x)	kref_read(&(x)->f_count)
 
 /* Initialize and open a private file and allocate its security structure. */
 extern int open_private_file(struct file *, struct dentry *, int);
diff -ruN -X dontdiff2 linux-2.6.7/security/selinux/hooks.c files_struct-kref-2.6.7/security/selinux/hooks.c
--- linux-2.6.7/security/selinux/hooks.c	2004-06-16 10:50:04.000000000 +0530
+++ files_struct-kref-2.6.7/security/selinux/hooks.c	2004-07-27 18:55:18.000000000 +0530
@@ -1811,7 +1811,7 @@
 						continue;
 					}
 					if (devnull) {
-						atomic_inc(&devnull->f_count);
+						kref_get(&devnull->f_count);
 					} else {
 						devnull = open_devnull();
 						if (!devnull) {

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchset] Lockfree fd lookup 5 of 5
  2004-08-02 10:10 [patchset] Lockfree fd lookup 0 of 5 Ravikiran G Thirumalai
                   ` (3 preceding siblings ...)
  2004-08-02 10:20 ` [patchset] Lockfree fd lookup 4 " Ravikiran G Thirumalai
@ 2004-08-02 10:23 ` Ravikiran G Thirumalai
  2004-08-02 16:56 ` [patchset] Lockfree fd lookup 0 " viro
  5 siblings, 0 replies; 23+ messages in thread
From: Ravikiran G Thirumalai @ 2004-08-02 10:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Greg KH, dipankar, viro

This patch uses rcu to make struct files_struct.fd[]
array lock free and use kref_lf_xxx api for struct file.f_count refcounter.

This is carrying forward work done earlier by Maneesh and Dipankar.

Thanks,
Kiran


D:
D: Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
D: Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
D: Signed-off-by: Ravikiran Thirumalai <kiran@in.ibm.com>
D:
D: files_struct-rcu-kref-2.6.7.patch
D: Patch to eliminate struct files_struct.file_lock spinlock on the reader
D: side and use rcu refcounting kref_lf_xxx api for the f_count refcounter.
D: This patch depends on the kref_lf patch which adds kref_lf_xxx api.
D: This patch also assumes f_count is converted to use kref api 
D: (files_struct-kref-s patch).
D: This patch improves performance of threaded io intensive workloads.
D: There has been a 15 % improvement in tiobench performance with this patch
D: on a 4 way xeon
D:
diff -ruN -X dontdiff2 linux-2.6.7/arch/mips/kernel/irixioctl.c files_struct-kref-2.6.7/arch/mips/kernel/irixioctl.c
--- linux-2.6.7/arch/mips/kernel/irixioctl.c	2004-06-16 10:49:42.000000000 +0530
+++ files_struct-kref-2.6.7/arch/mips/kernel/irixioctl.c	2004-07-27 19:52:38.790582872 +0530
@@ -14,6 +14,7 @@
 #include <linux/syscalls.h>
 #include <linux/tty.h>
 #include <linux/file.h>
+#include <linux/rcupdate.h>
 
 #include <asm/uaccess.h>
 #include <asm/ioctl.h>
@@ -33,15 +34,15 @@
 	struct file *filp;
 	struct tty_struct *ttyp = NULL;
 
-	spin_lock(&current->files->file_lock);
-	filp = fcheck(fd);
+	rcu_read_lock();
+	filp = fcheck_rcu(fd);
 	if(filp && filp->private_data) {
 		ttyp = (struct tty_struct *) filp->private_data;
 
 		if(ttyp->magic != TTY_MAGIC)
 			ttyp =NULL;
 	}
-	spin_unlock(&current->files->file_lock);
+	rcu_read_unlock();
 	return ttyp;
 }
 
diff -ruN -X dontdiff2 linux-2.6.7/arch/sparc64/solaris/ioctl.c files_struct-kref-2.6.7/arch/sparc64/solaris/ioctl.c
--- linux-2.6.7/arch/sparc64/solaris/ioctl.c	2004-06-16 10:49:36.000000000 +0530
+++ files_struct-kref-2.6.7/arch/sparc64/solaris/ioctl.c	2004-07-27 19:52:38.791582720 +0530
@@ -24,6 +24,7 @@
 #include <linux/netdevice.h>
 #include <linux/mtio.h>
 #include <linux/time.h>
+#include <linux/rcupdate.h>
 #include <linux/compat.h>
 
 #include <net/sock.h>
@@ -291,15 +292,15 @@
 {
 	struct inode *ino;
 	/* I wonder which of these tests are superfluous... --patrik */
-	spin_lock(&current->files->file_lock);
+	rcu_read_lock();
 	if (! current->files->fd[fd] ||
 	    ! current->files->fd[fd]->f_dentry ||
 	    ! (ino = current->files->fd[fd]->f_dentry->d_inode) ||
 	    ! ino->i_sock) {
-		spin_unlock(&current->files->file_lock);
+		rcu_read_unlock();
 		return TBADF;
 	}
-	spin_unlock(&current->files->file_lock);
+	rcu_read_unlock();
 	
 	switch (cmd & 0xff) {
 	case 109: /* SI_SOCKPARAMS */
diff -ruN -X dontdiff2 linux-2.6.7/arch/sparc64/solaris/timod.c files_struct-kref-2.6.7/arch/sparc64/solaris/timod.c
--- linux-2.6.7/arch/sparc64/solaris/timod.c	2004-06-16 10:49:13.000000000 +0530
+++ files_struct-kref-2.6.7/arch/sparc64/solaris/timod.c	2004-07-27 19:52:38.793582416 +0530
@@ -143,9 +143,14 @@
 static void timod_wake_socket(unsigned int fd)
 {
 	struct socket *sock;
+	struct file *filp; 
 
 	SOLD("wakeing socket");
-	sock = SOCKET_I(current->files->fd[fd]->f_dentry->d_inode);
+	if (!( filp = fcheck(fd))) {
+		SOLD("BAD FD");
+		return;
+	}
+	sock = SOCKET_I(filp->f_dentry->d_inode);
 	wake_up_interruptible(&sock->wait);
 	read_lock(&sock->sk->sk_callback_lock);
 	if (sock->fasync_list && !test_bit(SOCK_ASYNC_WAITDATA, &sock->flags))
@@ -157,9 +162,14 @@
 static void timod_queue(unsigned int fd, struct T_primsg *it)
 {
 	struct sol_socket_struct *sock;
+	struct file *filp; 
 
 	SOLD("queuing primsg");
-	sock = (struct sol_socket_struct *)current->files->fd[fd]->private_data;
+	if (!( filp = fcheck(fd))) {
+		SOLD("BAD FD");
+		return;
+	}
+	sock = (struct sol_socket_struct *)filp->private_data;
 	it->next = sock->pfirst;
 	sock->pfirst = it;
 	if (!sock->plast)
@@ -171,9 +181,14 @@
 static void timod_queue_end(unsigned int fd, struct T_primsg *it)
 {
 	struct sol_socket_struct *sock;
+	struct file *filp; 
 
 	SOLD("queuing primsg at end");
-	sock = (struct sol_socket_struct *)current->files->fd[fd]->private_data;
+	if (!( filp = fcheck(fd))) {
+		SOLD("BAD FD");
+		return;
+	}
+	sock = (struct sol_socket_struct *)filp->private_data;
 	it->next = NULL;
 	if (sock->plast)
 		sock->plast->next = it;
@@ -351,7 +366,10 @@
 		(int (*)(int, unsigned long *))SYS(socketcall);
 	int (*sys_sendto)(int, void *, size_t, unsigned, struct sockaddr *, int) =
 		(int (*)(int, void *, size_t, unsigned, struct sockaddr *, int))SYS(sendto);
-	filp = current->files->fd[fd];
+
+	if (!(filp = fcheck(fd)))
+		return -EBADF;
+
 	ino = filp->f_dentry->d_inode;
 	sock = (struct sol_socket_struct *)filp->private_data;
 	SOLD("entry");
@@ -632,7 +650,10 @@
 	
 	SOLD("entry");
 	SOLDD(("%u %p %d %p %p %d %p %d\n", fd, ctl_buf, ctl_maxlen, ctl_len, data_buf, data_maxlen, data_len, *flags_p));
-	filp = current->files->fd[fd];
+
+	if (!(filp = fcheck(fd)))
+		return -EBADF;
+
 	ino = filp->f_dentry->d_inode;
 	sock = (struct sol_socket_struct *)filp->private_data;
 	SOLDD(("%p %p\n", sock->pfirst, sock->pfirst ? sock->pfirst->next : NULL));
@@ -848,7 +869,7 @@
 	lock_kernel();
 	if(fd >= NR_OPEN) goto out;
 
-	filp = current->files->fd[fd];
+	filp = fcheck(fd);
 	if(!filp) goto out;
 
 	ino = filp->f_dentry->d_inode;
@@ -915,7 +936,7 @@
 	lock_kernel();
 	if(fd >= NR_OPEN) goto out;
 
-	filp = current->files->fd[fd];
+	filp = fcheck(fd);
 	if(!filp) goto out;
 
 	ino = filp->f_dentry->d_inode;
diff -ruN -X dontdiff2 linux-2.6.7/drivers/char/tty_io.c files_struct-kref-2.6.7/drivers/char/tty_io.c
--- linux-2.6.7/drivers/char/tty_io.c	2004-06-16 10:49:36.000000000 +0530
+++ files_struct-kref-2.6.7/drivers/char/tty_io.c	2004-07-27 19:52:38.797581808 +0530
@@ -1937,9 +1937,9 @@
 		}
 		task_lock(p);
 		if (p->files) {
-			spin_lock(&p->files->file_lock);
+			rcu_read_lock();
 			for (i=0; i < p->files->max_fds; i++) {
-				filp = fcheck_files(p->files, i);
+				filp = fcheck_files_rcu(p->files, i);
 				if (!filp)
 					continue;
 				if (filp->f_op->read == tty_read &&
@@ -1951,7 +1951,7 @@
 					break;
 				}
 			}
-			spin_unlock(&p->files->file_lock);
+			rcu_read_unlock();
 		}
 		task_unlock(p);
 	}
diff -ruN -X dontdiff2 linux-2.6.7/fs/aio.c files_struct-kref-2.6.7/fs/aio.c
--- linux-2.6.7/fs/aio.c	2004-07-27 19:59:54.740308440 +0530
+++ files_struct-kref-2.6.7/fs/aio.c	2004-07-27 19:57:03.738304696 +0530
@@ -489,7 +489,7 @@
 	/* Must be done under the lock to serialise against cancellation.
 	 * Call this aio_fput as it duplicates fput via the fput_work.
 	 */
-	if (unlikely(kref_put_last(&req->ki_filp->f_count))) {
+	if (unlikely(kref_lf_put_last(&req->ki_filp->f_count))) {
 		get_ioctx(ctx);
 		spin_lock(&fput_lock);
 		list_add(&req->ki_list, &fput_head);
diff -ruN -X dontdiff2 linux-2.6.7/fs/fcntl.c files_struct-kref-2.6.7/fs/fcntl.c
--- linux-2.6.7/fs/fcntl.c	2004-06-16 10:49:37.000000000 +0530
+++ files_struct-kref-2.6.7/fs/fcntl.c	2004-07-27 19:52:38.798581656 +0530
@@ -34,9 +34,9 @@
 {
 	struct files_struct *files = current->files;
 	int res;
-	spin_lock(&files->file_lock);
+	rcu_read_lock();
 	res = FD_ISSET(fd, files->close_on_exec);
-	spin_unlock(&files->file_lock);
+	rcu_read_unlock();
 	return res;
 }
 
@@ -138,7 +138,7 @@
 	if (fd >= 0) {
 		FD_SET(fd, files->open_fds);
 		FD_CLR(fd, files->close_on_exec);
-		spin_unlock(&files->file_lock);
+	spin_unlock(&files->file_lock);
 		fd_install(fd, file);
 	} else {
 		spin_unlock(&files->file_lock);
@@ -183,6 +183,7 @@
 		goto out_fput;
 
 	files->fd[newfd] = file;
+	smp_wmb();
 	FD_SET(newfd, files->open_fds);
 	FD_CLR(newfd, files->close_on_exec);
 	spin_unlock(&files->file_lock);
diff -ruN -X dontdiff2 linux-2.6.7/fs/file.c files_struct-kref-2.6.7/fs/file.c
--- linux-2.6.7/fs/file.c	2004-06-16 10:48:57.000000000 +0530
+++ files_struct-kref-2.6.7/fs/file.c	2004-07-27 19:52:38.800581352 +0530
@@ -14,7 +14,20 @@
 #include <linux/file.h>
 
 #include <asm/bitops.h>
+#include <linux/rcupdate.h>
 
+struct rcu_fd_array {
+	struct rcu_head rh;
+	struct file **array;   
+	int nfds;
+};
+
+struct rcu_fd_set {
+	struct rcu_head rh;
+	fd_set *openset;
+	fd_set *execset;
+	int nfds;
+};
 
 /*
  * Allocate an fd array, using kmalloc or vmalloc.
@@ -49,6 +62,13 @@
 		vfree(array);
 }
 
+static void fd_array_callback(struct rcu_head *rcu) 
+{
+	struct rcu_fd_array *a = container_of(rcu, struct rcu_fd_array, rh);
+	free_fd_array(a->array, a->nfds);
+	kfree(a); 
+}
+
 /*
  * Expand the fd array in the files_struct.  Called with the files
  * spinlock held for write.
@@ -56,8 +76,9 @@
 
 int expand_fd_array(struct files_struct *files, int nr)
 {
-	struct file **new_fds;
-	int error, nfds;
+	struct file **new_fds = NULL;
+	int error, nfds = 0;
+	struct rcu_fd_array *arg = NULL;
 
 	
 	error = -EMFILE;
@@ -89,18 +110,17 @@
 
 	error = -ENOMEM;
 	new_fds = alloc_fd_array(nfds);
+	arg = (struct rcu_fd_array *) kmalloc(sizeof(*arg), GFP_KERNEL);
+
 	spin_lock(&files->file_lock);
-	if (!new_fds)
+	if (!new_fds || !arg)
 		goto out;
 
 	/* Copy the existing array and install the new pointer */
 
 	if (nfds > files->max_fds) {
-		struct file **old_fds;
-		int i;
-		
-		old_fds = xchg(&files->fd, new_fds);
-		i = xchg(&files->max_fds, nfds);
+		struct file **old_fds = files->fd;
+		int i = files->max_fds;
 
 		/* Don't copy/clear the array if we are creating a new
 		   fd array for fork() */
@@ -109,19 +129,35 @@
 			/* clear the remainder of the array */
 			memset(&new_fds[i], 0,
 			       (nfds-i) * sizeof(struct file *)); 
-
-			spin_unlock(&files->file_lock);
-			free_fd_array(old_fds, i);
-			spin_lock(&files->file_lock);
 		}
+
+		smp_wmb();
+		files->fd = new_fds;
+		smp_wmb();
+		files->max_fds = nfds;
+		
+		if (i) {
+			arg->array = old_fds;
+			arg->nfds = i;
+			call_rcu(&arg->rh, (void (*)(void *))fd_array_callback,
+				 &arg->rh);
+		} else 
+			kfree(arg);
 	} else {
 		/* Somebody expanded the array while we slept ... */
 		spin_unlock(&files->file_lock);
 		free_fd_array(new_fds, nfds);
+		kfree(arg);
 		spin_lock(&files->file_lock);
 	}
-	error = 0;
+
+	return 0;
 out:
+	if (new_fds)
+		free_fd_array(new_fds, nfds);
+	if (arg)
+		kfree(arg);
+
 	return error;
 }
 
@@ -153,6 +189,14 @@
 		vfree(array);
 }
 
+static void fd_set_callback (struct rcu_head *rcu)
+{
+	struct rcu_fd_set *a = container_of(rcu, struct rcu_fd_set, rh); 
+	free_fdset(a->openset, a->nfds);
+	free_fdset(a->execset, a->nfds);
+	kfree(a);
+}
+
 /*
  * Expand the fdset in the files_struct.  Called with the files spinlock
  * held for write.
@@ -161,6 +205,7 @@
 {
 	fd_set *new_openset = 0, *new_execset = 0;
 	int error, nfds = 0;
+	struct rcu_fd_set *arg = NULL;
 
 	error = -EMFILE;
 	if (files->max_fdset >= NR_OPEN || nr >= NR_OPEN)
@@ -183,35 +228,43 @@
 	error = -ENOMEM;
 	new_openset = alloc_fdset(nfds);
 	new_execset = alloc_fdset(nfds);
+	arg = (struct rcu_fd_set *) kmalloc(sizeof(*arg), GFP_ATOMIC);
 	spin_lock(&files->file_lock);
-	if (!new_openset || !new_execset)
+	if (!new_openset || !new_execset || !arg)
 		goto out;
 
 	error = 0;
 	
 	/* Copy the existing tables and install the new pointers */
 	if (nfds > files->max_fdset) {
-		int i = files->max_fdset / (sizeof(unsigned long) * 8);
-		int count = (nfds - files->max_fdset) / 8;
+		fd_set *old_openset = files->open_fds;
+		fd_set *old_execset = files->close_on_exec;
+		int old_nfds = files->max_fdset;
+		int i = old_nfds / (sizeof(unsigned long) * 8);
+		int count = (nfds - old_nfds) / 8;
 		
 		/* 
 		 * Don't copy the entire array if the current fdset is
 		 * not yet initialised.  
 		 */
 		if (i) {
-			memcpy (new_openset, files->open_fds, files->max_fdset/8);
-			memcpy (new_execset, files->close_on_exec, files->max_fdset/8);
+			memcpy (new_openset, old_openset, old_nfds/8);
+			memcpy (new_execset, old_execset, old_nfds/8);
 			memset (&new_openset->fds_bits[i], 0, count);
 			memset (&new_execset->fds_bits[i], 0, count);
 		}
 		
-		nfds = xchg(&files->max_fdset, nfds);
-		new_openset = xchg(&files->open_fds, new_openset);
-		new_execset = xchg(&files->close_on_exec, new_execset);
-		spin_unlock(&files->file_lock);
-		free_fdset (new_openset, nfds);
-		free_fdset (new_execset, nfds);
-		spin_lock(&files->file_lock);
+		smp_wmb();
+		files->open_fds =  new_openset;
+		files->close_on_exec = new_execset;
+		smp_wmb();
+		files->max_fdset = nfds;
+
+		arg->openset = old_openset;
+		arg->execset = old_execset;
+		arg->nfds = nfds;
+		call_rcu(&arg->rh, (void (*)(void *))fd_set_callback, &arg->rh);
+
 		return 0;
 	} 
 	/* Somebody expanded the array while we slept ... */
@@ -222,6 +275,8 @@
 		free_fdset(new_openset, nfds);
 	if (new_execset)
 		free_fdset(new_execset, nfds);
+	if (arg)
+		kfree(arg);
 	spin_lock(&files->file_lock);
 	return error;
 }
diff -ruN -X dontdiff2 linux-2.6.7/fs/file_table.c files_struct-kref-2.6.7/fs/file_table.c
--- linux-2.6.7/fs/file_table.c	2004-07-27 19:59:54.741308288 +0530
+++ files_struct-kref-2.6.7/fs/file_table.c	2004-07-27 19:58:50.484076872 +0530
@@ -14,6 +14,7 @@
 #include <linux/fs.h>
 #include <linux/security.h>
 #include <linux/eventpoll.h>
+#include <linux/rcupdate.h>
 #include <linux/mount.h>
 #include <linux/cdev.h>
 
@@ -54,11 +55,17 @@
 	spin_unlock_irqrestore(&filp_count_lock, flags);
 }
 
-static inline void file_free(struct file *f)
+static inline void file_free_rcu(void *arg)
 {
+	struct file *f =  arg;
 	kmem_cache_free(filp_cachep, f);
 }
 
+static inline void file_free(struct file *f)
+{
+	call_rcu(&f->f_rcuhead, file_free_rcu, f);
+}
+
 /* Find an unused file structure and return a pointer to it.
  * Returns NULL, if there are no more free file structures or
  * we run out of memory.
@@ -81,7 +88,7 @@
 				goto fail;
 			}
 			eventpoll_init_file(f);
-			kref_init(&f->f_count);
+			kref_lf_init(&f->f_count);
 			f->f_uid = current->fsuid;
 			f->f_gid = current->fsgid;
 			f->f_owner.lock = RW_LOCK_UNLOCKED;
@@ -118,7 +125,7 @@
 	eventpoll_init_file(filp);
 	filp->f_flags  = flags;
 	filp->f_mode   = (flags+1) & O_ACCMODE;
-	kref_init(&filp->f_count);
+	kref_lf_init(&filp->f_count);
 	filp->f_dentry = dentry;
 	filp->f_mapping = dentry->d_inode->i_mapping;
 	filp->f_uid    = current->fsuid;
@@ -162,7 +169,7 @@
 
 void fastcall fput(struct file *file)
 {
-	kref_put(&file->f_count, __fput_kref);
+	kref_lf_put(&file->f_count, __fput_kref);
 }
 
 EXPORT_SYMBOL(fput);
@@ -204,11 +211,17 @@
 	struct file *file;
 	struct files_struct *files = current->files;
 
-	spin_lock(&files->file_lock);
-	file = fcheck_files(files, fd);
-	if (file)
-		get_file(file);
-	spin_unlock(&files->file_lock);
+	rcu_read_lock();
+	file = fcheck_files_rcu(files, fd);
+	if (file) {
+		if (!kref_lf_get_rcu(&file->f_count)) {
+			/* File object ref couldn't be taken */
+			rcu_read_unlock();
+			return NULL;
+		}
+	}	
+	rcu_read_unlock();
+
 	return file;
 }
 
@@ -230,14 +243,18 @@
 	if (likely((atomic_read(&files->count) == 1))) {
 		file = fcheck_files(files, fd);
 	} else {
-		spin_lock(&files->file_lock);
-		file = fcheck_files(files, fd);
+		rcu_read_lock();
+		file = fcheck_files_rcu(files, fd);
 		if (file) {
-			get_file(file);
-			*fput_needed = 1;
+			if (kref_lf_get_rcu(&file->f_count))
+				*fput_needed = 1;
+			else
+				/* Didn't get the reference, someone's freed */
+				file = NULL;
 		}
-		spin_unlock(&files->file_lock);
+		rcu_read_unlock();
 	}
+
 	return file;
 }
 
@@ -252,7 +269,7 @@
 	
 void put_filp(struct file *file)
 {
-	kref_put(&file->f_count, put_filp_kref);
+	kref_lf_put(&file->f_count, put_filp_kref);
 }
 
 EXPORT_SYMBOL(put_filp);
diff -ruN -X dontdiff2 linux-2.6.7/fs/open.c files_struct-kref-2.6.7/fs/open.c
--- linux-2.6.7/fs/open.c	2004-06-16 10:48:56.000000000 +0530
+++ files_struct-kref-2.6.7/fs/open.c	2004-07-27 19:52:38.802581048 +0530
@@ -1031,6 +1031,8 @@
 	if (!filp)
 		goto out_unlock;
 	files->fd[fd] = NULL;
+	/* Need to make it conistent with open_fds in __put_unused_fd() */
+	smp_wmb();
 	FD_CLR(fd, files->close_on_exec);
 	__put_unused_fd(files, fd);
 	spin_unlock(&files->file_lock);
diff -ruN -X dontdiff2 linux-2.6.7/fs/proc/base.c files_struct-kref-2.6.7/fs/proc/base.c
--- linux-2.6.7/fs/proc/base.c	2004-06-16 10:49:37.000000000 +0530
+++ files_struct-kref-2.6.7/fs/proc/base.c	2004-07-27 19:52:38.805580592 +0530
@@ -28,6 +28,7 @@
 #include <linux/namespace.h>
 #include <linux/mm.h>
 #include <linux/smp_lock.h>
+#include <linux/rcupdate.h>
 #include <linux/kallsyms.h>
 #include <linux/mount.h>
 #include <linux/security.h>
@@ -191,16 +192,16 @@
 
 	files = get_files_struct(task);
 	if (files) {
-		spin_lock(&files->file_lock);
-		file = fcheck_files(files, fd);
+		rcu_read_lock();
+		file = fcheck_files_rcu(files, fd);
 		if (file) {
 			*mnt = mntget(file->f_vfsmnt);
 			*dentry = dget(file->f_dentry);
-			spin_unlock(&files->file_lock);
+			rcu_read_unlock();
 			put_files_struct(files);
 			return 0;
 		}
-		spin_unlock(&files->file_lock);
+		rcu_read_unlock();
 		put_files_struct(files);
 	}
 	return -ENOENT;
@@ -805,15 +806,15 @@
 			files = get_files_struct(p);
 			if (!files)
 				goto out;
-			spin_lock(&files->file_lock);
+			rcu_read_lock();
 			for (fd = filp->f_pos-2;
 			     fd < files->max_fds;
 			     fd++, filp->f_pos++) {
 				unsigned int i,j;
 
-				if (!fcheck_files(files, fd))
+				if (!fcheck_files_rcu(files, fd))
 					continue;
-				spin_unlock(&files->file_lock);
+				rcu_read_unlock();
 
 				j = NUMBUF;
 				i = fd;
@@ -825,12 +826,12 @@
 
 				ino = fake_ino(tid, PROC_TID_FD_DIR + fd);
 				if (filldir(dirent, buf+j, NUMBUF-j, fd+2, ino, DT_LNK) < 0) {
-					spin_lock(&files->file_lock);
+					rcu_read_lock();
 					break;
 				}
-				spin_lock(&files->file_lock);
+				rcu_read_lock();
 			}
-			spin_unlock(&files->file_lock);
+			rcu_read_unlock();
 			put_files_struct(files);
 	}
 out:
@@ -1003,9 +1004,9 @@
 
 	files = get_files_struct(task);
 	if (files) {
-		spin_lock(&files->file_lock);
-		if (fcheck_files(files, fd)) {
-			spin_unlock(&files->file_lock);
+		rcu_read_lock();
+		if (fcheck_files_rcu(files, fd)) {
+			rcu_read_unlock();
 			put_files_struct(files);
 			if (task_dumpable(task)) {
 				inode->i_uid = task->euid;
@@ -1017,7 +1018,7 @@
 			security_task_to_inode(task, inode);
 			return 1;
 		}
-		spin_unlock(&files->file_lock);
+		rcu_read_unlock();
 		put_files_struct(files);
 	}
 	d_drop(dentry);
@@ -1109,15 +1110,15 @@
 	if (!files)
 		goto out_unlock;
 	inode->i_mode = S_IFLNK;
-	spin_lock(&files->file_lock);
-	file = fcheck_files(files, fd);
+	rcu_read_lock();
+	file = fcheck_files_rcu(files, fd);
 	if (!file)
 		goto out_unlock2;
 	if (file->f_mode & 1)
 		inode->i_mode |= S_IRUSR | S_IXUSR;
 	if (file->f_mode & 2)
 		inode->i_mode |= S_IWUSR | S_IXUSR;
-	spin_unlock(&files->file_lock);
+	rcu_read_unlock();
 	put_files_struct(files);
 	inode->i_op = &proc_pid_link_inode_operations;
 	inode->i_size = 64;
@@ -1127,7 +1128,7 @@
 	return NULL;
 
 out_unlock2:
-	spin_unlock(&files->file_lock);
+	rcu_read_unlock();
 	put_files_struct(files);
 out_unlock:
 	iput(inode);
diff -ruN -X dontdiff2 linux-2.6.7/fs/select.c files_struct-kref-2.6.7/fs/select.c
--- linux-2.6.7/fs/select.c	2004-06-16 10:48:54.000000000 +0530
+++ files_struct-kref-2.6.7/fs/select.c	2004-07-27 19:52:38.806580440 +0530
@@ -21,6 +21,7 @@
 #include <linux/personality.h> /* for STICKY_TIMEOUTS */
 #include <linux/file.h>
 #include <linux/fs.h>
+#include <linux/rcupdate.h>
 
 #include <asm/uaccess.h>
 
@@ -131,13 +132,16 @@
 static int max_select_fd(unsigned long n, fd_set_bits *fds)
 {
 	unsigned long *open_fds;
+	fd_set *open_fdset;
 	unsigned long set;
 	int max;
 
 	/* handle last in-complete long-word first */
 	set = ~(~0UL << (n & (__NFDBITS-1)));
 	n /= __NFDBITS;
-	open_fds = current->files->open_fds->fds_bits+n;
+	open_fdset = current->files->open_fds;
+	smp_read_barrier_depends();
+	open_fds = open_fdset->fds_bits+n;
 	max = 0;
 	if (set) {
 		set &= BITS(fds, n);
@@ -184,9 +188,9 @@
 	int retval, i;
 	long __timeout = *timeout;
 
- 	spin_lock(&current->files->file_lock);
+	rcu_read_lock();
 	retval = max_select_fd(n, fds);
-	spin_unlock(&current->files->file_lock);
+	rcu_read_unlock();
 
 	if (retval < 0)
 		return retval;
diff -ruN -X dontdiff2 linux-2.6.7/include/linux/file.h files_struct-kref-2.6.7/include/linux/file.h
--- linux-2.6.7/include/linux/file.h	2004-06-16 10:50:26.000000000 +0530
+++ files_struct-kref-2.6.7/include/linux/file.h	2004-07-27 19:52:38.807580288 +0530
@@ -9,6 +9,7 @@
 #include <linux/posix_types.h>
 #include <linux/compiler.h>
 #include <linux/spinlock.h>
+#include <linux/rcupdate.h>
 
 /*
  * The default fd array needs to be at least BITS_PER_LONG,
@@ -69,10 +70,26 @@
 	return file;
 }
 
+/* Need Proper read barriers if fd_array is being indexed lockfree */
+static inline struct file * fcheck_files_rcu(struct files_struct *files, unsigned int fd)
+{
+	struct file * file = NULL;
+
+	if (fd < files->max_fds) {
+		struct file ** fd_array;
+		smp_rmb();
+	       	fd_array = files->fd;
+		smp_read_barrier_depends();
+		file = fd_array[fd];
+	}
+	return file;
+}
+
 /*
  * Check whether the specified fd has an open file.
  */
 #define fcheck(fd)	fcheck_files(current->files, fd)
+#define fcheck_rcu(fd)	fcheck_files_rcu(current->files, fd)
 
 extern void FASTCALL(fd_install(unsigned int fd, struct file * file));
 
diff -ruN -X dontdiff2 linux-2.6.7/include/linux/fs.h files_struct-kref-2.6.7/include/linux/fs.h
--- linux-2.6.7/include/linux/fs.h	2004-07-27 19:59:54.751306768 +0530
+++ files_struct-kref-2.6.7/include/linux/fs.h	2004-07-27 19:59:38.190824344 +0530
@@ -580,13 +580,14 @@
 	spinlock_t		f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
 	struct address_space	*f_mapping;
+	struct rcu_head 	f_rcuhead;
 };
 extern spinlock_t files_lock;
 #define file_list_lock() spin_lock(&files_lock);
 #define file_list_unlock() spin_unlock(&files_lock);
 
-#define get_file(x)	kref_get(&(x)->f_count)
-#define file_count(x)	kref_read(&(x)->f_count)
+#define get_file(x)	kref_lf_get(&(x)->f_count)
+#define file_count(x)	kref_lf_read(&(x)->f_count)
 
 /* Initialize and open a private file and allocate its security structure. */
 extern int open_private_file(struct file *, struct dentry *, int);
diff -ruN -X dontdiff2 linux-2.6.7/kernel/fork.c files_struct-kref-2.6.7/kernel/fork.c
--- linux-2.6.7/kernel/fork.c	2004-06-16 10:48:57.000000000 +0530
+++ files_struct-kref-2.6.7/kernel/fork.c	2004-07-27 19:52:38.000000000 +0530
@@ -32,6 +32,7 @@
 #include <linux/syscalls.h>
 #include <linux/jiffies.h>
 #include <linux/futex.h>
+#include <linux/rcupdate.h>
 #include <linux/ptrace.h>
 #include <linux/mount.h>
 #include <linux/audit.h>
@@ -639,10 +640,12 @@
 static int count_open_files(struct files_struct *files, int size)
 {
 	int i;
+	fd_set *open_fds = files->open_fds;
 
+	smp_read_barrier_depends();
 	/* Find the last open fd */
 	for (i = size/(8*sizeof(long)); i > 0; ) {
-		if (files->open_fds->fds_bits[--i])
+		if (open_fds->fds_bits[--i])
 			break;
 	}
 	i = (i+1) * 8 * sizeof(long);
@@ -672,6 +675,14 @@
 	 * This works because we cache current->files (old) as oldf. Don't
 	 * break this.
 	 */
+	
+	/* 
+	 * We don't have oldf readlock, but even if the old fdset gets
+	 * grown now, we'll only copy upto "size" fds 
+	 */
+	size = oldf->max_fdset;	
+	smp_rmb();
+
 	tsk->files = NULL;
 	error = -ENOMEM;
 	newf = kmem_cache_alloc(files_cachep, SLAB_KERNEL);
@@ -688,9 +699,6 @@
 	newf->open_fds	    = &newf->open_fds_init;
 	newf->fd	    = &newf->fd_array[0];
 
-	/* We don't yet have the oldf readlock, but even if the old
-           fdset gets grown now, we'll only copy up to "size" fds */
-	size = oldf->max_fdset;
 	if (size > __FD_SETSIZE) {
 		newf->max_fdset = 0;
 		spin_lock(&newf->file_lock);
@@ -699,7 +707,7 @@
 		if (error)
 			goto out_release;
 	}
-	spin_lock(&oldf->file_lock);
+	rcu_read_lock();
 
 	open_files = count_open_files(oldf, size);
 
@@ -710,7 +718,7 @@
 	 */
 	nfds = NR_OPEN_DEFAULT;
 	if (open_files > nfds) {
-		spin_unlock(&oldf->file_lock);
+		rcu_read_unlock();
 		newf->max_fds = 0;
 		spin_lock(&newf->file_lock);
 		error = expand_fd_array(newf, open_files-1);
@@ -718,10 +726,11 @@
 		if (error) 
 			goto out_release;
 		nfds = newf->max_fds;
-		spin_lock(&oldf->file_lock);
+		rcu_read_lock();
 	}
 
 	old_fds = oldf->fd;
+	smp_read_barrier_depends();
 	new_fds = newf->fd;
 
 	memcpy(newf->open_fds->fds_bits, oldf->open_fds->fds_bits, open_files/8);
@@ -733,7 +742,7 @@
 			get_file(f);
 		*new_fds++ = f;
 	}
-	spin_unlock(&oldf->file_lock);
+	rcu_read_unlock();
 
 	/* compute the remainder to be cleared */
 	size = (newf->max_fds - open_files) * sizeof(struct file *);
diff -ruN -X dontdiff2 linux-2.6.7/net/ipv4/netfilter/ipt_owner.c files_struct-kref-2.6.7/net/ipv4/netfilter/ipt_owner.c
--- linux-2.6.7/net/ipv4/netfilter/ipt_owner.c	2004-06-16 10:50:03.000000000 +0530
+++ files_struct-kref-2.6.7/net/ipv4/netfilter/ipt_owner.c	2004-07-27 19:52:38.814579224 +0530
@@ -11,6 +11,7 @@
 #include <linux/module.h>
 #include <linux/skbuff.h>
 #include <linux/file.h>
+#include <linux/rcupdate.h>
 #include <net/sock.h>
 
 #include <linux/netfilter_ipv4/ipt_owner.h>
@@ -35,17 +36,17 @@
 		task_lock(p);
 		files = p->files;
 		if(files) {
-			spin_lock(&files->file_lock);
+			rcu_read_lock();
 			for (i=0; i < files->max_fds; i++) {
-				if (fcheck_files(files, i) ==
+				if (fcheck_files_rcu(files, i) ==
 				    skb->sk->sk_socket->file) {
-					spin_unlock(&files->file_lock);
+					rcu_read_unlock();
 					task_unlock(p);
 					read_unlock(&tasklist_lock);
 					return 1;
 				}
 			}
-			spin_unlock(&files->file_lock);
+			rcu_read_unlock();
 		}
 		task_unlock(p);
 	} while_each_thread(g, p);
@@ -67,17 +68,17 @@
 	task_lock(p);
 	files = p->files;
 	if(files) {
-		spin_lock(&files->file_lock);
+		rcu_read_lock();
 		for (i=0; i < files->max_fds; i++) {
-			if (fcheck_files(files, i) ==
+			if (fcheck_files_rcu(files, i) ==
 			    skb->sk->sk_socket->file) {
-				spin_unlock(&files->file_lock);
+				rcu_read_unlock();
 				task_unlock(p);
 				read_unlock(&tasklist_lock);
 				return 1;
 			}
 		}
-		spin_unlock(&files->file_lock);
+		rcu_read_unlock();
 	}
 	task_unlock(p);
 out:
@@ -101,14 +102,14 @@
 		task_lock(p);
 		files = p->files;
 		if (files) {
-			spin_lock(&files->file_lock);
+			rcu_read_lock();
 			for (i=0; i < files->max_fds; i++) {
-				if (fcheck_files(files, i) == file) {
+				if (fcheck_files_rcu(files, i) == file) {
 					found = 1;
 					break;
 				}
 			}
-			spin_unlock(&files->file_lock);
+			rcu_read_unlock();
 		}
 		task_unlock(p);
 		if (found)
diff -ruN -X dontdiff2 linux-2.6.7/net/ipv6/netfilter/ip6t_owner.c files_struct-kref-2.6.7/net/ipv6/netfilter/ip6t_owner.c
--- linux-2.6.7/net/ipv6/netfilter/ip6t_owner.c	2004-06-16 10:49:52.000000000 +0530
+++ files_struct-kref-2.6.7/net/ipv6/netfilter/ip6t_owner.c	2004-07-27 19:52:38.000000000 +0530
@@ -11,6 +11,7 @@
 #include <linux/module.h>
 #include <linux/skbuff.h>
 #include <linux/file.h>
+#include <linux/rcupdate.h>
 #include <net/sock.h>
 
 #include <linux/netfilter_ipv6/ip6t_owner.h>
@@ -34,16 +35,16 @@
 	task_lock(p);
 	files = p->files;
 	if(files) {
-		spin_lock(&files->file_lock);
+		rcu_read_lock();
 		for (i=0; i < files->max_fds; i++) {
-			if (fcheck_files(files, i) == skb->sk->sk_socket->file) {
-				spin_unlock(&files->file_lock);
+			if (fcheck_files_rcu(files, i) == skb->sk->sk_socket->file) {
+				rcu_read_unlock();
 				task_unlock(p);
 				read_unlock(&tasklist_lock);
 				return 1;
 			}
 		}
-		spin_unlock(&files->file_lock);
+		rcu_read_unlock();
 	}
 	task_unlock(p);
 out:
@@ -67,14 +68,14 @@
 		task_lock(p);
 		files = p->files;
 		if (files) {
-			spin_lock(&files->file_lock);
+			rcu_read_lock();
 			for (i=0; i < files->max_fds; i++) {
-				if (fcheck_files(files, i) == file) {
+				if (fcheck_files_rcu(files, i) == file) {
 					found = 1;
 					break;
 				}
 			}
-			spin_unlock(&files->file_lock);
+			rcu_read_unlock();
 		}
 		task_unlock(p);
 		if (found)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchset] Lockfree fd lookup 1 of 5
  2004-08-02 10:13 ` [patchset] Lockfree fd lookup 1 " Ravikiran G Thirumalai
@ 2004-08-02 13:38   ` Christoph Hellwig
  2004-08-02 16:12     ` viro
  2004-08-03  0:44   ` Greg KH
  1 sibling, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2004-08-02 13:38 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: Andrew Morton, linux-kernel, Greg KH, dipankar, viro

On Mon, Aug 02, 2004 at 03:43:52PM +0530, Ravikiran G Thirumalai wrote:
> Here's the first patch.  This patch 'shrinks' struct kref by removing
> the release pointer in the struct kref.  GregKH has applied this to his tree

What's the advantage of this kref API over opencoded atomic_t usage?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchset] Lockfree fd lookup 1 of 5
  2004-08-02 13:38   ` Christoph Hellwig
@ 2004-08-02 16:12     ` viro
  0 siblings, 0 replies; 23+ messages in thread
From: viro @ 2004-08-02 16:12 UTC (permalink / raw)
  To: Christoph Hellwig, Ravikiran G Thirumalai, Andrew Morton,
	linux-kernel, Greg KH, dipankar

On Mon, Aug 02, 2004 at 02:38:25PM +0100, Christoph Hellwig wrote:
> On Mon, Aug 02, 2004 at 03:43:52PM +0530, Ravikiran G Thirumalai wrote:
> > Here's the first patch.  This patch 'shrinks' struct kref by removing
> > the release pointer in the struct kref.  GregKH has applied this to his tree
> 
> What's the advantage of this kref API over opencoded atomic_t usage?

More interesting question is how much does it win compared to simple
switch to rwlock?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchset] Lockfree fd lookup 0 of 5
  2004-08-02 10:10 [patchset] Lockfree fd lookup 0 of 5 Ravikiran G Thirumalai
                   ` (4 preceding siblings ...)
  2004-08-02 10:23 ` [patchset] Lockfree fd lookup 5 " Ravikiran G Thirumalai
@ 2004-08-02 16:56 ` viro
  2004-08-02 20:07   ` David S. Miller
  2004-08-03  9:23   ` Ravikiran G Thirumalai
  5 siblings, 2 replies; 23+ messages in thread
From: viro @ 2004-08-02 16:56 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: Andrew Morton, linux-kernel, Greg KH, dipankar

On Mon, Aug 02, 2004 at 03:40:55PM +0530, Ravikiran G Thirumalai wrote:
> Here is a patchset to eliminate taking struct files_struct.file_lock on 
> reader side using rcu and rcu based refcounting.  These patches
> extend the kref api to include kref_lf_xxx api and kref_lf_get_rcu to
> do lockfree refcounting, and use the same.  As posted earlier, since fd
> lookups (struct files_struct.fd[]) will be lock free with these patches, 
> threaded workloads doing lots of io should see performance benefits 
> due to this patchset.  I have observed 13-15% improvement with tiobench 
> on a 4 way xeon with this patchset.

How about this for comparison?  That's just a dumb "convert to rwlock"
patch; we can be smarter in e.g. close_on_exec handling, but that's a
separate story.

diff -urN RC8-rc2-bk12/arch/mips/kernel/irixioctl.c RC8-rc2-bk12-current/arch/mips/kernel/irixioctl.c
--- RC8-rc2-bk12/arch/mips/kernel/irixioctl.c	2004-03-17 15:29:37.000000000 -0500
+++ RC8-rc2-bk12-current/arch/mips/kernel/irixioctl.c	2004-08-02 12:24:00.000000000 -0400
@@ -33,7 +33,7 @@
 	struct file *filp;
 	struct tty_struct *ttyp = NULL;
 
-	spin_lock(&current->files->file_lock);
+	read_lock(&current->files->file_lock);
 	filp = fcheck(fd);
 	if(filp && filp->private_data) {
 		ttyp = (struct tty_struct *) filp->private_data;
@@ -41,7 +41,7 @@
 		if(ttyp->magic != TTY_MAGIC)
 			ttyp =NULL;
 	}
-	spin_unlock(&current->files->file_lock);
+	read_unlock(&current->files->file_lock);
 	return ttyp;
 }
 
diff -urN RC8-rc2-bk12/arch/sparc64/solaris/ioctl.c RC8-rc2-bk12-current/arch/sparc64/solaris/ioctl.c
--- RC8-rc2-bk12/arch/sparc64/solaris/ioctl.c	2004-08-02 12:19:34.000000000 -0400
+++ RC8-rc2-bk12-current/arch/sparc64/solaris/ioctl.c	2004-08-02 12:24:23.000000000 -0400
@@ -294,15 +294,15 @@
 {
 	struct inode *ino;
 	/* I wonder which of these tests are superfluous... --patrik */
-	spin_lock(&current->files->file_lock);
+	read_lock(&current->files->file_lock);
 	if (! current->files->fd[fd] ||
 	    ! current->files->fd[fd]->f_dentry ||
 	    ! (ino = current->files->fd[fd]->f_dentry->d_inode) ||
 	    ! ino->i_sock) {
-		spin_unlock(&current->files->file_lock);
+		read_unlock(&current->files->file_lock);
 		return TBADF;
 	}
-	spin_unlock(&current->files->file_lock);
+	read_unlock(&current->files->file_lock);
 	
 	switch (cmd & 0xff) {
 	case 109: /* SI_SOCKPARAMS */
diff -urN RC8-rc2-bk12/drivers/char/tty_io.c RC8-rc2-bk12-current/drivers/char/tty_io.c
--- RC8-rc2-bk12/drivers/char/tty_io.c	2004-08-02 12:19:35.000000000 -0400
+++ RC8-rc2-bk12-current/drivers/char/tty_io.c	2004-08-02 12:24:52.139649424 -0400
@@ -1978,7 +1978,7 @@
 		}
 		task_lock(p);
 		if (p->files) {
-			spin_lock(&p->files->file_lock);
+			read_lock(&p->files->file_lock);
 			for (i=0; i < p->files->max_fds; i++) {
 				filp = fcheck_files(p->files, i);
 				if (!filp)
@@ -1992,7 +1992,7 @@
 					break;
 				}
 			}
-			spin_unlock(&p->files->file_lock);
+			read_unlock(&p->files->file_lock);
 		}
 		task_unlock(p);
 	}
diff -urN RC8-rc2-bk12/fs/exec.c RC8-rc2-bk12-current/fs/exec.c
--- RC8-rc2-bk12/fs/exec.c	2004-08-02 12:19:36.000000000 -0400
+++ RC8-rc2-bk12-current/fs/exec.c	2004-08-02 12:40:31.000000000 -0400
@@ -762,7 +762,7 @@
 {
 	long j = -1;
 
-	spin_lock(&files->file_lock);
+	write_lock(&files->file_lock);
 	for (;;) {
 		unsigned long set, i;
 
@@ -774,16 +774,16 @@
 		if (!set)
 			continue;
 		files->close_on_exec->fds_bits[j] = 0;
-		spin_unlock(&files->file_lock);
+		write_unlock(&files->file_lock);
 		for ( ; set ; i++,set >>= 1) {
 			if (set & 1) {
 				sys_close(i);
 			}
 		}
-		spin_lock(&files->file_lock);
+		write_lock(&files->file_lock);
 
 	}
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 }
 
 int flush_old_exec(struct linux_binprm * bprm)
diff -urN RC8-rc2-bk12/fs/fcntl.c RC8-rc2-bk12-current/fs/fcntl.c
--- RC8-rc2-bk12/fs/fcntl.c	2004-07-18 09:08:41.000000000 -0400
+++ RC8-rc2-bk12-current/fs/fcntl.c	2004-08-02 12:39:55.000000000 -0400
@@ -22,21 +22,21 @@
 void fastcall set_close_on_exec(unsigned int fd, int flag)
 {
 	struct files_struct *files = current->files;
-	spin_lock(&files->file_lock);
+	write_lock(&files->file_lock);
 	if (flag)
 		FD_SET(fd, files->close_on_exec);
 	else
 		FD_CLR(fd, files->close_on_exec);
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 }
 
 static inline int get_close_on_exec(unsigned int fd)
 {
 	struct files_struct *files = current->files;
 	int res;
-	spin_lock(&files->file_lock);
+	read_lock(&files->file_lock);
 	res = FD_ISSET(fd, files->close_on_exec);
-	spin_unlock(&files->file_lock);
+	read_unlock(&files->file_lock);
 	return res;
 }
 
@@ -133,15 +133,15 @@
 	struct files_struct * files = current->files;
 	int fd;
 
-	spin_lock(&files->file_lock);
+	write_lock(&files->file_lock);
 	fd = locate_fd(files, file, start);
 	if (fd >= 0) {
 		FD_SET(fd, files->open_fds);
 		FD_CLR(fd, files->close_on_exec);
-		spin_unlock(&files->file_lock);
+		write_unlock(&files->file_lock);
 		fd_install(fd, file);
 	} else {
-		spin_unlock(&files->file_lock);
+		write_unlock(&files->file_lock);
 		fput(file);
 	}
 
@@ -154,7 +154,7 @@
 	struct file * file, *tofree;
 	struct files_struct * files = current->files;
 
-	spin_lock(&files->file_lock);
+	write_lock(&files->file_lock);
 	if (!(file = fcheck(oldfd)))
 		goto out_unlock;
 	err = newfd;
@@ -185,7 +185,7 @@
 	files->fd[newfd] = file;
 	FD_SET(newfd, files->open_fds);
 	FD_CLR(newfd, files->close_on_exec);
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 
 	if (tofree)
 		filp_close(tofree, files);
@@ -193,11 +193,11 @@
 out:
 	return err;
 out_unlock:
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 	goto out;
 
 out_fput:
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 	fput(file);
 	goto out;
 }
diff -urN RC8-rc2-bk12/fs/file.c RC8-rc2-bk12-current/fs/file.c
--- RC8-rc2-bk12/fs/file.c	2004-07-18 09:08:41.000000000 -0400
+++ RC8-rc2-bk12-current/fs/file.c	2004-08-02 12:30:35.000000000 -0400
@@ -65,7 +65,7 @@
 		goto out;
 
 	nfds = files->max_fds;
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 
 	/* 
 	 * Expand to the max in easy steps, and keep expanding it until
@@ -89,7 +89,7 @@
 
 	error = -ENOMEM;
 	new_fds = alloc_fd_array(nfds);
-	spin_lock(&files->file_lock);
+	write_lock(&files->file_lock);
 	if (!new_fds)
 		goto out;
 
@@ -110,15 +110,15 @@
 			memset(&new_fds[i], 0,
 			       (nfds-i) * sizeof(struct file *)); 
 
-			spin_unlock(&files->file_lock);
+			write_unlock(&files->file_lock);
 			free_fd_array(old_fds, i);
-			spin_lock(&files->file_lock);
+			write_lock(&files->file_lock);
 		}
 	} else {
 		/* Somebody expanded the array while we slept ... */
-		spin_unlock(&files->file_lock);
+		write_unlock(&files->file_lock);
 		free_fd_array(new_fds, nfds);
-		spin_lock(&files->file_lock);
+		write_lock(&files->file_lock);
 	}
 	error = 0;
 out:
@@ -167,7 +167,7 @@
 		goto out;
 
 	nfds = files->max_fdset;
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 
 	/* Expand to the max in easy steps */
 	do {
@@ -183,7 +183,7 @@
 	error = -ENOMEM;
 	new_openset = alloc_fdset(nfds);
 	new_execset = alloc_fdset(nfds);
-	spin_lock(&files->file_lock);
+	write_lock(&files->file_lock);
 	if (!new_openset || !new_execset)
 		goto out;
 
@@ -208,21 +208,21 @@
 		nfds = xchg(&files->max_fdset, nfds);
 		new_openset = xchg(&files->open_fds, new_openset);
 		new_execset = xchg(&files->close_on_exec, new_execset);
-		spin_unlock(&files->file_lock);
+		write_unlock(&files->file_lock);
 		free_fdset (new_openset, nfds);
 		free_fdset (new_execset, nfds);
-		spin_lock(&files->file_lock);
+		write_lock(&files->file_lock);
 		return 0;
 	} 
 	/* Somebody expanded the array while we slept ... */
 
 out:
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 	if (new_openset)
 		free_fdset(new_openset, nfds);
 	if (new_execset)
 		free_fdset(new_execset, nfds);
-	spin_lock(&files->file_lock);
+	write_lock(&files->file_lock);
 	return error;
 }
 
diff -urN RC8-rc2-bk12/fs/file_table.c RC8-rc2-bk12-current/fs/file_table.c
--- RC8-rc2-bk12/fs/file_table.c	2004-05-10 00:23:35.000000000 -0400
+++ RC8-rc2-bk12-current/fs/file_table.c	2004-08-02 12:31:01.000000000 -0400
@@ -197,11 +197,11 @@
 	struct file *file;
 	struct files_struct *files = current->files;
 
-	spin_lock(&files->file_lock);
+	read_lock(&files->file_lock);
 	file = fcheck_files(files, fd);
 	if (file)
 		get_file(file);
-	spin_unlock(&files->file_lock);
+	read_unlock(&files->file_lock);
 	return file;
 }
 
@@ -223,13 +223,13 @@
 	if (likely((atomic_read(&files->count) == 1))) {
 		file = fcheck_files(files, fd);
 	} else {
-		spin_lock(&files->file_lock);
+		read_lock(&files->file_lock);
 		file = fcheck_files(files, fd);
 		if (file) {
 			get_file(file);
 			*fput_needed = 1;
 		}
-		spin_unlock(&files->file_lock);
+		read_unlock(&files->file_lock);
 	}
 	return file;
 }
diff -urN RC8-rc2-bk12/fs/open.c RC8-rc2-bk12-current/fs/open.c
--- RC8-rc2-bk12/fs/open.c	2004-06-16 10:26:24.000000000 -0400
+++ RC8-rc2-bk12-current/fs/open.c	2004-08-02 12:33:29.000000000 -0400
@@ -841,7 +841,7 @@
 	int fd, error;
 
   	error = -EMFILE;
-	spin_lock(&files->file_lock);
+	write_lock(&files->file_lock);
 
 repeat:
  	fd = find_next_zero_bit(files->open_fds->fds_bits, 
@@ -890,7 +890,7 @@
 	error = fd;
 
 out:
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 	return error;
 }
 
@@ -906,9 +906,9 @@
 void fastcall put_unused_fd(unsigned int fd)
 {
 	struct files_struct *files = current->files;
-	spin_lock(&files->file_lock);
+	write_lock(&files->file_lock);
 	__put_unused_fd(files, fd);
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 }
 
 EXPORT_SYMBOL(put_unused_fd);
@@ -929,11 +929,11 @@
 void fastcall fd_install(unsigned int fd, struct file * file)
 {
 	struct files_struct *files = current->files;
-	spin_lock(&files->file_lock);
+	write_lock(&files->file_lock);
 	if (unlikely(files->fd[fd] != NULL))
 		BUG();
 	files->fd[fd] = file;
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 }
 
 EXPORT_SYMBOL(fd_install);
@@ -1024,7 +1024,7 @@
 	struct file * filp;
 	struct files_struct *files = current->files;
 
-	spin_lock(&files->file_lock);
+	write_lock(&files->file_lock);
 	if (fd >= files->max_fds)
 		goto out_unlock;
 	filp = files->fd[fd];
@@ -1033,11 +1033,11 @@
 	files->fd[fd] = NULL;
 	FD_CLR(fd, files->close_on_exec);
 	__put_unused_fd(files, fd);
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 	return filp_close(filp, files);
 
 out_unlock:
-	spin_unlock(&files->file_lock);
+	write_unlock(&files->file_lock);
 	return -EBADF;
 }
 
diff -urN RC8-rc2-bk12/fs/proc/base.c RC8-rc2-bk12-current/fs/proc/base.c
--- RC8-rc2-bk12/fs/proc/base.c	2004-07-18 09:08:42.000000000 -0400
+++ RC8-rc2-bk12-current/fs/proc/base.c	2004-08-02 12:55:13.630740688 -0400
@@ -191,16 +191,16 @@
 
 	files = get_files_struct(task);
 	if (files) {
-		spin_lock(&files->file_lock);
+		read_lock(&files->file_lock);
 		file = fcheck_files(files, fd);
 		if (file) {
 			*mnt = mntget(file->f_vfsmnt);
 			*dentry = dget(file->f_dentry);
-			spin_unlock(&files->file_lock);
+			read_unlock(&files->file_lock);
 			put_files_struct(files);
 			return 0;
 		}
-		spin_unlock(&files->file_lock);
+		read_unlock(&files->file_lock);
 		put_files_struct(files);
 	}
 	return -ENOENT;
@@ -805,7 +805,7 @@
 			files = get_files_struct(p);
 			if (!files)
 				goto out;
-			spin_lock(&files->file_lock);
+			read_lock(&files->file_lock);
 			for (fd = filp->f_pos-2;
 			     fd < files->max_fds;
 			     fd++, filp->f_pos++) {
@@ -813,7 +813,7 @@
 
 				if (!fcheck_files(files, fd))
 					continue;
-				spin_unlock(&files->file_lock);
+				read_unlock(&files->file_lock);
 
 				j = NUMBUF;
 				i = fd;
@@ -825,12 +825,12 @@
 
 				ino = fake_ino(tid, PROC_TID_FD_DIR + fd);
 				if (filldir(dirent, buf+j, NUMBUF-j, fd+2, ino, DT_LNK) < 0) {
-					spin_lock(&files->file_lock);
+					read_lock(&files->file_lock);
 					break;
 				}
-				spin_lock(&files->file_lock);
+				read_lock(&files->file_lock);
 			}
-			spin_unlock(&files->file_lock);
+			read_unlock(&files->file_lock);
 			put_files_struct(files);
 	}
 out:
@@ -1003,9 +1003,9 @@
 
 	files = get_files_struct(task);
 	if (files) {
-		spin_lock(&files->file_lock);
+		read_lock(&files->file_lock);
 		if (fcheck_files(files, fd)) {
-			spin_unlock(&files->file_lock);
+			read_unlock(&files->file_lock);
 			put_files_struct(files);
 			if (task_dumpable(task)) {
 				inode->i_uid = task->euid;
@@ -1017,7 +1017,7 @@
 			security_task_to_inode(task, inode);
 			return 1;
 		}
-		spin_unlock(&files->file_lock);
+		read_unlock(&files->file_lock);
 		put_files_struct(files);
 	}
 	d_drop(dentry);
@@ -1109,7 +1109,7 @@
 	if (!files)
 		goto out_unlock;
 	inode->i_mode = S_IFLNK;
-	spin_lock(&files->file_lock);
+	read_lock(&files->file_lock);
 	file = fcheck_files(files, fd);
 	if (!file)
 		goto out_unlock2;
@@ -1117,7 +1117,7 @@
 		inode->i_mode |= S_IRUSR | S_IXUSR;
 	if (file->f_mode & 2)
 		inode->i_mode |= S_IWUSR | S_IXUSR;
-	spin_unlock(&files->file_lock);
+	read_unlock(&files->file_lock);
 	put_files_struct(files);
 	inode->i_op = &proc_pid_link_inode_operations;
 	inode->i_size = 64;
@@ -1127,7 +1127,7 @@
 	return NULL;
 
 out_unlock2:
-	spin_unlock(&files->file_lock);
+	read_unlock(&files->file_lock);
 	put_files_struct(files);
 out_unlock:
 	iput(inode);
diff -urN RC8-rc2-bk12/fs/select.c RC8-rc2-bk12-current/fs/select.c
--- RC8-rc2-bk12/fs/select.c	2003-10-09 17:34:47.000000000 -0400
+++ RC8-rc2-bk12-current/fs/select.c	2004-08-02 12:34:06.000000000 -0400
@@ -184,9 +184,9 @@
 	int retval, i;
 	long __timeout = *timeout;
 
- 	spin_lock(&current->files->file_lock);
+ 	read_lock(&current->files->file_lock);
 	retval = max_select_fd(n, fds);
-	spin_unlock(&current->files->file_lock);
+	read_unlock(&current->files->file_lock);
 
 	if (retval < 0)
 		return retval;
diff -urN RC8-rc2-bk12/include/linux/file.h RC8-rc2-bk12-current/include/linux/file.h
--- RC8-rc2-bk12/include/linux/file.h	2004-05-10 00:23:41.000000000 -0400
+++ RC8-rc2-bk12-current/include/linux/file.h	2004-08-02 12:35:52.374278504 -0400
@@ -21,7 +21,7 @@
  */
 struct files_struct {
         atomic_t count;
-        spinlock_t file_lock;     /* Protects all the below members.  Nests inside tsk->alloc_lock */
+        rwlock_t file_lock;     /* Protects all the below members.  Nests inside tsk->alloc_lock */
         int max_fds;
         int max_fdset;
         int next_fd;
diff -urN RC8-rc2-bk12/include/linux/init_task.h RC8-rc2-bk12-current/include/linux/init_task.h
--- RC8-rc2-bk12/include/linux/init_task.h	2004-07-18 09:08:44.000000000 -0400
+++ RC8-rc2-bk12-current/include/linux/init_task.h	2004-08-02 12:35:37.000000000 -0400
@@ -6,7 +6,7 @@
 #define INIT_FILES \
 { 							\
 	.count		= ATOMIC_INIT(1), 		\
-	.file_lock	= SPIN_LOCK_UNLOCKED, 		\
+	.file_lock	= RW_LOCK_UNLOCKED, 		\
 	.max_fds	= NR_OPEN_DEFAULT, 		\
 	.max_fdset	= __FD_SETSIZE, 		\
 	.next_fd	= 0, 				\
diff -urN RC8-rc2-bk12/kernel/fork.c RC8-rc2-bk12-current/kernel/fork.c
--- RC8-rc2-bk12/kernel/fork.c	2004-08-02 12:19:37.000000000 -0400
+++ RC8-rc2-bk12-current/kernel/fork.c	2004-08-02 12:37:08.000000000 -0400
@@ -680,7 +680,7 @@
 
 	atomic_set(&newf->count, 1);
 
-	newf->file_lock	    = SPIN_LOCK_UNLOCKED;
+	newf->file_lock	    = RW_LOCK_UNLOCKED;
 	newf->next_fd	    = 0;
 	newf->max_fds	    = NR_OPEN_DEFAULT;
 	newf->max_fdset	    = __FD_SETSIZE;
@@ -693,13 +693,13 @@
 	size = oldf->max_fdset;
 	if (size > __FD_SETSIZE) {
 		newf->max_fdset = 0;
-		spin_lock(&newf->file_lock);
+		write_lock(&newf->file_lock);
 		error = expand_fdset(newf, size-1);
-		spin_unlock(&newf->file_lock);
+		write_unlock(&newf->file_lock);
 		if (error)
 			goto out_release;
 	}
-	spin_lock(&oldf->file_lock);
+	read_lock(&oldf->file_lock);
 
 	open_files = count_open_files(oldf, size);
 
@@ -710,15 +710,15 @@
 	 */
 	nfds = NR_OPEN_DEFAULT;
 	if (open_files > nfds) {
-		spin_unlock(&oldf->file_lock);
+		read_unlock(&oldf->file_lock);
 		newf->max_fds = 0;
-		spin_lock(&newf->file_lock);
+		write_lock(&newf->file_lock);
 		error = expand_fd_array(newf, open_files-1);
-		spin_unlock(&newf->file_lock);
+		write_unlock(&newf->file_lock);
 		if (error) 
 			goto out_release;
 		nfds = newf->max_fds;
-		spin_lock(&oldf->file_lock);
+		read_lock(&oldf->file_lock);
 	}
 
 	old_fds = oldf->fd;
@@ -733,7 +733,7 @@
 			get_file(f);
 		*new_fds++ = f;
 	}
-	spin_unlock(&oldf->file_lock);
+	read_unlock(&oldf->file_lock);
 
 	/* compute the remainder to be cleared */
 	size = (newf->max_fds - open_files) * sizeof(struct file *);
diff -urN RC8-rc2-bk12/net/ipv4/netfilter/ipt_owner.c RC8-rc2-bk12-current/net/ipv4/netfilter/ipt_owner.c
--- RC8-rc2-bk12/net/ipv4/netfilter/ipt_owner.c	2004-07-18 09:08:45.000000000 -0400
+++ RC8-rc2-bk12-current/net/ipv4/netfilter/ipt_owner.c	2004-08-02 12:37:58.000000000 -0400
@@ -35,17 +35,17 @@
 		task_lock(p);
 		files = p->files;
 		if(files) {
-			spin_lock(&files->file_lock);
+			read_lock(&files->file_lock);
 			for (i=0; i < files->max_fds; i++) {
 				if (fcheck_files(files, i) ==
 				    skb->sk->sk_socket->file) {
-					spin_unlock(&files->file_lock);
+					read_unlock(&files->file_lock);
 					task_unlock(p);
 					read_unlock(&tasklist_lock);
 					return 1;
 				}
 			}
-			spin_unlock(&files->file_lock);
+			read_unlock(&files->file_lock);
 		}
 		task_unlock(p);
 	} while_each_thread(g, p);
@@ -67,17 +67,17 @@
 	task_lock(p);
 	files = p->files;
 	if(files) {
-		spin_lock(&files->file_lock);
+		read_lock(&files->file_lock);
 		for (i=0; i < files->max_fds; i++) {
 			if (fcheck_files(files, i) ==
 			    skb->sk->sk_socket->file) {
-				spin_unlock(&files->file_lock);
+				read_unlock(&files->file_lock);
 				task_unlock(p);
 				read_unlock(&tasklist_lock);
 				return 1;
 			}
 		}
-		spin_unlock(&files->file_lock);
+		read_unlock(&files->file_lock);
 	}
 	task_unlock(p);
 out:
@@ -101,14 +101,14 @@
 		task_lock(p);
 		files = p->files;
 		if (files) {
-			spin_lock(&files->file_lock);
+			read_lock(&files->file_lock);
 			for (i=0; i < files->max_fds; i++) {
 				if (fcheck_files(files, i) == file) {
 					found = 1;
 					break;
 				}
 			}
-			spin_unlock(&files->file_lock);
+			read_unlock(&files->file_lock);
 		}
 		task_unlock(p);
 		if (found)
diff -urN RC8-rc2-bk12/net/ipv6/netfilter/ip6t_owner.c RC8-rc2-bk12-current/net/ipv6/netfilter/ip6t_owner.c
--- RC8-rc2-bk12/net/ipv6/netfilter/ip6t_owner.c	2004-07-18 09:08:45.000000000 -0400
+++ RC8-rc2-bk12-current/net/ipv6/netfilter/ip6t_owner.c	2004-08-02 12:38:12.000000000 -0400
@@ -34,16 +34,16 @@
 	task_lock(p);
 	files = p->files;
 	if(files) {
-		spin_lock(&files->file_lock);
+		read_lock(&files->file_lock);
 		for (i=0; i < files->max_fds; i++) {
 			if (fcheck_files(files, i) == skb->sk->sk_socket->file) {
-				spin_unlock(&files->file_lock);
+				read_unlock(&files->file_lock);
 				task_unlock(p);
 				read_unlock(&tasklist_lock);
 				return 1;
 			}
 		}
-		spin_unlock(&files->file_lock);
+		read_unlock(&files->file_lock);
 	}
 	task_unlock(p);
 out:
@@ -67,14 +67,14 @@
 		task_lock(p);
 		files = p->files;
 		if (files) {
-			spin_lock(&files->file_lock);
+			read_lock(&files->file_lock);
 			for (i=0; i < files->max_fds; i++) {
 				if (fcheck_files(files, i) == file) {
 					found = 1;
 					break;
 				}
 			}
-			spin_unlock(&files->file_lock);
+			read_unlock(&files->file_lock);
 		}
 		task_unlock(p);
 		if (found)
diff -urN RC8-rc2-bk12/security/selinux/hooks.c RC8-rc2-bk12-current/security/selinux/hooks.c
--- RC8-rc2-bk12/security/selinux/hooks.c	2004-08-02 12:19:37.000000000 -0400
+++ RC8-rc2-bk12-current/security/selinux/hooks.c	2004-08-02 12:39:03.000000000 -0400
@@ -1794,7 +1794,7 @@
 
 	AVC_AUDIT_DATA_INIT(&ad,FS);
 
-	spin_lock(&files->file_lock);
+	read_lock(&files->file_lock);
 	for (;;) {
 		unsigned long set, i;
 		int fd;
@@ -1806,7 +1806,7 @@
 		set = files->open_fds->fds_bits[j];
 		if (!set)
 			continue;
-		spin_unlock(&files->file_lock);
+		read_unlock(&files->file_lock);
 		for ( ; set ; i++,set >>= 1) {
 			if (set & 1) {
 				file = fget(i);
@@ -1838,10 +1838,10 @@
 				fput(file);
 			}
 		}
-		spin_lock(&files->file_lock);
+		read_lock(&files->file_lock);
 
 	}
-	spin_unlock(&files->file_lock);
+	read_unlock(&files->file_lock);
 }
 
 static void selinux_bprm_apply_creds(struct linux_binprm *bprm, int unsafe)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchset] Lockfree fd lookup 0 of 5
@ 2004-08-02 18:50 Manfred Spraul
  0 siblings, 0 replies; 23+ messages in thread
From: Manfred Spraul @ 2004-08-02 18:50 UTC (permalink / raw)
  To: viro; +Cc: linux-kernel, Ravikiran G Thirumalai

viro wrote:

>How about this for comparison?  That's just a dumb "convert to rwlock"
>patch; we can be smarter in e.g. close_on_exec handling, but that's a
>separate story.
>  
>
That won't help:
The problem is the cache line trashing from fget() and fget_light() with 
multithreaded apps. A sequence lock might help, but an rw lock is not a 
solution.
Actually: 2.4 had an rwlock, it was converted to a spinlock because 
spinlocks are faster on i386:

    read_lock();
    short operation();
    read_unlock();

is a bit slower than

    spin_lock();
    short operation();
    spin_unlock();

because read_unlock() is a full memory barrier and spin_unlock is not a 
memory barrier (not necessary because writes are ordered)

--
    Manfred

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchset] Lockfree fd lookup 0 of 5
  2004-08-02 16:56 ` [patchset] Lockfree fd lookup 0 " viro
@ 2004-08-02 20:07   ` David S. Miller
  2004-08-02 21:01     ` William Lee Irwin III
  2004-08-03  9:23   ` Ravikiran G Thirumalai
  1 sibling, 1 reply; 23+ messages in thread
From: David S. Miller @ 2004-08-02 20:07 UTC (permalink / raw)
  To: viro; +Cc: kiran, akpm, linux-kernel, greg, dipankar

On Mon, 2 Aug 2004 17:56:07 +0100
viro@parcelfarce.linux.theplanet.co.uk wrote:

> How about this for comparison?  That's just a dumb "convert to rwlock"
> patch; we can be smarter in e.g. close_on_exec handling, but that's a
> separate story.

Compares to plain spinlocks, rwlock's don't buy you much,
if anything, these days.

Especially for short sequences of code.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchset] Lockfree fd lookup 0 of 5
  2004-08-02 20:07   ` David S. Miller
@ 2004-08-02 21:01     ` William Lee Irwin III
  2004-08-02 23:15       ` David S. Miller
  0 siblings, 1 reply; 23+ messages in thread
From: William Lee Irwin III @ 2004-08-02 21:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: viro, kiran, akpm, linux-kernel, greg, dipankar

On Mon, 2 Aug 2004 17:56:07 +0100 viro@parcelfarce.linux.theplanet.co.uk wrote:
>> How about this for comparison?  That's just a dumb "convert to rwlock"
>> patch; we can be smarter in e.g. close_on_exec handling, but that's a
>> separate story.

On Mon, Aug 02, 2004 at 01:07:29PM -0700, David S. Miller wrote:
> Compares to plain spinlocks, rwlock's don't buy you much,
> if anything, these days.
> Especially for short sequences of code.

I've found unusual results in this area. e.g. it does appear to matter
for mapping->tree_lock for database workloads that heavily share a
given file and access it in parallel. The radix tree walk, though
intuitively short, is long enough to make the rwlock a win in the
database-oriented uses and microbenchmarks starting around 4x.


-- wli

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchset] Lockfree fd lookup 0 of 5
  2004-08-02 21:01     ` William Lee Irwin III
@ 2004-08-02 23:15       ` David S. Miller
  2004-08-03  2:04         ` William Lee Irwin III
  0 siblings, 1 reply; 23+ messages in thread
From: David S. Miller @ 2004-08-02 23:15 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-kernel, rusty

On Mon, 2 Aug 2004 14:01:19 -0700
William Lee Irwin III <wli@holomorphy.com> wrote:

> I've found unusual results in this area. e.g. it does appear to matter
> for mapping->tree_lock for database workloads that heavily share a
> given file and access it in parallel. The radix tree walk, though
> intuitively short, is long enough to make the rwlock a win in the
> database-oriented uses and microbenchmarks starting around 4x.

Thanks for the data point, because I had this patch I had sent
to Rusty for 2.7.x which ripped rwlocks entirely out of the
kernel.  We might have to toss that idea :-)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchset] Lockfree fd lookup 2 of 5
  2004-08-02 10:16 ` [patchset] Lockfree fd lookup 2 " Ravikiran G Thirumalai
@ 2004-08-03  0:43   ` Greg KH
  2004-08-03  6:28     ` Ravikiran G Thirumalai
  2004-08-03  6:44   ` Greg KH
  1 sibling, 1 reply; 23+ messages in thread
From: Greg KH @ 2004-08-03  0:43 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: Andrew Morton, linux-kernel, dipankar, viro

On Mon, Aug 02, 2004 at 03:46:06PM +0530, Ravikiran G Thirumalai wrote:
> Here's the second patch.  This changes the current kref users to use
> the 'shrunk' kref objects and api.  GregKH has applied this to his tree too.

Well, I applied this, and then fixed it to actually link and work
properly.  Next time, please at least build your patches...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchset] Lockfree fd lookup 1 of 5
  2004-08-02 10:13 ` [patchset] Lockfree fd lookup 1 " Ravikiran G Thirumalai
  2004-08-02 13:38   ` Christoph Hellwig
@ 2004-08-03  0:44   ` Greg KH
  1 sibling, 0 replies; 23+ messages in thread
From: Greg KH @ 2004-08-03  0:44 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: Andrew Morton, linux-kernel, dipankar, viro

On Mon, Aug 02, 2004 at 03:43:52PM +0530, Ravikiran G Thirumalai wrote:
> Here's the first patch.  This patch 'shrinks' struct kref by removing
> the release pointer in the struct kref.  GregKH has applied this to his tree

I applied a tweaked version of this.  It's now in the bk-drivers tree.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchset] Lockfree fd lookup 0 of 5
  2004-08-02 23:15       ` David S. Miller
@ 2004-08-03  2:04         ` William Lee Irwin III
  0 siblings, 0 replies; 23+ messages in thread
From: William Lee Irwin III @ 2004-08-03  2:04 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel, rusty

On Mon, 2 Aug 2004 14:01:19 -0700 William Lee Irwin III wrote:
>> I've found unusual results in this area. e.g. it does appear to matter
>> for mapping->tree_lock for database workloads that heavily share a
>> given file and access it in parallel. The radix tree walk, though
>> intuitively short, is long enough to make the rwlock a win in the
>> database-oriented uses and microbenchmarks starting around 4x.

On Mon, Aug 02, 2004 at 04:15:14PM -0700, David S. Miller wrote:
> Thanks for the data point, because I had this patch I had sent
> to Rusty for 2.7.x which ripped rwlocks entirely out of the
> kernel.  We might have to toss that idea :-)

In all honesty, I'd rather use RCU, but that's a little more work than
most RCU conversions.


-- wli

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchset] Lockfree fd lookup 2 of 5
  2004-08-03  0:43   ` Greg KH
@ 2004-08-03  6:28     ` Ravikiran G Thirumalai
  0 siblings, 0 replies; 23+ messages in thread
From: Ravikiran G Thirumalai @ 2004-08-03  6:28 UTC (permalink / raw)
  To: Greg KH; +Cc: Andrew Morton, linux-kernel, dipankar, viro

On Mon, Aug 02, 2004 at 05:43:43PM -0700, Greg KH wrote:
> On Mon, Aug 02, 2004 at 03:46:06PM +0530, Ravikiran G Thirumalai wrote:
> > Here's the second patch.  This changes the current kref users to use
> > the 'shrunk' kref objects and api.  GregKH has applied this to his tree too.
> 
> Well, I applied this, and then fixed it to actually link and work
> properly.  Next time, please at least build your patches...
> ...

I did build it before sending it.  My initial submission to you
was based on 2.6.7 and you'd said it was applied.  My current submission
is 2.6.7 based too, and I didn't check for any driver changes since
you'd said you already applied it (although I couldn't confirm it since 
I did not know which bk tree you use).  The current patchset builds fine on
my setup on 2.6.7.  Please let me know what I missed so that I can take
care of such things in future.

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchset] Lockfree fd lookup 2 of 5
  2004-08-02 10:16 ` [patchset] Lockfree fd lookup 2 " Ravikiran G Thirumalai
  2004-08-03  0:43   ` Greg KH
@ 2004-08-03  6:44   ` Greg KH
  2004-08-03  7:41     ` Ravikiran G Thirumalai
  1 sibling, 1 reply; 23+ messages in thread
From: Greg KH @ 2004-08-03  6:44 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: Andrew Morton, linux-kernel, dipankar, viro

On Mon, Aug 02, 2004 at 03:46:06PM +0530, Ravikiran G Thirumalai wrote:
> diff -ruN -X dontdiff2 linux-2.6.7/drivers/usb/core/message.c kref-2.6.7/drivers/usb/core/message.c
> --- linux-2.6.7/drivers/usb/core/message.c	2004-06-16 10:49:02.000000000 +0530
> +++ kref-2.6.7/drivers/usb/core/message.c	2004-07-20 15:07:24.000000000 +0530
> @@ -1077,11 +1077,12 @@
>  
>  static void release_interface(struct device *dev)
>  {
> +	extern void destroy_serial(struct kref *kref);
>  	struct usb_interface *intf = to_usb_interface(dev);
>  	struct usb_interface_cache *intfc =
>  			altsetting_to_usb_interface_cache(intf->altsetting);
>  
> -	kref_put(&intfc->ref);
> +	kref_put(&intfc->ref, destroy_serial);
>  	kfree(intf);
>  }

This is the bug.  destroy_serial() is for the usb_serial core and does
not clean up for this type of structure (and is not exported, so it will
not even build properly).  Also, never put a function prototype within a
function like you did here.

So, I'm guessing you didn't try to remove any USB devices after applying
your patch?  :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchset] Lockfree fd lookup 2 of 5
  2004-08-03  6:44   ` Greg KH
@ 2004-08-03  7:41     ` Ravikiran G Thirumalai
  0 siblings, 0 replies; 23+ messages in thread
From: Ravikiran G Thirumalai @ 2004-08-03  7:41 UTC (permalink / raw)
  To: Greg KH; +Cc: Andrew Morton, linux-kernel, dipankar, viro

On Mon, Aug 02, 2004 at 11:44:24PM -0700, Greg KH wrote:
> On Mon, Aug 02, 2004 at 03:46:06PM +0530, Ravikiran G Thirumalai wrote:
> > diff -ruN -X dontdiff2 linux-2.6.7/drivers/usb/core/message.c kref-2.6.7/drivers/usb/core/message.c
> > --- linux-2.6.7/drivers/usb/core/message.c	2004-06-16 10:49:02.000000000 +0530
> > +++ kref-2.6.7/drivers/usb/core/message.c	2004-07-20 15:07:24.000000000 +0530
> > @@ -1077,11 +1077,12 @@
> >  
> >  static void release_interface(struct device *dev)
> >  {
> > +	extern void destroy_serial(struct kref *kref);
> >  	struct usb_interface *intf = to_usb_interface(dev);
> >  	struct usb_interface_cache *intfc =
> >  			altsetting_to_usb_interface_cache(intf->altsetting);
> >  
> > -	kref_put(&intfc->ref);
> > +	kref_put(&intfc->ref, destroy_serial);
> >  	kfree(intf);
> >  }
> 
> This is the bug.  destroy_serial() is for the usb_serial core and does
> not clean up for this type of structure (and is not exported, so it will
> not even build properly).  Also, never put a function prototype within a
> function like you did here.
> 
> So, I'm guessing you didn't try to remove any USB devices after applying
> your patch?  :)

No I didn't :).  I just checked the build not the working as such.  
>From what I could infer from the 2.6.7 tree, there is no kref_init()
in core/message.c.  Well I dunno what made me think destroy_serial
was used to init (kref_init) that refcounter at that time so I have 
even exported destroy_serial in the patch so that it builds.  
(expored as in changing static to non static .. I build all drivers 
into the kernel usually and so the build didn't break for me).
I now realise destroy_serial is the wrong release function.  I am no expert 
in usb-drivers area, but I should be more careful than that. 

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchset] Lockfree fd lookup 0 of 5
  2004-08-02 16:56 ` [patchset] Lockfree fd lookup 0 " viro
  2004-08-02 20:07   ` David S. Miller
@ 2004-08-03  9:23   ` Ravikiran G Thirumalai
  2004-08-03  9:35     ` Ravikiran G Thirumalai
  2004-08-03 10:06     ` viro
  1 sibling, 2 replies; 23+ messages in thread
From: Ravikiran G Thirumalai @ 2004-08-03  9:23 UTC (permalink / raw)
  To: viro; +Cc: Andrew Morton, linux-kernel, Greg KH, dipankar

On Mon, Aug 02, 2004 at 05:56:07PM +0100, viro@parcelfarce.linux.theplanet.co.uk wrote:
> On Mon, Aug 02, 2004 at 03:40:55PM +0530, Ravikiran G Thirumalai wrote:
> > Here is a patchset to eliminate taking struct files_struct.file_lock on 
> > reader side using rcu and rcu based refcounting.  These patches
> > extend the kref api to include kref_lf_xxx api and kref_lf_get_rcu to
> > do lockfree refcounting, and use the same.  As posted earlier, since fd
> > lookups (struct files_struct.fd[]) will be lock free with these patches, 
> > threaded workloads doing lots of io should see performance benefits 
> > due to this patchset.  I have observed 13-15% improvement with tiobench 
> > on a 4 way xeon with this patchset.
> 
> How about this for comparison?  That's just a dumb "convert to rwlock"
> patch; we can be smarter in e.g. close_on_exec handling, but that's a
> separate story.
> 

I ran tiobench on this patch and here is the comparison:


Kernel		Seqread		Randread	Seqwrite	Randwrite
--------------------------------------------------------------------------
2.6.7		410.33		234.15		254.39		189.36
rwlocks-viro	401.84		232.69		254.09		194.62
refcount (kref)	455.72		281.75		272.87		230.10

All of this was 2.6.7 based.  Nos are througput rates in mega bytes per sec.
Test was on ramfs, 48 threads doing 100 MB of io per thread averaged over
8 runs.  This was on a 4 way PIII xeon. 

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchset] Lockfree fd lookup 0 of 5
  2004-08-03  9:23   ` Ravikiran G Thirumalai
@ 2004-08-03  9:35     ` Ravikiran G Thirumalai
  2004-08-03 10:17       ` Dipankar Sarma
  2004-08-03 10:06     ` viro
  1 sibling, 1 reply; 23+ messages in thread
From: Ravikiran G Thirumalai @ 2004-08-03  9:35 UTC (permalink / raw)
  To: viro; +Cc: Andrew Morton, linux-kernel, Greg KH, dipankar

On Tue, Aug 03, 2004 at 02:53:17PM +0530, Ravikiran G Thirumalai wrote:
> > ...
> > How about this for comparison?  That's just a dumb "convert to rwlock"
> > patch; we can be smarter in e.g. close_on_exec handling, but that's a
> > separate story.
> > 
> 
> I ran tiobench on this patch and here is the comparison:
> 
> 
> Kernel		Seqread		Randread	Seqwrite	Randwrite
> --------------------------------------------------------------------------
> 2.6.7		410.33		234.15		254.39		189.36
> rwlocks-viro	401.84		232.69		254.09		194.62
> refcount (kref)	455.72		281.75		272.87		230.10
> 
> All of this was 2.6.7 based.  Nos are througput rates in mega bytes per sec.
> Test was on ramfs, 48 threads doing 100 MB of io per thread averaged over
> 8 runs.  This was on a 4 way PIII xeon. 

Just adding. This was with premption off and here are the oprofile 
profiles for the above runs;

Thanks,
Kiran

1. Vanilla 2.6.7
=================
CPU: PIII, speed 699.726 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
31816    18.7302  __copy_to_user_ll
30421    17.9089  simple_prepare_write
13270     7.8121  __copy_from_user_ll
11514     6.7783  fget_light
8388      4.9380  mark_offset_tsc
5352      3.1507  sysenter_past_esp
5047      2.9712  default_idle
3723      2.1917  do_gettimeofday
2991      1.7608  increment_tail
2884      1.6978  generic_file_aio_write_nolock
2147      1.2639  do_generic_mapping_read
1922      1.1315  current_kernel_time
1800      1.0597  find_vma
1691      0.9955  get_offset_tsc
1683      0.9908  __pagevec_lru_add
1537      0.9048  add_event_entry
1511      0.8895  profile_hook
1413      0.8318  activate_page
1411      0.8307  find_lock_page
1342      0.7900  lookup_dcookie
1318      0.7759  sync_buffer
1289      0.7588  dnotify_parent
1248      0.7347  apic_timer_interrupt
1244      0.7323  find_get_page
1203      0.7082  .text.lock.file_table
1135      0.6682  release_pages
1067      0.6281  __set_page_dirty_buffers
1032      0.6075  scheduler_tick
1027      0.6046  radix_tree_lookup
920       0.5416  generic_file_write

2. rwlocks-2.6.7
=================
CPU: PIII, speed 699.717 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
32276    18.9282  __copy_to_user_ll
29917    17.5448  simple_prepare_write
13168     7.7224  fget_light
13167     7.7218  __copy_from_user_ll
8505      4.9877  mark_offset_tsc
5237      3.0712  sysenter_past_esp
5199      3.0489  default_idle
3697      2.1681  do_gettimeofday
3090      1.8121  generic_file_aio_write_nolock
2739      1.6063  increment_tail
1977      1.1594  current_kernel_time
1956      1.1471  do_generic_mapping_read
1845      1.0820  find_vma
1764      1.0345  get_offset_tsc
1717      1.0069  __pagevec_lru_add
1598      0.9371  add_event_entry
1511      0.8861  profile_hook
1478      0.8668  activate_page
1390      0.8152  sync_buffer
1350      0.7917  find_lock_page
1336      0.7835  lookup_dcookie
1307      0.7665  apic_timer_interrupt
1240      0.7272  release_pages
1239      0.7266  dnotify_parent
1223      0.7172  find_get_page
1173      0.6879  __set_page_dirty_buffers
999       0.5859  scheduler_tick

3. rcu refcounting -2.6.7
=========================
CPU: PIII, speed 699.717 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
31391    19.8978  __copy_to_user_ll
29672    18.8082  simple_prepare_write
13330     8.4495  __copy_from_user_ll
8076      5.1191  mark_offset_tsc
5323      3.3741  sysenter_past_esp
5071      3.2144  default_idle
3682      2.3339  do_gettimeofday
2890      1.8319  increment_tail
2789      1.7679  generic_file_aio_write_nolock
1997      1.2658  do_generic_mapping_read
1993      1.2633  fget_light
1873      1.1872  __pagevec_lru_add
1826      1.1574  find_vma
1808      1.1460  profile_hook
1797      1.1391  current_kernel_time
1738      1.1017  get_offset_tsc
1439      0.9121  add_event_entry
1354      0.8583  sync_buffer
1346      0.8532  lookup_dcookie
1285      0.8145  dnotify_parent
1267      0.8031  apic_timer_interrupt
1260      0.7987  activate_page
1178      0.7467  find_get_page
1173      0.7435  find_lock_page
1125      0.7131  release_pages
1034      0.6554  scheduler_tick
986       0.6250  __set_page_dirty_buffers
939       0.5952  sched_clock


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchset] Lockfree fd lookup 0 of 5
  2004-08-03  9:23   ` Ravikiran G Thirumalai
  2004-08-03  9:35     ` Ravikiran G Thirumalai
@ 2004-08-03 10:06     ` viro
  1 sibling, 0 replies; 23+ messages in thread
From: viro @ 2004-08-03 10:06 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: Andrew Morton, linux-kernel, Greg KH, dipankar

On Tue, Aug 03, 2004 at 02:53:17PM +0530, Ravikiran G Thirumalai wrote:
> > How about this for comparison?  That's just a dumb "convert to rwlock"
> > patch; we can be smarter in e.g. close_on_exec handling, but that's a
> > separate story.
> > 
> 
> I ran tiobench on this patch and here is the comparison:
> 
> 
> Kernel		Seqread		Randread	Seqwrite	Randwrite
> --------------------------------------------------------------------------
> 2.6.7		410.33		234.15		254.39		189.36
> rwlocks-viro	401.84		232.69		254.09		194.62
> refcount (kref)	455.72		281.75		272.87		230.10

Thanks.  IOW, we are really seeing cacheline bounces - not contention...

I'm still not sure that in the current form patch is a good idea.  The thing
is, existing checks for ->f_count value are bogus in practically all cases;
IMO we should sort that out before making any decision based on the need for
such checks.  Ditto for uses of fcheck() (open-coded or not) in arch/*.

I agree that some form of "postpone freeing and make fget() lockless" would
make sense, but I'd rather clean the area *before* doing that; afterwards it
will be harder and results of cleanup can affect the patches in non-trivial
way.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patchset] Lockfree fd lookup 0 of 5
  2004-08-03  9:35     ` Ravikiran G Thirumalai
@ 2004-08-03 10:17       ` Dipankar Sarma
  0 siblings, 0 replies; 23+ messages in thread
From: Dipankar Sarma @ 2004-08-03 10:17 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: viro, Andrew Morton, linux-kernel, Greg KH

On Tue, Aug 03, 2004 at 03:05:55PM +0530, Ravikiran G Thirumalai wrote:
> On Tue, Aug 03, 2004 at 02:53:17PM +0530, Ravikiran G Thirumalai wrote:
> > I ran tiobench on this patch and here is the comparison:
> > 
> > 
> > Kernel		Seqread		Randread	Seqwrite	Randwrite
> > --------------------------------------------------------------------------
> > 2.6.7		410.33		234.15		254.39		189.36
> > rwlocks-viro	401.84		232.69		254.09		194.62
> > refcount (kref)	455.72		281.75		272.87		230.10
> > 

Hm...

11514     6.7783  fget_light (vanilla)
13168     7.7224  fget_light (rwlock)
1993      1.2633  fget_light (kref)

Total ticks -

169886  (vanilla)
170520  (rwlock)
157760  (kref)

Of the 12126 ticks that were reduced by kref, 9521 came from 
reduction in fget_light(). So, lock-free fget_light() does help.
Also, it seems the lock contention is not much of an issue -

1203      0.7082  .text.lock.file_table

That explains why rwlock didn't help. I guess we are benefiting
mostly from avoiding the cacheline bouncing and removal of the
lock acquisition.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2004-08-03 10:21 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-02 10:10 [patchset] Lockfree fd lookup 0 of 5 Ravikiran G Thirumalai
2004-08-02 10:13 ` [patchset] Lockfree fd lookup 1 " Ravikiran G Thirumalai
2004-08-02 13:38   ` Christoph Hellwig
2004-08-02 16:12     ` viro
2004-08-03  0:44   ` Greg KH
2004-08-02 10:16 ` [patchset] Lockfree fd lookup 2 " Ravikiran G Thirumalai
2004-08-03  0:43   ` Greg KH
2004-08-03  6:28     ` Ravikiran G Thirumalai
2004-08-03  6:44   ` Greg KH
2004-08-03  7:41     ` Ravikiran G Thirumalai
2004-08-02 10:18 ` [patchset] Lockfree fd lookup 0 " Ravikiran G Thirumalai
2004-08-02 10:20 ` [patchset] Lockfree fd lookup 4 " Ravikiran G Thirumalai
2004-08-02 10:23 ` [patchset] Lockfree fd lookup 5 " Ravikiran G Thirumalai
2004-08-02 16:56 ` [patchset] Lockfree fd lookup 0 " viro
2004-08-02 20:07   ` David S. Miller
2004-08-02 21:01     ` William Lee Irwin III
2004-08-02 23:15       ` David S. Miller
2004-08-03  2:04         ` William Lee Irwin III
2004-08-03  9:23   ` Ravikiran G Thirumalai
2004-08-03  9:35     ` Ravikiran G Thirumalai
2004-08-03 10:17       ` Dipankar Sarma
2004-08-03 10:06     ` viro
  -- strict thread matches above, loose matches on Subject: below --
2004-08-02 18:50 Manfred Spraul

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox