public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] 2.5: push BKL out of llseek
@ 2002-01-30 21:14 Martin Wirth
  0 siblings, 0 replies; 24+ messages in thread
From: Martin Wirth @ 2002-01-30 21:14 UTC (permalink / raw)
  To: linux-kernel

Hi,

This is just a general idea I had a few month ago and might be of some
value
for the replacement of BKL or longheld spinlocks in the future 2.5
developement.

While writing some device driver for a real-time data acquisition I had
a 
similar problem. I had to protect some driver data structure that is 
heavily accessed from multiple processes for merely reading a few
variables
consistently. But from time to time there are bigger tasks to be done,
where
holding a spinlock is not appropriate. 

So I used a combination of a spinlock and a semaphore. You can lock this
combilock for short-term issues in a spin-lock mode:

       combi_spin_lock(struct combilock *x)
       combi_spin_unlock(struct combilock *x)

and for longer lasting tasks in a semaphore mode by:

       combi_mutex_lock(struct combilock *x)
       combi_mutex_unlock(struct combilock *x)

If a spin-lock request is blocked by a mutex-lock, the spin-lock
attempt also sleeps i.e. behaves like a semaphore.

This approach is less automatic than a first_spin_then_sleep mutex,
but normally the programmer knows better if he is going to do quick
things, or
maybe maybe unbounded stuff.

Note: For a preemtible kernel this approach could lead to much less
scheduling ping-pong also for UP if a spinlock is replaced by a
combilock 
instead of a semaphore.  


The code is quite simple and borrowed a bit from the completion handler
stuff
in sched.c. (Of course the owner could be a simple flag, but I had some
later 
extension to a priority inheritance scheme in mind).

struct combilock {
       wait_queue_head_t wait;
       task_t            *owner;
};


void combi_spin_lock(struct combilock *x)
{
       spin_lock(&x->wait.lock);
       if (x->owner) {
              DECLARE_WAITQUEUE(wait, current);
              wait.flags |= WQ_FLAG_EXCLUSIVE;
	      __add_wait_queue_tail(&x->wait, &wait);
	      do {
	             __set_current_state(TASK_UNINTERRUPTIBLE);
		     spin_unlock(&x->wait.lock);
		     schedule();
		     spin_lock(&x->wait.lock);
	      } while (x->owner);
	      __remove_wait_queue(&x->wait, &wait);
       }
}


void combi_spin_unlock(struct combilock *x)
{
       spin_unlock(&x->wait.lock);  
}


void combi_mutex_lock(struct combilock *x)
{
       spin_lock(&x->wait.lock);
       if (x->owner) {
              DECLARE_WAITQUEUE(wait, current);
              wait.flags |= WQ_FLAG_EXCLUSIVE;
	      __add_wait_queue_tail(&x->wait, &wait);
	      do {
		     __set_current_state(TASK_UNINTERRUPTIBLE);
		     spin_unlock(&x->wait.lock);
		     schedule();
		     spin_lock(&x->wait.lock);
	      } while (x->owner);
	      __remove_wait_queue(&x->wait, &wait);
       } else 
              x->owner=current;  
       spin_unlock(&x->wait.lock);
}


void combi_mutex_unlock(struct combilock *x)
{
	spin_lock(&x->wait.lock);
	x->owner=NULL;
	__wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE,
1, 0);
	spin_unlock(&x->wait.lock);
}


Martin Wirth

^ permalink raw reply	[flat|nested] 24+ messages in thread
* Re: [PATCH] 2.5: push BKL out of llseek
@ 2002-02-01 19:29 John Hawkes
  0 siblings, 0 replies; 24+ messages in thread
From: John Hawkes @ 2002-02-01 19:29 UTC (permalink / raw)
  To: Linux-Kernel Mailing List

From: "Dave Jones" <davej@suse.de>
>  did you benchmark with anything other than dbench ?

I've done substantial AIM7 benchmarking on a 28p ia64 NUMA system, and
llseek's BKL usage is a significant contributor to poor scaling.  For
500 AIM7 "tasks" and ext2 filesystems, waiting on the BKL consumes about
half of the available CPU cycles, and sys_lseek()'s usage is the most
significant cycle waster, followed by ext2_get_block() and
ext2_write_inode().  Anton's llseek patch from last November does make
a measurable improvement in AIM7 throughput.

--
John Hawkes
hawkes@sgi.com



^ permalink raw reply	[flat|nested] 24+ messages in thread
* Re: [PATCH] 2.5: push BKL out of llseek
@ 2002-01-31 15:39 Martin Wirth
  2002-01-31 21:06 ` Nigel Gamble
  0 siblings, 1 reply; 24+ messages in thread
From: Martin Wirth @ 2002-01-31 15:39 UTC (permalink / raw)
  To: linux-kernel

On 30 Jan 2002, Martin Wirth wrote:
>
>void combi_mutex_lock(struct combilock *x)
.....
>       } else <---
>              x->owner=current;  
>       spin_unlock(&x->wait.lock);

Uugh, the else is wrong of course. The owner has to be set in any
case.(Just deleted some debugging code and reformatted a bit to quick
:))

A further note: Although the combilock shares some advantages with a
spin-lock (no unnecessary scheduling for short time locking) it may
behave like a semaphore on entry also if you call combi_spin_lock.
For example

       spin_lock(&slock);
       combi_spin_lock(&clock);

is a BUG because combi_spin_lock may sleep while holding slock!

Would be nice if there were some comments.

Martin Wirth

^ permalink raw reply	[flat|nested] 24+ messages in thread
* [PATCH] 2.5: push BKL out of llseek
@ 2002-01-30  0:00 Robert Love
  2002-01-30  0:09 ` Linus Torvalds
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Robert Love @ 2002-01-30  0:00 UTC (permalink / raw)
  To: torvalds; +Cc: viro, linux-kernel

This patch pushes the BKL out of llseek() and into the individual llseek
methods.  For generic_file_llseek, I replaced it with the inode
semaphore.  The lock contention is noticeable even on 2-way systems. 
Since we simply push the BKL further down the call chain (its the llseek
method's responsibilities now) we aren't doing anything hackish or
unsafe.

I suspect some (Al) may consider this a suboptimal solution, and I
agree.  However it is a first step -- tightening the locks -- toward a
better locking scheme, which is hopefully devoid of the BKL.

The best scores from a slew of dbench runs:

	(2.5.3-pre6 on 2-way Athlon)
	with patch	133.651	165.575	66.9876	37.5297	24.9436
	without patch	132.541	160.774	60.1174	33.2065	22.0126

Interestingly, the shorter lock times corresponded to an 8.9% reduction
in scheduling latency (under the above dbench load) with the preemptible
kernel.

	Robert Love

diff -urN linux-2.5.3-pre6/Documentation/filesystems/Locking linux/Documentation/filesystems/Locking
--- linux-2.5.3-pre6/Documentation/filesystems/Locking	Mon Jan 28 18:30:27 2002
+++ linux/Documentation/filesystems/Locking	Tue Jan 29 17:07:37 2002
@@ -219,7 +219,7 @@
 locking rules:
 	All except ->poll() may block.
 		BKL
-llseek:		yes
+llseek:		yes	(see below)
 read:		no
 write:		no
 readdir:	yes	(see below)
@@ -235,6 +235,10 @@
 readv:		no
 writev:		no
 
+->llseek() locking has moved from llseek to the individual llseek
+implementations.  If your fs is not using generic_file_llseek, you
+need to acquire and release the BKL in your ->llseek().
+
 ->open() locking is in-transit: big lock partially moved into the methods.
 The only exception is ->open() in the instances of file_operations that never
 end up in ->i_fop/->proc_fops, i.e. ones that belong to character devices
diff -urN linux-2.5.3-pre6/fs/block_dev.c linux/fs/block_dev.c
--- linux-2.5.3-pre6/fs/block_dev.c	Mon Jan 28 18:30:22 2002
+++ linux/fs/block_dev.c	Tue Jan 29 16:49:52 2002
@@ -170,6 +170,8 @@
 	loff_t size = file->f_dentry->d_inode->i_bdev->bd_inode->i_size;
 	loff_t retval;
 
+	lock_kernel();
+
 	switch (origin) {
 		case 2:
 			offset += size;
@@ -186,6 +188,7 @@
 		}
 		retval = offset;
 	}
+	unlock_kernel();
 	return retval;
 }
 	
diff -urN linux-2.5.3-pre6/fs/hfs/file_cap.c linux/fs/hfs/file_cap.c
--- linux-2.5.3-pre6/fs/hfs/file_cap.c	Mon Jan 28 18:30:22 2002
+++ linux/fs/hfs/file_cap.c	Tue Jan 29 16:49:52 2002
@@ -91,6 +91,8 @@
 {
 	long long retval;
 
+	lock_kernel();
+
 	switch (origin) {
 		case 2:
 			offset += file->f_dentry->d_inode->i_size;
@@ -106,6 +108,7 @@
 		}
 		retval = offset;
 	}
+	unlock_kernel();
 	return retval;
 }
 
diff -urN linux-2.5.3-pre6/fs/hfs/file_hdr.c linux/fs/hfs/file_hdr.c
--- linux-2.5.3-pre6/fs/hfs/file_hdr.c	Mon Jan 28 18:30:22 2002
+++ linux/fs/hfs/file_hdr.c	Tue Jan 29 16:49:52 2002
@@ -347,6 +347,8 @@
 {
 	long long retval;
 
+	lock_kernel();
+
 	switch (origin) {
 		case 2:
 			offset += file->f_dentry->d_inode->i_size;
@@ -362,6 +364,7 @@
 		}
 		retval = offset;
 	}
+	unlock_kernel();
 	return retval;
 }
 
diff -urN linux-2.5.3-pre6/fs/hpfs/dir.c linux/fs/hpfs/dir.c
--- linux-2.5.3-pre6/fs/hpfs/dir.c	Mon Jan 28 18:30:22 2002
+++ linux/fs/hpfs/dir.c	Tue Jan 29 16:49:52 2002
@@ -29,6 +29,9 @@
 	struct inode *i = filp->f_dentry->d_inode;
 	struct hpfs_inode_info *hpfs_inode = hpfs_i(i);
 	struct super_block *s = i->i_sb;
+
+	lock_kernel();
+
 	/*printk("dir lseek\n");*/
 	if (new_off == 0 || new_off == 1 || new_off == 11 || new_off == 12 || new_off == 13) goto ok;
 	hpfs_lock_inode(i);
@@ -40,10 +43,12 @@
 	}
 	hpfs_unlock_inode(i);
 	ok:
+	unlock_kernel();
 	return filp->f_pos = new_off;
 	fail:
 	hpfs_unlock_inode(i);
 	/*printk("illegal lseek: %016llx\n", new_off);*/
+	unlock_kernel();
 	return -ESPIPE;
 }
 
diff -urN linux-2.5.3-pre6/fs/proc/generic.c linux/fs/proc/generic.c
--- linux-2.5.3-pre6/fs/proc/generic.c	Mon Jan 28 18:30:22 2002
+++ linux/fs/proc/generic.c	Tue Jan 29 16:49:52 2002
@@ -16,6 +16,7 @@
 #include <linux/stat.h>
 #define __NO_VERSION__
 #include <linux/module.h>
+#include <linux/smp_lock.h>
 #include <asm/bitops.h>
 
 static ssize_t proc_file_read(struct file * file, char * buf,
@@ -140,22 +141,30 @@
 static loff_t
 proc_file_lseek(struct file * file, loff_t offset, int orig)
 {
+    lock_kernel();
+
     switch (orig) {
     case 0:
 	if (offset < 0)
-	    return -EINVAL;    
+	    goto out;
 	file->f_pos = offset;
+	unlock_kernel();
 	return(file->f_pos);
     case 1:
 	if (offset + file->f_pos < 0)
-	    return -EINVAL;    
+	    goto out;
 	file->f_pos += offset;
+	unlock_kernel();
 	return(file->f_pos);
     case 2:
-	return(-EINVAL);
+	goto out;
     default:
-	return(-EINVAL);
+	goto out;
     }
+
+out:
+    unlock_kernel();
+    return -EINVAL;
 }
 
 /*
diff -urN linux-2.5.3-pre6/fs/read_write.c linux/fs/read_write.c
--- linux-2.5.3-pre6/fs/read_write.c	Mon Jan 28 18:30:22 2002
+++ linux/fs/read_write.c	Tue Jan 29 16:49:52 2002
@@ -29,6 +29,8 @@
 {
 	long long retval;
 
+	down(&file->f_dentry->d_inode->i_sem);
+
 	switch (origin) {
 		case 2:
 			offset += file->f_dentry->d_inode->i_size;
@@ -45,6 +47,7 @@
 		}
 		retval = offset;
 	}
+	up(&file->f_dentry->d_inode->i_sem);
 	return retval;
 }
 
@@ -57,6 +60,8 @@
 {
 	long long retval;
 
+	lock_kernel();
+
 	switch (origin) {
 		case 2:
 			offset += file->f_dentry->d_inode->i_size;
@@ -73,6 +78,7 @@
 		}
 		retval = offset;
 	}
+	unlock_kernel();
 	return retval;
 }
 
@@ -84,9 +90,7 @@
 	fn = default_llseek;
 	if (file->f_op && file->f_op->llseek)
 		fn = file->f_op->llseek;
-	lock_kernel();
 	retval = fn(file, offset, origin);
-	unlock_kernel();
 	return retval;
 }
 
diff -urN linux-2.5.3-pre6/fs/ufs/file.c linux/fs/ufs/file.c
--- linux-2.5.3-pre6/fs/ufs/file.c	Mon Jan 28 18:30:22 2002
+++ linux/fs/ufs/file.c	Tue Jan 29 16:49:52 2002
@@ -47,6 +47,8 @@
 	long long retval;
 	struct inode *inode = file->f_dentry->d_inode;
 
+	lock_kernel();
+
 	switch (origin) {
 		case 2:
 			offset += inode->i_size;
@@ -64,6 +66,7 @@
 		}
 		retval = offset;
 	}
+	unlock_kernel();
 	return retval;
 }


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2002-02-01 19:29 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-01-30 21:14 [PATCH] 2.5: push BKL out of llseek Martin Wirth
  -- strict thread matches above, loose matches on Subject: below --
2002-02-01 19:29 John Hawkes
2002-01-31 15:39 Martin Wirth
2002-01-31 21:06 ` Nigel Gamble
2002-01-30  0:00 Robert Love
2002-01-30  0:09 ` Linus Torvalds
2002-01-30  0:41   ` Robert Love
2002-01-30  0:52     ` Linus Torvalds
2002-01-30  2:24       ` Robert Love
2002-01-30  1:26     ` Andrew Morton
2002-01-30  2:16       ` Linus Torvalds
2002-01-30  2:20       ` Robert Love
2002-01-30  2:20         ` Andrew Morton
2002-01-30  2:21         ` Dave Jones
2002-01-30  2:37           ` Robert Love
2002-01-30  2:50         ` Nigel Gamble
2002-01-30  3:19           ` Andrew Morton
2002-01-30  9:34             ` Nigel Gamble
2002-01-30 10:36         ` Russell King
2002-01-30  4:54   ` Alexander Viro
2002-01-30  8:00     ` Trond Myklebust
2002-01-30 13:39       ` Robert Love
2002-01-30  4:50 ` Anton Blanchard
2002-01-30  5:03 ` Robert Love

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox