Re: AIM7 40% regression with 2.6.26-rc1

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: AIM7 40% regression with 2.6.26-rc1
       [not found] ` <20080506114449.GC32591@elte.hu>
@ 2008-05-06 12:09   ` Matthew Wilcox
  2008-05-06 16:23     ` Matthew Wilcox
  0 siblings, 1 reply; 43+ messages in thread
From: Matthew Wilcox @ 2008-05-06 12:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zhang, Yanmin, LKML, Alexander Viro, Linus Torvalds,
	Andrew Morton, linux-fsdevel

On Tue, May 06, 2008 at 01:44:49PM +0200, Ingo Molnar wrote:
> * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:
> > After I manually reverted the patch against 2.6.26-rc1 while fixing 
> > lots of conflictions/errors, aim7 regression became less than 2%.
> 
> hm, which exact semaphore would that be due to?
> 
> My first blind guess would be the BKL - there's not much other semaphore 
> use left in the core kernel otherwise that would affect AIM7 normally. 
> The VFS still makes frequent use of the BKL and AIM7 is very VFS 
> intense. Getting rid of that BKL use from the VFS might be useful to 
> performance anyway.

That's slightly slanderous to the VFS ;-)  The BKL really isn't used
that much any more.  So little that I've gone through and produced a
list of places it's used:

fs/block_dev.c		opening and closing a block device.  Unlikely to be
			provoked by AIM7.
fs/char_dev.c		chrdev_open.  Unlikely to be provoked by AIM7.
fs/compat.c		mount.  Unlikely to be provoked by AIM7.
fs/compat_ioctl.c	held around calls to ioctl translator.
fs/exec.c		coredump.  If this is a contention problem ...
fs/fcntl.c		held around call to ->fasync.
fs/ioctl.c		held around f_op ->ioctl call (tmpfs doesn't have
			ioctl).  ditto bmap.  there's fasync, as previously
			mentioned.
fs/locks.c		hellhole.  I hope AIM7 doesn't use locks.
fs/namespace.c		mount, umount.  Unlikely to be provoked by AIM7.
fs/read_write.c		llseek.  tmpfs uses the unlocked version.
fs/super.c		shtdown, remount.  Unlikely to be provoked by AIM7.

So the only likely things I can see are:

 - file locks
 - fasync

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 12:09   ` AIM7 40% regression with 2.6.26-rc1 Matthew Wilcox
@ 2008-05-06 16:23     ` Matthew Wilcox
  2008-05-06 16:36       ` Linus Torvalds
                         ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Matthew Wilcox @ 2008-05-06 16:23 UTC (permalink / raw)
  To: Ingo Molnar, J. Bruce Fields
  Cc: Zhang, Yanmin, LKML, Alexander Viro, Linus Torvalds,
	Andrew Morton, linux-fsdevel

On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote:
> So the only likely things I can see are:
> 
>  - file locks
>  - fasync

I've wanted to fix file locks for a while.  Here's a first attempt.
It was done quickly, so I concede that it may well have bugs in it.
I found (and fixed) one with LTP.

It takes *no account* of nfsd, nor remote filesystems.  We need to have
a serious discussion about their requirements.

diff --git a/fs/locks.c b/fs/locks.c
index 663c069..cb09765 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -140,6 +140,8 @@ int lease_break_time = 45;
 #define for_each_lock(inode, lockp) \
 	for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next)
 
+static DEFINE_SPINLOCK(file_lock_lock);
+
 static LIST_HEAD(file_lock_list);
 static LIST_HEAD(blocked_list);
 
@@ -510,9 +512,9 @@ static void __locks_delete_block(struct file_lock *waiter)
  */
 static void locks_delete_block(struct file_lock *waiter)
 {
-	lock_kernel();
+	spin_lock(&file_lock_lock);
 	__locks_delete_block(waiter);
-	unlock_kernel();
+	spin_unlock(&file_lock_lock);
 }
 
 /* Insert waiter into blocker's block list.
@@ -649,7 +651,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
 {
 	struct file_lock *cfl;
 
-	lock_kernel();
+	spin_lock(&file_lock_lock);
 	for (cfl = filp->f_path.dentry->d_inode->i_flock; cfl; cfl = cfl->fl_next) {
 		if (!IS_POSIX(cfl))
 			continue;
@@ -662,7 +664,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
 			fl->fl_pid = pid_vnr(cfl->fl_nspid);
 	} else
 		fl->fl_type = F_UNLCK;
-	unlock_kernel();
+	spin_unlock(&file_lock_lock);
 	return;
 }
 EXPORT_SYMBOL(posix_test_lock);
@@ -735,18 +737,21 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
 	int error = 0;
 	int found = 0;
 
-	lock_kernel();
-	if (request->fl_flags & FL_ACCESS)
+	if (request->fl_flags & FL_ACCESS) {
+		spin_lock(&file_lock_lock);
 		goto find_conflict;
+	}
 
 	if (request->fl_type != F_UNLCK) {
 		error = -ENOMEM;
+		
 		new_fl = locks_alloc_lock();
 		if (new_fl == NULL)
-			goto out;
+			goto out_unlocked;
 		error = 0;
 	}
 
+	spin_lock(&file_lock_lock);
 	for_each_lock(inode, before) {
 		struct file_lock *fl = *before;
 		if (IS_POSIX(fl))
@@ -772,10 +777,13 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
 	 * If a higher-priority process was blocked on the old file lock,
 	 * give it the opportunity to lock the file.
 	 */
-	if (found)
+	if (found) {
+		spin_unlock(&file_lock_lock);
 		cond_resched();
+		spin_lock(&file_lock_lock);
+	}
 
-find_conflict:
+ find_conflict:
 	for_each_lock(inode, before) {
 		struct file_lock *fl = *before;
 		if (IS_POSIX(fl))
@@ -796,8 +804,9 @@ find_conflict:
 	new_fl = NULL;
 	error = 0;
 
-out:
-	unlock_kernel();
+ out:
+	spin_unlock(&file_lock_lock);
+ out_unlocked:
 	if (new_fl)
 		locks_free_lock(new_fl);
 	return error;
@@ -826,7 +835,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
 		new_fl2 = locks_alloc_lock();
 	}
 
-	lock_kernel();
+	spin_lock(&file_lock_lock);
 	if (request->fl_type != F_UNLCK) {
 		for_each_lock(inode, before) {
 			fl = *before;
@@ -994,7 +1003,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
 		locks_wake_up_blocks(left);
 	}
  out:
-	unlock_kernel();
+	spin_unlock(&file_lock_lock);
 	/*
 	 * Free any unused locks.
 	 */
@@ -1069,14 +1078,14 @@ int locks_mandatory_locked(struct inode *inode)
 	/*
 	 * Search the lock list for this inode for any POSIX locks.
 	 */
-	lock_kernel();
+	spin_lock(&file_lock_lock);
 	for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
 		if (!IS_POSIX(fl))
 			continue;
 		if (fl->fl_owner != owner)
 			break;
 	}
-	unlock_kernel();
+	spin_unlock(&file_lock_lock);
 	return fl ? -EAGAIN : 0;
 }
 
@@ -1190,7 +1199,7 @@ int __break_lease(struct inode *inode, unsigned int mode)
 
 	new_fl = lease_alloc(NULL, mode & FMODE_WRITE ? F_WRLCK : F_RDLCK);
 
-	lock_kernel();
+	spin_lock(&file_lock_lock);
 
 	time_out_leases(inode);
 
@@ -1251,8 +1260,10 @@ restart:
 			break_time++;
 	}
 	locks_insert_block(flock, new_fl);
+	spin_unlock(&file_lock_lock);
 	error = wait_event_interruptible_timeout(new_fl->fl_wait,
 						!new_fl->fl_next, break_time);
+	spin_lock(&file_lock_lock);
 	__locks_delete_block(new_fl);
 	if (error >= 0) {
 		if (error == 0)
@@ -1266,8 +1277,8 @@ restart:
 		error = 0;
 	}
 
-out:
-	unlock_kernel();
+ out:
+	spin_unlock(&file_lock_lock);
 	if (!IS_ERR(new_fl))
 		locks_free_lock(new_fl);
 	return error;
@@ -1323,7 +1334,7 @@ int fcntl_getlease(struct file *filp)
 	struct file_lock *fl;
 	int type = F_UNLCK;
 
-	lock_kernel();
+	spin_lock(&file_lock_lock);
 	time_out_leases(filp->f_path.dentry->d_inode);
 	for (fl = filp->f_path.dentry->d_inode->i_flock; fl && IS_LEASE(fl);
 			fl = fl->fl_next) {
@@ -1332,7 +1343,7 @@ int fcntl_getlease(struct file *filp)
 			break;
 		}
 	}
-	unlock_kernel();
+	spin_unlock(&file_lock_lock);
 	return type;
 }
 
@@ -1363,6 +1374,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
 	if (error)
 		return error;
 
+	spin_lock(&file_lock_lock);
 	time_out_leases(inode);
 
 	BUG_ON(!(*flp)->fl_lmops->fl_break);
@@ -1370,10 +1382,11 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
 	lease = *flp;
 
 	if (arg != F_UNLCK) {
+		spin_unlock(&file_lock_lock);
 		error = -ENOMEM;
 		new_fl = locks_alloc_lock();
 		if (new_fl == NULL)
-			goto out;
+			goto out_unlocked;
 
 		error = -EAGAIN;
 		if ((arg == F_RDLCK) && (atomic_read(&inode->i_writecount) > 0))
@@ -1382,6 +1395,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
 		    && ((atomic_read(&dentry->d_count) > 1)
 			|| (atomic_read(&inode->i_count) > 1)))
 			goto out;
+		spin_lock(&file_lock_lock);
 	}
 
 	/*
@@ -1429,11 +1443,14 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
 
 	locks_copy_lock(new_fl, lease);
 	locks_insert_lock(before, new_fl);
+	spin_unlock(&file_lock_lock);
 
 	*flp = new_fl;
 	return 0;
 
-out:
+ out:
+	spin_unlock(&file_lock_lock);
+ out_unlocked:
 	if (new_fl != NULL)
 		locks_free_lock(new_fl);
 	return error;
@@ -1471,12 +1488,10 @@ int vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
 {
 	int error;
 
-	lock_kernel();
 	if (filp->f_op && filp->f_op->setlease)
 		error = filp->f_op->setlease(filp, arg, lease);
 	else
 		error = generic_setlease(filp, arg, lease);
-	unlock_kernel();
 
 	return error;
 }
@@ -1503,12 +1518,11 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
 	if (error)
 		return error;
 
-	lock_kernel();
-
 	error = vfs_setlease(filp, arg, &flp);
 	if (error || arg == F_UNLCK)
-		goto out_unlock;
+		return error;
 
+	lock_kernel();
 	error = fasync_helper(fd, filp, 1, &flp->fl_fasync);
 	if (error < 0) {
 		/* remove lease just inserted by setlease */
@@ -1519,7 +1533,7 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
 	}
 
 	error = __f_setown(filp, task_pid(current), PIDTYPE_PID, 0);
-out_unlock:
+ out_unlock:
 	unlock_kernel();
 	return error;
 }
@@ -2024,7 +2038,7 @@ void locks_remove_flock(struct file *filp)
 			fl.fl_ops->fl_release_private(&fl);
 	}
 
-	lock_kernel();
+	spin_lock(&file_lock_lock);
 	before = &inode->i_flock;
 
 	while ((fl = *before) != NULL) {
@@ -2042,7 +2056,7 @@ void locks_remove_flock(struct file *filp)
  		}
 		before = &fl->fl_next;
 	}
-	unlock_kernel();
+	spin_unlock(&file_lock_lock);
 }
 
 /**
@@ -2057,12 +2071,12 @@ posix_unblock_lock(struct file *filp, struct file_lock *waiter)
 {
 	int status = 0;
 
-	lock_kernel();
+	spin_lock(&file_lock_lock);
 	if (waiter->fl_next)
 		__locks_delete_block(waiter);
 	else
 		status = -ENOENT;
-	unlock_kernel();
+	spin_unlock(&file_lock_lock);
 	return status;
 }
 
@@ -2175,7 +2189,7 @@ static int locks_show(struct seq_file *f, void *v)
 
 static void *locks_start(struct seq_file *f, loff_t *pos)
 {
-	lock_kernel();
+	spin_lock(&file_lock_lock);
 	f->private = (void *)1;
 	return seq_list_start(&file_lock_list, *pos);
 }
@@ -2187,7 +2201,7 @@ static void *locks_next(struct seq_file *f, void *v, loff_t *pos)
 
 static void locks_stop(struct seq_file *f, void *v)
 {
-	unlock_kernel();
+	spin_unlock(&file_lock_lock);
 }
 
 struct seq_operations locks_seq_operations = {
@@ -2215,7 +2229,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
 {
 	struct file_lock *fl;
 	int result = 1;
-	lock_kernel();
+	spin_lock(&file_lock_lock);
 	for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
 		if (IS_POSIX(fl)) {
 			if (fl->fl_type == F_RDLCK)
@@ -2232,7 +2246,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
 		result = 0;
 		break;
 	}
-	unlock_kernel();
+	spin_unlock(&file_lock_lock);
 	return result;
 }
 
@@ -2255,7 +2269,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
 {
 	struct file_lock *fl;
 	int result = 1;
-	lock_kernel();
+	spin_lock(&file_lock_lock);
 	for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
 		if (IS_POSIX(fl)) {
 			if ((fl->fl_end < start) || (fl->fl_start > (start + len)))
@@ -2270,7 +2284,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
 		result = 0;
 		break;
 	}
-	unlock_kernel();
+	spin_unlock(&file_lock_lock);
 	return result;
 }
 

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 16:23     ` Matthew Wilcox
@ 2008-05-06 16:36       ` Linus Torvalds
  2008-05-06 16:42         ` Matthew Wilcox
  2008-05-06 16:44         ` J. Bruce Fields
  2008-05-06 17:21       ` Andrew Morton
  2008-05-08  3:24       ` AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin
  2 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2008-05-06 16:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro,
	Andrew Morton, linux-fsdevel

On Tue, 6 May 2008, Matthew Wilcox wrote:
> 
> I've wanted to fix file locks for a while.  Here's a first attempt.
> It was done quickly, so I concede that it may well have bugs in it.
> I found (and fixed) one with LTP.

Hmm. Wouldn't it be nicer to make the lock be a per-inode thing? Or is 
there some user that doesn't have the inode info, or does anything that 
might cross inode boundaries?

This does seem to drop all locking around the "setlease()" calls down to 
the filesystem, which worries me. That said, we clearly do need to do 
this. Probably should have done it a long time ago.

Also, why do people do this:

> -find_conflict:
> + find_conflict:

Hmm?

		Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 16:42         ` Matthew Wilcox
@ 2008-05-06 16:39           ` Alan Cox
  2008-05-06 16:51             ` Matthew Wilcox
  2008-05-06 20:28           ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: Alan Cox @ 2008-05-06 16:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, Andrew Morton, linux-fsdevel

> > Hmm?
> 
> So that find_conflict doesn't end up in the first column, which causes
> diff to treat it as a function name for the purposes of the @@ lines.

Please can we just fix the tools not mangle the kernel to work around
silly bugs ?


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 16:36       ` Linus Torvalds
@ 2008-05-06 16:42         ` Matthew Wilcox
  2008-05-06 16:39           ` Alan Cox
  2008-05-06 20:28           ` Linus Torvalds
  2008-05-06 16:44         ` J. Bruce Fields
  1 sibling, 2 replies; 43+ messages in thread
From: Matthew Wilcox @ 2008-05-06 16:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro,
	Andrew Morton, linux-fsdevel

On Tue, May 06, 2008 at 09:36:06AM -0700, Linus Torvalds wrote:
> On Tue, 6 May 2008, Matthew Wilcox wrote:
> > I've wanted to fix file locks for a while.  Here's a first attempt.
> > It was done quickly, so I concede that it may well have bugs in it.
> > I found (and fixed) one with LTP.
> 
> Hmm. Wouldn't it be nicer to make the lock be a per-inode thing? Or is 
> there some user that doesn't have the inode info, or does anything that 
> might cross inode boundaries?

/proc/locks and deadlock detection both cross inode boundaries (and even
filesystem boundaries).  The BKL-removal brigade tried this back in 2.4
and the locking ended up scaling worse than just plonking a single
spinlock around the whole thing.

> This does seem to drop all locking around the "setlease()" calls down to 
> the filesystem, which worries me. That said, we clearly do need to do 
> this. Probably should have done it a long time ago.

The only filesystems that are going to have their own setlease methods
will be remote ones (nfs, smbfs, etc).  They're going to need to sleep
while the server responds to them.  So holding a spinlock while we call
them is impolite at best.

> Also, why do people do this:
> 
> > -find_conflict:
> > + find_conflict:
> 
> Hmm?

So that find_conflict doesn't end up in the first column, which causes
diff to treat it as a function name for the purposes of the @@ lines.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 16:36       ` Linus Torvalds
  2008-05-06 16:42         ` Matthew Wilcox
@ 2008-05-06 16:44         ` J. Bruce Fields
  1 sibling, 0 replies; 43+ messages in thread
From: J. Bruce Fields @ 2008-05-06 16:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Ingo Molnar, Zhang, Yanmin, LKML, Alexander Viro,
	Andrew Morton, linux-fsdevel, richterd

On Tue, May 06, 2008 at 09:36:06AM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 6 May 2008, Matthew Wilcox wrote:
> > 
> > I've wanted to fix file locks for a while.  Here's a first attempt.
> > It was done quickly, so I concede that it may well have bugs in it.
> > I found (and fixed) one with LTP.
> 
> Hmm. Wouldn't it be nicer to make the lock be a per-inode thing? Or is 
> there some user that doesn't have the inode info, or does anything that 
> might cross inode boundaries?

The deadlock detection crosses inode boundaries.

--b.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 16:51             ` Matthew Wilcox
@ 2008-05-06 16:45               ` Alan Cox
  2008-05-06 17:42               ` Linus Torvalds
  1 sibling, 0 replies; 43+ messages in thread
From: Alan Cox @ 2008-05-06 16:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, Andrew Morton, linux-fsdevel

On Tue, 6 May 2008 10:51:12 -0600
Matthew Wilcox <matthew@wil.cx> wrote:

> On Tue, May 06, 2008 at 05:39:48PM +0100, Alan Cox wrote:
> > > > Hmm?
> > > 
> > > So that find_conflict doesn't end up in the first column, which causes
> > > diff to treat it as a function name for the purposes of the @@ lines.
> > 
> > Please can we just fix the tools not mangle the kernel to work around
> > silly bugs ?
> 
> The people who control the tools refuse to fix them.

That would be their problem. We should refuse to mash the kernel up
because they aren't doing their job. If need be someone can fork a
private git-patch.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 16:39           ` Alan Cox
@ 2008-05-06 16:51             ` Matthew Wilcox
  2008-05-06 16:45               ` Alan Cox
  2008-05-06 17:42               ` Linus Torvalds
  0 siblings, 2 replies; 43+ messages in thread
From: Matthew Wilcox @ 2008-05-06 16:51 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, Andrew Morton, linux-fsdevel

On Tue, May 06, 2008 at 05:39:48PM +0100, Alan Cox wrote:
> > > Hmm?
> > 
> > So that find_conflict doesn't end up in the first column, which causes
> > diff to treat it as a function name for the purposes of the @@ lines.
> 
> Please can we just fix the tools not mangle the kernel to work around
> silly bugs ?

The people who control the tools refuse to fix them.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 16:23     ` Matthew Wilcox
  2008-05-06 16:36       ` Linus Torvalds
@ 2008-05-06 17:21       ` Andrew Morton
  2008-05-06 17:31         ` Matthew Wilcox
                           ` (3 more replies)
  2008-05-08  3:24       ` AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin
  2 siblings, 4 replies; 43+ messages in thread
From: Andrew Morton @ 2008-05-06 17:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro,
	Linus Torvalds, linux-fsdevel

On Tue, 6 May 2008 10:23:32 -0600 Matthew Wilcox <matthew@wil.cx> wrote:

> On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote:
> > So the only likely things I can see are:
> > 
> >  - file locks
> >  - fasync
> 
> I've wanted to fix file locks for a while.  Here's a first attempt.

Do we actually know that the locks code is implicated in this regression?

I'd initially thought "lseek" but afaict tmpfs doesn't hit default_llseek()
or remote_llseek().

tmpfs tends to do weird stuff - it would be interesting to know if the
regression is also present on ramfs or ext2/ext3/xfs/etc.

It would be interesting to see if the context switch rate has increased.

Finally: how come we regressed by swapping the semaphore implementation
anyway?  We went from one sleeping lock implementation to another - I'd
have expected performance to be pretty much the same.

<looks at the implementation>

down(), down_interruptible() and down_try() should use spin_lock_irq(), not
irqsave.

up() seems to be doing wake-one, FIFO which is nice.  Did the
implementation which we just removed also do that?  Was it perhaps
accidentally doing LIFO or something like that?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 17:21       ` Andrew Morton
@ 2008-05-06 17:31         ` Matthew Wilcox
  2008-05-06 17:49           ` Ingo Molnar
  2008-05-06 17:39         ` Ingo Molnar
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 43+ messages in thread
From: Matthew Wilcox @ 2008-05-06 17:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro,
	Linus Torvalds, linux-fsdevel

On Tue, May 06, 2008 at 10:21:53AM -0700, Andrew Morton wrote:
> Do we actually know that the locks code is implicated in this regression?

Not yet.  We don't even know it's the BKL.  It's just my best guess.
We're waiting for the original reporter to run some tests Ingo pointed
him at.

> I'd initially thought "lseek" but afaict tmpfs doesn't hit default_llseek()
> or remote_llseek().

Correct.

> Finally: how come we regressed by swapping the semaphore implementation
> anyway?  We went from one sleeping lock implementation to another - I'd
> have expected performance to be pretty much the same.
> 
> <looks at the implementation>
> 
> down(), down_interruptible() and down_try() should use spin_lock_irq(), not
> irqsave.

We talked about this ... the BKL actually requires that you be able to
acquire it with interrupts disabled.  Maybe we should make lock_kernel
do this:

	if (likely(!depth)) {
		unsigned long flags;
		local_save_flags(flags);
		down();
		local_irq_restore(flags);
	}

But tweaking down() is not worth it -- we should be eliminating users of
both the BKL and semaphores instead.

> up() seems to be doing wake-one, FIFO which is nice.  Did the
> implementation which we just removed also do that?  Was it perhaps
> accidentally doing LIFO or something like that?

That's a question for someone who knows x86 assembler, I think.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 17:21       ` Andrew Morton
  2008-05-06 17:31         ` Matthew Wilcox
@ 2008-05-06 17:39         ` Ingo Molnar
  2008-05-07  6:49           ` Zhang, Yanmin
  2008-05-06 17:45         ` Linus Torvalds
  2008-05-07 16:38         ` Matthew Wilcox
  3 siblings, 1 reply; 43+ messages in thread
From: Ingo Molnar @ 2008-05-06 17:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, Linus Torvalds, linux-fsdevel


* Andrew Morton <akpm@linux-foundation.org> wrote:

> Finally: how come we regressed by swapping the semaphore 
> implementation anyway?  We went from one sleeping lock implementation 
> to another - I'd have expected performance to be pretty much the same.
> 
> <looks at the implementation>
> 
> down(), down_interruptible() and down_try() should use 
> spin_lock_irq(), not irqsave.
> 
> up() seems to be doing wake-one, FIFO which is nice.  Did the 
> implementation which we just removed also do that?  Was it perhaps 
> accidentally doing LIFO or something like that?

i just checked the old implementation on x86. It used 
lib/semaphore-sleepers.c which does one weird thing:

  - __down() when it returns wakes up yet another task via 
    wake_up_locked().

i.e. we'll always keep yet another task in flight. This can mask wakeup 
latencies especially when it takes time.

The patch (hack) below tries to emulate this weirdness - it 'kicks' 
another task as well and keeps it busy. Most of the time this just 
causes extra scheduling, but if AIM7 is _just_ saturating the number of 
CPUs, it might make a difference. Yanmin, does the patch below make any 
difference to the AIM7 results?

( it would be useful data to get a meaningful context switch trace from 
  the whole regressed workload, and compare it to a context switch trace 
  with the revert added. )

	Ingo

---
 kernel/semaphore.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -261,4 +261,14 @@ static noinline void __sched __up(struct
 	list_del(&waiter->list);
 	waiter->up = 1;
 	wake_up_process(waiter->task);
+
+	if (likely(list_empty(&sem->wait_list)))
+		return;
+	/*
+	 * Opportunistically wake up another task as well but do not
+	 * remove it from the list:
+	 */
+	waiter = list_first_entry(&sem->wait_list,
+				  struct semaphore_waiter, list);
+	wake_up_process(waiter->task);
 }

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 16:51             ` Matthew Wilcox
  2008-05-06 16:45               ` Alan Cox
@ 2008-05-06 17:42               ` Linus Torvalds
  1 sibling, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2008-05-06 17:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alan Cox, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, Andrew Morton, linux-fsdevel

On Tue, 6 May 2008, Matthew Wilcox wrote:

> On Tue, May 06, 2008 at 05:39:48PM +0100, Alan Cox wrote:
> > > > Hmm?
> > > 
> > > So that find_conflict doesn't end up in the first column, which causes
> > > diff to treat it as a function name for the purposes of the @@ lines.
> > 
> > Please can we just fix the tools not mangle the kernel to work around
> > silly bugs ?
> 
> The people who control the tools refuse to fix them.

That's just plain bullocks.

First off, that "@@" line thing in diffs is not important enough to screw 
up the source code for. Ever.  It's just a small hint to make it somewhat 
easier to see more context for humans.

Second, it's not even that bad to show the last label there, rather than 
the function name.

Third, you seem to be a git user, so if you actually really care that much 
about the @@ line, then git actually lets you set your very own pattern 
for those things.

In fact, you can even do it on a per-file basis based on things like 
filename rules (ie you can have different patterns for what to trigger on 
for a *.c file and for a *.S file, since in a *.S file the 'name:' thing 
_is_ the right pattern).

So not only are you making idiotic changes just for irrelevant tool usage, 
you're also apparently lying about people "refusing to fix" things as an 
excuse.

You can play with it. It's documented in gitattributes (see "Defining a 
custom hunk header"), and the default one is just the same one that GNU 
diff uses for "-p". I think.

You can add something like this to your ~/.gitconfig:

	[diff "default"]
		funcname=^[a-zA-Z_$].*(.*$

to only trigger the funcname pattern on a line that starts with a valid C 
identifier hing, and contains a '('.

And you can just override the default like the above (that way you don't 
have to specify attributes), but if you want to do things differently for 
*.c files than from *.S files, you can edit your .git/info/attributes file 
and make it contain something like

	*.S	diff=assembler
	*.c	diff=C

and now you can make your ~/.gitconfig actually show them differently, ie 
something like

	[diff "C"]
		funcname=^[a-zA-Z_$].*(.*$

	[diff "assembler"]
		funcname=^[a-zA-Z_$].*:

etc.

Of course, there is a real cost to this, but it's cheap enough in practice 
that you'll never notice.

			Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 17:21       ` Andrew Morton
  2008-05-06 17:31         ` Matthew Wilcox
  2008-05-06 17:39         ` Ingo Molnar
@ 2008-05-06 17:45         ` Linus Torvalds
  2008-05-07 16:38         ` Matthew Wilcox
  3 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2008-05-06 17:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, linux-fsdevel



On Tue, 6 May 2008, Andrew Morton wrote:
> 
> down(), down_interruptible() and down_try() should use spin_lock_irq(), not
> irqsave.

down_trylock() is used in atomic code. See for example kernel/printk.c. So 
no, that one needs to be irqsafe.

		Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 17:31         ` Matthew Wilcox
@ 2008-05-06 17:49           ` Ingo Molnar
  2008-05-06 18:07             ` Andrew Morton
  0 siblings, 1 reply; 43+ messages in thread
From: Ingo Molnar @ 2008-05-06 17:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, Linus Torvalds, linux-fsdevel

* Matthew Wilcox <matthew@wil.cx> wrote:

> > down(), down_interruptible() and down_try() should use 
> > spin_lock_irq(), not irqsave.
> 
> We talked about this ... the BKL actually requires that you be able to 
> acquire it with interrupts disabled. [...]

hm, where does it require it, besides the early bootup code? (which 
should just be fixed)

down_trylock() is OK as irqsave/irqrestore for legacy reasons, but that 
is fundamentally atomic anyway.

> > up() seems to be doing wake-one, FIFO which is nice.  Did the 
> > implementation which we just removed also do that?  Was it perhaps 
> > accidentally doing LIFO or something like that?
> 
> That's a question for someone who knows x86 assembler, I think.

the assembly is mostly just for the fastpath - and a 40% regression 
cannot be about fastpath differences. In the old code the scheduling 
happens in lib/semaphore-sleeper.c, and from the looks of it it appears 
to be a proper FIFO as well. (plus this small wakeup weirdness it has)

i reviewed the new code in kernel/semaphore.c as well and can see 
nothing bad in it - it does proper wake-up, FIFO queueing, like the 
mutex code.

	Ingo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 17:49           ` Ingo Molnar
@ 2008-05-06 18:07             ` Andrew Morton
  2008-05-11 11:11               ` Matthew Wilcox
  0 siblings, 1 reply; 43+ messages in thread
From: Andrew Morton @ 2008-05-06 18:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: matthew, bfields, yanmin_zhang, linux-kernel, viro, torvalds,
	linux-fsdevel

On Tue, 6 May 2008 19:49:54 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Matthew Wilcox <matthew@wil.cx> wrote:
> 
> > > down(), down_interruptible() and down_try() should use 
> > > spin_lock_irq(), not irqsave.
> > 
> > We talked about this ... the BKL actually requires that you be able to 
> > acquire it with interrupts disabled. [...]
> 
> hm, where does it require it, besides the early bootup code? (which 
> should just be fixed)

Yeah, the early bootup code.  The kernel does accidental lock_kernel()s in
various places and if that renables interrupts then powerpc goeth crunch.

Matthew, that seemingly-unneeded irqsave in lib/semaphore.c is a prime site
for /* one of these things */, no?

> down_trylock() is OK as irqsave/irqrestore for legacy reasons, but that 
> is fundamentally atomic anyway.

yes, trylock should be made irq-safe.

> > > up() seems to be doing wake-one, FIFO which is nice.  Did the 
> > > implementation which we just removed also do that?  Was it perhaps 
> > > accidentally doing LIFO or something like that?
> > 
> > That's a question for someone who knows x86 assembler, I think.
> 
> the assembly is mostly just for the fastpath - and a 40% regression 
> cannot be about fastpath differences. In the old code the scheduling 
> happens in lib/semaphore-sleeper.c, and from the looks of it it appears 
> to be a proper FIFO as well. (plus this small wakeup weirdness it has)
> 
> i reviewed the new code in kernel/semaphore.c as well and can see 
> nothing bad in it - it does proper wake-up, FIFO queueing, like the 
> mutex code.
> 

There's the weird wakeup in down() which I understood for about five
minutes five years ago.  Perhaps that accidentally sped something up.
Oh well, more investigation needed..

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 16:42         ` Matthew Wilcox
  2008-05-06 16:39           ` Alan Cox
@ 2008-05-06 20:28           ` Linus Torvalds
  1 sibling, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2008-05-06 20:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro,
	Andrew Morton, linux-fsdevel



On Tue, 6 May 2008, Matthew Wilcox wrote:
> On Tue, May 06, 2008 at 09:36:06AM -0700, Linus Torvalds wrote:
> > 
> > Hmm. Wouldn't it be nicer to make the lock be a per-inode thing? Or is 
> > there some user that doesn't have the inode info, or does anything that 
> > might cross inode boundaries?
> 
> /proc/locks and deadlock detection both cross inode boundaries (and even
> filesystem boundaries).  The BKL-removal brigade tried this back in 2.4
> and the locking ended up scaling worse than just plonking a single
> spinlock around the whole thing.

Ok, no worries. Just as long as I know why it's a single lock. Looks ok to 
me, apart from the need for testing (and talking to NFS etc people).

		Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 17:39         ` Ingo Molnar
@ 2008-05-07  6:49           ` Zhang, Yanmin
  0 siblings, 0 replies; 43+ messages in thread
From: Zhang, Yanmin @ 2008-05-07  6:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Matthew Wilcox, J. Bruce Fields, LKML,
	Alexander Viro, Linus Torvalds, linux-fsdevel


On Tue, 2008-05-06 at 19:39 +0200, Ingo Molnar wrote:
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > Finally: how come we regressed by swapping the semaphore 
> > implementation anyway?  We went from one sleeping lock implementation 
> > to another - I'd have expected performance to be pretty much the same.
> i.e. we'll always keep yet another task in flight. This can mask wakeup 
> latencies especially when it takes time.
> 
> The patch (hack) below tries to emulate this weirdness - it 'kicks' 
> another task as well and keeps it busy. Most of the time this just 
> causes extra scheduling, but if AIM7 is _just_ saturating the number of 
> CPUs, it might make a difference. Yanmin, does the patch below make any 
> difference to the AIM7 results?
I tested it on my 8-core stoakley and the result is 12% worse than the one of
pure 2.6.26-rc1.

-yanmin

> 
> ( it would be useful data to get a meaningful context switch trace from 
>   the whole regressed workload, and compare it to a context switch trace 
>   with the revert added. )
> 
> 	Ingo
> 
> ---
>  kernel/semaphore.c |   10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> Index: linux/kernel/semaphore.c
> ===================================================================
> --- linux.orig/kernel/semaphore.c
> +++ linux/kernel/semaphore.c
> @@ -261,4 +261,14 @@ static noinline void __sched __up(struct
>  	list_del(&waiter->list);
>  	waiter->up = 1;
>  	wake_up_process(waiter->task);
> +
> +	if (likely(list_empty(&sem->wait_list)))
> +		return;
> +	/*
> +	 * Opportunistically wake up another task as well but do not
> +	 * remove it from the list:
> +	 */
> +	waiter = list_first_entry(&sem->wait_list,
> +				  struct semaphore_waiter, list);
> +	wake_up_process(waiter->task);
>  }


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 17:21       ` Andrew Morton
                           ` (2 preceding siblings ...)
  2008-05-06 17:45         ` Linus Torvalds
@ 2008-05-07 16:38         ` Matthew Wilcox
  2008-05-07 16:55           ` Linus Torvalds
  3 siblings, 1 reply; 43+ messages in thread
From: Matthew Wilcox @ 2008-05-07 16:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro,
	Linus Torvalds, linux-fsdevel

On Tue, May 06, 2008 at 10:21:53AM -0700, Andrew Morton wrote:
> up() seems to be doing wake-one, FIFO which is nice.  Did the
> implementation which we just removed also do that?  Was it perhaps
> accidentally doing LIFO or something like that?

If heavily contended, it could do this.

up() would increment sem->count and cal __up() which would call wake_up()
down() would decrement sem->count.
The unlucky task woken by __up would lose the race and go back to sleep.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 16:38         ` Matthew Wilcox
@ 2008-05-07 16:55           ` Linus Torvalds
  2008-05-07 17:08             ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2008-05-07 16:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, linux-fsdevel

On Wed, 7 May 2008, Matthew Wilcox wrote:
> 
> If heavily contended, it could do this.

It doesn' have to be heavily contended - if it's just hot and a bit lucky, 
it would potentially never schedule at all, because it would never take 
the spinlock and serialize the callers.

It doesn't even need "unfairness" to work that way. The old semaphore 
implementation was very much designed to be lock-free, and if you had one 
CPU doing a lock while another did an unlock, the *common* situation was 
that the unlock would succeed first, because the unlocker was also the 
person who had the spinlock exclusively in its cache!

The above may count as "lucky", but the hot-cache-line thing is a big 
deal. It likely "lucky" into something that isn't a 50:50 chance, but 
something that is quite possible to trigger consistently if you just have 
mostly short holders of the lock.

Which, btw, is probably true. The BKL is normally held for short times, 
and released (by that thread) for relatively much longer times. Which 
is when spinlocks tend to work the best, even when they are fair (because 
it's not so much a fairness issue, it's simply a cost-of-taking-the-lock 
issue!)

			Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 16:55           ` Linus Torvalds
@ 2008-05-07 17:08             ` Linus Torvalds
  2008-05-07 17:16               ` Andrew Morton
  2008-05-07 17:22               ` Ingo Molnar
  0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2008-05-07 17:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, linux-fsdevel

On Wed, 7 May 2008, Linus Torvalds wrote:
>
> Which, btw, is probably true. The BKL is normally held for short times, 
> and released (by that thread) for relatively much longer times. Which 
> is when spinlocks tend to work the best, even when they are fair (because 
> it's not so much a fairness issue, it's simply a cost-of-taking-the-lock 
> issue!)

.. and don't get me wrong: the old semaphores (and the new mutexes) should 
also have this property when lucky: taking the lock is often a hot-path 
case.

And the spinlock+generic semaphore thing probably makes that "lucky" 
behavior be exponentially less likely, because now to hit the lucky case, 
rather than the hot path having just *one* access to the interesting cache 
line, it has basically something like 4 accesses (spinlock, count test, 
count decrement, spinunlock), in addition to various serializing 
instructions, so I suspect it quite often gets serialized simply because 
even the "fast path" is actually about ten times as long!

As a result, a slow "fast path" means that the thing gets saturated much 
more easily, and that in turn means that the "fast path" turns into a 
"slow path" more easily, which is how you end up in the scheduler rather 
than just taking the fast path.

This is why sleeping locks are more expensive in general: they have a 
*huge* cost from when they get contended. Hundreds of times higher than a 
spinlock. And the faster they are, the longer it takes for them to get 
contended under load. So slowing them down in the fast path is a double 
whammy, in that it shows their bad behaviour much earlier.

And the generic semaphores really are slower than the old optimized ones 
in that fast path. By a *big* amount.

Which is why I'm 100% convinced it's not even worth saving the old code. 
It needs to use mutexes, or spinlocks. I bet it has *nothing* to do with 
"slow path" other than the fact that it gets to that slow path much more 
these days.

			Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 17:08             ` Linus Torvalds
@ 2008-05-07 17:16               ` Andrew Morton
  2008-05-07 17:27                 ` Linus Torvalds
  2008-05-07 17:22               ` Ingo Molnar
  1 sibling, 1 reply; 43+ messages in thread
From: Andrew Morton @ 2008-05-07 17:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, linux-fsdevel

On Wed, 7 May 2008 10:08:18 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Which is why I'm 100% convinced it's not even worth saving the old code. 
> It needs to use mutexes, or spinlocks. I bet it has *nothing* to do with 
> "slow path" other than the fact that it gets to that slow path much more 
> these days.

Stupid question: why doesn't lock_kernel() use a mutex?

(stupid answer: it'll trigger might_sleep() checks when we do it early in
boot with irqs disabled, but we can fix that)

(And __might_sleep()'s system_state check might even save us from that)

Of course, we shouldn't change anything until we've worked out why the new
semaphores got slower.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 17:08             ` Linus Torvalds
  2008-05-07 17:16               ` Andrew Morton
@ 2008-05-07 17:22               ` Ingo Molnar
  2008-05-07 17:25                 ` Ingo Molnar
  2008-05-07 17:31                 ` Linus Torvalds
  1 sibling, 2 replies; 43+ messages in thread
From: Ingo Molnar @ 2008-05-07 17:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Which is why I'm 100% convinced it's not even worth saving the old 
> code. It needs to use mutexes, or spinlocks. I bet it has *nothing* to 
> do with "slow path" other than the fact that it gets to that slow path 
> much more these days.

i think your theory should be easy to test: Yanmin, could you turn on 
CONFIG_MUTEX_DEBUG=y and check by how much AIM7 regresses?

Because in the CONFIG_MUTEX_DEBUG=y case the mutex debug code does 
exactly that: it doesnt use the single-instruction fastpath [it uses 
asm-generic/mutex-null.h] but always drops into the slowpath (to be able 
to access debug state). That debug code is about as expensive as the 
generic semaphore code's current fastpath. (perhaps even more 
expensive.)

There's far more normal mutex fastpath use during an AIM7 run than any 
BKL use. So if it's due to any direct fastpath overhead and the 
resulting widening of the window for the real slowdown, we should see a 
severe slowdown on AIM7 with CONFIG_MUTEX_DEBUG=y. Agreed?

	Ingo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 17:22               ` Ingo Molnar
@ 2008-05-07 17:25                 ` Ingo Molnar
  2008-05-07 17:31                 ` Linus Torvalds
  1 sibling, 0 replies; 43+ messages in thread
From: Ingo Molnar @ 2008-05-07 17:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel


* Ingo Molnar <mingo@elte.hu> wrote:

> There's far more normal mutex fastpath use during an AIM7 run than any 
> BKL use. So if it's due to any direct fastpath overhead and the 
> resulting widening of the window for the real slowdown, we should see 
> a severe slowdown on AIM7 with CONFIG_MUTEX_DEBUG=y. Agreed?

my own guesstimate about AIM7 performance impact resulting out of 
CONFIG_MUTEX_DEBUG=y: performance overhead will not be measurable, or 
will at most be in the sub-1% range. But i've been badly wrong before :)

	Ingo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 17:16               ` Andrew Morton
@ 2008-05-07 17:27                 ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2008-05-07 17:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, linux-fsdevel



On Wed, 7 May 2008, Andrew Morton wrote:

> On Wed, 7 May 2008 10:08:18 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > Which is why I'm 100% convinced it's not even worth saving the old code. 
> > It needs to use mutexes, or spinlocks. I bet it has *nothing* to do with 
> > "slow path" other than the fact that it gets to that slow path much more 
> > these days.
> 
> Stupid question: why doesn't lock_kernel() use a mutex?

Not stupid.

The only reason some code didn't get turned over to mutexes was literally 
that they didn't want the debugging because they were doing intentionally 
bad things. 

I think the BKL is one of them (the console semaphore was another, iirc). 

			Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 17:22               ` Ingo Molnar
  2008-05-07 17:25                 ` Ingo Molnar
@ 2008-05-07 17:31                 ` Linus Torvalds
  2008-05-07 17:47                   ` Linus Torvalds
  2008-05-07 17:49                   ` Ingo Molnar
  1 sibling, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2008-05-07 17:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel

On Wed, 7 May 2008, Ingo Molnar wrote:
> 
> There's far more normal mutex fastpath use during an AIM7 run than any 
> BKL use. So if it's due to any direct fastpath overhead and the 
> resulting widening of the window for the real slowdown, we should see a 
> severe slowdown on AIM7 with CONFIG_MUTEX_DEBUG=y. Agreed?

Not agreed.

The BKL is special because it is a *single* lock.

All the "normal" mutex code use fine-grained locking, so even if you slow 
down the fast path, that won't cause the same kind of fastpath->slowpath 
increase.

In order to see the fastpath->slowpath thing, you do need to have many 
threads hitting the same lock: ie the slowdown has to result in real 
contention. 

Almost no mutexes have any potential for contention what-so-ever, except 
for things that very consciously try to hit it (multiple threads doing 
readdir and/or file creation on the *same* directory etc).

The BKL really is special. 

		Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 17:31                 ` Linus Torvalds
@ 2008-05-07 17:47                   ` Linus Torvalds
  2008-05-07 17:49                   ` Ingo Molnar
  1 sibling, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2008-05-07 17:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel

On Wed, 7 May 2008, Linus Torvalds wrote:
> 
> All the "normal" mutex code use fine-grained locking, so even if you slow 
> down the fast path, that won't cause the same kind of fastpath->slowpath 
> increase.

Put another way: let's say that the "good fastpath" is basically a single 
locked instruction - ~12 cycles on AMD, ~35 Core 2. That's the 
no-bouncing, no-contention case.

Doing it with debugging (call overhead, spinlocks, local irq saving rtc) 
will probably easily triple it or more, but we're not changing anything 
else. There's no "downstream" effect: the behaviour itself doesn't change. 
It doesn't get more bouncing, it doesn't start sleeping.

But what happens if the lock has the *potential* for conflicts is 
different.

There, a "longish pause + fast lock + short average code sequece + fast 
unlock" is quite likely to stay uncontended for a fair amount of time, and 
while it will be much slower than the no-contention-at-all case (because 
you do get a pretty likely cacheline event at the "fast lock" part), with 
a fairly low number of CPU's and a long enough pause, you *also* easily 
get into a pattern where the thing that got the lock will likely also get 
to unlock without dropping the cacheline.

So far so good.

But that basically depends on the fact that "lock + work + unlock" is 
_much_ shorter than the "longish pause" in between, so that even if you 
have <n> CPU's all doing the same thing, their pauses between the locked 
section are still bigger than <n> times that short time.

Once that is no longer true, you now start to bounce both at the lock 
*and* the unlock, and now that whole sequence got likely ten times slower. 
*AND* because it now actually has real contention, it actually got even 
worse: if the lock is a sleeping one, you get *another* order of magnitude 
just because you now started doing scheduling overhead too!

So the thing is, it just breaks down very badly. A spinlock that gets 
contention probably gets ten times slower due to bouncing the cacheline. A 
semaphore that gets contention probably gets a *hundred* times slower, or 
more.

And so my bet is that both the old and the new semaphores had the same bad 
break-down situation, but the new semaphores just are a lot easier to 
trigger it because they are at least three times costlier than the old 
ones, so you just hit the bad behaviour with much lower loads (or fewer 
number of CPU's).

But spinlocks really do behave much better when contended, because at 
least they don't get the even bigger hit of also hitting the scheduler. So 
the old semaphores would have behaved badly too *eventually*, they just 
needed a more extreme case to show that bad behavior.

		Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 17:31                 ` Linus Torvalds
  2008-05-07 17:47                   ` Linus Torvalds
@ 2008-05-07 17:49                   ` Ingo Molnar
  2008-05-07 18:02                     ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: Ingo Molnar @ 2008-05-07 17:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > There's far more normal mutex fastpath use during an AIM7 run than 
> > any BKL use. So if it's due to any direct fastpath overhead and the 
> > resulting widening of the window for the real slowdown, we should 
> > see a severe slowdown on AIM7 with CONFIG_MUTEX_DEBUG=y. Agreed?
> 
> Not agreed.
> 
> The BKL is special because it is a *single* lock.

ok, indeed my suggestion is wrong and this would not be a good 
comparison.

another idea: my trial-baloon patch should test your theory too, because 
the generic down_trylock() is still the 'fat' version, it does:

        spin_lock_irqsave(&sem->lock, flags);
        count = sem->count - 1;
        if (likely(count >= 0))
                sem->count = count;
        spin_unlock_irqrestore(&sem->lock, flags);

if there is a noticeable performance difference between your 
trial-ballon patch and mine, then the micro-cost of the BKL very much 
matters to this workload. Agreed about that?

but i'd be _hugely_ surprised about it. The tty code's BKL use should i 
think only happen when a task exits and releases the tty - and a task 
exit - even if this is a threaded test (which AIM7 can be - not sure 
which exact parameters Yanmin used) - the costs of thread creation and 
thread exit are just not in the same ballpark as any BKL micro-costs. 
Dunno, maybe i overlooked some high-freq BKL user. (but any such site 
would have shown up before) Even assuming a widening of the critical 
path and some catastrophic domino effect (that does show up as increased 
scheduling) i've never seen a 40% drop like this.

this regression, to me, has "different scheduling behavior" written all 
over it - but that's just an impression. I'm not going to bet against 
you though ;-)

	Ingo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 17:49                   ` Ingo Molnar
@ 2008-05-07 18:02                     ` Linus Torvalds
  2008-05-07 18:17                       ` Ingo Molnar
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2008-05-07 18:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel

On Wed, 7 May 2008, Ingo Molnar wrote:
> 
> another idea: my trial-baloon patch should test your theory too, because 
> the generic down_trylock() is still the 'fat' version, it does:

I agree that your trial-balloon should likely get rid of the big 
regression, since it avoids the scheduler.

So with your patch, lock_kernel() ends up being just a rather expensive 
spinlock. And yes, I'd expect that it should get rid of the 40% cost, 
because while it makes lock_kernel() more expensive than a spinlock and 
you might end up having a few more cacheline bounces on the lock due to 
that, that's still the "small" expense compared to going through the whole 
scheduler on conflicts.

So I'd expect that realistically the performance difference between your 
version and just plain spinlocks shouldn't be *that* big. I'd expect it to 
be visible, but in the (low) single-digit percentage range rather than in 
any 40% range. That's just a guess.

		Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 18:02                     ` Linus Torvalds
@ 2008-05-07 18:17                       ` Ingo Molnar
  2008-05-07 18:27                         ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Ingo Molnar @ 2008-05-07 18:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, 7 May 2008, Ingo Molnar wrote:
> > 
> > another idea: my trial-baloon patch should test your theory too, 
> > because the generic down_trylock() is still the 'fat' version, it 
> > does:
> 
> I agree that your trial-balloon should likely get rid of the big 
> regression, since it avoids the scheduler.
> 
> So with your patch, lock_kernel() ends up being just a rather 
> expensive spinlock. And yes, I'd expect that it should get rid of the 
> 40% cost, because while it makes lock_kernel() more expensive than a 
> spinlock and you might end up having a few more cacheline bounces on 
> the lock due to that, that's still the "small" expense compared to 
> going through the whole scheduler on conflicts.
> 
> So I'd expect that realistically the performance difference between 
> your version and just plain spinlocks shouldn't be *that* big. I'd 
> expect it to be visible, but in the (low) single-digit percentage 
> range rather than in any 40% range. That's just a guess.

third attempt - the patch below ontop of v2.6.25 should be quite similar 
fastpath atomic overhead to what generic semaphores do? So if Yanmin 
tests this patch ontop of v2.6.25, we should see the direct fastpath 
overhead - without any changes to the semaphore wakeup/scheduling logic 
otherwise.

[ this patch should in fact be a bit worse, because there's two more 
  atomics in the fastpath - the fastpath atomics of the old semaphore 
  code. ]

	Ingo

------------------>
Subject: v2.6.25 BKL: add atomic overhead
From: Ingo Molnar <mingo@elte.hu>
Date: Wed May 07 20:09:13 CEST 2008

---
 lib/kernel_lock.c |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

Index: linux-2.6.25/lib/kernel_lock.c
===================================================================
--- linux-2.6.25.orig/lib/kernel_lock.c
+++ linux-2.6.25/lib/kernel_lock.c
@@ -24,6 +24,7 @@
  * Don't use in new code.
  */
 static DECLARE_MUTEX(kernel_sem);
+static DEFINE_SPINLOCK(global_lock);
 
 /*
  * Re-acquire the kernel semaphore.
@@ -47,6 +48,9 @@ int __lockfunc __reacquire_kernel_lock(v
 
 	down(&kernel_sem);
 
+	spin_lock(&global_lock);
+	spin_unlock(&global_lock);
+
 	preempt_disable();
 	task->lock_depth = saved_lock_depth;
 
@@ -55,6 +59,9 @@ int __lockfunc __reacquire_kernel_lock(v
 
 void __lockfunc __release_kernel_lock(void)
 {
+	spin_lock(&global_lock);
+	spin_unlock(&global_lock);
+
 	up(&kernel_sem);
 }
 
@@ -66,12 +73,16 @@ void __lockfunc lock_kernel(void)
 	struct task_struct *task = current;
 	int depth = task->lock_depth + 1;
 
-	if (likely(!depth))
+	if (likely(!depth)) {
 		/*
 		 * No recursion worries - we set up lock_depth _after_
 		 */
 		down(&kernel_sem);
 
+		spin_lock(&global_lock);
+		spin_unlock(&global_lock);
+	}
+
 	task->lock_depth = depth;
 }
 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 18:17                       ` Ingo Molnar
@ 2008-05-07 18:27                         ` Linus Torvalds
  2008-05-07 18:43                           ` Ingo Molnar
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2008-05-07 18:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel

On Wed, 7 May 2008, Ingo Molnar wrote:
> 
> [ this patch should in fact be a bit worse, because there's two more 
>   atomics in the fastpath - the fastpath atomics of the old semaphore 
>   code. ]

Well, it doesn't have the irq stuff, which is also pretty costly. Also, it 
doesn't nest the accesses the same way (with the counts being *inside* the 
spinlock and serialized against each other), so I'm not 100% sure you'd 
get the same behaviour.

But yes, it certainly has the potential to show the same slowdown. But 
it's not a very good patch, since not showing it doesn't really prove 
much.

			Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 18:27                         ` Linus Torvalds
@ 2008-05-07 18:43                           ` Ingo Molnar
  2008-05-07 19:01                             ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Ingo Molnar @ 2008-05-07 18:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > [ this patch should in fact be a bit worse, because there's two more
> >   atomics in the fastpath - the fastpath atomics of the old 
> >   semaphore code. ]
> 
> Well, it doesn't have the irq stuff, which is also pretty costly. 
> Also, it doesn't nest the accesses the same way (with the counts being 
> *inside* the spinlock and serialized against each other), so I'm not 
> 100% sure you'd get the same behaviour.
> 
> But yes, it certainly has the potential to show the same slowdown. But 
> it's not a very good patch, since not showing it doesn't really prove 
> much.

ok, the one below does irq ops and the counter behavior - and because 
the critical section also has the old-semaphore atomics i think this 
should definitely be a more expensive fastpath than what the new generic 
code introduces. So if this patch produces a 40% AIM7 slowdown on 
v2.6.25 it's the fastpath overhead (and its effects on slowpath 
probability) that makes the difference.

	Ingo

------------------->
Subject: add BKL atomic overhead
From: Ingo Molnar <mingo@elte.hu>
Date: Wed May 07 20:09:13 CEST 2008

NOT-Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 lib/kernel_lock.c |   18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

Index: linux-2.6.25/lib/kernel_lock.c
===================================================================
--- linux-2.6.25.orig/lib/kernel_lock.c
+++ linux-2.6.25/lib/kernel_lock.c
@@ -24,6 +24,8 @@
  * Don't use in new code.
  */
 static DECLARE_MUTEX(kernel_sem);
+static int global_count;
+static DEFINE_SPINLOCK(global_lock);
 
 /*
  * Re-acquire the kernel semaphore.
@@ -39,6 +41,7 @@ int __lockfunc __reacquire_kernel_lock(v
 {
 	struct task_struct *task = current;
 	int saved_lock_depth = task->lock_depth;
+	unsigned long flags;
 
 	BUG_ON(saved_lock_depth < 0);
 
@@ -47,6 +50,10 @@ int __lockfunc __reacquire_kernel_lock(v
 
 	down(&kernel_sem);
 
+	spin_lock_irqsave(&global_lock, flags);
+	global_count++;
+	spin_unlock_irqrestore(&global_lock, flags);
+
 	preempt_disable();
 	task->lock_depth = saved_lock_depth;
 
@@ -55,6 +62,10 @@ int __lockfunc __reacquire_kernel_lock(v
 
 void __lockfunc __release_kernel_lock(void)
 {
+	spin_lock_irqsave(&global_lock, flags);
+	global_count--;
+	spin_unlock_irqrestore(&global_lock, flags);
+
 	up(&kernel_sem);
 }
 
@@ -66,12 +77,17 @@ void __lockfunc lock_kernel(void)
 	struct task_struct *task = current;
 	int depth = task->lock_depth + 1;
 
-	if (likely(!depth))
+	if (likely(!depth)) {
 		/*
 		 * No recursion worries - we set up lock_depth _after_
 		 */
 		down(&kernel_sem);
 
+		spin_lock_irqsave(&global_lock, flags);
+		global_count++;
+		spin_unlock_irqrestore(&global_lock, flags);
+	}
+
 	task->lock_depth = depth;
 }
 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 18:43                           ` Ingo Molnar
@ 2008-05-07 19:01                             ` Linus Torvalds
  2008-05-07 19:09                               ` Ingo Molnar
  2008-05-07 19:24                               ` Matthew Wilcox
  0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2008-05-07 19:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel

On Wed, 7 May 2008, Ingo Molnar wrote:
> 
> ok, the one below does irq ops and the counter behavior

No it doesn't. The counter isn't used for any actual *testing*, so the 
locking around it and the serialization of it has absolutely no impact on 
the scheduling behaviour!

Since the big slowdown was clearly accompanied by sleeping behaviour (the 
processes who didn't get the lock end up sleeping!), that is a *big* part 
of the slowdown.

Is it possible that your patch gets similar behaviour? Absolutely. But 
you're missing the whole point here. Anybody can make code behave badly 
and perform worse. But if you want to just verify that it's about the 
sleeping behaviour and timings of the BKL, then you need to do exactly 
that: emulate the sleeping behavior, not just the timings _outside_ of the 
sleeping behavior.

The thing is, we definitely are interested to see whether it's the BKL or 
some other semaphore that is the problem. But the best way to test that is 
to just try my patch that *guarantees* that the BKL doesn't have any 
semaphore behaviour AT ALL.

Could it be something else entirely? Yes. We know it's semaphore- related. 
We don't know for a fact that it's the BKL itself. There could be other 
semaphores that are that hot. It sounds unlikely, but quite frankly, 
regardless, I don't really see the point of your patches.

If Yanmin tries my patch, it is *guaranteed* to show something. It either 
shows that it's about the BKL (and that we absolutely have to do the BKL 
as something _else_ than a generic semaphore), or it shows that it's not 
about the BKL (and that _all_ the patches in this discussion are likely 
pointless).

In contrast, these "try to emulate bad behavior with the old known-ok 
semaphores" don't show anything AT ALL. We already know it's related to 
semaphores. And your patches aren't even guaranteed to show the same 
issue.

		Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 19:01                             ` Linus Torvalds
@ 2008-05-07 19:09                               ` Ingo Molnar
  2008-05-07 19:24                               ` Matthew Wilcox
  1 sibling, 0 replies; 43+ messages in thread
From: Ingo Molnar @ 2008-05-07 19:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> In contrast, these "try to emulate bad behavior with the old known-ok 
> semaphores" don't show anything AT ALL. We already know it's related 
> to semaphores. And your patches aren't even guaranteed to show the 
> same issue.

yeah, i was just trying to come up with patches to probe which one of 
the following two possibilities is actually the case:

 - if the regression is due to the difference in scheduling behavior of 
   new semaphores (different wakeup patterns, etc.), that's fixable in 
   the new semaphore code => then the BKL code need not change.

 - if the regression is due due to difference in the fastpath cost, then 
   the new semaphores can probably not be improved (much of their appeal 
   comes from them not being complex and not being in assembly) => then 
   the BKL code needs to change to become cheaper [i.e. then we want 
   your patch].

	Ingo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 19:01                             ` Linus Torvalds
  2008-05-07 19:09                               ` Ingo Molnar
@ 2008-05-07 19:24                               ` Matthew Wilcox
  2008-05-07 19:44                                 ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: Matthew Wilcox @ 2008-05-07 19:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, linux-fsdevel

On Wed, May 07, 2008 at 12:01:28PM -0700, Linus Torvalds wrote:
> The thing is, we definitely are interested to see whether it's the BKL or 
> some other semaphore that is the problem. But the best way to test that is 
> to just try my patch that *guarantees* that the BKL doesn't have any 
> semaphore behaviour AT ALL.
> 
> Could it be something else entirely? Yes. We know it's semaphore- related. 
> We don't know for a fact that it's the BKL itself. There could be other 
> semaphores that are that hot. It sounds unlikely, but quite frankly, 
> regardless, I don't really see the point of your patches.
> 
> If Yanmin tries my patch, it is *guaranteed* to show something. It either 
> shows that it's about the BKL (and that we absolutely have to do the BKL 
> as something _else_ than a generic semaphore), or it shows that it's not 
> about the BKL (and that _all_ the patches in this discussion are likely 
> pointless).

One patch I'd still like Yanmin to test is my one from yesterday which
removes the BKL from fs/locks.c.

http://marc.info/?l=linux-fsdevel&m=121009123427437&w=2

Obviously, it won't help if the problem isn't the BKL.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 19:24                               ` Matthew Wilcox
@ 2008-05-07 19:44                                 ` Linus Torvalds
  2008-05-07 20:00                                   ` Oi. NFS people. Read this Matthew Wilcox
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2008-05-07 19:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ingo Molnar, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, linux-fsdevel



On Wed, 7 May 2008, Matthew Wilcox wrote:
> 
> One patch I'd still like Yanmin to test is my one from yesterday which
> removes the BKL from fs/locks.c.

And I'd personally rather have the network-fs people test and comment on 
that one ;)

I think that patch is worth looking at regardless, but the problems with 
that one aren't about performance, but about what the implications are for 
the filesystems (if any)...

		Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Oi.  NFS people.  Read this.
  2008-05-07 19:44                                 ` Linus Torvalds
@ 2008-05-07 20:00                                   ` Matthew Wilcox
  2008-05-07 22:10                                     ` Trond Myklebust
  0 siblings, 1 reply; 43+ messages in thread
From: Matthew Wilcox @ 2008-05-07 20:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, linux-fsdevel

On Wed, May 07, 2008 at 12:44:48PM -0700, Linus Torvalds wrote:
> On Wed, 7 May 2008, Matthew Wilcox wrote:
> > 
> > One patch I'd still like Yanmin to test is my one from yesterday which
> > removes the BKL from fs/locks.c.
> 
> And I'd personally rather have the network-fs people test and comment on 
> that one ;)
> 
> I think that patch is worth looking at regardless, but the problems with 
> that one aren't about performance, but about what the implications are for 
> the filesystems (if any)...

Oh, well, they don't seem interested.

I can comment on some of the problems though.

fs/lockd/svcsubs.c, fs/nfs/delegation.c, fs/nfs/nfs4state.c,
fs/nfsd/nfs4state.c all walk the i_flock list under the BKL.  That won't
protect them against locks.c any more.  That's probably OK for fs/nfs/*
since they'll be protected by their own data structures (Someone please
check me on that?), but it's a bad idea for lockd/nfsd which are walking
the lists for filesystems.

Are we going to have to export the file_lock_lock?  I'd rather not.  But
we need to keep nfsd/lockd from tripping over locks.c.

Maybe we could come up with a decent API that lockd could use?  It all
seems a bit complex at the moment ... maybe lockd should be keeping
track of the locks it owns anyway (since surely the posix deadlock
detection code can't work properly if it's just passing all the locks
through).

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Oi.  NFS people.  Read this.
  2008-05-07 20:00                                   ` Oi. NFS people. Read this Matthew Wilcox
@ 2008-05-07 22:10                                     ` Trond Myklebust
  2008-05-09  1:43                                       ` J. Bruce Fields
  0 siblings, 1 reply; 43+ messages in thread
From: Trond Myklebust @ 2008-05-07 22:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, J. Bruce Fields,
	Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel

On Wed, 2008-05-07 at 14:00 -0600, Matthew Wilcox wrote:
> On Wed, May 07, 2008 at 12:44:48PM -0700, Linus Torvalds wrote:
> > On Wed, 7 May 2008, Matthew Wilcox wrote:
> > > 
> > > One patch I'd still like Yanmin to test is my one from yesterday which
> > > removes the BKL from fs/locks.c.
> > 
> > And I'd personally rather have the network-fs people test and comment on 
> > that one ;)
> > 
> > I think that patch is worth looking at regardless, but the problems with 
> > that one aren't about performance, but about what the implications are for 
> > the filesystems (if any)...
> 
> Oh, well, they don't seem interested.

Poor timing: we're all preparing for and travelling to the annual
Connectathon interoperability testing conference which starts tomorrow.

> I can comment on some of the problems though.
> 
> fs/lockd/svcsubs.c, fs/nfs/delegation.c, fs/nfs/nfs4state.c,
> fs/nfsd/nfs4state.c all walk the i_flock list under the BKL.  That won't
> protect them against locks.c any more.  That's probably OK for fs/nfs/*
> since they'll be protected by their own data structures (Someone please
> check me on that?), but it's a bad idea for lockd/nfsd which are walking
> the lists for filesystems.

Yes. fs/nfs is just reusing the code in fs/locks.c in order to track the
locks it holds on the server. We could alternatively have coded a
private lock implementation, but this seemed easier.

> Are we going to have to export the file_lock_lock?  I'd rather not.  But
> we need to keep nfsd/lockd from tripping over locks.c.
> 
> Maybe we could come up with a decent API that lockd could use?  It all
> seems a bit complex at the moment ... maybe lockd should be keeping
> track of the locks it owns anyway (since surely the posix deadlock
> detection code can't work properly if it's just passing all the locks
> through).

I'm not sure what you mean when you talk about lockd keeping track of
the locks it owns. It has to keep those locks on inode->i_flock in order
to make them visible to the host filesystem...

All lockd really needs, is the ability to find a lock it owns, and then
obtain a copy. As for the nfs client, I suspect we can make do with
something similar...

Cheers
  Trond


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 16:23     ` Matthew Wilcox
  2008-05-06 16:36       ` Linus Torvalds
  2008-05-06 17:21       ` Andrew Morton
@ 2008-05-08  3:24       ` Zhang, Yanmin
  2008-05-08  3:34         ` Linus Torvalds
  2 siblings, 1 reply; 43+ messages in thread
From: Zhang, Yanmin @ 2008-05-08  3:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ingo Molnar, J. Bruce Fields, LKML, Alexander Viro,
	Linus Torvalds, Andrew Morton, linux-fsdevel


On Tue, 2008-05-06 at 10:23 -0600, Matthew Wilcox wrote:
> On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote:
> > So the only likely things I can see are:
> > 
> >  - file locks
> >  - fasync
> 
> I've wanted to fix file locks for a while.  Here's a first attempt.
> It was done quickly, so I concede that it may well have bugs in it.
> I found (and fixed) one with LTP.
> 
> It takes *no account* of nfsd, nor remote filesystems.  We need to have
> a serious discussion about their requirements.
I tested it on 8-core stoakley. aim7 result becomes 23% worse than the one of
pure 2.6.26-rc1.

I replied this email in case you have many patches and I might test what you don't
expect.

-yanmin

> 
> diff --git a/fs/locks.c b/fs/locks.c
> index 663c069..cb09765 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -140,6 +140,8 @@ int lease_break_time = 45;
>  #define for_each_lock(inode, lockp) \
>  	for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next)
>  
> +static DEFINE_SPINLOCK(file_lock_lock);
> +
>  static LIST_HEAD(file_lock_list);
>  static LIST_HEAD(blocked_list);
>  
> @@ -510,9 +512,9 @@ static void __locks_delete_block(struct file_lock *waiter)
>   */
>  static void locks_delete_block(struct file_lock *waiter)
>  {
> -	lock_kernel();
> +	spin_lock(&file_lock_lock);
>  	__locks_delete_block(waiter);
> -	unlock_kernel();
> +	spin_unlock(&file_lock_lock);
>  }
>  
>  /* Insert waiter into blocker's block list.
> @@ -649,7 +651,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
>  {
>  	struct file_lock *cfl;
>  
> -	lock_kernel();
> +	spin_lock(&file_lock_lock);
>  	for (cfl = filp->f_path.dentry->d_inode->i_flock; cfl; cfl = cfl->fl_next) {
>  		if (!IS_POSIX(cfl))
>  			continue;
> @@ -662,7 +664,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
>  			fl->fl_pid = pid_vnr(cfl->fl_nspid);
>  	} else
>  		fl->fl_type = F_UNLCK;
> -	unlock_kernel();
> +	spin_unlock(&file_lock_lock);
>  	return;
>  }
>  EXPORT_SYMBOL(posix_test_lock);
> @@ -735,18 +737,21 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
>  	int error = 0;
>  	int found = 0;
>  
> -	lock_kernel();
> -	if (request->fl_flags & FL_ACCESS)
> +	if (request->fl_flags & FL_ACCESS) {
> +		spin_lock(&file_lock_lock);
>  		goto find_conflict;
> +	}
>  
>  	if (request->fl_type != F_UNLCK) {
>  		error = -ENOMEM;
> +		
>  		new_fl = locks_alloc_lock();
>  		if (new_fl == NULL)
> -			goto out;
> +			goto out_unlocked;
>  		error = 0;
>  	}
>  
> +	spin_lock(&file_lock_lock);
>  	for_each_lock(inode, before) {
>  		struct file_lock *fl = *before;
>  		if (IS_POSIX(fl))
> @@ -772,10 +777,13 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
>  	 * If a higher-priority process was blocked on the old file lock,
>  	 * give it the opportunity to lock the file.
>  	 */
> -	if (found)
> +	if (found) {
> +		spin_unlock(&file_lock_lock);
>  		cond_resched();
> +		spin_lock(&file_lock_lock);
> +	}
>  
> -find_conflict:
> + find_conflict:
>  	for_each_lock(inode, before) {
>  		struct file_lock *fl = *before;
>  		if (IS_POSIX(fl))
> @@ -796,8 +804,9 @@ find_conflict:
>  	new_fl = NULL;
>  	error = 0;
>  
> -out:
> -	unlock_kernel();
> + out:
> +	spin_unlock(&file_lock_lock);
> + out_unlocked:
>  	if (new_fl)
>  		locks_free_lock(new_fl);
>  	return error;
> @@ -826,7 +835,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
>  		new_fl2 = locks_alloc_lock();
>  	}
>  
> -	lock_kernel();
> +	spin_lock(&file_lock_lock);
>  	if (request->fl_type != F_UNLCK) {
>  		for_each_lock(inode, before) {
>  			fl = *before;
> @@ -994,7 +1003,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
>  		locks_wake_up_blocks(left);
>  	}
>   out:
> -	unlock_kernel();
> +	spin_unlock(&file_lock_lock);
>  	/*
>  	 * Free any unused locks.
>  	 */
> @@ -1069,14 +1078,14 @@ int locks_mandatory_locked(struct inode *inode)
>  	/*
>  	 * Search the lock list for this inode for any POSIX locks.
>  	 */
> -	lock_kernel();
> +	spin_lock(&file_lock_lock);
>  	for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
>  		if (!IS_POSIX(fl))
>  			continue;
>  		if (fl->fl_owner != owner)
>  			break;
>  	}
> -	unlock_kernel();
> +	spin_unlock(&file_lock_lock);
>  	return fl ? -EAGAIN : 0;
>  }
>  
> @@ -1190,7 +1199,7 @@ int __break_lease(struct inode *inode, unsigned int mode)
>  
>  	new_fl = lease_alloc(NULL, mode & FMODE_WRITE ? F_WRLCK : F_RDLCK);
>  
> -	lock_kernel();
> +	spin_lock(&file_lock_lock);
>  
>  	time_out_leases(inode);
>  
> @@ -1251,8 +1260,10 @@ restart:
>  			break_time++;
>  	}
>  	locks_insert_block(flock, new_fl);
> +	spin_unlock(&file_lock_lock);
>  	error = wait_event_interruptible_timeout(new_fl->fl_wait,
>  						!new_fl->fl_next, break_time);
> +	spin_lock(&file_lock_lock);
>  	__locks_delete_block(new_fl);
>  	if (error >= 0) {
>  		if (error == 0)
> @@ -1266,8 +1277,8 @@ restart:
>  		error = 0;
>  	}
>  
> -out:
> -	unlock_kernel();
> + out:
> +	spin_unlock(&file_lock_lock);
>  	if (!IS_ERR(new_fl))
>  		locks_free_lock(new_fl);
>  	return error;
> @@ -1323,7 +1334,7 @@ int fcntl_getlease(struct file *filp)
>  	struct file_lock *fl;
>  	int type = F_UNLCK;
>  
> -	lock_kernel();
> +	spin_lock(&file_lock_lock);
>  	time_out_leases(filp->f_path.dentry->d_inode);
>  	for (fl = filp->f_path.dentry->d_inode->i_flock; fl && IS_LEASE(fl);
>  			fl = fl->fl_next) {
> @@ -1332,7 +1343,7 @@ int fcntl_getlease(struct file *filp)
>  			break;
>  		}
>  	}
> -	unlock_kernel();
> +	spin_unlock(&file_lock_lock);
>  	return type;
>  }
>  
> @@ -1363,6 +1374,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
>  	if (error)
>  		return error;
>  
> +	spin_lock(&file_lock_lock);
>  	time_out_leases(inode);
>  
>  	BUG_ON(!(*flp)->fl_lmops->fl_break);
> @@ -1370,10 +1382,11 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
>  	lease = *flp;
>  
>  	if (arg != F_UNLCK) {
> +		spin_unlock(&file_lock_lock);
>  		error = -ENOMEM;
>  		new_fl = locks_alloc_lock();
>  		if (new_fl == NULL)
> -			goto out;
> +			goto out_unlocked;
>  
>  		error = -EAGAIN;
>  		if ((arg == F_RDLCK) && (atomic_read(&inode->i_writecount) > 0))
> @@ -1382,6 +1395,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
>  		    && ((atomic_read(&dentry->d_count) > 1)
>  			|| (atomic_read(&inode->i_count) > 1)))
>  			goto out;
> +		spin_lock(&file_lock_lock);
>  	}
>  
>  	/*
> @@ -1429,11 +1443,14 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
>  
>  	locks_copy_lock(new_fl, lease);
>  	locks_insert_lock(before, new_fl);
> +	spin_unlock(&file_lock_lock);
>  
>  	*flp = new_fl;
>  	return 0;
>  
> -out:
> + out:
> +	spin_unlock(&file_lock_lock);
> + out_unlocked:
>  	if (new_fl != NULL)
>  		locks_free_lock(new_fl);
>  	return error;
> @@ -1471,12 +1488,10 @@ int vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
>  {
>  	int error;
>  
> -	lock_kernel();
>  	if (filp->f_op && filp->f_op->setlease)
>  		error = filp->f_op->setlease(filp, arg, lease);
>  	else
>  		error = generic_setlease(filp, arg, lease);
> -	unlock_kernel();
>  
>  	return error;
>  }
> @@ -1503,12 +1518,11 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
>  	if (error)
>  		return error;
>  
> -	lock_kernel();
> -
>  	error = vfs_setlease(filp, arg, &flp);
>  	if (error || arg == F_UNLCK)
> -		goto out_unlock;
> +		return error;
>  
> +	lock_kernel();
>  	error = fasync_helper(fd, filp, 1, &flp->fl_fasync);
>  	if (error < 0) {
>  		/* remove lease just inserted by setlease */
> @@ -1519,7 +1533,7 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
>  	}
>  
>  	error = __f_setown(filp, task_pid(current), PIDTYPE_PID, 0);
> -out_unlock:
> + out_unlock:
>  	unlock_kernel();
>  	return error;
>  }
> @@ -2024,7 +2038,7 @@ void locks_remove_flock(struct file *filp)
>  			fl.fl_ops->fl_release_private(&fl);
>  	}
>  
> -	lock_kernel();
> +	spin_lock(&file_lock_lock);
>  	before = &inode->i_flock;
>  
>  	while ((fl = *before) != NULL) {
> @@ -2042,7 +2056,7 @@ void locks_remove_flock(struct file *filp)
>   		}
>  		before = &fl->fl_next;
>  	}
> -	unlock_kernel();
> +	spin_unlock(&file_lock_lock);
>  }
>  
>  /**
> @@ -2057,12 +2071,12 @@ posix_unblock_lock(struct file *filp, struct file_lock *waiter)
>  {
>  	int status = 0;
>  
> -	lock_kernel();
> +	spin_lock(&file_lock_lock);
>  	if (waiter->fl_next)
>  		__locks_delete_block(waiter);
>  	else
>  		status = -ENOENT;
> -	unlock_kernel();
> +	spin_unlock(&file_lock_lock);
>  	return status;
>  }
>  
> @@ -2175,7 +2189,7 @@ static int locks_show(struct seq_file *f, void *v)
>  
>  static void *locks_start(struct seq_file *f, loff_t *pos)
>  {
> -	lock_kernel();
> +	spin_lock(&file_lock_lock);
>  	f->private = (void *)1;
>  	return seq_list_start(&file_lock_list, *pos);
>  }
> @@ -2187,7 +2201,7 @@ static void *locks_next(struct seq_file *f, void *v, loff_t *pos)
>  
>  static void locks_stop(struct seq_file *f, void *v)
>  {
> -	unlock_kernel();
> +	spin_unlock(&file_lock_lock);
>  }
>  
>  struct seq_operations locks_seq_operations = {
> @@ -2215,7 +2229,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
>  {
>  	struct file_lock *fl;
>  	int result = 1;
> -	lock_kernel();
> +	spin_lock(&file_lock_lock);
>  	for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
>  		if (IS_POSIX(fl)) {
>  			if (fl->fl_type == F_RDLCK)
> @@ -2232,7 +2246,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
>  		result = 0;
>  		break;
>  	}
> -	unlock_kernel();
> +	spin_unlock(&file_lock_lock);
>  	return result;
>  }
>  
> @@ -2255,7 +2269,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
>  {
>  	struct file_lock *fl;
>  	int result = 1;
> -	lock_kernel();
> +	spin_lock(&file_lock_lock);
>  	for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
>  		if (IS_POSIX(fl)) {
>  			if ((fl->fl_end < start) || (fl->fl_start > (start + len)))
> @@ -2270,7 +2284,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
>  		result = 0;
>  		break;
>  	}
> -	unlock_kernel();
> +	spin_unlock(&file_lock_lock);
>  	return result;
>  }
>  
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-08  3:24       ` AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin
@ 2008-05-08  3:34         ` Linus Torvalds
  2008-05-08  4:37           ` Zhang, Yanmin
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2008-05-08  3:34 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, LKML,
	Alexander Viro, Andrew Morton, linux-fsdevel



On Thu, 8 May 2008, Zhang, Yanmin wrote:
> 
> On Tue, 2008-05-06 at 10:23 -0600, Matthew Wilcox wrote:
> > On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote:
> > > So the only likely things I can see are:
> > > 
> > >  - file locks
> > >  - fasync
> > 
> > I've wanted to fix file locks for a while.  Here's a first attempt.
> > It was done quickly, so I concede that it may well have bugs in it.
> > I found (and fixed) one with LTP.
> > 
> > It takes *no account* of nfsd, nor remote filesystems.  We need to have
> > a serious discussion about their requirements.
>
> I tested it on 8-core stoakley. aim7 result becomes 23% worse than the one of
> pure 2.6.26-rc1.

Ouch. That's really odd. The BKL->spinlock conversion looks really 
obvious, so it shouldn't be that noticeably slower.

The *one* difference is that the BKL has the whole "you can take it 
recursively and you can sleep without dropping it because the scheduler 
will drop it for you" thing. The spinlock conversion changed all of that 
into explicit "drop and retake" locks, and maybe that causes some issues. 

But 23% worse? That sounds really odd/extreme.

Can you do a oprofile run or something?

			Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-08  3:34         ` Linus Torvalds
@ 2008-05-08  4:37           ` Zhang, Yanmin
  2008-05-08 14:58             ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Zhang, Yanmin @ 2008-05-08  4:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, LKML,
	Alexander Viro, Andrew Morton, linux-fsdevel


On Wed, 2008-05-07 at 20:34 -0700, Linus Torvalds wrote:
> 
> On Thu, 8 May 2008, Zhang, Yanmin wrote:
> > 
> > On Tue, 2008-05-06 at 10:23 -0600, Matthew Wilcox wrote:
> > > On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote:
> > > > So the only likely things I can see are:
> > > > 
> > > >  - file locks
> > > >  - fasync
> > > 
> > > I've wanted to fix file locks for a while.  Here's a first attempt.
> > > It was done quickly, so I concede that it may well have bugs in it.
> > > I found (and fixed) one with LTP.
> > > 
> > > It takes *no account* of nfsd, nor remote filesystems.  We need to have
> > > a serious discussion about their requirements.
> >
> > I tested it on 8-core stoakley. aim7 result becomes 23% worse than the one of
> > pure 2.6.26-rc1.
> 
> Ouch. That's really odd. The BKL->spinlock conversion looks really 
> obvious, so it shouldn't be that noticeably slower.
> 
> The *one* difference is that the BKL has the whole "you can take it 
> recursively and you can sleep without dropping it because the scheduler 
> will drop it for you" thing. The spinlock conversion changed all of that 
> into explicit "drop and retake" locks, and maybe that causes some issues. 
> 
> But 23% worse? That sounds really odd/extreme.
> 
> Can you do a oprofile run or something?
I collected oprofile data. It looks not useful, as cpu idle is more than 50%.

samples  %        app name                 symbol name
270157    9.4450  multitask                add_long
266419    9.3143  multitask                add_int
238934    8.3534  multitask                add_double
187184    6.5442  multitask                mul_double
159448    5.5745  multitask                add_float
156312    5.4649  multitask                sieve
148081    5.1771  multitask                mul_float
127192    4.4468  multitask                add_short
80480     2.8137  multitask                string_rtns_1
57520     2.0110  vmlinux                  clear_page_c
53935     1.8856  multitask                div_long
48753     1.7045  libc-2.6.90.so           strncat
40825     1.4273  multitask                array_rtns
32807     1.1470  vmlinux                  __copy_user_nocache
31995     1.1186  multitask                div_int
31143     1.0888  multitask                div_float
28821     1.0076  multitask                div_double
26400     0.9230  vmlinux                  find_lock_page
26159     0.9146  vmlinux                  unmap_vmas
25249     0.8827  multitask                div_short
21509     0.7520  vmlinux                  native_read_tsc
18865     0.6595  vmlinux                  copy_user_generic_string
17993     0.6291  vmlinux                  copy_page_c
16367     0.5722  vmlinux                  system_call
14616     0.5110  libc-2.6.90.so           msort_with_tmp
13630     0.4765  vmlinux                  native_sched_clock
12952     0.4528  vmlinux                  copy_page_range
12817     0.4481  libc-2.6.90.so           strcat
12708     0.4443  vmlinux                  calc_delta_mine
12611     0.4409  libc-2.6.90.so           memset
11631     0.4066  bash                     (no symbols)
9991      0.3493  vmlinux                  update_curr
9328      0.3261  vmlinux                  unlock_page



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-08  4:37           ` Zhang, Yanmin
@ 2008-05-08 14:58             ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2008-05-08 14:58 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, LKML,
	Alexander Viro, Andrew Morton, linux-fsdevel

On Thu, 8 May 2008, Zhang, Yanmin wrote:
>
> I collected oprofile data. It looks not useful, as cpu idle is more than 50%.

Ahh, so it's probably still the BKL that is the problem, it's just not in 
the file locking code. The changes to fs/locks.c probably didn't matter 
all that much, and the additional regression was likely just some 
perturbation.

So it's probably fasync that AIM7 tests. Quite possibly coupled with 
/dev/tty etc. No file locking.

			Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Oi.  NFS people.  Read this.
  2008-05-07 22:10                                     ` Trond Myklebust
@ 2008-05-09  1:43                                       ` J. Bruce Fields
  0 siblings, 0 replies; 43+ messages in thread
From: J. Bruce Fields @ 2008-05-09  1:43 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Matthew Wilcox, Linus Torvalds, Ingo Molnar, Andrew Morton,
	Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel

On Wed, May 07, 2008 at 03:10:27PM -0700, Trond Myklebust wrote:
> On Wed, 2008-05-07 at 14:00 -0600, Matthew Wilcox wrote:
> > On Wed, May 07, 2008 at 12:44:48PM -0700, Linus Torvalds wrote:
> > > On Wed, 7 May 2008, Matthew Wilcox wrote:
> > > > 
> > > > One patch I'd still like Yanmin to test is my one from yesterday which
> > > > removes the BKL from fs/locks.c.
> > > 
> > > And I'd personally rather have the network-fs people test and comment on 
> > > that one ;)
> > > 
> > > I think that patch is worth looking at regardless, but the problems with 
> > > that one aren't about performance, but about what the implications are for 
> > > the filesystems (if any)...
> > 
> > Oh, well, they don't seem interested.
> 
> Poor timing: we're all preparing for and travelling to the annual
> Connectathon interoperability testing conference which starts tomorrow.
> 
> > I can comment on some of the problems though.
> > 
> > fs/lockd/svcsubs.c, fs/nfs/delegation.c, fs/nfs/nfs4state.c,
> > fs/nfsd/nfs4state.c all walk the i_flock list under the BKL.  That won't
> > protect them against locks.c any more.  That's probably OK for fs/nfs/*
> > since they'll be protected by their own data structures (Someone please
> > check me on that?), but it's a bad idea for lockd/nfsd which are walking
> > the lists for filesystems.
> 
> Yes. fs/nfs is just reusing the code in fs/locks.c in order to track the
> locks it holds on the server. We could alternatively have coded a
> private lock implementation, but this seemed easier.

So, assuming nfs is taking care of its own locking (I don't know if
that's right), that leaves nlm_traverse_locks() and nlm_file_inuse()
(both in fs/lockd/svcsubs.c) as the problem spots.

> > Are we going to have to export the file_lock_lock?  I'd rather not.  But
> > we need to keep nfsd/lockd from tripping over locks.c.
> > 
> > Maybe we could come up with a decent API that lockd could use?  It all
> > seems a bit complex at the moment ... maybe lockd should be keeping
> > track of the locks it owns anyway (since surely the posix deadlock
> > detection code can't work properly if it's just passing all the locks
> > through).
> 
> I'm not sure what you mean when you talk about lockd keeping track of
> the locks it owns. It has to keep those locks on inode->i_flock in order
> to make them visible to the host filesystem...
> 
> All lockd really needs, is the ability to find a lock it owns, and then
> obtain a copy.

That sounds right.

--b.

> As for the nfs client, I suspect we can make do with
> something similar...



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 18:07             ` Andrew Morton
@ 2008-05-11 11:11               ` Matthew Wilcox
  0 siblings, 0 replies; 43+ messages in thread
From: Matthew Wilcox @ 2008-05-11 11:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, bfields, yanmin_zhang, linux-kernel, viro, torvalds,
	linux-fsdevel

On Tue, May 06, 2008 at 11:07:52AM -0700, Andrew Morton wrote:
> Yeah, the early bootup code.  The kernel does accidental lock_kernel()s in
> various places and if that renables interrupts then powerpc goeth crunch.
> 
> Matthew, that seemingly-unneeded irqsave in lib/semaphore.c is a prime site
> for /* one of these things */, no?

I was just reviewing the code and I came across one of these:

/*
 * Some notes on the implementation:
 *
 * The spinlock controls access to the other members of the semaphore.
 * down_trylock() and up() can be called from interrupt context, so we
 * have to disable interrupts when taking the lock.  It turns out various
 * parts of the kernel expect to be able to use down() on a semaphore in
 * interrupt context when they know it will succeed, so we have to use
 * irqsave variants for down(), down_interruptible() and down_killable()
 * too.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2008-05-11 11:11 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1210052904.3453.30.camel@ymzhang>
     [not found] ` <20080506114449.GC32591@elte.hu>
2008-05-06 12:09   ` AIM7 40% regression with 2.6.26-rc1 Matthew Wilcox
2008-05-06 16:23     ` Matthew Wilcox
2008-05-06 16:36       ` Linus Torvalds
2008-05-06 16:42         ` Matthew Wilcox
2008-05-06 16:39           ` Alan Cox
2008-05-06 16:51             ` Matthew Wilcox
2008-05-06 16:45               ` Alan Cox
2008-05-06 17:42               ` Linus Torvalds
2008-05-06 20:28           ` Linus Torvalds
2008-05-06 16:44         ` J. Bruce Fields
2008-05-06 17:21       ` Andrew Morton
2008-05-06 17:31         ` Matthew Wilcox
2008-05-06 17:49           ` Ingo Molnar
2008-05-06 18:07             ` Andrew Morton
2008-05-11 11:11               ` Matthew Wilcox
2008-05-06 17:39         ` Ingo Molnar
2008-05-07  6:49           ` Zhang, Yanmin
2008-05-06 17:45         ` Linus Torvalds
2008-05-07 16:38         ` Matthew Wilcox
2008-05-07 16:55           ` Linus Torvalds
2008-05-07 17:08             ` Linus Torvalds
2008-05-07 17:16               ` Andrew Morton
2008-05-07 17:27                 ` Linus Torvalds
2008-05-07 17:22               ` Ingo Molnar
2008-05-07 17:25                 ` Ingo Molnar
2008-05-07 17:31                 ` Linus Torvalds
2008-05-07 17:47                   ` Linus Torvalds
2008-05-07 17:49                   ` Ingo Molnar
2008-05-07 18:02                     ` Linus Torvalds
2008-05-07 18:17                       ` Ingo Molnar
2008-05-07 18:27                         ` Linus Torvalds
2008-05-07 18:43                           ` Ingo Molnar
2008-05-07 19:01                             ` Linus Torvalds
2008-05-07 19:09                               ` Ingo Molnar
2008-05-07 19:24                               ` Matthew Wilcox
2008-05-07 19:44                                 ` Linus Torvalds
2008-05-07 20:00                                   ` Oi. NFS people. Read this Matthew Wilcox
2008-05-07 22:10                                     ` Trond Myklebust
2008-05-09  1:43                                       ` J. Bruce Fields
2008-05-08  3:24       ` AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin
2008-05-08  3:34         ` Linus Torvalds
2008-05-08  4:37           ` Zhang, Yanmin
2008-05-08 14:58             ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).