* Re: AIM7 40% regression with 2.6.26-rc1
[not found] ` <20080506114449.GC32591@elte.hu>
@ 2008-05-06 12:09 ` Matthew Wilcox
2008-05-06 16:23 ` Matthew Wilcox
0 siblings, 1 reply; 43+ messages in thread
From: Matthew Wilcox @ 2008-05-06 12:09 UTC (permalink / raw)
To: Ingo Molnar
Cc: Zhang, Yanmin, LKML, Alexander Viro, Linus Torvalds,
Andrew Morton, linux-fsdevel
On Tue, May 06, 2008 at 01:44:49PM +0200, Ingo Molnar wrote:
> * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:
> > After I manually reverted the patch against 2.6.26-rc1 while fixing
> > lots of conflictions/errors, aim7 regression became less than 2%.
>
> hm, which exact semaphore would that be due to?
>
> My first blind guess would be the BKL - there's not much other semaphore
> use left in the core kernel otherwise that would affect AIM7 normally.
> The VFS still makes frequent use of the BKL and AIM7 is very VFS
> intense. Getting rid of that BKL use from the VFS might be useful to
> performance anyway.
That's slightly slanderous to the VFS ;-) The BKL really isn't used
that much any more. So little that I've gone through and produced a
list of places it's used:
fs/block_dev.c opening and closing a block device. Unlikely to be
provoked by AIM7.
fs/char_dev.c chrdev_open. Unlikely to be provoked by AIM7.
fs/compat.c mount. Unlikely to be provoked by AIM7.
fs/compat_ioctl.c held around calls to ioctl translator.
fs/exec.c coredump. If this is a contention problem ...
fs/fcntl.c held around call to ->fasync.
fs/ioctl.c held around f_op ->ioctl call (tmpfs doesn't have
ioctl). ditto bmap. there's fasync, as previously
mentioned.
fs/locks.c hellhole. I hope AIM7 doesn't use locks.
fs/namespace.c mount, umount. Unlikely to be provoked by AIM7.
fs/read_write.c llseek. tmpfs uses the unlocked version.
fs/super.c shtdown, remount. Unlikely to be provoked by AIM7.
So the only likely things I can see are:
- file locks
- fasync
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-06 12:09 ` AIM7 40% regression with 2.6.26-rc1 Matthew Wilcox
@ 2008-05-06 16:23 ` Matthew Wilcox
2008-05-06 16:36 ` Linus Torvalds
` (2 more replies)
0 siblings, 3 replies; 43+ messages in thread
From: Matthew Wilcox @ 2008-05-06 16:23 UTC (permalink / raw)
To: Ingo Molnar, J. Bruce Fields
Cc: Zhang, Yanmin, LKML, Alexander Viro, Linus Torvalds,
Andrew Morton, linux-fsdevel
On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote:
> So the only likely things I can see are:
>
> - file locks
> - fasync
I've wanted to fix file locks for a while. Here's a first attempt.
It was done quickly, so I concede that it may well have bugs in it.
I found (and fixed) one with LTP.
It takes *no account* of nfsd, nor remote filesystems. We need to have
a serious discussion about their requirements.
diff --git a/fs/locks.c b/fs/locks.c
index 663c069..cb09765 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -140,6 +140,8 @@ int lease_break_time = 45;
#define for_each_lock(inode, lockp) \
for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next)
+static DEFINE_SPINLOCK(file_lock_lock);
+
static LIST_HEAD(file_lock_list);
static LIST_HEAD(blocked_list);
@@ -510,9 +512,9 @@ static void __locks_delete_block(struct file_lock *waiter)
*/
static void locks_delete_block(struct file_lock *waiter)
{
- lock_kernel();
+ spin_lock(&file_lock_lock);
__locks_delete_block(waiter);
- unlock_kernel();
+ spin_unlock(&file_lock_lock);
}
/* Insert waiter into blocker's block list.
@@ -649,7 +651,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
{
struct file_lock *cfl;
- lock_kernel();
+ spin_lock(&file_lock_lock);
for (cfl = filp->f_path.dentry->d_inode->i_flock; cfl; cfl = cfl->fl_next) {
if (!IS_POSIX(cfl))
continue;
@@ -662,7 +664,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
fl->fl_pid = pid_vnr(cfl->fl_nspid);
} else
fl->fl_type = F_UNLCK;
- unlock_kernel();
+ spin_unlock(&file_lock_lock);
return;
}
EXPORT_SYMBOL(posix_test_lock);
@@ -735,18 +737,21 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
int error = 0;
int found = 0;
- lock_kernel();
- if (request->fl_flags & FL_ACCESS)
+ if (request->fl_flags & FL_ACCESS) {
+ spin_lock(&file_lock_lock);
goto find_conflict;
+ }
if (request->fl_type != F_UNLCK) {
error = -ENOMEM;
+
new_fl = locks_alloc_lock();
if (new_fl == NULL)
- goto out;
+ goto out_unlocked;
error = 0;
}
+ spin_lock(&file_lock_lock);
for_each_lock(inode, before) {
struct file_lock *fl = *before;
if (IS_POSIX(fl))
@@ -772,10 +777,13 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
* If a higher-priority process was blocked on the old file lock,
* give it the opportunity to lock the file.
*/
- if (found)
+ if (found) {
+ spin_unlock(&file_lock_lock);
cond_resched();
+ spin_lock(&file_lock_lock);
+ }
-find_conflict:
+ find_conflict:
for_each_lock(inode, before) {
struct file_lock *fl = *before;
if (IS_POSIX(fl))
@@ -796,8 +804,9 @@ find_conflict:
new_fl = NULL;
error = 0;
-out:
- unlock_kernel();
+ out:
+ spin_unlock(&file_lock_lock);
+ out_unlocked:
if (new_fl)
locks_free_lock(new_fl);
return error;
@@ -826,7 +835,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
new_fl2 = locks_alloc_lock();
}
- lock_kernel();
+ spin_lock(&file_lock_lock);
if (request->fl_type != F_UNLCK) {
for_each_lock(inode, before) {
fl = *before;
@@ -994,7 +1003,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
locks_wake_up_blocks(left);
}
out:
- unlock_kernel();
+ spin_unlock(&file_lock_lock);
/*
* Free any unused locks.
*/
@@ -1069,14 +1078,14 @@ int locks_mandatory_locked(struct inode *inode)
/*
* Search the lock list for this inode for any POSIX locks.
*/
- lock_kernel();
+ spin_lock(&file_lock_lock);
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
if (!IS_POSIX(fl))
continue;
if (fl->fl_owner != owner)
break;
}
- unlock_kernel();
+ spin_unlock(&file_lock_lock);
return fl ? -EAGAIN : 0;
}
@@ -1190,7 +1199,7 @@ int __break_lease(struct inode *inode, unsigned int mode)
new_fl = lease_alloc(NULL, mode & FMODE_WRITE ? F_WRLCK : F_RDLCK);
- lock_kernel();
+ spin_lock(&file_lock_lock);
time_out_leases(inode);
@@ -1251,8 +1260,10 @@ restart:
break_time++;
}
locks_insert_block(flock, new_fl);
+ spin_unlock(&file_lock_lock);
error = wait_event_interruptible_timeout(new_fl->fl_wait,
!new_fl->fl_next, break_time);
+ spin_lock(&file_lock_lock);
__locks_delete_block(new_fl);
if (error >= 0) {
if (error == 0)
@@ -1266,8 +1277,8 @@ restart:
error = 0;
}
-out:
- unlock_kernel();
+ out:
+ spin_unlock(&file_lock_lock);
if (!IS_ERR(new_fl))
locks_free_lock(new_fl);
return error;
@@ -1323,7 +1334,7 @@ int fcntl_getlease(struct file *filp)
struct file_lock *fl;
int type = F_UNLCK;
- lock_kernel();
+ spin_lock(&file_lock_lock);
time_out_leases(filp->f_path.dentry->d_inode);
for (fl = filp->f_path.dentry->d_inode->i_flock; fl && IS_LEASE(fl);
fl = fl->fl_next) {
@@ -1332,7 +1343,7 @@ int fcntl_getlease(struct file *filp)
break;
}
}
- unlock_kernel();
+ spin_unlock(&file_lock_lock);
return type;
}
@@ -1363,6 +1374,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
if (error)
return error;
+ spin_lock(&file_lock_lock);
time_out_leases(inode);
BUG_ON(!(*flp)->fl_lmops->fl_break);
@@ -1370,10 +1382,11 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
lease = *flp;
if (arg != F_UNLCK) {
+ spin_unlock(&file_lock_lock);
error = -ENOMEM;
new_fl = locks_alloc_lock();
if (new_fl == NULL)
- goto out;
+ goto out_unlocked;
error = -EAGAIN;
if ((arg == F_RDLCK) && (atomic_read(&inode->i_writecount) > 0))
@@ -1382,6 +1395,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
&& ((atomic_read(&dentry->d_count) > 1)
|| (atomic_read(&inode->i_count) > 1)))
goto out;
+ spin_lock(&file_lock_lock);
}
/*
@@ -1429,11 +1443,14 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
locks_copy_lock(new_fl, lease);
locks_insert_lock(before, new_fl);
+ spin_unlock(&file_lock_lock);
*flp = new_fl;
return 0;
-out:
+ out:
+ spin_unlock(&file_lock_lock);
+ out_unlocked:
if (new_fl != NULL)
locks_free_lock(new_fl);
return error;
@@ -1471,12 +1488,10 @@ int vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
{
int error;
- lock_kernel();
if (filp->f_op && filp->f_op->setlease)
error = filp->f_op->setlease(filp, arg, lease);
else
error = generic_setlease(filp, arg, lease);
- unlock_kernel();
return error;
}
@@ -1503,12 +1518,11 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
if (error)
return error;
- lock_kernel();
-
error = vfs_setlease(filp, arg, &flp);
if (error || arg == F_UNLCK)
- goto out_unlock;
+ return error;
+ lock_kernel();
error = fasync_helper(fd, filp, 1, &flp->fl_fasync);
if (error < 0) {
/* remove lease just inserted by setlease */
@@ -1519,7 +1533,7 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
}
error = __f_setown(filp, task_pid(current), PIDTYPE_PID, 0);
-out_unlock:
+ out_unlock:
unlock_kernel();
return error;
}
@@ -2024,7 +2038,7 @@ void locks_remove_flock(struct file *filp)
fl.fl_ops->fl_release_private(&fl);
}
- lock_kernel();
+ spin_lock(&file_lock_lock);
before = &inode->i_flock;
while ((fl = *before) != NULL) {
@@ -2042,7 +2056,7 @@ void locks_remove_flock(struct file *filp)
}
before = &fl->fl_next;
}
- unlock_kernel();
+ spin_unlock(&file_lock_lock);
}
/**
@@ -2057,12 +2071,12 @@ posix_unblock_lock(struct file *filp, struct file_lock *waiter)
{
int status = 0;
- lock_kernel();
+ spin_lock(&file_lock_lock);
if (waiter->fl_next)
__locks_delete_block(waiter);
else
status = -ENOENT;
- unlock_kernel();
+ spin_unlock(&file_lock_lock);
return status;
}
@@ -2175,7 +2189,7 @@ static int locks_show(struct seq_file *f, void *v)
static void *locks_start(struct seq_file *f, loff_t *pos)
{
- lock_kernel();
+ spin_lock(&file_lock_lock);
f->private = (void *)1;
return seq_list_start(&file_lock_list, *pos);
}
@@ -2187,7 +2201,7 @@ static void *locks_next(struct seq_file *f, void *v, loff_t *pos)
static void locks_stop(struct seq_file *f, void *v)
{
- unlock_kernel();
+ spin_unlock(&file_lock_lock);
}
struct seq_operations locks_seq_operations = {
@@ -2215,7 +2229,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
{
struct file_lock *fl;
int result = 1;
- lock_kernel();
+ spin_lock(&file_lock_lock);
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
if (IS_POSIX(fl)) {
if (fl->fl_type == F_RDLCK)
@@ -2232,7 +2246,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
result = 0;
break;
}
- unlock_kernel();
+ spin_unlock(&file_lock_lock);
return result;
}
@@ -2255,7 +2269,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
{
struct file_lock *fl;
int result = 1;
- lock_kernel();
+ spin_lock(&file_lock_lock);
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
if (IS_POSIX(fl)) {
if ((fl->fl_end < start) || (fl->fl_start > (start + len)))
@@ -2270,7 +2284,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
result = 0;
break;
}
- unlock_kernel();
+ spin_unlock(&file_lock_lock);
return result;
}
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-06 16:23 ` Matthew Wilcox
@ 2008-05-06 16:36 ` Linus Torvalds
2008-05-06 16:42 ` Matthew Wilcox
2008-05-06 16:44 ` J. Bruce Fields
2008-05-06 17:21 ` Andrew Morton
2008-05-08 3:24 ` AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin
2 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2008-05-06 16:36 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro,
Andrew Morton, linux-fsdevel
On Tue, 6 May 2008, Matthew Wilcox wrote:
>
> I've wanted to fix file locks for a while. Here's a first attempt.
> It was done quickly, so I concede that it may well have bugs in it.
> I found (and fixed) one with LTP.
Hmm. Wouldn't it be nicer to make the lock be a per-inode thing? Or is
there some user that doesn't have the inode info, or does anything that
might cross inode boundaries?
This does seem to drop all locking around the "setlease()" calls down to
the filesystem, which worries me. That said, we clearly do need to do
this. Probably should have done it a long time ago.
Also, why do people do this:
> -find_conflict:
> + find_conflict:
Hmm?
Linus
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-06 16:42 ` Matthew Wilcox
@ 2008-05-06 16:39 ` Alan Cox
2008-05-06 16:51 ` Matthew Wilcox
2008-05-06 20:28 ` Linus Torvalds
1 sibling, 1 reply; 43+ messages in thread
From: Alan Cox @ 2008-05-06 16:39 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Linus Torvalds, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
Alexander Viro, Andrew Morton, linux-fsdevel
> > Hmm?
>
> So that find_conflict doesn't end up in the first column, which causes
> diff to treat it as a function name for the purposes of the @@ lines.
Please can we just fix the tools not mangle the kernel to work around
silly bugs ?
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-06 16:36 ` Linus Torvalds
@ 2008-05-06 16:42 ` Matthew Wilcox
2008-05-06 16:39 ` Alan Cox
2008-05-06 20:28 ` Linus Torvalds
2008-05-06 16:44 ` J. Bruce Fields
1 sibling, 2 replies; 43+ messages in thread
From: Matthew Wilcox @ 2008-05-06 16:42 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro,
Andrew Morton, linux-fsdevel
On Tue, May 06, 2008 at 09:36:06AM -0700, Linus Torvalds wrote:
> On Tue, 6 May 2008, Matthew Wilcox wrote:
> > I've wanted to fix file locks for a while. Here's a first attempt.
> > It was done quickly, so I concede that it may well have bugs in it.
> > I found (and fixed) one with LTP.
>
> Hmm. Wouldn't it be nicer to make the lock be a per-inode thing? Or is
> there some user that doesn't have the inode info, or does anything that
> might cross inode boundaries?
/proc/locks and deadlock detection both cross inode boundaries (and even
filesystem boundaries). The BKL-removal brigade tried this back in 2.4
and the locking ended up scaling worse than just plonking a single
spinlock around the whole thing.
> This does seem to drop all locking around the "setlease()" calls down to
> the filesystem, which worries me. That said, we clearly do need to do
> this. Probably should have done it a long time ago.
The only filesystems that are going to have their own setlease methods
will be remote ones (nfs, smbfs, etc). They're going to need to sleep
while the server responds to them. So holding a spinlock while we call
them is impolite at best.
> Also, why do people do this:
>
> > -find_conflict:
> > + find_conflict:
>
> Hmm?
So that find_conflict doesn't end up in the first column, which causes
diff to treat it as a function name for the purposes of the @@ lines.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-06 16:36 ` Linus Torvalds
2008-05-06 16:42 ` Matthew Wilcox
@ 2008-05-06 16:44 ` J. Bruce Fields
1 sibling, 0 replies; 43+ messages in thread
From: J. Bruce Fields @ 2008-05-06 16:44 UTC (permalink / raw)
To: Linus Torvalds
Cc: Matthew Wilcox, Ingo Molnar, Zhang, Yanmin, LKML, Alexander Viro,
Andrew Morton, linux-fsdevel, richterd
On Tue, May 06, 2008 at 09:36:06AM -0700, Linus Torvalds wrote:
>
>
> On Tue, 6 May 2008, Matthew Wilcox wrote:
> >
> > I've wanted to fix file locks for a while. Here's a first attempt.
> > It was done quickly, so I concede that it may well have bugs in it.
> > I found (and fixed) one with LTP.
>
> Hmm. Wouldn't it be nicer to make the lock be a per-inode thing? Or is
> there some user that doesn't have the inode info, or does anything that
> might cross inode boundaries?
The deadlock detection crosses inode boundaries.
--b.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-06 16:51 ` Matthew Wilcox
@ 2008-05-06 16:45 ` Alan Cox
2008-05-06 17:42 ` Linus Torvalds
1 sibling, 0 replies; 43+ messages in thread
From: Alan Cox @ 2008-05-06 16:45 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Linus Torvalds, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
Alexander Viro, Andrew Morton, linux-fsdevel
On Tue, 6 May 2008 10:51:12 -0600
Matthew Wilcox <matthew@wil.cx> wrote:
> On Tue, May 06, 2008 at 05:39:48PM +0100, Alan Cox wrote:
> > > > Hmm?
> > >
> > > So that find_conflict doesn't end up in the first column, which causes
> > > diff to treat it as a function name for the purposes of the @@ lines.
> >
> > Please can we just fix the tools not mangle the kernel to work around
> > silly bugs ?
>
> The people who control the tools refuse to fix them.
That would be their problem. We should refuse to mash the kernel up
because they aren't doing their job. If need be someone can fork a
private git-patch.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-06 16:39 ` Alan Cox
@ 2008-05-06 16:51 ` Matthew Wilcox
2008-05-06 16:45 ` Alan Cox
2008-05-06 17:42 ` Linus Torvalds
0 siblings, 2 replies; 43+ messages in thread
From: Matthew Wilcox @ 2008-05-06 16:51 UTC (permalink / raw)
To: Alan Cox
Cc: Linus Torvalds, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
Alexander Viro, Andrew Morton, linux-fsdevel
On Tue, May 06, 2008 at 05:39:48PM +0100, Alan Cox wrote:
> > > Hmm?
> >
> > So that find_conflict doesn't end up in the first column, which causes
> > diff to treat it as a function name for the purposes of the @@ lines.
>
> Please can we just fix the tools not mangle the kernel to work around
> silly bugs ?
The people who control the tools refuse to fix them.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-06 16:23 ` Matthew Wilcox
2008-05-06 16:36 ` Linus Torvalds
@ 2008-05-06 17:21 ` Andrew Morton
2008-05-06 17:31 ` Matthew Wilcox
` (3 more replies)
2008-05-08 3:24 ` AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin
2 siblings, 4 replies; 43+ messages in thread
From: Andrew Morton @ 2008-05-06 17:21 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro,
Linus Torvalds, linux-fsdevel
On Tue, 6 May 2008 10:23:32 -0600 Matthew Wilcox <matthew@wil.cx> wrote:
> On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote:
> > So the only likely things I can see are:
> >
> > - file locks
> > - fasync
>
> I've wanted to fix file locks for a while. Here's a first attempt.
Do we actually know that the locks code is implicated in this regression?
I'd initially thought "lseek" but afaict tmpfs doesn't hit default_llseek()
or remote_llseek().
tmpfs tends to do weird stuff - it would be interesting to know if the
regression is also present on ramfs or ext2/ext3/xfs/etc.
It would be interesting to see if the context switch rate has increased.
Finally: how come we regressed by swapping the semaphore implementation
anyway? We went from one sleeping lock implementation to another - I'd
have expected performance to be pretty much the same.
<looks at the implementation>
down(), down_interruptible() and down_try() should use spin_lock_irq(), not
irqsave.
up() seems to be doing wake-one, FIFO which is nice. Did the
implementation which we just removed also do that? Was it perhaps
accidentally doing LIFO or something like that?
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-06 17:21 ` Andrew Morton
@ 2008-05-06 17:31 ` Matthew Wilcox
2008-05-06 17:49 ` Ingo Molnar
2008-05-06 17:39 ` Ingo Molnar
` (2 subsequent siblings)
3 siblings, 1 reply; 43+ messages in thread
From: Matthew Wilcox @ 2008-05-06 17:31 UTC (permalink / raw)
To: Andrew Morton
Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro,
Linus Torvalds, linux-fsdevel
On Tue, May 06, 2008 at 10:21:53AM -0700, Andrew Morton wrote:
> Do we actually know that the locks code is implicated in this regression?
Not yet. We don't even know it's the BKL. It's just my best guess.
We're waiting for the original reporter to run some tests Ingo pointed
him at.
> I'd initially thought "lseek" but afaict tmpfs doesn't hit default_llseek()
> or remote_llseek().
Correct.
> Finally: how come we regressed by swapping the semaphore implementation
> anyway? We went from one sleeping lock implementation to another - I'd
> have expected performance to be pretty much the same.
>
> <looks at the implementation>
>
> down(), down_interruptible() and down_try() should use spin_lock_irq(), not
> irqsave.
We talked about this ... the BKL actually requires that you be able to
acquire it with interrupts disabled. Maybe we should make lock_kernel
do this:
if (likely(!depth)) {
unsigned long flags;
local_save_flags(flags);
down();
local_irq_restore(flags);
}
But tweaking down() is not worth it -- we should be eliminating users of
both the BKL and semaphores instead.
> up() seems to be doing wake-one, FIFO which is nice. Did the
> implementation which we just removed also do that? Was it perhaps
> accidentally doing LIFO or something like that?
That's a question for someone who knows x86 assembler, I think.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-06 17:21 ` Andrew Morton
2008-05-06 17:31 ` Matthew Wilcox
@ 2008-05-06 17:39 ` Ingo Molnar
2008-05-07 6:49 ` Zhang, Yanmin
2008-05-06 17:45 ` Linus Torvalds
2008-05-07 16:38 ` Matthew Wilcox
3 siblings, 1 reply; 43+ messages in thread
From: Ingo Molnar @ 2008-05-06 17:39 UTC (permalink / raw)
To: Andrew Morton
Cc: Matthew Wilcox, J. Bruce Fields, Zhang, Yanmin, LKML,
Alexander Viro, Linus Torvalds, linux-fsdevel
* Andrew Morton <akpm@linux-foundation.org> wrote:
> Finally: how come we regressed by swapping the semaphore
> implementation anyway? We went from one sleeping lock implementation
> to another - I'd have expected performance to be pretty much the same.
>
> <looks at the implementation>
>
> down(), down_interruptible() and down_try() should use
> spin_lock_irq(), not irqsave.
>
> up() seems to be doing wake-one, FIFO which is nice. Did the
> implementation which we just removed also do that? Was it perhaps
> accidentally doing LIFO or something like that?
i just checked the old implementation on x86. It used
lib/semaphore-sleepers.c which does one weird thing:
- __down() when it returns wakes up yet another task via
wake_up_locked().
i.e. we'll always keep yet another task in flight. This can mask wakeup
latencies especially when it takes time.
The patch (hack) below tries to emulate this weirdness - it 'kicks'
another task as well and keeps it busy. Most of the time this just
causes extra scheduling, but if AIM7 is _just_ saturating the number of
CPUs, it might make a difference. Yanmin, does the patch below make any
difference to the AIM7 results?
( it would be useful data to get a meaningful context switch trace from
the whole regressed workload, and compare it to a context switch trace
with the revert added. )
Ingo
---
kernel/semaphore.c | 10 ++++++++++
1 file changed, 10 insertions(+)
Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -261,4 +261,14 @@ static noinline void __sched __up(struct
list_del(&waiter->list);
waiter->up = 1;
wake_up_process(waiter->task);
+
+ if (likely(list_empty(&sem->wait_list)))
+ return;
+ /*
+ * Opportunistically wake up another task as well but do not
+ * remove it from the list:
+ */
+ waiter = list_first_entry(&sem->wait_list,
+ struct semaphore_waiter, list);
+ wake_up_process(waiter->task);
}
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-06 16:51 ` Matthew Wilcox
2008-05-06 16:45 ` Alan Cox
@ 2008-05-06 17:42 ` Linus Torvalds
1 sibling, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2008-05-06 17:42 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Alan Cox, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
Alexander Viro, Andrew Morton, linux-fsdevel
On Tue, 6 May 2008, Matthew Wilcox wrote:
> On Tue, May 06, 2008 at 05:39:48PM +0100, Alan Cox wrote:
> > > > Hmm?
> > >
> > > So that find_conflict doesn't end up in the first column, which causes
> > > diff to treat it as a function name for the purposes of the @@ lines.
> >
> > Please can we just fix the tools not mangle the kernel to work around
> > silly bugs ?
>
> The people who control the tools refuse to fix them.
That's just plain bullocks.
First off, that "@@" line thing in diffs is not important enough to screw
up the source code for. Ever. It's just a small hint to make it somewhat
easier to see more context for humans.
Second, it's not even that bad to show the last label there, rather than
the function name.
Third, you seem to be a git user, so if you actually really care that much
about the @@ line, then git actually lets you set your very own pattern
for those things.
In fact, you can even do it on a per-file basis based on things like
filename rules (ie you can have different patterns for what to trigger on
for a *.c file and for a *.S file, since in a *.S file the 'name:' thing
_is_ the right pattern).
So not only are you making idiotic changes just for irrelevant tool usage,
you're also apparently lying about people "refusing to fix" things as an
excuse.
You can play with it. It's documented in gitattributes (see "Defining a
custom hunk header"), and the default one is just the same one that GNU
diff uses for "-p". I think.
You can add something like this to your ~/.gitconfig:
[diff "default"]
funcname=^[a-zA-Z_$].*(.*$
to only trigger the funcname pattern on a line that starts with a valid C
identifier hing, and contains a '('.
And you can just override the default like the above (that way you don't
have to specify attributes), but if you want to do things differently for
*.c files than from *.S files, you can edit your .git/info/attributes file
and make it contain something like
*.S diff=assembler
*.c diff=C
and now you can make your ~/.gitconfig actually show them differently, ie
something like
[diff "C"]
funcname=^[a-zA-Z_$].*(.*$
[diff "assembler"]
funcname=^[a-zA-Z_$].*:
etc.
Of course, there is a real cost to this, but it's cheap enough in practice
that you'll never notice.
Linus
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-06 17:21 ` Andrew Morton
2008-05-06 17:31 ` Matthew Wilcox
2008-05-06 17:39 ` Ingo Molnar
@ 2008-05-06 17:45 ` Linus Torvalds
2008-05-07 16:38 ` Matthew Wilcox
3 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2008-05-06 17:45 UTC (permalink / raw)
To: Andrew Morton
Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
Alexander Viro, linux-fsdevel
On Tue, 6 May 2008, Andrew Morton wrote:
>
> down(), down_interruptible() and down_try() should use spin_lock_irq(), not
> irqsave.
down_trylock() is used in atomic code. See for example kernel/printk.c. So
no, that one needs to be irqsafe.
Linus
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-06 17:31 ` Matthew Wilcox
@ 2008-05-06 17:49 ` Ingo Molnar
2008-05-06 18:07 ` Andrew Morton
0 siblings, 1 reply; 43+ messages in thread
From: Ingo Molnar @ 2008-05-06 17:49 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML,
Alexander Viro, Linus Torvalds, linux-fsdevel
* Matthew Wilcox <matthew@wil.cx> wrote:
> > down(), down_interruptible() and down_try() should use
> > spin_lock_irq(), not irqsave.
>
> We talked about this ... the BKL actually requires that you be able to
> acquire it with interrupts disabled. [...]
hm, where does it require it, besides the early bootup code? (which
should just be fixed)
down_trylock() is OK as irqsave/irqrestore for legacy reasons, but that
is fundamentally atomic anyway.
> > up() seems to be doing wake-one, FIFO which is nice. Did the
> > implementation which we just removed also do that? Was it perhaps
> > accidentally doing LIFO or something like that?
>
> That's a question for someone who knows x86 assembler, I think.
the assembly is mostly just for the fastpath - and a 40% regression
cannot be about fastpath differences. In the old code the scheduling
happens in lib/semaphore-sleeper.c, and from the looks of it it appears
to be a proper FIFO as well. (plus this small wakeup weirdness it has)
i reviewed the new code in kernel/semaphore.c as well and can see
nothing bad in it - it does proper wake-up, FIFO queueing, like the
mutex code.
Ingo
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-06 17:49 ` Ingo Molnar
@ 2008-05-06 18:07 ` Andrew Morton
2008-05-11 11:11 ` Matthew Wilcox
0 siblings, 1 reply; 43+ messages in thread
From: Andrew Morton @ 2008-05-06 18:07 UTC (permalink / raw)
To: Ingo Molnar
Cc: matthew, bfields, yanmin_zhang, linux-kernel, viro, torvalds,
linux-fsdevel
On Tue, 6 May 2008 19:49:54 +0200
Ingo Molnar <mingo@elte.hu> wrote:
>
> * Matthew Wilcox <matthew@wil.cx> wrote:
>
> > > down(), down_interruptible() and down_try() should use
> > > spin_lock_irq(), not irqsave.
> >
> > We talked about this ... the BKL actually requires that you be able to
> > acquire it with interrupts disabled. [...]
>
> hm, where does it require it, besides the early bootup code? (which
> should just be fixed)
Yeah, the early bootup code. The kernel does accidental lock_kernel()s in
various places and if that renables interrupts then powerpc goeth crunch.
Matthew, that seemingly-unneeded irqsave in lib/semaphore.c is a prime site
for /* one of these things */, no?
> down_trylock() is OK as irqsave/irqrestore for legacy reasons, but that
> is fundamentally atomic anyway.
yes, trylock should be made irq-safe.
> > > up() seems to be doing wake-one, FIFO which is nice. Did the
> > > implementation which we just removed also do that? Was it perhaps
> > > accidentally doing LIFO or something like that?
> >
> > That's a question for someone who knows x86 assembler, I think.
>
> the assembly is mostly just for the fastpath - and a 40% regression
> cannot be about fastpath differences. In the old code the scheduling
> happens in lib/semaphore-sleeper.c, and from the looks of it it appears
> to be a proper FIFO as well. (plus this small wakeup weirdness it has)
>
> i reviewed the new code in kernel/semaphore.c as well and can see
> nothing bad in it - it does proper wake-up, FIFO queueing, like the
> mutex code.
>
There's the weird wakeup in down() which I understood for about five
minutes five years ago. Perhaps that accidentally sped something up.
Oh well, more investigation needed..
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-06 16:42 ` Matthew Wilcox
2008-05-06 16:39 ` Alan Cox
@ 2008-05-06 20:28 ` Linus Torvalds
1 sibling, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2008-05-06 20:28 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro,
Andrew Morton, linux-fsdevel
On Tue, 6 May 2008, Matthew Wilcox wrote:
> On Tue, May 06, 2008 at 09:36:06AM -0700, Linus Torvalds wrote:
> >
> > Hmm. Wouldn't it be nicer to make the lock be a per-inode thing? Or is
> > there some user that doesn't have the inode info, or does anything that
> > might cross inode boundaries?
>
> /proc/locks and deadlock detection both cross inode boundaries (and even
> filesystem boundaries). The BKL-removal brigade tried this back in 2.4
> and the locking ended up scaling worse than just plonking a single
> spinlock around the whole thing.
Ok, no worries. Just as long as I know why it's a single lock. Looks ok to
me, apart from the need for testing (and talking to NFS etc people).
Linus
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-06 17:39 ` Ingo Molnar
@ 2008-05-07 6:49 ` Zhang, Yanmin
0 siblings, 0 replies; 43+ messages in thread
From: Zhang, Yanmin @ 2008-05-07 6:49 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andrew Morton, Matthew Wilcox, J. Bruce Fields, LKML,
Alexander Viro, Linus Torvalds, linux-fsdevel
On Tue, 2008-05-06 at 19:39 +0200, Ingo Molnar wrote:
> * Andrew Morton <akpm@linux-foundation.org> wrote:
>
> > Finally: how come we regressed by swapping the semaphore
> > implementation anyway? We went from one sleeping lock implementation
> > to another - I'd have expected performance to be pretty much the same.
> i.e. we'll always keep yet another task in flight. This can mask wakeup
> latencies especially when it takes time.
>
> The patch (hack) below tries to emulate this weirdness - it 'kicks'
> another task as well and keeps it busy. Most of the time this just
> causes extra scheduling, but if AIM7 is _just_ saturating the number of
> CPUs, it might make a difference. Yanmin, does the patch below make any
> difference to the AIM7 results?
I tested it on my 8-core stoakley and the result is 12% worse than the one of
pure 2.6.26-rc1.
-yanmin
>
> ( it would be useful data to get a meaningful context switch trace from
> the whole regressed workload, and compare it to a context switch trace
> with the revert added. )
>
> Ingo
>
> ---
> kernel/semaphore.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> Index: linux/kernel/semaphore.c
> ===================================================================
> --- linux.orig/kernel/semaphore.c
> +++ linux/kernel/semaphore.c
> @@ -261,4 +261,14 @@ static noinline void __sched __up(struct
> list_del(&waiter->list);
> waiter->up = 1;
> wake_up_process(waiter->task);
> +
> + if (likely(list_empty(&sem->wait_list)))
> + return;
> + /*
> + * Opportunistically wake up another task as well but do not
> + * remove it from the list:
> + */
> + waiter = list_first_entry(&sem->wait_list,
> + struct semaphore_waiter, list);
> + wake_up_process(waiter->task);
> }
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-06 17:21 ` Andrew Morton
` (2 preceding siblings ...)
2008-05-06 17:45 ` Linus Torvalds
@ 2008-05-07 16:38 ` Matthew Wilcox
2008-05-07 16:55 ` Linus Torvalds
3 siblings, 1 reply; 43+ messages in thread
From: Matthew Wilcox @ 2008-05-07 16:38 UTC (permalink / raw)
To: Andrew Morton
Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro,
Linus Torvalds, linux-fsdevel
On Tue, May 06, 2008 at 10:21:53AM -0700, Andrew Morton wrote:
> up() seems to be doing wake-one, FIFO which is nice. Did the
> implementation which we just removed also do that? Was it perhaps
> accidentally doing LIFO or something like that?
If heavily contended, it could do this.
up() would increment sem->count and cal __up() which would call wake_up()
down() would decrement sem->count.
The unlucky task woken by __up would lose the race and go back to sleep.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-07 16:38 ` Matthew Wilcox
@ 2008-05-07 16:55 ` Linus Torvalds
2008-05-07 17:08 ` Linus Torvalds
0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2008-05-07 16:55 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Andrew Morton, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
Alexander Viro, linux-fsdevel
On Wed, 7 May 2008, Matthew Wilcox wrote:
>
> If heavily contended, it could do this.
It doesn' have to be heavily contended - if it's just hot and a bit lucky,
it would potentially never schedule at all, because it would never take
the spinlock and serialize the callers.
It doesn't even need "unfairness" to work that way. The old semaphore
implementation was very much designed to be lock-free, and if you had one
CPU doing a lock while another did an unlock, the *common* situation was
that the unlock would succeed first, because the unlocker was also the
person who had the spinlock exclusively in its cache!
The above may count as "lucky", but the hot-cache-line thing is a big
deal. It likely "lucky" into something that isn't a 50:50 chance, but
something that is quite possible to trigger consistently if you just have
mostly short holders of the lock.
Which, btw, is probably true. The BKL is normally held for short times,
and released (by that thread) for relatively much longer times. Which
is when spinlocks tend to work the best, even when they are fair (because
it's not so much a fairness issue, it's simply a cost-of-taking-the-lock
issue!)
Linus
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-07 16:55 ` Linus Torvalds
@ 2008-05-07 17:08 ` Linus Torvalds
2008-05-07 17:16 ` Andrew Morton
2008-05-07 17:22 ` Ingo Molnar
0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2008-05-07 17:08 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Andrew Morton, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
Alexander Viro, linux-fsdevel
On Wed, 7 May 2008, Linus Torvalds wrote:
>
> Which, btw, is probably true. The BKL is normally held for short times,
> and released (by that thread) for relatively much longer times. Which
> is when spinlocks tend to work the best, even when they are fair (because
> it's not so much a fairness issue, it's simply a cost-of-taking-the-lock
> issue!)
.. and don't get me wrong: the old semaphores (and the new mutexes) should
also have this property when lucky: taking the lock is often a hot-path
case.
And the spinlock+generic semaphore thing probably makes that "lucky"
behavior be exponentially less likely, because now to hit the lucky case,
rather than the hot path having just *one* access to the interesting cache
line, it has basically something like 4 accesses (spinlock, count test,
count decrement, spinunlock), in addition to various serializing
instructions, so I suspect it quite often gets serialized simply because
even the "fast path" is actually about ten times as long!
As a result, a slow "fast path" means that the thing gets saturated much
more easily, and that in turn means that the "fast path" turns into a
"slow path" more easily, which is how you end up in the scheduler rather
than just taking the fast path.
This is why sleeping locks are more expensive in general: they have a
*huge* cost from when they get contended. Hundreds of times higher than a
spinlock. And the faster they are, the longer it takes for them to get
contended under load. So slowing them down in the fast path is a double
whammy, in that it shows their bad behaviour much earlier.
And the generic semaphores really are slower than the old optimized ones
in that fast path. By a *big* amount.
Which is why I'm 100% convinced it's not even worth saving the old code.
It needs to use mutexes, or spinlocks. I bet it has *nothing* to do with
"slow path" other than the fact that it gets to that slow path much more
these days.
Linus
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-07 17:08 ` Linus Torvalds
@ 2008-05-07 17:16 ` Andrew Morton
2008-05-07 17:27 ` Linus Torvalds
2008-05-07 17:22 ` Ingo Molnar
1 sibling, 1 reply; 43+ messages in thread
From: Andrew Morton @ 2008-05-07 17:16 UTC (permalink / raw)
To: Linus Torvalds
Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
Alexander Viro, linux-fsdevel
On Wed, 7 May 2008 10:08:18 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Which is why I'm 100% convinced it's not even worth saving the old code.
> It needs to use mutexes, or spinlocks. I bet it has *nothing* to do with
> "slow path" other than the fact that it gets to that slow path much more
> these days.
Stupid question: why doesn't lock_kernel() use a mutex?
(stupid answer: it'll trigger might_sleep() checks when we do it early in
boot with irqs disabled, but we can fix that)
(And __might_sleep()'s system_state check might even save us from that)
Of course, we shouldn't change anything until we've worked out why the new
semaphores got slower.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-07 17:08 ` Linus Torvalds
2008-05-07 17:16 ` Andrew Morton
@ 2008-05-07 17:22 ` Ingo Molnar
2008-05-07 17:25 ` Ingo Molnar
2008-05-07 17:31 ` Linus Torvalds
1 sibling, 2 replies; 43+ messages in thread
From: Ingo Molnar @ 2008-05-07 17:22 UTC (permalink / raw)
To: Linus Torvalds
Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
LKML, Alexander Viro, linux-fsdevel
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Which is why I'm 100% convinced it's not even worth saving the old
> code. It needs to use mutexes, or spinlocks. I bet it has *nothing* to
> do with "slow path" other than the fact that it gets to that slow path
> much more these days.
i think your theory should be easy to test: Yanmin, could you turn on
CONFIG_MUTEX_DEBUG=y and check by how much AIM7 regresses?
Because in the CONFIG_MUTEX_DEBUG=y case the mutex debug code does
exactly that: it doesnt use the single-instruction fastpath [it uses
asm-generic/mutex-null.h] but always drops into the slowpath (to be able
to access debug state). That debug code is about as expensive as the
generic semaphore code's current fastpath. (perhaps even more
expensive.)
There's far more normal mutex fastpath use during an AIM7 run than any
BKL use. So if it's due to any direct fastpath overhead and the
resulting widening of the window for the real slowdown, we should see a
severe slowdown on AIM7 with CONFIG_MUTEX_DEBUG=y. Agreed?
Ingo
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-07 17:22 ` Ingo Molnar
@ 2008-05-07 17:25 ` Ingo Molnar
2008-05-07 17:31 ` Linus Torvalds
1 sibling, 0 replies; 43+ messages in thread
From: Ingo Molnar @ 2008-05-07 17:25 UTC (permalink / raw)
To: Linus Torvalds
Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
LKML, Alexander Viro, linux-fsdevel
* Ingo Molnar <mingo@elte.hu> wrote:
> There's far more normal mutex fastpath use during an AIM7 run than any
> BKL use. So if it's due to any direct fastpath overhead and the
> resulting widening of the window for the real slowdown, we should see
> a severe slowdown on AIM7 with CONFIG_MUTEX_DEBUG=y. Agreed?
my own guesstimate about AIM7 performance impact resulting out of
CONFIG_MUTEX_DEBUG=y: performance overhead will not be measurable, or
will at most be in the sub-1% range. But i've been badly wrong before :)
Ingo
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-07 17:16 ` Andrew Morton
@ 2008-05-07 17:27 ` Linus Torvalds
0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2008-05-07 17:27 UTC (permalink / raw)
To: Andrew Morton
Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
Alexander Viro, linux-fsdevel
On Wed, 7 May 2008, Andrew Morton wrote:
> On Wed, 7 May 2008 10:08:18 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > Which is why I'm 100% convinced it's not even worth saving the old code.
> > It needs to use mutexes, or spinlocks. I bet it has *nothing* to do with
> > "slow path" other than the fact that it gets to that slow path much more
> > these days.
>
> Stupid question: why doesn't lock_kernel() use a mutex?
Not stupid.
The only reason some code didn't get turned over to mutexes was literally
that they didn't want the debugging because they were doing intentionally
bad things.
I think the BKL is one of them (the console semaphore was another, iirc).
Linus
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-07 17:22 ` Ingo Molnar
2008-05-07 17:25 ` Ingo Molnar
@ 2008-05-07 17:31 ` Linus Torvalds
2008-05-07 17:47 ` Linus Torvalds
2008-05-07 17:49 ` Ingo Molnar
1 sibling, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2008-05-07 17:31 UTC (permalink / raw)
To: Ingo Molnar
Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
LKML, Alexander Viro, linux-fsdevel
On Wed, 7 May 2008, Ingo Molnar wrote:
>
> There's far more normal mutex fastpath use during an AIM7 run than any
> BKL use. So if it's due to any direct fastpath overhead and the
> resulting widening of the window for the real slowdown, we should see a
> severe slowdown on AIM7 with CONFIG_MUTEX_DEBUG=y. Agreed?
Not agreed.
The BKL is special because it is a *single* lock.
All the "normal" mutex code use fine-grained locking, so even if you slow
down the fast path, that won't cause the same kind of fastpath->slowpath
increase.
In order to see the fastpath->slowpath thing, you do need to have many
threads hitting the same lock: ie the slowdown has to result in real
contention.
Almost no mutexes have any potential for contention what-so-ever, except
for things that very consciously try to hit it (multiple threads doing
readdir and/or file creation on the *same* directory etc).
The BKL really is special.
Linus
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-07 17:31 ` Linus Torvalds
@ 2008-05-07 17:47 ` Linus Torvalds
2008-05-07 17:49 ` Ingo Molnar
1 sibling, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2008-05-07 17:47 UTC (permalink / raw)
To: Ingo Molnar
Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
LKML, Alexander Viro, linux-fsdevel
On Wed, 7 May 2008, Linus Torvalds wrote:
>
> All the "normal" mutex code use fine-grained locking, so even if you slow
> down the fast path, that won't cause the same kind of fastpath->slowpath
> increase.
Put another way: let's say that the "good fastpath" is basically a single
locked instruction - ~12 cycles on AMD, ~35 Core 2. That's the
no-bouncing, no-contention case.
Doing it with debugging (call overhead, spinlocks, local irq saving rtc)
will probably easily triple it or more, but we're not changing anything
else. There's no "downstream" effect: the behaviour itself doesn't change.
It doesn't get more bouncing, it doesn't start sleeping.
But what happens if the lock has the *potential* for conflicts is
different.
There, a "longish pause + fast lock + short average code sequece + fast
unlock" is quite likely to stay uncontended for a fair amount of time, and
while it will be much slower than the no-contention-at-all case (because
you do get a pretty likely cacheline event at the "fast lock" part), with
a fairly low number of CPU's and a long enough pause, you *also* easily
get into a pattern where the thing that got the lock will likely also get
to unlock without dropping the cacheline.
So far so good.
But that basically depends on the fact that "lock + work + unlock" is
_much_ shorter than the "longish pause" in between, so that even if you
have <n> CPU's all doing the same thing, their pauses between the locked
section are still bigger than <n> times that short time.
Once that is no longer true, you now start to bounce both at the lock
*and* the unlock, and now that whole sequence got likely ten times slower.
*AND* because it now actually has real contention, it actually got even
worse: if the lock is a sleeping one, you get *another* order of magnitude
just because you now started doing scheduling overhead too!
So the thing is, it just breaks down very badly. A spinlock that gets
contention probably gets ten times slower due to bouncing the cacheline. A
semaphore that gets contention probably gets a *hundred* times slower, or
more.
And so my bet is that both the old and the new semaphores had the same bad
break-down situation, but the new semaphores just are a lot easier to
trigger it because they are at least three times costlier than the old
ones, so you just hit the bad behaviour with much lower loads (or fewer
number of CPU's).
But spinlocks really do behave much better when contended, because at
least they don't get the even bigger hit of also hitting the scheduler. So
the old semaphores would have behaved badly too *eventually*, they just
needed a more extreme case to show that bad behavior.
Linus
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-07 17:31 ` Linus Torvalds
2008-05-07 17:47 ` Linus Torvalds
@ 2008-05-07 17:49 ` Ingo Molnar
2008-05-07 18:02 ` Linus Torvalds
1 sibling, 1 reply; 43+ messages in thread
From: Ingo Molnar @ 2008-05-07 17:49 UTC (permalink / raw)
To: Linus Torvalds
Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
LKML, Alexander Viro, linux-fsdevel
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > There's far more normal mutex fastpath use during an AIM7 run than
> > any BKL use. So if it's due to any direct fastpath overhead and the
> > resulting widening of the window for the real slowdown, we should
> > see a severe slowdown on AIM7 with CONFIG_MUTEX_DEBUG=y. Agreed?
>
> Not agreed.
>
> The BKL is special because it is a *single* lock.
ok, indeed my suggestion is wrong and this would not be a good
comparison.
another idea: my trial-baloon patch should test your theory too, because
the generic down_trylock() is still the 'fat' version, it does:
spin_lock_irqsave(&sem->lock, flags);
count = sem->count - 1;
if (likely(count >= 0))
sem->count = count;
spin_unlock_irqrestore(&sem->lock, flags);
if there is a noticeable performance difference between your
trial-ballon patch and mine, then the micro-cost of the BKL very much
matters to this workload. Agreed about that?
but i'd be _hugely_ surprised about it. The tty code's BKL use should i
think only happen when a task exits and releases the tty - and a task
exit - even if this is a threaded test (which AIM7 can be - not sure
which exact parameters Yanmin used) - the costs of thread creation and
thread exit are just not in the same ballpark as any BKL micro-costs.
Dunno, maybe i overlooked some high-freq BKL user. (but any such site
would have shown up before) Even assuming a widening of the critical
path and some catastrophic domino effect (that does show up as increased
scheduling) i've never seen a 40% drop like this.
this regression, to me, has "different scheduling behavior" written all
over it - but that's just an impression. I'm not going to bet against
you though ;-)
Ingo
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-07 17:49 ` Ingo Molnar
@ 2008-05-07 18:02 ` Linus Torvalds
2008-05-07 18:17 ` Ingo Molnar
0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2008-05-07 18:02 UTC (permalink / raw)
To: Ingo Molnar
Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
LKML, Alexander Viro, linux-fsdevel
On Wed, 7 May 2008, Ingo Molnar wrote:
>
> another idea: my trial-baloon patch should test your theory too, because
> the generic down_trylock() is still the 'fat' version, it does:
I agree that your trial-balloon should likely get rid of the big
regression, since it avoids the scheduler.
So with your patch, lock_kernel() ends up being just a rather expensive
spinlock. And yes, I'd expect that it should get rid of the 40% cost,
because while it makes lock_kernel() more expensive than a spinlock and
you might end up having a few more cacheline bounces on the lock due to
that, that's still the "small" expense compared to going through the whole
scheduler on conflicts.
So I'd expect that realistically the performance difference between your
version and just plain spinlocks shouldn't be *that* big. I'd expect it to
be visible, but in the (low) single-digit percentage range rather than in
any 40% range. That's just a guess.
Linus
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-07 18:02 ` Linus Torvalds
@ 2008-05-07 18:17 ` Ingo Molnar
2008-05-07 18:27 ` Linus Torvalds
0 siblings, 1 reply; 43+ messages in thread
From: Ingo Molnar @ 2008-05-07 18:17 UTC (permalink / raw)
To: Linus Torvalds
Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
LKML, Alexander Viro, linux-fsdevel
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, 7 May 2008, Ingo Molnar wrote:
> >
> > another idea: my trial-baloon patch should test your theory too,
> > because the generic down_trylock() is still the 'fat' version, it
> > does:
>
> I agree that your trial-balloon should likely get rid of the big
> regression, since it avoids the scheduler.
>
> So with your patch, lock_kernel() ends up being just a rather
> expensive spinlock. And yes, I'd expect that it should get rid of the
> 40% cost, because while it makes lock_kernel() more expensive than a
> spinlock and you might end up having a few more cacheline bounces on
> the lock due to that, that's still the "small" expense compared to
> going through the whole scheduler on conflicts.
>
> So I'd expect that realistically the performance difference between
> your version and just plain spinlocks shouldn't be *that* big. I'd
> expect it to be visible, but in the (low) single-digit percentage
> range rather than in any 40% range. That's just a guess.
third attempt - the patch below ontop of v2.6.25 should be quite similar
fastpath atomic overhead to what generic semaphores do? So if Yanmin
tests this patch ontop of v2.6.25, we should see the direct fastpath
overhead - without any changes to the semaphore wakeup/scheduling logic
otherwise.
[ this patch should in fact be a bit worse, because there's two more
atomics in the fastpath - the fastpath atomics of the old semaphore
code. ]
Ingo
------------------>
Subject: v2.6.25 BKL: add atomic overhead
From: Ingo Molnar <mingo@elte.hu>
Date: Wed May 07 20:09:13 CEST 2008
---
lib/kernel_lock.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
Index: linux-2.6.25/lib/kernel_lock.c
===================================================================
--- linux-2.6.25.orig/lib/kernel_lock.c
+++ linux-2.6.25/lib/kernel_lock.c
@@ -24,6 +24,7 @@
* Don't use in new code.
*/
static DECLARE_MUTEX(kernel_sem);
+static DEFINE_SPINLOCK(global_lock);
/*
* Re-acquire the kernel semaphore.
@@ -47,6 +48,9 @@ int __lockfunc __reacquire_kernel_lock(v
down(&kernel_sem);
+ spin_lock(&global_lock);
+ spin_unlock(&global_lock);
+
preempt_disable();
task->lock_depth = saved_lock_depth;
@@ -55,6 +59,9 @@ int __lockfunc __reacquire_kernel_lock(v
void __lockfunc __release_kernel_lock(void)
{
+ spin_lock(&global_lock);
+ spin_unlock(&global_lock);
+
up(&kernel_sem);
}
@@ -66,12 +73,16 @@ void __lockfunc lock_kernel(void)
struct task_struct *task = current;
int depth = task->lock_depth + 1;
- if (likely(!depth))
+ if (likely(!depth)) {
/*
* No recursion worries - we set up lock_depth _after_
*/
down(&kernel_sem);
+ spin_lock(&global_lock);
+ spin_unlock(&global_lock);
+ }
+
task->lock_depth = depth;
}
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-07 18:17 ` Ingo Molnar
@ 2008-05-07 18:27 ` Linus Torvalds
2008-05-07 18:43 ` Ingo Molnar
0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2008-05-07 18:27 UTC (permalink / raw)
To: Ingo Molnar
Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
LKML, Alexander Viro, linux-fsdevel
On Wed, 7 May 2008, Ingo Molnar wrote:
>
> [ this patch should in fact be a bit worse, because there's two more
> atomics in the fastpath - the fastpath atomics of the old semaphore
> code. ]
Well, it doesn't have the irq stuff, which is also pretty costly. Also, it
doesn't nest the accesses the same way (with the counts being *inside* the
spinlock and serialized against each other), so I'm not 100% sure you'd
get the same behaviour.
But yes, it certainly has the potential to show the same slowdown. But
it's not a very good patch, since not showing it doesn't really prove
much.
Linus
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-07 18:27 ` Linus Torvalds
@ 2008-05-07 18:43 ` Ingo Molnar
2008-05-07 19:01 ` Linus Torvalds
0 siblings, 1 reply; 43+ messages in thread
From: Ingo Molnar @ 2008-05-07 18:43 UTC (permalink / raw)
To: Linus Torvalds
Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
LKML, Alexander Viro, linux-fsdevel
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > [ this patch should in fact be a bit worse, because there's two more
> > atomics in the fastpath - the fastpath atomics of the old
> > semaphore code. ]
>
> Well, it doesn't have the irq stuff, which is also pretty costly.
> Also, it doesn't nest the accesses the same way (with the counts being
> *inside* the spinlock and serialized against each other), so I'm not
> 100% sure you'd get the same behaviour.
>
> But yes, it certainly has the potential to show the same slowdown. But
> it's not a very good patch, since not showing it doesn't really prove
> much.
ok, the one below does irq ops and the counter behavior - and because
the critical section also has the old-semaphore atomics i think this
should definitely be a more expensive fastpath than what the new generic
code introduces. So if this patch produces a 40% AIM7 slowdown on
v2.6.25 it's the fastpath overhead (and its effects on slowpath
probability) that makes the difference.
Ingo
------------------->
Subject: add BKL atomic overhead
From: Ingo Molnar <mingo@elte.hu>
Date: Wed May 07 20:09:13 CEST 2008
NOT-Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
lib/kernel_lock.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
Index: linux-2.6.25/lib/kernel_lock.c
===================================================================
--- linux-2.6.25.orig/lib/kernel_lock.c
+++ linux-2.6.25/lib/kernel_lock.c
@@ -24,6 +24,8 @@
* Don't use in new code.
*/
static DECLARE_MUTEX(kernel_sem);
+static int global_count;
+static DEFINE_SPINLOCK(global_lock);
/*
* Re-acquire the kernel semaphore.
@@ -39,6 +41,7 @@ int __lockfunc __reacquire_kernel_lock(v
{
struct task_struct *task = current;
int saved_lock_depth = task->lock_depth;
+ unsigned long flags;
BUG_ON(saved_lock_depth < 0);
@@ -47,6 +50,10 @@ int __lockfunc __reacquire_kernel_lock(v
down(&kernel_sem);
+ spin_lock_irqsave(&global_lock, flags);
+ global_count++;
+ spin_unlock_irqrestore(&global_lock, flags);
+
preempt_disable();
task->lock_depth = saved_lock_depth;
@@ -55,6 +62,10 @@ int __lockfunc __reacquire_kernel_lock(v
void __lockfunc __release_kernel_lock(void)
{
+ spin_lock_irqsave(&global_lock, flags);
+ global_count--;
+ spin_unlock_irqrestore(&global_lock, flags);
+
up(&kernel_sem);
}
@@ -66,12 +77,17 @@ void __lockfunc lock_kernel(void)
struct task_struct *task = current;
int depth = task->lock_depth + 1;
- if (likely(!depth))
+ if (likely(!depth)) {
/*
* No recursion worries - we set up lock_depth _after_
*/
down(&kernel_sem);
+ spin_lock_irqsave(&global_lock, flags);
+ global_count++;
+ spin_unlock_irqrestore(&global_lock, flags);
+ }
+
task->lock_depth = depth;
}
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-07 18:43 ` Ingo Molnar
@ 2008-05-07 19:01 ` Linus Torvalds
2008-05-07 19:09 ` Ingo Molnar
2008-05-07 19:24 ` Matthew Wilcox
0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2008-05-07 19:01 UTC (permalink / raw)
To: Ingo Molnar
Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
LKML, Alexander Viro, linux-fsdevel
On Wed, 7 May 2008, Ingo Molnar wrote:
>
> ok, the one below does irq ops and the counter behavior
No it doesn't. The counter isn't used for any actual *testing*, so the
locking around it and the serialization of it has absolutely no impact on
the scheduling behaviour!
Since the big slowdown was clearly accompanied by sleeping behaviour (the
processes who didn't get the lock end up sleeping!), that is a *big* part
of the slowdown.
Is it possible that your patch gets similar behaviour? Absolutely. But
you're missing the whole point here. Anybody can make code behave badly
and perform worse. But if you want to just verify that it's about the
sleeping behaviour and timings of the BKL, then you need to do exactly
that: emulate the sleeping behavior, not just the timings _outside_ of the
sleeping behavior.
The thing is, we definitely are interested to see whether it's the BKL or
some other semaphore that is the problem. But the best way to test that is
to just try my patch that *guarantees* that the BKL doesn't have any
semaphore behaviour AT ALL.
Could it be something else entirely? Yes. We know it's semaphore- related.
We don't know for a fact that it's the BKL itself. There could be other
semaphores that are that hot. It sounds unlikely, but quite frankly,
regardless, I don't really see the point of your patches.
If Yanmin tries my patch, it is *guaranteed* to show something. It either
shows that it's about the BKL (and that we absolutely have to do the BKL
as something _else_ than a generic semaphore), or it shows that it's not
about the BKL (and that _all_ the patches in this discussion are likely
pointless).
In contrast, these "try to emulate bad behavior with the old known-ok
semaphores" don't show anything AT ALL. We already know it's related to
semaphores. And your patches aren't even guaranteed to show the same
issue.
Linus
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-07 19:01 ` Linus Torvalds
@ 2008-05-07 19:09 ` Ingo Molnar
2008-05-07 19:24 ` Matthew Wilcox
1 sibling, 0 replies; 43+ messages in thread
From: Ingo Molnar @ 2008-05-07 19:09 UTC (permalink / raw)
To: Linus Torvalds
Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
LKML, Alexander Viro, linux-fsdevel
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> In contrast, these "try to emulate bad behavior with the old known-ok
> semaphores" don't show anything AT ALL. We already know it's related
> to semaphores. And your patches aren't even guaranteed to show the
> same issue.
yeah, i was just trying to come up with patches to probe which one of
the following two possibilities is actually the case:
- if the regression is due to the difference in scheduling behavior of
new semaphores (different wakeup patterns, etc.), that's fixable in
the new semaphore code => then the BKL code need not change.
- if the regression is due due to difference in the fastpath cost, then
the new semaphores can probably not be improved (much of their appeal
comes from them not being complex and not being in assembly) => then
the BKL code needs to change to become cheaper [i.e. then we want
your patch].
Ingo
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-07 19:01 ` Linus Torvalds
2008-05-07 19:09 ` Ingo Molnar
@ 2008-05-07 19:24 ` Matthew Wilcox
2008-05-07 19:44 ` Linus Torvalds
1 sibling, 1 reply; 43+ messages in thread
From: Matthew Wilcox @ 2008-05-07 19:24 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML,
Alexander Viro, linux-fsdevel
On Wed, May 07, 2008 at 12:01:28PM -0700, Linus Torvalds wrote:
> The thing is, we definitely are interested to see whether it's the BKL or
> some other semaphore that is the problem. But the best way to test that is
> to just try my patch that *guarantees* that the BKL doesn't have any
> semaphore behaviour AT ALL.
>
> Could it be something else entirely? Yes. We know it's semaphore- related.
> We don't know for a fact that it's the BKL itself. There could be other
> semaphores that are that hot. It sounds unlikely, but quite frankly,
> regardless, I don't really see the point of your patches.
>
> If Yanmin tries my patch, it is *guaranteed* to show something. It either
> shows that it's about the BKL (and that we absolutely have to do the BKL
> as something _else_ than a generic semaphore), or it shows that it's not
> about the BKL (and that _all_ the patches in this discussion are likely
> pointless).
One patch I'd still like Yanmin to test is my one from yesterday which
removes the BKL from fs/locks.c.
http://marc.info/?l=linux-fsdevel&m=121009123427437&w=2
Obviously, it won't help if the problem isn't the BKL.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-07 19:24 ` Matthew Wilcox
@ 2008-05-07 19:44 ` Linus Torvalds
2008-05-07 20:00 ` Oi. NFS people. Read this Matthew Wilcox
0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2008-05-07 19:44 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Ingo Molnar, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML,
Alexander Viro, linux-fsdevel
On Wed, 7 May 2008, Matthew Wilcox wrote:
>
> One patch I'd still like Yanmin to test is my one from yesterday which
> removes the BKL from fs/locks.c.
And I'd personally rather have the network-fs people test and comment on
that one ;)
I think that patch is worth looking at regardless, but the problems with
that one aren't about performance, but about what the implications are for
the filesystems (if any)...
Linus
^ permalink raw reply [flat|nested] 43+ messages in thread
* Oi. NFS people. Read this.
2008-05-07 19:44 ` Linus Torvalds
@ 2008-05-07 20:00 ` Matthew Wilcox
2008-05-07 22:10 ` Trond Myklebust
0 siblings, 1 reply; 43+ messages in thread
From: Matthew Wilcox @ 2008-05-07 20:00 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML,
Alexander Viro, linux-fsdevel
On Wed, May 07, 2008 at 12:44:48PM -0700, Linus Torvalds wrote:
> On Wed, 7 May 2008, Matthew Wilcox wrote:
> >
> > One patch I'd still like Yanmin to test is my one from yesterday which
> > removes the BKL from fs/locks.c.
>
> And I'd personally rather have the network-fs people test and comment on
> that one ;)
>
> I think that patch is worth looking at regardless, but the problems with
> that one aren't about performance, but about what the implications are for
> the filesystems (if any)...
Oh, well, they don't seem interested.
I can comment on some of the problems though.
fs/lockd/svcsubs.c, fs/nfs/delegation.c, fs/nfs/nfs4state.c,
fs/nfsd/nfs4state.c all walk the i_flock list under the BKL. That won't
protect them against locks.c any more. That's probably OK for fs/nfs/*
since they'll be protected by their own data structures (Someone please
check me on that?), but it's a bad idea for lockd/nfsd which are walking
the lists for filesystems.
Are we going to have to export the file_lock_lock? I'd rather not. But
we need to keep nfsd/lockd from tripping over locks.c.
Maybe we could come up with a decent API that lockd could use? It all
seems a bit complex at the moment ... maybe lockd should be keeping
track of the locks it owns anyway (since surely the posix deadlock
detection code can't work properly if it's just passing all the locks
through).
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Oi. NFS people. Read this.
2008-05-07 20:00 ` Oi. NFS people. Read this Matthew Wilcox
@ 2008-05-07 22:10 ` Trond Myklebust
2008-05-09 1:43 ` J. Bruce Fields
0 siblings, 1 reply; 43+ messages in thread
From: Trond Myklebust @ 2008-05-07 22:10 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, J. Bruce Fields,
Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel
On Wed, 2008-05-07 at 14:00 -0600, Matthew Wilcox wrote:
> On Wed, May 07, 2008 at 12:44:48PM -0700, Linus Torvalds wrote:
> > On Wed, 7 May 2008, Matthew Wilcox wrote:
> > >
> > > One patch I'd still like Yanmin to test is my one from yesterday which
> > > removes the BKL from fs/locks.c.
> >
> > And I'd personally rather have the network-fs people test and comment on
> > that one ;)
> >
> > I think that patch is worth looking at regardless, but the problems with
> > that one aren't about performance, but about what the implications are for
> > the filesystems (if any)...
>
> Oh, well, they don't seem interested.
Poor timing: we're all preparing for and travelling to the annual
Connectathon interoperability testing conference which starts tomorrow.
> I can comment on some of the problems though.
>
> fs/lockd/svcsubs.c, fs/nfs/delegation.c, fs/nfs/nfs4state.c,
> fs/nfsd/nfs4state.c all walk the i_flock list under the BKL. That won't
> protect them against locks.c any more. That's probably OK for fs/nfs/*
> since they'll be protected by their own data structures (Someone please
> check me on that?), but it's a bad idea for lockd/nfsd which are walking
> the lists for filesystems.
Yes. fs/nfs is just reusing the code in fs/locks.c in order to track the
locks it holds on the server. We could alternatively have coded a
private lock implementation, but this seemed easier.
> Are we going to have to export the file_lock_lock? I'd rather not. But
> we need to keep nfsd/lockd from tripping over locks.c.
>
> Maybe we could come up with a decent API that lockd could use? It all
> seems a bit complex at the moment ... maybe lockd should be keeping
> track of the locks it owns anyway (since surely the posix deadlock
> detection code can't work properly if it's just passing all the locks
> through).
I'm not sure what you mean when you talk about lockd keeping track of
the locks it owns. It has to keep those locks on inode->i_flock in order
to make them visible to the host filesystem...
All lockd really needs, is the ability to find a lock it owns, and then
obtain a copy. As for the nfs client, I suspect we can make do with
something similar...
Cheers
Trond
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-06 16:23 ` Matthew Wilcox
2008-05-06 16:36 ` Linus Torvalds
2008-05-06 17:21 ` Andrew Morton
@ 2008-05-08 3:24 ` Zhang, Yanmin
2008-05-08 3:34 ` Linus Torvalds
2 siblings, 1 reply; 43+ messages in thread
From: Zhang, Yanmin @ 2008-05-08 3:24 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Ingo Molnar, J. Bruce Fields, LKML, Alexander Viro,
Linus Torvalds, Andrew Morton, linux-fsdevel
On Tue, 2008-05-06 at 10:23 -0600, Matthew Wilcox wrote:
> On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote:
> > So the only likely things I can see are:
> >
> > - file locks
> > - fasync
>
> I've wanted to fix file locks for a while. Here's a first attempt.
> It was done quickly, so I concede that it may well have bugs in it.
> I found (and fixed) one with LTP.
>
> It takes *no account* of nfsd, nor remote filesystems. We need to have
> a serious discussion about their requirements.
I tested it on 8-core stoakley. aim7 result becomes 23% worse than the one of
pure 2.6.26-rc1.
I replied this email in case you have many patches and I might test what you don't
expect.
-yanmin
>
> diff --git a/fs/locks.c b/fs/locks.c
> index 663c069..cb09765 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -140,6 +140,8 @@ int lease_break_time = 45;
> #define for_each_lock(inode, lockp) \
> for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next)
>
> +static DEFINE_SPINLOCK(file_lock_lock);
> +
> static LIST_HEAD(file_lock_list);
> static LIST_HEAD(blocked_list);
>
> @@ -510,9 +512,9 @@ static void __locks_delete_block(struct file_lock *waiter)
> */
> static void locks_delete_block(struct file_lock *waiter)
> {
> - lock_kernel();
> + spin_lock(&file_lock_lock);
> __locks_delete_block(waiter);
> - unlock_kernel();
> + spin_unlock(&file_lock_lock);
> }
>
> /* Insert waiter into blocker's block list.
> @@ -649,7 +651,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
> {
> struct file_lock *cfl;
>
> - lock_kernel();
> + spin_lock(&file_lock_lock);
> for (cfl = filp->f_path.dentry->d_inode->i_flock; cfl; cfl = cfl->fl_next) {
> if (!IS_POSIX(cfl))
> continue;
> @@ -662,7 +664,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
> fl->fl_pid = pid_vnr(cfl->fl_nspid);
> } else
> fl->fl_type = F_UNLCK;
> - unlock_kernel();
> + spin_unlock(&file_lock_lock);
> return;
> }
> EXPORT_SYMBOL(posix_test_lock);
> @@ -735,18 +737,21 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
> int error = 0;
> int found = 0;
>
> - lock_kernel();
> - if (request->fl_flags & FL_ACCESS)
> + if (request->fl_flags & FL_ACCESS) {
> + spin_lock(&file_lock_lock);
> goto find_conflict;
> + }
>
> if (request->fl_type != F_UNLCK) {
> error = -ENOMEM;
> +
> new_fl = locks_alloc_lock();
> if (new_fl == NULL)
> - goto out;
> + goto out_unlocked;
> error = 0;
> }
>
> + spin_lock(&file_lock_lock);
> for_each_lock(inode, before) {
> struct file_lock *fl = *before;
> if (IS_POSIX(fl))
> @@ -772,10 +777,13 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
> * If a higher-priority process was blocked on the old file lock,
> * give it the opportunity to lock the file.
> */
> - if (found)
> + if (found) {
> + spin_unlock(&file_lock_lock);
> cond_resched();
> + spin_lock(&file_lock_lock);
> + }
>
> -find_conflict:
> + find_conflict:
> for_each_lock(inode, before) {
> struct file_lock *fl = *before;
> if (IS_POSIX(fl))
> @@ -796,8 +804,9 @@ find_conflict:
> new_fl = NULL;
> error = 0;
>
> -out:
> - unlock_kernel();
> + out:
> + spin_unlock(&file_lock_lock);
> + out_unlocked:
> if (new_fl)
> locks_free_lock(new_fl);
> return error;
> @@ -826,7 +835,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> new_fl2 = locks_alloc_lock();
> }
>
> - lock_kernel();
> + spin_lock(&file_lock_lock);
> if (request->fl_type != F_UNLCK) {
> for_each_lock(inode, before) {
> fl = *before;
> @@ -994,7 +1003,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> locks_wake_up_blocks(left);
> }
> out:
> - unlock_kernel();
> + spin_unlock(&file_lock_lock);
> /*
> * Free any unused locks.
> */
> @@ -1069,14 +1078,14 @@ int locks_mandatory_locked(struct inode *inode)
> /*
> * Search the lock list for this inode for any POSIX locks.
> */
> - lock_kernel();
> + spin_lock(&file_lock_lock);
> for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> if (!IS_POSIX(fl))
> continue;
> if (fl->fl_owner != owner)
> break;
> }
> - unlock_kernel();
> + spin_unlock(&file_lock_lock);
> return fl ? -EAGAIN : 0;
> }
>
> @@ -1190,7 +1199,7 @@ int __break_lease(struct inode *inode, unsigned int mode)
>
> new_fl = lease_alloc(NULL, mode & FMODE_WRITE ? F_WRLCK : F_RDLCK);
>
> - lock_kernel();
> + spin_lock(&file_lock_lock);
>
> time_out_leases(inode);
>
> @@ -1251,8 +1260,10 @@ restart:
> break_time++;
> }
> locks_insert_block(flock, new_fl);
> + spin_unlock(&file_lock_lock);
> error = wait_event_interruptible_timeout(new_fl->fl_wait,
> !new_fl->fl_next, break_time);
> + spin_lock(&file_lock_lock);
> __locks_delete_block(new_fl);
> if (error >= 0) {
> if (error == 0)
> @@ -1266,8 +1277,8 @@ restart:
> error = 0;
> }
>
> -out:
> - unlock_kernel();
> + out:
> + spin_unlock(&file_lock_lock);
> if (!IS_ERR(new_fl))
> locks_free_lock(new_fl);
> return error;
> @@ -1323,7 +1334,7 @@ int fcntl_getlease(struct file *filp)
> struct file_lock *fl;
> int type = F_UNLCK;
>
> - lock_kernel();
> + spin_lock(&file_lock_lock);
> time_out_leases(filp->f_path.dentry->d_inode);
> for (fl = filp->f_path.dentry->d_inode->i_flock; fl && IS_LEASE(fl);
> fl = fl->fl_next) {
> @@ -1332,7 +1343,7 @@ int fcntl_getlease(struct file *filp)
> break;
> }
> }
> - unlock_kernel();
> + spin_unlock(&file_lock_lock);
> return type;
> }
>
> @@ -1363,6 +1374,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
> if (error)
> return error;
>
> + spin_lock(&file_lock_lock);
> time_out_leases(inode);
>
> BUG_ON(!(*flp)->fl_lmops->fl_break);
> @@ -1370,10 +1382,11 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
> lease = *flp;
>
> if (arg != F_UNLCK) {
> + spin_unlock(&file_lock_lock);
> error = -ENOMEM;
> new_fl = locks_alloc_lock();
> if (new_fl == NULL)
> - goto out;
> + goto out_unlocked;
>
> error = -EAGAIN;
> if ((arg == F_RDLCK) && (atomic_read(&inode->i_writecount) > 0))
> @@ -1382,6 +1395,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
> && ((atomic_read(&dentry->d_count) > 1)
> || (atomic_read(&inode->i_count) > 1)))
> goto out;
> + spin_lock(&file_lock_lock);
> }
>
> /*
> @@ -1429,11 +1443,14 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
>
> locks_copy_lock(new_fl, lease);
> locks_insert_lock(before, new_fl);
> + spin_unlock(&file_lock_lock);
>
> *flp = new_fl;
> return 0;
>
> -out:
> + out:
> + spin_unlock(&file_lock_lock);
> + out_unlocked:
> if (new_fl != NULL)
> locks_free_lock(new_fl);
> return error;
> @@ -1471,12 +1488,10 @@ int vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
> {
> int error;
>
> - lock_kernel();
> if (filp->f_op && filp->f_op->setlease)
> error = filp->f_op->setlease(filp, arg, lease);
> else
> error = generic_setlease(filp, arg, lease);
> - unlock_kernel();
>
> return error;
> }
> @@ -1503,12 +1518,11 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
> if (error)
> return error;
>
> - lock_kernel();
> -
> error = vfs_setlease(filp, arg, &flp);
> if (error || arg == F_UNLCK)
> - goto out_unlock;
> + return error;
>
> + lock_kernel();
> error = fasync_helper(fd, filp, 1, &flp->fl_fasync);
> if (error < 0) {
> /* remove lease just inserted by setlease */
> @@ -1519,7 +1533,7 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
> }
>
> error = __f_setown(filp, task_pid(current), PIDTYPE_PID, 0);
> -out_unlock:
> + out_unlock:
> unlock_kernel();
> return error;
> }
> @@ -2024,7 +2038,7 @@ void locks_remove_flock(struct file *filp)
> fl.fl_ops->fl_release_private(&fl);
> }
>
> - lock_kernel();
> + spin_lock(&file_lock_lock);
> before = &inode->i_flock;
>
> while ((fl = *before) != NULL) {
> @@ -2042,7 +2056,7 @@ void locks_remove_flock(struct file *filp)
> }
> before = &fl->fl_next;
> }
> - unlock_kernel();
> + spin_unlock(&file_lock_lock);
> }
>
> /**
> @@ -2057,12 +2071,12 @@ posix_unblock_lock(struct file *filp, struct file_lock *waiter)
> {
> int status = 0;
>
> - lock_kernel();
> + spin_lock(&file_lock_lock);
> if (waiter->fl_next)
> __locks_delete_block(waiter);
> else
> status = -ENOENT;
> - unlock_kernel();
> + spin_unlock(&file_lock_lock);
> return status;
> }
>
> @@ -2175,7 +2189,7 @@ static int locks_show(struct seq_file *f, void *v)
>
> static void *locks_start(struct seq_file *f, loff_t *pos)
> {
> - lock_kernel();
> + spin_lock(&file_lock_lock);
> f->private = (void *)1;
> return seq_list_start(&file_lock_list, *pos);
> }
> @@ -2187,7 +2201,7 @@ static void *locks_next(struct seq_file *f, void *v, loff_t *pos)
>
> static void locks_stop(struct seq_file *f, void *v)
> {
> - unlock_kernel();
> + spin_unlock(&file_lock_lock);
> }
>
> struct seq_operations locks_seq_operations = {
> @@ -2215,7 +2229,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
> {
> struct file_lock *fl;
> int result = 1;
> - lock_kernel();
> + spin_lock(&file_lock_lock);
> for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> if (IS_POSIX(fl)) {
> if (fl->fl_type == F_RDLCK)
> @@ -2232,7 +2246,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
> result = 0;
> break;
> }
> - unlock_kernel();
> + spin_unlock(&file_lock_lock);
> return result;
> }
>
> @@ -2255,7 +2269,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
> {
> struct file_lock *fl;
> int result = 1;
> - lock_kernel();
> + spin_lock(&file_lock_lock);
> for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> if (IS_POSIX(fl)) {
> if ((fl->fl_end < start) || (fl->fl_start > (start + len)))
> @@ -2270,7 +2284,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
> result = 0;
> break;
> }
> - unlock_kernel();
> + spin_unlock(&file_lock_lock);
> return result;
> }
>
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-08 3:24 ` AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin
@ 2008-05-08 3:34 ` Linus Torvalds
2008-05-08 4:37 ` Zhang, Yanmin
0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2008-05-08 3:34 UTC (permalink / raw)
To: Zhang, Yanmin
Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, LKML,
Alexander Viro, Andrew Morton, linux-fsdevel
On Thu, 8 May 2008, Zhang, Yanmin wrote:
>
> On Tue, 2008-05-06 at 10:23 -0600, Matthew Wilcox wrote:
> > On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote:
> > > So the only likely things I can see are:
> > >
> > > - file locks
> > > - fasync
> >
> > I've wanted to fix file locks for a while. Here's a first attempt.
> > It was done quickly, so I concede that it may well have bugs in it.
> > I found (and fixed) one with LTP.
> >
> > It takes *no account* of nfsd, nor remote filesystems. We need to have
> > a serious discussion about their requirements.
>
> I tested it on 8-core stoakley. aim7 result becomes 23% worse than the one of
> pure 2.6.26-rc1.
Ouch. That's really odd. The BKL->spinlock conversion looks really
obvious, so it shouldn't be that noticeably slower.
The *one* difference is that the BKL has the whole "you can take it
recursively and you can sleep without dropping it because the scheduler
will drop it for you" thing. The spinlock conversion changed all of that
into explicit "drop and retake" locks, and maybe that causes some issues.
But 23% worse? That sounds really odd/extreme.
Can you do a oprofile run or something?
Linus
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-08 3:34 ` Linus Torvalds
@ 2008-05-08 4:37 ` Zhang, Yanmin
2008-05-08 14:58 ` Linus Torvalds
0 siblings, 1 reply; 43+ messages in thread
From: Zhang, Yanmin @ 2008-05-08 4:37 UTC (permalink / raw)
To: Linus Torvalds
Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, LKML,
Alexander Viro, Andrew Morton, linux-fsdevel
On Wed, 2008-05-07 at 20:34 -0700, Linus Torvalds wrote:
>
> On Thu, 8 May 2008, Zhang, Yanmin wrote:
> >
> > On Tue, 2008-05-06 at 10:23 -0600, Matthew Wilcox wrote:
> > > On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote:
> > > > So the only likely things I can see are:
> > > >
> > > > - file locks
> > > > - fasync
> > >
> > > I've wanted to fix file locks for a while. Here's a first attempt.
> > > It was done quickly, so I concede that it may well have bugs in it.
> > > I found (and fixed) one with LTP.
> > >
> > > It takes *no account* of nfsd, nor remote filesystems. We need to have
> > > a serious discussion about their requirements.
> >
> > I tested it on 8-core stoakley. aim7 result becomes 23% worse than the one of
> > pure 2.6.26-rc1.
>
> Ouch. That's really odd. The BKL->spinlock conversion looks really
> obvious, so it shouldn't be that noticeably slower.
>
> The *one* difference is that the BKL has the whole "you can take it
> recursively and you can sleep without dropping it because the scheduler
> will drop it for you" thing. The spinlock conversion changed all of that
> into explicit "drop and retake" locks, and maybe that causes some issues.
>
> But 23% worse? That sounds really odd/extreme.
>
> Can you do a oprofile run or something?
I collected oprofile data. It looks not useful, as cpu idle is more than 50%.
samples % app name symbol name
270157 9.4450 multitask add_long
266419 9.3143 multitask add_int
238934 8.3534 multitask add_double
187184 6.5442 multitask mul_double
159448 5.5745 multitask add_float
156312 5.4649 multitask sieve
148081 5.1771 multitask mul_float
127192 4.4468 multitask add_short
80480 2.8137 multitask string_rtns_1
57520 2.0110 vmlinux clear_page_c
53935 1.8856 multitask div_long
48753 1.7045 libc-2.6.90.so strncat
40825 1.4273 multitask array_rtns
32807 1.1470 vmlinux __copy_user_nocache
31995 1.1186 multitask div_int
31143 1.0888 multitask div_float
28821 1.0076 multitask div_double
26400 0.9230 vmlinux find_lock_page
26159 0.9146 vmlinux unmap_vmas
25249 0.8827 multitask div_short
21509 0.7520 vmlinux native_read_tsc
18865 0.6595 vmlinux copy_user_generic_string
17993 0.6291 vmlinux copy_page_c
16367 0.5722 vmlinux system_call
14616 0.5110 libc-2.6.90.so msort_with_tmp
13630 0.4765 vmlinux native_sched_clock
12952 0.4528 vmlinux copy_page_range
12817 0.4481 libc-2.6.90.so strcat
12708 0.4443 vmlinux calc_delta_mine
12611 0.4409 libc-2.6.90.so memset
11631 0.4066 bash (no symbols)
9991 0.3493 vmlinux update_curr
9328 0.3261 vmlinux unlock_page
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-08 4:37 ` Zhang, Yanmin
@ 2008-05-08 14:58 ` Linus Torvalds
0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2008-05-08 14:58 UTC (permalink / raw)
To: Zhang, Yanmin
Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, LKML,
Alexander Viro, Andrew Morton, linux-fsdevel
On Thu, 8 May 2008, Zhang, Yanmin wrote:
>
> I collected oprofile data. It looks not useful, as cpu idle is more than 50%.
Ahh, so it's probably still the BKL that is the problem, it's just not in
the file locking code. The changes to fs/locks.c probably didn't matter
all that much, and the additional regression was likely just some
perturbation.
So it's probably fasync that AIM7 tests. Quite possibly coupled with
/dev/tty etc. No file locking.
Linus
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Oi. NFS people. Read this.
2008-05-07 22:10 ` Trond Myklebust
@ 2008-05-09 1:43 ` J. Bruce Fields
0 siblings, 0 replies; 43+ messages in thread
From: J. Bruce Fields @ 2008-05-09 1:43 UTC (permalink / raw)
To: Trond Myklebust
Cc: Matthew Wilcox, Linus Torvalds, Ingo Molnar, Andrew Morton,
Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel
On Wed, May 07, 2008 at 03:10:27PM -0700, Trond Myklebust wrote:
> On Wed, 2008-05-07 at 14:00 -0600, Matthew Wilcox wrote:
> > On Wed, May 07, 2008 at 12:44:48PM -0700, Linus Torvalds wrote:
> > > On Wed, 7 May 2008, Matthew Wilcox wrote:
> > > >
> > > > One patch I'd still like Yanmin to test is my one from yesterday which
> > > > removes the BKL from fs/locks.c.
> > >
> > > And I'd personally rather have the network-fs people test and comment on
> > > that one ;)
> > >
> > > I think that patch is worth looking at regardless, but the problems with
> > > that one aren't about performance, but about what the implications are for
> > > the filesystems (if any)...
> >
> > Oh, well, they don't seem interested.
>
> Poor timing: we're all preparing for and travelling to the annual
> Connectathon interoperability testing conference which starts tomorrow.
>
> > I can comment on some of the problems though.
> >
> > fs/lockd/svcsubs.c, fs/nfs/delegation.c, fs/nfs/nfs4state.c,
> > fs/nfsd/nfs4state.c all walk the i_flock list under the BKL. That won't
> > protect them against locks.c any more. That's probably OK for fs/nfs/*
> > since they'll be protected by their own data structures (Someone please
> > check me on that?), but it's a bad idea for lockd/nfsd which are walking
> > the lists for filesystems.
>
> Yes. fs/nfs is just reusing the code in fs/locks.c in order to track the
> locks it holds on the server. We could alternatively have coded a
> private lock implementation, but this seemed easier.
So, assuming nfs is taking care of its own locking (I don't know if
that's right), that leaves nlm_traverse_locks() and nlm_file_inuse()
(both in fs/lockd/svcsubs.c) as the problem spots.
> > Are we going to have to export the file_lock_lock? I'd rather not. But
> > we need to keep nfsd/lockd from tripping over locks.c.
> >
> > Maybe we could come up with a decent API that lockd could use? It all
> > seems a bit complex at the moment ... maybe lockd should be keeping
> > track of the locks it owns anyway (since surely the posix deadlock
> > detection code can't work properly if it's just passing all the locks
> > through).
>
> I'm not sure what you mean when you talk about lockd keeping track of
> the locks it owns. It has to keep those locks on inode->i_flock in order
> to make them visible to the host filesystem...
>
> All lockd really needs, is the ability to find a lock it owns, and then
> obtain a copy.
That sounds right.
--b.
> As for the nfs client, I suspect we can make do with
> something similar...
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1
2008-05-06 18:07 ` Andrew Morton
@ 2008-05-11 11:11 ` Matthew Wilcox
0 siblings, 0 replies; 43+ messages in thread
From: Matthew Wilcox @ 2008-05-11 11:11 UTC (permalink / raw)
To: Andrew Morton
Cc: Ingo Molnar, bfields, yanmin_zhang, linux-kernel, viro, torvalds,
linux-fsdevel
On Tue, May 06, 2008 at 11:07:52AM -0700, Andrew Morton wrote:
> Yeah, the early bootup code. The kernel does accidental lock_kernel()s in
> various places and if that renables interrupts then powerpc goeth crunch.
>
> Matthew, that seemingly-unneeded irqsave in lib/semaphore.c is a prime site
> for /* one of these things */, no?
I was just reviewing the code and I came across one of these:
/*
* Some notes on the implementation:
*
* The spinlock controls access to the other members of the semaphore.
* down_trylock() and up() can be called from interrupt context, so we
* have to disable interrupts when taking the lock. It turns out various
* parts of the kernel expect to be able to use down() on a semaphore in
* interrupt context when they know it will succeed, so we have to use
* irqsave variants for down(), down_interruptible() and down_killable()
* too.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2008-05-11 11:11 UTC | newest]
Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1210052904.3453.30.camel@ymzhang>
[not found] ` <20080506114449.GC32591@elte.hu>
2008-05-06 12:09 ` AIM7 40% regression with 2.6.26-rc1 Matthew Wilcox
2008-05-06 16:23 ` Matthew Wilcox
2008-05-06 16:36 ` Linus Torvalds
2008-05-06 16:42 ` Matthew Wilcox
2008-05-06 16:39 ` Alan Cox
2008-05-06 16:51 ` Matthew Wilcox
2008-05-06 16:45 ` Alan Cox
2008-05-06 17:42 ` Linus Torvalds
2008-05-06 20:28 ` Linus Torvalds
2008-05-06 16:44 ` J. Bruce Fields
2008-05-06 17:21 ` Andrew Morton
2008-05-06 17:31 ` Matthew Wilcox
2008-05-06 17:49 ` Ingo Molnar
2008-05-06 18:07 ` Andrew Morton
2008-05-11 11:11 ` Matthew Wilcox
2008-05-06 17:39 ` Ingo Molnar
2008-05-07 6:49 ` Zhang, Yanmin
2008-05-06 17:45 ` Linus Torvalds
2008-05-07 16:38 ` Matthew Wilcox
2008-05-07 16:55 ` Linus Torvalds
2008-05-07 17:08 ` Linus Torvalds
2008-05-07 17:16 ` Andrew Morton
2008-05-07 17:27 ` Linus Torvalds
2008-05-07 17:22 ` Ingo Molnar
2008-05-07 17:25 ` Ingo Molnar
2008-05-07 17:31 ` Linus Torvalds
2008-05-07 17:47 ` Linus Torvalds
2008-05-07 17:49 ` Ingo Molnar
2008-05-07 18:02 ` Linus Torvalds
2008-05-07 18:17 ` Ingo Molnar
2008-05-07 18:27 ` Linus Torvalds
2008-05-07 18:43 ` Ingo Molnar
2008-05-07 19:01 ` Linus Torvalds
2008-05-07 19:09 ` Ingo Molnar
2008-05-07 19:24 ` Matthew Wilcox
2008-05-07 19:44 ` Linus Torvalds
2008-05-07 20:00 ` Oi. NFS people. Read this Matthew Wilcox
2008-05-07 22:10 ` Trond Myklebust
2008-05-09 1:43 ` J. Bruce Fields
2008-05-08 3:24 ` AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin
2008-05-08 3:34 ` Linus Torvalds
2008-05-08 4:37 ` Zhang, Yanmin
2008-05-08 14:58 ` Linus Torvalds
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).