fs/locks.c BKL removal

All of lore.kernel.org
 help / color / mirror / Atom feed

* fs/locks.c BKL removal
@ 2002-05-10 21:48 Dave Hansen
  2002-05-10 22:13 ` Dave Hansen
  2002-05-11 19:45 ` Matthew Wilcox
  0 siblings, 2 replies; 11+ messages in thread
From: Dave Hansen @ 2002-05-10 21:48 UTC (permalink / raw)
  To: matthew; +Cc: linux-kernel

Matthew,
Al Viro pointed me your way.

I'm looking into the fs/locks.c mess.  It appears that there was an 
attempt to convert this over to a semaphore, but it was removed just 
before the 2.4 release because of some deadlocks.

Whenever the i_flock list is traversed, the BKL is held.  It is also 
held while running through the file_lock_list which I think is used 
only for /proc/locks.

We definitely need a semaphore because of all the blocking that goes 
on.  We can either have a global lock for all of them, which I think 
was tried last time.  Or, we can split it up a bit more.  With the 
current design, there will need to be a lock for the global list, each 
individual list, and one for each individual lock to protect against 
access from the reference in the file_lock_list and the inode->i_flock 
list.

However, I think that the file_lock_list complexity may be able to be 
reduced.  If we make the file_lock_list a list of inodes (or just the 
i_flocks) with active locks, we can avoid the complexity of having an 
individual file_lock lock.  That way, we at least reduce the number of 
_types_ of locks.  It increases the number of dereferences, but this 
is /proc we're talking about.  Any comments?

Talking about locks for locks is confusing :)

-- 
Dave Hansen
haveblue@us.ibm.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fs/locks.c BKL removal
  2002-05-10 21:48 fs/locks.c BKL removal Dave Hansen
@ 2002-05-10 22:13 ` Dave Hansen
  2002-05-10 23:17   ` Andrew Morton
  2002-05-11 19:45 ` Matthew Wilcox
  1 sibling, 1 reply; 11+ messages in thread
From: Dave Hansen @ 2002-05-10 22:13 UTC (permalink / raw)
  Cc: matthew, linux-kernel

As Linus pointed out, a semaphore is probably the wrong way to go. 
The only things that really needs to be protected are the list 
operations themselves.

> No, I really think the code should use a spinlock for the global list, and
> then on a per-lock basis something like a reference count and a blocking
> lock (which might be a semaphore).


-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fs/locks.c BKL removal
  2002-05-10 22:13 ` Dave Hansen
@ 2002-05-10 23:17   ` Andrew Morton
  2002-05-11 19:48     ` Matthew Wilcox
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2002-05-10 23:17 UTC (permalink / raw)
  To: Dave Hansen; +Cc: matthew, linux-kernel

Dave Hansen wrote:
> 
> As Linus pointed out, a semaphore is probably the wrong way to go.
> The only things that really needs to be protected are the list
> operations themselves.
> 

It was I who put the BKL back into locks.c, much to
Matthew's disgust...

The problem was that replacing the BKL with a semaphore
seriously damaged Apache thoughput on 8-way.  Apache
was using flock()-based synchronisation and replacing
a spin with a schedule just killed it.

So.. Apache isn't doing that any more, but it is an
instructive case.  Replacing the BKL with a semaphore
can sometimes be a very bad thing.

See http://www.uwsg.iu.edu/hypermail/linux/kernel/0010.3/ -
search for "scalability"

-

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fs/locks.c BKL removal
  2002-05-10 23:17   ` Andrew Morton
@ 2002-05-11 19:48     ` Matthew Wilcox
  0 siblings, 0 replies; 11+ messages in thread
From: Matthew Wilcox @ 2002-05-11 19:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dave Hansen, matthew, linux-kernel

On Fri, May 10, 2002 at 04:17:29PM -0700, Andrew Morton wrote:
> It was I who put the BKL back into locks.c, much to
> Matthew's disgust...

The disgust was targetted more at removing the abstraction of
locking scheme which I'd put in and having explicit lock_kernel() /
unlock_kernel() calls.  I'd used (iirc) acquire_lock() / release_lock()
macros which could have just been redefined.

> The problem was that replacing the BKL with a semaphore
> seriously damaged Apache thoughput on 8-way.  Apache
> was using flock()-based synchronisation and replacing
> a spin with a schedule just killed it.

Which says that our semaphores suck, because they don't try to spin for a
bit before scheduling.  Of course, your change back was the right thing
to do in the 2.3.late timeframe.

-- 
Revolutions do not require corporate support.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fs/locks.c BKL removal
  2002-05-10 21:48 fs/locks.c BKL removal Dave Hansen
  2002-05-10 22:13 ` Dave Hansen
@ 2002-05-11 19:45 ` Matthew Wilcox
  2002-05-12  1:40   ` Dave Hansen
       [not found]   ` <20020513171729.4165.qmail@thales.mathematik.uni-ulm.de>
  1 sibling, 2 replies; 11+ messages in thread
From: Matthew Wilcox @ 2002-05-11 19:45 UTC (permalink / raw)
  To: Dave Hansen; +Cc: matthew, linux-kernel

On Fri, May 10, 2002 at 02:48:39PM -0700, Dave Hansen wrote:
> I'm looking into the fs/locks.c mess.  It appears that there was an 
> attempt to convert this over to a semaphore, but it was removed just 
> before the 2.4 release because of some deadlocks.

Actually, performance problems ...

> Whenever the i_flock list is traversed, the BKL is held.  It is also 
> held while running through the file_lock_list which I think is used 
> only for /proc/locks.

Correct.

> We definitely need a semaphore because of all the blocking that goes 
> on.  We can either have a global lock for all of them, which I think 
> was tried last time.  Or, we can split it up a bit more.  With the 
> current design, there will need to be a lock for the global list, each 
> individual list, and one for each individual lock to protect against 
> access from the reference in the file_lock_list and the inode->i_flock 
> list.

Nah.  Though I'm glad you missed it too; it means that I'm not as
stupid as I thought I was for only noticing it 2 years later.  Look at
locks_wake_up_blocks (this is basically the _only_ tricky part).  This has
to be called with a wait argument which is true.  The only time that
can happen is if locks_delete_lock is called with a `true' parameter.
And the only time _that_ happens is when the _type_ of an flock lock is
being changed.

And really, what's happening here?  We have a BSD flock which is blocking
one or more locks.  Those processes have to have the opportunity
to acquire the lock before the previously-blocking process gets the
opportunity to acquire its lock.  But that doesn't mean we need to schedule once
for _each_ task which is blocked; we only need to yield once.

So we can eliminate the `wait' argument to locks_delete_lock,
locks_wake_up_blocks and the arm of the conditional in
locks_wake_up_blocks which sleeps.  We only need to check in
flock_lock_file whether we're unlocking and yield if we aren't.

I'm currently doing a major restructure of fs/locks.c, and this problem
(along with several others) simply disappears.  I'm looking for a
testsuite before I release this code to the world ... anybody got one?

> However, I think that the file_lock_list complexity may be able to be 
> reduced.  If we make the file_lock_list a list of inodes (or just the 
> i_flocks) with active locks, we can avoid the complexity of having an 
> individual file_lock lock.  That way, we at least reduce the number of 
> _types_ of locks.  It increases the number of dereferences, but this 
> is /proc we're talking about.  Any comments?

Ick... I'd really like to see one spinlock protecting all activity in this
area.  And obviously not the magic BKL ;-)

> Talking about locks for locks is confusing :)

Tell me about it!  I'm close to calling things `blocks' `plocks',
`leases' and `mlocks', just to reduce the namespace conflicts.  But it's
not obvious those refer to BSD locks, POSIX locks and Mandatory locks...

-- 
Revolutions do not require corporate support.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fs/locks.c BKL removal
  2002-05-11 19:45 ` Matthew Wilcox
@ 2002-05-12  1:40   ` Dave Hansen
  2002-05-12  2:07     ` Matthew Wilcox
       [not found]   ` <20020513171729.4165.qmail@thales.mathematik.uni-ulm.de>
  1 sibling, 1 reply; 11+ messages in thread
From: Dave Hansen @ 2002-05-12  1:40 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel

Matthew Wilcox wrote:
> Ick... I'd really like to see one spinlock protecting all activity in this
> area.  And obviously not the magic BKL ;-)

Do you really think a single lock is the way to go?  Maybe I'm just 
paranoid, but somebody is going to run into a locking bottleneck here 
eventually.  I also just don't like global locks.

I'll ask our benchmarking team if they have test suites for file 
locking.  I crossing my fingers.

-- 
Dave Hansen
haveblue@us.ibm.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fs/locks.c BKL removal
  2002-05-12  1:40   ` Dave Hansen
@ 2002-05-12  2:07     ` Matthew Wilcox
  0 siblings, 0 replies; 11+ messages in thread
From: Matthew Wilcox @ 2002-05-12  2:07 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Matthew Wilcox, linux-kernel

On Sat, May 11, 2002 at 06:40:00PM -0700, Dave Hansen wrote:
> Do you really think a single lock is the way to go?  Maybe I'm just 
> paranoid, but somebody is going to run into a locking bottleneck here 
> eventually.  I also just don't like global locks.

Well, we have a global entity being protected -- the file_lock_list.
Clearly we don't want to use one of the existing per-inode semaphores
since semaphores have a noted bad effect on both benchmarks and real-world
scenarios.  And I don't think introducing a new per-inode lock would
really be welcome.

> I'll ask our benchmarking team if they have test suites for file 
> locking.  I crossing my fingers.

Cool, thanks.

-- 
Revolutions do not require corporate support.

^ permalink raw reply	[flat|nested] 11+ messages in thread

[parent not found: <20020513171729.4165.qmail@thales.mathematik.uni-ulm.de>]

* Re: fs/locks.c BKL removal
       [not found]   ` <20020513171729.4165.qmail@thales.mathematik.uni-ulm.de>
@ 2002-05-13 19:43     ` Matthew Wilcox
  2002-05-13 23:15       ` Christian Ehrhardt
  2002-05-13 23:18       ` Andreas Dilger
  0 siblings, 2 replies; 11+ messages in thread
From: Matthew Wilcox @ 2002-05-13 19:43 UTC (permalink / raw)
  To: Christian Ehrhardt; +Cc: Matthew Wilcox, linux-fsdevel

On Mon, May 13, 2002 at 07:17:29PM +0200, Christian Ehrhardt wrote:
> Two Programs (source Code below):
> * lockf which locks a region in file testfile (start offset and
>   length given as arguments)
> * doit which exploits the bug by writing to an area that is supposed to
>   be mandatory locked.
> 
> The transscript of a session which shows that it's possible to write
> to the file testfile at offset 10 even if another process holds a
> mandatory lock on this section of the file. The reason is that there
> is a possibility that the region checked with locks_verify_area
> in read_write.c is NOT the region that we actually write data to.
> Similar issues exist with O_APPEND and locks that extend beyond end
> of file. Let me know if you need more details.

Basically, this is a race.  Here's the code (in sys_write()):

	struct inode *inode = file->f_dentry->d_inode;
	ret = locks_verify_area(FLOCK_VERIFY_WRITE, inode, file,
		file->f_pos, count);
	if (!ret) {
		ssize_t (*write)(struct file *, const char *, size_t, loff_t *);
		ret = -EINVAL;
		if (file->f_op && (write = file->f_op->write) != NULL)
			ret = write(file, buf, count, &file->f_pos);
	}

Any time between calling locks_verify_area and the filesystem's write()
method, another task which shares the struct file may modify f_pos.  This
race could be fixed by doing something like:

	struct inode *inode = file->f_dentry->d_inode;
	loff_t start = file->f_pos;
	ret = locks_verify_area(FLOCK_VERIFY_WRITE, inode, file,
		start, count);
	if (!ret) {
		ssize_t (*write)(struct file *, const char *, size_t, loff_t *);		ret = -EINVAL;
		if (file->f_op && (write = file->f_op->write) != NULL)
			ret = write(file, buf, count, &start);
	}
	if (!ret) {
		file->f_pos = start;
	}

Would this cause problems with programs that share file descriptors and
expect that the second seek in your example to not be overridden by the
result of the first write?

-- 
Revolutions do not require corporate support.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fs/locks.c BKL removal
  2002-05-13 19:43     ` Matthew Wilcox
@ 2002-05-13 23:15       ` Christian Ehrhardt
  2002-05-13 23:18       ` Andreas Dilger
  1 sibling, 0 replies; 11+ messages in thread
From: Christian Ehrhardt @ 2002-05-13 23:15 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel

Hi,

On Mon, May 13, 2002 at 08:43:47PM +0100, Matthew Wilcox wrote:
> On Mon, May 13, 2002 at 07:17:29PM +0200, Christian Ehrhardt wrote:
> > Two Programs (source Code below):
> > * lockf which locks a region in file testfile (start offset and
> >   length given as arguments)
> > * doit which exploits the bug by writing to an area that is supposed to
> >   be mandatory locked.
> > 
>  [ ... ]
> Any time between calling locks_verify_area and the filesystem's write()
> method, another task which shares the struct file may modify f_pos.

True. But there are other cases: The file system has full access to the
file pointer and can ignore the value start. This is the case if the file
is opened with O_APPEND and the current position is not at the end of
file. This is only relevant wrt mandatory locking if a lock extends
beyond EOF. I'm not sure if this is legal but it probably is.
There might also be write functions that rely on &filp->f_pos == &start
in subtle ways.

> This race could be fixed by doing something like:
> 
> 	struct inode *inode = file->f_dentry->d_inode;
> 	loff_t start = file->f_pos;
> 	ret = locks_verify_area(FLOCK_VERIFY_WRITE, inode, file,
> 		start, count);
> 	if (!ret) {
> 		ssize_t (*write)(struct file *, const char *, size_t, loff_t *);
>		ret = -EINVAL;
> 		if (file->f_op && (write = file->f_op->write) != NULL)
> 			ret = write(file, buf, count, &start);
> 	}
> 	if (!ret) {
> 		file->f_pos = start;
> 	}
> 
> Would this cause problems with programs that share file descriptors and
> expect that the second seek in your example to not be overridden by the
> result of the first write?

This is the behaviour observed on Solaris with the same test, i.e.
the write goes to the offset 0 and the seek has no effect on the write.
I really doubt that any relevant standard requires that a seek AFTER a
write affects that write. Programs relying on the current behaviour should
be considered seriously broken.

Note that there is a third kind of race that should be considered:
What if someone gets a write lock on a file while a write to this
region is already in progress. Such a program could read parts of the
locked region bevor the write to the same region is finished.

Having said that let me mention that the change suggested by you
would probably make us BUG-compatible with most other unices out there.
I have no idea how to test for the third race, but I did test the
O_APPEND case on Solaris and Solaris gets this wrong as well.

    regards   Christian Ehrhardt

-- 
THAT'S ALL FOLKS!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fs/locks.c BKL removal
  2002-05-13 19:43     ` Matthew Wilcox
  2002-05-13 23:15       ` Christian Ehrhardt
@ 2002-05-13 23:18       ` Andreas Dilger
  2002-05-14  9:02         ` Christian Ehrhardt
  1 sibling, 1 reply; 11+ messages in thread
From: Andreas Dilger @ 2002-05-13 23:18 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Christian Ehrhardt, linux-fsdevel

On May 13, 2002  20:43 +0100, Matthew Wilcox wrote:
> Basically, this is a race.  Here's the code (in sys_write()):
> 
> 	struct inode *inode = file->f_dentry->d_inode;
> 	ret = locks_verify_area(FLOCK_VERIFY_WRITE, inode, file,
> 		file->f_pos, count);
> 	if (!ret) {
> 		ssize_t (*write)(struct file *, const char *, size_t, loff_t *);
> 		ret = -EINVAL;
> 		if (file->f_op && (write = file->f_op->write) != NULL)
> 			ret = write(file, buf, count, &file->f_pos);
> 	}
> 
> Any time between calling locks_verify_area and the filesystem's write()
> method, another task which shares the struct file may modify f_pos.  This
> race could be fixed by doing something like:
> 
> 	struct inode *inode = file->f_dentry->d_inode;
> 	loff_t start = file->f_pos;
> 	ret = locks_verify_area(FLOCK_VERIFY_WRITE, inode, file,
> 		start, count);
> 	if (!ret) {
> 		ssize_t (*write)(struct file *, const char *, size_t, loff_t *);		ret = -EINVAL;
> 		if (file->f_op && (write = file->f_op->write) != NULL)
> 			ret = write(file, buf, count, &start);
> 	}
> 	if (!ret) {
> 		file->f_pos = start;
> 	}
> 
> Would this cause problems with programs that share file descriptors and
> expect that the second seek in your example to not be overridden by the
> result of the first write?

Two things here:
1) You probably need to set file->f_pos unconditionally after calling
   write() because it is possible for a write to partially complete and
   you want the file descriptor left at the end of the completed part.
2) If you have two threads writing to the same descriptor without any
   locking on their part, they can't expect _anything_ w.r.t. the
   position of the file descriptor, or you wouldn't have this problem
   in the first place.  Hence, no need to worry about the result.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fs/locks.c BKL removal
  2002-05-13 23:18       ` Andreas Dilger
@ 2002-05-14  9:02         ` Christian Ehrhardt
  0 siblings, 0 replies; 11+ messages in thread
From: Christian Ehrhardt @ 2002-05-14  9:02 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel


Hi,

On Mon, May 13, 2002 at 05:18:21PM -0600, Andreas Dilger wrote:
> > Any time between calling locks_verify_area and the filesystem's write()
> > method, another task which shares the struct file may modify f_pos.  This
> > race could be fixed by doing something like:
> > 
> > 	struct inode *inode = file->f_dentry->d_inode;
> > 	loff_t start = file->f_pos;
> > 	ret = locks_verify_area(FLOCK_VERIFY_WRITE, inode, file,
> > 		start, count);
> > 	if (!ret) {
> > 		ssize_t (*write)(struct file *, const char *, size_t, loff_t *);
> >		ret = -EINVAL;
> > 		if (file->f_op && (write = file->f_op->write) != NULL)
> > 			ret = write(file, buf, count, &start);
> > 	}
> > 	if (!ret) {
> > 		file->f_pos = start;
> > 	}
> 
> Two things here:
> 1) You probably need to set file->f_pos unconditionally after calling
>    write() because it is possible for a write to partially complete and
>    you want the file descriptor left at the end of the completed part.

write() should return the number of bytes written in this case, i.e.
we probably want either (ret > 0) or (ret >= 0) here, (!ret) is
definitely bogus.

      regards   Christian Ehrhardt

-- 
THAT'S ALL FOLKS!

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2002-05-14  9:02 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-05-10 21:48 fs/locks.c BKL removal Dave Hansen
2002-05-10 22:13 ` Dave Hansen
2002-05-10 23:17   ` Andrew Morton
2002-05-11 19:48     ` Matthew Wilcox
2002-05-11 19:45 ` Matthew Wilcox
2002-05-12  1:40   ` Dave Hansen
2002-05-12  2:07     ` Matthew Wilcox
     [not found]   ` <20020513171729.4165.qmail@thales.mathematik.uni-ulm.de>
2002-05-13 19:43     ` Matthew Wilcox
2002-05-13 23:15       ` Christian Ehrhardt
2002-05-13 23:18       ` Andreas Dilger
2002-05-14  9:02         ` Christian Ehrhardt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.