* fs/locks.c BKL removal @ 2002-05-10 21:48 Dave Hansen 2002-05-10 22:13 ` Dave Hansen 2002-05-11 19:45 ` Matthew Wilcox 0 siblings, 2 replies; 11+ messages in thread From: Dave Hansen @ 2002-05-10 21:48 UTC (permalink / raw) To: matthew; +Cc: linux-kernel Matthew, Al Viro pointed me your way. I'm looking into the fs/locks.c mess. It appears that there was an attempt to convert this over to a semaphore, but it was removed just before the 2.4 release because of some deadlocks. Whenever the i_flock list is traversed, the BKL is held. It is also held while running through the file_lock_list which I think is used only for /proc/locks. We definitely need a semaphore because of all the blocking that goes on. We can either have a global lock for all of them, which I think was tried last time. Or, we can split it up a bit more. With the current design, there will need to be a lock for the global list, each individual list, and one for each individual lock to protect against access from the reference in the file_lock_list and the inode->i_flock list. However, I think that the file_lock_list complexity may be able to be reduced. If we make the file_lock_list a list of inodes (or just the i_flocks) with active locks, we can avoid the complexity of having an individual file_lock lock. That way, we at least reduce the number of _types_ of locks. It increases the number of dereferences, but this is /proc we're talking about. Any comments? Talking about locks for locks is confusing :) -- Dave Hansen haveblue@us.ibm.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fs/locks.c BKL removal 2002-05-10 21:48 fs/locks.c BKL removal Dave Hansen @ 2002-05-10 22:13 ` Dave Hansen 2002-05-10 23:17 ` Andrew Morton 2002-05-11 19:45 ` Matthew Wilcox 1 sibling, 1 reply; 11+ messages in thread From: Dave Hansen @ 2002-05-10 22:13 UTC (permalink / raw) Cc: matthew, linux-kernel As Linus pointed out, a semaphore is probably the wrong way to go. The only things that really needs to be protected are the list operations themselves. > No, I really think the code should use a spinlock for the global list, and > then on a per-lock basis something like a reference count and a blocking > lock (which might be a semaphore). -- Dave Hansen haveblue@us.ibm.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fs/locks.c BKL removal 2002-05-10 22:13 ` Dave Hansen @ 2002-05-10 23:17 ` Andrew Morton 2002-05-11 19:48 ` Matthew Wilcox 0 siblings, 1 reply; 11+ messages in thread From: Andrew Morton @ 2002-05-10 23:17 UTC (permalink / raw) To: Dave Hansen; +Cc: matthew, linux-kernel Dave Hansen wrote: > > As Linus pointed out, a semaphore is probably the wrong way to go. > The only things that really needs to be protected are the list > operations themselves. > It was I who put the BKL back into locks.c, much to Matthew's disgust... The problem was that replacing the BKL with a semaphore seriously damaged Apache thoughput on 8-way. Apache was using flock()-based synchronisation and replacing a spin with a schedule just killed it. So.. Apache isn't doing that any more, but it is an instructive case. Replacing the BKL with a semaphore can sometimes be a very bad thing. See http://www.uwsg.iu.edu/hypermail/linux/kernel/0010.3/ - search for "scalability" - ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fs/locks.c BKL removal 2002-05-10 23:17 ` Andrew Morton @ 2002-05-11 19:48 ` Matthew Wilcox 0 siblings, 0 replies; 11+ messages in thread From: Matthew Wilcox @ 2002-05-11 19:48 UTC (permalink / raw) To: Andrew Morton; +Cc: Dave Hansen, matthew, linux-kernel On Fri, May 10, 2002 at 04:17:29PM -0700, Andrew Morton wrote: > It was I who put the BKL back into locks.c, much to > Matthew's disgust... The disgust was targetted more at removing the abstraction of locking scheme which I'd put in and having explicit lock_kernel() / unlock_kernel() calls. I'd used (iirc) acquire_lock() / release_lock() macros which could have just been redefined. > The problem was that replacing the BKL with a semaphore > seriously damaged Apache thoughput on 8-way. Apache > was using flock()-based synchronisation and replacing > a spin with a schedule just killed it. Which says that our semaphores suck, because they don't try to spin for a bit before scheduling. Of course, your change back was the right thing to do in the 2.3.late timeframe. -- Revolutions do not require corporate support. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fs/locks.c BKL removal 2002-05-10 21:48 fs/locks.c BKL removal Dave Hansen 2002-05-10 22:13 ` Dave Hansen @ 2002-05-11 19:45 ` Matthew Wilcox 2002-05-12 1:40 ` Dave Hansen [not found] ` <20020513171729.4165.qmail@thales.mathematik.uni-ulm.de> 1 sibling, 2 replies; 11+ messages in thread From: Matthew Wilcox @ 2002-05-11 19:45 UTC (permalink / raw) To: Dave Hansen; +Cc: matthew, linux-kernel On Fri, May 10, 2002 at 02:48:39PM -0700, Dave Hansen wrote: > I'm looking into the fs/locks.c mess. It appears that there was an > attempt to convert this over to a semaphore, but it was removed just > before the 2.4 release because of some deadlocks. Actually, performance problems ... > Whenever the i_flock list is traversed, the BKL is held. It is also > held while running through the file_lock_list which I think is used > only for /proc/locks. Correct. > We definitely need a semaphore because of all the blocking that goes > on. We can either have a global lock for all of them, which I think > was tried last time. Or, we can split it up a bit more. With the > current design, there will need to be a lock for the global list, each > individual list, and one for each individual lock to protect against > access from the reference in the file_lock_list and the inode->i_flock > list. Nah. Though I'm glad you missed it too; it means that I'm not as stupid as I thought I was for only noticing it 2 years later. Look at locks_wake_up_blocks (this is basically the _only_ tricky part). This has to be called with a wait argument which is true. The only time that can happen is if locks_delete_lock is called with a `true' parameter. And the only time _that_ happens is when the _type_ of an flock lock is being changed. And really, what's happening here? We have a BSD flock which is blocking one or more locks. Those processes have to have the opportunity to acquire the lock before the previously-blocking process gets the opportunity to acquire its lock. But that doesn't mean we need to schedule once for _each_ task which is blocked; we only need to yield once. So we can eliminate the `wait' argument to locks_delete_lock, locks_wake_up_blocks and the arm of the conditional in locks_wake_up_blocks which sleeps. We only need to check in flock_lock_file whether we're unlocking and yield if we aren't. I'm currently doing a major restructure of fs/locks.c, and this problem (along with several others) simply disappears. I'm looking for a testsuite before I release this code to the world ... anybody got one? > However, I think that the file_lock_list complexity may be able to be > reduced. If we make the file_lock_list a list of inodes (or just the > i_flocks) with active locks, we can avoid the complexity of having an > individual file_lock lock. That way, we at least reduce the number of > _types_ of locks. It increases the number of dereferences, but this > is /proc we're talking about. Any comments? Ick... I'd really like to see one spinlock protecting all activity in this area. And obviously not the magic BKL ;-) > Talking about locks for locks is confusing :) Tell me about it! I'm close to calling things `blocks' `plocks', `leases' and `mlocks', just to reduce the namespace conflicts. But it's not obvious those refer to BSD locks, POSIX locks and Mandatory locks... -- Revolutions do not require corporate support. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fs/locks.c BKL removal 2002-05-11 19:45 ` Matthew Wilcox @ 2002-05-12 1:40 ` Dave Hansen 2002-05-12 2:07 ` Matthew Wilcox [not found] ` <20020513171729.4165.qmail@thales.mathematik.uni-ulm.de> 1 sibling, 1 reply; 11+ messages in thread From: Dave Hansen @ 2002-05-12 1:40 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-kernel Matthew Wilcox wrote: > Ick... I'd really like to see one spinlock protecting all activity in this > area. And obviously not the magic BKL ;-) Do you really think a single lock is the way to go? Maybe I'm just paranoid, but somebody is going to run into a locking bottleneck here eventually. I also just don't like global locks. I'll ask our benchmarking team if they have test suites for file locking. I crossing my fingers. -- Dave Hansen haveblue@us.ibm.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fs/locks.c BKL removal 2002-05-12 1:40 ` Dave Hansen @ 2002-05-12 2:07 ` Matthew Wilcox 0 siblings, 0 replies; 11+ messages in thread From: Matthew Wilcox @ 2002-05-12 2:07 UTC (permalink / raw) To: Dave Hansen; +Cc: Matthew Wilcox, linux-kernel On Sat, May 11, 2002 at 06:40:00PM -0700, Dave Hansen wrote: > Do you really think a single lock is the way to go? Maybe I'm just > paranoid, but somebody is going to run into a locking bottleneck here > eventually. I also just don't like global locks. Well, we have a global entity being protected -- the file_lock_list. Clearly we don't want to use one of the existing per-inode semaphores since semaphores have a noted bad effect on both benchmarks and real-world scenarios. And I don't think introducing a new per-inode lock would really be welcome. > I'll ask our benchmarking team if they have test suites for file > locking. I crossing my fingers. Cool, thanks. -- Revolutions do not require corporate support. ^ permalink raw reply [flat|nested] 11+ messages in thread
[parent not found: <20020513171729.4165.qmail@thales.mathematik.uni-ulm.de>]
* Re: fs/locks.c BKL removal [not found] ` <20020513171729.4165.qmail@thales.mathematik.uni-ulm.de> @ 2002-05-13 19:43 ` Matthew Wilcox 2002-05-13 23:15 ` Christian Ehrhardt 2002-05-13 23:18 ` Andreas Dilger 0 siblings, 2 replies; 11+ messages in thread From: Matthew Wilcox @ 2002-05-13 19:43 UTC (permalink / raw) To: Christian Ehrhardt; +Cc: Matthew Wilcox, linux-fsdevel On Mon, May 13, 2002 at 07:17:29PM +0200, Christian Ehrhardt wrote: > Two Programs (source Code below): > * lockf which locks a region in file testfile (start offset and > length given as arguments) > * doit which exploits the bug by writing to an area that is supposed to > be mandatory locked. > > The transscript of a session which shows that it's possible to write > to the file testfile at offset 10 even if another process holds a > mandatory lock on this section of the file. The reason is that there > is a possibility that the region checked with locks_verify_area > in read_write.c is NOT the region that we actually write data to. > Similar issues exist with O_APPEND and locks that extend beyond end > of file. Let me know if you need more details. Basically, this is a race. Here's the code (in sys_write()): struct inode *inode = file->f_dentry->d_inode; ret = locks_verify_area(FLOCK_VERIFY_WRITE, inode, file, file->f_pos, count); if (!ret) { ssize_t (*write)(struct file *, const char *, size_t, loff_t *); ret = -EINVAL; if (file->f_op && (write = file->f_op->write) != NULL) ret = write(file, buf, count, &file->f_pos); } Any time between calling locks_verify_area and the filesystem's write() method, another task which shares the struct file may modify f_pos. This race could be fixed by doing something like: struct inode *inode = file->f_dentry->d_inode; loff_t start = file->f_pos; ret = locks_verify_area(FLOCK_VERIFY_WRITE, inode, file, start, count); if (!ret) { ssize_t (*write)(struct file *, const char *, size_t, loff_t *); ret = -EINVAL; if (file->f_op && (write = file->f_op->write) != NULL) ret = write(file, buf, count, &start); } if (!ret) { file->f_pos = start; } Would this cause problems with programs that share file descriptors and expect that the second seek in your example to not be overridden by the result of the first write? -- Revolutions do not require corporate support. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fs/locks.c BKL removal 2002-05-13 19:43 ` Matthew Wilcox @ 2002-05-13 23:15 ` Christian Ehrhardt 2002-05-13 23:18 ` Andreas Dilger 1 sibling, 0 replies; 11+ messages in thread From: Christian Ehrhardt @ 2002-05-13 23:15 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-fsdevel Hi, On Mon, May 13, 2002 at 08:43:47PM +0100, Matthew Wilcox wrote: > On Mon, May 13, 2002 at 07:17:29PM +0200, Christian Ehrhardt wrote: > > Two Programs (source Code below): > > * lockf which locks a region in file testfile (start offset and > > length given as arguments) > > * doit which exploits the bug by writing to an area that is supposed to > > be mandatory locked. > > > [ ... ] > Any time between calling locks_verify_area and the filesystem's write() > method, another task which shares the struct file may modify f_pos. True. But there are other cases: The file system has full access to the file pointer and can ignore the value start. This is the case if the file is opened with O_APPEND and the current position is not at the end of file. This is only relevant wrt mandatory locking if a lock extends beyond EOF. I'm not sure if this is legal but it probably is. There might also be write functions that rely on &filp->f_pos == &start in subtle ways. > This race could be fixed by doing something like: > > struct inode *inode = file->f_dentry->d_inode; > loff_t start = file->f_pos; > ret = locks_verify_area(FLOCK_VERIFY_WRITE, inode, file, > start, count); > if (!ret) { > ssize_t (*write)(struct file *, const char *, size_t, loff_t *); > ret = -EINVAL; > if (file->f_op && (write = file->f_op->write) != NULL) > ret = write(file, buf, count, &start); > } > if (!ret) { > file->f_pos = start; > } > > Would this cause problems with programs that share file descriptors and > expect that the second seek in your example to not be overridden by the > result of the first write? This is the behaviour observed on Solaris with the same test, i.e. the write goes to the offset 0 and the seek has no effect on the write. I really doubt that any relevant standard requires that a seek AFTER a write affects that write. Programs relying on the current behaviour should be considered seriously broken. Note that there is a third kind of race that should be considered: What if someone gets a write lock on a file while a write to this region is already in progress. Such a program could read parts of the locked region bevor the write to the same region is finished. Having said that let me mention that the change suggested by you would probably make us BUG-compatible with most other unices out there. I have no idea how to test for the third race, but I did test the O_APPEND case on Solaris and Solaris gets this wrong as well. regards Christian Ehrhardt -- THAT'S ALL FOLKS! ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fs/locks.c BKL removal 2002-05-13 19:43 ` Matthew Wilcox 2002-05-13 23:15 ` Christian Ehrhardt @ 2002-05-13 23:18 ` Andreas Dilger 2002-05-14 9:02 ` Christian Ehrhardt 1 sibling, 1 reply; 11+ messages in thread From: Andreas Dilger @ 2002-05-13 23:18 UTC (permalink / raw) To: Matthew Wilcox; +Cc: Christian Ehrhardt, linux-fsdevel On May 13, 2002 20:43 +0100, Matthew Wilcox wrote: > Basically, this is a race. Here's the code (in sys_write()): > > struct inode *inode = file->f_dentry->d_inode; > ret = locks_verify_area(FLOCK_VERIFY_WRITE, inode, file, > file->f_pos, count); > if (!ret) { > ssize_t (*write)(struct file *, const char *, size_t, loff_t *); > ret = -EINVAL; > if (file->f_op && (write = file->f_op->write) != NULL) > ret = write(file, buf, count, &file->f_pos); > } > > Any time between calling locks_verify_area and the filesystem's write() > method, another task which shares the struct file may modify f_pos. This > race could be fixed by doing something like: > > struct inode *inode = file->f_dentry->d_inode; > loff_t start = file->f_pos; > ret = locks_verify_area(FLOCK_VERIFY_WRITE, inode, file, > start, count); > if (!ret) { > ssize_t (*write)(struct file *, const char *, size_t, loff_t *); ret = -EINVAL; > if (file->f_op && (write = file->f_op->write) != NULL) > ret = write(file, buf, count, &start); > } > if (!ret) { > file->f_pos = start; > } > > Would this cause problems with programs that share file descriptors and > expect that the second seek in your example to not be overridden by the > result of the first write? Two things here: 1) You probably need to set file->f_pos unconditionally after calling write() because it is possible for a write to partially complete and you want the file descriptor left at the end of the completed part. 2) If you have two threads writing to the same descriptor without any locking on their part, they can't expect _anything_ w.r.t. the position of the file descriptor, or you wouldn't have this problem in the first place. Hence, no need to worry about the result. Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fs/locks.c BKL removal 2002-05-13 23:18 ` Andreas Dilger @ 2002-05-14 9:02 ` Christian Ehrhardt 0 siblings, 0 replies; 11+ messages in thread From: Christian Ehrhardt @ 2002-05-14 9:02 UTC (permalink / raw) To: Matthew Wilcox, linux-fsdevel Hi, On Mon, May 13, 2002 at 05:18:21PM -0600, Andreas Dilger wrote: > > Any time between calling locks_verify_area and the filesystem's write() > > method, another task which shares the struct file may modify f_pos. This > > race could be fixed by doing something like: > > > > struct inode *inode = file->f_dentry->d_inode; > > loff_t start = file->f_pos; > > ret = locks_verify_area(FLOCK_VERIFY_WRITE, inode, file, > > start, count); > > if (!ret) { > > ssize_t (*write)(struct file *, const char *, size_t, loff_t *); > > ret = -EINVAL; > > if (file->f_op && (write = file->f_op->write) != NULL) > > ret = write(file, buf, count, &start); > > } > > if (!ret) { > > file->f_pos = start; > > } > > Two things here: > 1) You probably need to set file->f_pos unconditionally after calling > write() because it is possible for a write to partially complete and > you want the file descriptor left at the end of the completed part. write() should return the number of bytes written in this case, i.e. we probably want either (ret > 0) or (ret >= 0) here, (!ret) is definitely bogus. regards Christian Ehrhardt -- THAT'S ALL FOLKS! ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2002-05-14 9:02 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-05-10 21:48 fs/locks.c BKL removal Dave Hansen
2002-05-10 22:13 ` Dave Hansen
2002-05-10 23:17 ` Andrew Morton
2002-05-11 19:48 ` Matthew Wilcox
2002-05-11 19:45 ` Matthew Wilcox
2002-05-12 1:40 ` Dave Hansen
2002-05-12 2:07 ` Matthew Wilcox
[not found] ` <20020513171729.4165.qmail@thales.mathematik.uni-ulm.de>
2002-05-13 19:43 ` Matthew Wilcox
2002-05-13 23:15 ` Christian Ehrhardt
2002-05-13 23:18 ` Andreas Dilger
2002-05-14 9:02 ` Christian Ehrhardt
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.