* 2.0, 2.2, 2.4, 2.5: fsync buffer race @ 2003-02-02 23:32 Mikulas Patocka 2003-02-03 0:00 ` Andrew Morton 0 siblings, 1 reply; 16+ messages in thread From: Mikulas Patocka @ 2003-02-02 23:32 UTC (permalink / raw) To: linux-kernel Hi there's a race condition in filesystem let's have a two inodes that are placed in the same buffer. call fsync on inode 1 it goes down to ext2_update_inode [update == 1] it calls ll_rw_block at the end ll_rw_block starts to write buffer ext2_update_inode waits on buffer while the buffer is writing, another process calls fsync on inode 2 it goes again to ext2_update_inode it calls ll_rw_block ll_rw_block sees buffer locked and exits immediatelly ext2_update_inode waits for buffer the first write finished, ext2_update_inode exits and changes made by second proces to inode 2 ARE NOT WRITTEN TO DISK. This bug causes that when you simultaneously fsync two inodes in the same buffer, only the first will be really written to disk. Mikulas ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race 2003-02-02 23:32 2.0, 2.2, 2.4, 2.5: fsync buffer race Mikulas Patocka @ 2003-02-03 0:00 ` Andrew Morton 2003-02-03 1:13 ` Mikulas Patocka 2003-02-04 23:16 ` Pavel Machek 0 siblings, 2 replies; 16+ messages in thread From: Andrew Morton @ 2003-02-03 0:00 UTC (permalink / raw) To: Mikulas Patocka; +Cc: linux-kernel Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> wrote: > > Hi > > there's a race condition in filesystem > > let's have a two inodes that are placed in the same buffer. > > call fsync on inode 1 > it goes down to ext2_update_inode [update == 1] > it calls ll_rw_block at the end > ll_rw_block starts to write buffer > ext2_update_inode waits on buffer > > while the buffer is writing, another process calls fsync on inode 2 > it goes again to ext2_update_inode > it calls ll_rw_block > ll_rw_block sees buffer locked and exits immediatelly > ext2_update_inode waits for buffer > the first write finished, ext2_update_inode exits and changes made by > second proces to inode 2 ARE NOT WRITTEN TO DISK. > hmm, yes. This is a general weakness in the ll_rw_block() interface. It is not suitable for data-integrity writeouts, as you've pointed out. A suitable fix would be do create a new void wait_and_rw_block(...) { wait_on_buffer(bh); ll_rw_block(...); } and go use that in all the appropriate places. I shall make that change for 2.5, thanks. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race 2003-02-03 0:00 ` Andrew Morton @ 2003-02-03 1:13 ` Mikulas Patocka 2003-02-03 1:20 ` Andrew Morton 2003-02-04 23:16 ` Pavel Machek 1 sibling, 1 reply; 16+ messages in thread From: Mikulas Patocka @ 2003-02-03 1:13 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel On Sun, 2 Feb 2003, Andrew Morton wrote: > Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> wrote: > > > > Hi > > > > there's a race condition in filesystem > > > > let's have a two inodes that are placed in the same buffer. > > > > call fsync on inode 1 > > it goes down to ext2_update_inode [update == 1] > > it calls ll_rw_block at the end > > ll_rw_block starts to write buffer > > ext2_update_inode waits on buffer > > > > while the buffer is writing, another process calls fsync on inode 2 > > it goes again to ext2_update_inode > > it calls ll_rw_block > > ll_rw_block sees buffer locked and exits immediatelly > > ext2_update_inode waits for buffer > > the first write finished, ext2_update_inode exits and changes made by > > second proces to inode 2 ARE NOT WRITTEN TO DISK. > > > > hmm, yes. This is a general weakness in the ll_rw_block() interface. It is > not suitable for data-integrity writeouts, as you've pointed out. > > A suitable fix would be do create a new > > void wait_and_rw_block(...) > { > wait_on_buffer(bh); > ll_rw_block(...); > } It would fail if other CPU submits IO with ll_rw_block after wait_on_buffer but before ll_rw_block. ll_rw_block really needs to be rewritten. Mikulas > and go use that in all the appropriate places. > > I shall make that change for 2.5, thanks. > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race 2003-02-03 1:13 ` Mikulas Patocka @ 2003-02-03 1:20 ` Andrew Morton 2003-02-03 9:29 ` Mikulas Patocka 0 siblings, 1 reply; 16+ messages in thread From: Andrew Morton @ 2003-02-03 1:20 UTC (permalink / raw) To: Mikulas Patocka; +Cc: linux-kernel Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> wrote: > > > void wait_and_rw_block(...) > > { > > wait_on_buffer(bh); > > ll_rw_block(...); > > } > > It would fail if other CPU submits IO with ll_rw_block after > wait_on_buffer but before ll_rw_block. In that case, the caller's data gets written anyway, and the caller will wait upon the I/O which the other CPU started. So the ll_rw_block() behaviour is appropriate. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race 2003-02-03 1:20 ` Andrew Morton @ 2003-02-03 9:29 ` Mikulas Patocka 0 siblings, 0 replies; 16+ messages in thread From: Mikulas Patocka @ 2003-02-03 9:29 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel On Sun, 2 Feb 2003, Andrew Morton wrote: > Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> wrote: > > > > > void wait_and_rw_block(...) > > > { > > > wait_on_buffer(bh); > > > ll_rw_block(...); > > > } > > > > It would fail if other CPU submits IO with ll_rw_block after > > wait_on_buffer but before ll_rw_block. > > In that case, the caller's data gets written anyway, and the caller will wait > upon the I/O which the other CPU started. So the ll_rw_block() behaviour is > appropriate. You are partly right, but it suffers from smp memory ordering bug: CPU 1 write data to buffer (but they are in cpu-local buffer and do not go to the bus) tests buffer_locked in wait_and_rw_block->wait_on_buffer, sees unlocked CPU 2 starts to write the buffer, but does not see data written by CPU 1 yet cpu flushes data to bus calls ll_rw_block, it sees buffer_locked, exits. new data are lost. There should be smp_mb(); before wait_on_buffer in wait_and_rw_block. BTW. why don't you just patch ll_rw_block so that it waits if it sees a locked buffer -- you get much cleaner code with only one test for locked buffer. Mikulas ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race 2003-02-03 0:00 ` Andrew Morton 2003-02-03 1:13 ` Mikulas Patocka @ 2003-02-04 23:16 ` Pavel Machek 2003-02-05 15:13 ` Mikulas Patocka 2003-02-10 13:07 ` Andrea Arcangeli 1 sibling, 2 replies; 16+ messages in thread From: Pavel Machek @ 2003-02-04 23:16 UTC (permalink / raw) To: Andrew Morton; +Cc: Mikulas Patocka, linux-kernel Hi! > > there's a race condition in filesystem > > > > let's have a two inodes that are placed in the same buffer. > > > > call fsync on inode 1 > > it goes down to ext2_update_inode [update == 1] > > it calls ll_rw_block at the end > > ll_rw_block starts to write buffer > > ext2_update_inode waits on buffer > > > > while the buffer is writing, another process calls fsync on inode 2 > > it goes again to ext2_update_inode > > it calls ll_rw_block > > ll_rw_block sees buffer locked and exits immediatelly > > ext2_update_inode waits for buffer > > the first write finished, ext2_update_inode exits and changes made by > > second proces to inode 2 ARE NOT WRITTEN TO DISK. > > > > hmm, yes. This is a general weakness in the ll_rw_block() interface. It is > not suitable for data-integrity writeouts, as you've pointed out. > > A suitable fix would be do create a new > > void wait_and_rw_block(...) > { > wait_on_buffer(bh); > ll_rw_block(...); > } > > and go use that in all the appropriate places. > > I shall make that change for 2.5, thanks. Should this be fixed at least in 2.4, too? It seems pretty serious for mail servers (etc)... Pavel -- Worst form of spam? Adding advertisment signatures ala sourceforge.net. What goes next? Inserting advertisment *into* email? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race 2003-02-04 23:16 ` Pavel Machek @ 2003-02-05 15:13 ` Mikulas Patocka 2003-02-10 13:07 ` Andrea Arcangeli 1 sibling, 0 replies; 16+ messages in thread From: Mikulas Patocka @ 2003-02-05 15:13 UTC (permalink / raw) To: Pavel Machek; +Cc: Andrew Morton, linux-kernel > Hi! > > > > there's a race condition in filesystem > > > > > > let's have a two inodes that are placed in the same buffer. > > > > > > call fsync on inode 1 > > > it goes down to ext2_update_inode [update == 1] > > > it calls ll_rw_block at the end > > > ll_rw_block starts to write buffer > > > ext2_update_inode waits on buffer > > > > > > while the buffer is writing, another process calls fsync on inode 2 > > > it goes again to ext2_update_inode > > > it calls ll_rw_block > > > ll_rw_block sees buffer locked and exits immediatelly > > > ext2_update_inode waits for buffer > > > the first write finished, ext2_update_inode exits and changes made by > > > second proces to inode 2 ARE NOT WRITTEN TO DISK. > > > > > > > hmm, yes. This is a general weakness in the ll_rw_block() interface. It is > > not suitable for data-integrity writeouts, as you've pointed out. > > > > A suitable fix would be do create a new > > > > void wait_and_rw_block(...) > > { > > wait_on_buffer(bh); > > ll_rw_block(...); > > } > > > > and go use that in all the appropriate places. > > > > I shall make that change for 2.5, thanks. > > Should this be fixed at least in 2.4, too? It seems pretty serious for > mail servers (etc)... > Pavel It should, but it is a hazard. The problem is that every use of ll_rw_block has this bug, not only the one in ext2 fsync. The most clean thing would be to modify ll_rw_block to wait until buffer becomes unlocked, no one knows if it can produce some weird things. Even Linus didn't know what he was doing, see this comment around the buggy part in 2.2, 2.0 and previous kernels. ll_rw_blk.c: /* Uhhuh.. Nasty dead-lock possible here.. */ if (buffer_locked(bh)) return; /* Maybe the above fixes it, and maybe it doesn't boot. Life is interesting */ lock_buffer(bh); Mikulas ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race 2003-02-04 23:16 ` Pavel Machek 2003-02-05 15:13 ` Mikulas Patocka @ 2003-02-10 13:07 ` Andrea Arcangeli 2003-02-10 16:28 ` Mikulas Patocka 1 sibling, 1 reply; 16+ messages in thread From: Andrea Arcangeli @ 2003-02-10 13:07 UTC (permalink / raw) To: Pavel Machek; +Cc: Andrew Morton, Mikulas Patocka, linux-kernel On Wed, Feb 05, 2003 at 12:16:53AM +0100, Pavel Machek wrote: > Hi! > > > > there's a race condition in filesystem > > > > > > let's have a two inodes that are placed in the same buffer. > > > > > > call fsync on inode 1 > > > it goes down to ext2_update_inode [update == 1] > > > it calls ll_rw_block at the end > > > ll_rw_block starts to write buffer > > > ext2_update_inode waits on buffer > > > > > > while the buffer is writing, another process calls fsync on inode 2 > > > it goes again to ext2_update_inode > > > it calls ll_rw_block > > > ll_rw_block sees buffer locked and exits immediatelly > > > ext2_update_inode waits for buffer > > > the first write finished, ext2_update_inode exits and changes made by > > > second proces to inode 2 ARE NOT WRITTEN TO DISK. > > > > > > > hmm, yes. This is a general weakness in the ll_rw_block() interface. It is > > not suitable for data-integrity writeouts, as you've pointed out. > > > > A suitable fix would be do create a new > > > > void wait_and_rw_block(...) > > { > > wait_on_buffer(bh); > > ll_rw_block(...); > > } > > > > and go use that in all the appropriate places. > > > > I shall make that change for 2.5, thanks. > > Should this be fixed at least in 2.4, too? It seems pretty serious for > mail servers (etc)... actually the lowlevel currently is supposed to take care of that if it writes directly with ll_rw_block like fsync_buffers_list takes care of it before calling ll_rw_block. But the whole point is that normally the write_inode path only marks the buffer dirty, it never writes directly, and no dirty bitflag can be lost and we flush those dirty buffers right with the proper wait_on_buffer before ll_rw_block. So I don't think it's happening really in 2.4, at least w/o mounting the fs with -osync. But I thought about about Mikulas suggestion of adding lock_buffer in place of the test and set bit. This looks very appealing. the main reason we didn't do that while fixing fsync_buffers_list a few months ago in 2.4.20 or so, is been that it is a very risky change, I mean, I'm feeling like it'll break something subtle. In theory the only thing those bitflags controls are the coherency of the buffer cache here, the rest is serialized by design at the highlevel. And the buffer_cache should be ok with the lock_buffer(), since we'll see the buffer update and we'll skip the write if we race with other reads (we'll sleep in lock_buffer rather than in wait_on_buffer). For the writes we'll overwrite the data one more time (the feature incidentally ;). so it sounds safe. Performance wise it shouldn't matter, for read it can't matter infact. So I guess it's ok to make that change, then we could even move the wait_on_buffer from fsync_buffers_list to the xfs buffer_delay path of write_buffer. I'm tempted to make this change for next -aa. I feel it makes more sense and it's a better fix even if if risky. Can anybody see a problem with that? Andrea ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race 2003-02-10 13:07 ` Andrea Arcangeli @ 2003-02-10 16:28 ` Mikulas Patocka 2003-02-10 16:57 ` Linus Torvalds 0 siblings, 1 reply; 16+ messages in thread From: Mikulas Patocka @ 2003-02-10 16:28 UTC (permalink / raw) To: torvalds; +Cc: Andrea Arcangeli, Pavel Machek, Andrew Morton, linux-kernel Linus, do you remember what "nasty deadlock" did you mean when writing this to 2.[0-2].* and maybe 1.*? ll_rw_blk.c - ll_rw_block(): /* Uhhuh.. Nasty dead-lock possible here.. */ if (buffer_locked(bh)) return; /* Maybe the above fixes it, and maybe it doesn't boot. Life is interesting */ lock_buffer(bh); Mikulas On Mon, 10 Feb 2003, Andrea Arcangeli wrote: > On Wed, Feb 05, 2003 at 12:16:53AM +0100, Pavel Machek wrote: > > Hi! > > > > > > there's a race condition in filesystem > > > > > > > > let's have a two inodes that are placed in the same buffer. > > > > > > > > call fsync on inode 1 > > > > it goes down to ext2_update_inode [update == 1] > > > > it calls ll_rw_block at the end > > > > ll_rw_block starts to write buffer > > > > ext2_update_inode waits on buffer > > > > > > > > while the buffer is writing, another process calls fsync on inode 2 > > > > it goes again to ext2_update_inode > > > > it calls ll_rw_block > > > > ll_rw_block sees buffer locked and exits immediatelly > > > > ext2_update_inode waits for buffer > > > > the first write finished, ext2_update_inode exits and changes made by > > > > second proces to inode 2 ARE NOT WRITTEN TO DISK. > > > > > > > > > > hmm, yes. This is a general weakness in the ll_rw_block() interface. It is > > > not suitable for data-integrity writeouts, as you've pointed out. > > > > > > A suitable fix would be do create a new > > > > > > void wait_and_rw_block(...) > > > { > > > wait_on_buffer(bh); > > > ll_rw_block(...); > > > } > > > > > > and go use that in all the appropriate places. > > > > > > I shall make that change for 2.5, thanks. > > > > Should this be fixed at least in 2.4, too? It seems pretty serious for > > mail servers (etc)... > > actually the lowlevel currently is supposed to take care of that if it > writes directly with ll_rw_block like fsync_buffers_list takes care of > it before calling ll_rw_block. But the whole point is that normally the > write_inode path only marks the buffer dirty, it never writes directly, > and no dirty bitflag can be lost and we flush those dirty buffers right > with the proper wait_on_buffer before ll_rw_block. So I don't think it's > happening really in 2.4, at least w/o mounting the fs with -osync. > > But I thought about about Mikulas suggestion of adding lock_buffer > in place of the test and set bit. This looks very appealing. the main > reason we didn't do that while fixing fsync_buffers_list a few months > ago in 2.4.20 or so, is been that it is a very risky change, I mean, I'm > feeling like it'll break something subtle. > > In theory the only thing those bitflags controls are the coherency of > the buffer cache here, the rest is serialized by design at the > highlevel. And the buffer_cache should be ok with the lock_buffer(), > since we'll see the buffer update and we'll skip the write if we race > with other reads (we'll sleep in lock_buffer rather than in > wait_on_buffer). For the writes we'll overwrite the data one more time > (the feature incidentally ;). so it sounds safe. Performance wise it > shouldn't matter, for read it can't matter infact. > > So I guess it's ok to make that change, then we could even move the > wait_on_buffer from fsync_buffers_list to the xfs buffer_delay path of > write_buffer. I'm tempted to make this change for next -aa. I feel it > makes more sense and it's a better fix even if if risky. Can anybody see > a problem with that? > > Andrea > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race 2003-02-10 16:28 ` Mikulas Patocka @ 2003-02-10 16:57 ` Linus Torvalds 2003-02-10 20:40 ` Andrew Morton 0 siblings, 1 reply; 16+ messages in thread From: Linus Torvalds @ 2003-02-10 16:57 UTC (permalink / raw) To: Mikulas Patocka Cc: Andrea Arcangeli, Pavel Machek, Andrew Morton, linux-kernel On Mon, 10 Feb 2003, Mikulas Patocka wrote: > > Linus, do you remember what "nasty deadlock" did you mean when writing > this to 2.[0-2].* and maybe 1.*? > > ll_rw_blk.c - ll_rw_block(): > /* Uhhuh.. Nasty dead-lock possible here.. */ > if (buffer_locked(bh)) > return; > /* Maybe the above fixes it, and maybe it doesn't boot. Life is interesting */ > lock_buffer(bh); Nope, that's a _long_ time ago. It may well have been simply a MM bug that got fixed long since, ie something like - dirty writeout happens - writeout needs memory - buffer.c tries to clean up pages - hang. However, some of the comments in this thread seem to just simply not be correct: > > > > > while the buffer is writing, another process calls fsync on inode 2 > > > > > it goes again to ext2_update_inode > > > > > it calls ll_rw_block > > > > > ll_rw_block sees buffer locked and exits immediatelly > > > > > ext2_update_inode waits for buffer > > > > > the first write finished, ext2_update_inode exits and changes made by > > > > > second proces to inode 2 ARE NOT WRITTEN TO DISK. The dirty bit is not cleared, so the changes _are_ written to disk. Eventually. The fact that something tests "unlocked" instead of "dirty" is not the fault of ll_rw_block(), and never has been. So the bug is not in ll_rw_block(), but in the waiter. If you expect to be able to wait for dirty blocks, you can't do what the code apparently does. You can't assume that "unlocked" means "no longer dirty". Unlocked means unlocked, and dirty means dirty. If you want to synchronously wait for dirty blocks, you should do something like lock_buffer(); .. dirty buffer.. mark_buffer_dirty(); unlock_buffer(); ll_rw_block(..) Btw, it is probably a really bad idea to just do wait_and_write() { wait_for_buffer(..); ll_rw_block(..); } for things like this. It's just a better idea to lock the buffer before making the modification. Of course, these days, if you really want to, you can just use submit_bh() etc instead. At which point _you_ are responsible for all the locking and unlocking. Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race 2003-02-10 16:57 ` Linus Torvalds @ 2003-02-10 20:40 ` Andrew Morton 2003-02-10 21:18 ` Andrea Arcangeli 0 siblings, 1 reply; 16+ messages in thread From: Andrew Morton @ 2003-02-10 20:40 UTC (permalink / raw) To: Linus Torvalds; +Cc: mikulas, andrea, pavel, linux-kernel Linus Torvalds <torvalds@transmeta.com> wrote: > > If you want to synchronously wait for dirty blocks, you should do > something like > > lock_buffer(); > .. dirty buffer.. > mark_buffer_dirty(); > unlock_buffer(); > > ll_rw_block(..) Almost all of the problematic code paths were doing this: alter_data_at(bh->b_data); mark_buffer_dirty(bh); if (IS_SYNC(inode)) { ll_rw_block(bh); wait_on_buffer(bh); } and the bug is that if writeout was already underway, the new changes are not committed to disk here. The approach you describe here would involve changing that to: if (IS_SYNC(inode)) lock_buffer(bh); alter_data_at(bh->b_data); mark_buffer_dirty(bh); if (IS_SYNC(inode)) { submit_bh(bh); wait_on_buffer(bh); } which seems a bit odd. Or perhaps lock_buffer(bh); alter_data_at(bh->b_data); mark_buffer_dirty(bh); if (IS_SYNC(inode)) { submit_bh(bh); wait_on_buffer(bh); } else { unlock_buffer(bh); } which is nice. But it'll suck for the non-IS_SYNC case due to undesirable serialisation. We don't need the buffer lock in there for serialisation because that is handled at a higher level - usually i_sem. I wrote the below patch to address this problem. Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> points out a bug in ll_rw_block() usage. Typical usage is: mark_buffer_dirty(bh); ll_rw_block(WRITE, 1, &bh); wait_on_buffer(bh); the problem is that if the buffer was locked on entry to this code sequence (due to in-progress I/O), ll_rw_block() will not wait, and start new I/O. So this code will wait on the _old_ I/O, and will then continue execution, leaving the buffer dirty. It turns out that all callers were only writing one buffer, and they were all waiting on that writeout. So I added a new sync_dirty_buffer() function: void sync_dirty_buffer(struct buffer_head *bh) { lock_buffer(bh); if (test_clear_buffer_dirty(bh)) { get_bh(bh); bh->b_end_io = end_buffer_io_sync; submit_bh(WRITE, bh); } else { unlock_buffer(bh); } } which allowed a fair amount of code to be removed, while adding the desired data-integrity guarantees. UFS has its own wrappers around ll_rw_block() which got in the way, so this operation was open-coded in that case. fs/buffer.c | 18 ++++++++++++++++++ fs/ext2/balloc.c | 12 ++++-------- fs/ext2/ialloc.c | 12 ++++-------- fs/ext2/inode.c | 9 +++------ fs/ext2/super.c | 3 +-- fs/ext2/xattr.c | 9 +++------ fs/ext3/super.c | 6 ++---- fs/jbd/commit.c | 3 +-- fs/jbd/journal.c | 8 ++++---- fs/jbd/transaction.c | 6 ++---- fs/jfs/jfs_imap.c | 3 +-- fs/jfs/jfs_mount.c | 3 +-- fs/jfs/namei.c | 6 ++---- fs/jfs/resize.c | 9 +++------ fs/minix/inode.c | 3 +-- fs/ntfs/super.c | 3 +-- fs/qnx4/inode.c | 3 +-- fs/reiserfs/journal.c | 6 ++---- fs/reiserfs/resize.c | 3 +-- fs/sysv/inode.c | 3 +-- fs/sysv/itree.c | 6 ++---- fs/udf/inode.c | 3 +-- fs/ufs/balloc.c | 16 ++++++++-------- fs/ufs/dir.c | 18 ++++++------------ fs/ufs/ialloc.c | 2 ++ fs/ufs/inode.c | 12 ++++-------- fs/ufs/truncate.c | 3 +++ include/linux/buffer_head.h | 1 + include/linux/hfs_sysdep.h | 9 ++------- kernel/ksyms.c | 1 + 30 files changed, 86 insertions(+), 113 deletions(-) diff -puN fs/buffer.c~ll_rw_block-fix fs/buffer.c --- 25/fs/buffer.c~ll_rw_block-fix Wed Feb 5 16:27:02 2003 +++ 25-akpm/fs/buffer.c Wed Feb 5 16:27:03 2003 @@ -2622,6 +2622,24 @@ void ll_rw_block(int rw, int nr, struct } /* + * For a data-integrity writeout, we need to wait upon any in-progress I/O + * and then start new I/O and then wait upon it. + */ +void sync_dirty_buffer(struct buffer_head *bh) +{ + WARN_ON(atomic_read(&bh->b_count) < 1); + lock_buffer(bh); + if (test_clear_buffer_dirty(bh)) { + get_bh(bh); + bh->b_end_io = end_buffer_io_sync; + submit_bh(WRITE, bh); + wait_on_buffer(bh); + } else { + unlock_buffer(bh); + } +} + +/* * Sanity checks for try_to_free_buffers. */ static void check_ttfb_buffer(struct page *page, struct buffer_head *bh) diff -puN fs/ext2/balloc.c~ll_rw_block-fix fs/ext2/balloc.c --- 25/fs/ext2/balloc.c~ll_rw_block-fix Wed Feb 5 16:27:02 2003 +++ 25-akpm/fs/ext2/balloc.c Wed Feb 5 16:27:03 2003 @@ -233,10 +233,8 @@ do_more: } mark_buffer_dirty(bitmap_bh); - if (sb->s_flags & MS_SYNCHRONOUS) { - ll_rw_block(WRITE, 1, &bitmap_bh); - wait_on_buffer(bitmap_bh); - } + if (sb->s_flags & MS_SYNCHRONOUS) + sync_dirty_buffer(bitmap_bh); group_release_blocks(desc, bh2, group_freed); freed += group_freed; @@ -466,10 +464,8 @@ got_block: write_unlock(&EXT2_I(inode)->i_meta_lock); mark_buffer_dirty(bitmap_bh); - if (sb->s_flags & MS_SYNCHRONOUS) { - ll_rw_block(WRITE, 1, &bitmap_bh); - wait_on_buffer(bitmap_bh); - } + if (sb->s_flags & MS_SYNCHRONOUS) + sync_dirty_buffer(bitmap_bh); ext2_debug ("allocating block %d. ", block); diff -puN fs/ext2/ialloc.c~ll_rw_block-fix fs/ext2/ialloc.c --- 25/fs/ext2/ialloc.c~ll_rw_block-fix Wed Feb 5 16:27:02 2003 +++ 25-akpm/fs/ext2/ialloc.c Wed Feb 5 16:27:03 2003 @@ -146,10 +146,8 @@ void ext2_free_inode (struct inode * ino mark_buffer_dirty(EXT2_SB(sb)->s_sbh); } mark_buffer_dirty(bitmap_bh); - if (sb->s_flags & MS_SYNCHRONOUS) { - ll_rw_block(WRITE, 1, &bitmap_bh); - wait_on_buffer(bitmap_bh); - } + if (sb->s_flags & MS_SYNCHRONOUS) + sync_dirty_buffer(bitmap_bh); sb->s_dirt = 1; error_return: brelse(bitmap_bh); @@ -485,10 +483,8 @@ repeat: ext2_set_bit(i, bitmap_bh->b_data); mark_buffer_dirty(bitmap_bh); - if (sb->s_flags & MS_SYNCHRONOUS) { - ll_rw_block(WRITE, 1, &bitmap_bh); - wait_on_buffer(bitmap_bh); - } + if (sb->s_flags & MS_SYNCHRONOUS) + sync_dirty_buffer(bitmap_bh); brelse(bitmap_bh); ino = group * EXT2_INODES_PER_GROUP(sb) + i + 1; diff -puN fs/ext2/inode.c~ll_rw_block-fix fs/ext2/inode.c --- 25/fs/ext2/inode.c~ll_rw_block-fix Wed Feb 5 16:27:02 2003 +++ 25-akpm/fs/ext2/inode.c Wed Feb 5 16:27:03 2003 @@ -443,10 +443,8 @@ static int ext2_alloc_branch(struct inod * But we now rely upon generic_osync_inode() * and b_inode_buffers. But not for directories. */ - if (S_ISDIR(inode->i_mode) && IS_DIRSYNC(inode)) { - ll_rw_block(WRITE, 1, &bh); - wait_on_buffer(bh); - } + if (S_ISDIR(inode->i_mode) && IS_DIRSYNC(inode)) + sync_dirty_buffer(bh); parent = nr; } if (n == num) @@ -1208,8 +1206,7 @@ static int ext2_update_inode(struct inod raw_inode->i_block[n] = ei->i_data[n]; mark_buffer_dirty(bh); if (do_sync) { - ll_rw_block (WRITE, 1, &bh); - wait_on_buffer (bh); + sync_dirty_buffer(bh); if (buffer_req(bh) && !buffer_uptodate(bh)) { printk ("IO error syncing ext2 inode [%s:%08lx]\n", sb->s_id, (unsigned long) ino); diff -puN fs/ext2/super.c~ll_rw_block-fix fs/ext2/super.c --- 25/fs/ext2/super.c~ll_rw_block-fix Wed Feb 5 16:27:02 2003 +++ 25-akpm/fs/ext2/super.c Wed Feb 5 16:27:03 2003 @@ -842,8 +842,7 @@ static void ext2_sync_super(struct super { es->s_wtime = cpu_to_le32(get_seconds()); mark_buffer_dirty(EXT2_SB(sb)->s_sbh); - ll_rw_block(WRITE, 1, &EXT2_SB(sb)->s_sbh); - wait_on_buffer(EXT2_SB(sb)->s_sbh); + sync_dirty_buffer(EXT2_SB(sb)->s_sbh); sb->s_dirt = 0; } diff -puN fs/ext2/xattr.c~ll_rw_block-fix fs/ext2/xattr.c --- 25/fs/ext2/xattr.c~ll_rw_block-fix Wed Feb 5 16:27:02 2003 +++ 25-akpm/fs/ext2/xattr.c Wed Feb 5 16:27:03 2003 @@ -774,8 +774,7 @@ ext2_xattr_set2(struct inode *inode, str } mark_buffer_dirty(new_bh); if (IS_SYNC(inode)) { - ll_rw_block(WRITE, 1, &new_bh); - wait_on_buffer(new_bh); + sync_dirty_buffer(new_bh); error = -EIO; if (buffer_req(new_bh) && !buffer_uptodate(new_bh)) goto cleanup; @@ -865,10 +864,8 @@ ext2_xattr_delete_inode(struct inode *in HDR(bh)->h_refcount = cpu_to_le32( le32_to_cpu(HDR(bh)->h_refcount) - 1); mark_buffer_dirty(bh); - if (IS_SYNC(inode)) { - ll_rw_block(WRITE, 1, &bh); - wait_on_buffer(bh); - } + if (IS_SYNC(inode)) + sync_dirty_buffer(bh); DQUOT_FREE_BLOCK(inode, 1); } EXT2_I(inode)->i_file_acl = 0; diff -puN fs/ext3/super.c~ll_rw_block-fix fs/ext3/super.c --- 25/fs/ext3/super.c~ll_rw_block-fix Wed Feb 5 16:27:02 2003 +++ 25-akpm/fs/ext3/super.c Wed Feb 5 16:27:03 2003 @@ -1627,10 +1627,8 @@ static void ext3_commit_super (struct su es->s_wtime = cpu_to_le32(get_seconds()); BUFFER_TRACE(EXT3_SB(sb)->s_sbh, "marking dirty"); mark_buffer_dirty(EXT3_SB(sb)->s_sbh); - if (sync) { - ll_rw_block(WRITE, 1, &EXT3_SB(sb)->s_sbh); - wait_on_buffer(EXT3_SB(sb)->s_sbh); - } + if (sync) + sync_dirty_buffer(EXT3_SB(sb)->s_sbh); } diff -puN fs/jbd/commit.c~ll_rw_block-fix fs/jbd/commit.c --- 25/fs/jbd/commit.c~ll_rw_block-fix Wed Feb 5 16:27:02 2003 +++ 25-akpm/fs/jbd/commit.c Wed Feb 5 16:27:03 2003 @@ -607,8 +607,7 @@ start_journal_io: { struct buffer_head *bh = jh2bh(descriptor); set_buffer_uptodate(bh); - ll_rw_block(WRITE, 1, &bh); - wait_on_buffer(bh); + sync_dirty_buffer(bh); __brelse(bh); /* One for getblk() */ journal_unlock_journal_head(descriptor); } diff -puN fs/jbd/journal.c~ll_rw_block-fix fs/jbd/journal.c --- 25/fs/jbd/journal.c~ll_rw_block-fix Wed Feb 5 16:27:02 2003 +++ 25-akpm/fs/jbd/journal.c Wed Feb 5 16:27:03 2003 @@ -960,9 +960,10 @@ void journal_update_superblock(journal_t BUFFER_TRACE(bh, "marking dirty"); mark_buffer_dirty(bh); - ll_rw_block(WRITE, 1, &bh); if (wait) - wait_on_buffer(bh); + sync_dirty_buffer(bh); + else + ll_rw_block(WRITE, 1, &bh); /* If we have just flushed the log (by marking s_start==0), then * any future commit will have to be careful to update the @@ -1296,8 +1297,7 @@ static int journal_convert_superblock_v1 bh = journal->j_sb_buffer; BUFFER_TRACE(bh, "marking dirty"); mark_buffer_dirty(bh); - ll_rw_block(WRITE, 1, &bh); - wait_on_buffer(bh); + sync_dirty_buffer(bh); return 0; } diff -puN fs/jbd/transaction.c~ll_rw_block-fix fs/jbd/transaction.c --- 25/fs/jbd/transaction.c~ll_rw_block-fix Wed Feb 5 16:27:02 2003 +++ 25-akpm/fs/jbd/transaction.c Wed Feb 5 16:27:03 2003 @@ -1079,8 +1079,7 @@ int journal_dirty_data (handle_t *handle atomic_inc(&bh->b_count); spin_unlock(&journal_datalist_lock); need_brelse = 1; - ll_rw_block(WRITE, 1, &bh); - wait_on_buffer(bh); + sync_dirty_buffer(bh); spin_lock(&journal_datalist_lock); /* The buffer may become locked again at any time if it is redirtied */ @@ -1361,8 +1360,7 @@ void journal_sync_buffer(struct buffer_h } atomic_inc(&bh->b_count); spin_unlock(&journal_datalist_lock); - ll_rw_block (WRITE, 1, &bh); - wait_on_buffer(bh); + sync_dirty_buffer(bh); __brelse(bh); goto out; } diff -puN fs/jfs/jfs_imap.c~ll_rw_block-fix fs/jfs/jfs_imap.c --- 25/fs/jfs/jfs_imap.c~ll_rw_block-fix Wed Feb 5 16:27:02 2003 +++ 25-akpm/fs/jfs/jfs_imap.c Wed Feb 5 16:27:03 2003 @@ -2980,8 +2980,7 @@ static void duplicateIXtree(struct super j_sb->s_flag |= JFS_BAD_SAIT; mark_buffer_dirty(bh); - ll_rw_block(WRITE, 1, &bh); - wait_on_buffer(bh); + sync_dirty_buffer(bh); brelse(bh); return; } diff -puN fs/jfs/jfs_mount.c~ll_rw_block-fix fs/jfs/jfs_mount.c --- 25/fs/jfs/jfs_mount.c~ll_rw_block-fix Wed Feb 5 16:27:02 2003 +++ 25-akpm/fs/jfs/jfs_mount.c Wed Feb 5 16:27:03 2003 @@ -449,8 +449,7 @@ int updateSuper(struct super_block *sb, } mark_buffer_dirty(bh); - ll_rw_block(WRITE, 1, &bh); - wait_on_buffer(bh); + sync_dirty_buffer(bh); brelse(bh); return 0; diff -puN fs/jfs/namei.c~ll_rw_block-fix fs/jfs/namei.c --- 25/fs/jfs/namei.c~ll_rw_block-fix Wed Feb 5 16:27:02 2003 +++ 25-akpm/fs/jfs/namei.c Wed Feb 5 16:27:03 2003 @@ -972,10 +972,8 @@ int jfs_symlink(struct inode *dip, struc #if 0 set_buffer_uptodate(bp); mark_buffer_dirty(bp, 1); - if (IS_SYNC(dip)) { - ll_rw_block(WRITE, 1, &bp); - wait_on_buffer(bp); - } + if (IS_SYNC(dip)) + sync_dirty_buffer(bp); brelse(bp); #endif /* 0 */ ssize -= copy_size; diff -puN fs/jfs/resize.c~ll_rw_block-fix fs/jfs/resize.c --- 25/fs/jfs/resize.c~ll_rw_block-fix Wed Feb 5 16:27:02 2003 +++ 25-akpm/fs/jfs/resize.c Wed Feb 5 16:27:03 2003 @@ -243,8 +243,7 @@ int jfs_extendfs(struct super_block *sb, /* synchronously update superblock */ mark_buffer_dirty(bh); - ll_rw_block(WRITE, 1, &bh); - wait_on_buffer(bh); + sync_dirty_buffer(bh); brelse(bh); /* @@ -512,15 +511,13 @@ int jfs_extendfs(struct super_block *sb, memcpy(j_sb2, j_sb, sizeof (struct jfs_superblock)); mark_buffer_dirty(bh); - ll_rw_block(WRITE, 1, &bh2); - wait_on_buffer(bh2); + sync_dirty_buffer(bh2); brelse(bh2); } /* write primary superblock */ mark_buffer_dirty(bh); - ll_rw_block(WRITE, 1, &bh); - wait_on_buffer(bh); + sync_dirty_buffer(bh); brelse(bh); goto resume; diff -puN fs/minix/inode.c~ll_rw_block-fix fs/minix/inode.c --- 25/fs/minix/inode.c~ll_rw_block-fix Wed Feb 5 16:27:02 2003 +++ 25-akpm/fs/minix/inode.c Wed Feb 5 16:27:03 2003 @@ -517,8 +517,7 @@ int minix_sync_inode(struct inode * inod bh = minix_update_inode(inode); if (bh && buffer_dirty(bh)) { - ll_rw_block(WRITE, 1, &bh); - wait_on_buffer(bh); + sync_dirty_buffer(bh); if (buffer_req(bh) && !buffer_uptodate(bh)) { printk ("IO error syncing minix inode [%s:%08lx]\n", diff -puN fs/ntfs/super.c~ll_rw_block-fix fs/ntfs/super.c --- 25/fs/ntfs/super.c~ll_rw_block-fix Wed Feb 5 16:27:02 2003 +++ 25-akpm/fs/ntfs/super.c Wed Feb 5 16:27:03 2003 @@ -505,8 +505,7 @@ hotfix_primary_boot_sector: memcpy(bh_primary->b_data, bh_backup->b_data, sb->s_blocksize); mark_buffer_dirty(bh_primary); - ll_rw_block(WRITE, 1, &bh_primary); - wait_on_buffer(bh_primary); + sync_dirty_buffer(bh_primary); if (buffer_uptodate(bh_primary)) { brelse(bh_backup); return bh_primary; diff -puN fs/qnx4/inode.c~ll_rw_block-fix fs/qnx4/inode.c --- 25/fs/qnx4/inode.c~ll_rw_block-fix Wed Feb 5 16:27:03 2003 +++ 25-akpm/fs/qnx4/inode.c Wed Feb 5 16:27:03 2003 @@ -44,8 +44,7 @@ int qnx4_sync_inode(struct inode *inode) bh = qnx4_update_inode(inode); if (bh && buffer_dirty(bh)) { - ll_rw_block(WRITE, 1, &bh); - wait_on_buffer(bh); + sync_dirty_buffer(bh); if (buffer_req(bh) && !buffer_uptodate(bh)) { printk ("IO error syncing qnx4 inode [%s:%08lx]\n", diff -puN fs/reiserfs/journal.c~ll_rw_block-fix fs/reiserfs/journal.c --- 25/fs/reiserfs/journal.c~ll_rw_block-fix Wed Feb 5 16:27:03 2003 +++ 25-akpm/fs/reiserfs/journal.c Wed Feb 5 16:27:03 2003 @@ -735,8 +735,7 @@ reiserfs_panic(s, "journal-539: flush_co } mark_buffer_dirty(jl->j_commit_bh) ; - ll_rw_block(WRITE, 1, &(jl->j_commit_bh)) ; - wait_on_buffer(jl->j_commit_bh) ; + sync_dirty_buffer(jl->j_commit_bh) ; if (!buffer_uptodate(jl->j_commit_bh)) { reiserfs_panic(s, "journal-615: buffer write failed\n") ; } @@ -828,8 +827,7 @@ static int _update_journal_header_block( jh->j_first_unflushed_offset = cpu_to_le32(offset) ; jh->j_mount_id = cpu_to_le32(SB_JOURNAL(p_s_sb)->j_mount_id) ; set_buffer_dirty(SB_JOURNAL(p_s_sb)->j_header_bh) ; - ll_rw_block(WRITE, 1, &(SB_JOURNAL(p_s_sb)->j_header_bh)) ; - wait_on_buffer((SB_JOURNAL(p_s_sb)->j_header_bh)) ; + sync_dirty_buffer(SB_JOURNAL(p_s_sb)->j_header_bh) ; if (!buffer_uptodate(SB_JOURNAL(p_s_sb)->j_header_bh)) { printk( "reiserfs: journal-837: IO error during journal replay\n" ); return -EIO ; diff -puN fs/reiserfs/resize.c~ll_rw_block-fix fs/reiserfs/resize.c --- 25/fs/reiserfs/resize.c~ll_rw_block-fix Wed Feb 5 16:27:03 2003 +++ 25-akpm/fs/reiserfs/resize.c Wed Feb 5 16:27:03 2003 @@ -120,8 +120,7 @@ int reiserfs_resize (struct super_block mark_buffer_dirty(bitmap[i].bh) ; set_buffer_uptodate(bitmap[i].bh); - ll_rw_block(WRITE, 1, &bitmap[i].bh); - wait_on_buffer(bitmap[i].bh); + sync_dirty_buffer(bitmap[i].bh); // update bitmap_info stuff bitmap[i].first_zero_hint=1; bitmap[i].free_count = sb_blocksize(sb) * 8 - 1; diff -puN fs/sysv/inode.c~ll_rw_block-fix fs/sysv/inode.c --- 25/fs/sysv/inode.c~ll_rw_block-fix Wed Feb 5 16:27:03 2003 +++ 25-akpm/fs/sysv/inode.c Wed Feb 5 16:27:03 2003 @@ -265,8 +265,7 @@ int sysv_sync_inode(struct inode * inode bh = sysv_update_inode(inode); if (bh && buffer_dirty(bh)) { - ll_rw_block(WRITE, 1, &bh); - wait_on_buffer(bh); + sync_dirty_buffer(bh); if (buffer_req(bh) && !buffer_uptodate(bh)) { printk ("IO error syncing sysv inode [%s:%08lx]\n", inode->i_sb->s_id, inode->i_ino); diff -puN fs/sysv/itree.c~ll_rw_block-fix fs/sysv/itree.c --- 25/fs/sysv/itree.c~ll_rw_block-fix Wed Feb 5 16:27:03 2003 +++ 25-akpm/fs/sysv/itree.c Wed Feb 5 16:27:03 2003 @@ -15,10 +15,8 @@ enum {DIRECT = 10, DEPTH = 4}; /* Have t static inline void dirty_indirect(struct buffer_head *bh, struct inode *inode) { mark_buffer_dirty_inode(bh, inode); - if (IS_SYNC(inode)) { - ll_rw_block (WRITE, 1, &bh); - wait_on_buffer (bh); - } + if (IS_SYNC(inode)) + sync_dirty_buffer(bh); } static int block_to_path(struct inode *inode, long block, int offsets[DEPTH]) diff -puN fs/udf/inode.c~ll_rw_block-fix fs/udf/inode.c --- 25/fs/udf/inode.c~ll_rw_block-fix Wed Feb 5 16:27:03 2003 +++ 25-akpm/fs/udf/inode.c Wed Feb 5 16:27:03 2003 @@ -1520,8 +1520,7 @@ udf_update_inode(struct inode *inode, in mark_buffer_dirty(bh); if (do_sync) { - ll_rw_block(WRITE, 1, &bh); - wait_on_buffer(bh); + sync_dirty_buffer(bh); if (buffer_req(bh) && !buffer_uptodate(bh)) { printk("IO error syncing udf inode [%s:%08lx]\n", diff -puN fs/ufs/balloc.c~ll_rw_block-fix fs/ufs/balloc.c --- 25/fs/ufs/balloc.c~ll_rw_block-fix Wed Feb 5 16:27:03 2003 +++ 25-akpm/fs/ufs/balloc.c Wed Feb 5 16:27:03 2003 @@ -114,6 +114,7 @@ void ufs_free_fragments (struct inode * ubh_mark_buffer_dirty (USPI_UBH); ubh_mark_buffer_dirty (UCPI_UBH); if (sb->s_flags & MS_SYNCHRONOUS) { + ubh_wait_on_buffer (UCPI_UBH); ubh_ll_rw_block (WRITE, 1, (struct ufs_buffer_head **)&ucpi); ubh_wait_on_buffer (UCPI_UBH); } @@ -199,6 +200,7 @@ do_more: ubh_mark_buffer_dirty (USPI_UBH); ubh_mark_buffer_dirty (UCPI_UBH); if (sb->s_flags & MS_SYNCHRONOUS) { + ubh_wait_on_buffer (UCPI_UBH); ubh_ll_rw_block (WRITE, 1, (struct ufs_buffer_head **)&ucpi); ubh_wait_on_buffer (UCPI_UBH); } @@ -228,10 +230,8 @@ failed: memset (bh->b_data, 0, sb->s_blocksize); \ set_buffer_uptodate(bh); \ mark_buffer_dirty (bh); \ - if (IS_SYNC(inode)) { \ - ll_rw_block (WRITE, 1, &bh); \ - wait_on_buffer (bh); \ - } \ + if (IS_SYNC(inode)) \ + sync_dirty_buffer(bh); \ brelse (bh); \ } @@ -364,10 +364,8 @@ unsigned ufs_new_fragments (struct inode clear_buffer_dirty(bh); bh->b_blocknr = result + i; mark_buffer_dirty (bh); - if (IS_SYNC(inode)) { - ll_rw_block (WRITE, 1, &bh); - wait_on_buffer (bh); - } + if (IS_SYNC(inode)) + sync_dirty_buffer(bh); brelse (bh); } else @@ -459,6 +457,7 @@ unsigned ufs_add_fragments (struct inode ubh_mark_buffer_dirty (USPI_UBH); ubh_mark_buffer_dirty (UCPI_UBH); if (sb->s_flags & MS_SYNCHRONOUS) { + ubh_wait_on_buffer (UCPI_UBH); ubh_ll_rw_block (WRITE, 1, (struct ufs_buffer_head **)&ucpi); ubh_wait_on_buffer (UCPI_UBH); } @@ -584,6 +583,7 @@ succed: ubh_mark_buffer_dirty (USPI_UBH); ubh_mark_buffer_dirty (UCPI_UBH); if (sb->s_flags & MS_SYNCHRONOUS) { + ubh_wait_on_buffer (UCPI_UBH); ubh_ll_rw_block (WRITE, 1, (struct ufs_buffer_head **)&ucpi); ubh_wait_on_buffer (UCPI_UBH); } diff -puN fs/ufs/dir.c~ll_rw_block-fix fs/ufs/dir.c --- 25/fs/ufs/dir.c~ll_rw_block-fix Wed Feb 5 16:27:03 2003 +++ 25-akpm/fs/ufs/dir.c Wed Feb 5 16:27:03 2003 @@ -356,10 +356,8 @@ void ufs_set_link(struct inode *dir, str dir->i_version++; de->d_ino = cpu_to_fs32(dir->i_sb, inode->i_ino); mark_buffer_dirty(bh); - if (IS_DIRSYNC(dir)) { - ll_rw_block (WRITE, 1, &bh); - wait_on_buffer(bh); - } + if (IS_DIRSYNC(dir)) + sync_dirty_buffer(bh); brelse (bh); } @@ -457,10 +455,8 @@ int ufs_add_link(struct dentry *dentry, de->d_ino = cpu_to_fs32(sb, inode->i_ino); ufs_set_de_type(sb, de, inode->i_mode); mark_buffer_dirty(bh); - if (IS_DIRSYNC(dir)) { - ll_rw_block (WRITE, 1, &bh); - wait_on_buffer (bh); - } + if (IS_DIRSYNC(dir)) + sync_dirty_buffer(bh); brelse (bh); dir->i_mtime = dir->i_ctime = CURRENT_TIME; dir->i_version++; @@ -508,10 +504,8 @@ int ufs_delete_entry (struct inode * ino inode->i_ctime = inode->i_mtime = CURRENT_TIME; mark_inode_dirty(inode); mark_buffer_dirty(bh); - if (IS_DIRSYNC(inode)) { - ll_rw_block(WRITE, 1, &bh); - wait_on_buffer(bh); - } + if (IS_DIRSYNC(inode)) + sync_dirty_buffer(bh); brelse(bh); UFSD(("EXIT\n")) return 0; diff -puN fs/ufs/ialloc.c~ll_rw_block-fix fs/ufs/ialloc.c --- 25/fs/ufs/ialloc.c~ll_rw_block-fix Wed Feb 5 16:27:03 2003 +++ 25-akpm/fs/ufs/ialloc.c Wed Feb 5 16:27:03 2003 @@ -124,6 +124,7 @@ void ufs_free_inode (struct inode * inod ubh_mark_buffer_dirty (USPI_UBH); ubh_mark_buffer_dirty (UCPI_UBH); if (sb->s_flags & MS_SYNCHRONOUS) { + ubh_wait_on_buffer (UCPI_UBH); ubh_ll_rw_block (WRITE, 1, (struct ufs_buffer_head **) &ucpi); ubh_wait_on_buffer (UCPI_UBH); } @@ -248,6 +249,7 @@ cg_found: ubh_mark_buffer_dirty (USPI_UBH); ubh_mark_buffer_dirty (UCPI_UBH); if (sb->s_flags & MS_SYNCHRONOUS) { + ubh_wait_on_buffer (UCPI_UBH); ubh_ll_rw_block (WRITE, 1, (struct ufs_buffer_head **) &ucpi); ubh_wait_on_buffer (UCPI_UBH); } diff -puN fs/ufs/inode.c~ll_rw_block-fix fs/ufs/inode.c --- 25/fs/ufs/inode.c~ll_rw_block-fix Wed Feb 5 16:27:03 2003 +++ 25-akpm/fs/ufs/inode.c Wed Feb 5 16:27:03 2003 @@ -298,10 +298,8 @@ repeat: } mark_buffer_dirty(bh); - if (IS_SYNC(inode)) { - ll_rw_block (WRITE, 1, &bh); - wait_on_buffer (bh); - } + if (IS_SYNC(inode)) + sync_dirty_buffer(bh); inode->i_ctime = CURRENT_TIME; mark_inode_dirty(inode); out: @@ -635,10 +633,8 @@ static int ufs_update_inode(struct inode memset (ufs_inode, 0, sizeof(struct ufs_inode)); mark_buffer_dirty(bh); - if (do_sync) { - ll_rw_block (WRITE, 1, &bh); - wait_on_buffer (bh); - } + if (do_sync) + sync_dirty_buffer(bh); brelse (bh); UFSD(("EXIT\n")) diff -puN fs/ufs/truncate.c~ll_rw_block-fix fs/ufs/truncate.c --- 25/fs/ufs/truncate.c~ll_rw_block-fix Wed Feb 5 16:27:03 2003 +++ 25-akpm/fs/ufs/truncate.c Wed Feb 5 16:27:03 2003 @@ -284,6 +284,7 @@ next:; } } if (IS_SYNC(inode) && ind_ubh && ubh_buffer_dirty(ind_ubh)) { + ubh_wait_on_buffer (ind_ubh); ubh_ll_rw_block (WRITE, 1, &ind_ubh); ubh_wait_on_buffer (ind_ubh); } @@ -351,6 +352,7 @@ static int ufs_trunc_dindirect (struct i } } if (IS_SYNC(inode) && dind_bh && ubh_buffer_dirty(dind_bh)) { + ubh_wait_on_buffer (dind_bh); ubh_ll_rw_block (WRITE, 1, &dind_bh); ubh_wait_on_buffer (dind_bh); } @@ -415,6 +417,7 @@ static int ufs_trunc_tindirect (struct i } } if (IS_SYNC(inode) && tind_bh && ubh_buffer_dirty(tind_bh)) { + ubh_wait_on_buffer (tind_bh); ubh_ll_rw_block (WRITE, 1, &tind_bh); ubh_wait_on_buffer (tind_bh); } diff -puN include/linux/buffer_head.h~ll_rw_block-fix include/linux/buffer_head.h --- 25/include/linux/buffer_head.h~ll_rw_block-fix Wed Feb 5 16:27:03 2003 +++ 25-akpm/include/linux/buffer_head.h Wed Feb 5 16:27:03 2003 @@ -169,6 +169,7 @@ struct buffer_head *alloc_buffer_head(vo void free_buffer_head(struct buffer_head * bh); void FASTCALL(unlock_buffer(struct buffer_head *bh)); void ll_rw_block(int, int, struct buffer_head * bh[]); +void sync_dirty_buffer(struct buffer_head *bh); int submit_bh(int, struct buffer_head *); void write_boundary_block(struct block_device *bdev, sector_t bblock, unsigned blocksize); diff -puN include/linux/hfs_sysdep.h~ll_rw_block-fix include/linux/hfs_sysdep.h --- 25/include/linux/hfs_sysdep.h~ll_rw_block-fix Wed Feb 5 16:27:03 2003 +++ 25-akpm/include/linux/hfs_sysdep.h Wed Feb 5 16:27:03 2003 @@ -155,13 +155,8 @@ static inline void hfs_buffer_dirty(hfs_ } static inline void hfs_buffer_sync(hfs_buffer buffer) { - while (buffer_locked(buffer)) { - wait_on_buffer(buffer); - } - if (buffer_dirty(buffer)) { - ll_rw_block(WRITE, 1, &buffer); - wait_on_buffer(buffer); - } + if (buffer_dirty(buffer)) + sync_dirty_buffer(buffer); } static inline void *hfs_buffer_data(const hfs_buffer buffer) { diff -puN kernel/ksyms.c~ll_rw_block-fix kernel/ksyms.c --- 25/kernel/ksyms.c~ll_rw_block-fix Wed Feb 5 16:27:03 2003 +++ 25-akpm/kernel/ksyms.c Wed Feb 5 16:27:03 2003 @@ -208,6 +208,7 @@ EXPORT_SYMBOL(close_bdev_excl); EXPORT_SYMBOL(__brelse); EXPORT_SYMBOL(__bforget); EXPORT_SYMBOL(ll_rw_block); +EXPORT_SYMBOL(sync_dirty_buffer); EXPORT_SYMBOL(submit_bh); EXPORT_SYMBOL(unlock_buffer); EXPORT_SYMBOL(__wait_on_buffer); _ ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race 2003-02-10 20:40 ` Andrew Morton @ 2003-02-10 21:18 ` Andrea Arcangeli 2003-02-10 21:44 ` Andrew Morton 0 siblings, 1 reply; 16+ messages in thread From: Andrea Arcangeli @ 2003-02-10 21:18 UTC (permalink / raw) To: Andrew Morton; +Cc: Linus Torvalds, mikulas, pavel, linux-kernel On Mon, Feb 10, 2003 at 12:40:00PM -0800, Andrew Morton wrote: > void sync_dirty_buffer(struct buffer_head *bh) > { > lock_buffer(bh); > if (test_clear_buffer_dirty(bh)) { > get_bh(bh); > bh->b_end_io = end_buffer_io_sync; > submit_bh(WRITE, bh); > } else { > unlock_buffer(bh); > } > } If you we don't take the lock around the mark_dirty_buffer as Linus suggested (to avoid serializing in the non-sync case), why don't you simply add lock_buffer() to ll_rw_block() as we suggested originally and you #define sync_dirty_buffer as ll_rw_block+wait_on_buffer if you really want to make the cleanup? There would be no difference. I don't see the need of the above specialized ll_rw_block-cloned function just for the O_SYNC usage, O_SYNC isn't that a special case. lock_buffer in ll_rw_block makes more sense to me, leaving the test-and-set-bit in there, and having a secondary ll_rw_block with different behaviour just for O_SYNC doesn't look that clean to me. Especially in 2.4 I wouldn't like to make the below change that is 100% equivalent to a one liner patch that just adds lock_buffer() instead of the test-and-set-bit (for reads I see no problems either). BTW, Linus's way that suggests the lock around the data modifications (unconditionally), would also enforce metadata coherency so it would provide an additional coherency guarantee (but it's not directly related to this problem and it may be overkill). Normally we always allow in-core modifications of the buffer during write-IO to disk (also for the data in pagecache). Only the journal commits must be very careful in avoiding that (like applications must be careful to run fsync and not to overwrite the data during the fsync). So normally taking the lock around the in-core modification and mark_buffer_dirty, would be overkill IMHO. Andrea ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race 2003-02-10 21:18 ` Andrea Arcangeli @ 2003-02-10 21:44 ` Andrew Morton 2003-02-10 21:59 ` Andrea Arcangeli 0 siblings, 1 reply; 16+ messages in thread From: Andrew Morton @ 2003-02-10 21:44 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: torvalds, mikulas, pavel, linux-kernel Andrea Arcangeli <andrea@suse.de> wrote: > > On Mon, Feb 10, 2003 at 12:40:00PM -0800, Andrew Morton wrote: > > void sync_dirty_buffer(struct buffer_head *bh) > > { > > lock_buffer(bh); > > if (test_clear_buffer_dirty(bh)) { > > get_bh(bh); > > bh->b_end_io = end_buffer_io_sync; > > submit_bh(WRITE, bh); > > } else { > > unlock_buffer(bh); > > } > > } > > If you we don't take the lock around the mark_dirty_buffer as Linus > suggested (to avoid serializing in the non-sync case), why don't you > simply add lock_buffer() to ll_rw_block() as we suggested originally That is undesirable for READA. > and > you #define sync_dirty_buffer as ll_rw_block+wait_on_buffer if you > really want to make the cleanup? Linux 2.4 tends to contain costly confusion between writeout for memory cleansing and writeout for data integrity. In 2.5 I have been trying to make it very clear and explicit that these are fundamentally different things. > ... > Especially in 2.4 I wouldn't like to make the below change that is > 100% equivalent to a one liner patch that just adds lock_buffer() > instead of the test-and-set-bit (for reads I see no problems either). That'd probably be OK, with a dont-do-that for READA. > BTW, Linus's way that suggests the lock around the data modifications > (unconditionally), would also enforce metadata coherency so it would > provide an additional coherency guarantee (but it's not directly related > to this problem and it may be overkill). Normally we always allow > in-core modifications of the buffer during write-IO to disk (also for > the data in pagecache). Only the journal commits must be very careful in > avoiding that (like applications must be careful to run fsync and not to > overwrite the data during the fsync). So normally taking the lock around > the in-core modification and mark_buffer_dirty, would be overkill IMHO. Yup. Except for a non-uptodate buffer. If software is bringing a non-uptodate buffer uptodate by hand it should generally be locked, else a concurrent read may stomp on the changes. There are few places where this happens. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race 2003-02-10 21:44 ` Andrew Morton @ 2003-02-10 21:59 ` Andrea Arcangeli 2003-03-11 13:58 ` Andrea Arcangeli 0 siblings, 1 reply; 16+ messages in thread From: Andrea Arcangeli @ 2003-02-10 21:59 UTC (permalink / raw) To: Andrew Morton; +Cc: torvalds, mikulas, pavel, linux-kernel On Mon, Feb 10, 2003 at 01:44:34PM -0800, Andrew Morton wrote: > Andrea Arcangeli <andrea@suse.de> wrote: > > > > On Mon, Feb 10, 2003 at 12:40:00PM -0800, Andrew Morton wrote: > > > void sync_dirty_buffer(struct buffer_head *bh) > > > { > > > lock_buffer(bh); > > > if (test_clear_buffer_dirty(bh)) { > > > get_bh(bh); > > > bh->b_end_io = end_buffer_io_sync; > > > submit_bh(WRITE, bh); > > > } else { > > > unlock_buffer(bh); > > > } > > > } > > > > If you we don't take the lock around the mark_dirty_buffer as Linus > > suggested (to avoid serializing in the non-sync case), why don't you > > simply add lock_buffer() to ll_rw_block() as we suggested originally > > That is undesirable for READA. in 2.4 we killed READA some release ago becuse of races: case READA: #if 0 /* bread() misinterprets failed READA attempts as IO errors on SMP */ rw_ahead = 1; #endif so I wasn't focusing on it. > > and > > you #define sync_dirty_buffer as ll_rw_block+wait_on_buffer if you > > really want to make the cleanup? > > Linux 2.4 tends to contain costly confusion between writeout for memory > cleansing and writeout for data integrity. > > In 2.5 I have been trying to make it very clear and explicit that these are > fundamentally different things. yes, and these are in the memory cleansing area (only the journal commits [and somehow the superblock again in journaling] needs to be writeouts for data integrity). Data doesn't need to provide data integrity either, userspace has to care about that. > > > ... > > Especially in 2.4 I wouldn't like to make the below change that is > > 100% equivalent to a one liner patch that just adds lock_buffer() > > instead of the test-and-set-bit (for reads I see no problems either). > > That'd probably be OK, with a dont-do-that for READA. Not an issue in 2.4. In 2.5 the bio is providing reada still though, so it would be wrong to wait_on_buffer there. > > BTW, Linus's way that suggests the lock around the data modifications > > (unconditionally), would also enforce metadata coherency so it would > > provide an additional coherency guarantee (but it's not directly related > > to this problem and it may be overkill). Normally we always allow > > in-core modifications of the buffer during write-IO to disk (also for > > the data in pagecache). Only the journal commits must be very careful in > > avoiding that (like applications must be careful to run fsync and not to > > overwrite the data during the fsync). So normally taking the lock around > > the in-core modification and mark_buffer_dirty, would be overkill IMHO. > > Yup. Except for a non-uptodate buffer. If software is bringing a > non-uptodate buffer uptodate by hand it should generally be locked, else a > concurrent read may stomp on the changes. There are few places where this > happens. I recall I fixed all those cases where we write a non uptodate buffer under reads in ext2 in 2.2 with proper wait_on_buffers before modifying the buffer. That was enough to guarantee any underlying read would complete before we were going to touch the buffer. However today with threading (if they're not protected against reads by the big kernel lock) they should all be converted to lock_buffers(). they were easy to spot grepping for getblk IIRC. Andrea ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race 2003-02-10 21:59 ` Andrea Arcangeli @ 2003-03-11 13:58 ` Andrea Arcangeli 2003-03-14 6:42 ` Andrea Arcangeli 0 siblings, 1 reply; 16+ messages in thread From: Andrea Arcangeli @ 2003-03-11 13:58 UTC (permalink / raw) To: Andrew Morton; +Cc: torvalds, mikulas, pavel, linux-kernel On Mon, Feb 10, 2003 at 10:59:40PM +0100, Andrea Arcangeli wrote: > On Mon, Feb 10, 2003 at 01:44:34PM -0800, Andrew Morton wrote: > > Andrea Arcangeli <andrea@suse.de> wrote: > > > > > > On Mon, Feb 10, 2003 at 12:40:00PM -0800, Andrew Morton wrote: > > > > void sync_dirty_buffer(struct buffer_head *bh) > > > > { > > > > lock_buffer(bh); > > > > if (test_clear_buffer_dirty(bh)) { > > > > get_bh(bh); > > > > bh->b_end_io = end_buffer_io_sync; > > > > submit_bh(WRITE, bh); > > > > } else { > > > > unlock_buffer(bh); > > > > } > > > > } > > > > > > If you we don't take the lock around the mark_dirty_buffer as Linus > > > suggested (to avoid serializing in the non-sync case), why don't you > > > simply add lock_buffer() to ll_rw_block() as we suggested originally > > > > That is undesirable for READA. > > in 2.4 we killed READA some release ago becuse of races: > > case READA: > #if 0 /* bread() misinterprets failed READA attempts as IO errors on SMP */ > rw_ahead = 1; > #endif > > so I wasn't focusing on it. > > > > and > > > you #define sync_dirty_buffer as ll_rw_block+wait_on_buffer if you > > > really want to make the cleanup? > > > > Linux 2.4 tends to contain costly confusion between writeout for memory > > cleansing and writeout for data integrity. > > > > In 2.5 I have been trying to make it very clear and explicit that these are > > fundamentally different things. > > yes, and these are in the memory cleansing area (only the journal > commits [and somehow the superblock again in journaling] needs to be > writeouts for data integrity). Data doesn't need to provide data > integrity either, userspace has to care about that. I'm experimenting with the lock_buffer() in ll_rw_block for one month (not only for the sync writes like in Andrew's patch, but for everything including reads), and I believe this is what triggered a deadlocks on my most stressed machine this night. I attached below the best debug info I collected so far (I've the full SYSRQ+T too of course but the below should make it easier to focus on the right place). I'm going to backout that patch (http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21pre4aa3/00_ll_rw_block-sync-race-1) until this issue is resolved. One simple case where it could introduce a deadlock is if a locked buffer visible to the rest of the fs/vm isn't immediatly inserted in the I/O queue, if two tasks runs that code at the same time they can deadlock against each other. Another possibility is that the journal is running a ll_rw_block on a locked buffer not yet in the I/O queue, and it pretends not to deadlock for whatever reason. In all cases I believe this is an issue in ext3 that would better be fixed, but it isn't necessairly a bug with the 00_ll_rw_block patch backed out. Note: it maybe something else too in my tree that deadlocked the box, but I really doubt it could be something else, also see the wait_on_buffer from ll_rw_block in the first line that makes it quite obvious that the new behaviour is happening during the deadlock. any help from the ext3 developers and everybody involved in this thread (the 2.0, 2.2, 2.4 fsync races) is welcome, thanks! (of course the vfs down shouldn't matter, the wait_on_buffers are interesting here, but since they were in D state I didn't cut them out) mutt D 00000292 0 14627 24890 (NOTLB) Call Trace: [__wait_on_buffer+86/144] [ext3_free_inode+391/1248] [ll_rw_block+309/464] [ext3_find_entry+828/864] [ext3_lookup+65/208] [lookup_hash+194/288] [sys_unlink+139/288] [system_call+51/56] tee D ECD560C0 0 111 11445 110 (NOTLB) Call Trace: [journal_dirty_metadata+420/608] [__wait_on_buffer+86/144] [__cleanup_transaction+362/368] [journal_cancel_revoke+91/448] [log_do_checkpoint+299/448] [n_tty_receive_buf+400/1808] [journal_dirty_metadata+420/608] [n_tty_receive_buf+400/1808] [journal_dirty_metadata+420/608] [ext3_do_update_inode+375/992] [__jbd_kmalloc+35/176] [log_wait_for_space+130/144] [start_this_handle+191/864] [new_handle+75/112] [journal_start+165/208] [ext3_prepare_write+567/576] [write_chan+335/512] [generic_file_write_nolock+1023/2032] [pipe_wait+125/160] [generic_file_write+77/112] [ext3_file_write+57/224] [sys_write+163/336] [system_call+51/56] python D 00000180 0 21334 112 21336 21335 (NOTLB) Call Trace: [start_request+409/544] [__down+105/192] [__down_failed+8/12] [.text.lock.checkpoint+15/173] [start_this_handle+191/864] [new_handle+75/112] [journal_start+165/208] [ext3_lookup+0/208] [ext3_create+300/320] [vfs_create+153/304] [open_namei+1301/1456] [filp_open+62/112] [sys_open+83/192] [system_call+51/56] python D E2EB302C 2388 21335 112 21337 21334 (NOTLB) Call Trace: [dput+33/320] [__down+105/192] [__down_failed+8/12] [.text.lock.namei+265/988] [filp_open+62/112] [sys_open+83/192] [system_call+51/56] procmail D E51AC016 0 21378 1 19102 (NOTLB) Call Trace: [dput+33/320] [__down+105/192] [__down_failed+8/12] [.text.lock.namei+265/988] [filp_open+62/112] [sys_open+83/192] [system_call+51/56] python D C01186FF 0 21470 107 21471 (NOTLB) Call Trace: [do_page_fault+463/1530] [__down+105/192] [__jbd_kmalloc+35/176] [__down_failed+8/12] [.text.lock.checkpoint+15/173] [start_this_handle+191/864] [new_handle+75/112] [journal_start+165/208] [do_wp_page+416/1088] [ext3_dirty_inode+336/368] [__mark_inode_dirty+195/208] [update_inode_times+39/64] [generic_file_write_nolock+630/2032] [generic_file_write+77/112] [ext3_file_write+57/224] [sys_write+163/336] [system_call+51/56] python D C01186FF 0 21471 107 21470 (NOTLB) Call Trace: [do_page_fault+463/1530] [__down+105/192] [__jbd_kmalloc+35/176] [__down_failed+8/12] [.text.lock.checkpoint+15/173] [start_this_handle+191/864] [new_handle+75/112] [journal_start+165/208] [do_wp_page+416/1088] [ext3_dirty_inode+336/368] [__mark_inode_dirty+195/208] [update_inode_times+39/64] [generic_file_write_nolock+630/2032] [generic_file_write+77/112] [ext3_file_write+57/224] [sys_write+163/336] [system_call+51/56] python D 000001E0 0 21732 21731 (NOTLB) Call Trace: [bread+32/160] [__down+105/192] [__jbd_kmalloc+35/176] [__down_failed+8/12] [.text.lock.checkpoint+15/173] [start_this_handle+191/864] [new_handle+75/112] [journal_start+165/208] [dput+33/320] [ext3_unlink+440/448] [vfs_unlink+233/480] [sys_unlink+183/288] [system_call+51/56] procmail D D0E13016 0 21914 21913 (NOTLB) Call Trace: [dput+33/320] [__down+105/192] [__down_failed+8/12] [.text.lock.namei+265/988] [filp_open+62/112] [sys_open+83/192] [system_call+51/56] procmail D E39F3016 0 21999 21998 (NOTLB) Call Trace: [dput+33/320] [__down+105/192] [__down_failed+8/12] [.text.lock.namei+265/988] [filp_open+62/112] [sys_open+83/192] [system_call+51/56] Andrea ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race 2003-03-11 13:58 ` Andrea Arcangeli @ 2003-03-14 6:42 ` Andrea Arcangeli 0 siblings, 0 replies; 16+ messages in thread From: Andrea Arcangeli @ 2003-03-14 6:42 UTC (permalink / raw) To: Andrew Morton; +Cc: torvalds, mikulas, pavel, linux-kernel Hi! > mutt D 00000292 0 14627 24890 (NOTLB) > Call Trace: [__wait_on_buffer+86/144] [ext3_free_inode+391/1248] [ll_rw_block+309/464] [ext3_find_entry+828/864] [ext3_lookup+65/208] > [lookup_hash+194/288] [sys_unlink+139/288] [system_call+51/56] > tee D ECD560C0 0 111 11445 110 (NOTLB) > Call Trace: [journal_dirty_metadata+420/608] [__wait_on_buffer+86/144] [__cleanup_transaction+362/368] [journal_cancel_revoke+91/448] [log_do_checkpoint+299/448] > [n_tty_receive_buf+400/1808] [journal_dirty_metadata+420/608] [n_tty_receive_buf+400/1808] [journal_dirty_metadata+420/608] [ext3_do_update_inode+375/992] [__jbd_kmalloc+35/176] > [log_wait_for_space+130/144] [start_this_handle+191/864] [new_handle+75/112] [journal_start+165/208] [ext3_prepare_write+567/576] [write_chan+335/512] > [generic_file_write_nolock+1023/2032] [pipe_wait+125/160] [generic_file_write+77/112] [ext3_file_write+57/224] [sys_write+163/336] [system_call+51/56] > python D 00000180 0 21334 112 21336 21335 (NOTLB) > Call Trace: [start_request+409/544] [__down+105/192] [__down_failed+8/12] [.text.lock.checkpoint+15/173] [start_this_handle+191/864] > [new_handle+75/112] [journal_start+165/208] [ext3_lookup+0/208] [ext3_create+300/320] [vfs_create+153/304] [open_namei+1301/1456] > [filp_open+62/112] [sys_open+83/192] [system_call+51/56] > python D E2EB302C 2388 21335 112 21337 21334 (NOTLB) > Call Trace: [dput+33/320] [__down+105/192] [__down_failed+8/12] [.text.lock.namei+265/988] [filp_open+62/112] > [sys_open+83/192] [system_call+51/56] > procmail D E51AC016 0 21378 1 19102 (NOTLB) > Call Trace: [dput+33/320] [__down+105/192] [__down_failed+8/12] [.text.lock.namei+265/988] [filp_open+62/112] > [sys_open+83/192] [system_call+51/56] > python D C01186FF 0 21470 107 21471 (NOTLB) > Call Trace: [do_page_fault+463/1530] [__down+105/192] [__jbd_kmalloc+35/176] [__down_failed+8/12] [.text.lock.checkpoint+15/173] > [start_this_handle+191/864] [new_handle+75/112] [journal_start+165/208] [do_wp_page+416/1088] [ext3_dirty_inode+336/368] [__mark_inode_dirty+195/208] > [update_inode_times+39/64] [generic_file_write_nolock+630/2032] [generic_file_write+77/112] [ext3_file_write+57/224] [sys_write+163/336] [system_call+51/56] > python D C01186FF 0 21471 107 21470 (NOTLB) > Call Trace: [do_page_fault+463/1530] [__down+105/192] [__jbd_kmalloc+35/176] [__down_failed+8/12] [.text.lock.checkpoint+15/173] > [start_this_handle+191/864] [new_handle+75/112] [journal_start+165/208] [do_wp_page+416/1088] [ext3_dirty_inode+336/368] [__mark_inode_dirty+195/208] > [update_inode_times+39/64] [generic_file_write_nolock+630/2032] [generic_file_write+77/112] [ext3_file_write+57/224] [sys_write+163/336] [system_call+51/56] > python D 000001E0 0 21732 21731 (NOTLB) > Call Trace: [bread+32/160] [__down+105/192] [__jbd_kmalloc+35/176] [__down_failed+8/12] [.text.lock.checkpoint+15/173] > [start_this_handle+191/864] [new_handle+75/112] [journal_start+165/208] [dput+33/320] [ext3_unlink+440/448] [vfs_unlink+233/480] > [sys_unlink+183/288] [system_call+51/56] > procmail D D0E13016 0 21914 21913 (NOTLB) > Call Trace: [dput+33/320] [__down+105/192] [__down_failed+8/12] [.text.lock.namei+265/988] [filp_open+62/112] > [sys_open+83/192] [system_call+51/56] > procmail D E39F3016 0 21999 21998 (NOTLB) > Call Trace: [dput+33/320] [__down+105/192] [__down_failed+8/12] [.text.lock.namei+265/988] [filp_open+62/112] > [sys_open+83/192] [system_call+51/56] never mind, I found the real bug and it's not the fsync fix or in ext3. This other more recent cleanedup trace made the real bug more obvious since less stuff was involved on the fs side (because it wasn't an fs problem after all): gzip D CEAE5740 1616 1234 1229 (NOTLB) Call Trace: [ext3_new_block+1062/2448] [do_schedule+284/416] [__wait_on_buffer+84/144] [__cleanup_transaction+252/256] [log_do_checkpoint+286/432] [ide_build_sglist+156/512] [__ide_dma_begin+57/80] [__ide_dma_count+22/32] [__ide_dma_write+326/352] [dma_timer_expiry+0/224] [do_rw_disk+884/1712] [context_switch+131/320] [ide_wait_stat+159/304] [start_request+409/544] [__jbd_kmalloc+35/176] [log_wait_for_space+108/128] [start_this_handle+184/848] [new_handle+75/112] [journal_start+165/208] [__alloc_pages+100/624] [ext3_prepare_write+336/352] [lru_cache_add+99/112] [generic_file_write_nolock+999/1856] [generic_file_write+76/112] [ext3_file_write+57/224] [sys_write+163/304] [system_call+51/56] shutdown D DFE7D940 0 2308 1 2309 2169 (NOTLB) Call Trace: [do_schedule+284/416] [__wait_on_buffer+84/144] [__refile_buffer+86/112] [wait_for_buffers+172/176] [wait_for_locked_buffers+35/48] [sync_buffers+106/112] [fsync_dev+68/80] [sys_sync+15/32] [system_call+51/56] sshd D D0CC93DC 0 2338 601 2339 2340 1226 (NOTLB) Call Trace: [do_schedule+284/416] [__down+90/160] [__jbd_kmalloc+35/176] [__down_failed+8/12] [.text.lock.checkpoint+15/109] [start_this_handle+184/848] [new_handle+75/112] [journal_start+165/208] [ext3_dirty_inode+276/288] [__mark_inode_dirty+150/160] [update_atime+94/96] [generic_file_read+174/352] [file_read_actor+0/144] [sys_read+163/304] [sys_fstat64+73/128] [system_call+51/56] sync D DFF858C0 80 2422 2394 (NOTLB) Call Trace: [do_schedule+284/416] [__wait_on_buffer+84/144] [wait_for_buffers+172/176] [wait_for_locked_buffers+35/48] [sync_buffers+106/112] [fsync_dev+68/80] [sys_sync+15/32] [system_call+51/56] sshd waits for gzip on the journal lock, gzip and shutdown waits on a buffer that was supposed to be submitted by `sync`, but really `sync` never submitted it, it only locked it. This bug is present only in 2.4.21pre4aa3 (all previous didn't had this problem). This was extremely hard to trigger, so it's not surprising few people noticed it, expecially on fast machines. The lowlatency improvements in write_some_buffers zeroed out 'count' at the second pass, and so we lost track of those bh if need_resched was set during the write_some_buffers pass. this will be fixed in my next tree that I can now release confortably back in rock solid state with also the fsync-race fix re-added ;) here the untested fix against 2.4.21pre4aa3 that I will include in 2.4.21pre5aa1: --- x/fs/buffer.c.~1~ 2003-03-13 06:15:29.000000000 +0100 +++ x/fs/buffer.c 2003-03-14 07:36:40.000000000 +0100 @@ -236,10 +236,10 @@ static int write_some_buffers(kdev_t dev unsigned int count; int nr, num_sched = 0; + count = 0; restart: next = lru_list[BUF_DIRTY]; nr = nr_buffers_type[BUF_DIRTY]; - count = 0; while (next && --nr >= 0) { struct buffer_head * bh = next; next = bh->b_next_free; Sorry for the noise. Andrea ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2003-03-14 6:32 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2003-02-02 23:32 2.0, 2.2, 2.4, 2.5: fsync buffer race Mikulas Patocka 2003-02-03 0:00 ` Andrew Morton 2003-02-03 1:13 ` Mikulas Patocka 2003-02-03 1:20 ` Andrew Morton 2003-02-03 9:29 ` Mikulas Patocka 2003-02-04 23:16 ` Pavel Machek 2003-02-05 15:13 ` Mikulas Patocka 2003-02-10 13:07 ` Andrea Arcangeli 2003-02-10 16:28 ` Mikulas Patocka 2003-02-10 16:57 ` Linus Torvalds 2003-02-10 20:40 ` Andrew Morton 2003-02-10 21:18 ` Andrea Arcangeli 2003-02-10 21:44 ` Andrew Morton 2003-02-10 21:59 ` Andrea Arcangeli 2003-03-11 13:58 ` Andrea Arcangeli 2003-03-14 6:42 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox