* kjournald() with DIO
@ 2005-09-12 23:23 Badari Pulavarty
2005-09-12 23:37 ` Andrew Morton
0 siblings, 1 reply; 29+ messages in thread
From: Badari Pulavarty @ 2005-09-12 23:23 UTC (permalink / raw)
To: linux-fsdevel; +Cc: akpm, sct
Hi,
I have been chasing a race condition, which leads to
returning premature EIO with DIO (iozone tests) on
ext3 filesystems.
It seems to be a race between kjournald() & DIO process
invalidating the page. Is this a known issue ?
Thanks,
Badari
Here is the race:
kjournald submited buffers for IO and waiting for them to finish.
Note that it has a ref. against the buffer.
journal_commit_transaction()
...
submited buffers for IO
/* Waiting for IO to complete */
while (commit_transaction->t_locked_list) {
...
get_bh(bh);
if (buffer_locked(bh)) {
spin_unlock(&journal->j_list_lock);
wait_on_buffer(bh); <<----------
spin_lock(&journal->j_list_lock);
}
..
put_bh(bh);
}
Now, DIO process comes to frees the jh through journal_try_to_free_buffers()
but fails to drop_buffers() since kjournald() has a reference against it.
invalidate_complete_range()
..
ext3_releasepage()
journal_try_to_free_buffers()
journal_put_journal_head()
__journal_try_to_free_buffer()
<<--- freed jh
try_to_free_buffers()
drop_buffers()
if (buffer_busy(bh))
goto failed;
<<--- returns EIO due to b_count
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: kjournald() with DIO 2005-09-12 23:23 kjournald() with DIO Badari Pulavarty @ 2005-09-12 23:37 ` Andrew Morton 2005-09-13 0:06 ` Badari Pulavarty 0 siblings, 1 reply; 29+ messages in thread From: Andrew Morton @ 2005-09-12 23:37 UTC (permalink / raw) To: Badari Pulavarty; +Cc: linux-fsdevel, sct Badari Pulavarty <pbadari@us.ibm.com> wrote: > > I have been chasing a race condition, which leads to > returning premature EIO with DIO (iozone tests) on > ext3 filesystems. In Linux, I assume ;) Any particular version? > It seems to be a race between kjournald() & DIO process > invalidating the page. Is this a known issue ? Sort-of. Jan Kara has suggested a fix for this. Go into block_write_full_page(), replace the call to block_invalidatepage() with a call to do_invalidatepage(), please. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-12 23:37 ` Andrew Morton @ 2005-09-13 0:06 ` Badari Pulavarty 2005-09-13 0:29 ` Andrew Morton 0 siblings, 1 reply; 29+ messages in thread From: Badari Pulavarty @ 2005-09-13 0:06 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-fsdevel, sct On Mon, 2005-09-12 at 16:37 -0700, Andrew Morton wrote: > Badari Pulavarty <pbadari@us.ibm.com> wrote: > > > > I have been chasing a race condition, which leads to > > returning premature EIO with DIO (iozone tests) on > > ext3 filesystems. > > In Linux, I assume ;) Any particular version? I am working with a SLES test kernel (SLES9 SP2), but has most DIO fixes from mainline. > > > It seems to be a race between kjournald() & DIO process > > invalidating the page. Is this a known issue ? > > Sort-of. Jan Kara has suggested a fix for this. > > Go into block_write_full_page(), replace the call to block_invalidatepage() > with a call to do_invalidatepage(), please. Didn't help. How can this help ? DIO code is kicking back to buffered mode, since its trying to fill-in the holes. The fix you suggested might have helped, if IO is beyond file size or file size not ending on page boundary. What am I missing ? Thanks, Badari ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-13 0:06 ` Badari Pulavarty @ 2005-09-13 0:29 ` Andrew Morton 2005-09-13 16:52 ` Badari Pulavarty ` (2 more replies) 0 siblings, 3 replies; 29+ messages in thread From: Andrew Morton @ 2005-09-13 0:29 UTC (permalink / raw) To: Badari Pulavarty; +Cc: linux-fsdevel, sct Badari Pulavarty <pbadari@us.ibm.com> wrote: > > Didn't help. How can this help ? Oh well. Different bug. See http://bugzilla.kernel.org/show_bug.cgi?id=4964 > DIO code is kicking back to buffered mode, since its trying to > fill-in the holes. The fix you suggested might have helped, if > IO is beyond file size or file size not ending on page boundary. > What am I missing ? Have you spoken with Mingming about this? She's chasing a similar race. She's currently working on getting the ext3-debug patch working so we can find out how those buffers (which apparently aren't even attached to the journal) came to have an elevated refcount. Apparently the debug patch is currently oopsing. We need to fix that. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-13 0:29 ` Andrew Morton @ 2005-09-13 16:52 ` Badari Pulavarty 2005-09-13 23:07 ` Andrew Morton 2005-09-13 17:53 ` Mingming Cao 2005-09-16 13:42 ` Stephen C. Tweedie 2 siblings, 1 reply; 29+ messages in thread From: Badari Pulavarty @ 2005-09-13 16:52 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-fsdevel, sct On Mon, 2005-09-12 at 17:29 -0700, Andrew Morton wrote: > > DIO code is kicking back to buffered mode, since its trying to > > fill-in the holes. The fix you suggested might have helped, if > > IO is beyond file size or file size not ending on page boundary. > > What am I missing ? > > Have you spoken with Mingming about this? She's chasing a similar race. > She's currently working on getting the ext3-debug patch working so we can > find out how those buffers (which apparently aren't even attached to the > journal) came to have an elevated refcount. I am helping her on the same bug. The buffers were attached to the journal (earlier). kjournald() submitted them for IO and waiting for IO to finish. (So it has a ref. count on it). journal_commit_transaction() ... submited buffers for IO /* Waiting for IO to complete */ while (commit_transaction->t_locked_list) { ... get_bh(bh); if (buffer_locked(bh)) { spin_unlock(&journal->j_list_lock); wait_on_buffer(bh); <<<<<< In the mean while, DIO process through releasepage invalidated the page, released the journal head attached to the buffer and trying to drop the buffer. But the drop buffer fails due to the kjournald ref. count. invalidate_complete_page(): .. ext3_releasepage() journal_try_to_free_buffers() journal_put_journal_head() __journal_try_to_free_buffer() <--- freed jh try_to_free_buffers() drop_buffers() if (buffer_busy(bh)) goto failed; <<-- Thanks, Badari ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-13 16:52 ` Badari Pulavarty @ 2005-09-13 23:07 ` Andrew Morton 2005-09-14 17:23 ` Mingming Cao 0 siblings, 1 reply; 29+ messages in thread From: Andrew Morton @ 2005-09-13 23:07 UTC (permalink / raw) To: Badari Pulavarty; +Cc: linux-fsdevel, sct Badari Pulavarty <pbadari@us.ibm.com> wrote: > > The buffers were attached to the journal (earlier). kjournald() > submitted them for IO and waiting for IO to finish. (So it has a ref. > count on it). > > > journal_commit_transaction() > ... > submited buffers for IO > /* Waiting for IO to complete */ > while (commit_transaction->t_locked_list) { > ... > get_bh(bh); > if (buffer_locked(bh)) { > spin_unlock(&journal->j_list_lock); > wait_on_buffer(bh); <<<<<< > > > In the mean while, DIO process through releasepage invalidated the page, > released the journal head attached to the buffer and trying to drop the > buffer. But the drop buffer fails due to the kjournald ref. count. > > invalidate_complete_page(): > .. > ext3_releasepage() > journal_try_to_free_buffers() > journal_put_journal_head() > __journal_try_to_free_buffer() > <--- freed jh > > try_to_free_buffers() > drop_buffers() > if (buffer_busy(bh)) > goto failed; > <<-- OK. <checks email> The problem is that __generic_file_aio_write_nolock ->invalidate_inode_pages2_range ->invalidate_complete_page was unable to invalidate the page and so -EIO was propagated back. Well, that was wrong of us. invalidate_*_pages() is best-effort. It does not guarantee to remove the pages from pagecache or anything else. The truncate_*_pages() functions do have such guarantees. They will block on that locked buffer, for example. * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode * @mapping: the address_space which holds the pages to invalidate * @start: the offset 'from' which to invalidate * @end: the offset 'to' which to invalidate (inclusive) * * This function only removes the unlocked pages, if you want to * remove all the pages of one inode, you must call truncate_inode_pages. * * invalidate_mapping_pages() will not block on IO activity. It will not * invalidate pages which are dirty, locked, under writeback or mapped into * pagetables. So if we really want to take down that pagecache we'll need to call ->invalidatepage() against each page, not ->releasepage(). Yes, the naming is whacky. truncate_inode_pages uses ->invalidatepage and invalidate_inode_pages uses ->releasepage(). Or simply ignore the invalidate_inode_pages2_range() return value in generic_file_direct_IO(). ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-13 23:07 ` Andrew Morton @ 2005-09-14 17:23 ` Mingming Cao 2005-09-14 18:18 ` Andrew Morton 0 siblings, 1 reply; 29+ messages in thread From: Mingming Cao @ 2005-09-14 17:23 UTC (permalink / raw) To: Andrew Morton; +Cc: Badari Pulavarty, linux-fsdevel, sct On Tue, 2005-09-13 at 16:07 -0700, Andrew Morton wrote: > Or simply ignore the invalidate_inode_pages2_range() return value in > generic_file_direct_IO(). > Could we simply do that? I found some discussions about why we check the return value of invalidate_inode_pages2_range() in generic_file_direct_IO(): http://marc.theaimsgroup.com/?l=linux-kernel&m=109850054025709&w=2 It seems the check for EIO was added to 2.6.11 to handle the case of parallel direct IO and mapped IO. It is possible that the mapped IO dirty the pages after the a_ops->direct_IO. In that case, an error will return back to the caller of DIO to indicate the race. Thanks, Mingming ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-14 17:23 ` Mingming Cao @ 2005-09-14 18:18 ` Andrew Morton 2005-09-14 21:40 ` Mingming Cao 0 siblings, 1 reply; 29+ messages in thread From: Andrew Morton @ 2005-09-14 18:18 UTC (permalink / raw) To: cmm; +Cc: pbadari, linux-fsdevel, sct Mingming Cao <cmm@us.ibm.com> wrote: > > On Tue, 2005-09-13 at 16:07 -0700, Andrew Morton wrote: > > > Or simply ignore the invalidate_inode_pages2_range() return value in > > generic_file_direct_IO(). > > > Could we simply do that? > > I found some discussions about why we check the return value of > invalidate_inode_pages2_range() in generic_file_direct_IO(): > http://marc.theaimsgroup.com/?l=linux-kernel&m=109850054025709&w=2 Well found. That brings it back. > It seems the check for EIO was added to 2.6.11 to handle the case of > parallel direct IO and mapped IO. It is possible that the mapped IO > dirty the pages after the a_ops->direct_IO. In that case, an error will > return back to the caller of DIO to indicate the race. According to the logic we discussed last year, invalidate_inode_pages2_range() only needs to return -EIO if it failed to invalidate a page, and that page was dirty. The -EIO is there to tell the caller that another process dirtied pagecache against the file (within the range of the direct-io write()) after generic_file_direct_IO() has synced the pagecache to disk. The -EIO is telling the direct-io write()r "hey, the data which you wrote was overwritten by a racing buffered-write() or mmapped-write". It's not obvious to me _why_ we should tell the direct-io write()r this - after all, we assume that's what the application developer wanted to do. Still, we don't have to worry about that at present because invalidate_inode_pages2_range() is just doing the wrong thing: it's treating this elevated-refcount buffer_head as if it was a dirty page, and it's not. How about this? --- devel/mm/truncate.c~a 2005-09-14 11:16:42.000000000 -0700 +++ devel-akpm/mm/truncate.c 2005-09-14 11:17:17.000000000 -0700 @@ -249,7 +249,7 @@ EXPORT_SYMBOL(invalidate_inode_pages); * Any pages which are found to be mapped into pagetables are unmapped prior to * invalidation. * - * Returns -EIO if any pages could not be invalidated. + * Returns -EIO if any pages could not be invalidated and were dirty. */ int invalidate_inode_pages2_range(struct address_space *mapping, pgoff_t start, pgoff_t end) @@ -307,9 +307,10 @@ int invalidate_inode_pages2_range(struct } was_dirty = test_clear_page_dirty(page); if (!invalidate_complete_page(mapping, page)) { - if (was_dirty) + if (was_dirty) { set_page_dirty(page); - ret = -EIO; + ret = -EIO; + } } unlock_page(page); } _ ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-14 18:18 ` Andrew Morton @ 2005-09-14 21:40 ` Mingming Cao 2005-09-14 22:02 ` Andrew Morton 0 siblings, 1 reply; 29+ messages in thread From: Mingming Cao @ 2005-09-14 21:40 UTC (permalink / raw) To: Andrew Morton; +Cc: pbadari, linux-fsdevel, sct, andrea On Wed, 2005-09-14 at 11:18 -0700, Andrew Morton wrote: > Mingming Cao <cmm@us.ibm.com> wrote: > > > > On Tue, 2005-09-13 at 16:07 -0700, Andrew Morton wrote: > > > > > Or simply ignore the invalidate_inode_pages2_range() return value in > > > generic_file_direct_IO(). > > > > > Could we simply do that? > > > > I found some discussions about why we check the return value of > > invalidate_inode_pages2_range() in generic_file_direct_IO(): > > http://marc.theaimsgroup.com/?l=linux-kernel&m=109850054025709&w=2 > > Well found. That brings it back. > > > It seems the check for EIO was added to 2.6.11 to handle the case of > > parallel direct IO and mapped IO. It is possible that the mapped IO > > dirty the pages after the a_ops->direct_IO. In that case, an error will > > return back to the caller of DIO to indicate the race. > > According to the logic we discussed last year, > invalidate_inode_pages2_range() only needs to return -EIO if it failed to > invalidate a page, and that page was dirty. > > The -EIO is there to tell the caller that another process dirtied pagecache > against the file (within the range of the direct-io write()) after > generic_file_direct_IO() has synced the pagecache to disk. > > The -EIO is telling the direct-io write()r "hey, the data which you wrote > was overwritten by a racing buffered-write() or mmapped-write". It's not > obvious to me _why_ we should tell the direct-io write()r this - after all, > we assume that's what the application developer wanted to do. > > Still, we don't have to worry about that at present because > invalidate_inode_pages2_range() is just doing the wrong thing: it's > treating this elevated-refcount buffer_head as if it was a dirty page, and > it's not. > > How about this? I proposed similar idea to Andrea in the bug report before. Andrea expressed this concern: with this(try_to_free_buffers() still fail to drop the buffer because of this elevated-refcount by kjournald), block_read_full_page will not re-read from disk the buffers the next time a buffered-IO read from disk, after the direct-io has completed. This is because the buffer is marked uptodate. How could we handle this? Mingming ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-14 21:40 ` Mingming Cao @ 2005-09-14 22:02 ` Andrew Morton 2005-09-15 11:11 ` Suparna Bhattacharya 2005-09-15 15:03 ` Badari Pulavarty 0 siblings, 2 replies; 29+ messages in thread From: Andrew Morton @ 2005-09-14 22:02 UTC (permalink / raw) To: cmm; +Cc: pbadari, linux-fsdevel, sct, andrea Mingming Cao <cmm@us.ibm.com> wrote: > > I proposed similar idea to Andrea in the bug report before. Andrea > expressed this concern: with this(try_to_free_buffers() still fail to > drop the buffer because of this elevated-refcount by kjournald), > block_read_full_page will not re-read from disk the buffers the next > time a buffered-IO read from disk, after the direct-io has completed. > This is because the buffer is marked uptodate. How could we handle this? Well. Application A has written data x with direct-io. Application B has, at a later time, written, dirtied (or even read) data y with buffered I/O. So the contents of pagecache are in fact correct, but it is not known whether or not the data on disk is correct. I guess one way of resolving it is to make invalidate_inode_pages2_range() provide a hard guarantee: call truncate_complete_page() instead of invalidate_complete_page(). We need to have a think about the implications of that. For starters, if someone gets in at the right time and reinstantiates a pte against the page, we'll end up converting that into an anonymous page. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-14 22:02 ` Andrew Morton @ 2005-09-15 11:11 ` Suparna Bhattacharya 2005-09-15 18:52 ` Andrew Morton 2005-09-15 15:03 ` Badari Pulavarty 1 sibling, 1 reply; 29+ messages in thread From: Suparna Bhattacharya @ 2005-09-15 11:11 UTC (permalink / raw) To: Andrew Morton; +Cc: cmm, pbadari, linux-fsdevel, sct, andrea On Wed, Sep 14, 2005 at 03:02:24PM -0700, Andrew Morton wrote: > Mingming Cao <cmm@us.ibm.com> wrote: > > > > I proposed similar idea to Andrea in the bug report before. Andrea > > expressed this concern: with this(try_to_free_buffers() still fail to > > drop the buffer because of this elevated-refcount by kjournald), > > block_read_full_page will not re-read from disk the buffers the next > > time a buffered-IO read from disk, after the direct-io has completed. > > This is because the buffer is marked uptodate. How could we handle this? > > Well. Application A has written data x with direct-io. Application B has, > at a later time, written, dirtied (or even read) data y with buffered I/O. I guess the concern is the the page that B has written was based off stale data, which has been mmap written with a few new bytes. But I wonder, do we really have to or intend to guarantee DIO vs mmaped write consistency ? What we have is sort of best effort -- we just ensure no corruptions and exposures, and try to keep data consistent to the extent we can. Can we actually do a perfect job ? > > So the contents of pagecache are in fact correct, but it is not known > whether or not the data on disk is correct. > > I guess one way of resolving it is to make invalidate_inode_pages2_range() > provide a hard guarantee: call truncate_complete_page() instead of > invalidate_complete_page(). We need to have a think about the implications > of that. For starters, if someone gets in at the right time and > reinstantiates a pte against the page, we'll end up converting that into an > anonymous page. Regards Suparna > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-15 11:11 ` Suparna Bhattacharya @ 2005-09-15 18:52 ` Andrew Morton 0 siblings, 0 replies; 29+ messages in thread From: Andrew Morton @ 2005-09-15 18:52 UTC (permalink / raw) To: suparna; +Cc: cmm, pbadari, linux-fsdevel, sct, andrea Suparna Bhattacharya <suparna@in.ibm.com> wrote: > > But I wonder, do we really have to or intend to guarantee DIO vs mmaped > write consistency ? What we have is sort of best effort -- we just ensure > no corruptions and exposures, and try to keep data consistent to the > extent we can. Yup. There will be holes where inconsistencies can get in. We just need to understand what they are and arrange for them to be as benign as poss. > Can we actually do a perfect job ? Certainly we can, but I don't think there's a way of doing it in an acceptably performant way. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-14 22:02 ` Andrew Morton 2005-09-15 11:11 ` Suparna Bhattacharya @ 2005-09-15 15:03 ` Badari Pulavarty 2005-09-15 19:22 ` Andrea Arcangeli 1 sibling, 1 reply; 29+ messages in thread From: Badari Pulavarty @ 2005-09-15 15:03 UTC (permalink / raw) To: Andrew Morton; +Cc: cmm, linux-fsdevel, sct, andrea On Wed, 2005-09-14 at 15:02 -0700, Andrew Morton wrote: > Mingming Cao <cmm@us.ibm.com> wrote: > > > > I proposed similar idea to Andrea in the bug report before. Andrea > > expressed this concern: with this(try_to_free_buffers() still fail to > > drop the buffer because of this elevated-refcount by kjournald), > > block_read_full_page will not re-read from disk the buffers the next > > time a buffered-IO read from disk, after the direct-io has completed. > > This is because the buffer is marked uptodate. How could we handle this? > > Well. Application A has written data x with direct-io. Application B has, > at a later time, written, dirtied (or even read) data y with buffered I/O. > > So the contents of pagecache are in fact correct, but it is not known > whether or not the data on disk is correct. > > I guess one way of resolving it is to make invalidate_inode_pages2_range() > provide a hard guarantee: call truncate_complete_page() instead of > invalidate_complete_page(). We need to have a think about the implications > of that. For starters, if someone gets in at the right time and > reinstantiates a pte against the page, we'll end up converting that into an > anonymous page. FWIW, I changed it to call truncate_complete_page() and the tests ran fine overnight. Thanks, Badari ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-15 15:03 ` Badari Pulavarty @ 2005-09-15 19:22 ` Andrea Arcangeli 2005-09-15 20:00 ` Andrew Morton 0 siblings, 1 reply; 29+ messages in thread From: Andrea Arcangeli @ 2005-09-15 19:22 UTC (permalink / raw) To: Badari Pulavarty; +Cc: Andrew Morton, cmm, linux-fsdevel, sct On Thu, Sep 15, 2005 at 08:03:23AM -0700, Badari Pulavarty wrote: > FWIW, I changed it to call truncate_complete_page() and the tests ran > fine overnight. What is the main difference between the two? (besides the removal of of the pagecache that isn't interesting here). That it calls ->invalidatepage (ext3_invalidatepage) instead of ->releasepage? ->invalidatepage must be freeing the buffers too (the fs thinks the data block will be released) so why is there a difference between ->releasepage and ->invalidatepage in the first place? What if you define both of them to the invalidatepage implementation? (with a local change to ext3 calling ext3_invalidatepage with offset = 0 inside ext3_releasepage, the gfpmask should also be checked but that's not important for now) Just trying to understand why truncatepage blocks and waits, and releasepage doesn't (which seems the reason why we have a problem in the first place). ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-15 19:22 ` Andrea Arcangeli @ 2005-09-15 20:00 ` Andrew Morton 2005-09-15 20:20 ` Andrea Arcangeli 0 siblings, 1 reply; 29+ messages in thread From: Andrew Morton @ 2005-09-15 20:00 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: pbadari, cmm, linux-fsdevel, sct Andrea Arcangeli <andrea@suse.de> wrote: > > On Thu, Sep 15, 2005 at 08:03:23AM -0700, Badari Pulavarty wrote: > > FWIW, I changed it to call truncate_complete_page() and the tests ran > > fine overnight. > > What is the main difference between the two? (besides the removal of > of the pagecache that isn't interesting here). That it calls > ->invalidatepage (ext3_invalidatepage) instead of ->releasepage? aops.releasepage(): This is a best-effort function which was originally called only from vmscan.c. It's the filesystem's version of try_to_free_buffers(). Later called by direct-io and by fadvise. aops.invalidatepage() Provides hard guarantees. Called by truncate. It's the filesystem's version of block_truncate_page(). Major difference: invalidatepage() is supposed to kill dirty buffers, releasepage() must not. > ->invalidatepage must be freeing the buffers too (the fs thinks the data > block will be released) so why is there a difference between > ->releasepage and ->invalidatepage in the first place? memory reclaim versus truncate, basically. > What if you > define both of them to the invalidatepage implementation? Memory reclaim will reclaim dirty pages ;) (For those parts of the page outside the truncation point. The trucation point isn't meaningful for ->releasepage()). Probably we could simulate ->releasepage(page) by using ->invalidatepage(page, PAGE_CACHE_SIZE), but that's relying on side-effects, or would need new semantics defined, or something. > (with a local > change to ext3 calling ext3_invalidatepage with offset = 0 inside > ext3_releasepage, the gfpmask should also be checked but that's not > important for now) > > Just trying to understand why truncatepage blocks and waits, and > releasepage doesn't (which seems the reason why we have a problem in the > first place). I made releasepage() nonblocking to avoid blocking processes in the memory reclaim paths. Given that it's best-effort, it may as well just trylock everything it needs and give up if that doesn't work out. I think ->releasepage() was being called under locks at some stage but that appears to no longer be the case. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-15 20:00 ` Andrew Morton @ 2005-09-15 20:20 ` Andrea Arcangeli 2005-09-15 20:35 ` Andrew Morton 0 siblings, 1 reply; 29+ messages in thread From: Andrea Arcangeli @ 2005-09-15 20:20 UTC (permalink / raw) To: Andrew Morton; +Cc: pbadari, cmm, linux-fsdevel, sct On Thu, Sep 15, 2005 at 01:00:18PM -0700, Andrew Morton wrote: > Memory reclaim will reclaim dirty pages ;) ;) > Probably we could simulate ->releasepage(page) by using > ->invalidatepage(page, PAGE_CACHE_SIZE), but that's relying on > side-effects, or would need new semantics defined, or something. The problem is that one thing is the memory reclaim, one thing is the invalidate_inode_pages and one thing is the truncate. They need three different behaviours. But we've only two API to ask the fs to get rid of the buffers, one is destroying dirty buffers, the other is non-blocking, we might need one that is blocking and it isn't destroying dirty buffers. The "wait/gfpmask" parameter to releasepage is ignored by ext3, perhaps we should change the semantics of that bit, and have it passed as GFP_KERNEL only by the invalidate_inode_pages. > I made releasepage() nonblocking to avoid blocking processes in the memory > reclaim paths. Given that it's best-effort, it may as well just trylock > everything it needs and give up if that doesn't work out. I agree the memory reclaim can remain nonblocking. However note that the failure here is in try_to_free_buffers because the buffer is pinned by the journal code (it's not a trylock failing). Buffer is BH_Mapped, BH_Req, BH_Uptodate (0x19) - nicely tracked by IBM. > I think ->releasepage() was being called under locks at some stage but that > appears to no longer be the case. I'm a bit worried about the i_sem if waiting in releasepage but there should be no other locks in the way. We need the blocking behaviour only for invalidate_inode_pages. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-15 20:20 ` Andrea Arcangeli @ 2005-09-15 20:35 ` Andrew Morton 2005-09-15 20:49 ` Badari Pulavarty 2005-09-15 21:03 ` Andrea Arcangeli 0 siblings, 2 replies; 29+ messages in thread From: Andrew Morton @ 2005-09-15 20:35 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: pbadari, cmm, linux-fsdevel, sct Andrea Arcangeli <andrea@suse.de> wrote: > > > Probably we could simulate ->releasepage(page) by using > > ->invalidatepage(page, PAGE_CACHE_SIZE), but that's relying on > > side-effects, or would need new semantics defined, or something. > > The problem is that one thing is the memory reclaim, one thing is the > invalidate_inode_pages and one thing is the truncate. > > They need three different behaviours. Yup. > But we've only two API to ask the fs to get rid of the buffers, one is > destroying dirty buffers, the other is non-blocking, we might need one > that is blocking and it isn't destroying dirty buffers. > > The "wait/gfpmask" parameter to releasepage is ignored by ext3, perhaps > we should change the semantics of that bit, and have it passed as > GFP_KERNEL only by the invalidate_inode_pages. That would work. > > I made releasepage() nonblocking to avoid blocking processes in the memory > > reclaim paths. Given that it's best-effort, it may as well just trylock > > everything it needs and give up if that doesn't work out. > > I agree the memory reclaim can remain nonblocking. > > However note that the failure here is in try_to_free_buffers because the > buffer is pinned by the journal code (it's not a trylock failing). > Buffer is BH_Mapped, BH_Req, BH_Uptodate (0x19) - nicely tracked by IBM. I'll need reminding - why was the buffer unfreeable? Because kjournald had a ref on it? Whereabouts is that happening? ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-15 20:35 ` Andrew Morton @ 2005-09-15 20:49 ` Badari Pulavarty 2005-09-15 21:06 ` Andrea Arcangeli 2005-09-15 21:20 ` Andrew Morton 2005-09-15 21:03 ` Andrea Arcangeli 1 sibling, 2 replies; 29+ messages in thread From: Badari Pulavarty @ 2005-09-15 20:49 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, cmm, linux-fsdevel, sct On Thu, 2005-09-15 at 13:35 -0700, Andrew Morton wrote: > Andrea Arcangeli <andrea@suse.de> wrote: > > > > > Probably we could simulate ->releasepage(page) by using > > > ->invalidatepage(page, PAGE_CACHE_SIZE), but that's relying on > > > side-effects, or would need new semantics defined, or something. > > > > The problem is that one thing is the memory reclaim, one thing is the > > invalidate_inode_pages and one thing is the truncate. > > > > They need three different behaviours. > > Yup. > > > But we've only two API to ask the fs to get rid of the buffers, one is > > destroying dirty buffers, the other is non-blocking, we might need one > > that is blocking and it isn't destroying dirty buffers. > > > > The "wait/gfpmask" parameter to releasepage is ignored by ext3, perhaps > > we should change the semantics of that bit, and have it passed as > > GFP_KERNEL only by the invalidate_inode_pages. > > That would work. > > > > I made releasepage() nonblocking to avoid blocking processes in the memory > > > reclaim paths. Given that it's best-effort, it may as well just trylock > > > everything it needs and give up if that doesn't work out. > > > > I agree the memory reclaim can remain nonblocking. > > > > However note that the failure here is in try_to_free_buffers because the > > buffer is pinned by the journal code (it's not a trylock failing). > > Buffer is BH_Mapped, BH_Req, BH_Uptodate (0x19) - nicely tracked by IBM. > > I'll need reminding - why was the buffer unfreeable? Because kjournald had > a ref on it? Whereabouts is that happening? > > kjournald() is commiting the transaction and doing IO to buffers, since previous DIO write kicked it back to buffered mode (due to hole filling). Here is the race: DIO Process kjounald() journal_commit_transaction() ... /* submited buffers for IO */ /* Waiting for IO to complete */ while (t_locked_list) { ... get_bh(bh); if (buffer_locked(bh)) { spin_unlock(&journal->j_list_lock); wait_on_buffer(bh); invalidate_complete_page() .. ext3_releasepage() journal_try_to_free_buffers() journal_put_journal_head() __journal_try_to_free_buffer() <--- freed jh try_to_free_buffers() drop_buffers() if (buffer_busy(bh)) goto failed; <<--- returns EIO due to b_count ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-15 20:49 ` Badari Pulavarty @ 2005-09-15 21:06 ` Andrea Arcangeli 2005-09-15 21:20 ` Andrew Morton 1 sibling, 0 replies; 29+ messages in thread From: Andrea Arcangeli @ 2005-09-15 21:06 UTC (permalink / raw) To: Badari Pulavarty; +Cc: Andrew Morton, cmm, linux-fsdevel, sct On Thu, Sep 15, 2005 at 01:49:31PM -0700, Badari Pulavarty wrote: > previous DIO write kicked it back to buffered mode (due to hole > filling). btw, I guess with data=journal it'd trigger even without the hole filling, and infact data=writeback apparently hides the bug. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-15 20:49 ` Badari Pulavarty 2005-09-15 21:06 ` Andrea Arcangeli @ 2005-09-15 21:20 ` Andrew Morton 2005-09-15 22:22 ` Badari Pulavarty 1 sibling, 1 reply; 29+ messages in thread From: Andrew Morton @ 2005-09-15 21:20 UTC (permalink / raw) To: Badari Pulavarty; +Cc: andrea, cmm, linux-fsdevel, sct Badari Pulavarty <pbadari@us.ibm.com> wrote: > > Here is the race: > > > DIO Process kjounald() > journal_commit_transaction() > ... > /* submited buffers for IO */ > /* Waiting for IO to complete */ > while (t_locked_list) { > ... > get_bh(bh); > if (buffer_locked(bh)) { > spin_unlock(&journal->j_list_lock); > wait_on_buffer(bh); > > > invalidate_complete_page() > .. > ext3_releasepage() > journal_try_to_free_buffers() > journal_put_journal_head() > __journal_try_to_free_buffer() > <--- freed jh > > try_to_free_buffers() > drop_buffers() > if (buffer_busy(bh)) > goto failed; > <<--- returns EIO due to b_count Right. But there's still a race in there. invalidate_complete_page() may still fail for these buffers. The only blocking it does is lock_buffer(). So if kjournald does: get_bh(bh); wait_on_bffer(bh); <- window put_bh(bh); ext3_invalidatepage() may hit that unlocked, pinned buffer and fail to release the page. In that case, truncate_complete_page() will convert the apge into an anonymous page (if it's mmapped). If it's just a pagecache page then I guess it'll be converted into a zero-ref page on the page LRU. It's not possible for ext3_invalidatepage() to block until a buffer comes unpinned. We'd have to add additional stuff for that. (bring back wake_up_buffer(), for example). ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-15 21:20 ` Andrew Morton @ 2005-09-15 22:22 ` Badari Pulavarty 0 siblings, 0 replies; 29+ messages in thread From: Badari Pulavarty @ 2005-09-15 22:22 UTC (permalink / raw) To: Andrew Morton; +Cc: andrea, cmm, linux-fsdevel, sct On Thu, 2005-09-15 at 14:20 -0700, Andrew Morton wrote: > Badari Pulavarty <pbadari@us.ibm.com> wrote: > > > > Here is the race: > > > > > > DIO Process kjounald() > > journal_commit_transaction() > > ... > > /* submited buffers for IO */ > > /* Waiting for IO to complete */ > > while (t_locked_list) { > > ... > > get_bh(bh); > > if (buffer_locked(bh)) { > > spin_unlock(&journal->j_list_lock); > > wait_on_buffer(bh); > > > > > > invalidate_complete_page() > > .. > > ext3_releasepage() > > journal_try_to_free_buffers() > > journal_put_journal_head() > > __journal_try_to_free_buffer() > > <--- freed jh > > > > try_to_free_buffers() > > drop_buffers() > > if (buffer_busy(bh)) > > goto failed; > > <<--- returns EIO due to b_count > > Right. But there's still a race in there. > > invalidate_complete_page() may still fail for these buffers. The only > blocking it does is lock_buffer(). So if kjournald does: > > get_bh(bh); > wait_on_bffer(bh); > <- window > put_bh(bh); > > ext3_invalidatepage() may hit that unlocked, pinned buffer and fail to > release the page. Yep. Thats exactly the window. BTW, the reason for hitting this so easily on SLES9 SP1, but not on mainline is, we invalidate only range of pages which DIO is involved. But SLES9 SP1 invalidates the whole file. (SLES9, RHEL4 ignores the error - so we don't see failure there). > In that case, truncate_complete_page() will convert the apge into an > anonymous page (if it's mmapped). If it's just a pagecache page then I > guess it'll be converted into a zero-ref page on the page LRU. > > It's not possible for ext3_invalidatepage() to block until a buffer comes > unpinned. We'd have to add additional stuff for that. (bring back > wake_up_buffer(), for example). Thanks, Badari ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-15 20:35 ` Andrew Morton 2005-09-15 20:49 ` Badari Pulavarty @ 2005-09-15 21:03 ` Andrea Arcangeli 2005-09-15 21:26 ` Andrew Morton 1 sibling, 1 reply; 29+ messages in thread From: Andrea Arcangeli @ 2005-09-15 21:03 UTC (permalink / raw) To: Andrew Morton; +Cc: pbadari, cmm, linux-fsdevel, sct On Thu, Sep 15, 2005 at 01:35:00PM -0700, Andrew Morton wrote: > I'll need reminding - why was the buffer unfreeable? Because kjournald had > a ref on it? Whereabouts is that happening? Yes, kjournald had a ref (b_count) on it. It's happening when running iozone in some way (Badari knows how to reproduce). I was now wondering why it's a problem to destroy dirty buffers in invalidate_inode_pages2? (ok, ignoring the detail that PageDirty may return true inside invalidate_complete_page) Clearly we can't do that inside releasepage, but invalidate_inode_pages2 is all about destroying the dirty and uptodate information, since we just finished writing with direct-io (or at least the dirty info isn't that important anymore, we're still in our write context for those pages that we're invalidating [even better with the range]). If we could just have a blocking version of try_to_release_page I guess that would be better, but even the ->invalidatepage behaviour may not be that bad for invalidate_inode_pages2. Or am I missing something? ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-15 21:03 ` Andrea Arcangeli @ 2005-09-15 21:26 ` Andrew Morton 2005-09-15 22:04 ` Andrea Arcangeli 0 siblings, 1 reply; 29+ messages in thread From: Andrew Morton @ 2005-09-15 21:26 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: pbadari, cmm, linux-fsdevel, sct Andrea Arcangeli <andrea@suse.de> wrote: > > On Thu, Sep 15, 2005 at 01:35:00PM -0700, Andrew Morton wrote: > > I'll need reminding - why was the buffer unfreeable? Because kjournald had > > a ref on it? Whereabouts is that happening? > > Yes, kjournald had a ref (b_count) on it. It's happening when running > iozone in some way (Badari knows how to reproduce). > > I was now wondering why it's a problem to destroy dirty buffers in > invalidate_inode_pages2? (ok, ignoring the detail that PageDirty may > return true inside invalidate_complete_page) Bear in mind that generic_file_direct_IO() has just fsynced that section of the file so there shouldn't be any dirty buffers unless something funny is happening. Like, direct-io fell back to buffered IO. In that case perhaps we should run sync_page_range() before trying the invalidate_inode_pages2_range()? > Clearly we can't do that inside releasepage, but invalidate_inode_pages2 > is all about destroying the dirty and uptodate information, since we > just finished writing with direct-io (or at least the dirty info isn't > that important anymore, we're still in our write context for those pages > that we're invalidating [even better with the range]). Those dirty pages/buffers could be there because __generic_file_aio_write_nolock() fell back to buffered I/O. If we run sync_page_range() before invalidate_inode_pages2_range() and we *still* find dirty pages/buffers then yes, it might be right to simply nuke those dirty bits. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-15 21:26 ` Andrew Morton @ 2005-09-15 22:04 ` Andrea Arcangeli 2005-09-15 23:28 ` Mingming Cao 0 siblings, 1 reply; 29+ messages in thread From: Andrea Arcangeli @ 2005-09-15 22:04 UTC (permalink / raw) To: Andrew Morton; +Cc: pbadari, cmm, linux-fsdevel, sct On Thu, Sep 15, 2005 at 02:26:42PM -0700, Andrew Morton wrote: > Bear in mind that generic_file_direct_IO() has just fsynced that section of > the file so there shouldn't be any dirty buffers unless something funny is That was clear yes, or we cannot start writing to the platter if there's outstanding dirty cache. > happening. Like, direct-io fell back to buffered IO. In that case perhaps > we should run sync_page_range() before trying the > invalidate_inode_pages2_range()? The buffered-IO fallback would happen _after_ (not before) the invalidate regardless so it doesn't matter since the pages would be marked dirty again later (still inside write context so we're safe). The fallback actually happens outside generic_file_direct_IO, so it doesn't matter what we do inside since we'll rewrite and redirty all pagecache later (if we fail the direct-IO). > Those dirty pages/buffers could be there because > __generic_file_aio_write_nolock() fell back to buffered I/O. > > If we run sync_page_range() before invalidate_inode_pages2_range() and we > *still* find dirty pages/buffers then yes, it might be right to simply nuke > those dirty bits. See above, it doesn't seem necessary. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-15 22:04 ` Andrea Arcangeli @ 2005-09-15 23:28 ` Mingming Cao 2005-09-16 0:18 ` Andrea Arcangeli 0 siblings, 1 reply; 29+ messages in thread From: Mingming Cao @ 2005-09-15 23:28 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, pbadari, linux-fsdevel, sct On Fri, 2005-09-16 at 00:04 +0200, Andrea Arcangeli wrote: > On Thu, Sep 15, 2005 at 02:26:42PM -0700, Andrew Morton wrote: > > Bear in mind that generic_file_direct_IO() has just fsynced that section of > > the file so there shouldn't be any dirty buffers unless something funny is > > That was clear yes, or we cannot start writing to the platter if there's > outstanding dirty cache. > > > happening. Like, direct-io fell back to buffered IO. In that case perhaps > > we should run sync_page_range() before trying the > > invalidate_inode_pages2_range()? > Isn't direct-io and buffered IO being serialized by i_sem? If so, the buffers should not be dirtyed by other concurrent DIO or DIO-fall-back- to-buffered-IO or buffered-IO, did I miss something? > The buffered-IO fallback would happen _after_ (not before) the > invalidate regardless so it doesn't matter since the pages would be > marked dirty again later (still inside write context so we're safe). > > The fallback actually happens outside generic_file_direct_IO, so it > doesn't matter what we do inside since we'll rewrite and redirty all > pagecache later (if we fail the direct-IO). > > > Those dirty pages/buffers could be there because > > __generic_file_aio_write_nolock() fell back to buffered I/O. > > > > If we run sync_page_range() before invalidate_inode_pages2_range() and we > > *still* find dirty pages/buffers then yes, it might be right to simply nuke > > those dirty bits. > > See above, it doesn't seem necessary. > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-15 23:28 ` Mingming Cao @ 2005-09-16 0:18 ` Andrea Arcangeli 0 siblings, 0 replies; 29+ messages in thread From: Andrea Arcangeli @ 2005-09-16 0:18 UTC (permalink / raw) To: Mingming Cao; +Cc: Andrew Morton, pbadari, linux-fsdevel, sct On Thu, Sep 15, 2005 at 04:28:44PM -0700, Mingming Cao wrote: > Isn't direct-io and buffered IO being serialized by i_sem? If so, the page faults aren't serialized by i_sem, so if a page fault happens the vm may unmap the pte under memory pressure marking the page dirty. But we have the guarantee that such dirty bit in the pte was set after we were inside write() so we can drop it (or we can as well leave it), it's undefined. but the point is that the buffered-io fallback happens after the invalidate, so it shouldn't matter if we drop dirty buffers or dirty pages in the invalidate, they should be re-generated. So overall I believe it's semantically equivalent if we drop the dirty bits or not. Clearly not dropping them is a bit less intrusive, but current code seems ready to drop them by calling ->invalidatepage (destructive) instead of creating a blocking version of ->releasepage (non destructive). ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-13 0:29 ` Andrew Morton 2005-09-13 16:52 ` Badari Pulavarty @ 2005-09-13 17:53 ` Mingming Cao 2005-09-16 13:42 ` Stephen C. Tweedie 2 siblings, 0 replies; 29+ messages in thread From: Mingming Cao @ 2005-09-13 17:53 UTC (permalink / raw) To: Andrew Morton; +Cc: Badari Pulavarty, linux-fsdevel, sct On Mon, 2005-09-12 at 17:29 -0700, Andrew Morton wrote: > Badari Pulavarty <pbadari@us.ibm.com> wrote: > > > > Didn't help. How can this help ? > > Oh well. Different bug. See http://bugzilla.kernel.org/show_bug.cgi?id=4964 > > > DIO code is kicking back to buffered mode, since its trying to > > fill-in the holes. The fix you suggested might have helped, if > > IO is beyond file size or file size not ending on page boundary. > > What am I missing ? > > Have you spoken with Mingming about this? She's chasing a similar race. > She's currently working on getting the ext3-debug patch working so we can > find out how those buffers (which apparently aren't even attached to the > journal) came to have an elevated refcount. > Hi Andrew, Badari was looking at the race with me and he came up with some hack patch to "remember" who are the last two caller to get_bh(), and once we found a buffer whose reference count is >1 but the journal head is NULL, it start trace this bufferhead.(dump who is the last two caller to get_bh()) The analysis results shows that there are race between journal_try_to_free_buffers() and kjournald. It possible that when kjournald is waiting for a buffer unlocked, it released the journal- >j_list_lock(commit.c, line 398), but it is still holding a reference count on that buffer while wait_on_buffer(). This allowed the __journal_try_to_free_buffer() to proceed (which is waiting for the j_list_lock and will unlink the journal head from the buffer eventually). journal_try_to_free_buffer() will call try_to_free_buffer() if the journal head is NULL, at that point, since kjournald is still holding a reference count on that buffer, try_to_free_buffers()- >drop_buffers() failed because buffer is busy. This race happened in SLES9 SP1/SP2, also it happend in 2.6.11, but could not reproduce in mainline 2.6.13. SPLES9 sp1/sp2 and 2.6.11 all have the patch to allow invalidate_inode_page() to return -EIO to deal with the DIO-with-dirty-mapped-write case. But in 2.6.13, it invalidates using a range, probably that's why mainline doesn't show the problem (the chance of calling invalidate_inode_pages2_range on a busy page is reduced if the range is passed as parameter) > Apparently the debug patch is currently oopsing. We need to fix that. Sorry about the lack of response on this, it certainly very powerful and useful tool. Last time I tried it again the kernel would not boot. I am working on it, slowly:) Mingming ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-13 0:29 ` Andrew Morton 2005-09-13 16:52 ` Badari Pulavarty 2005-09-13 17:53 ` Mingming Cao @ 2005-09-16 13:42 ` Stephen C. Tweedie 2005-09-21 18:22 ` Mingming Cao 2 siblings, 1 reply; 29+ messages in thread From: Stephen C. Tweedie @ 2005-09-16 13:42 UTC (permalink / raw) To: Andrew Morton Cc: Badari Pulavarty, linux-fsdevel, Mingming Cao, Stephen Tweedie Hi, On Tue, 2005-09-13 at 01:29, Andrew Morton wrote: > Have you spoken with Mingming about this? She's chasing a similar race. > She's currently working on getting the ext3-debug patch working so we can > find out how those buffers (which apparently aren't even attached to the > journal) came to have an elevated refcount. Here's the latest version of ext3 buffer tracing which I've been using. It's basically just the same as one from Andrew from about 2.6.5 vintage, with a few changes in places to deal with new bits of code that need temporary buffer_heads initialised properly for tracing. It has survived a couple of hours of fsx and fsstress in parallel, but I haven't run any O_DIRECT tests with it on the current git kernel, so there may stil be problems in that area. Cheers, Stephen Buffer_head tracing for ext3/jbd --- drivers/ide/ide-io.c | 22 ++++ fs/Kconfig | 4 + fs/Makefile | 2 fs/buffer.c | 100 ++++++++++++++++ fs/direct-io.c | 1 fs/ext3/inode.c | 7 + fs/ext3/namei.c | 40 ++++++- fs/ext3/super.c | 81 +++++++++++++ fs/ext3/xattr.c | 4 + fs/jbd-kernel.c | 255 ++++++++++++++++++++++++++++++++++++++++++ fs/jbd/commit.c | 19 ++- fs/jbd/transaction.c | 46 +++++--- fs/mpage.c | 2 include/linux/buffer-trace.h | 85 ++++++++++++++ include/linux/buffer_head.h | 5 + include/linux/jbd.h | 26 ++-- 16 files changed, 654 insertions(+), 45 deletions(-) diff --git a/drivers/ide/ide-io.c b/drivers/ide/ide-io.c --- a/drivers/ide/ide-io.c +++ b/drivers/ide/ide-io.c @@ -890,6 +890,28 @@ static ide_startstop_t start_request (id ide_startstop_t startstop; sector_t block; +#ifdef CONFIG_JBD_DEBUG + /* + * Silently stop writing to this disk to simulate a crash. + */ + extern struct block_device *journal_no_write[2]; + int i; + + if (!(rq->flags & REQ_RW)) + goto its_a_read; + + for (i = 0; i < 2; i++) { + struct block_device *bdev = journal_no_write[i]; + if (bdev && bdev_get_queue(bdev) == rq->q) { + printk("%s: write suppressed\n", __FUNCTION__); + ide_end_request(drive, 1, rq->nr_sectors); + return ide_stopped; + } + } +its_a_read: + ; +#endif + BUG_ON(!(rq->flags & REQ_STARTED)); #ifdef DEBUG diff --git a/fs/Kconfig b/fs/Kconfig --- a/fs/Kconfig +++ b/fs/Kconfig @@ -172,6 +172,10 @@ config JBD_DEBUG generated. To turn debugging off again, do "echo 0 > /proc/sys/fs/jbd-debug". +config BUFFER_DEBUG + bool "buffer-layer tracing" + depends on JBD + config FS_MBCACHE # Meta block cache for Extended Attributes (ext2/ext3) tristate diff --git a/fs/Makefile b/fs/Makefile --- a/fs/Makefile +++ b/fs/Makefile @@ -23,6 +23,8 @@ obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout obj-$(CONFIG_BINFMT_EM86) += binfmt_em86.o obj-$(CONFIG_BINFMT_MISC) += binfmt_misc.o +obj-$(CONFIG_BUFFER_DEBUG) += jbd-kernel.o + # binfmt_script is always there obj-y += binfmt_script.o diff --git a/fs/buffer.c b/fs/buffer.c --- a/fs/buffer.c +++ b/fs/buffer.c @@ -35,7 +35,9 @@ #include <linux/hash.h> #include <linux/suspend.h> #include <linux/buffer_head.h> +#include <linux/buffer-trace.h> #include <linux/bio.h> +#include <linux/jbd.h> #include <linux/notifier.h> #include <linux/cpu.h> #include <linux/bitops.h> @@ -52,6 +54,7 @@ init_buffer(struct buffer_head *bh, bh_e { bh->b_end_io = handler; bh->b_private = private; + buffer_trace_init(&bh->b_history); } static int sync_buffer(void *word) @@ -60,6 +63,7 @@ static int sync_buffer(void *word) struct buffer_head *bh = container_of(word, struct buffer_head, b_state); + BUFFER_TRACE(bh, "enter"); smp_mb(); bd = bh->b_bdev; if (bd) @@ -104,6 +108,7 @@ static void buffer_io_error(struct buffe { char b[BDEVNAME_SIZE]; + BUFFER_TRACE(bh, "I/O error"); printk(KERN_ERR "Buffer I/O error on device %s, logical block %Lu\n", bdevname(bh->b_bdev, b), (unsigned long long)bh->b_blocknr); @@ -115,10 +120,12 @@ static void buffer_io_error(struct buffe */ void end_buffer_read_sync(struct buffer_head *bh, int uptodate) { + BUFFER_TRACE(bh, "enter"); if (uptodate) { set_buffer_uptodate(bh); } else { /* This happens, due to failed READA attempts. */ + BUFFER_TRACE(bh, "clear uptodate"); clear_buffer_uptodate(bh); } unlock_buffer(bh); @@ -130,8 +137,10 @@ void end_buffer_write_sync(struct buffer char b[BDEVNAME_SIZE]; if (uptodate) { + BUFFER_TRACE(bh, "uptodate"); set_buffer_uptodate(bh); } else { + BUFFER_TRACE(bh, "Not uptodate"); if (!buffer_eopnotsupp(bh) && printk_ratelimit()) { buffer_io_error(bh); printk(KERN_WARNING "lost page write due to " @@ -139,6 +148,7 @@ void end_buffer_write_sync(struct buffer bdevname(bh->b_bdev, b)); } set_buffer_write_io_error(bh); + BUFFER_TRACE(bh, "clear uptodate"); clear_buffer_uptodate(bh); } unlock_buffer(bh); @@ -520,12 +530,15 @@ static void end_buffer_async_read(struct struct page *page; int page_uptodate = 1; + BUFFER_TRACE(bh, "enter"); + BUG_ON(!buffer_async_read(bh)); page = bh->b_page; if (uptodate) { set_buffer_uptodate(bh); } else { + BUFFER_TRACE(bh, "clear uptodate"); clear_buffer_uptodate(bh); if (printk_ratelimit()) buffer_io_error(bh); @@ -582,6 +595,8 @@ void end_buffer_async_write(struct buffe struct buffer_head *tmp; struct page *page; + BUFFER_TRACE(bh, "enter"); + BUG_ON(!buffer_async_write(bh)); page = bh->b_page; @@ -595,6 +610,7 @@ void end_buffer_async_write(struct buffe bdevname(bh->b_bdev, b)); } set_bit(AS_EIO, &page->mapping->flags); + BUFFER_TRACE(bh, "clear uptodate"); clear_buffer_uptodate(bh); SetPageError(page); } @@ -802,6 +818,7 @@ void mark_buffer_dirty_inode(struct buff struct address_space *mapping = inode->i_mapping; struct address_space *buffer_mapping = bh->b_page->mapping; + BUFFER_TRACE(bh, "mark dirty"); mark_buffer_dirty(bh); if (!mapping->assoc_mapping) { mapping->assoc_mapping = buffer_mapping; @@ -853,7 +870,25 @@ int __set_page_dirty_buffers(struct page struct buffer_head *bh = head; do { - set_buffer_dirty(bh); + if (buffer_uptodate(bh)) { + BUFFER_TRACE(bh, "set dirty"); + /* The following test can only succeed + * if jbd is built in, not if it's a + * module... */ +#ifdef CONFIG_JBD + if (buffer_jbd(bh)) { + struct journal_head *jh; + + jh = journal_add_journal_head(bh); + WARN_ON(jh->b_jlist == BJ_Metadata); + journal_put_journal_head(jh); + } +#endif + set_buffer_dirty(bh); + } else { + printk("%s: !uptodate buffer\n", __FUNCTION__); + buffer_assertion_failure(bh); + } bh = bh->b_this_page; } while (bh != head); } @@ -1043,6 +1078,7 @@ no_grow: do { bh = head; head = head->b_this_page; + bh->b_this_page = NULL; free_buffer_head(bh); } while (head); } @@ -1258,6 +1294,7 @@ __getblk_slow(struct block_device *bdev, */ void fastcall mark_buffer_dirty(struct buffer_head *bh) { + BUFFER_TRACE(bh, "entry"); if (!buffer_dirty(bh) && !test_set_buffer_dirty(bh)) __set_page_dirty_nobuffers(bh->b_page); } @@ -1271,12 +1308,14 @@ void fastcall mark_buffer_dirty(struct b */ void __brelse(struct buffer_head * buf) { + BUFFER_TRACE(buf, "entry"); if (atomic_read(&buf->b_count)) { put_bh(buf); return; } printk(KERN_ERR "VFS: brelse: Trying to free free buffer\n"); WARN_ON(1); + buffer_assertion_failure(buf); } /* @@ -1285,6 +1324,7 @@ void __brelse(struct buffer_head * buf) */ void __bforget(struct buffer_head *bh) { + BUFFER_TRACE(bh, "enter"); clear_buffer_dirty(bh); if (!list_empty(&bh->b_assoc_buffers)) { struct address_space *buffer_mapping = bh->b_page->mapping; @@ -1546,6 +1586,7 @@ EXPORT_SYMBOL(set_bh_page); */ static inline void discard_buffer(struct buffer_head * bh) { + BUFFER_TRACE(bh, "enter"); lock_buffer(bh); clear_buffer_dirty(bh); bh->b_bdev = NULL; @@ -1696,6 +1737,7 @@ void unmap_underlying_metadata(struct bl old_bh = __find_get_block_slow(bdev, block, 0); if (old_bh) { + BUFFER_TRACE(old_bh, "alias: unmap it"); clear_buffer_dirty(old_bh); wait_on_buffer(old_bh); clear_buffer_req(old_bh); @@ -1775,6 +1817,7 @@ static int __block_write_full_page(struc /* * The buffer was zeroed by block_write_full_page() */ + BUFFER_TRACE(bh, "outside last_block"); clear_buffer_dirty(bh); set_buffer_uptodate(bh); } else if (!buffer_mapped(bh) && buffer_dirty(bh)) { @@ -1782,6 +1825,8 @@ static int __block_write_full_page(struc if (err) goto recover; if (buffer_new(bh)) { + BUFFER_TRACE(bh, "new: call " + "unmap_underlying_metadata"); /* blockdev mappings never come here */ clear_buffer_new(bh); unmap_underlying_metadata(bh->b_bdev, @@ -1809,6 +1854,11 @@ static int __block_write_full_page(struc continue; } if (test_clear_buffer_dirty(bh)) { + if (!buffer_uptodate(bh)) { + printk("%s: dirty non-uptodate buffer\n", + __FUNCTION__); + buffer_assertion_failure(bh); + } mark_buffer_async_write(bh); } else { unlock_buffer(bh); @@ -1870,6 +1920,7 @@ recover: /* Recovery: lock and submit the mapped buffers */ do { if (buffer_mapped(bh) && buffer_dirty(bh)) { + BUFFER_TRACE(bh, "lock it"); lock_buffer(bh); mark_buffer_async_write(bh); } else { @@ -1935,9 +1986,12 @@ static int __block_prepare_write(struct if (err) break; if (buffer_new(bh)) { + BUFFER_TRACE(bh, "new: call " + "unmap_underlying_metadata"); unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr); if (PageUptodate(page)) { + BUFFER_TRACE(bh, "setting uptodate"); set_buffer_uptodate(bh); continue; } @@ -1958,12 +2012,14 @@ static int __block_prepare_write(struct } } if (PageUptodate(page)) { + BUFFER_TRACE(bh, "setting uptodate"); if (!buffer_uptodate(bh)) set_buffer_uptodate(bh); continue; } if (!buffer_uptodate(bh) && !buffer_delay(bh) && (block_start < from || block_end > to)) { + BUFFER_TRACE(bh, "reading"); ll_rw_block(READ, 1, &bh); *wait_bh++=bh; } @@ -2006,6 +2062,7 @@ static int __block_prepare_write(struct memset(kaddr+block_start, 0, bh->b_size); kunmap_atomic(kaddr, KM_USER0); set_buffer_uptodate(bh); + BUFFER_TRACE(bh, "mark dirty"); mark_buffer_dirty(bh); } next_bh: @@ -2034,6 +2091,7 @@ static int __block_commit_write(struct i partial = 1; } else { set_buffer_uptodate(bh); + BUFFER_TRACE(bh, "mark dirty"); mark_buffer_dirty(bh); } } @@ -2328,6 +2386,7 @@ static void end_buffer_read_nobh(struct set_buffer_uptodate(bh); } else { /* This happens, due to failed READA attempts. */ + BUFFER_TRACE(bh, "clear uptodate"); clear_buffer_uptodate(bh); } unlock_buffer(bh); @@ -2419,6 +2478,7 @@ int nobh_prepare_write(struct page *page bh->b_data = (char *)(long)block_start; bh->b_bdev = map_bh.b_bdev; bh->b_private = NULL; + buffer_trace_init(&bh->b_history); read_bh[nr_reads++] = bh; } } @@ -2662,6 +2722,7 @@ int block_truncate_page(struct address_s flush_dcache_page(page); kunmap_atomic(kaddr, KM_USER0); + BUFFER_TRACE(bh, "mark dirty"); mark_buffer_dirty(bh); err = 0; @@ -2696,11 +2757,20 @@ int block_write_full_page(struct page *p * they may have been added in ext3_writepage(). Make them * freeable here, so the page does not leak. */ + if (page_has_buffers(page)) { + struct buffer_head *b = page_buffers(page); + BUFFER_TRACE(b, "EIO"); + } block_invalidatepage(page, 0); unlock_page(page); return 0; /* don't care */ } + if (page_has_buffers(page)) { + struct buffer_head *b = page_buffers(page); + BUFFER_TRACE(b, "partial"); + } + /* * The page straddles i_size. It must be zeroed out on each and every * writepage invokation because it may be mmapped. "A file is mapped @@ -2720,8 +2790,10 @@ sector_t generic_block_bmap(struct addre { struct buffer_head tmp; struct inode *inode = mapping->host; + tmp.b_state = 0; tmp.b_blocknr = 0; + tmp.b_page = NULL; get_block(inode, block, &tmp, 0); return tmp.b_blocknr; } @@ -2748,9 +2820,11 @@ int submit_bh(int rw, struct buffer_head struct bio *bio; int ret = 0; - BUG_ON(!buffer_locked(bh)); - BUG_ON(!buffer_mapped(bh)); - BUG_ON(!bh->b_end_io); + BUFFER_TRACE(bh, "enter"); + + J_ASSERT_BH(bh, buffer_locked(bh)); + J_ASSERT_BH(bh, buffer_mapped(bh)); + J_ASSERT_BH(bh, bh->b_end_io != NULL); if (buffer_ordered(bh) && (rw == WRITE)) rw = WRITE_BARRIER; @@ -2762,6 +2836,11 @@ int submit_bh(int rw, struct buffer_head if (test_set_buffer_req(bh) && (rw == WRITE || rw == WRITE_BARRIER)) clear_buffer_write_io_error(bh); + if (rw == WRITE) + BUFFER_TRACE(bh, "write"); + else + BUFFER_TRACE(bh, "read"); + /* * from here on down, it's all bio -- do the initial mapping, * submit_bio -> generic_make_request may further map this bio around @@ -2966,6 +3045,7 @@ out: do { struct buffer_head *next = bh->b_this_page; + bh->b_this_page = NULL; free_buffer_head(bh); bh = next; } while (bh != buffers_to_free); @@ -3052,6 +3132,7 @@ struct buffer_head *alloc_buffer_head(un get_cpu_var(bh_accounting).nr++; recalc_bh_state(); put_cpu_var(bh_accounting); + buffer_trace_init(&ret->b_history); } return ret; } @@ -3059,7 +3140,16 @@ EXPORT_SYMBOL(alloc_buffer_head); void free_buffer_head(struct buffer_head *bh) { - BUG_ON(!list_empty(&bh->b_assoc_buffers)); + J_ASSERT_BH(bh, bh->b_this_page == NULL); + J_ASSERT_BH(bh, list_empty(&bh->b_assoc_buffers)); +#if defined(CONFIG_JBD) || defined(CONFIG_JBD_MODULE) + if (buffer_jbd(bh)) { + J_ASSERT_BH(bh, bh2jh(bh)->b_transaction == 0); + J_ASSERT_BH(bh, bh2jh(bh)->b_next_transaction == 0); + J_ASSERT_BH(bh, bh2jh(bh)->b_frozen_data == 0); + J_ASSERT_BH(bh, bh2jh(bh)->b_committed_data == 0); + } +#endif kmem_cache_free(bh_cachep, bh); get_cpu_var(bh_accounting).nr--; recalc_bh_state(); diff --git a/fs/direct-io.c b/fs/direct-io.c --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -519,6 +519,7 @@ static int get_more_blocks(struct dio *d if (ret == 0) { map_bh->b_state = 0; map_bh->b_size = 0; + map_bh->b_page = 0; BUG_ON(dio->block_in_file >= dio->final_block_in_request); fs_startblk = dio->block_in_file >> dio->blkfactor; dio_count = dio->final_block_in_request - dio->block_in_file; diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -859,6 +859,7 @@ struct buffer_head *ext3_getblk(handle_t dummy.b_state = 0; dummy.b_blocknr = -1000; + dummy.b_page = NULL; buffer_trace_init(&dummy.b_history); *errp = ext3_get_block_handle(handle, inode, block, &dummy, create, 1); if (!*errp && buffer_mapped(&dummy)) { @@ -1286,6 +1287,11 @@ static int ext3_ordered_writepage(struct walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE, NULL, bget_one); + { + struct buffer_head *b = page_buffers(page); + BUFFER_TRACE(b, "call block_write_full_page"); + } + ret = block_write_full_page(page, ext3_get_block, wbc); /* @@ -1685,6 +1691,7 @@ static int ext3_block_truncate_page(hand } else { if (ext3_should_order_data(inode)) err = ext3_journal_dirty_data(handle, bh); + BUFFER_TRACE(bh, "mark dirty"); mark_buffer_dirty(bh); } diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c --- a/fs/ext3/namei.c +++ b/fs/ext3/namei.c @@ -60,6 +60,7 @@ static struct buffer_head *ext3_append(h EXT3_I(inode)->i_disksize = inode->i_size; ext3_journal_get_write_access(handle,bh); } + BUFFER_TRACE(bh, "exit"); return bh; } @@ -340,6 +341,7 @@ dx_probe(struct dentry *dentry, struct i dir = dentry->d_parent->d_inode; if (!(bh = ext3_bread (NULL,dir, 0, 0, err))) goto fail; + BUFFER_TRACE(bh, "bread it"); root = (struct dx_root *) bh->b_data; if (root->info.hash_version != DX_HASH_TEA && root->info.hash_version != DX_HASH_HALF_MD4 && @@ -414,18 +416,21 @@ dx_probe(struct dentry *dentry, struct i at = p - 1; dxtrace(printk(" %x->%u\n", at == entries? 0: dx_get_hash(at), dx_get_block(at))); + BUFFER_TRACE(bh, "add to frame"); frame->bh = bh; frame->entries = entries; frame->at = at; if (!indirect--) return frame; if (!(bh = ext3_bread (NULL,dir, dx_get_block(at), 0, err))) goto fail2; + BUFFER_TRACE(bh, "read it"); at = entries = ((struct dx_node *) bh->b_data)->entries; assert (dx_get_limit(entries) == dx_node_limit (dir)); frame++; } fail2: while (frame >= frame_in) { + BUFFER_TRACE(bh, "brelse it"); brelse(frame->bh); frame--; } @@ -438,8 +443,11 @@ static void dx_release (struct dx_frame if (frames[0].bh == NULL) return; - if (((struct dx_root *) frames[0].bh->b_data)->info.indirect_levels) + if (((struct dx_root *) frames[0].bh->b_data)->info.indirect_levels) { + BUFFER_TRACE(frames[1].bh, "brelse it"); brelse(frames[1].bh); + } + BUFFER_TRACE(frames[0].bh, "brelse it"); brelse(frames[0].bh); } @@ -963,6 +971,8 @@ static struct buffer_head * ext3_dx_find dx_release (frames); return bh; } + BUFFER_TRACE(bh, "brelse it"); + BUFFER_TRACE(bh, "brelse it"); brelse (bh); /* Check to see if we should continue to search */ retval = ext3_htree_next_block(dir, hash, frame, @@ -997,6 +1007,7 @@ static struct dentry *ext3_lookup(struct inode = NULL; if (bh) { unsigned long ino = le32_to_cpu(de->inode); + BUFFER_TRACE(bh, "brelse it"); brelse (bh); inode = iget(dir->i_sb, ino); @@ -1118,16 +1129,20 @@ static struct ext3_dir_entry_2 *do_split bh2 = ext3_append (handle, dir, &newblock, error); if (!(bh2)) { + BUFFER_TRACE(*bh, "brelse it"); brelse(*bh); *bh = NULL; goto errout; } + BUFFER_TRACE(bh2, "using it"); BUFFER_TRACE(*bh, "get_write_access"); err = ext3_journal_get_write_access(handle, *bh); if (err) { journal_error: + BUFFER_TRACE(*bh, "brelse"); brelse(*bh); + BUFFER_TRACE(bh2, "brelse"); brelse(bh2); *bh = NULL; ext3_std_error(dir->i_sb, err); @@ -1163,6 +1178,8 @@ static struct ext3_dir_entry_2 *do_split /* Which block gets the new entry? */ if (hinfo->hash >= hash2) { + BUFFER_TRACE(*bh, "swapping"); + BUFFER_TRACE(bh2, "swapping"); swap(*bh, bh2); de = de2; } @@ -1171,8 +1188,11 @@ static struct ext3_dir_entry_2 *do_split if (err) goto journal_error; err = ext3_journal_dirty_metadata (handle, frame->bh); - if (err) + if (err) { + BUFFER_TRACE(bh2, "error"); goto journal_error; + } + BUFFER_TRACE(*bh, "brelse"); brelse (bh2); dxtrace(dx_show_index ("frame", frame->entries)); errout: @@ -1215,6 +1235,7 @@ static int add_dirent_to_buf(handle_t *h return -EIO; } if (ext3_match (namelen, name, de)) { + BUFFER_TRACE(bh, "brelse"); brelse (bh); return -EEXIST; } @@ -1272,6 +1293,7 @@ static int add_dirent_to_buf(handle_t *h err = ext3_journal_dirty_metadata(handle, bh); if (err) ext3_std_error(dir->i_sb, err); + BUFFER_TRACE(bh, "brelse"); brelse(bh); return 0; } @@ -1302,6 +1324,7 @@ static int make_indexed_dir(handle_t *ha blocksize = dir->i_sb->s_blocksize; dxtrace(printk("Creating index\n")); + BUFFER_TRACE(bh, "get write access"); retval = ext3_journal_get_write_access(handle, bh); if (retval) { ext3_std_error(dir->i_sb, retval); @@ -1346,6 +1369,7 @@ static int make_indexed_dir(handle_t *ha frame = frames; frame->entries = entries; frame->at = entries; + BUFFER_TRACE(bh, "adding to frame"); frame->bh = bh; bh = bh2; de = do_split(handle,dir, &bh, frame, &hinfo, &retval); @@ -1402,6 +1426,7 @@ static int ext3_add_entry (handle_t *han bh = ext3_bread(handle, dir, block, 0, &retval); if(!bh) return retval; + BUFFER_TRACE(bh, "read it"); retval = add_dirent_to_buf(handle, dentry, inode, NULL, bh); if (retval != -ENOSPC) return retval; @@ -1416,6 +1441,7 @@ static int ext3_add_entry (handle_t *han bh = ext3_append(handle, dir, &block, &retval); if (!bh) return retval; + BUFFER_TRACE(bh, "append returned it"); de = (struct ext3_dir_entry_2 *) bh->b_data; de->inode = 0; de->rec_len = cpu_to_le16(rlen = blocksize); @@ -1481,6 +1507,7 @@ static int ext3_dx_add_entry(handle_t *h bh2 = ext3_append (handle, dir, &newblock, &err); if (!(bh2)) goto cleanup; + BUFFER_TRACE(bh2, "append returned it"); node2 = (struct dx_node *)(bh2->b_data); entries2 = node2->entries; node2->fake.rec_len = cpu_to_le16(sb->s_blocksize); @@ -1510,6 +1537,8 @@ static int ext3_dx_add_entry(handle_t *h if (at - entries >= icount1) { frame->at = at = at - entries - icount1 + entries2; frame->entries = entries = entries2; + BUFFER_TRACE(frame->bh, "swap"); + BUFFER_TRACE(bh2, "swap"); swap(frame->bh, bh2); } dx_insert_block (frames + 0, hash2, newblock); @@ -1519,6 +1548,7 @@ static int ext3_dx_add_entry(handle_t *h err = ext3_journal_dirty_metadata(handle, bh2); if (err) goto journal_error; + BUFFER_TRACE(bh2, "brelse"); brelse (bh2); } else { dxtrace(printk("Creating second level index...\n")); @@ -1535,6 +1565,8 @@ static int ext3_dx_add_entry(handle_t *h frame = frames + 1; frame->at = at = at - entries + entries2; frame->entries = entries = entries2; + BUFFER_TRACE(frame->bh, "overwrite"); + BUFFER_TRACE(bh2, "add to frame"); frame->bh = bh2; err = ext3_journal_get_write_access(handle, frame->bh); @@ -1553,8 +1585,10 @@ static int ext3_dx_add_entry(handle_t *h journal_error: ext3_std_error(dir->i_sb, err); cleanup: - if (bh) + if (bh) { + BUFFER_TRACE(bh, "brelse"); brelse(bh); + } dx_release(frames); return err; } diff --git a/fs/ext3/super.c b/fs/ext3/super.c --- a/fs/ext3/super.c +++ b/fs/ext3/super.c @@ -40,6 +40,10 @@ #include "xattr.h" #include "acl.h" +#ifdef CONFIG_JBD_DEBUG +static int ext3_ro_after; /* Make fs read-only after this many jiffies */ +#endif + static int ext3_load_journal(struct super_block *, struct ext3_super_block *); static int ext3_create_journal(struct super_block *, struct ext3_super_block *, int); @@ -59,6 +63,65 @@ static void ext3_unlockfs(struct super_b static void ext3_write_super (struct super_block * sb); static void ext3_write_super_lockfs(struct super_block *sb); +#ifdef CONFIG_JBD_DEBUG +struct block_device *journal_no_write[2]; + +/* + * Debug code for turning filesystems "read-only" after a specified + * amount of time. This is for crash/recovery testing. + */ + +static void make_rdonly(struct block_device *bdev, + struct block_device **no_write) +{ + char b[BDEVNAME_SIZE]; + + if (bdev) { + printk(KERN_WARNING "Turning device %s read-only\n", + bdevname(bdev, b)); + *no_write = bdev; + } +} + +static void turn_fs_readonly(unsigned long arg) +{ + struct super_block *sb = (struct super_block *)arg; + + make_rdonly(sb->s_bdev, &journal_no_write[0]); + make_rdonly(EXT3_SB(sb)->s_journal->j_dev, &journal_no_write[1]); + wake_up(&EXT3_SB(sb)->ro_wait_queue); +} + +static void setup_ro_after(struct super_block *sb) +{ + struct ext3_sb_info *sbi = EXT3_SB(sb); + init_timer(&sbi->turn_ro_timer); + if (ext3_ro_after) { + printk(KERN_DEBUG "fs will go read-only in %d jiffies\n", + ext3_ro_after); + init_waitqueue_head(&sbi->ro_wait_queue); + journal_no_write[0] = NULL; + journal_no_write[1] = NULL; + sbi->turn_ro_timer.function = turn_fs_readonly; + sbi->turn_ro_timer.data = (unsigned long)sb; + sbi->turn_ro_timer.expires = jiffies + ext3_ro_after; + ext3_ro_after = 0; + add_timer(&sbi->turn_ro_timer); + } +} + +static void clear_ro_after(struct super_block *sb) +{ + del_timer_sync(&EXT3_SB(sb)->turn_ro_timer); + journal_no_write[0] = NULL; + journal_no_write[1] = NULL; + ext3_ro_after = 0; +} +#else +#define setup_ro_after(sb) do {} while (0) +#define clear_ro_after(sb) do {} while (0) +#endif + /* * Wrappers for journal_start/end. * @@ -427,6 +490,7 @@ static void ext3_put_super (struct super invalidate_bdev(sbi->journal_bdev, 0); ext3_blkdev_remove(sbi); } + clear_ro_after(sb); sb->s_fs_info = NULL; kfree(sbi); return; @@ -619,6 +683,7 @@ enum { Opt_bsd_df, Opt_minix_df, Opt_grpid, Opt_nogrpid, Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic, Opt_err_ro, Opt_nouid32, Opt_check, Opt_nocheck, Opt_debug, Opt_oldalloc, Opt_orlov, + Opt_ro_after, Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl, Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_commit, Opt_journal_update, Opt_journal_inum, @@ -649,6 +714,7 @@ static match_table_t tokens = { {Opt_debug, "debug"}, {Opt_oldalloc, "oldalloc"}, {Opt_orlov, "orlov"}, + {Opt_ro_after, "ro_after"}, {Opt_user_xattr, "user_xattr"}, {Opt_nouser_xattr, "nouser_xattr"}, {Opt_acl, "acl"}, @@ -786,6 +852,13 @@ static int parse_options (char * options case Opt_orlov: clear_opt (sbi->s_mount_opt, OLDALLOC); break; +#ifdef CONFIG_JBD_DEBUG + case Opt_ro_after: + if (match_int(&args[0], &option)) + return 0; + ext3_ro_after = option; + break; +#endif #ifdef CONFIG_EXT3_FS_XATTR case Opt_user_xattr: set_opt (sbi->s_mount_opt, XATTR_USER); @@ -1114,6 +1187,7 @@ static int ext3_setup_super(struct super ext3_check_inodes_bitmap (sb); } #endif + setup_ro_after(sb); return res; } @@ -1348,6 +1422,9 @@ static int ext3_fill_super (struct super int needs_recovery; __le32 features; +#ifdef CONFIG_JBD_DEBUG + ext3_ro_after = 0; +#endif sbi = kmalloc(sizeof(*sbi), GFP_KERNEL); if (!sbi) return -ENOMEM; @@ -1356,6 +1433,7 @@ static int ext3_fill_super (struct super sbi->s_mount_opt = 0; sbi->s_resuid = EXT3_DEF_RESUID; sbi->s_resgid = EXT3_DEF_RESGID; + setup_ro_after(sb); unlock_kernel(); @@ -2072,6 +2150,8 @@ static void ext3_clear_journal_err(struc journal = EXT3_SB(sb)->s_journal; + clear_ro_after(sb); + /* * Now check for any error status which may have been recorded in the * journal by a prior ext3_error() or ext3_abort() @@ -2138,6 +2218,7 @@ static int ext3_sync_fs(struct super_blo if (wait) log_wait_commit(EXT3_SB(sb)->s_journal, target); } + setup_ro_after(sb); return 0; } diff --git a/fs/ext3/xattr.c b/fs/ext3/xattr.c --- a/fs/ext3/xattr.c +++ b/fs/ext3/xattr.c @@ -482,12 +482,14 @@ ext3_xattr_release_block(handle_t *handl ce = mb_cache_entry_get(ext3_xattr_cache, bh->b_bdev, bh->b_blocknr); if (BHDR(bh)->h_refcount == cpu_to_le32(1)) { ea_bdebug(bh, "refcount now=0; freeing"); + BUFFER_TRACE(bh, "xattr delete"); if (ce) mb_cache_entry_free(ce); ext3_free_blocks(handle, inode, bh->b_blocknr, 1); get_bh(bh); ext3_forget(handle, 1, inode, bh, bh->b_blocknr); } else { + BUFFER_TRACE(bh, "xattr deref"); if (ext3_journal_get_write_access(handle, bh) == 0) { lock_buffer(bh); BHDR(bh)->h_refcount = cpu_to_le32( @@ -758,6 +760,7 @@ ext3_xattr_block_set(handle_t *handle, s inserted: if (!IS_LAST_ENTRY(s->first)) { new_bh = ext3_xattr_cache_find(inode, header(s->base), &ce); + BUFFER_TRACE2(new_bh, bs->bh, "xattr cache looked up"); if (new_bh) { /* We found an identical block in the cache. */ if (new_bh == bs->bh) @@ -802,6 +805,7 @@ inserted: ea_idebug(inode, "creating block %d", block); new_bh = sb_getblk(sb, block); + BUFFER_TRACE(new_bh, "new xattr block"); if (!new_bh) { getblk_failed: ext3_free_blocks(handle, inode, block, 1); diff --git a/fs/jbd-kernel.c b/fs/jbd-kernel.c new file mode 100644 --- /dev/null +++ b/fs/jbd-kernel.c @@ -0,0 +1,255 @@ +/* + * fs/jbd-kernel.c + * + * Support code for the Journalling Block Device layer. + * This file contains things which have to be in-kernel when + * JBD is a module. + * + * 15 May 2001 Andrew Morton <andrewm@uow.edu.au> + * Created + */ + +#include <linux/config.h> +#include <linux/fs.h> +#include <linux/jbd.h> +#include <linux/module.h> +#include <linux/sched.h> +#include <linux/mm.h> + +/* + * Some sanity testing which is called from mark_buffer_clean(), + * and must be present in the main kernel. + */ + +void jbd_preclean_buffer_check(struct buffer_head *bh) +{ + if (buffer_jbd(bh)) { + struct journal_head *jh = bh2jh(bh); + + transaction_t *transaction = jh->b_transaction; + journal_t *journal; + + if (jh->b_jlist == 0 && transaction == NULL) + return; + + J_ASSERT_JH(jh, transaction != NULL); + /* The kernel may be unmapping old data. We expect it + * to be dirty in that case, unless the buffer has + * already been forgotten by a transaction. */ + if (jh->b_jlist != BJ_Forget) { +#if 1 + if (!buffer_dirty(bh)) { + printk("%s: clean of clean buffer\n", + __FUNCTION__); + print_buffer_trace(bh); + return; + } +#endif + J_ASSERT_BH(bh, buffer_dirty(bh)); + } + + journal = transaction->t_journal; + J_ASSERT_JH(jh, + transaction == journal->j_running_transaction || + transaction == journal->j_committing_transaction); + } +} +EXPORT_SYMBOL(jbd_preclean_buffer_check); + +/* + * Support functions for BUFFER_TRACE() + */ + +static spinlock_t trace_lock = SPIN_LOCK_UNLOCKED; + +void buffer_trace(const char *function, struct buffer_head *dest, + struct buffer_head *src, char *info) +{ + struct buffer_history_item *bhist_i; + struct page *page; + unsigned long flags; + + if (dest == 0 || src == 0) + return; + + spin_lock_irqsave(&trace_lock, flags); + + /* + * Sometimes we don't initialise the ring pointers. (locally declared + * temp buffer_heads). Feebly attempt to detect and correct that here. + */ + if ((dest->b_history.b_history_head - dest->b_history.b_history_tail > + BUFFER_HISTORY_SIZE)) { + dest->b_history.b_history_head = 0; + dest->b_history.b_history_tail = 0; + } + bhist_i = dest->b_history.b + + (dest->b_history.b_history_head & (BUFFER_HISTORY_SIZE - 1)); + bhist_i->function = function; + bhist_i->info = info; + bhist_i->b_state = src->b_state; + page = src->b_page; + if (page) + bhist_i->pg_dirty = !!PageDirty(page); + else + bhist_i->pg_dirty = 0; + +#if defined(CONFIG_JBD) || defined(CONFIG_JBD_MODULE) + bhist_i->b_trans_is_running = 0; + bhist_i->b_trans_is_committing = 0; + bhist_i->b_blocknr = src->b_blocknr; + if (buffer_jbd(src)) { + struct journal_head *jh; + journal_t *journal; + transaction_t *transaction; + + /* Footwork to avoid racing with journal_remove_journal_head */ + jh = src->b_private; + if (jh == 0) + goto raced; + transaction = jh->b_transaction; + if (src->b_private == 0) + goto raced; + bhist_i->b_jcount = jh->b_jcount; + bhist_i->b_jbd = 1; + bhist_i->b_jlist = jh->b_jlist; + bhist_i->b_frozen_data = jh->b_frozen_data; + bhist_i->b_committed_data = jh->b_committed_data; + bhist_i->b_transaction = !!jh->b_transaction; + bhist_i->b_next_transaction = !!jh->b_next_transaction; + bhist_i->b_cp_transaction = !!jh->b_cp_transaction; + + if (transaction) { + journal = transaction->t_journal; + bhist_i->b_trans_is_running = transaction == + journal->j_running_transaction; + bhist_i->b_trans_is_committing = transaction == + journal->j_committing_transaction; + } + } else { +raced: + bhist_i->b_jcount = 0; + bhist_i->b_jbd = 0; + bhist_i->b_jlist = 0; + bhist_i->b_frozen_data = 0; + bhist_i->b_committed_data = 0; + bhist_i->b_transaction = 0; + bhist_i->b_next_transaction = 0; + bhist_i->b_cp_transaction = 0; + } +#endif /* defined(CONFIG_JBD) || defined(CONFIG_JBD_MODULE) */ + + bhist_i->cpu = smp_processor_id(); + bhist_i->b_count = atomic_read(&src->b_count); + + dest->b_history.b_history_head++; + if (dest->b_history.b_history_head - dest->b_history.b_history_tail > + BUFFER_HISTORY_SIZE) + dest->b_history.b_history_tail = + dest->b_history.b_history_head - BUFFER_HISTORY_SIZE; + + spin_unlock_irqrestore(&trace_lock, flags); +} + +static const char *b_jlist_to_string(unsigned int b_list) +{ + switch (b_list) { +#if defined(CONFIG_JBD) || defined(CONFIG_JBD_MODULE) + case BJ_None: return "BJ_None"; + case BJ_SyncData: return "BJ_SyncData"; + case BJ_Metadata: return "BJ_Metadata"; + case BJ_Forget: return "BJ_Forget"; + case BJ_IO: return "BJ_IO"; + case BJ_Shadow: return "BJ_Shadow"; + case BJ_LogCtl: return "BJ_LogCtl"; + case BJ_Reserved: return "BJ_Reserved"; +#endif + default: return "Bad b_jlist"; + } +} + +static void print_one_hist(struct buffer_history_item *bhist_i) +{ + printk(" %s():%s\n", bhist_i->function, bhist_i->info); + printk(" b_state:0x%lx b_jlist:%s cpu:%d b_count:%d b_blocknr:%lu\n", + bhist_i->b_state, + b_jlist_to_string(bhist_i->b_jlist), + bhist_i->cpu, + bhist_i->b_count, + bhist_i->b_blocknr); +#if defined(CONFIG_JBD) || defined(CONFIG_JBD_MODULE) + printk(" b_jbd:%u b_frozen_data:%p b_committed_data:%p\n", + bhist_i->b_jbd, + bhist_i->b_frozen_data, + bhist_i->b_committed_data); + printk(" b_transaction:%u b_next_transaction:%u " + "b_cp_transaction:%u b_trans_is_running:%u\n", + bhist_i->b_transaction, + bhist_i->b_next_transaction, + bhist_i->b_cp_transaction, + bhist_i->b_trans_is_running); + printk(" b_trans_is_comitting:%u b_jcount:%u pg_dirty:%u", + bhist_i->b_trans_is_committing, + bhist_i->b_jcount, + bhist_i->pg_dirty); +#endif + printk("\n"); +} + +void print_buffer_fields(struct buffer_head *bh) +{ + printk("b_blocknr:%llu b_count:%d\n", + (unsigned long long)bh->b_blocknr, atomic_read(&bh->b_count)); + printk("b_this_page:%p b_data:%p b_page:%p\n", + bh->b_this_page, bh->b_data, bh->b_page); +#if defined(CONFIG_JBD) || defined(CONFIG_JBD_MODULE) + if (buffer_jbd(bh)) { + struct journal_head *jh = bh2jh(bh); + + printk("b_jlist:%u b_frozen_data:%p b_committed_data:%p\n", + jh->b_jlist, jh->b_frozen_data, jh->b_committed_data); + printk(" b_transaction:%p b_next_transaction:%p " + "b_cp_transaction:%p\n", + jh->b_transaction, jh->b_next_transaction, + jh->b_cp_transaction); + printk("b_cpnext:%p b_cpprev:%p\n", + jh->b_cpnext, jh->b_cpprev); + } +#endif +} + +void print_buffer_trace(struct buffer_head *bh) +{ + unsigned long idx, count; + unsigned long flags; + + printk("buffer trace for buffer at 0x%p (I am CPU %d)\n", + bh, smp_processor_id()); + BUFFER_TRACE(bh, ""); /* Record state now */ + + spin_lock_irqsave(&trace_lock, flags); + for ( idx = bh->b_history.b_history_tail, count = 0; + idx < bh->b_history.b_history_head && + count < BUFFER_HISTORY_SIZE; + idx++, count++) + print_one_hist(bh->b_history.b + + (idx & (BUFFER_HISTORY_SIZE - 1))); + + print_buffer_fields(bh); + spin_unlock_irqrestore(&trace_lock, flags); + dump_stack(); + printk("\n"); +} + +static struct buffer_head *failed_buffer_head; /* For access with debuggers */ + +void buffer_assertion_failure(struct buffer_head *bh) +{ + console_verbose(); + failed_buffer_head = bh; + print_buffer_trace(bh); +} +EXPORT_SYMBOL(buffer_trace); +EXPORT_SYMBOL(print_buffer_trace); +EXPORT_SYMBOL(buffer_assertion_failure); +EXPORT_SYMBOL(print_buffer_fields); diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c --- a/fs/jbd/commit.c +++ b/fs/jbd/commit.c @@ -28,10 +28,12 @@ static void journal_end_buffer_io_sync(struct buffer_head *bh, int uptodate) { BUFFER_TRACE(bh, ""); - if (uptodate) + if (uptodate) { set_buffer_uptodate(bh); - else + } else { + BUFFER_TRACE(bh, "clear uptodate"); clear_buffer_uptodate(bh); + } unlock_buffer(bh); } @@ -341,7 +343,7 @@ write_out_data: BUFFER_TRACE(bh, "locked"); if (!inverted_lock(journal, bh)) goto write_out_data; - __journal_temp_unlink_buffer(jh); + __journal_temp_unlink_buffer(journal, jh); __journal_file_buffer(jh, commit_transaction, BJ_Locked); jbd_unlock_bh_state(bh); @@ -367,7 +369,7 @@ write_out_data: BUFFER_TRACE(bh, "writeout complete: unfile"); if (!inverted_lock(journal, bh)) goto write_out_data; - __journal_unfile_buffer(jh); + __journal_unfile_buffer(journal, jh); jbd_unlock_bh_state(bh); journal_remove_journal_head(bh); put_bh(bh); @@ -408,7 +410,7 @@ write_out_data: continue; } if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) { - __journal_unfile_buffer(jh); + __journal_unfile_buffer(journal, jh); jbd_unlock_bh_state(bh); journal_remove_journal_head(bh); put_bh(bh); @@ -780,6 +782,7 @@ restart_loop: * behind for writeback and gets reallocated for another * use in a different page. */ if (buffer_freed(bh)) { + BUFFER_TRACE(bh, "buffer_freed"); clear_buffer_freed(bh); clear_buffer_jbddirty(bh); } @@ -788,12 +791,14 @@ restart_loop: JBUFFER_TRACE(jh, "add to new checkpointing trans"); __journal_insert_checkpoint(jh, commit_transaction); JBUFFER_TRACE(jh, "refile for checkpoint writeback"); - __journal_refile_buffer(jh); + __journal_refile_buffer(journal, jh); + JBUFFER_TRACE(jh, "did writepage hack"); jbd_unlock_bh_state(bh); } else { + BUFFER_TRACE(bh, "not jbddirty"); J_ASSERT_BH(bh, !buffer_dirty(bh)); J_ASSERT_JH(jh, jh->b_next_transaction == NULL); - __journal_unfile_buffer(jh); + __journal_unfile_buffer(journal, jh); jbd_unlock_bh_state(bh); journal_remove_journal_head(bh); /* needs a brelse */ release_buffer_page(bh); diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c --- a/fs/jbd/transaction.c +++ b/fs/jbd/transaction.c @@ -1034,7 +1034,7 @@ int journal_dirty_data(handle_t *handle, /* journal_clean_data_list() may have got there first */ if (jh->b_transaction != NULL) { JBUFFER_TRACE(jh, "unfile from commit"); - __journal_temp_unlink_buffer(jh); + __journal_temp_unlink_buffer(journal, jh); /* It still points to the committing * transaction; move it to this one so * that the refile assert checks are @@ -1053,7 +1053,7 @@ int journal_dirty_data(handle_t *handle, if (jh->b_jlist != BJ_SyncData && jh->b_jlist != BJ_Locked) { JBUFFER_TRACE(jh, "not on correct data list: unfile"); J_ASSERT_JH(jh, jh->b_jlist != BJ_Shadow); - __journal_temp_unlink_buffer(jh); + __journal_temp_unlink_buffer(journal, jh); jh->b_transaction = handle->h_transaction; JBUFFER_TRACE(jh, "file as data"); __journal_file_buffer(jh, handle->h_transaction, @@ -1249,10 +1249,10 @@ int journal_forget (handle_t *handle, st */ if (jh->b_cp_transaction) { - __journal_temp_unlink_buffer(jh); + __journal_temp_unlink_buffer(journal, jh); __journal_file_buffer(jh, transaction, BJ_Forget); } else { - __journal_unfile_buffer(jh); + __journal_unfile_buffer(journal, jh); journal_remove_journal_head(bh); __brelse(bh); if (!buffer_jbd(bh)) { @@ -1433,6 +1433,8 @@ int journal_force_commit(journal_t *jour static inline void __blist_add_buffer(struct journal_head **list, struct journal_head *jh) { + J_ASSERT_JH(jh, jh->b_tnext == NULL); + J_ASSERT_JH(jh, jh->b_tprev == NULL); if (!*list) { jh->b_tnext = jh->b_tprev = jh; *list = jh; @@ -1457,6 +1459,8 @@ __blist_add_buffer(struct journal_head * static inline void __blist_del_buffer(struct journal_head **list, struct journal_head *jh) { + J_ASSERT_JH(jh, jh->b_tnext != NULL); + J_ASSERT_JH(jh, jh->b_tprev != NULL); if (*list == jh) { *list = jh->b_tnext; if (*list == jh) @@ -1464,6 +1468,8 @@ __blist_del_buffer(struct journal_head * } jh->b_tprev->b_tnext = jh->b_tnext; jh->b_tnext->b_tprev = jh->b_tprev; + jh->b_tprev = NULL; + jh->b_tnext = NULL; } /* @@ -1477,16 +1483,19 @@ __blist_del_buffer(struct journal_head * * * Called under j_list_lock. The journal may not be locked. */ -void __journal_temp_unlink_buffer(struct journal_head *jh) +void __journal_temp_unlink_buffer(journal_t *journal, struct journal_head *jh) { struct journal_head **list = NULL; transaction_t *transaction; struct buffer_head *bh = jh2bh(jh); + JBUFFER_TRACE(jh, "entry"); J_ASSERT_JH(jh, jbd_is_locked_bh_state(bh)); transaction = jh->b_transaction; - if (transaction) + if (transaction) { + J_ASSERT_JH(jh, jh->b_transaction->t_journal == journal); assert_spin_locked(&transaction->t_journal->j_list_lock); + } J_ASSERT_JH(jh, jh->b_jlist < BJ_Types); if (jh->b_jlist != BJ_None) @@ -1529,9 +1538,9 @@ void __journal_temp_unlink_buffer(struct mark_buffer_dirty(bh); /* Expose it to the VM */ } -void __journal_unfile_buffer(struct journal_head *jh) +void __journal_unfile_buffer(journal_t *journal, struct journal_head *jh) { - __journal_temp_unlink_buffer(jh); + __journal_temp_unlink_buffer(journal, jh); jh->b_transaction = NULL; } @@ -1539,7 +1548,7 @@ void journal_unfile_buffer(journal_t *jo { jbd_lock_bh_state(jh2bh(jh)); spin_lock(&journal->j_list_lock); - __journal_unfile_buffer(jh); + __journal_unfile_buffer(journal, jh); spin_unlock(&journal->j_list_lock); jbd_unlock_bh_state(jh2bh(jh)); } @@ -1567,7 +1576,7 @@ __journal_try_to_free_buffer(journal_t * if (jh->b_jlist == BJ_SyncData || jh->b_jlist == BJ_Locked) { /* A written-back ordered data buffer */ JBUFFER_TRACE(jh, "release data"); - __journal_unfile_buffer(jh); + __journal_unfile_buffer(journal, jh); journal_remove_journal_head(bh); __brelse(bh); } @@ -1672,7 +1681,8 @@ static int __dispose_buffer(struct journ int may_free = 1; struct buffer_head *bh = jh2bh(jh); - __journal_unfile_buffer(jh); + JBUFFER_TRACE(jh, "unfile"); + __journal_unfile_buffer(transaction->t_journal, jh); if (jh->b_cp_transaction) { JBUFFER_TRACE(jh, "on running+cp transaction"); @@ -1928,6 +1938,7 @@ void __journal_file_buffer(struct journa int was_dirty = 0; struct buffer_head *bh = jh2bh(jh); + BUFFER_TRACE(bh, "entry"); J_ASSERT_JH(jh, jbd_is_locked_bh_state(bh)); assert_spin_locked(&transaction->t_journal->j_list_lock); @@ -1950,7 +1961,7 @@ void __journal_file_buffer(struct journa } if (jh->b_transaction) - __journal_temp_unlink_buffer(jh); + __journal_temp_unlink_buffer(transaction->t_journal, jh); jh->b_transaction = transaction; switch (jlist) { @@ -1990,6 +2001,7 @@ void __journal_file_buffer(struct journa if (was_dirty) set_buffer_jbddirty(bh); + BUFFER_TRACE(bh, "filed"); } void journal_file_buffer(struct journal_head *jh, @@ -2012,18 +2024,20 @@ void journal_file_buffer(struct journal_ * * Called under jbd_lock_bh_state(jh2bh(jh)) */ -void __journal_refile_buffer(struct journal_head *jh) +void __journal_refile_buffer(journal_t *journal, struct journal_head *jh) { int was_dirty; struct buffer_head *bh = jh2bh(jh); + JBUFFER_TRACE(jh, "entry"); J_ASSERT_JH(jh, jbd_is_locked_bh_state(bh)); if (jh->b_transaction) assert_spin_locked(&jh->b_transaction->t_journal->j_list_lock); /* If the buffer is now unused, just drop it. */ if (jh->b_next_transaction == NULL) { - __journal_unfile_buffer(jh); + BUFFER_TRACE(bh, "unfile"); + __journal_unfile_buffer(journal, jh); return; } @@ -2033,7 +2047,7 @@ void __journal_refile_buffer(struct jour */ was_dirty = test_clear_buffer_jbddirty(bh); - __journal_temp_unlink_buffer(jh); + __journal_temp_unlink_buffer(journal, jh); jh->b_transaction = jh->b_next_transaction; jh->b_next_transaction = NULL; __journal_file_buffer(jh, jh->b_transaction, BJ_Metadata); @@ -2064,7 +2078,7 @@ void journal_refile_buffer(journal_t *jo jbd_lock_bh_state(bh); spin_lock(&journal->j_list_lock); - __journal_refile_buffer(jh); + __journal_refile_buffer(journal, jh); jbd_unlock_bh_state(bh); journal_remove_journal_head(bh); diff --git a/fs/mpage.c b/fs/mpage.c --- a/fs/mpage.c +++ b/fs/mpage.c @@ -191,6 +191,7 @@ do_mpage_readpage(struct bio *bio, struc for (page_block = 0; page_block < blocks_per_page; page_block++, block_in_file++) { bh.b_state = 0; + bh.b_page = page; if (block_in_file < last_block) { if (get_block(inode, block_in_file, &bh, 0)) goto confused; @@ -472,6 +473,7 @@ __mpage_writepage(struct bio *bio, struc for (page_block = 0; page_block < blocks_per_page; ) { map_bh.b_state = 0; + map_bh.b_page = page; if (get_block(inode, block_in_file, &map_bh, 1)) goto confused; if (buffer_new(&map_bh)) diff --git a/include/linux/buffer-trace.h b/include/linux/buffer-trace.h new file mode 100644 --- /dev/null +++ b/include/linux/buffer-trace.h @@ -0,0 +1,85 @@ +/* + * include/linux/buffer-trace.h + * + * Debugging support for recording buffer_head state transitions + * + * May 2001, akpm + * Created + */ + +#ifndef BUFFER_TRACE_H_INCLUDED +#define BUFFER_TRACE_H_INCLUDED + +#include <linux/config.h> + +#ifdef CONFIG_BUFFER_DEBUG + +/* The number of records per buffer_head. Must be a power of two */ +#define BUFFER_HISTORY_SIZE 32 + +struct buffer_head; + +/* This gets embedded in struct buffer_head */ +struct buffer_history { + struct buffer_history_item { + const char *function; + char *info; + unsigned long b_state; + unsigned b_list:3; + unsigned b_jlist:4; + unsigned pg_dirty:1; + unsigned cpu:3; + unsigned b_count:8; + unsigned long b_blocknr; /* For src != dest */ +#if defined(CONFIG_JBD) || defined(CONFIG_JBD_MODULE) + unsigned b_jcount:4; + unsigned b_jbd:1; + unsigned b_transaction:1; + unsigned b_next_transaction:1; + unsigned b_cp_transaction:1; + unsigned b_trans_is_running:1; + unsigned b_trans_is_committing:1; + void *b_frozen_data; + void *b_committed_data; +#endif + } b[BUFFER_HISTORY_SIZE]; + unsigned long b_history_head; /* Next place to write */ + unsigned long b_history_tail; /* Oldest valid entry */ +}; + +static inline void buffer_trace_init(struct buffer_history *bhist) +{ + bhist->b_history_head = 0; + bhist->b_history_tail = 0; +} +extern void buffer_trace(const char *function, struct buffer_head *dest, + struct buffer_head *src, char *info); +extern void print_buffer_fields(struct buffer_head *bh); +extern void print_buffer_trace(struct buffer_head *bh); + +#define BUFFER_STRINGIFY2(X) #X +#define BUFFER_STRINGIFY(X) BUFFER_STRINGIFY2(X) + +#define BUFFER_TRACE2(dest, src, info) \ + do { \ + buffer_trace(__FUNCTION__, (dest), (src), \ + "["__FILE__":" \ + BUFFER_STRINGIFY(__LINE__)"] " info); \ + } while (0) + +#define BUFFER_TRACE(bh, info) BUFFER_TRACE2(bh, bh, info) +#define JBUFFER_TRACE(jh, info) BUFFER_TRACE(jh2bh(jh), info) + +#else /* CONFIG_BUFFER_DEBUG */ + +#define buffer_trace_init(bh) do {} while (0) +#define print_buffer_fields(bh) do {} while (0) +#define print_buffer_trace(bh) do {} while (0) +#define BUFFER_TRACE(bh, info) do {} while (0) +#define BUFFER_TRACE2(bh, bh2, info) do {} while (0) +#define JBUFFER_TRACE(jh, info) do {} while (0) + +#endif /* CONFIG_BUFFER_DEBUG */ + +#endif /* BUFFER_TRACE_H_INCLUDED */ + diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -13,6 +13,7 @@ #include <linux/pagemap.h> #include <linux/wait.h> #include <asm/atomic.h> +#include <linux/buffer-trace.h> enum bh_state_bits { BH_Uptodate, /* Contains valid data */ @@ -65,6 +66,10 @@ struct buffer_head { bh_end_io_t *b_end_io; /* I/O completion */ void *b_private; /* reserved for b_end_io */ struct list_head b_assoc_buffers; /* associated with another mapping */ + +#ifdef CONFIG_BUFFER_DEBUG + struct buffer_history b_history; +#endif }; /* diff --git a/include/linux/jbd.h b/include/linux/jbd.h --- a/include/linux/jbd.h +++ b/include/linux/jbd.h @@ -41,7 +41,7 @@ * hardware _can_ fail, but for debugging purposes when running tests on * known-good hardware we may want to trap these errors. */ -#undef JBD_PARANOID_IOFAIL +#define JBD_PARANOID_IOFAIL /* * The default maximum commit age, in seconds. @@ -55,7 +55,9 @@ * CONFIG_JBD_DEBUG is on. */ #define JBD_EXPENSIVE_CHECKING + extern int journal_enable_debug; +extern struct block_device *journal_no_write[2]; #define jbd_debug(n, f, a...) \ do { \ @@ -266,6 +268,7 @@ void buffer_assertion_failure(struct buf #else #define J_ASSERT_BH(bh, expr) J_ASSERT(expr) #define J_ASSERT_JH(jh, expr) J_ASSERT(expr) +#define buffer_assertion_failure(bh) do { } while (0) #endif #else @@ -273,9 +276,9 @@ void buffer_assertion_failure(struct buf #endif /* JBD_ASSERTIONS */ #if defined(JBD_PARANOID_IOFAIL) -#define J_EXPECT(expr, why...) J_ASSERT(expr) -#define J_EXPECT_BH(bh, expr, why...) J_ASSERT_BH(bh, expr) -#define J_EXPECT_JH(jh, expr, why...) J_ASSERT_JH(jh, expr) +#define J_EXPECT(expr, why...) ({ J_ASSERT(expr); 1; }) +#define J_EXPECT_BH(bh, expr, why...) ({ J_ASSERT_BH(bh, expr); 1; }) +#define J_EXPECT_JH(jh, expr, why...) ({ J_ASSERT_JH(jh, expr); 1; }) #else #define __journal_expect(expr, why...) \ ({ \ @@ -823,10 +826,10 @@ struct journal_s */ /* Filing buffers */ -extern void __journal_temp_unlink_buffer(struct journal_head *jh); +extern void __journal_temp_unlink_buffer(journal_t *, struct journal_head *jh); extern void journal_unfile_buffer(journal_t *, struct journal_head *); -extern void __journal_unfile_buffer(struct journal_head *); -extern void __journal_refile_buffer(struct journal_head *); +extern void __journal_unfile_buffer(journal_t *, struct journal_head *); +extern void __journal_refile_buffer(journal_t *, struct journal_head *); extern void journal_refile_buffer(journal_t *, struct journal_head *); extern void __journal_file_buffer(struct journal_head *, transaction_t *, int); extern void __journal_free_buffer(struct journal_head *bh); @@ -1055,6 +1058,8 @@ static inline int jbd_space_needed(journ * Definitions which augment the buffer_head layer */ +/* JBD additions */ + /* journaling buffer types */ #define BJ_None 0 /* Not journaled */ #define BJ_SyncData 1 /* Normal data: flush before commit */ @@ -1071,13 +1076,6 @@ extern int jbd_blocks_per_page(struct in #ifdef __KERNEL__ -#define buffer_trace_init(bh) do {} while (0) -#define print_buffer_fields(bh) do {} while (0) -#define print_buffer_trace(bh) do {} while (0) -#define BUFFER_TRACE(bh, info) do {} while (0) -#define BUFFER_TRACE2(bh, bh2, info) do {} while (0) -#define JBUFFER_TRACE(jh, info) do {} while (0) - #endif /* __KERNEL__ */ #endif /* CONFIG_JBD || CONFIG_JBD_MODULE || !__KERNEL__ */ ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kjournald() with DIO 2005-09-16 13:42 ` Stephen C. Tweedie @ 2005-09-21 18:22 ` Mingming Cao 0 siblings, 0 replies; 29+ messages in thread From: Mingming Cao @ 2005-09-21 18:22 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Andrew Morton, Badari Pulavarty, linux-fsdevel, Mingming Cao On Fri, 2005-09-16 at 14:42 +0100, Stephen C. Tweedie wrote: > Hi, > > On Tue, 2005-09-13 at 01:29, Andrew Morton wrote: > > > Have you spoken with Mingming about this? She's chasing a similar race. > > She's currently working on getting the ext3-debug patch working so we can > > find out how those buffers (which apparently aren't even attached to the > > journal) came to have an elevated refcount. > > Here's the latest version of ext3 buffer tracing which I've been using. > It's basically just the same as one from Andrew from about 2.6.5 > vintage, with a few changes in places to deal with new bits of code that > need temporary buffer_heads initialised properly for tracing. > > It has survived a couple of hours of fsx and fsstress in parallel, but I > haven't run any O_DIRECT tests with it on the current git kernel, so > there may stil be problems in that area. Your patch works smoothly running iozone DIO tests overnight. It also successfully shows the last 32 activities history of the busy buffer which caused the EIO failure here, and the results matches what the previous analysis. A minor improvement, to translate BJ_Locked state into a proper string. Signed-off-by: Mingming Cao cmm@us.ibm.com --- linux-2.6.5-7.191-cmm/fs/jbd-kernel.c | 1 + 1 files changed, 1 insertion(+) diff -puN fs/jbd-kernel.c~buffer-trace-fix fs/jbd-kernel.c --- linux-2.6.5-7.191/fs/jbd-kernel.c~buffer-trace-fix 2005-09-19 15:48:03.000000000 -0700 +++ linux-2.6.5-7.191-cmm/fs/jbd-kernel.c 2005-09-19 15:50:16.000000000 -0700 @@ -163,6 +163,7 @@ static const char *b_jlist_to_string(uns case BJ_Shadow: return "BJ_Shadow"; case BJ_LogCtl: return "BJ_LogCtl"; case BJ_Reserved: return "BJ_Reserved"; + case BJ_Locked: return "BJ_Locked"; #endif default: return "Bad b_jlist"; } _ ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2005-09-21 18:22 UTC | newest] Thread overview: 29+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-09-12 23:23 kjournald() with DIO Badari Pulavarty 2005-09-12 23:37 ` Andrew Morton 2005-09-13 0:06 ` Badari Pulavarty 2005-09-13 0:29 ` Andrew Morton 2005-09-13 16:52 ` Badari Pulavarty 2005-09-13 23:07 ` Andrew Morton 2005-09-14 17:23 ` Mingming Cao 2005-09-14 18:18 ` Andrew Morton 2005-09-14 21:40 ` Mingming Cao 2005-09-14 22:02 ` Andrew Morton 2005-09-15 11:11 ` Suparna Bhattacharya 2005-09-15 18:52 ` Andrew Morton 2005-09-15 15:03 ` Badari Pulavarty 2005-09-15 19:22 ` Andrea Arcangeli 2005-09-15 20:00 ` Andrew Morton 2005-09-15 20:20 ` Andrea Arcangeli 2005-09-15 20:35 ` Andrew Morton 2005-09-15 20:49 ` Badari Pulavarty 2005-09-15 21:06 ` Andrea Arcangeli 2005-09-15 21:20 ` Andrew Morton 2005-09-15 22:22 ` Badari Pulavarty 2005-09-15 21:03 ` Andrea Arcangeli 2005-09-15 21:26 ` Andrew Morton 2005-09-15 22:04 ` Andrea Arcangeli 2005-09-15 23:28 ` Mingming Cao 2005-09-16 0:18 ` Andrea Arcangeli 2005-09-13 17:53 ` Mingming Cao 2005-09-16 13:42 ` Stephen C. Tweedie 2005-09-21 18:22 ` Mingming Cao
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).