kjournald() with DIO

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* kjournald() with DIO
@ 2005-09-12 23:23 Badari Pulavarty
  2005-09-12 23:37 ` Andrew Morton
  0 siblings, 1 reply; 29+ messages in thread
From: Badari Pulavarty @ 2005-09-12 23:23 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: akpm, sct

Hi,

I have been chasing a race condition, which leads to
returning premature EIO with DIO (iozone tests) on
ext3 filesystems.

It seems to be a race between kjournald() & DIO process
invalidating the page. Is this a known issue ?

Thanks,
Badari


Here is the race:

kjournald submited buffers for IO and waiting for them to finish.
Note that it has a ref. against the buffer.

journal_commit_transaction()
        ...
        submited buffers for IO
        /* Waiting for IO to complete */
        while (commit_transaction->t_locked_list) {
                ...
                get_bh(bh);
                if (buffer_locked(bh)) {
                        spin_unlock(&journal->j_list_lock);
                        wait_on_buffer(bh);  <<----------
                        spin_lock(&journal->j_list_lock);
                }

                ..
                put_bh(bh);
        }

Now, DIO process comes to frees the jh through journal_try_to_free_buffers()
but fails to drop_buffers() since kjournald() has a reference against it.

invalidate_complete_range()
        ..
        ext3_releasepage()
                journal_try_to_free_buffers()
                        journal_put_journal_head()
                                __journal_try_to_free_buffer()
                                        <<--- freed jh

                        try_to_free_buffers()
                                drop_buffers()
                                        if (buffer_busy(bh))
                                                goto failed;
                                          <<--- returns EIO due to b_count




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-12 23:23 kjournald() with DIO Badari Pulavarty
@ 2005-09-12 23:37 ` Andrew Morton
  2005-09-13  0:06   ` Badari Pulavarty
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2005-09-12 23:37 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-fsdevel, sct

Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> I have been chasing a race condition, which leads to
> returning premature EIO with DIO (iozone tests) on
> ext3 filesystems.

In Linux, I assume ;)  Any particular version?

> It seems to be a race between kjournald() & DIO process
> invalidating the page. Is this a known issue ?

Sort-of.  Jan Kara has suggested a fix for this.

Go into block_write_full_page(), replace the call to block_invalidatepage()
with a call to do_invalidatepage(), please.  

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-12 23:37 ` Andrew Morton
@ 2005-09-13  0:06   ` Badari Pulavarty
  2005-09-13  0:29     ` Andrew Morton
  0 siblings, 1 reply; 29+ messages in thread
From: Badari Pulavarty @ 2005-09-13  0:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, sct

On Mon, 2005-09-12 at 16:37 -0700, Andrew Morton wrote:
> Badari Pulavarty <pbadari@us.ibm.com> wrote:
> >
> > I have been chasing a race condition, which leads to
> > returning premature EIO with DIO (iozone tests) on
> > ext3 filesystems.
> 
> In Linux, I assume ;)  Any particular version?

I am working with a SLES test kernel (SLES9 SP2), but
has most DIO fixes from mainline. 

> 
> > It seems to be a race between kjournald() & DIO process
> > invalidating the page. Is this a known issue ?
> 
> Sort-of.  Jan Kara has suggested a fix for this.
> 
> Go into block_write_full_page(), replace the call to block_invalidatepage()
> with a call to do_invalidatepage(), please.  

Didn't help. How can this help ?

DIO code is kicking back to buffered mode, since its trying to
fill-in the holes. The fix you suggested might have helped, if
IO is beyond file size or file size not ending on page boundary.
What am I missing ?


Thanks,
Badari


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-13  0:06   ` Badari Pulavarty
@ 2005-09-13  0:29     ` Andrew Morton
  2005-09-13 16:52       ` Badari Pulavarty
                         ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Andrew Morton @ 2005-09-13  0:29 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-fsdevel, sct

Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> Didn't help. How can this help ?

Oh well.  Different bug.  See http://bugzilla.kernel.org/show_bug.cgi?id=4964

> DIO code is kicking back to buffered mode, since its trying to
> fill-in the holes. The fix you suggested might have helped, if
> IO is beyond file size or file size not ending on page boundary.
> What am I missing ?

Have you spoken with Mingming about this?  She's chasing a similar race. 
She's currently working on getting the ext3-debug patch working so we can
find out how those buffers (which apparently aren't even attached to the
journal) came to have an elevated refcount.

Apparently the debug patch is currently oopsing.  We need to fix that.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-13  0:29     ` Andrew Morton
@ 2005-09-13 16:52       ` Badari Pulavarty
  2005-09-13 23:07         ` Andrew Morton
  2005-09-13 17:53       ` Mingming Cao
  2005-09-16 13:42       ` Stephen C. Tweedie
  2 siblings, 1 reply; 29+ messages in thread
From: Badari Pulavarty @ 2005-09-13 16:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, sct

On Mon, 2005-09-12 at 17:29 -0700, Andrew Morton wrote:

> > DIO code is kicking back to buffered mode, since its trying to
> > fill-in the holes. The fix you suggested might have helped, if
> > IO is beyond file size or file size not ending on page boundary.
> > What am I missing ?
> 
> Have you spoken with Mingming about this?  She's chasing a similar race. 
> She's currently working on getting the ext3-debug patch working so we can
> find out how those buffers (which apparently aren't even attached to the
> journal) came to have an elevated refcount.

I am helping her on the same bug.

The buffers were attached to the journal (earlier). kjournald()
submitted them for IO and waiting for IO to finish. (So it has a ref.
count on it).


journal_commit_transaction()
        ...
        submited buffers for IO
        /* Waiting for IO to complete */
        while (commit_transaction->t_locked_list) {
                ...
                get_bh(bh);
                if (buffer_locked(bh)) {
                        spin_unlock(&journal->j_list_lock);
                        wait_on_buffer(bh);  <<<<<<


In the mean while, DIO process through releasepage invalidated the page,
released the journal head attached to the buffer and trying to drop the
buffer. But the drop buffer fails due to the kjournald ref. count.

invalidate_complete_page():
        ..
        ext3_releasepage()
                journal_try_to_free_buffers()
                        journal_put_journal_head()
                                __journal_try_to_free_buffer()
                                        <--- freed jh

                        try_to_free_buffers()
                                drop_buffers()
                                        if (buffer_busy(bh))
                                                goto failed;
                                        		<<--

Thanks,
Badari


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-13 16:52       ` Badari Pulavarty
@ 2005-09-13 23:07         ` Andrew Morton
  2005-09-14 17:23           ` Mingming Cao
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2005-09-13 23:07 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-fsdevel, sct

Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> The buffers were attached to the journal (earlier). kjournald()
>  submitted them for IO and waiting for IO to finish. (So it has a ref.
>  count on it).
> 
> 
>  journal_commit_transaction()
>          ...
>          submited buffers for IO
>          /* Waiting for IO to complete */
>          while (commit_transaction->t_locked_list) {
>                  ...
>                  get_bh(bh);
>                  if (buffer_locked(bh)) {
>                          spin_unlock(&journal->j_list_lock);
>                          wait_on_buffer(bh);  <<<<<<
> 
> 
>  In the mean while, DIO process through releasepage invalidated the page,
>  released the journal head attached to the buffer and trying to drop the
>  buffer. But the drop buffer fails due to the kjournald ref. count.
> 
>  invalidate_complete_page():
>          ..
>          ext3_releasepage()
>                  journal_try_to_free_buffers()
>                          journal_put_journal_head()
>                                  __journal_try_to_free_buffer()
>                                          <--- freed jh
> 
>                          try_to_free_buffers()
>                                  drop_buffers()
>                                          if (buffer_busy(bh))
>                                                  goto failed;
>                                          		<<--

OK.

<checks email>

The problem is that

	__generic_file_aio_write_nolock
	->invalidate_inode_pages2_range
	  ->invalidate_complete_page

was unable to invalidate the page and so -EIO was propagated back.

Well, that was wrong of us.  invalidate_*_pages() is best-effort.  It does
not guarantee to remove the pages from pagecache or anything else.  The
truncate_*_pages() functions do have such guarantees.  They will block on
that locked buffer, for example.

 * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode
 * @mapping: the address_space which holds the pages to invalidate
 * @start: the offset 'from' which to invalidate
 * @end: the offset 'to' which to invalidate (inclusive)
 *
 * This function only removes the unlocked pages, if you want to
 * remove all the pages of one inode, you must call truncate_inode_pages.
 *
 * invalidate_mapping_pages() will not block on IO activity. It will not
 * invalidate pages which are dirty, locked, under writeback or mapped into
 * pagetables.

So if we really want to take down that pagecache we'll need to call
->invalidatepage() against each page, not ->releasepage().  Yes, the naming
is whacky.  truncate_inode_pages uses ->invalidatepage and
invalidate_inode_pages uses ->releasepage().

Or simply ignore the invalidate_inode_pages2_range() return value in
generic_file_direct_IO().

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-13 23:07         ` Andrew Morton
@ 2005-09-14 17:23           ` Mingming Cao
  2005-09-14 18:18             ` Andrew Morton
  0 siblings, 1 reply; 29+ messages in thread
From: Mingming Cao @ 2005-09-14 17:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Badari Pulavarty, linux-fsdevel, sct

On Tue, 2005-09-13 at 16:07 -0700, Andrew Morton wrote:

> Or simply ignore the invalidate_inode_pages2_range() return value in
> generic_file_direct_IO().
> 
Could we simply do that?

I found some discussions about why we check the return value of
invalidate_inode_pages2_range() in generic_file_direct_IO():
http://marc.theaimsgroup.com/?l=linux-kernel&m=109850054025709&w=2

It seems the check for EIO was added to 2.6.11 to handle the case of
parallel direct IO and mapped IO. It is possible that the mapped IO
dirty the pages after the a_ops->direct_IO.  In that case, an error will
return back to the caller of DIO to indicate the race. 

Thanks,

Mingming


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-14 17:23           ` Mingming Cao
@ 2005-09-14 18:18             ` Andrew Morton
  2005-09-14 21:40               ` Mingming Cao
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2005-09-14 18:18 UTC (permalink / raw)
  To: cmm; +Cc: pbadari, linux-fsdevel, sct

Mingming Cao <cmm@us.ibm.com> wrote:
>
> On Tue, 2005-09-13 at 16:07 -0700, Andrew Morton wrote:
> 
> > Or simply ignore the invalidate_inode_pages2_range() return value in
> > generic_file_direct_IO().
> > 
> Could we simply do that?
> 
> I found some discussions about why we check the return value of
> invalidate_inode_pages2_range() in generic_file_direct_IO():
> http://marc.theaimsgroup.com/?l=linux-kernel&m=109850054025709&w=2

Well found.  That brings it back.

> It seems the check for EIO was added to 2.6.11 to handle the case of
> parallel direct IO and mapped IO. It is possible that the mapped IO
> dirty the pages after the a_ops->direct_IO.  In that case, an error will
> return back to the caller of DIO to indicate the race. 

According to the logic we discussed last year,
invalidate_inode_pages2_range() only needs to return -EIO if it failed to
invalidate a page, and that page was dirty.

The -EIO is there to tell the caller that another process dirtied pagecache
against the file (within the range of the direct-io write()) after
generic_file_direct_IO() has synced the pagecache to disk.

The -EIO is telling the direct-io write()r "hey, the data which you wrote
was overwritten by a racing buffered-write() or mmapped-write".  It's not
obvious to me _why_ we should tell the direct-io write()r this - after all,
we assume that's what the application developer wanted to do.

Still, we don't have to worry about that at present because
invalidate_inode_pages2_range() is just doing the wrong thing: it's
treating this elevated-refcount buffer_head as if it was a dirty page, and
it's not.

How about this?

--- devel/mm/truncate.c~a	2005-09-14 11:16:42.000000000 -0700
+++ devel-akpm/mm/truncate.c	2005-09-14 11:17:17.000000000 -0700
@@ -249,7 +249,7 @@ EXPORT_SYMBOL(invalidate_inode_pages);
  * Any pages which are found to be mapped into pagetables are unmapped prior to
  * invalidation.
  *
- * Returns -EIO if any pages could not be invalidated.
+ * Returns -EIO if any pages could not be invalidated and were dirty.
  */
 int invalidate_inode_pages2_range(struct address_space *mapping,
 				  pgoff_t start, pgoff_t end)
@@ -307,9 +307,10 @@ int invalidate_inode_pages2_range(struct
 			}
 			was_dirty = test_clear_page_dirty(page);
 			if (!invalidate_complete_page(mapping, page)) {
-				if (was_dirty)
+				if (was_dirty) {
 					set_page_dirty(page);
-				ret = -EIO;
+					ret = -EIO;
+				}
 			}
 			unlock_page(page);
 		}
_

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-14 18:18             ` Andrew Morton
@ 2005-09-14 21:40               ` Mingming Cao
  2005-09-14 22:02                 ` Andrew Morton
  0 siblings, 1 reply; 29+ messages in thread
From: Mingming Cao @ 2005-09-14 21:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: pbadari, linux-fsdevel, sct, andrea

On Wed, 2005-09-14 at 11:18 -0700, Andrew Morton wrote:
> Mingming Cao <cmm@us.ibm.com> wrote:
> >
> > On Tue, 2005-09-13 at 16:07 -0700, Andrew Morton wrote:
> > 
> > > Or simply ignore the invalidate_inode_pages2_range() return value in
> > > generic_file_direct_IO().
> > > 
> > Could we simply do that?
> > 
> > I found some discussions about why we check the return value of
> > invalidate_inode_pages2_range() in generic_file_direct_IO():
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=109850054025709&w=2
> 
> Well found.  That brings it back.
> 
> > It seems the check for EIO was added to 2.6.11 to handle the case of
> > parallel direct IO and mapped IO. It is possible that the mapped IO
> > dirty the pages after the a_ops->direct_IO.  In that case, an error will
> > return back to the caller of DIO to indicate the race. 
> 
> According to the logic we discussed last year,
> invalidate_inode_pages2_range() only needs to return -EIO if it failed to
> invalidate a page, and that page was dirty.
> 
> The -EIO is there to tell the caller that another process dirtied pagecache
> against the file (within the range of the direct-io write()) after
> generic_file_direct_IO() has synced the pagecache to disk.
> 
> The -EIO is telling the direct-io write()r "hey, the data which you wrote
> was overwritten by a racing buffered-write() or mmapped-write".  It's not
> obvious to me _why_ we should tell the direct-io write()r this - after all,
> we assume that's what the application developer wanted to do.
> 
> Still, we don't have to worry about that at present because
> invalidate_inode_pages2_range() is just doing the wrong thing: it's
> treating this elevated-refcount buffer_head as if it was a dirty page, and
> it's not.
> 
> How about this?

I proposed similar idea to Andrea in the bug report before.  Andrea
expressed this concern: with this(try_to_free_buffers() still fail to
drop the buffer because of this elevated-refcount by kjournald),
block_read_full_page will not re-read from disk the buffers the next
time a buffered-IO read from disk, after the direct-io has completed.
This is because the buffer is marked uptodate. How could we handle this?

Mingming


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-14 21:40               ` Mingming Cao
@ 2005-09-14 22:02                 ` Andrew Morton
  2005-09-15 11:11                   ` Suparna Bhattacharya
  2005-09-15 15:03                   ` Badari Pulavarty
  0 siblings, 2 replies; 29+ messages in thread
From: Andrew Morton @ 2005-09-14 22:02 UTC (permalink / raw)
  To: cmm; +Cc: pbadari, linux-fsdevel, sct, andrea

Mingming Cao <cmm@us.ibm.com> wrote:
>
> I proposed similar idea to Andrea in the bug report before.  Andrea
> expressed this concern: with this(try_to_free_buffers() still fail to
> drop the buffer because of this elevated-refcount by kjournald),
> block_read_full_page will not re-read from disk the buffers the next
> time a buffered-IO read from disk, after the direct-io has completed.
> This is because the buffer is marked uptodate. How could we handle this?

Well.  Application A has written data x with direct-io.  Application B has,
at a later time, written, dirtied (or even read) data y with buffered I/O.

So the contents of pagecache are in fact correct, but it is not known
whether or not the data on disk is correct.

I guess one way of resolving it is to make invalidate_inode_pages2_range()
provide a hard guarantee: call truncate_complete_page() instead of
invalidate_complete_page().  We need to have a think about the implications
of that.  For starters, if someone gets in at the right time and
reinstantiates a pte against the page, we'll end up converting that into an
anonymous page.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-14 22:02                 ` Andrew Morton
@ 2005-09-15 11:11                   ` Suparna Bhattacharya
  2005-09-15 18:52                     ` Andrew Morton
  2005-09-15 15:03                   ` Badari Pulavarty
  1 sibling, 1 reply; 29+ messages in thread
From: Suparna Bhattacharya @ 2005-09-15 11:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: cmm, pbadari, linux-fsdevel, sct, andrea

On Wed, Sep 14, 2005 at 03:02:24PM -0700, Andrew Morton wrote:
> Mingming Cao <cmm@us.ibm.com> wrote:
> >
> > I proposed similar idea to Andrea in the bug report before.  Andrea
> > expressed this concern: with this(try_to_free_buffers() still fail to
> > drop the buffer because of this elevated-refcount by kjournald),
> > block_read_full_page will not re-read from disk the buffers the next
> > time a buffered-IO read from disk, after the direct-io has completed.
> > This is because the buffer is marked uptodate. How could we handle this?
> 
> Well.  Application A has written data x with direct-io.  Application B has,
> at a later time, written, dirtied (or even read) data y with buffered I/O.

I guess the concern is the the page that B has written was based off stale
data, which has been mmap written with a few new bytes.

But I wonder, do we really have to or intend to guarantee DIO vs mmaped
write consistency ? What we have is sort of best effort -- we just ensure
no corruptions and exposures, and try to keep data consistent to the
extent we can. Can we actually do a perfect job ?

> 
> So the contents of pagecache are in fact correct, but it is not known
> whether or not the data on disk is correct.
> 
> I guess one way of resolving it is to make invalidate_inode_pages2_range()
> provide a hard guarantee: call truncate_complete_page() instead of
> invalidate_complete_page().  We need to have a think about the implications
> of that.  For starters, if someone gets in at the right time and
> reinstantiates a pte against the page, we'll end up converting that into an
> anonymous page.


Regards
Suparna

> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-15 11:11                   ` Suparna Bhattacharya
@ 2005-09-15 18:52                     ` Andrew Morton
  0 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2005-09-15 18:52 UTC (permalink / raw)
  To: suparna; +Cc: cmm, pbadari, linux-fsdevel, sct, andrea

Suparna Bhattacharya <suparna@in.ibm.com> wrote:
>
> But I wonder, do we really have to or intend to guarantee DIO vs mmaped
>  write consistency ? What we have is sort of best effort -- we just ensure
>  no corruptions and exposures, and try to keep data consistent to the
>  extent we can.

Yup.  There will be holes where inconsistencies can get in.  We just need
to understand what they are and arrange for them to be as benign as poss.

> Can we actually do a perfect job ?

Certainly we can, but I don't think there's a way of doing it in an
acceptably performant way.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-14 22:02                 ` Andrew Morton
  2005-09-15 11:11                   ` Suparna Bhattacharya
@ 2005-09-15 15:03                   ` Badari Pulavarty
  2005-09-15 19:22                     ` Andrea Arcangeli
  1 sibling, 1 reply; 29+ messages in thread
From: Badari Pulavarty @ 2005-09-15 15:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: cmm, linux-fsdevel, sct, andrea

On Wed, 2005-09-14 at 15:02 -0700, Andrew Morton wrote:
> Mingming Cao <cmm@us.ibm.com> wrote:
> >
> > I proposed similar idea to Andrea in the bug report before.  Andrea
> > expressed this concern: with this(try_to_free_buffers() still fail to
> > drop the buffer because of this elevated-refcount by kjournald),
> > block_read_full_page will not re-read from disk the buffers the next
> > time a buffered-IO read from disk, after the direct-io has completed.
> > This is because the buffer is marked uptodate. How could we handle this?
> 
> Well.  Application A has written data x with direct-io.  Application B has,
> at a later time, written, dirtied (or even read) data y with buffered I/O.
> 
> So the contents of pagecache are in fact correct, but it is not known
> whether or not the data on disk is correct.
> 
> I guess one way of resolving it is to make invalidate_inode_pages2_range()
> provide a hard guarantee: call truncate_complete_page() instead of
> invalidate_complete_page().  We need to have a think about the implications
> of that.  For starters, if someone gets in at the right time and
> reinstantiates a pte against the page, we'll end up converting that into an
> anonymous page.

FWIW, I changed it to call truncate_complete_page() and the tests ran
fine overnight.

Thanks,
Badari


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-15 15:03                   ` Badari Pulavarty
@ 2005-09-15 19:22                     ` Andrea Arcangeli
  2005-09-15 20:00                       ` Andrew Morton
  0 siblings, 1 reply; 29+ messages in thread
From: Andrea Arcangeli @ 2005-09-15 19:22 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Andrew Morton, cmm, linux-fsdevel, sct

On Thu, Sep 15, 2005 at 08:03:23AM -0700, Badari Pulavarty wrote:
> FWIW, I changed it to call truncate_complete_page() and the tests ran
> fine overnight.

What is the main difference between the two? (besides the removal of
of the pagecache that isn't interesting here). That it calls
->invalidatepage (ext3_invalidatepage) instead of ->releasepage?

->invalidatepage must be freeing the buffers too (the fs thinks the data
block will be released) so why is there a difference between
->releasepage and ->invalidatepage in the first place? What if you
define both of them to the invalidatepage implementation? (with a local
change to ext3 calling ext3_invalidatepage with offset = 0 inside
ext3_releasepage, the gfpmask should also be checked but that's not
important for now)

Just trying to understand why truncatepage blocks and waits, and
releasepage doesn't (which seems the reason why we have a problem in the
first place).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-15 19:22                     ` Andrea Arcangeli
@ 2005-09-15 20:00                       ` Andrew Morton
  2005-09-15 20:20                         ` Andrea Arcangeli
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2005-09-15 20:00 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: pbadari, cmm, linux-fsdevel, sct

Andrea Arcangeli <andrea@suse.de> wrote:
>
> On Thu, Sep 15, 2005 at 08:03:23AM -0700, Badari Pulavarty wrote:
> > FWIW, I changed it to call truncate_complete_page() and the tests ran
> > fine overnight.
> 
> What is the main difference between the two? (besides the removal of
> of the pagecache that isn't interesting here). That it calls
> ->invalidatepage (ext3_invalidatepage) instead of ->releasepage?

aops.releasepage():

  This is a best-effort function which was originally called only from
  vmscan.c.  It's the filesystem's version of try_to_free_buffers().  Later
  called by direct-io and by fadvise.

aops.invalidatepage()

  Provides hard guarantees.  Called by truncate.  It's the filesystem's
  version of block_truncate_page().

Major difference: invalidatepage() is supposed to kill dirty buffers,
releasepage() must not.

> ->invalidatepage must be freeing the buffers too (the fs thinks the data
> block will be released) so why is there a difference between
> ->releasepage and ->invalidatepage in the first place?

memory reclaim versus truncate, basically.

> What if you
> define both of them to the invalidatepage implementation?

Memory reclaim will reclaim dirty pages ;)

(For those parts of the page outside the truncation point.  The trucation
point isn't meaningful for ->releasepage()).

Probably we could simulate ->releasepage(page) by using
->invalidatepage(page, PAGE_CACHE_SIZE), but that's relying on
side-effects, or would need new semantics defined, or something.

> (with a local
> change to ext3 calling ext3_invalidatepage with offset = 0 inside
> ext3_releasepage, the gfpmask should also be checked but that's not
> important for now)
> 
> Just trying to understand why truncatepage blocks and waits, and
> releasepage doesn't (which seems the reason why we have a problem in the
> first place).

I made releasepage() nonblocking to avoid blocking processes in the memory
reclaim paths.  Given that it's best-effort, it may as well just trylock
everything it needs and give up if that doesn't work out.

I think ->releasepage() was being called under locks at some stage but that
appears to no longer be the case.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-15 20:00                       ` Andrew Morton
@ 2005-09-15 20:20                         ` Andrea Arcangeli
  2005-09-15 20:35                           ` Andrew Morton
  0 siblings, 1 reply; 29+ messages in thread
From: Andrea Arcangeli @ 2005-09-15 20:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: pbadari, cmm, linux-fsdevel, sct

On Thu, Sep 15, 2005 at 01:00:18PM -0700, Andrew Morton wrote:
> Memory reclaim will reclaim dirty pages ;)

;)

> Probably we could simulate ->releasepage(page) by using
> ->invalidatepage(page, PAGE_CACHE_SIZE), but that's relying on
> side-effects, or would need new semantics defined, or something.

The problem is that one thing is the memory reclaim, one thing is the
invalidate_inode_pages and one thing is the truncate.

They need three different behaviours.

But we've only two API to ask the fs to get rid of the buffers, one is
destroying dirty buffers, the other is non-blocking, we might need one
that is blocking and it isn't destroying dirty buffers.

The "wait/gfpmask" parameter to releasepage is ignored by ext3, perhaps
we should change the semantics of that bit, and have it passed as
GFP_KERNEL only by the invalidate_inode_pages.

> I made releasepage() nonblocking to avoid blocking processes in the memory
> reclaim paths.  Given that it's best-effort, it may as well just trylock
> everything it needs and give up if that doesn't work out.

I agree the memory reclaim can remain nonblocking.

However note that the failure here is in try_to_free_buffers because the
buffer is pinned by the journal code (it's not a trylock failing).
Buffer is BH_Mapped, BH_Req, BH_Uptodate (0x19) - nicely tracked by IBM.

> I think ->releasepage() was being called under locks at some stage but that
> appears to no longer be the case.

I'm a bit worried about the i_sem if waiting in releasepage but there
should be no other locks in the way. We need the blocking behaviour only
for invalidate_inode_pages.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-15 20:20                         ` Andrea Arcangeli
@ 2005-09-15 20:35                           ` Andrew Morton
  2005-09-15 20:49                             ` Badari Pulavarty
  2005-09-15 21:03                             ` Andrea Arcangeli
  0 siblings, 2 replies; 29+ messages in thread
From: Andrew Morton @ 2005-09-15 20:35 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: pbadari, cmm, linux-fsdevel, sct

Andrea Arcangeli <andrea@suse.de> wrote:
>
>  > Probably we could simulate ->releasepage(page) by using
>  > ->invalidatepage(page, PAGE_CACHE_SIZE), but that's relying on
>  > side-effects, or would need new semantics defined, or something.
> 
>  The problem is that one thing is the memory reclaim, one thing is the
>  invalidate_inode_pages and one thing is the truncate.
> 
>  They need three different behaviours.

Yup.

>  But we've only two API to ask the fs to get rid of the buffers, one is
>  destroying dirty buffers, the other is non-blocking, we might need one
>  that is blocking and it isn't destroying dirty buffers.
> 
>  The "wait/gfpmask" parameter to releasepage is ignored by ext3, perhaps
>  we should change the semantics of that bit, and have it passed as
>  GFP_KERNEL only by the invalidate_inode_pages.

That would work.

>  > I made releasepage() nonblocking to avoid blocking processes in the memory
>  > reclaim paths.  Given that it's best-effort, it may as well just trylock
>  > everything it needs and give up if that doesn't work out.
> 
>  I agree the memory reclaim can remain nonblocking.
> 
>  However note that the failure here is in try_to_free_buffers because the
>  buffer is pinned by the journal code (it's not a trylock failing).
>  Buffer is BH_Mapped, BH_Req, BH_Uptodate (0x19) - nicely tracked by IBM.

I'll need reminding - why was the buffer unfreeable?  Because kjournald had
a ref on it?  Whereabouts is that happening?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-15 20:35                           ` Andrew Morton
@ 2005-09-15 20:49                             ` Badari Pulavarty
  2005-09-15 21:06                               ` Andrea Arcangeli
  2005-09-15 21:20                               ` Andrew Morton
  2005-09-15 21:03                             ` Andrea Arcangeli
  1 sibling, 2 replies; 29+ messages in thread
From: Badari Pulavarty @ 2005-09-15 20:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, cmm, linux-fsdevel, sct

On Thu, 2005-09-15 at 13:35 -0700, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> >  > Probably we could simulate ->releasepage(page) by using
> >  > ->invalidatepage(page, PAGE_CACHE_SIZE), but that's relying on
> >  > side-effects, or would need new semantics defined, or something.
> > 
> >  The problem is that one thing is the memory reclaim, one thing is the
> >  invalidate_inode_pages and one thing is the truncate.
> > 
> >  They need three different behaviours.
> 
> Yup.
> 
> >  But we've only two API to ask the fs to get rid of the buffers, one is
> >  destroying dirty buffers, the other is non-blocking, we might need one
> >  that is blocking and it isn't destroying dirty buffers.
> > 
> >  The "wait/gfpmask" parameter to releasepage is ignored by ext3, perhaps
> >  we should change the semantics of that bit, and have it passed as
> >  GFP_KERNEL only by the invalidate_inode_pages.
> 
> That would work.
> 
> >  > I made releasepage() nonblocking to avoid blocking processes in the memory
> >  > reclaim paths.  Given that it's best-effort, it may as well just trylock
> >  > everything it needs and give up if that doesn't work out.
> > 
> >  I agree the memory reclaim can remain nonblocking.
> > 
> >  However note that the failure here is in try_to_free_buffers because the
> >  buffer is pinned by the journal code (it's not a trylock failing).
> >  Buffer is BH_Mapped, BH_Req, BH_Uptodate (0x19) - nicely tracked by IBM.
> 
> I'll need reminding - why was the buffer unfreeable?  Because kjournald had
> a ref on it?  Whereabouts is that happening?
> 
> 

kjournald() is commiting the transaction and doing IO to buffers, since
previous DIO write kicked it back to buffered mode (due to hole
filling). 

Here is the race:


DIO Process				kjounald()
				journal_commit_transaction()
      					  ...
        			/* submited buffers for IO */
        			/* Waiting for IO to complete */
        			while (t_locked_list) {
               			  ...
              	  		  get_bh(bh);
                	          if (buffer_locked(bh)) {
                       		     spin_unlock(&journal->j_list_lock);
                        	     wait_on_buffer(bh); 
                                                              
						
invalidate_complete_page()
        ..
   ext3_releasepage()
       journal_try_to_free_buffers()
           journal_put_journal_head()
                  __journal_try_to_free_buffer()
                                 <--- freed jh

           try_to_free_buffers()
              drop_buffers()
                    if (buffer_busy(bh))
                         goto failed;
 <<--- returns EIO due to b_count
					 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-15 20:49                             ` Badari Pulavarty
@ 2005-09-15 21:06                               ` Andrea Arcangeli
  2005-09-15 21:20                               ` Andrew Morton
  1 sibling, 0 replies; 29+ messages in thread
From: Andrea Arcangeli @ 2005-09-15 21:06 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Andrew Morton, cmm, linux-fsdevel, sct

On Thu, Sep 15, 2005 at 01:49:31PM -0700, Badari Pulavarty wrote:
> previous DIO write kicked it back to buffered mode (due to hole
> filling). 

btw, I guess with data=journal it'd trigger even without the hole
filling, and infact data=writeback apparently hides the bug.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-15 20:49                             ` Badari Pulavarty
  2005-09-15 21:06                               ` Andrea Arcangeli
@ 2005-09-15 21:20                               ` Andrew Morton
  2005-09-15 22:22                                 ` Badari Pulavarty
  1 sibling, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2005-09-15 21:20 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: andrea, cmm, linux-fsdevel, sct

Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> Here is the race:
> 
> 
>  DIO Process				kjounald()
>  				journal_commit_transaction()
>        					  ...
>          			/* submited buffers for IO */
>          			/* Waiting for IO to complete */
>          			while (t_locked_list) {
>                 			  ...
>                	  		  get_bh(bh);
>                  	          if (buffer_locked(bh)) {
>                         		     spin_unlock(&journal->j_list_lock);
>                          	     wait_on_buffer(bh); 
>                                                                
>  						
>  invalidate_complete_page()
>          ..
>     ext3_releasepage()
>         journal_try_to_free_buffers()
>             journal_put_journal_head()
>                    __journal_try_to_free_buffer()
>                                   <--- freed jh
> 
>             try_to_free_buffers()
>                drop_buffers()
>                      if (buffer_busy(bh))
>                           goto failed;
>   <<--- returns EIO due to b_count

Right.   But there's still a race in there.

invalidate_complete_page() may still fail for these buffers.  The only
blocking it does is lock_buffer().  So if kjournald does:

	get_bh(bh);
	wait_on_bffer(bh);
				<- window
	put_bh(bh);

ext3_invalidatepage() may hit that unlocked, pinned buffer and fail to
release the page.

In that case, truncate_complete_page() will convert the apge into an
anonymous page (if it's mmapped).  If it's just a pagecache page then I
guess it'll be converted into a zero-ref page on the page LRU.

It's not possible for ext3_invalidatepage() to block until a buffer comes
unpinned.  We'd have to add additional stuff for that.  (bring back
wake_up_buffer(), for example).


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-15 21:20                               ` Andrew Morton
@ 2005-09-15 22:22                                 ` Badari Pulavarty
  0 siblings, 0 replies; 29+ messages in thread
From: Badari Pulavarty @ 2005-09-15 22:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: andrea, cmm, linux-fsdevel, sct

On Thu, 2005-09-15 at 14:20 -0700, Andrew Morton wrote:
> Badari Pulavarty <pbadari@us.ibm.com> wrote:
> >
> > Here is the race:
> > 
> > 
> >  DIO Process				kjounald()
> >  				journal_commit_transaction()
> >        					  ...
> >          			/* submited buffers for IO */
> >          			/* Waiting for IO to complete */
> >          			while (t_locked_list) {
> >                 			  ...
> >                	  		  get_bh(bh);
> >                  	          if (buffer_locked(bh)) {
> >                         		     spin_unlock(&journal->j_list_lock);
> >                          	     wait_on_buffer(bh); 
> >                                                                
> >  						
> >  invalidate_complete_page()
> >          ..
> >     ext3_releasepage()
> >         journal_try_to_free_buffers()
> >             journal_put_journal_head()
> >                    __journal_try_to_free_buffer()
> >                                   <--- freed jh
> > 
> >             try_to_free_buffers()
> >                drop_buffers()
> >                      if (buffer_busy(bh))
> >                           goto failed;
> >   <<--- returns EIO due to b_count
> 
> Right.   But there's still a race in there.
> 
> invalidate_complete_page() may still fail for these buffers.  The only
> blocking it does is lock_buffer().  So if kjournald does:
> 
> 	get_bh(bh);
> 	wait_on_bffer(bh);
> 				<- window
> 	put_bh(bh);
> 
> ext3_invalidatepage() may hit that unlocked, pinned buffer and fail to
> release the page.

Yep. Thats exactly the window.

BTW, the reason for hitting this so easily on SLES9 SP1, but not on
mainline is, we invalidate only range of pages which DIO is involved.
But SLES9 SP1 invalidates the whole file. (SLES9, RHEL4 ignores the
error - so we don't see failure there).

> In that case, truncate_complete_page() will convert the apge into an
> anonymous page (if it's mmapped).  If it's just a pagecache page then I
> guess it'll be converted into a zero-ref page on the page LRU.
> 
> It's not possible for ext3_invalidatepage() to block until a buffer comes
> unpinned.  We'd have to add additional stuff for that.  (bring back
> wake_up_buffer(), for example).


Thanks,
Badari


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-15 20:35                           ` Andrew Morton
  2005-09-15 20:49                             ` Badari Pulavarty
@ 2005-09-15 21:03                             ` Andrea Arcangeli
  2005-09-15 21:26                               ` Andrew Morton
  1 sibling, 1 reply; 29+ messages in thread
From: Andrea Arcangeli @ 2005-09-15 21:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: pbadari, cmm, linux-fsdevel, sct

On Thu, Sep 15, 2005 at 01:35:00PM -0700, Andrew Morton wrote:
> I'll need reminding - why was the buffer unfreeable?  Because kjournald had
> a ref on it?  Whereabouts is that happening?

Yes, kjournald had a ref (b_count) on it. It's happening when running
iozone in some way (Badari knows how to reproduce).

I was now wondering why it's a problem to destroy dirty buffers in
invalidate_inode_pages2? (ok, ignoring the detail that PageDirty may
return true inside invalidate_complete_page)

Clearly we can't do that inside releasepage, but invalidate_inode_pages2
is all about destroying the dirty and uptodate information, since we
just finished writing with direct-io (or at least the dirty info isn't
that important anymore, we're still in our write context for those pages
that we're invalidating [even better with the range]).

If we could just have a blocking version of try_to_release_page I guess
that would be better, but even the ->invalidatepage behaviour may not be
that bad for invalidate_inode_pages2. Or am I missing something?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-15 21:03                             ` Andrea Arcangeli
@ 2005-09-15 21:26                               ` Andrew Morton
  2005-09-15 22:04                                 ` Andrea Arcangeli
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2005-09-15 21:26 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: pbadari, cmm, linux-fsdevel, sct

Andrea Arcangeli <andrea@suse.de> wrote:
>
> On Thu, Sep 15, 2005 at 01:35:00PM -0700, Andrew Morton wrote:
> > I'll need reminding - why was the buffer unfreeable?  Because kjournald had
> > a ref on it?  Whereabouts is that happening?
> 
> Yes, kjournald had a ref (b_count) on it. It's happening when running
> iozone in some way (Badari knows how to reproduce).
> 
> I was now wondering why it's a problem to destroy dirty buffers in
> invalidate_inode_pages2? (ok, ignoring the detail that PageDirty may
> return true inside invalidate_complete_page)

Bear in mind that generic_file_direct_IO() has just fsynced that section of
the file so there shouldn't be any dirty buffers unless something funny is
happening.  Like, direct-io fell back to buffered IO.  In that case perhaps
we should run sync_page_range() before trying the
invalidate_inode_pages2_range()?

> Clearly we can't do that inside releasepage, but invalidate_inode_pages2
> is all about destroying the dirty and uptodate information, since we
> just finished writing with direct-io (or at least the dirty info isn't
> that important anymore, we're still in our write context for those pages
> that we're invalidating [even better with the range]).

Those dirty pages/buffers could be there because
__generic_file_aio_write_nolock() fell back to buffered I/O.

If we run sync_page_range() before invalidate_inode_pages2_range() and we
*still* find dirty pages/buffers then yes, it might be right to simply nuke
those dirty bits.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-15 21:26                               ` Andrew Morton
@ 2005-09-15 22:04                                 ` Andrea Arcangeli
  2005-09-15 23:28                                   ` Mingming Cao
  0 siblings, 1 reply; 29+ messages in thread
From: Andrea Arcangeli @ 2005-09-15 22:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: pbadari, cmm, linux-fsdevel, sct

On Thu, Sep 15, 2005 at 02:26:42PM -0700, Andrew Morton wrote:
> Bear in mind that generic_file_direct_IO() has just fsynced that section of
> the file so there shouldn't be any dirty buffers unless something funny is

That was clear yes, or we cannot start writing to the platter if there's
outstanding dirty cache.

> happening.  Like, direct-io fell back to buffered IO.  In that case perhaps
> we should run sync_page_range() before trying the
> invalidate_inode_pages2_range()?

The buffered-IO fallback would happen _after_ (not before) the
invalidate regardless so it doesn't matter since the pages would be
marked dirty again later (still inside write context so we're safe).

The fallback actually happens outside generic_file_direct_IO, so it
doesn't matter what we do inside since we'll rewrite and redirty all
pagecache later (if we fail the direct-IO).

> Those dirty pages/buffers could be there because
> __generic_file_aio_write_nolock() fell back to buffered I/O.
> 
> If we run sync_page_range() before invalidate_inode_pages2_range() and we
> *still* find dirty pages/buffers then yes, it might be right to simply nuke
> those dirty bits.

See above, it doesn't seem necessary.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-15 22:04                                 ` Andrea Arcangeli
@ 2005-09-15 23:28                                   ` Mingming Cao
  2005-09-16  0:18                                     ` Andrea Arcangeli
  0 siblings, 1 reply; 29+ messages in thread
From: Mingming Cao @ 2005-09-15 23:28 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, pbadari, linux-fsdevel, sct

On Fri, 2005-09-16 at 00:04 +0200, Andrea Arcangeli wrote:
> On Thu, Sep 15, 2005 at 02:26:42PM -0700, Andrew Morton wrote:
> > Bear in mind that generic_file_direct_IO() has just fsynced that section of
> > the file so there shouldn't be any dirty buffers unless something funny is
> 
> That was clear yes, or we cannot start writing to the platter if there's
> outstanding dirty cache.
> 
> > happening.  Like, direct-io fell back to buffered IO.  In that case perhaps
> > we should run sync_page_range() before trying the
> > invalidate_inode_pages2_range()?
> 

Isn't direct-io and buffered IO being serialized by i_sem? If so, the
buffers should not be dirtyed by other concurrent DIO or DIO-fall-back-
to-buffered-IO or buffered-IO, did I miss something?

> The buffered-IO fallback would happen _after_ (not before) the
> invalidate regardless so it doesn't matter since the pages would be
> marked dirty again later (still inside write context so we're safe).
> 
> The fallback actually happens outside generic_file_direct_IO, so it
> doesn't matter what we do inside since we'll rewrite and redirty all
> pagecache later (if we fail the direct-IO).
> 
> > Those dirty pages/buffers could be there because
> > __generic_file_aio_write_nolock() fell back to buffered I/O.
> > 
> > If we run sync_page_range() before invalidate_inode_pages2_range() and we
> > *still* find dirty pages/buffers then yes, it might be right to simply nuke
> > those dirty bits.
> 
> See above, it doesn't seem necessary.
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-15 23:28                                   ` Mingming Cao
@ 2005-09-16  0:18                                     ` Andrea Arcangeli
  0 siblings, 0 replies; 29+ messages in thread
From: Andrea Arcangeli @ 2005-09-16  0:18 UTC (permalink / raw)
  To: Mingming Cao; +Cc: Andrew Morton, pbadari, linux-fsdevel, sct

On Thu, Sep 15, 2005 at 04:28:44PM -0700, Mingming Cao wrote:
> Isn't direct-io and buffered IO being serialized by i_sem? If so, the

page faults aren't serialized by i_sem, so if a page fault happens the
vm may unmap the pte under memory pressure marking the page dirty. But
we have the guarantee that such dirty bit in the pte was set after we
were inside write() so we can drop it (or we can as well leave it), it's
undefined.

but the point is that the buffered-io fallback happens after the
invalidate, so it shouldn't matter if we drop dirty buffers or dirty
pages in the invalidate, they should be re-generated.

So overall I believe it's semantically equivalent if we drop the dirty
bits or not. Clearly not dropping them is a bit less intrusive, but
current code seems ready to drop them by calling ->invalidatepage
(destructive) instead of creating a blocking version of ->releasepage
(non destructive).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-13  0:29     ` Andrew Morton
  2005-09-13 16:52       ` Badari Pulavarty
@ 2005-09-13 17:53       ` Mingming Cao
  2005-09-16 13:42       ` Stephen C. Tweedie
  2 siblings, 0 replies; 29+ messages in thread
From: Mingming Cao @ 2005-09-13 17:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Badari Pulavarty, linux-fsdevel, sct

On Mon, 2005-09-12 at 17:29 -0700, Andrew Morton wrote:
> Badari Pulavarty <pbadari@us.ibm.com> wrote:
> >
> > Didn't help. How can this help ?
> 
> Oh well.  Different bug.  See http://bugzilla.kernel.org/show_bug.cgi?id=4964
> 
> > DIO code is kicking back to buffered mode, since its trying to
> > fill-in the holes. The fix you suggested might have helped, if
> > IO is beyond file size or file size not ending on page boundary.
> > What am I missing ?
> 
> Have you spoken with Mingming about this?  She's chasing a similar race. 
> She's currently working on getting the ext3-debug patch working so we can
> find out how those buffers (which apparently aren't even attached to the
> journal) came to have an elevated refcount.
> 
Hi Andrew, Badari was looking at the race with me and he came up with
some hack patch to "remember" who are the last two caller to get_bh(),
and once we found a buffer whose reference count is >1 but the journal
head is NULL, it start trace this bufferhead.(dump who is the last two
caller to get_bh())

The analysis results shows that there are race between
journal_try_to_free_buffers() and kjournald.  It possible that when
kjournald is waiting for a buffer unlocked, it released the journal-
>j_list_lock(commit.c, line 398), but it is still holding a reference
count on that buffer while wait_on_buffer().  This allowed the
__journal_try_to_free_buffer() to proceed (which is waiting for the
j_list_lock and will unlink the journal head from the buffer
eventually). journal_try_to_free_buffer() will call try_to_free_buffer()
if the journal head is NULL, at that point, since kjournald is still
holding a reference count on that buffer, try_to_free_buffers()-
>drop_buffers() failed because buffer is busy.


This race happened in SLES9 SP1/SP2, also it happend in 2.6.11, but
could not reproduce in mainline 2.6.13. SPLES9 sp1/sp2 and 2.6.11 all
have the patch to allow invalidate_inode_page() to return -EIO to deal
with the DIO-with-dirty-mapped-write case.  But in 2.6.13, it
invalidates using a range, probably that's why mainline doesn't 
show the problem (the chance of calling invalidate_inode_pages2_range on
a busy page is reduced if the range is passed as parameter)

> Apparently the debug patch is currently oopsing.  We need to fix that.

Sorry about the lack of response on this, it certainly very powerful and
useful tool. Last time I tried it again the kernel would not boot. I am
working on it, slowly:)

Mingming


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-13  0:29     ` Andrew Morton
  2005-09-13 16:52       ` Badari Pulavarty
  2005-09-13 17:53       ` Mingming Cao
@ 2005-09-16 13:42       ` Stephen C. Tweedie
  2005-09-21 18:22         ` Mingming Cao
  2 siblings, 1 reply; 29+ messages in thread
From: Stephen C. Tweedie @ 2005-09-16 13:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Badari Pulavarty, linux-fsdevel, Mingming Cao, Stephen Tweedie

Hi,

On Tue, 2005-09-13 at 01:29, Andrew Morton wrote:

> Have you spoken with Mingming about this?  She's chasing a similar race. 
> She's currently working on getting the ext3-debug patch working so we can
> find out how those buffers (which apparently aren't even attached to the
> journal) came to have an elevated refcount.

Here's the latest version of ext3 buffer tracing which I've been using. 
It's basically just the same as one from Andrew from about 2.6.5
vintage, with a few changes in places to deal with new bits of code that
need temporary buffer_heads initialised properly for tracing.

It has survived a couple of hours of fsx and fsstress in parallel, but I
haven't run any O_DIRECT tests with it on the current git kernel, so
there may stil be problems in that area.

Cheers,
 Stephen



Buffer_head tracing for ext3/jbd
---

 drivers/ide/ide-io.c         |   22 ++++
 fs/Kconfig                   |    4 +
 fs/Makefile                  |    2 
 fs/buffer.c                  |  100 ++++++++++++++++
 fs/direct-io.c               |    1 
 fs/ext3/inode.c              |    7 +
 fs/ext3/namei.c              |   40 ++++++-
 fs/ext3/super.c              |   81 +++++++++++++
 fs/ext3/xattr.c              |    4 +
 fs/jbd-kernel.c              |  255 ++++++++++++++++++++++++++++++++++++++++++
 fs/jbd/commit.c              |   19 ++-
 fs/jbd/transaction.c         |   46 +++++---
 fs/mpage.c                   |    2 
 include/linux/buffer-trace.h |   85 ++++++++++++++
 include/linux/buffer_head.h  |    5 +
 include/linux/jbd.h          |   26 ++--
 16 files changed, 654 insertions(+), 45 deletions(-)

diff --git a/drivers/ide/ide-io.c b/drivers/ide/ide-io.c
--- a/drivers/ide/ide-io.c
+++ b/drivers/ide/ide-io.c
@@ -890,6 +890,28 @@ static ide_startstop_t start_request (id
 	ide_startstop_t startstop;
 	sector_t block;
 
+#ifdef CONFIG_JBD_DEBUG
+	/*
+	 * Silently stop writing to this disk to simulate a crash.
+	 */
+	extern struct block_device *journal_no_write[2];
+	int i;
+
+	if (!(rq->flags & REQ_RW))
+		goto its_a_read;
+
+	for (i = 0; i < 2; i++) {
+		struct block_device *bdev = journal_no_write[i];
+		if (bdev && bdev_get_queue(bdev) == rq->q) {
+			printk("%s: write suppressed\n", __FUNCTION__);
+			ide_end_request(drive, 1, rq->nr_sectors);
+			return ide_stopped;
+		}
+	}
+its_a_read:
+	;
+#endif
+
 	BUG_ON(!(rq->flags & REQ_STARTED));
 
 #ifdef DEBUG
diff --git a/fs/Kconfig b/fs/Kconfig
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -172,6 +172,10 @@ config JBD_DEBUG
 	  generated.  To turn debugging off again, do
 	  "echo 0 > /proc/sys/fs/jbd-debug".
 
+config BUFFER_DEBUG
+	bool "buffer-layer tracing"
+	depends on JBD
+
 config FS_MBCACHE
 # Meta block cache for Extended Attributes (ext2/ext3)
 	tristate
diff --git a/fs/Makefile b/fs/Makefile
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -23,6 +23,8 @@ obj-$(CONFIG_BINFMT_AOUT)	+= binfmt_aout
 obj-$(CONFIG_BINFMT_EM86)	+= binfmt_em86.o
 obj-$(CONFIG_BINFMT_MISC)	+= binfmt_misc.o
 
+obj-$(CONFIG_BUFFER_DEBUG)	+= jbd-kernel.o
+
 # binfmt_script is always there
 obj-y				+= binfmt_script.o
 
diff --git a/fs/buffer.c b/fs/buffer.c
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -35,7 +35,9 @@
 #include <linux/hash.h>
 #include <linux/suspend.h>
 #include <linux/buffer_head.h>
+#include <linux/buffer-trace.h>
 #include <linux/bio.h>
+#include <linux/jbd.h>
 #include <linux/notifier.h>
 #include <linux/cpu.h>
 #include <linux/bitops.h>
@@ -52,6 +54,7 @@ init_buffer(struct buffer_head *bh, bh_e
 {
 	bh->b_end_io = handler;
 	bh->b_private = private;
+	buffer_trace_init(&bh->b_history);
 }
 
 static int sync_buffer(void *word)
@@ -60,6 +63,7 @@ static int sync_buffer(void *word)
 	struct buffer_head *bh
 		= container_of(word, struct buffer_head, b_state);
 
+	BUFFER_TRACE(bh, "enter");
 	smp_mb();
 	bd = bh->b_bdev;
 	if (bd)
@@ -104,6 +108,7 @@ static void buffer_io_error(struct buffe
 {
 	char b[BDEVNAME_SIZE];
 
+	BUFFER_TRACE(bh, "I/O error");
 	printk(KERN_ERR "Buffer I/O error on device %s, logical block %Lu\n",
 			bdevname(bh->b_bdev, b),
 			(unsigned long long)bh->b_blocknr);
@@ -115,10 +120,12 @@ static void buffer_io_error(struct buffe
  */
 void end_buffer_read_sync(struct buffer_head *bh, int uptodate)
 {
+	BUFFER_TRACE(bh, "enter");
 	if (uptodate) {
 		set_buffer_uptodate(bh);
 	} else {
 		/* This happens, due to failed READA attempts. */
+		BUFFER_TRACE(bh, "clear uptodate");
 		clear_buffer_uptodate(bh);
 	}
 	unlock_buffer(bh);
@@ -130,8 +137,10 @@ void end_buffer_write_sync(struct buffer
 	char b[BDEVNAME_SIZE];
 
 	if (uptodate) {
+		BUFFER_TRACE(bh, "uptodate");
 		set_buffer_uptodate(bh);
 	} else {
+		BUFFER_TRACE(bh, "Not uptodate");
 		if (!buffer_eopnotsupp(bh) && printk_ratelimit()) {
 			buffer_io_error(bh);
 			printk(KERN_WARNING "lost page write due to "
@@ -139,6 +148,7 @@ void end_buffer_write_sync(struct buffer
 				       bdevname(bh->b_bdev, b));
 		}
 		set_buffer_write_io_error(bh);
+		BUFFER_TRACE(bh, "clear uptodate");
 		clear_buffer_uptodate(bh);
 	}
 	unlock_buffer(bh);
@@ -520,12 +530,15 @@ static void end_buffer_async_read(struct
 	struct page *page;
 	int page_uptodate = 1;
 
+	BUFFER_TRACE(bh, "enter");
+
 	BUG_ON(!buffer_async_read(bh));
 
 	page = bh->b_page;
 	if (uptodate) {
 		set_buffer_uptodate(bh);
 	} else {
+		BUFFER_TRACE(bh, "clear uptodate");
 		clear_buffer_uptodate(bh);
 		if (printk_ratelimit())
 			buffer_io_error(bh);
@@ -582,6 +595,8 @@ void end_buffer_async_write(struct buffe
 	struct buffer_head *tmp;
 	struct page *page;
 
+	BUFFER_TRACE(bh, "enter");
+
 	BUG_ON(!buffer_async_write(bh));
 
 	page = bh->b_page;
@@ -595,6 +610,7 @@ void end_buffer_async_write(struct buffe
 			       bdevname(bh->b_bdev, b));
 		}
 		set_bit(AS_EIO, &page->mapping->flags);
+		BUFFER_TRACE(bh, "clear uptodate");
 		clear_buffer_uptodate(bh);
 		SetPageError(page);
 	}
@@ -802,6 +818,7 @@ void mark_buffer_dirty_inode(struct buff
 	struct address_space *mapping = inode->i_mapping;
 	struct address_space *buffer_mapping = bh->b_page->mapping;
 
+	BUFFER_TRACE(bh, "mark dirty");
 	mark_buffer_dirty(bh);
 	if (!mapping->assoc_mapping) {
 		mapping->assoc_mapping = buffer_mapping;
@@ -853,7 +870,25 @@ int __set_page_dirty_buffers(struct page
 		struct buffer_head *bh = head;
 
 		do {
-			set_buffer_dirty(bh);
+			if (buffer_uptodate(bh)) {
+				BUFFER_TRACE(bh, "set dirty");
+				/* The following test can only succeed
+				 * if jbd is built in, not if it's a
+				 * module... */
+#ifdef CONFIG_JBD
+				if (buffer_jbd(bh)) {
+					struct journal_head *jh;
+
+					jh = journal_add_journal_head(bh);
+					WARN_ON(jh->b_jlist == BJ_Metadata);
+					journal_put_journal_head(jh);
+				}
+#endif
+				set_buffer_dirty(bh);
+			} else {
+				printk("%s: !uptodate buffer\n", __FUNCTION__);
+				buffer_assertion_failure(bh);
+			}
 			bh = bh->b_this_page;
 		} while (bh != head);
 	}
@@ -1043,6 +1078,7 @@ no_grow:
 		do {
 			bh = head;
 			head = head->b_this_page;
+			bh->b_this_page = NULL;
 			free_buffer_head(bh);
 		} while (head);
 	}
@@ -1258,6 +1294,7 @@ __getblk_slow(struct block_device *bdev,
  */
 void fastcall mark_buffer_dirty(struct buffer_head *bh)
 {
+	BUFFER_TRACE(bh, "entry");
 	if (!buffer_dirty(bh) && !test_set_buffer_dirty(bh))
 		__set_page_dirty_nobuffers(bh->b_page);
 }
@@ -1271,12 +1308,14 @@ void fastcall mark_buffer_dirty(struct b
  */
 void __brelse(struct buffer_head * buf)
 {
+	BUFFER_TRACE(buf, "entry");
 	if (atomic_read(&buf->b_count)) {
 		put_bh(buf);
 		return;
 	}
 	printk(KERN_ERR "VFS: brelse: Trying to free free buffer\n");
 	WARN_ON(1);
+	buffer_assertion_failure(buf);
 }
 
 /*
@@ -1285,6 +1324,7 @@ void __brelse(struct buffer_head * buf)
  */
 void __bforget(struct buffer_head *bh)
 {
+	BUFFER_TRACE(bh, "enter");
 	clear_buffer_dirty(bh);
 	if (!list_empty(&bh->b_assoc_buffers)) {
 		struct address_space *buffer_mapping = bh->b_page->mapping;
@@ -1546,6 +1586,7 @@ EXPORT_SYMBOL(set_bh_page);
  */
 static inline void discard_buffer(struct buffer_head * bh)
 {
+	BUFFER_TRACE(bh, "enter");
 	lock_buffer(bh);
 	clear_buffer_dirty(bh);
 	bh->b_bdev = NULL;
@@ -1696,6 +1737,7 @@ void unmap_underlying_metadata(struct bl
 
 	old_bh = __find_get_block_slow(bdev, block, 0);
 	if (old_bh) {
+		BUFFER_TRACE(old_bh, "alias: unmap it");
 		clear_buffer_dirty(old_bh);
 		wait_on_buffer(old_bh);
 		clear_buffer_req(old_bh);
@@ -1775,6 +1817,7 @@ static int __block_write_full_page(struc
 			/*
 			 * The buffer was zeroed by block_write_full_page()
 			 */
+			BUFFER_TRACE(bh, "outside last_block");
 			clear_buffer_dirty(bh);
 			set_buffer_uptodate(bh);
 		} else if (!buffer_mapped(bh) && buffer_dirty(bh)) {
@@ -1782,6 +1825,8 @@ static int __block_write_full_page(struc
 			if (err)
 				goto recover;
 			if (buffer_new(bh)) {
+				BUFFER_TRACE(bh, "new: call "
+						"unmap_underlying_metadata");
 				/* blockdev mappings never come here */
 				clear_buffer_new(bh);
 				unmap_underlying_metadata(bh->b_bdev,
@@ -1809,6 +1854,11 @@ static int __block_write_full_page(struc
 			continue;
 		}
 		if (test_clear_buffer_dirty(bh)) {
+			if (!buffer_uptodate(bh)) {
+				printk("%s: dirty non-uptodate buffer\n",
+						__FUNCTION__);
+				buffer_assertion_failure(bh);
+			}
 			mark_buffer_async_write(bh);
 		} else {
 			unlock_buffer(bh);
@@ -1870,6 +1920,7 @@ recover:
 	/* Recovery: lock and submit the mapped buffers */
 	do {
 		if (buffer_mapped(bh) && buffer_dirty(bh)) {
+			BUFFER_TRACE(bh, "lock it");
 			lock_buffer(bh);
 			mark_buffer_async_write(bh);
 		} else {
@@ -1935,9 +1986,12 @@ static int __block_prepare_write(struct 
 			if (err)
 				break;
 			if (buffer_new(bh)) {
+				BUFFER_TRACE(bh, "new: call "
+						"unmap_underlying_metadata");
 				unmap_underlying_metadata(bh->b_bdev,
 							bh->b_blocknr);
 				if (PageUptodate(page)) {
+					BUFFER_TRACE(bh, "setting uptodate");
 					set_buffer_uptodate(bh);
 					continue;
 				}
@@ -1958,12 +2012,14 @@ static int __block_prepare_write(struct 
 			}
 		}
 		if (PageUptodate(page)) {
+			BUFFER_TRACE(bh, "setting uptodate");
 			if (!buffer_uptodate(bh))
 				set_buffer_uptodate(bh);
 			continue; 
 		}
 		if (!buffer_uptodate(bh) && !buffer_delay(bh) &&
 		     (block_start < from || block_end > to)) {
+			BUFFER_TRACE(bh, "reading");
 			ll_rw_block(READ, 1, &bh);
 			*wait_bh++=bh;
 		}
@@ -2006,6 +2062,7 @@ static int __block_prepare_write(struct 
 			memset(kaddr+block_start, 0, bh->b_size);
 			kunmap_atomic(kaddr, KM_USER0);
 			set_buffer_uptodate(bh);
+			BUFFER_TRACE(bh, "mark dirty");
 			mark_buffer_dirty(bh);
 		}
 next_bh:
@@ -2034,6 +2091,7 @@ static int __block_commit_write(struct i
 				partial = 1;
 		} else {
 			set_buffer_uptodate(bh);
+			BUFFER_TRACE(bh, "mark dirty");
 			mark_buffer_dirty(bh);
 		}
 	}
@@ -2328,6 +2386,7 @@ static void end_buffer_read_nobh(struct 
 		set_buffer_uptodate(bh);
 	} else {
 		/* This happens, due to failed READA attempts. */
+		BUFFER_TRACE(bh, "clear uptodate");
 		clear_buffer_uptodate(bh);
 	}
 	unlock_buffer(bh);
@@ -2419,6 +2478,7 @@ int nobh_prepare_write(struct page *page
 			bh->b_data = (char *)(long)block_start;
 			bh->b_bdev = map_bh.b_bdev;
 			bh->b_private = NULL;
+			buffer_trace_init(&bh->b_history);
 			read_bh[nr_reads++] = bh;
 		}
 	}
@@ -2662,6 +2722,7 @@ int block_truncate_page(struct address_s
 	flush_dcache_page(page);
 	kunmap_atomic(kaddr, KM_USER0);
 
+	BUFFER_TRACE(bh, "mark dirty");
 	mark_buffer_dirty(bh);
 	err = 0;
 
@@ -2696,11 +2757,20 @@ int block_write_full_page(struct page *p
 		 * they may have been added in ext3_writepage().  Make them
 		 * freeable here, so the page does not leak.
 		 */
+		if (page_has_buffers(page)) {
+			struct buffer_head *b = page_buffers(page);
+			BUFFER_TRACE(b, "EIO");
+		}
 		block_invalidatepage(page, 0);
 		unlock_page(page);
 		return 0; /* don't care */
 	}
 
+	if (page_has_buffers(page)) {
+		struct buffer_head *b = page_buffers(page);
+		BUFFER_TRACE(b, "partial");
+	}
+
 	/*
 	 * The page straddles i_size.  It must be zeroed out on each and every
 	 * writepage invokation because it may be mmapped.  "A file is mapped
@@ -2720,8 +2790,10 @@ sector_t generic_block_bmap(struct addre
 {
 	struct buffer_head tmp;
 	struct inode *inode = mapping->host;
+
 	tmp.b_state = 0;
 	tmp.b_blocknr = 0;
+	tmp.b_page = NULL;
 	get_block(inode, block, &tmp, 0);
 	return tmp.b_blocknr;
 }
@@ -2748,9 +2820,11 @@ int submit_bh(int rw, struct buffer_head
 	struct bio *bio;
 	int ret = 0;
 
-	BUG_ON(!buffer_locked(bh));
-	BUG_ON(!buffer_mapped(bh));
-	BUG_ON(!bh->b_end_io);
+	BUFFER_TRACE(bh, "enter");
+
+	J_ASSERT_BH(bh, buffer_locked(bh));
+	J_ASSERT_BH(bh, buffer_mapped(bh));
+	J_ASSERT_BH(bh, bh->b_end_io != NULL);
 
 	if (buffer_ordered(bh) && (rw == WRITE))
 		rw = WRITE_BARRIER;
@@ -2762,6 +2836,11 @@ int submit_bh(int rw, struct buffer_head
 	if (test_set_buffer_req(bh) && (rw == WRITE || rw == WRITE_BARRIER))
 		clear_buffer_write_io_error(bh);
 
+	if (rw == WRITE)
+		BUFFER_TRACE(bh, "write");
+	else
+		BUFFER_TRACE(bh, "read");
+
 	/*
 	 * from here on down, it's all bio -- do the initial mapping,
 	 * submit_bio -> generic_make_request may further map this bio around
@@ -2966,6 +3045,7 @@ out:
 
 		do {
 			struct buffer_head *next = bh->b_this_page;
+			bh->b_this_page = NULL;
 			free_buffer_head(bh);
 			bh = next;
 		} while (bh != buffers_to_free);
@@ -3052,6 +3132,7 @@ struct buffer_head *alloc_buffer_head(un
 		get_cpu_var(bh_accounting).nr++;
 		recalc_bh_state();
 		put_cpu_var(bh_accounting);
+		buffer_trace_init(&ret->b_history);
 	}
 	return ret;
 }
@@ -3059,7 +3140,16 @@ EXPORT_SYMBOL(alloc_buffer_head);
 
 void free_buffer_head(struct buffer_head *bh)
 {
-	BUG_ON(!list_empty(&bh->b_assoc_buffers));
+	J_ASSERT_BH(bh, bh->b_this_page == NULL);
+	J_ASSERT_BH(bh, list_empty(&bh->b_assoc_buffers));
+#if defined(CONFIG_JBD) || defined(CONFIG_JBD_MODULE)
+	if (buffer_jbd(bh)) {
+		J_ASSERT_BH(bh, bh2jh(bh)->b_transaction == 0);
+		J_ASSERT_BH(bh, bh2jh(bh)->b_next_transaction == 0);
+		J_ASSERT_BH(bh, bh2jh(bh)->b_frozen_data == 0);
+		J_ASSERT_BH(bh, bh2jh(bh)->b_committed_data == 0);
+	}
+#endif
 	kmem_cache_free(bh_cachep, bh);
 	get_cpu_var(bh_accounting).nr--;
 	recalc_bh_state();
diff --git a/fs/direct-io.c b/fs/direct-io.c
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -519,6 +519,7 @@ static int get_more_blocks(struct dio *d
 	if (ret == 0) {
 		map_bh->b_state = 0;
 		map_bh->b_size = 0;
+		map_bh->b_page = 0;
 		BUG_ON(dio->block_in_file >= dio->final_block_in_request);
 		fs_startblk = dio->block_in_file >> dio->blkfactor;
 		dio_count = dio->final_block_in_request - dio->block_in_file;
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -859,6 +859,7 @@ struct buffer_head *ext3_getblk(handle_t
 
 	dummy.b_state = 0;
 	dummy.b_blocknr = -1000;
+	dummy.b_page = NULL;
 	buffer_trace_init(&dummy.b_history);
 	*errp = ext3_get_block_handle(handle, inode, block, &dummy, create, 1);
 	if (!*errp && buffer_mapped(&dummy)) {
@@ -1286,6 +1287,11 @@ static int ext3_ordered_writepage(struct
 	walk_page_buffers(handle, page_bufs, 0,
 			PAGE_CACHE_SIZE, NULL, bget_one);
 
+	{
+		struct buffer_head *b = page_buffers(page);
+		BUFFER_TRACE(b, "call block_write_full_page");
+	}
+
 	ret = block_write_full_page(page, ext3_get_block, wbc);
 
 	/*
@@ -1685,6 +1691,7 @@ static int ext3_block_truncate_page(hand
 	} else {
 		if (ext3_should_order_data(inode))
 			err = ext3_journal_dirty_data(handle, bh);
+		BUFFER_TRACE(bh, "mark dirty");
 		mark_buffer_dirty(bh);
 	}
 
diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -60,6 +60,7 @@ static struct buffer_head *ext3_append(h
 		EXT3_I(inode)->i_disksize = inode->i_size;
 		ext3_journal_get_write_access(handle,bh);
 	}
+	BUFFER_TRACE(bh, "exit");
 	return bh;
 }
 
@@ -340,6 +341,7 @@ dx_probe(struct dentry *dentry, struct i
 		dir = dentry->d_parent->d_inode;
 	if (!(bh = ext3_bread (NULL,dir, 0, 0, err)))
 		goto fail;
+	BUFFER_TRACE(bh, "bread it");
 	root = (struct dx_root *) bh->b_data;
 	if (root->info.hash_version != DX_HASH_TEA &&
 	    root->info.hash_version != DX_HASH_HALF_MD4 &&
@@ -414,18 +416,21 @@ dx_probe(struct dentry *dentry, struct i
 
 		at = p - 1;
 		dxtrace(printk(" %x->%u\n", at == entries? 0: dx_get_hash(at), dx_get_block(at)));
+		BUFFER_TRACE(bh, "add to frame");
 		frame->bh = bh;
 		frame->entries = entries;
 		frame->at = at;
 		if (!indirect--) return frame;
 		if (!(bh = ext3_bread (NULL,dir, dx_get_block(at), 0, err)))
 			goto fail2;
+		BUFFER_TRACE(bh, "read it");
 		at = entries = ((struct dx_node *) bh->b_data)->entries;
 		assert (dx_get_limit(entries) == dx_node_limit (dir));
 		frame++;
 	}
 fail2:
 	while (frame >= frame_in) {
+		BUFFER_TRACE(bh, "brelse it");
 		brelse(frame->bh);
 		frame--;
 	}
@@ -438,8 +443,11 @@ static void dx_release (struct dx_frame 
 	if (frames[0].bh == NULL)
 		return;
 
-	if (((struct dx_root *) frames[0].bh->b_data)->info.indirect_levels)
+	if (((struct dx_root *) frames[0].bh->b_data)->info.indirect_levels) {
+		BUFFER_TRACE(frames[1].bh, "brelse it");
 		brelse(frames[1].bh);
+	}
+	BUFFER_TRACE(frames[0].bh, "brelse it");
 	brelse(frames[0].bh);
 }
 
@@ -963,6 +971,8 @@ static struct buffer_head * ext3_dx_find
 			dx_release (frames);
 			return bh;
 		}
+		BUFFER_TRACE(bh, "brelse it");
+		BUFFER_TRACE(bh, "brelse it");
 		brelse (bh);
 		/* Check to see if we should continue to search */
 		retval = ext3_htree_next_block(dir, hash, frame,
@@ -997,6 +1007,7 @@ static struct dentry *ext3_lookup(struct
 	inode = NULL;
 	if (bh) {
 		unsigned long ino = le32_to_cpu(de->inode);
+		BUFFER_TRACE(bh, "brelse it");
 		brelse (bh);
 		inode = iget(dir->i_sb, ino);
 
@@ -1118,16 +1129,20 @@ static struct ext3_dir_entry_2 *do_split
 
 	bh2 = ext3_append (handle, dir, &newblock, error);
 	if (!(bh2)) {
+		BUFFER_TRACE(*bh, "brelse it");
 		brelse(*bh);
 		*bh = NULL;
 		goto errout;
 	}
 
+	BUFFER_TRACE(bh2, "using it");
 	BUFFER_TRACE(*bh, "get_write_access");
 	err = ext3_journal_get_write_access(handle, *bh);
 	if (err) {
 	journal_error:
+		BUFFER_TRACE(*bh, "brelse");
 		brelse(*bh);
+		BUFFER_TRACE(bh2, "brelse");
 		brelse(bh2);
 		*bh = NULL;
 		ext3_std_error(dir->i_sb, err);
@@ -1163,6 +1178,8 @@ static struct ext3_dir_entry_2 *do_split
 	/* Which block gets the new entry? */
 	if (hinfo->hash >= hash2)
 	{
+		BUFFER_TRACE(*bh, "swapping");
+		BUFFER_TRACE(bh2, "swapping");
 		swap(*bh, bh2);
 		de = de2;
 	}
@@ -1171,8 +1188,11 @@ static struct ext3_dir_entry_2 *do_split
 	if (err)
 		goto journal_error;
 	err = ext3_journal_dirty_metadata (handle, frame->bh);
-	if (err)
+	if (err) {
+		BUFFER_TRACE(bh2, "error");
 		goto journal_error;
+	}
+	BUFFER_TRACE(*bh, "brelse");
 	brelse (bh2);
 	dxtrace(dx_show_index ("frame", frame->entries));
 errout:
@@ -1215,6 +1235,7 @@ static int add_dirent_to_buf(handle_t *h
 				return -EIO;
 			}
 			if (ext3_match (namelen, name, de)) {
+				BUFFER_TRACE(bh, "brelse");
 				brelse (bh);
 				return -EEXIST;
 			}
@@ -1272,6 +1293,7 @@ static int add_dirent_to_buf(handle_t *h
 	err = ext3_journal_dirty_metadata(handle, bh);
 	if (err)
 		ext3_std_error(dir->i_sb, err);
+	BUFFER_TRACE(bh, "brelse");
 	brelse(bh);
 	return 0;
 }
@@ -1302,6 +1324,7 @@ static int make_indexed_dir(handle_t *ha
 
 	blocksize =  dir->i_sb->s_blocksize;
 	dxtrace(printk("Creating index\n"));
+	BUFFER_TRACE(bh, "get write access");
 	retval = ext3_journal_get_write_access(handle, bh);
 	if (retval) {
 		ext3_std_error(dir->i_sb, retval);
@@ -1346,6 +1369,7 @@ static int make_indexed_dir(handle_t *ha
 	frame = frames;
 	frame->entries = entries;
 	frame->at = entries;
+	BUFFER_TRACE(bh, "adding to frame");
 	frame->bh = bh;
 	bh = bh2;
 	de = do_split(handle,dir, &bh, frame, &hinfo, &retval);
@@ -1402,6 +1426,7 @@ static int ext3_add_entry (handle_t *han
 		bh = ext3_bread(handle, dir, block, 0, &retval);
 		if(!bh)
 			return retval;
+		BUFFER_TRACE(bh, "read it");
 		retval = add_dirent_to_buf(handle, dentry, inode, NULL, bh);
 		if (retval != -ENOSPC)
 			return retval;
@@ -1416,6 +1441,7 @@ static int ext3_add_entry (handle_t *han
 	bh = ext3_append(handle, dir, &block, &retval);
 	if (!bh)
 		return retval;
+	BUFFER_TRACE(bh, "append returned it");
 	de = (struct ext3_dir_entry_2 *) bh->b_data;
 	de->inode = 0;
 	de->rec_len = cpu_to_le16(rlen = blocksize);
@@ -1481,6 +1507,7 @@ static int ext3_dx_add_entry(handle_t *h
 		bh2 = ext3_append (handle, dir, &newblock, &err);
 		if (!(bh2))
 			goto cleanup;
+		BUFFER_TRACE(bh2, "append returned it");
 		node2 = (struct dx_node *)(bh2->b_data);
 		entries2 = node2->entries;
 		node2->fake.rec_len = cpu_to_le16(sb->s_blocksize);
@@ -1510,6 +1537,8 @@ static int ext3_dx_add_entry(handle_t *h
 			if (at - entries >= icount1) {
 				frame->at = at = at - entries - icount1 + entries2;
 				frame->entries = entries = entries2;
+				BUFFER_TRACE(frame->bh, "swap");
+				BUFFER_TRACE(bh2, "swap");
 				swap(frame->bh, bh2);
 			}
 			dx_insert_block (frames + 0, hash2, newblock);
@@ -1519,6 +1548,7 @@ static int ext3_dx_add_entry(handle_t *h
 			err = ext3_journal_dirty_metadata(handle, bh2);
 			if (err)
 				goto journal_error;
+			BUFFER_TRACE(bh2, "brelse");
 			brelse (bh2);
 		} else {
 			dxtrace(printk("Creating second level index...\n"));
@@ -1535,6 +1565,8 @@ static int ext3_dx_add_entry(handle_t *h
 			frame = frames + 1;
 			frame->at = at = at - entries + entries2;
 			frame->entries = entries = entries2;
+			BUFFER_TRACE(frame->bh, "overwrite");
+			BUFFER_TRACE(bh2, "add to frame");
 			frame->bh = bh2;
 			err = ext3_journal_get_write_access(handle,
 							     frame->bh);
@@ -1553,8 +1585,10 @@ static int ext3_dx_add_entry(handle_t *h
 journal_error:
 	ext3_std_error(dir->i_sb, err);
 cleanup:
-	if (bh)
+	if (bh) {
+		BUFFER_TRACE(bh, "brelse");
 		brelse(bh);
+	}
 	dx_release(frames);
 	return err;
 }
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -40,6 +40,10 @@
 #include "xattr.h"
 #include "acl.h"
 
+#ifdef CONFIG_JBD_DEBUG
+static int ext3_ro_after; /* Make fs read-only after this many jiffies */
+#endif
+
 static int ext3_load_journal(struct super_block *, struct ext3_super_block *);
 static int ext3_create_journal(struct super_block *, struct ext3_super_block *,
 			       int);
@@ -59,6 +63,65 @@ static void ext3_unlockfs(struct super_b
 static void ext3_write_super (struct super_block * sb);
 static void ext3_write_super_lockfs(struct super_block *sb);
 
+#ifdef CONFIG_JBD_DEBUG
+struct block_device *journal_no_write[2];
+
+/*
+ * Debug code for turning filesystems "read-only" after a specified
+ * amount of time.  This is for crash/recovery testing.
+ */
+
+static void make_rdonly(struct block_device *bdev,
+			struct block_device **no_write)
+{
+	char b[BDEVNAME_SIZE];
+
+	if (bdev) {
+		printk(KERN_WARNING "Turning device %s read-only\n",
+		       bdevname(bdev, b));
+		*no_write = bdev;
+	}
+}
+
+static void turn_fs_readonly(unsigned long arg)
+{
+	struct super_block *sb = (struct super_block *)arg;
+
+	make_rdonly(sb->s_bdev, &journal_no_write[0]);
+	make_rdonly(EXT3_SB(sb)->s_journal->j_dev, &journal_no_write[1]);
+	wake_up(&EXT3_SB(sb)->ro_wait_queue);
+}
+
+static void setup_ro_after(struct super_block *sb)
+{
+	struct ext3_sb_info *sbi = EXT3_SB(sb);
+	init_timer(&sbi->turn_ro_timer);
+	if (ext3_ro_after) {
+		printk(KERN_DEBUG "fs will go read-only in %d jiffies\n",
+		       ext3_ro_after);
+		init_waitqueue_head(&sbi->ro_wait_queue);
+		journal_no_write[0] = NULL;
+		journal_no_write[1] = NULL;
+		sbi->turn_ro_timer.function = turn_fs_readonly;
+		sbi->turn_ro_timer.data = (unsigned long)sb;
+		sbi->turn_ro_timer.expires = jiffies + ext3_ro_after;
+		ext3_ro_after = 0;
+		add_timer(&sbi->turn_ro_timer);
+	}
+}
+
+static void clear_ro_after(struct super_block *sb)
+{
+	del_timer_sync(&EXT3_SB(sb)->turn_ro_timer);
+	journal_no_write[0] = NULL;
+	journal_no_write[1] = NULL;
+	ext3_ro_after = 0;
+}
+#else
+#define setup_ro_after(sb)	do {} while (0)
+#define clear_ro_after(sb)	do {} while (0)
+#endif
+
 /* 
  * Wrappers for journal_start/end.
  *
@@ -427,6 +490,7 @@ static void ext3_put_super (struct super
 		invalidate_bdev(sbi->journal_bdev, 0);
 		ext3_blkdev_remove(sbi);
 	}
+	clear_ro_after(sb);
 	sb->s_fs_info = NULL;
 	kfree(sbi);
 	return;
@@ -619,6 +683,7 @@ enum {
 	Opt_bsd_df, Opt_minix_df, Opt_grpid, Opt_nogrpid,
 	Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic, Opt_err_ro,
 	Opt_nouid32, Opt_check, Opt_nocheck, Opt_debug, Opt_oldalloc, Opt_orlov,
+	Opt_ro_after,
 	Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl,
 	Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh,
 	Opt_commit, Opt_journal_update, Opt_journal_inum,
@@ -649,6 +714,7 @@ static match_table_t tokens = {
 	{Opt_debug, "debug"},
 	{Opt_oldalloc, "oldalloc"},
 	{Opt_orlov, "orlov"},
+	{Opt_ro_after, "ro_after"},
 	{Opt_user_xattr, "user_xattr"},
 	{Opt_nouser_xattr, "nouser_xattr"},
 	{Opt_acl, "acl"},
@@ -786,6 +852,13 @@ static int parse_options (char * options
 		case Opt_orlov:
 			clear_opt (sbi->s_mount_opt, OLDALLOC);
 			break;
+#ifdef CONFIG_JBD_DEBUG
+		case Opt_ro_after:
+			if (match_int(&args[0], &option))
+				return 0;
+			ext3_ro_after = option;
+			break;
+#endif
 #ifdef CONFIG_EXT3_FS_XATTR
 		case Opt_user_xattr:
 			set_opt (sbi->s_mount_opt, XATTR_USER);
@@ -1114,6 +1187,7 @@ static int ext3_setup_super(struct super
 		ext3_check_inodes_bitmap (sb);
 	}
 #endif
+	setup_ro_after(sb);
 	return res;
 }
 
@@ -1348,6 +1422,9 @@ static int ext3_fill_super (struct super
 	int needs_recovery;
 	__le32 features;
 
+#ifdef CONFIG_JBD_DEBUG
+	ext3_ro_after = 0;
+#endif
 	sbi = kmalloc(sizeof(*sbi), GFP_KERNEL);
 	if (!sbi)
 		return -ENOMEM;
@@ -1356,6 +1433,7 @@ static int ext3_fill_super (struct super
 	sbi->s_mount_opt = 0;
 	sbi->s_resuid = EXT3_DEF_RESUID;
 	sbi->s_resgid = EXT3_DEF_RESGID;
+	setup_ro_after(sb);
 
 	unlock_kernel();
 
@@ -2072,6 +2150,8 @@ static void ext3_clear_journal_err(struc
 
 	journal = EXT3_SB(sb)->s_journal;
 
+	clear_ro_after(sb);
+
 	/*
 	 * Now check for any error status which may have been recorded in the
 	 * journal by a prior ext3_error() or ext3_abort()
@@ -2138,6 +2218,7 @@ static int ext3_sync_fs(struct super_blo
 		if (wait)
 			log_wait_commit(EXT3_SB(sb)->s_journal, target);
 	}
+	setup_ro_after(sb);
 	return 0;
 }
 
diff --git a/fs/ext3/xattr.c b/fs/ext3/xattr.c
--- a/fs/ext3/xattr.c
+++ b/fs/ext3/xattr.c
@@ -482,12 +482,14 @@ ext3_xattr_release_block(handle_t *handl
 	ce = mb_cache_entry_get(ext3_xattr_cache, bh->b_bdev, bh->b_blocknr);
 	if (BHDR(bh)->h_refcount == cpu_to_le32(1)) {
 		ea_bdebug(bh, "refcount now=0; freeing");
+		BUFFER_TRACE(bh, "xattr delete");
 		if (ce)
 			mb_cache_entry_free(ce);
 		ext3_free_blocks(handle, inode, bh->b_blocknr, 1);
 		get_bh(bh);
 		ext3_forget(handle, 1, inode, bh, bh->b_blocknr);
 	} else {
+		BUFFER_TRACE(bh, "xattr deref");
 		if (ext3_journal_get_write_access(handle, bh) == 0) {
 			lock_buffer(bh);
 			BHDR(bh)->h_refcount = cpu_to_le32(
@@ -758,6 +760,7 @@ ext3_xattr_block_set(handle_t *handle, s
 inserted:
 	if (!IS_LAST_ENTRY(s->first)) {
 		new_bh = ext3_xattr_cache_find(inode, header(s->base), &ce);
+		BUFFER_TRACE2(new_bh, bs->bh, "xattr cache looked up");
 		if (new_bh) {
 			/* We found an identical block in the cache. */
 			if (new_bh == bs->bh)
@@ -802,6 +805,7 @@ inserted:
 			ea_idebug(inode, "creating block %d", block);
 
 			new_bh = sb_getblk(sb, block);
+			BUFFER_TRACE(new_bh, "new xattr block");
 			if (!new_bh) {
 getblk_failed:
 				ext3_free_blocks(handle, inode, block, 1);
diff --git a/fs/jbd-kernel.c b/fs/jbd-kernel.c
new file mode 100644
--- /dev/null
+++ b/fs/jbd-kernel.c
@@ -0,0 +1,255 @@
+/*
+ * fs/jbd-kernel.c
+ *
+ * Support code for the Journalling Block Device layer.
+ * This file contains things which have to be in-kernel when
+ * JBD is a module.
+ *
+ * 15 May 2001	Andrew Morton <andrewm@uow.edu.au>
+ *	Created
+ */
+
+#include <linux/config.h>
+#include <linux/fs.h>
+#include <linux/jbd.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+
+/*
+ * Some sanity testing which is called from mark_buffer_clean(),
+ * and must be present in the main kernel.
+ */
+
+void jbd_preclean_buffer_check(struct buffer_head *bh)
+{
+	if (buffer_jbd(bh)) {
+		struct journal_head *jh = bh2jh(bh);
+
+		transaction_t *transaction = jh->b_transaction;
+		journal_t *journal;
+
+		if (jh->b_jlist == 0 && transaction == NULL)
+			return;
+
+		J_ASSERT_JH(jh, transaction != NULL);
+		/* The kernel may be unmapping old data.  We expect it
+		 * to be dirty in that case, unless the buffer has
+		 * already been forgotten by a transaction. */
+		if (jh->b_jlist != BJ_Forget) {
+#if 1
+			if (!buffer_dirty(bh)) {
+				printk("%s: clean of clean buffer\n",
+					__FUNCTION__);
+				print_buffer_trace(bh);
+				return;
+			}
+#endif
+			J_ASSERT_BH(bh, buffer_dirty(bh));
+		}
+
+		journal = transaction->t_journal;
+		J_ASSERT_JH(jh,
+			    transaction == journal->j_running_transaction ||
+			    transaction == journal->j_committing_transaction);
+	}
+}
+EXPORT_SYMBOL(jbd_preclean_buffer_check);
+
+/*
+ * Support functions for BUFFER_TRACE()
+ */
+
+static spinlock_t trace_lock = SPIN_LOCK_UNLOCKED;
+
+void buffer_trace(const char *function, struct buffer_head *dest,
+		struct buffer_head *src, char *info)
+{
+	struct buffer_history_item *bhist_i;
+	struct page *page;
+	unsigned long flags;
+
+	if (dest == 0 || src == 0)
+		return;
+
+	spin_lock_irqsave(&trace_lock, flags);
+
+	/*
+	 * Sometimes we don't initialise the ring pointers. (locally declared
+	 * temp buffer_heads). Feebly attempt to detect and correct that here.
+	 */
+	if ((dest->b_history.b_history_head - dest->b_history.b_history_tail >
+				BUFFER_HISTORY_SIZE)) {
+		dest->b_history.b_history_head = 0;
+		dest->b_history.b_history_tail = 0;
+	}
+	bhist_i = dest->b_history.b +
+		(dest->b_history.b_history_head & (BUFFER_HISTORY_SIZE - 1));
+	bhist_i->function = function;
+	bhist_i->info = info;
+	bhist_i->b_state = src->b_state;
+	page = src->b_page;
+	if (page)
+		bhist_i->pg_dirty = !!PageDirty(page);
+	else
+		bhist_i->pg_dirty = 0;
+
+#if defined(CONFIG_JBD) || defined(CONFIG_JBD_MODULE)
+	bhist_i->b_trans_is_running = 0;
+	bhist_i->b_trans_is_committing = 0;
+	bhist_i->b_blocknr = src->b_blocknr;
+	if (buffer_jbd(src)) {
+		struct journal_head *jh;
+		journal_t *journal;
+		transaction_t *transaction;
+
+		/* Footwork to avoid racing with journal_remove_journal_head */
+		jh = src->b_private;
+		if (jh == 0)
+			goto raced;
+		transaction = jh->b_transaction;
+		if (src->b_private == 0)
+			goto raced;
+		bhist_i->b_jcount = jh->b_jcount;
+		bhist_i->b_jbd = 1;
+		bhist_i->b_jlist = jh->b_jlist;
+		bhist_i->b_frozen_data = jh->b_frozen_data;
+		bhist_i->b_committed_data = jh->b_committed_data;
+		bhist_i->b_transaction = !!jh->b_transaction;
+		bhist_i->b_next_transaction = !!jh->b_next_transaction;
+		bhist_i->b_cp_transaction = !!jh->b_cp_transaction;
+
+		if (transaction) {
+			journal = transaction->t_journal;
+			bhist_i->b_trans_is_running = transaction ==
+					journal->j_running_transaction;
+			bhist_i->b_trans_is_committing = transaction ==
+					journal->j_committing_transaction;
+		}
+	} else {
+raced:
+		bhist_i->b_jcount = 0;
+		bhist_i->b_jbd = 0;
+		bhist_i->b_jlist = 0;
+		bhist_i->b_frozen_data = 0;
+		bhist_i->b_committed_data = 0;
+		bhist_i->b_transaction = 0;
+		bhist_i->b_next_transaction = 0;
+		bhist_i->b_cp_transaction = 0;
+	}
+#endif	/* defined(CONFIG_JBD) || defined(CONFIG_JBD_MODULE) */
+
+	bhist_i->cpu = smp_processor_id();
+	bhist_i->b_count = atomic_read(&src->b_count);
+
+	dest->b_history.b_history_head++;
+	if (dest->b_history.b_history_head - dest->b_history.b_history_tail >
+				BUFFER_HISTORY_SIZE)
+		dest->b_history.b_history_tail =
+			dest->b_history.b_history_head - BUFFER_HISTORY_SIZE;
+
+	spin_unlock_irqrestore(&trace_lock, flags);
+}
+
+static const char *b_jlist_to_string(unsigned int b_list)
+{
+	switch (b_list) {
+#if defined(CONFIG_JBD) || defined(CONFIG_JBD_MODULE)
+	case BJ_None:		return "BJ_None";
+	case BJ_SyncData:	return "BJ_SyncData";
+	case BJ_Metadata:	return "BJ_Metadata";
+	case BJ_Forget:		return "BJ_Forget";
+	case BJ_IO:		return "BJ_IO";
+	case BJ_Shadow:		return "BJ_Shadow";
+	case BJ_LogCtl:		return "BJ_LogCtl";
+	case BJ_Reserved:	return "BJ_Reserved";
+#endif
+	default:		return "Bad b_jlist";
+	}
+}
+
+static void print_one_hist(struct buffer_history_item *bhist_i)
+{
+	printk(" %s():%s\n", bhist_i->function, bhist_i->info);
+	printk("     b_state:0x%lx b_jlist:%s cpu:%d b_count:%d b_blocknr:%lu\n",
+			bhist_i->b_state,
+			b_jlist_to_string(bhist_i->b_jlist),
+			bhist_i->cpu,
+			bhist_i->b_count,
+			bhist_i->b_blocknr);
+#if defined(CONFIG_JBD) || defined(CONFIG_JBD_MODULE)
+	printk("     b_jbd:%u b_frozen_data:%p b_committed_data:%p\n",
+			bhist_i->b_jbd,
+			bhist_i->b_frozen_data,
+			bhist_i->b_committed_data);
+	printk("     b_transaction:%u b_next_transaction:%u "
+			"b_cp_transaction:%u b_trans_is_running:%u\n",
+			bhist_i->b_transaction,
+			bhist_i->b_next_transaction,
+			bhist_i->b_cp_transaction,
+			bhist_i->b_trans_is_running);
+	printk("     b_trans_is_comitting:%u b_jcount:%u pg_dirty:%u",
+			bhist_i->b_trans_is_committing,
+			bhist_i->b_jcount,
+			bhist_i->pg_dirty);
+#endif
+	printk("\n");
+}
+
+void print_buffer_fields(struct buffer_head *bh)
+{
+	printk("b_blocknr:%llu b_count:%d\n",
+		(unsigned long long)bh->b_blocknr, atomic_read(&bh->b_count));
+	printk("b_this_page:%p b_data:%p b_page:%p\n",
+			bh->b_this_page, bh->b_data, bh->b_page);
+#if defined(CONFIG_JBD) || defined(CONFIG_JBD_MODULE)
+	if (buffer_jbd(bh)) {
+		struct journal_head *jh = bh2jh(bh);
+
+		printk("b_jlist:%u b_frozen_data:%p b_committed_data:%p\n",
+			jh->b_jlist, jh->b_frozen_data, jh->b_committed_data);
+		printk(" b_transaction:%p b_next_transaction:%p "
+				"b_cp_transaction:%p\n",
+			jh->b_transaction, jh->b_next_transaction,
+			jh->b_cp_transaction);
+		printk("b_cpnext:%p b_cpprev:%p\n",
+			jh->b_cpnext, jh->b_cpprev);
+	}
+#endif
+}
+
+void print_buffer_trace(struct buffer_head *bh)
+{
+	unsigned long idx, count;
+	unsigned long flags;
+
+	printk("buffer trace for buffer at 0x%p (I am CPU %d)\n",
+			bh, smp_processor_id());
+	BUFFER_TRACE(bh, "");		/* Record state now */
+
+	spin_lock_irqsave(&trace_lock, flags);
+	for (	idx = bh->b_history.b_history_tail, count = 0;
+		idx < bh->b_history.b_history_head &&
+			count < BUFFER_HISTORY_SIZE;
+		idx++, count++)
+		print_one_hist(bh->b_history.b +
+			(idx & (BUFFER_HISTORY_SIZE - 1)));
+
+	print_buffer_fields(bh);
+	spin_unlock_irqrestore(&trace_lock, flags);
+	dump_stack();
+	printk("\n");
+}
+
+static struct buffer_head *failed_buffer_head;	/* For access with debuggers */
+
+void buffer_assertion_failure(struct buffer_head *bh)
+{
+	console_verbose();
+	failed_buffer_head = bh;
+	print_buffer_trace(bh);
+}
+EXPORT_SYMBOL(buffer_trace);
+EXPORT_SYMBOL(print_buffer_trace);
+EXPORT_SYMBOL(buffer_assertion_failure);
+EXPORT_SYMBOL(print_buffer_fields);
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -28,10 +28,12 @@
 static void journal_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
 {
 	BUFFER_TRACE(bh, "");
-	if (uptodate)
+	if (uptodate) {
 		set_buffer_uptodate(bh);
-	else
+	} else {
+		BUFFER_TRACE(bh, "clear uptodate");
 		clear_buffer_uptodate(bh);
+	}
 	unlock_buffer(bh);
 }
 
@@ -341,7 +343,7 @@ write_out_data:
 			BUFFER_TRACE(bh, "locked");
 			if (!inverted_lock(journal, bh))
 				goto write_out_data;
-			__journal_temp_unlink_buffer(jh);
+			__journal_temp_unlink_buffer(journal, jh);
 			__journal_file_buffer(jh, commit_transaction,
 						BJ_Locked);
 			jbd_unlock_bh_state(bh);
@@ -367,7 +369,7 @@ write_out_data:
 				BUFFER_TRACE(bh, "writeout complete: unfile");
 				if (!inverted_lock(journal, bh))
 					goto write_out_data;
-				__journal_unfile_buffer(jh);
+				__journal_unfile_buffer(journal, jh);
 				jbd_unlock_bh_state(bh);
 				journal_remove_journal_head(bh);
 				put_bh(bh);
@@ -408,7 +410,7 @@ write_out_data:
 			continue;
 		}
 		if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
-			__journal_unfile_buffer(jh);
+			__journal_unfile_buffer(journal, jh);
 			jbd_unlock_bh_state(bh);
 			journal_remove_journal_head(bh);
 			put_bh(bh);
@@ -780,6 +782,7 @@ restart_loop:
 		 * behind for writeback and gets reallocated for another
 		 * use in a different page. */
 		if (buffer_freed(bh)) {
+			BUFFER_TRACE(bh, "buffer_freed");
 			clear_buffer_freed(bh);
 			clear_buffer_jbddirty(bh);
 		}
@@ -788,12 +791,14 @@ restart_loop:
 			JBUFFER_TRACE(jh, "add to new checkpointing trans");
 			__journal_insert_checkpoint(jh, commit_transaction);
 			JBUFFER_TRACE(jh, "refile for checkpoint writeback");
-			__journal_refile_buffer(jh);
+			__journal_refile_buffer(journal, jh);
+			JBUFFER_TRACE(jh, "did writepage hack");
 			jbd_unlock_bh_state(bh);
 		} else {
+			BUFFER_TRACE(bh, "not jbddirty");
 			J_ASSERT_BH(bh, !buffer_dirty(bh));
 			J_ASSERT_JH(jh, jh->b_next_transaction == NULL);
-			__journal_unfile_buffer(jh);
+			__journal_unfile_buffer(journal, jh);
 			jbd_unlock_bh_state(bh);
 			journal_remove_journal_head(bh);  /* needs a brelse */
 			release_buffer_page(bh);
diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -1034,7 +1034,7 @@ int journal_dirty_data(handle_t *handle,
 			/* journal_clean_data_list() may have got there first */
 			if (jh->b_transaction != NULL) {
 				JBUFFER_TRACE(jh, "unfile from commit");
-				__journal_temp_unlink_buffer(jh);
+				__journal_temp_unlink_buffer(journal, jh);
 				/* It still points to the committing
 				 * transaction; move it to this one so
 				 * that the refile assert checks are
@@ -1053,7 +1053,7 @@ int journal_dirty_data(handle_t *handle,
 		if (jh->b_jlist != BJ_SyncData && jh->b_jlist != BJ_Locked) {
 			JBUFFER_TRACE(jh, "not on correct data list: unfile");
 			J_ASSERT_JH(jh, jh->b_jlist != BJ_Shadow);
-			__journal_temp_unlink_buffer(jh);
+			__journal_temp_unlink_buffer(journal, jh);
 			jh->b_transaction = handle->h_transaction;
 			JBUFFER_TRACE(jh, "file as data");
 			__journal_file_buffer(jh, handle->h_transaction,
@@ -1249,10 +1249,10 @@ int journal_forget (handle_t *handle, st
 		 */
 
 		if (jh->b_cp_transaction) {
-			__journal_temp_unlink_buffer(jh);
+			__journal_temp_unlink_buffer(journal, jh);
 			__journal_file_buffer(jh, transaction, BJ_Forget);
 		} else {
-			__journal_unfile_buffer(jh);
+			__journal_unfile_buffer(journal, jh);
 			journal_remove_journal_head(bh);
 			__brelse(bh);
 			if (!buffer_jbd(bh)) {
@@ -1433,6 +1433,8 @@ int journal_force_commit(journal_t *jour
 static inline void 
 __blist_add_buffer(struct journal_head **list, struct journal_head *jh)
 {
+	J_ASSERT_JH(jh, jh->b_tnext == NULL);
+	J_ASSERT_JH(jh, jh->b_tprev == NULL);
 	if (!*list) {
 		jh->b_tnext = jh->b_tprev = jh;
 		*list = jh;
@@ -1457,6 +1459,8 @@ __blist_add_buffer(struct journal_head *
 static inline void
 __blist_del_buffer(struct journal_head **list, struct journal_head *jh)
 {
+	J_ASSERT_JH(jh, jh->b_tnext != NULL);
+	J_ASSERT_JH(jh, jh->b_tprev != NULL);
 	if (*list == jh) {
 		*list = jh->b_tnext;
 		if (*list == jh)
@@ -1464,6 +1468,8 @@ __blist_del_buffer(struct journal_head *
 	}
 	jh->b_tprev->b_tnext = jh->b_tnext;
 	jh->b_tnext->b_tprev = jh->b_tprev;
+	jh->b_tprev = NULL;
+	jh->b_tnext = NULL;
 }
 
 /* 
@@ -1477,16 +1483,19 @@ __blist_del_buffer(struct journal_head *
  *
  * Called under j_list_lock.  The journal may not be locked.
  */
-void __journal_temp_unlink_buffer(struct journal_head *jh)
+void __journal_temp_unlink_buffer(journal_t *journal, struct journal_head *jh)
 {
 	struct journal_head **list = NULL;
 	transaction_t *transaction;
 	struct buffer_head *bh = jh2bh(jh);
 
+	JBUFFER_TRACE(jh, "entry");
 	J_ASSERT_JH(jh, jbd_is_locked_bh_state(bh));
 	transaction = jh->b_transaction;
-	if (transaction)
+	if (transaction) {
+		J_ASSERT_JH(jh, jh->b_transaction->t_journal == journal);
 		assert_spin_locked(&transaction->t_journal->j_list_lock);
+	}
 
 	J_ASSERT_JH(jh, jh->b_jlist < BJ_Types);
 	if (jh->b_jlist != BJ_None)
@@ -1529,9 +1538,9 @@ void __journal_temp_unlink_buffer(struct
 		mark_buffer_dirty(bh);	/* Expose it to the VM */
 }
 
-void __journal_unfile_buffer(struct journal_head *jh)
+void __journal_unfile_buffer(journal_t *journal, struct journal_head *jh)
 {
-	__journal_temp_unlink_buffer(jh);
+	__journal_temp_unlink_buffer(journal, jh);
 	jh->b_transaction = NULL;
 }
 
@@ -1539,7 +1548,7 @@ void journal_unfile_buffer(journal_t *jo
 {
 	jbd_lock_bh_state(jh2bh(jh));
 	spin_lock(&journal->j_list_lock);
-	__journal_unfile_buffer(jh);
+	__journal_unfile_buffer(journal, jh);
 	spin_unlock(&journal->j_list_lock);
 	jbd_unlock_bh_state(jh2bh(jh));
 }
@@ -1567,7 +1576,7 @@ __journal_try_to_free_buffer(journal_t *
 		if (jh->b_jlist == BJ_SyncData || jh->b_jlist == BJ_Locked) {
 			/* A written-back ordered data buffer */
 			JBUFFER_TRACE(jh, "release data");
-			__journal_unfile_buffer(jh);
+			__journal_unfile_buffer(journal, jh);
 			journal_remove_journal_head(bh);
 			__brelse(bh);
 		}
@@ -1672,7 +1681,8 @@ static int __dispose_buffer(struct journ
 	int may_free = 1;
 	struct buffer_head *bh = jh2bh(jh);
 
-	__journal_unfile_buffer(jh);
+	JBUFFER_TRACE(jh, "unfile");
+	__journal_unfile_buffer(transaction->t_journal, jh);
 
 	if (jh->b_cp_transaction) {
 		JBUFFER_TRACE(jh, "on running+cp transaction");
@@ -1928,6 +1938,7 @@ void __journal_file_buffer(struct journa
 	int was_dirty = 0;
 	struct buffer_head *bh = jh2bh(jh);
 
+	BUFFER_TRACE(bh, "entry");
 	J_ASSERT_JH(jh, jbd_is_locked_bh_state(bh));
 	assert_spin_locked(&transaction->t_journal->j_list_lock);
 
@@ -1950,7 +1961,7 @@ void __journal_file_buffer(struct journa
 	}
 
 	if (jh->b_transaction)
-		__journal_temp_unlink_buffer(jh);
+		__journal_temp_unlink_buffer(transaction->t_journal, jh);
 	jh->b_transaction = transaction;
 
 	switch (jlist) {
@@ -1990,6 +2001,7 @@ void __journal_file_buffer(struct journa
 
 	if (was_dirty)
 		set_buffer_jbddirty(bh);
+	BUFFER_TRACE(bh, "filed");
 }
 
 void journal_file_buffer(struct journal_head *jh,
@@ -2012,18 +2024,20 @@ void journal_file_buffer(struct journal_
  *
  * Called under jbd_lock_bh_state(jh2bh(jh))
  */
-void __journal_refile_buffer(struct journal_head *jh)
+void __journal_refile_buffer(journal_t *journal, struct journal_head *jh)
 {
 	int was_dirty;
 	struct buffer_head *bh = jh2bh(jh);
 
+	JBUFFER_TRACE(jh, "entry");
 	J_ASSERT_JH(jh, jbd_is_locked_bh_state(bh));
 	if (jh->b_transaction)
 		assert_spin_locked(&jh->b_transaction->t_journal->j_list_lock);
 
 	/* If the buffer is now unused, just drop it. */
 	if (jh->b_next_transaction == NULL) {
-		__journal_unfile_buffer(jh);
+		BUFFER_TRACE(bh, "unfile");
+		__journal_unfile_buffer(journal, jh);
 		return;
 	}
 
@@ -2033,7 +2047,7 @@ void __journal_refile_buffer(struct jour
 	 */
 
 	was_dirty = test_clear_buffer_jbddirty(bh);
-	__journal_temp_unlink_buffer(jh);
+	__journal_temp_unlink_buffer(journal, jh);
 	jh->b_transaction = jh->b_next_transaction;
 	jh->b_next_transaction = NULL;
 	__journal_file_buffer(jh, jh->b_transaction, BJ_Metadata);
@@ -2064,7 +2078,7 @@ void journal_refile_buffer(journal_t *jo
 	jbd_lock_bh_state(bh);
 	spin_lock(&journal->j_list_lock);
 
-	__journal_refile_buffer(jh);
+	__journal_refile_buffer(journal, jh);
 	jbd_unlock_bh_state(bh);
 	journal_remove_journal_head(bh);
 
diff --git a/fs/mpage.c b/fs/mpage.c
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -191,6 +191,7 @@ do_mpage_readpage(struct bio *bio, struc
 	for (page_block = 0; page_block < blocks_per_page;
 				page_block++, block_in_file++) {
 		bh.b_state = 0;
+		bh.b_page = page;
 		if (block_in_file < last_block) {
 			if (get_block(inode, block_in_file, &bh, 0))
 				goto confused;
@@ -472,6 +473,7 @@ __mpage_writepage(struct bio *bio, struc
 	for (page_block = 0; page_block < blocks_per_page; ) {
 
 		map_bh.b_state = 0;
+		map_bh.b_page = page;
 		if (get_block(inode, block_in_file, &map_bh, 1))
 			goto confused;
 		if (buffer_new(&map_bh))
diff --git a/include/linux/buffer-trace.h b/include/linux/buffer-trace.h
new file mode 100644
--- /dev/null
+++ b/include/linux/buffer-trace.h
@@ -0,0 +1,85 @@
+/*
+ * include/linux/buffer-trace.h
+ *
+ * Debugging support for recording buffer_head state transitions
+ *
+ * May 2001, akpm
+ *	Created
+ */
+
+#ifndef BUFFER_TRACE_H_INCLUDED
+#define BUFFER_TRACE_H_INCLUDED
+
+#include <linux/config.h>
+
+#ifdef CONFIG_BUFFER_DEBUG
+
+/* The number of records per buffer_head.  Must be a power of two */
+#define BUFFER_HISTORY_SIZE	32
+
+struct buffer_head;
+
+/* This gets embedded in struct buffer_head */
+struct buffer_history {
+	struct buffer_history_item {
+		const char *function;
+		char *info;
+		unsigned long b_state;
+		unsigned b_list:3;
+		unsigned b_jlist:4;
+		unsigned pg_dirty:1;
+		unsigned cpu:3;
+		unsigned b_count:8;
+		unsigned long b_blocknr;	/* For src != dest */
+#if defined(CONFIG_JBD) || defined(CONFIG_JBD_MODULE)
+		unsigned b_jcount:4;
+		unsigned b_jbd:1;
+		unsigned b_transaction:1;
+		unsigned b_next_transaction:1;
+		unsigned b_cp_transaction:1;
+		unsigned b_trans_is_running:1;
+		unsigned b_trans_is_committing:1;
+		void *b_frozen_data;
+		void *b_committed_data;
+#endif
+	} b[BUFFER_HISTORY_SIZE];
+	unsigned long b_history_head;	/* Next place to write */
+	unsigned long b_history_tail;	/* Oldest valid entry */
+};
+
+static inline void buffer_trace_init(struct buffer_history *bhist)
+{
+	bhist->b_history_head = 0;
+	bhist->b_history_tail = 0;
+}
+extern void buffer_trace(const char *function, struct buffer_head *dest,
+			struct buffer_head *src, char *info);
+extern void print_buffer_fields(struct buffer_head *bh);
+extern void print_buffer_trace(struct buffer_head *bh);
+
+#define BUFFER_STRINGIFY2(X)		#X
+#define BUFFER_STRINGIFY(X)		BUFFER_STRINGIFY2(X)
+
+#define BUFFER_TRACE2(dest, src, info)				\
+	do {							\
+		buffer_trace(__FUNCTION__, (dest), (src),	\
+			"["__FILE__":"				\
+			BUFFER_STRINGIFY(__LINE__)"] " info);	\
+	} while (0)
+
+#define BUFFER_TRACE(bh, info) BUFFER_TRACE2(bh, bh, info)
+#define JBUFFER_TRACE(jh, info)	BUFFER_TRACE(jh2bh(jh), info)
+
+#else		/* CONFIG_BUFFER_DEBUG */
+
+#define buffer_trace_init(bh)	do {} while (0)
+#define print_buffer_fields(bh)	do {} while (0)
+#define print_buffer_trace(bh)	do {} while (0)
+#define BUFFER_TRACE(bh, info)	do {} while (0)
+#define BUFFER_TRACE2(bh, bh2, info)	do {} while (0)
+#define JBUFFER_TRACE(jh, info)	do {} while (0)
+
+#endif		/* CONFIG_BUFFER_DEBUG */
+
+#endif		/* BUFFER_TRACE_H_INCLUDED */
+
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -13,6 +13,7 @@
 #include <linux/pagemap.h>
 #include <linux/wait.h>
 #include <asm/atomic.h>
+#include <linux/buffer-trace.h>
 
 enum bh_state_bits {
 	BH_Uptodate,	/* Contains valid data */
@@ -65,6 +66,10 @@ struct buffer_head {
 	bh_end_io_t *b_end_io;		/* I/O completion */
  	void *b_private;		/* reserved for b_end_io */
 	struct list_head b_assoc_buffers; /* associated with another mapping */
+
+#ifdef CONFIG_BUFFER_DEBUG
+	struct buffer_history b_history;
+#endif
 };
 
 /*
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -41,7 +41,7 @@
  * hardware _can_ fail, but for debugging purposes when running tests on
  * known-good hardware we may want to trap these errors.
  */
-#undef JBD_PARANOID_IOFAIL
+#define JBD_PARANOID_IOFAIL
 
 /*
  * The default maximum commit age, in seconds.
@@ -55,7 +55,9 @@
  * CONFIG_JBD_DEBUG is on.
  */
 #define JBD_EXPENSIVE_CHECKING
+
 extern int journal_enable_debug;
+extern struct block_device *journal_no_write[2];
 
 #define jbd_debug(n, f, a...)						\
 	do {								\
@@ -266,6 +268,7 @@ void buffer_assertion_failure(struct buf
 #else
 #define J_ASSERT_BH(bh, expr)	J_ASSERT(expr)
 #define J_ASSERT_JH(jh, expr)	J_ASSERT(expr)
+#define buffer_assertion_failure(bh) do { } while (0)
 #endif
 
 #else
@@ -273,9 +276,9 @@ void buffer_assertion_failure(struct buf
 #endif		/* JBD_ASSERTIONS */
 
 #if defined(JBD_PARANOID_IOFAIL)
-#define J_EXPECT(expr, why...)		J_ASSERT(expr)
-#define J_EXPECT_BH(bh, expr, why...)	J_ASSERT_BH(bh, expr)
-#define J_EXPECT_JH(jh, expr, why...)	J_ASSERT_JH(jh, expr)
+#define J_EXPECT(expr, why...)		({ J_ASSERT(expr); 1; })
+#define J_EXPECT_BH(bh, expr, why...)	({ J_ASSERT_BH(bh, expr); 1; })
+#define J_EXPECT_JH(jh, expr, why...)	({ J_ASSERT_JH(jh, expr); 1; })
 #else
 #define __journal_expect(expr, why...)					     \
 	({								     \
@@ -823,10 +826,10 @@ struct journal_s
  */
 
 /* Filing buffers */
-extern void __journal_temp_unlink_buffer(struct journal_head *jh);
+extern void __journal_temp_unlink_buffer(journal_t *, struct journal_head *jh);
 extern void journal_unfile_buffer(journal_t *, struct journal_head *);
-extern void __journal_unfile_buffer(struct journal_head *);
-extern void __journal_refile_buffer(struct journal_head *);
+extern void __journal_unfile_buffer(journal_t *, struct journal_head *);
+extern void __journal_refile_buffer(journal_t *, struct journal_head *);
 extern void journal_refile_buffer(journal_t *, struct journal_head *);
 extern void __journal_file_buffer(struct journal_head *, transaction_t *, int);
 extern void __journal_free_buffer(struct journal_head *bh);
@@ -1055,6 +1058,8 @@ static inline int jbd_space_needed(journ
  * Definitions which augment the buffer_head layer
  */
 
+/* JBD additions */
+
 /* journaling buffer types */
 #define BJ_None		0	/* Not journaled */
 #define BJ_SyncData	1	/* Normal data: flush before commit */
@@ -1071,13 +1076,6 @@ extern int jbd_blocks_per_page(struct in
 
 #ifdef __KERNEL__
 
-#define buffer_trace_init(bh)	do {} while (0)
-#define print_buffer_fields(bh)	do {} while (0)
-#define print_buffer_trace(bh)	do {} while (0)
-#define BUFFER_TRACE(bh, info)	do {} while (0)
-#define BUFFER_TRACE2(bh, bh2, info)	do {} while (0)
-#define JBUFFER_TRACE(jh, info)	do {} while (0)
-
 #endif	/* __KERNEL__ */
 
 #endif	/* CONFIG_JBD || CONFIG_JBD_MODULE || !__KERNEL__ */


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kjournald() with DIO
  2005-09-16 13:42       ` Stephen C. Tweedie
@ 2005-09-21 18:22         ` Mingming Cao
  0 siblings, 0 replies; 29+ messages in thread
From: Mingming Cao @ 2005-09-21 18:22 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Andrew Morton, Badari Pulavarty, linux-fsdevel, Mingming Cao

On Fri, 2005-09-16 at 14:42 +0100, Stephen C. Tweedie wrote: 
> Hi,
> 
> On Tue, 2005-09-13 at 01:29, Andrew Morton wrote:
> 
> > Have you spoken with Mingming about this?  She's chasing a similar race. 
> > She's currently working on getting the ext3-debug patch working so we can
> > find out how those buffers (which apparently aren't even attached to the
> > journal) came to have an elevated refcount.
> 
> Here's the latest version of ext3 buffer tracing which I've been using. 
> It's basically just the same as one from Andrew from about 2.6.5
> vintage, with a few changes in places to deal with new bits of code that
> need temporary buffer_heads initialised properly for tracing.
> 
> It has survived a couple of hours of fsx and fsstress in parallel, but I
> haven't run any O_DIRECT tests with it on the current git kernel, so
> there may stil be problems in that area.

Your patch works smoothly running iozone DIO tests overnight.  It also
successfully shows the last 32 activities history of the busy buffer
which caused the EIO failure here, and the results matches what the
previous analysis.

A minor improvement, to translate BJ_Locked state into a proper string.

Signed-off-by: Mingming Cao cmm@us.ibm.com
---

 linux-2.6.5-7.191-cmm/fs/jbd-kernel.c |    1 +
 1 files changed, 1 insertion(+)

diff -puN fs/jbd-kernel.c~buffer-trace-fix fs/jbd-kernel.c
--- linux-2.6.5-7.191/fs/jbd-kernel.c~buffer-trace-fix	2005-09-19 15:48:03.000000000 -0700
+++ linux-2.6.5-7.191-cmm/fs/jbd-kernel.c	2005-09-19 15:50:16.000000000 -0700
@@ -163,6 +163,7 @@ static const char *b_jlist_to_string(uns
 	case BJ_Shadow:		return "BJ_Shadow";
 	case BJ_LogCtl:		return "BJ_LogCtl";
 	case BJ_Reserved:	return "BJ_Reserved";
+	case BJ_Locked:		return "BJ_Locked";
 #endif
 	default:		return "Bad b_jlist";
 	}

_



^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2005-09-21 18:22 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-09-12 23:23 kjournald() with DIO Badari Pulavarty
2005-09-12 23:37 ` Andrew Morton
2005-09-13  0:06   ` Badari Pulavarty
2005-09-13  0:29     ` Andrew Morton
2005-09-13 16:52       ` Badari Pulavarty
2005-09-13 23:07         ` Andrew Morton
2005-09-14 17:23           ` Mingming Cao
2005-09-14 18:18             ` Andrew Morton
2005-09-14 21:40               ` Mingming Cao
2005-09-14 22:02                 ` Andrew Morton
2005-09-15 11:11                   ` Suparna Bhattacharya
2005-09-15 18:52                     ` Andrew Morton
2005-09-15 15:03                   ` Badari Pulavarty
2005-09-15 19:22                     ` Andrea Arcangeli
2005-09-15 20:00                       ` Andrew Morton
2005-09-15 20:20                         ` Andrea Arcangeli
2005-09-15 20:35                           ` Andrew Morton
2005-09-15 20:49                             ` Badari Pulavarty
2005-09-15 21:06                               ` Andrea Arcangeli
2005-09-15 21:20                               ` Andrew Morton
2005-09-15 22:22                                 ` Badari Pulavarty
2005-09-15 21:03                             ` Andrea Arcangeli
2005-09-15 21:26                               ` Andrew Morton
2005-09-15 22:04                                 ` Andrea Arcangeli
2005-09-15 23:28                                   ` Mingming Cao
2005-09-16  0:18                                     ` Andrea Arcangeli
2005-09-13 17:53       ` Mingming Cao
2005-09-16 13:42       ` Stephen C. Tweedie
2005-09-21 18:22         ` Mingming Cao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).