Re: Odd "leak" of extent info into data blocks?

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Curt Wohlgemuth <curtw@google.com>
To: Theodore Tso <tytso@mit.edu>
Cc: Valerie Aurora <vaurora@redhat.com>,
	ext4 development <linux-ext4@vger.kernel.org>
Subject: Re: Odd "leak" of extent info into data blocks?
Date: Tue, 8 Sep 2009 21:00:50 -0700	[thread overview]
Message-ID: <6601abe90909082100n48afdba9qee087ff46bfe4e3f@mail.gmail.com> (raw)
In-Reply-To: <20090908233644.GV22901@mit.edu>

Hi Ted:

On Tue, Sep 8, 2009 at 4:36 PM, Theodore Tso<tytso@mit.edu> wrote:
> On Tue, Sep 08, 2009 at 02:18:35PM -0700, Curt Wohlgemuth wrote:
>>
>> All bforget() does is clear the buffer's dirty bit.  Meanwhile, the
>> page is still marked dirty, and can be in the middle of writeback;
>> it's true that __block_write_full_page() will check the dirty bit for
>> each buffer in the page, but there doesn't seem to be any
>> synchronization to ensure that the write won't take place at some
>> point in time after bforget() is called.  Which means it can be called
>> after the bitmap is changed.
>
> Let me make sure I got this right.  The problem that you're worried
> about is a block that had previously contained an extent tree node for
> an inode that gets deleted, and then that blocks gets reallocated for
> use as a data block.

Correct.

> In ext3 and ext4, metadata blocks (such as
> extent tree blocks), aren't stored in the page cache.

Hmm.  You're saying that in the absence of a journal, all metadata
writes go direct to disk?  Where should I look for this in the code?

Looking at ext4_ext_new_meta_block() and code that uses it, I don't
see anything that prevents the use of the page cache.  And if this
were the case, wouldn't the call to mark_buffer_dirty() in
__ext4_handle_dirty_metadata() (when there's no journal) do nothing?

I also put in code in submit_bio() to scan all pages for the extent
header pattern that I was seeing ("leaking" into the data pages).
When I saw it, the stack trace was always from pdflush() (from
wb_kupdate()).  I.e., these are from the page cache.

> So I'm not sure why you're worried about the page being marked dirty.
> What's the scenario you are concerned about?

If you're right that metadata writes are not through the page cache,
then there is no scenario I'm worried about :-) .

The problem is that I've seen this in real life.  And the patch below
seems to fix it.  (Unfortunately, I haven't been able to recreate this
in a simple example, after several days work.  I've only seen this in
a *very* small number of cases on heavily loaded machines.)

> If it's the case where a data block for a deleted inode getting
> rewritten after the inode is deleted, when the inode is deleted,
> truncate_inode_apges() end up dropping the pages from the page cache
> *before* the block allocation bitmap is dropped.

It's quite possible that there's an interaction with older code that
we have in our 2.6.26-based kernels -- our ext4/jbd2 code is pretty
up-to-date, but the rest of the code base is not.  But really, in my
case -- and you'll have to trust me -- I've seen this pattern:

1. file A (~8MB) is written out and closed, with a final mod time of 12:08 p.m.
2. My submit_bio() scan sees the "bad extent header" written out to
physical block B at 12:15 p.m.
3. Looking at file A later, its logical block 2048 corresponds to
physical block B -- and contains the "bad extent header" pattern.

truncate_inode_pages() only deals with data blocks, right?  So it
should have no effect on metadata...

>> This is why I opted to wait for the buffer to be written out before
>> continuing on to ext4_free_blocks().
>
> Just to be clear, which buffer are you talking about here?

The leaf extent blocks buffer_head.  Here's the patch, as applied to a
2.6.30.3 version of extents.c:

diff -Naur orig/fs/ext4/extents.c new/fs/ext4/extents.c
--- orig/fs/ext4/extents.c	2009-09-08 20:28:46.000000000 -0700
+++ new/fs/ext4/extents.c	2009-09-08 20:31:42.000000000 -0700
@@ -1958,6 +1958,15 @@
 		return err;
 	ext_debug("index is empty, remove it, free block %llu\n", leaf);
 	bh = sb_find_get_block(inode->i_sb, leaf);
+
+	/*
+	 * If we don't have a journal, then we've dirtied the BH for the leaf
+	 * block, but we're freeing the block now.  We need to wait here for
+	 * the page to be written out before we proceed.
+	 */
+	if (!ext4_handle_valid(handle) && bh)
+		sync_dirty_buffer(bh);
+
 	ext4_forget(handle, 1, inode, bh, leaf);
 	ext4_free_blocks(handle, inode, leaf, 1, 1);
 	return err;

Thanks,
Curt
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2009-09-09  4:00 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-08-22 23:10 Odd "leak" of extent info into data blocks? Curt Wohlgemuth
     [not found] ` <20090908175605.GB7801@shell>
2009-09-08 18:21   ` Curt Wohlgemuth
2009-09-08 19:40     ` Theodore Tso
2009-09-08 21:18       ` Curt Wohlgemuth
2009-09-08 23:36         ` Theodore Tso
2009-09-09  4:00           ` Curt Wohlgemuth [this message]
2009-09-09 15:19             ` Theodore Tso

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6601abe90909082100n48afdba9qee087ff46bfe4e3f@mail.gmail.com \
    --to=curtw@google.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=vaurora@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).