linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: 丁定华 <dingdinghua85@gmail.com>
Cc: Jan Kara <jack@suse.cz>, linux-ext4@vger.kernel.org
Subject: Re: Should we discard jbddirty bit if BH_Freed is set?
Date: Thu, 28 Jan 2010 13:53:55 +0100	[thread overview]
Message-ID: <20100128125354.GA3124@quack.suse.cz> (raw)
In-Reply-To: <7bb361261001271723n4fdad0e9l2171aa092baa0523@mail.gmail.com>

  Hi,

On Thu 28-01-10 09:23:43, 丁定华 wrote:
>     As you wrote, if T2!=T1, then T2 is committing transaction while T1
>     is running transaction, and if T1 complete commit, we don't care
>     about the content of buffers.  But there is a  prerequisite -->"T1
>     complete commit", if T1 start commit and another transaction T3
>     becomes the new running transaction, T3 may need to reuse T2 log
>     space and force checkpoint, and since we have clean the BH_dirty bit
>     of buffers after T2 commits, so T2 may be freed before T1 complete
>     commit, and unfortunately, T1 doesn't complete commit, so after
>     replay, updates of T2 get lost, fs becomes inconsistent.
  Hmm, you are right. That could probably happen. It only hits ext3/4 in
data=journaled mode but the bug is there. But it's hard to fix it in a
reasonable way - if we would just leave the dirty bit set, we will have
aliasing problems - buffer B is attached to some page which used to be from
file F, so unmap_underlying_metadata will not find it because it looks only
into page cache the block device, not to the pagecaches of individual
files. So if we reallocate the block of B for some other file G before the
buffer B is checkpointed, we have two dirty buffers representing the same
block and thus data corruption can happen.
  Maybe we could handle them by setting b_next_transaction to the
transaction that deleted the buffer (in jbd2_journal_unmap_buffer) and
setting buffer freed like we do now. Commit code would handle freed
buffers like:
If b_next_transaction is set, file buffer to forget list of the next
transaction. If b_next_transaction isn't set, clear all dirty bits.
What do you think?

								Honza

> 2010/1/27 Jan Kara <jack@suse.cz>:
> >  Hi,
> >
> > On Wed 27-01-10 10:32:18, 丁定华 wrote:
> >>         I'm a little confused about BH_Freed bit. The only place it is set
> >> is journal_unmap_buffer, which is called by jbd2_journal_invalidatepage when
> >> we want to truncate a file. Since jbd2_journal_invalidatepage is called
> >> outside of transaction, We can't make sure whether the "add to orphan"
> >> operation belongs to committing transaction or not,  so we can't touch the
> >> buffer belongs to committing transaction, instead BH_Freed bit is set to
> >> indicate that this buffer can be discarded in running transaction. But i
> >> think we shouldn't clear BH_JBDdirty in jbd2_journal_commit_transaction, as
> >> following codes does:
> >>                 /* A buffer which has been freed while still being
> >>                  * journaled by a previous transaction may end up still
> >>                  * being dirty here, but we want to avoid writing back
> >>                  * that buffer in the future now that the last use has
> >>                  * been committed.  That's not only a performance gain,
> >>                  * it also stops aliasing problems if the buffer is left
> >>                  * behind for writeback and gets reallocated for another
> >>                  * use in a different page. */
> >>                 if (buffer_freed(bh)) {
> >>                         clear_buffer_freed(bh);
> >>                         clear_buffer_jbddirty(bh);
> >>                 }
> >> Note that, *We can't make sure "current running transaction" can complete
> >> commit work.* If we clear BH_JBDdirty bit here, this buffer may be freed
> >> here,  the log space of older transaction may be freed before the "current
> >> running transaction" complete commit work, and if this happends, filesystem
> >> will be inconsistent.
> >  Let me sketch the situation here:
> > The file F gets truncated. The inode is added to orphan list in some
> > transaction T1, only then jbd2_journal_invalidatepage can be called.
> > As you wrote above, it can happen that jbd2_journal_invalidatepage on
> > buffer B runs when some transaction T2 containing B is being committed and
> > in that case we set BH_Freed.  If T2 != T1 - i.e., T2 is being committed
> > and T1 is the running transaction, note that we clear the dirty bit only
> > when T2 is fully committed and we are processing forget list. So buffer has
> > been properly written to T2 and we just won't write it in the transaction
> > T1. And that is fine because as soon as transaction T1 finishes commit, we
> > don't care about what happens with buffers of F because the fact that F is
> > truncated is recorded and in case of crash we finish truncate during
> > journal replay. And if we crash before T1 finishes commit, we don't care
> > about contents of T1 either. If T2 == T1, the above reasoning applies as
> > well and the situation is even simpler.
> >
> >                                                                Honza
> > --
> > Jan Kara <jack@suse.cz>
> > SUSE Labs, CR
> >
> 
> 
> 
> -- 
> 丁定华
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2010-01-28 12:53 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <7bb361261001261832wb4f9ac2u96fdb6460aa45fa2@mail.gmail.com>
2010-01-27 12:23 ` Should we discard jbddirty bit if BH_Freed is set? Jan Kara
2010-01-28  1:23   ` 丁定华
2010-01-28 12:53     ` Jan Kara [this message]
2010-01-29  1:43       ` 丁定华
2010-01-30  6:41         ` 丁定华
2010-02-02 22:02           ` Jan Kara
2010-01-27  2:39 丁定华

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100128125354.GA3124@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=dingdinghua85@gmail.com \
    --cc=linux-ext4@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).