From: Dmitry Monakhov <dmonakhov@openvz.org>
To: tytso@mit.edu
Cc: linux-ext4@vger.kernel.org
Subject: Re: [PATCH 1/2] ext4/jbd2: fix io-barrier logic in case of external journal
Date: Mon, 22 Mar 2010 19:14:13 +0300 [thread overview]
Message-ID: <87d3yw49m2.fsf@openvz.org> (raw)
In-Reply-To: <20100322150342.GG11560@thunk.org> (tytso@mit.edu's message of "Mon, 22 Mar 2010 11:03:42 -0400")
tytso@mit.edu writes:
> On Mon, Mar 22, 2010 at 05:04:19PM +0300, Dmitry Monakhov wrote:
>> tytso@mit.edu writes:
>>
>> > On Fri, Mar 12, 2010 at 08:26:49PM +0300, Dmitry Monakhov wrote:
>> >> start_journal_io:
>> >> + if (bufs)
>> >> + commit_transaction->t_flushed_data_blocks = 1;
>> >> +
>> >
>> > I'm not convinced this is right.
>> >
>> > From your test case, the problem isn't because we have journaled
>> > metadata blocks (which is what bufs) counts, but because fsync()
>> > depends on data blocks also getting flushed out to disks.
>> >
>> > However, if we aren't closing the transaction because of fsync(), I
>> > don't think we need to do a barrier in the case of an external
>> > journal. So instead of effectively unconditionally setting
>> > t_flushed_data_blocks (since bufs is nearly always going to be
>> > non-zero), I think the better fix is to test to see if the journal
>> > device != to the fs data device in fsync(), and if so, start the
>> > barrier operation there.
>> >
>> > Do you agree?
>> Yes.
>
> Just to be clear, since I realized I wrote fsync() when I should have
> written fs/ext4/fsync.c, my proposal was to put this check in
> ext4_sync_file().
This means that we will end up with two io-barriers in a row
for external journal case:
ext4_sync_file()
->jbd2_log_start_commit()
->blkdev_issue_flush(j_fs_dev)
/* seems following flush is mandatory to guarantee the metadata
* consistency */
->blkdev_issue_flush(j_fs_dev)
may be it will be better to pass some sort of barrier flag to
jbd2_log_start_commit() this function mark journal->j_commit_request
with ->t_flushed_data_blocks = 1, so io-carrier will be issued
even if transaction has only metadata
Advantages:
1) approach will works for all data modes
2) only one io-barrier is needed to guarantee the data and
metadata consiscency.
3) changes not so complicated.
I've already started to work with the patch but was surprised with
commit logic.
->__jbd2_log_start_commit(target)
journal->j_commit_request = target
->wake_up(&journal->j_wait_commit)
->kjournald2
transaction = journal->j_running_transaction
->jbd2_journal_commit_transaction(journal);
commit_transaction = journal->j_running_transaction
...
/* So j_commit_request is used only for comparison. But actually
committing journal->j_running_transaction->t_tid transaction
instead of j_commit_request. Why?
*/
>
>> BTW Would it be correct to update j_tail in
>> jbd2_journal_commit_transaction() to something more recent
>> if we have issued an io-barrier to j_fs_dev?
>> This will helps to reduce journal_recovery time which may be really
>> painful in some slow devices.
>
> Um, maybe. We are already calling __jbd2_journal_clean_checkpoint_list(),
> and the barrier operation *is* expensive.
>
> On the other hand, updating the journal superblock on every sync is
> another seek that would have to made before the barrier operation, and
> I'm a bit concerned that this seek would be noticeable. If it is
> noticeable, is it worth it to optimize for the uncommon case (a power
> failure requiring a journal replay) when it might cost us something,
> however, small, on every single journal update?
>
> Do we really think the journal replay time is really something that is
> a major pain point. I can think of optimizations that involve
> skipping writes that will get updated later in future transactions,
> but it means complicating the replay code, which has been stable for a
> long time, and it's not clear to me that the costs are worth the
> benefits.
Never mind. It was just my thoughts. The price of broken recovery stage
is to expansive to fix such rare cases.
>
>> I've take a look at async commit logic: fs/jbd2/commit.c
>> void jbd2_journal_commit_transaction(journal_t *journal)
>> {
>> 725: /* Done it all: now write the commit record asynchronously. */
>> if (JBD2_HAS_INCOMPAT_FEATURE(journal,
>> JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT))
>> {
>> err = journal_submit_commit_record(journal,
>> commit_transaction,
>> &cbh, crc32_sum);
>> if (err)
>> __jbd2_journal_abort_hard(journal);
>> if (journal->j_flags & JBD2_BARRIER)
>> blkdev_issue_flush(journal->j_dev, NULL);
>> <<< blkdev_issue_flush is wait for barrier to complete by default, but
>> <<< in fact we don't have to wait for barrier here. I've prepared a
>> <<< patch wich add flags to control blkdev_issue_flush() wait
>> <<< behavior, and this is the place for no-wait variant.
>
> I think that's right, as long as we're confident that subsequent
> writes won't get scheduled before the no-wait barrier. If it did, it
> would be a bug in the block I/O layer, so it should be OK.
>
> - Ted
>
next prev parent reply other threads:[~2010-03-22 16:14 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-03-12 17:26 [PATCH 1/2] ext4/jbd2: fix io-barrier logic in case of external journal Dmitry Monakhov
2010-03-12 17:26 ` [PATCH 2/2] ext3, jbd: Add barriers for file systems with exernal journals Dmitry Monakhov
2010-03-22 1:20 ` [PATCH 1/2] ext4/jbd2: fix io-barrier logic in case of external journal tytso
2010-03-22 14:04 ` Dmitry Monakhov
2010-03-22 15:03 ` tytso
2010-03-22 16:14 ` Dmitry Monakhov [this message]
2010-03-22 20:22 ` tytso
2010-03-30 2:14 ` Jan Kara
2010-03-30 1:47 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87d3yw49m2.fsf@openvz.org \
--to=dmonakhov@openvz.org \
--cc=linux-ext4@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).