From: Ric Wheeler <rwheeler@redhat.com>
To: Theodore Tso <tytso@mit.edu>
Cc: Andreas Dilger <adilger@sun.com>,
Christian Fischer <Christian.Fischer@easterngraphics.com>,
linux-ext4@vger.kernel.org
Subject: Re: Enable asynchronous commits by default patch revoked?
Date: Mon, 24 Aug 2009 16:28:16 -0400 [thread overview]
Message-ID: <4A92F7E0.9010001@redhat.com> (raw)
In-Reply-To: <20090824201027.GC17684@mit.edu>
Theodore Tso wrote:
> On Mon, Aug 24, 2009 at 12:31:19PM -0600, Andreas Dilger wrote:
>
>>> No one has gotten around to looking at this closely; I think adding a
>>> strategically placed blkdev_issue_flush() will allow us to safely
>>> enable this feature, but it needs careful study.
>>>
>> I don't think that was the issue, but rather that we wanted to have
>> per-block checksums in order to handle the case were some block in
>> transaction A is causing a transaction checksum failure, yet transaction
>> B has already committed and begun checkpointing.
>>
>
> There are multiple problems that are going on here.
>
> Adding per-block checksums is useful in providing better resiliance in
> the case of write errors in the journal, and providing better error
> handling when we detect a checksum error in the commit block.
>
My issue with the async commit is that it is basically a detection
mechanism.
Drives will (almost always) write to platter sequential writes in order.
Async commit lets us send down things out of order which means that we
have a wider window of "bad state" for any given transaction...
ric
> However, even if we add even if we add per-block checksums, there is
> still the problem that there is logic in the jbd layer where we avoid
> reusing certain blocks until we are sure the transaction has
> committed. With asynchronous commits, there is no way of knowing when
> it is safe to reuse those blocks.
>
> To illustrate, consider a data block for an inode which has just been
> deleted. We have code which prevents that block from being reused
> until the transaction has committed; but when we use asynchronous
> commits, we don't know when that will be. Currently the async commit
> code assumes that once we send the commit block to be written, the
> commit has happened; this opens up a race where between when the
> commit record (and all of the associated journal blocks) is actually
> written to disk, and when a data block gets reused.
>
> Most of the time, this will cause silent corruption of a data file
> that was about to be deleted right before the power failure --- but if
> the block in question was part of a directory that was in the process
> of being deleted, it could result in a filesystem corruption
> detectable by e2fsck.
>
> Looking at the code, the best we can do in the short-term is to write
> the commit record where we do, but do so with a barrier requested, and
> then wait for the commit block where we do. This will provide some
> performance improvement, since we won't wait for all of the journal
> blocks to be sent to disk before writing the commit record. However,
> we still have to wait for the commit record (and all of the blocks
> before it) to be committed to stable store before we can mark the
> transaction as being truly committed. So it's not a true "async
> commit", but it is a benefit of adding journal checksums.
>
> It might be possible in the long term to batch together multiple
> transaction in the "committing" state, instead of only allowing one
> transaction to be in the committing state, and to prevent blocks from
> being reused or allowing pinned buffer_heads from writing the contents
> of the blocks to their final location on disk until we are 100% sure
> the commit block and all of the associated journal blocks really have
> made it to disk. However, it's probably not worth doing this, since
> the only time this would make a big difference is when we have several
> transactions closing within the space of a short time; and in that
> case the fsync() requires a barrier operation anyway.
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
next prev parent reply other threads:[~2009-08-24 20:29 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <200908241033.10527.Christian.Fischer@easterngraphics.com>
2009-08-24 13:34 ` Enable asynchronous commits by default patch revoked? Theodore Tso
2009-08-24 18:31 ` Andreas Dilger
2009-08-24 18:37 ` Ric Wheeler
2009-08-24 20:10 ` Theodore Tso
2009-08-24 20:28 ` Ric Wheeler [this message]
2009-08-24 22:07 ` Theodore Tso
2009-08-24 22:12 ` Ric Wheeler
2009-08-24 23:28 ` Theodore Tso
2009-08-24 23:43 ` Andreas Dilger
2009-08-25 0:15 ` Theodore Tso
2009-08-25 17:52 ` Andreas Dilger
2009-08-25 18:07 ` Ric Wheeler
2009-08-25 21:11 ` Theodore Tso
2009-08-26 9:50 ` Andreas Dilger
2009-08-26 13:14 ` Theodore Tso
2009-08-26 22:00 ` Andreas Dilger
2009-08-26 22:55 ` Theodore Tso
2009-08-25 18:21 ` Ric Wheeler
2009-08-26 16:02 ` Jan Kara
2009-08-24 22:46 ` Andreas Dilger
2009-08-24 23:52 ` Theodore Tso
2009-09-02 14:48 ` Tom Vier
2009-09-02 15:03 ` Theodore Tso
2009-08-24 21:28 ` Andreas Dilger
2009-08-25 6:16 ` Christian Fischer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4A92F7E0.9010001@redhat.com \
--to=rwheeler@redhat.com \
--cc=Christian.Fischer@easterngraphics.com \
--cc=adilger@sun.com \
--cc=linux-ext4@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.