linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Theodore Tso <tytso@mit.edu>
To: Andreas Dilger <adilger@sun.com>
Cc: Christian Fischer <Christian.Fischer@easterngraphics.com>,
	linux-ext4@vger.kernel.org
Subject: Re: Enable asynchronous commits by default patch revoked?
Date: Mon, 24 Aug 2009 16:10:27 -0400	[thread overview]
Message-ID: <20090824201027.GC17684@mit.edu> (raw)
In-Reply-To: <20090824183119.GI5931@webber.adilger.int>

On Mon, Aug 24, 2009 at 12:31:19PM -0600, Andreas Dilger wrote:
> > No one has gotten around to looking at this closely; I think adding a
> > strategically placed blkdev_issue_flush() will allow us to safely
> > enable this feature, but it needs careful study.
> 
> I don't think that was the issue, but rather that we wanted to have
> per-block checksums in order to handle the case were some block in
> transaction A is causing a transaction checksum failure, yet transaction
> B has already committed and begun checkpointing.

There are multiple problems that are going on here.

Adding per-block checksums is useful in providing better resiliance in
the case of write errors in the journal, and providing better error
handling when we detect a checksum error in the commit block.

However, even if we add even if we add per-block checksums, there is
still the problem that there is logic in the jbd layer where we avoid
reusing certain blocks until we are sure the transaction has
committed.  With asynchronous commits, there is no way of knowing when
it is safe to reuse those blocks.  

To illustrate, consider a data block for an inode which has just been
deleted.  We have code which prevents that block from being reused
until the transaction has committed; but when we use asynchronous
commits, we don't know when that will be.  Currently the async commit
code assumes that once we send the commit block to be written, the
commit has happened; this opens up a race where between when the
commit record (and all of the associated journal blocks) is actually
written to disk, and when a data block gets reused.

Most of the time, this will cause silent corruption of a data file
that was about to be deleted right before the power failure --- but if
the block in question was part of a directory that was in the process
of being deleted, it could result in a filesystem corruption
detectable by e2fsck.

Looking at the code, the best we can do in the short-term is to write
the commit record where we do, but do so with a barrier requested, and
then wait for the commit block where we do.  This will provide some
performance improvement, since we won't wait for all of the journal
blocks to be sent to disk before writing the commit record.  However,
we still have to wait for the commit record (and all of the blocks
before it) to be committed to stable store before we can mark the
transaction as being truly committed.  So it's not a true "async
commit", but it is a benefit of adding journal checksums.

It might be possible in the long term to batch together multiple
transaction in the "committing" state, instead of only allowing one
transaction to be in the committing state, and to prevent blocks from
being reused or allowing pinned buffer_heads from writing the contents
of the blocks to their final location on disk until we are 100% sure
the commit block and all of the associated journal blocks really have
made it to disk.  However, it's probably not worth doing this, since
the only time this would make a big difference is when we have several
transactions closing within the space of a short time; and in that
case the fsync() requires a barrier operation anyway.

							- Ted

  parent reply	other threads:[~2009-08-24 20:10 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <200908241033.10527.Christian.Fischer@easterngraphics.com>
2009-08-24 13:34 ` Enable asynchronous commits by default patch revoked? Theodore Tso
2009-08-24 18:31   ` Andreas Dilger
2009-08-24 18:37     ` Ric Wheeler
2009-08-24 20:10     ` Theodore Tso [this message]
2009-08-24 20:28       ` Ric Wheeler
2009-08-24 22:07         ` Theodore Tso
2009-08-24 22:12           ` Ric Wheeler
2009-08-24 23:28             ` Theodore Tso
2009-08-24 23:43               ` Andreas Dilger
2009-08-25  0:15                 ` Theodore Tso
2009-08-25 17:52                   ` Andreas Dilger
2009-08-25 18:07                     ` Ric Wheeler
2009-08-25 21:11                       ` Theodore Tso
2009-08-26  9:50                         ` Andreas Dilger
2009-08-26 13:14                           ` Theodore Tso
2009-08-26 22:00                             ` Andreas Dilger
2009-08-26 22:55                               ` Theodore Tso
2009-08-25 18:21                     ` Ric Wheeler
2009-08-26 16:02                   ` Jan Kara
2009-08-24 22:46           ` Andreas Dilger
2009-08-24 23:52             ` Theodore Tso
2009-09-02 14:48           ` Tom Vier
2009-09-02 15:03             ` Theodore Tso
2009-08-24 21:28       ` Andreas Dilger
2009-08-25  6:16   ` Christian Fischer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090824201027.GC17684@mit.edu \
    --to=tytso@mit.edu \
    --cc=Christian.Fischer@easterngraphics.com \
    --cc=adilger@sun.com \
    --cc=linux-ext4@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).