linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Theodore Tso <tytso@mit.edu>
To: Andreas Dilger <adilger@sun.com>
Cc: Ric Wheeler <rwheeler@redhat.com>,
	Christian Fischer <Christian.Fischer@easterngraphics.com>,
	linux-ext4@vger.kernel.org
Subject: Re: Enable asynchronous commits by default patch revoked?
Date: Wed, 26 Aug 2009 18:55:15 -0400	[thread overview]
Message-ID: <20090826225515.GK6997@mit.edu> (raw)
In-Reply-To: <20090826220045.GG4197@webber.adilger.int>

On Wed, Aug 26, 2009 at 04:00:45PM -0600, Andreas Dilger wrote:
> I'm still missing something.  With async_commit enabled, it doesn't
> matter if the commit block is reordered, since the transaction checksum
> will verify if all of the data + commit block are written for that
> transaction, in case of a crash.  That is the whole point of async_commit.

The problem isn't reordering with respect to the journal blocks alone;
the problem is reordering with respect to the journal blocks *plus*
normal filesystem metadata.

The key point here is that jbd pins filesystem metadata blocks and
prevents them from being pushed out to disk until the transaction has
committed.  Once the transaction has been commited, they are free to
be written to disk, and {directory,indirect,extent} blocks which have
been released during the last transactoin are now freed to be reused
by the block allocator.

If the system is under memory pressure and is gettings lots of
fsync(), there are a large number of transaction boundaries.  So it's
possible for I/O stream of the form:

	 ...
	 commit seq #17
	 journal of block #12
	 journal of block #52
	 journal of block #36
	 journal of block allocation bitmap releasing block #23
	 commit seq #18
	 update of block #12
	 write of reallocated block #23
	 ..,

Could get reorderd as follows:

         ...
	 commit seq #17
	 journal of block #12
	 journal of block #52
	 update of block #12
	 write of reallocated block #23
	 journal of block #36
	 <crash>
	 (journal of block allocation bitmap releasing block #23)
	 (commit seq #18)

OK, so what's happened?  Since there was no barrier when we write the
commit block for transaction #18, some of the (non-journal) I/O that
was only supposed to have happened *after* the commit has completed,
has happened too early, and then the system crashed before all of the
journal blocks associated with commit #18 could be written out.

So from the perspective of the journal replay commit #18 never
happened.  So among other things the act of releasing block #23 never
happened --- but block #23 has gotten reused already, since a write
that took place *after* commit #18 has taken place, due to reordering
that took place on the disk drive.

This is what Chris Mason has demonstrated with his barrier=0 file
system corruption workload.  And this is something which journal
checksums don't help, because it's not about the commit block getting
written out before the rest of the journal blocks.  *That* case will
be detected by an incorrect journal checksum.  The problem is other
I/O taking place to other parts of the filesystem.

I've actually used bad numbers here, since the journal is typically at
the very front of the disk (for ext3) or in the middle of the disk
(for ext4).  If the I/O for the rest of the filesystem is at the very
end of the disk, it's in fact very believable that drive might defer
the journal update (at the beginning of the disk) and try to do lots
of filesystem metadata updates (at the end of the disk) to avoid
seeking back and forth, without realizing that this violates the
ordering constraints that the jbd layer needs for correctness.

Unfortunately, the only way we can communicate these constraints to the
disk drive is via barriers.

						- Ted

  reply	other threads:[~2009-08-26 22:55 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <200908241033.10527.Christian.Fischer@easterngraphics.com>
2009-08-24 13:34 ` Enable asynchronous commits by default patch revoked? Theodore Tso
2009-08-24 18:31   ` Andreas Dilger
2009-08-24 18:37     ` Ric Wheeler
2009-08-24 20:10     ` Theodore Tso
2009-08-24 20:28       ` Ric Wheeler
2009-08-24 22:07         ` Theodore Tso
2009-08-24 22:12           ` Ric Wheeler
2009-08-24 23:28             ` Theodore Tso
2009-08-24 23:43               ` Andreas Dilger
2009-08-25  0:15                 ` Theodore Tso
2009-08-25 17:52                   ` Andreas Dilger
2009-08-25 18:07                     ` Ric Wheeler
2009-08-25 21:11                       ` Theodore Tso
2009-08-26  9:50                         ` Andreas Dilger
2009-08-26 13:14                           ` Theodore Tso
2009-08-26 22:00                             ` Andreas Dilger
2009-08-26 22:55                               ` Theodore Tso [this message]
2009-08-25 18:21                     ` Ric Wheeler
2009-08-26 16:02                   ` Jan Kara
2009-08-24 22:46           ` Andreas Dilger
2009-08-24 23:52             ` Theodore Tso
2009-09-02 14:48           ` Tom Vier
2009-09-02 15:03             ` Theodore Tso
2009-08-24 21:28       ` Andreas Dilger
2009-08-25  6:16   ` Christian Fischer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090826225515.GK6997@mit.edu \
    --to=tytso@mit.edu \
    --cc=Christian.Fischer@easterngraphics.com \
    --cc=adilger@sun.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=rwheeler@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).