linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Theodore Tso <tytso@mit.edu>
To: Ric Wheeler <rwheeler@redhat.com>
Cc: Andreas Dilger <adilger@sun.com>,
	Christian Fischer <Christian.Fischer@easterngraphics.com>,
	linux-ext4@vger.kernel.org
Subject: Re: Enable asynchronous commits by default patch revoked?
Date: Mon, 24 Aug 2009 18:07:39 -0400	[thread overview]
Message-ID: <20090824220738.GG17684@mit.edu> (raw)
In-Reply-To: <4A92F7E0.9010001@redhat.com>

On Mon, Aug 24, 2009 at 04:28:16PM -0400, Ric Wheeler wrote:
>
> My issue with the async commit is that it is basically a detection  
> mechanism.
>
> Drives will (almost always) write to platter sequential writes in order.  
> Async commit lets us send down things out of order which means that we  
> have a wider window of "bad state" for any given transaction...

Sure, agreed.  But let's look a bit closer at what "async commit"
really means.

What ext3 and ext4 does by default is this:

1)  Write data blocks required by data=ordered mode (if any)

2)  Write the journal blocks

3)  Wait for the journal blocks to be sent to disk.  (We don't actually
do a barrier operation), so this just means the blocks have been sent
to the disk, not necessarily that they are forced to a platter.

4)  Write the commit block, with the barrier flag set.

5)  Wait for the commit block.

-----

What the current async commit code does is this:

1)  Write data blocks required by data=ordered mode (if any)

2)  Write the journal blocks

3)  Write the commit block, without a barrier.

4)  Wait for the journal blocks to be sent to disk.

5)  Wait for the commit block (since a barrier is requested, this is
just when it was sent to the disk, not when it is actually committed
to stable store).

Since there are no barriers at all, the async mount option basically
works the same as barriers=0, and is subject to exactly the same
problems as barrier=0 --- problems which I've actually demonstrated
exist in practice.

----

What I think we can do safely in ext4 is this:

1)  Write data blocks required by data=ordered mode (if any)

2)  Write the journal blocks

3)  Write the commit block, WITH a barrier requested.

4)  Wait for the commit block to be completed.

5)  Wait for the journal blocks to be sent to disk.  #4 implies that
all of the journal block I/O will have been completed, so this is just
to collect the commit completion status; we should actually block
during step #5, assuming the block layer's barrier operation was
implemented correctly.


This should save us a little bit, since it implies the commit record
will be sent to disk in the same I/O request to the storage device as
the the other journal blocks, which is _not_ currently the case today.


Technically, what ext3 does today could result in problems, since
without the barrier between the journal blocks and the commit block,
the two could theoretically get reordered by the disk such that the
commit block is written before the journal blocks are completely
written --- and since ext3 doesn't have journal checksumming, this
would never be noticed.  Fortunately in practice this generally won't
happen since the commit block is adjacent to the rest of the journal
blocks, so a sane disk drive will likely coalesce the two write
requests together.

						- Ted


  reply	other threads:[~2009-08-24 22:07 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <200908241033.10527.Christian.Fischer@easterngraphics.com>
2009-08-24 13:34 ` Enable asynchronous commits by default patch revoked? Theodore Tso
2009-08-24 18:31   ` Andreas Dilger
2009-08-24 18:37     ` Ric Wheeler
2009-08-24 20:10     ` Theodore Tso
2009-08-24 20:28       ` Ric Wheeler
2009-08-24 22:07         ` Theodore Tso [this message]
2009-08-24 22:12           ` Ric Wheeler
2009-08-24 23:28             ` Theodore Tso
2009-08-24 23:43               ` Andreas Dilger
2009-08-25  0:15                 ` Theodore Tso
2009-08-25 17:52                   ` Andreas Dilger
2009-08-25 18:07                     ` Ric Wheeler
2009-08-25 21:11                       ` Theodore Tso
2009-08-26  9:50                         ` Andreas Dilger
2009-08-26 13:14                           ` Theodore Tso
2009-08-26 22:00                             ` Andreas Dilger
2009-08-26 22:55                               ` Theodore Tso
2009-08-25 18:21                     ` Ric Wheeler
2009-08-26 16:02                   ` Jan Kara
2009-08-24 22:46           ` Andreas Dilger
2009-08-24 23:52             ` Theodore Tso
2009-09-02 14:48           ` Tom Vier
2009-09-02 15:03             ` Theodore Tso
2009-08-24 21:28       ` Andreas Dilger
2009-08-25  6:16   ` Christian Fischer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090824220738.GG17684@mit.edu \
    --to=tytso@mit.edu \
    --cc=Christian.Fischer@easterngraphics.com \
    --cc=adilger@sun.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=rwheeler@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).