linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jamie Lokier <jamie@shareable.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Theodore Tso <tytso@mit.edu>,
	Christoph Hellwig <hch@infradead.org>,
	Jens Axboe <jens.axboe@oracle.com>,
	linux-fsdevel@vger.kernel.org, linux-scsi@vger.kernel.org
Subject: Re: O_DIRECT and barriers
Date: Wed, 26 Aug 2009 16:01:02 +0100	[thread overview]
Message-ID: <20090826150102.GB22027@shareable.org> (raw)
In-Reply-To: <20090826063455.GA2417@discord.disaster>

Dave Chinner wrote:
> On Fri, Aug 21, 2009 at 06:08:52PM -0400, Theodore Tso wrote:
> > On Fri, Aug 21, 2009 at 10:26:35AM -0400, Christoph Hellwig wrote:
> > > > It turns out that applications needing integrity must use fdatasync or
> > > > O_DSYNC (or O_SYNC) *already* with O_DIRECT, because the kernel may
> > > > choose to use buffered writes at any time, with no signal to the
> > > > application.
> > > 
> > > The fallback was a relatively recent addition to the O_DIRECT semantics
> > > for broken filesystems that can't handle holes very well.  Fortunately
> > > enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
> > > semantics for that already.
> > 
> > Um, actually, we don't.  If we did that, we would have to wait for a
> > journal commit to complete before allowing the write(2) to complete,
> > which would be especially painfully slow for ext3.
> > 
> > This question recently came up on the ext4 developer's list, because
> > of a question of how direct I/O to an preallocated (uninitialized)
> > extent should be handled.  Are we supposed to guarantee synchronous
> > updates of the metadata by the time write(2) returns, or not?  One of
> > the ext4 developers (I can't remember if it was Mingming or Eric)
> > asked an XFS developer what they did in that case, and I believe the
> > answer they were given was that XFS started a commit, but did *not*
> > wait for the commit to complete before returning from the Direct I/O
> > write.  In fact, they were told (I believe this was from an SGI
> > engineer, but I don't remember the name; we can track that down if
> > it's important) that if an application wanted to guarantee metadata
> > would be updated for an extending write, they had to use fsync() or
> > O_SYNC/O_DSYNC.  
> 
> That would have been Eric asking me. My answer that O_DIRECT does
> not imply any new data integrity guarantees associated with a
> write(2) call - it just avoids system caches. You get the same
> guarantees of resiliency as a non-O_DIRECT write(2) call at
> completion - it may or may notbe there if you crash. If you want
> some guarantee of integrity, then you need to use O_DSYNC, O_SYNC or
> call f[data]sync(2) just like all other IO.
> 
> Also, note that direct IO is not necessarily synchronous - you can
> do asynchronous direct IO.....

I agree with all of the above, except:

  1. If the automatic O_SYNC fallback mentioned by Christopher is
     currently implemented at all, even in a subset of filesystems,
     then I think it should be removed.

     An app which wants integrity should be calling fsync/fdatasync or
     using O_DSYNC/O_SYNC explicitly - with fsync/fdatasync giving
     more control over batching.

     If it doesn't do any of those things, it may be using O_DIRECT
     for performance, and not wish to be penalised by an expensive
     O_SYNC on every individual write.  Especially when O_SYNC is
     fixed to commit drive caches.

  2. I agree with everything Dave said about needing to use some other
     mechanism for an integrity commit; O_DIRECT is not enough.

     We can't realistically make O_DIRECT (by itself) do integrity
     commits anyway, because on some drives that involves committing
     the drive cache, and it would be a large performance regression.
     Given O_DIRECT is often used for its performance, that's not an
     option.

  3. Currently none of the options provides good integrity commit.

     All of them fail to commit drive caches under some circumstances;
     even fsync on ext3 with barriers enabled (because it doesn't
     commit a journal record if there were writes but no inode change
     with data=ordered).

     This should be changed (or at least made optionally available),
     and that's all the more reason to avoid commit operations except
     when requested.

  4. On drives which need it, fdatasync/fsync must trigger a drive
     cache flush even when there is no dirty page cache to write,
     because dirty pages may have been written in the background
     already, and because O_DIRECT writes dirty the drive cache but
     not the page cache.

     A per-drive flag would make sense to optimise this: It is set by
     any non-FUA writes sent to the drive while the drive's writeback
     cache is enabled, and cleared when any cache flush command is
     sent.  When the flag is clear, further cache flush commands don't
     need to be sent.

-- Jamie

  reply	other threads:[~2009-08-26 15:01 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1250697884-22288-1-git-send-email-jack@suse.cz>
2009-08-20 22:12 ` O_DIRECT and barriers Christoph Hellwig
2009-08-21 11:40   ` Jens Axboe
2009-08-21 13:54     ` Jamie Lokier
2009-08-21 14:26       ` Christoph Hellwig
2009-08-21 15:24         ` Jamie Lokier
2009-08-21 17:45           ` Christoph Hellwig
2009-08-21 19:18             ` Ric Wheeler
2009-08-22  0:50             ` Jamie Lokier
2009-08-22  2:19               ` Theodore Tso
2009-08-22  2:31                 ` Theodore Tso
2009-08-24  2:34               ` Christoph Hellwig
2009-08-27 14:34                 ` Jamie Lokier
2009-08-27 17:10                   ` adding proper O_SYNC/O_DSYNC, was " Christoph Hellwig
2009-08-27 17:24                     ` Ulrich Drepper
2009-08-28 15:46                       ` Christoph Hellwig
2009-08-28 16:06                         ` Ulrich Drepper
2009-08-28 16:17                           ` Christoph Hellwig
2009-08-28 16:33                             ` Ulrich Drepper
2009-08-28 16:41                               ` Christoph Hellwig
2009-08-28 20:51                                 ` Ulrich Drepper
2009-08-28 21:08                                   ` Christoph Hellwig
2009-08-28 21:16                                     ` Trond Myklebust
2009-08-28 21:29                                       ` Christoph Hellwig
2009-08-28 21:43                                         ` Trond Myklebust
2009-08-28 22:39                                           ` Christoph Hellwig
2009-08-30 16:44                                     ` Jamie Lokier
2009-08-28 16:46                               ` Jamie Lokier
2009-08-29  0:59                                 ` Jamie Lokier
2009-08-28 16:44                         ` Jamie Lokier
2009-08-28 16:50                           ` Jamie Lokier
2009-08-28 21:08                           ` Ulrich Drepper
2009-08-30 16:58                             ` Jamie Lokier
2009-08-30 17:48                             ` Jamie Lokier
2009-08-28 23:06                         ` Jamie Lokier
2009-08-28 23:46                           ` Christoph Hellwig
2009-08-21 22:08         ` Theodore Tso
2009-08-21 22:38           ` Joel Becker
2009-08-21 22:45           ` Joel Becker
2009-08-22  2:11             ` Theodore Tso
2009-08-24  2:42               ` Christoph Hellwig
2009-08-24  2:37             ` Christoph Hellwig
2009-08-22  0:56           ` Jamie Lokier
2009-08-22  2:06             ` Theodore Tso
2009-08-26  6:34           ` Dave Chinner
2009-08-26 15:01             ` Jamie Lokier [this message]
2009-08-26 18:47               ` Theodore Tso
2009-08-27 14:50                 ` Jamie Lokier
2009-08-21 14:20     ` Christoph Hellwig
2009-08-21 15:06       ` James Bottomley
2009-08-21 15:23         ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090826150102.GB22027@shareable.org \
    --to=jamie@shareable.org \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jens.axboe@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).