linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jamie Lokier <jamie@shareable.org>
To: Theodore Tso <tytso@mit.edu>,
	Christoph Hellwig <hch@infradead.org>,
	Jens Axboe <jens.axboe@oracle.com>,
	linux-fsdevel@vger.kernel.org, linux-scsi@vger.kernel.org
Subject: Re: O_DIRECT and barriers
Date: Sat, 22 Aug 2009 01:56:13 +0100	[thread overview]
Message-ID: <20090822005613.GB22530@shareable.org> (raw)
In-Reply-To: <20090821220852.GM9529@mit.edu>

Theodore Tso wrote:
> On Fri, Aug 21, 2009 at 10:26:35AM -0400, Christoph Hellwig wrote:
> > > It turns out that applications needing integrity must use fdatasync or
> > > O_DSYNC (or O_SYNC) *already* with O_DIRECT, because the kernel may
> > > choose to use buffered writes at any time, with no signal to the
> > > application.
> > 
> > The fallback was a relatively recent addition to the O_DIRECT semantics
> > for broken filesystems that can't handle holes very well.  Fortunately
> > enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
> > semantics for that already.
> 
> Um, actually, we don't.  If we did that, we would have to wait for a
> journal commit to complete before allowing the write(2) to complete,
> which would be especially painfully slow for ext3.
> 
> This question recently came up on the ext4 developer's list, because
> of a question of how direct I/O to an preallocated (uninitialized)
> extent should be handled.  Are we supposed to guarantee synchronous
> updates of the metadata by the time write(2) returns, or not?  One of
> the ext4 developers (I can't remember if it was Mingming or Eric)
> asked an XFS developer what they did in that case, and I believe the
> answer they were given was that XFS started a commit, but did *not*
> wait for the commit to complete before returning from the Direct I/O
> write.  In fact, they were told (I believe this was from an SGI
> engineer, but I don't remember the name; we can track that down if
> it's important) that if an application wanted to guarantee metadata
> would be updated for an extending write, they had to use fsync() or
> O_SYNC/O_DSYNC.  
> 
> Perhaps they were given an incorrect answer, but it's clear the
> semantics of exactly how Direct I/O works in edge cases isn't well
> defined, or at least clearly and widely understood.

And that's not even a hardware cache issue, just whether filesystem
metadata is written.

AIX behaves like XFS according to documentation:

    [ http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/fileio.htm ]

    Direct I/O and Data I/O Integrity Completion

    Although direct I/O writes are done synchronously, they do not
    provide synchronized I/O data integrity completion, as defined by
    POSIX. Applications that need this feature should use O_DSYNC in
    addition to O_DIRECT. O_DSYNC guarantees that all of the data and
    enough of the metadata (for example, indirect blocks) have written
    to the stable store to be able to retrieve the data after a system
    crash. O_DIRECT only writes the data; it does not write the
    metadata.

That's another reason to use O_DIRECT|O_DSYNC in moderately portable
code.

> I have an early draft (for discussion only) what we think it means and
> what is currently implemented in Linux, which I've put up, (again, let
> me emphasisize) for *discussion* here:
> 
> http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
> 
> Comments are welcome, either on the wiki's talk page, or directly to
> me, or to the linux-fsdevel or linux-ext4.

I haven't read it yet.  One thing which comes to mind is it would be
good to summarise what other OSes as well as Linux do with O_DIRECT
w.r.t. data-finding metadata, preallocation, file extending, hole
filling, unaligned access and what alignment is required, block
devices vs. files and different filesystems and behaviour-modifying
mount options, file open for buffered I/O on another descriptor, file
has mapped pages, mlocked pages, and of course drive cache write
through or not.

-- Jamie

  parent reply	other threads:[~2009-08-22  0:56 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1250697884-22288-1-git-send-email-jack@suse.cz>
2009-08-20 22:12 ` O_DIRECT and barriers Christoph Hellwig
2009-08-21 11:40   ` Jens Axboe
2009-08-21 13:54     ` Jamie Lokier
2009-08-21 14:26       ` Christoph Hellwig
2009-08-21 15:24         ` Jamie Lokier
2009-08-21 17:45           ` Christoph Hellwig
2009-08-21 19:18             ` Ric Wheeler
2009-08-22  0:50             ` Jamie Lokier
2009-08-22  2:19               ` Theodore Tso
2009-08-22  2:31                 ` Theodore Tso
2009-08-24  2:34               ` Christoph Hellwig
2009-08-27 14:34                 ` Jamie Lokier
2009-08-27 17:10                   ` adding proper O_SYNC/O_DSYNC, was " Christoph Hellwig
2009-08-27 17:24                     ` Ulrich Drepper
2009-08-28 15:46                       ` Christoph Hellwig
2009-08-28 16:06                         ` Ulrich Drepper
2009-08-28 16:17                           ` Christoph Hellwig
2009-08-28 16:33                             ` Ulrich Drepper
2009-08-28 16:41                               ` Christoph Hellwig
2009-08-28 20:51                                 ` Ulrich Drepper
2009-08-28 21:08                                   ` Christoph Hellwig
2009-08-28 21:16                                     ` Trond Myklebust
2009-08-28 21:29                                       ` Christoph Hellwig
2009-08-28 21:43                                         ` Trond Myklebust
2009-08-28 22:39                                           ` Christoph Hellwig
2009-08-30 16:44                                     ` Jamie Lokier
2009-08-28 16:46                               ` Jamie Lokier
2009-08-29  0:59                                 ` Jamie Lokier
2009-08-28 16:44                         ` Jamie Lokier
2009-08-28 16:50                           ` Jamie Lokier
2009-08-28 21:08                           ` Ulrich Drepper
2009-08-30 16:58                             ` Jamie Lokier
2009-08-30 17:48                             ` Jamie Lokier
2009-08-28 23:06                         ` Jamie Lokier
2009-08-28 23:46                           ` Christoph Hellwig
2009-08-21 22:08         ` Theodore Tso
2009-08-21 22:38           ` Joel Becker
2009-08-21 22:45           ` Joel Becker
2009-08-22  2:11             ` Theodore Tso
2009-08-24  2:42               ` Christoph Hellwig
2009-08-24  2:37             ` Christoph Hellwig
2009-08-22  0:56           ` Jamie Lokier [this message]
2009-08-22  2:06             ` Theodore Tso
2009-08-26  6:34           ` Dave Chinner
2009-08-26 15:01             ` Jamie Lokier
2009-08-26 18:47               ` Theodore Tso
2009-08-27 14:50                 ` Jamie Lokier
2009-08-21 14:20     ` Christoph Hellwig
2009-08-21 15:06       ` James Bottomley
2009-08-21 15:23         ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090822005613.GB22530@shareable.org \
    --to=jamie@shareable.org \
    --cc=hch@infradead.org \
    --cc=jens.axboe@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).