From: Christoph Hellwig <hch@infradead.org>
To: Jamie Lokier <jamie@shareable.org>
Cc: Christoph Hellwig <hch@infradead.org>,
Jens Axboe <jens.axboe@oracle.com>,
linux-fsdevel@vger.kernel.org, linux-scsi@vger.kernel.org
Subject: Re: O_DIRECT and barriers
Date: Fri, 21 Aug 2009 13:45:25 -0400 [thread overview]
Message-ID: <20090821174525.GA28861@infradead.org> (raw)
In-Reply-To: <20090821152459.GC6929@shareable.org>
On Fri, Aug 21, 2009 at 04:24:59PM +0100, Jamie Lokier wrote:
> In measurements I've done, disabling a disk's write cache results in
> much slower ext3 filesystem writes than using barriers. Others report
> similar results. This is with disks that don't have NCQ; good NCQ may
> be better.
On a scsi disk and a SATA SSD with NCQ I get different results. Most
worksloads, in particular metadata-intensive ones and large streaming
writes are noticably better just turning off the write cache. The only
onces that benefit from it are relatively small writes witout O_SYNC
or much fsyncs. This is however using XFS which tends to issue much
more barriers than ext3.
> Using FUA for all writes should be equivalent to writing with write
> cache disabled.
>
> A journalling filesystem or database tends to write like this:
>
> (guest) WRITE
> (guest) WRITE
> (guest) WRITE
> (guest) WRITE
> (guest) WRITE
> (guest) CACHE FLUSH
> (guest) WRITE
> (guest) CACHE FLUSH
> (guest) WRITE
> (guest) WRITE
> (guest) WRITE
In the optimal case, yeah.
> Assuming that WRITE FUA is equivalent to disabling write cache, we may
> expect the WRITE FUA version to run much slower than the CACHE FLUSH
> version.
For a workload that only does FUA writes, yeah. That is however the use
case for virtual machines. As I'm looking into those issues I will run
some benchmarks comparing both variants.
> It's also too weak, of course, on drives which don't support FUA.
> Then you have to use CACHE FLUSH anyway, so the code should support
> that (or disable the write cache entirely, which also performs badly).
> If you don't handle drives without FUA, then you're back to "integrity
> sometimes, user must check type of hardware", which is something we're
> trying to get away from. Integrity should not be a surprise when the
> application requests it.
As mentioned in the previous mails FUA would only be an optimization
(if it ends up helping) we do need to support the cache flush case.
> > I thought about this alot . It would be sensible to only require
> > the FUA semantics if O_SYNC is specified. But from looking around at
> > users of O_DIRECT no one seems to actually specify O_SYNC with it.
>
> O_DIRECT with true POSIX O_SYNC is a bad idea, because it flushes
> inode metadata (like mtime) too. O_DIRECT|O_DSYNC is better.
O_SYNC above is the Linux O_SYNC aka Posix O_DYNC.
> O_DIRECT without O_SYNC, O_DSYNC, fsync or fdatasync is asking for
> integrity problems when direct writes are converted to buffered writes
> - which applies to all or nearly all OSes according to their
> documentation (I've read a lot of them).
It did not happen on IRIX where O_DIRECT originated that did not happen,
neither does it happen on Linux when using XFS. Then again at least on
Linux we provide O_SYNC (that is Linux O_SYNC, aka Posix O_DYSC)
semantics for that case.
> Imho, integrity should not be something which depends on the user
> knowing the details of their hardware to decide application
> configuration options - at least, not out of the box.
That is what I meant. Only doing cache flushes/FUA for O_DIRECT|O_DSYNC
is not what users naively expect. And the wording in hour manpages also
suggests this behaviour, although it is not entirely clear:
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. In
general this will degrade performance, but it is useful in special
situations, such as when applications do their own caching. File I/O
is done directly to/from user space buffers. The I/O is synchronous,
that is, at the completion of a read(2) or write(2), data is
guaranteed to have been transferred. See NOTES below forfurther
discussion.
(And yeah, the whole wording is horrible, I will send an update once
we've sorted out the semantics, including caveats about older kernels)
> > And on Linux where O_SYNC really means O_DYSNC that's pretty sensible -
> > if O_DIRECT bypasses the filesystem cache there is nothing else
> > left to sync for a non-extending write.
>
> Oh, O_SYNC means O_DSYNC? I thought it was the other way around.
> Ugh, how messy.
Yes. Except when using XFS and using the "osyncisosync" mount option :)
> > The fallback was a relatively recent addition to the O_DIRECT semantics
> > for broken filesystems that can't handle holes very well. Fortunately
> > enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
> > semantics for that already.
>
> Ok, so you're saying there's no _harm_ in specifying O_DSYNC with
> O_DIRECT either? :-)
No. In the generic code and filesystems I looked at it simply has no
effect at all.
next prev parent reply other threads:[~2009-08-21 17:45 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1250697884-22288-1-git-send-email-jack@suse.cz>
2009-08-20 22:12 ` O_DIRECT and barriers Christoph Hellwig
2009-08-21 11:40 ` Jens Axboe
2009-08-21 13:54 ` Jamie Lokier
2009-08-21 14:26 ` Christoph Hellwig
2009-08-21 15:24 ` Jamie Lokier
2009-08-21 17:45 ` Christoph Hellwig [this message]
2009-08-21 19:18 ` Ric Wheeler
2009-08-22 0:50 ` Jamie Lokier
2009-08-22 2:19 ` Theodore Tso
2009-08-22 2:31 ` Theodore Tso
2009-08-24 2:34 ` Christoph Hellwig
2009-08-27 14:34 ` Jamie Lokier
2009-08-27 17:10 ` adding proper O_SYNC/O_DSYNC, was " Christoph Hellwig
2009-08-27 17:24 ` Ulrich Drepper
2009-08-28 15:46 ` Christoph Hellwig
2009-08-28 16:06 ` Ulrich Drepper
2009-08-28 16:17 ` Christoph Hellwig
2009-08-28 16:33 ` Ulrich Drepper
2009-08-28 16:41 ` Christoph Hellwig
2009-08-28 20:51 ` Ulrich Drepper
2009-08-28 21:08 ` Christoph Hellwig
2009-08-28 21:16 ` Trond Myklebust
2009-08-28 21:29 ` Christoph Hellwig
2009-08-28 21:43 ` Trond Myklebust
2009-08-28 22:39 ` Christoph Hellwig
2009-08-30 16:44 ` Jamie Lokier
2009-08-28 16:46 ` Jamie Lokier
2009-08-29 0:59 ` Jamie Lokier
2009-08-28 16:44 ` Jamie Lokier
2009-08-28 16:50 ` Jamie Lokier
2009-08-28 21:08 ` Ulrich Drepper
2009-08-30 16:58 ` Jamie Lokier
2009-08-30 17:48 ` Jamie Lokier
2009-08-28 23:06 ` Jamie Lokier
2009-08-28 23:46 ` Christoph Hellwig
2009-08-21 22:08 ` Theodore Tso
2009-08-21 22:38 ` Joel Becker
2009-08-21 22:45 ` Joel Becker
2009-08-22 2:11 ` Theodore Tso
2009-08-24 2:42 ` Christoph Hellwig
2009-08-24 2:37 ` Christoph Hellwig
2009-08-22 0:56 ` Jamie Lokier
2009-08-22 2:06 ` Theodore Tso
2009-08-26 6:34 ` Dave Chinner
2009-08-26 15:01 ` Jamie Lokier
2009-08-26 18:47 ` Theodore Tso
2009-08-27 14:50 ` Jamie Lokier
2009-08-21 14:20 ` Christoph Hellwig
2009-08-21 15:06 ` James Bottomley
2009-08-21 15:23 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090821174525.GA28861@infradead.org \
--to=hch@infradead.org \
--cc=jamie@shareable.org \
--cc=jens.axboe@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).