From: Christoph Hellwig <hch@infradead.org>
To: Jamie Lokier <jamie@shareable.org>
Cc: Christoph Hellwig <hch@infradead.org>,
Jens Axboe <jens.axboe@oracle.com>,
linux-fsdevel@vger.kernel.org, linux-scsi@vger.kernel.org
Subject: Re: O_DIRECT and barriers
Date: Sun, 23 Aug 2009 22:34:22 -0400 [thread overview]
Message-ID: <20090824023422.GA775@infradead.org> (raw)
In-Reply-To: <20090822005006.GA22530@shareable.org>
On Sat, Aug 22, 2009 at 01:50:06AM +0100, Jamie Lokier wrote:
> Oh, I agree with that. That comes from observing that quasi-portable
> code using O_DIRECT needs to use O_DSYNC too because several OSes and
> filesystems on those OSes revert to buffered writes under some
> circumstances, in which case you want O_DSYNC too. That has nothing
> to do with hardware caches, but it's a lucky coincidence that
> fdatasync() would form a nice barrier function, and O_DIRECT|O_DSYNC
> would then make sense as an FUA equivalent.
I agree. I do however fear about everything using O_DIRECT that is
around now. Less so about the databases and HPC workloads on expensive
hardware because they usually run on vendor approved scsi disks that
have the write back cache disabled, but rather things like
virtualization software or other things that get run on commodity
hardware.
Then again they already don't get what they expect and never did,
so if we clear document and communicate the O_SYNC (that is Linux
O_SYNC) requirement we might be able to go with this.
> Perhaps in the same way that fsync/fdatasync aren't clear on disk
> cache behaviour either. On Linux and some other OSes.
The disk write cache really is an implementation detail, it has no
business in Posix.
Posix seems to define the semantics for fdatasync and cor relatively
well (that is if you like the specification speak in there):
"The fdatasync() function forces all currently queued I/O operations
associated with the file indicated by file descriptor fildes to the
synchronised I/O completion state."
"synchronised I/O data integrity completion
o For read, when the operation has been completed or diagnosed if
unsuccessful. The read is complete only when an image of the data has
been successfully transferred to the requesting process. If there were
any pending write requests affecting the data to be read at the time
that the synchronised read operation was requested, these write
requests shall be successfully transferred prior to reading the
data."
o For write, when the operation has been completed or diagnosed if
unsuccessful. The write is complete only when the data specified in the
write request is successfully transferred and all file system
information required to retrieve the data is successfully transferred."
Given that it talks about data retrievable an volatile cache does not
seem to meet the above criteria. But yeah, it's a horrible language.
> What does IRIX do? Does O_DIRECT on IRIX write through the drive's
> cache? What about Solaris?
IRIX only came pre-packaged with SGI MIPS systems. Which as most of
the more expensive hardware was not configured with write through
caches. Which btw is still the case for all more expensive hardware
I have. The whole issue with volatile write back cache is just too
much of a data integrity nightmare as that you would enable it where
your customers actually care about their data.
next prev parent reply other threads:[~2009-08-24 2:34 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1250697884-22288-1-git-send-email-jack@suse.cz>
2009-08-20 22:12 ` O_DIRECT and barriers Christoph Hellwig
2009-08-21 11:40 ` Jens Axboe
2009-08-21 13:54 ` Jamie Lokier
2009-08-21 14:26 ` Christoph Hellwig
2009-08-21 15:24 ` Jamie Lokier
2009-08-21 17:45 ` Christoph Hellwig
2009-08-21 19:18 ` Ric Wheeler
2009-08-22 0:50 ` Jamie Lokier
2009-08-22 2:19 ` Theodore Tso
2009-08-22 2:31 ` Theodore Tso
2009-08-24 2:34 ` Christoph Hellwig [this message]
2009-08-27 14:34 ` Jamie Lokier
2009-08-27 17:10 ` adding proper O_SYNC/O_DSYNC, was " Christoph Hellwig
2009-08-27 17:24 ` Ulrich Drepper
2009-08-28 15:46 ` Christoph Hellwig
2009-08-28 16:06 ` Ulrich Drepper
2009-08-28 16:17 ` Christoph Hellwig
2009-08-28 16:33 ` Ulrich Drepper
2009-08-28 16:41 ` Christoph Hellwig
2009-08-28 20:51 ` Ulrich Drepper
2009-08-28 21:08 ` Christoph Hellwig
2009-08-28 21:16 ` Trond Myklebust
2009-08-28 21:29 ` Christoph Hellwig
2009-08-28 21:43 ` Trond Myklebust
2009-08-28 22:39 ` Christoph Hellwig
2009-08-30 16:44 ` Jamie Lokier
2009-08-28 16:46 ` Jamie Lokier
2009-08-29 0:59 ` Jamie Lokier
2009-08-28 16:44 ` Jamie Lokier
2009-08-28 16:50 ` Jamie Lokier
2009-08-28 21:08 ` Ulrich Drepper
2009-08-30 16:58 ` Jamie Lokier
2009-08-30 17:48 ` Jamie Lokier
2009-08-28 23:06 ` Jamie Lokier
2009-08-28 23:46 ` Christoph Hellwig
2009-08-21 22:08 ` Theodore Tso
2009-08-21 22:38 ` Joel Becker
2009-08-21 22:45 ` Joel Becker
2009-08-22 2:11 ` Theodore Tso
2009-08-24 2:42 ` Christoph Hellwig
2009-08-24 2:37 ` Christoph Hellwig
2009-08-22 0:56 ` Jamie Lokier
2009-08-22 2:06 ` Theodore Tso
2009-08-26 6:34 ` Dave Chinner
2009-08-26 15:01 ` Jamie Lokier
2009-08-26 18:47 ` Theodore Tso
2009-08-27 14:50 ` Jamie Lokier
2009-08-21 14:20 ` Christoph Hellwig
2009-08-21 15:06 ` James Bottomley
2009-08-21 15:23 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090824023422.GA775@infradead.org \
--to=hch@infradead.org \
--cc=jamie@shareable.org \
--cc=jens.axboe@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).