From: Theodore Tso <tytso@mit.edu>
To: Lawrence Greenfield <leg@google.com>
Cc: linux-ext4@vger.kernel.org
Subject: Re: RFC: Clarifying Direct I/O Semantics
Date: Sat, 22 Aug 2009 16:40:11 -0400 [thread overview]
Message-ID: <20090822204011.GC4800@mit.edu> (raw)
In-Reply-To: <5956ddbe0908220625h6a6eeba2w679602d3a1f6336c@mail.gmail.com>
On Sat, Aug 22, 2009 at 09:25:20AM -0400, Lawrence Greenfield wrote:
> > The question in my mind is whether we should guarantee that the data
> > block is written synchronously for allocating writes when the file
> > metadata is not written synchronously; what's the point? After all,
> > the application can't distinguish between the data block not making it
> > out to disk, versus the metadata that will allow the data block to be
> > accessed after a crash, why should one by synchronous but not the
> > other?
>
> O_DIRECT is about avoiding polluting the buffer cache, not only about
> data integrity. If an application wants allocating writes to have a
> data integrity guarantee, they can open the file O_DIRECT|O_DSYNC, at
> the cost that writes they think might be one disk seek end up being 2
> (or more). But please don't fall back to putting the data into the
> buffer cache!
Well, it really depends on who you talk to. This goes back to the
problem that O_DIRECT's goals and semantics aren't well defined.
I find it really hard to believe that the main point is to avoid
polluting the page/buffer cache. If that were true, then fadvise's
FADV_NOREUSE would be sufficient, and much simpler semantics to
implement than O_DIRECT's rather baroque restrictions and
requirements.
For the enterprise database folks (who were the ones who originally
asked the Solaris, AIX, and Irix OS's of the world for this feature)
it was always about performance/speed; they wanted to avoid copying
data in and out of the buffer/page cache for speed reasons. But if
you need to take time out to maniulate allocation data structures, the
disk reads/writes are in the noise compared to the memory copy in and
out of the buffer cache.
> I think it would be useful to be explicit to applications what they
> need to do for O_DIRECT writes to be guaranteed to be visible after a
> crash. As a naive application writer, I would have thought using
> posix_fallocate would have been "good enough". If I understand
> correctly, an application that wants to know that O_DIRECT writes will
> both avoid the buffer cache and be visible after a crash must
> guarantee that it's previously written to those blocks either O_DSYNC
> or has used fdatasync() on the file after such writes. All subsequent
> writes can be done with only O_DIRECT.
>
> That means that a database must explicitly initialize its files by
> writing 0s: it can't rely on posix_fallocate. (Amusingly, it would
> have worked before fallocate() was introduced into the kernel!)
Well, all a database needs to do is use fdatasync() after an
application-level commit. If there hasn't been any metadata changes,
the fdatasync() is cheap. If the application is keeping track of when
it might be doing an allocating write() and when it isn't, it can try
to work out when it can omit the fdatasync() call.
- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2009-08-22 20:40 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-08-21 21:54 RFC: Clarifying Direct I/O Semantics Theodore Ts'o
2009-08-21 22:28 ` jim owens
2009-08-22 0:07 ` Theodore Tso
2009-08-22 13:25 ` Lawrence Greenfield
2009-08-22 20:40 ` Theodore Tso [this message]
2009-08-21 23:04 ` Andreas Dilger
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090822204011.GC4800@mit.edu \
--to=tytso@mit.edu \
--cc=leg@google.com \
--cc=linux-ext4@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.