linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jamie Lokier <jamie@shareable.org>
To: Chris Mason <chris.mason@oracle.com>
Cc: Theodore Tso <tytso@mit.edu>, Nick Piggin <npiggin@suse.de>,
	linux-fsdevel@vger.kernel.org, Eric Sandeen <sandeen@sandeen.net>
Subject: Re: [rfc] fsync_range?
Date: Wed, 21 Jan 2009 20:41:05 +0000	[thread overview]
Message-ID: <20090121204105.GA16133@shareable.org> (raw)
In-Reply-To: <1232548550.17244.3.camel@think.oraclecorp.com>

Chris Mason wrote:
> On Wed, 2009-01-21 at 09:12 -0500, Theodore Tso wrote:
> > On Wed, Jan 21, 2009 at 12:37:11PM +0000, Jamie Lokier wrote:
> > > 
> > > What about btrfs with data checksums?  Doesn't that count among
> > > data-retrieval metadata?  What about nilfs, which always writes data
> > > to a new place?  Etc.
> > > 
> > > I'm wondering what exactly sync_file_range() definitely writes, and
> > > what it doesn't write.
> > > 
> > > If it's just in use by Oracle, and nobody's sure what it does, that
> > > smacks of those secret APIs in Windows that made Word run a bit faster
> > > than everyone else's word processer...  sort of. :-)
> > 
> > Actually, I take that back; Oracle (and most other enterprise
> > databases; the world is not just Oracle --- there's also DB2, for
> > example) generally uses Direct I/O, so I wonder if they are using
> > sync_file_range() at all.
> 
> Usually if they don't use O_DIRECT, they use O_SYNC.

There's a case for using both together.

An O_DIRECT write convert to non-direct in some conditions.  When that
happens, you want the properties of O_SYNC.  It is documented to
happen on some other OSes - and maybe for VxFS on Linux.

Linux is nicer than some other platforms in returning EINVAL usually
for O_DIRECT whose alignment isn't satisfactory, but it can still fall
back to buffered I/O in some circumstances.  I think current kernels
do a sync in that case, but some earlier 2.6 kernels failed to.

Oh, you'd use O_DSYNC instead of course...  No point committing inode
updates all the time, only size increases, and most OSes document that
O_DSYNC does commit size increases.

By the way, emulators/VMs like QEMU and KVM use much the same methods
to access virtual disk images as databases do, for the same reasons.

> > I do wonder though how well or poorly Oracle will work on btrfs, or
> > indeed any filesystem that uses WAFL-like or log-structutred
> > filesystem-like algorithms.  Most of the enterprise databases have
> > been optimized for use on block devices and filesystems where you do
> > write-in-place acesses; and some enterprise databases do their own
> > data checksumming.  So if I had to guess, I suspect the answer to the
> > question I posed is "disastrously".  :-)
> 
> Yes, I think btrfs' nodatacow option is pretty important for database
> use.

Does O_DIRECT on btrfs still allocate new data blocks?
That's not very direct :-)

I'm thinking if O_DIRECT is set, considering what's likely to request
it, it may be reasonable for it to mean "overwrite in place" too
(except for files which are actually COW-shared with others of course).

> > After all, such db's
> > generally are happiest when the OS acts as a program loader than then
> > gets the heck out of the way of the filesystem, hence their use of
> > DIO.
> > 
> > Which again brings me back to the question --- I wonder who is
> > actually using sync_file_range, and what for?  I would assume it is
> > some database, most likely; so maybe we should check with MySQL or
> > Postgres?
> 
> Eric, didn't you have a magic script for grepping the sources/binaries
> in fedora for syscalls? 

sync_file_range does not appear anywhere in

    db-4.7.25
    mysql-dfsg-5.0.67
    postgresql-8.3.5
    sqlite3-3.5.9

(On Ubuntu; presumably the same in other distros).

-- Jamie


  parent reply	other threads:[~2009-01-21 20:41 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-01-20 16:47 [rfc] fsync_range? Nick Piggin
2009-01-20 18:31 ` Jamie Lokier
2009-01-20 21:25   ` Bryan Henderson
2009-01-20 22:42     ` Jamie Lokier
2009-01-21 19:43       ` Bryan Henderson
2009-01-21 21:08         ` Jamie Lokier
2009-01-21 22:44           ` Bryan Henderson
2009-01-21 23:31             ` Jamie Lokier
2009-01-21  1:36     ` Nick Piggin
2009-01-21 19:58       ` Bryan Henderson
2009-01-21 20:53         ` Jamie Lokier
2009-01-21 22:14           ` Bryan Henderson
2009-01-21 22:30             ` Jamie Lokier
2009-01-22  1:52               ` Bryan Henderson
2009-01-22  3:41                 ` Jamie Lokier
2009-01-21  1:29   ` Nick Piggin
2009-01-21  3:15     ` Jamie Lokier
2009-01-21  3:48       ` Nick Piggin
2009-01-21  5:24         ` Jamie Lokier
2009-01-21  6:16           ` Nick Piggin
2009-01-21 11:18             ` Jamie Lokier
2009-01-21 11:41               ` Nick Piggin
2009-01-21 12:09                 ` Jamie Lokier
2009-01-21  4:16       ` Nick Piggin
2009-01-21  4:59         ` Jamie Lokier
2009-01-21  6:23           ` Nick Piggin
2009-01-21 12:02             ` Jamie Lokier
2009-01-21 12:13             ` Theodore Tso
2009-01-21 12:37               ` Jamie Lokier
2009-01-21 14:12                 ` Theodore Tso
2009-01-21 14:35                   ` Chris Mason
2009-01-21 15:58                     ` Eric Sandeen
2009-01-21 20:41                     ` Jamie Lokier [this message]
2009-01-21 21:23                       ` jim owens
2009-01-21 21:59                         ` Jamie Lokier
2009-01-21 23:08                           ` btrfs O_DIRECT was " jim owens
2009-01-22  0:06                             ` Jamie Lokier
2009-01-22 13:50                               ` jim owens
2009-01-22 21:18                   ` Florian Weimer
2009-01-22 21:23                     ` Florian Weimer
2009-01-21  3:25     ` Jamie Lokier
2009-01-21  3:52       ` Nick Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090121204105.GA16133@shareable.org \
    --to=jamie@shareable.org \
    --cc=chris.mason@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=npiggin@suse.de \
    --cc=sandeen@sandeen.net \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).