From: Jamie Lokier <jamie@shareable.org>
To: Chris Mason <chris.mason@oracle.com>
Cc: Theodore Tso <tytso@mit.edu>, Nick Piggin <npiggin@suse.de>,
linux-fsdevel@vger.kernel.org, Eric Sandeen <sandeen@sandeen.net>
Subject: Re: [rfc] fsync_range?
Date: Wed, 21 Jan 2009 20:41:05 +0000 [thread overview]
Message-ID: <20090121204105.GA16133@shareable.org> (raw)
In-Reply-To: <1232548550.17244.3.camel@think.oraclecorp.com>
Chris Mason wrote:
> On Wed, 2009-01-21 at 09:12 -0500, Theodore Tso wrote:
> > On Wed, Jan 21, 2009 at 12:37:11PM +0000, Jamie Lokier wrote:
> > >
> > > What about btrfs with data checksums? Doesn't that count among
> > > data-retrieval metadata? What about nilfs, which always writes data
> > > to a new place? Etc.
> > >
> > > I'm wondering what exactly sync_file_range() definitely writes, and
> > > what it doesn't write.
> > >
> > > If it's just in use by Oracle, and nobody's sure what it does, that
> > > smacks of those secret APIs in Windows that made Word run a bit faster
> > > than everyone else's word processer... sort of. :-)
> >
> > Actually, I take that back; Oracle (and most other enterprise
> > databases; the world is not just Oracle --- there's also DB2, for
> > example) generally uses Direct I/O, so I wonder if they are using
> > sync_file_range() at all.
>
> Usually if they don't use O_DIRECT, they use O_SYNC.
There's a case for using both together.
An O_DIRECT write convert to non-direct in some conditions. When that
happens, you want the properties of O_SYNC. It is documented to
happen on some other OSes - and maybe for VxFS on Linux.
Linux is nicer than some other platforms in returning EINVAL usually
for O_DIRECT whose alignment isn't satisfactory, but it can still fall
back to buffered I/O in some circumstances. I think current kernels
do a sync in that case, but some earlier 2.6 kernels failed to.
Oh, you'd use O_DSYNC instead of course... No point committing inode
updates all the time, only size increases, and most OSes document that
O_DSYNC does commit size increases.
By the way, emulators/VMs like QEMU and KVM use much the same methods
to access virtual disk images as databases do, for the same reasons.
> > I do wonder though how well or poorly Oracle will work on btrfs, or
> > indeed any filesystem that uses WAFL-like or log-structutred
> > filesystem-like algorithms. Most of the enterprise databases have
> > been optimized for use on block devices and filesystems where you do
> > write-in-place acesses; and some enterprise databases do their own
> > data checksumming. So if I had to guess, I suspect the answer to the
> > question I posed is "disastrously". :-)
>
> Yes, I think btrfs' nodatacow option is pretty important for database
> use.
Does O_DIRECT on btrfs still allocate new data blocks?
That's not very direct :-)
I'm thinking if O_DIRECT is set, considering what's likely to request
it, it may be reasonable for it to mean "overwrite in place" too
(except for files which are actually COW-shared with others of course).
> > After all, such db's
> > generally are happiest when the OS acts as a program loader than then
> > gets the heck out of the way of the filesystem, hence their use of
> > DIO.
> >
> > Which again brings me back to the question --- I wonder who is
> > actually using sync_file_range, and what for? I would assume it is
> > some database, most likely; so maybe we should check with MySQL or
> > Postgres?
>
> Eric, didn't you have a magic script for grepping the sources/binaries
> in fedora for syscalls?
sync_file_range does not appear anywhere in
db-4.7.25
mysql-dfsg-5.0.67
postgresql-8.3.5
sqlite3-3.5.9
(On Ubuntu; presumably the same in other distros).
-- Jamie
next prev parent reply other threads:[~2009-01-21 20:41 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-01-20 16:47 [rfc] fsync_range? Nick Piggin
2009-01-20 18:31 ` Jamie Lokier
2009-01-20 21:25 ` Bryan Henderson
2009-01-20 22:42 ` Jamie Lokier
2009-01-21 19:43 ` Bryan Henderson
2009-01-21 21:08 ` Jamie Lokier
2009-01-21 22:44 ` Bryan Henderson
2009-01-21 23:31 ` Jamie Lokier
2009-01-21 1:36 ` Nick Piggin
2009-01-21 19:58 ` Bryan Henderson
2009-01-21 20:53 ` Jamie Lokier
2009-01-21 22:14 ` Bryan Henderson
2009-01-21 22:30 ` Jamie Lokier
2009-01-22 1:52 ` Bryan Henderson
2009-01-22 3:41 ` Jamie Lokier
2009-01-21 1:29 ` Nick Piggin
2009-01-21 3:15 ` Jamie Lokier
2009-01-21 3:48 ` Nick Piggin
2009-01-21 5:24 ` Jamie Lokier
2009-01-21 6:16 ` Nick Piggin
2009-01-21 11:18 ` Jamie Lokier
2009-01-21 11:41 ` Nick Piggin
2009-01-21 12:09 ` Jamie Lokier
2009-01-21 4:16 ` Nick Piggin
2009-01-21 4:59 ` Jamie Lokier
2009-01-21 6:23 ` Nick Piggin
2009-01-21 12:02 ` Jamie Lokier
2009-01-21 12:13 ` Theodore Tso
2009-01-21 12:37 ` Jamie Lokier
2009-01-21 14:12 ` Theodore Tso
2009-01-21 14:35 ` Chris Mason
2009-01-21 15:58 ` Eric Sandeen
2009-01-21 20:41 ` Jamie Lokier [this message]
2009-01-21 21:23 ` jim owens
2009-01-21 21:59 ` Jamie Lokier
2009-01-21 23:08 ` btrfs O_DIRECT was " jim owens
2009-01-22 0:06 ` Jamie Lokier
2009-01-22 13:50 ` jim owens
2009-01-22 21:18 ` Florian Weimer
2009-01-22 21:23 ` Florian Weimer
2009-01-21 3:25 ` Jamie Lokier
2009-01-21 3:52 ` Nick Piggin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090121204105.GA16133@shareable.org \
--to=jamie@shareable.org \
--cc=chris.mason@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=npiggin@suse.de \
--cc=sandeen@sandeen.net \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).