From: Jamie Lokier <jamie@shareable.org>
To: Chris Mason <chris.mason@oracle.com>
Cc: Theodore Tso <tytso@mit.edu>, Nick Piggin <npiggin@suse.de>,
linux-fsdevel@vger.kernel.org, Eric Sandeen <sandeen@sandeen.net>
Subject: Re: [rfc] fsync_range?
Date: Wed, 21 Jan 2009 20:41:05 +0000 [thread overview]
Message-ID: <20090121204105.GA16133@shareable.org> (raw)
In-Reply-To: <1232548550.17244.3.camel@think.oraclecorp.com>
Chris Mason wrote:
> On Wed, 2009-01-21 at 09:12 -0500, Theodore Tso wrote:
> > On Wed, Jan 21, 2009 at 12:37:11PM +0000, Jamie Lokier wrote:
> > >
> > > What about btrfs with data checksums? Doesn't that count among
> > > data-retrieval metadata? What about nilfs, which always writes data
> > > to a new place? Etc.
> > >
> > > I'm wondering what exactly sync_file_range() definitely writes, and
> > > what it doesn't write.
> > >
> > > If it's just in use by Oracle, and nobody's sure what it does, that
> > > smacks of those secret APIs in Windows that made Word run a bit faster
> > > than everyone else's word processer... sort of. :-)
> >
> > Actually, I take that back; Oracle (and most other enterprise
> > databases; the world is not just Oracle --- there's also DB2, for
> > example) generally uses Direct I/O, so I wonder if they are using
> > sync_file_range() at all.
>
> Usually if they don't use O_DIRECT, they use O_SYNC.
There's a case for using both together.
An O_DIRECT write convert to non-direct in some conditions. When that
happens, you want the properties of O_SYNC. It is documented to
happen on some other OSes - and maybe for VxFS on Linux.
Linux is nicer than some other platforms in returning EINVAL usually
for O_DIRECT whose alignment isn't satisfactory, but it can still fall
back to buffered I/O in some circumstances. I think current kernels
do a sync in that case, but some earlier 2.6 kernels failed to.
Oh, you'd use O_DSYNC instead of course... No point committing inode
updates all the time, only size increases, and most OSes document that
O_DSYNC does commit size increases.
By the way, emulators/VMs like QEMU and KVM use much the same methods
to access virtual disk images as databases do, for the same reasons.
> > I do wonder though how well or poorly Oracle will work on btrfs, or
> > indeed any filesystem that uses WAFL-like or log-structutred
> > filesystem-like algorithms. Most of the enterprise databases have
> > been optimized for use on block devices and filesystems where you do
> > write-in-place acesses; and some enterprise databases do their own
> > data checksumming. So if I had to guess, I suspect the answer to the
> > question I posed is "disastrously". :-)
>
> Yes, I think btrfs' nodatacow option is pretty important for database
> use.
Does O_DIRECT on btrfs still allocate new data blocks?
That's not very direct :-)
I'm thinking if O_DIRECT is set, considering what's likely to request
it, it may be reasonable for it to mean "overwrite in place" too
(except for files which are actually COW-shared with others of course).
> > After all, such db's
> > generally are happiest when the OS acts as a program loader than then
> > gets the heck out of the way of the filesystem, hence their use of
> > DIO.
> >
> > Which again brings me back to the question --- I wonder who is
> > actually using sync_file_range, and what for? I would assume it is
> > some database, most likely; so maybe we should check with MySQL or
> > Postgres?
>
> Eric, didn't you have a magic script for grepping the sources/binaries
> in fedora for syscalls?
sync_file_range does not appear anywhere in
db-4.7.25
mysql-dfsg-5.0.67
postgresql-8.3.5
sqlite3-3.5.9
(On Ubuntu; presumably the same in other distros).
-- Jamie
next prev parent reply other threads:[~2009-01-21 20:41 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-01-20 16:47 [rfc] fsync_range? Nick Piggin
2009-01-20 18:31 ` Jamie Lokier
2009-01-20 21:25 ` Bryan Henderson
2009-01-20 22:42 ` Jamie Lokier
2009-01-21 19:43 ` Bryan Henderson
2009-01-21 21:08 ` Jamie Lokier
2009-01-21 22:44 ` Bryan Henderson
2009-01-21 23:31 ` Jamie Lokier
2009-01-21 1:36 ` Nick Piggin
2009-01-21 19:58 ` Bryan Henderson
2009-01-21 20:53 ` Jamie Lokier
2009-01-21 22:14 ` Bryan Henderson
2009-01-21 22:30 ` Jamie Lokier
2009-01-22 1:52 ` Bryan Henderson
2009-01-22 3:41 ` Jamie Lokier
2009-01-21 1:29 ` Nick Piggin
2009-01-21 3:15 ` Jamie Lokier
2009-01-21 3:48 ` Nick Piggin
2009-01-21 5:24 ` Jamie Lokier
2009-01-21 6:16 ` Nick Piggin
2009-01-21 11:18 ` Jamie Lokier
2009-01-21 11:41 ` Nick Piggin
2009-01-21 12:09 ` Jamie Lokier
2009-01-21 4:16 ` Nick Piggin
2009-01-21 4:59 ` Jamie Lokier
2009-01-21 6:23 ` Nick Piggin
2009-01-21 12:02 ` Jamie Lokier
2009-01-21 12:13 ` Theodore Tso
2009-01-21 12:37 ` Jamie Lokier
2009-01-21 14:12 ` Theodore Tso
2009-01-21 14:35 ` Chris Mason
2009-01-21 15:58 ` Eric Sandeen
2009-01-21 20:41 ` Jamie Lokier [this message]
2009-01-21 21:23 ` jim owens
2009-01-21 21:59 ` Jamie Lokier
2009-01-21 23:08 ` btrfs O_DIRECT was " jim owens
2009-01-22 0:06 ` Jamie Lokier
2009-01-22 13:50 ` jim owens
2009-01-22 21:18 ` Florian Weimer
2009-01-22 21:23 ` Florian Weimer
2009-01-21 3:25 ` Jamie Lokier
2009-01-21 3:52 ` Nick Piggin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090121204105.GA16133@shareable.org \
--to=jamie@shareable.org \
--cc=chris.mason@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=npiggin@suse.de \
--cc=sandeen@sandeen.net \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.