From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jamie Lokier Subject: Re: [rfc] fsync_range? Date: Wed, 21 Jan 2009 20:41:05 +0000 Message-ID: <20090121204105.GA16133@shareable.org> References: <20090120183120.GD27464@shareable.org> <20090121012900.GD24891@wotan.suse.de> <20090121031500.GA2354@shareable.org> <20090121041604.GI24891@wotan.suse.de> <20090121045921.GA3944@shareable.org> <20090121062306.GK24891@wotan.suse.de> <20090121121308.GA31253@mit.edu> <20090121123711.GA10637@shareable.org> <20090121141207.GD31253@mit.edu> <1232548550.17244.3.camel@think.oraclecorp.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Theodore Tso , Nick Piggin , linux-fsdevel@vger.kernel.org, Eric Sandeen To: Chris Mason Return-path: Received: from mail2.shareable.org ([80.68.89.115]:44997 "EHLO mail2.shareable.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751518AbZAUUlO (ORCPT ); Wed, 21 Jan 2009 15:41:14 -0500 Content-Disposition: inline In-Reply-To: <1232548550.17244.3.camel@think.oraclecorp.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Chris Mason wrote: > On Wed, 2009-01-21 at 09:12 -0500, Theodore Tso wrote: > > On Wed, Jan 21, 2009 at 12:37:11PM +0000, Jamie Lokier wrote: > > > > > > What about btrfs with data checksums? Doesn't that count among > > > data-retrieval metadata? What about nilfs, which always writes data > > > to a new place? Etc. > > > > > > I'm wondering what exactly sync_file_range() definitely writes, and > > > what it doesn't write. > > > > > > If it's just in use by Oracle, and nobody's sure what it does, that > > > smacks of those secret APIs in Windows that made Word run a bit faster > > > than everyone else's word processer... sort of. :-) > > > > Actually, I take that back; Oracle (and most other enterprise > > databases; the world is not just Oracle --- there's also DB2, for > > example) generally uses Direct I/O, so I wonder if they are using > > sync_file_range() at all. > > Usually if they don't use O_DIRECT, they use O_SYNC. There's a case for using both together. An O_DIRECT write convert to non-direct in some conditions. When that happens, you want the properties of O_SYNC. It is documented to happen on some other OSes - and maybe for VxFS on Linux. Linux is nicer than some other platforms in returning EINVAL usually for O_DIRECT whose alignment isn't satisfactory, but it can still fall back to buffered I/O in some circumstances. I think current kernels do a sync in that case, but some earlier 2.6 kernels failed to. Oh, you'd use O_DSYNC instead of course... No point committing inode updates all the time, only size increases, and most OSes document that O_DSYNC does commit size increases. By the way, emulators/VMs like QEMU and KVM use much the same methods to access virtual disk images as databases do, for the same reasons. > > I do wonder though how well or poorly Oracle will work on btrfs, or > > indeed any filesystem that uses WAFL-like or log-structutred > > filesystem-like algorithms. Most of the enterprise databases have > > been optimized for use on block devices and filesystems where you do > > write-in-place acesses; and some enterprise databases do their own > > data checksumming. So if I had to guess, I suspect the answer to the > > question I posed is "disastrously". :-) > > Yes, I think btrfs' nodatacow option is pretty important for database > use. Does O_DIRECT on btrfs still allocate new data blocks? That's not very direct :-) I'm thinking if O_DIRECT is set, considering what's likely to request it, it may be reasonable for it to mean "overwrite in place" too (except for files which are actually COW-shared with others of course). > > After all, such db's > > generally are happiest when the OS acts as a program loader than then > > gets the heck out of the way of the filesystem, hence their use of > > DIO. > > > > Which again brings me back to the question --- I wonder who is > > actually using sync_file_range, and what for? I would assume it is > > some database, most likely; so maybe we should check with MySQL or > > Postgres? > > Eric, didn't you have a magic script for grepping the sources/binaries > in fedora for syscalls? sync_file_range does not appear anywhere in db-4.7.25 mysql-dfsg-5.0.67 postgresql-8.3.5 sqlite3-3.5.9 (On Ubuntu; presumably the same in other distros). -- Jamie