From mboxrd@z Thu Jan 1 00:00:00 1970 From: jim owens Subject: Re: btrfs O_DIRECT was [rfc] fsync_range? Date: Thu, 22 Jan 2009 08:50:37 -0500 Message-ID: <497879AD.30204@hp.com> References: <20090121045921.GA3944@shareable.org> <20090121062306.GK24891@wotan.suse.de> <20090121121308.GA31253@mit.edu> <20090121123711.GA10637@shareable.org> <20090121141207.GD31253@mit.edu> <1232548550.17244.3.camel@think.oraclecorp.com> <20090121204105.GA16133@shareable.org> <4977926E.30703@hp.com> <20090121215921.GG16133@shareable.org> <4977AAFA.7050503@hp.com> <20090122000636.GC20407@shareable.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Chris Mason , linux-fsdevel@vger.kernel.org To: Jamie Lokier Return-path: Received: from g4t0017.houston.hp.com ([15.201.24.20]:36583 "EHLO g4t0017.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753142AbZAVNul (ORCPT ); Thu, 22 Jan 2009 08:50:41 -0500 In-Reply-To: <20090122000636.GC20407@shareable.org> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Jamie Lokier wrote: > jim owens wrote: >> Jamie Lokier wrote: >>> Writing in place or new-place on a *non-shared* (i.e. non-snapshotted) >>> file is the choice which is useful. It's a filesystem implementation >>> detail, not a semantic difference. I'm suggesting writing in place >>> may do no harm and be more like the expected behaviour with programs >>> that use O_DIRECT, which are usually databases. >>> >>> How about a btrfs mount option? >>> in_place_write=never/always/direct_only. (Default direct_only). >> The harm is creating a special guarantee for just one case >> of "don't move my data" based on a transient file open mode. >> >> What about defragmenting or moving the extent to another >> device for performance or for (failing) device removal? >> >> We are on a slippery slope for presumed expectations. > > Don't make it a guarantee, just a hint to filesystem write strategy. > > It's ok to move data around when useful, we're not talking about a > hard requirement, but a performance knob. > > The question is just what performance and fragmentation > characteristics do programs that use O_DIRECT have? > > They are nearly all databases, filesystems-in-a-file, or virtual > machine disks. I'm guessing virtually all of those _particular_ > applications programs would perform significantly differently with a > write-in-place strategy for most writes, although you'd still want > access to the bells and whistles of snapshots and COW and so on when > requested. > > Note I said differently :-) I'm not sure write-in-place performs > better for those sort of applications. It's just a guess. I'm very certain that write-in-place performs much better than cow because as we all know, doing storage allocation is expensive. So many databases preallocate their files. > Oracle probably has a really good idea how it performs on ZFS compared > with a block device (which is always in place) - and knows whether ZFS > does in-place writes with O_DIRECT or not. Chris? We only disagree how the rule to write-in-place is defined and more importantly documented so it is easy to understand. Btrfs allows each individual file to have "nodatacow" set as an attribute. That is an easy rule to document for the db admin. Much easier than "if nothing else takes precedence to make it cow, O_DIRECT will write-in-place". jim