From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jamie Lokier Subject: Re: [PATCH] direct I/O fallback sync simplification Date: Wed, 23 Sep 2009 15:04:30 +0100 Message-ID: <20090923140430.GB15256@shareable.org> References: <20090923130730.GC10759@lst.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Al Viro , Nick Piggin , Jan Kara , linux-fsdevel@vger.kernel.org To: Christoph Hellwig Return-path: Received: from mail2.shareable.org ([80.68.89.115]:59995 "EHLO mail2.shareable.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751693AbZIWOE2 (ORCPT ); Wed, 23 Sep 2009 10:04:28 -0400 Content-Disposition: inline In-Reply-To: <20090923130730.GC10759@lst.de> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Christoph Hellwig wrote: > In the case of direct I/O falling back to buffered I/O we sync data > twice currently: once at the end of generic_file_buffered_write using > filemap_write_and_wait_range and once a little later in > __generic_file_aio_write using do_sync_mapping_range with all flags set. > > The wait before write of the do_sync_mapping_range call does not make > any sense, so just keep the filemap_write_and_wait_range call and move > it to the right spot. Are you sure this is an expectation of O_DIRECT? A few notes from the net, including some documentation from IBM, advise using O_DIRECT|O_DSYNC if you need sync when direct I/O falls back to buffered on some other OSes. IBM (about AIX I believe): http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/fileio.htm Direct I/O and Data I/O Integrity Completion Although direct I/O writes are done synchronously, they do not provide synchronized I/O data integrity completion, as defined by POSIX. Applications that need this feature should use O_DSYNC in addition to O_DIRECT. O_DSYNC guarantees that all of the data and enough of the metadata (for example, indirect blocks) have written to the stable store to be able to retrieve the data after a system crash. O_DIRECT only writes the data; it does not write the metadata. >>From an earlier thread, "O_DIRECT and barriers": Theodore Tso wrote: > On Fri, Aug 21, 2009 at 10:26:35AM -0400, Christoph Hellwig wrote: > > > It turns out that applications needing integrity must use fdatasync or > > > O_DSYNC (or O_SYNC) *already* with O_DIRECT, because the kernel may > > > choose to use buffered writes at any time, with no signal to the > > > application. > > > > The fallback was a relatively recent addition to the O_DIRECT semantics > > for broken filesystems that can't handle holes very well. Fortunately > > enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC) > > semantics for that already. > > Um, actually, we don't. If we did that, we would have to wait for a > journal commit to complete before allowing the write(2) to complete, > which would be especially painfully slow for ext3. There's no point in a "half-sync". Nobody expects or can usefully depend on it. So imho we should drop the filemap_write_and_wait_range entirely when O_DSYNC is not set. O_DIRECT without syncing in the buffered fallback will be a useful performance optimisation for applications (including virtual machines) which do sequences of writes interspersed with fdatasync calls on sparse files, or when extending files, or to filesystems which don't implement O_DIRECT. Since they need fdatasync anyway, even with direct I/O to get integrity on some hardware, that's a sensible coding pattern. -- Jamie