From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christoph Hellwig Subject: Re: JFYI: ext4 bug triggerable by kvm Date: Tue, 17 Aug 2010 10:28:08 -0400 Message-ID: <20100817142808.GA22412@infradead.org> References: <4C694483.5010903@msgid.tls.msk.ru> <4C694E7D.3060600@codemonkey.ws> <20100816184237.GA16579@infradead.org> <4C69A0C4.2080102@codemonkey.ws> <20100817090755.GA11110@infradead.org> <4C6A86E4.9080600@codemonkey.ws> <20100817130702.GA16635@infradead.org> <4C6A9AB5.6050404@codemonkey.ws> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Christoph Hellwig , Michael Tokarev , KVM list , Kevin Wolf To: Anthony Liguori Return-path: Received: from bombadil.infradead.org ([18.85.46.34]:39510 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757980Ab0HQO2K (ORCPT ); Tue, 17 Aug 2010 10:28:10 -0400 Content-Disposition: inline In-Reply-To: <4C6A9AB5.6050404@codemonkey.ws> Sender: kvm-owner@vger.kernel.org List-ID: On Tue, Aug 17, 2010 at 09:20:37AM -0500, Anthony Liguori wrote: > On 08/17/2010 08:07 AM, Christoph Hellwig wrote: > >>The point is that we don't want to flush the disk write cache. The > >>intention of writethrough is not to make the disk cache writethrough > >>but to treat the host's cache as writethrough. > > > >We need to make sure data is not in the disk write cache if want to > >provide data integrity. > > When the guest explicitly flushes the emulated disk's write cache. > Not on every single write completion. That depends on the cache= mode. For cache=none and cache=writeback we present a write-back cache to the guest, and the guest does explicit cache flushes. For cache=writethrough we present a writethrough cache to the guest, and we need to make sure data actually has hit the disk before returning I/O completion to the guest. > > It has nothing to do with the qemu caching > >mode - for data=writeback or none it's commited as part of the fdatasync > >call, and for data=writethrough it's commited as part of the O_SYNC > >write. Note that both these path end up calling the filesystems ->fsync > >method which is what's require to make writes stable. That's exactly > >what is missing out in sync_file_range, and that's why that API is not > >useful at all for data integrity operations. > > For normal writes from a guest, we don't need to follow the write > with an fsync(). We should only need to issue an fsync() given an > explicit flush from the guest. Define normal writes. For cache=none and cache=writeback we don't have to, and instead do explicit calls to fsync()/fdatasync() calls when a we a cache flush from the guest. For data=writethrough we guarantee data has made it to disk, and we implement this using O_DSYNC/O_SYNC when opening the file. That tells the operating system to not return until data has hit the disk. For Linux this is internally implement using a range-fsync/fdatasync after the actual write. > fsync() being slow is orthogonal to my point. I don't see why we > need to do an fsync() on *every* write. It should only be necessary > when a guest injects an actual barrier. See above.