From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=52862 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OxqZ6-0000wL-MF for qemu-devel@nongnu.org; Mon, 20 Sep 2010 20:18:35 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1OxqZ5-0007uc-7K for qemu-devel@nongnu.org; Mon, 20 Sep 2010 20:18:32 -0400 Received: from mail-iw0-f173.google.com ([209.85.214.173]:35918) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OxqZ5-0007uR-1N for qemu-devel@nongnu.org; Mon, 20 Sep 2010 20:18:31 -0400 Received: by iwn38 with SMTP id 38so5021352iwn.4 for ; Mon, 20 Sep 2010 17:18:30 -0700 (PDT) Message-ID: <4C97F9C6.60501@codemonkey.ws> Date: Mon, 20 Sep 2010 19:18:14 -0500 From: Anthony Liguori MIME-Version: 1.0 References: <4C97916E.2080801@codemonkey.ws> <20100920193451.GA11516@lst.de> <4C97BFF3.90103@codemonkey.ws> <20100920231742.GB18512@lst.de> In-Reply-To: <20100920231742.GB18512@lst.de> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: [Qemu-devel] Re: Caching modes List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Christoph Hellwig Cc: Kevin Wolf , qemu-devel On 09/20/2010 06:17 PM, Christoph Hellwig wrote: > On Mon, Sep 20, 2010 at 03:11:31PM -0500, Anthony Liguori wrote: > >>>> All read and write requests SHOULD avoid any type of caching in the >>>> host. Any write request MUST complete after the next level of storage >>>> reports that the write request has completed. A flush from the guest >>>> MUST complete after all pending I/O requests for the guest have been >>>> completed. >>>> >>>> As an implementation detail, with the raw format, these guarantees are >>>> only in place for preallocated images. Sparse images do not provide as >>>> strong of a guarantee. >>>> >>>> >>> That's not how cache=none ever worked nor works currently. >>> >>> >> How does it work today compared to what I wrote above? >> > For the guest point of view it works exactly as you describe > cache=writeback. There is no ordering or cache flushing guarantees. By > using O_DIRECT we do bypass the host file cache, but we don't even try > on the others (disk cache, commiting metadata transaction that are > required to actually see the commited data for sparse, preallocated or > growing images). > O_DIRECT alone to a pre-allocated file on a normal file system should result in the data being visible without any additional metadata transactions. The only time when that isn't true is when dealing with CoW or other special filesystem features. > What you describe above is the equivalent of O_DSYNC|O_DIRECT which > doesn't exist in current qemu, except that O_DSYNC|O_DIRECT also > guarantees the semantics for sparse images. Sparse images really aren't > special in any way - preallocaiton using posix_fallocate or COW > filesystems like btrfs,nilfs2 or zfs have exactly the same issues. > > >>> | WC enable | WC disable >>> ----------------------------------------------- >>> direct | | >>> buffer | | >>> buffer + ignore flush | | >>> >>> currently we only have: >>> >>> cache=none direct + WC enable >>> cache=writeback buffer + WC enable >>> cache=writethrough buffer + WC disable >>> cache=unsafe buffer + ignore flush + WC enable >>> >>> >> Where does O_DSYNC fit into this chart? >> > O_DSYNC is used for all WC disable modes. > > >> Do all modern filesystems implement O_DSYNC without generating >> additional barriers per request? >> >> Having a barrier per-write request is ultimately not the right semantic >> for any of the modes. However, without the use of O_DSYNC (or >> sync_file_range(), which I know you dislike), I don't see how we can >> have reasonable semantics without always implementing write back caching >> in the host. >> > Barriers are a Linux-specific implementation details that is in the > process of going away, probably in Linux 2.6.37. But if you want > O_DSYNC semantics with a volatile disk write cache there is no way > around using a cache flush or the FUA bit on all I/O caused by it. If you have a volatile disk write cache, then we don't need O_DSYNC semantics. > We > currently use the cache flush, and although I plan to experiment a bit > more with the FUA bit for O_DIRECT | O_DSYNC writes I would be very > surprised if they actually are any faster. > The thing I struggle with understanding is that if the guest is sending us a write request, why are we sending the underlying disk a write + flush request? That doesn't seem logical at all to me. Even if we advertise WC disable, it should be up to the guest to decide when to issue flushes. >> I'm certainly happy to break up the caching option. However, I still >> don't know how we get a reasonable equivalent to cache=writethrough >> without assuming that ext4 is mounted without barriers enabled. >> > There's two problems here - one is a Linux-wide problem and that's the > barrier primitive which is currenly the only way to flush a volatile > disk cache. We've sorted this out for the 2.6.37. The other is that > ext3 and ext4 have really bad fsync implementations. Just use a better > filesystem or bug one of it's developers if you want that fixed. But > except for disabling the disk cache there is no way to get data integrity > without cache flushes (the FUA bit is nothing but an implicit flush). > But why are we issuing more flushes than the guest is issuing if we don't have to worry about filesystem metadata (i.e. preallocated storage or physical devices)? Regards, Anthony Liguori