From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1MiFeF-0004fr-Pn for qemu-devel@nongnu.org; Mon, 31 Aug 2009 18:46:51 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1MiFeB-0004dg-8Q for qemu-devel@nongnu.org; Mon, 31 Aug 2009 18:46:51 -0400 Received: from [199.232.76.173] (port=38680 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1MiFeB-0004dd-1L for qemu-devel@nongnu.org; Mon, 31 Aug 2009 18:46:47 -0400 Received: from mail2.shareable.org ([80.68.89.115]:51233) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1MiFeA-0000i0-EV for qemu-devel@nongnu.org; Mon, 31 Aug 2009 18:46:46 -0400 Date: Mon, 31 Aug 2009 23:46:45 +0100 From: Jamie Lokier Subject: Re: [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag Message-ID: <20090831224645.GD24318@shareable.org> References: <20090831201627.GA4811@lst.de> <20090831201651.GA4874@lst.de> <20090831220950.GB24318@shareable.org> <20090831221622.GA8834@lst.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090831221622.GA8834@lst.de> List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Christoph Hellwig Cc: qemu-devel@nongnu.org Christoph Hellwig wrote: > On Mon, Aug 31, 2009 at 11:09:50PM +0100, Jamie Lokier wrote: > > Right now, on a Linux host O_SYNC is unsafe with hardware that has a > > volatile write cache. That might not be changed, but if it is than > > performance with cache=writethrough will plummet (due to issuing a > > CACHE FLUSH to the hardware after every write), while performance with > > cache=writeback will be reasonable. > > Currenly all modes are more or less unsafe with volatile write caches > at least when using ext3 or raw block device accesses. XFS is safe > two thirds due to doing the right thing and one third due to sheer > luck. Right, but now you've made it worse. By not calling fdatasync at all, you've reduced the integrity. Previously it would reach the drive's cache, and take whatever (short) time it took to reach the platter. Now you're leaving data in the host cache which can stay for much longer, and is vulnerable to host kernel crashes. Oh, and QEMU could call whatever "hdparm -F" does when using raw block devices ;-) > > nothing (perhaps fdatasync on QEMU blockdev close) > > Fine withe me, let the flame war begin :) Well I'd like to start by pointing out your patch introduces a regression in the combination cache=writeback with emulated SCSI, because it effectively removes the fdatasync calls in that case :-) So please amend the patch before it gets applied lest such silliness be propagated. Thanks :-) I've actually been using cache=writeback with emulated IDE on deployed server VMs, assuming that worked with KVM. It's been an eye opener to find that it was broken all along because the driver failed set the "has write cache" bit. Thank you for the detective work. It goes to show no matter how hard we try, data integrity is a slippery thing where getting it wrong does not show up under normal circumstances, only during catastrophic system failures. Ironically, with emulated SCSI, I used cache=writethrough, thinking guests would not issue CACHE FLUSH commands over SCSI because historically performance has been reached by having overlapping writes instead. > > When using guests OSes which issue CACHE FLUSH commands (that's a > > guest config issue), why would you ever use cache=writethrough? > > cache=writeback should be faster and equally safe - provided you do > > actually advertise the write cache! > > And provided the guest OS actually issues cache flushes when it should, > something that at least Linux historicaly utterly failed at, and some > other operating systems haven't even tried. For the hosts, yes - fsync/fdatasync/O_SYNC/O_DIRECT all utterly fail. (Afaict, Windows hosts do it in some combinations). But I'd like to think we're about to fix Linux hosts soon, thanks to your good work on that elsewhere. For guests, Linux has been good at issuing the necessary flushes for ordinary journalling, (in ext3's case, provided barrier=1 is given in mount options), which is quite important. It failed with fsync, which is also important to applications, but filesystem integrity is the most important thing and it's been good at that for many years. > cache=writethrough is the equivalent of turning off the volatile > write cache of the real disk. It might be slower (which isn't even > always the case for real disks), but it is much safer. When O_SYNC is made to flush hardware cache on Linux hosts, it will be excruciatingly slow: it'll have to seek twice for every write. Once for the data, once for the inode update. That's another reason O_DSYNC is important. > E.g. if you want to move your old SCO Unix box into a VM it's the > only safe option. I agree, and for that reason, cache=writethrough or cache=none are the only reasonable defaults. By the way, all this has led me to another idea... We may find that O_SYNC is slower than batching several writes followed by one fdatasync whose completion allows several writes to report as completed, when emulating SCSI or virtio-blk (anything which allows overlapping write commands from the guest). -- Jamie