From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1MiQlA-0007GC-3I for qemu-devel@nongnu.org; Tue, 01 Sep 2009 06:38:44 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1MiQl5-0007Fj-Ge for qemu-devel@nongnu.org; Tue, 01 Sep 2009 06:38:43 -0400 Received: from [199.232.76.173] (port=37327 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1MiQl5-0007Fg-03 for qemu-devel@nongnu.org; Tue, 01 Sep 2009 06:38:39 -0400 Received: from mx20.gnu.org ([199.232.41.8]:12272) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1MiQl4-0003B4-H6 for qemu-devel@nongnu.org; Tue, 01 Sep 2009 06:38:38 -0400 Received: from mail2.shareable.org ([80.68.89.115]) by mx20.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1MiQl3-0005qJ-7e for qemu-devel@nongnu.org; Tue, 01 Sep 2009 06:38:37 -0400 Date: Tue, 1 Sep 2009 11:38:35 +0100 From: Jamie Lokier Subject: Re: [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag Message-ID: <20090901103835.GA9548@shareable.org> References: <20090831201627.GA4811@lst.de> <20090831201651.GA4874@lst.de> <20090831220950.GB24318@shareable.org> <20090831221622.GA8834@lst.de> <20090831224645.GD24318@shareable.org> <20090831230612.GC10220@lst.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090831230612.GC10220@lst.de> List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Christoph Hellwig Cc: qemu-devel@nongnu.org Christoph Hellwig wrote: > > Oh, and QEMU could call whatever "hdparm -F" does when using raw block > > devices ;-) > > Actually for ide/scsi implementing cache control is on my todo list. > Not sure about virtio yet. I think hdparm -f -F does for some block devices what fdatasync ideally does for files. What I was getting at was until we have perfect fdatasync on block devices for Linux, QEMU could use the blockdev ioctls to accomplish the same thing on older kernels. > > It goes to show no matter how hard we try, data integrity is a > > slippery thing where getting it wrong does not show up under normal > > circumstances, only during catastrophic system failures. > > Honestly, it should not. Digging through all this was a bit of work, > but I was extremly how carelessly most people that touched it before > were. It's not rocket sciense and can be tested quite easily using > various tools - qemu beeing the easiest nowdays but scsi_debug or > an instrumented iscsi target would do the same thing. Oh I agree - we have increasingly good debugging tools. What's missing is a dirty script^H^H^H^H^H^H a good validation test which stresses the various combinations of ways to sync data on block devices and various filesystems, and various types of emulated hardware with/without caches enabled, and various mount options, and checks the I/O does what is desired in every case. > > It failed with fsync, which is also important to applications, but > > filesystem integrity is the most important thing and it's been > > good at that for many years. > > Users might disagree with that. With my user hat on I couldn't care > less on what state the internal metadata is as long as I get back at > my data which the OS has guaranteed me to reach the disk after a > successfull fsync/fdatasync/O_SYNC write. I guess it depends what you're doing. I've observed more instances of filesystem corruption due to lack of barriers, resulting in an inability to find files, than I've ever noticed missing data inside files, but then I hardly ever keep large amounts of data in databases. And I get so much mail I wouldn't notice if a few got lost ;-) > > > E.g. if you want to move your old SCO Unix box into a VM it's the > > > only safe option. > > > > I agree, and for that reason, cache=writethrough or cache=none are the > > only reasonable defaults. > > despite the extremly misleading name cache=none is _NOT_ an alternative, > unless we make it open the image using O_DIRECT|O_SYNC. Good point about the misleading name, and good point about O_DIRECT being insufficient too. For a safe emulation default with reasonable performance, I wonder if it would work to emulate drive cache _off_ at the beginning, but with the capability for the guest to enable it? The theory is that old guests don't know about drive caches and will leave it off and be safe (getting O_DSYNC or O_DIRECT|O_DSYNC)[*], and newer guests will turn it on if they also implement barriers (getting nothing or O_DIRECT, and fdatasync when they issue barriers). Do you think that would work with typical guests we know about? [*] - O_DSYNC as opposed to O_SYNC strikes me as important once proper cache flushes are implemented, as it may behave very similar to real hardware when doing data overwrites, whereas O_SYNC should seek back and forth between the data and inode areas for every write, if it's updating it's nanosecond timestamps correctly. -- Jamie