From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1MiQlA-0007GC-3I
	for qemu-devel@nongnu.org; Tue, 01 Sep 2009 06:38:44 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1MiQl5-0007Fj-Ge
	for qemu-devel@nongnu.org; Tue, 01 Sep 2009 06:38:43 -0400
Received: from [199.232.76.173] (port=37327 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1MiQl5-0007Fg-03
	for qemu-devel@nongnu.org; Tue, 01 Sep 2009 06:38:39 -0400
Received: from mx20.gnu.org ([199.232.41.8]:12272)
	by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32)
	(Exim 4.60) (envelope-from <jamie@shareable.org>) id 1MiQl4-0003B4-H6
	for qemu-devel@nongnu.org; Tue, 01 Sep 2009 06:38:38 -0400
Received: from mail2.shareable.org ([80.68.89.115])
	by mx20.gnu.org with esmtp (Exim 4.60)
	(envelope-from <jamie@shareable.org>) id 1MiQl3-0005qJ-7e
	for qemu-devel@nongnu.org; Tue, 01 Sep 2009 06:38:37 -0400
Date: Tue, 1 Sep 2009 11:38:35 +0100
From: Jamie Lokier <jamie@shareable.org>
Subject: Re: [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag
Message-ID: <20090901103835.GA9548@shareable.org>
References: <20090831201627.GA4811@lst.de> <20090831201651.GA4874@lst.de>
	<20090831220950.GB24318@shareable.org>
	<20090831221622.GA8834@lst.de>
	<20090831224645.GD24318@shareable.org>
	<20090831230612.GC10220@lst.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20090831230612.GC10220@lst.de>
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Christoph Hellwig <hch@lst.de>
Cc: qemu-devel@nongnu.org

Christoph Hellwig wrote:
> > Oh, and QEMU could call whatever "hdparm -F" does when using raw block
> > devices ;-)
> 
> Actually for ide/scsi implementing cache control is on my todo list.
> Not sure about virtio yet.

I think hdparm -f -F does for some block devices what fdatasync
ideally does for files.  What I was getting at was until we have
perfect fdatasync on block devices for Linux, QEMU could use the
blockdev ioctls to accomplish the same thing on older kernels.

> > It goes to show no matter how hard we try, data integrity is a
> > slippery thing where getting it wrong does not show up under normal
> > circumstances, only during catastrophic system failures.
> 
> Honestly, it should not.  Digging through all this was a bit of work,
> but I was extremly how carelessly most people that touched it before
> were.  It's not rocket sciense and can be tested quite easily using
> various tools - qemu beeing the easiest nowdays but scsi_debug or
> an instrumented iscsi target would do the same thing.

Oh I agree - we have increasingly good debugging tools.  What's
missing is a dirty script^H^H^H^H^H^H a good validation test which
stresses the various combinations of ways to sync data on block
devices and various filesystems, and various types of emulated
hardware with/without caches enabled, and various mount options, and
checks the I/O does what is desired in every case.

> > It failed with fsync, which is also important to applications, but
> > filesystem integrity is the most important thing and it's been
> > good at that for many years.
> 
> Users might disagree with that.  With my user hat on I couldn't care
> less on what state the internal metadata is as long as I get back at
> my data which the OS has guaranteed me to reach the disk after a
> successfull fsync/fdatasync/O_SYNC write.

I guess it depends what you're doing.  I've observed more instances of
filesystem corruption due to lack of barriers, resulting in an
inability to find files, than I've ever noticed missing data inside
files, but then I hardly ever keep large amounts of data in databases.
And I get so much mail I wouldn't notice if a few got lost ;-)

> > > E.g. if you want to move your old SCO Unix box into a VM it's the
> > > only safe option.
> > 
> > I agree, and for that reason, cache=writethrough or cache=none are the
> > only reasonable defaults.
> 
> despite the extremly misleading name cache=none is _NOT_ an alternative,
> unless we make it open the image using O_DIRECT|O_SYNC.

Good point about the misleading name, and good point about O_DIRECT
being insufficient too.

For a safe emulation default with reasonable performance, I wonder if
it would work to emulate drive cache _off_ at the beginning, but with
the capability for the guest to enable it?  The theory is that old
guests don't know about drive caches and will leave it off and be safe
(getting O_DSYNC or O_DIRECT|O_DSYNC)[*], and newer guests will turn it on
if they also implement barriers (getting nothing or O_DIRECT, and
fdatasync when they issue barriers).  Do you think that would work
with typical guests we know about?

[*] - O_DSYNC as opposed to O_SYNC strikes me as important once proper
cache flushes are implemented, as it may behave very similar to real
hardware when doing data overwrites, whereas O_SYNC should seek back
and forth between the data and inode areas for every write, if it's
updating it's nanosecond timestamps correctly.

-- Jamie