From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1MiFeF-0004fr-Pn
	for qemu-devel@nongnu.org; Mon, 31 Aug 2009 18:46:51 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1MiFeB-0004dg-8Q
	for qemu-devel@nongnu.org; Mon, 31 Aug 2009 18:46:51 -0400
Received: from [199.232.76.173] (port=38680 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1MiFeB-0004dd-1L
	for qemu-devel@nongnu.org; Mon, 31 Aug 2009 18:46:47 -0400
Received: from mail2.shareable.org ([80.68.89.115]:51233)
	by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32)
	(Exim 4.60) (envelope-from <jamie@shareable.org>) id 1MiFeA-0000i0-EV
	for qemu-devel@nongnu.org; Mon, 31 Aug 2009 18:46:46 -0400
Date: Mon, 31 Aug 2009 23:46:45 +0100
From: Jamie Lokier <jamie@shareable.org>
Subject: Re: [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag
Message-ID: <20090831224645.GD24318@shareable.org>
References: <20090831201627.GA4811@lst.de> <20090831201651.GA4874@lst.de>
	<20090831220950.GB24318@shareable.org>
	<20090831221622.GA8834@lst.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20090831221622.GA8834@lst.de>
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Christoph Hellwig <hch@lst.de>
Cc: qemu-devel@nongnu.org

Christoph Hellwig wrote:
> On Mon, Aug 31, 2009 at 11:09:50PM +0100, Jamie Lokier wrote:
> > Right now, on a Linux host O_SYNC is unsafe with hardware that has a
> > volatile write cache.  That might not be changed, but if it is than
> > performance with cache=writethrough will plummet (due to issuing a
> > CACHE FLUSH to the hardware after every write), while performance with
> > cache=writeback will be reasonable.
> 
> Currenly all modes are more or less unsafe with volatile write caches
> at least when using ext3 or raw block device accesses.  XFS is safe
> two thirds due to doing the right thing and one third due to sheer
> luck.

Right, but now you've made it worse.  By not calling fdatasync at all,
you've reduced the integrity.  Previously it would reach the drive's
cache, and take whatever (short) time it took to reach the platter.
Now you're leaving data in the host cache which can stay for much
longer, and is vulnerable to host kernel crashes.

Oh, and QEMU could call whatever "hdparm -F" does when using raw block
devices ;-)

> >         nothing (perhaps fdatasync on QEMU blockdev close)
> 
> Fine withe me, let the flame war begin :)

Well I'd like to start by pointing out your patch introduces a
regression in the combination cache=writeback with emulated SCSI,
because it effectively removes the fdatasync calls in that case :-)

So please amend the patch before it gets applied lest such silliness
be propagated.  Thanks :-)

I've actually been using cache=writeback with emulated IDE on deployed
server VMs, assuming that worked with KVM.  It's been an eye opener to
find that it was broken all along because the driver failed set the
"has write cache" bit.  Thank you for the detective work.

It goes to show no matter how hard we try, data integrity is a
slippery thing where getting it wrong does not show up under normal
circumstances, only during catastrophic system failures.

Ironically, with emulated SCSI, I used cache=writethrough, thinking
guests would not issue CACHE FLUSH commands over SCSI because
historically performance has been reached by having overlapping writes
instead.

> > When using guests OSes which issue CACHE FLUSH commands (that's a
> > guest config issue), why would you ever use cache=writethrough?
> > cache=writeback should be faster and equally safe - provided you do
> > actually advertise the write cache!
> 
> And provided the guest OS actually issues cache flushes when it should,
> something that at least Linux historicaly utterly failed at, and some
> other operating systems haven't even tried.

For the hosts, yes - fsync/fdatasync/O_SYNC/O_DIRECT all utterly fail.
(Afaict, Windows hosts do it in some combinations).  But I'd like to
think we're about to fix Linux hosts soon, thanks to your good work on
that elsewhere.

For guests, Linux has been good at issuing the necessary flushes for
ordinary journalling, (in ext3's case, provided barrier=1 is given in
mount options), which is quite important.  It failed with fsync, which
is also important to applications, but filesystem integrity is the
most important thing and it's been good at that for many years.

> cache=writethrough is the equivalent of turning off the volatile
> write cache of the real disk.  It might be slower (which isn't even
> always the case for real disks), but it is much safer.

When O_SYNC is made to flush hardware cache on Linux hosts, it will be
excruciatingly slow: it'll have to seek twice for every write.  Once
for the data, once for the inode update.  That's another reason
O_DSYNC is important.

> E.g. if you want to move your old SCO Unix box into a VM it's the
> only safe option.

I agree, and for that reason, cache=writethrough or cache=none are the
only reasonable defaults.

By the way, all this has led me to another idea...  We may find that
O_SYNC is slower than batching several writes followed by one
fdatasync whose completion allows several writes to report as
completed, when emulating SCSI or virtio-blk (anything which allows
overlapping write commands from the guest).

-- Jamie