From mboxrd@z Thu Jan  1 00:00:00 1970
From: Avi Kivity <avi@redhat.com>
Subject: Re: writeback cache + h700 controller w/1gb nvcache, corruption on
 power loss
Date: Tue, 24 Apr 2012 16:43:27 +0300
Message-ID: <4F96ADFF.6000502@redhat.com>
References: <CAJSP0QXZ355p6B4u7nDGucBPjQZDEAa-bhSXSSge97kmvU==Cg@mail.gmail.com> <28769.660.1334566274126.JavaMail.root@sys1.internetdefensetechnologies.com> <CAJSP0QXtaKo32dy+AzxR1WMR9P4TgVXBvBb+e8A2jPv0uG9M7w@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: Ron Edison <ron@idthq.com>, kvm@vger.kernel.org,
	Kevin Wolf <kwolf@redhat.com>,
	Christoph Hellwig <hch@infradead.org>
To: Stefan Hajnoczi <stefanha@gmail.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:14607 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1752861Ab2DXNnr (ORCPT <rfc822;kvm@vger.kernel.org>);
	Tue, 24 Apr 2012 09:43:47 -0400
In-Reply-To: <CAJSP0QXtaKo32dy+AzxR1WMR9P4TgVXBvBb+e8A2jPv0uG9M7w@mail.gmail.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On 04/17/2012 11:41 AM, Stefan Hajnoczi wrote:
> > The disk corruption experienced was indeed lost data -- an fsck was necessary for 4 of the guests to boot at all in RW mode, they first came up read only. In the case of one of the guests there was actually files data / data lost after fsck was manually run upon reboot/single user mode. In some cases these were config files, in other database indexes, etc. This one of the 4 guests with the most severe corruption was not usable and we had to revert to a backup and pull current data out of it as much as possible.
>
> Since you used QEMU -drive cache=writeback data loss is expected on
> host power failure.  cache=writeback uses the (volatile) host page
> cache and therefore data may not have made it to the RAID controller
> before power was lost.
>
> Guest file system recovery - either a quick journal replay or a
> painful fsck - is also expected on host power failure.  The file
> systems are dirty since the guest stopped executing without cleanly
> unmounting its file systems.  If you use cache=none or
> cache=directsync then you should get a quick journal replay and the
> risk of a painful fsck should be reduced (most/all of the data will
> have been preserved).

Both cache=writeback and cache=none should result in journal replay if
the guest is using barriers.  If you're using ext4 and data=journal, you
should expect no data loss.  With the defaults you can expect to lost
recently written data (quite a lot with data=writeback) but the
filesystem structure itself should be fine.

Even with cache=directsync you should expect some data loss as the guest
may hold data in its own page cache.

So something unexpected is happening.  Which guests (by OS type) are
failing?

-- 
error compiling committee.c: too many arguments to function