writeback cache + h700 controller w/1gb nvcache, corruption on power loss

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* writeback cache + h700 controller w/1gb nvcache, corruption on power loss
       [not found] <22130654.645.1334462335379.JavaMail.root@sys1.internetdefensetechnologies.com>
@ 2012-04-15  4:16 ` Ron Edison
  2012-04-16  8:34   ` Stefan Hajnoczi
  0 siblings, 1 reply; 6+ messages in thread
From: Ron Edison @ 2012-04-15  4:16 UTC (permalink / raw)
  To: kvm

dear list, 

I recently had a machine lose power that was unfortunately running between 15-20 kvm guests. 

The server is a Dell R710 with an H700 controller with 1gb of nvcache. Writeback cache is enabled on the controller. There is a mix of linux and windows guests, some with qcow2 format vdisks and others with raw format vdisks. Some of these guests have wb cache enabled on the vdisks and some do not. 

About a third of the guests experienced disk corruption after coming back up after the host lost power. Based on what I have read, this should not have happened using the above configuration. The operating system is Centos 6.2, this is all direct attached storage configured as raid 1 mirrors. 

I'm hoping someone has a comment or suggestion on this so that I can take action to prevent corruption in the future. 

The motivation to enable write caching is primarily performance. 

Thanks, 

Ron 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: writeback cache + h700 controller w/1gb nvcache, corruption on power loss
  2012-04-15  4:16 ` writeback cache + h700 controller w/1gb nvcache, corruption on power loss Ron Edison
@ 2012-04-16  8:34   ` Stefan Hajnoczi
  2012-04-16  8:51     ` Ron Edison
  2012-04-25  3:57     ` Christoph Hellwig
  0 siblings, 2 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2012-04-16  8:34 UTC (permalink / raw)
  To: Ron Edison; +Cc: kvm

On Sun, Apr 15, 2012 at 5:16 AM, Ron Edison <ron@idthq.com> wrote:
> The server is a Dell R710 with an H700 controller with 1gb of nvcache. Writeback cache is enabled on the controller. There is a mix of linux and windows guests, some with qcow2 format vdisks and others with raw format vdisks. Some of these guests have wb cache enabled on the vdisks and some do not.

-drive cache=writeback is safe when the guest flushes appropriately.
If the guest is not sending flushes (e.g. ext3/4 barrier=0) then there
are no guarantees.

> About a third of the guests experienced disk corruption after coming back up after the host lost power. Based on what I have read, this should not have happened using the above configuration. The operating system is Centos 6.2, this is all direct attached storage configured as raid 1 mirrors.

Is the H700 battery charged?  My understanding is that writethrough
caching on the H700 will only be safe when the battery is present and
charged.

What do you mean by "guests experienced disk corruption"?  I would
expect the guests do an fsck when they are started again, just like a
physical machine that underwent power failure.  What exactly is
corrupted?  Does QEMU refuse to open the .qcow2 file?  Is there data
missing inside the guest?  Has application data (e.g. database) gone
into a bad state so you get errors?

Stefan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: writeback cache + h700 controller w/1gb nvcache, corruption on power loss
  2012-04-16  8:34   ` Stefan Hajnoczi
@ 2012-04-16  8:51     ` Ron Edison
  2012-04-17  8:41       ` Stefan Hajnoczi
  2012-04-25  3:57     ` Christoph Hellwig
  1 sibling, 1 reply; 6+ messages in thread
From: Ron Edison @ 2012-04-16  8:51 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kvm

Thank you very much, Stefan,

I would be very interested in how to ensure the guests are sending flushes. I'm unfamiliar with the example you gave, where is that configured? Primarily the guests are CentOS 4, 5 or 6. I am also curious if it would be advisable to switch to writethrough cache on each guest virtual disk and leave writeback enabled on the controller and if that would adversely affect performance of the guests.

The H700 controller had a charged battery. The server itself is virtually brand new in fact, no issues with it.

The disk corruption experienced was indeed lost data -- an fsck was necessary for 4 of the guests to boot at all in RW mode, they first came up read only. In the case of one of the guests there was actually files data / data lost after fsck was manually run upon reboot/single user mode. In some cases these were config files, in other database indexes, etc. This one of the 4 guests with the most severe corruption was not usable and we had to revert to a backup and pull current data out of it as much as possible.

In contrast we had an R410 server in the same rack that experienced the same power loss also running an H700 controller with NVcache, KVM, Centos6 but only 1 guest and considerably less load. That server and the one guest on it experienced zero corruption and came back up without manual intervention.

Ron

----- Original Message -----
From: "Stefan Hajnoczi" <stefanha@gmail.com>
To: "Ron Edison" <ron@idthq.com>
Cc: kvm@vger.kernel.org
Sent: Monday, April 16, 2012 1:34:41 AM
Subject: Re: writeback cache + h700 controller w/1gb nvcache, corruption on power loss

On Sun, Apr 15, 2012 at 5:16 AM, Ron Edison <ron@idthq.com> wrote:
> The server is a Dell R710 with an H700 controller with 1gb of nvcache. Writeback cache is enabled on the controller. There is a mix of linux and windows guests, some with qcow2 format vdisks and others with raw format vdisks. Some of these guests have wb cache enabled on the vdisks and some do not.

-drive cache=writeback is safe when the guest flushes appropriately.
If the guest is not sending flushes (e.g. ext3/4 barrier=0) then there
are no guarantees.

> About a third of the guests experienced disk corruption after coming back up after the host lost power. Based on what I have read, this should not have happened using the above configuration. The operating system is Centos 6.2, this is all direct attached storage configured as raid 1 mirrors.

Is the H700 battery charged?  My understanding is that writethrough
caching on the H700 will only be safe when the battery is present and
charged.

What do you mean by "guests experienced disk corruption"?  I would
expect the guests do an fsck when they are started again, just like a
physical machine that underwent power failure.  What exactly is
corrupted?  Does QEMU refuse to open the .qcow2 file?  Is there data
missing inside the guest?  Has application data (e.g. database) gone
into a bad state so you get errors?

Stefan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: writeback cache + h700 controller w/1gb nvcache, corruption on power loss
  2012-04-16  8:51     ` Ron Edison
@ 2012-04-17  8:41       ` Stefan Hajnoczi
  2012-04-24 13:43         ` Avi Kivity
  0 siblings, 1 reply; 6+ messages in thread
From: Stefan Hajnoczi @ 2012-04-17  8:41 UTC (permalink / raw)
  To: Ron Edison; +Cc: kvm, Kevin Wolf, Christoph Hellwig

On Mon, Apr 16, 2012 at 9:51 AM, Ron Edison <ron@idthq.com> wrote:
> I would be very interested in how to ensure the guests are sending flushes. I'm unfamiliar with the example you gave, where is that configured?

"mount -o barrier=1 /dev/sda /mnt" is a mount option for ext3 and ext4
file systems.

You probably don't want this actually since you have a battery-backed
RAID controller.  See below for more.

> Primarily the guests are CentOS 4, 5 or 6. I am also curious if it would be advisable to switch to writethrough cache on each guest virtual disk and leave writeback enabled on the controller and if that would adversely affect performance of the guests.

The most conservative modes are cache=writethrough (uses host page
cache) and cache=directsync (does not use host page cache).  They both
ensure that every single write is flushed to disk.  Therefore they
have a performance penalty.  cache=directsync minimizes stress on host
memory because it bypasses the page cache.

Since you have a non-volatile cache in your RAID controller you can
also use cache=none.  This also bypasses the host page cache but it
does not flush every single write.  The guest may still send flushes
but even if it does not, the writes are going to the RAID controller's
non-volatile cache.

> The disk corruption experienced was indeed lost data -- an fsck was necessary for 4 of the guests to boot at all in RW mode, they first came up read only. In the case of one of the guests there was actually files data / data lost after fsck was manually run upon reboot/single user mode. In some cases these were config files, in other database indexes, etc. This one of the 4 guests with the most severe corruption was not usable and we had to revert to a backup and pull current data out of it as much as possible.

Since you used QEMU -drive cache=writeback data loss is expected on
host power failure.  cache=writeback uses the (volatile) host page
cache and therefore data may not have made it to the RAID controller
before power was lost.

Guest file system recovery - either a quick journal replay or a
painful fsck - is also expected on host power failure.  The file
systems are dirty since the guest stopped executing without cleanly
unmounting its file systems.  If you use cache=none or
cache=directsync then you should get a quick journal replay and the
risk of a painful fsck should be reduced (most/all of the data will
have been preserved).

Stefan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: writeback cache + h700 controller w/1gb nvcache, corruption on power loss
  2012-04-17  8:41       ` Stefan Hajnoczi
@ 2012-04-24 13:43         ` Avi Kivity
  0 siblings, 0 replies; 6+ messages in thread
From: Avi Kivity @ 2012-04-24 13:43 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Ron Edison, kvm, Kevin Wolf, Christoph Hellwig

On 04/17/2012 11:41 AM, Stefan Hajnoczi wrote:
> > The disk corruption experienced was indeed lost data -- an fsck was necessary for 4 of the guests to boot at all in RW mode, they first came up read only. In the case of one of the guests there was actually files data / data lost after fsck was manually run upon reboot/single user mode. In some cases these were config files, in other database indexes, etc. This one of the 4 guests with the most severe corruption was not usable and we had to revert to a backup and pull current data out of it as much as possible.
>
> Since you used QEMU -drive cache=writeback data loss is expected on
> host power failure.  cache=writeback uses the (volatile) host page
> cache and therefore data may not have made it to the RAID controller
> before power was lost.
>
> Guest file system recovery - either a quick journal replay or a
> painful fsck - is also expected on host power failure.  The file
> systems are dirty since the guest stopped executing without cleanly
> unmounting its file systems.  If you use cache=none or
> cache=directsync then you should get a quick journal replay and the
> risk of a painful fsck should be reduced (most/all of the data will
> have been preserved).

Both cache=writeback and cache=none should result in journal replay if
the guest is using barriers.  If you're using ext4 and data=journal, you
should expect no data loss.  With the defaults you can expect to lost
recently written data (quite a lot with data=writeback) but the
filesystem structure itself should be fine.

Even with cache=directsync you should expect some data loss as the guest
may hold data in its own page cache.

So something unexpected is happening.  Which guests (by OS type) are
failing?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: writeback cache + h700 controller w/1gb nvcache, corruption on power loss
  2012-04-16  8:34   ` Stefan Hajnoczi
  2012-04-16  8:51     ` Ron Edison
@ 2012-04-25  3:57     ` Christoph Hellwig
  1 sibling, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2012-04-25  3:57 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Ron Edison, kvm

On Mon, Apr 16, 2012 at 09:34:41AM +0100, Stefan Hajnoczi wrote:
> On Sun, Apr 15, 2012 at 5:16 AM, Ron Edison <ron@idthq.com> wrote:
> > The server is a Dell R710 with an H700 controller with 1gb of nvcache. Writeback cache is enabled on the controller. There is a mix of linux and windows guests, some with qcow2 format vdisks and others with raw format vdisks. Some of these guests have wb cache enabled on the vdisks and some do not.
> 
> -drive cache=writeback is safe when the guest flushes appropriately.
> If the guest is not sending flushes (e.g. ext3/4 barrier=0) then there
> are no guarantees.

Which is the default for ext3 on RHEL/CentOS < 6, btw.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-04-25  3:57 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <22130654.645.1334462335379.JavaMail.root@sys1.internetdefensetechnologies.com>
2012-04-15  4:16 ` writeback cache + h700 controller w/1gb nvcache, corruption on power loss Ron Edison
2012-04-16  8:34   ` Stefan Hajnoczi
2012-04-16  8:51     ` Ron Edison
2012-04-17  8:41       ` Stefan Hajnoczi
2012-04-24 13:43         ` Avi Kivity
2012-04-25  3:57     ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox