* disk corruption after virsh destroy
@ 2013-07-02 14:40 Brian J. Murrell
2013-07-02 15:26 ` Brian J. Murrell
2013-07-03 8:47 ` Stefan Hajnoczi
0 siblings, 2 replies; 5+ messages in thread
From: Brian J. Murrell @ 2013-07-02 14:40 UTC (permalink / raw)
To: kvm
I have a cluster of VMs setup with shared virtio-scsi disks. The
purpose of sharing a disk is that if a VM goes down, another can pick up
and mount the (ext4) filesystem on shared disk a provide service to it.
But just to be super clear, only one VM ever has a filesystem mounted at
a time even though multiple VMs technically can access the device at the
same time. A VM mounting a filesystem ensures absolutely that no other
node has it mounted before mounting it.
That said, what I am finding is that when one a node dies and another
node tries to mount the (ext4) filesystem, it is found dirty and needs
an fsck.
My understanding is that with ext{3,4}, this should not be the case and
indeed it is my experience, on real hardware with coherent disk caching
(i.e. no non-battery-backed caching disk controllers lying to the O/S
about what has been written to physical disk) that this is the case.
That is, a node failing does not leave an ext{3,4} filesystem dirty such
that it needs an fsck.
So, clearly, somewhere between the KVM VM and the physical disk, there
is a cache that is resulting in the guest O/S believing data is being
written to physical disk that is not actually being written there. To
that end, I have ensured that on these shared disks that I set
"cache=none", but this does not seem to have fixed the problem.
Here is my KVM commandline. Please bear with the unfortunate line
wrapping since my MUA (Thunderbird) doesn't allow for one to specify
lines which shouldn't be wrapped. I have tried to ameliorate that by
indenting all of the lines that start command line options with two spaces.
/usr/bin/qemu-kvm -name wtm-60vm5 -S -M pc-0.14 -enable-kvm -m 8192 \
-smp 1,sockets=1,cores=1,threads=1 \
-uuid 5cbc2568-e32d-11e2-9c1f-001e67293bea -no-user-config \
-nodefaults \
-chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/wtm-60vm5.monitor,server,nowait
\
-mon chardev=charmonitor,id=monitor,mode=control \
-rtc base=utc -no-shutdown \
-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2
-device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x5 -drive
file=/var/lib/libvirt/images/wtm-60vm5.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,serial=node1-root
-device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
-drive
file=/dev/vg_00/disk1,if=none,id=drive-scsi0-0-0-0,format=raw,serial=disk1,cache=none\
-device
scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0
-drive
file=/dev/vg_00/disk2,if=none,id=drive-scsi0-0-0-1,format=raw,serial=disk2,cache=none
\
-device
scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi0-0-0-1,id=scsi0-0-0-1
\
-drive
file=/dev/vg_00/disk3,if=none,id=drive-scsi0-0-0-2,format=raw,serial=disk3,cache=none
\
-device
scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=2,drive=drive-scsi0-0-0-2,id=scsi0-0-0-2
\
-drive
file=/dev/vg_00/disk4,if=none,id=drive-scsi0-0-0-3,format=raw,serial=disk4,cache=none
\
-device
scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=3,drive=drive-scsi0-0-0-3,id=scsi0-0-0-3
\
-drive
file=/dev/vg_00/disk5,if=none,id=drive-scsi0-0-0-4,format=raw,serial=disk5,cache=none
\
-device
scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=4,drive=drive-scsi0-0-0-4,id=scsi0-0-0-4
\
-netdev tap,fd=23,id=hostnet0,vhost=on,vhostfd=27 \
-device
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:60:d9:05,bus=pci.0,addr=0x3
\
-netdev tap,fd=31,id=hostnet1,vhost=on,vhostfd=32
-device
virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:60:a7:05,bus=pci.0,addr=0x8
-chardev pty,id=charserial0 \
-device isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:3 \
-vga cirrus -device AC97,id=sound0,bus=pci.0,addr=0x4 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7
Clearly it's the 5 scsi disks which are yielding corruption when a VM is
destroyed with "virsh destroy".
Any ideas on what I need to do to ensure that writes at the guest O/S
layer which are to be sent to physical disk actually make it to physical
disk on the host?
Of course, I am happy to provide any additional information, debugging,
etc. that may be needed.
Cheers,
b.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: disk corruption after virsh destroy
2013-07-02 14:40 disk corruption after virsh destroy Brian J. Murrell
@ 2013-07-02 15:26 ` Brian J. Murrell
2013-07-03 8:47 ` Stefan Hajnoczi
1 sibling, 0 replies; 5+ messages in thread
From: Brian J. Murrell @ 2013-07-02 15:26 UTC (permalink / raw)
To: kvm
I really should have included some version info. I had intended to,
after I got all of the details written down but simply forgot before I
hit Send. Apologies.
In any case, this was all on Fedora 18. Here's the versions of all
packages related to qemu, kvm and libvirt:
$ rpm -qa | grep -e kvm -e qemu -e libvirt | sort
fence-virtd-libvirt-0.3.0-5.fc18.x86_64
ipxe-roms-qemu-20120328-2.gitaac9718.fc18.noarch
libvirt-0.10.2.6-1.fc18.x86_64
libvirt-client-0.10.2.6-1.fc18.x86_64
libvirt-daemon-0.10.2.6-1.fc18.x86_64
libvirt-daemon-config-network-0.10.2.6-1.fc18.x86_64
libvirt-daemon-config-nwfilter-0.10.2.6-1.fc18.x86_64
libvirt-daemon-driver-interface-0.10.2.6-1.fc18.x86_64
libvirt-daemon-driver-libxl-0.10.2.6-1.fc18.x86_64
libvirt-daemon-driver-lxc-0.10.2.6-1.fc18.x86_64
libvirt-daemon-driver-network-0.10.2.6-1.fc18.x86_64
libvirt-daemon-driver-nodedev-0.10.2.6-1.fc18.x86_64
libvirt-daemon-driver-nwfilter-0.10.2.6-1.fc18.x86_64
libvirt-daemon-driver-qemu-0.10.2.6-1.fc18.x86_64
libvirt-daemon-driver-secret-0.10.2.6-1.fc18.x86_64
libvirt-daemon-driver-storage-0.10.2.6-1.fc18.x86_64
libvirt-daemon-driver-uml-0.10.2.6-1.fc18.x86_64
libvirt-daemon-driver-xen-0.10.2.6-1.fc18.x86_64
libvirt-daemon-kvm-0.10.2.6-1.fc18.x86_64
libvirt-daemon-qemu-0.10.2.6-1.fc18.x86_64
libvirt-python-0.10.2.6-1.fc18.x86_64
qemu-1.2.2-13.fc18.x86_64
qemu-common-1.2.2-13.fc18.x86_64
qemu-img-1.2.2-13.fc18.x86_64
qemu-kvm-1.2.2-13.fc18.x86_64
qemu-system-alpha-1.2.2-13.fc18.x86_64
qemu-system-arm-1.2.2-13.fc18.x86_64
qemu-system-cris-1.2.2-13.fc18.x86_64
qemu-system-lm32-1.2.2-13.fc18.x86_64
qemu-system-m68k-1.2.2-13.fc18.x86_64
qemu-system-microblaze-1.2.2-13.fc18.x86_64
qemu-system-mips-1.2.2-13.fc18.x86_64
qemu-system-or32-1.2.2-13.fc18.x86_64
qemu-system-ppc-1.2.2-13.fc18.x86_64
qemu-system-s390x-1.2.2-13.fc18.x86_64
qemu-system-sh4-1.2.2-13.fc18.x86_64
qemu-system-sparc-1.2.2-13.fc18.x86_64
qemu-system-unicore32-1.2.2-13.fc18.x86_64
qemu-system-x86-1.2.2-13.fc18.x86_64
qemu-system-xtensa-1.2.2-13.fc18.x86_64
qemu-user-1.2.2-13.fc18.x86_64
ruby-libvirt-0.4.0-4.fc18.x86_64
Running kernel is 3.9.2-200.fc18.x86_64
Let me know if there is anything more I can add.
Cheers,
b.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: disk corruption after virsh destroy
2013-07-02 14:40 disk corruption after virsh destroy Brian J. Murrell
2013-07-02 15:26 ` Brian J. Murrell
@ 2013-07-03 8:47 ` Stefan Hajnoczi
2013-07-06 13:03 ` Bernd Schubert
1 sibling, 1 reply; 5+ messages in thread
From: Stefan Hajnoczi @ 2013-07-03 8:47 UTC (permalink / raw)
To: Brian J. Murrell; +Cc: kvm
On Tue, Jul 02, 2013 at 10:40:11AM -0400, Brian J. Murrell wrote:
> I have a cluster of VMs setup with shared virtio-scsi disks. The
> purpose of sharing a disk is that if a VM goes down, another can
> pick up and mount the (ext4) filesystem on shared disk a provide
> service to it.
>
> But just to be super clear, only one VM ever has a filesystem
> mounted at a time even though multiple VMs technically can access
> the device at the same time. A VM mounting a filesystem ensures
> absolutely that no other node has it mounted before mounting it.
>
> That said, what I am finding is that when one a node dies and
> another node tries to mount the (ext4) filesystem, it is found dirty
> and needs an fsck.
>
> My understanding is that with ext{3,4}, this should not be the case
> and indeed it is my experience, on real hardware with coherent disk
> caching (i.e. no non-battery-backed caching disk controllers lying
> to the O/S about what has been written to physical disk) that this
> is the case. That is, a node failing does not leave an ext{3,4}
> filesystem dirty such that it needs an fsck.
>
> So, clearly, somewhere between the KVM VM and the physical disk,
> there is a cache that is resulting in the guest O/S believing data
> is being written to physical disk that is not actually being written
> there. To that end, I have ensured that on these shared disks that
> I set "cache=none", but this does not seem to have fixed the
> problem.
I expect journal replay and possibly fsck when an ext4 file system was
left in a mounted state and with I/O pending (e.g. due to power
failure).
A few questions:
1. Is the guest mounting the file system with barrier=0? barrier=1 is
the default.
2. Do the physical disks have a volatile write cache enabled (if yes,
the guest should use barrier=1)? If the physical disks have a
non-volatile write cache or the write cache is disabled (then
barrier=0 is okay).
3. Have you tested without the cluster? Run a single VM and kill it
while it is busy. Then start it up again and see if there is fsck.
4. Is it possible that your previous cluster setup used tune2fs(8) to
disable fsck in some cases? That could explain why you didn't see
fsck before but do now.
Stefan
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: disk corruption after virsh destroy
2013-07-03 8:47 ` Stefan Hajnoczi
@ 2013-07-06 13:03 ` Bernd Schubert
2013-07-15 1:23 ` Stefan Hajnoczi
0 siblings, 1 reply; 5+ messages in thread
From: Bernd Schubert @ 2013-07-06 13:03 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: Brian J. Murrell, kvm
On 07/03/2013 10:47 AM, Stefan Hajnoczi wrote:
> On Tue, Jul 02, 2013 at 10:40:11AM -0400, Brian J. Murrell wrote:
>> I have a cluster of VMs setup with shared virtio-scsi disks. The
>> purpose of sharing a disk is that if a VM goes down, another can
>> pick up and mount the (ext4) filesystem on shared disk a provide
>> service to it.
>>
>> But just to be super clear, only one VM ever has a filesystem
>> mounted at a time even though multiple VMs technically can access
>> the device at the same time. A VM mounting a filesystem ensures
>> absolutely that no other node has it mounted before mounting it.
>>
>> That said, what I am finding is that when one a node dies and
>> another node tries to mount the (ext4) filesystem, it is found dirty
>> and needs an fsck.
>>
>> My understanding is that with ext{3,4}, this should not be the case
>> and indeed it is my experience, on real hardware with coherent disk
>> caching (i.e. no non-battery-backed caching disk controllers lying
>> to the O/S about what has been written to physical disk) that this
>> is the case. That is, a node failing does not leave an ext{3,4}
>> filesystem dirty such that it needs an fsck.
>>
>> So, clearly, somewhere between the KVM VM and the physical disk,
>> there is a cache that is resulting in the guest O/S believing data
>> is being written to physical disk that is not actually being written
>> there. To that end, I have ensured that on these shared disks that
>> I set "cache=none", but this does not seem to have fixed the
>> problem.
>
> I expect journal replay and possibly fsck when an ext4 file system was
> left in a mounted state and with I/O pending (e.g. due to power
> failure).
>
> A few questions:
>
> 1. Is the guest mounting the file system with barrier=0? barrier=1 is
> the default.
>
> 2. Do the physical disks have a volatile write cache enabled (if yes,
> the guest should use barrier=1)? If the physical disks have a
> non-volatile write cache or the write cache is disabled (then
> barrier=0 is okay).
Er, why? The As far as I understood Brian the physical disks have not
been reset, so their cache should be irrelevant?
If the VM needs barrier=1, then there must be some VM caching involved,
but Brian tried to disable that. At least in the past that worked fine
with the emulated LSI scsi controller. That way I simulated shared
storage and as long as I used the raw disks format and cache=none there
never had been any corruption (although qcow2 didn't work and introduced
issues).
Brian, maybe you could figure out the pattern of the corruption? I need
to add a very mode to ql-fstest, but with some network file systems on
top of ext4 it also should work.
https://bitbucket.org/aakef/ql-fstest
Cheers,
Bernd
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: disk corruption after virsh destroy
2013-07-06 13:03 ` Bernd Schubert
@ 2013-07-15 1:23 ` Stefan Hajnoczi
0 siblings, 0 replies; 5+ messages in thread
From: Stefan Hajnoczi @ 2013-07-15 1:23 UTC (permalink / raw)
To: Bernd Schubert; +Cc: Brian J. Murrell, kvm
On Sat, Jul 06, 2013 at 03:03:07PM +0200, Bernd Schubert wrote:
> On 07/03/2013 10:47 AM, Stefan Hajnoczi wrote:
> > On Tue, Jul 02, 2013 at 10:40:11AM -0400, Brian J. Murrell wrote:
> >> I have a cluster of VMs setup with shared virtio-scsi disks. The
> >> purpose of sharing a disk is that if a VM goes down, another can
> >> pick up and mount the (ext4) filesystem on shared disk a provide
> >> service to it.
> >>
> >> But just to be super clear, only one VM ever has a filesystem
> >> mounted at a time even though multiple VMs technically can access
> >> the device at the same time. A VM mounting a filesystem ensures
> >> absolutely that no other node has it mounted before mounting it.
> >>
> >> That said, what I am finding is that when one a node dies and
> >> another node tries to mount the (ext4) filesystem, it is found dirty
> >> and needs an fsck.
> >>
> >> My understanding is that with ext{3,4}, this should not be the case
> >> and indeed it is my experience, on real hardware with coherent disk
> >> caching (i.e. no non-battery-backed caching disk controllers lying
> >> to the O/S about what has been written to physical disk) that this
> >> is the case. That is, a node failing does not leave an ext{3,4}
> >> filesystem dirty such that it needs an fsck.
> >>
> >> So, clearly, somewhere between the KVM VM and the physical disk,
> >> there is a cache that is resulting in the guest O/S believing data
> >> is being written to physical disk that is not actually being written
> >> there. To that end, I have ensured that on these shared disks that
> >> I set "cache=none", but this does not seem to have fixed the
> >> problem.
> >
> > I expect journal replay and possibly fsck when an ext4 file system was
> > left in a mounted state and with I/O pending (e.g. due to power
> > failure).
> >
> > A few questions:
> >
> > 1. Is the guest mounting the file system with barrier=0? barrier=1 is
> > the default.
> >
> > 2. Do the physical disks have a volatile write cache enabled (if yes,
> > the guest should use barrier=1)? If the physical disks have a
> > non-volatile write cache or the write cache is disabled (then
> > barrier=0 is okay).
>
> Er, why? The As far as I understood Brian the physical disks have not
> been reset, so their cache should be irrelevant?
You are right. The physical disk write cache should not matter in this
case.
Stefan
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2013-07-15 1:23 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-07-02 14:40 disk corruption after virsh destroy Brian J. Murrell
2013-07-02 15:26 ` Brian J. Murrell
2013-07-03 8:47 ` Stefan Hajnoczi
2013-07-06 13:03 ` Bernd Schubert
2013-07-15 1:23 ` Stefan Hajnoczi
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.