* [Qemu-devel] cache=writeback and migrations over shared storage
@ 2013-09-11 14:30 Filippos Giannakos
2013-09-26 7:31 ` Stefan Hajnoczi
0 siblings, 1 reply; 3+ messages in thread
From: Filippos Giannakos @ 2013-09-11 14:30 UTC (permalink / raw)
To: qemu-devel; +Cc: synnefo-devel
Hello All,
I stumbled upon this link [1] which among other things contains the following:
"iSCSI, FC, or other forms of direct attached storage are only safe to use with
live migration if you use cache=none."
How valid is this assertion with current QEMU versions?
I checked out the source code and was left with the impression that
during migration and *before* handling control to the destination, a flush is
performed on all disks of the VM. Since the VM is started on the destination
only after the flush is done, its very first read will bring consistent data
from disk.
I can understand that on the corner case in which the storage device has
already been mapped and perhaps has data in the page cache of the destination
node, there is no way to invalidate them, so the VM will read stale data,
despite the flushes which happened at the source node.
In our case, we provision VMs using our custom storage layer, called
Archipelago [2], which presents volumes as block devices in the host. We would
like to run VMs in cache=writeback mode. If we guarantee externally that there
will be no incoherent cached data on the destination host of the migration
(e.g., by making sure the volume is not mapped on the destination node before
the migration), would it be safe to do so?
Can you comment on the aforementioned approach? Please let me know if there's
something I have misunderstood.
[1] http://wiki.qemu.org/Migration/Storage
[2] http://www.synnefo.org/docs/archipelago/latest
Kind Regards,
--
Filippos
<philipgian@grnet.gr>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [Qemu-devel] cache=writeback and migrations over shared storage
2013-09-11 14:30 [Qemu-devel] cache=writeback and migrations over shared storage Filippos Giannakos
@ 2013-09-26 7:31 ` Stefan Hajnoczi
2013-10-03 8:10 ` Filippos Giannakos
0 siblings, 1 reply; 3+ messages in thread
From: Stefan Hajnoczi @ 2013-09-26 7:31 UTC (permalink / raw)
To: Filippos Giannakos; +Cc: synnefo-devel, qemu-devel
On Wed, Sep 11, 2013 at 05:30:10PM +0300, Filippos Giannakos wrote:
> I stumbled upon this link [1] which among other things contains the following:
>
> "iSCSI, FC, or other forms of direct attached storage are only safe to use with
> live migration if you use cache=none."
>
> How valid is this assertion with current QEMU versions?
>
> I checked out the source code and was left with the impression that
> during migration and *before* handling control to the destination, a flush is
> performed on all disks of the VM. Since the VM is started on the destination
> only after the flush is done, its very first read will bring consistent data
> from disk.
>
> I can understand that on the corner case in which the storage device has
> already been mapped and perhaps has data in the page cache of the destination
> node, there is no way to invalidate them, so the VM will read stale data,
> despite the flushes which happened at the source node.
>
> In our case, we provision VMs using our custom storage layer, called
> Archipelago [2], which presents volumes as block devices in the host. We would
> like to run VMs in cache=writeback mode. If we guarantee externally that there
> will be no incoherent cached data on the destination host of the migration
> (e.g., by making sure the volume is not mapped on the destination node before
> the migration), would it be safe to do so?
>
> Can you comment on the aforementioned approach? Please let me know if there's
> something I have misunderstood.
>
> [1] http://wiki.qemu.org/Migration/Storage
> [2] http://www.synnefo.org/docs/archipelago/latest
Hi Filippos,
Late response but this may help start the discussion...
Cache consistency during migration was discussed a lot on the mailing
list. You might be able to find threads from about 2 years ago that
discuss this in detail.
Here is what I remember:
During migration the QEMU process on the destination host must be
started. When QEMU starts up it opens the image file and reads the
first sector (for disk geometry and image format probing). At this
point the destination would populate its page cache while the source is
still running the guest.
We're in trouble because the destination host has stale pages in its
page cache. Hence the recommendation to use cache=none.
There are a few things to look at if you are really eager to use
cache=writeback:
1. Can you avoid geometry probing? I think by setting the geometry
options on the -drive you can skip probing. See
hw/block/hd-geometry.c.
2. Can you avoid format probing? Use -drive format=raw to skip format
probing.
3. Make sure to use raw image files. Do not use a format since that
would require reading a header and metadata before migration
handover.
4. Check if ioctl(BLKFLSBUF) can be used. Unfortunately it requires
CAP_SYS_ADMIN so the QEMU process cannot issue it when running
without privileges. Perhaps an external tool like libvirt could
issue it, but that's tricky since live migration handover is a
delicate operation - it's important to avoided dependencies between
multiple processes to keep guest downtime low and avoid possibility
of failures.
So you might be able to get away with cache=writeback *if* you carefully
study the code and double-check with strace that the destination QEMU
processes does not access the image file before handover has completed.
Stefan
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [Qemu-devel] cache=writeback and migrations over shared storage
2013-09-26 7:31 ` Stefan Hajnoczi
@ 2013-10-03 8:10 ` Filippos Giannakos
0 siblings, 0 replies; 3+ messages in thread
From: Filippos Giannakos @ 2013-10-03 8:10 UTC (permalink / raw)
To: qemu-devel; +Cc: stefanha, synnefo-devel
On Thu, Sep 26, 2013 at 09:31:00AM +0200, Stefan Hajnoczi wrote:
> Hi Filippos,
> Late response but this may help start the discussion...
>
> Cache consistency during migration was discussed a lot on the mailing
> list. You might be able to find threads from about 2 years ago that
> discuss this in detail.
>
> Here is what I remember:
>
> During migration the QEMU process on the destination host must be
> started. When QEMU starts up it opens the image file and reads the
> first sector (for disk geometry and image format probing). At this
> point the destination would populate its page cache while the source is
> still running the guest.
>
> We're in trouble because the destination host has stale pages in its
> page cache. Hence the recommendation to use cache=none.
>
> There are a few things to look at if you are really eager to use
> cache=writeback:
>
> 1. Can you avoid geometry probing? I think by setting the geometry
> options on the -drive you can skip probing. See
> hw/block/hd-geometry.c.
>
> 2. Can you avoid format probing? Use -drive format=raw to skip format
> probing.
>
> 3. Make sure to use raw image files. Do not use a format since that
> would require reading a header and metadata before migration
> handover.
>
> 4. Check if ioctl(BLKFLSBUF) can be used. Unfortunately it requires
> CAP_SYS_ADMIN so the QEMU process cannot issue it when running
> without privileges. Perhaps an external tool like libvirt could
> issue it, but that's tricky since live migration handover is a
> delicate operation - it's important to avoided dependencies between
> multiple processes to keep guest downtime low and avoid possibility
> of failures.
>
> So you might be able to get away with cache=writeback *if* you carefully
> study the code and double-check with strace that the destination QEMU
> processes does not access the image file before handover has completed.
>
> Stefan
Hi Stefan,
Thanx for your response. You've been really helpful.
I believe it should be possible to use writeback cache, if we address the
problems you pointed out.
I 'll give it a try by providing disk geometry and by using the raw drive
format. I don't thing I need to use ioctl(BLKFLSBUF). As far as I can tell by
looking briefly the kernel code, I see that it flushes the dirty pages of the
device and invalidates the now clean pages. Since a) the source QEMU process
flushes all block devices before handing control over to the destination process
and b) we remove the block device after the migration, this should happen
automatically.
To be extra sure about this, I 'll run a modified qemu version to pause
execution of the source hypervisor after the vm is stopped in the source and
right before it hands control to the destination. I believe that this window
exists after:
vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
and right before:
qemu_savevm_state_complete(s->file);
in the migration_thread in migration.c .
I 'll be strace-ing the destination to make sure there are no reads issued on
the block device. Plus our custom storage layer allows us to monitor all I/O
requests that are performed to the block device, so we have another layer that
ensures that no data were read from the storage.
Kind Regards,
--
Filippos
<philipgian@grnet.gr>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2013-10-03 8:11 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-11 14:30 [Qemu-devel] cache=writeback and migrations over shared storage Filippos Giannakos
2013-09-26 7:31 ` Stefan Hajnoczi
2013-10-03 8:10 ` Filippos Giannakos
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).