qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] Live migration debugging
@ 2014-07-29 11:31 Paul Boven
  0 siblings, 0 replies; only message in thread
From: Paul Boven @ 2014-07-29 11:31 UTC (permalink / raw)
  To: qemu-devel

Hi folks,

Recently there's been several patches to fix kvmclock issues during 
migrations, which were subsequently reverted. I hope the observations 
below can be helpful in pinning down the actual issues to make live 
migration work again in the future.

Live migration has been broken since at least release 1.4.0 (as shipped 
with Ubuntu 13.04), and still has the same problems in 2.1.0-rc2, but 
briefly worked in 2.0-git-20140609.

The problem is that once the live migration is complete and the guest 
gets started on the destination server, it will hang for a long time, 
consuming 100% cpu. This can be mere seconds, but I've also observed 
hangs for as long as 11 minutes. And then suddenly the guest starts to 
respond again as if nothing happens, but its clock has not progressed at 
all while the machine was hanging.

What I have observed is that the time spent hanging is exactly the 
difference between the clock rate of the host, and the 'real' (NTP) 
time. If you multiply the time since the previous migration with the PPM 
offset as determined by NTP (see /var/lib/ntp/ntp.drift), that is 
exactly how many seconds the guest will spend at 100% CPU before 
becoming responsive again. I have observed this on two different pairs 
of KVM servers. Each of the servers has a negative PPM value according 
to NTP.

Example: a guest having nearly 9 days of uptime, with (according to NTP) 
a clock rate of -34 ppm, froze for 27 seconds when I migrated it. I have 
done quite a few test migrations, and this relationship holds quite 
precisely.

As the duration of the freeze is proportional to the time since the 
previous migration, debugging is a bit difficult as you have to wait a 
while before you can demonstrate the problem. It is also probably a 
reason this problem is underreported, because it is not very noticeable 
if you do it right after starting the VM, but looks like a complete 
crash if you have a few months of uptime.

With the 2.0 sources from 2014-06-09, the problem does *not* occur. A 
side-effect of that patch is that the guest clock has a lot of jitter 
until the first migration, but behaves normally (yet without hangs) on 
subsequent migrations.

Is there a way that I can directly read the kvmclock from the guest or 
host, so we can compare them before and after migration, and see what 
goes wrong precisely?

See also https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1297218

Regards, Paul Boven.
-- 
Paul Boven <boven@jive.nl> +31 (0)521-596547
Unix/Linux/Networking specialist
Joint Institute for VLBI in Europe - www.jive.nl
VLBI - It's a fringe science

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2014-07-29 11:52 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-29 11:31 [Qemu-devel] Live migration debugging Paul Boven

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).