From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:60582)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <aderumier@oxygem.tv>) id 1ZmgVD-0002dc-98
	for qemu-devel@nongnu.org; Thu, 15 Oct 2015 07:15:26 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <aderumier@oxygem.tv>) id 1ZmgVA-0000A0-0i
	for qemu-devel@nongnu.org; Thu, 15 Oct 2015 07:15:19 -0400
Received: from mailpro.odiso.net ([89.248.209.98]:44292)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <aderumier@oxygem.tv>) id 1ZmgV9-00009d-MA
	for qemu-devel@nongnu.org; Thu, 15 Oct 2015 07:15:15 -0400
Date: Thu, 15 Oct 2015 13:15:12 +0200 (CEST)
From: Alexandre DERUMIER <aderumier@odiso.com>
Message-ID: <43269660.84310434.1444907712286.JavaMail.zimbra@oxygem.tv>
In-Reply-To: <20151015101607.17529.2880.malone@chaenomeles.canonical.com>
References: <20150910143810.8783.51069.malonedeb@wampee.canonical.com>
	<20151015101607.17529.2880.malone@chaenomeles.canonical.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [Bug 1494350] Re: QEMU: causes vCPU steal time
 overflow on live migration
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Bug 1494350 <1494350@bugs.launchpad.net>
Cc: qemu-devel <qemu-devel@nongnu.org>

>>Hi,
Hi

>>
>>I've seen the same issue with debian jessie.
>>
>>Compiled 4.2.3 from kernel.org with "make localyesconfig",
>>no problem any more

host kernel or guest kernel ?


----- Mail original -----
De: "Markus Breitegger" <1494350@bugs.launchpad.net>
=C3=80: "qemu-devel" <qemu-devel@nongnu.org>
Envoy=C3=A9: Jeudi 15 Octobre 2015 12:16:07
Objet: [Qemu-devel] [Bug 1494350] Re: QEMU: causes vCPU steal time overflow=
 on live migration

Hi,=20

I've seen the same issue with debian jessie.=20

Compiled 4.2.3 from kernel.org with "make localyesconfig",=20
no problem any more=20

--=20
You received this bug notification because you are a member of qemu-=20
devel-ml, which is subscribed to QEMU.=20
https://bugs.launchpad.net/bugs/1494350=20

Title:=20
QEMU: causes vCPU steal time overflow on live migration=20

Status in QEMU:=20
New=20

Bug description:=20
I'm pasting in text from Debian Bug 785557=20
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=3D785557=20
b/c I couldn't find this issue reported.=20

It is present in QEMU 2.3, but I haven't tested later versions.=20
Perhaps someone else will find this bug and confirm for later=20
versions. (Or I will when I have time!)=20

---------------------------------------------------------------------------=
-----------------=20

Hi,=20

I'm trying to debug an issue we're having with some debian.org machines=20
running in QEMU 2.1.2 instances (see [1] for more background). In short,=20
after a live migration guests running Debian Jessie (linux 3.16) stop=20
accounting CPU time properly. /proc/stat in the guest shows no increase=20
in user and system time anymore (regardless of workload) and what stands=20
out are extremely large values for steal time:=20

% cat /proc/stat=20
cpu 2400 0 1842 650879168 2579640 0 25 136562317270 0 0=20
cpu0 1366 0 1028 161392988 1238598 0 11 383803090749 0 0=20
cpu1 294 0 240 162582008 639105 0 8 39686436048 0 0=20
cpu2 406 0 338 163331066 383867 0 4 333994238765 0 0=20
cpu3 332 0 235 163573105 318069 0 1 1223752959076 0 0=20
intr 355773871 33 10 0 0 0 0 3 0 1 0 0 36 144 0 0 1638612 0 0 0 0 0 0 0 0 0=
 0=20
0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 5001741 41 0 8516993 0 3669582 0 0 0 0 0 0 0 =
0 0=20
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
 0 0=20
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
 0 0=20
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
 0 0=20
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
 0 0=20
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
 0 0=20
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
 0 0=20
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
 0 0=20
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
 0 0=20
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
 0 0=20
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
 0 0=20
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
 0 0=20
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
 0 0=20
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
 0 0=20
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
 0 0=20
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
 0 0=20
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
 0 0=20
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=20
ctxt 837862829=20
btime 1431642967=20
processes 8529939=20
procs_running 1=20
procs_blocked 0=20
softirq 225193331 2 77532878 172 7250024 819289 0 54 33739135 176552 105675=
225=20

Reading the memory pointed to by the steal time MSRs pre- and=20
post-migration, I can see that post-migration the high bytes are set to=20
0xff:=20

(qemu) xp /8b 0x1fc0cfc0=20
000000001fc0cfc0: 0x94 0x57 0x77 0xf5 0xff 0xff 0xff 0xff=20

The "jump" in steal time happens when the guest is resumed on the=20
receiving side.=20

I've also been able to consistently reproduce this on a Ganeti cluster=20
at work, using QEMU 2.1.3 and kernels 3.16 and 4.0 in the guests. The=20
issue goes away if I disable the steal time MSR using `-cpu=20
qemu64,-kvm_steal_time`.=20

So, it looks to me as if the steal time MSR is not set/copied properly=20
during live migration, although AFAICT this should be the case after=20
917367aa968fd4fef29d340e0c7ec8c608dffaab.=20

After investigating a bit more, it looks like the issue comes from an overf=
low=20
in the kernel's accumulate_steal_time() (arch/x86/kvm/x86.c:2023):=20

static void accumulate_steal_time(struct kvm_vcpu *vcpu)=20
{=20
u64 delta;=20

if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))=20
return;=20

delta =3D current->sched_info.run_delay - vcpu->arch.st.last_steal;=20

Using systemtap with the attached script to trace KVM execution on the=20
receiving host kernel, we can see that shortly before marking the vCPUs=20
as runnable on a migrated KVM instance with 2 vCPUs, the following=20
happens (** marks lines of interest):=20

** 0 qemu-system-x86(18446): kvm_arch_vcpu_load: run_delay=3D7856949 ns ste=
al=3D7856949 ns=20
0 qemu-system-x86(18446): -> kvm_arch_vcpu_load=20
0 vhost-18446(18447): -> kvm_arch_vcpu_should_kick=20
5 vhost-18446(18447): <- kvm_arch_vcpu_should_kick=20
23 qemu-system-x86(18446): <- kvm_arch_vcpu_load=20
0 qemu-system-x86(18446): -> kvm_arch_vcpu_ioctl=20
2 qemu-system-x86(18446): <- kvm_arch_vcpu_ioctl=20
0 qemu-system-x86(18446): -> kvm_arch_vcpu_put=20
2 qemu-system-x86(18446): -> kvm_put_guest_fpu=20
3 qemu-system-x86(18446): <- kvm_put_guest_fpu=20
4 qemu-system-x86(18446): <- kvm_arch_vcpu_put=20
** 0 qemu-system-x86(18446): kvm_arch_vcpu_load: run_delay=3D7856949 ns ste=
al=3D7856949 ns=20
0 qemu-system-x86(18446): -> kvm_arch_vcpu_load=20
1 qemu-system-x86(18446): <- kvm_arch_vcpu_load=20
0 qemu-system-x86(18446): -> kvm_arch_vcpu_ioctl=20
1 qemu-system-x86(18446): <- kvm_arch_vcpu_ioctl=20
0 qemu-system-x86(18446): -> kvm_arch_vcpu_put=20
1 qemu-system-x86(18446): -> kvm_put_guest_fpu=20
2 qemu-system-x86(18446): <- kvm_put_guest_fpu=20
3 qemu-system-x86(18446): <- kvm_arch_vcpu_put=20
** 0 qemu-system-x86(18449): kvm_arch_vcpu_load: run_delay=3D40304 ns steal=
=3D7856949 ns=20
0 qemu-system-x86(18449): -> kvm_arch_vcpu_load=20
** 7 qemu-system-x86(18449): delta: 18446744073701734971 ns, steal=3D785694=
9 ns, run_delay=3D40304 ns=20
10 qemu-system-x86(18449): <- kvm_arch_vcpu_load=20
** 0 qemu-system-x86(18449): -> kvm_arch_vcpu_ioctl_run=20
4 qemu-system-x86(18449): -> kvm_arch_vcpu_runnable=20
6 qemu-system-x86(18449): <- kvm_arch_vcpu_runnable=20
...=20
0 qemu-system-x86(18448): kvm_arch_vcpu_load: run_delay=3D0 ns steal=3D7856=
949 ns=20
0 qemu-system-x86(18448): -> kvm_arch_vcpu_load=20
** 34 qemu-system-x86(18448): delta: 18446744073701694667 ns, steal=3D78569=
49 ns, run_delay=3D0 ns=20
40 qemu-system-x86(18448): <- kvm_arch_vcpu_load=20
** 0 qemu-system-x86(18448): -> kvm_arch_vcpu_ioctl_run=20
5 qemu-system-x86(18448): -> kvm_arch_vcpu_runnable=20

Now, what's really interesting is that current->sched_info.run_delay=20
gets reset because the tasks (threads) using the vCPUs change, and thus=20
have a different current->sched_info: it looks like task 18446 created=20
the two vCPUs, and then they were handed over to 18448 and 18449=20
respectively. This is also verified by the fact that during the=20
overflow, both vCPUs have the old steal time of the last vcpu_load of=20
task 18446. However, according to Documentation/virtual/kvm/api.txt:=20

- vcpu ioctls: These query and set attributes that control the operation=20
of a single virtual cpu.=20

Only run vcpu ioctls from the same thread that was used to create=20
the vcpu.=20


So it seems qemu is doing something that it shouldn't: calling vCPU=20
ioctls from a thread that didn't create the vCPU. Note that this=20
probably happens on every QEMU startup, but is not visible because the=20
guest kernel zeroes out the steal time on boot.=20

There are at least two ways to mitigate the issue without a kernel=20
recompilation:=20

- The first one is to disable the steal time propagation from host to=20
guest by invoking qemu with `-cpu qemu64,-kvm_steal_time`. This will=20
short-circuit accumulate_steal_time() due to (vcpu->arch.st.msr_val &=20
KVM_MSR_ENABLED) and will completely disable steal time reporting in=20
the guest, which may not be desired if people rely on it to detect=20
CPU congestion.=20

- The other one is using the following systemtap script to prevent the=20
steal time counter from overflowing by dropping the problematic=20
samples (WARNING: systemtap guru mode required, use at your own=20
risk):=20

probe module("kvm").statement("*@arch/x86/kvm/x86.c:2024") {=20
if (@defined($delta) && $delta < 0) {=20
printk(4, "kvm: steal time delta < 0, dropping")=20
$delta =3D 0=20
}=20
}=20

Note that not all *guests* handle this condition in the same way: 3.2=20
guests still get the overflow in /proc/stat, but their scheduler=20
continues to work as expected. 3.16 guests OTOH go nuts once steal time=20
overflows and stop accumulating system & user time, while entering an=20
erratic state where steal time in /proc/stat is *decreasing* on every=20
clock tick.=20
-------------------------------------------- Revised statement:=20
> Now, what's really interesting is that current->sched_info.run_delay=20
> gets reset because the tasks (threads) using the vCPUs change, and=20
> thus have a different current->sched_info: it looks like task 18446=20
> created the two vCPUs, and then they were handed over to 18448 and=20
> 18449 respectively. This is also verified by the fact that during the=20
> overflow, both vCPUs have the old steal time of the last vcpu_load of=20
> task 18446. However, according to Documentation/virtual/kvm/api.txt:=20

The above is not entirely accurate: the vCPUs were created by the=20
threads that are used to run them (18448 and 18449 respectively), it's=20
just that the main thread is issuing ioctls during initialization, as=20
illustrated by the strace output on a different process:=20

[ vCPU #0 thread creating vCPU #0 (fd 20) ]=20
[pid 1861] ioctl(14, KVM_CREATE_VCPU, 0) =3D 20=20
[pid 1861] ioctl(20, KVM_X86_SETUP_MCE, 0x7fbd3ca40cd8) =3D 0=20
[pid 1861] ioctl(20, KVM_SET_CPUID2, 0x7fbd3ca40ce0) =3D 0=20
[pid 1861] ioctl(20, KVM_SET_SIGNAL_MASK, 0x7fbd380008f0) =3D 0=20

[ vCPU #1 thread creating vCPU #1 (fd 21) ]=20
[pid 1862] ioctl(14, KVM_CREATE_VCPU, 0x1) =3D 21=20
[pid 1862] ioctl(21, KVM_X86_SETUP_MCE, 0x7fbd37ffdcd8) =3D 0=20
[pid 1862] ioctl(21, KVM_SET_CPUID2, 0x7fbd37ffdce0) =3D 0=20
[pid 1862] ioctl(21, KVM_SET_SIGNAL_MASK, 0x7fbd300008f0) =3D 0=20

[ Main thread calling kvm_arch_put_registers() on vCPU #0 ]=20
[pid 1859] ioctl(20, KVM_SET_REGS, 0x7ffc98aac230) =3D 0=20
[pid 1859] ioctl(20, KVM_SET_XSAVE or KVM_SIGNAL_MSI, 0x7fbd38001000) =3D 0=
=20
[pid 1859] ioctl(20, KVM_PPC_ALLOCATE_HTAB or KVM_SET_XCRS, 0x7ffc98aac010)=
 =3D 0=20
[pid 1859] ioctl(20, KVM_SET_SREGS, 0x7ffc98aac050) =3D 0=20
[pid 1859] ioctl(20, KVM_SET_MSRS, 0x7ffc98aab820) =3D 87=20
[pid 1859] ioctl(20, KVM_SET_MP_STATE, 0x7ffc98aac230) =3D 0=20
[pid 1859] ioctl(20, KVM_SET_LAPIC, 0x7ffc98aabd80) =3D 0=20
[pid 1859] ioctl(20, KVM_SET_MSRS, 0x7ffc98aac1b0) =3D 1=20
[pid 1859] ioctl(20, KVM_SET_PIT2 or KVM_SET_VCPU_EVENTS, 0x7ffc98aac1b0) =
=3D 0=20
[pid 1859] ioctl(20, KVM_SET_DEBUGREGS or KVM_SET_TSC_KHZ, 0x7ffc98aac1b0) =
=3D 0=20

[ Main thread calling kvm_arch_put_registers() on vCPU #1 ]=20
[pid 1859] ioctl(21, KVM_SET_REGS, 0x7ffc98aac230) =3D 0=20
[pid 1859] ioctl(21, KVM_SET_XSAVE or KVM_SIGNAL_MSI, 0x7fbd30001000) =3D 0=
=20
[pid 1859] ioctl(21, KVM_PPC_ALLOCATE_HTAB or KVM_SET_XCRS, 0x7ffc98aac010)=
 =3D 0=20
[pid 1859] ioctl(21, KVM_SET_SREGS, 0x7ffc98aac050) =3D 0=20
[pid 1859] ioctl(21, KVM_SET_MSRS, 0x7ffc98aab820) =3D 87=20
[pid 1859] ioctl(21, KVM_SET_MP_STATE, 0x7ffc98aac230) =3D 0=20
[pid 1859] ioctl(21, KVM_SET_LAPIC, 0x7ffc98aabd80) =3D 0=20
[pid 1859] ioctl(21, KVM_SET_MSRS, 0x7ffc98aac1b0) =3D 1=20
[pid 1859] ioctl(21, KVM_SET_PIT2 or KVM_SET_VCPU_EVENTS, 0x7ffc98aac1b0) =
=3D 0=20
[pid 1859] ioctl(21, KVM_SET_DEBUGREGS or KVM_SET_TSC_KHZ, 0x7ffc98aac1b0) =
=3D 0=20

Using systemtap again, I noticed that the main thread's run_delay is copied=
 to=20
last_steal only after a KVM_SET_MSRS ioctl which enables the steal time=20
MSR is issued by the main thread (see linux=20
3.16.7-ckt11-1/arch/x86/kvm/x86.c:2162). Taking an educated guess, I=20
reverted the following qemu commits:=20

commit 0e5035776df31380a44a1a851850d110b551ecb6=20
Author: Marcelo Tosatti <mtosatti@redhat.com>=20
Date: Tue Sep 3 18:55:16 2013 -0300=20

fix steal time MSR vmsd callback to proper opaque type=20

Convert steal time MSR vmsd callback pointer to proper X86CPU type.=20

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>=20
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>=20

commit 917367aa968fd4fef29d340e0c7ec8c608dffaab=20
Author: Marcelo Tosatti <mtosatti@redhat.com>=20
Date: Tue Feb 19 23:27:20 2013 -0300=20

target-i386: kvm: save/restore steal time MSR=20

Read and write steal time MSR, so that reporting is functional across=20
migration.=20

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>=20
Signed-off-by: Gleb Natapov <gleb@redhat.com>=20

and the steal time jump on migration went away. However, steal time was=20
not reported at all after migration, which is expected after reverting=20
917367aa.=20

So it seems that after 917367aa, the steal time MSR is correctly saved=20
and copied to the receiving side, but then it is restored by the main=20
thread (probably during cpu_synchronize_all_post_init()), causing the=20
overflow when the vCPU threads are unpaused.=20

To manage notifications about this bug go to:=20
https://bugs.launchpad.net/qemu/+bug/1494350/+subscriptions=20