From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:60582) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZmgVD-0002dc-98 for qemu-devel@nongnu.org; Thu, 15 Oct 2015 07:15:26 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZmgVA-0000A0-0i for qemu-devel@nongnu.org; Thu, 15 Oct 2015 07:15:19 -0400 Received: from mailpro.odiso.net ([89.248.209.98]:44292) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZmgV9-00009d-MA for qemu-devel@nongnu.org; Thu, 15 Oct 2015 07:15:15 -0400 Date: Thu, 15 Oct 2015 13:15:12 +0200 (CEST) From: Alexandre DERUMIER Message-ID: <43269660.84310434.1444907712286.JavaMail.zimbra@oxygem.tv> In-Reply-To: <20151015101607.17529.2880.malone@chaenomeles.canonical.com> References: <20150910143810.8783.51069.malonedeb@wampee.canonical.com> <20151015101607.17529.2880.malone@chaenomeles.canonical.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [Bug 1494350] Re: QEMU: causes vCPU steal time overflow on live migration List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Bug 1494350 <1494350@bugs.launchpad.net> Cc: qemu-devel >>Hi, Hi >> >>I've seen the same issue with debian jessie. >> >>Compiled 4.2.3 from kernel.org with "make localyesconfig", >>no problem any more host kernel or guest kernel ? ----- Mail original ----- De: "Markus Breitegger" <1494350@bugs.launchpad.net> =C3=80: "qemu-devel" Envoy=C3=A9: Jeudi 15 Octobre 2015 12:16:07 Objet: [Qemu-devel] [Bug 1494350] Re: QEMU: causes vCPU steal time overflow= on live migration Hi,=20 I've seen the same issue with debian jessie.=20 Compiled 4.2.3 from kernel.org with "make localyesconfig",=20 no problem any more=20 --=20 You received this bug notification because you are a member of qemu-=20 devel-ml, which is subscribed to QEMU.=20 https://bugs.launchpad.net/bugs/1494350=20 Title:=20 QEMU: causes vCPU steal time overflow on live migration=20 Status in QEMU:=20 New=20 Bug description:=20 I'm pasting in text from Debian Bug 785557=20 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=3D785557=20 b/c I couldn't find this issue reported.=20 It is present in QEMU 2.3, but I haven't tested later versions.=20 Perhaps someone else will find this bug and confirm for later=20 versions. (Or I will when I have time!)=20 ---------------------------------------------------------------------------= -----------------=20 Hi,=20 I'm trying to debug an issue we're having with some debian.org machines=20 running in QEMU 2.1.2 instances (see [1] for more background). In short,=20 after a live migration guests running Debian Jessie (linux 3.16) stop=20 accounting CPU time properly. /proc/stat in the guest shows no increase=20 in user and system time anymore (regardless of workload) and what stands=20 out are extremely large values for steal time:=20 % cat /proc/stat=20 cpu 2400 0 1842 650879168 2579640 0 25 136562317270 0 0=20 cpu0 1366 0 1028 161392988 1238598 0 11 383803090749 0 0=20 cpu1 294 0 240 162582008 639105 0 8 39686436048 0 0=20 cpu2 406 0 338 163331066 383867 0 4 333994238765 0 0=20 cpu3 332 0 235 163573105 318069 0 1 1223752959076 0 0=20 intr 355773871 33 10 0 0 0 0 3 0 1 0 0 36 144 0 0 1638612 0 0 0 0 0 0 0 0 0= 0=20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 5001741 41 0 8516993 0 3669582 0 0 0 0 0 0 0 = 0 0=20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0=20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0=20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0=20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0=20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0=20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0=20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0=20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0=20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0=20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0=20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0=20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0=20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0=20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0=20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0=20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0= 0 0=20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=20 ctxt 837862829=20 btime 1431642967=20 processes 8529939=20 procs_running 1=20 procs_blocked 0=20 softirq 225193331 2 77532878 172 7250024 819289 0 54 33739135 176552 105675= 225=20 Reading the memory pointed to by the steal time MSRs pre- and=20 post-migration, I can see that post-migration the high bytes are set to=20 0xff:=20 (qemu) xp /8b 0x1fc0cfc0=20 000000001fc0cfc0: 0x94 0x57 0x77 0xf5 0xff 0xff 0xff 0xff=20 The "jump" in steal time happens when the guest is resumed on the=20 receiving side.=20 I've also been able to consistently reproduce this on a Ganeti cluster=20 at work, using QEMU 2.1.3 and kernels 3.16 and 4.0 in the guests. The=20 issue goes away if I disable the steal time MSR using `-cpu=20 qemu64,-kvm_steal_time`.=20 So, it looks to me as if the steal time MSR is not set/copied properly=20 during live migration, although AFAICT this should be the case after=20 917367aa968fd4fef29d340e0c7ec8c608dffaab.=20 After investigating a bit more, it looks like the issue comes from an overf= low=20 in the kernel's accumulate_steal_time() (arch/x86/kvm/x86.c:2023):=20 static void accumulate_steal_time(struct kvm_vcpu *vcpu)=20 {=20 u64 delta;=20 if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))=20 return;=20 delta =3D current->sched_info.run_delay - vcpu->arch.st.last_steal;=20 Using systemtap with the attached script to trace KVM execution on the=20 receiving host kernel, we can see that shortly before marking the vCPUs=20 as runnable on a migrated KVM instance with 2 vCPUs, the following=20 happens (** marks lines of interest):=20 ** 0 qemu-system-x86(18446): kvm_arch_vcpu_load: run_delay=3D7856949 ns ste= al=3D7856949 ns=20 0 qemu-system-x86(18446): -> kvm_arch_vcpu_load=20 0 vhost-18446(18447): -> kvm_arch_vcpu_should_kick=20 5 vhost-18446(18447): <- kvm_arch_vcpu_should_kick=20 23 qemu-system-x86(18446): <- kvm_arch_vcpu_load=20 0 qemu-system-x86(18446): -> kvm_arch_vcpu_ioctl=20 2 qemu-system-x86(18446): <- kvm_arch_vcpu_ioctl=20 0 qemu-system-x86(18446): -> kvm_arch_vcpu_put=20 2 qemu-system-x86(18446): -> kvm_put_guest_fpu=20 3 qemu-system-x86(18446): <- kvm_put_guest_fpu=20 4 qemu-system-x86(18446): <- kvm_arch_vcpu_put=20 ** 0 qemu-system-x86(18446): kvm_arch_vcpu_load: run_delay=3D7856949 ns ste= al=3D7856949 ns=20 0 qemu-system-x86(18446): -> kvm_arch_vcpu_load=20 1 qemu-system-x86(18446): <- kvm_arch_vcpu_load=20 0 qemu-system-x86(18446): -> kvm_arch_vcpu_ioctl=20 1 qemu-system-x86(18446): <- kvm_arch_vcpu_ioctl=20 0 qemu-system-x86(18446): -> kvm_arch_vcpu_put=20 1 qemu-system-x86(18446): -> kvm_put_guest_fpu=20 2 qemu-system-x86(18446): <- kvm_put_guest_fpu=20 3 qemu-system-x86(18446): <- kvm_arch_vcpu_put=20 ** 0 qemu-system-x86(18449): kvm_arch_vcpu_load: run_delay=3D40304 ns steal= =3D7856949 ns=20 0 qemu-system-x86(18449): -> kvm_arch_vcpu_load=20 ** 7 qemu-system-x86(18449): delta: 18446744073701734971 ns, steal=3D785694= 9 ns, run_delay=3D40304 ns=20 10 qemu-system-x86(18449): <- kvm_arch_vcpu_load=20 ** 0 qemu-system-x86(18449): -> kvm_arch_vcpu_ioctl_run=20 4 qemu-system-x86(18449): -> kvm_arch_vcpu_runnable=20 6 qemu-system-x86(18449): <- kvm_arch_vcpu_runnable=20 ...=20 0 qemu-system-x86(18448): kvm_arch_vcpu_load: run_delay=3D0 ns steal=3D7856= 949 ns=20 0 qemu-system-x86(18448): -> kvm_arch_vcpu_load=20 ** 34 qemu-system-x86(18448): delta: 18446744073701694667 ns, steal=3D78569= 49 ns, run_delay=3D0 ns=20 40 qemu-system-x86(18448): <- kvm_arch_vcpu_load=20 ** 0 qemu-system-x86(18448): -> kvm_arch_vcpu_ioctl_run=20 5 qemu-system-x86(18448): -> kvm_arch_vcpu_runnable=20 Now, what's really interesting is that current->sched_info.run_delay=20 gets reset because the tasks (threads) using the vCPUs change, and thus=20 have a different current->sched_info: it looks like task 18446 created=20 the two vCPUs, and then they were handed over to 18448 and 18449=20 respectively. This is also verified by the fact that during the=20 overflow, both vCPUs have the old steal time of the last vcpu_load of=20 task 18446. However, according to Documentation/virtual/kvm/api.txt:=20 - vcpu ioctls: These query and set attributes that control the operation=20 of a single virtual cpu.=20 Only run vcpu ioctls from the same thread that was used to create=20 the vcpu.=20 So it seems qemu is doing something that it shouldn't: calling vCPU=20 ioctls from a thread that didn't create the vCPU. Note that this=20 probably happens on every QEMU startup, but is not visible because the=20 guest kernel zeroes out the steal time on boot.=20 There are at least two ways to mitigate the issue without a kernel=20 recompilation:=20 - The first one is to disable the steal time propagation from host to=20 guest by invoking qemu with `-cpu qemu64,-kvm_steal_time`. This will=20 short-circuit accumulate_steal_time() due to (vcpu->arch.st.msr_val &=20 KVM_MSR_ENABLED) and will completely disable steal time reporting in=20 the guest, which may not be desired if people rely on it to detect=20 CPU congestion.=20 - The other one is using the following systemtap script to prevent the=20 steal time counter from overflowing by dropping the problematic=20 samples (WARNING: systemtap guru mode required, use at your own=20 risk):=20 probe module("kvm").statement("*@arch/x86/kvm/x86.c:2024") {=20 if (@defined($delta) && $delta < 0) {=20 printk(4, "kvm: steal time delta < 0, dropping")=20 $delta =3D 0=20 }=20 }=20 Note that not all *guests* handle this condition in the same way: 3.2=20 guests still get the overflow in /proc/stat, but their scheduler=20 continues to work as expected. 3.16 guests OTOH go nuts once steal time=20 overflows and stop accumulating system & user time, while entering an=20 erratic state where steal time in /proc/stat is *decreasing* on every=20 clock tick.=20 -------------------------------------------- Revised statement:=20 > Now, what's really interesting is that current->sched_info.run_delay=20 > gets reset because the tasks (threads) using the vCPUs change, and=20 > thus have a different current->sched_info: it looks like task 18446=20 > created the two vCPUs, and then they were handed over to 18448 and=20 > 18449 respectively. This is also verified by the fact that during the=20 > overflow, both vCPUs have the old steal time of the last vcpu_load of=20 > task 18446. However, according to Documentation/virtual/kvm/api.txt:=20 The above is not entirely accurate: the vCPUs were created by the=20 threads that are used to run them (18448 and 18449 respectively), it's=20 just that the main thread is issuing ioctls during initialization, as=20 illustrated by the strace output on a different process:=20 [ vCPU #0 thread creating vCPU #0 (fd 20) ]=20 [pid 1861] ioctl(14, KVM_CREATE_VCPU, 0) =3D 20=20 [pid 1861] ioctl(20, KVM_X86_SETUP_MCE, 0x7fbd3ca40cd8) =3D 0=20 [pid 1861] ioctl(20, KVM_SET_CPUID2, 0x7fbd3ca40ce0) =3D 0=20 [pid 1861] ioctl(20, KVM_SET_SIGNAL_MASK, 0x7fbd380008f0) =3D 0=20 [ vCPU #1 thread creating vCPU #1 (fd 21) ]=20 [pid 1862] ioctl(14, KVM_CREATE_VCPU, 0x1) =3D 21=20 [pid 1862] ioctl(21, KVM_X86_SETUP_MCE, 0x7fbd37ffdcd8) =3D 0=20 [pid 1862] ioctl(21, KVM_SET_CPUID2, 0x7fbd37ffdce0) =3D 0=20 [pid 1862] ioctl(21, KVM_SET_SIGNAL_MASK, 0x7fbd300008f0) =3D 0=20 [ Main thread calling kvm_arch_put_registers() on vCPU #0 ]=20 [pid 1859] ioctl(20, KVM_SET_REGS, 0x7ffc98aac230) =3D 0=20 [pid 1859] ioctl(20, KVM_SET_XSAVE or KVM_SIGNAL_MSI, 0x7fbd38001000) =3D 0= =20 [pid 1859] ioctl(20, KVM_PPC_ALLOCATE_HTAB or KVM_SET_XCRS, 0x7ffc98aac010)= =3D 0=20 [pid 1859] ioctl(20, KVM_SET_SREGS, 0x7ffc98aac050) =3D 0=20 [pid 1859] ioctl(20, KVM_SET_MSRS, 0x7ffc98aab820) =3D 87=20 [pid 1859] ioctl(20, KVM_SET_MP_STATE, 0x7ffc98aac230) =3D 0=20 [pid 1859] ioctl(20, KVM_SET_LAPIC, 0x7ffc98aabd80) =3D 0=20 [pid 1859] ioctl(20, KVM_SET_MSRS, 0x7ffc98aac1b0) =3D 1=20 [pid 1859] ioctl(20, KVM_SET_PIT2 or KVM_SET_VCPU_EVENTS, 0x7ffc98aac1b0) = =3D 0=20 [pid 1859] ioctl(20, KVM_SET_DEBUGREGS or KVM_SET_TSC_KHZ, 0x7ffc98aac1b0) = =3D 0=20 [ Main thread calling kvm_arch_put_registers() on vCPU #1 ]=20 [pid 1859] ioctl(21, KVM_SET_REGS, 0x7ffc98aac230) =3D 0=20 [pid 1859] ioctl(21, KVM_SET_XSAVE or KVM_SIGNAL_MSI, 0x7fbd30001000) =3D 0= =20 [pid 1859] ioctl(21, KVM_PPC_ALLOCATE_HTAB or KVM_SET_XCRS, 0x7ffc98aac010)= =3D 0=20 [pid 1859] ioctl(21, KVM_SET_SREGS, 0x7ffc98aac050) =3D 0=20 [pid 1859] ioctl(21, KVM_SET_MSRS, 0x7ffc98aab820) =3D 87=20 [pid 1859] ioctl(21, KVM_SET_MP_STATE, 0x7ffc98aac230) =3D 0=20 [pid 1859] ioctl(21, KVM_SET_LAPIC, 0x7ffc98aabd80) =3D 0=20 [pid 1859] ioctl(21, KVM_SET_MSRS, 0x7ffc98aac1b0) =3D 1=20 [pid 1859] ioctl(21, KVM_SET_PIT2 or KVM_SET_VCPU_EVENTS, 0x7ffc98aac1b0) = =3D 0=20 [pid 1859] ioctl(21, KVM_SET_DEBUGREGS or KVM_SET_TSC_KHZ, 0x7ffc98aac1b0) = =3D 0=20 Using systemtap again, I noticed that the main thread's run_delay is copied= to=20 last_steal only after a KVM_SET_MSRS ioctl which enables the steal time=20 MSR is issued by the main thread (see linux=20 3.16.7-ckt11-1/arch/x86/kvm/x86.c:2162). Taking an educated guess, I=20 reverted the following qemu commits:=20 commit 0e5035776df31380a44a1a851850d110b551ecb6=20 Author: Marcelo Tosatti =20 Date: Tue Sep 3 18:55:16 2013 -0300=20 fix steal time MSR vmsd callback to proper opaque type=20 Convert steal time MSR vmsd callback pointer to proper X86CPU type.=20 Signed-off-by: Marcelo Tosatti =20 Signed-off-by: Paolo Bonzini =20 commit 917367aa968fd4fef29d340e0c7ec8c608dffaab=20 Author: Marcelo Tosatti =20 Date: Tue Feb 19 23:27:20 2013 -0300=20 target-i386: kvm: save/restore steal time MSR=20 Read and write steal time MSR, so that reporting is functional across=20 migration.=20 Signed-off-by: Marcelo Tosatti =20 Signed-off-by: Gleb Natapov =20 and the steal time jump on migration went away. However, steal time was=20 not reported at all after migration, which is expected after reverting=20 917367aa.=20 So it seems that after 917367aa, the steal time MSR is correctly saved=20 and copied to the receiving side, but then it is restored by the main=20 thread (probably during cpu_synchronize_all_post_init()), causing the=20 overflow when the vCPU threads are unpaused.=20 To manage notifications about this bug go to:=20 https://bugs.launchpad.net/qemu/+bug/1494350/+subscriptions=20