From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andre Przywara Subject: Re: 2.6.35-rc1 regression with pvclock and smp guests Date: Tue, 27 Jul 2010 15:48:35 +0200 Message-ID: <4C4EE3B3.2090900@amd.com> References: <4C483F67.1010007@amd.com> <4C4BF96B.7010005@redhat.com> <4C4D4B8B.80006@amd.com> <4C4EAEFC.20207@redhat.com> <4C4EC7D1.6030708@amd.com> <4C4ECBC7.1070405@redhat.com> <4C4ECF2E.4070103@amd.com> <4C4ED257.40002@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: "glommer@redhat.com" , Zachary Amsden , KVM list To: Avi Kivity Return-path: Received: from va3ehsobe001.messaging.microsoft.com ([216.32.180.11]:45812 "EHLO VA3EHSOBE001.bigfish.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751811Ab0G0Nv4 convert rfc822-to-8bit (ORCPT ); Tue, 27 Jul 2010 09:51:56 -0400 In-Reply-To: <4C4ED257.40002@redhat.com> Sender: kvm-owner@vger.kernel.org List-ID: Avi Kivity wrote: > On 07/27/2010 03:21 PM, Andre Przywara wrote: >> Avi Kivity wrote: >>> On 07/27/2010 02:49 PM, Andre Przywara wrote: >>>>> What is the guest executing when it hangs? >>>> Both VCPUs are halted, the monitor and System.map tell me it's in=20 >>>> native_safe_halt(). >>>> The code sequence confirms this, it is an intentional sti;hlt=20 >>>> condition. >>>> Using -smp 16 also shows that all 16 VCPUs are stuck. >>>> >>> Well, strange. The intent of that patch was to make the clock neve= r=20 >>> go backwards. Perhaps the change made it go forwards by a large=20 >>> amount, and the guest is not hung, just waiting for some timer that= =20 >>> is far in the future. >>> >>> Can you do something like >>> >>> - if (ret < last) >>> + if (ret < last) { >>> + static u64 max_delta; >>> + if (last - ret > max_delta) { >>> + max_delta =3D last - ret; >>> + printk("advancing kvmclock by: %llx\n", max_delt= a); >>> + } >>> return last; >>> + } >>> >>> to see if this is happening? >> No change, it still hangs. I also don't see the printk. >> The output with smp=3D1 is like this: >> [ 1.186549] ACPI: Power Button [PWRF] >> [ 1.189204] XENFS: not registering filesystem on non-xen platform >> [ 1.195001] Non-volatile memory driver v1.3 >> [ 1.196358] Linux agpgart interface v0.103 >> [ 1.197687] [drm] Initialized drm 1.1.0 20060810 >> [ 1.198926] [drm:i915_init] *ERROR* drm/i915 can't work without=20 >> intel_agp module! >> [ 1.201213] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabl= ed >> =FF[ 1.460714] serial8250: ttyS0 at I/O 0x3f8 (irq =3D 4) is a 16= 550A >> [ 1.463243] 00:06: ttyS0 at I/O 0x3f8 (irq =3D 4) is a 16550A >> [ 1.467153] brd: module loaded >> [ 1.469245] loop: module loaded >> With smp=3D2 the output stops just before the strange "y" character = (I=20 >> guess it's ASCII 255), which I assume is an artifact of the serial=20 >> console. >> As you can see at the timestamps, it takes some time between the las= t=20 >> shown line (1.201213) and the first missing one (1.460714). >=20 > Wierd. Maybe the clock goes crazy. >=20 > Let's see if it jumps forward alot: >=20 > } while (unlikely(last !=3D ret)); > + > + { > + static u64 last_report; > + if (ret > last_report + 10000) { > + last_report =3D ret; > + printk("kvmclock: %llx\n", ret); > + } > + > + } >=20 > return ret; > } >=20 > Worth updating the 'return last' to update ret and goto the new code,= so=20 > we don't miss that path. Did that. There is _a lot_ of output (about 350 lines per second via th= e=20 115k serial console), both with smp=3D1 and smp=3D2. The majority is differing about 2,000,000 (ticks?), but a handful of=20 them are in the range of 20 million. No difference between smp=3D2 and = smp=3D1. I also get some "BUG: recent printk recursion!" and I don't see any=20 kernel boot progress beyond outputting the BogoMIPS value. BTW: I found two message from your earlier debug statement: [ 0.000000] kvm-clock: cpu 0, msr 0:1ac0401, boot clock [ 0.000000] kvm-clock: cpu 0, msr 0:1e15401, primary cpu clock Regards, Andre. --=20 Andre Przywara AMD-OSRC (Dresden) Tel: x29712