From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andre Przywara Subject: Re: 2.6.35-rc1 regression with pvclock and smp guests Date: Mon, 2 Aug 2010 12:12:37 +0200 Message-ID: <4C569A15.6020900@amd.com> References: <4C483F67.1010007@amd.com> <4C4BF96B.7010005@redhat.com> <4C4D4B8B.80006@amd.com> <4C4EAEFC.20207@redhat.com> <4C4EC7D1.6030708@amd.com> <4C4ECBC7.1070405@redhat.com> <4C4ECF2E.4070103@amd.com> <4C4ED257.40002@redhat.com> <4C4EE3B3.2090900@amd <4C4EE61F.3070908@redhat.com> <4C4EF361.4010303@amd.com> <4C4F54F5.6080400@amd.com> <4C5021B3.1030000@amd.com> <4C53581E.2090208@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Cc: Avi Kivity , "glommer@redhat.com" , KVM list To: Zachary Amsden Return-path: Received: from tx2ehsobe003.messaging.microsoft.com ([65.55.88.13]:43693 "EHLO TX2EHSOBE006.bigfish.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751142Ab0HBKRC (ORCPT ); Mon, 2 Aug 2010 06:17:02 -0400 In-Reply-To: <4C53581E.2090208@redhat.com> Sender: kvm-owner@vger.kernel.org List-ID: Zachary Amsden wrote: > On 07/28/2010 02:25 AM, Andre Przywara wrote: >> Andre Przywara wrote: >>> Andre Przywara wrote: >>>> Avi Kivity wrote: >>>>> On 07/27/2010 04:48 PM, Andre Przywara wrote: >>>>>>> Wierd. Maybe the clock goes crazy. >>>>>>> >>>>>>> Let's see if it jumps forward alot: >>>>>>> >>>>>>> } while (unlikely(last != ret)); >>>>>>> + >>>>>>> + { >>>>>>> + static u64 last_report; >>>>>>> + if (ret > last_report + 10000) { >>>>>>> + last_report = ret; >>>>>>> + printk("kvmclock: %llx\n", ret); >>>>>>> + } >>>>>>> + >>>>>>> + } >>>>>>> >>>>>>> return ret; >>>>>>> } >>>>>>> >>>>>>> Worth updating the 'return last' to update ret and goto the new >>>>>>> code, so we don't miss that path. >>>>>> Did that. There is _a lot_ of output (about 350 lines per second >>>>>> via the 115k serial console), both with smp=1 and smp=2. >>>>>> The majority is differing about 2,000,000 (ticks?), but a handful >>>>>> of them are in the range of 20 million. >>>>> nanoseconds. So 2-20ms. Consistent with 350 lines/sec. >>>>> >>>>>> No difference between smp=2 and smp=1. >>>>>> I also get some "BUG: recent printk recursion!" and I don't see >>>>>> any kernel boot progress beyond outputting the BogoMIPS value. >>>>> Right, printk() wants the time too. >>>>> >>>>>> BTW: I found two message from your earlier debug statement: >>>>>> [ 0.000000] kvm-clock: cpu 0, msr 0:1ac0401, boot clock >>>>>> [ 0.000000] kvm-clock: cpu 0, msr 0:1e15401, primary cpu clock >>>>> Those are from kvmclock initialization, not from the older patch. >>>>> >>>>> I'm completely confused, everything seems to be in order. >>>>> >>>>> Let's see. if you s/return last/return ret/ in the original, does >>>>> this help things along? this makes pvclock drop the computation >>>>> and should be exactly the same as before the patch. >>>> Yes, this works, both smp version boot. I see a short very short >>>> break after the line in question, but then it proceeds well. >>>> Thanks for your help, now I got a much better insight into the >>>> issue. I will see if I can find something more. >>> Did some more investigations, some observations: >>> - The cmpxchg does not seem to be a problem, I didn't see the loop >>> iterated more than once. >>> - Turning off printk-timestamps makes the bug go away. But I guess it >>> is just hiding or deferring it, and it's no real workaround anyway. >>> - I instrumented the "if (ret < last) return last;" statement, when >>> the kernel hangs I get only printks from there, although it has hit >>> before: >>> ---------- >>> [ 0.820000] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled >>> [ 0.820000] returning last instead (cnt=19001) >>> [ 0.820000] returning last instead (cnt=20001) >>> The last line repeats forever with the same timestamp, the counter >>> (counting the number of "return last;") increments about 3500 >>> times/second. >>> >>> I will see if I find something more... >> Added some more instrumentation, seems like the values read from the >> pvclock is bogus *sometimes*: >> returning last instead (2778021535795841, cnt=1, diff=1389078312510470) >> This is from the first time the if-statement triggers. So I guess the >> value read is ridiculously far in the future (multiple days), so next >> calls to clocksource_read() will always return this bogus last value. >> This means that the clock does not make progress (for several days) >> and thus timing loops will never come to an end. I also instrumented >> the serial driver, the last thing I saw was autoconfig_irq, where >> obviously udelay() is called. >> >> Does that ring a bell with someone? >> >> I will now concentrate on the pvclock readout/HV write part to see >> which of the values used here are wrong. > > Have you gotten any further results on this? Somehow. I think my latest findings were more or less ghost bugs: since prinkts contain a timestamp they interfere with the actual code. The large gap I described above was only to be seen with these printks, it is more or less double the real value (which is my host's uptime). Sadly I cannot use debugfs to avoid the printks, since the kernel halts and I don't get to userland. On another try I managed to bisect the failure also in qemu-kvm. The bug triggers only with "ebc4f45 turn off kvmclock when resetting cpu" applied (_additionally_ to the kernel patch in question). When I comment out the call to kvm_reset_msrs() in the master branch, this also lets the bug vanish. > > I think the most likely explanation is that your host CPU has TSC out of > sync, and somehow this leaks over to pvclock. Am I correct that it > happens even with one guest VCPU? What if you disable secondary host CPUs? I tried several ways to pin VCPUs to different host CPUs (cores and sockets): both VCPUs on one core, both vCPUs on different cores on the same socket and both vCPUs to different sockets/nodes. That all did not make any difference, the kernel halted in either case. I also tried booting the host with maxcpus=1, the error was still the same: -smp 1 works, -smp 2 halts. Btw.: the host uses clocksource acpi_pm. Also I noticed that sometimes the guests gets very slow after having switched the clocksource to kvmclock, it then eventually halts at the mentioned line. Regards, Andre. -- Andre Przywara AMD-OSRC (Dresden) Tel: x29712