From mboxrd@z Thu Jan 1 00:00:00 1970 From: Zachary Amsden Subject: Re: 2.6.35-rc1 regression with pvclock and smp guests Date: Sat, 02 Oct 2010 13:42:04 -1000 Message-ID: <4CA7C34C.4040000@redhat.com> References: <4CA2F8A3.80400@redhat.com> <4CA30424.9030007@msgid.tls.msk.ru> <4CA30493.6090503@msgid.tls.msk.ru> <4CA392FE.5090009@xutrox.com> <4CA4427C.9090304@msgid.tls.msk.ru> <4CA45F7B.8050806@msgid.tls.msk.ru> <4CA4968F.9050402@redhat.com> <4CA4A8C5.3030407@msgid.tls.msk.ru> <4CA4AD87.8060502@redhat.com> <4CA4DBC8.6070606@xutrox.com> <20100930190507.GA1111@amt.cnet> <4CA51715.1070507@msgid.tls.msk.ru> <4CA51847.5060208@msgid.tls.msk.ru> <4CA6C4BB.5020004@redhat.com> <4CA6E0BF.90605@msgid.tls.msk.ru> <4CA75969.1080405@xutrox.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: kvm@vger.kernel.org, Marcelo Tosatti , Michael Tokarev , Avi Kivity , Glauber Costa , Andre Przywara To: Arjan Koers <0h61vkll2ly8@xutrox.com> Return-path: Received: from mx1.redhat.com ([209.132.183.28]:34899 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751915Ab0JBXmP (ORCPT ); Sat, 2 Oct 2010 19:42:15 -0400 In-Reply-To: <4CA75969.1080405@xutrox.com> Sender: kvm-owner@vger.kernel.org List-ID: On 10/02/2010 06:10 AM, Arjan Koers wrote: > On 2010-10-02 09:35, Michael Tokarev wrote: > >> 02.10.2010 09:35, Zachary Amsden wrote: >> [] >> >>> Can you try this patch to see if it helps? I believe it is also safe >>> for Xen, but cc'ing to double check. >>> >> It makes no visible difference. >> >> For some reason one of my test guests - 2.6.35.6 32bit kernel - >> stopped booting completely, always handing at boot somewhere >> unless I disable printk.time. Here's the typical boot messages, >> up to the hang: >> >> [ 0.000000] Initializing cgroup subsys cpuset >> > ... > >> [ 0.259999] ata1: PATA max MWDMA2 cmd 0x1f0 ctl 0x3f6 bmdma 0xc000 irq 14 >> [ 0.259999] ata2: PATA max MWDMA2 cmd 0x170 ctl 0x376 bmdma 0xc008 irq 15 >> >> Note the time - it is constant after switching to kvmclock. >> > While CPU 1 is booting, pvclock_clocksource_read gets wrong data for that > CPU and returns a value that's far into the future. On subsequent calls, it > keeps returning that bogus 'last' value, because it has been made > to never go backwards in time. > > I'm pretty sure that your kernel will boot with this debug patch (for > 2.6.35.7). It doesn't fix the problem, but corrects things afterwards. > The patch sets the clock backwards if it detects that the previous > value was far into the future. It also modifies printk to display some > extra information. The DEBUG define was added to get extra calls to > printk's where things can go wrong. > > > > diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c > index 239427c..5eab569 100644 > --- a/arch/x86/kernel/pvclock.c > +++ b/arch/x86/kernel/pvclock.c > @@ -120,12 +120,15 @@ unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src) > > static atomic64_t last_value = ATOMIC64_INIT(0); > > +int pvclock_backwards = 0; > + > cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src) > { > struct pvclock_shadow_time shadow; > unsigned version; > cycle_t ret, offset; > u64 last; > + bool backwards; > > do { > version = pvclock_get_time_values(&shadow, src); > @@ -153,13 +156,26 @@ cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src) > * updating at the same time, and one of them could be slightly behind, > * making the assumption that last_value always go forward fail to hold. > */ > + backwards = false; > last = atomic64_read(&last_value); > do { > - if (ret< last) > - return last; > + if (ret< last) { > + if ( last - ret< 25000000 ) > + return last; > + else > + /* The clock will go backwards instead of being stuck at last value for a very long time > + * The return value of the previous call to pvclock_clocksource_read was most probably > + * very far into te future because the clock for that CPU hadn't been setup yet > + */ > + backwards = true; > + } > last = atomic64_cmpxchg(&last_value, last, ret); > } while (unlikely(last != ret)); > > + /* Increment outside of the while loop, because it always loops twice */ > + if (backwards) > + pvclock_backwards++; > + > return ret; > } > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > index 0bf2ece..d6dcd45 100644 > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c > @@ -1,3 +1,5 @@ > +#define DEBUG > + > /* > * x86 SMP booting functions > * > diff --git a/kernel/printk.c b/kernel/printk.c > index 444b770..9608bec 100644 > --- a/kernel/printk.c > +++ b/kernel/printk.c > @@ -687,6 +687,8 @@ static inline void printk_delay(void) > } > } > > +extern int pvclock_backwards; > + > asmlinkage int vprintk(const char *fmt, va_list args) > { > int printed_len = 0; > @@ -778,9 +780,13 @@ asmlinkage int vprintk(const char *fmt, va_list args) > unsigned long long t; > unsigned long nanosec_rem; > > + int pvclock_backwards_prev = pvclock_backwards; > t = cpu_clock(printk_cpu); > nanosec_rem = do_div(t, 1000000000); > - tlen = sprintf(tbuf, "[%5lu.%06lu] ", > + tlen = sprintf(tbuf, "[%d;%d/%d:%5lu.%06lu] ", > + printk_cpu, > + pvclock_backwards_prev, > + pvclock_backwards, > (unsigned long) t, > nanosec_rem / 1000); > > > > > Partial output on my machine, where the clock is set backwards 4 times: > ... > [0;0/0: 0.015662] CPU0: AMD Athlon(tm) II X2 240 Processor stepping 02 > [0;0/0: 0.124164] ++++++++++++++++++++=_---CPU UP 1 > [0;0/0: 0.124193] Booting Node 0, Processors #1 Ok. > [0;0/0: 0.124602] Setting warm reset code and vector. > [0;0/0: 0.124609] 1. > [0;0/0: 0.124610] 2. > [0;0/0: 0.124611] 3. > [0;0/0: 0.124624] Asserting INIT. > [0;0/0: 0.124634] Waiting for send to finish... > [0;0/0: 0.134508] Deasserting INIT. > [0;0/0: 0.134515] Waiting for send to finish... > [0;0/0: 0.134519] #startup loops: 2. > [0;0/0: 0.134521] Sending STARTUP #1. > [0;0/0: 0.134527] After apic_write. > [1;0/0: 0.008000] CPU#1 (phys ID: 1) waiting for CALLOUT > [0;0/1: 0.134838] Startup point 1. > [0;1/1: 0.134841] Waiting for send to finish... > [0;1/1: 0.135049] Sending STARTUP #2. > [0;1/1: 0.135055] After apic_write. > [0;1/1: 0.135359] Startup point 1. > [0;1/1: 0.135361] Waiting for send to finish... > [0;1/1: 0.135568] After Startup. > [0;1/1: 0.135569] Before Callout 1. > [0;1/1: 0.135571] After Callout 1. > [1;1/1: 0.008000] CALLIN, before setup_local_APIC(). > [1;2/2: 0.008000] Stack at about ffff88001f875f44 > [0;3/3: 0.136176] CPU1: has booted. > [1;3/3: 0.008000] kvm-clock: cpu 1, msr 0:1511c41, secondary cpu clock > [0;4/4: 0.136199] Brought up 2 CPUs > [0;4/4: 0.136201] Boot done. > [0;4/4: 0.136202] Before bogomips. > [0;4/4: 0.136204] Total of 2 processors activated (11198.56 BogoMIPS). > [0;4/4: 0.136205] Before bogocount - setting activated=1. > [1;4/4: 0.140208] x86 PAT enabled: cpu 1, old 0x0, new 0x7010600070106 > [0;4/4: 0.142577] NET: Registered protocol family 16 > [0;4/4: 0.144263] PCI: Using configuration type 1 for base access > [0;4/4: 0.144494] PCI: Using configuration type 1 for extended access > [0;4/4: 0.144938] mtrr: your CPUs had inconsistent variable MTRR settings > [0;4/4: 0.144938] mtrr: your CPUs had inconsistent MTRRdefType settings > [0;4/4: 0.144938] mtrr: probably your BIOS does not setup all CPUs. > [0;4/4: 0.148004] mtrr: corrected configuration. > [0;4/4: 0.156040] bio: create slab at 0 > [0;4/4: 0.156602] vgaarb: loaded > [0;4/4: 0.156602] PCI: Probing PCI hardware > [0;4/4: 0.156602] PCI: Probing PCI hardware (bus 00) > [0;4/4: 0.156703] pci 0000:00:01.1: reg 20: [io 0xc000-0xc00f] > [0;4/4: 0.160269] pci 0000:00:01.3: quirk: [io 0xb000-0xb03f] claimed by PIIX4 ACPI > [0;4/4: 0.161055] pci 0000:00:01.3: quirk: [io 0xb100-0xb10f] claimed by PIIX4 SMB > [0;4/4: 0.164064] pci 0000:00:02.0: reg 10: [mem 0xf0000000-0xf1ffffff pref] > [0;4/4: 0.164827] pci 0000:00:02.0: reg 14: [mem 0xf2000000-0xf2000fff] > [0;4/4: 0.169023] pci 0000:00:03.0: reg 10: [io 0xc020-0xc03f] > [0;4/4: 0.170052] pci 0000:00:03.0: reg 14: [mem 0xf2001000-0xf2001fff] > [0;4/4: 0.170381] pci 0000:00:04.0: reg 10: [io 0xc040-0xc05f] > [0;4/4: 0.170765] pci 0000:00:05.0: reg 10: [io 0xc080-0xc0bf] > [0;4/4: 0.171023] pci 0000:00:06.0: reg 10: [io 0xc0c0-0xc0ff] > [0;4/4: 0.172123] vgaarb: device added: PCI:0000:00:02.0,decodes=io+mem,owns=io+mem,locks=none > [0;4/4: 0.172971] pci 0000:00:01.0: PIIX/ICH IRQ router [8086:7000] > [0;4/4: 0.172971] PCI: pci_cache_line_size set to 64 bytes > [0;4/4: 0.172971] reserve RAM buffer: 000000000009bc00 - 000000000009ffff > [0;4/4: 0.172971] reserve RAM buffer: 000000001fffd000 - 000000001fffffff > [0;4/4: 0.176175] Switching to clocksource kvm-clock > [1;4/4: 0.212494] pci_bus 0000:00: resource 0 [io 0x0000-0xffff] > [1;4/4: 0.212500] pci_bus 0000:00: resource 1 [mem 0x00000000-0xffffffffffffffff] > [1;4/4: 0.212828] NET: Registered protocol family 2 > [1;4/4: 0.213783] IP route cache hash table entries: 4096 (order: 3, 32768 bytes) > ... > Umm... do you guys have this commit? This is supposed to address the issue where the guest keeps resetting the TSC. A guest which does that will break kvmclock. It only happens on SMP, and it's much worse on AMD CPUs... sound like your scenario. commit bd59fc8ff95126f27b7a0df1b6cc602aa428812d Author: Zachary Amsden Date: Thu Aug 19 22:07:26 2010 -1000 KVM: x86: Robust TSC compensation Make the match of TSC find TSC writes that are close to each other instead of perfectly identical; this allows the compensator to also work in migration / suspend scenarios. Signed-off-by: Zachary Amsden Signed-off-by: Marcelo Tosatti