xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
@ 2012-10-26 12:59 Ian Campbell
  2012-10-26 18:25 ` Mauro
       [not found] ` <CAE17a0WfpKHxN=3iF6F+P0nT4UbyF9Lbpm8R5+JJEquQOmudXw@mail.gmail.com>
  0 siblings, 2 replies; 32+ messages in thread
From: Ian Campbell @ 2012-10-26 12:59 UTC (permalink / raw)
  To: xen-devel; +Cc: 599161

[-- Attachment #1: Type: text/plain, Size: 446 bytes --]

Hi all,

I've BCC'd a number of people who have reported seeing this bug at
various times in the past.

If you can still repro I'd appreciate it if you could give the patch in
http://marc.info/?l=xen-devel&m=135049062216685&w=2 (also attached) a go
and report back success/failure and the output of the debugging messages
produced.

Thanks,
Ian.

-- 
Ian Campbell
Current Noise: Death - Evil Dead

Executive ability is prominent in your make-up.

[-- Attachment #2: 00-tsc-debug --]
[-- Type: text/x-patch, Size: 1256 bytes --]

diff -r c1c549c4fe9e xen/arch/x86/time.c
--- a/xen/arch/x86/time.c	Mon Oct 15 16:51:44 2012 +0100
+++ b/xen/arch/x86/time.c	Wed Oct 17 17:13:22 2012 +0100
@@ -523,11 +523,12 @@ static s_time_t __read_platform_stime(u6
 static void plt_overflow(void *unused)
 {
     int i;
-    u64 count;
+    u64 count, old_stamp, tsc;
     s_time_t now, plt_now, plt_wrap;
 
     spin_lock_irq(&platform_timer_lock);
 
+    old_stamp = plt_stamp;
     count = plt_src.read_counter();
     plt_stamp64 += (count - plt_stamp) & plt_mask;
     plt_stamp = count;
@@ -540,6 +541,14 @@ static void plt_overflow(void *unused)
         plt_wrap = __read_platform_stime(plt_stamp64 + plt_mask + 1);
         if ( ABS(plt_wrap - now) > ABS(plt_now - now) )
             break;
+        rdtscll(tsc);
+        printk("XXX plt_overflow: plt_now=%"PRIx64" plt_wrap=%"PRIx64
+               " now=%"PRIx64" old_stamp=%"PRIx64" new_stamp=%"PRIx64
+               " plt_stamp64=%"PRIx64" plt_mask=%"PRIx64
+               " tsc=%"PRIx64" tsc_stamp=%"PRIx64"\n",
+               plt_now, plt_wrap, now, old_stamp, plt_stamp, plt_stamp64,
+               plt_mask, tsc, this_cpu(cpu_time).local_tsc_stamp);
+        break;
         plt_stamp64 += plt_mask + 1;
     }
     if ( i != 0 )

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-10-26 12:59 #599161: Xen debug patch for the "clock shifts by 50 minutes" bug Ian Campbell
@ 2012-10-26 18:25 ` Mauro
       [not found] ` <CAE17a0WfpKHxN=3iF6F+P0nT4UbyF9Lbpm8R5+JJEquQOmudXw@mail.gmail.com>
  1 sibling, 0 replies; 32+ messages in thread
From: Mauro @ 2012-10-26 18:25 UTC (permalink / raw)
  To: Ian Campbell; +Cc: 599161, xen-devel

On 26 October 2012 14:59, Ian Campbell <ijc@hellion.org.uk> wrote:
> Hi all,
>
> I've BCC'd a number of people who have reported seeing this bug at
> various times in the past.
>
> If you can still repro I'd appreciate it if you could give the patch in
> http://marc.info/?l=xen-devel&m=135049062216685&w=2 (also attached) a go
> and report back success/failure and the output of the debugging messages
> produced.

Is that patch for amd64 architectures?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
       [not found] ` <CAE17a0WfpKHxN=3iF6F+P0nT4UbyF9Lbpm8R5+JJEquQOmudXw@mail.gmail.com>
@ 2012-10-26 18:40   ` Ian Campbell
  0 siblings, 0 replies; 32+ messages in thread
From: Ian Campbell @ 2012-10-26 18:40 UTC (permalink / raw)
  To: Mauro; +Cc: 599161@bugs.debian.org, xen-devel@lists.xen.org

On Fri, 2012-10-26 at 19:25 +0100, Mauro wrote:
> On 26 October 2012 14:59, Ian Campbell <ijc@hellion.org.uk> wrote:
> > Hi all,
> >
> > I've BCC'd a number of people who have reported seeing this bug at
> > various times in the past.
> >
> > If you can still repro I'd appreciate it if you could give the patch in
> > http://marc.info/?l=xen-devel&m=135049062216685&w=2 (also attached) a go
> > and report back success/failure and the output of the debugging messages
> > produced.
> 
> Is that patch for amd64 architectures?

It is for 32 and 64 bit x86, so yes.

BTW, you were the original recipient of this patch in the thread linked
above.

Ian.
-- 
Ian Campbell


"I'd love to go out with you, but I'm having all my plants neutered."

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
@ 2012-11-07 10:10 Philippe.Simonet
  2012-11-07 13:22 ` Ian Campbell
  2012-11-07 17:07 ` Jan Beulich
  0 siblings, 2 replies; 32+ messages in thread
From: Philippe.Simonet @ 2012-11-07 10:10 UTC (permalink / raw)
  To: ijc, xen-devel; +Cc: 599161, mrsanna1

Hi Ian

i compiled  a patched hypervisor for Mauro, it is running since many days and the overflow occured, 
without clock jumps

> (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306
> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
> tsc_stamp=e3839fcb0273

(below is the complete xm dmesg output)

did that help you ? do you need more info ? 

thanks and regards

Philippe


> -----Original Message-----
> From: Mauro [mailto:mrsanna1@gmail.com]
> Sent: Wednesday, November 07, 2012 10:12 AM
> To: Simonet Philippe, ITS-OUS-OP-IFM-NW-IPE
> Subject: Re: [Xen-devel] [Xen-users] Re: Xen 4 TSC problems
> 
> Hello, no news until now there aren't clock jumps.
> Here is xm dmesg:
> 
> xm dmesg
> (XEN) Xen version 4.0.1 (Debian 4.0.1-5.4) (ultrotter@debian.org) (gcc
> version 4.4.5 (Debian 4.4.5-8) ) Mon Oct 29 14:42:12 CET 2012
> (XEN) Bootloader: GRUB 1.98+20100804-14+squeeze1
> (XEN) Command line: placeholder dom0_mem=3072M loglvl=warning
> guest_loglvl=warning
> (XEN) Video information:
> (XEN)  VGA is text mode 80x25, font 8x16
> (XEN)  VBE/DDC methods: V2; EDID transfer time: 2 seconds
> (XEN) Disc information:
> (XEN)  Found 2 MBR signatures
> (XEN)  Found 2 EDD information structures
> (XEN) Xen-e820 RAM map:
> (XEN)  0000000000000000 - 000000000009f400 (usable)
> (XEN)  000000000009f400 - 00000000000a0000 (reserved)
> (XEN)  00000000000f0000 - 0000000000100000 (reserved)
> (XEN)  0000000000100000 - 00000000cfd43000 (usable)
> (XEN)  00000000cfd43000 - 00000000cfd4c000 (ACPI data)
> (XEN)  00000000cfd4c000 - 00000000cfd4d000 (usable)
> (XEN)  00000000cfd4d000 - 00000000d0000000 (reserved)
> (XEN)  00000000e0000000 - 00000000f0000000 (reserved)
> (XEN)  00000000fec00000 - 00000000fed00000 (reserved)
> (XEN)  00000000fee00000 - 00000000fee10000 (reserved)
> (XEN)  00000000ffc00000 - 0000000100000000 (reserved)
> (XEN)  0000000100000000 - 000000102ffff000 (usable)
> (XEN) ACPI: RSDP 000F4F20, 0024 (r2 HP    )
> (XEN) ACPI: XSDT CFD43900, 007C (r1 HP     ProLiant        2         162E)
> (XEN) ACPI: FACP CFD439C0, 00F4 (r3 HP     ProLiant        2         162E)
> (XEN) ACPI: DSDT CFD43AC0, 30C9 (r1 HP         DSDT        1 INTL 20030228)
> (XEN) ACPI: FACS CFD43100, 0040
> (XEN) ACPI: SPCR CFD43140, 0050 (r1 HP     SPCRRBSU        1         162E)
> (XEN) ACPI: MCFG CFD431C0, 003C (r1 HP     ProLiant        1             0)
> (XEN) ACPI: HPET CFD43200, 0038 (r1 HP     ProLiant        2         162E)
> (XEN) ACPI: FFFF CFD43240, 0064 (r2 HP     P61             2         162E)
> (XEN) ACPI: SPMI CFD432C0, 0040 (r5 HP     ProLiant        1         162E)
> (XEN) ACPI: ERST CFD43300, 01D0 (r1 HP     ProLiant        1         162E)
> (XEN) ACPI: APIC CFD43500, 0176 (r1 HP     ProLiant        2             0)
> (XEN) ACPI: FFFF CFD43680, 0176 (r1 HP     ProLiant        1         162E)
> (XEN) ACPI: BERT CFD43800, 0030 (r1 HP     ProLiant        1         162E)
> (XEN) ACPI: HEST CFD43840, 00BC (r1 HP     ProLiant        1         162E)
> (XEN) System RAM: 65532MB (67105672kB)
> (XEN) Domain heap initialised
> (XEN) Processor #0 6:15 APIC version 20
> (XEN) Processor #8 6:15 APIC version 20
> (XEN) Processor #16 6:15 APIC version 20
> (XEN) Processor #24 6:15 APIC version 20
> (XEN) Processor #1 6:15 APIC version 20
> (XEN) Processor #9 6:15 APIC version 20
> (XEN) Processor #17 6:15 APIC version 20
> (XEN) Processor #25 6:15 APIC version 20
> (XEN) Processor #2 6:15 APIC version 20
> (XEN) Processor #10 6:15 APIC version 20
> (XEN) Processor #18 6:15 APIC version 20
> (XEN) Processor #26 6:15 APIC version 20
> (XEN) Processor #3 6:15 APIC version 20
> (XEN) Processor #11 6:15 APIC version 20
> (XEN) Processor #19 6:15 APIC version 20
> (XEN) Processor #27 6:15 APIC version 20
> (XEN) IOAPIC[0]: apic_id 1, version 32, address 0xfec00000, GSI 0-23
> (XEN) IOAPIC[1]: apic_id 2, version 32, address 0xfec80000, GSI 24-47
> (XEN) IOAPIC[2]: apic_id 3, version 32, address 0xfec81000, GSI 48-71
> (XEN) IOAPIC[3]: apic_id 4, version 32, address 0xfec81800, GSI 72-95
> (XEN) Enabling APIC mode:  Phys.  Using 4 I/O APICs
> (XEN) Using scheduler: SMP Credit Scheduler (credit)
> (XEN) Detected 2400.128 MHz processor.
> (XEN) Initing memory sharing.
> (XEN) VMX: Supported advanced features:
> (XEN)  - APIC MMIO access virtualisation
> (XEN)  - APIC TPR shadow
> (XEN)  - Virtual NMI
> (XEN)  - MSR direct-access bitmap
> (XEN) HVM: ASIDs disabled.
> (XEN) HVM: VMX enabled
> (XEN) I/O virtualisation disabled
> (XEN) Total of 16 processors activated.
> (XEN) ENABLING IO-APIC IRQs
> (XEN)  -> Using new ACK method
> (XEN) checking TSC synchronization across 16 CPUs:
> (XEN) CPU#14 had 3 usecs TSC skew, fixed it up.
> (XEN) Platform timer is 14.318MHz HPET
> (XEN) Allocated console ring of 32 KiB.
> (XEN) Brought up 16 CPUs
> (XEN) *** LOADING DOMAIN 0 ***
> (XEN)  Xen  kernel: 64-bit, lsb, compat32
> (XEN)  Dom0 kernel: 64-bit, PAE, lsb, paddr 0x1000000 -> 0x1708000
> (XEN) PHYSICAL MEMORY ARRANGEMENT:
> (XEN)  Dom0 alloc.:   000000083c000000->0000000840000000 (770048 pages
> to be allocated)
> (XEN) VIRTUAL MEMORY ARRANGEMENT:
> (XEN)  Loaded kernel: ffffffff81000000->ffffffff81708000
> (XEN)  Init. ramdisk: ffffffff81708000->ffffffff81efb000
> (XEN)  Phys-Mach map: ffffffff81efb000->ffffffff824fb000
> (XEN)  Start info:    ffffffff824fb000->ffffffff824fb4b4
> (XEN)  Page tables:   ffffffff824fc000->ffffffff82513000
> (XEN)  Boot stack:    ffffffff82513000->ffffffff82514000
> (XEN)  TOTAL:         ffffffff80000000->ffffffff82800000
> (XEN)  ENTRY ADDRESS: ffffffff81531200
> (XEN) Dom0 has maximum 16 VCPUs
> (XEN) Scrubbing Free RAM:
> .....................................................................................................................
> .....................................................................................................................
> .....................................................................................................................
> .....................................................................................................................
> .....................................................................................................................
> ................................done.
> (XEN) Xen trace buffers: disabled
> (XEN) Std. Loglevel: Errors and warnings
> (XEN) Guest Loglevel: Errors and warnings
> (XEN) Xen is relinquishing VGA console.
> (XEN) *** Serial input -> DOM0 (type 'CTRL-a' three times to switch input to
> Xen)
> (XEN) Freed 176kB init memory.
> (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306
> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
> tsc_stamp=e3839fcb0273
> 
> 





> -----Original Message-----
> From: xen-devel-bounces@lists.xen.org [mailto:xen-devel-
> bounces@lists.xen.org] On Behalf Of Ian Campbell
> Sent: Friday, October 26, 2012 3:00 PM
> To: xen-devel@lists.xen.org
> Cc: 599161@bugs.debian.org
> Subject: [Xen-devel] #599161: Xen debug patch for the "clock shifts by 50
> minutes" bug.
> 
> Hi all,
> 
> I've BCC'd a number of people who have reported seeing this bug at various
> times in the past.
> 
> If you can still repro I'd appreciate it if you could give the patch in
> http://marc.info/?l=xen-devel&m=135049062216685&w=2 (also attached) a
> go and report back success/failure and the output of the debugging
> messages produced.
> 
> Thanks,
> Ian.
> 
> --
> Ian Campbell
> Current Noise: Death - Evil Dead
> 
> Executive ability is prominent in your make-up.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-11-07 10:10 Philippe.Simonet
@ 2012-11-07 13:22 ` Ian Campbell
  2012-11-07 17:40   ` Keir Fraser
  2012-11-07 17:07 ` Jan Beulich
  1 sibling, 1 reply; 32+ messages in thread
From: Ian Campbell @ 2012-11-07 13:22 UTC (permalink / raw)
  To: Philippe.Simonet; +Cc: 599161, Keir Fraser, mrsanna1, xen-devel

On Wed, 2012-11-07 at 10:10 +0000, Philippe.Simonet@swisscom.com wrote:
> Hi Ian

Thanks for doing this test.

> i compiled  a patched hypervisor for Mauro, it is running since many
> days and the overflow occured, without clock jumps

So just to be clear you saw this logging occur *without* the 50 minute
jump in time? That's good!

> > (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306
> > now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
> > plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
> > tsc_stamp=e3839fcb0273
> 
> (below is the complete xm dmesg output)
> 
> did that help you ? do you need more info ? 

I'll leave this to Keir (who wrote the debugging patch) to answer but it
looks to me like it should be useful!

Thanks again.

Ian.

> thanks and regards
> 
> Philippe
> 
> 
> > -----Original Message-----
> > From: Mauro [mailto:mrsanna1@gmail.com]
> > Sent: Wednesday, November 07, 2012 10:12 AM
> > To: Simonet Philippe, ITS-OUS-OP-IFM-NW-IPE
> > Subject: Re: [Xen-devel] [Xen-users] Re: Xen 4 TSC problems
> > 
> > Hello, no news until now there aren't clock jumps.
> > Here is xm dmesg:
> > 
> > xm dmesg
> > (XEN) Xen version 4.0.1 (Debian 4.0.1-5.4) (ultrotter@debian.org) (gcc
> > version 4.4.5 (Debian 4.4.5-8) ) Mon Oct 29 14:42:12 CET 2012
> > (XEN) Bootloader: GRUB 1.98+20100804-14+squeeze1
> > (XEN) Command line: placeholder dom0_mem=3072M loglvl=warning
> > guest_loglvl=warning
> > (XEN) Video information:
> > (XEN)  VGA is text mode 80x25, font 8x16
> > (XEN)  VBE/DDC methods: V2; EDID transfer time: 2 seconds
> > (XEN) Disc information:
> > (XEN)  Found 2 MBR signatures
> > (XEN)  Found 2 EDD information structures
> > (XEN) Xen-e820 RAM map:
> > (XEN)  0000000000000000 - 000000000009f400 (usable)
> > (XEN)  000000000009f400 - 00000000000a0000 (reserved)
> > (XEN)  00000000000f0000 - 0000000000100000 (reserved)
> > (XEN)  0000000000100000 - 00000000cfd43000 (usable)
> > (XEN)  00000000cfd43000 - 00000000cfd4c000 (ACPI data)
> > (XEN)  00000000cfd4c000 - 00000000cfd4d000 (usable)
> > (XEN)  00000000cfd4d000 - 00000000d0000000 (reserved)
> > (XEN)  00000000e0000000 - 00000000f0000000 (reserved)
> > (XEN)  00000000fec00000 - 00000000fed00000 (reserved)
> > (XEN)  00000000fee00000 - 00000000fee10000 (reserved)
> > (XEN)  00000000ffc00000 - 0000000100000000 (reserved)
> > (XEN)  0000000100000000 - 000000102ffff000 (usable)
> > (XEN) ACPI: RSDP 000F4F20, 0024 (r2 HP    )
> > (XEN) ACPI: XSDT CFD43900, 007C (r1 HP     ProLiant        2         162E)
> > (XEN) ACPI: FACP CFD439C0, 00F4 (r3 HP     ProLiant        2         162E)
> > (XEN) ACPI: DSDT CFD43AC0, 30C9 (r1 HP         DSDT        1 INTL 20030228)
> > (XEN) ACPI: FACS CFD43100, 0040
> > (XEN) ACPI: SPCR CFD43140, 0050 (r1 HP     SPCRRBSU        1         162E)
> > (XEN) ACPI: MCFG CFD431C0, 003C (r1 HP     ProLiant        1             0)
> > (XEN) ACPI: HPET CFD43200, 0038 (r1 HP     ProLiant        2         162E)
> > (XEN) ACPI: FFFF CFD43240, 0064 (r2 HP     P61             2         162E)
> > (XEN) ACPI: SPMI CFD432C0, 0040 (r5 HP     ProLiant        1         162E)
> > (XEN) ACPI: ERST CFD43300, 01D0 (r1 HP     ProLiant        1         162E)
> > (XEN) ACPI: APIC CFD43500, 0176 (r1 HP     ProLiant        2             0)
> > (XEN) ACPI: FFFF CFD43680, 0176 (r1 HP     ProLiant        1         162E)
> > (XEN) ACPI: BERT CFD43800, 0030 (r1 HP     ProLiant        1         162E)
> > (XEN) ACPI: HEST CFD43840, 00BC (r1 HP     ProLiant        1         162E)
> > (XEN) System RAM: 65532MB (67105672kB)
> > (XEN) Domain heap initialised
> > (XEN) Processor #0 6:15 APIC version 20
> > (XEN) Processor #8 6:15 APIC version 20
> > (XEN) Processor #16 6:15 APIC version 20
> > (XEN) Processor #24 6:15 APIC version 20
> > (XEN) Processor #1 6:15 APIC version 20
> > (XEN) Processor #9 6:15 APIC version 20
> > (XEN) Processor #17 6:15 APIC version 20
> > (XEN) Processor #25 6:15 APIC version 20
> > (XEN) Processor #2 6:15 APIC version 20
> > (XEN) Processor #10 6:15 APIC version 20
> > (XEN) Processor #18 6:15 APIC version 20
> > (XEN) Processor #26 6:15 APIC version 20
> > (XEN) Processor #3 6:15 APIC version 20
> > (XEN) Processor #11 6:15 APIC version 20
> > (XEN) Processor #19 6:15 APIC version 20
> > (XEN) Processor #27 6:15 APIC version 20
> > (XEN) IOAPIC[0]: apic_id 1, version 32, address 0xfec00000, GSI 0-23
> > (XEN) IOAPIC[1]: apic_id 2, version 32, address 0xfec80000, GSI 24-47
> > (XEN) IOAPIC[2]: apic_id 3, version 32, address 0xfec81000, GSI 48-71
> > (XEN) IOAPIC[3]: apic_id 4, version 32, address 0xfec81800, GSI 72-95
> > (XEN) Enabling APIC mode:  Phys.  Using 4 I/O APICs
> > (XEN) Using scheduler: SMP Credit Scheduler (credit)
> > (XEN) Detected 2400.128 MHz processor.
> > (XEN) Initing memory sharing.
> > (XEN) VMX: Supported advanced features:
> > (XEN)  - APIC MMIO access virtualisation
> > (XEN)  - APIC TPR shadow
> > (XEN)  - Virtual NMI
> > (XEN)  - MSR direct-access bitmap
> > (XEN) HVM: ASIDs disabled.
> > (XEN) HVM: VMX enabled
> > (XEN) I/O virtualisation disabled
> > (XEN) Total of 16 processors activated.
> > (XEN) ENABLING IO-APIC IRQs
> > (XEN)  -> Using new ACK method
> > (XEN) checking TSC synchronization across 16 CPUs:
> > (XEN) CPU#14 had 3 usecs TSC skew, fixed it up.
> > (XEN) Platform timer is 14.318MHz HPET
> > (XEN) Allocated console ring of 32 KiB.
> > (XEN) Brought up 16 CPUs
> > (XEN) *** LOADING DOMAIN 0 ***
> > (XEN)  Xen  kernel: 64-bit, lsb, compat32
> > (XEN)  Dom0 kernel: 64-bit, PAE, lsb, paddr 0x1000000 -> 0x1708000
> > (XEN) PHYSICAL MEMORY ARRANGEMENT:
> > (XEN)  Dom0 alloc.:   000000083c000000->0000000840000000 (770048 pages
> > to be allocated)
> > (XEN) VIRTUAL MEMORY ARRANGEMENT:
> > (XEN)  Loaded kernel: ffffffff81000000->ffffffff81708000
> > (XEN)  Init. ramdisk: ffffffff81708000->ffffffff81efb000
> > (XEN)  Phys-Mach map: ffffffff81efb000->ffffffff824fb000
> > (XEN)  Start info:    ffffffff824fb000->ffffffff824fb4b4
> > (XEN)  Page tables:   ffffffff824fc000->ffffffff82513000
> > (XEN)  Boot stack:    ffffffff82513000->ffffffff82514000
> > (XEN)  TOTAL:         ffffffff80000000->ffffffff82800000
> > (XEN)  ENTRY ADDRESS: ffffffff81531200
> > (XEN) Dom0 has maximum 16 VCPUs
> > (XEN) Scrubbing Free RAM:
> > .....................................................................................................................
> > .....................................................................................................................
> > .....................................................................................................................
> > .....................................................................................................................
> > .....................................................................................................................
> > ................................done.
> > (XEN) Xen trace buffers: disabled
> > (XEN) Std. Loglevel: Errors and warnings
> > (XEN) Guest Loglevel: Errors and warnings
> > (XEN) Xen is relinquishing VGA console.
> > (XEN) *** Serial input -> DOM0 (type 'CTRL-a' three times to switch input to
> > Xen)
> > (XEN) Freed 176kB init memory.
> > (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306
> > now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
> > plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
> > tsc_stamp=e3839fcb0273
> > 
> > 
> 
> 
> 
> 
> 
> > -----Original Message-----
> > From: xen-devel-bounces@lists.xen.org [mailto:xen-devel-
> > bounces@lists.xen.org] On Behalf Of Ian Campbell
> > Sent: Friday, October 26, 2012 3:00 PM
> > To: xen-devel@lists.xen.org
> > Cc: 599161@bugs.debian.org
> > Subject: [Xen-devel] #599161: Xen debug patch for the "clock shifts by 50
> > minutes" bug.
> > 
> > Hi all,
> > 
> > I've BCC'd a number of people who have reported seeing this bug at various
> > times in the past.
> > 
> > If you can still repro I'd appreciate it if you could give the patch in
> > http://marc.info/?l=xen-devel&m=135049062216685&w=2 (also attached) a
> > go and report back success/failure and the output of the debugging
> > messages produced.
> > 
> > Thanks,
> > Ian.
> > 
> > --
> > Ian Campbell
> > Current Noise: Death - Evil Dead
> > 
> > Executive ability is prominent in your make-up.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-11-07 10:10 Philippe.Simonet
  2012-11-07 13:22 ` Ian Campbell
@ 2012-11-07 17:07 ` Jan Beulich
  1 sibling, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2012-11-07 17:07 UTC (permalink / raw)
  To: Philippe.Simonet, Keir Fraser; +Cc: 599161, xen-devel, mrsanna1, Ian Campbell

>>> On 07.11.12 at 11:10, <Philippe.Simonet@swisscom.com> wrote:
> i compiled  a patched hypervisor for Mauro, it is running since many days 
> and the overflow occured, 
> without clock jumps
> 
>> (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306
>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
>> tsc_stamp=e3839fcb0273

i.e. we have (in order of time)

 plt_wrap=5ece12d09306
      now=5ece12d16292
  plt_now=5ece12d34128

which is exactly the inverse order of how things should be (now not
necessarily being in the middle). Nor should plt_now and plt_wrap be
that close together. So far I have no idea how this can be explained.

Jan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-11-07 13:22 ` Ian Campbell
@ 2012-11-07 17:40   ` Keir Fraser
  2012-11-08  9:39     ` Jan Beulich
                       ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Keir Fraser @ 2012-11-07 17:40 UTC (permalink / raw)
  To: Ian Campbell, Philippe.Simonet; +Cc: 599161, mrsanna1, xen-devel

On 07/11/2012 13:22, "Ian Campbell" <ijc@hellion.org.uk> wrote:

>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306
>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
>>> tsc_stamp=e3839fcb0273
>> 
>> (below is the complete xm dmesg output)
>> 
>> did that help you ? do you need more info ?
> 
> I'll leave this to Keir (who wrote the debugging patch) to answer but it
> looks to me like it should be useful!

I'm scratching my head. plt_wrap is earlier than plt_now, which should be
impossible. plt_stamp64 oddly has low 32 bits identical to new_stamp. That
seems very very improbable!

I wonder whether the overflow handling should just be removed, or made
conditional on a command-line parameter, or on the 32-bit platform counter
being at least somewhat likely to overflow before a softirq occurs -- it
seems lots of systems are using 14MHz HPET, and that gives us a couple of
minutes for the plt_overflow softirq to do its work before overflow occurs.
I think we would notice that outage in other ways. :)

 -- Keir

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-11-07 17:40   ` Keir Fraser
@ 2012-11-08  9:39     ` Jan Beulich
  2012-11-08 10:38       ` Keir Fraser
                         ` (2 more replies)
  2012-11-08 11:43     ` Ian Campbell
  2012-11-08 14:20     ` Keir Fraser
  2 siblings, 3 replies; 32+ messages in thread
From: Jan Beulich @ 2012-11-08  9:39 UTC (permalink / raw)
  To: Philippe.Simonet, Keir Fraser; +Cc: 599161, xen-devel, mrsanna1, Ian Campbell

>>> On 07.11.12 at 18:40, Keir Fraser <keir@xen.org> wrote:
> On 07/11/2012 13:22, "Ian Campbell" <ijc@hellion.org.uk> wrote:
> 
>>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306
>>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
>>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
>>>> tsc_stamp=e3839fcb0273
>>> 
>>> (below is the complete xm dmesg output)
>>> 
>>> did that help you ? do you need more info ?
>> 
>> I'll leave this to Keir (who wrote the debugging patch) to answer but it
>> looks to me like it should be useful!
> 
> I'm scratching my head. plt_wrap is earlier than plt_now, which should be
> impossible. plt_stamp64 oddly has low 32 bits identical to new_stamp. That
> seems very very improbable!

Is it? My understanding was that plt_stamp64 is just a software
extension to the more narrow HW counter, and hence the low
plt_mask bits would always be expected to be identical.

The plt_wrap < plt_now thing of course is entirely unexplainable
to me too: Considering that plt_scale doesn't change at all post-
boot, apart from memory corruption I could only see an memory
access ordering problem to be the reason (platform_timer_stamp
and/or stime_platform_stamp changing despite platform_timer_lock
being held. So maybe taking a snapshot of all three static values
involved in the calculation in __read_platform_stime() between
acquiring the lock and the first call to __read_platform_stime(),
and printing them together with the "live" values in a second
printk() after the one your original patch added could rule that
out.

But the box doesn't even seem to be NUMA (of course it also
doesn't help that the log level was kept restricted - hint, hint,
Philippe), not does there appear to be any S3 cycle or pCPU
bring-up/-down in between...

Philippe, could you clarify again what CPU model(s) this is being
observed on (the long times between individual steps forward
with this problem perhaps warrant repeating the basics each
time, as it's otherwise quite cumbersome to always look up old
pieces of information).

> I wonder whether the overflow handling should just be removed, or made
> conditional on a command-line parameter, or on the 32-bit platform counter
> being at least somewhat likely to overflow before a softirq occurs -- it
> seems lots of systems are using 14MHz HPET, and that gives us a couple of
> minutes for the plt_overflow softirq to do its work before overflow occurs.
> I think we would notice that outage in other ways. :)

Iirc we added this for a good reason - to cover the, however
unlikely, event of Xen running for very long without preemption.
Presumably most of the cases got fixed meanwhile, and indeed
a wraparound time on the order of minutes should make this
superfluous, but as the case here shows that code did spot a
severe anomaly (whatever that may turn out to be).

Also recall that there are HPET implementations around that tick
at a much higher frequency than 14MHz.

So unless we finally reach the understanding that the code is
flawed, I would rather want to keep it.

Jan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-11-08  9:39     ` Jan Beulich
@ 2012-11-08 10:38       ` Keir Fraser
  2012-11-08 13:53         ` Jan Beulich
  2012-11-08 13:47       ` Philippe.Simonet
  2012-11-08 16:45       ` Tim Deegan
  2 siblings, 1 reply; 32+ messages in thread
From: Keir Fraser @ 2012-11-08 10:38 UTC (permalink / raw)
  To: Jan Beulich, Philippe.Simonet; +Cc: 599161, xen-devel, mrsanna1, Ian Campbell

On 08/11/2012 09:39, "Jan Beulich" <JBeulich@suse.com> wrote:

>>>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306
>>>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
>>>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
>>>>> tsc_stamp=e3839fcb0273
>>>> 
>>>> (below is the complete xm dmesg output)
>>>> 
>>>> did that help you ? do you need more info ?
>>> 
>>> I'll leave this to Keir (who wrote the debugging patch) to answer but it
>>> looks to me like it should be useful!
>> 
>> I'm scratching my head. plt_wrap is earlier than plt_now, which should be
>> impossible. plt_stamp64 oddly has low 32 bits identical to new_stamp. That
>> seems very very improbable!
> 
> Is it? My understanding was that plt_stamp64 is just a software
> extension to the more narrow HW counter, and hence the low
> plt_mask bits would always be expected to be identical.

No, plt_stamp is simply the HW counter time at which plt_stamp64 was last
brought up to date. Hence plt_stamp64 is updated as:
 plt_stamp64 += (new_stamp - old_stamp) & plt_mask;

Hence why seeing plt_stamp64&plt_mask == new_stamp is very unexpected!

 -- Keir

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-11-07 17:40   ` Keir Fraser
  2012-11-08  9:39     ` Jan Beulich
@ 2012-11-08 11:43     ` Ian Campbell
  2012-11-08 14:20     ` Keir Fraser
  2 siblings, 0 replies; 32+ messages in thread
From: Ian Campbell @ 2012-11-08 11:43 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Philippe.Simonet@swisscom.com, 599161@bugs.debian.org,
	mrsanna1@gmail.com, xen-devel@lists.xen.org

On Wed, 2012-11-07 at 17:40 +0000, Keir Fraser wrote:
> On 07/11/2012 13:22, "Ian Campbell" <ijc@hellion.org.uk> wrote:
> 
> >>> (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306
> >>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
> >>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
> >>> tsc_stamp=e3839fcb0273
> >> 
> >> (below is the complete xm dmesg output)
> >> 
> >> did that help you ? do you need more info ?
> > 
> > I'll leave this to Keir (who wrote the debugging patch) to answer but it
> > looks to me like it should be useful!
> 
> I'm scratching my head. plt_wrap is earlier than plt_now, which should be
> impossible.

impossible due to guarantees made by the h/w or by construction in Xen.
There appears to be a certain amount of hardware-specificness to the
issue -- so I'm wondering if maybe there are some platforms whose tsc is
not as monotonically increasing as it needs to be...

>  plt_stamp64 oddly has low 32 bits identical to new_stamp. That
> seems very very improbable!

Does this code run on all cpus or just one? Is it always the same one?

> I wonder whether the overflow handling should just be removed, or made
> conditional on a command-line parameter, or on the 32-bit platform counter
> being at least somewhat likely to overflow before a softirq occurs -- it
> seems lots of systems are using 14MHz HPET, and that gives us a couple of
> minutes for the plt_overflow softirq to do its work before overflow occurs.
> I think we would notice that outage in other ways. :)
> 
>  -- Keir
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
       [not found] <1352375011.12977.95.camel@hastur.hellion.org.uk>
@ 2012-11-08 12:54 ` Keir Fraser
  2012-11-08 14:28   ` Ian Campbell
  0 siblings, 1 reply; 32+ messages in thread
From: Keir Fraser @ 2012-11-08 12:54 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Philippe.Simonet@swisscom.com, 599161@bugs.debian.org,
	mrsanna1@gmail.com, xen-devel@lists.xen.org

On 08/11/2012 11:43, "Ian Campbell" <ijc@hellion.org.uk> wrote:

>>> I'll leave this to Keir (who wrote the debugging patch) to answer but it
>>> looks to me like it should be useful!
>> 
>> I'm scratching my head. plt_wrap is earlier than plt_now, which should be
>> impossible.
> 
> impossible due to guarantees made by the h/w or by construction in Xen.

That's a question, right? By construction in Xen.

> There appears to be a certain amount of hardware-specificness to the
> issue -- so I'm wondering if maybe there are some platforms whose tsc is
> not as monotonically increasing as it needs to be...

plt_* timestamps are not derived from TSC at all.

>>  plt_stamp64 oddly has low 32 bits identical to new_stamp. That
>> seems very very improbable!
> 
> Does this code run on all cpus or just one? Is it always the same one?

Always cpu0.

 -- Keir

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-11-08  9:39     ` Jan Beulich
  2012-11-08 10:38       ` Keir Fraser
@ 2012-11-08 13:47       ` Philippe.Simonet
  2012-11-08 14:29         ` Ian Campbell
                           ` (2 more replies)
  2012-11-08 16:45       ` Tim Deegan
  2 siblings, 3 replies; 32+ messages in thread
From: Philippe.Simonet @ 2012-11-08 13:47 UTC (permalink / raw)
  To: mrsanna1; +Cc: keir, 599161, xen-devel, JBeulich, ijc

Hi Mauro, 

that's a question for you : 

> Philippe, could you clarify again what CPU model(s) this is being observed on
> (the long times between individual steps forward with this problem perhaps
> warrant repeating the basics each time, as it's otherwise quite cumbersome
> to always look up old pieces of information).

can you provide this information ? 
	cat /proc/cpuinfo 	
	cat /proc/meminfo
	hardware information (manufacturer, model, urls, ...)

Thanks, Philippe


> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, November 08, 2012 10:40 AM
> To: Simonet Philippe, ITS-OUS-OP-IFM-NW-IPE; Keir Fraser
> Cc: 599161@bugs.debian.org; mrsanna1@gmail.com; Ian Campbell; xen-
> devel@lists.xen.org
> Subject: Re: [Xen-devel] #599161: Xen debug patch for the "clock shifts by 50
> minutes" bug.
> 
> >>> On 07.11.12 at 18:40, Keir Fraser <keir@xen.org> wrote:
> > On 07/11/2012 13:22, "Ian Campbell" <ijc@hellion.org.uk> wrote:
> >
> >>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128
> plt_wrap=5ece12d09306
> >>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
> >>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
> >>>> tsc_stamp=e3839fcb0273
> >>>
> >>> (below is the complete xm dmesg output)
> >>>
> >>> did that help you ? do you need more info ?
> >>
> >> I'll leave this to Keir (who wrote the debugging patch) to answer but
> >> it looks to me like it should be useful!
> >
> > I'm scratching my head. plt_wrap is earlier than plt_now, which should
> > be impossible. plt_stamp64 oddly has low 32 bits identical to
> > new_stamp. That seems very very improbable!
> 
> Is it? My understanding was that plt_stamp64 is just a software extension to
> the more narrow HW counter, and hence the low plt_mask bits would always
> be expected to be identical.
> 
> The plt_wrap < plt_now thing of course is entirely unexplainable to me too:
> Considering that plt_scale doesn't change at all post- boot, apart from
> memory corruption I could only see an memory access ordering problem to
> be the reason (platform_timer_stamp and/or stime_platform_stamp
> changing despite platform_timer_lock being held. So maybe taking a
> snapshot of all three static values involved in the calculation in
> __read_platform_stime() between acquiring the lock and the first call to
> __read_platform_stime(), and printing them together with the "live" values
> in a second
> printk() after the one your original patch added could rule that out.
> 
> But the box doesn't even seem to be NUMA (of course it also doesn't help
> that the log level was kept restricted - hint, hint, Philippe), not does there
> appear to be any S3 cycle or pCPU bring-up/-down in between...
> 
> Philippe, could you clarify again what CPU model(s) this is being observed on
> (the long times between individual steps forward with this problem perhaps
> warrant repeating the basics each time, as it's otherwise quite cumbersome
> to always look up old pieces of information).
> 
> > I wonder whether the overflow handling should just be removed, or made
> > conditional on a command-line parameter, or on the 32-bit platform
> > counter being at least somewhat likely to overflow before a softirq
> > occurs -- it seems lots of systems are using 14MHz HPET, and that
> > gives us a couple of minutes for the plt_overflow softirq to do its work
> before overflow occurs.
> > I think we would notice that outage in other ways. :)
> 
> Iirc we added this for a good reason - to cover the, however unlikely, event
> of Xen running for very long without preemption.
> Presumably most of the cases got fixed meanwhile, and indeed a
> wraparound time on the order of minutes should make this superfluous, but
> as the case here shows that code did spot a severe anomaly (whatever that
> may turn out to be).
> 
> Also recall that there are HPET implementations around that tick at a much
> higher frequency than 14MHz.
> 
> So unless we finally reach the understanding that the code is flawed, I would
> rather want to keep it.
> 
> Jan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-11-08 10:38       ` Keir Fraser
@ 2012-11-08 13:53         ` Jan Beulich
  2012-11-08 14:04           ` Keir Fraser
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2012-11-08 13:53 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Philippe.Simonet, 599161, xen-devel, mrsanna1, Ian Campbell

>>> On 08.11.12 at 11:38, Keir Fraser <keir@xen.org> wrote:
> On 08/11/2012 09:39, "Jan Beulich" <JBeulich@suse.com> wrote:
> 
>>>>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306
>>>>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
>>>>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
>>>>>> tsc_stamp=e3839fcb0273
>>>>> 
>>>>> (below is the complete xm dmesg output)
>>>>> 
>>>>> did that help you ? do you need more info ?
>>>> 
>>>> I'll leave this to Keir (who wrote the debugging patch) to answer but it
>>>> looks to me like it should be useful!
>>> 
>>> I'm scratching my head. plt_wrap is earlier than plt_now, which should be
>>> impossible. plt_stamp64 oddly has low 32 bits identical to new_stamp. That
>>> seems very very improbable!
>> 
>> Is it? My understanding was that plt_stamp64 is just a software
>> extension to the more narrow HW counter, and hence the low
>> plt_mask bits would always be expected to be identical.
> 
> No, plt_stamp is simply the HW counter time at which plt_stamp64 was last
> brought up to date. Hence plt_stamp64 is updated as:
>  plt_stamp64 += (new_stamp - old_stamp) & plt_mask;

I concur: Given that what old_stamp is here was new_stamp for
the last update, we should simply have

stamp64 = s0 + (s1 - s0) + (s2 - s1) + ...

(of course with the mask applied on each addend), which (for the
low bits) is the same as just new_stamp.

Jan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-11-08 13:53         ` Jan Beulich
@ 2012-11-08 14:04           ` Keir Fraser
  0 siblings, 0 replies; 32+ messages in thread
From: Keir Fraser @ 2012-11-08 14:04 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Philippe.Simonet, 599161, xen-devel, mrsanna1, Ian Campbell

On 08/11/2012 13:53, "Jan Beulich" <JBeulich@suse.com> wrote:

>>> Is it? My understanding was that plt_stamp64 is just a software
>>> extension to the more narrow HW counter, and hence the low
>>> plt_mask bits would always be expected to be identical.
>> 
>> No, plt_stamp is simply the HW counter time at which plt_stamp64 was last
>> brought up to date. Hence plt_stamp64 is updated as:
>>  plt_stamp64 += (new_stamp - old_stamp) & plt_mask;
> 
> I concur

Well, no, you don't really. You're about to point out the flaw in my
reasoning...

> : Given that what old_stamp is here was new_stamp for
> the last update, we should simply have
> 
> stamp64 = s0 + (s1 - s0) + (s2 - s1) + ...
> 
> (of course with the mask applied on each addend), which (for the
> low bits) is the same as just new_stamp.

Very good point. Silly me. Then the observed value of plt_stamp64 makes
perfect sense.

 -- Keir

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-11-07 17:40   ` Keir Fraser
  2012-11-08  9:39     ` Jan Beulich
  2012-11-08 11:43     ` Ian Campbell
@ 2012-11-08 14:20     ` Keir Fraser
  2 siblings, 0 replies; 32+ messages in thread
From: Keir Fraser @ 2012-11-08 14:20 UTC (permalink / raw)
  To: Ian Campbell, Philippe.Simonet; +Cc: 599161, mrsanna1, xen-devel

On 07/11/2012 17:40, "Keir Fraser" <keir@xen.org> wrote:

> On 07/11/2012 13:22, "Ian Campbell" <ijc@hellion.org.uk> wrote:
> 
>>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306
>>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
>>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
>>>> tsc_stamp=e3839fcb0273
>>> 
>>> (below is the complete xm dmesg output)
>>> 
>>> did that help you ? do you need more info ?
>> 
>> I'll leave this to Keir (who wrote the debugging patch) to answer but it
>> looks to me like it should be useful!
> 
> I'm scratching my head. plt_wrap is earlier than plt_now, which should be
> impossible. plt_stamp64 oddly has low 32 bits identical to new_stamp. That
> seems very very improbable!

Jan has pointed out that the value of plt_stamp64 makes perfect sense, and
will in fact always have low 32 bits identical to new_stamp. At least that
is explained.

So, the question is then why plt_now (== __read_platform_stime(15b800366a5))
is greater than plt_wrap (== __read_platform_stime(15c800366a5)). Perhaps
the scale_delta() logic is failing for some reason, but we do use it a lot
elsewhere!

 -- Keir

> I wonder whether the overflow handling should just be removed, or made
> conditional on a command-line parameter, or on the 32-bit platform counter
> being at least somewhat likely to overflow before a softirq occurs -- it
> seems lots of systems are using 14MHz HPET, and that gives us a couple of
> minutes for the plt_overflow softirq to do its work before overflow occurs.
> I think we would notice that outage in other ways. :)
> 
>  -- Keir
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-11-08 12:54 ` Keir Fraser
@ 2012-11-08 14:28   ` Ian Campbell
  2012-11-08 14:34     ` Keir Fraser
  0 siblings, 1 reply; 32+ messages in thread
From: Ian Campbell @ 2012-11-08 14:28 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Philippe.Simonet@swisscom.com, 599161@bugs.debian.org,
	mrsanna1@gmail.com, xen-devel@lists.xen.org

On Thu, 2012-11-08 at 12:54 +0000, Keir Fraser wrote:
> On 08/11/2012 11:43, "Ian Campbell" <ijc@hellion.org.uk> wrote:
> 
> >>> I'll leave this to Keir (who wrote the debugging patch) to answer but it
> >>> looks to me like it should be useful!
> >> 
> >> I'm scratching my head. plt_wrap is earlier than plt_now, which should be
> >> impossible.
> > 
> > impossible due to guarantees made by the h/w or by construction in Xen.
> 
> That's a question, right? 

Yes, sorry.

> By construction in Xen.
> 
> > There appears to be a certain amount of hardware-specificness to the
> > issue -- so I'm wondering if maybe there are some platforms whose tsc is
> > not as monotonically increasing as it needs to be...
> 
> plt_* timestamps are not derived from TSC at all.

I see, rather it is derived from the platform_timesource which could be
HPET, pmtimer, pit etc but in this case (according to the provided xm
dmesg) appears to be a 14MHz HPET.

So I guess s/tsc/HPET/ in my original thought...

> >>  plt_stamp64 oddly has low 32 bits identical to new_stamp. That
> >> seems very very improbable!
> > 
> > Does this code run on all cpus or just one? Is it always the same one?
> 
> Always cpu0.

So it's not cross cpu drift then.

Ian.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-11-08 13:47       ` Philippe.Simonet
@ 2012-11-08 14:29         ` Ian Campbell
  2012-11-08 15:21         ` Mauro
       [not found]         ` <1352384968.12977.104.camel@hastur.hellion.org.uk>
  2 siblings, 0 replies; 32+ messages in thread
From: Ian Campbell @ 2012-11-08 14:29 UTC (permalink / raw)
  To: Philippe.Simonet; +Cc: 599161, keir, JBeulich, mrsanna1, xen-devel

On Thu, 2012-11-08 at 13:47 +0000, Philippe.Simonet@swisscom.com wrote:
> Hi Mauro, 
> 
> that's a question for you : 

I think Jan was asking for information relating to the system you saw
this on -- or are you working on the same systems as Mauro?

Of course additional information from Mauro would be useful too in order
to help spotting any patterns.

> > Philippe, could you clarify again what CPU model(s) this is being observed on
> > (the long times between individual steps forward with this problem perhaps
> > warrant repeating the basics each time, as it's otherwise quite cumbersome
> > to always look up old pieces of information).
> 
> can you provide this information ? 
> 	cat /proc/cpuinfo 	
> 	cat /proc/meminfo
> 	hardware information (manufacturer, model, urls, ...)
> 
> Thanks, Philippe
> 
> 
> > -----Original Message-----
> > From: Jan Beulich [mailto:JBeulich@suse.com]
> > Sent: Thursday, November 08, 2012 10:40 AM
> > To: Simonet Philippe, ITS-OUS-OP-IFM-NW-IPE; Keir Fraser
> > Cc: 599161@bugs.debian.org; mrsanna1@gmail.com; Ian Campbell; xen-
> > devel@lists.xen.org
> > Subject: Re: [Xen-devel] #599161: Xen debug patch for the "clock shifts by 50
> > minutes" bug.
> > 
> > >>> On 07.11.12 at 18:40, Keir Fraser <keir@xen.org> wrote:
> > > On 07/11/2012 13:22, "Ian Campbell" <ijc@hellion.org.uk> wrote:
> > >
> > >>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128
> > plt_wrap=5ece12d09306
> > >>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
> > >>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
> > >>>> tsc_stamp=e3839fcb0273
> > >>>
> > >>> (below is the complete xm dmesg output)
> > >>>
> > >>> did that help you ? do you need more info ?
> > >>
> > >> I'll leave this to Keir (who wrote the debugging patch) to answer but
> > >> it looks to me like it should be useful!
> > >
> > > I'm scratching my head. plt_wrap is earlier than plt_now, which should
> > > be impossible. plt_stamp64 oddly has low 32 bits identical to
> > > new_stamp. That seems very very improbable!
> > 
> > Is it? My understanding was that plt_stamp64 is just a software extension to
> > the more narrow HW counter, and hence the low plt_mask bits would always
> > be expected to be identical.
> > 
> > The plt_wrap < plt_now thing of course is entirely unexplainable to me too:
> > Considering that plt_scale doesn't change at all post- boot, apart from
> > memory corruption I could only see an memory access ordering problem to
> > be the reason (platform_timer_stamp and/or stime_platform_stamp
> > changing despite platform_timer_lock being held. So maybe taking a
> > snapshot of all three static values involved in the calculation in
> > __read_platform_stime() between acquiring the lock and the first call to
> > __read_platform_stime(), and printing them together with the "live" values
> > in a second
> > printk() after the one your original patch added could rule that out.
> > 
> > But the box doesn't even seem to be NUMA (of course it also doesn't help
> > that the log level was kept restricted - hint, hint, Philippe), not does there
> > appear to be any S3 cycle or pCPU bring-up/-down in between...
> > 
> > Philippe, could you clarify again what CPU model(s) this is being observed on
> > (the long times between individual steps forward with this problem perhaps
> > warrant repeating the basics each time, as it's otherwise quite cumbersome
> > to always look up old pieces of information).
> > 
> > > I wonder whether the overflow handling should just be removed, or made
> > > conditional on a command-line parameter, or on the 32-bit platform
> > > counter being at least somewhat likely to overflow before a softirq
> > > occurs -- it seems lots of systems are using 14MHz HPET, and that
> > > gives us a couple of minutes for the plt_overflow softirq to do its work
> > before overflow occurs.
> > > I think we would notice that outage in other ways. :)
> > 
> > Iirc we added this for a good reason - to cover the, however unlikely, event
> > of Xen running for very long without preemption.
> > Presumably most of the cases got fixed meanwhile, and indeed a
> > wraparound time on the order of minutes should make this superfluous, but
> > as the case here shows that code did spot a severe anomaly (whatever that
> > may turn out to be).
> > 
> > Also recall that there are HPET implementations around that tick at a much
> > higher frequency than 14MHz.
> > 
> > So unless we finally reach the understanding that the code is flawed, I would
> > rather want to keep it.
> > 
> > Jan
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-11-08 14:28   ` Ian Campbell
@ 2012-11-08 14:34     ` Keir Fraser
  0 siblings, 0 replies; 32+ messages in thread
From: Keir Fraser @ 2012-11-08 14:34 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Philippe.Simonet@swisscom.com, 599161@bugs.debian.org,
	mrsanna1@gmail.com, xen-devel@lists.xen.org

On 08/11/2012 14:28, "Ian Campbell" <ijc@hellion.org.uk> wrote:

>>> There appears to be a certain amount of hardware-specificness to the
>>> issue -- so I'm wondering if maybe there are some platforms whose tsc is
>>> not as monotonically increasing as it needs to be...
>> 
>> plt_* timestamps are not derived from TSC at all.
> 
> I see, rather it is derived from the platform_timesource which could be
> HPET, pmtimer, pit etc but in this case (according to the provided xm
> dmesg) appears to be a 14MHz HPET.
> 
> So I guess s/tsc/HPET/ in my original thought...

In fact plt_now and plt_wrap are both derived from the same value of
plt_stamp64. One is derived from it directly, and the other from
plt_stamp64+plt_mask+1 (== plt_stamp64+(1<<32)).

 -- Keir

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-11-08 13:47       ` Philippe.Simonet
  2012-11-08 14:29         ` Ian Campbell
@ 2012-11-08 15:21         ` Mauro
       [not found]         ` <1352384968.12977.104.camel@hastur.hellion.org.uk>
  2 siblings, 0 replies; 32+ messages in thread
From: Mauro @ 2012-11-08 15:21 UTC (permalink / raw)
  To: Philippe.Simonet; +Cc: keir, 599161, xen-devel, JBeulich, ijc

On 8 November 2012 14:47,  <Philippe.Simonet@swisscom.com> wrote:
> Hi Mauro,
>
> that's a question for you :
>
>> Philippe, could you clarify again what CPU model(s) this is being observed on
>> (the long times between individual steps forward with this problem perhaps
>> warrant repeating the basics each time, as it's otherwise quite cumbersome
>> to always look up old pieces of information).
>
> can you provide this information ?
>         cat /proc/cpuinfo
>         cat /proc/meminfo
>         hardware information (manufacturer, model, urls, ...)
>

cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 4
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 5
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 6
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 8
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 9
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 10
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 11
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 12
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 13
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 14
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 15
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:


cat /proc/meminfo
MemTotal:        3127132 kB
MemFree:         2697364 kB
Buffers:           65816 kB
Cached:            62188 kB
SwapCached:            0 kB
Active:           117156 kB
Inactive:          46840 kB
Active(anon):      28668 kB
Inactive(anon):    16360 kB
Active(file):      88488 kB
Inactive(file):    30480 kB
Unevictable:       18944 kB
Mlocked:           18944 kB
SwapTotal:        974840 kB
SwapFree:         974840 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:         54936 kB
Mapped:            14124 kB
Shmem:               508 kB
Slab:              66072 kB
SReclaimable:      19732 kB
SUnreclaim:        46340 kB
KernelStack:        4016 kB
PageTables:         4216 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     2538404 kB
Committed_AS:     202468 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      289904 kB
VmallocChunk:   34359384628 kB
HardwareCorrupted:     0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:     3145728 kB
DirectMap2M:           0 kB


note that 3G is the mem reserved to dom0, the total amount of mem in
the machine is 64G.


The machine is a HP Proliant DL 580 G5,
http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c01154573&lang=en&cc=us&taskId=&prodSeriesId=3454575&prodTypeId=15351

Thanks to all for the work.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-11-08  9:39     ` Jan Beulich
  2012-11-08 10:38       ` Keir Fraser
  2012-11-08 13:47       ` Philippe.Simonet
@ 2012-11-08 16:45       ` Tim Deegan
  2012-11-08 16:53         ` Keir Fraser
  2 siblings, 1 reply; 32+ messages in thread
From: Tim Deegan @ 2012-11-08 16:45 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Philippe.Simonet, Keir Fraser, mrsanna1, xen-devel, 599161,
	Ian Campbell

At 09:39 +0000 on 08 Nov (1352367592), Jan Beulich wrote:
> The plt_wrap < plt_now thing of course is entirely unexplainable
> to me too: Considering that plt_scale doesn't change at all post-
> boot, apart from memory corruption I could only see an memory
> access ordering problem to be the reason (platform_timer_stamp
> and/or stime_platform_stamp changing despite platform_timer_lock
> being held. So maybe taking a snapshot of all three static values
> involved in the calculation in __read_platform_stime() between
> acquiring the lock and the first call to __read_platform_stime(),
> and printing them together with the "live" values in a second
> printk() after the one your original patch added could rule that
> out.
>  
> But the box doesn't even seem to be NUMA (of course it also
> doesn't help that the log level was kept restricted - hint, hint,
> Philippe), not does there appear to be any S3 cycle or pCPU
> bring-up/-down in between...

S3 looks like it might be a culprit, since resume_platform_timer()
clobbers plt_stamp64 without taking the platform_timer_lock.  But both
the S3 resume code and the plt_overflow timer should only ever run on
CPU 0, so even that should be safe (unless continue_hypercall_on_cpu()
is broken...)

Definitely having loglvl=all would have helped here, to eliminate S3
from our enquiries.

> > I wonder whether the overflow handling should just be removed, or made
> > conditional on a command-line parameter, or on the 32-bit platform counter
> > being at least somewhat likely to overflow before a softirq occurs -- it
> > seems lots of systems are using 14MHz HPET, and that gives us a couple of
> > minutes for the plt_overflow softirq to do its work before overflow occurs.
> > I think we would notice that outage in other ways. :)
> 
> Iirc we added this for a good reason - to cover the, however
> unlikely, event of Xen running for very long without preemption.
> Presumably most of the cases got fixed meanwhile, and indeed
> a wraparound time on the order of minutes should make this
> superfluous, but as the case here shows that code did spot a
> severe anomaly (whatever that may turn out to be).

ISTR when this code went in we were dealing with a timer that had a
period of about 4 seconds (ACPI PMTIMER?).  It might well be OTT for the
HPET, but if there's something weird going on I'd like to track it down
while we have some sort of a handle on it.

Tim.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-11-08 16:45       ` Tim Deegan
@ 2012-11-08 16:53         ` Keir Fraser
  0 siblings, 0 replies; 32+ messages in thread
From: Keir Fraser @ 2012-11-08 16:53 UTC (permalink / raw)
  To: Tim Deegan, Jan Beulich
  Cc: Ian Campbell, Philippe.Simonet, 599161, mrsanna1, xen-devel

On 08/11/2012 16:45, "Tim Deegan" <tim@xen.org> wrote:

>>> I wonder whether the overflow handling should just be removed, or made
>>> conditional on a command-line parameter, or on the 32-bit platform counter
>>> being at least somewhat likely to overflow before a softirq occurs -- it
>>> seems lots of systems are using 14MHz HPET, and that gives us a couple of
>>> minutes for the plt_overflow softirq to do its work before overflow occurs.
>>> I think we would notice that outage in other ways. :)
>> 
>> Iirc we added this for a good reason - to cover the, however
>> unlikely, event of Xen running for very long without preemption.
>> Presumably most of the cases got fixed meanwhile, and indeed
>> a wraparound time on the order of minutes should make this
>> superfluous, but as the case here shows that code did spot a
>> severe anomaly (whatever that may turn out to be).
> 
> ISTR when this code went in we were dealing with a timer that had a
> period of about 4 seconds (ACPI PMTIMER?).  It might well be OTT for the
> HPET, but if there's something weird going on I'd like to track it down
> while we have some sort of a handle on it.

It must have been the PMTIMER. It's the only counter narrower than 32 bits
(legacy PIT we simulate as a 32-bit counter behind the scenes).

 -- Keir

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
       [not found]         ` <1352384968.12977.104.camel@hastur.hellion.org.uk>
@ 2012-11-09  9:05           ` Philippe.Simonet
       [not found]           ` <FF93AF260AC2BB499A119CC65B092CF73151901D@sg000713.corproot.net>
  1 sibling, 0 replies; 32+ messages in thread
From: Philippe.Simonet @ 2012-11-09  9:05 UTC (permalink / raw)
  To: ijc; +Cc: 599161, keir, JBeulich, mrsanna1, xen-devel


> -----Original Message-----
> From: Ian Campbell [mailto:ijc@hellion.org.uk]
> Sent: Thursday, November 08, 2012 3:29 PM
> To: Simonet Philippe, ITS-OUS-OP-IFM-NW-IPE
> Cc: mrsanna1@gmail.com; 599161@bugs.debian.org; xen-
> devel@lists.xen.org; keir@xen.org; JBeulich@suse.com
> Subject: Re: [Xen-devel] #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
> 
> 
> I think Jan was asking for information relating to the system you saw this on -
> - or are you working on the same systems as Mauro?

oops, excuse me, here is a description : I have the problem on 4 systems, all with same hardware.
the problem occured  on each system, 1 time each 2 month in average. since January 2012, I decided to reboot them all monthly, 
and the clock jump occurred only once in February ...

SYSTEM : 		HP ProLiant DL385 G7, with 2 * AMD Processor 6174 (12 cores) = 24 cores, 16 GB MEMORY
XEN    			 (XEN) Xen version 4.0.1 (Debian 4.0.1-5.4) (ultrotter@debian.org) (gcc version 4.4.5 (Debian 4.4.5-8) ) Sat Sep  8 19:15:46 UTC 2012
DOM0			Linux 2.6.32-5-xen-amd64 #1 SMP Sun Sep 23 13:49:30 UTC 2012 x86_64 GNU/Linux
CPU 			
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 9
model name      : AMD Opteron(tm) Processor 6174
stepping        : 1
cpu MHz         : 3791872.477
cache size      : 512 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic mtrr mca cmov pat clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid amd_dcm pni cx16 popcnt hypervisor lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch nodeid_msr
bogomips        : 4400.17
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

> 
> Of course additional information from Mauro would be useful too in order to
> help spotting any patterns.
> 
> > > Philippe, could you clarify again what CPU model(s) this is being
> > > observed on (the long times between individual steps forward with
> > > this problem perhaps warrant repeating the basics each time, as it's
> > > otherwise quite cumbersome to always look up old pieces of
> information).
> >
> > can you provide this information ?
> > 	cat /proc/cpuinfo
> > 	cat /proc/meminfo
> > 	hardware information (manufacturer, model, urls, ...)
> >
> > Thanks, Philippe
> >
> >
> > > -----Original Message-----
> > > From: Jan Beulich [mailto:JBeulich@suse.com]
> > > Sent: Thursday, November 08, 2012 10:40 AM
> > > To: Simonet Philippe, ITS-OUS-OP-IFM-NW-IPE; Keir Fraser
> > > Cc: 599161@bugs.debian.org; mrsanna1@gmail.com; Ian Campbell; xen-
> > > devel@lists.xen.org
> > > Subject: Re: [Xen-devel] #599161: Xen debug patch for the "clock
> > > shifts by 50 minutes" bug.
> > >
> > > >>> On 07.11.12 at 18:40, Keir Fraser <keir@xen.org> wrote:
> > > > On 07/11/2012 13:22, "Ian Campbell" <ijc@hellion.org.uk> wrote:
> > > >
> > > >>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128
> > > plt_wrap=5ece12d09306
> > > >>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
> > > >>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
> > > >>>> tsc_stamp=e3839fcb0273
> > > >>>
> > > >>> (below is the complete xm dmesg output)
> > > >>>
> > > >>> did that help you ? do you need more info ?
> > > >>
> > > >> I'll leave this to Keir (who wrote the debugging patch) to answer
> > > >> but it looks to me like it should be useful!
> > > >
> > > > I'm scratching my head. plt_wrap is earlier than plt_now, which
> > > > should be impossible. plt_stamp64 oddly has low 32 bits identical
> > > > to new_stamp. That seems very very improbable!
> > >
> > > Is it? My understanding was that plt_stamp64 is just a software
> > > extension to the more narrow HW counter, and hence the low plt_mask
> > > bits would always be expected to be identical.
> > >
> > > The plt_wrap < plt_now thing of course is entirely unexplainable to me
> too:
> > > Considering that plt_scale doesn't change at all post- boot, apart
> > > from memory corruption I could only see an memory access ordering
> > > problem to be the reason (platform_timer_stamp and/or
> > > stime_platform_stamp changing despite platform_timer_lock being
> > > held. So maybe taking a snapshot of all three static values involved
> > > in the calculation in
> > > __read_platform_stime() between acquiring the lock and the first
> > > call to __read_platform_stime(), and printing them together with the
> > > "live" values in a second
> > > printk() after the one your original patch added could rule that out.
> > >
> > > But the box doesn't even seem to be NUMA (of course it also doesn't
> > > help that the log level was kept restricted - hint, hint, Philippe),
> > > not does there appear to be any S3 cycle or pCPU bring-up/-down in
> between...
> > >
> > > Philippe, could you clarify again what CPU model(s) this is being
> > > observed on (the long times between individual steps forward with
> > > this problem perhaps warrant repeating the basics each time, as it's
> > > otherwise quite cumbersome to always look up old pieces of
> information).
> > >
> > > > I wonder whether the overflow handling should just be removed, or
> > > > made conditional on a command-line parameter, or on the 32-bit
> > > > platform counter being at least somewhat likely to overflow before
> > > > a softirq occurs -- it seems lots of systems are using 14MHz HPET,
> > > > and that gives us a couple of minutes for the plt_overflow softirq
> > > > to do its work
> > > before overflow occurs.
> > > > I think we would notice that outage in other ways. :)
> > >
> > > Iirc we added this for a good reason - to cover the, however
> > > unlikely, event of Xen running for very long without preemption.
> > > Presumably most of the cases got fixed meanwhile, and indeed a
> > > wraparound time on the order of minutes should make this
> > > superfluous, but as the case here shows that code did spot a severe
> > > anomaly (whatever that may turn out to be).
> > >
> > > Also recall that there are HPET implementations around that tick at
> > > a much higher frequency than 14MHz.
> > >
> > > So unless we finally reach the understanding that the code is
> > > flawed, I would rather want to keep it.
> > >
> > > Jan
> >
> >
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
       [not found]           ` <FF93AF260AC2BB499A119CC65B092CF73151901D@sg000713.corproot.net>
@ 2012-11-09  9:47             ` Jan Beulich
  2012-11-09  9:54               ` Mauro
  2012-11-13 10:14             ` Jan Beulich
  1 sibling, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2012-11-09  9:47 UTC (permalink / raw)
  To: Philippe.Simonet; +Cc: keir, 599161, xen-devel, mrsanna1, ijc

>>> On 09.11.12 at 10:05, <Philippe.Simonet@swisscom.com> wrote:
> oops, excuse me, here is a description : I have the problem on 4 systems, 
> all with same hardware.
> the problem occured  on each system, 1 time each 2 month in average. since 
> January 2012, I decided to reboot them all monthly, 
> and the clock jump occurred only once in February ...
> 
> SYSTEM : 		HP ProLiant DL385 G7, with 2 * AMD Processor 6174 (12 cores) = 24 
> cores, 16 GB MEMORY
> XEN    			 (XEN) Xen version 4.0.1 (Debian 4.0.1-5.4) (ultrotter@debian.org) 
> (gcc version 4.4.5 (Debian 4.4.5-8) ) Sat Sep  8 19:15:46 UTC 2012
> DOM0			Linux 2.6.32-5-xen-amd64 #1 SMP Sun Sep 23 13:49:30 UTC 2012 x86_64 
> GNU/Linux
> CPU 			
> processor       : 0
> vendor_id       : AuthenticAMD
> cpu family      : 16
> model           : 9
> model name      : AMD Opteron(tm) Processor 6174

Huh - so we have the problem on even different vendor CPUs (as
Mauro's are Intel ones). But I take it that you haven't see an event
yet with the debugging patch?

Plus, what's puzzling me a little too - before the occurrence of the
event on Mauro's system, I was under the impression that this
requires quite a bit of uptime. Yet the event he observed occurred
early on the second day after boot afaict.

Jan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-11-09  9:47             ` Jan Beulich
@ 2012-11-09  9:54               ` Mauro
  0 siblings, 0 replies; 32+ messages in thread
From: Mauro @ 2012-11-09  9:54 UTC (permalink / raw)
  To: Jan Beulich; +Cc: keir, Philippe.Simonet, 599161, xen-devel, ijc

On 9 November 2012 10:47, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 09.11.12 at 10:05, <Philippe.Simonet@swisscom.com> wrote:
>> oops, excuse me, here is a description : I have the problem on 4 systems,
>> all with same hardware.
>> the problem occured  on each system, 1 time each 2 month in average. since
>> January 2012, I decided to reboot them all monthly,
>> and the clock jump occurred only once in February ...
>>
>> SYSTEM :              HP ProLiant DL385 G7, with 2 * AMD Processor 6174 (12 cores) = 24
>> cores, 16 GB MEMORY
>> XEN                            (XEN) Xen version 4.0.1 (Debian 4.0.1-5.4) (ultrotter@debian.org)
>> (gcc version 4.4.5 (Debian 4.4.5-8) ) Sat Sep  8 19:15:46 UTC 2012
>> DOM0                  Linux 2.6.32-5-xen-amd64 #1 SMP Sun Sep 23 13:49:30 UTC 2012 x86_64
>> GNU/Linux
>> CPU
>> processor       : 0
>> vendor_id       : AuthenticAMD
>> cpu family      : 16
>> model           : 9
>> model name      : AMD Opteron(tm) Processor 6174
>
> Huh - so we have the problem on even different vendor CPUs (as
> Mauro's are Intel ones). But I take it that you haven't see an event
> yet with the debugging patch?
>
> Plus, what's puzzling me a little too - before the occurrence of the
> event on Mauro's system, I was under the impression that this
> requires quite a bit of uptime. Yet the event he observed occurred
> early on the second day after boot afaict.

Before the patch clock jumps on my systems occurred about once or twice at week.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
       [not found]           ` <FF93AF260AC2BB499A119CC65B092CF73151901D@sg000713.corproot.net>
  2012-11-09  9:47             ` Jan Beulich
@ 2012-11-13 10:14             ` Jan Beulich
  1 sibling, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2012-11-13 10:14 UTC (permalink / raw)
  To: mrsanna1, Philippe.Simonet; +Cc: keir, 599161, xen-devel, ijc

[-- Attachment #1: Type: text/plain, Size: 738 bytes --]

>>> On 09.11.12 at 10:05, <Philippe.Simonet@swisscom.com> wrote:

Since it looks like this got stalled again, attached is a slightly
extended version of Keir's debugging patch, allowing to rule out
any inconsistencies of the globals between the first and second
instances of the two invocations of __read_platform_stime().

Should the numbers printed turn out identical between the two
invocations and identical to the boot time determined values, then
I'm afraid I'm out of explanations as well as debugging suggestions.

Please remember to add "loglvl=all" to the hypervisor command
line.

The patch is against a 4.0.3 based tree I had still lying around, so
I hope it'll apply cleanly to your 4.0.1 based one.

Jan


[-- Attachment #2: 00-tsc-debug --]
[-- Type: application/octet-stream, Size: 2077 bytes --]

--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -578,14 +578,23 @@ static s_time_t __read_platform_stime(u6
     return (stime_platform_stamp + scale_delta(diff, &plt_scale));
 }
 
+#define print_time_scale(prefix, scale, suffix) \
+    printk(prefix "%d:%08x" suffix, (scale)->shift, (scale)->mul_frac);
+
 static void plt_overflow(void *unused)
 {
     int i;
-    u64 count;
-    s_time_t now, plt_now, plt_wrap;
+    u64 count, old_stamp, tsc, pts;
+    s_time_t now, plt_now, plt_wrap, sps;
+    struct time_scale ps;
 
     spin_lock_irq(&platform_timer_lock);
 
+    pts = platform_timer_stamp;
+    sps = stime_platform_stamp;
+    ps = plt_scale;
+
+    old_stamp = plt_stamp;
     count = plt_src.read_counter();
     plt_stamp64 += (count - plt_stamp) & plt_mask;
     plt_stamp = count;
@@ -598,6 +607,19 @@ static void plt_overflow(void *unused)
         plt_wrap = __read_platform_stime(plt_stamp64 + plt_mask + 1);
         if ( ABS(plt_wrap - now) > ABS(plt_now - now) )
             break;
+        rdtscll(tsc);
+        printk("XXX plt_overflow: plt_now=%"PRIx64" plt_wrap=%"PRIx64
+               " now=%"PRIx64" old_stamp=%"PRIx64" new_stamp=%"PRIx64
+               " plt_stamp64=%"PRIx64" plt_mask=%"PRIx64
+               " tsc=%"PRIx64" tsc_stamp=%"PRIx64"\n",
+               plt_now, plt_wrap, now, old_stamp, plt_stamp, plt_stamp64,
+               plt_mask, tsc, this_cpu(cpu_time).local_tsc_stamp);
+        printk("XXX pts=%"PRIx64"/%"PRIx64" sps=%"PRIx64"/%"PRIx64"\n",
+               pts, platform_timer_stamp, sps, stime_platform_stamp);
+        print_time_scale("XXX ps=", &ps, "/");
+        print_time_scale("", &plt_scale, "\n");
+
+        break;
         plt_stamp64 += plt_mask + 1;
     }
     if ( i != 0 )
@@ -694,6 +716,7 @@ static void __init init_platform_timer(v
     plt_mask = (u64)~0ull >> (64 - pts->counter_bits);
 
     set_time_scale(&plt_scale, pts->frequency);
+    print_time_scale("plt_scale=", &plt_scale, "\n");
 
     plt_overflow_period = scale_delta(
         1ull << (pts->counter_bits-1), &plt_scale);

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
       [not found]   ` <CAE17a0Xit58cLiJ+pE7uKuCDq3Xn+7-5+iDpV2+5S+TX3yuT+Q@mail.gmail.com>
@ 2012-11-26 10:20     ` Jan Beulich
  2012-11-26 10:28       ` Mauro
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2012-11-26 10:20 UTC (permalink / raw)
  To: Mauro; +Cc: Philippe.Simonet, Keir Fraser, xen-devel, ijc

>>> On 24.11.12 at 13:06, Mauro <mrsanna1@gmail.com> wrote:
> (XEN) XXX plt_overflow: plt_now=569ed0c10fa3 plt_wrap=569ed0bd2ac3
> now=569ed0bf15a4 old_stamp=36f62 new_stamp=800377ce
> plt_stamp64=13d800377ce plt_mask=ffffffff tsc=cfdf69d72c82
> tsc_stamp=cfdf69ccc91d
> (XEN) XXX pts=13d80036988/13d80036988 sps=569ed0bd2ac3/569ed0bd2ac3
> (XEN) XXX ps=7:8baebcb8/7:8baebcb8

I.e. plt_wrap == stime_platform_stamp, which would imply
plt_stamp64 + plt_mask + 1 == platform_timer_stamp, which
visibly isn't the case.

Having to consider mis-compilation of some sort at this point,
can you please attach the xen-syms matching the xen.gz from
the run above?

The only other alternative would appear to be register or stack
corruption (which of the two might also be visible from the
xen-syms, so it would be needed for that determination too). As
the code in question runs with interrupts disabled, the only
sources for that would be NMI, MCE, or SMI. Adding logging of
instances of the former two would be possible, but otoh the
entry/exit code of them is uniform with other exceptions and
hence can pretty much be excluded to corrupt anything.

Jan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-11-26 10:20     ` Jan Beulich
@ 2012-11-26 10:28       ` Mauro
  2012-11-26 10:40         ` Ian Campbell
  0 siblings, 1 reply; 32+ messages in thread
From: Mauro @ 2012-11-26 10:28 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Philippe.Simonet, Keir Fraser, xen-devel, ijc

On 26 November 2012 11:20, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 24.11.12 at 13:06, Mauro <mrsanna1@gmail.com> wrote:
>> (XEN) XXX plt_overflow: plt_now=569ed0c10fa3 plt_wrap=569ed0bd2ac3
>> now=569ed0bf15a4 old_stamp=36f62 new_stamp=800377ce
>> plt_stamp64=13d800377ce plt_mask=ffffffff tsc=cfdf69d72c82
>> tsc_stamp=cfdf69ccc91d
>> (XEN) XXX pts=13d80036988/13d80036988 sps=569ed0bd2ac3/569ed0bd2ac3
>> (XEN) XXX ps=7:8baebcb8/7:8baebcb8
>
> I.e. plt_wrap == stime_platform_stamp, which would imply
> plt_stamp64 + plt_mask + 1 == platform_timer_stamp, which
> visibly isn't the case.
>
> Having to consider mis-compilation of some sort at this point,
> can you please attach the xen-syms matching the xen.gz from
> the run above?

how can I do?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-11-26 10:28       ` Mauro
@ 2012-11-26 10:40         ` Ian Campbell
  2012-11-26 12:18           ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: Ian Campbell @ 2012-11-26 10:40 UTC (permalink / raw)
  To: Mauro; +Cc: Philippe.Simonet, Keir Fraser, Jan Beulich, xen-devel

On Mon, 2012-11-26 at 11:28 +0100, Mauro wrote:
> On 26 November 2012 11:20, Jan Beulich <JBeulich@suse.com> wrote:
> >>>> On 24.11.12 at 13:06, Mauro <mrsanna1@gmail.com> wrote:
> >> (XEN) XXX plt_overflow: plt_now=569ed0c10fa3 plt_wrap=569ed0bd2ac3
> >> now=569ed0bf15a4 old_stamp=36f62 new_stamp=800377ce
> >> plt_stamp64=13d800377ce plt_mask=ffffffff tsc=cfdf69d72c82
> >> tsc_stamp=cfdf69ccc91d
> >> (XEN) XXX pts=13d80036988/13d80036988 sps=569ed0bd2ac3/569ed0bd2ac3
> >> (XEN) XXX ps=7:8baebcb8/7:8baebcb8
> >
> > I.e. plt_wrap == stime_platform_stamp, which would imply
> > plt_stamp64 + plt_mask + 1 == platform_timer_stamp, which
> > visibly isn't the case.
> >
> > Having to consider mis-compilation of some sort at this point,
> > can you please attach the xen-syms matching the xen.gz from
> > the run above?
> 
> how can I do?

xen/xen-syms should be in the place where you built the hypervisor.

-- 
Ian Campbell

It is not doing the thing we like to do, but liking the thing we have to do,
that makes life blessed.
		-- Goethe

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-11-26 10:40         ` Ian Campbell
@ 2012-11-26 12:18           ` Jan Beulich
  2012-11-26 13:08             ` Philippe.Simonet
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2012-11-26 12:18 UTC (permalink / raw)
  To: Mauro; +Cc: Philippe.Simonet, Keir Fraser, xen-devel, Ian Campbell

>>> On 26.11.12 at 11:40, Ian Campbell <ijc@hellion.org.uk> wrote:
> On Mon, 2012-11-26 at 11:28 +0100, Mauro wrote:
>> On 26 November 2012 11:20, Jan Beulich <JBeulich@suse.com> wrote:
>> >>>> On 24.11.12 at 13:06, Mauro <mrsanna1@gmail.com> wrote:
>> >> (XEN) XXX plt_overflow: plt_now=569ed0c10fa3 plt_wrap=569ed0bd2ac3
>> >> now=569ed0bf15a4 old_stamp=36f62 new_stamp=800377ce
>> >> plt_stamp64=13d800377ce plt_mask=ffffffff tsc=cfdf69d72c82
>> >> tsc_stamp=cfdf69ccc91d
>> >> (XEN) XXX pts=13d80036988/13d80036988 sps=569ed0bd2ac3/569ed0bd2ac3
>> >> (XEN) XXX ps=7:8baebcb8/7:8baebcb8
>> >
>> > I.e. plt_wrap == stime_platform_stamp, which would imply
>> > plt_stamp64 + plt_mask + 1 == platform_timer_stamp, which
>> > visibly isn't the case.
>> >
>> > Having to consider mis-compilation of some sort at this point,
>> > can you please attach the xen-syms matching the xen.gz from
>> > the run above?
>> 
>> how can I do?
> 
> xen/xen-syms should be in the place where you built the hypervisor.

... or, if you didn't build the hypervisor yourself, you should check
the package you were handed.

Jan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-11-26 12:18           ` Jan Beulich
@ 2012-11-26 13:08             ` Philippe.Simonet
  2012-11-26 14:15               ` Sylvain Munaut
  2012-11-26 14:29               ` Jan Beulich
  0 siblings, 2 replies; 32+ messages in thread
From: Philippe.Simonet @ 2012-11-26 13:08 UTC (permalink / raw)
  To: JBeulich, mrsanna1; +Cc: keir.xen, xen-devel, ijc

Hi jan

I've done that for Mauro, here are the symbols : http://www.ncube.ch/xen-syms.gz

MANY thanks and regards

Philippe




> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Monday, November 26, 2012 1:18 PM
> To: Mauro
> Cc: Keir Fraser; Ian Campbell; xen-devel; Simonet Philippe, ITS-OUS-OP-
> IFM-NW-IPE
> Subject: Re: [Xen-devel] #599161: Xen debug patch for the "clock shifts by 50
> minutes" bug.
> 
> >>> On 26.11.12 at 11:40, Ian Campbell <ijc@hellion.org.uk> wrote:
> > On Mon, 2012-11-26 at 11:28 +0100, Mauro wrote:
> >> On 26 November 2012 11:20, Jan Beulich <JBeulich@suse.com> wrote:
> >> >>>> On 24.11.12 at 13:06, Mauro <mrsanna1@gmail.com> wrote:
> >> >> (XEN) XXX plt_overflow: plt_now=569ed0c10fa3
> plt_wrap=569ed0bd2ac3
> >> >> now=569ed0bf15a4 old_stamp=36f62 new_stamp=800377ce
> >> >> plt_stamp64=13d800377ce plt_mask=ffffffff tsc=cfdf69d72c82
> >> >> tsc_stamp=cfdf69ccc91d
> >> >> (XEN) XXX pts=13d80036988/13d80036988
> >> >> sps=569ed0bd2ac3/569ed0bd2ac3
> >> >> (XEN) XXX ps=7:8baebcb8/7:8baebcb8
> >> >
> >> > I.e. plt_wrap == stime_platform_stamp, which would imply
> >> > plt_stamp64 + plt_mask + 1 == platform_timer_stamp, which visibly
> >> > isn't the case.
> >> >
> >> > Having to consider mis-compilation of some sort at this point, can
> >> > you please attach the xen-syms matching the xen.gz from the run
> >> > above?
> >>
> >> how can I do?
> >
> > xen/xen-syms should be in the place where you built the hypervisor.
> 
> ... or, if you didn't build the hypervisor yourself, you should check the
> package you were handed.
> 
> Jan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-11-26 13:08             ` Philippe.Simonet
@ 2012-11-26 14:15               ` Sylvain Munaut
  2012-11-26 14:29               ` Jan Beulich
  1 sibling, 0 replies; 32+ messages in thread
From: Sylvain Munaut @ 2012-11-26 14:15 UTC (permalink / raw)
  To: Philippe.Simonet; +Cc: ijc, keir.xen, mrsanna1, JBeulich, xen-devel

Hi,


I just wanted to say I've also been hitting this problem for years on
some servers.
That was originally under xen 4.0 when I originally installed the
machines, but even after update to 4.1.3 and more recently a 3.6.4
dom0 kernel, the problem is still present.

More specifically the two hw platforms I'm sure are concerned (hw +
cpu info + debug-key 's' info)

Sun Fire x2250:

    cpu family : 6
    model : 23
    model name : Intel(R) Xeon(R) CPU           L5420  @ 2.50GHz
    stepping : 6
    microcode : 0x60b

(XEN) TSC has constant rate, deep Cstates possible, so not reliable,
warp=3292 (count=1)
(XEN) dom1: mode=0,ofs=0x56eb699774,khz=2500128,inc=1,vtsc count:
1168090 kernel, 18678 user


Sun Fire x4150:

    cpu family : 6
    model : 23
    model name : Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz
    stepping : 10
    microcode : 0xa07

(XEN) TSC has constant rate, deep Cstates possible, so not reliable,
warp=3789 (count=1)
(XEN) dom2: mode=0,ofs=0x3a652aed7d,khz=2992564,inc=1,vtsc count:
4247396065 kernel, 712543 user
(XEN) dom3: mode=0,ofs=0x9eb4c44593,khz=2500147,inc=2,vtsc count:
1967559486 kernel, 192516 user


When the problem appears, I can also see the "(XEN) Platform timer
appears to have unexpectedly wrapped 10 or more times." message.

I have now deployed a freshly compiled 4.1.3 with the latest tsc debug
patched compiled and loglvl=all option to a test server with the same
hw where the problem happenned and I'll report if I see anything.

The problem is pretty rare so not sure when it will happen next ...


Cheers,

     Sylvain

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
  2012-11-26 13:08             ` Philippe.Simonet
  2012-11-26 14:15               ` Sylvain Munaut
@ 2012-11-26 14:29               ` Jan Beulich
  1 sibling, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2012-11-26 14:29 UTC (permalink / raw)
  To: mrsanna1, Philippe.Simonet; +Cc: keir.xen, xen-devel, ijc

>>> On 26.11.12 at 14:08, <Philippe.Simonet@swisscom.com> wrote:
> I've done that for Mauro, here are the symbols : 
> http://www.ncube.ch/xen-syms.gz 

That helped; I'll send a patch shortly.

Jan

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2012-11-26 14:29 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-26 12:59 #599161: Xen debug patch for the "clock shifts by 50 minutes" bug Ian Campbell
2012-10-26 18:25 ` Mauro
     [not found] ` <CAE17a0WfpKHxN=3iF6F+P0nT4UbyF9Lbpm8R5+JJEquQOmudXw@mail.gmail.com>
2012-10-26 18:40   ` Ian Campbell
  -- strict thread matches above, loose matches on Subject: below --
2012-11-07 10:10 Philippe.Simonet
2012-11-07 13:22 ` Ian Campbell
2012-11-07 17:40   ` Keir Fraser
2012-11-08  9:39     ` Jan Beulich
2012-11-08 10:38       ` Keir Fraser
2012-11-08 13:53         ` Jan Beulich
2012-11-08 14:04           ` Keir Fraser
2012-11-08 13:47       ` Philippe.Simonet
2012-11-08 14:29         ` Ian Campbell
2012-11-08 15:21         ` Mauro
     [not found]         ` <1352384968.12977.104.camel@hastur.hellion.org.uk>
2012-11-09  9:05           ` Philippe.Simonet
     [not found]           ` <FF93AF260AC2BB499A119CC65B092CF73151901D@sg000713.corproot.net>
2012-11-09  9:47             ` Jan Beulich
2012-11-09  9:54               ` Mauro
2012-11-13 10:14             ` Jan Beulich
2012-11-08 16:45       ` Tim Deegan
2012-11-08 16:53         ` Keir Fraser
2012-11-08 11:43     ` Ian Campbell
2012-11-08 14:20     ` Keir Fraser
2012-11-07 17:07 ` Jan Beulich
     [not found] <1352375011.12977.95.camel@hastur.hellion.org.uk>
2012-11-08 12:54 ` Keir Fraser
2012-11-08 14:28   ` Ian Campbell
2012-11-08 14:34     ` Keir Fraser
     [not found] <CAE17a0XdxhaoCZaDDT6RXm1DPnabf2hWXAnzU5NukHeXX45grA@mail.gmail.com>
     [not found] ` <CCCD845A.44DB0%keir.xen@gmail.com>
     [not found]   ` <CAE17a0Xit58cLiJ+pE7uKuCDq3Xn+7-5+iDpV2+5S+TX3yuT+Q@mail.gmail.com>
2012-11-26 10:20     ` Jan Beulich
2012-11-26 10:28       ` Mauro
2012-11-26 10:40         ` Ian Campbell
2012-11-26 12:18           ` Jan Beulich
2012-11-26 13:08             ` Philippe.Simonet
2012-11-26 14:15               ` Sylvain Munaut
2012-11-26 14:29               ` Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).