All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] x86: assorted time handling adjustments
@ 2026-01-06 13:57 Jan Beulich
  2026-01-06 13:58 ` [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() Jan Beulich
                   ` (4 more replies)
  0 siblings, 5 replies; 32+ messages in thread
From: Jan Beulich @ 2026-01-06 13:57 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné

The first patch aims at addressing a recent bug report, but may also point
out further more or less related issues. The other patches do some tidying
found desirable while doing the investigation.

1: time: deal with negative deltas in get_s_time_fixed()
2: time: scale_delta() can be static
3: HVM: drop at_tsc parameter from ->set_tsc_offset() hook
4: time: gtsc_to_gtime() is HVM-only
5: time: pv_soft_rdtsc() is PV-only

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
  2026-01-06 13:57 [PATCH 0/5] x86: assorted time handling adjustments Jan Beulich
@ 2026-01-06 13:58 ` Jan Beulich
  2026-01-06 20:10   ` Антон Марков
  2026-01-14 17:49   ` Roger Pau Monné
  2026-01-06 13:58 ` [PATCH 2/5] x86/time: scale_delta() can be static Jan Beulich
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 32+ messages in thread
From: Jan Beulich @ 2026-01-06 13:58 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Roger Pau Monné,
	Антон Марков

Callers may pass in TSC values from before the local TSC stamp was last
updated (this would in particular be the case when the TSC was latched, a
time rendezvous occurs, and the latched value is used only afterwards).
scale_delta(), otoh, deals with unsigned values, and hence would treat
negative incoming deltas as huge positive values, yielding entirely bogus
return values.

Fixes: 88e64cb785c1 ("x86/HVM: use fixed TSC value when saving or restoring domain")
Reported-by: Антон Марков <akmarkov45@gmail.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
An alternative might be to have scale_delta() deal with signed deltas, yet
that seemed more risky to me.

There could likely be more Fixes: tags; the one used is the oldest
applicable one, from what I can tell.

A similar issue looks to exist in read_xen_timer() and its read_cycle()
helper, if we're scheduled out (and beck in) between reading of the TSC
and calculation of the delta (involving ->tsc_timestamp). Am I
overlooking anything there?

stime2tsc() guards against negative deltas by using 0 instead; I'm not
quite sure that's correct either.

amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a
comment towards the TSC being "sane", but is that correct? Due to
TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then
wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before
calling tsc_ticks2ns()?

A similar issue looks to exist in tsc_get_info(), again when rdtsc()
possibly returns a huge value due to TSC_ADJUST. Once again I wonder
whether we shouldn't subtract boot_tsc_stamp.

--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1649,8 +1649,13 @@ s_time_t get_s_time_fixed(u64 at_tsc)
         tsc = at_tsc;
     else
         tsc = rdtsc_ordered();
-    delta = tsc - t->stamp.local_tsc;
-    return t->stamp.local_stime + scale_delta(delta, &t->tsc_scale);
+
+    if ( tsc >= t->stamp.local_tsc )
+        delta = scale_delta(tsc - t->stamp.local_tsc, &t->tsc_scale);
+    else
+        delta = -scale_delta(t->stamp.local_tsc - tsc, &t->tsc_scale);
+
+    return t->stamp.local_stime + delta;
 }
 
 s_time_t get_s_time(void)



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 2/5] x86/time: scale_delta() can be static
  2026-01-06 13:57 [PATCH 0/5] x86: assorted time handling adjustments Jan Beulich
  2026-01-06 13:58 ` [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() Jan Beulich
@ 2026-01-06 13:58 ` Jan Beulich
  2026-01-13 18:40   ` Roger Pau Monné
  2026-01-06 13:58 ` [PATCH 3/5] x86/HVM: drop at_tsc parameter from ->set_tsc_offset() hook Jan Beulich
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2026-01-06 13:58 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné

It's used in time.c alone. Modernize types while there.

Amends: 5a82d598d2d ("viridian: unify time sources")
Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/include/asm/time.h
+++ b/xen/arch/x86/include/asm/time.h
@@ -56,7 +56,6 @@ u64 stime2tsc(s_time_t stime);
 
 struct time_scale;
 void set_time_scale(struct time_scale *ts, u64 ticks_per_sec);
-u64 scale_delta(u64 delta, const struct time_scale *scale);
 
 /* Programmable Interval Timer (8254) */
 
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -131,9 +131,9 @@ static inline u32 mul_frac(u32 multiplic
  * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
  * yielding a 64-bit result.
  */
-u64 scale_delta(u64 delta, const struct time_scale *scale)
+static uint64_t scale_delta(uint64_t delta, const struct time_scale *scale)
 {
-    u64 product;
+    uint64_t product;
 
     if ( scale->shift < 0 )
         delta >>= -scale->shift;



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 3/5] x86/HVM: drop at_tsc parameter from ->set_tsc_offset() hook
  2026-01-06 13:57 [PATCH 0/5] x86: assorted time handling adjustments Jan Beulich
  2026-01-06 13:58 ` [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() Jan Beulich
  2026-01-06 13:58 ` [PATCH 2/5] x86/time: scale_delta() can be static Jan Beulich
@ 2026-01-06 13:58 ` Jan Beulich
  2026-01-14  9:40   ` Roger Pau Monné
  2026-01-06 13:59 ` [PATCH 4/5] x86/time: gtsc_to_gtime() is HVM-only Jan Beulich
  2026-01-06 14:00 ` [PATCH 5/5] x86/time: pv_soft_rdtsc() is PV-only Jan Beulich
  4 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2026-01-06 13:58 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné

While the VMX hook never used the parameter, the SVM one lost its sole use
some time ago (while the original use of the parameter had gone away even
earlier).

Again modernize types while there.

Amends: 0cd50753eb40 ("nestedsvm: Disable TscRateMSR")
Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hvm/domain.c
+++ b/xen/arch/x86/hvm/domain.c
@@ -312,8 +312,7 @@ int arch_set_info_hvm_guest(struct vcpu
     /* Sync AP's TSC with BSP's. */
     v->arch.hvm.cache_tsc_offset =
         d->vcpu[0]->arch.hvm.cache_tsc_offset;
-    hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset,
-                       d->arch.hvm.sync_tsc);
+    hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset);
 
     paging_update_paging_modes(v);
 
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -412,7 +412,7 @@ static void hvm_set_guest_tsc_fixed(stru
     delta_tsc = guest_tsc - tsc;
     v->arch.hvm.cache_tsc_offset = delta_tsc;
 
-    hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset, at_tsc);
+    hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset);
 }
 
 #define hvm_set_guest_tsc(v, t) hvm_set_guest_tsc_fixed(v, t, 0)
@@ -430,7 +430,7 @@ static void hvm_set_guest_tsc_msr(struct
 static void hvm_set_guest_tsc_adjust(struct vcpu *v, u64 tsc_adjust)
 {
     v->arch.hvm.cache_tsc_offset += tsc_adjust - v->arch.hvm.msr_tsc_adjust;
-    hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset, 0);
+    hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset);
     v->arch.hvm.msr_tsc_adjust = tsc_adjust;
     if ( v == current )
         update_vcpu_system_time(v);
@@ -4023,8 +4023,7 @@ void hvm_vcpu_reset_state(struct vcpu *v
     /* Sync AP's TSC with BSP's. */
     v->arch.hvm.cache_tsc_offset =
         v->domain->vcpu[0]->arch.hvm.cache_tsc_offset;
-    hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset,
-                       d->arch.hvm.sync_tsc);
+    hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset);
 
     v->arch.hvm.msr_tsc_adjust = 0;
 
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -779,7 +779,7 @@ static int cf_check svm_get_guest_pat(st
     return 1;
 }
 
-static void cf_check svm_set_tsc_offset(struct vcpu *v, u64 offset, u64 at_tsc)
+static void cf_check svm_set_tsc_offset(struct vcpu *v, uint64_t offset)
 {
     struct vmcb_struct *vmcb = v->arch.hvm.svm.vmcb;
     struct vmcb_struct *n1vmcb, *n2vmcb;
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1558,7 +1558,7 @@ static void cf_check vmx_handle_cd(struc
     }
 }
 
-static void cf_check vmx_set_tsc_offset(struct vcpu *v, u64 offset, u64 at_tsc)
+static void cf_check vmx_set_tsc_offset(struct vcpu *v, uint64_t offset)
 {
     vmx_vmcs_enter(v);
 
--- a/xen/arch/x86/hvm/vmx/vvmx.c
+++ b/xen/arch/x86/hvm/vmx/vvmx.c
@@ -1114,7 +1114,7 @@ static void load_shadow_guest_state(stru
             hvm_inject_hw_exception(X86_EXC_GP, 0);
     }
 
-    hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset, 0);
+    hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset);
 
     vvmcs_to_shadow_bulk(v, ARRAY_SIZE(vmentry_fields), vmentry_fields);
 
@@ -1330,7 +1330,7 @@ static void load_vvmcs_host_state(struct
             hvm_inject_hw_exception(X86_EXC_GP, 0);
     }
 
-    hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset, 0);
+    hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset);
 
     set_vvmcs(v, VM_ENTRY_INTR_INFO, 0);
 
--- a/xen/arch/x86/include/asm/hvm/hvm.h
+++ b/xen/arch/x86/include/asm/hvm/hvm.h
@@ -165,7 +165,7 @@ struct hvm_function_table {
     int  (*get_guest_pat)(struct vcpu *v, uint64_t *gpat);
     int  (*set_guest_pat)(struct vcpu *v, uint64_t gpat);
 
-    void (*set_tsc_offset)(struct vcpu *v, u64 offset, u64 at_tsc);
+    void (*set_tsc_offset)(struct vcpu *v, uint64_t offset);
 
     void (*inject_event)(const struct x86_event *event);
 
@@ -482,10 +482,9 @@ static inline void hvm_cpuid_policy_chan
     alternative_vcall(hvm_funcs.cpuid_policy_changed, v);
 }
 
-static inline void hvm_set_tsc_offset(struct vcpu *v, uint64_t offset,
-                                      uint64_t at_tsc)
+static inline void hvm_set_tsc_offset(struct vcpu *v, uint64_t offset)
 {
-    alternative_vcall(hvm_funcs.set_tsc_offset, v, offset, at_tsc);
+    alternative_vcall(hvm_funcs.set_tsc_offset, v, offset);
 }
 
 /*
@@ -847,7 +846,7 @@ static inline void hvm_sync_pir_to_irr(s
  */
 int hvm_guest_x86_mode(struct vcpu *v);
 void hvm_cpuid_policy_changed(struct vcpu *v);
-void hvm_set_tsc_offset(struct vcpu *v, uint64_t offset, uint64_t at_tsc);
+void hvm_set_tsc_offset(struct vcpu *v, uint64_t offset);
 
 /* End of prototype list */
 
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -2985,8 +2985,7 @@ int tsc_set_info(struct domain *d,
              */
             d->arch.hvm.sync_tsc = rdtsc();
             hvm_set_tsc_offset(d->vcpu[0],
-                               d->vcpu[0]->arch.hvm.cache_tsc_offset,
-                               d->arch.hvm.sync_tsc);
+                               d->vcpu[0]->arch.hvm.cache_tsc_offset);
         }
     }
 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 4/5] x86/time: gtsc_to_gtime() is HVM-only
  2026-01-06 13:57 [PATCH 0/5] x86: assorted time handling adjustments Jan Beulich
                   ` (2 preceding siblings ...)
  2026-01-06 13:58 ` [PATCH 3/5] x86/HVM: drop at_tsc parameter from ->set_tsc_offset() hook Jan Beulich
@ 2026-01-06 13:59 ` Jan Beulich
  2026-01-14  9:43   ` Roger Pau Monné
  2026-01-06 14:00 ` [PATCH 5/5] x86/time: pv_soft_rdtsc() is PV-only Jan Beulich
  4 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2026-01-06 13:59 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné

Omit the function when HVM=n. With that the !HVM logic can also go away;
leave an assertion.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -2840,14 +2840,13 @@ uint64_t gtime_to_gtsc(const struct doma
     return scale_delta(time, &d->arch.ns_to_vtsc);
 }
 
+#ifdef CONFIG_HVM
 uint64_t gtsc_to_gtime(const struct domain *d, uint64_t tsc)
 {
-    u64 time = scale_delta(tsc, &d->arch.vtsc_to_ns);
-
-    if ( !is_hvm_domain(d) )
-        time += d->arch.vtsc_offset;
-    return time;
+    ASSERT(is_hvm_domain(d));
+    return scale_delta(tsc, &d->arch.vtsc_to_ns);
 }
+#endif /* CONFIG_HVM */
 
 uint64_t pv_soft_rdtsc(const struct vcpu *v, const struct cpu_user_regs *regs)
 {



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 5/5] x86/time: pv_soft_rdtsc() is PV-only
  2026-01-06 13:57 [PATCH 0/5] x86: assorted time handling adjustments Jan Beulich
                   ` (3 preceding siblings ...)
  2026-01-06 13:59 ` [PATCH 4/5] x86/time: gtsc_to_gtime() is HVM-only Jan Beulich
@ 2026-01-06 14:00 ` Jan Beulich
  2026-01-14  9:46   ` Roger Pau Monné
  4 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2026-01-06 14:00 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné

Omit the function when PV=n, by moving it to the sole file using it, thus
allowing it to become static as well.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/include/asm/time.h
+++ b/xen/arch/x86/include/asm/time.h
@@ -37,8 +37,6 @@ uint64_t cf_check acpi_pm_tick_to_ns(uin
 
 uint64_t tsc_ticks2ns(uint64_t ticks);
 
-struct cpu_user_regs;
-uint64_t pv_soft_rdtsc(const struct vcpu *v, const struct cpu_user_regs *regs);
 uint64_t gtime_to_gtsc(const struct domain *d, uint64_t time);
 uint64_t gtsc_to_gtime(const struct domain *d, uint64_t tsc);
 
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -874,6 +874,20 @@ static uint64_t guest_efer(const struct
     return val;
 }
 
+static uint64_t soft_rdtsc(
+    const struct vcpu *v, const struct cpu_user_regs *regs)
+{
+    s_time_t old, new, now = get_s_time();
+    struct domain *d = v->domain;
+
+    do {
+        old = d->arch.vtsc_last;
+        new = now > d->arch.vtsc_last ? now : old + 1;
+    } while ( cmpxchg(&d->arch.vtsc_last, old, new) != old );
+
+    return gtime_to_gtsc(d, new);
+}
+
 static int cf_check read_msr(
     unsigned int reg, uint64_t *val, struct x86_emulate_ctxt *ctxt)
 {
@@ -920,7 +934,7 @@ static int cf_check read_msr(
         return X86EMUL_OKAY;
 
     case MSR_IA32_TSC:
-        *val = currd->arch.vtsc ? pv_soft_rdtsc(curr, ctxt->regs) : rdtsc();
+        *val = currd->arch.vtsc ? soft_rdtsc(curr, ctxt->regs) : rdtsc();
         return X86EMUL_OKAY;
 
     case MSR_EFER:
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -2848,19 +2848,6 @@ uint64_t gtsc_to_gtime(const struct doma
 }
 #endif /* CONFIG_HVM */
 
-uint64_t pv_soft_rdtsc(const struct vcpu *v, const struct cpu_user_regs *regs)
-{
-    s_time_t old, new, now = get_s_time();
-    struct domain *d = v->domain;
-
-    do {
-        old = d->arch.vtsc_last;
-        new = now > d->arch.vtsc_last ? now : old + 1;
-    } while ( cmpxchg(&d->arch.vtsc_last, old, new) != old );
-
-    return gtime_to_gtsc(d, new);
-}
-
 bool clocksource_is_tsc(void)
 {
     return plt_src.read_counter == READ_TSC_POISON;



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
  2026-01-06 13:58 ` [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() Jan Beulich
@ 2026-01-06 20:10   ` Антон Марков
  2026-01-07  8:02     ` Jan Beulich
  2026-01-14 17:49   ` Roger Pau Monné
  1 sibling, 1 reply; 32+ messages in thread
From: Антон Марков @ 2026-01-06 20:10 UTC (permalink / raw)
  To: xen-devel; +Cc: jbeulich, andrew.cooper3, roger.pau

[-- Attachment #1: Type: text/plain, Size: 6534 bytes --]

Hi, I'm not sure about the other places. In hvm_load_cpu_ctxt 
(xen/arch/x86/hvm/hvm.c ), it was easy to catch because 
process_pending_softirqs is frequently called there, which in turn 
processes softirqs from the timer (where the timestamp is updated). 
After I fixed sync_tsc in hvm_load_cpu_ctxt, the problem stopped 
reproducing under no load. However, when the number of vCPUs is 4 times 
greater than the number of CPUs (under heavy load), the problem rarely 
reoccurs (mostly during snapshot restores during 
process_pending_softirqs calls), and this is no longer a simple case. If 
get_s_time_fixed can indeed be interrupted during execution after 
rdtsc_ordered, then the current fix is ​​insufficient. It's necessary to 
atomically copy "t->stamp" to the stack using local_irq_disable and 
local_irq_enable (as in local_time_calibration), and then work with the 
copy, confident in its lifetime and immutability. But until 
get_s_time_fixed is proven to be interruptible, this is premature, so 
your fix is ​​sufficient. I think I need more information and testing to 
say more.

Regarding the other scale_delta calls, if they include values 
​​calculated from externally saved tsc values ​​that could have become 
stale during the process_pending_softirqs call, this definitely needs to 
be fixed.

Here's the problematic code from dwm.exe, if that helps:

void __fastcall 
CPartitionVerticalBlankScheduler::UpdateCurrentTime(CPartitionVerticalBlankScheduler 
*this) {
     uint64_t *xor_time_ptr = (uint64_t *)((char *)this + 0x3E40);
     CPartitionVerticalBlankScheduler *scheduler = this;
     LARGE_INTEGER *current_time_ptr = (LARGE_INTEGER *)((char *)this + 
0x3E30);
     uint64_t previous_time = *((_QWORD *)this + 0x7C6); // 0x3e30
     uint64_t address_high = ((_QWORD)this + 0x3E40) << 32;
     uint64_t xor_key = ((uint64_t)this + 0x3E40) | address_high;

     // validate internal state
     if ((previous_time ^ xor_key) != *((_QWORD *)this + 0x7C8)) { // 0x3e40
         char exception_record[0x98];
         memset(exception_record, 0, sizeof(exception_record));
         *(int *)exception_record = 0x88980080; // MILERR_INTERNALERROR

         uint64_t computed_xor = *xor_time_ptr ^ ((uint64_t)xor_time_ptr 
| address_high);
         int param_count = 4;
         int previous_time_high = (int)(previous_time >> 32);
         uint32_t previous_time_low = (uint32_t)previous_time;
         int computed_xor_high = (int)(computed_xor >> 32);
         uint32_t computed_xor_low = (uint32_t)computed_xor;

RaiseFailFastException((PEXCEPTION_RECORD)exception_record, nullptr, 0);
         previous_time = *((_QWORD *)scheduler + 1990);
     }

     // get current timestamp
     *((_QWORD *)scheduler + 0x7C7) = previous_time; // 0x3e38
     QueryPerformanceCounter(current_time_ptr);

     LARGE_INTEGER new_time = *current_time_ptr;
     uint64_t last_previous_time = *((_QWORD *)scheduler + 0x7C7); // 0x3e38

     // check if time go backward
     if (new_time.QuadPart < last_previous_time) {
         char exception_record[0x98];
         memset(exception_record, 0, sizeof(exception_record));
         *(int *)exception_record = 0x8898009B; // 
MILERR_QPC_TIME_WENT_BACKWARD

         int new_time_high = new_time.HighPart;
         uint32_t new_time_low = new_time.LowPart;
         int last_previous_high = (int)(last_previous_time >> 32);
         uint32_t last_previous_low = (uint32_t)last_previous_time;
         int frequency_high = g_qpcFrequency.HighPart;
         uint32_t frequency_low = g_qpcFrequency.LowPart;

         int64_t time_delta = last_previous_time - new_time.QuadPart;
         int64_t millis_delta = (1000 * time_delta) / 
g_qpcFrequency.QuadPart;
         int millis_high = (int)(millis_delta >> 32);
         uint32_t millis_low = (uint32_t)millis_delta;

         int param_count = 8;

RaiseFailFastException((PEXCEPTION_RECORD)exception_record, nullptr, 0);
         new_time = *(LARGE_INTEGER *)((char *)scheduler + 15920);
     }

     *xor_time_ptr = new_time.QuadPart ^ xor_key;
}

Thanks.

06.01.2026 16:58, Jan Beulich пишет:
> Callers may pass in TSC values from before the local TSC stamp was last
> updated (this would in particular be the case when the TSC was latched, a
> time rendezvous occurs, and the latched value is used only afterwards).
> scale_delta(), otoh, deals with unsigned values, and hence would treat
> negative incoming deltas as huge positive values, yielding entirely bogus
> return values.
>
> Fixes: 88e64cb785c1 ("x86/HVM: use fixed TSC value when saving or restoring domain")
> Reported-by: Антон Марков<akmarkov45@gmail.com>
> Signed-off-by: Jan Beulich<jbeulich@suse.com>
> ---
> An alternative might be to have scale_delta() deal with signed deltas, yet
> that seemed more risky to me.
>
> There could likely be more Fixes: tags; the one used is the oldest
> applicable one, from what I can tell.
>
> A similar issue looks to exist in read_xen_timer() and its read_cycle()
> helper, if we're scheduled out (and beck in) between reading of the TSC
> and calculation of the delta (involving ->tsc_timestamp). Am I
> overlooking anything there?
>
> stime2tsc() guards against negative deltas by using 0 instead; I'm not
> quite sure that's correct either.
>
> amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a
> comment towards the TSC being "sane", but is that correct? Due to
> TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then
> wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before
> calling tsc_ticks2ns()?
>
> A similar issue looks to exist in tsc_get_info(), again when rdtsc()
> possibly returns a huge value due to TSC_ADJUST. Once again I wonder
> whether we shouldn't subtract boot_tsc_stamp.
>
> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -1649,8 +1649,13 @@ s_time_t get_s_time_fixed(u64 at_tsc)
>           tsc = at_tsc;
>       else
>           tsc = rdtsc_ordered();
> -    delta = tsc - t->stamp.local_tsc;
> -    return t->stamp.local_stime + scale_delta(delta, &t->tsc_scale);
> +
> +    if ( tsc >= t->stamp.local_tsc )
> +        delta = scale_delta(tsc - t->stamp.local_tsc, &t->tsc_scale);
> +    else
> +        delta = -scale_delta(t->stamp.local_tsc - tsc, &t->tsc_scale);
> +
> +    return t->stamp.local_stime + delta;
>   }
>   
>   s_time_t get_s_time(void)
>

[-- Attachment #2: Type: text/html, Size: 10522 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
  2026-01-06 20:10   ` Антон Марков
@ 2026-01-07  8:02     ` Jan Beulich
  2026-01-09 16:11       ` Anton Markov
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2026-01-07  8:02 UTC (permalink / raw)
  To: Антон Марков
  Cc: andrew.cooper3, roger.pau, xen-devel

On 06.01.2026 21:10, Антон Марков wrote:
> Hi, I'm not sure about the other places. In hvm_load_cpu_ctxt 
> (xen/arch/x86/hvm/hvm.c ), it was easy to catch because 
> process_pending_softirqs is frequently called there, which in turn 
> processes softirqs from the timer (where the timestamp is updated). 
> After I fixed sync_tsc in hvm_load_cpu_ctxt, the problem stopped 
> reproducing under no load. However, when the number of vCPUs is 4 times 
> greater than the number of CPUs (under heavy load), the problem rarely 
> reoccurs (mostly during snapshot restores during 
> process_pending_softirqs calls), and this is no longer a simple case. If 
> get_s_time_fixed can indeed be interrupted during execution after 
> rdtsc_ordered, then the current fix is ​​insufficient. It's necessary to 
> atomically copy "t->stamp" to the stack using local_irq_disable and 
> local_irq_enable (as in local_time_calibration), and then work with the 
> copy, confident in its lifetime and immutability. But until 
> get_s_time_fixed is proven to be interruptible, this is premature, so 
> your fix is ​​sufficient. I think I need more information and testing to 
> say more.

While the cpu_calibration per-CPU variable is updated from IRQ context,
the cpu_time one isn't. Hence t->stamp's contents cannot change behind
the back of get_s_time_fixed(). I wonder whether ...

> Regarding the other scale_delta calls, if they include values 
​​> calculated from externally saved tsc values ​​that could have become 
> stale during the process_pending_softirqs call, this definitely needs to 
> be fixed.

... another similar issue (possibly one not included in the set of
remarks I have in the patch, as none of those look related to what you
describe) might be causing the remaining, more rare problems you say you
see. That set of remarks is actually a result of me going over all other
scale_delta() calls, but of course I may have got the analysis wrong.

As to using 4 times as many vCPU-s as there are pCPU-s (and then heavy
load) - while I don't think we have a support statement for such upstream
(when probably we should), iirc for our (SUSE's) products we would
consider that unsupported. Just fyi.

Also, btw, please don't top-post.

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
  2026-01-07  8:02     ` Jan Beulich
@ 2026-01-09 16:11       ` Anton Markov
  2026-01-12 10:31         ` Anton Markov
  0 siblings, 1 reply; 32+ messages in thread
From: Anton Markov @ 2026-01-09 16:11 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, roger.pau, xen-devel

[-- Attachment #1: Type: text/plain, Size: 9698 bytes --]

You're right. These aren't interrupts in get_s_time_fixed, but rather a
cumulative error in the sequence due to integer rounding. I added logging
of the current local_stime to local_time_calibration and noticed that the
timestamp between cores is gradually increasing. If the server has been
running for weeks, this could be a very large value.


```

@@ -1732,6 +1753,8 @@ static void cf_check local_time_calibration(void)

if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )

{

/* Atomically read cpu_calibration struct and write cpu_time struct. */

+ printk("update stime on time calibrate %d, %lu -> %lu (%lu, %lu)\n",
smp_processor_id(), t->stamp.local_stime, c->local_stime,

+ t->last_seen_ns, t->last_seen_tsc);

local_irq_disable();

t->stamp = *c;

local_irq_enable();

```


2 hours of work:


```

(XEN) update stime on time calibrate 0, 8564145820102 -> 8565145861597
(8565145862216, 0)

(XEN) update stime on time calibrate 1, 8564145820129 -> 8565145861609
(8565145863957, 0)

(XEN) update stime on time calibrate 3, 8564145819996 -> 8565145861491
(8565145864800, 0)

(XEN) update stime on time calibrate 2, 8564145820099 -> 8565145861609
(8565145865372, 0)


8565145861609 - 8565145861491 = 115 * 3 (3.00 GHz) = 345 lag

```


6 hours of work:

```

(XEN) update stime on time calibrate 0, 22914730829200 -> 22915730869993
(22915730870665, 0)

(XEN) update stime on time calibrate 1, 22914730829073 -> 22915730869889
(22915730870693, 0)

(XEN) update stime on time calibrate 2, 22914730829052 -> 22915730869841
(22915730872231, 0)

(XEN) update stime on time calibrate 3, 22914730828892 -> 22915730869696
(22915730872096, 0)


22915730869993 - 22915730869696 = 297 * 3 (3.00 GHz) = 891 lag

```


Clarification, according to my xen settings:

```

ucode=scan dom0_mem=53923M,max:53923M dom0_max_vcpus=4-96 dom0_vcpus_pin=0
force-ept=1 ept=no-ad,no-pml hap_1gb=0 hap_2mb=0 altp2m=1
hpet=legacy-replacement smt=1 spec-ctrl=no gnttab_max_frames=512
cpufreq=xen:performance max_cstate=1 sched=credit sched-gran=cpu apicv=0
sched_credit_tslice_ms=5 sched_ratelimit_us=500

```


Processors I tested on:


```

Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz


Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx
fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl
nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 cx16
sse4_1 sse4_2 popcnt aes xsave avx f16c hypervisor lahf_lm cpuid_fault
fsgsbase erms xsaveopt arch_capabilities

```

```

Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10GHz x2 (NUMA)


Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx
fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl
nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 fma cx16
sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm
3dnowprefetch cpuid_fault ibpb fsgsbase bmi1 avx2 bmi2 erms rtm avx512f
avx512dq rdseed adx avx512ifma clflushopt clwb avx512cd sha_ni avx512bw
avx512vl xsaveopt xsavec xgetbv1 avx512vbmi avx512_vbmi2 gfni vaes
vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear
arch_capabilities

```


Next I moved the code to r3 to speed up playback:


```

#include <stdint.h>

#include <stdio.h>


static __inline__ unsigned long long rdtsc(void)

{

unsigned hi, lo;

__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));

return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );

}


#define MILLISECS(_ms) ((int64_t)((_ms) * 1000000ULL))


struct cpu_time_stamp {

uint64_t local_tsc;

int64_t local_stime;

int64_t master_stime;

};


struct time_scale {

int shift;

uint32_t mul_frac;

};



static inline uint32_t div_frac(uint32_t dividend, uint32_t divisor)

{

uint32_t quotient, remainder;

asm (

"divl %4"

: "=a" (quotient), "=d" (remainder)

: "0" (0), "1" (dividend), "r" (divisor) );

return quotient;

}



void set_time_scale(struct time_scale *ts, uint64_t ticks_per_sec)

{

uint64_t tps64 = ticks_per_sec;

uint32_t tps32;

int shift = 0;


while ( tps64 > (MILLISECS(1000)*2) )

{

tps64 >>= 1;

shift--;

}


tps32 = (uint32_t)tps64;

while ( tps32 <= (uint32_t)MILLISECS(1000) )

{

tps32 <<= 1;

shift++;

}


ts->mul_frac = div_frac(MILLISECS(1000), tps32);

ts->shift = shift;

}



uint64_t scale_delta(uint64_t delta, const struct time_scale *scale)

{

uint64_t product;


if ( scale->shift < 0 )

delta >>= -scale->shift;

else

delta <<= scale->shift;


asm (

"mulq %2 ; shrd $32,%1,%0"

: "=a" (product), "=d" (delta)

: "rm" (delta), "0" ((uint64_t)scale->mul_frac) );


return product;

}


#define _TS_MUL_FRAC_IDENTITY 0x80000000UL


static inline struct time_scale scale_reciprocal(struct time_scale scale)

{

struct time_scale reciprocal;

uint32_t dividend;


dividend = _TS_MUL_FRAC_IDENTITY;

reciprocal.shift = 1 - scale.shift;

while ( dividend >= scale.mul_frac )

{

dividend >>= 1;

reciprocal.shift++;

}


asm (

"divl %4"

: "=a" (reciprocal.mul_frac), "=d" (dividend)

: "0" (0), "1" (dividend), "r" (scale.mul_frac) );


return reciprocal;

}



int64_t get_s_time_fixed(const struct cpu_time_stamp *t, const struct
time_scale *ts, uint64_t at_tsc)

{

uint64_t tsc, delta;


if ( at_tsc )

tsc = at_tsc;

else

tsc = rdtsc();

delta = tsc - t->local_tsc;

return t->local_stime + scale_delta(delta, ts);

}


int main() {


struct cpu_time_stamp ct;

struct time_scale ts;

struct time_scale back;


uint64_t start = rdtsc();


set_time_scale(&ts, 3000000000);

back = scale_reciprocal(ts);


ct.local_tsc = start;

ct.local_stime = scale_delta(start, &ts);


while (1) {

uint64_t new_tsc = rdtsc();

ct.local_stime = get_s_time_fixed(&ct, &ts, new_tsc);

ct.local_tsc = new_tsc;


uint64_t tmp_tsc = rdtsc();

printf("%lu %lu\n", tmp_tsc, scale_delta(get_s_time_fixed(&ct, &ts,
tmp_tsc), &back));

}


return 0;

}

```


It's enough to run the loop for 10-20 seconds to see the problem.
scale_delta rounds the least significant bits during calculation, which
causes the error to accumulate, at different rates on different cores,
depending on the least significant bits at the time of calibration.


Now let's talk about why dwm reacts this way. When a snapshot is reversed,
last_guest_time in hvm_get_guest_time_fixed is set to 0, which doesn't
prevent time from flowing backwards. This means that cache_tsc_offset in
hvm_get_guest_tsc_fixed may be configured correctly on one physical core,
but due to shedding on a physical core with a lagging tsc, the guest may
occasionally see a tsc value lower than it saw before the snapshot was
reversed. If this happens to the dwm code, it terminates with an error.


A quick solution to this problem might be to save the last_seen_tsc
parameter in a snapshot for each core, followed by validation.


The correct solution is to remove the rounding of the least significant
bits from the sequence.

On Wed, Jan 7, 2026 at 11:02 AM Jan Beulich <jbeulich@suse.com> wrote:

> On 06.01.2026 21:10, Антон Марков wrote:
> > Hi, I'm not sure about the other places. In hvm_load_cpu_ctxt
> > (xen/arch/x86/hvm/hvm.c ), it was easy to catch because
> > process_pending_softirqs is frequently called there, which in turn
> > processes softirqs from the timer (where the timestamp is updated).
> > After I fixed sync_tsc in hvm_load_cpu_ctxt, the problem stopped
> > reproducing under no load. However, when the number of vCPUs is 4 times
> > greater than the number of CPUs (under heavy load), the problem rarely
> > reoccurs (mostly during snapshot restores during
> > process_pending_softirqs calls), and this is no longer a simple case. If
> > get_s_time_fixed can indeed be interrupted during execution after
> > rdtsc_ordered, then the current fix is ​​insufficient. It's necessary to
> > atomically copy "t->stamp" to the stack using local_irq_disable and
> > local_irq_enable (as in local_time_calibration), and then work with the
> > copy, confident in its lifetime and immutability. But until
> > get_s_time_fixed is proven to be interruptible, this is premature, so
> > your fix is ​​sufficient. I think I need more information and testing to
> > say more.
>
> While the cpu_calibration per-CPU variable is updated from IRQ context,
> the cpu_time one isn't. Hence t->stamp's contents cannot change behind
> the back of get_s_time_fixed(). I wonder whether ...
>
> > Regarding the other scale_delta calls, if they include values
> ​​> calculated from externally saved tsc values ​​that could have become
> > stale during the process_pending_softirqs call, this definitely needs to
> > be fixed.
>
> ... another similar issue (possibly one not included in the set of
> remarks I have in the patch, as none of those look related to what you
> describe) might be causing the remaining, more rare problems you say you
> see. That set of remarks is actually a result of me going over all other
> scale_delta() calls, but of course I may have got the analysis wrong.
>
> As to using 4 times as many vCPU-s as there are pCPU-s (and then heavy
> load) - while I don't think we have a support statement for such upstream
> (when probably we should), iirc for our (SUSE's) products we would
> consider that unsupported. Just fyi.
>
> Also, btw, please don't top-post.
>
> Jan
>

[-- Attachment #2: Type: text/html, Size: 27925 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
  2026-01-09 16:11       ` Anton Markov
@ 2026-01-12 10:31         ` Anton Markov
  2026-01-12 11:45           ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: Anton Markov @ 2026-01-12 10:31 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, roger.pau, xen-devel

[-- Attachment #1: Type: text/plain, Size: 10788 bytes --]

Bit rounding isn't the main issue; the difference in ipi delivery to the
cores accumulates due to the ordering. Replacing get_s_time_fixed with
scale_delta in time_calibration_rendezvous_tail should be sufficient. This
configuration won't accumulate errors, but bit rounding can still cause a
significant difference from calibration to calibration, but it's not as
significant.

On Fri, Jan 9, 2026 at 7:11 PM Anton Markov <akmarkov45@gmail.com> wrote:

> You're right. These aren't interrupts in get_s_time_fixed, but rather a
> cumulative error in the sequence due to integer rounding. I added logging
> of the current local_stime to local_time_calibration and noticed that the
> timestamp between cores is gradually increasing. If the server has been
> running for weeks, this could be a very large value.
>
>
> ```
>
> @@ -1732,6 +1753,8 @@ static void cf_check local_time_calibration(void)
>
> if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
>
> {
>
> /* Atomically read cpu_calibration struct and write cpu_time struct. */
>
> + printk("update stime on time calibrate %d, %lu -> %lu (%lu, %lu)\n",
> smp_processor_id(), t->stamp.local_stime, c->local_stime,
>
> + t->last_seen_ns, t->last_seen_tsc);
>
> local_irq_disable();
>
> t->stamp = *c;
>
> local_irq_enable();
>
> ```
>
>
> 2 hours of work:
>
>
> ```
>
> (XEN) update stime on time calibrate 0, 8564145820102 -> 8565145861597
> (8565145862216, 0)
>
> (XEN) update stime on time calibrate 1, 8564145820129 -> 8565145861609
> (8565145863957, 0)
>
> (XEN) update stime on time calibrate 3, 8564145819996 -> 8565145861491
> (8565145864800, 0)
>
> (XEN) update stime on time calibrate 2, 8564145820099 -> 8565145861609
> (8565145865372, 0)
>
>
> 8565145861609 - 8565145861491 = 115 * 3 (3.00 GHz) = 345 lag
>
> ```
>
>
> 6 hours of work:
>
> ```
>
> (XEN) update stime on time calibrate 0, 22914730829200 -> 22915730869993
> (22915730870665, 0)
>
> (XEN) update stime on time calibrate 1, 22914730829073 -> 22915730869889
> (22915730870693, 0)
>
> (XEN) update stime on time calibrate 2, 22914730829052 -> 22915730869841
> (22915730872231, 0)
>
> (XEN) update stime on time calibrate 3, 22914730828892 -> 22915730869696
> (22915730872096, 0)
>
>
> 22915730869993 - 22915730869696 = 297 * 3 (3.00 GHz) = 891 lag
>
> ```
>
>
> Clarification, according to my xen settings:
>
> ```
>
> ucode=scan dom0_mem=53923M,max:53923M dom0_max_vcpus=4-96 dom0_vcpus_pin=0
> force-ept=1 ept=no-ad,no-pml hap_1gb=0 hap_2mb=0 altp2m=1
> hpet=legacy-replacement smt=1 spec-ctrl=no gnttab_max_frames=512
> cpufreq=xen:performance max_cstate=1 sched=credit sched-gran=cpu apicv=0
> sched_credit_tslice_ms=5 sched_ratelimit_us=500
>
> ```
>
>
> Processors I tested on:
>
>
> ```
>
> Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz
>
>
> Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx
> fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl
> nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 cx16
> sse4_1 sse4_2 popcnt aes xsave avx f16c hypervisor lahf_lm cpuid_fault
> fsgsbase erms xsaveopt arch_capabilities
>
> ```
>
> ```
>
> Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10GHz x2 (NUMA)
>
>
> Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx
> fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl
> nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 fma cx16
> sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm
> 3dnowprefetch cpuid_fault ibpb fsgsbase bmi1 avx2 bmi2 erms rtm avx512f
> avx512dq rdseed adx avx512ifma clflushopt clwb avx512cd sha_ni avx512bw
> avx512vl xsaveopt xsavec xgetbv1 avx512vbmi avx512_vbmi2 gfni vaes
> vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear
> arch_capabilities
>
> ```
>
>
> Next I moved the code to r3 to speed up playback:
>
>
> ```
>
> #include <stdint.h>
>
> #include <stdio.h>
>
>
> static __inline__ unsigned long long rdtsc(void)
>
> {
>
> unsigned hi, lo;
>
> __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
>
> return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
>
> }
>
>
> #define MILLISECS(_ms) ((int64_t)((_ms) * 1000000ULL))
>
>
> struct cpu_time_stamp {
>
> uint64_t local_tsc;
>
> int64_t local_stime;
>
> int64_t master_stime;
>
> };
>
>
> struct time_scale {
>
> int shift;
>
> uint32_t mul_frac;
>
> };
>
>
>
> static inline uint32_t div_frac(uint32_t dividend, uint32_t divisor)
>
> {
>
> uint32_t quotient, remainder;
>
> asm (
>
> "divl %4"
>
> : "=a" (quotient), "=d" (remainder)
>
> : "0" (0), "1" (dividend), "r" (divisor) );
>
> return quotient;
>
> }
>
>
>
> void set_time_scale(struct time_scale *ts, uint64_t ticks_per_sec)
>
> {
>
> uint64_t tps64 = ticks_per_sec;
>
> uint32_t tps32;
>
> int shift = 0;
>
>
> while ( tps64 > (MILLISECS(1000)*2) )
>
> {
>
> tps64 >>= 1;
>
> shift--;
>
> }
>
>
> tps32 = (uint32_t)tps64;
>
> while ( tps32 <= (uint32_t)MILLISECS(1000) )
>
> {
>
> tps32 <<= 1;
>
> shift++;
>
> }
>
>
> ts->mul_frac = div_frac(MILLISECS(1000), tps32);
>
> ts->shift = shift;
>
> }
>
>
>
> uint64_t scale_delta(uint64_t delta, const struct time_scale *scale)
>
> {
>
> uint64_t product;
>
>
> if ( scale->shift < 0 )
>
> delta >>= -scale->shift;
>
> else
>
> delta <<= scale->shift;
>
>
> asm (
>
> "mulq %2 ; shrd $32,%1,%0"
>
> : "=a" (product), "=d" (delta)
>
> : "rm" (delta), "0" ((uint64_t)scale->mul_frac) );
>
>
> return product;
>
> }
>
>
> #define _TS_MUL_FRAC_IDENTITY 0x80000000UL
>
>
> static inline struct time_scale scale_reciprocal(struct time_scale scale)
>
> {
>
> struct time_scale reciprocal;
>
> uint32_t dividend;
>
>
> dividend = _TS_MUL_FRAC_IDENTITY;
>
> reciprocal.shift = 1 - scale.shift;
>
> while ( dividend >= scale.mul_frac )
>
> {
>
> dividend >>= 1;
>
> reciprocal.shift++;
>
> }
>
>
> asm (
>
> "divl %4"
>
> : "=a" (reciprocal.mul_frac), "=d" (dividend)
>
> : "0" (0), "1" (dividend), "r" (scale.mul_frac) );
>
>
> return reciprocal;
>
> }
>
>
>
> int64_t get_s_time_fixed(const struct cpu_time_stamp *t, const struct
> time_scale *ts, uint64_t at_tsc)
>
> {
>
> uint64_t tsc, delta;
>
>
> if ( at_tsc )
>
> tsc = at_tsc;
>
> else
>
> tsc = rdtsc();
>
> delta = tsc - t->local_tsc;
>
> return t->local_stime + scale_delta(delta, ts);
>
> }
>
>
> int main() {
>
>
> struct cpu_time_stamp ct;
>
> struct time_scale ts;
>
> struct time_scale back;
>
>
> uint64_t start = rdtsc();
>
>
> set_time_scale(&ts, 3000000000);
>
> back = scale_reciprocal(ts);
>
>
> ct.local_tsc = start;
>
> ct.local_stime = scale_delta(start, &ts);
>
>
> while (1) {
>
> uint64_t new_tsc = rdtsc();
>
> ct.local_stime = get_s_time_fixed(&ct, &ts, new_tsc);
>
> ct.local_tsc = new_tsc;
>
>
> uint64_t tmp_tsc = rdtsc();
>
> printf("%lu %lu\n", tmp_tsc, scale_delta(get_s_time_fixed(&ct, &ts,
> tmp_tsc), &back));
>
> }
>
>
> return 0;
>
> }
>
> ```
>
>
> It's enough to run the loop for 10-20 seconds to see the problem.
> scale_delta rounds the least significant bits during calculation, which
> causes the error to accumulate, at different rates on different cores,
> depending on the least significant bits at the time of calibration.
>
>
> Now let's talk about why dwm reacts this way. When a snapshot is reversed,
> last_guest_time in hvm_get_guest_time_fixed is set to 0, which doesn't
> prevent time from flowing backwards. This means that cache_tsc_offset in
> hvm_get_guest_tsc_fixed may be configured correctly on one physical core,
> but due to shedding on a physical core with a lagging tsc, the guest may
> occasionally see a tsc value lower than it saw before the snapshot was
> reversed. If this happens to the dwm code, it terminates with an error.
>
>
> A quick solution to this problem might be to save the last_seen_tsc
> parameter in a snapshot for each core, followed by validation.
>
>
> The correct solution is to remove the rounding of the least significant
> bits from the sequence.
>
> On Wed, Jan 7, 2026 at 11:02 AM Jan Beulich <jbeulich@suse.com> wrote:
>
>> On 06.01.2026 21:10, Антон Марков wrote:
>> > Hi, I'm not sure about the other places. In hvm_load_cpu_ctxt
>> > (xen/arch/x86/hvm/hvm.c ), it was easy to catch because
>> > process_pending_softirqs is frequently called there, which in turn
>> > processes softirqs from the timer (where the timestamp is updated).
>> > After I fixed sync_tsc in hvm_load_cpu_ctxt, the problem stopped
>> > reproducing under no load. However, when the number of vCPUs is 4 times
>> > greater than the number of CPUs (under heavy load), the problem rarely
>> > reoccurs (mostly during snapshot restores during
>> > process_pending_softirqs calls), and this is no longer a simple case.
>> If
>> > get_s_time_fixed can indeed be interrupted during execution after
>> > rdtsc_ordered, then the current fix is insufficient. It's necessary to
>> > atomically copy "t->stamp" to the stack using local_irq_disable and
>> > local_irq_enable (as in local_time_calibration), and then work with the
>> > copy, confident in its lifetime and immutability. But until
>> > get_s_time_fixed is proven to be interruptible, this is premature, so
>> > your fix is sufficient. I think I need more information and testing to
>> > say more.
>>
>> While the cpu_calibration per-CPU variable is updated from IRQ context,
>> the cpu_time one isn't. Hence t->stamp's contents cannot change behind
>> the back of get_s_time_fixed(). I wonder whether ...
>>
>> > Regarding the other scale_delta calls, if they include values
>> > calculated from externally saved tsc values that could have become
>> > stale during the process_pending_softirqs call, this definitely needs
>> to
>> > be fixed.
>>
>> ... another similar issue (possibly one not included in the set of
>> remarks I have in the patch, as none of those look related to what you
>> describe) might be causing the remaining, more rare problems you say you
>> see. That set of remarks is actually a result of me going over all other
>> scale_delta() calls, but of course I may have got the analysis wrong.
>>
>> As to using 4 times as many vCPU-s as there are pCPU-s (and then heavy
>> load) - while I don't think we have a support statement for such upstream
>> (when probably we should), iirc for our (SUSE's) products we would
>> consider that unsupported. Just fyi.
>>
>> Also, btw, please don't top-post.
>>
>> Jan
>>
>

[-- Attachment #2: Type: text/html, Size: 28969 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
  2026-01-12 10:31         ` Anton Markov
@ 2026-01-12 11:45           ` Jan Beulich
  2026-01-12 12:49             ` Anton Markov
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2026-01-12 11:45 UTC (permalink / raw)
  To: Anton Markov; +Cc: andrew.cooper3, roger.pau, xen-devel

On 12.01.2026 11:31, Anton Markov wrote:
> Bit rounding isn't the main issue; the difference in ipi delivery to the
> cores accumulates due to the ordering. Replacing get_s_time_fixed with
> scale_delta in time_calibration_rendezvous_tail should be sufficient. This
> configuration won't accumulate errors, but bit rounding can still cause a
> significant difference from calibration to calibration, but it's not as
> significant.

That invocation of get_s_time_fixed() reduces to scale_delta() (without
further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW
it's not quite clear to me what change you are suggesting (that would
actually make a functional difference).

Btw, your prior response was too hard to properly read, due to excess blank
lines with at the same time squashed leading blanks. Together with your
apparent inability to avoid top-posting, I think you really want to adjust
your mail program's configuration.

Jan

> On Fri, Jan 9, 2026 at 7:11 PM Anton Markov <akmarkov45@gmail.com> wrote:
> 
>> You're right. These aren't interrupts in get_s_time_fixed, but rather a
>> cumulative error in the sequence due to integer rounding. I added logging
>> of the current local_stime to local_time_calibration and noticed that the
>> timestamp between cores is gradually increasing. If the server has been
>> running for weeks, this could be a very large value.
>>
>>
>> ```
>>
>> @@ -1732,6 +1753,8 @@ static void cf_check local_time_calibration(void)
>>
>> if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
>>
>> {
>>
>> /* Atomically read cpu_calibration struct and write cpu_time struct. */
>>
>> + printk("update stime on time calibrate %d, %lu -> %lu (%lu, %lu)\n",
>> smp_processor_id(), t->stamp.local_stime, c->local_stime,
>>
>> + t->last_seen_ns, t->last_seen_tsc);
>>
>> local_irq_disable();
>>
>> t->stamp = *c;
>>
>> local_irq_enable();
>>
>> ```
>>
>>
>> 2 hours of work:
>>
>>
>> ```
>>
>> (XEN) update stime on time calibrate 0, 8564145820102 -> 8565145861597
>> (8565145862216, 0)
>>
>> (XEN) update stime on time calibrate 1, 8564145820129 -> 8565145861609
>> (8565145863957, 0)
>>
>> (XEN) update stime on time calibrate 3, 8564145819996 -> 8565145861491
>> (8565145864800, 0)
>>
>> (XEN) update stime on time calibrate 2, 8564145820099 -> 8565145861609
>> (8565145865372, 0)
>>
>>
>> 8565145861609 - 8565145861491 = 115 * 3 (3.00 GHz) = 345 lag
>>
>> ```
>>
>>
>> 6 hours of work:
>>
>> ```
>>
>> (XEN) update stime on time calibrate 0, 22914730829200 -> 22915730869993
>> (22915730870665, 0)
>>
>> (XEN) update stime on time calibrate 1, 22914730829073 -> 22915730869889
>> (22915730870693, 0)
>>
>> (XEN) update stime on time calibrate 2, 22914730829052 -> 22915730869841
>> (22915730872231, 0)
>>
>> (XEN) update stime on time calibrate 3, 22914730828892 -> 22915730869696
>> (22915730872096, 0)
>>
>>
>> 22915730869993 - 22915730869696 = 297 * 3 (3.00 GHz) = 891 lag
>>
>> ```
>>
>>
>> Clarification, according to my xen settings:
>>
>> ```
>>
>> ucode=scan dom0_mem=53923M,max:53923M dom0_max_vcpus=4-96 dom0_vcpus_pin=0
>> force-ept=1 ept=no-ad,no-pml hap_1gb=0 hap_2mb=0 altp2m=1
>> hpet=legacy-replacement smt=1 spec-ctrl=no gnttab_max_frames=512
>> cpufreq=xen:performance max_cstate=1 sched=credit sched-gran=cpu apicv=0
>> sched_credit_tslice_ms=5 sched_ratelimit_us=500
>>
>> ```
>>
>>
>> Processors I tested on:
>>
>>
>> ```
>>
>> Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz
>>
>>
>> Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx
>> fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl
>> nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 cx16
>> sse4_1 sse4_2 popcnt aes xsave avx f16c hypervisor lahf_lm cpuid_fault
>> fsgsbase erms xsaveopt arch_capabilities
>>
>> ```
>>
>> ```
>>
>> Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10GHz x2 (NUMA)
>>
>>
>> Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx
>> fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl
>> nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 fma cx16
>> sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm
>> 3dnowprefetch cpuid_fault ibpb fsgsbase bmi1 avx2 bmi2 erms rtm avx512f
>> avx512dq rdseed adx avx512ifma clflushopt clwb avx512cd sha_ni avx512bw
>> avx512vl xsaveopt xsavec xgetbv1 avx512vbmi avx512_vbmi2 gfni vaes
>> vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear
>> arch_capabilities
>>
>> ```
>>
>>
>> Next I moved the code to r3 to speed up playback:
>>
>>
>> ```
>>
>> #include <stdint.h>
>>
>> #include <stdio.h>
>>
>>
>> static __inline__ unsigned long long rdtsc(void)
>>
>> {
>>
>> unsigned hi, lo;
>>
>> __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
>>
>> return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
>>
>> }
>>
>>
>> #define MILLISECS(_ms) ((int64_t)((_ms) * 1000000ULL))
>>
>>
>> struct cpu_time_stamp {
>>
>> uint64_t local_tsc;
>>
>> int64_t local_stime;
>>
>> int64_t master_stime;
>>
>> };
>>
>>
>> struct time_scale {
>>
>> int shift;
>>
>> uint32_t mul_frac;
>>
>> };
>>
>>
>>
>> static inline uint32_t div_frac(uint32_t dividend, uint32_t divisor)
>>
>> {
>>
>> uint32_t quotient, remainder;
>>
>> asm (
>>
>> "divl %4"
>>
>> : "=a" (quotient), "=d" (remainder)
>>
>> : "0" (0), "1" (dividend), "r" (divisor) );
>>
>> return quotient;
>>
>> }
>>
>>
>>
>> void set_time_scale(struct time_scale *ts, uint64_t ticks_per_sec)
>>
>> {
>>
>> uint64_t tps64 = ticks_per_sec;
>>
>> uint32_t tps32;
>>
>> int shift = 0;
>>
>>
>> while ( tps64 > (MILLISECS(1000)*2) )
>>
>> {
>>
>> tps64 >>= 1;
>>
>> shift--;
>>
>> }
>>
>>
>> tps32 = (uint32_t)tps64;
>>
>> while ( tps32 <= (uint32_t)MILLISECS(1000) )
>>
>> {
>>
>> tps32 <<= 1;
>>
>> shift++;
>>
>> }
>>
>>
>> ts->mul_frac = div_frac(MILLISECS(1000), tps32);
>>
>> ts->shift = shift;
>>
>> }
>>
>>
>>
>> uint64_t scale_delta(uint64_t delta, const struct time_scale *scale)
>>
>> {
>>
>> uint64_t product;
>>
>>
>> if ( scale->shift < 0 )
>>
>> delta >>= -scale->shift;
>>
>> else
>>
>> delta <<= scale->shift;
>>
>>
>> asm (
>>
>> "mulq %2 ; shrd $32,%1,%0"
>>
>> : "=a" (product), "=d" (delta)
>>
>> : "rm" (delta), "0" ((uint64_t)scale->mul_frac) );
>>
>>
>> return product;
>>
>> }
>>
>>
>> #define _TS_MUL_FRAC_IDENTITY 0x80000000UL
>>
>>
>> static inline struct time_scale scale_reciprocal(struct time_scale scale)
>>
>> {
>>
>> struct time_scale reciprocal;
>>
>> uint32_t dividend;
>>
>>
>> dividend = _TS_MUL_FRAC_IDENTITY;
>>
>> reciprocal.shift = 1 - scale.shift;
>>
>> while ( dividend >= scale.mul_frac )
>>
>> {
>>
>> dividend >>= 1;
>>
>> reciprocal.shift++;
>>
>> }
>>
>>
>> asm (
>>
>> "divl %4"
>>
>> : "=a" (reciprocal.mul_frac), "=d" (dividend)
>>
>> : "0" (0), "1" (dividend), "r" (scale.mul_frac) );
>>
>>
>> return reciprocal;
>>
>> }
>>
>>
>>
>> int64_t get_s_time_fixed(const struct cpu_time_stamp *t, const struct
>> time_scale *ts, uint64_t at_tsc)
>>
>> {
>>
>> uint64_t tsc, delta;
>>
>>
>> if ( at_tsc )
>>
>> tsc = at_tsc;
>>
>> else
>>
>> tsc = rdtsc();
>>
>> delta = tsc - t->local_tsc;
>>
>> return t->local_stime + scale_delta(delta, ts);
>>
>> }
>>
>>
>> int main() {
>>
>>
>> struct cpu_time_stamp ct;
>>
>> struct time_scale ts;
>>
>> struct time_scale back;
>>
>>
>> uint64_t start = rdtsc();
>>
>>
>> set_time_scale(&ts, 3000000000);
>>
>> back = scale_reciprocal(ts);
>>
>>
>> ct.local_tsc = start;
>>
>> ct.local_stime = scale_delta(start, &ts);
>>
>>
>> while (1) {
>>
>> uint64_t new_tsc = rdtsc();
>>
>> ct.local_stime = get_s_time_fixed(&ct, &ts, new_tsc);
>>
>> ct.local_tsc = new_tsc;
>>
>>
>> uint64_t tmp_tsc = rdtsc();
>>
>> printf("%lu %lu\n", tmp_tsc, scale_delta(get_s_time_fixed(&ct, &ts,
>> tmp_tsc), &back));
>>
>> }
>>
>>
>> return 0;
>>
>> }
>>
>> ```
>>
>>
>> It's enough to run the loop for 10-20 seconds to see the problem.
>> scale_delta rounds the least significant bits during calculation, which
>> causes the error to accumulate, at different rates on different cores,
>> depending on the least significant bits at the time of calibration.
>>
>>
>> Now let's talk about why dwm reacts this way. When a snapshot is reversed,
>> last_guest_time in hvm_get_guest_time_fixed is set to 0, which doesn't
>> prevent time from flowing backwards. This means that cache_tsc_offset in
>> hvm_get_guest_tsc_fixed may be configured correctly on one physical core,
>> but due to shedding on a physical core with a lagging tsc, the guest may
>> occasionally see a tsc value lower than it saw before the snapshot was
>> reversed. If this happens to the dwm code, it terminates with an error.
>>
>>
>> A quick solution to this problem might be to save the last_seen_tsc
>> parameter in a snapshot for each core, followed by validation.
>>
>>
>> The correct solution is to remove the rounding of the least significant
>> bits from the sequence.
>>
>> On Wed, Jan 7, 2026 at 11:02 AM Jan Beulich <jbeulich@suse.com> wrote:
>>
>>> On 06.01.2026 21:10, Антон Марков wrote:
>>>> Hi, I'm not sure about the other places. In hvm_load_cpu_ctxt
>>>> (xen/arch/x86/hvm/hvm.c ), it was easy to catch because
>>>> process_pending_softirqs is frequently called there, which in turn
>>>> processes softirqs from the timer (where the timestamp is updated).
>>>> After I fixed sync_tsc in hvm_load_cpu_ctxt, the problem stopped
>>>> reproducing under no load. However, when the number of vCPUs is 4 times
>>>> greater than the number of CPUs (under heavy load), the problem rarely
>>>> reoccurs (mostly during snapshot restores during
>>>> process_pending_softirqs calls), and this is no longer a simple case.
>>> If
>>>> get_s_time_fixed can indeed be interrupted during execution after
>>>> rdtsc_ordered, then the current fix is insufficient. It's necessary to
>>>> atomically copy "t->stamp" to the stack using local_irq_disable and
>>>> local_irq_enable (as in local_time_calibration), and then work with the
>>>> copy, confident in its lifetime and immutability. But until
>>>> get_s_time_fixed is proven to be interruptible, this is premature, so
>>>> your fix is sufficient. I think I need more information and testing to
>>>> say more.
>>>
>>> While the cpu_calibration per-CPU variable is updated from IRQ context,
>>> the cpu_time one isn't. Hence t->stamp's contents cannot change behind
>>> the back of get_s_time_fixed(). I wonder whether ...
>>>
>>>> Regarding the other scale_delta calls, if they include values
>>>> calculated from externally saved tsc values that could have become
>>>> stale during the process_pending_softirqs call, this definitely needs
>>> to
>>>> be fixed.
>>>
>>> ... another similar issue (possibly one not included in the set of
>>> remarks I have in the patch, as none of those look related to what you
>>> describe) might be causing the remaining, more rare problems you say you
>>> see. That set of remarks is actually a result of me going over all other
>>> scale_delta() calls, but of course I may have got the analysis wrong.
>>>
>>> As to using 4 times as many vCPU-s as there are pCPU-s (and then heavy
>>> load) - while I don't think we have a support statement for such upstream
>>> (when probably we should), iirc for our (SUSE's) products we would
>>> consider that unsupported. Just fyi.
>>>
>>> Also, btw, please don't top-post.
>>>
>>> Jan
>>>
>>
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
  2026-01-12 11:45           ` Jan Beulich
@ 2026-01-12 12:49             ` Anton Markov
  2026-01-12 14:13               ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: Anton Markov @ 2026-01-12 12:49 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, roger.pau, xen-devel

[-- Attachment #1: Type: text/plain, Size: 15599 bytes --]

>
> Btw, your prior response was too hard to properly read, due to excess blank
> lines with at the same time squashed leading blanks. Together with your
> apparent inability to avoid top-posting, I think you really want to adjust
> your mail program's configuration.

I suggest we skip the discussion of formatting and the number of spaces in
messages and instead focus on the topic of the thread. I have a very
difficult time troubleshooting difficult-to-reproduce bugs, and the fact
that their descriptions are difficult to read due to the number of spaces
is probably the least of the difficulties.

That invocation of get_s_time_fixed() reduces to scale_delta() (without
> further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW
> it's not quite clear to me what change you are suggesting (that would
> actually make a functional difference).

Replacing get_s_time_fixed with scale_delta will remove the calculation
dependency on the previous local_stime value, which accumulates lag between
cores. This is because: rdtsc_ordered is not called synchronously on the
cores, but by the difference offset by the ipi speed. Therefore, we get:

core0: current_rdtsc;
core1: current_rdtsc + ipi speed;
coreN: current_rdtsc + ipi speed * N;

Since ipi values are sent alternately in a loop to core0, in the version
with get_s_time_fixed, we get the following local_stime calculation format.

coreN: local_stime = local_stime + scale_delta((current_rdtsc + (ipi_speed
* N)) – local_rdtsc);

This means the time on each core will differ by ipi_speed * N. And since
we're using the values of the previous local_stime, the difference will
accumulate because the previous local_stime was also offset. In the version
with scale_delta, we get:

coreN: local_stime = scale_delta(current_rdtsc + (ipi_speed * N));

This means there will still be a difference, but it won't accumulate, and
the offsets will remain within normal limits.

If it's still unclear: If your local_stime in get_s_time_fixed is offset
relative to other cores, then the fact that rdtsc_ordered and local_tsc are
not offset doesn't change anything, since you're using the delta relative
to local_stime.

core0_local_stime + (rdtsc_ordered() - local_tsc) != core1_local_stime +
(rdtsc_ordered() - local_tsc); // Even if rdtsc_ordered() and local_tsc are
equal across cores.

On 96-core configurations, up to a millisecond of latency can accumulate in
local_stime over a week of operation, and this is a significant
difference. This
is due to the fact that I use cpufreq=xen:performance max_cstate=1 ,
meaning that in my configuration, local_stime is never overwritten by
master_stime.

Thanks.

On Mon, Jan 12, 2026 at 2:45 PM Jan Beulich <jbeulich@suse.com> wrote:

> On 12.01.2026 11:31, Anton Markov wrote:
> > Bit rounding isn't the main issue; the difference in ipi delivery to the
> > cores accumulates due to the ordering. Replacing get_s_time_fixed with
> > scale_delta in time_calibration_rendezvous_tail should be sufficient.
> This
> > configuration won't accumulate errors, but bit rounding can still cause a
> > significant difference from calibration to calibration, but it's not as
> > significant.
>
> That invocation of get_s_time_fixed() reduces to scale_delta() (without
> further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW
> it's not quite clear to me what change you are suggesting (that would
> actually make a functional difference).
>
> Btw, your prior response was too hard to properly read, due to excess blank
> lines with at the same time squashed leading blanks. Together with your
> apparent inability to avoid top-posting, I think you really want to adjust
> your mail program's configuration.
>
> Jan
>
> > On Fri, Jan 9, 2026 at 7:11 PM Anton Markov <akmarkov45@gmail.com>
> wrote:
> >
> >> You're right. These aren't interrupts in get_s_time_fixed, but rather a
> >> cumulative error in the sequence due to integer rounding. I added
> logging
> >> of the current local_stime to local_time_calibration and noticed that
> the
> >> timestamp between cores is gradually increasing. If the server has been
> >> running for weeks, this could be a very large value.
> >>
> >>
> >> ```
> >>
> >> @@ -1732,6 +1753,8 @@ static void cf_check local_time_calibration(void)
> >>
> >> if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
> >>
> >> {
> >>
> >> /* Atomically read cpu_calibration struct and write cpu_time struct. */
> >>
> >> + printk("update stime on time calibrate %d, %lu -> %lu (%lu, %lu)\n",
> >> smp_processor_id(), t->stamp.local_stime, c->local_stime,
> >>
> >> + t->last_seen_ns, t->last_seen_tsc);
> >>
> >> local_irq_disable();
> >>
> >> t->stamp = *c;
> >>
> >> local_irq_enable();
> >>
> >> ```
> >>
> >>
> >> 2 hours of work:
> >>
> >>
> >> ```
> >>
> >> (XEN) update stime on time calibrate 0, 8564145820102 -> 8565145861597
> >> (8565145862216, 0)
> >>
> >> (XEN) update stime on time calibrate 1, 8564145820129 -> 8565145861609
> >> (8565145863957, 0)
> >>
> >> (XEN) update stime on time calibrate 3, 8564145819996 -> 8565145861491
> >> (8565145864800, 0)
> >>
> >> (XEN) update stime on time calibrate 2, 8564145820099 -> 8565145861609
> >> (8565145865372, 0)
> >>
> >>
> >> 8565145861609 - 8565145861491 = 115 * 3 (3.00 GHz) = 345 lag
> >>
> >> ```
> >>
> >>
> >> 6 hours of work:
> >>
> >> ```
> >>
> >> (XEN) update stime on time calibrate 0, 22914730829200 -> 22915730869993
> >> (22915730870665, 0)
> >>
> >> (XEN) update stime on time calibrate 1, 22914730829073 -> 22915730869889
> >> (22915730870693, 0)
> >>
> >> (XEN) update stime on time calibrate 2, 22914730829052 -> 22915730869841
> >> (22915730872231, 0)
> >>
> >> (XEN) update stime on time calibrate 3, 22914730828892 -> 22915730869696
> >> (22915730872096, 0)
> >>
> >>
> >> 22915730869993 - 22915730869696 = 297 * 3 (3.00 GHz) = 891 lag
> >>
> >> ```
> >>
> >>
> >> Clarification, according to my xen settings:
> >>
> >> ```
> >>
> >> ucode=scan dom0_mem=53923M,max:53923M dom0_max_vcpus=4-96
> dom0_vcpus_pin=0
> >> force-ept=1 ept=no-ad,no-pml hap_1gb=0 hap_2mb=0 altp2m=1
> >> hpet=legacy-replacement smt=1 spec-ctrl=no gnttab_max_frames=512
> >> cpufreq=xen:performance max_cstate=1 sched=credit sched-gran=cpu apicv=0
> >> sched_credit_tslice_ms=5 sched_ratelimit_us=500
> >>
> >> ```
> >>
> >>
> >> Processors I tested on:
> >>
> >>
> >> ```
> >>
> >> Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz
> >>
> >>
> >> Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx
> >> fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl
> >> nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 cx16
> >> sse4_1 sse4_2 popcnt aes xsave avx f16c hypervisor lahf_lm cpuid_fault
> >> fsgsbase erms xsaveopt arch_capabilities
> >>
> >> ```
> >>
> >> ```
> >>
> >> Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10GHz x2 (NUMA)
> >>
> >>
> >> Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx
> >> fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl
> >> nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 fma
> cx16
> >> sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm
> abm
> >> 3dnowprefetch cpuid_fault ibpb fsgsbase bmi1 avx2 bmi2 erms rtm avx512f
> >> avx512dq rdseed adx avx512ifma clflushopt clwb avx512cd sha_ni avx512bw
> >> avx512vl xsaveopt xsavec xgetbv1 avx512vbmi avx512_vbmi2 gfni vaes
> >> vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm
> md_clear
> >> arch_capabilities
> >>
> >> ```
> >>
> >>
> >> Next I moved the code to r3 to speed up playback:
> >>
> >>
> >> ```
> >>
> >> #include <stdint.h>
> >>
> >> #include <stdio.h>
> >>
> >>
> >> static __inline__ unsigned long long rdtsc(void)
> >>
> >> {
> >>
> >> unsigned hi, lo;
> >>
> >> __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
> >>
> >> return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
> >>
> >> }
> >>
> >>
> >> #define MILLISECS(_ms) ((int64_t)((_ms) * 1000000ULL))
> >>
> >>
> >> struct cpu_time_stamp {
> >>
> >> uint64_t local_tsc;
> >>
> >> int64_t local_stime;
> >>
> >> int64_t master_stime;
> >>
> >> };
> >>
> >>
> >> struct time_scale {
> >>
> >> int shift;
> >>
> >> uint32_t mul_frac;
> >>
> >> };
> >>
> >>
> >>
> >> static inline uint32_t div_frac(uint32_t dividend, uint32_t divisor)
> >>
> >> {
> >>
> >> uint32_t quotient, remainder;
> >>
> >> asm (
> >>
> >> "divl %4"
> >>
> >> : "=a" (quotient), "=d" (remainder)
> >>
> >> : "0" (0), "1" (dividend), "r" (divisor) );
> >>
> >> return quotient;
> >>
> >> }
> >>
> >>
> >>
> >> void set_time_scale(struct time_scale *ts, uint64_t ticks_per_sec)
> >>
> >> {
> >>
> >> uint64_t tps64 = ticks_per_sec;
> >>
> >> uint32_t tps32;
> >>
> >> int shift = 0;
> >>
> >>
> >> while ( tps64 > (MILLISECS(1000)*2) )
> >>
> >> {
> >>
> >> tps64 >>= 1;
> >>
> >> shift--;
> >>
> >> }
> >>
> >>
> >> tps32 = (uint32_t)tps64;
> >>
> >> while ( tps32 <= (uint32_t)MILLISECS(1000) )
> >>
> >> {
> >>
> >> tps32 <<= 1;
> >>
> >> shift++;
> >>
> >> }
> >>
> >>
> >> ts->mul_frac = div_frac(MILLISECS(1000), tps32);
> >>
> >> ts->shift = shift;
> >>
> >> }
> >>
> >>
> >>
> >> uint64_t scale_delta(uint64_t delta, const struct time_scale *scale)
> >>
> >> {
> >>
> >> uint64_t product;
> >>
> >>
> >> if ( scale->shift < 0 )
> >>
> >> delta >>= -scale->shift;
> >>
> >> else
> >>
> >> delta <<= scale->shift;
> >>
> >>
> >> asm (
> >>
> >> "mulq %2 ; shrd $32,%1,%0"
> >>
> >> : "=a" (product), "=d" (delta)
> >>
> >> : "rm" (delta), "0" ((uint64_t)scale->mul_frac) );
> >>
> >>
> >> return product;
> >>
> >> }
> >>
> >>
> >> #define _TS_MUL_FRAC_IDENTITY 0x80000000UL
> >>
> >>
> >> static inline struct time_scale scale_reciprocal(struct time_scale
> scale)
> >>
> >> {
> >>
> >> struct time_scale reciprocal;
> >>
> >> uint32_t dividend;
> >>
> >>
> >> dividend = _TS_MUL_FRAC_IDENTITY;
> >>
> >> reciprocal.shift = 1 - scale.shift;
> >>
> >> while ( dividend >= scale.mul_frac )
> >>
> >> {
> >>
> >> dividend >>= 1;
> >>
> >> reciprocal.shift++;
> >>
> >> }
> >>
> >>
> >> asm (
> >>
> >> "divl %4"
> >>
> >> : "=a" (reciprocal.mul_frac), "=d" (dividend)
> >>
> >> : "0" (0), "1" (dividend), "r" (scale.mul_frac) );
> >>
> >>
> >> return reciprocal;
> >>
> >> }
> >>
> >>
> >>
> >> int64_t get_s_time_fixed(const struct cpu_time_stamp *t, const struct
> >> time_scale *ts, uint64_t at_tsc)
> >>
> >> {
> >>
> >> uint64_t tsc, delta;
> >>
> >>
> >> if ( at_tsc )
> >>
> >> tsc = at_tsc;
> >>
> >> else
> >>
> >> tsc = rdtsc();
> >>
> >> delta = tsc - t->local_tsc;
> >>
> >> return t->local_stime + scale_delta(delta, ts);
> >>
> >> }
> >>
> >>
> >> int main() {
> >>
> >>
> >> struct cpu_time_stamp ct;
> >>
> >> struct time_scale ts;
> >>
> >> struct time_scale back;
> >>
> >>
> >> uint64_t start = rdtsc();
> >>
> >>
> >> set_time_scale(&ts, 3000000000);
> >>
> >> back = scale_reciprocal(ts);
> >>
> >>
> >> ct.local_tsc = start;
> >>
> >> ct.local_stime = scale_delta(start, &ts);
> >>
> >>
> >> while (1) {
> >>
> >> uint64_t new_tsc = rdtsc();
> >>
> >> ct.local_stime = get_s_time_fixed(&ct, &ts, new_tsc);
> >>
> >> ct.local_tsc = new_tsc;
> >>
> >>
> >> uint64_t tmp_tsc = rdtsc();
> >>
> >> printf("%lu %lu\n", tmp_tsc, scale_delta(get_s_time_fixed(&ct, &ts,
> >> tmp_tsc), &back));
> >>
> >> }
> >>
> >>
> >> return 0;
> >>
> >> }
> >>
> >> ```
> >>
> >>
> >> It's enough to run the loop for 10-20 seconds to see the problem.
> >> scale_delta rounds the least significant bits during calculation, which
> >> causes the error to accumulate, at different rates on different cores,
> >> depending on the least significant bits at the time of calibration.
> >>
> >>
> >> Now let's talk about why dwm reacts this way. When a snapshot is
> reversed,
> >> last_guest_time in hvm_get_guest_time_fixed is set to 0, which doesn't
> >> prevent time from flowing backwards. This means that cache_tsc_offset in
> >> hvm_get_guest_tsc_fixed may be configured correctly on one physical
> core,
> >> but due to shedding on a physical core with a lagging tsc, the guest may
> >> occasionally see a tsc value lower than it saw before the snapshot was
> >> reversed. If this happens to the dwm code, it terminates with an error.
> >>
> >>
> >> A quick solution to this problem might be to save the last_seen_tsc
> >> parameter in a snapshot for each core, followed by validation.
> >>
> >>
> >> The correct solution is to remove the rounding of the least significant
> >> bits from the sequence.
> >>
> >> On Wed, Jan 7, 2026 at 11:02 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>
> >>> On 06.01.2026 21:10, Антон Марков wrote:
> >>>> Hi, I'm not sure about the other places. In hvm_load_cpu_ctxt
> >>>> (xen/arch/x86/hvm/hvm.c ), it was easy to catch because
> >>>> process_pending_softirqs is frequently called there, which in turn
> >>>> processes softirqs from the timer (where the timestamp is updated).
> >>>> After I fixed sync_tsc in hvm_load_cpu_ctxt, the problem stopped
> >>>> reproducing under no load. However, when the number of vCPUs is 4
> times
> >>>> greater than the number of CPUs (under heavy load), the problem rarely
> >>>> reoccurs (mostly during snapshot restores during
> >>>> process_pending_softirqs calls), and this is no longer a simple case.
> >>> If
> >>>> get_s_time_fixed can indeed be interrupted during execution after
> >>>> rdtsc_ordered, then the current fix is insufficient. It's necessary to
> >>>> atomically copy "t->stamp" to the stack using local_irq_disable and
> >>>> local_irq_enable (as in local_time_calibration), and then work with
> the
> >>>> copy, confident in its lifetime and immutability. But until
> >>>> get_s_time_fixed is proven to be interruptible, this is premature, so
> >>>> your fix is sufficient. I think I need more information and testing to
> >>>> say more.
> >>>
> >>> While the cpu_calibration per-CPU variable is updated from IRQ context,
> >>> the cpu_time one isn't. Hence t->stamp's contents cannot change behind
> >>> the back of get_s_time_fixed(). I wonder whether ...
> >>>
> >>>> Regarding the other scale_delta calls, if they include values
> >>>> calculated from externally saved tsc values that could have become
> >>>> stale during the process_pending_softirqs call, this definitely needs
> >>> to
> >>>> be fixed.
> >>>
> >>> ... another similar issue (possibly one not included in the set of
> >>> remarks I have in the patch, as none of those look related to what you
> >>> describe) might be causing the remaining, more rare problems you say
> you
> >>> see. That set of remarks is actually a result of me going over all
> other
> >>> scale_delta() calls, but of course I may have got the analysis wrong.
> >>>
> >>> As to using 4 times as many vCPU-s as there are pCPU-s (and then heavy
> >>> load) - while I don't think we have a support statement for such
> upstream
> >>> (when probably we should), iirc for our (SUSE's) products we would
> >>> consider that unsupported. Just fyi.
> >>>
> >>> Also, btw, please don't top-post.
> >>>
> >>> Jan
> >>>
> >>
> >
>
>

[-- Attachment #2: Type: text/html, Size: 23118 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
  2026-01-12 12:49             ` Anton Markov
@ 2026-01-12 14:13               ` Jan Beulich
  2026-01-12 14:51                 ` Anton Markov
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2026-01-12 14:13 UTC (permalink / raw)
  To: Anton Markov; +Cc: andrew.cooper3, roger.pau, xen-devel

On 12.01.2026 13:49, Anton Markov wrote:
>> Btw, your prior response was too hard to properly read, due to excess blank
>> lines with at the same time squashed leading blanks. Together with your
>> apparent inability to avoid top-posting, I think you really want to adjust
>> your mail program's configuration.
> 
> I suggest we skip the discussion of formatting and the number of spaces in
> messages and instead focus on the topic of the thread. I have a very
> difficult time troubleshooting difficult-to-reproduce bugs, and the fact
> that their descriptions are difficult to read due to the number of spaces
> is probably the least of the difficulties.

Perhaps, yet it still makes dealing with things more difficult.

> That invocation of get_s_time_fixed() reduces to scale_delta() (without
>> further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW
>> it's not quite clear to me what change you are suggesting (that would
>> actually make a functional difference).
> 
> Replacing get_s_time_fixed with scale_delta will remove the calculation
> dependency on the previous local_stime value, which accumulates lag between
> cores. This is because: rdtsc_ordered is not called synchronously on the
> cores, but by the difference offset by the ipi speed. Therefore, we get:
> 
> core0: current_rdtsc;
> core1: current_rdtsc + ipi speed;
> coreN: current_rdtsc + ipi speed * N;

That's if IPIs are sent sequentially. In the most common case, they aren't,
though - we use the all-but-self shorthand.

Actually, even if IPIs are sent sequentially, I can't see where you spot
this effect: Both callers of time_calibration_rendezvous_tail() signal all
secondary CPUs to continue at the same time. Hence they'll all execute
time_calibration_rendezvous_tail() in parallel.

> Since ipi values are sent alternately in a loop to core0,

Are they? I fear I don't know which part of the code you're talking about.

> in the version
> with get_s_time_fixed, we get the following local_stime calculation format.
> 
> coreN: local_stime = local_stime + scale_delta((current_rdtsc + (ipi_speed
> * N)) – local_rdtsc);

One of the reasons we (iirc) don't do that is that since the scaling factor
is also slightly imprecise, we'd prefer to avoid scaling very big values.
IOW by changing as you suggest we'd trade one accumulating error for
another.

Jan

> This means the time on each core will differ by ipi_speed * N. And since
> we're using the values of the previous local_stime, the difference will
> accumulate because the previous local_stime was also offset. In the version
> with scale_delta, we get:
> 
> coreN: local_stime = scale_delta(current_rdtsc + (ipi_speed * N));
> 
> This means there will still be a difference, but it won't accumulate, and
> the offsets will remain within normal limits.
> 
> If it's still unclear: If your local_stime in get_s_time_fixed is offset
> relative to other cores, then the fact that rdtsc_ordered and local_tsc are
> not offset doesn't change anything, since you're using the delta relative
> to local_stime.
> 
> core0_local_stime + (rdtsc_ordered() - local_tsc) != core1_local_stime +
> (rdtsc_ordered() - local_tsc); // Even if rdtsc_ordered() and local_tsc are
> equal across cores.
> 
> On 96-core configurations, up to a millisecond of latency can accumulate in
> local_stime over a week of operation, and this is a significant
> difference. This
> is due to the fact that I use cpufreq=xen:performance max_cstate=1 ,
> meaning that in my configuration, local_stime is never overwritten by
> master_stime.
> 
> Thanks.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
  2026-01-12 14:13               ` Jan Beulich
@ 2026-01-12 14:51                 ` Anton Markov
  2026-01-12 16:08                   ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: Anton Markov @ 2026-01-12 14:51 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, roger.pau, xen-devel

[-- Attachment #1: Type: text/plain, Size: 6163 bytes --]

>
> Perhaps, yet it still makes dealing with things more difficult.

Sorry. I just spent too much time on this bug to stay in my mind.

That's if IPIs are sent sequentially. In the most common case, they aren't,
> though - we use the all-but-self shorthand.

Actually, even if IPIs are sent sequentially, I can't see where you spot
> this effect: Both callers of time_calibration_rendezvous_tail() signal all
> secondary CPUs to continue at the same time. Hence they'll all execute
> time_calibration_rendezvous_tail() in parallel.

In parallel, but with a slight delay.

Are they? I fear I don't know which part of the code you're talking about.

In the function "time_calibration" (xen/arch/x86/time.c) Sorry, I don't
take into account that you don't stay in context, being distracted by other
threads.

One of the reasons we (iirc) don't do that is that since the scaling factor
> is also slightly imprecise, we'd prefer to avoid scaling very big values.
> IOW by changing as you suggest we'd trade one accumulating error for
> another.

As I wrote above, there will be an error when using scale_delta, but it
won't accumulate between calls to time_calibration_rendezvous_tail. In the
current version, the old error (ipi lag + rounding error) persists due to
the use of the old local_stime in the get_s_time_fixed function, and it's
added to the new error and accumulates with each call.
If

c->local_stime = get_s_time_fixed(old_tsc ?: new_tsc);

replaced with:

c->local_stime = scale_delta(old_tsc ?: new_tsc);

Then we'll only be dealing with the error due to the current ipi lag and
rough rounding, within a single call.

If I understand you correctly, you fixed the rough rounding of scale_delta
by reducing the values using get_s_time_fixed . But the problem is, that
didn't help. The error now accumulates gradually because we're relying on
old calculations. Furthermore, while the old rounding error was limited,
now the error accumulates infinitely, albeit very slowly. If this is so,
then the solution to the problem becomes less obvious.

Roughly speaking, my servers start to go crazy after a week of continuous
operation, as the time lag between cores reaches 1 millisecond or more.

On Mon, Jan 12, 2026 at 5:13 PM Jan Beulich <jbeulich@suse.com> wrote:

> On 12.01.2026 13:49, Anton Markov wrote:
> >> Btw, your prior response was too hard to properly read, due to excess
> blank
> >> lines with at the same time squashed leading blanks. Together with your
> >> apparent inability to avoid top-posting, I think you really want to
> adjust
> >> your mail program's configuration.
> >
> > I suggest we skip the discussion of formatting and the number of spaces
> in
> > messages and instead focus on the topic of the thread. I have a very
> > difficult time troubleshooting difficult-to-reproduce bugs, and the fact
> > that their descriptions are difficult to read due to the number of spaces
> > is probably the least of the difficulties.
>
> Perhaps, yet it still makes dealing with things more difficult.
>
> > That invocation of get_s_time_fixed() reduces to scale_delta() (without
> >> further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW
> >> it's not quite clear to me what change you are suggesting (that would
> >> actually make a functional difference).
> >
> > Replacing get_s_time_fixed with scale_delta will remove the calculation
> > dependency on the previous local_stime value, which accumulates lag
> between
> > cores. This is because: rdtsc_ordered is not called synchronously on the
> > cores, but by the difference offset by the ipi speed. Therefore, we get:
> >
> > core0: current_rdtsc;
> > core1: current_rdtsc + ipi speed;
> > coreN: current_rdtsc + ipi speed * N;
>
> That's if IPIs are sent sequentially. In the most common case, they aren't,
> though - we use the all-but-self shorthand.
>
> Actually, even if IPIs are sent sequentially, I can't see where you spot
> this effect: Both callers of time_calibration_rendezvous_tail() signal all
> secondary CPUs to continue at the same time. Hence they'll all execute
> time_calibration_rendezvous_tail() in parallel.
>
> > Since ipi values are sent alternately in a loop to core0,
>
> Are they? I fear I don't know which part of the code you're talking about.
>
> > in the version
> > with get_s_time_fixed, we get the following local_stime calculation
> format.
> >
> > coreN: local_stime = local_stime + scale_delta((current_rdtsc +
> (ipi_speed
> > * N)) – local_rdtsc);
>
> One of the reasons we (iirc) don't do that is that since the scaling factor
> is also slightly imprecise, we'd prefer to avoid scaling very big values.
> IOW by changing as you suggest we'd trade one accumulating error for
> another.
>
> Jan
>
> > This means the time on each core will differ by ipi_speed * N. And since
> > we're using the values of the previous local_stime, the difference will
> > accumulate because the previous local_stime was also offset. In the
> version
> > with scale_delta, we get:
> >
> > coreN: local_stime = scale_delta(current_rdtsc + (ipi_speed * N));
> >
> > This means there will still be a difference, but it won't accumulate, and
> > the offsets will remain within normal limits.
> >
> > If it's still unclear: If your local_stime in get_s_time_fixed is offset
> > relative to other cores, then the fact that rdtsc_ordered and local_tsc
> are
> > not offset doesn't change anything, since you're using the delta relative
> > to local_stime.
> >
> > core0_local_stime + (rdtsc_ordered() - local_tsc) != core1_local_stime +
> > (rdtsc_ordered() - local_tsc); // Even if rdtsc_ordered() and local_tsc
> are
> > equal across cores.
> >
> > On 96-core configurations, up to a millisecond of latency can accumulate
> in
> > local_stime over a week of operation, and this is a significant
> > difference. This
> > is due to the fact that I use cpufreq=xen:performance max_cstate=1 ,
> > meaning that in my configuration, local_stime is never overwritten by
> > master_stime.
> >
> > Thanks.
>

[-- Attachment #2: Type: text/html, Size: 9253 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
  2026-01-12 14:51                 ` Anton Markov
@ 2026-01-12 16:08                   ` Jan Beulich
  2026-01-12 16:41                     ` Anton Markov
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2026-01-12 16:08 UTC (permalink / raw)
  To: Anton Markov; +Cc: andrew.cooper3, roger.pau, xen-devel

On 12.01.2026 15:51, Anton Markov wrote:
> That's if IPIs are sent sequentially. In the most common case, they aren't,
>> though - we use the all-but-self shorthand.
> 
> Actually, even if IPIs are sent sequentially, I can't see where you spot
>> this effect: Both callers of time_calibration_rendezvous_tail() signal all
>> secondary CPUs to continue at the same time. Hence they'll all execute
>> time_calibration_rendezvous_tail() in parallel.
> 
> In parallel, but with a slight delay.
> 
> Are they? I fear I don't know which part of the code you're talking about.
> 
> In the function "time_calibration" (xen/arch/x86/time.c) Sorry, I don't
> take into account that you don't stay in context, being distracted by other
> threads.

That calls on_selected_cpus(), but send_IPI_mask() may then still resort to
all-but-self. In that case all IPIs are sent in one go.

Plus as said, how IPIs are sent doesn't matter for the invocation of
time_calibration_rendezvous_tail(). They'll all run at the same time, not
one after the other.

Since further down you build upon that "IPI lag", I fear we first need to
settle on this aspect of your theory.

Jan

> One of the reasons we (iirc) don't do that is that since the scaling factor
>> is also slightly imprecise, we'd prefer to avoid scaling very big values.
>> IOW by changing as you suggest we'd trade one accumulating error for
>> another.
> 
> As I wrote above, there will be an error when using scale_delta, but it
> won't accumulate between calls to time_calibration_rendezvous_tail. In the
> current version, the old error (ipi lag + rounding error) persists due to
> the use of the old local_stime in the get_s_time_fixed function, and it's
> added to the new error and accumulates with each call.
> If
> 
> c->local_stime = get_s_time_fixed(old_tsc ?: new_tsc);
> 
> replaced with:
> 
> c->local_stime = scale_delta(old_tsc ?: new_tsc);
> 
> Then we'll only be dealing with the error due to the current ipi lag and
> rough rounding, within a single call.
> 
> If I understand you correctly, you fixed the rough rounding of scale_delta
> by reducing the values using get_s_time_fixed . But the problem is, that
> didn't help. The error now accumulates gradually because we're relying on
> old calculations. Furthermore, while the old rounding error was limited,
> now the error accumulates infinitely, albeit very slowly. If this is so,
> then the solution to the problem becomes less obvious.
> 
> Roughly speaking, my servers start to go crazy after a week of continuous
> operation, as the time lag between cores reaches 1 millisecond or more.
> 
> On Mon, Jan 12, 2026 at 5:13 PM Jan Beulich <jbeulich@suse.com> wrote:
> 
>> On 12.01.2026 13:49, Anton Markov wrote:
>>>> Btw, your prior response was too hard to properly read, due to excess
>> blank
>>>> lines with at the same time squashed leading blanks. Together with your
>>>> apparent inability to avoid top-posting, I think you really want to
>> adjust
>>>> your mail program's configuration.
>>>
>>> I suggest we skip the discussion of formatting and the number of spaces
>> in
>>> messages and instead focus on the topic of the thread. I have a very
>>> difficult time troubleshooting difficult-to-reproduce bugs, and the fact
>>> that their descriptions are difficult to read due to the number of spaces
>>> is probably the least of the difficulties.
>>
>> Perhaps, yet it still makes dealing with things more difficult.
>>
>>> That invocation of get_s_time_fixed() reduces to scale_delta() (without
>>>> further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW
>>>> it's not quite clear to me what change you are suggesting (that would
>>>> actually make a functional difference).
>>>
>>> Replacing get_s_time_fixed with scale_delta will remove the calculation
>>> dependency on the previous local_stime value, which accumulates lag
>> between
>>> cores. This is because: rdtsc_ordered is not called synchronously on the
>>> cores, but by the difference offset by the ipi speed. Therefore, we get:
>>>
>>> core0: current_rdtsc;
>>> core1: current_rdtsc + ipi speed;
>>> coreN: current_rdtsc + ipi speed * N;
>>
>> That's if IPIs are sent sequentially. In the most common case, they aren't,
>> though - we use the all-but-self shorthand.
>>
>> Actually, even if IPIs are sent sequentially, I can't see where you spot
>> this effect: Both callers of time_calibration_rendezvous_tail() signal all
>> secondary CPUs to continue at the same time. Hence they'll all execute
>> time_calibration_rendezvous_tail() in parallel.
>>
>>> Since ipi values are sent alternately in a loop to core0,
>>
>> Are they? I fear I don't know which part of the code you're talking about.
>>
>>> in the version
>>> with get_s_time_fixed, we get the following local_stime calculation
>> format.
>>>
>>> coreN: local_stime = local_stime + scale_delta((current_rdtsc +
>> (ipi_speed
>>> * N)) – local_rdtsc);
>>
>> One of the reasons we (iirc) don't do that is that since the scaling factor
>> is also slightly imprecise, we'd prefer to avoid scaling very big values.
>> IOW by changing as you suggest we'd trade one accumulating error for
>> another.
>>
>> Jan
>>
>>> This means the time on each core will differ by ipi_speed * N. And since
>>> we're using the values of the previous local_stime, the difference will
>>> accumulate because the previous local_stime was also offset. In the
>> version
>>> with scale_delta, we get:
>>>
>>> coreN: local_stime = scale_delta(current_rdtsc + (ipi_speed * N));
>>>
>>> This means there will still be a difference, but it won't accumulate, and
>>> the offsets will remain within normal limits.
>>>
>>> If it's still unclear: If your local_stime in get_s_time_fixed is offset
>>> relative to other cores, then the fact that rdtsc_ordered and local_tsc
>> are
>>> not offset doesn't change anything, since you're using the delta relative
>>> to local_stime.
>>>
>>> core0_local_stime + (rdtsc_ordered() - local_tsc) != core1_local_stime +
>>> (rdtsc_ordered() - local_tsc); // Even if rdtsc_ordered() and local_tsc
>> are
>>> equal across cores.
>>>
>>> On 96-core configurations, up to a millisecond of latency can accumulate
>> in
>>> local_stime over a week of operation, and this is a significant
>>> difference. This
>>> is due to the fact that I use cpufreq=xen:performance max_cstate=1 ,
>>> meaning that in my configuration, local_stime is never overwritten by
>>> master_stime.
>>>
>>> Thanks.
>>
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
  2026-01-12 16:08                   ` Jan Beulich
@ 2026-01-12 16:41                     ` Anton Markov
  2026-01-13 11:21                       ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: Anton Markov @ 2026-01-12 16:41 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, roger.pau, xen-devel

[-- Attachment #1: Type: text/plain, Size: 10337 bytes --]

>
> That calls on_selected_cpus(), but send_IPI_mask() may then still resort to
> all-but-self. In that case all IPIs are sent in one go.

Plus as said, how IPIs are sent doesn't matter for the invocation of
> time_calibration_rendezvous_tail(). They'll all run at the same time, not
> one after the other.

At the hardware level, no one can guarantee that the processors will
simultaneously respond to the signal and execute your code nanosecond after
you send the ipi. Especially when we're talking about NUMA configurations. I'm
afraid the possible and impossible in the laws of physics is also beyond
the scope of this thread.

Since further down you build upon that "IPI lag", I fear we first need to
> settle on this aspect of your theory.

 I've already provided the delay logs. It's not hard for me to repeat.

The patch:

> @@ -1732,6 +1753,8 @@ static void cf_check local_time_calibration(void)
>
> if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
>
> {
>
> /* Atomically read cpu_calibration struct and write cpu_time struct. */
>
> + printk("update stime on time calibrate %d, %lu -> %lu (%lu, %lu)\n",
> smp_processor_id(), t->stamp.local_stime, c->local_stime,
>
> + t->last_seen_ns, t->last_seen_tsc);
>
> local_irq_disable();
>
> t->stamp = *c;
>
> local_irq_enable();
>

 2 hours of work:

> (XEN) update stime on time calibrate 0, 8564145820102 -> 8565145861597
> (8565145862216, 0)
> (XEN) update stime on time calibrate 1, 8564145820129 -> 8565145861609
> (8565145863957, 0)
> (XEN) update stime on time calibrate 3, 8564145819996 -> 8565145861491
> (8565145864800, 0)
> (XEN) update stime on time calibrate 2, 8564145820099 -> 8565145861609
> (8565145865372, 0)
>
> 8565145861609 - 8565145861491 = 115 * 3 (3.00 GHz) = 345 lag
>

3 hours of work:

> (XEN) update stime on time calibrate 0, 22914730829200 -> 22915730869993
> (22915730870665, 0)
> (XEN) update stime on time calibrate 1, 22914730829073 -> 22915730869889
> (22915730870693, 0)
> (XEN) update stime on time calibrate 2, 22914730829052 -> 22915730869841
> (22915730872231, 0)
> (XEN) update stime on time calibrate 3, 22914730828892 -> 22915730869696
> (22915730872096, 0)
>
> 22915730869993 - 22915730869696 = 297 * 3 (3.00 GHz) = 891 lag
>

2-3 day of work:

> (XEN) update stime on time calibrate 0, 254477161980127 -> 254478162020920
> (254478162021549, 0)
> (XEN) update stime on time calibrate 2, 254477161977638 -> 254478162018429
> (254478162022187, 0)
> (XEN) update stime on time calibrate 1, 254477161978192 -> 254478162018972
> (254478162022776, 0)
> (XEN) update stime on time calibrate 3, 254477161976832 -> 254478162017636
> (254478162021394, 0)
>
> 254478162020920 - 254478162017636 = 3284 * 3 (3.00 GHz) = 9852 lag


 As you can see, the core lag is strictly determined by their sequence
number. I won't argue about what percentage of this delay is due to
rounding error and what percentage is due to ipi lag. To reproduce this,
simply add the patch (excluding t->last_seen_ns and t->last_seen_tsc ,
which were necessary for my own understanding). Then enable the hypervisor
with the settings cpufreq=xen:performance max_cstate=1 . Clocksource is
left at the default (i.e., hpet).

On Mon, Jan 12, 2026 at 7:08 PM Jan Beulich <jbeulich@suse.com> wrote:

> On 12.01.2026 15:51, Anton Markov wrote:
> > That's if IPIs are sent sequentially. In the most common case, they
> aren't,
> >> though - we use the all-but-self shorthand.
> >
> > Actually, even if IPIs are sent sequentially, I can't see where you spot
> >> this effect: Both callers of time_calibration_rendezvous_tail() signal
> all
> >> secondary CPUs to continue at the same time. Hence they'll all execute
> >> time_calibration_rendezvous_tail() in parallel.
> >
> > In parallel, but with a slight delay.
> >
> > Are they? I fear I don't know which part of the code you're talking
> about.
> >
> > In the function "time_calibration" (xen/arch/x86/time.c) Sorry, I don't
> > take into account that you don't stay in context, being distracted by
> other
> > threads.
>
> That calls on_selected_cpus(), but send_IPI_mask() may then still resort to
> all-but-self. In that case all IPIs are sent in one go.
>
> Plus as said, how IPIs are sent doesn't matter for the invocation of
> time_calibration_rendezvous_tail(). They'll all run at the same time, not
> one after the other.
>
> Since further down you build upon that "IPI lag", I fear we first need to
> settle on this aspect of your theory.
>
> Jan
>
> > One of the reasons we (iirc) don't do that is that since the scaling
> factor
> >> is also slightly imprecise, we'd prefer to avoid scaling very big
> values.
> >> IOW by changing as you suggest we'd trade one accumulating error for
> >> another.
> >
> > As I wrote above, there will be an error when using scale_delta, but it
> > won't accumulate between calls to time_calibration_rendezvous_tail. In
> the
> > current version, the old error (ipi lag + rounding error) persists due to
> > the use of the old local_stime in the get_s_time_fixed function, and it's
> > added to the new error and accumulates with each call.
> > If
> >
> > c->local_stime = get_s_time_fixed(old_tsc ?: new_tsc);
> >
> > replaced with:
> >
> > c->local_stime = scale_delta(old_tsc ?: new_tsc);
> >
> > Then we'll only be dealing with the error due to the current ipi lag and
> > rough rounding, within a single call.
> >
> > If I understand you correctly, you fixed the rough rounding of
> scale_delta
> > by reducing the values using get_s_time_fixed . But the problem is, that
> > didn't help. The error now accumulates gradually because we're relying on
> > old calculations. Furthermore, while the old rounding error was limited,
> > now the error accumulates infinitely, albeit very slowly. If this is so,
> > then the solution to the problem becomes less obvious.
> >
> > Roughly speaking, my servers start to go crazy after a week of continuous
> > operation, as the time lag between cores reaches 1 millisecond or more.
> >
> > On Mon, Jan 12, 2026 at 5:13 PM Jan Beulich <jbeulich@suse.com> wrote:
> >
> >> On 12.01.2026 13:49, Anton Markov wrote:
> >>>> Btw, your prior response was too hard to properly read, due to excess
> >> blank
> >>>> lines with at the same time squashed leading blanks. Together with
> your
> >>>> apparent inability to avoid top-posting, I think you really want to
> >> adjust
> >>>> your mail program's configuration.
> >>>
> >>> I suggest we skip the discussion of formatting and the number of spaces
> >> in
> >>> messages and instead focus on the topic of the thread. I have a very
> >>> difficult time troubleshooting difficult-to-reproduce bugs, and the
> fact
> >>> that their descriptions are difficult to read due to the number of
> spaces
> >>> is probably the least of the difficulties.
> >>
> >> Perhaps, yet it still makes dealing with things more difficult.
> >>
> >>> That invocation of get_s_time_fixed() reduces to scale_delta() (without
> >>>> further rdtsc_ordered()), as non-zero at_tsc is passed in all cases.
> IOW
> >>>> it's not quite clear to me what change you are suggesting (that would
> >>>> actually make a functional difference).
> >>>
> >>> Replacing get_s_time_fixed with scale_delta will remove the calculation
> >>> dependency on the previous local_stime value, which accumulates lag
> >> between
> >>> cores. This is because: rdtsc_ordered is not called synchronously on
> the
> >>> cores, but by the difference offset by the ipi speed. Therefore, we
> get:
> >>>
> >>> core0: current_rdtsc;
> >>> core1: current_rdtsc + ipi speed;
> >>> coreN: current_rdtsc + ipi speed * N;
> >>
> >> That's if IPIs are sent sequentially. In the most common case, they
> aren't,
> >> though - we use the all-but-self shorthand.
> >>
> >> Actually, even if IPIs are sent sequentially, I can't see where you spot
> >> this effect: Both callers of time_calibration_rendezvous_tail() signal
> all
> >> secondary CPUs to continue at the same time. Hence they'll all execute
> >> time_calibration_rendezvous_tail() in parallel.
> >>
> >>> Since ipi values are sent alternately in a loop to core0,
> >>
> >> Are they? I fear I don't know which part of the code you're talking
> about.
> >>
> >>> in the version
> >>> with get_s_time_fixed, we get the following local_stime calculation
> >> format.
> >>>
> >>> coreN: local_stime = local_stime + scale_delta((current_rdtsc +
> >> (ipi_speed
> >>> * N)) – local_rdtsc);
> >>
> >> One of the reasons we (iirc) don't do that is that since the scaling
> factor
> >> is also slightly imprecise, we'd prefer to avoid scaling very big
> values.
> >> IOW by changing as you suggest we'd trade one accumulating error for
> >> another.
> >>
> >> Jan
> >>
> >>> This means the time on each core will differ by ipi_speed * N. And
> since
> >>> we're using the values of the previous local_stime, the difference will
> >>> accumulate because the previous local_stime was also offset. In the
> >> version
> >>> with scale_delta, we get:
> >>>
> >>> coreN: local_stime = scale_delta(current_rdtsc + (ipi_speed * N));
> >>>
> >>> This means there will still be a difference, but it won't accumulate,
> and
> >>> the offsets will remain within normal limits.
> >>>
> >>> If it's still unclear: If your local_stime in get_s_time_fixed is
> offset
> >>> relative to other cores, then the fact that rdtsc_ordered and local_tsc
> >> are
> >>> not offset doesn't change anything, since you're using the delta
> relative
> >>> to local_stime.
> >>>
> >>> core0_local_stime + (rdtsc_ordered() - local_tsc) != core1_local_stime
> +
> >>> (rdtsc_ordered() - local_tsc); // Even if rdtsc_ordered() and local_tsc
> >> are
> >>> equal across cores.
> >>>
> >>> On 96-core configurations, up to a millisecond of latency can
> accumulate
> >> in
> >>> local_stime over a week of operation, and this is a significant
> >>> difference. This
> >>> is due to the fact that I use cpufreq=xen:performance max_cstate=1 ,
> >>> meaning that in my configuration, local_stime is never overwritten by
> >>> master_stime.
> >>>
> >>> Thanks.
> >>
> >
>
>

[-- Attachment #2: Type: text/html, Size: 15860 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
  2026-01-12 16:41                     ` Anton Markov
@ 2026-01-13 11:21                       ` Jan Beulich
  2026-01-13 12:35                         ` Anton Markov
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2026-01-13 11:21 UTC (permalink / raw)
  To: Anton Markov; +Cc: andrew.cooper3, roger.pau, xen-devel

On 12.01.2026 17:41, Anton Markov wrote:
>>
>> That calls on_selected_cpus(), but send_IPI_mask() may then still resort to
>> all-but-self. In that case all IPIs are sent in one go.
> 
> Plus as said, how IPIs are sent doesn't matter for the invocation of
>> time_calibration_rendezvous_tail(). They'll all run at the same time, not
>> one after the other.
> 
> At the hardware level, no one can guarantee that the processors will
> simultaneously respond to the signal and execute your code nanosecond after
> you send the ipi. Especially when we're talking about NUMA configurations. I'm
> afraid the possible and impossible in the laws of physics is also beyond
> the scope of this thread.

You did read my recurring explanation beyond the IPI sending, didn't you?
Of course IPI arrival may vary across cores / threads. Yet the term
"rendezvous" is used because CPUs having received the IPI are then held
in a waiting loop, until _all_ CPUs have made it there. Then CPU0
indicates to all of them simultaneously to move to the next step. There's
going to again be some variance (especially on NUMA, where the memory
write needs to propagate to all nodes), but at least within a single node
that should be pretty low. The main source of variance I would expect
there would by hyperthreads competing with one another in a single core.

> Since further down you build upon that "IPI lag", I fear we first need to
>> settle on this aspect of your theory.
> 
>  I've already provided the delay logs. It's not hard for me to repeat.

Sure, I don't doubt you make those observations. But we're still trying to
converge on a theory on what these may be caused by.

>  2 hours of work:
> 
>> (XEN) update stime on time calibrate 0, 8564145820102 -> 8565145861597
>> (8565145862216, 0)
>> (XEN) update stime on time calibrate 1, 8564145820129 -> 8565145861609
>> (8565145863957, 0)
>> (XEN) update stime on time calibrate 3, 8564145819996 -> 8565145861491
>> (8565145864800, 0)
>> (XEN) update stime on time calibrate 2, 8564145820099 -> 8565145861609
>> (8565145865372, 0)
>>
>> 8565145861609 - 8565145861491 = 115 * 3 (3.00 GHz) = 345 lag

The log entries aren't in CPU order, and CPUs 1 and 2 actually have
identical values on the rhs. That doesn't quite fit what you have said so
far. CPU3's value is also lower than CPU0's.

> 3 hours of work:
> 
>> (XEN) update stime on time calibrate 0, 22914730829200 -> 22915730869993
>> (22915730870665, 0)
>> (XEN) update stime on time calibrate 1, 22914730829073 -> 22915730869889
>> (22915730870693, 0)
>> (XEN) update stime on time calibrate 2, 22914730829052 -> 22915730869841
>> (22915730872231, 0)
>> (XEN) update stime on time calibrate 3, 22914730828892 -> 22915730869696
>> (22915730872096, 0)
>>
>> 22915730869993 - 22915730869696 = 297 * 3 (3.00 GHz) = 891 lag

While CPU numbers happen to be in sequence here, the rhs values aren't equally
ordered.

Also really here it is

22915730869696 - 22915730869993 = -297 * 3 (3.00 GHz) = 891 ahead

> 2-3 day of work:
> 
>> (XEN) update stime on time calibrate 0, 254477161980127 -> 254478162020920
>> (254478162021549, 0)
>> (XEN) update stime on time calibrate 2, 254477161977638 -> 254478162018429
>> (254478162022187, 0)
>> (XEN) update stime on time calibrate 1, 254477161978192 -> 254478162018972
>> (254478162022776, 0)
>> (XEN) update stime on time calibrate 3, 254477161976832 -> 254478162017636
>> (254478162021394, 0)
>>
>> 254478162020920 - 254478162017636 = 3284 * 3 (3.00 GHz) = 9852 lag

Similarly here. Yes, the gap increases, yet that's not a lag of CPU3 past
CPU0, but exactly the other way around.

>  As you can see, the core lag is strictly determined by their sequence
> number.

As per above - no, I don't think I can see that. Or maybe I'm misreading the
numbers as well as what you have been saying.

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
  2026-01-13 11:21                       ` Jan Beulich
@ 2026-01-13 12:35                         ` Anton Markov
  0 siblings, 0 replies; 32+ messages in thread
From: Anton Markov @ 2026-01-13 12:35 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, roger.pau, xen-devel

[-- Attachment #1: Type: text/plain, Size: 7598 bytes --]

You did read my recurring explanation beyond the IPI sending, didn't you?
Of course IPI arrival may vary across cores / threads. Yet the term
"rendezvous" is used because CPUs having received the IPI are then held in
a waiting loop, until _all_ CPUs have made it there. Then CPU0 indicates to
all of them simultaneously to move to the next step. There's going to again
be some variance (especially on NUMA, where the memory write needs to
propagate to all nodes), but at least within a single node that should be
pretty low. The main source of variance I would expect there would by
hyperthreads competing with one another in a single core.

Yes, I saw it. I'm not trying to dispute your understanding of the
situation. The difference may be small, but it adds up.

Sure, I don't doubt you make those observations. But we're still trying to
> converge on a theory on what these may be caused by.
>
I can't tell you exactly what the main cause of the delay is. I can only
list possible factors:
1. Rounding error, which varies for each core;
2. Delay in IPI delivery speed (even at the hardware level, signal
delivery shouldn't
happen simultaneously, and some cores may have interrupts disabled);
3. CPU frequency boost, which allows some cores to execute code faster. On
modern CPUs, this doesn't affect the TSC frequency, but the problem is that
the counter will be read at different times.
4. The initial difference in TSC counter values, which for cores within a
single CPU should be no more than 100 ns. In the case of NUMA, no more than
1000 ns;
I can't speak about percentages; I wasn't involved in CPU development, but
there are many reasons not to allow cores to compete in sequence increment
speed.

The log entries aren't in CPU order, and CPUs 1 and 2 actually have
> identical values on the rhs. That doesn't quite fit what you have said so
> far. CPU3's value is also lower than CPU0's.
> While CPU numbers happen to be in sequence here, the rhs values aren't
> equally
> ordered.
> Also really here it is
> 22915730869696 - 22915730869993 = -297 * 3 (3.00 GHz) = 891 ahead
> Similarly here. Yes, the gap increases, yet that's not a lag of CPU3 past
> CPU0, but exactly the other way around.
> As per above - no, I don't think I can see that. Or maybe I'm misreading
> the
> numbers as well as what you have been saying.
>
During the first few hours, the situation can be blurred due to possible
race conditions. After two days, it becomes more or less clear:

254478162020920 (core 0) > 254478162018972 (core 1) > 254478162018429 (core
2) > 254478162017636 (core 3)

The lower the core number, the more it pulls ahead. It's possible that this
is indeed related to which core is most heavily loaded (which one activates
CPU boost more often). In my configuration, the first three cores are
dedicated to DOM0, and the fourth is reserved for virtual machines. The
first core ends up being the most heavily loaded due to interrupt handling,
etc.
I can also add that after replacing get_s_time_fixed with scale_delta, the
difference stops accumulating. At this point, it's clear to me that the
problem is the use of previous last_stime values. The nature of this CPU
behavior is unlikely to be understood at the software level. Of course, all
the processors I tested on have the constant_tsc, nonstop_tsc, and
tsc_known_freq flags.


On Tue, Jan 13, 2026 at 2:21 PM Jan Beulich <jbeulich@suse.com> wrote:

> On 12.01.2026 17:41, Anton Markov wrote:
> >>
> >> That calls on_selected_cpus(), but send_IPI_mask() may then still
> resort to
> >> all-but-self. In that case all IPIs are sent in one go.
> >
> > Plus as said, how IPIs are sent doesn't matter for the invocation of
> >> time_calibration_rendezvous_tail(). They'll all run at the same time,
> not
> >> one after the other.
> >
> > At the hardware level, no one can guarantee that the processors will
> > simultaneously respond to the signal and execute your code nanosecond
> after
> > you send the ipi. Especially when we're talking about NUMA
> configurations. I'm
> > afraid the possible and impossible in the laws of physics is also beyond
> > the scope of this thread.
>
> You did read my recurring explanation beyond the IPI sending, didn't you?
> Of course IPI arrival may vary across cores / threads. Yet the term
> "rendezvous" is used because CPUs having received the IPI are then held
> in a waiting loop, until _all_ CPUs have made it there. Then CPU0
> indicates to all of them simultaneously to move to the next step. There's
> going to again be some variance (especially on NUMA, where the memory
> write needs to propagate to all nodes), but at least within a single node
> that should be pretty low. The main source of variance I would expect
> there would by hyperthreads competing with one another in a single core.
>
> > Since further down you build upon that "IPI lag", I fear we first need to
> >> settle on this aspect of your theory.
> >
> >  I've already provided the delay logs. It's not hard for me to repeat.
>
> Sure, I don't doubt you make those observations. But we're still trying to
> converge on a theory on what these may be caused by.
>
> >  2 hours of work:
> >
> >> (XEN) update stime on time calibrate 0, 8564145820102 -> 8565145861597
> >> (8565145862216, 0)
> >> (XEN) update stime on time calibrate 1, 8564145820129 -> 8565145861609
> >> (8565145863957, 0)
> >> (XEN) update stime on time calibrate 3, 8564145819996 -> 8565145861491
> >> (8565145864800, 0)
> >> (XEN) update stime on time calibrate 2, 8564145820099 -> 8565145861609
> >> (8565145865372, 0)
> >>
> >> 8565145861609 - 8565145861491 = 115 * 3 (3.00 GHz) = 345 lag
>
> The log entries aren't in CPU order, and CPUs 1 and 2 actually have
> identical values on the rhs. That doesn't quite fit what you have said so
> far. CPU3's value is also lower than CPU0's.
>
> > 3 hours of work:
> >
> >> (XEN) update stime on time calibrate 0, 22914730829200 -> 22915730869993
> >> (22915730870665, 0)
> >> (XEN) update stime on time calibrate 1, 22914730829073 -> 22915730869889
> >> (22915730870693, 0)
> >> (XEN) update stime on time calibrate 2, 22914730829052 -> 22915730869841
> >> (22915730872231, 0)
> >> (XEN) update stime on time calibrate 3, 22914730828892 -> 22915730869696
> >> (22915730872096, 0)
> >>
> >> 22915730869993 - 22915730869696 = 297 * 3 (3.00 GHz) = 891 lag
>
> While CPU numbers happen to be in sequence here, the rhs values aren't
> equally
> ordered.
>
> Also really here it is
>
> 22915730869696 - 22915730869993 = -297 * 3 (3.00 GHz) = 891 ahead
>
> > 2-3 day of work:
> >
> >> (XEN) update stime on time calibrate 0, 254477161980127 ->
> 254478162020920
> >> (254478162021549, 0)
> >> (XEN) update stime on time calibrate 2, 254477161977638 ->
> 254478162018429
> >> (254478162022187, 0)
> >> (XEN) update stime on time calibrate 1, 254477161978192 ->
> 254478162018972
> >> (254478162022776, 0)
> >> (XEN) update stime on time calibrate 3, 254477161976832 ->
> 254478162017636
> >> (254478162021394, 0)
> >>
> >> 254478162020920 - 254478162017636 = 3284 * 3 (3.00 GHz) = 9852 lag
>
> Similarly here. Yes, the gap increases, yet that's not a lag of CPU3 past
> CPU0, but exactly the other way around.
>
> >  As you can see, the core lag is strictly determined by their sequence
> > number.
>
> As per above - no, I don't think I can see that. Or maybe I'm misreading
> the
> numbers as well as what you have been saying.
>
> Jan
>

[-- Attachment #2: Type: text/html, Size: 12481 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/5] x86/time: scale_delta() can be static
  2026-01-06 13:58 ` [PATCH 2/5] x86/time: scale_delta() can be static Jan Beulich
@ 2026-01-13 18:40   ` Roger Pau Monné
  0 siblings, 0 replies; 32+ messages in thread
From: Roger Pau Monné @ 2026-01-13 18:40 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper

On Tue, Jan 06, 2026 at 02:58:29PM +0100, Jan Beulich wrote:
> It's used in time.c alone. Modernize types while there.
> 
> Amends: 5a82d598d2d ("viridian: unify time sources")
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] x86/HVM: drop at_tsc parameter from ->set_tsc_offset() hook
  2026-01-06 13:58 ` [PATCH 3/5] x86/HVM: drop at_tsc parameter from ->set_tsc_offset() hook Jan Beulich
@ 2026-01-14  9:40   ` Roger Pau Monné
  0 siblings, 0 replies; 32+ messages in thread
From: Roger Pau Monné @ 2026-01-14  9:40 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper

On Tue, Jan 06, 2026 at 02:58:54PM +0100, Jan Beulich wrote:
> While the VMX hook never used the parameter, the SVM one lost its sole use
> some time ago (while the original use of the parameter had gone away even
> earlier).
> 
> Again modernize types while there.
> 
> Amends: 0cd50753eb40 ("nestedsvm: Disable TscRateMSR")
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/5] x86/time: gtsc_to_gtime() is HVM-only
  2026-01-06 13:59 ` [PATCH 4/5] x86/time: gtsc_to_gtime() is HVM-only Jan Beulich
@ 2026-01-14  9:43   ` Roger Pau Monné
  2026-01-14  9:52     ` Roger Pau Monné
  0 siblings, 1 reply; 32+ messages in thread
From: Roger Pau Monné @ 2026-01-14  9:43 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper

On Tue, Jan 06, 2026 at 02:59:43PM +0100, Jan Beulich wrote:
> Omit the function when HVM=n. With that the !HVM logic can also go away;
> leave an assertion.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -2840,14 +2840,13 @@ uint64_t gtime_to_gtsc(const struct doma
>      return scale_delta(time, &d->arch.ns_to_vtsc);
>  }
>  
> +#ifdef CONFIG_HVM
>  uint64_t gtsc_to_gtime(const struct domain *d, uint64_t tsc)
>  {
> -    u64 time = scale_delta(tsc, &d->arch.vtsc_to_ns);
> -
> -    if ( !is_hvm_domain(d) )
> -        time += d->arch.vtsc_offset;
> -    return time;
> +    ASSERT(is_hvm_domain(d));
> +    return scale_delta(tsc, &d->arch.vtsc_to_ns);
>  }
> +#endif /* CONFIG_HVM */

Seeing this is solely used by the vlapic code, did you consider
keeping scale_delta() non-static, and then moving gtsc_to_gtime() into
vlapic.c?

It would result in less ifdefery overall.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/5] x86/time: pv_soft_rdtsc() is PV-only
  2026-01-06 14:00 ` [PATCH 5/5] x86/time: pv_soft_rdtsc() is PV-only Jan Beulich
@ 2026-01-14  9:46   ` Roger Pau Monné
  0 siblings, 0 replies; 32+ messages in thread
From: Roger Pau Monné @ 2026-01-14  9:46 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper

On Tue, Jan 06, 2026 at 03:00:21PM +0100, Jan Beulich wrote:
> Omit the function when PV=n, by moving it to the sole file using it, thus
> allowing it to become static as well.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/5] x86/time: gtsc_to_gtime() is HVM-only
  2026-01-14  9:43   ` Roger Pau Monné
@ 2026-01-14  9:52     ` Roger Pau Monné
  0 siblings, 0 replies; 32+ messages in thread
From: Roger Pau Monné @ 2026-01-14  9:52 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper

On Wed, Jan 14, 2026 at 10:43:38AM +0100, Roger Pau Monné wrote:
> On Tue, Jan 06, 2026 at 02:59:43PM +0100, Jan Beulich wrote:
> > Omit the function when HVM=n. With that the !HVM logic can also go away;
> > leave an assertion.
> > 
> > Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > 
> > --- a/xen/arch/x86/time.c
> > +++ b/xen/arch/x86/time.c
> > @@ -2840,14 +2840,13 @@ uint64_t gtime_to_gtsc(const struct doma
> >      return scale_delta(time, &d->arch.ns_to_vtsc);
> >  }
> >  
> > +#ifdef CONFIG_HVM
> >  uint64_t gtsc_to_gtime(const struct domain *d, uint64_t tsc)
> >  {
> > -    u64 time = scale_delta(tsc, &d->arch.vtsc_to_ns);
> > -
> > -    if ( !is_hvm_domain(d) )
> > -        time += d->arch.vtsc_offset;
> > -    return time;
> > +    ASSERT(is_hvm_domain(d));
> > +    return scale_delta(tsc, &d->arch.vtsc_to_ns);
> >  }
> > +#endif /* CONFIG_HVM */
> 
> Seeing this is solely used by the vlapic code, did you consider
> keeping scale_delta() non-static, and then moving gtsc_to_gtime() into
> vlapic.c?
> 
> It would result in less ifdefery overall.

I see now this might not be appropriate, gtsc_to_gtime() is the pair
to gtime_to_gtsc(), which is used in other places.  It looks like
gtsc_to_gtime() could also gain users in order parts of the code, and
hence making static to vlapic might not be the best approach.

I'm not overly happy with adding more ifdefary to time.c, and the
asymmetry this creates with PV guest not having a gtsc_to_gtime(), but
it's otherwise dead code, so:

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
  2026-01-06 13:58 ` [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() Jan Beulich
  2026-01-06 20:10   ` Антон Марков
@ 2026-01-14 17:49   ` Roger Pau Monné
  2026-01-15  8:00     ` Jan Beulich
  1 sibling, 1 reply; 32+ messages in thread
From: Roger Pau Monné @ 2026-01-14 17:49 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper,
	Антон Марков

On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote:
> Callers may pass in TSC values from before the local TSC stamp was last
> updated (this would in particular be the case when the TSC was latched, a
> time rendezvous occurs, and the latched value is used only afterwards).
> scale_delta(), otoh, deals with unsigned values, and hence would treat
> negative incoming deltas as huge positive values, yielding entirely bogus
> return values.
> 
> Fixes: 88e64cb785c1 ("x86/HVM: use fixed TSC value when saving or restoring domain")
> Reported-by: Антон Марков <akmarkov45@gmail.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> An alternative might be to have scale_delta() deal with signed deltas, yet
> that seemed more risky to me.

I won't go that route, the caller can figure out the signedness of the
result as needed.

> There could likely be more Fixes: tags; the one used is the oldest
> applicable one, from what I can tell.
> 
> A similar issue looks to exist in read_xen_timer() and its read_cycle()
> helper, if we're scheduled out (and beck in) between reading of the TSC
> and calculation of the delta (involving ->tsc_timestamp). Am I
> overlooking anything there?

If we are scheduled out in the middle, and the ->tsc_timestamp is
updated, ->version would also be bumped, and hence the loop will be
restarted.  I don't think there's an issue there.

> stime2tsc() guards against negative deltas by using 0 instead; I'm not
> quite sure that's correct either.

Hm, we should likely do the same for stime2tsc() that you do for
get_s_time_fixed().  Given the current callers I think we might be
safe, but it's a risk.

> amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a
> comment towards the TSC being "sane", but is that correct? Due to
> TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then
> wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before
> calling tsc_ticks2ns()?

amd_check_erratum_1474() runs after early_time_init(), which would
have cleared any TSC_ADJUST offset AFAICT.  There's a note in the
initcall to that regard:

/*
 * Must be executed after early_time_init() for tsc_ticks2ns() to have been
 * calibrated.  That prevents us doing the check in init_amd().
 */
presmp_initcall(amd_check_erratum_1474);

> A similar issue looks to exist in tsc_get_info(), again when rdtsc()
> possibly returns a huge value due to TSC_ADJUST. Once again I wonder
> whether we shouldn't subtract boot_tsc_stamp.

I would expect tsc_get_info() to also get called exclusively after
early_time_init()?

> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -1649,8 +1649,13 @@ s_time_t get_s_time_fixed(u64 at_tsc)
>          tsc = at_tsc;
>      else
>          tsc = rdtsc_ordered();
> -    delta = tsc - t->stamp.local_tsc;
> -    return t->stamp.local_stime + scale_delta(delta, &t->tsc_scale);
> +
> +    if ( tsc >= t->stamp.local_tsc )

Should we hint the compiler this is the likely path?

> +        delta = scale_delta(tsc - t->stamp.local_tsc, &t->tsc_scale);
> +    else
> +        delta = -scale_delta(t->stamp.local_tsc - tsc, &t->tsc_scale);
> +
> +    return t->stamp.local_stime + delta;

LGTM:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

However I see Anton still has concerns that this patch doesn't fully
fix the issue he reported, and I'm afraid I don't really understand
what's going on.  I will have to take a more detailed look at the
thread, or possibly attempt to reproduce myself.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
  2026-01-14 17:49   ` Roger Pau Monné
@ 2026-01-15  8:00     ` Jan Beulich
  2026-01-15  8:24       ` Roger Pau Monné
  2026-01-20  8:50       ` Jan Beulich
  0 siblings, 2 replies; 32+ messages in thread
From: Jan Beulich @ 2026-01-15  8:00 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper,
	Антон Марков

On 14.01.2026 18:49, Roger Pau Monné wrote:
> On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote:
>> A similar issue looks to exist in read_xen_timer() and its read_cycle()
>> helper, if we're scheduled out (and beck in) between reading of the TSC
>> and calculation of the delta (involving ->tsc_timestamp). Am I
>> overlooking anything there?
> 
> If we are scheduled out in the middle, and the ->tsc_timestamp is
> updated, ->version would also be bumped, and hence the loop will be
> restarted.  I don't think there's an issue there.

Oh, indeed - I was too focused on the read_cycle() alone. That may go
wrong, but as you say the result then simply won't be used.

>> stime2tsc() guards against negative deltas by using 0 instead; I'm not
>> quite sure that's correct either.
> 
> Hm, we should likely do the same for stime2tsc() that you do for
> get_s_time_fixed().  Given the current callers I think we might be
> safe, but it's a risk.

Will do then.

>> amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a
>> comment towards the TSC being "sane", but is that correct? Due to
>> TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then
>> wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before
>> calling tsc_ticks2ns()?
> 
> amd_check_erratum_1474() runs after early_time_init(), which would
> have cleared any TSC_ADJUST offset AFAICT.  There's a note in the
> initcall to that regard:
> 
> /*
>  * Must be executed after early_time_init() for tsc_ticks2ns() to have been
>  * calibrated.  That prevents us doing the check in init_amd().
>  */
> presmp_initcall(amd_check_erratum_1474);

Hmm, I should have written "Due to e.g. TSC_ADJUST". Firmware may also
have played other games with MSR_TSC.

>> A similar issue looks to exist in tsc_get_info(), again when rdtsc()
>> possibly returns a huge value due to TSC_ADJUST. Once again I wonder
>> whether we shouldn't subtract boot_tsc_stamp.
> 
> I would expect tsc_get_info() to also get called exclusively after
> early_time_init()?

Same here then (obviously).

>> --- a/xen/arch/x86/time.c
>> +++ b/xen/arch/x86/time.c
>> @@ -1649,8 +1649,13 @@ s_time_t get_s_time_fixed(u64 at_tsc)
>>          tsc = at_tsc;
>>      else
>>          tsc = rdtsc_ordered();
>> -    delta = tsc - t->stamp.local_tsc;
>> -    return t->stamp.local_stime + scale_delta(delta, &t->tsc_scale);
>> +
>> +    if ( tsc >= t->stamp.local_tsc )
> 
> Should we hint the compiler this is the likely path?

I was wondering too, but was erring on the side of "no" primarily because
of Andrew's general preference towards no likely(). I'd be happy to add it
here.

>> +        delta = scale_delta(tsc - t->stamp.local_tsc, &t->tsc_scale);
>> +    else
>> +        delta = -scale_delta(t->stamp.local_tsc - tsc, &t->tsc_scale);
>> +
>> +    return t->stamp.local_stime + delta;
> 
> LGTM:
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

> However I see Anton still has concerns that this patch doesn't fully
> fix the issue he reported, and I'm afraid I don't really understand
> what's going on.  I will have to take a more detailed look at the
> thread, or possibly attempt to reproduce myself.

Largely the same here.

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
  2026-01-15  8:00     ` Jan Beulich
@ 2026-01-15  8:24       ` Roger Pau Monné
  2026-01-15  8:49         ` Anton Markov
  2026-01-15 10:38         ` Jan Beulich
  2026-01-20  8:50       ` Jan Beulich
  1 sibling, 2 replies; 32+ messages in thread
From: Roger Pau Monné @ 2026-01-15  8:24 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper,
	Антон Марков

On Thu, Jan 15, 2026 at 09:00:07AM +0100, Jan Beulich wrote:
> On 14.01.2026 18:49, Roger Pau Monné wrote:
> > On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote:
> >> A similar issue looks to exist in read_xen_timer() and its read_cycle()
> >> helper, if we're scheduled out (and beck in) between reading of the TSC
> >> and calculation of the delta (involving ->tsc_timestamp). Am I
> >> overlooking anything there?
> > 
> > If we are scheduled out in the middle, and the ->tsc_timestamp is
> > updated, ->version would also be bumped, and hence the loop will be
> > restarted.  I don't think there's an issue there.
> 
> Oh, indeed - I was too focused on the read_cycle() alone. That may go
> wrong, but as you say the result then simply won't be used.
> 
> >> stime2tsc() guards against negative deltas by using 0 instead; I'm not
> >> quite sure that's correct either.
> > 
> > Hm, we should likely do the same for stime2tsc() that you do for
> > get_s_time_fixed().  Given the current callers I think we might be
> > safe, but it's a risk.
> 
> Will do then.
> 
> >> amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a
> >> comment towards the TSC being "sane", but is that correct? Due to
> >> TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then
> >> wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before
> >> calling tsc_ticks2ns()?
> > 
> > amd_check_erratum_1474() runs after early_time_init(), which would
> > have cleared any TSC_ADJUST offset AFAICT.  There's a note in the
> > initcall to that regard:
> > 
> > /*
> >  * Must be executed after early_time_init() for tsc_ticks2ns() to have been
> >  * calibrated.  That prevents us doing the check in init_amd().
> >  */
> > presmp_initcall(amd_check_erratum_1474);
> 
> Hmm, I should have written "Due to e.g. TSC_ADJUST". Firmware may also
> have played other games with MSR_TSC.

For amd_check_erratum_1474() we don't want to subtract boot_tsc_stamp,
otherwise when kexec'ed we won't be accounting properly for the time
since host startup, as subtracting boot_tsc_stamp would remove any
time consumed by a previously run OS.

> >> A similar issue looks to exist in tsc_get_info(), again when rdtsc()
> >> possibly returns a huge value due to TSC_ADJUST. Once again I wonder
> >> whether we shouldn't subtract boot_tsc_stamp.
> > 
> > I would expect tsc_get_info() to also get called exclusively after
> > early_time_init()?
> 
> Same here then (obviously).

For tsc_get_info() I think you are worried that the TSC might
overflow, and hence the calculation in scale_delta() would then be
skewed.  We must have other instances of this pattern however, what
about get_s_time_fixed(), I think it would also be affected?

Or maybe I'm not understanding the concern.  Given the proposed
scale_delta() logic, it won't be possible to distinguish rdtsc
overflowing from a value in the past.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
  2026-01-15  8:24       ` Roger Pau Monné
@ 2026-01-15  8:49         ` Anton Markov
  2026-01-15 10:38         ` Jan Beulich
  1 sibling, 0 replies; 32+ messages in thread
From: Anton Markov @ 2026-01-15  8:49 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jan Beulich, xen-devel@lists.xenproject.org, Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 267 bytes --]

Sorry. I mixed several threads into one. It wasn't immediately clear that
the second bug isn't related to handling negative values in scale_delta. It's
related to gradual time desynchronization between cores, so they should
probably be addressed separately.

Thanks.

[-- Attachment #2: Type: text/html, Size: 962 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
  2026-01-15  8:24       ` Roger Pau Monné
  2026-01-15  8:49         ` Anton Markov
@ 2026-01-15 10:38         ` Jan Beulich
  2026-01-15 11:48           ` Roger Pau Monné
  1 sibling, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2026-01-15 10:38 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper,
	Антон Марков

On 15.01.2026 09:24, Roger Pau Monné wrote:
> On Thu, Jan 15, 2026 at 09:00:07AM +0100, Jan Beulich wrote:
>> On 14.01.2026 18:49, Roger Pau Monné wrote:
>>> On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote:
>>>> amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a
>>>> comment towards the TSC being "sane", but is that correct? Due to
>>>> TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then
>>>> wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before
>>>> calling tsc_ticks2ns()?
>>>
>>> amd_check_erratum_1474() runs after early_time_init(), which would
>>> have cleared any TSC_ADJUST offset AFAICT.  There's a note in the
>>> initcall to that regard:
>>>
>>> /*
>>>  * Must be executed after early_time_init() for tsc_ticks2ns() to have been
>>>  * calibrated.  That prevents us doing the check in init_amd().
>>>  */
>>> presmp_initcall(amd_check_erratum_1474);
>>
>> Hmm, I should have written "Due to e.g. TSC_ADJUST". Firmware may also
>> have played other games with MSR_TSC.
> 
> For amd_check_erratum_1474() we don't want to subtract boot_tsc_stamp,
> otherwise when kexec'ed we won't be accounting properly for the time
> since host startup, as subtracting boot_tsc_stamp would remove any
> time consumed by a previously run OS.

For both this and ...

>>>> A similar issue looks to exist in tsc_get_info(), again when rdtsc()
>>>> possibly returns a huge value due to TSC_ADJUST. Once again I wonder
>>>> whether we shouldn't subtract boot_tsc_stamp.
>>>
>>> I would expect tsc_get_info() to also get called exclusively after
>>> early_time_init()?
>>
>> Same here then (obviously).
> 
> For tsc_get_info() I think you are worried that the TSC might
> overflow, and hence the calculation in scale_delta() would then be
> skewed.  We must have other instances of this pattern however, what
> about get_s_time_fixed(), I think it would also be affected?
> 
> Or maybe I'm not understanding the concern.  Given the proposed
> scale_delta() logic, it won't be possible to distinguish rdtsc
> overflowing from a value in the past.

... this, my main point really is that scale_delta() (as its name says),
and hence also tsc_ticks2ns(), shouldn't be used on absolute counts, but
only deltas. (Yes, an absolute count can be viewed as delta from 0, but
that's correct only if we know the TSC started counting from 0 and was
never adjusted by some bias.)

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
  2026-01-15 10:38         ` Jan Beulich
@ 2026-01-15 11:48           ` Roger Pau Monné
  2026-01-15 12:04             ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: Roger Pau Monné @ 2026-01-15 11:48 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper,
	Антон Марков

On Thu, Jan 15, 2026 at 11:38:10AM +0100, Jan Beulich wrote:
> On 15.01.2026 09:24, Roger Pau Monné wrote:
> > On Thu, Jan 15, 2026 at 09:00:07AM +0100, Jan Beulich wrote:
> >> On 14.01.2026 18:49, Roger Pau Monné wrote:
> >>> On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote:
> >>>> amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a
> >>>> comment towards the TSC being "sane", but is that correct? Due to
> >>>> TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then
> >>>> wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before
> >>>> calling tsc_ticks2ns()?
> >>>
> >>> amd_check_erratum_1474() runs after early_time_init(), which would
> >>> have cleared any TSC_ADJUST offset AFAICT.  There's a note in the
> >>> initcall to that regard:
> >>>
> >>> /*
> >>>  * Must be executed after early_time_init() for tsc_ticks2ns() to have been
> >>>  * calibrated.  That prevents us doing the check in init_amd().
> >>>  */
> >>> presmp_initcall(amd_check_erratum_1474);
> >>
> >> Hmm, I should have written "Due to e.g. TSC_ADJUST". Firmware may also
> >> have played other games with MSR_TSC.
> > 
> > For amd_check_erratum_1474() we don't want to subtract boot_tsc_stamp,
> > otherwise when kexec'ed we won't be accounting properly for the time
> > since host startup, as subtracting boot_tsc_stamp would remove any
> > time consumed by a previously run OS.
> 
> For both this and ...
> 
> >>>> A similar issue looks to exist in tsc_get_info(), again when rdtsc()
> >>>> possibly returns a huge value due to TSC_ADJUST. Once again I wonder
> >>>> whether we shouldn't subtract boot_tsc_stamp.
> >>>
> >>> I would expect tsc_get_info() to also get called exclusively after
> >>> early_time_init()?
> >>
> >> Same here then (obviously).
> > 
> > For tsc_get_info() I think you are worried that the TSC might
> > overflow, and hence the calculation in scale_delta() would then be
> > skewed.  We must have other instances of this pattern however, what
> > about get_s_time_fixed(), I think it would also be affected?
> > 
> > Or maybe I'm not understanding the concern.  Given the proposed
> > scale_delta() logic, it won't be possible to distinguish rdtsc
> > overflowing from a value in the past.
> 
> ... this, my main point really is that scale_delta() (as its name says),
> and hence also tsc_ticks2ns(), shouldn't be used on absolute counts, but
> only deltas. (Yes, an absolute count can be viewed as delta from 0, but
> that's correct only if we know the TSC started counting from 0 and was
> never adjusted by some bias.)

Well amd_check_erratum_1474() does want the delta from 0 to the
current TSC, because that's the best? way to see when C6 needs to be
disabled.  Otherwise we just straight disable C6 on boot on affected
systems.

IMO the problem is not about using tsc_ticks2ns() or scale_delta()
with absolute or delta values, the issue if how the "delta" passed to
those functions is calculated.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
  2026-01-15 11:48           ` Roger Pau Monné
@ 2026-01-15 12:04             ` Jan Beulich
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2026-01-15 12:04 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper,
	Антон Марков

On 15.01.2026 12:48, Roger Pau Monné wrote:
> On Thu, Jan 15, 2026 at 11:38:10AM +0100, Jan Beulich wrote:
>> On 15.01.2026 09:24, Roger Pau Monné wrote:
>>> On Thu, Jan 15, 2026 at 09:00:07AM +0100, Jan Beulich wrote:
>>>> On 14.01.2026 18:49, Roger Pau Monné wrote:
>>>>> On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote:
>>>>>> amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a
>>>>>> comment towards the TSC being "sane", but is that correct? Due to
>>>>>> TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then
>>>>>> wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before
>>>>>> calling tsc_ticks2ns()?
>>>>>
>>>>> amd_check_erratum_1474() runs after early_time_init(), which would
>>>>> have cleared any TSC_ADJUST offset AFAICT.  There's a note in the
>>>>> initcall to that regard:
>>>>>
>>>>> /*
>>>>>  * Must be executed after early_time_init() for tsc_ticks2ns() to have been
>>>>>  * calibrated.  That prevents us doing the check in init_amd().
>>>>>  */
>>>>> presmp_initcall(amd_check_erratum_1474);
>>>>
>>>> Hmm, I should have written "Due to e.g. TSC_ADJUST". Firmware may also
>>>> have played other games with MSR_TSC.
>>>
>>> For amd_check_erratum_1474() we don't want to subtract boot_tsc_stamp,
>>> otherwise when kexec'ed we won't be accounting properly for the time
>>> since host startup, as subtracting boot_tsc_stamp would remove any
>>> time consumed by a previously run OS.
>>
>> For both this and ...
>>
>>>>>> A similar issue looks to exist in tsc_get_info(), again when rdtsc()
>>>>>> possibly returns a huge value due to TSC_ADJUST. Once again I wonder
>>>>>> whether we shouldn't subtract boot_tsc_stamp.
>>>>>
>>>>> I would expect tsc_get_info() to also get called exclusively after
>>>>> early_time_init()?
>>>>
>>>> Same here then (obviously).
>>>
>>> For tsc_get_info() I think you are worried that the TSC might
>>> overflow, and hence the calculation in scale_delta() would then be
>>> skewed.  We must have other instances of this pattern however, what
>>> about get_s_time_fixed(), I think it would also be affected?
>>>
>>> Or maybe I'm not understanding the concern.  Given the proposed
>>> scale_delta() logic, it won't be possible to distinguish rdtsc
>>> overflowing from a value in the past.
>>
>> ... this, my main point really is that scale_delta() (as its name says),
>> and hence also tsc_ticks2ns(), shouldn't be used on absolute counts, but
>> only deltas. (Yes, an absolute count can be viewed as delta from 0, but
>> that's correct only if we know the TSC started counting from 0 and was
>> never adjusted by some bias.)
> 
> Well amd_check_erratum_1474() does want the delta from 0 to the
> current TSC, because that's the best? way to see when C6 needs to be
> disabled.  Otherwise we just straight disable C6 on boot on affected
> systems.

I think that may be necessary when we don't know what was done to the TSC
before we took control of the system.

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
  2026-01-15  8:00     ` Jan Beulich
  2026-01-15  8:24       ` Roger Pau Monné
@ 2026-01-20  8:50       ` Jan Beulich
  2026-01-20  9:28         ` Jan Beulich
  1 sibling, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2026-01-20  8:50 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper,
	Антон Марков

On 15.01.2026 09:00, Jan Beulich wrote:
> On 14.01.2026 18:49, Roger Pau Monné wrote:
>> On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote:
>>> stime2tsc() guards against negative deltas by using 0 instead; I'm not
>>> quite sure that's correct either.
>>
>> Hm, we should likely do the same for stime2tsc() that you do for
>> get_s_time_fixed().  Given the current callers I think we might be
>> safe, but it's a risk.
> 
> Will do then.

While doing so, I came to wonder if there isn't a reason for this "capping".
In local_time_calibration() we also have

    /* Local time warps forward if it lags behind master time. */
    if ( curr.local_stime < curr.master_stime )
        curr.local_stime = curr.master_stime;

Which for the use of stime2tsc() in cstate_restore_tsc() might mean that
indeed there is a worry of the delta being negative, and the desire to
"warp forward" in that case.

Whereas for the other use in reprogram_timer() capping at 0 isn't overly
useful. By the time the value is used, it is likely in the past anyway.

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
  2026-01-20  8:50       ` Jan Beulich
@ 2026-01-20  9:28         ` Jan Beulich
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2026-01-20  9:28 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper,
	Антон Марков

On 20.01.2026 09:50, Jan Beulich wrote:
> On 15.01.2026 09:00, Jan Beulich wrote:
>> On 14.01.2026 18:49, Roger Pau Monné wrote:
>>> On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote:
>>>> stime2tsc() guards against negative deltas by using 0 instead; I'm not
>>>> quite sure that's correct either.
>>>
>>> Hm, we should likely do the same for stime2tsc() that you do for
>>> get_s_time_fixed().  Given the current callers I think we might be
>>> safe, but it's a risk.
>>
>> Will do then.
> 
> While doing so, I came to wonder if there isn't a reason for this "capping".
> In local_time_calibration() we also have
> 
>     /* Local time warps forward if it lags behind master time. */
>     if ( curr.local_stime < curr.master_stime )
>         curr.local_stime = curr.master_stime;
> 
> Which for the use of stime2tsc() in cstate_restore_tsc() might mean that
> indeed there is a worry of the delta being negative, and the desire to
> "warp forward" in that case.

Proposed new function implementation (easier to look at than the diff):

uint64_t stime2tsc(s_time_t stime)
{
    const struct cpu_time *t = &this_cpu(cpu_time);
    s_time_t stime_delta = stime - t->stamp.local_stime;
    int64_t delta = 0;

    /*
     * While for reprogram_timer() the capping at 0 isn't relevant (the returned
     * value is likely in the past anyway then, by the time it is used), for
     * cstate_restore_tsc() it is meaningful: We need to avoid moving the TSC
     * backwards (relative to when it may last have been read).
     */
    if ( stime_delta > 0 )
    {
        struct time_scale sys_to_tsc = scale_reciprocal(t->tsc_scale);

        delta = scale_delta(stime_delta, &sys_to_tsc);
    }

    return t->stamp.local_tsc + delta;
}

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2026-01-20  9:29 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-06 13:57 [PATCH 0/5] x86: assorted time handling adjustments Jan Beulich
2026-01-06 13:58 ` [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() Jan Beulich
2026-01-06 20:10   ` Антон Марков
2026-01-07  8:02     ` Jan Beulich
2026-01-09 16:11       ` Anton Markov
2026-01-12 10:31         ` Anton Markov
2026-01-12 11:45           ` Jan Beulich
2026-01-12 12:49             ` Anton Markov
2026-01-12 14:13               ` Jan Beulich
2026-01-12 14:51                 ` Anton Markov
2026-01-12 16:08                   ` Jan Beulich
2026-01-12 16:41                     ` Anton Markov
2026-01-13 11:21                       ` Jan Beulich
2026-01-13 12:35                         ` Anton Markov
2026-01-14 17:49   ` Roger Pau Monné
2026-01-15  8:00     ` Jan Beulich
2026-01-15  8:24       ` Roger Pau Monné
2026-01-15  8:49         ` Anton Markov
2026-01-15 10:38         ` Jan Beulich
2026-01-15 11:48           ` Roger Pau Monné
2026-01-15 12:04             ` Jan Beulich
2026-01-20  8:50       ` Jan Beulich
2026-01-20  9:28         ` Jan Beulich
2026-01-06 13:58 ` [PATCH 2/5] x86/time: scale_delta() can be static Jan Beulich
2026-01-13 18:40   ` Roger Pau Monné
2026-01-06 13:58 ` [PATCH 3/5] x86/HVM: drop at_tsc parameter from ->set_tsc_offset() hook Jan Beulich
2026-01-14  9:40   ` Roger Pau Monné
2026-01-06 13:59 ` [PATCH 4/5] x86/time: gtsc_to_gtime() is HVM-only Jan Beulich
2026-01-14  9:43   ` Roger Pau Monné
2026-01-14  9:52     ` Roger Pau Monné
2026-01-06 14:00 ` [PATCH 5/5] x86/time: pv_soft_rdtsc() is PV-only Jan Beulich
2026-01-14  9:46   ` Roger Pau Monné

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.