* [PATCH 0/5] x86: assorted time handling adjustments
@ 2026-01-06 13:57 Jan Beulich
2026-01-06 13:58 ` [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() Jan Beulich
` (4 more replies)
0 siblings, 5 replies; 32+ messages in thread
From: Jan Beulich @ 2026-01-06 13:57 UTC (permalink / raw)
To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné
The first patch aims at addressing a recent bug report, but may also point
out further more or less related issues. The other patches do some tidying
found desirable while doing the investigation.
1: time: deal with negative deltas in get_s_time_fixed()
2: time: scale_delta() can be static
3: HVM: drop at_tsc parameter from ->set_tsc_offset() hook
4: time: gtsc_to_gtime() is HVM-only
5: time: pv_soft_rdtsc() is PV-only
Jan
^ permalink raw reply [flat|nested] 32+ messages in thread* [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() 2026-01-06 13:57 [PATCH 0/5] x86: assorted time handling adjustments Jan Beulich @ 2026-01-06 13:58 ` Jan Beulich 2026-01-06 20:10 ` Антон Марков 2026-01-14 17:49 ` Roger Pau Monné 2026-01-06 13:58 ` [PATCH 2/5] x86/time: scale_delta() can be static Jan Beulich ` (3 subsequent siblings) 4 siblings, 2 replies; 32+ messages in thread From: Jan Beulich @ 2026-01-06 13:58 UTC (permalink / raw) To: xen-devel@lists.xenproject.org Cc: Andrew Cooper, Roger Pau Monné, Антон Марков Callers may pass in TSC values from before the local TSC stamp was last updated (this would in particular be the case when the TSC was latched, a time rendezvous occurs, and the latched value is used only afterwards). scale_delta(), otoh, deals with unsigned values, and hence would treat negative incoming deltas as huge positive values, yielding entirely bogus return values. Fixes: 88e64cb785c1 ("x86/HVM: use fixed TSC value when saving or restoring domain") Reported-by: Антон Марков <akmarkov45@gmail.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> --- An alternative might be to have scale_delta() deal with signed deltas, yet that seemed more risky to me. There could likely be more Fixes: tags; the one used is the oldest applicable one, from what I can tell. A similar issue looks to exist in read_xen_timer() and its read_cycle() helper, if we're scheduled out (and beck in) between reading of the TSC and calculation of the delta (involving ->tsc_timestamp). Am I overlooking anything there? stime2tsc() guards against negative deltas by using 0 instead; I'm not quite sure that's correct either. amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a comment towards the TSC being "sane", but is that correct? Due to TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before calling tsc_ticks2ns()? A similar issue looks to exist in tsc_get_info(), again when rdtsc() possibly returns a huge value due to TSC_ADJUST. Once again I wonder whether we shouldn't subtract boot_tsc_stamp. --- a/xen/arch/x86/time.c +++ b/xen/arch/x86/time.c @@ -1649,8 +1649,13 @@ s_time_t get_s_time_fixed(u64 at_tsc) tsc = at_tsc; else tsc = rdtsc_ordered(); - delta = tsc - t->stamp.local_tsc; - return t->stamp.local_stime + scale_delta(delta, &t->tsc_scale); + + if ( tsc >= t->stamp.local_tsc ) + delta = scale_delta(tsc - t->stamp.local_tsc, &t->tsc_scale); + else + delta = -scale_delta(t->stamp.local_tsc - tsc, &t->tsc_scale); + + return t->stamp.local_stime + delta; } s_time_t get_s_time(void) ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() 2026-01-06 13:58 ` [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() Jan Beulich @ 2026-01-06 20:10 ` Антон Марков 2026-01-07 8:02 ` Jan Beulich 2026-01-14 17:49 ` Roger Pau Monné 1 sibling, 1 reply; 32+ messages in thread From: Антон Марков @ 2026-01-06 20:10 UTC (permalink / raw) To: xen-devel; +Cc: jbeulich, andrew.cooper3, roger.pau [-- Attachment #1: Type: text/plain, Size: 6534 bytes --] Hi, I'm not sure about the other places. In hvm_load_cpu_ctxt (xen/arch/x86/hvm/hvm.c ), it was easy to catch because process_pending_softirqs is frequently called there, which in turn processes softirqs from the timer (where the timestamp is updated). After I fixed sync_tsc in hvm_load_cpu_ctxt, the problem stopped reproducing under no load. However, when the number of vCPUs is 4 times greater than the number of CPUs (under heavy load), the problem rarely reoccurs (mostly during snapshot restores during process_pending_softirqs calls), and this is no longer a simple case. If get_s_time_fixed can indeed be interrupted during execution after rdtsc_ordered, then the current fix is insufficient. It's necessary to atomically copy "t->stamp" to the stack using local_irq_disable and local_irq_enable (as in local_time_calibration), and then work with the copy, confident in its lifetime and immutability. But until get_s_time_fixed is proven to be interruptible, this is premature, so your fix is sufficient. I think I need more information and testing to say more. Regarding the other scale_delta calls, if they include values calculated from externally saved tsc values that could have become stale during the process_pending_softirqs call, this definitely needs to be fixed. Here's the problematic code from dwm.exe, if that helps: void __fastcall CPartitionVerticalBlankScheduler::UpdateCurrentTime(CPartitionVerticalBlankScheduler *this) { uint64_t *xor_time_ptr = (uint64_t *)((char *)this + 0x3E40); CPartitionVerticalBlankScheduler *scheduler = this; LARGE_INTEGER *current_time_ptr = (LARGE_INTEGER *)((char *)this + 0x3E30); uint64_t previous_time = *((_QWORD *)this + 0x7C6); // 0x3e30 uint64_t address_high = ((_QWORD)this + 0x3E40) << 32; uint64_t xor_key = ((uint64_t)this + 0x3E40) | address_high; // validate internal state if ((previous_time ^ xor_key) != *((_QWORD *)this + 0x7C8)) { // 0x3e40 char exception_record[0x98]; memset(exception_record, 0, sizeof(exception_record)); *(int *)exception_record = 0x88980080; // MILERR_INTERNALERROR uint64_t computed_xor = *xor_time_ptr ^ ((uint64_t)xor_time_ptr | address_high); int param_count = 4; int previous_time_high = (int)(previous_time >> 32); uint32_t previous_time_low = (uint32_t)previous_time; int computed_xor_high = (int)(computed_xor >> 32); uint32_t computed_xor_low = (uint32_t)computed_xor; RaiseFailFastException((PEXCEPTION_RECORD)exception_record, nullptr, 0); previous_time = *((_QWORD *)scheduler + 1990); } // get current timestamp *((_QWORD *)scheduler + 0x7C7) = previous_time; // 0x3e38 QueryPerformanceCounter(current_time_ptr); LARGE_INTEGER new_time = *current_time_ptr; uint64_t last_previous_time = *((_QWORD *)scheduler + 0x7C7); // 0x3e38 // check if time go backward if (new_time.QuadPart < last_previous_time) { char exception_record[0x98]; memset(exception_record, 0, sizeof(exception_record)); *(int *)exception_record = 0x8898009B; // MILERR_QPC_TIME_WENT_BACKWARD int new_time_high = new_time.HighPart; uint32_t new_time_low = new_time.LowPart; int last_previous_high = (int)(last_previous_time >> 32); uint32_t last_previous_low = (uint32_t)last_previous_time; int frequency_high = g_qpcFrequency.HighPart; uint32_t frequency_low = g_qpcFrequency.LowPart; int64_t time_delta = last_previous_time - new_time.QuadPart; int64_t millis_delta = (1000 * time_delta) / g_qpcFrequency.QuadPart; int millis_high = (int)(millis_delta >> 32); uint32_t millis_low = (uint32_t)millis_delta; int param_count = 8; RaiseFailFastException((PEXCEPTION_RECORD)exception_record, nullptr, 0); new_time = *(LARGE_INTEGER *)((char *)scheduler + 15920); } *xor_time_ptr = new_time.QuadPart ^ xor_key; } Thanks. 06.01.2026 16:58, Jan Beulich пишет: > Callers may pass in TSC values from before the local TSC stamp was last > updated (this would in particular be the case when the TSC was latched, a > time rendezvous occurs, and the latched value is used only afterwards). > scale_delta(), otoh, deals with unsigned values, and hence would treat > negative incoming deltas as huge positive values, yielding entirely bogus > return values. > > Fixes: 88e64cb785c1 ("x86/HVM: use fixed TSC value when saving or restoring domain") > Reported-by: Антон Марков<akmarkov45@gmail.com> > Signed-off-by: Jan Beulich<jbeulich@suse.com> > --- > An alternative might be to have scale_delta() deal with signed deltas, yet > that seemed more risky to me. > > There could likely be more Fixes: tags; the one used is the oldest > applicable one, from what I can tell. > > A similar issue looks to exist in read_xen_timer() and its read_cycle() > helper, if we're scheduled out (and beck in) between reading of the TSC > and calculation of the delta (involving ->tsc_timestamp). Am I > overlooking anything there? > > stime2tsc() guards against negative deltas by using 0 instead; I'm not > quite sure that's correct either. > > amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a > comment towards the TSC being "sane", but is that correct? Due to > TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then > wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before > calling tsc_ticks2ns()? > > A similar issue looks to exist in tsc_get_info(), again when rdtsc() > possibly returns a huge value due to TSC_ADJUST. Once again I wonder > whether we shouldn't subtract boot_tsc_stamp. > > --- a/xen/arch/x86/time.c > +++ b/xen/arch/x86/time.c > @@ -1649,8 +1649,13 @@ s_time_t get_s_time_fixed(u64 at_tsc) > tsc = at_tsc; > else > tsc = rdtsc_ordered(); > - delta = tsc - t->stamp.local_tsc; > - return t->stamp.local_stime + scale_delta(delta, &t->tsc_scale); > + > + if ( tsc >= t->stamp.local_tsc ) > + delta = scale_delta(tsc - t->stamp.local_tsc, &t->tsc_scale); > + else > + delta = -scale_delta(t->stamp.local_tsc - tsc, &t->tsc_scale); > + > + return t->stamp.local_stime + delta; > } > > s_time_t get_s_time(void) > [-- Attachment #2: Type: text/html, Size: 10522 bytes --] ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() 2026-01-06 20:10 ` Антон Марков @ 2026-01-07 8:02 ` Jan Beulich 2026-01-09 16:11 ` Anton Markov 0 siblings, 1 reply; 32+ messages in thread From: Jan Beulich @ 2026-01-07 8:02 UTC (permalink / raw) To: Антон Марков Cc: andrew.cooper3, roger.pau, xen-devel On 06.01.2026 21:10, Антон Марков wrote: > Hi, I'm not sure about the other places. In hvm_load_cpu_ctxt > (xen/arch/x86/hvm/hvm.c ), it was easy to catch because > process_pending_softirqs is frequently called there, which in turn > processes softirqs from the timer (where the timestamp is updated). > After I fixed sync_tsc in hvm_load_cpu_ctxt, the problem stopped > reproducing under no load. However, when the number of vCPUs is 4 times > greater than the number of CPUs (under heavy load), the problem rarely > reoccurs (mostly during snapshot restores during > process_pending_softirqs calls), and this is no longer a simple case. If > get_s_time_fixed can indeed be interrupted during execution after > rdtsc_ordered, then the current fix is insufficient. It's necessary to > atomically copy "t->stamp" to the stack using local_irq_disable and > local_irq_enable (as in local_time_calibration), and then work with the > copy, confident in its lifetime and immutability. But until > get_s_time_fixed is proven to be interruptible, this is premature, so > your fix is sufficient. I think I need more information and testing to > say more. While the cpu_calibration per-CPU variable is updated from IRQ context, the cpu_time one isn't. Hence t->stamp's contents cannot change behind the back of get_s_time_fixed(). I wonder whether ... > Regarding the other scale_delta calls, if they include values > calculated from externally saved tsc values that could have become > stale during the process_pending_softirqs call, this definitely needs to > be fixed. ... another similar issue (possibly one not included in the set of remarks I have in the patch, as none of those look related to what you describe) might be causing the remaining, more rare problems you say you see. That set of remarks is actually a result of me going over all other scale_delta() calls, but of course I may have got the analysis wrong. As to using 4 times as many vCPU-s as there are pCPU-s (and then heavy load) - while I don't think we have a support statement for such upstream (when probably we should), iirc for our (SUSE's) products we would consider that unsupported. Just fyi. Also, btw, please don't top-post. Jan ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() 2026-01-07 8:02 ` Jan Beulich @ 2026-01-09 16:11 ` Anton Markov 2026-01-12 10:31 ` Anton Markov 0 siblings, 1 reply; 32+ messages in thread From: Anton Markov @ 2026-01-09 16:11 UTC (permalink / raw) To: Jan Beulich; +Cc: andrew.cooper3, roger.pau, xen-devel [-- Attachment #1: Type: text/plain, Size: 9698 bytes --] You're right. These aren't interrupts in get_s_time_fixed, but rather a cumulative error in the sequence due to integer rounding. I added logging of the current local_stime to local_time_calibration and noticed that the timestamp between cores is gradually increasing. If the server has been running for weeks, this could be a very large value. ``` @@ -1732,6 +1753,8 @@ static void cf_check local_time_calibration(void) if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ) { /* Atomically read cpu_calibration struct and write cpu_time struct. */ + printk("update stime on time calibrate %d, %lu -> %lu (%lu, %lu)\n", smp_processor_id(), t->stamp.local_stime, c->local_stime, + t->last_seen_ns, t->last_seen_tsc); local_irq_disable(); t->stamp = *c; local_irq_enable(); ``` 2 hours of work: ``` (XEN) update stime on time calibrate 0, 8564145820102 -> 8565145861597 (8565145862216, 0) (XEN) update stime on time calibrate 1, 8564145820129 -> 8565145861609 (8565145863957, 0) (XEN) update stime on time calibrate 3, 8564145819996 -> 8565145861491 (8565145864800, 0) (XEN) update stime on time calibrate 2, 8564145820099 -> 8565145861609 (8565145865372, 0) 8565145861609 - 8565145861491 = 115 * 3 (3.00 GHz) = 345 lag ``` 6 hours of work: ``` (XEN) update stime on time calibrate 0, 22914730829200 -> 22915730869993 (22915730870665, 0) (XEN) update stime on time calibrate 1, 22914730829073 -> 22915730869889 (22915730870693, 0) (XEN) update stime on time calibrate 2, 22914730829052 -> 22915730869841 (22915730872231, 0) (XEN) update stime on time calibrate 3, 22914730828892 -> 22915730869696 (22915730872096, 0) 22915730869993 - 22915730869696 = 297 * 3 (3.00 GHz) = 891 lag ``` Clarification, according to my xen settings: ``` ucode=scan dom0_mem=53923M,max:53923M dom0_max_vcpus=4-96 dom0_vcpus_pin=0 force-ept=1 ept=no-ad,no-pml hap_1gb=0 hap_2mb=0 altp2m=1 hpet=legacy-replacement smt=1 spec-ctrl=no gnttab_max_frames=512 cpufreq=xen:performance max_cstate=1 sched=credit sched-gran=cpu apicv=0 sched_credit_tslice_ms=5 sched_ratelimit_us=500 ``` Processors I tested on: ``` Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c hypervisor lahf_lm cpuid_fault fsgsbase erms xsaveopt arch_capabilities ``` ``` Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10GHz x2 (NUMA) Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ibpb fsgsbase bmi1 avx2 bmi2 erms rtm avx512f avx512dq rdseed adx avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 avx512vbmi avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear arch_capabilities ``` Next I moved the code to r3 to speed up playback: ``` #include <stdint.h> #include <stdio.h> static __inline__ unsigned long long rdtsc(void) { unsigned hi, lo; __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi)); return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 ); } #define MILLISECS(_ms) ((int64_t)((_ms) * 1000000ULL)) struct cpu_time_stamp { uint64_t local_tsc; int64_t local_stime; int64_t master_stime; }; struct time_scale { int shift; uint32_t mul_frac; }; static inline uint32_t div_frac(uint32_t dividend, uint32_t divisor) { uint32_t quotient, remainder; asm ( "divl %4" : "=a" (quotient), "=d" (remainder) : "0" (0), "1" (dividend), "r" (divisor) ); return quotient; } void set_time_scale(struct time_scale *ts, uint64_t ticks_per_sec) { uint64_t tps64 = ticks_per_sec; uint32_t tps32; int shift = 0; while ( tps64 > (MILLISECS(1000)*2) ) { tps64 >>= 1; shift--; } tps32 = (uint32_t)tps64; while ( tps32 <= (uint32_t)MILLISECS(1000) ) { tps32 <<= 1; shift++; } ts->mul_frac = div_frac(MILLISECS(1000), tps32); ts->shift = shift; } uint64_t scale_delta(uint64_t delta, const struct time_scale *scale) { uint64_t product; if ( scale->shift < 0 ) delta >>= -scale->shift; else delta <<= scale->shift; asm ( "mulq %2 ; shrd $32,%1,%0" : "=a" (product), "=d" (delta) : "rm" (delta), "0" ((uint64_t)scale->mul_frac) ); return product; } #define _TS_MUL_FRAC_IDENTITY 0x80000000UL static inline struct time_scale scale_reciprocal(struct time_scale scale) { struct time_scale reciprocal; uint32_t dividend; dividend = _TS_MUL_FRAC_IDENTITY; reciprocal.shift = 1 - scale.shift; while ( dividend >= scale.mul_frac ) { dividend >>= 1; reciprocal.shift++; } asm ( "divl %4" : "=a" (reciprocal.mul_frac), "=d" (dividend) : "0" (0), "1" (dividend), "r" (scale.mul_frac) ); return reciprocal; } int64_t get_s_time_fixed(const struct cpu_time_stamp *t, const struct time_scale *ts, uint64_t at_tsc) { uint64_t tsc, delta; if ( at_tsc ) tsc = at_tsc; else tsc = rdtsc(); delta = tsc - t->local_tsc; return t->local_stime + scale_delta(delta, ts); } int main() { struct cpu_time_stamp ct; struct time_scale ts; struct time_scale back; uint64_t start = rdtsc(); set_time_scale(&ts, 3000000000); back = scale_reciprocal(ts); ct.local_tsc = start; ct.local_stime = scale_delta(start, &ts); while (1) { uint64_t new_tsc = rdtsc(); ct.local_stime = get_s_time_fixed(&ct, &ts, new_tsc); ct.local_tsc = new_tsc; uint64_t tmp_tsc = rdtsc(); printf("%lu %lu\n", tmp_tsc, scale_delta(get_s_time_fixed(&ct, &ts, tmp_tsc), &back)); } return 0; } ``` It's enough to run the loop for 10-20 seconds to see the problem. scale_delta rounds the least significant bits during calculation, which causes the error to accumulate, at different rates on different cores, depending on the least significant bits at the time of calibration. Now let's talk about why dwm reacts this way. When a snapshot is reversed, last_guest_time in hvm_get_guest_time_fixed is set to 0, which doesn't prevent time from flowing backwards. This means that cache_tsc_offset in hvm_get_guest_tsc_fixed may be configured correctly on one physical core, but due to shedding on a physical core with a lagging tsc, the guest may occasionally see a tsc value lower than it saw before the snapshot was reversed. If this happens to the dwm code, it terminates with an error. A quick solution to this problem might be to save the last_seen_tsc parameter in a snapshot for each core, followed by validation. The correct solution is to remove the rounding of the least significant bits from the sequence. On Wed, Jan 7, 2026 at 11:02 AM Jan Beulich <jbeulich@suse.com> wrote: > On 06.01.2026 21:10, Антон Марков wrote: > > Hi, I'm not sure about the other places. In hvm_load_cpu_ctxt > > (xen/arch/x86/hvm/hvm.c ), it was easy to catch because > > process_pending_softirqs is frequently called there, which in turn > > processes softirqs from the timer (where the timestamp is updated). > > After I fixed sync_tsc in hvm_load_cpu_ctxt, the problem stopped > > reproducing under no load. However, when the number of vCPUs is 4 times > > greater than the number of CPUs (under heavy load), the problem rarely > > reoccurs (mostly during snapshot restores during > > process_pending_softirqs calls), and this is no longer a simple case. If > > get_s_time_fixed can indeed be interrupted during execution after > > rdtsc_ordered, then the current fix is insufficient. It's necessary to > > atomically copy "t->stamp" to the stack using local_irq_disable and > > local_irq_enable (as in local_time_calibration), and then work with the > > copy, confident in its lifetime and immutability. But until > > get_s_time_fixed is proven to be interruptible, this is premature, so > > your fix is sufficient. I think I need more information and testing to > > say more. > > While the cpu_calibration per-CPU variable is updated from IRQ context, > the cpu_time one isn't. Hence t->stamp's contents cannot change behind > the back of get_s_time_fixed(). I wonder whether ... > > > Regarding the other scale_delta calls, if they include values > > calculated from externally saved tsc values that could have become > > stale during the process_pending_softirqs call, this definitely needs to > > be fixed. > > ... another similar issue (possibly one not included in the set of > remarks I have in the patch, as none of those look related to what you > describe) might be causing the remaining, more rare problems you say you > see. That set of remarks is actually a result of me going over all other > scale_delta() calls, but of course I may have got the analysis wrong. > > As to using 4 times as many vCPU-s as there are pCPU-s (and then heavy > load) - while I don't think we have a support statement for such upstream > (when probably we should), iirc for our (SUSE's) products we would > consider that unsupported. Just fyi. > > Also, btw, please don't top-post. > > Jan > [-- Attachment #2: Type: text/html, Size: 27925 bytes --] ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() 2026-01-09 16:11 ` Anton Markov @ 2026-01-12 10:31 ` Anton Markov 2026-01-12 11:45 ` Jan Beulich 0 siblings, 1 reply; 32+ messages in thread From: Anton Markov @ 2026-01-12 10:31 UTC (permalink / raw) To: Jan Beulich; +Cc: andrew.cooper3, roger.pau, xen-devel [-- Attachment #1: Type: text/plain, Size: 10788 bytes --] Bit rounding isn't the main issue; the difference in ipi delivery to the cores accumulates due to the ordering. Replacing get_s_time_fixed with scale_delta in time_calibration_rendezvous_tail should be sufficient. This configuration won't accumulate errors, but bit rounding can still cause a significant difference from calibration to calibration, but it's not as significant. On Fri, Jan 9, 2026 at 7:11 PM Anton Markov <akmarkov45@gmail.com> wrote: > You're right. These aren't interrupts in get_s_time_fixed, but rather a > cumulative error in the sequence due to integer rounding. I added logging > of the current local_stime to local_time_calibration and noticed that the > timestamp between cores is gradually increasing. If the server has been > running for weeks, this could be a very large value. > > > ``` > > @@ -1732,6 +1753,8 @@ static void cf_check local_time_calibration(void) > > if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ) > > { > > /* Atomically read cpu_calibration struct and write cpu_time struct. */ > > + printk("update stime on time calibrate %d, %lu -> %lu (%lu, %lu)\n", > smp_processor_id(), t->stamp.local_stime, c->local_stime, > > + t->last_seen_ns, t->last_seen_tsc); > > local_irq_disable(); > > t->stamp = *c; > > local_irq_enable(); > > ``` > > > 2 hours of work: > > > ``` > > (XEN) update stime on time calibrate 0, 8564145820102 -> 8565145861597 > (8565145862216, 0) > > (XEN) update stime on time calibrate 1, 8564145820129 -> 8565145861609 > (8565145863957, 0) > > (XEN) update stime on time calibrate 3, 8564145819996 -> 8565145861491 > (8565145864800, 0) > > (XEN) update stime on time calibrate 2, 8564145820099 -> 8565145861609 > (8565145865372, 0) > > > 8565145861609 - 8565145861491 = 115 * 3 (3.00 GHz) = 345 lag > > ``` > > > 6 hours of work: > > ``` > > (XEN) update stime on time calibrate 0, 22914730829200 -> 22915730869993 > (22915730870665, 0) > > (XEN) update stime on time calibrate 1, 22914730829073 -> 22915730869889 > (22915730870693, 0) > > (XEN) update stime on time calibrate 2, 22914730829052 -> 22915730869841 > (22915730872231, 0) > > (XEN) update stime on time calibrate 3, 22914730828892 -> 22915730869696 > (22915730872096, 0) > > > 22915730869993 - 22915730869696 = 297 * 3 (3.00 GHz) = 891 lag > > ``` > > > Clarification, according to my xen settings: > > ``` > > ucode=scan dom0_mem=53923M,max:53923M dom0_max_vcpus=4-96 dom0_vcpus_pin=0 > force-ept=1 ept=no-ad,no-pml hap_1gb=0 hap_2mb=0 altp2m=1 > hpet=legacy-replacement smt=1 spec-ctrl=no gnttab_max_frames=512 > cpufreq=xen:performance max_cstate=1 sched=credit sched-gran=cpu apicv=0 > sched_credit_tslice_ms=5 sched_ratelimit_us=500 > > ``` > > > Processors I tested on: > > > ``` > > Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz > > > Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx > fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl > nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 cx16 > sse4_1 sse4_2 popcnt aes xsave avx f16c hypervisor lahf_lm cpuid_fault > fsgsbase erms xsaveopt arch_capabilities > > ``` > > ``` > > Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10GHz x2 (NUMA) > > > Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx > fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl > nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 fma cx16 > sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm > 3dnowprefetch cpuid_fault ibpb fsgsbase bmi1 avx2 bmi2 erms rtm avx512f > avx512dq rdseed adx avx512ifma clflushopt clwb avx512cd sha_ni avx512bw > avx512vl xsaveopt xsavec xgetbv1 avx512vbmi avx512_vbmi2 gfni vaes > vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear > arch_capabilities > > ``` > > > Next I moved the code to r3 to speed up playback: > > > ``` > > #include <stdint.h> > > #include <stdio.h> > > > static __inline__ unsigned long long rdtsc(void) > > { > > unsigned hi, lo; > > __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi)); > > return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 ); > > } > > > #define MILLISECS(_ms) ((int64_t)((_ms) * 1000000ULL)) > > > struct cpu_time_stamp { > > uint64_t local_tsc; > > int64_t local_stime; > > int64_t master_stime; > > }; > > > struct time_scale { > > int shift; > > uint32_t mul_frac; > > }; > > > > static inline uint32_t div_frac(uint32_t dividend, uint32_t divisor) > > { > > uint32_t quotient, remainder; > > asm ( > > "divl %4" > > : "=a" (quotient), "=d" (remainder) > > : "0" (0), "1" (dividend), "r" (divisor) ); > > return quotient; > > } > > > > void set_time_scale(struct time_scale *ts, uint64_t ticks_per_sec) > > { > > uint64_t tps64 = ticks_per_sec; > > uint32_t tps32; > > int shift = 0; > > > while ( tps64 > (MILLISECS(1000)*2) ) > > { > > tps64 >>= 1; > > shift--; > > } > > > tps32 = (uint32_t)tps64; > > while ( tps32 <= (uint32_t)MILLISECS(1000) ) > > { > > tps32 <<= 1; > > shift++; > > } > > > ts->mul_frac = div_frac(MILLISECS(1000), tps32); > > ts->shift = shift; > > } > > > > uint64_t scale_delta(uint64_t delta, const struct time_scale *scale) > > { > > uint64_t product; > > > if ( scale->shift < 0 ) > > delta >>= -scale->shift; > > else > > delta <<= scale->shift; > > > asm ( > > "mulq %2 ; shrd $32,%1,%0" > > : "=a" (product), "=d" (delta) > > : "rm" (delta), "0" ((uint64_t)scale->mul_frac) ); > > > return product; > > } > > > #define _TS_MUL_FRAC_IDENTITY 0x80000000UL > > > static inline struct time_scale scale_reciprocal(struct time_scale scale) > > { > > struct time_scale reciprocal; > > uint32_t dividend; > > > dividend = _TS_MUL_FRAC_IDENTITY; > > reciprocal.shift = 1 - scale.shift; > > while ( dividend >= scale.mul_frac ) > > { > > dividend >>= 1; > > reciprocal.shift++; > > } > > > asm ( > > "divl %4" > > : "=a" (reciprocal.mul_frac), "=d" (dividend) > > : "0" (0), "1" (dividend), "r" (scale.mul_frac) ); > > > return reciprocal; > > } > > > > int64_t get_s_time_fixed(const struct cpu_time_stamp *t, const struct > time_scale *ts, uint64_t at_tsc) > > { > > uint64_t tsc, delta; > > > if ( at_tsc ) > > tsc = at_tsc; > > else > > tsc = rdtsc(); > > delta = tsc - t->local_tsc; > > return t->local_stime + scale_delta(delta, ts); > > } > > > int main() { > > > struct cpu_time_stamp ct; > > struct time_scale ts; > > struct time_scale back; > > > uint64_t start = rdtsc(); > > > set_time_scale(&ts, 3000000000); > > back = scale_reciprocal(ts); > > > ct.local_tsc = start; > > ct.local_stime = scale_delta(start, &ts); > > > while (1) { > > uint64_t new_tsc = rdtsc(); > > ct.local_stime = get_s_time_fixed(&ct, &ts, new_tsc); > > ct.local_tsc = new_tsc; > > > uint64_t tmp_tsc = rdtsc(); > > printf("%lu %lu\n", tmp_tsc, scale_delta(get_s_time_fixed(&ct, &ts, > tmp_tsc), &back)); > > } > > > return 0; > > } > > ``` > > > It's enough to run the loop for 10-20 seconds to see the problem. > scale_delta rounds the least significant bits during calculation, which > causes the error to accumulate, at different rates on different cores, > depending on the least significant bits at the time of calibration. > > > Now let's talk about why dwm reacts this way. When a snapshot is reversed, > last_guest_time in hvm_get_guest_time_fixed is set to 0, which doesn't > prevent time from flowing backwards. This means that cache_tsc_offset in > hvm_get_guest_tsc_fixed may be configured correctly on one physical core, > but due to shedding on a physical core with a lagging tsc, the guest may > occasionally see a tsc value lower than it saw before the snapshot was > reversed. If this happens to the dwm code, it terminates with an error. > > > A quick solution to this problem might be to save the last_seen_tsc > parameter in a snapshot for each core, followed by validation. > > > The correct solution is to remove the rounding of the least significant > bits from the sequence. > > On Wed, Jan 7, 2026 at 11:02 AM Jan Beulich <jbeulich@suse.com> wrote: > >> On 06.01.2026 21:10, Антон Марков wrote: >> > Hi, I'm not sure about the other places. In hvm_load_cpu_ctxt >> > (xen/arch/x86/hvm/hvm.c ), it was easy to catch because >> > process_pending_softirqs is frequently called there, which in turn >> > processes softirqs from the timer (where the timestamp is updated). >> > After I fixed sync_tsc in hvm_load_cpu_ctxt, the problem stopped >> > reproducing under no load. However, when the number of vCPUs is 4 times >> > greater than the number of CPUs (under heavy load), the problem rarely >> > reoccurs (mostly during snapshot restores during >> > process_pending_softirqs calls), and this is no longer a simple case. >> If >> > get_s_time_fixed can indeed be interrupted during execution after >> > rdtsc_ordered, then the current fix is insufficient. It's necessary to >> > atomically copy "t->stamp" to the stack using local_irq_disable and >> > local_irq_enable (as in local_time_calibration), and then work with the >> > copy, confident in its lifetime and immutability. But until >> > get_s_time_fixed is proven to be interruptible, this is premature, so >> > your fix is sufficient. I think I need more information and testing to >> > say more. >> >> While the cpu_calibration per-CPU variable is updated from IRQ context, >> the cpu_time one isn't. Hence t->stamp's contents cannot change behind >> the back of get_s_time_fixed(). I wonder whether ... >> >> > Regarding the other scale_delta calls, if they include values >> > calculated from externally saved tsc values that could have become >> > stale during the process_pending_softirqs call, this definitely needs >> to >> > be fixed. >> >> ... another similar issue (possibly one not included in the set of >> remarks I have in the patch, as none of those look related to what you >> describe) might be causing the remaining, more rare problems you say you >> see. That set of remarks is actually a result of me going over all other >> scale_delta() calls, but of course I may have got the analysis wrong. >> >> As to using 4 times as many vCPU-s as there are pCPU-s (and then heavy >> load) - while I don't think we have a support statement for such upstream >> (when probably we should), iirc for our (SUSE's) products we would >> consider that unsupported. Just fyi. >> >> Also, btw, please don't top-post. >> >> Jan >> > [-- Attachment #2: Type: text/html, Size: 28969 bytes --] ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() 2026-01-12 10:31 ` Anton Markov @ 2026-01-12 11:45 ` Jan Beulich 2026-01-12 12:49 ` Anton Markov 0 siblings, 1 reply; 32+ messages in thread From: Jan Beulich @ 2026-01-12 11:45 UTC (permalink / raw) To: Anton Markov; +Cc: andrew.cooper3, roger.pau, xen-devel On 12.01.2026 11:31, Anton Markov wrote: > Bit rounding isn't the main issue; the difference in ipi delivery to the > cores accumulates due to the ordering. Replacing get_s_time_fixed with > scale_delta in time_calibration_rendezvous_tail should be sufficient. This > configuration won't accumulate errors, but bit rounding can still cause a > significant difference from calibration to calibration, but it's not as > significant. That invocation of get_s_time_fixed() reduces to scale_delta() (without further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW it's not quite clear to me what change you are suggesting (that would actually make a functional difference). Btw, your prior response was too hard to properly read, due to excess blank lines with at the same time squashed leading blanks. Together with your apparent inability to avoid top-posting, I think you really want to adjust your mail program's configuration. Jan > On Fri, Jan 9, 2026 at 7:11 PM Anton Markov <akmarkov45@gmail.com> wrote: > >> You're right. These aren't interrupts in get_s_time_fixed, but rather a >> cumulative error in the sequence due to integer rounding. I added logging >> of the current local_stime to local_time_calibration and noticed that the >> timestamp between cores is gradually increasing. If the server has been >> running for weeks, this could be a very large value. >> >> >> ``` >> >> @@ -1732,6 +1753,8 @@ static void cf_check local_time_calibration(void) >> >> if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ) >> >> { >> >> /* Atomically read cpu_calibration struct and write cpu_time struct. */ >> >> + printk("update stime on time calibrate %d, %lu -> %lu (%lu, %lu)\n", >> smp_processor_id(), t->stamp.local_stime, c->local_stime, >> >> + t->last_seen_ns, t->last_seen_tsc); >> >> local_irq_disable(); >> >> t->stamp = *c; >> >> local_irq_enable(); >> >> ``` >> >> >> 2 hours of work: >> >> >> ``` >> >> (XEN) update stime on time calibrate 0, 8564145820102 -> 8565145861597 >> (8565145862216, 0) >> >> (XEN) update stime on time calibrate 1, 8564145820129 -> 8565145861609 >> (8565145863957, 0) >> >> (XEN) update stime on time calibrate 3, 8564145819996 -> 8565145861491 >> (8565145864800, 0) >> >> (XEN) update stime on time calibrate 2, 8564145820099 -> 8565145861609 >> (8565145865372, 0) >> >> >> 8565145861609 - 8565145861491 = 115 * 3 (3.00 GHz) = 345 lag >> >> ``` >> >> >> 6 hours of work: >> >> ``` >> >> (XEN) update stime on time calibrate 0, 22914730829200 -> 22915730869993 >> (22915730870665, 0) >> >> (XEN) update stime on time calibrate 1, 22914730829073 -> 22915730869889 >> (22915730870693, 0) >> >> (XEN) update stime on time calibrate 2, 22914730829052 -> 22915730869841 >> (22915730872231, 0) >> >> (XEN) update stime on time calibrate 3, 22914730828892 -> 22915730869696 >> (22915730872096, 0) >> >> >> 22915730869993 - 22915730869696 = 297 * 3 (3.00 GHz) = 891 lag >> >> ``` >> >> >> Clarification, according to my xen settings: >> >> ``` >> >> ucode=scan dom0_mem=53923M,max:53923M dom0_max_vcpus=4-96 dom0_vcpus_pin=0 >> force-ept=1 ept=no-ad,no-pml hap_1gb=0 hap_2mb=0 altp2m=1 >> hpet=legacy-replacement smt=1 spec-ctrl=no gnttab_max_frames=512 >> cpufreq=xen:performance max_cstate=1 sched=credit sched-gran=cpu apicv=0 >> sched_credit_tslice_ms=5 sched_ratelimit_us=500 >> >> ``` >> >> >> Processors I tested on: >> >> >> ``` >> >> Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz >> >> >> Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx >> fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl >> nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 cx16 >> sse4_1 sse4_2 popcnt aes xsave avx f16c hypervisor lahf_lm cpuid_fault >> fsgsbase erms xsaveopt arch_capabilities >> >> ``` >> >> ``` >> >> Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10GHz x2 (NUMA) >> >> >> Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx >> fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl >> nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 fma cx16 >> sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm >> 3dnowprefetch cpuid_fault ibpb fsgsbase bmi1 avx2 bmi2 erms rtm avx512f >> avx512dq rdseed adx avx512ifma clflushopt clwb avx512cd sha_ni avx512bw >> avx512vl xsaveopt xsavec xgetbv1 avx512vbmi avx512_vbmi2 gfni vaes >> vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear >> arch_capabilities >> >> ``` >> >> >> Next I moved the code to r3 to speed up playback: >> >> >> ``` >> >> #include <stdint.h> >> >> #include <stdio.h> >> >> >> static __inline__ unsigned long long rdtsc(void) >> >> { >> >> unsigned hi, lo; >> >> __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi)); >> >> return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 ); >> >> } >> >> >> #define MILLISECS(_ms) ((int64_t)((_ms) * 1000000ULL)) >> >> >> struct cpu_time_stamp { >> >> uint64_t local_tsc; >> >> int64_t local_stime; >> >> int64_t master_stime; >> >> }; >> >> >> struct time_scale { >> >> int shift; >> >> uint32_t mul_frac; >> >> }; >> >> >> >> static inline uint32_t div_frac(uint32_t dividend, uint32_t divisor) >> >> { >> >> uint32_t quotient, remainder; >> >> asm ( >> >> "divl %4" >> >> : "=a" (quotient), "=d" (remainder) >> >> : "0" (0), "1" (dividend), "r" (divisor) ); >> >> return quotient; >> >> } >> >> >> >> void set_time_scale(struct time_scale *ts, uint64_t ticks_per_sec) >> >> { >> >> uint64_t tps64 = ticks_per_sec; >> >> uint32_t tps32; >> >> int shift = 0; >> >> >> while ( tps64 > (MILLISECS(1000)*2) ) >> >> { >> >> tps64 >>= 1; >> >> shift--; >> >> } >> >> >> tps32 = (uint32_t)tps64; >> >> while ( tps32 <= (uint32_t)MILLISECS(1000) ) >> >> { >> >> tps32 <<= 1; >> >> shift++; >> >> } >> >> >> ts->mul_frac = div_frac(MILLISECS(1000), tps32); >> >> ts->shift = shift; >> >> } >> >> >> >> uint64_t scale_delta(uint64_t delta, const struct time_scale *scale) >> >> { >> >> uint64_t product; >> >> >> if ( scale->shift < 0 ) >> >> delta >>= -scale->shift; >> >> else >> >> delta <<= scale->shift; >> >> >> asm ( >> >> "mulq %2 ; shrd $32,%1,%0" >> >> : "=a" (product), "=d" (delta) >> >> : "rm" (delta), "0" ((uint64_t)scale->mul_frac) ); >> >> >> return product; >> >> } >> >> >> #define _TS_MUL_FRAC_IDENTITY 0x80000000UL >> >> >> static inline struct time_scale scale_reciprocal(struct time_scale scale) >> >> { >> >> struct time_scale reciprocal; >> >> uint32_t dividend; >> >> >> dividend = _TS_MUL_FRAC_IDENTITY; >> >> reciprocal.shift = 1 - scale.shift; >> >> while ( dividend >= scale.mul_frac ) >> >> { >> >> dividend >>= 1; >> >> reciprocal.shift++; >> >> } >> >> >> asm ( >> >> "divl %4" >> >> : "=a" (reciprocal.mul_frac), "=d" (dividend) >> >> : "0" (0), "1" (dividend), "r" (scale.mul_frac) ); >> >> >> return reciprocal; >> >> } >> >> >> >> int64_t get_s_time_fixed(const struct cpu_time_stamp *t, const struct >> time_scale *ts, uint64_t at_tsc) >> >> { >> >> uint64_t tsc, delta; >> >> >> if ( at_tsc ) >> >> tsc = at_tsc; >> >> else >> >> tsc = rdtsc(); >> >> delta = tsc - t->local_tsc; >> >> return t->local_stime + scale_delta(delta, ts); >> >> } >> >> >> int main() { >> >> >> struct cpu_time_stamp ct; >> >> struct time_scale ts; >> >> struct time_scale back; >> >> >> uint64_t start = rdtsc(); >> >> >> set_time_scale(&ts, 3000000000); >> >> back = scale_reciprocal(ts); >> >> >> ct.local_tsc = start; >> >> ct.local_stime = scale_delta(start, &ts); >> >> >> while (1) { >> >> uint64_t new_tsc = rdtsc(); >> >> ct.local_stime = get_s_time_fixed(&ct, &ts, new_tsc); >> >> ct.local_tsc = new_tsc; >> >> >> uint64_t tmp_tsc = rdtsc(); >> >> printf("%lu %lu\n", tmp_tsc, scale_delta(get_s_time_fixed(&ct, &ts, >> tmp_tsc), &back)); >> >> } >> >> >> return 0; >> >> } >> >> ``` >> >> >> It's enough to run the loop for 10-20 seconds to see the problem. >> scale_delta rounds the least significant bits during calculation, which >> causes the error to accumulate, at different rates on different cores, >> depending on the least significant bits at the time of calibration. >> >> >> Now let's talk about why dwm reacts this way. When a snapshot is reversed, >> last_guest_time in hvm_get_guest_time_fixed is set to 0, which doesn't >> prevent time from flowing backwards. This means that cache_tsc_offset in >> hvm_get_guest_tsc_fixed may be configured correctly on one physical core, >> but due to shedding on a physical core with a lagging tsc, the guest may >> occasionally see a tsc value lower than it saw before the snapshot was >> reversed. If this happens to the dwm code, it terminates with an error. >> >> >> A quick solution to this problem might be to save the last_seen_tsc >> parameter in a snapshot for each core, followed by validation. >> >> >> The correct solution is to remove the rounding of the least significant >> bits from the sequence. >> >> On Wed, Jan 7, 2026 at 11:02 AM Jan Beulich <jbeulich@suse.com> wrote: >> >>> On 06.01.2026 21:10, Антон Марков wrote: >>>> Hi, I'm not sure about the other places. In hvm_load_cpu_ctxt >>>> (xen/arch/x86/hvm/hvm.c ), it was easy to catch because >>>> process_pending_softirqs is frequently called there, which in turn >>>> processes softirqs from the timer (where the timestamp is updated). >>>> After I fixed sync_tsc in hvm_load_cpu_ctxt, the problem stopped >>>> reproducing under no load. However, when the number of vCPUs is 4 times >>>> greater than the number of CPUs (under heavy load), the problem rarely >>>> reoccurs (mostly during snapshot restores during >>>> process_pending_softirqs calls), and this is no longer a simple case. >>> If >>>> get_s_time_fixed can indeed be interrupted during execution after >>>> rdtsc_ordered, then the current fix is insufficient. It's necessary to >>>> atomically copy "t->stamp" to the stack using local_irq_disable and >>>> local_irq_enable (as in local_time_calibration), and then work with the >>>> copy, confident in its lifetime and immutability. But until >>>> get_s_time_fixed is proven to be interruptible, this is premature, so >>>> your fix is sufficient. I think I need more information and testing to >>>> say more. >>> >>> While the cpu_calibration per-CPU variable is updated from IRQ context, >>> the cpu_time one isn't. Hence t->stamp's contents cannot change behind >>> the back of get_s_time_fixed(). I wonder whether ... >>> >>>> Regarding the other scale_delta calls, if they include values >>>> calculated from externally saved tsc values that could have become >>>> stale during the process_pending_softirqs call, this definitely needs >>> to >>>> be fixed. >>> >>> ... another similar issue (possibly one not included in the set of >>> remarks I have in the patch, as none of those look related to what you >>> describe) might be causing the remaining, more rare problems you say you >>> see. That set of remarks is actually a result of me going over all other >>> scale_delta() calls, but of course I may have got the analysis wrong. >>> >>> As to using 4 times as many vCPU-s as there are pCPU-s (and then heavy >>> load) - while I don't think we have a support statement for such upstream >>> (when probably we should), iirc for our (SUSE's) products we would >>> consider that unsupported. Just fyi. >>> >>> Also, btw, please don't top-post. >>> >>> Jan >>> >> > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() 2026-01-12 11:45 ` Jan Beulich @ 2026-01-12 12:49 ` Anton Markov 2026-01-12 14:13 ` Jan Beulich 0 siblings, 1 reply; 32+ messages in thread From: Anton Markov @ 2026-01-12 12:49 UTC (permalink / raw) To: Jan Beulich; +Cc: andrew.cooper3, roger.pau, xen-devel [-- Attachment #1: Type: text/plain, Size: 15599 bytes --] > > Btw, your prior response was too hard to properly read, due to excess blank > lines with at the same time squashed leading blanks. Together with your > apparent inability to avoid top-posting, I think you really want to adjust > your mail program's configuration. I suggest we skip the discussion of formatting and the number of spaces in messages and instead focus on the topic of the thread. I have a very difficult time troubleshooting difficult-to-reproduce bugs, and the fact that their descriptions are difficult to read due to the number of spaces is probably the least of the difficulties. That invocation of get_s_time_fixed() reduces to scale_delta() (without > further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW > it's not quite clear to me what change you are suggesting (that would > actually make a functional difference). Replacing get_s_time_fixed with scale_delta will remove the calculation dependency on the previous local_stime value, which accumulates lag between cores. This is because: rdtsc_ordered is not called synchronously on the cores, but by the difference offset by the ipi speed. Therefore, we get: core0: current_rdtsc; core1: current_rdtsc + ipi speed; coreN: current_rdtsc + ipi speed * N; Since ipi values are sent alternately in a loop to core0, in the version with get_s_time_fixed, we get the following local_stime calculation format. coreN: local_stime = local_stime + scale_delta((current_rdtsc + (ipi_speed * N)) – local_rdtsc); This means the time on each core will differ by ipi_speed * N. And since we're using the values of the previous local_stime, the difference will accumulate because the previous local_stime was also offset. In the version with scale_delta, we get: coreN: local_stime = scale_delta(current_rdtsc + (ipi_speed * N)); This means there will still be a difference, but it won't accumulate, and the offsets will remain within normal limits. If it's still unclear: If your local_stime in get_s_time_fixed is offset relative to other cores, then the fact that rdtsc_ordered and local_tsc are not offset doesn't change anything, since you're using the delta relative to local_stime. core0_local_stime + (rdtsc_ordered() - local_tsc) != core1_local_stime + (rdtsc_ordered() - local_tsc); // Even if rdtsc_ordered() and local_tsc are equal across cores. On 96-core configurations, up to a millisecond of latency can accumulate in local_stime over a week of operation, and this is a significant difference. This is due to the fact that I use cpufreq=xen:performance max_cstate=1 , meaning that in my configuration, local_stime is never overwritten by master_stime. Thanks. On Mon, Jan 12, 2026 at 2:45 PM Jan Beulich <jbeulich@suse.com> wrote: > On 12.01.2026 11:31, Anton Markov wrote: > > Bit rounding isn't the main issue; the difference in ipi delivery to the > > cores accumulates due to the ordering. Replacing get_s_time_fixed with > > scale_delta in time_calibration_rendezvous_tail should be sufficient. > This > > configuration won't accumulate errors, but bit rounding can still cause a > > significant difference from calibration to calibration, but it's not as > > significant. > > That invocation of get_s_time_fixed() reduces to scale_delta() (without > further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW > it's not quite clear to me what change you are suggesting (that would > actually make a functional difference). > > Btw, your prior response was too hard to properly read, due to excess blank > lines with at the same time squashed leading blanks. Together with your > apparent inability to avoid top-posting, I think you really want to adjust > your mail program's configuration. > > Jan > > > On Fri, Jan 9, 2026 at 7:11 PM Anton Markov <akmarkov45@gmail.com> > wrote: > > > >> You're right. These aren't interrupts in get_s_time_fixed, but rather a > >> cumulative error in the sequence due to integer rounding. I added > logging > >> of the current local_stime to local_time_calibration and noticed that > the > >> timestamp between cores is gradually increasing. If the server has been > >> running for weeks, this could be a very large value. > >> > >> > >> ``` > >> > >> @@ -1732,6 +1753,8 @@ static void cf_check local_time_calibration(void) > >> > >> if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ) > >> > >> { > >> > >> /* Atomically read cpu_calibration struct and write cpu_time struct. */ > >> > >> + printk("update stime on time calibrate %d, %lu -> %lu (%lu, %lu)\n", > >> smp_processor_id(), t->stamp.local_stime, c->local_stime, > >> > >> + t->last_seen_ns, t->last_seen_tsc); > >> > >> local_irq_disable(); > >> > >> t->stamp = *c; > >> > >> local_irq_enable(); > >> > >> ``` > >> > >> > >> 2 hours of work: > >> > >> > >> ``` > >> > >> (XEN) update stime on time calibrate 0, 8564145820102 -> 8565145861597 > >> (8565145862216, 0) > >> > >> (XEN) update stime on time calibrate 1, 8564145820129 -> 8565145861609 > >> (8565145863957, 0) > >> > >> (XEN) update stime on time calibrate 3, 8564145819996 -> 8565145861491 > >> (8565145864800, 0) > >> > >> (XEN) update stime on time calibrate 2, 8564145820099 -> 8565145861609 > >> (8565145865372, 0) > >> > >> > >> 8565145861609 - 8565145861491 = 115 * 3 (3.00 GHz) = 345 lag > >> > >> ``` > >> > >> > >> 6 hours of work: > >> > >> ``` > >> > >> (XEN) update stime on time calibrate 0, 22914730829200 -> 22915730869993 > >> (22915730870665, 0) > >> > >> (XEN) update stime on time calibrate 1, 22914730829073 -> 22915730869889 > >> (22915730870693, 0) > >> > >> (XEN) update stime on time calibrate 2, 22914730829052 -> 22915730869841 > >> (22915730872231, 0) > >> > >> (XEN) update stime on time calibrate 3, 22914730828892 -> 22915730869696 > >> (22915730872096, 0) > >> > >> > >> 22915730869993 - 22915730869696 = 297 * 3 (3.00 GHz) = 891 lag > >> > >> ``` > >> > >> > >> Clarification, according to my xen settings: > >> > >> ``` > >> > >> ucode=scan dom0_mem=53923M,max:53923M dom0_max_vcpus=4-96 > dom0_vcpus_pin=0 > >> force-ept=1 ept=no-ad,no-pml hap_1gb=0 hap_2mb=0 altp2m=1 > >> hpet=legacy-replacement smt=1 spec-ctrl=no gnttab_max_frames=512 > >> cpufreq=xen:performance max_cstate=1 sched=credit sched-gran=cpu apicv=0 > >> sched_credit_tslice_ms=5 sched_ratelimit_us=500 > >> > >> ``` > >> > >> > >> Processors I tested on: > >> > >> > >> ``` > >> > >> Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz > >> > >> > >> Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx > >> fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl > >> nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 cx16 > >> sse4_1 sse4_2 popcnt aes xsave avx f16c hypervisor lahf_lm cpuid_fault > >> fsgsbase erms xsaveopt arch_capabilities > >> > >> ``` > >> > >> ``` > >> > >> Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10GHz x2 (NUMA) > >> > >> > >> Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx > >> fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl > >> nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 fma > cx16 > >> sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm > abm > >> 3dnowprefetch cpuid_fault ibpb fsgsbase bmi1 avx2 bmi2 erms rtm avx512f > >> avx512dq rdseed adx avx512ifma clflushopt clwb avx512cd sha_ni avx512bw > >> avx512vl xsaveopt xsavec xgetbv1 avx512vbmi avx512_vbmi2 gfni vaes > >> vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm > md_clear > >> arch_capabilities > >> > >> ``` > >> > >> > >> Next I moved the code to r3 to speed up playback: > >> > >> > >> ``` > >> > >> #include <stdint.h> > >> > >> #include <stdio.h> > >> > >> > >> static __inline__ unsigned long long rdtsc(void) > >> > >> { > >> > >> unsigned hi, lo; > >> > >> __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi)); > >> > >> return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 ); > >> > >> } > >> > >> > >> #define MILLISECS(_ms) ((int64_t)((_ms) * 1000000ULL)) > >> > >> > >> struct cpu_time_stamp { > >> > >> uint64_t local_tsc; > >> > >> int64_t local_stime; > >> > >> int64_t master_stime; > >> > >> }; > >> > >> > >> struct time_scale { > >> > >> int shift; > >> > >> uint32_t mul_frac; > >> > >> }; > >> > >> > >> > >> static inline uint32_t div_frac(uint32_t dividend, uint32_t divisor) > >> > >> { > >> > >> uint32_t quotient, remainder; > >> > >> asm ( > >> > >> "divl %4" > >> > >> : "=a" (quotient), "=d" (remainder) > >> > >> : "0" (0), "1" (dividend), "r" (divisor) ); > >> > >> return quotient; > >> > >> } > >> > >> > >> > >> void set_time_scale(struct time_scale *ts, uint64_t ticks_per_sec) > >> > >> { > >> > >> uint64_t tps64 = ticks_per_sec; > >> > >> uint32_t tps32; > >> > >> int shift = 0; > >> > >> > >> while ( tps64 > (MILLISECS(1000)*2) ) > >> > >> { > >> > >> tps64 >>= 1; > >> > >> shift--; > >> > >> } > >> > >> > >> tps32 = (uint32_t)tps64; > >> > >> while ( tps32 <= (uint32_t)MILLISECS(1000) ) > >> > >> { > >> > >> tps32 <<= 1; > >> > >> shift++; > >> > >> } > >> > >> > >> ts->mul_frac = div_frac(MILLISECS(1000), tps32); > >> > >> ts->shift = shift; > >> > >> } > >> > >> > >> > >> uint64_t scale_delta(uint64_t delta, const struct time_scale *scale) > >> > >> { > >> > >> uint64_t product; > >> > >> > >> if ( scale->shift < 0 ) > >> > >> delta >>= -scale->shift; > >> > >> else > >> > >> delta <<= scale->shift; > >> > >> > >> asm ( > >> > >> "mulq %2 ; shrd $32,%1,%0" > >> > >> : "=a" (product), "=d" (delta) > >> > >> : "rm" (delta), "0" ((uint64_t)scale->mul_frac) ); > >> > >> > >> return product; > >> > >> } > >> > >> > >> #define _TS_MUL_FRAC_IDENTITY 0x80000000UL > >> > >> > >> static inline struct time_scale scale_reciprocal(struct time_scale > scale) > >> > >> { > >> > >> struct time_scale reciprocal; > >> > >> uint32_t dividend; > >> > >> > >> dividend = _TS_MUL_FRAC_IDENTITY; > >> > >> reciprocal.shift = 1 - scale.shift; > >> > >> while ( dividend >= scale.mul_frac ) > >> > >> { > >> > >> dividend >>= 1; > >> > >> reciprocal.shift++; > >> > >> } > >> > >> > >> asm ( > >> > >> "divl %4" > >> > >> : "=a" (reciprocal.mul_frac), "=d" (dividend) > >> > >> : "0" (0), "1" (dividend), "r" (scale.mul_frac) ); > >> > >> > >> return reciprocal; > >> > >> } > >> > >> > >> > >> int64_t get_s_time_fixed(const struct cpu_time_stamp *t, const struct > >> time_scale *ts, uint64_t at_tsc) > >> > >> { > >> > >> uint64_t tsc, delta; > >> > >> > >> if ( at_tsc ) > >> > >> tsc = at_tsc; > >> > >> else > >> > >> tsc = rdtsc(); > >> > >> delta = tsc - t->local_tsc; > >> > >> return t->local_stime + scale_delta(delta, ts); > >> > >> } > >> > >> > >> int main() { > >> > >> > >> struct cpu_time_stamp ct; > >> > >> struct time_scale ts; > >> > >> struct time_scale back; > >> > >> > >> uint64_t start = rdtsc(); > >> > >> > >> set_time_scale(&ts, 3000000000); > >> > >> back = scale_reciprocal(ts); > >> > >> > >> ct.local_tsc = start; > >> > >> ct.local_stime = scale_delta(start, &ts); > >> > >> > >> while (1) { > >> > >> uint64_t new_tsc = rdtsc(); > >> > >> ct.local_stime = get_s_time_fixed(&ct, &ts, new_tsc); > >> > >> ct.local_tsc = new_tsc; > >> > >> > >> uint64_t tmp_tsc = rdtsc(); > >> > >> printf("%lu %lu\n", tmp_tsc, scale_delta(get_s_time_fixed(&ct, &ts, > >> tmp_tsc), &back)); > >> > >> } > >> > >> > >> return 0; > >> > >> } > >> > >> ``` > >> > >> > >> It's enough to run the loop for 10-20 seconds to see the problem. > >> scale_delta rounds the least significant bits during calculation, which > >> causes the error to accumulate, at different rates on different cores, > >> depending on the least significant bits at the time of calibration. > >> > >> > >> Now let's talk about why dwm reacts this way. When a snapshot is > reversed, > >> last_guest_time in hvm_get_guest_time_fixed is set to 0, which doesn't > >> prevent time from flowing backwards. This means that cache_tsc_offset in > >> hvm_get_guest_tsc_fixed may be configured correctly on one physical > core, > >> but due to shedding on a physical core with a lagging tsc, the guest may > >> occasionally see a tsc value lower than it saw before the snapshot was > >> reversed. If this happens to the dwm code, it terminates with an error. > >> > >> > >> A quick solution to this problem might be to save the last_seen_tsc > >> parameter in a snapshot for each core, followed by validation. > >> > >> > >> The correct solution is to remove the rounding of the least significant > >> bits from the sequence. > >> > >> On Wed, Jan 7, 2026 at 11:02 AM Jan Beulich <jbeulich@suse.com> wrote: > >> > >>> On 06.01.2026 21:10, Антон Марков wrote: > >>>> Hi, I'm not sure about the other places. In hvm_load_cpu_ctxt > >>>> (xen/arch/x86/hvm/hvm.c ), it was easy to catch because > >>>> process_pending_softirqs is frequently called there, which in turn > >>>> processes softirqs from the timer (where the timestamp is updated). > >>>> After I fixed sync_tsc in hvm_load_cpu_ctxt, the problem stopped > >>>> reproducing under no load. However, when the number of vCPUs is 4 > times > >>>> greater than the number of CPUs (under heavy load), the problem rarely > >>>> reoccurs (mostly during snapshot restores during > >>>> process_pending_softirqs calls), and this is no longer a simple case. > >>> If > >>>> get_s_time_fixed can indeed be interrupted during execution after > >>>> rdtsc_ordered, then the current fix is insufficient. It's necessary to > >>>> atomically copy "t->stamp" to the stack using local_irq_disable and > >>>> local_irq_enable (as in local_time_calibration), and then work with > the > >>>> copy, confident in its lifetime and immutability. But until > >>>> get_s_time_fixed is proven to be interruptible, this is premature, so > >>>> your fix is sufficient. I think I need more information and testing to > >>>> say more. > >>> > >>> While the cpu_calibration per-CPU variable is updated from IRQ context, > >>> the cpu_time one isn't. Hence t->stamp's contents cannot change behind > >>> the back of get_s_time_fixed(). I wonder whether ... > >>> > >>>> Regarding the other scale_delta calls, if they include values > >>>> calculated from externally saved tsc values that could have become > >>>> stale during the process_pending_softirqs call, this definitely needs > >>> to > >>>> be fixed. > >>> > >>> ... another similar issue (possibly one not included in the set of > >>> remarks I have in the patch, as none of those look related to what you > >>> describe) might be causing the remaining, more rare problems you say > you > >>> see. That set of remarks is actually a result of me going over all > other > >>> scale_delta() calls, but of course I may have got the analysis wrong. > >>> > >>> As to using 4 times as many vCPU-s as there are pCPU-s (and then heavy > >>> load) - while I don't think we have a support statement for such > upstream > >>> (when probably we should), iirc for our (SUSE's) products we would > >>> consider that unsupported. Just fyi. > >>> > >>> Also, btw, please don't top-post. > >>> > >>> Jan > >>> > >> > > > > [-- Attachment #2: Type: text/html, Size: 23118 bytes --] ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() 2026-01-12 12:49 ` Anton Markov @ 2026-01-12 14:13 ` Jan Beulich 2026-01-12 14:51 ` Anton Markov 0 siblings, 1 reply; 32+ messages in thread From: Jan Beulich @ 2026-01-12 14:13 UTC (permalink / raw) To: Anton Markov; +Cc: andrew.cooper3, roger.pau, xen-devel On 12.01.2026 13:49, Anton Markov wrote: >> Btw, your prior response was too hard to properly read, due to excess blank >> lines with at the same time squashed leading blanks. Together with your >> apparent inability to avoid top-posting, I think you really want to adjust >> your mail program's configuration. > > I suggest we skip the discussion of formatting and the number of spaces in > messages and instead focus on the topic of the thread. I have a very > difficult time troubleshooting difficult-to-reproduce bugs, and the fact > that their descriptions are difficult to read due to the number of spaces > is probably the least of the difficulties. Perhaps, yet it still makes dealing with things more difficult. > That invocation of get_s_time_fixed() reduces to scale_delta() (without >> further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW >> it's not quite clear to me what change you are suggesting (that would >> actually make a functional difference). > > Replacing get_s_time_fixed with scale_delta will remove the calculation > dependency on the previous local_stime value, which accumulates lag between > cores. This is because: rdtsc_ordered is not called synchronously on the > cores, but by the difference offset by the ipi speed. Therefore, we get: > > core0: current_rdtsc; > core1: current_rdtsc + ipi speed; > coreN: current_rdtsc + ipi speed * N; That's if IPIs are sent sequentially. In the most common case, they aren't, though - we use the all-but-self shorthand. Actually, even if IPIs are sent sequentially, I can't see where you spot this effect: Both callers of time_calibration_rendezvous_tail() signal all secondary CPUs to continue at the same time. Hence they'll all execute time_calibration_rendezvous_tail() in parallel. > Since ipi values are sent alternately in a loop to core0, Are they? I fear I don't know which part of the code you're talking about. > in the version > with get_s_time_fixed, we get the following local_stime calculation format. > > coreN: local_stime = local_stime + scale_delta((current_rdtsc + (ipi_speed > * N)) – local_rdtsc); One of the reasons we (iirc) don't do that is that since the scaling factor is also slightly imprecise, we'd prefer to avoid scaling very big values. IOW by changing as you suggest we'd trade one accumulating error for another. Jan > This means the time on each core will differ by ipi_speed * N. And since > we're using the values of the previous local_stime, the difference will > accumulate because the previous local_stime was also offset. In the version > with scale_delta, we get: > > coreN: local_stime = scale_delta(current_rdtsc + (ipi_speed * N)); > > This means there will still be a difference, but it won't accumulate, and > the offsets will remain within normal limits. > > If it's still unclear: If your local_stime in get_s_time_fixed is offset > relative to other cores, then the fact that rdtsc_ordered and local_tsc are > not offset doesn't change anything, since you're using the delta relative > to local_stime. > > core0_local_stime + (rdtsc_ordered() - local_tsc) != core1_local_stime + > (rdtsc_ordered() - local_tsc); // Even if rdtsc_ordered() and local_tsc are > equal across cores. > > On 96-core configurations, up to a millisecond of latency can accumulate in > local_stime over a week of operation, and this is a significant > difference. This > is due to the fact that I use cpufreq=xen:performance max_cstate=1 , > meaning that in my configuration, local_stime is never overwritten by > master_stime. > > Thanks. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() 2026-01-12 14:13 ` Jan Beulich @ 2026-01-12 14:51 ` Anton Markov 2026-01-12 16:08 ` Jan Beulich 0 siblings, 1 reply; 32+ messages in thread From: Anton Markov @ 2026-01-12 14:51 UTC (permalink / raw) To: Jan Beulich; +Cc: andrew.cooper3, roger.pau, xen-devel [-- Attachment #1: Type: text/plain, Size: 6163 bytes --] > > Perhaps, yet it still makes dealing with things more difficult. Sorry. I just spent too much time on this bug to stay in my mind. That's if IPIs are sent sequentially. In the most common case, they aren't, > though - we use the all-but-self shorthand. Actually, even if IPIs are sent sequentially, I can't see where you spot > this effect: Both callers of time_calibration_rendezvous_tail() signal all > secondary CPUs to continue at the same time. Hence they'll all execute > time_calibration_rendezvous_tail() in parallel. In parallel, but with a slight delay. Are they? I fear I don't know which part of the code you're talking about. In the function "time_calibration" (xen/arch/x86/time.c) Sorry, I don't take into account that you don't stay in context, being distracted by other threads. One of the reasons we (iirc) don't do that is that since the scaling factor > is also slightly imprecise, we'd prefer to avoid scaling very big values. > IOW by changing as you suggest we'd trade one accumulating error for > another. As I wrote above, there will be an error when using scale_delta, but it won't accumulate between calls to time_calibration_rendezvous_tail. In the current version, the old error (ipi lag + rounding error) persists due to the use of the old local_stime in the get_s_time_fixed function, and it's added to the new error and accumulates with each call. If c->local_stime = get_s_time_fixed(old_tsc ?: new_tsc); replaced with: c->local_stime = scale_delta(old_tsc ?: new_tsc); Then we'll only be dealing with the error due to the current ipi lag and rough rounding, within a single call. If I understand you correctly, you fixed the rough rounding of scale_delta by reducing the values using get_s_time_fixed . But the problem is, that didn't help. The error now accumulates gradually because we're relying on old calculations. Furthermore, while the old rounding error was limited, now the error accumulates infinitely, albeit very slowly. If this is so, then the solution to the problem becomes less obvious. Roughly speaking, my servers start to go crazy after a week of continuous operation, as the time lag between cores reaches 1 millisecond or more. On Mon, Jan 12, 2026 at 5:13 PM Jan Beulich <jbeulich@suse.com> wrote: > On 12.01.2026 13:49, Anton Markov wrote: > >> Btw, your prior response was too hard to properly read, due to excess > blank > >> lines with at the same time squashed leading blanks. Together with your > >> apparent inability to avoid top-posting, I think you really want to > adjust > >> your mail program's configuration. > > > > I suggest we skip the discussion of formatting and the number of spaces > in > > messages and instead focus on the topic of the thread. I have a very > > difficult time troubleshooting difficult-to-reproduce bugs, and the fact > > that their descriptions are difficult to read due to the number of spaces > > is probably the least of the difficulties. > > Perhaps, yet it still makes dealing with things more difficult. > > > That invocation of get_s_time_fixed() reduces to scale_delta() (without > >> further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW > >> it's not quite clear to me what change you are suggesting (that would > >> actually make a functional difference). > > > > Replacing get_s_time_fixed with scale_delta will remove the calculation > > dependency on the previous local_stime value, which accumulates lag > between > > cores. This is because: rdtsc_ordered is not called synchronously on the > > cores, but by the difference offset by the ipi speed. Therefore, we get: > > > > core0: current_rdtsc; > > core1: current_rdtsc + ipi speed; > > coreN: current_rdtsc + ipi speed * N; > > That's if IPIs are sent sequentially. In the most common case, they aren't, > though - we use the all-but-self shorthand. > > Actually, even if IPIs are sent sequentially, I can't see where you spot > this effect: Both callers of time_calibration_rendezvous_tail() signal all > secondary CPUs to continue at the same time. Hence they'll all execute > time_calibration_rendezvous_tail() in parallel. > > > Since ipi values are sent alternately in a loop to core0, > > Are they? I fear I don't know which part of the code you're talking about. > > > in the version > > with get_s_time_fixed, we get the following local_stime calculation > format. > > > > coreN: local_stime = local_stime + scale_delta((current_rdtsc + > (ipi_speed > > * N)) – local_rdtsc); > > One of the reasons we (iirc) don't do that is that since the scaling factor > is also slightly imprecise, we'd prefer to avoid scaling very big values. > IOW by changing as you suggest we'd trade one accumulating error for > another. > > Jan > > > This means the time on each core will differ by ipi_speed * N. And since > > we're using the values of the previous local_stime, the difference will > > accumulate because the previous local_stime was also offset. In the > version > > with scale_delta, we get: > > > > coreN: local_stime = scale_delta(current_rdtsc + (ipi_speed * N)); > > > > This means there will still be a difference, but it won't accumulate, and > > the offsets will remain within normal limits. > > > > If it's still unclear: If your local_stime in get_s_time_fixed is offset > > relative to other cores, then the fact that rdtsc_ordered and local_tsc > are > > not offset doesn't change anything, since you're using the delta relative > > to local_stime. > > > > core0_local_stime + (rdtsc_ordered() - local_tsc) != core1_local_stime + > > (rdtsc_ordered() - local_tsc); // Even if rdtsc_ordered() and local_tsc > are > > equal across cores. > > > > On 96-core configurations, up to a millisecond of latency can accumulate > in > > local_stime over a week of operation, and this is a significant > > difference. This > > is due to the fact that I use cpufreq=xen:performance max_cstate=1 , > > meaning that in my configuration, local_stime is never overwritten by > > master_stime. > > > > Thanks. > [-- Attachment #2: Type: text/html, Size: 9253 bytes --] ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() 2026-01-12 14:51 ` Anton Markov @ 2026-01-12 16:08 ` Jan Beulich 2026-01-12 16:41 ` Anton Markov 0 siblings, 1 reply; 32+ messages in thread From: Jan Beulich @ 2026-01-12 16:08 UTC (permalink / raw) To: Anton Markov; +Cc: andrew.cooper3, roger.pau, xen-devel On 12.01.2026 15:51, Anton Markov wrote: > That's if IPIs are sent sequentially. In the most common case, they aren't, >> though - we use the all-but-self shorthand. > > Actually, even if IPIs are sent sequentially, I can't see where you spot >> this effect: Both callers of time_calibration_rendezvous_tail() signal all >> secondary CPUs to continue at the same time. Hence they'll all execute >> time_calibration_rendezvous_tail() in parallel. > > In parallel, but with a slight delay. > > Are they? I fear I don't know which part of the code you're talking about. > > In the function "time_calibration" (xen/arch/x86/time.c) Sorry, I don't > take into account that you don't stay in context, being distracted by other > threads. That calls on_selected_cpus(), but send_IPI_mask() may then still resort to all-but-self. In that case all IPIs are sent in one go. Plus as said, how IPIs are sent doesn't matter for the invocation of time_calibration_rendezvous_tail(). They'll all run at the same time, not one after the other. Since further down you build upon that "IPI lag", I fear we first need to settle on this aspect of your theory. Jan > One of the reasons we (iirc) don't do that is that since the scaling factor >> is also slightly imprecise, we'd prefer to avoid scaling very big values. >> IOW by changing as you suggest we'd trade one accumulating error for >> another. > > As I wrote above, there will be an error when using scale_delta, but it > won't accumulate between calls to time_calibration_rendezvous_tail. In the > current version, the old error (ipi lag + rounding error) persists due to > the use of the old local_stime in the get_s_time_fixed function, and it's > added to the new error and accumulates with each call. > If > > c->local_stime = get_s_time_fixed(old_tsc ?: new_tsc); > > replaced with: > > c->local_stime = scale_delta(old_tsc ?: new_tsc); > > Then we'll only be dealing with the error due to the current ipi lag and > rough rounding, within a single call. > > If I understand you correctly, you fixed the rough rounding of scale_delta > by reducing the values using get_s_time_fixed . But the problem is, that > didn't help. The error now accumulates gradually because we're relying on > old calculations. Furthermore, while the old rounding error was limited, > now the error accumulates infinitely, albeit very slowly. If this is so, > then the solution to the problem becomes less obvious. > > Roughly speaking, my servers start to go crazy after a week of continuous > operation, as the time lag between cores reaches 1 millisecond or more. > > On Mon, Jan 12, 2026 at 5:13 PM Jan Beulich <jbeulich@suse.com> wrote: > >> On 12.01.2026 13:49, Anton Markov wrote: >>>> Btw, your prior response was too hard to properly read, due to excess >> blank >>>> lines with at the same time squashed leading blanks. Together with your >>>> apparent inability to avoid top-posting, I think you really want to >> adjust >>>> your mail program's configuration. >>> >>> I suggest we skip the discussion of formatting and the number of spaces >> in >>> messages and instead focus on the topic of the thread. I have a very >>> difficult time troubleshooting difficult-to-reproduce bugs, and the fact >>> that their descriptions are difficult to read due to the number of spaces >>> is probably the least of the difficulties. >> >> Perhaps, yet it still makes dealing with things more difficult. >> >>> That invocation of get_s_time_fixed() reduces to scale_delta() (without >>>> further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW >>>> it's not quite clear to me what change you are suggesting (that would >>>> actually make a functional difference). >>> >>> Replacing get_s_time_fixed with scale_delta will remove the calculation >>> dependency on the previous local_stime value, which accumulates lag >> between >>> cores. This is because: rdtsc_ordered is not called synchronously on the >>> cores, but by the difference offset by the ipi speed. Therefore, we get: >>> >>> core0: current_rdtsc; >>> core1: current_rdtsc + ipi speed; >>> coreN: current_rdtsc + ipi speed * N; >> >> That's if IPIs are sent sequentially. In the most common case, they aren't, >> though - we use the all-but-self shorthand. >> >> Actually, even if IPIs are sent sequentially, I can't see where you spot >> this effect: Both callers of time_calibration_rendezvous_tail() signal all >> secondary CPUs to continue at the same time. Hence they'll all execute >> time_calibration_rendezvous_tail() in parallel. >> >>> Since ipi values are sent alternately in a loop to core0, >> >> Are they? I fear I don't know which part of the code you're talking about. >> >>> in the version >>> with get_s_time_fixed, we get the following local_stime calculation >> format. >>> >>> coreN: local_stime = local_stime + scale_delta((current_rdtsc + >> (ipi_speed >>> * N)) – local_rdtsc); >> >> One of the reasons we (iirc) don't do that is that since the scaling factor >> is also slightly imprecise, we'd prefer to avoid scaling very big values. >> IOW by changing as you suggest we'd trade one accumulating error for >> another. >> >> Jan >> >>> This means the time on each core will differ by ipi_speed * N. And since >>> we're using the values of the previous local_stime, the difference will >>> accumulate because the previous local_stime was also offset. In the >> version >>> with scale_delta, we get: >>> >>> coreN: local_stime = scale_delta(current_rdtsc + (ipi_speed * N)); >>> >>> This means there will still be a difference, but it won't accumulate, and >>> the offsets will remain within normal limits. >>> >>> If it's still unclear: If your local_stime in get_s_time_fixed is offset >>> relative to other cores, then the fact that rdtsc_ordered and local_tsc >> are >>> not offset doesn't change anything, since you're using the delta relative >>> to local_stime. >>> >>> core0_local_stime + (rdtsc_ordered() - local_tsc) != core1_local_stime + >>> (rdtsc_ordered() - local_tsc); // Even if rdtsc_ordered() and local_tsc >> are >>> equal across cores. >>> >>> On 96-core configurations, up to a millisecond of latency can accumulate >> in >>> local_stime over a week of operation, and this is a significant >>> difference. This >>> is due to the fact that I use cpufreq=xen:performance max_cstate=1 , >>> meaning that in my configuration, local_stime is never overwritten by >>> master_stime. >>> >>> Thanks. >> > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() 2026-01-12 16:08 ` Jan Beulich @ 2026-01-12 16:41 ` Anton Markov 2026-01-13 11:21 ` Jan Beulich 0 siblings, 1 reply; 32+ messages in thread From: Anton Markov @ 2026-01-12 16:41 UTC (permalink / raw) To: Jan Beulich; +Cc: andrew.cooper3, roger.pau, xen-devel [-- Attachment #1: Type: text/plain, Size: 10337 bytes --] > > That calls on_selected_cpus(), but send_IPI_mask() may then still resort to > all-but-self. In that case all IPIs are sent in one go. Plus as said, how IPIs are sent doesn't matter for the invocation of > time_calibration_rendezvous_tail(). They'll all run at the same time, not > one after the other. At the hardware level, no one can guarantee that the processors will simultaneously respond to the signal and execute your code nanosecond after you send the ipi. Especially when we're talking about NUMA configurations. I'm afraid the possible and impossible in the laws of physics is also beyond the scope of this thread. Since further down you build upon that "IPI lag", I fear we first need to > settle on this aspect of your theory. I've already provided the delay logs. It's not hard for me to repeat. The patch: > @@ -1732,6 +1753,8 @@ static void cf_check local_time_calibration(void) > > if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ) > > { > > /* Atomically read cpu_calibration struct and write cpu_time struct. */ > > + printk("update stime on time calibrate %d, %lu -> %lu (%lu, %lu)\n", > smp_processor_id(), t->stamp.local_stime, c->local_stime, > > + t->last_seen_ns, t->last_seen_tsc); > > local_irq_disable(); > > t->stamp = *c; > > local_irq_enable(); > 2 hours of work: > (XEN) update stime on time calibrate 0, 8564145820102 -> 8565145861597 > (8565145862216, 0) > (XEN) update stime on time calibrate 1, 8564145820129 -> 8565145861609 > (8565145863957, 0) > (XEN) update stime on time calibrate 3, 8564145819996 -> 8565145861491 > (8565145864800, 0) > (XEN) update stime on time calibrate 2, 8564145820099 -> 8565145861609 > (8565145865372, 0) > > 8565145861609 - 8565145861491 = 115 * 3 (3.00 GHz) = 345 lag > 3 hours of work: > (XEN) update stime on time calibrate 0, 22914730829200 -> 22915730869993 > (22915730870665, 0) > (XEN) update stime on time calibrate 1, 22914730829073 -> 22915730869889 > (22915730870693, 0) > (XEN) update stime on time calibrate 2, 22914730829052 -> 22915730869841 > (22915730872231, 0) > (XEN) update stime on time calibrate 3, 22914730828892 -> 22915730869696 > (22915730872096, 0) > > 22915730869993 - 22915730869696 = 297 * 3 (3.00 GHz) = 891 lag > 2-3 day of work: > (XEN) update stime on time calibrate 0, 254477161980127 -> 254478162020920 > (254478162021549, 0) > (XEN) update stime on time calibrate 2, 254477161977638 -> 254478162018429 > (254478162022187, 0) > (XEN) update stime on time calibrate 1, 254477161978192 -> 254478162018972 > (254478162022776, 0) > (XEN) update stime on time calibrate 3, 254477161976832 -> 254478162017636 > (254478162021394, 0) > > 254478162020920 - 254478162017636 = 3284 * 3 (3.00 GHz) = 9852 lag As you can see, the core lag is strictly determined by their sequence number. I won't argue about what percentage of this delay is due to rounding error and what percentage is due to ipi lag. To reproduce this, simply add the patch (excluding t->last_seen_ns and t->last_seen_tsc , which were necessary for my own understanding). Then enable the hypervisor with the settings cpufreq=xen:performance max_cstate=1 . Clocksource is left at the default (i.e., hpet). On Mon, Jan 12, 2026 at 7:08 PM Jan Beulich <jbeulich@suse.com> wrote: > On 12.01.2026 15:51, Anton Markov wrote: > > That's if IPIs are sent sequentially. In the most common case, they > aren't, > >> though - we use the all-but-self shorthand. > > > > Actually, even if IPIs are sent sequentially, I can't see where you spot > >> this effect: Both callers of time_calibration_rendezvous_tail() signal > all > >> secondary CPUs to continue at the same time. Hence they'll all execute > >> time_calibration_rendezvous_tail() in parallel. > > > > In parallel, but with a slight delay. > > > > Are they? I fear I don't know which part of the code you're talking > about. > > > > In the function "time_calibration" (xen/arch/x86/time.c) Sorry, I don't > > take into account that you don't stay in context, being distracted by > other > > threads. > > That calls on_selected_cpus(), but send_IPI_mask() may then still resort to > all-but-self. In that case all IPIs are sent in one go. > > Plus as said, how IPIs are sent doesn't matter for the invocation of > time_calibration_rendezvous_tail(). They'll all run at the same time, not > one after the other. > > Since further down you build upon that "IPI lag", I fear we first need to > settle on this aspect of your theory. > > Jan > > > One of the reasons we (iirc) don't do that is that since the scaling > factor > >> is also slightly imprecise, we'd prefer to avoid scaling very big > values. > >> IOW by changing as you suggest we'd trade one accumulating error for > >> another. > > > > As I wrote above, there will be an error when using scale_delta, but it > > won't accumulate between calls to time_calibration_rendezvous_tail. In > the > > current version, the old error (ipi lag + rounding error) persists due to > > the use of the old local_stime in the get_s_time_fixed function, and it's > > added to the new error and accumulates with each call. > > If > > > > c->local_stime = get_s_time_fixed(old_tsc ?: new_tsc); > > > > replaced with: > > > > c->local_stime = scale_delta(old_tsc ?: new_tsc); > > > > Then we'll only be dealing with the error due to the current ipi lag and > > rough rounding, within a single call. > > > > If I understand you correctly, you fixed the rough rounding of > scale_delta > > by reducing the values using get_s_time_fixed . But the problem is, that > > didn't help. The error now accumulates gradually because we're relying on > > old calculations. Furthermore, while the old rounding error was limited, > > now the error accumulates infinitely, albeit very slowly. If this is so, > > then the solution to the problem becomes less obvious. > > > > Roughly speaking, my servers start to go crazy after a week of continuous > > operation, as the time lag between cores reaches 1 millisecond or more. > > > > On Mon, Jan 12, 2026 at 5:13 PM Jan Beulich <jbeulich@suse.com> wrote: > > > >> On 12.01.2026 13:49, Anton Markov wrote: > >>>> Btw, your prior response was too hard to properly read, due to excess > >> blank > >>>> lines with at the same time squashed leading blanks. Together with > your > >>>> apparent inability to avoid top-posting, I think you really want to > >> adjust > >>>> your mail program's configuration. > >>> > >>> I suggest we skip the discussion of formatting and the number of spaces > >> in > >>> messages and instead focus on the topic of the thread. I have a very > >>> difficult time troubleshooting difficult-to-reproduce bugs, and the > fact > >>> that their descriptions are difficult to read due to the number of > spaces > >>> is probably the least of the difficulties. > >> > >> Perhaps, yet it still makes dealing with things more difficult. > >> > >>> That invocation of get_s_time_fixed() reduces to scale_delta() (without > >>>> further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. > IOW > >>>> it's not quite clear to me what change you are suggesting (that would > >>>> actually make a functional difference). > >>> > >>> Replacing get_s_time_fixed with scale_delta will remove the calculation > >>> dependency on the previous local_stime value, which accumulates lag > >> between > >>> cores. This is because: rdtsc_ordered is not called synchronously on > the > >>> cores, but by the difference offset by the ipi speed. Therefore, we > get: > >>> > >>> core0: current_rdtsc; > >>> core1: current_rdtsc + ipi speed; > >>> coreN: current_rdtsc + ipi speed * N; > >> > >> That's if IPIs are sent sequentially. In the most common case, they > aren't, > >> though - we use the all-but-self shorthand. > >> > >> Actually, even if IPIs are sent sequentially, I can't see where you spot > >> this effect: Both callers of time_calibration_rendezvous_tail() signal > all > >> secondary CPUs to continue at the same time. Hence they'll all execute > >> time_calibration_rendezvous_tail() in parallel. > >> > >>> Since ipi values are sent alternately in a loop to core0, > >> > >> Are they? I fear I don't know which part of the code you're talking > about. > >> > >>> in the version > >>> with get_s_time_fixed, we get the following local_stime calculation > >> format. > >>> > >>> coreN: local_stime = local_stime + scale_delta((current_rdtsc + > >> (ipi_speed > >>> * N)) – local_rdtsc); > >> > >> One of the reasons we (iirc) don't do that is that since the scaling > factor > >> is also slightly imprecise, we'd prefer to avoid scaling very big > values. > >> IOW by changing as you suggest we'd trade one accumulating error for > >> another. > >> > >> Jan > >> > >>> This means the time on each core will differ by ipi_speed * N. And > since > >>> we're using the values of the previous local_stime, the difference will > >>> accumulate because the previous local_stime was also offset. In the > >> version > >>> with scale_delta, we get: > >>> > >>> coreN: local_stime = scale_delta(current_rdtsc + (ipi_speed * N)); > >>> > >>> This means there will still be a difference, but it won't accumulate, > and > >>> the offsets will remain within normal limits. > >>> > >>> If it's still unclear: If your local_stime in get_s_time_fixed is > offset > >>> relative to other cores, then the fact that rdtsc_ordered and local_tsc > >> are > >>> not offset doesn't change anything, since you're using the delta > relative > >>> to local_stime. > >>> > >>> core0_local_stime + (rdtsc_ordered() - local_tsc) != core1_local_stime > + > >>> (rdtsc_ordered() - local_tsc); // Even if rdtsc_ordered() and local_tsc > >> are > >>> equal across cores. > >>> > >>> On 96-core configurations, up to a millisecond of latency can > accumulate > >> in > >>> local_stime over a week of operation, and this is a significant > >>> difference. This > >>> is due to the fact that I use cpufreq=xen:performance max_cstate=1 , > >>> meaning that in my configuration, local_stime is never overwritten by > >>> master_stime. > >>> > >>> Thanks. > >> > > > > [-- Attachment #2: Type: text/html, Size: 15860 bytes --] ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() 2026-01-12 16:41 ` Anton Markov @ 2026-01-13 11:21 ` Jan Beulich 2026-01-13 12:35 ` Anton Markov 0 siblings, 1 reply; 32+ messages in thread From: Jan Beulich @ 2026-01-13 11:21 UTC (permalink / raw) To: Anton Markov; +Cc: andrew.cooper3, roger.pau, xen-devel On 12.01.2026 17:41, Anton Markov wrote: >> >> That calls on_selected_cpus(), but send_IPI_mask() may then still resort to >> all-but-self. In that case all IPIs are sent in one go. > > Plus as said, how IPIs are sent doesn't matter for the invocation of >> time_calibration_rendezvous_tail(). They'll all run at the same time, not >> one after the other. > > At the hardware level, no one can guarantee that the processors will > simultaneously respond to the signal and execute your code nanosecond after > you send the ipi. Especially when we're talking about NUMA configurations. I'm > afraid the possible and impossible in the laws of physics is also beyond > the scope of this thread. You did read my recurring explanation beyond the IPI sending, didn't you? Of course IPI arrival may vary across cores / threads. Yet the term "rendezvous" is used because CPUs having received the IPI are then held in a waiting loop, until _all_ CPUs have made it there. Then CPU0 indicates to all of them simultaneously to move to the next step. There's going to again be some variance (especially on NUMA, where the memory write needs to propagate to all nodes), but at least within a single node that should be pretty low. The main source of variance I would expect there would by hyperthreads competing with one another in a single core. > Since further down you build upon that "IPI lag", I fear we first need to >> settle on this aspect of your theory. > > I've already provided the delay logs. It's not hard for me to repeat. Sure, I don't doubt you make those observations. But we're still trying to converge on a theory on what these may be caused by. > 2 hours of work: > >> (XEN) update stime on time calibrate 0, 8564145820102 -> 8565145861597 >> (8565145862216, 0) >> (XEN) update stime on time calibrate 1, 8564145820129 -> 8565145861609 >> (8565145863957, 0) >> (XEN) update stime on time calibrate 3, 8564145819996 -> 8565145861491 >> (8565145864800, 0) >> (XEN) update stime on time calibrate 2, 8564145820099 -> 8565145861609 >> (8565145865372, 0) >> >> 8565145861609 - 8565145861491 = 115 * 3 (3.00 GHz) = 345 lag The log entries aren't in CPU order, and CPUs 1 and 2 actually have identical values on the rhs. That doesn't quite fit what you have said so far. CPU3's value is also lower than CPU0's. > 3 hours of work: > >> (XEN) update stime on time calibrate 0, 22914730829200 -> 22915730869993 >> (22915730870665, 0) >> (XEN) update stime on time calibrate 1, 22914730829073 -> 22915730869889 >> (22915730870693, 0) >> (XEN) update stime on time calibrate 2, 22914730829052 -> 22915730869841 >> (22915730872231, 0) >> (XEN) update stime on time calibrate 3, 22914730828892 -> 22915730869696 >> (22915730872096, 0) >> >> 22915730869993 - 22915730869696 = 297 * 3 (3.00 GHz) = 891 lag While CPU numbers happen to be in sequence here, the rhs values aren't equally ordered. Also really here it is 22915730869696 - 22915730869993 = -297 * 3 (3.00 GHz) = 891 ahead > 2-3 day of work: > >> (XEN) update stime on time calibrate 0, 254477161980127 -> 254478162020920 >> (254478162021549, 0) >> (XEN) update stime on time calibrate 2, 254477161977638 -> 254478162018429 >> (254478162022187, 0) >> (XEN) update stime on time calibrate 1, 254477161978192 -> 254478162018972 >> (254478162022776, 0) >> (XEN) update stime on time calibrate 3, 254477161976832 -> 254478162017636 >> (254478162021394, 0) >> >> 254478162020920 - 254478162017636 = 3284 * 3 (3.00 GHz) = 9852 lag Similarly here. Yes, the gap increases, yet that's not a lag of CPU3 past CPU0, but exactly the other way around. > As you can see, the core lag is strictly determined by their sequence > number. As per above - no, I don't think I can see that. Or maybe I'm misreading the numbers as well as what you have been saying. Jan ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() 2026-01-13 11:21 ` Jan Beulich @ 2026-01-13 12:35 ` Anton Markov 0 siblings, 0 replies; 32+ messages in thread From: Anton Markov @ 2026-01-13 12:35 UTC (permalink / raw) To: Jan Beulich; +Cc: andrew.cooper3, roger.pau, xen-devel [-- Attachment #1: Type: text/plain, Size: 7598 bytes --] You did read my recurring explanation beyond the IPI sending, didn't you? Of course IPI arrival may vary across cores / threads. Yet the term "rendezvous" is used because CPUs having received the IPI are then held in a waiting loop, until _all_ CPUs have made it there. Then CPU0 indicates to all of them simultaneously to move to the next step. There's going to again be some variance (especially on NUMA, where the memory write needs to propagate to all nodes), but at least within a single node that should be pretty low. The main source of variance I would expect there would by hyperthreads competing with one another in a single core. Yes, I saw it. I'm not trying to dispute your understanding of the situation. The difference may be small, but it adds up. Sure, I don't doubt you make those observations. But we're still trying to > converge on a theory on what these may be caused by. > I can't tell you exactly what the main cause of the delay is. I can only list possible factors: 1. Rounding error, which varies for each core; 2. Delay in IPI delivery speed (even at the hardware level, signal delivery shouldn't happen simultaneously, and some cores may have interrupts disabled); 3. CPU frequency boost, which allows some cores to execute code faster. On modern CPUs, this doesn't affect the TSC frequency, but the problem is that the counter will be read at different times. 4. The initial difference in TSC counter values, which for cores within a single CPU should be no more than 100 ns. In the case of NUMA, no more than 1000 ns; I can't speak about percentages; I wasn't involved in CPU development, but there are many reasons not to allow cores to compete in sequence increment speed. The log entries aren't in CPU order, and CPUs 1 and 2 actually have > identical values on the rhs. That doesn't quite fit what you have said so > far. CPU3's value is also lower than CPU0's. > While CPU numbers happen to be in sequence here, the rhs values aren't > equally > ordered. > Also really here it is > 22915730869696 - 22915730869993 = -297 * 3 (3.00 GHz) = 891 ahead > Similarly here. Yes, the gap increases, yet that's not a lag of CPU3 past > CPU0, but exactly the other way around. > As per above - no, I don't think I can see that. Or maybe I'm misreading > the > numbers as well as what you have been saying. > During the first few hours, the situation can be blurred due to possible race conditions. After two days, it becomes more or less clear: 254478162020920 (core 0) > 254478162018972 (core 1) > 254478162018429 (core 2) > 254478162017636 (core 3) The lower the core number, the more it pulls ahead. It's possible that this is indeed related to which core is most heavily loaded (which one activates CPU boost more often). In my configuration, the first three cores are dedicated to DOM0, and the fourth is reserved for virtual machines. The first core ends up being the most heavily loaded due to interrupt handling, etc. I can also add that after replacing get_s_time_fixed with scale_delta, the difference stops accumulating. At this point, it's clear to me that the problem is the use of previous last_stime values. The nature of this CPU behavior is unlikely to be understood at the software level. Of course, all the processors I tested on have the constant_tsc, nonstop_tsc, and tsc_known_freq flags. On Tue, Jan 13, 2026 at 2:21 PM Jan Beulich <jbeulich@suse.com> wrote: > On 12.01.2026 17:41, Anton Markov wrote: > >> > >> That calls on_selected_cpus(), but send_IPI_mask() may then still > resort to > >> all-but-self. In that case all IPIs are sent in one go. > > > > Plus as said, how IPIs are sent doesn't matter for the invocation of > >> time_calibration_rendezvous_tail(). They'll all run at the same time, > not > >> one after the other. > > > > At the hardware level, no one can guarantee that the processors will > > simultaneously respond to the signal and execute your code nanosecond > after > > you send the ipi. Especially when we're talking about NUMA > configurations. I'm > > afraid the possible and impossible in the laws of physics is also beyond > > the scope of this thread. > > You did read my recurring explanation beyond the IPI sending, didn't you? > Of course IPI arrival may vary across cores / threads. Yet the term > "rendezvous" is used because CPUs having received the IPI are then held > in a waiting loop, until _all_ CPUs have made it there. Then CPU0 > indicates to all of them simultaneously to move to the next step. There's > going to again be some variance (especially on NUMA, where the memory > write needs to propagate to all nodes), but at least within a single node > that should be pretty low. The main source of variance I would expect > there would by hyperthreads competing with one another in a single core. > > > Since further down you build upon that "IPI lag", I fear we first need to > >> settle on this aspect of your theory. > > > > I've already provided the delay logs. It's not hard for me to repeat. > > Sure, I don't doubt you make those observations. But we're still trying to > converge on a theory on what these may be caused by. > > > 2 hours of work: > > > >> (XEN) update stime on time calibrate 0, 8564145820102 -> 8565145861597 > >> (8565145862216, 0) > >> (XEN) update stime on time calibrate 1, 8564145820129 -> 8565145861609 > >> (8565145863957, 0) > >> (XEN) update stime on time calibrate 3, 8564145819996 -> 8565145861491 > >> (8565145864800, 0) > >> (XEN) update stime on time calibrate 2, 8564145820099 -> 8565145861609 > >> (8565145865372, 0) > >> > >> 8565145861609 - 8565145861491 = 115 * 3 (3.00 GHz) = 345 lag > > The log entries aren't in CPU order, and CPUs 1 and 2 actually have > identical values on the rhs. That doesn't quite fit what you have said so > far. CPU3's value is also lower than CPU0's. > > > 3 hours of work: > > > >> (XEN) update stime on time calibrate 0, 22914730829200 -> 22915730869993 > >> (22915730870665, 0) > >> (XEN) update stime on time calibrate 1, 22914730829073 -> 22915730869889 > >> (22915730870693, 0) > >> (XEN) update stime on time calibrate 2, 22914730829052 -> 22915730869841 > >> (22915730872231, 0) > >> (XEN) update stime on time calibrate 3, 22914730828892 -> 22915730869696 > >> (22915730872096, 0) > >> > >> 22915730869993 - 22915730869696 = 297 * 3 (3.00 GHz) = 891 lag > > While CPU numbers happen to be in sequence here, the rhs values aren't > equally > ordered. > > Also really here it is > > 22915730869696 - 22915730869993 = -297 * 3 (3.00 GHz) = 891 ahead > > > 2-3 day of work: > > > >> (XEN) update stime on time calibrate 0, 254477161980127 -> > 254478162020920 > >> (254478162021549, 0) > >> (XEN) update stime on time calibrate 2, 254477161977638 -> > 254478162018429 > >> (254478162022187, 0) > >> (XEN) update stime on time calibrate 1, 254477161978192 -> > 254478162018972 > >> (254478162022776, 0) > >> (XEN) update stime on time calibrate 3, 254477161976832 -> > 254478162017636 > >> (254478162021394, 0) > >> > >> 254478162020920 - 254478162017636 = 3284 * 3 (3.00 GHz) = 9852 lag > > Similarly here. Yes, the gap increases, yet that's not a lag of CPU3 past > CPU0, but exactly the other way around. > > > As you can see, the core lag is strictly determined by their sequence > > number. > > As per above - no, I don't think I can see that. Or maybe I'm misreading > the > numbers as well as what you have been saying. > > Jan > [-- Attachment #2: Type: text/html, Size: 12481 bytes --] ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() 2026-01-06 13:58 ` [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() Jan Beulich 2026-01-06 20:10 ` Антон Марков @ 2026-01-14 17:49 ` Roger Pau Monné 2026-01-15 8:00 ` Jan Beulich 1 sibling, 1 reply; 32+ messages in thread From: Roger Pau Monné @ 2026-01-14 17:49 UTC (permalink / raw) To: Jan Beulich Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Антон Марков On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote: > Callers may pass in TSC values from before the local TSC stamp was last > updated (this would in particular be the case when the TSC was latched, a > time rendezvous occurs, and the latched value is used only afterwards). > scale_delta(), otoh, deals with unsigned values, and hence would treat > negative incoming deltas as huge positive values, yielding entirely bogus > return values. > > Fixes: 88e64cb785c1 ("x86/HVM: use fixed TSC value when saving or restoring domain") > Reported-by: Антон Марков <akmarkov45@gmail.com> > Signed-off-by: Jan Beulich <jbeulich@suse.com> > --- > An alternative might be to have scale_delta() deal with signed deltas, yet > that seemed more risky to me. I won't go that route, the caller can figure out the signedness of the result as needed. > There could likely be more Fixes: tags; the one used is the oldest > applicable one, from what I can tell. > > A similar issue looks to exist in read_xen_timer() and its read_cycle() > helper, if we're scheduled out (and beck in) between reading of the TSC > and calculation of the delta (involving ->tsc_timestamp). Am I > overlooking anything there? If we are scheduled out in the middle, and the ->tsc_timestamp is updated, ->version would also be bumped, and hence the loop will be restarted. I don't think there's an issue there. > stime2tsc() guards against negative deltas by using 0 instead; I'm not > quite sure that's correct either. Hm, we should likely do the same for stime2tsc() that you do for get_s_time_fixed(). Given the current callers I think we might be safe, but it's a risk. > amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a > comment towards the TSC being "sane", but is that correct? Due to > TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then > wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before > calling tsc_ticks2ns()? amd_check_erratum_1474() runs after early_time_init(), which would have cleared any TSC_ADJUST offset AFAICT. There's a note in the initcall to that regard: /* * Must be executed after early_time_init() for tsc_ticks2ns() to have been * calibrated. That prevents us doing the check in init_amd(). */ presmp_initcall(amd_check_erratum_1474); > A similar issue looks to exist in tsc_get_info(), again when rdtsc() > possibly returns a huge value due to TSC_ADJUST. Once again I wonder > whether we shouldn't subtract boot_tsc_stamp. I would expect tsc_get_info() to also get called exclusively after early_time_init()? > --- a/xen/arch/x86/time.c > +++ b/xen/arch/x86/time.c > @@ -1649,8 +1649,13 @@ s_time_t get_s_time_fixed(u64 at_tsc) > tsc = at_tsc; > else > tsc = rdtsc_ordered(); > - delta = tsc - t->stamp.local_tsc; > - return t->stamp.local_stime + scale_delta(delta, &t->tsc_scale); > + > + if ( tsc >= t->stamp.local_tsc ) Should we hint the compiler this is the likely path? > + delta = scale_delta(tsc - t->stamp.local_tsc, &t->tsc_scale); > + else > + delta = -scale_delta(t->stamp.local_tsc - tsc, &t->tsc_scale); > + > + return t->stamp.local_stime + delta; LGTM: Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> However I see Anton still has concerns that this patch doesn't fully fix the issue he reported, and I'm afraid I don't really understand what's going on. I will have to take a more detailed look at the thread, or possibly attempt to reproduce myself. Thanks, Roger. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() 2026-01-14 17:49 ` Roger Pau Monné @ 2026-01-15 8:00 ` Jan Beulich 2026-01-15 8:24 ` Roger Pau Monné 2026-01-20 8:50 ` Jan Beulich 0 siblings, 2 replies; 32+ messages in thread From: Jan Beulich @ 2026-01-15 8:00 UTC (permalink / raw) To: Roger Pau Monné Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Антон Марков On 14.01.2026 18:49, Roger Pau Monné wrote: > On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote: >> A similar issue looks to exist in read_xen_timer() and its read_cycle() >> helper, if we're scheduled out (and beck in) between reading of the TSC >> and calculation of the delta (involving ->tsc_timestamp). Am I >> overlooking anything there? > > If we are scheduled out in the middle, and the ->tsc_timestamp is > updated, ->version would also be bumped, and hence the loop will be > restarted. I don't think there's an issue there. Oh, indeed - I was too focused on the read_cycle() alone. That may go wrong, but as you say the result then simply won't be used. >> stime2tsc() guards against negative deltas by using 0 instead; I'm not >> quite sure that's correct either. > > Hm, we should likely do the same for stime2tsc() that you do for > get_s_time_fixed(). Given the current callers I think we might be > safe, but it's a risk. Will do then. >> amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a >> comment towards the TSC being "sane", but is that correct? Due to >> TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then >> wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before >> calling tsc_ticks2ns()? > > amd_check_erratum_1474() runs after early_time_init(), which would > have cleared any TSC_ADJUST offset AFAICT. There's a note in the > initcall to that regard: > > /* > * Must be executed after early_time_init() for tsc_ticks2ns() to have been > * calibrated. That prevents us doing the check in init_amd(). > */ > presmp_initcall(amd_check_erratum_1474); Hmm, I should have written "Due to e.g. TSC_ADJUST". Firmware may also have played other games with MSR_TSC. >> A similar issue looks to exist in tsc_get_info(), again when rdtsc() >> possibly returns a huge value due to TSC_ADJUST. Once again I wonder >> whether we shouldn't subtract boot_tsc_stamp. > > I would expect tsc_get_info() to also get called exclusively after > early_time_init()? Same here then (obviously). >> --- a/xen/arch/x86/time.c >> +++ b/xen/arch/x86/time.c >> @@ -1649,8 +1649,13 @@ s_time_t get_s_time_fixed(u64 at_tsc) >> tsc = at_tsc; >> else >> tsc = rdtsc_ordered(); >> - delta = tsc - t->stamp.local_tsc; >> - return t->stamp.local_stime + scale_delta(delta, &t->tsc_scale); >> + >> + if ( tsc >= t->stamp.local_tsc ) > > Should we hint the compiler this is the likely path? I was wondering too, but was erring on the side of "no" primarily because of Andrew's general preference towards no likely(). I'd be happy to add it here. >> + delta = scale_delta(tsc - t->stamp.local_tsc, &t->tsc_scale); >> + else >> + delta = -scale_delta(t->stamp.local_tsc - tsc, &t->tsc_scale); >> + >> + return t->stamp.local_stime + delta; > > LGTM: > > Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Thanks. > However I see Anton still has concerns that this patch doesn't fully > fix the issue he reported, and I'm afraid I don't really understand > what's going on. I will have to take a more detailed look at the > thread, or possibly attempt to reproduce myself. Largely the same here. Jan ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() 2026-01-15 8:00 ` Jan Beulich @ 2026-01-15 8:24 ` Roger Pau Monné 2026-01-15 8:49 ` Anton Markov 2026-01-15 10:38 ` Jan Beulich 2026-01-20 8:50 ` Jan Beulich 1 sibling, 2 replies; 32+ messages in thread From: Roger Pau Monné @ 2026-01-15 8:24 UTC (permalink / raw) To: Jan Beulich Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Антон Марков On Thu, Jan 15, 2026 at 09:00:07AM +0100, Jan Beulich wrote: > On 14.01.2026 18:49, Roger Pau Monné wrote: > > On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote: > >> A similar issue looks to exist in read_xen_timer() and its read_cycle() > >> helper, if we're scheduled out (and beck in) between reading of the TSC > >> and calculation of the delta (involving ->tsc_timestamp). Am I > >> overlooking anything there? > > > > If we are scheduled out in the middle, and the ->tsc_timestamp is > > updated, ->version would also be bumped, and hence the loop will be > > restarted. I don't think there's an issue there. > > Oh, indeed - I was too focused on the read_cycle() alone. That may go > wrong, but as you say the result then simply won't be used. > > >> stime2tsc() guards against negative deltas by using 0 instead; I'm not > >> quite sure that's correct either. > > > > Hm, we should likely do the same for stime2tsc() that you do for > > get_s_time_fixed(). Given the current callers I think we might be > > safe, but it's a risk. > > Will do then. > > >> amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a > >> comment towards the TSC being "sane", but is that correct? Due to > >> TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then > >> wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before > >> calling tsc_ticks2ns()? > > > > amd_check_erratum_1474() runs after early_time_init(), which would > > have cleared any TSC_ADJUST offset AFAICT. There's a note in the > > initcall to that regard: > > > > /* > > * Must be executed after early_time_init() for tsc_ticks2ns() to have been > > * calibrated. That prevents us doing the check in init_amd(). > > */ > > presmp_initcall(amd_check_erratum_1474); > > Hmm, I should have written "Due to e.g. TSC_ADJUST". Firmware may also > have played other games with MSR_TSC. For amd_check_erratum_1474() we don't want to subtract boot_tsc_stamp, otherwise when kexec'ed we won't be accounting properly for the time since host startup, as subtracting boot_tsc_stamp would remove any time consumed by a previously run OS. > >> A similar issue looks to exist in tsc_get_info(), again when rdtsc() > >> possibly returns a huge value due to TSC_ADJUST. Once again I wonder > >> whether we shouldn't subtract boot_tsc_stamp. > > > > I would expect tsc_get_info() to also get called exclusively after > > early_time_init()? > > Same here then (obviously). For tsc_get_info() I think you are worried that the TSC might overflow, and hence the calculation in scale_delta() would then be skewed. We must have other instances of this pattern however, what about get_s_time_fixed(), I think it would also be affected? Or maybe I'm not understanding the concern. Given the proposed scale_delta() logic, it won't be possible to distinguish rdtsc overflowing from a value in the past. Thanks, Roger. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() 2026-01-15 8:24 ` Roger Pau Monné @ 2026-01-15 8:49 ` Anton Markov 2026-01-15 10:38 ` Jan Beulich 1 sibling, 0 replies; 32+ messages in thread From: Anton Markov @ 2026-01-15 8:49 UTC (permalink / raw) To: Roger Pau Monné Cc: Jan Beulich, xen-devel@lists.xenproject.org, Andrew Cooper [-- Attachment #1: Type: text/plain, Size: 267 bytes --] Sorry. I mixed several threads into one. It wasn't immediately clear that the second bug isn't related to handling negative values in scale_delta. It's related to gradual time desynchronization between cores, so they should probably be addressed separately. Thanks. [-- Attachment #2: Type: text/html, Size: 962 bytes --] ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() 2026-01-15 8:24 ` Roger Pau Monné 2026-01-15 8:49 ` Anton Markov @ 2026-01-15 10:38 ` Jan Beulich 2026-01-15 11:48 ` Roger Pau Monné 1 sibling, 1 reply; 32+ messages in thread From: Jan Beulich @ 2026-01-15 10:38 UTC (permalink / raw) To: Roger Pau Monné Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Антон Марков On 15.01.2026 09:24, Roger Pau Monné wrote: > On Thu, Jan 15, 2026 at 09:00:07AM +0100, Jan Beulich wrote: >> On 14.01.2026 18:49, Roger Pau Monné wrote: >>> On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote: >>>> amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a >>>> comment towards the TSC being "sane", but is that correct? Due to >>>> TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then >>>> wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before >>>> calling tsc_ticks2ns()? >>> >>> amd_check_erratum_1474() runs after early_time_init(), which would >>> have cleared any TSC_ADJUST offset AFAICT. There's a note in the >>> initcall to that regard: >>> >>> /* >>> * Must be executed after early_time_init() for tsc_ticks2ns() to have been >>> * calibrated. That prevents us doing the check in init_amd(). >>> */ >>> presmp_initcall(amd_check_erratum_1474); >> >> Hmm, I should have written "Due to e.g. TSC_ADJUST". Firmware may also >> have played other games with MSR_TSC. > > For amd_check_erratum_1474() we don't want to subtract boot_tsc_stamp, > otherwise when kexec'ed we won't be accounting properly for the time > since host startup, as subtracting boot_tsc_stamp would remove any > time consumed by a previously run OS. For both this and ... >>>> A similar issue looks to exist in tsc_get_info(), again when rdtsc() >>>> possibly returns a huge value due to TSC_ADJUST. Once again I wonder >>>> whether we shouldn't subtract boot_tsc_stamp. >>> >>> I would expect tsc_get_info() to also get called exclusively after >>> early_time_init()? >> >> Same here then (obviously). > > For tsc_get_info() I think you are worried that the TSC might > overflow, and hence the calculation in scale_delta() would then be > skewed. We must have other instances of this pattern however, what > about get_s_time_fixed(), I think it would also be affected? > > Or maybe I'm not understanding the concern. Given the proposed > scale_delta() logic, it won't be possible to distinguish rdtsc > overflowing from a value in the past. ... this, my main point really is that scale_delta() (as its name says), and hence also tsc_ticks2ns(), shouldn't be used on absolute counts, but only deltas. (Yes, an absolute count can be viewed as delta from 0, but that's correct only if we know the TSC started counting from 0 and was never adjusted by some bias.) Jan ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() 2026-01-15 10:38 ` Jan Beulich @ 2026-01-15 11:48 ` Roger Pau Monné 2026-01-15 12:04 ` Jan Beulich 0 siblings, 1 reply; 32+ messages in thread From: Roger Pau Monné @ 2026-01-15 11:48 UTC (permalink / raw) To: Jan Beulich Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Антон Марков On Thu, Jan 15, 2026 at 11:38:10AM +0100, Jan Beulich wrote: > On 15.01.2026 09:24, Roger Pau Monné wrote: > > On Thu, Jan 15, 2026 at 09:00:07AM +0100, Jan Beulich wrote: > >> On 14.01.2026 18:49, Roger Pau Monné wrote: > >>> On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote: > >>>> amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a > >>>> comment towards the TSC being "sane", but is that correct? Due to > >>>> TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then > >>>> wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before > >>>> calling tsc_ticks2ns()? > >>> > >>> amd_check_erratum_1474() runs after early_time_init(), which would > >>> have cleared any TSC_ADJUST offset AFAICT. There's a note in the > >>> initcall to that regard: > >>> > >>> /* > >>> * Must be executed after early_time_init() for tsc_ticks2ns() to have been > >>> * calibrated. That prevents us doing the check in init_amd(). > >>> */ > >>> presmp_initcall(amd_check_erratum_1474); > >> > >> Hmm, I should have written "Due to e.g. TSC_ADJUST". Firmware may also > >> have played other games with MSR_TSC. > > > > For amd_check_erratum_1474() we don't want to subtract boot_tsc_stamp, > > otherwise when kexec'ed we won't be accounting properly for the time > > since host startup, as subtracting boot_tsc_stamp would remove any > > time consumed by a previously run OS. > > For both this and ... > > >>>> A similar issue looks to exist in tsc_get_info(), again when rdtsc() > >>>> possibly returns a huge value due to TSC_ADJUST. Once again I wonder > >>>> whether we shouldn't subtract boot_tsc_stamp. > >>> > >>> I would expect tsc_get_info() to also get called exclusively after > >>> early_time_init()? > >> > >> Same here then (obviously). > > > > For tsc_get_info() I think you are worried that the TSC might > > overflow, and hence the calculation in scale_delta() would then be > > skewed. We must have other instances of this pattern however, what > > about get_s_time_fixed(), I think it would also be affected? > > > > Or maybe I'm not understanding the concern. Given the proposed > > scale_delta() logic, it won't be possible to distinguish rdtsc > > overflowing from a value in the past. > > ... this, my main point really is that scale_delta() (as its name says), > and hence also tsc_ticks2ns(), shouldn't be used on absolute counts, but > only deltas. (Yes, an absolute count can be viewed as delta from 0, but > that's correct only if we know the TSC started counting from 0 and was > never adjusted by some bias.) Well amd_check_erratum_1474() does want the delta from 0 to the current TSC, because that's the best? way to see when C6 needs to be disabled. Otherwise we just straight disable C6 on boot on affected systems. IMO the problem is not about using tsc_ticks2ns() or scale_delta() with absolute or delta values, the issue if how the "delta" passed to those functions is calculated. Thanks, Roger. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() 2026-01-15 11:48 ` Roger Pau Monné @ 2026-01-15 12:04 ` Jan Beulich 0 siblings, 0 replies; 32+ messages in thread From: Jan Beulich @ 2026-01-15 12:04 UTC (permalink / raw) To: Roger Pau Monné Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Антон Марков On 15.01.2026 12:48, Roger Pau Monné wrote: > On Thu, Jan 15, 2026 at 11:38:10AM +0100, Jan Beulich wrote: >> On 15.01.2026 09:24, Roger Pau Monné wrote: >>> On Thu, Jan 15, 2026 at 09:00:07AM +0100, Jan Beulich wrote: >>>> On 14.01.2026 18:49, Roger Pau Monné wrote: >>>>> On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote: >>>>>> amd_check_erratum_1474() (next to its call to tsc_ticks2ns()) has a >>>>>> comment towards the TSC being "sane", but is that correct? Due to >>>>>> TSC_ADJUST, rdtsc() may well return a huge value (and the TSC would then >>>>>> wrap through 0 at some point). Shouldn't we subtract boot_tsc_stamp before >>>>>> calling tsc_ticks2ns()? >>>>> >>>>> amd_check_erratum_1474() runs after early_time_init(), which would >>>>> have cleared any TSC_ADJUST offset AFAICT. There's a note in the >>>>> initcall to that regard: >>>>> >>>>> /* >>>>> * Must be executed after early_time_init() for tsc_ticks2ns() to have been >>>>> * calibrated. That prevents us doing the check in init_amd(). >>>>> */ >>>>> presmp_initcall(amd_check_erratum_1474); >>>> >>>> Hmm, I should have written "Due to e.g. TSC_ADJUST". Firmware may also >>>> have played other games with MSR_TSC. >>> >>> For amd_check_erratum_1474() we don't want to subtract boot_tsc_stamp, >>> otherwise when kexec'ed we won't be accounting properly for the time >>> since host startup, as subtracting boot_tsc_stamp would remove any >>> time consumed by a previously run OS. >> >> For both this and ... >> >>>>>> A similar issue looks to exist in tsc_get_info(), again when rdtsc() >>>>>> possibly returns a huge value due to TSC_ADJUST. Once again I wonder >>>>>> whether we shouldn't subtract boot_tsc_stamp. >>>>> >>>>> I would expect tsc_get_info() to also get called exclusively after >>>>> early_time_init()? >>>> >>>> Same here then (obviously). >>> >>> For tsc_get_info() I think you are worried that the TSC might >>> overflow, and hence the calculation in scale_delta() would then be >>> skewed. We must have other instances of this pattern however, what >>> about get_s_time_fixed(), I think it would also be affected? >>> >>> Or maybe I'm not understanding the concern. Given the proposed >>> scale_delta() logic, it won't be possible to distinguish rdtsc >>> overflowing from a value in the past. >> >> ... this, my main point really is that scale_delta() (as its name says), >> and hence also tsc_ticks2ns(), shouldn't be used on absolute counts, but >> only deltas. (Yes, an absolute count can be viewed as delta from 0, but >> that's correct only if we know the TSC started counting from 0 and was >> never adjusted by some bias.) > > Well amd_check_erratum_1474() does want the delta from 0 to the > current TSC, because that's the best? way to see when C6 needs to be > disabled. Otherwise we just straight disable C6 on boot on affected > systems. I think that may be necessary when we don't know what was done to the TSC before we took control of the system. Jan ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() 2026-01-15 8:00 ` Jan Beulich 2026-01-15 8:24 ` Roger Pau Monné @ 2026-01-20 8:50 ` Jan Beulich 2026-01-20 9:28 ` Jan Beulich 1 sibling, 1 reply; 32+ messages in thread From: Jan Beulich @ 2026-01-20 8:50 UTC (permalink / raw) To: Roger Pau Monné Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Антон Марков On 15.01.2026 09:00, Jan Beulich wrote: > On 14.01.2026 18:49, Roger Pau Monné wrote: >> On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote: >>> stime2tsc() guards against negative deltas by using 0 instead; I'm not >>> quite sure that's correct either. >> >> Hm, we should likely do the same for stime2tsc() that you do for >> get_s_time_fixed(). Given the current callers I think we might be >> safe, but it's a risk. > > Will do then. While doing so, I came to wonder if there isn't a reason for this "capping". In local_time_calibration() we also have /* Local time warps forward if it lags behind master time. */ if ( curr.local_stime < curr.master_stime ) curr.local_stime = curr.master_stime; Which for the use of stime2tsc() in cstate_restore_tsc() might mean that indeed there is a worry of the delta being negative, and the desire to "warp forward" in that case. Whereas for the other use in reprogram_timer() capping at 0 isn't overly useful. By the time the value is used, it is likely in the past anyway. Jan ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() 2026-01-20 8:50 ` Jan Beulich @ 2026-01-20 9:28 ` Jan Beulich 0 siblings, 0 replies; 32+ messages in thread From: Jan Beulich @ 2026-01-20 9:28 UTC (permalink / raw) To: Roger Pau Monné Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Антон Марков On 20.01.2026 09:50, Jan Beulich wrote: > On 15.01.2026 09:00, Jan Beulich wrote: >> On 14.01.2026 18:49, Roger Pau Monné wrote: >>> On Tue, Jan 06, 2026 at 02:58:11PM +0100, Jan Beulich wrote: >>>> stime2tsc() guards against negative deltas by using 0 instead; I'm not >>>> quite sure that's correct either. >>> >>> Hm, we should likely do the same for stime2tsc() that you do for >>> get_s_time_fixed(). Given the current callers I think we might be >>> safe, but it's a risk. >> >> Will do then. > > While doing so, I came to wonder if there isn't a reason for this "capping". > In local_time_calibration() we also have > > /* Local time warps forward if it lags behind master time. */ > if ( curr.local_stime < curr.master_stime ) > curr.local_stime = curr.master_stime; > > Which for the use of stime2tsc() in cstate_restore_tsc() might mean that > indeed there is a worry of the delta being negative, and the desire to > "warp forward" in that case. Proposed new function implementation (easier to look at than the diff): uint64_t stime2tsc(s_time_t stime) { const struct cpu_time *t = &this_cpu(cpu_time); s_time_t stime_delta = stime - t->stamp.local_stime; int64_t delta = 0; /* * While for reprogram_timer() the capping at 0 isn't relevant (the returned * value is likely in the past anyway then, by the time it is used), for * cstate_restore_tsc() it is meaningful: We need to avoid moving the TSC * backwards (relative to when it may last have been read). */ if ( stime_delta > 0 ) { struct time_scale sys_to_tsc = scale_reciprocal(t->tsc_scale); delta = scale_delta(stime_delta, &sys_to_tsc); } return t->stamp.local_tsc + delta; } Jan ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 2/5] x86/time: scale_delta() can be static 2026-01-06 13:57 [PATCH 0/5] x86: assorted time handling adjustments Jan Beulich 2026-01-06 13:58 ` [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() Jan Beulich @ 2026-01-06 13:58 ` Jan Beulich 2026-01-13 18:40 ` Roger Pau Monné 2026-01-06 13:58 ` [PATCH 3/5] x86/HVM: drop at_tsc parameter from ->set_tsc_offset() hook Jan Beulich ` (2 subsequent siblings) 4 siblings, 1 reply; 32+ messages in thread From: Jan Beulich @ 2026-01-06 13:58 UTC (permalink / raw) To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné It's used in time.c alone. Modernize types while there. Amends: 5a82d598d2d ("viridian: unify time sources") Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/xen/arch/x86/include/asm/time.h +++ b/xen/arch/x86/include/asm/time.h @@ -56,7 +56,6 @@ u64 stime2tsc(s_time_t stime); struct time_scale; void set_time_scale(struct time_scale *ts, u64 ticks_per_sec); -u64 scale_delta(u64 delta, const struct time_scale *scale); /* Programmable Interval Timer (8254) */ --- a/xen/arch/x86/time.c +++ b/xen/arch/x86/time.c @@ -131,9 +131,9 @@ static inline u32 mul_frac(u32 multiplic * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction, * yielding a 64-bit result. */ -u64 scale_delta(u64 delta, const struct time_scale *scale) +static uint64_t scale_delta(uint64_t delta, const struct time_scale *scale) { - u64 product; + uint64_t product; if ( scale->shift < 0 ) delta >>= -scale->shift; ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 2/5] x86/time: scale_delta() can be static 2026-01-06 13:58 ` [PATCH 2/5] x86/time: scale_delta() can be static Jan Beulich @ 2026-01-13 18:40 ` Roger Pau Monné 0 siblings, 0 replies; 32+ messages in thread From: Roger Pau Monné @ 2026-01-13 18:40 UTC (permalink / raw) To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper On Tue, Jan 06, 2026 at 02:58:29PM +0100, Jan Beulich wrote: > It's used in time.c alone. Modernize types while there. > > Amends: 5a82d598d2d ("viridian: unify time sources") > Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Thanks, Roger. ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 3/5] x86/HVM: drop at_tsc parameter from ->set_tsc_offset() hook 2026-01-06 13:57 [PATCH 0/5] x86: assorted time handling adjustments Jan Beulich 2026-01-06 13:58 ` [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() Jan Beulich 2026-01-06 13:58 ` [PATCH 2/5] x86/time: scale_delta() can be static Jan Beulich @ 2026-01-06 13:58 ` Jan Beulich 2026-01-14 9:40 ` Roger Pau Monné 2026-01-06 13:59 ` [PATCH 4/5] x86/time: gtsc_to_gtime() is HVM-only Jan Beulich 2026-01-06 14:00 ` [PATCH 5/5] x86/time: pv_soft_rdtsc() is PV-only Jan Beulich 4 siblings, 1 reply; 32+ messages in thread From: Jan Beulich @ 2026-01-06 13:58 UTC (permalink / raw) To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné While the VMX hook never used the parameter, the SVM one lost its sole use some time ago (while the original use of the parameter had gone away even earlier). Again modernize types while there. Amends: 0cd50753eb40 ("nestedsvm: Disable TscRateMSR") Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/xen/arch/x86/hvm/domain.c +++ b/xen/arch/x86/hvm/domain.c @@ -312,8 +312,7 @@ int arch_set_info_hvm_guest(struct vcpu /* Sync AP's TSC with BSP's. */ v->arch.hvm.cache_tsc_offset = d->vcpu[0]->arch.hvm.cache_tsc_offset; - hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset, - d->arch.hvm.sync_tsc); + hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset); paging_update_paging_modes(v); --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -412,7 +412,7 @@ static void hvm_set_guest_tsc_fixed(stru delta_tsc = guest_tsc - tsc; v->arch.hvm.cache_tsc_offset = delta_tsc; - hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset, at_tsc); + hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset); } #define hvm_set_guest_tsc(v, t) hvm_set_guest_tsc_fixed(v, t, 0) @@ -430,7 +430,7 @@ static void hvm_set_guest_tsc_msr(struct static void hvm_set_guest_tsc_adjust(struct vcpu *v, u64 tsc_adjust) { v->arch.hvm.cache_tsc_offset += tsc_adjust - v->arch.hvm.msr_tsc_adjust; - hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset, 0); + hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset); v->arch.hvm.msr_tsc_adjust = tsc_adjust; if ( v == current ) update_vcpu_system_time(v); @@ -4023,8 +4023,7 @@ void hvm_vcpu_reset_state(struct vcpu *v /* Sync AP's TSC with BSP's. */ v->arch.hvm.cache_tsc_offset = v->domain->vcpu[0]->arch.hvm.cache_tsc_offset; - hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset, - d->arch.hvm.sync_tsc); + hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset); v->arch.hvm.msr_tsc_adjust = 0; --- a/xen/arch/x86/hvm/svm/svm.c +++ b/xen/arch/x86/hvm/svm/svm.c @@ -779,7 +779,7 @@ static int cf_check svm_get_guest_pat(st return 1; } -static void cf_check svm_set_tsc_offset(struct vcpu *v, u64 offset, u64 at_tsc) +static void cf_check svm_set_tsc_offset(struct vcpu *v, uint64_t offset) { struct vmcb_struct *vmcb = v->arch.hvm.svm.vmcb; struct vmcb_struct *n1vmcb, *n2vmcb; --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -1558,7 +1558,7 @@ static void cf_check vmx_handle_cd(struc } } -static void cf_check vmx_set_tsc_offset(struct vcpu *v, u64 offset, u64 at_tsc) +static void cf_check vmx_set_tsc_offset(struct vcpu *v, uint64_t offset) { vmx_vmcs_enter(v); --- a/xen/arch/x86/hvm/vmx/vvmx.c +++ b/xen/arch/x86/hvm/vmx/vvmx.c @@ -1114,7 +1114,7 @@ static void load_shadow_guest_state(stru hvm_inject_hw_exception(X86_EXC_GP, 0); } - hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset, 0); + hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset); vvmcs_to_shadow_bulk(v, ARRAY_SIZE(vmentry_fields), vmentry_fields); @@ -1330,7 +1330,7 @@ static void load_vvmcs_host_state(struct hvm_inject_hw_exception(X86_EXC_GP, 0); } - hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset, 0); + hvm_set_tsc_offset(v, v->arch.hvm.cache_tsc_offset); set_vvmcs(v, VM_ENTRY_INTR_INFO, 0); --- a/xen/arch/x86/include/asm/hvm/hvm.h +++ b/xen/arch/x86/include/asm/hvm/hvm.h @@ -165,7 +165,7 @@ struct hvm_function_table { int (*get_guest_pat)(struct vcpu *v, uint64_t *gpat); int (*set_guest_pat)(struct vcpu *v, uint64_t gpat); - void (*set_tsc_offset)(struct vcpu *v, u64 offset, u64 at_tsc); + void (*set_tsc_offset)(struct vcpu *v, uint64_t offset); void (*inject_event)(const struct x86_event *event); @@ -482,10 +482,9 @@ static inline void hvm_cpuid_policy_chan alternative_vcall(hvm_funcs.cpuid_policy_changed, v); } -static inline void hvm_set_tsc_offset(struct vcpu *v, uint64_t offset, - uint64_t at_tsc) +static inline void hvm_set_tsc_offset(struct vcpu *v, uint64_t offset) { - alternative_vcall(hvm_funcs.set_tsc_offset, v, offset, at_tsc); + alternative_vcall(hvm_funcs.set_tsc_offset, v, offset); } /* @@ -847,7 +846,7 @@ static inline void hvm_sync_pir_to_irr(s */ int hvm_guest_x86_mode(struct vcpu *v); void hvm_cpuid_policy_changed(struct vcpu *v); -void hvm_set_tsc_offset(struct vcpu *v, uint64_t offset, uint64_t at_tsc); +void hvm_set_tsc_offset(struct vcpu *v, uint64_t offset); /* End of prototype list */ --- a/xen/arch/x86/time.c +++ b/xen/arch/x86/time.c @@ -2985,8 +2985,7 @@ int tsc_set_info(struct domain *d, */ d->arch.hvm.sync_tsc = rdtsc(); hvm_set_tsc_offset(d->vcpu[0], - d->vcpu[0]->arch.hvm.cache_tsc_offset, - d->arch.hvm.sync_tsc); + d->vcpu[0]->arch.hvm.cache_tsc_offset); } } ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 3/5] x86/HVM: drop at_tsc parameter from ->set_tsc_offset() hook 2026-01-06 13:58 ` [PATCH 3/5] x86/HVM: drop at_tsc parameter from ->set_tsc_offset() hook Jan Beulich @ 2026-01-14 9:40 ` Roger Pau Monné 0 siblings, 0 replies; 32+ messages in thread From: Roger Pau Monné @ 2026-01-14 9:40 UTC (permalink / raw) To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper On Tue, Jan 06, 2026 at 02:58:54PM +0100, Jan Beulich wrote: > While the VMX hook never used the parameter, the SVM one lost its sole use > some time ago (while the original use of the parameter had gone away even > earlier). > > Again modernize types while there. > > Amends: 0cd50753eb40 ("nestedsvm: Disable TscRateMSR") > Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Thanks, Roger. ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 4/5] x86/time: gtsc_to_gtime() is HVM-only 2026-01-06 13:57 [PATCH 0/5] x86: assorted time handling adjustments Jan Beulich ` (2 preceding siblings ...) 2026-01-06 13:58 ` [PATCH 3/5] x86/HVM: drop at_tsc parameter from ->set_tsc_offset() hook Jan Beulich @ 2026-01-06 13:59 ` Jan Beulich 2026-01-14 9:43 ` Roger Pau Monné 2026-01-06 14:00 ` [PATCH 5/5] x86/time: pv_soft_rdtsc() is PV-only Jan Beulich 4 siblings, 1 reply; 32+ messages in thread From: Jan Beulich @ 2026-01-06 13:59 UTC (permalink / raw) To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné Omit the function when HVM=n. With that the !HVM logic can also go away; leave an assertion. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/xen/arch/x86/time.c +++ b/xen/arch/x86/time.c @@ -2840,14 +2840,13 @@ uint64_t gtime_to_gtsc(const struct doma return scale_delta(time, &d->arch.ns_to_vtsc); } +#ifdef CONFIG_HVM uint64_t gtsc_to_gtime(const struct domain *d, uint64_t tsc) { - u64 time = scale_delta(tsc, &d->arch.vtsc_to_ns); - - if ( !is_hvm_domain(d) ) - time += d->arch.vtsc_offset; - return time; + ASSERT(is_hvm_domain(d)); + return scale_delta(tsc, &d->arch.vtsc_to_ns); } +#endif /* CONFIG_HVM */ uint64_t pv_soft_rdtsc(const struct vcpu *v, const struct cpu_user_regs *regs) { ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 4/5] x86/time: gtsc_to_gtime() is HVM-only 2026-01-06 13:59 ` [PATCH 4/5] x86/time: gtsc_to_gtime() is HVM-only Jan Beulich @ 2026-01-14 9:43 ` Roger Pau Monné 2026-01-14 9:52 ` Roger Pau Monné 0 siblings, 1 reply; 32+ messages in thread From: Roger Pau Monné @ 2026-01-14 9:43 UTC (permalink / raw) To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper On Tue, Jan 06, 2026 at 02:59:43PM +0100, Jan Beulich wrote: > Omit the function when HVM=n. With that the !HVM logic can also go away; > leave an assertion. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> > > --- a/xen/arch/x86/time.c > +++ b/xen/arch/x86/time.c > @@ -2840,14 +2840,13 @@ uint64_t gtime_to_gtsc(const struct doma > return scale_delta(time, &d->arch.ns_to_vtsc); > } > > +#ifdef CONFIG_HVM > uint64_t gtsc_to_gtime(const struct domain *d, uint64_t tsc) > { > - u64 time = scale_delta(tsc, &d->arch.vtsc_to_ns); > - > - if ( !is_hvm_domain(d) ) > - time += d->arch.vtsc_offset; > - return time; > + ASSERT(is_hvm_domain(d)); > + return scale_delta(tsc, &d->arch.vtsc_to_ns); > } > +#endif /* CONFIG_HVM */ Seeing this is solely used by the vlapic code, did you consider keeping scale_delta() non-static, and then moving gtsc_to_gtime() into vlapic.c? It would result in less ifdefery overall. Thanks, Roger. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 4/5] x86/time: gtsc_to_gtime() is HVM-only 2026-01-14 9:43 ` Roger Pau Monné @ 2026-01-14 9:52 ` Roger Pau Monné 0 siblings, 0 replies; 32+ messages in thread From: Roger Pau Monné @ 2026-01-14 9:52 UTC (permalink / raw) To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper On Wed, Jan 14, 2026 at 10:43:38AM +0100, Roger Pau Monné wrote: > On Tue, Jan 06, 2026 at 02:59:43PM +0100, Jan Beulich wrote: > > Omit the function when HVM=n. With that the !HVM logic can also go away; > > leave an assertion. > > > > Signed-off-by: Jan Beulich <jbeulich@suse.com> > > > > --- a/xen/arch/x86/time.c > > +++ b/xen/arch/x86/time.c > > @@ -2840,14 +2840,13 @@ uint64_t gtime_to_gtsc(const struct doma > > return scale_delta(time, &d->arch.ns_to_vtsc); > > } > > > > +#ifdef CONFIG_HVM > > uint64_t gtsc_to_gtime(const struct domain *d, uint64_t tsc) > > { > > - u64 time = scale_delta(tsc, &d->arch.vtsc_to_ns); > > - > > - if ( !is_hvm_domain(d) ) > > - time += d->arch.vtsc_offset; > > - return time; > > + ASSERT(is_hvm_domain(d)); > > + return scale_delta(tsc, &d->arch.vtsc_to_ns); > > } > > +#endif /* CONFIG_HVM */ > > Seeing this is solely used by the vlapic code, did you consider > keeping scale_delta() non-static, and then moving gtsc_to_gtime() into > vlapic.c? > > It would result in less ifdefery overall. I see now this might not be appropriate, gtsc_to_gtime() is the pair to gtime_to_gtsc(), which is used in other places. It looks like gtsc_to_gtime() could also gain users in order parts of the code, and hence making static to vlapic might not be the best approach. I'm not overly happy with adding more ifdefary to time.c, and the asymmetry this creates with PV guest not having a gtsc_to_gtime(), but it's otherwise dead code, so: Acked-by: Roger Pau Monné <roger.pau@citrix.com> Thanks, Roger. ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 5/5] x86/time: pv_soft_rdtsc() is PV-only 2026-01-06 13:57 [PATCH 0/5] x86: assorted time handling adjustments Jan Beulich ` (3 preceding siblings ...) 2026-01-06 13:59 ` [PATCH 4/5] x86/time: gtsc_to_gtime() is HVM-only Jan Beulich @ 2026-01-06 14:00 ` Jan Beulich 2026-01-14 9:46 ` Roger Pau Monné 4 siblings, 1 reply; 32+ messages in thread From: Jan Beulich @ 2026-01-06 14:00 UTC (permalink / raw) To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné Omit the function when PV=n, by moving it to the sole file using it, thus allowing it to become static as well. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/xen/arch/x86/include/asm/time.h +++ b/xen/arch/x86/include/asm/time.h @@ -37,8 +37,6 @@ uint64_t cf_check acpi_pm_tick_to_ns(uin uint64_t tsc_ticks2ns(uint64_t ticks); -struct cpu_user_regs; -uint64_t pv_soft_rdtsc(const struct vcpu *v, const struct cpu_user_regs *regs); uint64_t gtime_to_gtsc(const struct domain *d, uint64_t time); uint64_t gtsc_to_gtime(const struct domain *d, uint64_t tsc); --- a/xen/arch/x86/pv/emul-priv-op.c +++ b/xen/arch/x86/pv/emul-priv-op.c @@ -874,6 +874,20 @@ static uint64_t guest_efer(const struct return val; } +static uint64_t soft_rdtsc( + const struct vcpu *v, const struct cpu_user_regs *regs) +{ + s_time_t old, new, now = get_s_time(); + struct domain *d = v->domain; + + do { + old = d->arch.vtsc_last; + new = now > d->arch.vtsc_last ? now : old + 1; + } while ( cmpxchg(&d->arch.vtsc_last, old, new) != old ); + + return gtime_to_gtsc(d, new); +} + static int cf_check read_msr( unsigned int reg, uint64_t *val, struct x86_emulate_ctxt *ctxt) { @@ -920,7 +934,7 @@ static int cf_check read_msr( return X86EMUL_OKAY; case MSR_IA32_TSC: - *val = currd->arch.vtsc ? pv_soft_rdtsc(curr, ctxt->regs) : rdtsc(); + *val = currd->arch.vtsc ? soft_rdtsc(curr, ctxt->regs) : rdtsc(); return X86EMUL_OKAY; case MSR_EFER: --- a/xen/arch/x86/time.c +++ b/xen/arch/x86/time.c @@ -2848,19 +2848,6 @@ uint64_t gtsc_to_gtime(const struct doma } #endif /* CONFIG_HVM */ -uint64_t pv_soft_rdtsc(const struct vcpu *v, const struct cpu_user_regs *regs) -{ - s_time_t old, new, now = get_s_time(); - struct domain *d = v->domain; - - do { - old = d->arch.vtsc_last; - new = now > d->arch.vtsc_last ? now : old + 1; - } while ( cmpxchg(&d->arch.vtsc_last, old, new) != old ); - - return gtime_to_gtsc(d, new); -} - bool clocksource_is_tsc(void) { return plt_src.read_counter == READ_TSC_POISON; ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 5/5] x86/time: pv_soft_rdtsc() is PV-only 2026-01-06 14:00 ` [PATCH 5/5] x86/time: pv_soft_rdtsc() is PV-only Jan Beulich @ 2026-01-14 9:46 ` Roger Pau Monné 0 siblings, 0 replies; 32+ messages in thread From: Roger Pau Monné @ 2026-01-14 9:46 UTC (permalink / raw) To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper On Tue, Jan 06, 2026 at 03:00:21PM +0100, Jan Beulich wrote: > Omit the function when PV=n, by moving it to the sole file using it, thus > allowing it to become static as well. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Thanks, Roger. ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2026-01-20 9:29 UTC | newest] Thread overview: 32+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-01-06 13:57 [PATCH 0/5] x86: assorted time handling adjustments Jan Beulich 2026-01-06 13:58 ` [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed() Jan Beulich 2026-01-06 20:10 ` Антон Марков 2026-01-07 8:02 ` Jan Beulich 2026-01-09 16:11 ` Anton Markov 2026-01-12 10:31 ` Anton Markov 2026-01-12 11:45 ` Jan Beulich 2026-01-12 12:49 ` Anton Markov 2026-01-12 14:13 ` Jan Beulich 2026-01-12 14:51 ` Anton Markov 2026-01-12 16:08 ` Jan Beulich 2026-01-12 16:41 ` Anton Markov 2026-01-13 11:21 ` Jan Beulich 2026-01-13 12:35 ` Anton Markov 2026-01-14 17:49 ` Roger Pau Monné 2026-01-15 8:00 ` Jan Beulich 2026-01-15 8:24 ` Roger Pau Monné 2026-01-15 8:49 ` Anton Markov 2026-01-15 10:38 ` Jan Beulich 2026-01-15 11:48 ` Roger Pau Monné 2026-01-15 12:04 ` Jan Beulich 2026-01-20 8:50 ` Jan Beulich 2026-01-20 9:28 ` Jan Beulich 2026-01-06 13:58 ` [PATCH 2/5] x86/time: scale_delta() can be static Jan Beulich 2026-01-13 18:40 ` Roger Pau Monné 2026-01-06 13:58 ` [PATCH 3/5] x86/HVM: drop at_tsc parameter from ->set_tsc_offset() hook Jan Beulich 2026-01-14 9:40 ` Roger Pau Monné 2026-01-06 13:59 ` [PATCH 4/5] x86/time: gtsc_to_gtime() is HVM-only Jan Beulich 2026-01-14 9:43 ` Roger Pau Monné 2026-01-14 9:52 ` Roger Pau Monné 2026-01-06 14:00 ` [PATCH 5/5] x86/time: pv_soft_rdtsc() is PV-only Jan Beulich 2026-01-14 9:46 ` Roger Pau Monné
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.