[PATCH RESEND] avoid hv timer fallback to sw timer if delay exceeds period

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH RESEND] avoid hv timer fallback to sw timer if delay exceeds period
@ 2025-10-13 12:51 fuqiang wang
  2025-10-13 23:29 ` Sean Christopherson
  0 siblings, 1 reply; 5+ messages in thread
From: fuqiang wang @ 2025-10-13 12:51 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Maxim Levitsky, kvm, linux-kernel
  Cc: fuqiang wang, yu chen, dongxu zhang

When the guest uses the APIC periodic timer, if the delay exceeds the
period, the delta will be negative. nsec_to_cycles() may then convert this
delta into an absolute value larger than guest_l1_tsc, resulting in a
negative tscdeadline. Since the hv timer supports a maximum bit width of
cpu_preemption_timer_multi + 32, this causes the hv timer setup to fail and
switch to the sw timer.

Moreover, due to the commit 98c25ead5eda ("KVM: VMX: Move preemption timer
<=> hrtimer dance to common x86"), if the guest is using the sw timer
before blocking, it will continue to use the sw timer after being woken up,
and will not switch back to the hv timer until the relevant APIC timer
register is reprogrammed.  Since the periodic timer does not require
frequent APIC timer register programming, the guest may continue to use the
software timer for an extended period.

The reproduction steps and patch verification results at link [1].

[1]: https://github.com/cai-fuqiang/kernel_test/tree/master/period_timer_test

Fixes: 98c25ead5eda ("KVM: VMX: Move preemption timer <=> hrtimer dance to common x86")
Signed-off-by: fuqiang wang <fuqiang.wng@gmail.com>
---
 arch/x86/kvm/lapic.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 5fc437341e03..afd349f4d933 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2036,6 +2036,9 @@ static void advance_periodic_target_expiration(struct kvm_lapic *apic)
 	u64 tscl = rdtsc();
 	ktime_t delta;

+	u64 delta_cycles_u;
+	u64 delta_cycles_s;
+
 	/*
 	 * Synchronize both deadlines to the same time source or
 	 * differences in the periods (caused by differences in the
@@ -2047,8 +2050,11 @@ static void advance_periodic_target_expiration(struct kvm_lapic *apic)
 		ktime_add_ns(apic->lapic_timer.target_expiration,
 				apic->lapic_timer.period);
 	delta = ktime_sub(apic->lapic_timer.target_expiration, now);
+	delta_cycles_u = nsec_to_cycles(apic->vcpu, abs(delta));
+	delta_cycles_s = delta > 0 ? delta_cycles_u : -delta_cycles_u;
+
 	apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl) +
-		nsec_to_cycles(apic->vcpu, delta);
+		delta_cycles_s;
 }

 static void start_sw_period(struct kvm_lapic *apic)
-- 
2.47.0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH RESEND] avoid hv timer fallback to sw timer if delay exceeds period
  2025-10-13 12:51 [PATCH RESEND] avoid hv timer fallback to sw timer if delay exceeds period fuqiang wang
@ 2025-10-13 23:29 ` Sean Christopherson
  2025-10-17 12:21   ` fuqiang wang
  0 siblings, 1 reply; 5+ messages in thread
From: Sean Christopherson @ 2025-10-13 23:29 UTC (permalink / raw)
  To: fuqiang wang
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H . Peter Anvin, Maxim Levitsky, kvm,
	linux-kernel, yu chen, dongxu zhang

On Wed, Oct 01, 2025, fuqiang wang wrote:
> When the guest uses the APIC periodic timer, if the delay exceeds the
> period, the delta will be negative. 

IIUC, by "delay" you mean the time it takes for KVM to get (back) to
advance_periodic_target_expiration().  If that's correct, I think it would be
clearer to word this as:

  When the guest uses the APIC periodic timer, if the next period has already
  expired, e.g. due to the period being smaller than the delay in processing
  the timer, the delta will be negative.

> nsec_to_cycles() may then convert this
> delta into an absolute value larger than guest_l1_tsc, resulting in a
> negative tscdeadline. Since the hv timer supports a maximum bit width of
> cpu_preemption_timer_multi + 32, this causes the hv timer setup to fail and
> switch to the sw timer.
> 
> Moreover, due to the commit 98c25ead5eda ("KVM: VMX: Move preemption timer
> <=> hrtimer dance to common x86"), if the guest is using the sw timer
> before blocking, it will continue to use the sw timer after being woken up,
> and will not switch back to the hv timer until the relevant APIC timer
> register is reprogrammed.  Since the periodic timer does not require
> frequent APIC timer register programming, the guest may continue to use the
> software timer for an extended period.
> 
> The reproduction steps and patch verification results at link [1].
> 
> [1]: https://github.com/cai-fuqiang/kernel_test/tree/master/period_timer_test
> 
> Fixes: 98c25ead5eda ("KVM: VMX: Move preemption timer <=> hrtimer dance to common x86")

I'm pretty sure this should be:

  Fixes: d8f2f498d9ed ("x86/kvm: fix LAPIC timer drift when guest uses periodic mode")

because that's where the bug with tsdeadline (incorrectly) wrapping was introduced.
The aforementioned commit exacerbated (and likely exposed?) the bug, but AFAICT
that commit itself didn't introduce any bugs (related to this issue).

> Signed-off-by: fuqiang wang <fuqiang.wng@gmail.com>
> ---
>  arch/x86/kvm/lapic.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 5fc437341e03..afd349f4d933 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2036,6 +2036,9 @@ static void advance_periodic_target_expiration(struct kvm_lapic *apic)
>  	u64 tscl = rdtsc();
>  	ktime_t delta;
>  
> +	u64 delta_cycles_u;
> +	u64 delta_cycles_s;
> +
>  	/*
>  	 * Synchronize both deadlines to the same time source or
>  	 * differences in the periods (caused by differences in the
> @@ -2047,8 +2050,11 @@ static void advance_periodic_target_expiration(struct kvm_lapic *apic)
>  		ktime_add_ns(apic->lapic_timer.target_expiration,
>  				apic->lapic_timer.period);
>  	delta = ktime_sub(apic->lapic_timer.target_expiration, now);
> +	delta_cycles_u = nsec_to_cycles(apic->vcpu, abs(delta));
> +	delta_cycles_s = delta > 0 ? delta_cycles_u : -delta_cycles_u;
> +
>  	apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl) +
> -		nsec_to_cycles(apic->vcpu, delta);
> +		delta_cycles_s;

This isn't quite correct either.  E.g. if delta is negative while L1 TSC is tiny,
then subtracting the delta will incorrectly result the deadline wrapping too.
Very, very, theoretically, L1 TSC could even be '0', e.g. due to a weird offset
for L1, so I don't think subtracting is ever safe.  Heh, of course we're hosed
in that case no matter what since KVM treats tscdeadline==0 as "not programmed".

Anyways, can't we just skip adding negative value?  Whether or not the TSC deadline
has expired is mostly a boolean value; for the vast majority of code it doesn't
matter exactly when the timer expired.

The only code that cares is __kvm_wait_lapic_expire(), and the only downside to
setting tscdeadline=L1.TSC is that adjust_lapic_timer_advance() won't adjust as
aggressively as it probably should.

Ha!  That's essentially what update_target_expiration() already does:

	now = ktime_get();
	remaining = ktime_sub(apic->lapic_timer.target_expiration, now);
	if (ktime_to_ns(remaining) < 0)
		remaining = 0;

E.g.

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 0ae7f913d782..2fb03a8a9ae9 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2131,18 +2131,26 @@ static void advance_periodic_target_expiration(struct kvm_lapic *apic)
        ktime_t delta;
 
        /*
-        * Synchronize both deadlines to the same time source or
-        * differences in the periods (caused by differences in the
-        * underlying clocks or numerical approximation errors) will
-        * cause the two to drift apart over time as the errors
-        * accumulate.
+        * Use kernel time as the time source for both deadlines so that they
+        * stay synchronized.  Computing each deadline independently will cause
+        * the two deadlines to drift apart over time as differences in the
+        * periods accumulate, e.g. due to differences in the underlying clocks
+        * or numerical approximation errors.
         */
        apic->lapic_timer.target_expiration =
                ktime_add_ns(apic->lapic_timer.target_expiration,
                                apic->lapic_timer.period);
        delta = ktime_sub(apic->lapic_timer.target_expiration, now);
-       apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl) +
-               nsec_to_cycles(apic->vcpu, delta);
+
+       /*
+        * Don't adjust the TSC deadline if the next period has already expired,
+        * e.g. due to software overhead resulting in delays larger than the
+        * period.  Blindly adding a negative delta could cause the deadline to
+        * become excessively large due to the deadline being an unsigned value.
+        */
+       apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl);
+       if (delta > 0)
+               apic->lapic_timer.tscdeadline += nsec_to_cycles(apic->vcpu, delta);
 }
 
 static void start_sw_period(struct kvm_lapic *apic)

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH RESEND] avoid hv timer fallback to sw timer if delay exceeds period
  2025-10-13 23:29 ` Sean Christopherson
@ 2025-10-17 12:21   ` fuqiang wang
  2025-10-17 15:59     ` Sean Christopherson
  0 siblings, 1 reply; 5+ messages in thread
From: fuqiang wang @ 2025-10-17 12:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H . Peter Anvin, Maxim Levitsky, kvm,
	linux-kernel, yu chen, dongxu zhang



On 10/14/25 7:29 AM, Sean Christopherson wrote:
> On Wed, Oct 01, 2025, fuqiang wang wrote:
>> When the guest uses the APIC periodic timer, if the delay exceeds the
>> period, the delta will be negative.
> 
> IIUC, by "delay" you mean the time it takes for KVM to get (back) to
> advance_periodic_target_expiration().  If that's correct, I think it would be
> clearer to word this as:
> 
>    When the guest uses the APIC periodic timer, if the next period has already
>    expired, e.g. due to the period being smaller than the delay in processing
>    the timer, the delta will be negative.
> 

Indeed, this explanation is much clearer. I will adopt it in the next version
patch.

>> nsec_to_cycles() may then convert this
>> delta into an absolute value larger than guest_l1_tsc, resulting in a
>> negative tscdeadline. Since the hv timer supports a maximum bit width of
>> cpu_preemption_timer_multi + 32, this causes the hv timer setup to fail and
>> switch to the sw timer.
>>
>> Moreover, due to the commit 98c25ead5eda ("KVM: VMX: Move preemption timer
>> <=> hrtimer dance to common x86"), if the guest is using the sw timer
>> before blocking, it will continue to use the sw timer after being woken up,
>> and will not switch back to the hv timer until the relevant APIC timer
>> register is reprogrammed.  Since the periodic timer does not require
>> frequent APIC timer register programming, the guest may continue to use the
>> software timer for an extended period.
>>
>> The reproduction steps and patch verification results at link [1].
>>
>> [1]: https://github.com/cai-fuqiang/kernel_test/tree/master/period_timer_test
>>
>> Fixes: 98c25ead5eda ("KVM: VMX: Move preemption timer <=> hrtimer dance to common x86")

Yes, it's better, I will fix it.

> 
> I'm pretty sure this should be:
> 
>    Fixes: d8f2f498d9ed ("x86/kvm: fix LAPIC timer drift when guest uses periodic mode")
> 
> because that's where the bug with tsdeadline (incorrectly) wrapping was introduced.
> The aforementioned commit exacerbated (and likely exposed?) the bug, but AFAICT
> that commit itself didn't introduce any bugs (related to this issue).
> 
>> Signed-off-by: fuqiang wang <fuqiang.wng@gmail.com>
>> ---
>>   arch/x86/kvm/lapic.c | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index 5fc437341e03..afd349f4d933 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -2036,6 +2036,9 @@ static void advance_periodic_target_expiration(struct kvm_lapic *apic)
>>   	u64 tscl = rdtsc();
>>   	ktime_t delta;
>>   
>> +	u64 delta_cycles_u;
>> +	u64 delta_cycles_s;
>> +
>>   	/*
>>   	 * Synchronize both deadlines to the same time source or
>>   	 * differences in the periods (caused by differences in the
>> @@ -2047,8 +2050,11 @@ static void advance_periodic_target_expiration(struct kvm_lapic *apic)
>>   		ktime_add_ns(apic->lapic_timer.target_expiration,
>>   				apic->lapic_timer.period);
>>   	delta = ktime_sub(apic->lapic_timer.target_expiration, now);
>> +	delta_cycles_u = nsec_to_cycles(apic->vcpu, abs(delta));
>> +	delta_cycles_s = delta > 0 ? delta_cycles_u : -delta_cycles_u;
>> +
>>   	apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl) +
>> -		nsec_to_cycles(apic->vcpu, delta);
>> +		delta_cycles_s;
> 
> This isn't quite correct either.  E.g. if delta is negative while L1 TSC is tiny,
> then subtracting the delta will incorrectly result the deadline wrapping too.
> Very, very, theoretically, L1 TSC could even be '0', e.g. due to a weird offset
> for L1, so I don't think subtracting is ever safe.  Heh, of course we're hosed
> in that case no matter what since KVM treats tscdeadline==0 as "not programmed".
> 
> Anyways, can't we just skip adding negative value?  Whether or not the TSC deadline
> has expired is mostly a boolean value; for the vast majority of code it doesn't
> matter exactly when the timer expired.

I don’t think this patch will cause tscdeadline to wrap around to zero, because
the system is unlikely to start a timer when the guest tsc is zero and have it
expire immediately. However, keeping the logic to skip adding negative values is
a good idea, seems like there’s no downside.

> 
> The only code that cares is __kvm_wait_lapic_expire(), and the only downside to
> setting tscdeadline=L1.TSC is that adjust_lapic_timer_advance() won't adjust as
> aggressively as it probably should.

I am not sure which type of timers should use the "advanced tscdeadline hrtimer
expiration feature".

I list the history of this feature.

1. Marcelo first introduce this feature, only support the tscdeadline sw timer.
2. Yunhong introduce vmx preemption timer(hv), only support for tscdeadline.
3. Liwanpeng extend the hv timer to oneshot and period timers.
4. Liwanpeng extend this feature to hv timer.
5. Sean and liwanpeng fix some BUG extend this feature to hv period/oneshot timer.

[1] d0659d946be0("KVM: x86: add option to advance tscdeadline hrtimer expiration")
     Marcelo Tosatti     Dec 16 2014
[2] ce7a058a2117("KVM: x86: support using the vmx preemption timer for tsc deadline timer")
     Yunhong Jiang       Jun 13 2016
[3] 8003c9ae204e("KVM: LAPIC: add APIC Timer periodic/oneshot mode VMX preemption timer support")
     liwanpeng           Oct 24 2016
[4] c5ce8235cffa("KVM: VMX: Optimize tscdeadline timer latency")
     liwanpeng           May 29 2018
[5] ee66e453db13("KVM: lapic: Busy wait for timer to expire when using hv_timer")
     Sean Christopherson Apr 16 2019

     d981dd15498b("KVM: LAPIC: Accurately guarantee busy wait for timer to expire when using hv_timer")
     liwanpeng           Apr 28 2021

Now, timers supported for this feature includes:
- sw: tscdeadline
- hv: all (tscdeadline, oneshot, period)

====
IMHO
====

1. for period timer
===================

I think for periodic timers emulation, the expiration time is already adjusted
to compensate for the delays introduced by timer emulation, so don't need this
feature to adjust again. But after use the feature, the first timer expiration
may be relatively accurate.

E.g., At time 0, start a periodic task (period: 10,000 ns) with a simulated
delay of 100 ns.

With this feature enabled and reasonably accurate prediction, the expiration
time set seen by the guest are: 10000, 20000, 30000...

With this feature not enabled, the expiration times set: 10100, 20100, 30100...

But IMHO, for periodic timers, accuracy of the period seems to be the main
concern, because it does not frequently start and stop. The incorrect period
caused by the first timer expiration can be ignored.

2. for oneshot timer
====================

In [1], Marcelo treated oneshot and tscdeadline differently. Shouldn’t the
behavior of these two timers be similar? Unlike periodic timers, both oneshot
and tscdeadline timers set a specific expiration time, and what the guest cares
about is whether the expiration time is accurate. Moreover, this feature is
mainly intended to mitigate the latency introduced by timer virtualization.
Since software timers have higher latency compared to hardware virtual timers,
the need for this feature is actually more urgent for software timers. However,
in the current implementation, the feature is applied to hv oneshot/period
timers, but not to sw oneshot/period timers.

===============
Summary of IMHO
===============

The feature should be applied to the following timer types:
sw/hv tscdeadline and sw/hv oneshot

should not be applied to:
sw/hv period

However, so far I haven’t found any discussion on the mailing lists regarding
the commits summarized above. Please let me know if I’ve overlooked something.

> 
> Ha!  That's essentially what update_target_expiration() already does:
> 
> 	now = ktime_get();
> 	remaining = ktime_sub(apic->lapic_timer.target_expiration, now);
> 	if (ktime_to_ns(remaining) < 0)
> 		remaining = 0;
> 
> E.g.
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 0ae7f913d782..2fb03a8a9ae9 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2131,18 +2131,26 @@ static void advance_periodic_target_expiration(struct kvm_lapic *apic)
>          ktime_t delta;
>   
>          /*
> -        * Synchronize both deadlines to the same time source or
> -        * differences in the periods (caused by differences in the
> -        * underlying clocks or numerical approximation errors) will
> -        * cause the two to drift apart over time as the errors
> -        * accumulate.
> +        * Use kernel time as the time source for both deadlines so that they
> +        * stay synchronized.  Computing each deadline independently will cause
> +        * the two deadlines to drift apart over time as differences in the
> +        * periods accumulate, e.g. due to differences in the underlying clocks
> +        * or numerical approximation errors.
>           */
>          apic->lapic_timer.target_expiration =
>                  ktime_add_ns(apic->lapic_timer.target_expiration,
>                                  apic->lapic_timer.period);
>          delta = ktime_sub(apic->lapic_timer.target_expiration, now);
> -       apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl) +
> -               nsec_to_cycles(apic->vcpu, delta);
> +
> +       /*
> +        * Don't adjust the TSC deadline if the next period has already expired,
> +        * e.g. due to software overhead resulting in delays larger than the
> +        * period.  Blindly adding a negative delta could cause the deadline to
> +        * become excessively large due to the deadline being an unsigned value.
> +        */
> +       apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl);
> +       if (delta > 0)
> +               apic->lapic_timer.tscdeadline += nsec_to_cycles(apic->vcpu, delta);
>   }
>   
>   static void start_sw_period(struct kvm_lapic *apic)

Thank you for your patient explanations and for helping me make the revisions. I
will update this in the next patch version.

Regards,
fuqiang

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH RESEND] avoid hv timer fallback to sw timer if delay exceeds period
  2025-10-17 12:21   ` fuqiang wang
@ 2025-10-17 15:59     ` Sean Christopherson
  2025-10-21 15:37       ` fuqiang wang
  0 siblings, 1 reply; 5+ messages in thread
From: Sean Christopherson @ 2025-10-17 15:59 UTC (permalink / raw)
  To: fuqiang wang
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H . Peter Anvin, Maxim Levitsky, kvm,
	linux-kernel, yu chen, dongxu zhang

On Fri, Oct 17, 2025, fuqiang wang wrote:
> On 10/14/25 7:29 AM, Sean Christopherson wrote:
> > On Wed, Oct 01, 2025, fuqiang wang wrote:
> > The only code that cares is __kvm_wait_lapic_expire(), and the only downside to
> > setting tscdeadline=L1.TSC is that adjust_lapic_timer_advance() won't adjust as
> > aggressively as it probably should.
> 
> I am not sure which type of timers should use the "advanced tscdeadline hrtimer
> expiration feature".
> 
> I list the history of this feature.
> 
> 1. Marcelo first introduce this feature, only support the tscdeadline sw timer.
> 2. Yunhong introduce vmx preemption timer(hv), only support for tscdeadline.
> 3. Liwanpeng extend the hv timer to oneshot and period timers.
> 4. Liwanpeng extend this feature to hv timer.
> 5. Sean and liwanpeng fix some BUG extend this feature to hv period/oneshot timer.
> 
> [1] d0659d946be0("KVM: x86: add option to advance tscdeadline hrtimer expiration")
>     Marcelo Tosatti     Dec 16 2014
> [2] ce7a058a2117("KVM: x86: support using the vmx preemption timer for tsc deadline timer")
>     Yunhong Jiang       Jun 13 2016
> [3] 8003c9ae204e("KVM: LAPIC: add APIC Timer periodic/oneshot mode VMX preemption timer support")
>     liwanpeng           Oct 24 2016
> [4] c5ce8235cffa("KVM: VMX: Optimize tscdeadline timer latency")
>     liwanpeng           May 29 2018
> [5] ee66e453db13("KVM: lapic: Busy wait for timer to expire when using hv_timer")
>     Sean Christopherson Apr 16 2019
> 
>     d981dd15498b("KVM: LAPIC: Accurately guarantee busy wait for timer to expire when using hv_timer")
>     liwanpeng           Apr 28 2021
> 
> Now, timers supported for this feature includes:
> - sw: tscdeadline
> - hv: all (tscdeadline, oneshot, period)
> 
> ====
> IMHO
> ====
> 
> 1. for period timer
> ===================
> 
> I think for periodic timers emulation, the expiration time is already adjusted
> to compensate for the delays introduced by timer emulation, so don't need this
> feature to adjust again. But after use the feature, the first timer expiration
> may be relatively accurate.
> 
> E.g., At time 0, start a periodic task (period: 10,000 ns) with a simulated
> delay of 100 ns.
> 
> With this feature enabled and reasonably accurate prediction, the expiration
> time set seen by the guest are: 10000, 20000, 30000...
> 
> With this feature not enabled, the expiration times set: 10100, 20100, 30100...
> 
> But IMHO, for periodic timers, accuracy of the period seems to be the main
> concern, because it does not frequently start and stop. The incorrect period
> caused by the first timer expiration can be ignored.

I agree it's superfluous, but applying the advancement also does no harm, and
avoiding it would be moreeffort than simply letting KVM predict the first expiration.

> 2. for oneshot timer
> ====================
> 
> In [1], Marcelo treated oneshot and tscdeadline differently. Shouldn’t the
> behavior of these two timers be similar?

Yes, but they aren't identical, and so supporting both would require additional
code, complexity, and testing.

> Unlike periodic timers, both oneshot and tscdeadline timers set a specific
> expiration time, and what the guest cares about is whether the expiration
> time is accurate. Moreover, this feature is mainly intended to mitigate the
> latency introduced by timer virtualization.  Since software timers have
> higher latency compared to hardware virtual timers, the need for this feature
> is actually more urgent for software timers.

Yep.

> However, in the current implementation, the feature is applied to hv
> oneshot/period timers, but not to sw oneshot/period timers.
>
> ===============
> Summary of IMHO
> ===============
> 
> The feature should be applied to the following timer types:
> sw/hv tscdeadline and sw/hv oneshot

In a perfect world, probably?  But I don't know that it's worth changing at this
time.  Much of this is balancing complexity with benefit, though it's also most
definitely a reflection of the initial implementation.

KVM unconditionally emulates TSC-deadline mode, and AFAIK every real-world kernel
prefers TSC-deadline over one-shot, and so in practice the benefits of applying
the advancement to one-shot hrtimers.  That was also the way the world was headed
back when Marcelo first implemented the support.  I don't know for sure why the
initial implementation targeted only TSC-deadline mode, but I think it's safe to
assume that the use case Marcelo was targeting exclusively used TSC-deadline.

I'm not entirely opposed to playing the advancement games with one-shot hrtimers,
but it's also not clear to me that it's worth doing.  E.g. supporting one-shot
hrtimers would likely require a bit of extra complexity to juggle the different
time domains.  And if the only use cases that are truly sensitive to timer
programming latency exclusively use TSC-deadline mode (because one-shot mode is
inherently "fuzzy"), then any amount of extra complexity is effectively dead weight.

> should not be applied to:
> sw/hv period

I wouldn't say "should not be applied to", I think it's more "doesn't provide much
benefit to".

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH RESEND] avoid hv timer fallback to sw timer if delay exceeds period
  2025-10-17 15:59     ` Sean Christopherson
@ 2025-10-21 15:37       ` fuqiang wang
  0 siblings, 0 replies; 5+ messages in thread
From: fuqiang wang @ 2025-10-21 15:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H . Peter Anvin, Maxim Levitsky, kvm,
	linux-kernel, yu chen, dongxu zhang

On 10/17/25 11:59 PM, Sean Christopherson wrote:

>> ====
>> IMHO
>> ====
>>
>> 1. for period timer
>> ===================
>>
>> I think for periodic timers emulation, the expiration time is already adjusted
>> to compensate for the delays introduced by timer emulation, so don't need this
>> feature to adjust again. But after use the feature, the first timer expiration
>> may be relatively accurate.
>>
>> E.g., At time 0, start a periodic task (period: 10,000 ns) with a simulated
>> delay of 100 ns.
>>
>> With this feature enabled and reasonably accurate prediction, the expiration
>> time set seen by the guest are: 10000, 20000, 30000...
>>
>> With this feature not enabled, the expiration times set: 10100, 20100, 30100...
>>
>> But IMHO, for periodic timers, accuracy of the period seems to be the main
>> concern, because it does not frequently start and stop. The incorrect period
>> caused by the first timer expiration can be ignored.
>
> I agree it's superfluous, but applying the advancement also does no harm, and
> avoiding it would be moreeffort than simply letting KVM predict the first expiration.
>

Yes, that’s indeed the case.

> KVM unconditionally emulates TSC-deadline mode, and AFAIK every real-world kernel
> prefers TSC-deadline over one-shot, and so in practice the benefits of applying
> the advancement to one-shot hrtimers.  That was also the way the world was headed
> back when Marcelo first implemented the support.  I don't know for sure why the
> initial implementation targeted only TSC-deadline mode, but I think it's safe to
> assume that the use case Marcelo was targeting exclusively used TSC-deadline.

Yes, it appears that focusing on TSC-deadline emulation fits the current use
cases.

>
> I'm not entirely opposed to playing the advancement games with one-shot hrtimers,
> but it's also not clear to me that it's worth doing.  E.g. supporting one-shot
> hrtimers would likely require a bit of extra complexity to juggle the different
> time domains.  And if the only use cases that are truly sensitive to timer
> programming latency exclusively use TSC-deadline mode (because one-shot mode is
> inherently "fuzzy"), then any amount of extra complexity is effectively dead weight.
>
>> should not be applied to:
>> sw/hv period
>
> I wouldn't say "should not be applied to", I think it's more "doesn't provide much
> benefit to".

Thanks again for your clear explanation and insights. This really helped me
understand the design choices better. :)

Regards,
fuqiang

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-10-21 15:38 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-13 12:51 [PATCH RESEND] avoid hv timer fallback to sw timer if delay exceeds period fuqiang wang
2025-10-13 23:29 ` Sean Christopherson
2025-10-17 12:21   ` fuqiang wang
2025-10-17 15:59     ` Sean Christopherson
2025-10-21 15:37       ` fuqiang wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox