[PATCH v5 0/1] KVM: x86: fix some kvm period timer BUG

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 0/1] KVM: x86: fix some kvm period timer BUG
@ 2025-11-07  3:47 fuqiang wang
  2025-11-07  3:48 ` [PATCH v5 1/1] KVM: x86: Fix VM hard lockup after prolonged suspend with periodic HV timer fuqiang wang
  2025-11-11  0:28 ` [PATCH v5 0/1] KVM: x86: fix some kvm period timer BUG Sean Christopherson
  0 siblings, 2 replies; 6+ messages in thread
From: fuqiang wang @ 2025-11-07  3:47 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Marcelo Tosatti,
	H . Peter Anvin, Maxim Levitsky, kvm, linux-kernel
  Cc: fuqiang wang, yu chen, dongxu zhang

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 5927 bytes --]

This patch fixes two issues with the period timer:

====================================================================
issue 1: avoid hv timer fallback to sw timer if delay exceeds period 
====================================================================

When the guest uses the APIC periodic timer, if the next period has already
expired, e.g. due to the period being smaller than the delay in processing
the timer, the delta will be negative. nsec_to_cycles() may then convert
this delta into an absolute value larger than guest_l1_tsc, resulting in a
negative tscdeadline. Since the hv timer supports a maximum bit width of
cpu_preemption_timer_multi + 32, this causes the hv timer setup to fail and
switch to the sw timer.

Moreover, due to the commit 98c25ead5eda ("KVM: VMX: Move preemption timer
<=> hrtimer dance to common x86"), if the guest is using the sw timer
before blocking, it will continue to use the sw timer after being woken up,
and will not switch back to the hv timer until the relevant APIC timer
register is reprogrammed.  Since the periodic timer does not require
frequent APIC timer register programming, the guest may continue to use the
software timer for an extended period.

Link [1] reproduces this issue by injecting a kernel module. This module
creates a periodic hrtimer and adds a certain delay in its callback, making
the delay longer than the KVM periodic timer period.

======================================================================
issue 2: VM hard lockup after prolonged suspend with periodic HV timer
======================================================================

Resuming a virtual machine after it has been suspended for a long time may
trigger a hard lockup. 

The main reason is that the KVM periodic HV timer only advances during the
VM-exit “VMX-preemption timer expired” event and  when the vCPU is
suspended or returns to user space for other reasons, the KVM timer stops
advancing. Since the periodic timer expiration callback advances the timer
by one period per invocation, this results in the callback being executed
many times to catch up the expiration to the current timer value.

Due to issue 1, the KVM periodic HV timer will switch to the software
timer, and these catch-up will be executed within a single clock interrupt.
If this process lasts long enough, it can easily lead to a hard lockup.

One of our Windows virtual machines in the production environment triggered
this case:
  NMI watchdog: Watchdog detected hard LOCKUP on cpu 45
  ...
  RIP: 0010:advance_periodic_target_expiration+0x4d/0x80 [kvm]
  ...
  RSP: 0018:ff4f88f5d98d8ef0 EFLAGS: 00000046
  RAX: fff0103f91be678e RBX: fff0103f91be678e RCX: 00843a7d9e127bcc
  RDX: 0000000000000002 RSI: 0052ca4003697505 RDI: ff440d5bfbdbd500
  RBP: ff440d5956f99200 R08: ff2ff2a42deb6a84 R09: 000000000002a6c0
  R10: 0122d794016332b3 R11: 0000000000000000 R12: ff440db1af39cfc0
  R13: ff440db1af39cfc0 R14: ffffffffc0d4a560 R15: ff440db1af39d0f8
  FS:  00007f04a6ffd700(0000) GS:ff440db1af380000(0000) knlGS:000000e38a3b8000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000000d5651feff8 CR3: 000000684e038002 CR4: 0000000000773ee0
  PKRU: 55555554
  Call Trace:
   <IRQ>
   apic_timer_fn+0x31/0x50 [kvm]
   __hrtimer_run_queues+0x100/0x280
   hrtimer_interrupt+0x100/0x210
   ? ttwu_do_wakeup+0x19/0x160
   smp_apic_timer_interrupt+0x6a/0x130
   apic_timer_interrupt+0xf/0x20
   </IRQ>

And in link [2], Marcelo also reported this issue. But I don't think it can
reproduce the issue. Because of commit [3], as long as the KVM timer is
running, target_expiration will keep catching up to now (unless every
single delay from timer virtualization is longer than the period, which is
a pretty extreme case). Also, this patch is based on the patch of link [2],
but with some differences: In link [2], target_expiration is updated to
"now - period"(I'm not sure why it doesn't just catch up to now -- maybe
I'm missing something?). In this patch, I set target_expiration to catch up
to now just like how update_target_expiration handles the remaining.

Link [4] provides details of the hard lockup details and as well as how to
reproduce the KVM timer stop by pausing the virtual machine.

=================================
Fix both issues in a single patch
=================================

In versions v2 and v3, I split these two issues into two separate patches
for fixing. However, this caused patch 2 to revert some of the changes made
by patch 1.

In patch 4, I attempted to merge the two patches into one and tried to
describe both issues in the commit message, but I did not do it well. In
this version, I have included more details in the commit message and the
cover letter.

Changes in v5:
- Add more details in commit messages and letters.
- link to v4: https://lore.kernel.org/all/20251105135340.33335-1-fuqiang.wng@gmail.com/

Changes in v4:
- merge two patch into one
- link to v3: https://lore.kernel.org/all/20251022150055.2531-1-fuqiang.wng@gmail.com/

Changes in v3:
- Fix: advanced SW timer (hrtimer) expiration does not catch up to current
  time.
- optimize the commit message of patch 2
- link to v2: https://lore.kernel.org/all/20251021154052.17132-1-fuqiang.wng@gmail.com/

Changes in v2:
- Added a bugfix for hardlockup in v2
- link to v1: https://lore.kernel.org/all/20251013125117.87739-1-fuqiang.wng@gmail.com/

[1]: https://github.com/cai-fuqiang/kernel_test/tree/master/period_timer_test
[2]: https://lore.kernel.org/kvm/YgahsSubOgFtyorl@fuller.cnet/
[3]: commit d8f2f498d9ed ("x86/kvm: fix LAPIC timer drift when guest uses periodic mode")
[4]: https://github.com/cai-fuqiang/md/tree/master/case/intel_kvm_period_timer

fuqiang wang (1):
  KVM: x86: Fix VM hard lockup after prolonged suspend with periodic HV
    timer

 arch/x86/kvm/lapic.c | 32 ++++++++++++++++++++++++--------
 1 file changed, 24 insertions(+), 8 deletions(-)

-- 
2.47.0

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v5 1/1] KVM: x86: Fix VM hard lockup after prolonged suspend with periodic HV timer
  2025-11-07  3:47 [PATCH v5 0/1] KVM: x86: fix some kvm period timer BUG fuqiang wang
@ 2025-11-07  3:48 ` fuqiang wang
  2025-11-11  0:46   ` Sean Christopherson
  2025-11-11  0:28 ` [PATCH v5 0/1] KVM: x86: fix some kvm period timer BUG Sean Christopherson
  1 sibling, 1 reply; 6+ messages in thread
From: fuqiang wang @ 2025-11-07  3:48 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Marcelo Tosatti,
	H . Peter Anvin, Maxim Levitsky, kvm, linux-kernel
  Cc: fuqiang wang, yu chen, dongxu zhang

When a VM is suspended while using the periodic HV timer, the KVM timer
also ceases to advance. After the VM resumes from a prolonged suspend,
there will be a huge gap between target_expiration and the current time.
Because target_expiration is incremented by only one period on each KVM
timer expiration, this leads to a series of KVM timer expirations occurring
rapidly after the VM resumes.

More critically, when the VM first triggers a periodic HV timer expiration
after resuming, executing advance_periodic_target_expiration() advance
target_expiration by one period, but it will still be earlier than the
current time (now).  As a result, delta may be calculated as a negative
value. Subsequently, nsec_to_cycles() convert this delta into an absolute
value larger than guest_l1_tsc, resulting in a negative tscdeadline. Since
the hv timer supports a maximum bit width of cpu_preemption_timer_multi +
32, this causes the hv timer setup to fail and switch to the sw timer.

After switching to the software timer, periodic timer expiration callbacks
may be executed consecutively within a single clock interrupt handler, with
interrupts disabled until target_expiration is advanced to now. If this
situation persists for an extended period, it could result in a hard
lockup.

Here is a stack trace from a Windows VM that encountered a hard lockup
after resuming from a long suspend.

  NMI watchdog: Watchdog detected hard LOCKUP on cpu 45
  ...
  RIP: 0010:advance_periodic_target_expiration+0x4d/0x80 [kvm]
  ...
  RSP: 0018:ff4f88f5d98d8ef0 EFLAGS: 00000046
  RAX: fff0103f91be678e RBX: fff0103f91be678e RCX: 00843a7d9e127bcc
  RDX: 0000000000000002 RSI: 0052ca4003697505 RDI: ff440d5bfbdbd500
  RBP: ff440d5956f99200 R08: ff2ff2a42deb6a84 R09: 000000000002a6c0
  R10: 0122d794016332b3 R11: 0000000000000000 R12: ff440db1af39cfc0
  R13: ff440db1af39cfc0 R14: ffffffffc0d4a560 R15: ff440db1af39d0f8
  FS:  00007f04a6ffd700(0000) GS:ff440db1af380000(0000) knlGS:000000e38a3b8000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000000d5651feff8 CR3: 000000684e038002 CR4: 0000000000773ee0
  PKRU: 55555554
  Call Trace:
   <IRQ>
   apic_timer_fn+0x31/0x50 [kvm]
   __hrtimer_run_queues+0x100/0x280
   hrtimer_interrupt+0x100/0x210
   ? ttwu_do_wakeup+0x19/0x160
   smp_apic_timer_interrupt+0x6a/0x130
   apic_timer_interrupt+0xf/0x20
   </IRQ>

Moreover, if the suspend duration of the virtual machine is not long enough
to trigger a hard lockup in this scenario, due to the commit 98c25ead5eda
("KVM: VMX: Move preemption timer <=> hrtimer dance to common x86"), if the
guest is using the sw timer before blocking, it will continue to use the sw
timer after being woken up, and will not switch back to the hv timer until
the relevant APIC timer register is reprogrammed.  Since the periodic timer
does not require frequent APIC timer register programming, the guest may
continue to use the software timer for an extended period.

This patch makes the following modification: When handling KVM periodic
timer expiration, if we find that the advanced target_expiration is still
less than now, we set target_expiration directly to now (just like how
update_target_expiration handles the remaining).

Fixes: d8f2f498d9ed ("x86/kvm: fix LAPIC timer drift when guest uses periodic mode")
Signed-off-by: fuqiang wang <fuqiang.wng@gmail.com>
---
 arch/x86/kvm/lapic.c | 32 ++++++++++++++++++++++++--------
 1 file changed, 24 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 0ae7f913d782..bc082271c81c 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2131,18 +2131,34 @@ static void advance_periodic_target_expiration(struct kvm_lapic *apic)
 	ktime_t delta;

 	/*
-	 * Synchronize both deadlines to the same time source or
-	 * differences in the periods (caused by differences in the
-	 * underlying clocks or numerical approximation errors) will
-	 * cause the two to drift apart over time as the errors
-	 * accumulate.
+	 * Use kernel time as the time source for both deadlines so that they
+	 * stay synchronized.  Computing each deadline independently will cause
+	 * the two deadlines to drift apart over time as differences in the
+	 * periods accumulate, e.g. due to differences in the underlying clocks
+	 * or numerical approximation errors.
 	 */
 	apic->lapic_timer.target_expiration =
 		ktime_add_ns(apic->lapic_timer.target_expiration,
 				apic->lapic_timer.period);
+
+	/*
+	 * When the vm is suspend, the hv timer also stops advancing. After it
+	 * is resumed, this may result in a large delta. If the
+	 * target_expiration only advances by one period each time, it will
+	 * cause KVM to frequently handle timer expirations.
+	 */
+	if (apic->lapic_timer.period > 0 &&
+	    ktime_before(apic->lapic_timer.target_expiration, now))
+		apic->lapic_timer.target_expiration = now;
+
 	delta = ktime_sub(apic->lapic_timer.target_expiration, now);
-	apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl) +
-		nsec_to_cycles(apic->vcpu, delta);
+	apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl);
+	/*
+	 * Note: delta must not be negative. Otherwise, blindly adding a
+	 * negative delta could cause the deadline to become excessively large
+	 * due to the tscdeadline being an unsigned value.
+	 */
+	apic->lapic_timer.tscdeadline += nsec_to_cycles(apic->vcpu, delta);
 }

 static void start_sw_period(struct kvm_lapic *apic)
@@ -2972,7 +2988,7 @@ static enum hrtimer_restart apic_timer_fn(struct hrtimer *data)

 	if (lapic_is_periodic(apic)) {
 		advance_periodic_target_expiration(apic);
-		hrtimer_add_expires_ns(&ktimer->timer, ktimer->period);
+		hrtimer_set_expires(&ktimer->timer, ktimer->target_expiration);
 		return HRTIMER_RESTART;
 	} else
 		return HRTIMER_NORESTART;
-- 
2.47.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v5 0/1] KVM: x86: fix some kvm period timer BUG
  2025-11-07  3:47 [PATCH v5 0/1] KVM: x86: fix some kvm period timer BUG fuqiang wang
  2025-11-07  3:48 ` [PATCH v5 1/1] KVM: x86: Fix VM hard lockup after prolonged suspend with periodic HV timer fuqiang wang
@ 2025-11-11  0:28 ` Sean Christopherson
  2025-11-11  8:14   ` fuqiang wang
  1 sibling, 1 reply; 6+ messages in thread
From: Sean Christopherson @ 2025-11-11  0:28 UTC (permalink / raw)
  To: fuqiang wang
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Marcelo Tosatti, H . Peter Anvin,
	Maxim Levitsky, kvm, linux-kernel, yu chen, dongxu zhang

On Fri, Nov 07, 2025, fuqiang wang wrote:
> =================================
> Fix both issues in a single patch
> =================================
> 
> In versions v2 and v3, I split these two issues into two separate patches
> for fixing. However, this caused patch 2 to revert some of the changes made
> by patch 1.

FWIW, my initial reaction was that I liked splitting this into two patches better,
but after digging through all the angles of this for a few hours, I agree that it's
better to skip the "don't let the delta go negative" patch, because that patch
really only addresses a symptom that shouldn't happen in the first place.

> In patch 4, I attempted to merge the two patches into one and tried to
> describe both issues in the commit message, but I did not do it well. In
> this version, I have included more details in the commit message and the
> cover letter.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v5 1/1] KVM: x86: Fix VM hard lockup after prolonged suspend with periodic HV timer
  2025-11-07  3:48 ` [PATCH v5 1/1] KVM: x86: Fix VM hard lockup after prolonged suspend with periodic HV timer fuqiang wang
@ 2025-11-11  0:46   ` Sean Christopherson
  2025-11-11  8:17     ` fuqiang wang
  0 siblings, 1 reply; 6+ messages in thread
From: Sean Christopherson @ 2025-11-11  0:46 UTC (permalink / raw)
  To: fuqiang wang
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Marcelo Tosatti, H . Peter Anvin,
	Maxim Levitsky, kvm, linux-kernel, yu chen, dongxu zhang

On Fri, Nov 07, 2025, fuqiang wang wrote:
> When a VM is suspended while using the periodic HV timer, the KVM timer
> also ceases to advance. After the VM resumes from a prolonged suspend,

Maybe it's because I've been dealing with a variety of different suspend+resume
issue, but I find it confusing to talk about suspend+resume, partly because it's
not immediately clear if you're talking about _host_ suspend+resume versus guest
suspend+resume.  And suspend+resume isn't the only way this issue could be hit,
e.g. userspace could simply stop running a VM for whatever reason and get the
same result.

I even think we should downplay the HV timer / VMX Preemption Timer angle to
some extent, because restarting the hrtimer if it expires while the vCPU is in
userspace is wasteful and unnecessary.  And if we ever "fix" that, e.g. maybe by
not restarting the hrtimer if vcpu->wants_to_run is false, then the hard lockup
could be hittable even without the HV timer.

> there will be a huge gap between target_expiration and the current time.
> Because target_expiration is incremented by only one period on each KVM
> timer expiration, this leads to a series of KVM timer expirations occurring
> rapidly after the VM resumes.
> 
> More critically, when the VM first triggers a periodic HV timer expiration
> after resuming, executing advance_periodic_target_expiration() advance
> target_expiration by one period, but it will still be earlier than the
> current time (now).  As a result, delta may be calculated as a negative
> value. Subsequently, nsec_to_cycles() convert this delta into an absolute
> value larger than guest_l1_tsc, resulting in a negative tscdeadline. Since
> the hv timer supports a maximum bit width of cpu_preemption_timer_multi +
> 32, this causes the hv timer setup to fail and switch to the sw timer.
> 
> After switching to the software timer, periodic timer expiration callbacks
> may be executed consecutively within a single clock interrupt handler, with
> interrupts disabled until target_expiration is advanced to now. If this
> situation persists for an extended period, it could result in a hard
> lockup.
> 
> Here is a stack trace from a Windows VM that encountered a hard lockup
> after resuming from a long suspend.
> 
>   NMI watchdog: Watchdog detected hard LOCKUP on cpu 45
>   ...
>   RIP: 0010:advance_periodic_target_expiration+0x4d/0x80 [kvm]
>   ...
>   RSP: 0018:ff4f88f5d98d8ef0 EFLAGS: 00000046
>   RAX: fff0103f91be678e RBX: fff0103f91be678e RCX: 00843a7d9e127bcc
>   RDX: 0000000000000002 RSI: 0052ca4003697505 RDI: ff440d5bfbdbd500
>   RBP: ff440d5956f99200 R08: ff2ff2a42deb6a84 R09: 000000000002a6c0
>   R10: 0122d794016332b3 R11: 0000000000000000 R12: ff440db1af39cfc0
>   R13: ff440db1af39cfc0 R14: ffffffffc0d4a560 R15: ff440db1af39d0f8
>   FS:  00007f04a6ffd700(0000) GS:ff440db1af380000(0000) knlGS:000000e38a3b8000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 000000d5651feff8 CR3: 000000684e038002 CR4: 0000000000773ee0
>   PKRU: 55555554
>   Call Trace:
>    <IRQ>
>    apic_timer_fn+0x31/0x50 [kvm]
>    __hrtimer_run_queues+0x100/0x280
>    hrtimer_interrupt+0x100/0x210
>    ? ttwu_do_wakeup+0x19/0x160
>    smp_apic_timer_interrupt+0x6a/0x130
>    apic_timer_interrupt+0xf/0x20
>    </IRQ>
> 
> Moreover, if the suspend duration of the virtual machine is not long enough
> to trigger a hard lockup in this scenario, due to the commit 98c25ead5eda
> ("KVM: VMX: Move preemption timer <=> hrtimer dance to common x86"), if the
> guest is using the sw timer before blocking, it will continue to use the sw
> timer after being woken up, and will not switch back to the hv timer until
> the relevant APIC timer register is reprogrammed.  Since the periodic timer
> does not require frequent APIC timer register programming, the guest may
> continue to use the software timer for an extended period.
> 
> This patch makes the following modification: When handling KVM periodic
> timer expiration, if we find that the advanced target_expiration is still
> less than now, we set target_expiration directly to now (just like how
> update_target_expiration handles the remaining).

Please don't use "this patch" (it's superfluous since this is obvioulsy a patch),
or pronouns like "we" and "us" as pronouns are often ambiguous in the world of
KVM.  E.g. "we" might me "we the cloud provider", "we the VMM", "we as KVM", "we
as the host kernel", etc.

> Fixes: d8f2f498d9ed ("x86/kvm: fix LAPIC timer drift when guest uses periodic mode")
> Signed-off-by: fuqiang wang <fuqiang.wng@gmail.com>
> ---
>  arch/x86/kvm/lapic.c | 32 ++++++++++++++++++++++++--------
>  1 file changed, 24 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 0ae7f913d782..bc082271c81c 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2131,18 +2131,34 @@ static void advance_periodic_target_expiration(struct kvm_lapic *apic)
>  	ktime_t delta;
>  
>  	/*
> -	 * Synchronize both deadlines to the same time source or
> -	 * differences in the periods (caused by differences in the
> -	 * underlying clocks or numerical approximation errors) will
> -	 * cause the two to drift apart over time as the errors
> -	 * accumulate.
> +	 * Use kernel time as the time source for both deadlines so that they
> +	 * stay synchronized.  Computing each deadline independently will cause
> +	 * the two deadlines to drift apart over time as differences in the
> +	 * periods accumulate, e.g. due to differences in the underlying clocks
> +	 * or numerical approximation errors.
>  	 */
>  	apic->lapic_timer.target_expiration =
>  		ktime_add_ns(apic->lapic_timer.target_expiration,
>  				apic->lapic_timer.period);
> +
> +	/*
> +	 * When the vm is suspend, the hv timer also stops advancing. After it
> +	 * is resumed, this may result in a large delta. If the

This definitely shouldn't talk about suspend+resume, because that's fully a VMM
concept and isn't the _only_ way to end up with a huge delta.

> +	 * target_expiration only advances by one period each time, it will
> +	 * cause KVM to frequently handle timer expirations.

It's also worth calling out that KVM will only deliver a single IRQ to the guest,
i.e. restarting the timer over and over (and over and over) is completey usless.
*Except* if KVM is posting IRQs across CPUs, e.g. running hrtimers on housekeeping
CPUs, in which case KVM really will deliver an interrupt storm to the guest.

But I don't want to try and special case that mode, because there's no guarantee
KVM can post a timer, and I don't want to have arbitrarily different guest-visible
behavior based on a fairly obscure module param (and other settings).  So unless
I too am missing something, setting the target expiration to "now" seems like the
way to go.

> +	 */
> +	if (apic->lapic_timer.period > 0 &&

apic->lapic_timer.period is guaranteed to be zero here.  kvm_lapic_expired_hv_timer()
and start_sw_period() check it explicitly, and apic_timer_fn() should be unreachable
with period==0 (KVM is supposed to cancel the hrtimer before changing the period).

I do think it's worth hardening apic_timer_fn() though, but in a separate prep
patch.

> +	    ktime_before(apic->lapic_timer.target_expiration, now))
> +		apic->lapic_timer.target_expiration = now;
> +
>  	delta = ktime_sub(apic->lapic_timer.target_expiration, now);
> -	apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl) +
> -		nsec_to_cycles(apic->vcpu, delta);
> +	apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl);
> +	/*
> +	 * Note: delta must not be negative. Otherwise, blindly adding a
> +	 * negative delta could cause the deadline to become excessively large
> +	 * due to the tscdeadline being an unsigned value.
> +	 */

Rather than say "delta must not be negative", instead call out that the delta
_can't_ be negative thanks to the above calculation (which is also why moving
the "period > 0" check to the caller is important).  And I think it's worth
putting this comment above the calculation of "delta" to preserve the existing
code that does "tscdead = l1_tsc + delta" (and it yields a smaller diff).

> +	apic->lapic_timer.tscdeadline += nsec_to_cycles(apic->vcpu, delta);
>  }
>  
>  static void start_sw_period(struct kvm_lapic *apic)
> @@ -2972,7 +2988,7 @@ static enum hrtimer_restart apic_timer_fn(struct hrtimer *data)
>  
>  	if (lapic_is_periodic(apic)) {
>  		advance_periodic_target_expiration(apic);
> -		hrtimer_add_expires_ns(&ktimer->timer, ktimer->period);
> +		hrtimer_set_expires(&ktimer->timer, ktimer->target_expiration);

This can also go in a separate prep patch.  Logically it stands on its own (use
the computed expiration instead of doing yet another calculation), and in case
something goes sideways, it'll give us another bisection point.

Lastly, adding a cleanup patch on top to capture apic->lapic_timer in a local
variable would make this code easier to read (I don't want to say "easy" to
read, but at least easier :-D).

All in all, I ended up with the below.  I'll post a v6 tomorrow (I've got the
tweaks made locally) after testing.  Thanks much!

static void advance_periodic_target_expiration(struct kvm_lapic *apic)
{
	struct kvm_timer *ktimer = &apic->lapic_timer;
	ktime_t now = ktime_get();
	u64 tscl = rdtsc();
	ktime_t delta;

	/*
	 * Use kernel time as the time source for both the hrtimer deadline and
	 * TSC-based deadline so that they stay synchronized.  Computing each
	 * deadline independently will cause the two deadlines to drift apart
	 * over time as differences in the periods accumulate, e.g. due to
	 * differences in the underlying clocks or numerical approximation errors.
	 */
	ktimer->target_expiration = ktime_add_ns(ktimer->target_expiration,
						 ktimer->period);

	/*
	 * If the new expiration is in the past, e.g. because userspace stopped
	 * running the VM for an extended duration, then force the expiration
	 * to "now" and don't try to play catch-up with the missed events.  KVM
	 * will only deliver a single interrupt regardless of how many events
	 * are pending, i.e. restarting the timer with an expiration in the
	 * past will do nothing more than waste host cycles, and can even lead
	 * to a hard lockup in extreme cases.
	 */
	if (ktime_before(ktimer->target_expiration, now))
		ktimer->target_expiration = now;

	/*
	 * Note, ensuring the expiration isn't in the past also prevents delta
	 * from going negative, which could cause the TSC deadline to become
	 * excessively large due to it an unsigned value.
	 */
	delta = ktime_sub(ktimer->target_expiration, now);
	ktimer->target_expiration = kvm_read_l1_tsc(apic->vcpu, tscl) +
				    nsec_to_cycles(apic->vcpu, delta);
}

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v5 0/1] KVM: x86: fix some kvm period timer BUG
  2025-11-11  0:28 ` [PATCH v5 0/1] KVM: x86: fix some kvm period timer BUG Sean Christopherson
@ 2025-11-11  8:14   ` fuqiang wang
  0 siblings, 0 replies; 6+ messages in thread
From: fuqiang wang @ 2025-11-11  8:14 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Marcelo Tosatti, H . Peter Anvin,
	Maxim Levitsky, kvm, linux-kernel, yu chen, dongxu zhang



On 11/11/25 8:28 AM, Sean Christopherson wrote:
> On Fri, Nov 07, 2025, fuqiang wang wrote:
>> =================================
>> Fix both issues in a single patch
>> =================================
>>
>> In versions v2 and v3, I split these two issues into two separate patches
>> for fixing. However, this caused patch 2 to revert some of the changes made
>> by patch 1.
> 
> FWIW, my initial reaction was that I liked splitting this into two patches better,
> but after digging through all the angles of this for a few hours, I agree that it's
> better to skip the "don't let the delta go negative" patch, because that patch
> really only addresses a symptom that shouldn't happen in the first place.
> 
>> In patch 4, I attempted to merge the two patches into one and tried to
>> describe both issues in the commit message, but I did not do it well. In
>> this version, I have included more details in the commit message and the
>> cover letter.

Yes, the reason why I issued the “don’t let the delta go negative” patch
separately in v1 was that I hadn’t yet identified the root cause of the
hardlockup at that time...

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v5 1/1] KVM: x86: Fix VM hard lockup after prolonged suspend with periodic HV timer
  2025-11-11  0:46   ` Sean Christopherson
@ 2025-11-11  8:17     ` fuqiang wang
  0 siblings, 0 replies; 6+ messages in thread
From: fuqiang wang @ 2025-11-11  8:17 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Marcelo Tosatti, H . Peter Anvin,
	Maxim Levitsky, kvm, linux-kernel, yu chen, dongxu zhang



On 11/11/25 8:46 AM, Sean Christopherson wrote:

> On Fri, Nov 07, 2025, fuqiang wang wrote:
> > When a VM is suspended while using the periodic HV timer, the KVM timer
> > also ceases to advance. After the VM resumes from a prolonged suspend,
> 
> Maybe it's because I've been dealing with a variety of different suspend+resume
> issue, but I find it confusing to talk about suspend+resume, partly because it's
> not immediately clear if you're talking about _host_ suspend+resume versus guest
> suspend+resume.

Yes, using guest suspend+resume is clearer.

> And suspend+resume isn't the only way this issue could be hit,
> e.g. userspace could simply stop running a VM for whatever reason and get the
> same result.

Yes, it just requires staying in user space for a rather long time(I haven’t
tested it specifically, it might take several days or even dozens of days). As a
comment or commit message in the KVM, it is indeed better to make the
description more general, rather than restricting it to specific scenarios.

>
> I even think we should downplay the HV timer / VMX Preemption Timer angle to
> some extent, because restarting the hrtimer if it expires while the vCPU is in
> userspace is wasteful and unnecessary.  And if we ever "fix" that, e.g. maybe by
> not restarting the hrtimer if vcpu->wants_to_run is false, then the hard lockup
> could be hittable even without the HV timer.

Yes, I also noticed this issue. But initially, I didn’t have a good idea for how
to fix it, because KVM can’t determine how long it will take to return to
userspace. If the guest is in user space only briefly, not restarting the KVM
software period timer is unreasonable. Using vcpu->wants_to_run seems good.
However, I am not very familiar with KVM_CAP_IMMEDIATE_EXIT. I will review its
history and try to run some tests.

> > This patch makes the following modification: When handling KVM periodic
> > timer expiration, if we find that the advanced target_expiration is still
> > less than now, we set target_expiration directly to now (just like how
> > update_target_expiration handles the remaining).
> 
> Please don't use "this patch" (it's superfluous since this is obvioulsy a patch),
> or pronouns like "we" and "us" as pronouns are often ambiguous in the world of
> KVM.  E.g. "we" might me "we the cloud provider", "we the VMM", "we as KVM", "we
> as the host kernel", etc.

Thank you for patiently. This is a big mistake I shouldn’t have made and I will
thoroughly review the code submission guidelines for kernel/KVM-x86 as provided
in following links:

[1] https://docs.kernel.org/process/maintainer-kvm-x86.html#comments
[2] https://docs.kernel.org/process/submitting-patches.html#describe-your-changes

> 
> > Fixes: d8f2f498d9ed ("x86/kvm: fix LAPIC timer drift when guest uses periodic mode")
> > Signed-off-by: fuqiang wang <fuqiang.wng@gmail.com>
> > ---
> >  arch/x86/kvm/lapic.c | 32 ++++++++++++++++++++++++--------
> >  1 file changed, 24 insertions(+), 8 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 0ae7f913d782..bc082271c81c 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -2131,18 +2131,34 @@ static void advance_periodic_target_expiration(struct kvm_lapic *apic)
> >  	ktime_t delta;
> >
> >  	/*
> > -	 * Synchronize both deadlines to the same time source or
> > -	 * differences in the periods (caused by differences in the
> > -	 * underlying clocks or numerical approximation errors) will
> > -	 * cause the two to drift apart over time as the errors
> > -	 * accumulate.
> > +	 * Use kernel time as the time source for both deadlines so that they
> > +	 * stay synchronized.  Computing each deadline independently will cause
> > +	 * the two deadlines to drift apart over time as differences in the
> > +	 * periods accumulate, e.g. due to differences in the underlying clocks
> > +	 * or numerical approximation errors.
> >  	 */
> >  	apic->lapic_timer.target_expiration =
> >  		ktime_add_ns(apic->lapic_timer.target_expiration,
> >  				apic->lapic_timer.period);
> > +
> > +	/*
> > +	 * When the vm is suspend, the hv timer also stops advancing. After it
> > +	 * is resumed, this may result in a large delta. If the
> 
> This definitely shouldn't talk about suspend+resume, because that's fully a VMM
> concept and isn't the _only_ way to end up with a huge delta.

Yes, it's better!

> 
> > +	 * target_expiration only advances by one period each time, it will
> > +	 * cause KVM to frequently handle timer expirations.
> 
> It's also worth calling out that KVM will only deliver a single IRQ to the guest,
> i.e. restarting the timer over and over (and over and over) is completey usless.
> *Except* if KVM is posting IRQs across CPUs, e.g. running hrtimers on housekeeping
> CPUs, in which case KVM really will deliver an interrupt storm to the guest.
>
> But I don't want to try and special case that mode, because there's no guarantee
> KVM can post a timer, and I don't want to have arbitrarily different guest-visible
> behavior based on a fairly obscure module param (and other settings).  So unless
> I too am missing something, setting the target expiration to "now" seems like the
> way to go.

Yep, I also think it’s better to handle this in a simple and consistent way.
Here are my thoughts:

There are two possible reasons why the new target_expiration might lag behind:

1. The KVM timer is stopped.
2. The period is smaller than the delay in processing the timer.

In the first case, the timer shouldn’t be counting down because cpu is stopped.
Therefore, we should keep the LAPIC current count as it was before stopping.
That means we need to record the delta between target_expiration and now before
stopping, and set it when starting timer in future.

***

In the second case, it appears to the guest that the timer source is not very
accurate and cannot deliver interrupts at precise times. In d8f2f498d9ed
("x86/kvm: fix LAPIC timer drift when guest uses periodic mode"), David made the
KVM periodic timer more accurate by adjusting the next target_expiration, as a
result, over a relatively long duration, we can observe more accurate timer
interrupt counts and intervals.

However, this approach doesn’t apply when the target_expiration is already
significantly behind. In that situation, even though the total number of timer
interrupts over a long duration may be correct, the interrupts themselves are
triggered too close together. In this situation, the guest would prefer to see
regular timer period, rather than an almost frantic burst of timer interrupts.

And as you mentioned, if KVM is not posting timer IRQs across CPUs, the guest
won’t even have a chance to see the interrupt storm, or even two consecutive
interrupts.

***

Given all the situations above, I don’t think it’s worth adding a lot of
complexity for this, it’s better to handle this in a simple and consistent way.

> > +	 */
> > +	if (apic->lapic_timer.period > 0 &&
> 
> apic->lapic_timer.period is guaranteed to be zero here.  kvm_lapic_expired_hv_timer()
> and start_sw_period() check it explicitly, and apic_timer_fn() should be unreachable
> with period==0 (KVM is supposed to cancel the hrtimer before changing the period).
> 

Yes, that was my mistake. Thank you for pointing it out.

> I do think it's worth hardening apic_timer_fn() though, but in a separate prep
> patch.
> 
> > +	    ktime_before(apic->lapic_timer.target_expiration, now))
> > +		apic->lapic_timer.target_expiration = now;
> > +
> >  	delta = ktime_sub(apic->lapic_timer.target_expiration, now);
> > -	apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl) +
> > -		nsec_to_cycles(apic->vcpu, delta);
> > +	apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl);
> > +	/*
> > +	 * Note: delta must not be negative. Otherwise, blindly adding a
> > +	 * negative delta could cause the deadline to become excessively large
> > +	 * due to the tscdeadline being an unsigned value.
> > +	 */
> 
> Rather than say "delta must not be negative", instead call out that the delta
> _can't_ be negative thanks to the above calculation (which is also why moving
> the "period > 0" check to the caller is important).  And I think it's worth
> putting this comment above the calculation of "delta" to preserve the existing
> code that does "tscdead = l1_tsc + delta" (and it yields a smaller diff).

Yes, it's better.

> > +	apic->lapic_timer.tscdeadline += nsec_to_cycles(apic->vcpu, delta);
> >  }
> >  
> >  static void start_sw_period(struct kvm_lapic *apic)
> > @@ -2972,7 +2988,7 @@ static enum hrtimer_restart apic_timer_fn(struct hrtimer *data)
> >  
> >  	if (lapic_is_periodic(apic)) {
> >  		advance_periodic_target_expiration(apic);
> > -		hrtimer_add_expires_ns(&ktimer->timer, ktimer->period);
> > +		hrtimer_set_expires(&ktimer->timer, ktimer->target_expiration);
> 
> This can also go in a separate prep patch.  Logically it stands on its own (use
> the computed expiration instead of doing yet another calculation), and in case
> something goes sideways, it'll give us another bisection point.
> 
> Lastly, adding a cleanup patch on top to capture apic->lapic_timer in a local
> variable would make this code easier to read (I don't want to say "easy" to
> read, but at least easier :-D).

Yes, it's better.

> 
> All in all, I ended up with the below.  I'll post a v6 tomorrow (I've got the
> tweaks made locally) after testing.  Thanks much!
>
> static void advance_periodic_target_expiration(struct kvm_lapic *apic)
> {
> 	struct kvm_timer *ktimer = &apic->lapic_timer;
> 	ktime_t now = ktime_get();
> 	u64 tscl = rdtsc();
> 	ktime_t delta;
> 
> 	/*
> 	 * Use kernel time as the time source for both the hrtimer deadline and
> 	 * TSC-based deadline so that they stay synchronized.  Computing each
> 	 * deadline independently will cause the two deadlines to drift apart
> 	 * over time as differences in the periods accumulate, e.g. due to
> 	 * differences in the underlying clocks or numerical approximation errors.
> 	 */
> 	ktimer->target_expiration = ktime_add_ns(ktimer->target_expiration,
> 						 ktimer->period);
> 
> 	/*
> 	 * If the new expiration is in the past, e.g. because userspace stopped
> 	 * running the VM for an extended duration, then force the expiration
> 	 * to "now" and don't try to play catch-up with the missed events.  KVM
> 	 * will only deliver a single interrupt regardless of how many events
> 	 * are pending, i.e. restarting the timer with an expiration in the
> 	 * past will do nothing more than waste host cycles, and can even lead
> 	 * to a hard lockup in extreme cases.
> 	 */

This comment is a significant improvement over v5...

> 	if (ktime_before(ktimer->target_expiration, now))
> 		ktimer->target_expiration = now;
> 
> 	/*
> 	 * Note, ensuring the expiration isn't in the past also prevents delta
> 	 * from going negative, which could cause the TSC deadline to become
> 	 * excessively large due to it an unsigned value.
> 	 */
> 	delta = ktime_sub(ktimer->target_expiration, now);
> 	ktimer->target_expiration = kvm_read_l1_tsc(apic->vcpu, tscl) +
> 				    nsec_to_cycles(apic->vcpu, delta);
> }

I really appreciate your patient corrections and advice. They have helped me
learn how to submit a good patch and I will also actively learn and abide by
these rules.

Regards,
fuqiang

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-11-11  8:18 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-07  3:47 [PATCH v5 0/1] KVM: x86: fix some kvm period timer BUG fuqiang wang
2025-11-07  3:48 ` [PATCH v5 1/1] KVM: x86: Fix VM hard lockup after prolonged suspend with periodic HV timer fuqiang wang
2025-11-11  0:46   ` Sean Christopherson
2025-11-11  8:17     ` fuqiang wang
2025-11-11  0:28 ` [PATCH v5 0/1] KVM: x86: fix some kvm period timer BUG Sean Christopherson
2025-11-11  8:14   ` fuqiang wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox