Broken udelay() on KVM host with a vcpu loaded

public inbox for kvmarm@lists.cs.columbia.edu
 help / color / mirror / Atom feed

* Broken udelay() on KVM host with a vcpu loaded
@ 2026-02-10 12:27 Quentin Perret
  2026-02-10 12:52 ` Marc Zyngier
  0 siblings, 1 reply; 9+ messages in thread
From: Quentin Perret @ 2026-02-10 12:27 UTC (permalink / raw)
  To: kvmarm
  Cc: maz, oupton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will

Hi all,

I have just received a report from a partner of udelay misbehaving when
running on the host whilst a vCPU is loaded. This hardware has FEAT_WFxT
and uses the matching implementation of udelay. Interestingly, WFIT
triggers using CNTVCT_EL0 unconditionally, but with KVM the host/guest
switch for that happens from the preempt notifiers/vcpu_put which aren't
invoked when e.g. handling an IRQ. Interestingly, udelay reads the arch
timer to set the waiting time for WFIT using an absolute value, and that
gets compared to CNTVCT_EL0 which in the aforementioned
IRQ-with-vCPU-loaded case uses the _guest's_ CNTVCT_EL0.

I can think of two approaches to address the problem:
  1. have KVM context switch cntvoff proactively prior to re-enabling
  preemption when handling a guest exit;
  2. modify the WFIT-based udelay implementation to read from CNTVCT_EL0
  instead of the arch_timer to be a bit more self-consitent;

Other ideas welcome!

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Broken udelay() on KVM host with a vcpu loaded
  2026-02-10 12:27 Broken udelay() on KVM host with a vcpu loaded Quentin Perret
@ 2026-02-10 12:52 ` Marc Zyngier
  2026-02-10 15:34   ` Will Deacon
  0 siblings, 1 reply; 9+ messages in thread
From: Marc Zyngier @ 2026-02-10 12:52 UTC (permalink / raw)
  To: Quentin Perret
  Cc: kvmarm, oupton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will

On Tue, 10 Feb 2026 12:27:48 +0000,
Quentin Perret <qperret@google.com> wrote:
> 
> Hi all,
> 
> I have just received a report from a partner of udelay misbehaving when
> running on the host whilst a vCPU is loaded. This hardware has FEAT_WFxT
> and uses the matching implementation of udelay. Interestingly, WFIT
> triggers using CNTVCT_EL0 unconditionally, but with KVM the host/guest
> switch for that happens from the preempt notifiers/vcpu_put which aren't
> invoked when e.g. handling an IRQ. Interestingly, udelay reads the arch
> timer to set the waiting time for WFIT using an absolute value, and that
> gets compared to CNTVCT_EL0 which in the aforementioned
> IRQ-with-vCPU-loaded case uses the _guest's_ CNTVCT_EL0.

Well, the underlying issue is that get_cycle(), as used by  __delay(),
is *either* using CNTVCT_EL0 (when booted at EL1) or CNTPCT_EL0 (when
booted at EL2).

> 
> I can think of two approaches to address the problem:
>   1. have KVM context switch cntvoff proactively prior to re-enabling
>   preemption when handling a guest exit;
>   2. modify the WFIT-based udelay implementation to read from CNTVCT_EL0
>   instead of the arch_timer to be a bit more self-consitent;
> 
> Other ideas welcome!

(1) is a real nightmare, and would force a complete redesign of the
life cycle of guest timers (switching from load/put to enter/exit for
the context switch, but only on !VHE). I'd rather avoid that, as this
is a pretty large performance penalty.

(2) is much more palatable, and easily hacked, see below. Can you
please five it a go?

Thanks,

	M.

From b1b45d591aed3e5276ff857dbc6cfa3bce181766 Mon Sep 17 00:00:00 2001
From: Marc Zyngier <maz@kernel.org>
Date: Tue, 10 Feb 2026 12:43:07 +0000
Subject: [PATCH] arm64: Force the use of CNTVCT_EL0 in __delay()

Quentin reports an interesting problem with the use of WFxT in __delay()
when a vcpu is loaded and that KVM is *not* in VHE mode.

In this case, CNTVOFF_EL2 is set to a non-zero value to reflect the
state of the guest virtual counter. At the same time, __delay() is
using get_cycles() to read the counter value, which is indirected to
reading CNTPCT_EL0.

The core of the issue is that WFxT is using the *virtual* counter,
while the kernel is using the physical counter, and that the offset
introduces a really bad discrepancy between the two.

Fix this by forcing the use of CNTVCT_EL0, making __delay() consistent
irrespective of the value of CNTVOFF_EL2.

Reported-by: Quentin Perret <qperret@google.com>
Fixes: 7d26b0516a0df ("arm64: Use WFxT for __delay() when possible")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/ktosachvft2cgqd5qkukn275ugmhy6xrhxur4zqpdxlfr3qh5h@o3zrfnsq63od
Cc: stable@vger.kernel.org
---
 arch/arm64/lib/delay.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/lib/delay.c b/arch/arm64/lib/delay.c
index cb2062e7e2340..26a39bb301ef6 100644
--- a/arch/arm64/lib/delay.c
+++ b/arch/arm64/lib/delay.c
@@ -23,9 +23,16 @@ static inline unsigned long xloops_to_cycles(unsigned long xloops)
 	return (xloops * loops_per_jiffy * HZ) >> 32;
 }
 
+/*
+ * Force the use of CNTVCT_EL0 in order to have the same base as
+ * WFxT. This avoids some annoying issues when CNTVOFF_EL2 is not
+ * reset 0 on a KVM host until we do a vcpu_put() on the vcpu.
+ */
+#define __delay_cycles()	__arch_counter_get_cntvct_stable()
+
 void __delay(unsigned long cycles)
 {
-	cycles_t start = get_cycles();
+	cycles_t start = __delay_cycles();
 
 	if (alternative_has_cap_unlikely(ARM64_HAS_WFXT)) {
 		u64 end = start + cycles;
@@ -35,17 +42,17 @@ void __delay(unsigned long cycles)
 		 * early, use a WFET loop to complete the delay.
 		 */
 		wfit(end);
-		while ((get_cycles() - start) < cycles)
+		while ((__delay_cycles() - start) < cycles)
 			wfet(end);
 	} else 	if (arch_timer_evtstrm_available()) {
 		const cycles_t timer_evt_period =
 			USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US);
 
-		while ((get_cycles() - start + timer_evt_period) < cycles)
+		while ((__delay_cycles() - start + timer_evt_period) < cycles)
 			wfe();
 	}
 
-	while ((get_cycles() - start) < cycles)
+	while ((__delay_cycles() - start) < cycles)
 		cpu_relax();
 }
 EXPORT_SYMBOL(__delay);
-- 
2.47.3


-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Broken udelay() on KVM host with a vcpu loaded
  2026-02-10 12:52 ` Marc Zyngier
@ 2026-02-10 15:34   ` Will Deacon
  2026-02-10 15:58     ` Quentin Perret
  2026-02-10 19:46     ` Marc Zyngier
  0 siblings, 2 replies; 9+ messages in thread
From: Will Deacon @ 2026-02-10 15:34 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Quentin Perret, kvmarm, oupton, joey.gouly, suzuki.poulose,
	yuzenghui, catalin.marinas

On Tue, Feb 10, 2026 at 12:52:21PM +0000, Marc Zyngier wrote:
> On Tue, 10 Feb 2026 12:27:48 +0000,
> Quentin Perret <qperret@google.com> wrote:
> > 
> > Hi all,
> > 
> > I have just received a report from a partner of udelay misbehaving when
> > running on the host whilst a vCPU is loaded. This hardware has FEAT_WFxT
> > and uses the matching implementation of udelay. Interestingly, WFIT
> > triggers using CNTVCT_EL0 unconditionally, but with KVM the host/guest
> > switch for that happens from the preempt notifiers/vcpu_put which aren't
> > invoked when e.g. handling an IRQ. Interestingly, udelay reads the arch
> > timer to set the waiting time for WFIT using an absolute value, and that
> > gets compared to CNTVCT_EL0 which in the aforementioned
> > IRQ-with-vCPU-loaded case uses the _guest's_ CNTVCT_EL0.
> 
> Well, the underlying issue is that get_cycle(), as used by  __delay(),
> is *either* using CNTVCT_EL0 (when booted at EL1) or CNTPCT_EL0 (when
> booted at EL2).
> 
> > 
> > I can think of two approaches to address the problem:
> >   1. have KVM context switch cntvoff proactively prior to re-enabling
> >   preemption when handling a guest exit;
> >   2. modify the WFIT-based udelay implementation to read from CNTVCT_EL0
> >   instead of the arch_timer to be a bit more self-consitent;
> > 
> > Other ideas welcome!

I suppose a third option would be to avoid WFxT when we're on a CPU
with a vCPU loaded on it? The per-cpu state is grotty, mind. A bigger
hammer would be to use WFxT only when using VHE.

> (1) is a real nightmare, and would force a complete redesign of the
> life cycle of guest timers (switching from load/put to enter/exit for
> the context switch, but only on !VHE). I'd rather avoid that, as this
> is a pretty large performance penalty.
> 
> (2) is much more palatable, and easily hacked, see below. Can you
> please five it a go?
> 
> Thanks,
> 
> 	M.
> 
> From b1b45d591aed3e5276ff857dbc6cfa3bce181766 Mon Sep 17 00:00:00 2001
> From: Marc Zyngier <maz@kernel.org>
> Date: Tue, 10 Feb 2026 12:43:07 +0000
> Subject: [PATCH] arm64: Force the use of CNTVCT_EL0 in __delay()
> 
> Quentin reports an interesting problem with the use of WFxT in __delay()
> when a vcpu is loaded and that KVM is *not* in VHE mode.
> 
> In this case, CNTVOFF_EL2 is set to a non-zero value to reflect the
> state of the guest virtual counter. At the same time, __delay() is
> using get_cycles() to read the counter value, which is indirected to
> reading CNTPCT_EL0.
> 
> The core of the issue is that WFxT is using the *virtual* counter,
> while the kernel is using the physical counter, and that the offset
> introduces a really bad discrepancy between the two.
> 
> Fix this by forcing the use of CNTVCT_EL0, making __delay() consistent
> irrespective of the value of CNTVOFF_EL2.
> 
> Reported-by: Quentin Perret <qperret@google.com>
> Fixes: 7d26b0516a0df ("arm64: Use WFxT for __delay() when possible")
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> Link: https://lore.kernel.org/r/ktosachvft2cgqd5qkukn275ugmhy6xrhxur4zqpdxlfr3qh5h@o3zrfnsq63od
> Cc: stable@vger.kernel.org
> ---
>  arch/arm64/lib/delay.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/lib/delay.c b/arch/arm64/lib/delay.c
> index cb2062e7e2340..26a39bb301ef6 100644
> --- a/arch/arm64/lib/delay.c
> +++ b/arch/arm64/lib/delay.c
> @@ -23,9 +23,16 @@ static inline unsigned long xloops_to_cycles(unsigned long xloops)
>  	return (xloops * loops_per_jiffy * HZ) >> 32;
>  }
>  
> +/*
> + * Force the use of CNTVCT_EL0 in order to have the same base as
> + * WFxT. This avoids some annoying issues when CNTVOFF_EL2 is not
> + * reset 0 on a KVM host until we do a vcpu_put() on the vcpu.
> + */
> +#define __delay_cycles()	__arch_counter_get_cntvct_stable()
> +
>  void __delay(unsigned long cycles)
>  {
> -	cycles_t start = get_cycles();
> +	cycles_t start = __delay_cycles();
>  
>  	if (alternative_has_cap_unlikely(ARM64_HAS_WFXT)) {
>  		u64 end = start + cycles;
> @@ -35,17 +42,17 @@ void __delay(unsigned long cycles)
>  		 * early, use a WFET loop to complete the delay.
>  		 */
>  		wfit(end);
> -		while ((get_cycles() - start) < cycles)
> +		while ((__delay_cycles() - start) < cycles)
>  			wfet(end);
>  	} else 	if (arch_timer_evtstrm_available()) {
>  		const cycles_t timer_evt_period =
>  			USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US);
>  
> -		while ((get_cycles() - start + timer_evt_period) < cycles)
> +		while ((__delay_cycles() - start + timer_evt_period) < cycles)
>  			wfe();
>  	}
>  
> -	while ((get_cycles() - start) < cycles)
> +	while ((__delay_cycles() - start) < cycles)
>  		cpu_relax();

I can't put my finger on a specific bug, but it does feel pretty scary
to use the virtual counter in the host while running with a guest
voffset. Is the offset userspace controllable? If so, can we guarantee
that we're ok with overflow and is it ok if we get preempted in the
middle of the loop? It's making my head hurt!

The smp_cond_load_{relaxed,acquire}_timeout() series from Ankur looks
like it has the same issue (it uses arch_timer_read_counter()).

Will

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Broken udelay() on KVM host with a vcpu loaded
  2026-02-10 15:34   ` Will Deacon
@ 2026-02-10 15:58     ` Quentin Perret
  2026-02-10 19:54       ` Marc Zyngier
  2026-02-10 19:46     ` Marc Zyngier
  1 sibling, 1 reply; 9+ messages in thread
From: Quentin Perret @ 2026-02-10 15:58 UTC (permalink / raw)
  To: Will Deacon
  Cc: Marc Zyngier, kvmarm, oupton, joey.gouly, suzuki.poulose,
	yuzenghui, catalin.marinas

On Tuesday 10 Feb 2026 at 15:34:22 (+0000), Will Deacon wrote:
> On Tue, Feb 10, 2026 at 12:52:21PM +0000, Marc Zyngier wrote:
> > On Tue, 10 Feb 2026 12:27:48 +0000,
> > Quentin Perret <qperret@google.com> wrote:
> > > 
> > > Hi all,
> > > 
> > > I have just received a report from a partner of udelay misbehaving when
> > > running on the host whilst a vCPU is loaded. This hardware has FEAT_WFxT
> > > and uses the matching implementation of udelay. Interestingly, WFIT
> > > triggers using CNTVCT_EL0 unconditionally, but with KVM the host/guest
> > > switch for that happens from the preempt notifiers/vcpu_put which aren't
> > > invoked when e.g. handling an IRQ. Interestingly, udelay reads the arch
> > > timer to set the waiting time for WFIT using an absolute value, and that
> > > gets compared to CNTVCT_EL0 which in the aforementioned
> > > IRQ-with-vCPU-loaded case uses the _guest's_ CNTVCT_EL0.
> > 
> > Well, the underlying issue is that get_cycle(), as used by  __delay(),
> > is *either* using CNTVCT_EL0 (when booted at EL1) or CNTPCT_EL0 (when
> > booted at EL2).
> > 
> > > 
> > > I can think of two approaches to address the problem:
> > >   1. have KVM context switch cntvoff proactively prior to re-enabling
> > >   preemption when handling a guest exit;
> > >   2. modify the WFIT-based udelay implementation to read from CNTVCT_EL0
> > >   instead of the arch_timer to be a bit more self-consitent;
> > > 
> > > Other ideas welcome!
> 
> I suppose a third option would be to avoid WFxT when we're on a CPU
> with a vCPU loaded on it? The per-cpu state is grotty, mind.

It's one of the rare cases where we could look at that per-cpu data in
preemptible context, because that per-cpu state would migrate with the
task if it were migrated to a different CPU thanks to the preempt
notifiers, so that's nice.

> A bigger hammer would be to use WFxT only when using VHE.
> 
> > (1) is a real nightmare, and would force a complete redesign of the
> > life cycle of guest timers (switching from load/put to enter/exit for
> > the context switch, but only on !VHE). I'd rather avoid that, as this
> > is a pretty large performance penalty.
> > 
> > (2) is much more palatable, and easily hacked, see below. Can you
> > please five it a go?
> > 
> > Thanks,
> > 
> > 	M.
> > 
> > From b1b45d591aed3e5276ff857dbc6cfa3bce181766 Mon Sep 17 00:00:00 2001
> > From: Marc Zyngier <maz@kernel.org>
> > Date: Tue, 10 Feb 2026 12:43:07 +0000
> > Subject: [PATCH] arm64: Force the use of CNTVCT_EL0 in __delay()
> > 
> > Quentin reports an interesting problem with the use of WFxT in __delay()
> > when a vcpu is loaded and that KVM is *not* in VHE mode.
> > 
> > In this case, CNTVOFF_EL2 is set to a non-zero value to reflect the
> > state of the guest virtual counter. At the same time, __delay() is
> > using get_cycles() to read the counter value, which is indirected to
> > reading CNTPCT_EL0.
> > 
> > The core of the issue is that WFxT is using the *virtual* counter,
> > while the kernel is using the physical counter, and that the offset
> > introduces a really bad discrepancy between the two.
> > 
> > Fix this by forcing the use of CNTVCT_EL0, making __delay() consistent
> > irrespective of the value of CNTVOFF_EL2.
> > 
> > Reported-by: Quentin Perret <qperret@google.com>
> > Fixes: 7d26b0516a0df ("arm64: Use WFxT for __delay() when possible")
> > Signed-off-by: Marc Zyngier <maz@kernel.org>
> > Link: https://lore.kernel.org/r/ktosachvft2cgqd5qkukn275ugmhy6xrhxur4zqpdxlfr3qh5h@o3zrfnsq63od
> > Cc: stable@vger.kernel.org
> > ---
> >  arch/arm64/lib/delay.c | 15 +++++++++++----
> >  1 file changed, 11 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/arm64/lib/delay.c b/arch/arm64/lib/delay.c
> > index cb2062e7e2340..26a39bb301ef6 100644
> > --- a/arch/arm64/lib/delay.c
> > +++ b/arch/arm64/lib/delay.c
> > @@ -23,9 +23,16 @@ static inline unsigned long xloops_to_cycles(unsigned long xloops)
> >  	return (xloops * loops_per_jiffy * HZ) >> 32;
> >  }
> >  
> > +/*
> > + * Force the use of CNTVCT_EL0 in order to have the same base as
> > + * WFxT. This avoids some annoying issues when CNTVOFF_EL2 is not
> > + * reset 0 on a KVM host until we do a vcpu_put() on the vcpu.
> > + */
> > +#define __delay_cycles()	__arch_counter_get_cntvct_stable()
> > +
> >  void __delay(unsigned long cycles)
> >  {
> > -	cycles_t start = get_cycles();
> > +	cycles_t start = __delay_cycles();
> >  
> >  	if (alternative_has_cap_unlikely(ARM64_HAS_WFXT)) {
> >  		u64 end = start + cycles;
> > @@ -35,17 +42,17 @@ void __delay(unsigned long cycles)
> >  		 * early, use a WFET loop to complete the delay.
> >  		 */
> >  		wfit(end);
> > -		while ((get_cycles() - start) < cycles)
> > +		while ((__delay_cycles() - start) < cycles)
> >  			wfet(end);
> >  	} else 	if (arch_timer_evtstrm_available()) {
> >  		const cycles_t timer_evt_period =
> >  			USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US);
> >  
> > -		while ((get_cycles() - start + timer_evt_period) < cycles)
> > +		while ((__delay_cycles() - start + timer_evt_period) < cycles)
> >  			wfe();
> >  	}
> >  
> > -	while ((get_cycles() - start) < cycles)
> > +	while ((__delay_cycles() - start) < cycles)
> >  		cpu_relax();
> 
> I can't put my finger on a specific bug, but it does feel pretty scary
> to use the virtual counter in the host while running with a guest
> voffset. Is the offset userspace controllable? If so, can we guarantee
> that we're ok with overflow and is it ok if we get preempted in the
> middle of the loop? It's making my head hurt!

Ouch, it does seem that the SET_ONE_REG stuff allows to mess with that
value _out of vcpu context_, so yeah userspace could change the value
while a vcpu thread is preempted in the middle of a udelay loop ...

> The smp_cond_load_{relaxed,acquire}_timeout() series from Ankur looks
> like it has the same issue (it uses arch_timer_read_counter()).
> 
> Will

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Broken udelay() on KVM host with a vcpu loaded
  2026-02-10 15:58     ` Quentin Perret
@ 2026-02-10 19:54       ` Marc Zyngier
  2026-02-13 11:50         ` Will Deacon
  2026-02-13 14:05         ` Quentin Perret
  0 siblings, 2 replies; 9+ messages in thread
From: Marc Zyngier @ 2026-02-10 19:54 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Will Deacon, kvmarm, oupton, joey.gouly, suzuki.poulose,
	yuzenghui, catalin.marinas

On Tue, 10 Feb 2026 15:58:14 +0000,
Quentin Perret <qperret@google.com> wrote:
> 
> Ouch, it does seem that the SET_ONE_REG stuff allows to mess with that
> value _out of vcpu context_, so yeah userspace could change the value
> while a vcpu thread is preempted in the middle of a udelay loop ...

I don't think so. You can only do that on the vcpu fd, and if the vcpu
is loaded, it means that you are already holding the vcpu mutex. And
if you do that from the vcpu thread, then you already have done a
vcpu_put().

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Broken udelay() on KVM host with a vcpu loaded
  2026-02-10 19:54       ` Marc Zyngier
@ 2026-02-13 11:50         ` Will Deacon
  2026-02-13 13:52           ` Marc Zyngier
  2026-02-13 14:05         ` Quentin Perret
  1 sibling, 1 reply; 9+ messages in thread
From: Will Deacon @ 2026-02-13 11:50 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Quentin Perret, kvmarm, oupton, joey.gouly, suzuki.poulose,
	yuzenghui, catalin.marinas

On Tue, Feb 10, 2026 at 07:54:36PM +0000, Marc Zyngier wrote:
> On Tue, 10 Feb 2026 15:58:14 +0000,
> Quentin Perret <qperret@google.com> wrote:
> > 
> > Ouch, it does seem that the SET_ONE_REG stuff allows to mess with that
> > value _out of vcpu context_, so yeah userspace could change the value
> > while a vcpu thread is preempted in the middle of a udelay loop ...
> 
> I don't think so. You can only do that on the vcpu fd, and if the vcpu
> is loaded, it means that you are already holding the vcpu mutex. And
> if you do that from the vcpu thread, then you already have done a
> vcpu_put().

Just working this through, I think you're right that the vCPU mutex
saves us here (thanks!), although that's because it's held across the
duration of the KVM_RUN ioctl() and not because of the vcpu_put() (which
would run off the back of the preempt notifier if we were preempted
anyway). It would be good to capture that in a comment if we go with
your approach of using the virtual counter on the host.

There's also a VM ioctl for messing with the offset
(KVM_ARM_SET_COUNTER_OFFSET) but that uses kvm_trylock_all_vcpus() so
it's fine.

Will

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Broken udelay() on KVM host with a vcpu loaded
  2026-02-13 11:50         ` Will Deacon
@ 2026-02-13 13:52           ` Marc Zyngier
  0 siblings, 0 replies; 9+ messages in thread
From: Marc Zyngier @ 2026-02-13 13:52 UTC (permalink / raw)
  To: Will Deacon
  Cc: Quentin Perret, kvmarm, oupton, joey.gouly, suzuki.poulose,
	yuzenghui, catalin.marinas

On Fri, 13 Feb 2026 11:50:17 +0000,
Will Deacon <will@kernel.org> wrote:
> 
> On Tue, Feb 10, 2026 at 07:54:36PM +0000, Marc Zyngier wrote:
> > On Tue, 10 Feb 2026 15:58:14 +0000,
> > Quentin Perret <qperret@google.com> wrote:
> > > 
> > > Ouch, it does seem that the SET_ONE_REG stuff allows to mess with that
> > > value _out of vcpu context_, so yeah userspace could change the value
> > > while a vcpu thread is preempted in the middle of a udelay loop ...
> > 
> > I don't think so. You can only do that on the vcpu fd, and if the vcpu
> > is loaded, it means that you are already holding the vcpu mutex. And
> > if you do that from the vcpu thread, then you already have done a
> > vcpu_put().
> 
> Just working this through, I think you're right that the vCPU mutex
> saves us here (thanks!), although that's because it's held across the
> duration of the KVM_RUN ioctl() and not because of the vcpu_put() (which
> would run off the back of the preempt notifier if we were preempted
> anyway).

Duh, yes. Sorry about the confusing explanation. it was perfectly
clear in my head, I swear!

> It would be good to capture that in a comment if we go with your
> approach of using the virtual counter on the host.

Sure, I'll add a comment to that effect and post it shortly.

Thanks,

	M.

-- 
Jazz isn't dead. It just smells funny.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Broken udelay() on KVM host with a vcpu loaded
  2026-02-10 19:54       ` Marc Zyngier
  2026-02-13 11:50         ` Will Deacon
@ 2026-02-13 14:05         ` Quentin Perret
  1 sibling, 0 replies; 9+ messages in thread
From: Quentin Perret @ 2026-02-13 14:05 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Will Deacon, kvmarm, oupton, joey.gouly, suzuki.poulose,
	yuzenghui, catalin.marinas, hyesoo.yu

On Tuesday 10 Feb 2026 at 19:54:36 (+0000), Marc Zyngier wrote:
> On Tue, 10 Feb 2026 15:58:14 +0000,
> Quentin Perret <qperret@google.com> wrote:
> > 
> > Ouch, it does seem that the SET_ONE_REG stuff allows to mess with that
> > value _out of vcpu context_, so yeah userspace could change the value
> > while a vcpu thread is preempted in the middle of a udelay loop ...
> 
> I don't think so. You can only do that on the vcpu fd, and if the vcpu
> is loaded, it means that you are already holding the vcpu mutex. And
> if you do that from the vcpu thread, then you already have done a
> vcpu_put().

Aha I had missed the locking here -- reasoning about SET_ONE_REG being
issued on a running vCPU was 'interesting' :-).

    Reviewed-by: Quentin Perret <qperret@google.com>

And also, I'm only relaying a report that we got from a partner, but we
might want to credit the original reporter:

    Reported-by: Hyesoo Yu <hyesoo.yu@samsung.com>

Thanks for the patch!
Quentin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Broken udelay() on KVM host with a vcpu loaded
  2026-02-10 15:34   ` Will Deacon
  2026-02-10 15:58     ` Quentin Perret
@ 2026-02-10 19:46     ` Marc Zyngier
  1 sibling, 0 replies; 9+ messages in thread
From: Marc Zyngier @ 2026-02-10 19:46 UTC (permalink / raw)
  To: Will Deacon
  Cc: Quentin Perret, kvmarm, oupton, joey.gouly, suzuki.poulose,
	yuzenghui, catalin.marinas

On Tue, 10 Feb 2026 15:34:22 +0000,
Will Deacon <will@kernel.org> wrote:
> 
> On Tue, Feb 10, 2026 at 12:52:21PM +0000, Marc Zyngier wrote:
> > On Tue, 10 Feb 2026 12:27:48 +0000,
> > Quentin Perret <qperret@google.com> wrote:
> > > 
> > > Hi all,
> > > 
> > > I have just received a report from a partner of udelay misbehaving when
> > > running on the host whilst a vCPU is loaded. This hardware has FEAT_WFxT
> > > and uses the matching implementation of udelay. Interestingly, WFIT
> > > triggers using CNTVCT_EL0 unconditionally, but with KVM the host/guest
> > > switch for that happens from the preempt notifiers/vcpu_put which aren't
> > > invoked when e.g. handling an IRQ. Interestingly, udelay reads the arch
> > > timer to set the waiting time for WFIT using an absolute value, and that
> > > gets compared to CNTVCT_EL0 which in the aforementioned
> > > IRQ-with-vCPU-loaded case uses the _guest's_ CNTVCT_EL0.
> > 
> > Well, the underlying issue is that get_cycle(), as used by  __delay(),
> > is *either* using CNTVCT_EL0 (when booted at EL1) or CNTPCT_EL0 (when
> > booted at EL2).
> > 
> > > 
> > > I can think of two approaches to address the problem:
> > >   1. have KVM context switch cntvoff proactively prior to re-enabling
> > >   preemption when handling a guest exit;
> > >   2. modify the WFIT-based udelay implementation to read from CNTVCT_EL0
> > >   instead of the arch_timer to be a bit more self-consitent;
> > > 
> > > Other ideas welcome!
> 
> I suppose a third option would be to avoid WFxT when we're on a CPU
> with a vCPU loaded on it? The per-cpu state is grotty, mind. A bigger
> hammer would be to use WFxT only when using VHE.

Hmmm. Sure, but that's rather ugly.

> 
> > (1) is a real nightmare, and would force a complete redesign of the
> > life cycle of guest timers (switching from load/put to enter/exit for
> > the context switch, but only on !VHE). I'd rather avoid that, as this
> > is a pretty large performance penalty.
> > 
> > (2) is much more palatable, and easily hacked, see below. Can you
> > please five it a go?
> > 
> > Thanks,
> > 
> > 	M.
> > 
> > From b1b45d591aed3e5276ff857dbc6cfa3bce181766 Mon Sep 17 00:00:00 2001
> > From: Marc Zyngier <maz@kernel.org>
> > Date: Tue, 10 Feb 2026 12:43:07 +0000
> > Subject: [PATCH] arm64: Force the use of CNTVCT_EL0 in __delay()
> > 
> > Quentin reports an interesting problem with the use of WFxT in __delay()
> > when a vcpu is loaded and that KVM is *not* in VHE mode.
> > 
> > In this case, CNTVOFF_EL2 is set to a non-zero value to reflect the
> > state of the guest virtual counter. At the same time, __delay() is
> > using get_cycles() to read the counter value, which is indirected to
> > reading CNTPCT_EL0.
> > 
> > The core of the issue is that WFxT is using the *virtual* counter,
> > while the kernel is using the physical counter, and that the offset
> > introduces a really bad discrepancy between the two.
> > 
> > Fix this by forcing the use of CNTVCT_EL0, making __delay() consistent
> > irrespective of the value of CNTVOFF_EL2.
> > 
> > Reported-by: Quentin Perret <qperret@google.com>
> > Fixes: 7d26b0516a0df ("arm64: Use WFxT for __delay() when possible")
> > Signed-off-by: Marc Zyngier <maz@kernel.org>
> > Link: https://lore.kernel.org/r/ktosachvft2cgqd5qkukn275ugmhy6xrhxur4zqpdxlfr3qh5h@o3zrfnsq63od
> > Cc: stable@vger.kernel.org
> > ---
> >  arch/arm64/lib/delay.c | 15 +++++++++++----
> >  1 file changed, 11 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/arm64/lib/delay.c b/arch/arm64/lib/delay.c
> > index cb2062e7e2340..26a39bb301ef6 100644
> > --- a/arch/arm64/lib/delay.c
> > +++ b/arch/arm64/lib/delay.c
> > @@ -23,9 +23,16 @@ static inline unsigned long xloops_to_cycles(unsigned long xloops)
> >  	return (xloops * loops_per_jiffy * HZ) >> 32;
> >  }
> >  
> > +/*
> > + * Force the use of CNTVCT_EL0 in order to have the same base as
> > + * WFxT. This avoids some annoying issues when CNTVOFF_EL2 is not
> > + * reset 0 on a KVM host until we do a vcpu_put() on the vcpu.
> > + */
> > +#define __delay_cycles()	__arch_counter_get_cntvct_stable()
> > +
> >  void __delay(unsigned long cycles)
> >  {
> > -	cycles_t start = get_cycles();
> > +	cycles_t start = __delay_cycles();
> >  
> >  	if (alternative_has_cap_unlikely(ARM64_HAS_WFXT)) {
> >  		u64 end = start + cycles;
> > @@ -35,17 +42,17 @@ void __delay(unsigned long cycles)
> >  		 * early, use a WFET loop to complete the delay.
> >  		 */
> >  		wfit(end);
> > -		while ((get_cycles() - start) < cycles)
> > +		while ((__delay_cycles() - start) < cycles)
> >  			wfet(end);
> >  	} else 	if (arch_timer_evtstrm_available()) {
> >  		const cycles_t timer_evt_period =
> >  			USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US);
> >  
> > -		while ((get_cycles() - start + timer_evt_period) < cycles)
> > +		while ((__delay_cycles() - start + timer_evt_period) < cycles)
> >  			wfe();
> >  	}
> >  
> > -	while ((get_cycles() - start) < cycles)
> > +	while ((__delay_cycles() - start) < cycles)
> >  		cpu_relax();
> 
> I can't put my finger on a specific bug, but it does feel pretty scary
> to use the virtual counter in the host while running with a guest
> voffset. Is the offset userspace controllable?

It is, but not while the vcpu is loaded. You need a full put/load
sequence to get there (that's the only way to set CNTVOFF_EL2 from
EL1).  And since the vcpu is loaded, no other thread can interact with
the vcpu fd (the vcpu mutex is held).

> If so, can we guarantee
> that we're ok with overflow and is it ok if we get preempted in the
> middle of the loop? It's making my head hurt!

Again, I don't see that it can happen.

> 
> The smp_cond_load_{relaxed,acquire}_timeout() series from Ankur looks
> like it has the same issue (it uses arch_timer_read_counter()).

I haven't reviewed it in a long while, but if it using WFxT, it is
likely broken the same way.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-02-13 14:05 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-10 12:27 Broken udelay() on KVM host with a vcpu loaded Quentin Perret
2026-02-10 12:52 ` Marc Zyngier
2026-02-10 15:34   ` Will Deacon
2026-02-10 15:58     ` Quentin Perret
2026-02-10 19:54       ` Marc Zyngier
2026-02-13 11:50         ` Will Deacon
2026-02-13 13:52           ` Marc Zyngier
2026-02-13 14:05         ` Quentin Perret
2026-02-10 19:46     ` Marc Zyngier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox