All of lore.kernel.org
 help / color / mirror / Atom feed
From: Marc Zyngier <maz@kernel.org>
To: Will Deacon <will@kernel.org>
Cc: Quentin Perret <qperret@google.com>,
	kvmarm@lists.linux.dev, oupton@kernel.org, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com
Subject: Re: Broken udelay() on KVM host with a vcpu loaded
Date: Tue, 10 Feb 2026 19:46:53 +0000	[thread overview]
Message-ID: <86pl6cbcuq.wl-maz@kernel.org> (raw)
In-Reply-To: <aYtP_rgolLfm_UDD@willie-the-truck>

On Tue, 10 Feb 2026 15:34:22 +0000,
Will Deacon <will@kernel.org> wrote:
> 
> On Tue, Feb 10, 2026 at 12:52:21PM +0000, Marc Zyngier wrote:
> > On Tue, 10 Feb 2026 12:27:48 +0000,
> > Quentin Perret <qperret@google.com> wrote:
> > > 
> > > Hi all,
> > > 
> > > I have just received a report from a partner of udelay misbehaving when
> > > running on the host whilst a vCPU is loaded. This hardware has FEAT_WFxT
> > > and uses the matching implementation of udelay. Interestingly, WFIT
> > > triggers using CNTVCT_EL0 unconditionally, but with KVM the host/guest
> > > switch for that happens from the preempt notifiers/vcpu_put which aren't
> > > invoked when e.g. handling an IRQ. Interestingly, udelay reads the arch
> > > timer to set the waiting time for WFIT using an absolute value, and that
> > > gets compared to CNTVCT_EL0 which in the aforementioned
> > > IRQ-with-vCPU-loaded case uses the _guest's_ CNTVCT_EL0.
> > 
> > Well, the underlying issue is that get_cycle(), as used by  __delay(),
> > is *either* using CNTVCT_EL0 (when booted at EL1) or CNTPCT_EL0 (when
> > booted at EL2).
> > 
> > > 
> > > I can think of two approaches to address the problem:
> > >   1. have KVM context switch cntvoff proactively prior to re-enabling
> > >   preemption when handling a guest exit;
> > >   2. modify the WFIT-based udelay implementation to read from CNTVCT_EL0
> > >   instead of the arch_timer to be a bit more self-consitent;
> > > 
> > > Other ideas welcome!
> 
> I suppose a third option would be to avoid WFxT when we're on a CPU
> with a vCPU loaded on it? The per-cpu state is grotty, mind. A bigger
> hammer would be to use WFxT only when using VHE.

Hmmm. Sure, but that's rather ugly.

> 
> > (1) is a real nightmare, and would force a complete redesign of the
> > life cycle of guest timers (switching from load/put to enter/exit for
> > the context switch, but only on !VHE). I'd rather avoid that, as this
> > is a pretty large performance penalty.
> > 
> > (2) is much more palatable, and easily hacked, see below. Can you
> > please five it a go?
> > 
> > Thanks,
> > 
> > 	M.
> > 
> > From b1b45d591aed3e5276ff857dbc6cfa3bce181766 Mon Sep 17 00:00:00 2001
> > From: Marc Zyngier <maz@kernel.org>
> > Date: Tue, 10 Feb 2026 12:43:07 +0000
> > Subject: [PATCH] arm64: Force the use of CNTVCT_EL0 in __delay()
> > 
> > Quentin reports an interesting problem with the use of WFxT in __delay()
> > when a vcpu is loaded and that KVM is *not* in VHE mode.
> > 
> > In this case, CNTVOFF_EL2 is set to a non-zero value to reflect the
> > state of the guest virtual counter. At the same time, __delay() is
> > using get_cycles() to read the counter value, which is indirected to
> > reading CNTPCT_EL0.
> > 
> > The core of the issue is that WFxT is using the *virtual* counter,
> > while the kernel is using the physical counter, and that the offset
> > introduces a really bad discrepancy between the two.
> > 
> > Fix this by forcing the use of CNTVCT_EL0, making __delay() consistent
> > irrespective of the value of CNTVOFF_EL2.
> > 
> > Reported-by: Quentin Perret <qperret@google.com>
> > Fixes: 7d26b0516a0df ("arm64: Use WFxT for __delay() when possible")
> > Signed-off-by: Marc Zyngier <maz@kernel.org>
> > Link: https://lore.kernel.org/r/ktosachvft2cgqd5qkukn275ugmhy6xrhxur4zqpdxlfr3qh5h@o3zrfnsq63od
> > Cc: stable@vger.kernel.org
> > ---
> >  arch/arm64/lib/delay.c | 15 +++++++++++----
> >  1 file changed, 11 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/arm64/lib/delay.c b/arch/arm64/lib/delay.c
> > index cb2062e7e2340..26a39bb301ef6 100644
> > --- a/arch/arm64/lib/delay.c
> > +++ b/arch/arm64/lib/delay.c
> > @@ -23,9 +23,16 @@ static inline unsigned long xloops_to_cycles(unsigned long xloops)
> >  	return (xloops * loops_per_jiffy * HZ) >> 32;
> >  }
> >  
> > +/*
> > + * Force the use of CNTVCT_EL0 in order to have the same base as
> > + * WFxT. This avoids some annoying issues when CNTVOFF_EL2 is not
> > + * reset 0 on a KVM host until we do a vcpu_put() on the vcpu.
> > + */
> > +#define __delay_cycles()	__arch_counter_get_cntvct_stable()
> > +
> >  void __delay(unsigned long cycles)
> >  {
> > -	cycles_t start = get_cycles();
> > +	cycles_t start = __delay_cycles();
> >  
> >  	if (alternative_has_cap_unlikely(ARM64_HAS_WFXT)) {
> >  		u64 end = start + cycles;
> > @@ -35,17 +42,17 @@ void __delay(unsigned long cycles)
> >  		 * early, use a WFET loop to complete the delay.
> >  		 */
> >  		wfit(end);
> > -		while ((get_cycles() - start) < cycles)
> > +		while ((__delay_cycles() - start) < cycles)
> >  			wfet(end);
> >  	} else 	if (arch_timer_evtstrm_available()) {
> >  		const cycles_t timer_evt_period =
> >  			USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US);
> >  
> > -		while ((get_cycles() - start + timer_evt_period) < cycles)
> > +		while ((__delay_cycles() - start + timer_evt_period) < cycles)
> >  			wfe();
> >  	}
> >  
> > -	while ((get_cycles() - start) < cycles)
> > +	while ((__delay_cycles() - start) < cycles)
> >  		cpu_relax();
> 
> I can't put my finger on a specific bug, but it does feel pretty scary
> to use the virtual counter in the host while running with a guest
> voffset. Is the offset userspace controllable?

It is, but not while the vcpu is loaded. You need a full put/load
sequence to get there (that's the only way to set CNTVOFF_EL2 from
EL1).  And since the vcpu is loaded, no other thread can interact with
the vcpu fd (the vcpu mutex is held).

> If so, can we guarantee
> that we're ok with overflow and is it ok if we get preempted in the
> middle of the loop? It's making my head hurt!

Again, I don't see that it can happen.

> 
> The smp_cond_load_{relaxed,acquire}_timeout() series from Ankur looks
> like it has the same issue (it uses arch_timer_read_counter()).

I haven't reviewed it in a long while, but if it using WFxT, it is
likely broken the same way.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

      parent reply	other threads:[~2026-02-10 19:46 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-10 12:27 Broken udelay() on KVM host with a vcpu loaded Quentin Perret
2026-02-10 12:52 ` Marc Zyngier
2026-02-10 15:34   ` Will Deacon
2026-02-10 15:58     ` Quentin Perret
2026-02-10 19:54       ` Marc Zyngier
2026-02-13 11:50         ` Will Deacon
2026-02-13 13:52           ` Marc Zyngier
2026-02-13 14:05         ` Quentin Perret
2026-02-10 19:46     ` Marc Zyngier [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86pl6cbcuq.wl-maz@kernel.org \
    --to=maz@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=joey.gouly@arm.com \
    --cc=kvmarm@lists.linux.dev \
    --cc=oupton@kernel.org \
    --cc=qperret@google.com \
    --cc=suzuki.poulose@arm.com \
    --cc=will@kernel.org \
    --cc=yuzenghui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.