From: Marc Zyngier <maz@kernel.org>
To: Quentin Perret <qperret@google.com>
Cc: kvmarm@lists.linux.dev, oupton@kernel.org, joey.gouly@arm.com,
suzuki.poulose@arm.com, yuzenghui@huawei.com,
catalin.marinas@arm.com, will@kernel.org
Subject: Re: Broken udelay() on KVM host with a vcpu loaded
Date: Tue, 10 Feb 2026 12:52:21 +0000 [thread overview]
Message-ID: <86qzqsbw1m.wl-maz@kernel.org> (raw)
In-Reply-To: <ktosachvft2cgqd5qkukn275ugmhy6xrhxur4zqpdxlfr3qh5h@o3zrfnsq63od>
On Tue, 10 Feb 2026 12:27:48 +0000,
Quentin Perret <qperret@google.com> wrote:
>
> Hi all,
>
> I have just received a report from a partner of udelay misbehaving when
> running on the host whilst a vCPU is loaded. This hardware has FEAT_WFxT
> and uses the matching implementation of udelay. Interestingly, WFIT
> triggers using CNTVCT_EL0 unconditionally, but with KVM the host/guest
> switch for that happens from the preempt notifiers/vcpu_put which aren't
> invoked when e.g. handling an IRQ. Interestingly, udelay reads the arch
> timer to set the waiting time for WFIT using an absolute value, and that
> gets compared to CNTVCT_EL0 which in the aforementioned
> IRQ-with-vCPU-loaded case uses the _guest's_ CNTVCT_EL0.
Well, the underlying issue is that get_cycle(), as used by __delay(),
is *either* using CNTVCT_EL0 (when booted at EL1) or CNTPCT_EL0 (when
booted at EL2).
>
> I can think of two approaches to address the problem:
> 1. have KVM context switch cntvoff proactively prior to re-enabling
> preemption when handling a guest exit;
> 2. modify the WFIT-based udelay implementation to read from CNTVCT_EL0
> instead of the arch_timer to be a bit more self-consitent;
>
> Other ideas welcome!
(1) is a real nightmare, and would force a complete redesign of the
life cycle of guest timers (switching from load/put to enter/exit for
the context switch, but only on !VHE). I'd rather avoid that, as this
is a pretty large performance penalty.
(2) is much more palatable, and easily hacked, see below. Can you
please five it a go?
Thanks,
M.
From b1b45d591aed3e5276ff857dbc6cfa3bce181766 Mon Sep 17 00:00:00 2001
From: Marc Zyngier <maz@kernel.org>
Date: Tue, 10 Feb 2026 12:43:07 +0000
Subject: [PATCH] arm64: Force the use of CNTVCT_EL0 in __delay()
Quentin reports an interesting problem with the use of WFxT in __delay()
when a vcpu is loaded and that KVM is *not* in VHE mode.
In this case, CNTVOFF_EL2 is set to a non-zero value to reflect the
state of the guest virtual counter. At the same time, __delay() is
using get_cycles() to read the counter value, which is indirected to
reading CNTPCT_EL0.
The core of the issue is that WFxT is using the *virtual* counter,
while the kernel is using the physical counter, and that the offset
introduces a really bad discrepancy between the two.
Fix this by forcing the use of CNTVCT_EL0, making __delay() consistent
irrespective of the value of CNTVOFF_EL2.
Reported-by: Quentin Perret <qperret@google.com>
Fixes: 7d26b0516a0df ("arm64: Use WFxT for __delay() when possible")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/ktosachvft2cgqd5qkukn275ugmhy6xrhxur4zqpdxlfr3qh5h@o3zrfnsq63od
Cc: stable@vger.kernel.org
---
arch/arm64/lib/delay.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/lib/delay.c b/arch/arm64/lib/delay.c
index cb2062e7e2340..26a39bb301ef6 100644
--- a/arch/arm64/lib/delay.c
+++ b/arch/arm64/lib/delay.c
@@ -23,9 +23,16 @@ static inline unsigned long xloops_to_cycles(unsigned long xloops)
return (xloops * loops_per_jiffy * HZ) >> 32;
}
+/*
+ * Force the use of CNTVCT_EL0 in order to have the same base as
+ * WFxT. This avoids some annoying issues when CNTVOFF_EL2 is not
+ * reset 0 on a KVM host until we do a vcpu_put() on the vcpu.
+ */
+#define __delay_cycles() __arch_counter_get_cntvct_stable()
+
void __delay(unsigned long cycles)
{
- cycles_t start = get_cycles();
+ cycles_t start = __delay_cycles();
if (alternative_has_cap_unlikely(ARM64_HAS_WFXT)) {
u64 end = start + cycles;
@@ -35,17 +42,17 @@ void __delay(unsigned long cycles)
* early, use a WFET loop to complete the delay.
*/
wfit(end);
- while ((get_cycles() - start) < cycles)
+ while ((__delay_cycles() - start) < cycles)
wfet(end);
} else if (arch_timer_evtstrm_available()) {
const cycles_t timer_evt_period =
USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US);
- while ((get_cycles() - start + timer_evt_period) < cycles)
+ while ((__delay_cycles() - start + timer_evt_period) < cycles)
wfe();
}
- while ((get_cycles() - start) < cycles)
+ while ((__delay_cycles() - start) < cycles)
cpu_relax();
}
EXPORT_SYMBOL(__delay);
--
2.47.3
--
Without deviation from the norm, progress is not possible.
next prev parent reply other threads:[~2026-02-10 12:52 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-10 12:27 Broken udelay() on KVM host with a vcpu loaded Quentin Perret
2026-02-10 12:52 ` Marc Zyngier [this message]
2026-02-10 15:34 ` Will Deacon
2026-02-10 15:58 ` Quentin Perret
2026-02-10 19:54 ` Marc Zyngier
2026-02-13 11:50 ` Will Deacon
2026-02-13 13:52 ` Marc Zyngier
2026-02-13 14:05 ` Quentin Perret
2026-02-10 19:46 ` Marc Zyngier
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=86qzqsbw1m.wl-maz@kernel.org \
--to=maz@kernel.org \
--cc=catalin.marinas@arm.com \
--cc=joey.gouly@arm.com \
--cc=kvmarm@lists.linux.dev \
--cc=oupton@kernel.org \
--cc=qperret@google.com \
--cc=suzuki.poulose@arm.com \
--cc=will@kernel.org \
--cc=yuzenghui@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.