* Broken udelay() on KVM host with a vcpu loaded @ 2026-02-10 12:27 Quentin Perret 2026-02-10 12:52 ` Marc Zyngier 0 siblings, 1 reply; 9+ messages in thread From: Quentin Perret @ 2026-02-10 12:27 UTC (permalink / raw) To: kvmarm Cc: maz, oupton, joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, will Hi all, I have just received a report from a partner of udelay misbehaving when running on the host whilst a vCPU is loaded. This hardware has FEAT_WFxT and uses the matching implementation of udelay. Interestingly, WFIT triggers using CNTVCT_EL0 unconditionally, but with KVM the host/guest switch for that happens from the preempt notifiers/vcpu_put which aren't invoked when e.g. handling an IRQ. Interestingly, udelay reads the arch timer to set the waiting time for WFIT using an absolute value, and that gets compared to CNTVCT_EL0 which in the aforementioned IRQ-with-vCPU-loaded case uses the _guest's_ CNTVCT_EL0. I can think of two approaches to address the problem: 1. have KVM context switch cntvoff proactively prior to re-enabling preemption when handling a guest exit; 2. modify the WFIT-based udelay implementation to read from CNTVCT_EL0 instead of the arch_timer to be a bit more self-consitent; Other ideas welcome! Thanks, Quentin ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Broken udelay() on KVM host with a vcpu loaded 2026-02-10 12:27 Broken udelay() on KVM host with a vcpu loaded Quentin Perret @ 2026-02-10 12:52 ` Marc Zyngier 2026-02-10 15:34 ` Will Deacon 0 siblings, 1 reply; 9+ messages in thread From: Marc Zyngier @ 2026-02-10 12:52 UTC (permalink / raw) To: Quentin Perret Cc: kvmarm, oupton, joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, will On Tue, 10 Feb 2026 12:27:48 +0000, Quentin Perret <qperret@google.com> wrote: > > Hi all, > > I have just received a report from a partner of udelay misbehaving when > running on the host whilst a vCPU is loaded. This hardware has FEAT_WFxT > and uses the matching implementation of udelay. Interestingly, WFIT > triggers using CNTVCT_EL0 unconditionally, but with KVM the host/guest > switch for that happens from the preempt notifiers/vcpu_put which aren't > invoked when e.g. handling an IRQ. Interestingly, udelay reads the arch > timer to set the waiting time for WFIT using an absolute value, and that > gets compared to CNTVCT_EL0 which in the aforementioned > IRQ-with-vCPU-loaded case uses the _guest's_ CNTVCT_EL0. Well, the underlying issue is that get_cycle(), as used by __delay(), is *either* using CNTVCT_EL0 (when booted at EL1) or CNTPCT_EL0 (when booted at EL2). > > I can think of two approaches to address the problem: > 1. have KVM context switch cntvoff proactively prior to re-enabling > preemption when handling a guest exit; > 2. modify the WFIT-based udelay implementation to read from CNTVCT_EL0 > instead of the arch_timer to be a bit more self-consitent; > > Other ideas welcome! (1) is a real nightmare, and would force a complete redesign of the life cycle of guest timers (switching from load/put to enter/exit for the context switch, but only on !VHE). I'd rather avoid that, as this is a pretty large performance penalty. (2) is much more palatable, and easily hacked, see below. Can you please five it a go? Thanks, M. From b1b45d591aed3e5276ff857dbc6cfa3bce181766 Mon Sep 17 00:00:00 2001 From: Marc Zyngier <maz@kernel.org> Date: Tue, 10 Feb 2026 12:43:07 +0000 Subject: [PATCH] arm64: Force the use of CNTVCT_EL0 in __delay() Quentin reports an interesting problem with the use of WFxT in __delay() when a vcpu is loaded and that KVM is *not* in VHE mode. In this case, CNTVOFF_EL2 is set to a non-zero value to reflect the state of the guest virtual counter. At the same time, __delay() is using get_cycles() to read the counter value, which is indirected to reading CNTPCT_EL0. The core of the issue is that WFxT is using the *virtual* counter, while the kernel is using the physical counter, and that the offset introduces a really bad discrepancy between the two. Fix this by forcing the use of CNTVCT_EL0, making __delay() consistent irrespective of the value of CNTVOFF_EL2. Reported-by: Quentin Perret <qperret@google.com> Fixes: 7d26b0516a0df ("arm64: Use WFxT for __delay() when possible") Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/ktosachvft2cgqd5qkukn275ugmhy6xrhxur4zqpdxlfr3qh5h@o3zrfnsq63od Cc: stable@vger.kernel.org --- arch/arm64/lib/delay.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/arch/arm64/lib/delay.c b/arch/arm64/lib/delay.c index cb2062e7e2340..26a39bb301ef6 100644 --- a/arch/arm64/lib/delay.c +++ b/arch/arm64/lib/delay.c @@ -23,9 +23,16 @@ static inline unsigned long xloops_to_cycles(unsigned long xloops) return (xloops * loops_per_jiffy * HZ) >> 32; } +/* + * Force the use of CNTVCT_EL0 in order to have the same base as + * WFxT. This avoids some annoying issues when CNTVOFF_EL2 is not + * reset 0 on a KVM host until we do a vcpu_put() on the vcpu. + */ +#define __delay_cycles() __arch_counter_get_cntvct_stable() + void __delay(unsigned long cycles) { - cycles_t start = get_cycles(); + cycles_t start = __delay_cycles(); if (alternative_has_cap_unlikely(ARM64_HAS_WFXT)) { u64 end = start + cycles; @@ -35,17 +42,17 @@ void __delay(unsigned long cycles) * early, use a WFET loop to complete the delay. */ wfit(end); - while ((get_cycles() - start) < cycles) + while ((__delay_cycles() - start) < cycles) wfet(end); } else if (arch_timer_evtstrm_available()) { const cycles_t timer_evt_period = USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US); - while ((get_cycles() - start + timer_evt_period) < cycles) + while ((__delay_cycles() - start + timer_evt_period) < cycles) wfe(); } - while ((get_cycles() - start) < cycles) + while ((__delay_cycles() - start) < cycles) cpu_relax(); } EXPORT_SYMBOL(__delay); -- 2.47.3 -- Without deviation from the norm, progress is not possible. ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: Broken udelay() on KVM host with a vcpu loaded 2026-02-10 12:52 ` Marc Zyngier @ 2026-02-10 15:34 ` Will Deacon 2026-02-10 15:58 ` Quentin Perret 2026-02-10 19:46 ` Marc Zyngier 0 siblings, 2 replies; 9+ messages in thread From: Will Deacon @ 2026-02-10 15:34 UTC (permalink / raw) To: Marc Zyngier Cc: Quentin Perret, kvmarm, oupton, joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas On Tue, Feb 10, 2026 at 12:52:21PM +0000, Marc Zyngier wrote: > On Tue, 10 Feb 2026 12:27:48 +0000, > Quentin Perret <qperret@google.com> wrote: > > > > Hi all, > > > > I have just received a report from a partner of udelay misbehaving when > > running on the host whilst a vCPU is loaded. This hardware has FEAT_WFxT > > and uses the matching implementation of udelay. Interestingly, WFIT > > triggers using CNTVCT_EL0 unconditionally, but with KVM the host/guest > > switch for that happens from the preempt notifiers/vcpu_put which aren't > > invoked when e.g. handling an IRQ. Interestingly, udelay reads the arch > > timer to set the waiting time for WFIT using an absolute value, and that > > gets compared to CNTVCT_EL0 which in the aforementioned > > IRQ-with-vCPU-loaded case uses the _guest's_ CNTVCT_EL0. > > Well, the underlying issue is that get_cycle(), as used by __delay(), > is *either* using CNTVCT_EL0 (when booted at EL1) or CNTPCT_EL0 (when > booted at EL2). > > > > > I can think of two approaches to address the problem: > > 1. have KVM context switch cntvoff proactively prior to re-enabling > > preemption when handling a guest exit; > > 2. modify the WFIT-based udelay implementation to read from CNTVCT_EL0 > > instead of the arch_timer to be a bit more self-consitent; > > > > Other ideas welcome! I suppose a third option would be to avoid WFxT when we're on a CPU with a vCPU loaded on it? The per-cpu state is grotty, mind. A bigger hammer would be to use WFxT only when using VHE. > (1) is a real nightmare, and would force a complete redesign of the > life cycle of guest timers (switching from load/put to enter/exit for > the context switch, but only on !VHE). I'd rather avoid that, as this > is a pretty large performance penalty. > > (2) is much more palatable, and easily hacked, see below. Can you > please five it a go? > > Thanks, > > M. > > From b1b45d591aed3e5276ff857dbc6cfa3bce181766 Mon Sep 17 00:00:00 2001 > From: Marc Zyngier <maz@kernel.org> > Date: Tue, 10 Feb 2026 12:43:07 +0000 > Subject: [PATCH] arm64: Force the use of CNTVCT_EL0 in __delay() > > Quentin reports an interesting problem with the use of WFxT in __delay() > when a vcpu is loaded and that KVM is *not* in VHE mode. > > In this case, CNTVOFF_EL2 is set to a non-zero value to reflect the > state of the guest virtual counter. At the same time, __delay() is > using get_cycles() to read the counter value, which is indirected to > reading CNTPCT_EL0. > > The core of the issue is that WFxT is using the *virtual* counter, > while the kernel is using the physical counter, and that the offset > introduces a really bad discrepancy between the two. > > Fix this by forcing the use of CNTVCT_EL0, making __delay() consistent > irrespective of the value of CNTVOFF_EL2. > > Reported-by: Quentin Perret <qperret@google.com> > Fixes: 7d26b0516a0df ("arm64: Use WFxT for __delay() when possible") > Signed-off-by: Marc Zyngier <maz@kernel.org> > Link: https://lore.kernel.org/r/ktosachvft2cgqd5qkukn275ugmhy6xrhxur4zqpdxlfr3qh5h@o3zrfnsq63od > Cc: stable@vger.kernel.org > --- > arch/arm64/lib/delay.c | 15 +++++++++++---- > 1 file changed, 11 insertions(+), 4 deletions(-) > > diff --git a/arch/arm64/lib/delay.c b/arch/arm64/lib/delay.c > index cb2062e7e2340..26a39bb301ef6 100644 > --- a/arch/arm64/lib/delay.c > +++ b/arch/arm64/lib/delay.c > @@ -23,9 +23,16 @@ static inline unsigned long xloops_to_cycles(unsigned long xloops) > return (xloops * loops_per_jiffy * HZ) >> 32; > } > > +/* > + * Force the use of CNTVCT_EL0 in order to have the same base as > + * WFxT. This avoids some annoying issues when CNTVOFF_EL2 is not > + * reset 0 on a KVM host until we do a vcpu_put() on the vcpu. > + */ > +#define __delay_cycles() __arch_counter_get_cntvct_stable() > + > void __delay(unsigned long cycles) > { > - cycles_t start = get_cycles(); > + cycles_t start = __delay_cycles(); > > if (alternative_has_cap_unlikely(ARM64_HAS_WFXT)) { > u64 end = start + cycles; > @@ -35,17 +42,17 @@ void __delay(unsigned long cycles) > * early, use a WFET loop to complete the delay. > */ > wfit(end); > - while ((get_cycles() - start) < cycles) > + while ((__delay_cycles() - start) < cycles) > wfet(end); > } else if (arch_timer_evtstrm_available()) { > const cycles_t timer_evt_period = > USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US); > > - while ((get_cycles() - start + timer_evt_period) < cycles) > + while ((__delay_cycles() - start + timer_evt_period) < cycles) > wfe(); > } > > - while ((get_cycles() - start) < cycles) > + while ((__delay_cycles() - start) < cycles) > cpu_relax(); I can't put my finger on a specific bug, but it does feel pretty scary to use the virtual counter in the host while running with a guest voffset. Is the offset userspace controllable? If so, can we guarantee that we're ok with overflow and is it ok if we get preempted in the middle of the loop? It's making my head hurt! The smp_cond_load_{relaxed,acquire}_timeout() series from Ankur looks like it has the same issue (it uses arch_timer_read_counter()). Will ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Broken udelay() on KVM host with a vcpu loaded 2026-02-10 15:34 ` Will Deacon @ 2026-02-10 15:58 ` Quentin Perret 2026-02-10 19:54 ` Marc Zyngier 2026-02-10 19:46 ` Marc Zyngier 1 sibling, 1 reply; 9+ messages in thread From: Quentin Perret @ 2026-02-10 15:58 UTC (permalink / raw) To: Will Deacon Cc: Marc Zyngier, kvmarm, oupton, joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas On Tuesday 10 Feb 2026 at 15:34:22 (+0000), Will Deacon wrote: > On Tue, Feb 10, 2026 at 12:52:21PM +0000, Marc Zyngier wrote: > > On Tue, 10 Feb 2026 12:27:48 +0000, > > Quentin Perret <qperret@google.com> wrote: > > > > > > Hi all, > > > > > > I have just received a report from a partner of udelay misbehaving when > > > running on the host whilst a vCPU is loaded. This hardware has FEAT_WFxT > > > and uses the matching implementation of udelay. Interestingly, WFIT > > > triggers using CNTVCT_EL0 unconditionally, but with KVM the host/guest > > > switch for that happens from the preempt notifiers/vcpu_put which aren't > > > invoked when e.g. handling an IRQ. Interestingly, udelay reads the arch > > > timer to set the waiting time for WFIT using an absolute value, and that > > > gets compared to CNTVCT_EL0 which in the aforementioned > > > IRQ-with-vCPU-loaded case uses the _guest's_ CNTVCT_EL0. > > > > Well, the underlying issue is that get_cycle(), as used by __delay(), > > is *either* using CNTVCT_EL0 (when booted at EL1) or CNTPCT_EL0 (when > > booted at EL2). > > > > > > > > I can think of two approaches to address the problem: > > > 1. have KVM context switch cntvoff proactively prior to re-enabling > > > preemption when handling a guest exit; > > > 2. modify the WFIT-based udelay implementation to read from CNTVCT_EL0 > > > instead of the arch_timer to be a bit more self-consitent; > > > > > > Other ideas welcome! > > I suppose a third option would be to avoid WFxT when we're on a CPU > with a vCPU loaded on it? The per-cpu state is grotty, mind. It's one of the rare cases where we could look at that per-cpu data in preemptible context, because that per-cpu state would migrate with the task if it were migrated to a different CPU thanks to the preempt notifiers, so that's nice. > A bigger hammer would be to use WFxT only when using VHE. > > > (1) is a real nightmare, and would force a complete redesign of the > > life cycle of guest timers (switching from load/put to enter/exit for > > the context switch, but only on !VHE). I'd rather avoid that, as this > > is a pretty large performance penalty. > > > > (2) is much more palatable, and easily hacked, see below. Can you > > please five it a go? > > > > Thanks, > > > > M. > > > > From b1b45d591aed3e5276ff857dbc6cfa3bce181766 Mon Sep 17 00:00:00 2001 > > From: Marc Zyngier <maz@kernel.org> > > Date: Tue, 10 Feb 2026 12:43:07 +0000 > > Subject: [PATCH] arm64: Force the use of CNTVCT_EL0 in __delay() > > > > Quentin reports an interesting problem with the use of WFxT in __delay() > > when a vcpu is loaded and that KVM is *not* in VHE mode. > > > > In this case, CNTVOFF_EL2 is set to a non-zero value to reflect the > > state of the guest virtual counter. At the same time, __delay() is > > using get_cycles() to read the counter value, which is indirected to > > reading CNTPCT_EL0. > > > > The core of the issue is that WFxT is using the *virtual* counter, > > while the kernel is using the physical counter, and that the offset > > introduces a really bad discrepancy between the two. > > > > Fix this by forcing the use of CNTVCT_EL0, making __delay() consistent > > irrespective of the value of CNTVOFF_EL2. > > > > Reported-by: Quentin Perret <qperret@google.com> > > Fixes: 7d26b0516a0df ("arm64: Use WFxT for __delay() when possible") > > Signed-off-by: Marc Zyngier <maz@kernel.org> > > Link: https://lore.kernel.org/r/ktosachvft2cgqd5qkukn275ugmhy6xrhxur4zqpdxlfr3qh5h@o3zrfnsq63od > > Cc: stable@vger.kernel.org > > --- > > arch/arm64/lib/delay.c | 15 +++++++++++---- > > 1 file changed, 11 insertions(+), 4 deletions(-) > > > > diff --git a/arch/arm64/lib/delay.c b/arch/arm64/lib/delay.c > > index cb2062e7e2340..26a39bb301ef6 100644 > > --- a/arch/arm64/lib/delay.c > > +++ b/arch/arm64/lib/delay.c > > @@ -23,9 +23,16 @@ static inline unsigned long xloops_to_cycles(unsigned long xloops) > > return (xloops * loops_per_jiffy * HZ) >> 32; > > } > > > > +/* > > + * Force the use of CNTVCT_EL0 in order to have the same base as > > + * WFxT. This avoids some annoying issues when CNTVOFF_EL2 is not > > + * reset 0 on a KVM host until we do a vcpu_put() on the vcpu. > > + */ > > +#define __delay_cycles() __arch_counter_get_cntvct_stable() > > + > > void __delay(unsigned long cycles) > > { > > - cycles_t start = get_cycles(); > > + cycles_t start = __delay_cycles(); > > > > if (alternative_has_cap_unlikely(ARM64_HAS_WFXT)) { > > u64 end = start + cycles; > > @@ -35,17 +42,17 @@ void __delay(unsigned long cycles) > > * early, use a WFET loop to complete the delay. > > */ > > wfit(end); > > - while ((get_cycles() - start) < cycles) > > + while ((__delay_cycles() - start) < cycles) > > wfet(end); > > } else if (arch_timer_evtstrm_available()) { > > const cycles_t timer_evt_period = > > USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US); > > > > - while ((get_cycles() - start + timer_evt_period) < cycles) > > + while ((__delay_cycles() - start + timer_evt_period) < cycles) > > wfe(); > > } > > > > - while ((get_cycles() - start) < cycles) > > + while ((__delay_cycles() - start) < cycles) > > cpu_relax(); > > I can't put my finger on a specific bug, but it does feel pretty scary > to use the virtual counter in the host while running with a guest > voffset. Is the offset userspace controllable? If so, can we guarantee > that we're ok with overflow and is it ok if we get preempted in the > middle of the loop? It's making my head hurt! Ouch, it does seem that the SET_ONE_REG stuff allows to mess with that value _out of vcpu context_, so yeah userspace could change the value while a vcpu thread is preempted in the middle of a udelay loop ... > The smp_cond_load_{relaxed,acquire}_timeout() series from Ankur looks > like it has the same issue (it uses arch_timer_read_counter()). > > Will ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Broken udelay() on KVM host with a vcpu loaded 2026-02-10 15:58 ` Quentin Perret @ 2026-02-10 19:54 ` Marc Zyngier 2026-02-13 11:50 ` Will Deacon 2026-02-13 14:05 ` Quentin Perret 0 siblings, 2 replies; 9+ messages in thread From: Marc Zyngier @ 2026-02-10 19:54 UTC (permalink / raw) To: Quentin Perret Cc: Will Deacon, kvmarm, oupton, joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas On Tue, 10 Feb 2026 15:58:14 +0000, Quentin Perret <qperret@google.com> wrote: > > Ouch, it does seem that the SET_ONE_REG stuff allows to mess with that > value _out of vcpu context_, so yeah userspace could change the value > while a vcpu thread is preempted in the middle of a udelay loop ... I don't think so. You can only do that on the vcpu fd, and if the vcpu is loaded, it means that you are already holding the vcpu mutex. And if you do that from the vcpu thread, then you already have done a vcpu_put(). M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Broken udelay() on KVM host with a vcpu loaded 2026-02-10 19:54 ` Marc Zyngier @ 2026-02-13 11:50 ` Will Deacon 2026-02-13 13:52 ` Marc Zyngier 2026-02-13 14:05 ` Quentin Perret 1 sibling, 1 reply; 9+ messages in thread From: Will Deacon @ 2026-02-13 11:50 UTC (permalink / raw) To: Marc Zyngier Cc: Quentin Perret, kvmarm, oupton, joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas On Tue, Feb 10, 2026 at 07:54:36PM +0000, Marc Zyngier wrote: > On Tue, 10 Feb 2026 15:58:14 +0000, > Quentin Perret <qperret@google.com> wrote: > > > > Ouch, it does seem that the SET_ONE_REG stuff allows to mess with that > > value _out of vcpu context_, so yeah userspace could change the value > > while a vcpu thread is preempted in the middle of a udelay loop ... > > I don't think so. You can only do that on the vcpu fd, and if the vcpu > is loaded, it means that you are already holding the vcpu mutex. And > if you do that from the vcpu thread, then you already have done a > vcpu_put(). Just working this through, I think you're right that the vCPU mutex saves us here (thanks!), although that's because it's held across the duration of the KVM_RUN ioctl() and not because of the vcpu_put() (which would run off the back of the preempt notifier if we were preempted anyway). It would be good to capture that in a comment if we go with your approach of using the virtual counter on the host. There's also a VM ioctl for messing with the offset (KVM_ARM_SET_COUNTER_OFFSET) but that uses kvm_trylock_all_vcpus() so it's fine. Will ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Broken udelay() on KVM host with a vcpu loaded 2026-02-13 11:50 ` Will Deacon @ 2026-02-13 13:52 ` Marc Zyngier 0 siblings, 0 replies; 9+ messages in thread From: Marc Zyngier @ 2026-02-13 13:52 UTC (permalink / raw) To: Will Deacon Cc: Quentin Perret, kvmarm, oupton, joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas On Fri, 13 Feb 2026 11:50:17 +0000, Will Deacon <will@kernel.org> wrote: > > On Tue, Feb 10, 2026 at 07:54:36PM +0000, Marc Zyngier wrote: > > On Tue, 10 Feb 2026 15:58:14 +0000, > > Quentin Perret <qperret@google.com> wrote: > > > > > > Ouch, it does seem that the SET_ONE_REG stuff allows to mess with that > > > value _out of vcpu context_, so yeah userspace could change the value > > > while a vcpu thread is preempted in the middle of a udelay loop ... > > > > I don't think so. You can only do that on the vcpu fd, and if the vcpu > > is loaded, it means that you are already holding the vcpu mutex. And > > if you do that from the vcpu thread, then you already have done a > > vcpu_put(). > > Just working this through, I think you're right that the vCPU mutex > saves us here (thanks!), although that's because it's held across the > duration of the KVM_RUN ioctl() and not because of the vcpu_put() (which > would run off the back of the preempt notifier if we were preempted > anyway). Duh, yes. Sorry about the confusing explanation. it was perfectly clear in my head, I swear! > It would be good to capture that in a comment if we go with your > approach of using the virtual counter on the host. Sure, I'll add a comment to that effect and post it shortly. Thanks, M. -- Jazz isn't dead. It just smells funny. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Broken udelay() on KVM host with a vcpu loaded 2026-02-10 19:54 ` Marc Zyngier 2026-02-13 11:50 ` Will Deacon @ 2026-02-13 14:05 ` Quentin Perret 1 sibling, 0 replies; 9+ messages in thread From: Quentin Perret @ 2026-02-13 14:05 UTC (permalink / raw) To: Marc Zyngier Cc: Will Deacon, kvmarm, oupton, joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, hyesoo.yu On Tuesday 10 Feb 2026 at 19:54:36 (+0000), Marc Zyngier wrote: > On Tue, 10 Feb 2026 15:58:14 +0000, > Quentin Perret <qperret@google.com> wrote: > > > > Ouch, it does seem that the SET_ONE_REG stuff allows to mess with that > > value _out of vcpu context_, so yeah userspace could change the value > > while a vcpu thread is preempted in the middle of a udelay loop ... > > I don't think so. You can only do that on the vcpu fd, and if the vcpu > is loaded, it means that you are already holding the vcpu mutex. And > if you do that from the vcpu thread, then you already have done a > vcpu_put(). Aha I had missed the locking here -- reasoning about SET_ONE_REG being issued on a running vCPU was 'interesting' :-). Reviewed-by: Quentin Perret <qperret@google.com> And also, I'm only relaying a report that we got from a partner, but we might want to credit the original reporter: Reported-by: Hyesoo Yu <hyesoo.yu@samsung.com> Thanks for the patch! Quentin ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Broken udelay() on KVM host with a vcpu loaded 2026-02-10 15:34 ` Will Deacon 2026-02-10 15:58 ` Quentin Perret @ 2026-02-10 19:46 ` Marc Zyngier 1 sibling, 0 replies; 9+ messages in thread From: Marc Zyngier @ 2026-02-10 19:46 UTC (permalink / raw) To: Will Deacon Cc: Quentin Perret, kvmarm, oupton, joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas On Tue, 10 Feb 2026 15:34:22 +0000, Will Deacon <will@kernel.org> wrote: > > On Tue, Feb 10, 2026 at 12:52:21PM +0000, Marc Zyngier wrote: > > On Tue, 10 Feb 2026 12:27:48 +0000, > > Quentin Perret <qperret@google.com> wrote: > > > > > > Hi all, > > > > > > I have just received a report from a partner of udelay misbehaving when > > > running on the host whilst a vCPU is loaded. This hardware has FEAT_WFxT > > > and uses the matching implementation of udelay. Interestingly, WFIT > > > triggers using CNTVCT_EL0 unconditionally, but with KVM the host/guest > > > switch for that happens from the preempt notifiers/vcpu_put which aren't > > > invoked when e.g. handling an IRQ. Interestingly, udelay reads the arch > > > timer to set the waiting time for WFIT using an absolute value, and that > > > gets compared to CNTVCT_EL0 which in the aforementioned > > > IRQ-with-vCPU-loaded case uses the _guest's_ CNTVCT_EL0. > > > > Well, the underlying issue is that get_cycle(), as used by __delay(), > > is *either* using CNTVCT_EL0 (when booted at EL1) or CNTPCT_EL0 (when > > booted at EL2). > > > > > > > > I can think of two approaches to address the problem: > > > 1. have KVM context switch cntvoff proactively prior to re-enabling > > > preemption when handling a guest exit; > > > 2. modify the WFIT-based udelay implementation to read from CNTVCT_EL0 > > > instead of the arch_timer to be a bit more self-consitent; > > > > > > Other ideas welcome! > > I suppose a third option would be to avoid WFxT when we're on a CPU > with a vCPU loaded on it? The per-cpu state is grotty, mind. A bigger > hammer would be to use WFxT only when using VHE. Hmmm. Sure, but that's rather ugly. > > > (1) is a real nightmare, and would force a complete redesign of the > > life cycle of guest timers (switching from load/put to enter/exit for > > the context switch, but only on !VHE). I'd rather avoid that, as this > > is a pretty large performance penalty. > > > > (2) is much more palatable, and easily hacked, see below. Can you > > please five it a go? > > > > Thanks, > > > > M. > > > > From b1b45d591aed3e5276ff857dbc6cfa3bce181766 Mon Sep 17 00:00:00 2001 > > From: Marc Zyngier <maz@kernel.org> > > Date: Tue, 10 Feb 2026 12:43:07 +0000 > > Subject: [PATCH] arm64: Force the use of CNTVCT_EL0 in __delay() > > > > Quentin reports an interesting problem with the use of WFxT in __delay() > > when a vcpu is loaded and that KVM is *not* in VHE mode. > > > > In this case, CNTVOFF_EL2 is set to a non-zero value to reflect the > > state of the guest virtual counter. At the same time, __delay() is > > using get_cycles() to read the counter value, which is indirected to > > reading CNTPCT_EL0. > > > > The core of the issue is that WFxT is using the *virtual* counter, > > while the kernel is using the physical counter, and that the offset > > introduces a really bad discrepancy between the two. > > > > Fix this by forcing the use of CNTVCT_EL0, making __delay() consistent > > irrespective of the value of CNTVOFF_EL2. > > > > Reported-by: Quentin Perret <qperret@google.com> > > Fixes: 7d26b0516a0df ("arm64: Use WFxT for __delay() when possible") > > Signed-off-by: Marc Zyngier <maz@kernel.org> > > Link: https://lore.kernel.org/r/ktosachvft2cgqd5qkukn275ugmhy6xrhxur4zqpdxlfr3qh5h@o3zrfnsq63od > > Cc: stable@vger.kernel.org > > --- > > arch/arm64/lib/delay.c | 15 +++++++++++---- > > 1 file changed, 11 insertions(+), 4 deletions(-) > > > > diff --git a/arch/arm64/lib/delay.c b/arch/arm64/lib/delay.c > > index cb2062e7e2340..26a39bb301ef6 100644 > > --- a/arch/arm64/lib/delay.c > > +++ b/arch/arm64/lib/delay.c > > @@ -23,9 +23,16 @@ static inline unsigned long xloops_to_cycles(unsigned long xloops) > > return (xloops * loops_per_jiffy * HZ) >> 32; > > } > > > > +/* > > + * Force the use of CNTVCT_EL0 in order to have the same base as > > + * WFxT. This avoids some annoying issues when CNTVOFF_EL2 is not > > + * reset 0 on a KVM host until we do a vcpu_put() on the vcpu. > > + */ > > +#define __delay_cycles() __arch_counter_get_cntvct_stable() > > + > > void __delay(unsigned long cycles) > > { > > - cycles_t start = get_cycles(); > > + cycles_t start = __delay_cycles(); > > > > if (alternative_has_cap_unlikely(ARM64_HAS_WFXT)) { > > u64 end = start + cycles; > > @@ -35,17 +42,17 @@ void __delay(unsigned long cycles) > > * early, use a WFET loop to complete the delay. > > */ > > wfit(end); > > - while ((get_cycles() - start) < cycles) > > + while ((__delay_cycles() - start) < cycles) > > wfet(end); > > } else if (arch_timer_evtstrm_available()) { > > const cycles_t timer_evt_period = > > USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US); > > > > - while ((get_cycles() - start + timer_evt_period) < cycles) > > + while ((__delay_cycles() - start + timer_evt_period) < cycles) > > wfe(); > > } > > > > - while ((get_cycles() - start) < cycles) > > + while ((__delay_cycles() - start) < cycles) > > cpu_relax(); > > I can't put my finger on a specific bug, but it does feel pretty scary > to use the virtual counter in the host while running with a guest > voffset. Is the offset userspace controllable? It is, but not while the vcpu is loaded. You need a full put/load sequence to get there (that's the only way to set CNTVOFF_EL2 from EL1). And since the vcpu is loaded, no other thread can interact with the vcpu fd (the vcpu mutex is held). > If so, can we guarantee > that we're ok with overflow and is it ok if we get preempted in the > middle of the loop? It's making my head hurt! Again, I don't see that it can happen. > > The smp_cond_load_{relaxed,acquire}_timeout() series from Ankur looks > like it has the same issue (it uses arch_timer_read_counter()). I haven't reviewed it in a long while, but if it using WFxT, it is likely broken the same way. Thanks, M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-02-13 14:05 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-02-10 12:27 Broken udelay() on KVM host with a vcpu loaded Quentin Perret 2026-02-10 12:52 ` Marc Zyngier 2026-02-10 15:34 ` Will Deacon 2026-02-10 15:58 ` Quentin Perret 2026-02-10 19:54 ` Marc Zyngier 2026-02-13 11:50 ` Will Deacon 2026-02-13 13:52 ` Marc Zyngier 2026-02-13 14:05 ` Quentin Perret 2026-02-10 19:46 ` Marc Zyngier
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox