[PATCH v4 0/8] x86/HPET: tidying / improvements

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/8] x86/HPET: tidying / improvements
@ 2025-11-17 14:35 Jan Beulich
  2025-11-17 14:37 ` [PATCH v4 1/8] x86/HPET: avoid indirect call to event handler Jan Beulich
                   ` (7 more replies)
  0 siblings, 8 replies; 27+ messages in thread
From: Jan Beulich @ 2025-11-17 14:35 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné

v4 is what is left from the broadcast-IRQ series, plus yet another two new
changes. Logically the changes are largely independent, and can likely go in
in about any order.

1: avoid indirect call to event handler
2: make another channel flags update atomic
3: move legacy tick IRQ count adjustment
4: reduce hpet_next_event() call sites
5: drop "long timeout" handling from reprogram_hpet_evt_channel()
6: simplify "expire" check a little in reprogram_hpet_evt_channel()
7: drop .set_affinity hook
8: don't arbitrarily cap delta in reprogram_hpet_evt_channel()

Jan


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 1/8] x86/HPET: avoid indirect call to event handler
  2025-11-17 14:35 [PATCH v4 0/8] x86/HPET: tidying / improvements Jan Beulich
@ 2025-11-17 14:37 ` Jan Beulich
  2026-01-21 16:19   ` Roger Pau Monné
  2025-11-17 14:37 ` [PATCH v4 2/8] x86/HPET: make another channel flags update atomic Jan Beulich
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2025-11-17 14:37 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné

It's only ever handle_hpet_broadcast() that's used. While we now don't
enable IRQs right away, still play safe and convert the function pointer
to a boolean, to make sure no calls occur too early.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Re-base over changes earlier in the series.

--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -40,7 +40,7 @@ struct hpet_event_channel
     s_time_t      next_event;
     cpumask_var_t cpumask;
     spinlock_t    lock;
-    void          (*event_handler)(struct hpet_event_channel *ch);
+    bool          event_handler;
 
     unsigned int idx;   /* physical channel idx */
     unsigned int cpu;   /* msi target */
@@ -194,7 +194,7 @@ static void evt_do_broadcast(cpumask_t *
        cpumask_raise_softirq(mask, TIMER_SOFTIRQ);
 }
 
-static void cf_check handle_hpet_broadcast(struct hpet_event_channel *ch)
+static void handle_hpet_broadcast(struct hpet_event_channel *ch)
 {
     cpumask_t *scratch = this_cpu(hpet_scratch_cpumask);
     s_time_t now, next_event;
@@ -250,7 +250,7 @@ static void cf_check hpet_interrupt_hand
         return;
     }
 
-    ch->event_handler(ch);
+    handle_hpet_broadcast(ch);
 }
 
 static void hpet_enable_channel(struct hpet_event_channel *ch)
@@ -657,7 +657,7 @@ void __init hpet_broadcast_init(void)
         hpet_events[i].next_event = STIME_MAX;
         spin_lock_init(&hpet_events[i].lock);
         smp_wmb();
-        hpet_events[i].event_handler = handle_hpet_broadcast;
+        hpet_events[i].event_handler = true;
 
         hpet_events[i].msi.msi_attrib.maskbit = 1;
         hpet_events[i].msi.msi_attrib.pos = MSI_TYPE_HPET;
@@ -814,7 +814,9 @@ int hpet_legacy_irq_tick(void)
          (hpet_events->flags & (HPET_EVT_DISABLE|HPET_EVT_LEGACY)) !=
          HPET_EVT_LEGACY )
         return 0;
-    hpet_events->event_handler(hpet_events);
+
+    handle_hpet_broadcast(hpet_events);
+
     return 1;
 }
 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 2/8] x86/HPET: make another channel flags update atomic
  2025-11-17 14:35 [PATCH v4 0/8] x86/HPET: tidying / improvements Jan Beulich
  2025-11-17 14:37 ` [PATCH v4 1/8] x86/HPET: avoid indirect call to event handler Jan Beulich
@ 2025-11-17 14:37 ` Jan Beulich
  2026-01-21 17:55   ` Roger Pau Monné
  2025-11-17 14:37 ` [PATCH v4 3/8] x86/HPET: move legacy tick IRQ count adjustment Jan Beulich
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2025-11-17 14:37 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné

Unlike the setting of HPET_EVT_LEGACY in hpet_broadcast_init(), the
setting of HPET_EVT_DISABLE in hpet_disable_legacy_broadcast() isn't init-
only and hence can race other flag manipulation (not all of which occur
while holding the channel's lock). While possibly any such updates would
only ever occur when HPET_EVT_LEGACY isn't set in the first place, this
doesn't look straightforward to prove, so better be on the safe side.

Fixes: d09486dba36a ("cpuidle: Enable hpet broadcast by default")
Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -729,7 +729,7 @@ void hpet_disable_legacy_broadcast(void)
 
     spin_lock_irqsave(&hpet_events->lock, flags);
 
-    hpet_events->flags |= HPET_EVT_DISABLE;
+    set_bit(HPET_EVT_DISABLE_BIT, &hpet_events->flags);
 
     /* disable HPET T0 */
     cfg = hpet_read32(HPET_Tn_CFG(0));



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 3/8] x86/HPET: move legacy tick IRQ count adjustment
  2025-11-17 14:35 [PATCH v4 0/8] x86/HPET: tidying / improvements Jan Beulich
  2025-11-17 14:37 ` [PATCH v4 1/8] x86/HPET: avoid indirect call to event handler Jan Beulich
  2025-11-17 14:37 ` [PATCH v4 2/8] x86/HPET: make another channel flags update atomic Jan Beulich
@ 2025-11-17 14:37 ` Jan Beulich
  2026-01-22  8:50   ` Roger Pau Monné
  2025-11-17 14:38 ` [PATCH v4 4/8] x86/HPET: reduce hpet_next_event() call sites Jan Beulich
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2025-11-17 14:37 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné

If already we play with the IRQ count, we should do so only if we actually
"consume" the interrupt; normal timer IRQs should not have any adjustment
done.

Fixes: 353533232730 ("cpuidle: fix the menu governor to enhance IO performance")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
_Why_ we do these adjustments (also elsewhere) I don't really know.

--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -808,13 +808,13 @@ int hpet_broadcast_is_available(void)
 
 int hpet_legacy_irq_tick(void)
 {
-    this_cpu(irq_count)--;
-
     if ( !hpet_events ||
          (hpet_events->flags & (HPET_EVT_DISABLE|HPET_EVT_LEGACY)) !=
          HPET_EVT_LEGACY )
         return 0;
 
+    this_cpu(irq_count)--;
+
     handle_hpet_broadcast(hpet_events);
 
     return 1;



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 4/8] x86/HPET: reduce hpet_next_event() call sites
  2025-11-17 14:35 [PATCH v4 0/8] x86/HPET: tidying / improvements Jan Beulich
                   ` (2 preceding siblings ...)
  2025-11-17 14:37 ` [PATCH v4 3/8] x86/HPET: move legacy tick IRQ count adjustment Jan Beulich
@ 2025-11-17 14:38 ` Jan Beulich
  2026-01-22  9:00   ` Roger Pau Monné
  2025-11-17 14:39 ` [PATCH v4 5/8] x86/HPET: drop "long timeout" handling from reprogram_hpet_evt_channel() Jan Beulich
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2025-11-17 14:38 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné

I'm surprised gcc doesn't manage to do that: At least in debug builds two
call sites exist, just like source code has it. That's not necessary
though - by using do/while we can reduce this to a single call site. Then
the function will be inlined.

While improving code gen, also switch the function's 2nd parameter to
unsigned.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Oddly enough the CDQE is replaced by an entirely unnecessary 32-bit MOV of
a register to itself (i.e. zero-extending to 64 bits), as that's
immediately preceded by a 32-bit ADD targeting the same register.

--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -124,7 +124,7 @@ static inline unsigned long ns2ticks(uns
     return (unsigned long) tmp;
 }
 
-static int hpet_next_event(unsigned long delta, int timer)
+static int hpet_next_event(unsigned long delta, unsigned int timer)
 {
     uint32_t cnt, cmp;
     unsigned long flags;
@@ -173,12 +173,10 @@ static int reprogram_hpet_evt_channel(
     delta = max_t(int64_t, delta, MIN_DELTA_NS);
     delta = ns2ticks(delta, ch->shift, ch->mult);
 
-    ret = hpet_next_event(delta, ch->idx);
-    while ( ret && force )
-    {
-        delta += delta;
+    do {
         ret = hpet_next_event(delta, ch->idx);
-    }
+        delta += delta;
+    } while ( ret && force );
 
     return ret;
 }



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 5/8] x86/HPET: drop "long timeout" handling from reprogram_hpet_evt_channel()
  2025-11-17 14:35 [PATCH v4 0/8] x86/HPET: tidying / improvements Jan Beulich
                   ` (3 preceding siblings ...)
  2025-11-17 14:38 ` [PATCH v4 4/8] x86/HPET: reduce hpet_next_event() call sites Jan Beulich
@ 2025-11-17 14:39 ` Jan Beulich
  2026-01-22  9:03   ` Roger Pau Monné
  2025-11-17 14:39 ` [PATCH v4 6/8] x86/HPET: simplify "expire" check a little in reprogram_hpet_evt_channel() Jan Beulich
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2025-11-17 14:39 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné

Neither caller passes STIME_MAX, so (bogusly) handling the case isn't
necessary.

"Bogusly" because with 32-bit counters, writing 0 means on average half
the wrapping period until an interrupt would be raised, while of course
in extreme cases an interrupt would be raised almost right away.

Amends: aa42fc0e9cd9 ("cpuidle: remove hpet access in hpet_broadcast_exit")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Drop the code instead of adjusting it.

--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -162,13 +162,6 @@ static int reprogram_hpet_evt_channel(
 
     ch->next_event = expire;
 
-    if ( expire == STIME_MAX )
-    {
-        /* We assume it will take a long time for the timer to wrap. */
-        hpet_write32(0, HPET_Tn_CMP(ch->idx));
-        return 0;
-    }
-
     delta = min_t(int64_t, delta, MAX_DELTA_NS);
     delta = max_t(int64_t, delta, MIN_DELTA_NS);
     delta = ns2ticks(delta, ch->shift, ch->mult);



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 6/8] x86/HPET: simplify "expire" check a little in reprogram_hpet_evt_channel()
  2025-11-17 14:35 [PATCH v4 0/8] x86/HPET: tidying / improvements Jan Beulich
                   ` (4 preceding siblings ...)
  2025-11-17 14:39 ` [PATCH v4 5/8] x86/HPET: drop "long timeout" handling from reprogram_hpet_evt_channel() Jan Beulich
@ 2025-11-17 14:39 ` Jan Beulich
  2026-01-22  9:18   ` Roger Pau Monné
  2025-11-17 14:39 ` [PATCH v4 7/8] x86/HPET: drop .set_affinity hook Jan Beulich
  2025-11-17 14:40 ` [PATCH v4 8/8] x86/HPET: don't arbitrarily cap delta in reprogram_hpet_evt_channel() Jan Beulich
  7 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2025-11-17 14:39 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné

When this was added, the log message was updated correctly, but the zero
case was needlessly checked separately: hpet_broadcast_enter() had a zero
check added at the same time, while handle_hpet_broadcast() can't possibly
pass 0 here anyway.

Fixes: 7145897cfb81 ("cpuidle: Fix for timer_deadline==0 case")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -147,10 +147,10 @@ static int reprogram_hpet_evt_channel(
     int64_t delta;
     int ret;
 
-    if ( (ch->flags & HPET_EVT_DISABLE) || (expire == 0) )
+    if ( ch->flags & HPET_EVT_DISABLE )
         return 0;
 
-    if ( unlikely(expire < 0) )
+    if ( unlikely(expire <= 0) )
     {
         printk(KERN_DEBUG "reprogram: expire <= 0\n");
         return -ETIME;



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 7/8] x86/HPET: drop .set_affinity hook
  2025-11-17 14:35 [PATCH v4 0/8] x86/HPET: tidying / improvements Jan Beulich
                   ` (5 preceding siblings ...)
  2025-11-17 14:39 ` [PATCH v4 6/8] x86/HPET: simplify "expire" check a little in reprogram_hpet_evt_channel() Jan Beulich
@ 2025-11-17 14:39 ` Jan Beulich
  2026-01-22 10:05   ` Roger Pau Monné
  2025-11-17 14:40 ` [PATCH v4 8/8] x86/HPET: don't arbitrarily cap delta in reprogram_hpet_evt_channel() Jan Beulich
  7 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2025-11-17 14:39 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné

No IRQ balancing is supposed to be happening on the broadcast IRQs. The
only entity responsible for fiddling with the CPU affinities is
set_channel_irq_affinity(). They shouldn't even be fiddled with when
offlining a CPU: A CPU going down can't at the same time be idle. Some
properties (->arch.cpu_mask in particular) may transiently reference an
offline CPU, but that'll be adjusted as soon as a channel goes into active
use again.

Along with adjusting fixup_irqs() (in a more general way, i.e. covering all
vectors which are marked in use globally), also adjust section placement of
used_vectors.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -294,22 +294,6 @@ static int hpet_msi_write(struct hpet_ev
 
 #define hpet_msi_shutdown hpet_msi_mask
 
-static void cf_check hpet_msi_set_affinity(
-    struct irq_desc *desc, const cpumask_t *mask)
-{
-    struct hpet_event_channel *ch = desc->action->dev_id;
-    struct msi_msg msg = ch->msi.msg;
-
-    /* This really is only for dump_irqs(). */
-    cpumask_copy(desc->arch.cpu_mask, mask);
-
-    msg.dest32 = cpu_mask_to_apicid(mask);
-    msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK;
-    msg.address_lo |= MSI_ADDR_DEST_ID(msg.dest32);
-    if ( msg.dest32 != ch->msi.msg.dest32 )
-        hpet_msi_write(ch, &msg);
-}
-
 /*
  * IRQ Chip for MSI HPET Devices,
  */
@@ -321,7 +305,6 @@ static hw_irq_controller hpet_msi_type =
     .disable    = hpet_msi_mask,
     .ack        = irq_actor_none,
     .end        = end_nonmaskable_irq,
-    .set_affinity   = hpet_msi_set_affinity,
 };
 
 static int __hpet_setup_msi_irq(struct irq_desc *desc)
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -51,7 +51,7 @@ static vmask_t global_used_vector_map;
 
 struct irq_desc __read_mostly *irq_desc = NULL;
 
-static DECLARE_BITMAP(used_vectors, X86_IDT_VECTORS);
+static DECLARE_BITMAP(used_vectors, X86_IDT_VECTORS) __ro_after_init;
 
 static DEFINE_SPINLOCK(vector_lock);
 
@@ -2630,13 +2630,17 @@ void fixup_irqs(void)
         spin_lock(&desc->lock);
 
         vector = irq_to_vector(irq);
-        if ( vector >= FIRST_HIPRIORITY_VECTOR &&
-             vector <= LAST_HIPRIORITY_VECTOR &&
-             desc->handler == &no_irq_type )
+        if ( (vector >= FIRST_HIPRIORITY_VECTOR &&
+              vector <= LAST_HIPRIORITY_VECTOR &&
+              desc->handler == &no_irq_type) ||
+             test_bit(vector, used_vectors) )
         {
             /*
              * This can in particular happen when parking secondary threads
              * during boot and when the serial console wants to use a PCI IRQ.
+             *
+             * Globally used vectors (like the HPET broadcast IRQ ones), need
+             * to be left alone in any event.
              */
             spin_unlock(&desc->lock);
             continue;



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 8/8] x86/HPET: don't arbitrarily cap delta in reprogram_hpet_evt_channel()
  2025-11-17 14:35 [PATCH v4 0/8] x86/HPET: tidying / improvements Jan Beulich
                   ` (6 preceding siblings ...)
  2025-11-17 14:39 ` [PATCH v4 7/8] x86/HPET: drop .set_affinity hook Jan Beulich
@ 2025-11-17 14:40 ` Jan Beulich
  2026-01-22 10:23   ` Roger Pau Monné
  7 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2025-11-17 14:40 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné

There's no reason to set an arbitrary upper bound of 10 seconds. We can
simply set the comparator such that it'll take a whole cycle through all
32-bit values until the next interrupt would be raised. (For an extremely
fast-running HPET [400 MHz and up] 10 seconds would also be too long.)

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -23,7 +23,6 @@
 #include <asm/irq-vectors.h>
 #include <asm/msi.h>
 
-#define MAX_DELTA_NS MILLISECS(10*1000)
 #define MIN_DELTA_NS MICROSECS(20)
 
 #define HPET_EVT_USED_BIT    0
@@ -162,10 +161,15 @@ static int reprogram_hpet_evt_channel(
 
     ch->next_event = expire;
 
-    delta = min_t(int64_t, delta, MAX_DELTA_NS);
     delta = max_t(int64_t, delta, MIN_DELTA_NS);
     delta = ns2ticks(delta, ch->shift, ch->mult);
 
+    if ( delta > UINT32_MAX )
+    {
+        hpet_write32(hpet_read32(HPET_COUNTER), HPET_Tn_CMP(ch->idx));
+        return 0;
+    }
+
     do {
         ret = hpet_next_event(delta, ch->idx);
         delta += delta;



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 1/8] x86/HPET: avoid indirect call to event handler
  2025-11-17 14:37 ` [PATCH v4 1/8] x86/HPET: avoid indirect call to event handler Jan Beulich
@ 2026-01-21 16:19   ` Roger Pau Monné
  0 siblings, 0 replies; 27+ messages in thread
From: Roger Pau Monné @ 2026-01-21 16:19 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper

On Mon, Nov 17, 2025 at 03:37:04PM +0100, Jan Beulich wrote:
> It's only ever handle_hpet_broadcast() that's used. While we now don't
> enable IRQs right away, still play safe and convert the function pointer
> to a boolean, to make sure no calls occur too early.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

> ---
> v2: Re-base over changes earlier in the series.
> 
> --- a/xen/arch/x86/hpet.c
> +++ b/xen/arch/x86/hpet.c
> @@ -40,7 +40,7 @@ struct hpet_event_channel
>      s_time_t      next_event;
>      cpumask_var_t cpumask;
>      spinlock_t    lock;
> -    void          (*event_handler)(struct hpet_event_channel *ch);
> +    bool          event_handler;

It would be nice to also get rid of this field, but I don't see any
other input that we could use to ensure the channel is ready to
receive interrupts.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 2/8] x86/HPET: make another channel flags update atomic
  2025-11-17 14:37 ` [PATCH v4 2/8] x86/HPET: make another channel flags update atomic Jan Beulich
@ 2026-01-21 17:55   ` Roger Pau Monné
  0 siblings, 0 replies; 27+ messages in thread
From: Roger Pau Monné @ 2026-01-21 17:55 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper

On Mon, Nov 17, 2025 at 03:37:21PM +0100, Jan Beulich wrote:
> Unlike the setting of HPET_EVT_LEGACY in hpet_broadcast_init(), the
> setting of HPET_EVT_DISABLE in hpet_disable_legacy_broadcast() isn't init-
> only and hence can race other flag manipulation (not all of which occur
> while holding the channel's lock). While possibly any such updates would
> only ever occur when HPET_EVT_LEGACY isn't set in the first place, this
> doesn't look straightforward to prove, so better be on the safe side.
> 
> Fixes: d09486dba36a ("cpuidle: Enable hpet broadcast by default")
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

I also think it's quite likely that hpet_disable_legacy_broadcast() is
not possible to be used concurrently while the setting of the other
flags, that seem to happen in broadcast mode only, but better be on
the safe side.

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 3/8] x86/HPET: move legacy tick IRQ count adjustment
  2025-11-17 14:37 ` [PATCH v4 3/8] x86/HPET: move legacy tick IRQ count adjustment Jan Beulich
@ 2026-01-22  8:50   ` Roger Pau Monné
  2026-01-22 10:31     ` Jan Beulich
  0 siblings, 1 reply; 27+ messages in thread
From: Roger Pau Monné @ 2026-01-22  8:50 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper

On Mon, Nov 17, 2025 at 03:37:45PM +0100, Jan Beulich wrote:
> If already we play with the IRQ count, we should do so only if we actually
> "consume" the interrupt; normal timer IRQs should not have any adjustment
> done.
> 
> Fixes: 353533232730 ("cpuidle: fix the menu governor to enhance IO performance")
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> _Why_ we do these adjustments (also elsewhere) I don't really know.

I think I have an idea of what's going on here.  This accounting is
used by the idle governor to decide when to go idle.  On Linux (where
the code is imported from) the governor took into account the inflight
IO request state.  However that's not available to Xen and instead
they decided to mimic the tracking of the IO activity by counting
interrupts.  I bet then realized the timer interrupt would "skew"
those results and make it look like there's IO activity when the
system is otherwise mostly idle.

> 
> --- a/xen/arch/x86/hpet.c
> +++ b/xen/arch/x86/hpet.c
> @@ -808,13 +808,13 @@ int hpet_broadcast_is_available(void)
>  
>  int hpet_legacy_irq_tick(void)
>  {
> -    this_cpu(irq_count)--;

I think you want to pull this decrease into timer_interrupt() itself,
so it does the decrease unconditionally of whether the interrupt is a
legacy HPET one or from the PIT?

By gating the decrease on the interrupt having been originated from
the HPET you completely avoid the decrease in the PIT case AFAICT.

> -
>      if ( !hpet_events ||
>           (hpet_events->flags & (HPET_EVT_DISABLE|HPET_EVT_LEGACY)) !=
>           HPET_EVT_LEGACY )
>          return 0;
>  
> +    this_cpu(irq_count)--;

Also in hpet_interrupt_handler() we might consider only doing the
decrease after we ensure it's not a spurious interrupt?  We don't seem
to decrease irq_count for spurious interrupts elsewhere.

Thanks, Roger.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 4/8] x86/HPET: reduce hpet_next_event() call sites
  2025-11-17 14:38 ` [PATCH v4 4/8] x86/HPET: reduce hpet_next_event() call sites Jan Beulich
@ 2026-01-22  9:00   ` Roger Pau Monné
  0 siblings, 0 replies; 27+ messages in thread
From: Roger Pau Monné @ 2026-01-22  9:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper

On Mon, Nov 17, 2025 at 03:38:43PM +0100, Jan Beulich wrote:
> I'm surprised gcc doesn't manage to do that: At least in debug builds two
> call sites exist, just like source code has it. That's not necessary
> though - by using do/while we can reduce this to a single call site. Then
> the function will be inlined.
> 
> While improving code gen, also switch the function's 2nd parameter to
> unsigned.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 5/8] x86/HPET: drop "long timeout" handling from reprogram_hpet_evt_channel()
  2025-11-17 14:39 ` [PATCH v4 5/8] x86/HPET: drop "long timeout" handling from reprogram_hpet_evt_channel() Jan Beulich
@ 2026-01-22  9:03   ` Roger Pau Monné
  2026-01-22  9:23     ` Jan Beulich
  0 siblings, 1 reply; 27+ messages in thread
From: Roger Pau Monné @ 2026-01-22  9:03 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper

On Mon, Nov 17, 2025 at 03:39:03PM +0100, Jan Beulich wrote:
> Neither caller passes STIME_MAX, so (bogusly) handling the case isn't
> necessary.
> 
> "Bogusly" because with 32-bit counters, writing 0 means on average half
> the wrapping period until an interrupt would be raised, while of course
> in extreme cases an interrupt would be raised almost right away.
> 
> Amends: aa42fc0e9cd9 ("cpuidle: remove hpet access in hpet_broadcast_exit")
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

> ---
> v3: Drop the code instead of adjusting it.
> 
> --- a/xen/arch/x86/hpet.c
> +++ b/xen/arch/x86/hpet.c
> @@ -162,13 +162,6 @@ static int reprogram_hpet_evt_channel(
>  
>      ch->next_event = expire;
>  
> -    if ( expire == STIME_MAX )
> -    {
> -        /* We assume it will take a long time for the timer to wrap. */
> -        hpet_write32(0, HPET_Tn_CMP(ch->idx));
> -        return 0;
> -    }

I wouldn't mind if you replaced this with an ASSERT(expire != STIME_MAX);

Thanks, Roger.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 6/8] x86/HPET: simplify "expire" check a little in reprogram_hpet_evt_channel()
  2025-11-17 14:39 ` [PATCH v4 6/8] x86/HPET: simplify "expire" check a little in reprogram_hpet_evt_channel() Jan Beulich
@ 2026-01-22  9:18   ` Roger Pau Monné
  2026-01-22  9:28     ` Jan Beulich
  0 siblings, 1 reply; 27+ messages in thread
From: Roger Pau Monné @ 2026-01-22  9:18 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper

On Mon, Nov 17, 2025 at 03:39:30PM +0100, Jan Beulich wrote:
> When this was added, the log message was updated correctly, but the zero
> case was needlessly checked separately: hpet_broadcast_enter() had a zero
> check added at the same time, while handle_hpet_broadcast() can't possibly
> pass 0 here anyway.
> 
> Fixes: 7145897cfb81 ("cpuidle: Fix for timer_deadline==0 case")
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Similar to the previous commit, I wonder whether it would make sense
to add an ASSERT_UNREACHABLE() if that error path is not reachable
given the logic in the callers.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 5/8] x86/HPET: drop "long timeout" handling from reprogram_hpet_evt_channel()
  2026-01-22  9:03   ` Roger Pau Monné
@ 2026-01-22  9:23     ` Jan Beulich
  0 siblings, 0 replies; 27+ messages in thread
From: Jan Beulich @ 2026-01-22  9:23 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper

On 22.01.2026 10:03, Roger Pau Monné wrote:
> On Mon, Nov 17, 2025 at 03:39:03PM +0100, Jan Beulich wrote:
>> Neither caller passes STIME_MAX, so (bogusly) handling the case isn't
>> necessary.
>>
>> "Bogusly" because with 32-bit counters, writing 0 means on average half
>> the wrapping period until an interrupt would be raised, while of course
>> in extreme cases an interrupt would be raised almost right away.
>>
>> Amends: aa42fc0e9cd9 ("cpuidle: remove hpet access in hpet_broadcast_exit")
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

>> --- a/xen/arch/x86/hpet.c
>> +++ b/xen/arch/x86/hpet.c
>> @@ -162,13 +162,6 @@ static int reprogram_hpet_evt_channel(
>>  
>>      ch->next_event = expire;
>>  
>> -    if ( expire == STIME_MAX )
>> -    {
>> -        /* We assume it will take a long time for the timer to wrap. */
>> -        hpet_write32(0, HPET_Tn_CMP(ch->idx));
>> -        return 0;
>> -    }
> 
> I wouldn't mind if you replaced this with an ASSERT(expire != STIME_MAX);

Hmm, yes, can do.

Jan


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 6/8] x86/HPET: simplify "expire" check a little in reprogram_hpet_evt_channel()
  2026-01-22  9:18   ` Roger Pau Monné
@ 2026-01-22  9:28     ` Jan Beulich
  2026-01-22 10:10       ` Roger Pau Monné
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2026-01-22  9:28 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper

On 22.01.2026 10:18, Roger Pau Monné wrote:
> On Mon, Nov 17, 2025 at 03:39:30PM +0100, Jan Beulich wrote:
>> When this was added, the log message was updated correctly, but the zero
>> case was needlessly checked separately: hpet_broadcast_enter() had a zero
>> check added at the same time, while handle_hpet_broadcast() can't possibly
>> pass 0 here anyway.
>>
>> Fixes: 7145897cfb81 ("cpuidle: Fix for timer_deadline==0 case")
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

> Similar to the previous commit, I wonder whether it would make sense
> to add an ASSERT_UNREACHABLE() if that error path is not reachable
> given the logic in the callers.

That would mean

    if ( unlikely(expire < 0) )
    {
        printk(KERN_DEBUG "reprogram: expire <= 0\n");
        return -ETIME;
    }

    if ( unlikely(expire == 0) )
    {
        ASSERT_UNREACHABLE();
        return -ETIME;
    }

which I fear I don't like (for going too far). Even

    if ( unlikely(expire <= 0) )
    {
        printk(KERN_DEBUG "reprogram: expire <= 0\n");
        ASSERT(expire);
        return -ETIME;
    }

I'd be uncertain about, as that needlessly gives 0 a meaning that isn't
required anymore in this function.

Jan


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 7/8] x86/HPET: drop .set_affinity hook
  2025-11-17 14:39 ` [PATCH v4 7/8] x86/HPET: drop .set_affinity hook Jan Beulich
@ 2026-01-22 10:05   ` Roger Pau Monné
  0 siblings, 0 replies; 27+ messages in thread
From: Roger Pau Monné @ 2026-01-22 10:05 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper

On Mon, Nov 17, 2025 at 03:39:50PM +0100, Jan Beulich wrote:
> No IRQ balancing is supposed to be happening on the broadcast IRQs. The
> only entity responsible for fiddling with the CPU affinities is
> set_channel_irq_affinity(). They shouldn't even be fiddled with when
> offlining a CPU: A CPU going down can't at the same time be idle. Some
> properties (->arch.cpu_mask in particular) may transiently reference an
> offline CPU, but that'll be adjusted as soon as a channel goes into active
> use again.

Hm, we where likely pointlessly migrating the HPET vector in
fixup_irqs() previous to this change, but such movement shouldn't be
fatal, just pointless.

> Along with adjusting fixup_irqs() (in a more general way, i.e. covering all
> vectors which are marked in use globally), also adjust section placement of
> used_vectors.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 6/8] x86/HPET: simplify "expire" check a little in reprogram_hpet_evt_channel()
  2026-01-22  9:28     ` Jan Beulich
@ 2026-01-22 10:10       ` Roger Pau Monné
  2026-01-22 10:15         ` Jan Beulich
  0 siblings, 1 reply; 27+ messages in thread
From: Roger Pau Monné @ 2026-01-22 10:10 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper

On Thu, Jan 22, 2026 at 10:28:51AM +0100, Jan Beulich wrote:
> On 22.01.2026 10:18, Roger Pau Monné wrote:
> > On Mon, Nov 17, 2025 at 03:39:30PM +0100, Jan Beulich wrote:
> >> When this was added, the log message was updated correctly, but the zero
> >> case was needlessly checked separately: hpet_broadcast_enter() had a zero
> >> check added at the same time, while handle_hpet_broadcast() can't possibly
> >> pass 0 here anyway.
> >>
> >> Fixes: 7145897cfb81 ("cpuidle: Fix for timer_deadline==0 case")
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > 
> > Acked-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Thanks.
> 
> > Similar to the previous commit, I wonder whether it would make sense
> > to add an ASSERT_UNREACHABLE() if that error path is not reachable
> > given the logic in the callers.
> 
> That would mean
> 
>     if ( unlikely(expire < 0) )
>     {
>         printk(KERN_DEBUG "reprogram: expire <= 0\n");
>         return -ETIME;
>     }
> 
>     if ( unlikely(expire == 0) )
>     {
>         ASSERT_UNREACHABLE();
>         return -ETIME;
>     }
> 
> which I fear I don't like (for going too far). Even
> 
>     if ( unlikely(expire <= 0) )
>     {
>         printk(KERN_DEBUG "reprogram: expire <= 0\n");
>         ASSERT(expire);
>         return -ETIME;
>     }
> 
> I'd be uncertain about, as that needlessly gives 0 a meaning that isn't
> required anymore in this function.

Hm, OK, I was under the impression that both < 0 and 0 should never be
passed by the callers.  If expire == 0 is a possible input then I
don't think the ASSERT() is that helpful.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 6/8] x86/HPET: simplify "expire" check a little in reprogram_hpet_evt_channel()
  2026-01-22 10:10       ` Roger Pau Monné
@ 2026-01-22 10:15         ` Jan Beulich
  2026-01-22 11:30           ` Roger Pau Monné
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2026-01-22 10:15 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper

On 22.01.2026 11:10, Roger Pau Monné wrote:
> On Thu, Jan 22, 2026 at 10:28:51AM +0100, Jan Beulich wrote:
>> On 22.01.2026 10:18, Roger Pau Monné wrote:
>>> On Mon, Nov 17, 2025 at 03:39:30PM +0100, Jan Beulich wrote:
>>>> When this was added, the log message was updated correctly, but the zero
>>>> case was needlessly checked separately: hpet_broadcast_enter() had a zero
>>>> check added at the same time, while handle_hpet_broadcast() can't possibly
>>>> pass 0 here anyway.
>>>>
>>>> Fixes: 7145897cfb81 ("cpuidle: Fix for timer_deadline==0 case")
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>
>>> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
>>
>> Thanks.
>>
>>> Similar to the previous commit, I wonder whether it would make sense
>>> to add an ASSERT_UNREACHABLE() if that error path is not reachable
>>> given the logic in the callers.
>>
>> That would mean
>>
>>     if ( unlikely(expire < 0) )
>>     {
>>         printk(KERN_DEBUG "reprogram: expire <= 0\n");
>>         return -ETIME;
>>     }
>>
>>     if ( unlikely(expire == 0) )
>>     {
>>         ASSERT_UNREACHABLE();
>>         return -ETIME;
>>     }
>>
>> which I fear I don't like (for going too far). Even
>>
>>     if ( unlikely(expire <= 0) )
>>     {
>>         printk(KERN_DEBUG "reprogram: expire <= 0\n");
>>         ASSERT(expire);
>>         return -ETIME;
>>     }
>>
>> I'd be uncertain about, as that needlessly gives 0 a meaning that isn't
>> required anymore in this function.
> 
> Hm, OK, I was under the impression that both < 0 and 0 should never be
> passed by the callers.  If expire == 0 is a possible input then I
> don't think the ASSERT() is that helpful.

Oh, so you were after

    if ( unlikely(expire <= 0) )
    {
        printk(KERN_DEBUG "reprogram: expire <= 0\n");
        ASSERT_UNREACHABLE();
        return -ETIME;
    }

(perhaps even with the printk() dropped)? That I could buy off on, as NOW()
is expected to only ever return valid (positive) s_time_t values.

Jan


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 8/8] x86/HPET: don't arbitrarily cap delta in reprogram_hpet_evt_channel()
  2025-11-17 14:40 ` [PATCH v4 8/8] x86/HPET: don't arbitrarily cap delta in reprogram_hpet_evt_channel() Jan Beulich
@ 2026-01-22 10:23   ` Roger Pau Monné
  2026-01-22 10:35     ` Jan Beulich
  0 siblings, 1 reply; 27+ messages in thread
From: Roger Pau Monné @ 2026-01-22 10:23 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper

On Mon, Nov 17, 2025 at 03:40:08PM +0100, Jan Beulich wrote:
> There's no reason to set an arbitrary upper bound of 10 seconds. We can
> simply set the comparator such that it'll take a whole cycle through all
> 32-bit values until the next interrupt would be raised. (For an extremely
> fast-running HPET [400 MHz and up] 10 seconds would also be too long.)
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v4: New.
> 
> --- a/xen/arch/x86/hpet.c
> +++ b/xen/arch/x86/hpet.c
> @@ -23,7 +23,6 @@
>  #include <asm/irq-vectors.h>
>  #include <asm/msi.h>
>  
> -#define MAX_DELTA_NS MILLISECS(10*1000)
>  #define MIN_DELTA_NS MICROSECS(20)
>  
>  #define HPET_EVT_USED_BIT    0
> @@ -162,10 +161,15 @@ static int reprogram_hpet_evt_channel(
>  
>      ch->next_event = expire;
>  
> -    delta = min_t(int64_t, delta, MAX_DELTA_NS);
>      delta = max_t(int64_t, delta, MIN_DELTA_NS);
>      delta = ns2ticks(delta, ch->shift, ch->mult);
>  
> +    if ( delta > UINT32_MAX )
> +    {
> +        hpet_write32(hpet_read32(HPET_COUNTER), HPET_Tn_CMP(ch->idx));

Should Xen disable interrupts around this call to avoid unexpected
latency between the counter read and the comparator write?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 3/8] x86/HPET: move legacy tick IRQ count adjustment
  2026-01-22  8:50   ` Roger Pau Monné
@ 2026-01-22 10:31     ` Jan Beulich
  2026-01-22 11:21       ` Roger Pau Monné
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2026-01-22 10:31 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper

On 22.01.2026 09:50, Roger Pau Monné wrote:
> On Mon, Nov 17, 2025 at 03:37:45PM +0100, Jan Beulich wrote:
>> If already we play with the IRQ count, we should do so only if we actually
>> "consume" the interrupt; normal timer IRQs should not have any adjustment
>> done.
>>
>> Fixes: 353533232730 ("cpuidle: fix the menu governor to enhance IO performance")
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> _Why_ we do these adjustments (also elsewhere) I don't really know.
> 
> I think I have an idea of what's going on here.  This accounting is
> used by the idle governor to decide when to go idle.  On Linux (where
> the code is imported from) the governor took into account the inflight
> IO request state.  However that's not available to Xen and instead
> they decided to mimic the tracking of the IO activity by counting
> interrupts.  I bet then realized the timer interrupt would "skew"
> those results and make it look like there's IO activity when the
> system is otherwise mostly idle.

Hmm, yes, that sounds pretty plausible. Except for one aspect: Why would
it be I/O that the governor would care about? It wants to judge by the
system being busy, and timer interrupts generally are an indication of
busyness. Just not broadcast ones. Hence ...

>> --- a/xen/arch/x86/hpet.c
>> +++ b/xen/arch/x86/hpet.c
>> @@ -808,13 +808,13 @@ int hpet_broadcast_is_available(void)
>>  
>>  int hpet_legacy_irq_tick(void)
>>  {
>> -    this_cpu(irq_count)--;
> 
> I think you want to pull this decrease into timer_interrupt() itself,
> so it does the decrease unconditionally of whether the interrupt is a
> legacy HPET one or from the PIT?

... I think moving to timer_interrupt() would actually be wrong.

> By gating the decrease on the interrupt having been originated from
> the HPET you completely avoid the decrease in the PIT case AFAICT.
> 
>> -
>>      if ( !hpet_events ||
>>           (hpet_events->flags & (HPET_EVT_DISABLE|HPET_EVT_LEGACY)) !=
>>           HPET_EVT_LEGACY )
>>          return 0;
>>  
>> +    this_cpu(irq_count)--;
> 
> Also in hpet_interrupt_handler() we might consider only doing the
> decrease after we ensure it's not a spurious interrupt?  We don't seem
> to decrease irq_count for spurious interrupts elsewhere.

Even a spurious interrupt is only an idle management auxiliary one (i.e.
really an artifact thereof). It doesn't hint at the system being busy.

Jan


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 8/8] x86/HPET: don't arbitrarily cap delta in reprogram_hpet_evt_channel()
  2026-01-22 10:23   ` Roger Pau Monné
@ 2026-01-22 10:35     ` Jan Beulich
  2026-01-22 11:29       ` Roger Pau Monné
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2026-01-22 10:35 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper

On 22.01.2026 11:23, Roger Pau Monné wrote:
> On Mon, Nov 17, 2025 at 03:40:08PM +0100, Jan Beulich wrote:
>> @@ -162,10 +161,15 @@ static int reprogram_hpet_evt_channel(
>>  
>>      ch->next_event = expire;
>>  
>> -    delta = min_t(int64_t, delta, MAX_DELTA_NS);
>>      delta = max_t(int64_t, delta, MIN_DELTA_NS);
>>      delta = ns2ticks(delta, ch->shift, ch->mult);
>>  
>> +    if ( delta > UINT32_MAX )
>> +    {
>> +        hpet_write32(hpet_read32(HPET_COUNTER), HPET_Tn_CMP(ch->idx));
> 
> Should Xen disable interrupts around this call to avoid unexpected
> latency between the counter read and the comparator write?

Such latency could then still arise, due NMI or SMI. What's your underlying
concern here?

Jan


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 3/8] x86/HPET: move legacy tick IRQ count adjustment
  2026-01-22 10:31     ` Jan Beulich
@ 2026-01-22 11:21       ` Roger Pau Monné
  0 siblings, 0 replies; 27+ messages in thread
From: Roger Pau Monné @ 2026-01-22 11:21 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper

On Thu, Jan 22, 2026 at 11:31:52AM +0100, Jan Beulich wrote:
> On 22.01.2026 09:50, Roger Pau Monné wrote:
> > On Mon, Nov 17, 2025 at 03:37:45PM +0100, Jan Beulich wrote:
> >> If already we play with the IRQ count, we should do so only if we actually
> >> "consume" the interrupt; normal timer IRQs should not have any adjustment
> >> done.
> >>
> >> Fixes: 353533232730 ("cpuidle: fix the menu governor to enhance IO performance")
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> ---
> >> _Why_ we do these adjustments (also elsewhere) I don't really know.
> > 
> > I think I have an idea of what's going on here.  This accounting is
> > used by the idle governor to decide when to go idle.  On Linux (where
> > the code is imported from) the governor took into account the inflight
> > IO request state.  However that's not available to Xen and instead
> > they decided to mimic the tracking of the IO activity by counting
> > interrupts.  I bet then realized the timer interrupt would "skew"
> > those results and make it look like there's IO activity when the
> > system is otherwise mostly idle.
> 
> Hmm, yes, that sounds pretty plausible. Except for one aspect: Why would
> it be I/O that the governor would care about?

This is all hypothetical, I don't know the real reasons.  I think they
aimed to avoid putting the system in deep idle states if there's IO
gong on, regardless of whether the CPU is otherwise idle.  Putting the
system in those deeper idle states would also increase interrupt
latency.

I'm not arguing the initial purpose was correct, just attempting to
make sense of all of this.

> It wants to judge by the
> system being busy, and timer interrupts generally are an indication of
> busyness. Just not broadcast ones. Hence ...
> 
> >> --- a/xen/arch/x86/hpet.c
> >> +++ b/xen/arch/x86/hpet.c
> >> @@ -808,13 +808,13 @@ int hpet_broadcast_is_available(void)
> >>  
> >>  int hpet_legacy_irq_tick(void)
> >>  {
> >> -    this_cpu(irq_count)--;
> > 
> > I think you want to pull this decrease into timer_interrupt() itself,
> > so it does the decrease unconditionally of whether the interrupt is a
> > legacy HPET one or from the PIT?
> 
> ... I think moving to timer_interrupt() would actually be wrong.

Hm, I see.  It's only HPET broadcast we want to avoid accounting for.

> > By gating the decrease on the interrupt having been originated from
> > the HPET you completely avoid the decrease in the PIT case AFAICT.
> > 
> >> -
> >>      if ( !hpet_events ||
> >>           (hpet_events->flags & (HPET_EVT_DISABLE|HPET_EVT_LEGACY)) !=
> >>           HPET_EVT_LEGACY )
> >>          return 0;
> >>  
> >> +    this_cpu(irq_count)--;
> > 
> > Also in hpet_interrupt_handler() we might consider only doing the
> > decrease after we ensure it's not a spurious interrupt?  We don't seem
> > to decrease irq_count for spurious interrupts elsewhere.
> 
> Even a spurious interrupt is only an idle management auxiliary one (i.e.
> really an artifact thereof). It doesn't hint at the system being busy.

Right, I was mislead and somehow assumed the intent was to avoid this
counting for all timer interrupts.  Instead is just the HPET broadcast
that not accounted for.

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 8/8] x86/HPET: don't arbitrarily cap delta in reprogram_hpet_evt_channel()
  2026-01-22 10:35     ` Jan Beulich
@ 2026-01-22 11:29       ` Roger Pau Monné
  0 siblings, 0 replies; 27+ messages in thread
From: Roger Pau Monné @ 2026-01-22 11:29 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper

On Thu, Jan 22, 2026 at 11:35:06AM +0100, Jan Beulich wrote:
> On 22.01.2026 11:23, Roger Pau Monné wrote:
> > On Mon, Nov 17, 2025 at 03:40:08PM +0100, Jan Beulich wrote:
> >> @@ -162,10 +161,15 @@ static int reprogram_hpet_evt_channel(
> >>  
> >>      ch->next_event = expire;
> >>  
> >> -    delta = min_t(int64_t, delta, MAX_DELTA_NS);
> >>      delta = max_t(int64_t, delta, MIN_DELTA_NS);
> >>      delta = ns2ticks(delta, ch->shift, ch->mult);
> >>  
> >> +    if ( delta > UINT32_MAX )
> >> +    {
> >> +        hpet_write32(hpet_read32(HPET_COUNTER), HPET_Tn_CMP(ch->idx));
> > 
> > Should Xen disable interrupts around this call to avoid unexpected
> > latency between the counter read and the comparator write?
> 
> Such latency could then still arise, due NMI or SMI. What's your underlying
> concern here?

For NMI or SMI there isn't much we can do.  I guess this is much less
of a concern here than it is in hpet_next_event(), given the next
event is expected to be after a full HPET counter period.  One of
those events taking a full HPET counter period overlap would make a
lot of others things explode.

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 6/8] x86/HPET: simplify "expire" check a little in reprogram_hpet_evt_channel()
  2026-01-22 10:15         ` Jan Beulich
@ 2026-01-22 11:30           ` Roger Pau Monné
  2026-01-22 12:50             ` Jan Beulich
  0 siblings, 1 reply; 27+ messages in thread
From: Roger Pau Monné @ 2026-01-22 11:30 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper

On Thu, Jan 22, 2026 at 11:15:49AM +0100, Jan Beulich wrote:
> On 22.01.2026 11:10, Roger Pau Monné wrote:
> > On Thu, Jan 22, 2026 at 10:28:51AM +0100, Jan Beulich wrote:
> >> On 22.01.2026 10:18, Roger Pau Monné wrote:
> >>> On Mon, Nov 17, 2025 at 03:39:30PM +0100, Jan Beulich wrote:
> >>>> When this was added, the log message was updated correctly, but the zero
> >>>> case was needlessly checked separately: hpet_broadcast_enter() had a zero
> >>>> check added at the same time, while handle_hpet_broadcast() can't possibly
> >>>> pass 0 here anyway.
> >>>>
> >>>> Fixes: 7145897cfb81 ("cpuidle: Fix for timer_deadline==0 case")
> >>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >>>
> >>> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
> >>
> >> Thanks.
> >>
> >>> Similar to the previous commit, I wonder whether it would make sense
> >>> to add an ASSERT_UNREACHABLE() if that error path is not reachable
> >>> given the logic in the callers.
> >>
> >> That would mean
> >>
> >>     if ( unlikely(expire < 0) )
> >>     {
> >>         printk(KERN_DEBUG "reprogram: expire <= 0\n");
> >>         return -ETIME;
> >>     }
> >>
> >>     if ( unlikely(expire == 0) )
> >>     {
> >>         ASSERT_UNREACHABLE();
> >>         return -ETIME;
> >>     }
> >>
> >> which I fear I don't like (for going too far). Even
> >>
> >>     if ( unlikely(expire <= 0) )
> >>     {
> >>         printk(KERN_DEBUG "reprogram: expire <= 0\n");
> >>         ASSERT(expire);
> >>         return -ETIME;
> >>     }
> >>
> >> I'd be uncertain about, as that needlessly gives 0 a meaning that isn't
> >> required anymore in this function.
> > 
> > Hm, OK, I was under the impression that both < 0 and 0 should never be
> > passed by the callers.  If expire == 0 is a possible input then I
> > don't think the ASSERT() is that helpful.
> 
> Oh, so you were after
> 
>     if ( unlikely(expire <= 0) )
>     {
>         printk(KERN_DEBUG "reprogram: expire <= 0\n");
>         ASSERT_UNREACHABLE();
>         return -ETIME;
>     }
> 
> (perhaps even with the printk() dropped)? That I could buy off on, as NOW()
> is expected to only ever return valid (positive) s_time_t values.

Yes, that's what I was thinking off, but your previous reply made me
think there are possible cases where expire < 0 gets passed to the
function?

If that's not the case, adding the ASSERT_UNREACHABLE() would be my
preference.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 6/8] x86/HPET: simplify "expire" check a little in reprogram_hpet_evt_channel()
  2026-01-22 11:30           ` Roger Pau Monné
@ 2026-01-22 12:50             ` Jan Beulich
  0 siblings, 0 replies; 27+ messages in thread
From: Jan Beulich @ 2026-01-22 12:50 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper

On 22.01.2026 12:30, Roger Pau Monné wrote:
> On Thu, Jan 22, 2026 at 11:15:49AM +0100, Jan Beulich wrote:
>> On 22.01.2026 11:10, Roger Pau Monné wrote:
>>> On Thu, Jan 22, 2026 at 10:28:51AM +0100, Jan Beulich wrote:
>>>> On 22.01.2026 10:18, Roger Pau Monné wrote:
>>>>> On Mon, Nov 17, 2025 at 03:39:30PM +0100, Jan Beulich wrote:
>>>>>> When this was added, the log message was updated correctly, but the zero
>>>>>> case was needlessly checked separately: hpet_broadcast_enter() had a zero
>>>>>> check added at the same time, while handle_hpet_broadcast() can't possibly
>>>>>> pass 0 here anyway.
>>>>>>
>>>>>> Fixes: 7145897cfb81 ("cpuidle: Fix for timer_deadline==0 case")
>>>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>>>
>>>>> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
>>>>
>>>> Thanks.
>>>>
>>>>> Similar to the previous commit, I wonder whether it would make sense
>>>>> to add an ASSERT_UNREACHABLE() if that error path is not reachable
>>>>> given the logic in the callers.
>>>>
>>>> That would mean
>>>>
>>>>     if ( unlikely(expire < 0) )
>>>>     {
>>>>         printk(KERN_DEBUG "reprogram: expire <= 0\n");
>>>>         return -ETIME;
>>>>     }
>>>>
>>>>     if ( unlikely(expire == 0) )
>>>>     {
>>>>         ASSERT_UNREACHABLE();
>>>>         return -ETIME;
>>>>     }
>>>>
>>>> which I fear I don't like (for going too far). Even
>>>>
>>>>     if ( unlikely(expire <= 0) )
>>>>     {
>>>>         printk(KERN_DEBUG "reprogram: expire <= 0\n");
>>>>         ASSERT(expire);
>>>>         return -ETIME;
>>>>     }
>>>>
>>>> I'd be uncertain about, as that needlessly gives 0 a meaning that isn't
>>>> required anymore in this function.
>>>
>>> Hm, OK, I was under the impression that both < 0 and 0 should never be
>>> passed by the callers.  If expire == 0 is a possible input then I
>>> don't think the ASSERT() is that helpful.
>>
>> Oh, so you were after
>>
>>     if ( unlikely(expire <= 0) )
>>     {
>>         printk(KERN_DEBUG "reprogram: expire <= 0\n");
>>         ASSERT_UNREACHABLE();
>>         return -ETIME;
>>     }
>>
>> (perhaps even with the printk() dropped)? That I could buy off on, as NOW()
>> is expected to only ever return valid (positive) s_time_t values.
> 
> Yes, that's what I was thinking off, but your previous reply made me
> think there are possible cases where expire < 0 gets passed to the
> function?

No, I don't think there are any.

> If that's not the case, adding the ASSERT_UNREACHABLE() would be my
> preference.

Okay, that's what I'll commit then.

Jan


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2026-01-22 12:51 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-17 14:35 [PATCH v4 0/8] x86/HPET: tidying / improvements Jan Beulich
2025-11-17 14:37 ` [PATCH v4 1/8] x86/HPET: avoid indirect call to event handler Jan Beulich
2026-01-21 16:19   ` Roger Pau Monné
2025-11-17 14:37 ` [PATCH v4 2/8] x86/HPET: make another channel flags update atomic Jan Beulich
2026-01-21 17:55   ` Roger Pau Monné
2025-11-17 14:37 ` [PATCH v4 3/8] x86/HPET: move legacy tick IRQ count adjustment Jan Beulich
2026-01-22  8:50   ` Roger Pau Monné
2026-01-22 10:31     ` Jan Beulich
2026-01-22 11:21       ` Roger Pau Monné
2025-11-17 14:38 ` [PATCH v4 4/8] x86/HPET: reduce hpet_next_event() call sites Jan Beulich
2026-01-22  9:00   ` Roger Pau Monné
2025-11-17 14:39 ` [PATCH v4 5/8] x86/HPET: drop "long timeout" handling from reprogram_hpet_evt_channel() Jan Beulich
2026-01-22  9:03   ` Roger Pau Monné
2026-01-22  9:23     ` Jan Beulich
2025-11-17 14:39 ` [PATCH v4 6/8] x86/HPET: simplify "expire" check a little in reprogram_hpet_evt_channel() Jan Beulich
2026-01-22  9:18   ` Roger Pau Monné
2026-01-22  9:28     ` Jan Beulich
2026-01-22 10:10       ` Roger Pau Monné
2026-01-22 10:15         ` Jan Beulich
2026-01-22 11:30           ` Roger Pau Monné
2026-01-22 12:50             ` Jan Beulich
2025-11-17 14:39 ` [PATCH v4 7/8] x86/HPET: drop .set_affinity hook Jan Beulich
2026-01-22 10:05   ` Roger Pau Monné
2025-11-17 14:40 ` [PATCH v4 8/8] x86/HPET: don't arbitrarily cap delta in reprogram_hpet_evt_channel() Jan Beulich
2026-01-22 10:23   ` Roger Pau Monné
2026-01-22 10:35     ` Jan Beulich
2026-01-22 11:29       ` Roger Pau Monné

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.