[PATCH for-4.21 00/10] x86/HPET: broadcast IRQ and other improvements

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH for-4.21 00/10] x86/HPET: broadcast IRQ and other improvements
@ 2025-10-16  7:30 Jan Beulich
  2025-10-16  7:31 ` [PATCH for-4.21 01/10] x86/HPET: limit channel changes Jan Beulich
                   ` (11 more replies)
  0 siblings, 12 replies; 41+ messages in thread
From: Jan Beulich @ 2025-10-16  7:30 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Roger Pau Monné, Oleksii Kurochko

While 1db7829e5657 ("x86/hpet: do local APIC EOI after interrupt processing")
helped quite a bit, nested interrupts could still occur. First and foremost
as a result from IRQ migration (where we don't have any control over the
vectors chosen). Hence besides reducing the number of IRQs that can be raised
(first two patches) and possibly the number of invocations of
handle_hpet_broadcast() from the IRQ handler (optional patch 4), the main
goal here is to eliminate the potential for nested IRQs (patch 3). These
patches are imo 4.21 candidates (with patch 4 being questionable altogether;
see there). Patches 5 and onwards likely aren't important enough anymore at
this point of the release cycle, even if those with a Fixes: tag would likely
end up being backported later on.

The one related thing I haven't been able to find a viable solution for is
the elimination of the cpumask_t local variable in handle_hpet_broadcast().
That'll get in the way of possible future increases of the NR_CPUS upper
bound: Much like right now a single level of nesting is already too much,
if the limit was doubled even a single IRQ would end up consuming too much
stack space (together with cpumask_raise_softirq() also having such a
variable). Yet further doubling would not allow any such stack variables
anymore.

01: limit channel changes
02: disable unused channels
03: use single, global, low-priority vector for broadcast IRQ
04: ignore "stale" IRQs
05: avoid indirect call to event handler
06: make another channel flags update atomic
07: move legacy tick IRQ count adjustment
08: shrink IRQ-descriptor locked region in set_channel_irq_affinity()
09: reduce hpet_next_event() call sites
10: don't use hardcoded 0 for "long timeout"

Jan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH for-4.21 01/10] x86/HPET: limit channel changes
  2025-10-16  7:30 [PATCH for-4.21 00/10] x86/HPET: broadcast IRQ and other improvements Jan Beulich
@ 2025-10-16  7:31 ` Jan Beulich
  2025-10-16 10:24   ` Roger Pau Monné
  2025-10-17  9:23   ` Roger Pau Monné
  2025-10-16  7:31 ` [PATCH for-4.21 02/10] x86/HPET: disable unused channels Jan Beulich
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 41+ messages in thread
From: Jan Beulich @ 2025-10-16  7:31 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Roger Pau Monné, Oleksii Kurochko

Despite 1db7829e5657 ("x86/hpet: do local APIC EOI after interrupt
processing") we can still observe nested invocations of
hpet_interrupt_handler(). This is, afaict, a result of previously used
channels retaining their IRQ affinity until some other CPU re-uses them.
Such nesting is increasingly problematic with higher CPU counts, as both
handle_hpet_broadcast() and cpumask_raise_softirq() have a cpumask_t local
variable. IOW already a single level of nesting may require more stack
space (2 times above 4k) than we have available (8k), when NR_CPUS=16383
(the maximum value presently possible).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Whether this is still worthwhile with "x86/HPET: use single, global, low-
priority vector for broadcast IRQ" isn't quite clear to me.

--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -442,6 +442,8 @@ static void __init hpet_fsb_cap_lookup(v
            num_hpets_used, num_chs);
 }
 
+static DEFINE_PER_CPU(struct hpet_event_channel *, lru_channel);
+
 static struct hpet_event_channel *hpet_get_channel(unsigned int cpu)
 {
     static unsigned int next_channel;
@@ -454,9 +456,21 @@ static struct hpet_event_channel *hpet_g
     if ( num_hpets_used >= nr_cpu_ids )
         return &hpet_events[cpu];
 
+    /*
+     * Try the least recently used channel first.  It may still have its IRQ's
+     * affinity set to the desired CPU.  This way we also limit having multiple
+     * of our IRQs raised on the same CPU, in possibly a nested manner.
+     */
+    ch = per_cpu(lru_channel, cpu);
+    if ( ch && !test_and_set_bit(HPET_EVT_USED_BIT, &ch->flags) )
+    {
+        ch->cpu = cpu;
+        return ch;
+    }
+
+    /* Then look for an unused channel. */
     next = arch_fetch_and_add(&next_channel, 1) % num_hpets_used;
 
-    /* try unused channel first */
     for ( i = next; i < next + num_hpets_used; i++ )
     {
         ch = &hpet_events[i % num_hpets_used];
@@ -479,6 +493,8 @@ static void set_channel_irq_affinity(str
 {
     struct irq_desc *desc = irq_to_desc(ch->msi.irq);
 
+    per_cpu(lru_channel, ch->cpu) = ch;
+
     ASSERT(!local_irq_is_enabled());
     spin_lock(&desc->lock);
     hpet_msi_mask(desc);



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 01/10] x86/HPET: limit channel changes
  2025-10-16  7:31 ` [PATCH for-4.21 01/10] x86/HPET: limit channel changes Jan Beulich
@ 2025-10-16 10:24   ` Roger Pau Monné
  2025-10-16 11:47     ` Jan Beulich
  2025-10-17  9:23   ` Roger Pau Monné
  1 sibling, 1 reply; 41+ messages in thread
From: Roger Pau Monné @ 2025-10-16 10:24 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On Thu, Oct 16, 2025 at 09:31:21AM +0200, Jan Beulich wrote:
> Despite 1db7829e5657 ("x86/hpet: do local APIC EOI after interrupt
> processing") we can still observe nested invocations of
> hpet_interrupt_handler(). This is, afaict, a result of previously used
> channels retaining their IRQ affinity until some other CPU re-uses them.

But the underlying problem here is not so much the affinity itself,
but the fact that the channel is not stopped after firing?

> Such nesting is increasingly problematic with higher CPU counts, as both
> handle_hpet_broadcast() and cpumask_raise_softirq() have a cpumask_t local
> variable. IOW already a single level of nesting may require more stack
> space (2 times above 4k) than we have available (8k), when NR_CPUS=16383
> (the maximum value presently possible).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Whether this is still worthwhile with "x86/HPET: use single, global, low-
> priority vector for broadcast IRQ" isn't quite clear to me.
> 
> --- a/xen/arch/x86/hpet.c
> +++ b/xen/arch/x86/hpet.c
> @@ -442,6 +442,8 @@ static void __init hpet_fsb_cap_lookup(v
>             num_hpets_used, num_chs);
>  }
>  
> +static DEFINE_PER_CPU(struct hpet_event_channel *, lru_channel);
> +
>  static struct hpet_event_channel *hpet_get_channel(unsigned int cpu)
>  {
>      static unsigned int next_channel;
> @@ -454,9 +456,21 @@ static struct hpet_event_channel *hpet_g
>      if ( num_hpets_used >= nr_cpu_ids )
>          return &hpet_events[cpu];
>  
> +    /*
> +     * Try the least recently used channel first.  It may still have its IRQ's
> +     * affinity set to the desired CPU.  This way we also limit having multiple
> +     * of our IRQs raised on the same CPU, in possibly a nested manner.
> +     */
> +    ch = per_cpu(lru_channel, cpu);
> +    if ( ch && !test_and_set_bit(HPET_EVT_USED_BIT, &ch->flags) )
> +    {
> +        ch->cpu = cpu;
> +        return ch;
> +    }
> +
> +    /* Then look for an unused channel. */
>      next = arch_fetch_and_add(&next_channel, 1) % num_hpets_used;
>  
> -    /* try unused channel first */
>      for ( i = next; i < next + num_hpets_used; i++ )
>      {
>          ch = &hpet_events[i % num_hpets_used];
> @@ -479,6 +493,8 @@ static void set_channel_irq_affinity(str
>  {
>      struct irq_desc *desc = irq_to_desc(ch->msi.irq);
>  
> +    per_cpu(lru_channel, ch->cpu) = ch;
> +
>      ASSERT(!local_irq_is_enabled());
>      spin_lock(&desc->lock);
>      hpet_msi_mask(desc);

Maybe I'm missing the point here, but you are resetting the MSI
affinity anyway here, so there isn't much point in attempting to
re-use the same channel when Xen still unconditionally goes through the
process of setting the affinity anyway?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 01/10] x86/HPET: limit channel changes
  2025-10-16 10:24   ` Roger Pau Monné
@ 2025-10-16 11:47     ` Jan Beulich
  2025-10-16 15:07       ` Roger Pau Monné
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Beulich @ 2025-10-16 11:47 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On 16.10.2025 12:24, Roger Pau Monné wrote:
> On Thu, Oct 16, 2025 at 09:31:21AM +0200, Jan Beulich wrote:
>> Despite 1db7829e5657 ("x86/hpet: do local APIC EOI after interrupt
>> processing") we can still observe nested invocations of
>> hpet_interrupt_handler(). This is, afaict, a result of previously used
>> channels retaining their IRQ affinity until some other CPU re-uses them.
> 
> But the underlying problem here is not so much the affinity itself,
> but the fact that the channel is not stopped after firing?

(when being detached, that is) That's the main problem here, yes. A minor
benefit is to avoid the MMIO write in hpet_msi_set_affinity(). See also
below.

Further, even when mask while detaching, the issue would re-surface after
unmasking; it's just that the window then is smaller.

>> @@ -454,9 +456,21 @@ static struct hpet_event_channel *hpet_g
>>      if ( num_hpets_used >= nr_cpu_ids )
>>          return &hpet_events[cpu];
>>  
>> +    /*
>> +     * Try the least recently used channel first.  It may still have its IRQ's
>> +     * affinity set to the desired CPU.  This way we also limit having multiple
>> +     * of our IRQs raised on the same CPU, in possibly a nested manner.
>> +     */
>> +    ch = per_cpu(lru_channel, cpu);
>> +    if ( ch && !test_and_set_bit(HPET_EVT_USED_BIT, &ch->flags) )
>> +    {
>> +        ch->cpu = cpu;
>> +        return ch;
>> +    }
>> +
>> +    /* Then look for an unused channel. */
>>      next = arch_fetch_and_add(&next_channel, 1) % num_hpets_used;
>>  
>> -    /* try unused channel first */
>>      for ( i = next; i < next + num_hpets_used; i++ )
>>      {
>>          ch = &hpet_events[i % num_hpets_used];
>> @@ -479,6 +493,8 @@ static void set_channel_irq_affinity(str
>>  {
>>      struct irq_desc *desc = irq_to_desc(ch->msi.irq);
>>  
>> +    per_cpu(lru_channel, ch->cpu) = ch;
>> +
>>      ASSERT(!local_irq_is_enabled());
>>      spin_lock(&desc->lock);
>>      hpet_msi_mask(desc);
> 
> Maybe I'm missing the point here, but you are resetting the MSI
> affinity anyway here, so there isn't much point in attempting to
> re-use the same channel when Xen still unconditionally goes through the
> process of setting the affinity anyway?

While still using normal IRQs, there's still a benefit: We can re-use the
same vector (as staying on the same CPU), and hence we save an IRQ
migration (being the main source of nested IRQs according to my
observations).

We could actually do even better, by avoiding the mask/unmask pair there,
which would avoid triggering the "immediate" IRQ that I (for now) see as
the only explanation of the large amount of "early" IRQs that I observe
on (at least) Intel hardware. That would require doing the msg.dest32
check earlier, but otherwise looks feasible. (Actually, the unmask would
still be necessary, in case we're called with the channel already masked.)

Jan


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 01/10] x86/HPET: limit channel changes
  2025-10-16 11:47     ` Jan Beulich
@ 2025-10-16 15:07       ` Roger Pau Monné
  2025-10-16 15:16         ` Jan Beulich
  0 siblings, 1 reply; 41+ messages in thread
From: Roger Pau Monné @ 2025-10-16 15:07 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On Thu, Oct 16, 2025 at 01:47:38PM +0200, Jan Beulich wrote:
> On 16.10.2025 12:24, Roger Pau Monné wrote:
> > On Thu, Oct 16, 2025 at 09:31:21AM +0200, Jan Beulich wrote:
> >> Despite 1db7829e5657 ("x86/hpet: do local APIC EOI after interrupt
> >> processing") we can still observe nested invocations of
> >> hpet_interrupt_handler(). This is, afaict, a result of previously used
> >> channels retaining their IRQ affinity until some other CPU re-uses them.
> > 
> > But the underlying problem here is not so much the affinity itself,
> > but the fact that the channel is not stopped after firing?
> 
> (when being detached, that is) That's the main problem here, yes. A minor
> benefit is to avoid the MMIO write in hpet_msi_set_affinity(). See also
> below.
> 
> Further, even when mask while detaching, the issue would re-surface after
> unmasking; it's just that the window then is smaller.

Yeah, it could trigger after unmasking, but the window is smaller
there, as after enabling the comparator will get updated to the new
deadline.

> >> @@ -454,9 +456,21 @@ static struct hpet_event_channel *hpet_g
> >>      if ( num_hpets_used >= nr_cpu_ids )
> >>          return &hpet_events[cpu];
> >>  
> >> +    /*
> >> +     * Try the least recently used channel first.  It may still have its IRQ's
> >> +     * affinity set to the desired CPU.  This way we also limit having multiple
> >> +     * of our IRQs raised on the same CPU, in possibly a nested manner.
> >> +     */
> >> +    ch = per_cpu(lru_channel, cpu);
> >> +    if ( ch && !test_and_set_bit(HPET_EVT_USED_BIT, &ch->flags) )
> >> +    {
> >> +        ch->cpu = cpu;
> >> +        return ch;
> >> +    }
> >> +
> >> +    /* Then look for an unused channel. */
> >>      next = arch_fetch_and_add(&next_channel, 1) % num_hpets_used;
> >>  
> >> -    /* try unused channel first */
> >>      for ( i = next; i < next + num_hpets_used; i++ )
> >>      {
> >>          ch = &hpet_events[i % num_hpets_used];
> >> @@ -479,6 +493,8 @@ static void set_channel_irq_affinity(str
> >>  {
> >>      struct irq_desc *desc = irq_to_desc(ch->msi.irq);
> >>  
> >> +    per_cpu(lru_channel, ch->cpu) = ch;
> >> +
> >>      ASSERT(!local_irq_is_enabled());
> >>      spin_lock(&desc->lock);
> >>      hpet_msi_mask(desc);
> > 
> > Maybe I'm missing the point here, but you are resetting the MSI
> > affinity anyway here, so there isn't much point in attempting to
> > re-use the same channel when Xen still unconditionally goes through the
> > process of setting the affinity anyway?
> 
> While still using normal IRQs, there's still a benefit: We can re-use the
> same vector (as staying on the same CPU), and hence we save an IRQ
> migration (being the main source of nested IRQs according to my
> observations).

Hm, I see.  You short-circuit all the logic in _assign_irq_vector().

> We could actually do even better, by avoiding the mask/unmask pair there,
> which would avoid triggering the "immediate" IRQ that I (for now) see as
> the only explanation of the large amount of "early" IRQs that I observe
> on (at least) Intel hardware. That would require doing the msg.dest32
> check earlier, but otherwise looks feasible. (Actually, the unmask would
> still be necessary, in case we're called with the channel already masked.)

Checking with .dest32 seems a bit crude, I would possibly prefer to
slightly modify hpet_attach_channel() to notice when ch->cpu == cpu
and avoid the call to set_channel_irq_affinity()?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 01/10] x86/HPET: limit channel changes
  2025-10-16 15:07       ` Roger Pau Monné
@ 2025-10-16 15:16         ` Jan Beulich
  2025-10-16 15:25           ` Roger Pau Monné
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Beulich @ 2025-10-16 15:16 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On 16.10.2025 17:07, Roger Pau Monné wrote:
> On Thu, Oct 16, 2025 at 01:47:38PM +0200, Jan Beulich wrote:
>> On 16.10.2025 12:24, Roger Pau Monné wrote:
>>> On Thu, Oct 16, 2025 at 09:31:21AM +0200, Jan Beulich wrote:
>>>> @@ -454,9 +456,21 @@ static struct hpet_event_channel *hpet_g
>>>>      if ( num_hpets_used >= nr_cpu_ids )
>>>>          return &hpet_events[cpu];
>>>>  
>>>> +    /*
>>>> +     * Try the least recently used channel first.  It may still have its IRQ's
>>>> +     * affinity set to the desired CPU.  This way we also limit having multiple
>>>> +     * of our IRQs raised on the same CPU, in possibly a nested manner.
>>>> +     */
>>>> +    ch = per_cpu(lru_channel, cpu);
>>>> +    if ( ch && !test_and_set_bit(HPET_EVT_USED_BIT, &ch->flags) )
>>>> +    {
>>>> +        ch->cpu = cpu;
>>>> +        return ch;
>>>> +    }
>>>> +
>>>> +    /* Then look for an unused channel. */
>>>>      next = arch_fetch_and_add(&next_channel, 1) % num_hpets_used;
>>>>  
>>>> -    /* try unused channel first */
>>>>      for ( i = next; i < next + num_hpets_used; i++ )
>>>>      {
>>>>          ch = &hpet_events[i % num_hpets_used];
>>>> @@ -479,6 +493,8 @@ static void set_channel_irq_affinity(str
>>>>  {
>>>>      struct irq_desc *desc = irq_to_desc(ch->msi.irq);
>>>>  
>>>> +    per_cpu(lru_channel, ch->cpu) = ch;
>>>> +
>>>>      ASSERT(!local_irq_is_enabled());
>>>>      spin_lock(&desc->lock);
>>>>      hpet_msi_mask(desc);
>>>
>>> Maybe I'm missing the point here, but you are resetting the MSI
>>> affinity anyway here, so there isn't much point in attempting to
>>> re-use the same channel when Xen still unconditionally goes through the
>>> process of setting the affinity anyway?
>>
>> While still using normal IRQs, there's still a benefit: We can re-use the
>> same vector (as staying on the same CPU), and hence we save an IRQ
>> migration (being the main source of nested IRQs according to my
>> observations).
> 
> Hm, I see.  You short-circuit all the logic in _assign_irq_vector().
> 
>> We could actually do even better, by avoiding the mask/unmask pair there,
>> which would avoid triggering the "immediate" IRQ that I (for now) see as
>> the only explanation of the large amount of "early" IRQs that I observe
>> on (at least) Intel hardware. That would require doing the msg.dest32
>> check earlier, but otherwise looks feasible. (Actually, the unmask would
>> still be necessary, in case we're called with the channel already masked.)
> 
> Checking with .dest32 seems a bit crude, I would possibly prefer to
> slightly modify hpet_attach_channel() to notice when ch->cpu == cpu
> and avoid the call to set_channel_irq_affinity()?

That would be an always-false condition, wouldn't it? "attach" and "detach"
are used strictly in pairs, and after "detach" ch->cpu != cpu.

Jan


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 01/10] x86/HPET: limit channel changes
  2025-10-16 15:16         ` Jan Beulich
@ 2025-10-16 15:25           ` Roger Pau Monné
  0 siblings, 0 replies; 41+ messages in thread
From: Roger Pau Monné @ 2025-10-16 15:25 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On Thu, Oct 16, 2025 at 05:16:07PM +0200, Jan Beulich wrote:
> On 16.10.2025 17:07, Roger Pau Monné wrote:
> > On Thu, Oct 16, 2025 at 01:47:38PM +0200, Jan Beulich wrote:
> >> On 16.10.2025 12:24, Roger Pau Monné wrote:
> >>> On Thu, Oct 16, 2025 at 09:31:21AM +0200, Jan Beulich wrote:
> >>>> @@ -454,9 +456,21 @@ static struct hpet_event_channel *hpet_g
> >>>>      if ( num_hpets_used >= nr_cpu_ids )
> >>>>          return &hpet_events[cpu];
> >>>>  
> >>>> +    /*
> >>>> +     * Try the least recently used channel first.  It may still have its IRQ's
> >>>> +     * affinity set to the desired CPU.  This way we also limit having multiple
> >>>> +     * of our IRQs raised on the same CPU, in possibly a nested manner.
> >>>> +     */
> >>>> +    ch = per_cpu(lru_channel, cpu);
> >>>> +    if ( ch && !test_and_set_bit(HPET_EVT_USED_BIT, &ch->flags) )
> >>>> +    {
> >>>> +        ch->cpu = cpu;
> >>>> +        return ch;
> >>>> +    }
> >>>> +
> >>>> +    /* Then look for an unused channel. */
> >>>>      next = arch_fetch_and_add(&next_channel, 1) % num_hpets_used;
> >>>>  
> >>>> -    /* try unused channel first */
> >>>>      for ( i = next; i < next + num_hpets_used; i++ )
> >>>>      {
> >>>>          ch = &hpet_events[i % num_hpets_used];
> >>>> @@ -479,6 +493,8 @@ static void set_channel_irq_affinity(str
> >>>>  {
> >>>>      struct irq_desc *desc = irq_to_desc(ch->msi.irq);
> >>>>  
> >>>> +    per_cpu(lru_channel, ch->cpu) = ch;
> >>>> +
> >>>>      ASSERT(!local_irq_is_enabled());
> >>>>      spin_lock(&desc->lock);
> >>>>      hpet_msi_mask(desc);
> >>>
> >>> Maybe I'm missing the point here, but you are resetting the MSI
> >>> affinity anyway here, so there isn't much point in attempting to
> >>> re-use the same channel when Xen still unconditionally goes through the
> >>> process of setting the affinity anyway?
> >>
> >> While still using normal IRQs, there's still a benefit: We can re-use the
> >> same vector (as staying on the same CPU), and hence we save an IRQ
> >> migration (being the main source of nested IRQs according to my
> >> observations).
> > 
> > Hm, I see.  You short-circuit all the logic in _assign_irq_vector().
> > 
> >> We could actually do even better, by avoiding the mask/unmask pair there,
> >> which would avoid triggering the "immediate" IRQ that I (for now) see as
> >> the only explanation of the large amount of "early" IRQs that I observe
> >> on (at least) Intel hardware. That would require doing the msg.dest32
> >> check earlier, but otherwise looks feasible. (Actually, the unmask would
> >> still be necessary, in case we're called with the channel already masked.)
> > 
> > Checking with .dest32 seems a bit crude, I would possibly prefer to
> > slightly modify hpet_attach_channel() to notice when ch->cpu == cpu
> > and avoid the call to set_channel_irq_affinity()?
> 
> That would be an always-false condition, wouldn't it? "attach" and "detach"
> are used strictly in pairs, and after "detach" ch->cpu != cpu.

I see, we set ch->cpu = -1 if the channel is really detached as
opposed to migrated to a different CPU.  I haven't looked, but I
assume leaving the previous CPU in ch->cpu would cause issues
elsewhere.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 01/10] x86/HPET: limit channel changes
  2025-10-16  7:31 ` [PATCH for-4.21 01/10] x86/HPET: limit channel changes Jan Beulich
  2025-10-16 10:24   ` Roger Pau Monné
@ 2025-10-17  9:23   ` Roger Pau Monné
  2025-10-17  9:55     ` Jan Beulich
  1 sibling, 1 reply; 41+ messages in thread
From: Roger Pau Monné @ 2025-10-17  9:23 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On Thu, Oct 16, 2025 at 09:31:21AM +0200, Jan Beulich wrote:
> Despite 1db7829e5657 ("x86/hpet: do local APIC EOI after interrupt
> processing") we can still observe nested invocations of
> hpet_interrupt_handler(). This is, afaict, a result of previously used
> channels retaining their IRQ affinity until some other CPU re-uses them.
> Such nesting is increasingly problematic with higher CPU counts, as both
> handle_hpet_broadcast() and cpumask_raise_softirq() have a cpumask_t local
> variable. IOW already a single level of nesting may require more stack
> space (2 times above 4k) than we have available (8k), when NR_CPUS=16383
> (the maximum value presently possible).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Whether this is still worthwhile with "x86/HPET: use single, global, low-
> priority vector for broadcast IRQ" isn't quite clear to me.

Seeing the rest of the series, I don't think this is necessary
anymore?  Also the comment you here is made stale by the patch that
uses a global vector.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 01/10] x86/HPET: limit channel changes
  2025-10-17  9:23   ` Roger Pau Monné
@ 2025-10-17  9:55     ` Jan Beulich
  0 siblings, 0 replies; 41+ messages in thread
From: Jan Beulich @ 2025-10-17  9:55 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On 17.10.2025 11:23, Roger Pau Monné wrote:
> On Thu, Oct 16, 2025 at 09:31:21AM +0200, Jan Beulich wrote:
>> Despite 1db7829e5657 ("x86/hpet: do local APIC EOI after interrupt
>> processing") we can still observe nested invocations of
>> hpet_interrupt_handler(). This is, afaict, a result of previously used
>> channels retaining their IRQ affinity until some other CPU re-uses them.
>> Such nesting is increasingly problematic with higher CPU counts, as both
>> handle_hpet_broadcast() and cpumask_raise_softirq() have a cpumask_t local
>> variable. IOW already a single level of nesting may require more stack
>> space (2 times above 4k) than we have available (8k), when NR_CPUS=16383
>> (the maximum value presently possible).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> Whether this is still worthwhile with "x86/HPET: use single, global, low-
>> priority vector for broadcast IRQ" isn't quite clear to me.
> 
> Seeing the rest of the series, I don't think this is necessary
> anymore?  Also the comment you here is made stale by the patch that
> uses a global vector.

Right now I'm not quite sure, hence the remark and the patch being part of
the series. If I re-work patch 3 to avoid the mask/unmask upon affinity
changes, I think the one here can indeed be dropped.

Jan


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH for-4.21 02/10] x86/HPET: disable unused channels
  2025-10-16  7:30 [PATCH for-4.21 00/10] x86/HPET: broadcast IRQ and other improvements Jan Beulich
  2025-10-16  7:31 ` [PATCH for-4.21 01/10] x86/HPET: limit channel changes Jan Beulich
@ 2025-10-16  7:31 ` Jan Beulich
  2025-10-16 11:42   ` Roger Pau Monné
  2025-10-16 16:31   ` Roger Pau Monné
  2025-10-16  7:32 ` [PATCH for-4.21 03/10] x86/HPET: use single, global, low-priority vector for broadcast IRQ Jan Beulich
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 41+ messages in thread
From: Jan Beulich @ 2025-10-16  7:31 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Roger Pau Monné, Oleksii Kurochko

Keeping channels enabled when they're unused is only causing problems:
Extra interrupts harm performance, and extra nested interrupts could even
have caused worse problems.

Note that no explicit "enable" is necessary - that's implicitly done by
set_channel_irq_affinity() once the channel goes into use again.

Along with disabling the counter, also "clear" the channel's "next event",
for it to be properly written by whatever the next user is going to want
(possibly avoiding too early an IRQ).

Further, along the same lines, don't enable channels early when starting
up an IRQ. This similarly should happen no earlier than from
set_channel_irq_affinity() (here: once a channel goes into use the very
first time). This eliminates a single instance of

(XEN) [VT-D]INTR-REMAP: Request device [0000:00:1f.0] fault index 0
(XEN) [VT-D]INTR-REMAP: reason 25 - Blocked a compatibility format interrupt request

during boot. (Why exactly there's only one instance, when we use multiple
counters and hence multiple IRQs, I can't tell. My understanding would be
that this was due to __hpet_setup_msi_irq() being called only after
request_irq() [and hence the .startup handler], yet that should have
affected all channels.)

Fixes: 3ba523ff957c ("CPUIDLE: enable MSI capable HPET for timer broadcast")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
A window still remains for IRQs to be caused by stale comparator values:
hpet_attach_channel() is called ahead of reprogram_hpet_evt_channel().
Should we also write the comparator to "far into the future"?

Furthermore this prolongues the window until "old" vectors may be released
again, as this way we potentially (and intentionally) delay the ocurrence
of the next IRQ for the channel in question. (This issue will disappear
once we switch to a fixed, global vector.)

--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -262,10 +262,9 @@ static void cf_check hpet_msi_unmask(str
     ch->msi.msi_attrib.host_masked = 0;
 }
 
-static void cf_check hpet_msi_mask(struct irq_desc *desc)
+static void hpet_disable_channel(struct hpet_event_channel *ch)
 {
     u32 cfg;
-    struct hpet_event_channel *ch = desc->action->dev_id;
 
     cfg = hpet_read32(HPET_Tn_CFG(ch->idx));
     cfg &= ~HPET_TN_ENABLE;
@@ -273,6 +272,11 @@ static void cf_check hpet_msi_mask(struc
     ch->msi.msi_attrib.host_masked = 1;
 }
 
+static void cf_check hpet_msi_mask(struct irq_desc *desc)
+{
+    hpet_disable_channel(desc->action->dev_id);
+}
+
 static int hpet_msi_write(struct hpet_event_channel *ch, struct msi_msg *msg)
 {
     ch->msi.msg = *msg;
@@ -295,12 +299,6 @@ static int hpet_msi_write(struct hpet_ev
     return 0;
 }
 
-static unsigned int cf_check hpet_msi_startup(struct irq_desc *desc)
-{
-    hpet_msi_unmask(desc);
-    return 0;
-}
-
 #define hpet_msi_shutdown hpet_msi_mask
 
 static void cf_check hpet_msi_set_affinity(
@@ -326,7 +324,7 @@ static void cf_check hpet_msi_set_affini
  */
 static hw_irq_controller hpet_msi_type = {
     .typename   = "HPET-MSI",
-    .startup    = hpet_msi_startup,
+    .startup    = irq_startup_none,
     .shutdown   = hpet_msi_shutdown,
     .enable	    = hpet_msi_unmask,
     .disable    = hpet_msi_mask,
@@ -542,6 +540,8 @@ static void hpet_detach_channel(unsigned
         spin_unlock_irq(&ch->lock);
     else if ( (next = cpumask_first(ch->cpumask)) >= nr_cpu_ids )
     {
+        hpet_disable_channel(ch);
+        ch->next_event = STIME_MAX;
         ch->cpu = -1;
         clear_bit(HPET_EVT_USED_BIT, &ch->flags);
         spin_unlock_irq(&ch->lock);



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 02/10] x86/HPET: disable unused channels
  2025-10-16  7:31 ` [PATCH for-4.21 02/10] x86/HPET: disable unused channels Jan Beulich
@ 2025-10-16 11:42   ` Roger Pau Monné
  2025-10-16 11:57     ` Jan Beulich
  2025-10-16 16:31   ` Roger Pau Monné
  1 sibling, 1 reply; 41+ messages in thread
From: Roger Pau Monné @ 2025-10-16 11:42 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On Thu, Oct 16, 2025 at 09:31:42AM +0200, Jan Beulich wrote:
> Keeping channels enabled when they're unused is only causing problems:
> Extra interrupts harm performance, and extra nested interrupts could even
> have caused worse problems.
> 
> Note that no explicit "enable" is necessary - that's implicitly done by
> set_channel_irq_affinity() once the channel goes into use again.
> 
> Along with disabling the counter, also "clear" the channel's "next event",
> for it to be properly written by whatever the next user is going to want
> (possibly avoiding too early an IRQ).
> 
> Further, along the same lines, don't enable channels early when starting
> up an IRQ. This similarly should happen no earlier than from
> set_channel_irq_affinity() (here: once a channel goes into use the very
> first time). This eliminates a single instance of
> 
> (XEN) [VT-D]INTR-REMAP: Request device [0000:00:1f.0] fault index 0
> (XEN) [VT-D]INTR-REMAP: reason 25 - Blocked a compatibility format interrupt request
> 
> during boot. (Why exactly there's only one instance, when we use multiple
> counters and hence multiple IRQs, I can't tell. My understanding would be
> that this was due to __hpet_setup_msi_irq() being called only after
> request_irq() [and hence the .startup handler], yet that should have
> affected all channels.)
> 
> Fixes: 3ba523ff957c ("CPUIDLE: enable MSI capable HPET for timer broadcast")
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> A window still remains for IRQs to be caused by stale comparator values:
> hpet_attach_channel() is called ahead of reprogram_hpet_evt_channel().
> Should we also write the comparator to "far into the future"?

It might be helpful to reprogram the comparator as far ahead as
possible in hpet_attach_channel() ahead of enabling it, or
alternatively in hpet_detach_channel().

> Furthermore this prolongues the window until "old" vectors may be released
> again, as this way we potentially (and intentionally) delay the ocurrence
> of the next IRQ for the channel in question. (This issue will disappear
> once we switch to a fixed, global vector.)
> 
> --- a/xen/arch/x86/hpet.c
> +++ b/xen/arch/x86/hpet.c
> @@ -262,10 +262,9 @@ static void cf_check hpet_msi_unmask(str
>      ch->msi.msi_attrib.host_masked = 0;
>  }
>  
> -static void cf_check hpet_msi_mask(struct irq_desc *desc)
> +static void hpet_disable_channel(struct hpet_event_channel *ch)
>  {
>      u32 cfg;
> -    struct hpet_event_channel *ch = desc->action->dev_id;
>  
>      cfg = hpet_read32(HPET_Tn_CFG(ch->idx));
>      cfg &= ~HPET_TN_ENABLE;
> @@ -273,6 +272,11 @@ static void cf_check hpet_msi_mask(struc
>      ch->msi.msi_attrib.host_masked = 1;
>  }
>  
> +static void cf_check hpet_msi_mask(struct irq_desc *desc)
> +{
> +    hpet_disable_channel(desc->action->dev_id);
> +}
> +
>  static int hpet_msi_write(struct hpet_event_channel *ch, struct msi_msg *msg)
>  {
>      ch->msi.msg = *msg;
> @@ -295,12 +299,6 @@ static int hpet_msi_write(struct hpet_ev
>      return 0;
>  }
>  
> -static unsigned int cf_check hpet_msi_startup(struct irq_desc *desc)
> -{
> -    hpet_msi_unmask(desc);
> -    return 0;
> -}
> -
>  #define hpet_msi_shutdown hpet_msi_mask
>  
>  static void cf_check hpet_msi_set_affinity(
> @@ -326,7 +324,7 @@ static void cf_check hpet_msi_set_affini
>   */
>  static hw_irq_controller hpet_msi_type = {
>      .typename   = "HPET-MSI",
> -    .startup    = hpet_msi_startup,
> +    .startup    = irq_startup_none,
>      .shutdown   = hpet_msi_shutdown,
>      .enable	    = hpet_msi_unmask,
>      .disable    = hpet_msi_mask,
> @@ -542,6 +540,8 @@ static void hpet_detach_channel(unsigned
>          spin_unlock_irq(&ch->lock);
>      else if ( (next = cpumask_first(ch->cpumask)) >= nr_cpu_ids )
>      {
> +        hpet_disable_channel(ch);
> +        ch->next_event = STIME_MAX;
>          ch->cpu = -1;
>          clear_bit(HPET_EVT_USED_BIT, &ch->flags);
>          spin_unlock_irq(&ch->lock);

I'm a bit confused with what the HPET code does here (don't know
enough about it, and there are no comments).  Why is the timer rotated
to a CPU in ch->cpumask once disabled, instead of just being plain
disabled?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 02/10] x86/HPET: disable unused channels
  2025-10-16 11:42   ` Roger Pau Monné
@ 2025-10-16 11:57     ` Jan Beulich
  2025-10-16 15:34       ` Roger Pau Monné
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Beulich @ 2025-10-16 11:57 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On 16.10.2025 13:42, Roger Pau Monné wrote:
> On Thu, Oct 16, 2025 at 09:31:42AM +0200, Jan Beulich wrote:
>> Keeping channels enabled when they're unused is only causing problems:
>> Extra interrupts harm performance, and extra nested interrupts could even
>> have caused worse problems.
>>
>> Note that no explicit "enable" is necessary - that's implicitly done by
>> set_channel_irq_affinity() once the channel goes into use again.
>>
>> Along with disabling the counter, also "clear" the channel's "next event",
>> for it to be properly written by whatever the next user is going to want
>> (possibly avoiding too early an IRQ).
>>
>> Further, along the same lines, don't enable channels early when starting
>> up an IRQ. This similarly should happen no earlier than from
>> set_channel_irq_affinity() (here: once a channel goes into use the very
>> first time). This eliminates a single instance of
>>
>> (XEN) [VT-D]INTR-REMAP: Request device [0000:00:1f.0] fault index 0
>> (XEN) [VT-D]INTR-REMAP: reason 25 - Blocked a compatibility format interrupt request
>>
>> during boot. (Why exactly there's only one instance, when we use multiple
>> counters and hence multiple IRQs, I can't tell. My understanding would be
>> that this was due to __hpet_setup_msi_irq() being called only after
>> request_irq() [and hence the .startup handler], yet that should have
>> affected all channels.)
>>
>> Fixes: 3ba523ff957c ("CPUIDLE: enable MSI capable HPET for timer broadcast")
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> A window still remains for IRQs to be caused by stale comparator values:
>> hpet_attach_channel() is called ahead of reprogram_hpet_evt_channel().
>> Should we also write the comparator to "far into the future"?
> 
> It might be helpful to reprogram the comparator as far ahead as
> possible in hpet_attach_channel() ahead of enabling it, or
> alternatively in hpet_detach_channel().

The downside is yet another (slow) MMIO access. Hence why I didn't make
such a change right away. Plus I wasn't quite sure about the locking there:
Imo if we did so, it would be better if the lock wasn't dropped
intermediately.

>> @@ -542,6 +540,8 @@ static void hpet_detach_channel(unsigned
>>          spin_unlock_irq(&ch->lock);
>>      else if ( (next = cpumask_first(ch->cpumask)) >= nr_cpu_ids )
>>      {
>> +        hpet_disable_channel(ch);
>> +        ch->next_event = STIME_MAX;
>>          ch->cpu = -1;
>>          clear_bit(HPET_EVT_USED_BIT, &ch->flags);
>>          spin_unlock_irq(&ch->lock);
> 
> I'm a bit confused with what the HPET code does here (don't know
> enough about it, and there are no comments).  Why is the timer rotated
> to a CPU in ch->cpumask once disabled, instead of just being plain
> disabled?

Because it will still be needed by the other CPUs that the channel is
shared with.

Jan


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 02/10] x86/HPET: disable unused channels
  2025-10-16 11:57     ` Jan Beulich
@ 2025-10-16 15:34       ` Roger Pau Monné
  2025-10-16 15:55         ` Jan Beulich
  0 siblings, 1 reply; 41+ messages in thread
From: Roger Pau Monné @ 2025-10-16 15:34 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On Thu, Oct 16, 2025 at 01:57:41PM +0200, Jan Beulich wrote:
> On 16.10.2025 13:42, Roger Pau Monné wrote:
> > On Thu, Oct 16, 2025 at 09:31:42AM +0200, Jan Beulich wrote:
> >> Keeping channels enabled when they're unused is only causing problems:
> >> Extra interrupts harm performance, and extra nested interrupts could even
> >> have caused worse problems.
> >>
> >> Note that no explicit "enable" is necessary - that's implicitly done by
> >> set_channel_irq_affinity() once the channel goes into use again.
> >>
> >> Along with disabling the counter, also "clear" the channel's "next event",
> >> for it to be properly written by whatever the next user is going to want
> >> (possibly avoiding too early an IRQ).
> >>
> >> Further, along the same lines, don't enable channels early when starting
> >> up an IRQ. This similarly should happen no earlier than from
> >> set_channel_irq_affinity() (here: once a channel goes into use the very
> >> first time). This eliminates a single instance of
> >>
> >> (XEN) [VT-D]INTR-REMAP: Request device [0000:00:1f.0] fault index 0
> >> (XEN) [VT-D]INTR-REMAP: reason 25 - Blocked a compatibility format interrupt request
> >>
> >> during boot. (Why exactly there's only one instance, when we use multiple
> >> counters and hence multiple IRQs, I can't tell. My understanding would be
> >> that this was due to __hpet_setup_msi_irq() being called only after
> >> request_irq() [and hence the .startup handler], yet that should have
> >> affected all channels.)
> >>
> >> Fixes: 3ba523ff957c ("CPUIDLE: enable MSI capable HPET for timer broadcast")
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> ---
> >> A window still remains for IRQs to be caused by stale comparator values:
> >> hpet_attach_channel() is called ahead of reprogram_hpet_evt_channel().
> >> Should we also write the comparator to "far into the future"?
> > 
> > It might be helpful to reprogram the comparator as far ahead as
> > possible in hpet_attach_channel() ahead of enabling it, or
> > alternatively in hpet_detach_channel().
> 
> The downside is yet another (slow) MMIO access. Hence why I didn't make
> such a change right away. Plus I wasn't quite sure about the locking there:
> Imo if we did so, it would be better if the lock wasn't dropped
> intermediately.
> 
> >> @@ -542,6 +540,8 @@ static void hpet_detach_channel(unsigned
> >>          spin_unlock_irq(&ch->lock);
> >>      else if ( (next = cpumask_first(ch->cpumask)) >= nr_cpu_ids )
> >>      {
> >> +        hpet_disable_channel(ch);
> >> +        ch->next_event = STIME_MAX;
> >>          ch->cpu = -1;
> >>          clear_bit(HPET_EVT_USED_BIT, &ch->flags);
> >>          spin_unlock_irq(&ch->lock);
> > 
> > I'm a bit confused with what the HPET code does here (don't know
> > enough about it, and there are no comments).  Why is the timer rotated
> > to a CPU in ch->cpumask once disabled, instead of just being plain
> > disabled?
> 
> Because it will still be needed by the other CPUs that the channel is
> shared with.

Yeah, missed that part, the channel is migrated to a different CPU.  I
wonder however: since an active channel can be migrated around between
CPUs, isn't there a risk of the timer firing just in the middle of
migration (when interrupt generation is disabled), and hence Xen
possibly missing a deadline?

In hpet_broadcast_exit() we need to check whether the timer has
expired after the migration, and manually trigger a broadcast if
needed.  This also risks doing to broadcasts also back-to-back, but
it's the only option I see to avoid missing a deadline.

Maybe there's something I'm missing, this is all fairly complex.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 02/10] x86/HPET: disable unused channels
  2025-10-16 15:34       ` Roger Pau Monné
@ 2025-10-16 15:55         ` Jan Beulich
  2025-10-16 16:28           ` Roger Pau Monné
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Beulich @ 2025-10-16 15:55 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On 16.10.2025 17:34, Roger Pau Monné wrote:
> On Thu, Oct 16, 2025 at 01:57:41PM +0200, Jan Beulich wrote:
>> On 16.10.2025 13:42, Roger Pau Monné wrote:
>>> On Thu, Oct 16, 2025 at 09:31:42AM +0200, Jan Beulich wrote:
>>>> Keeping channels enabled when they're unused is only causing problems:
>>>> Extra interrupts harm performance, and extra nested interrupts could even
>>>> have caused worse problems.
>>>>
>>>> Note that no explicit "enable" is necessary - that's implicitly done by
>>>> set_channel_irq_affinity() once the channel goes into use again.
>>>>
>>>> Along with disabling the counter, also "clear" the channel's "next event",
>>>> for it to be properly written by whatever the next user is going to want
>>>> (possibly avoiding too early an IRQ).
>>>>
>>>> Further, along the same lines, don't enable channels early when starting
>>>> up an IRQ. This similarly should happen no earlier than from
>>>> set_channel_irq_affinity() (here: once a channel goes into use the very
>>>> first time). This eliminates a single instance of
>>>>
>>>> (XEN) [VT-D]INTR-REMAP: Request device [0000:00:1f.0] fault index 0
>>>> (XEN) [VT-D]INTR-REMAP: reason 25 - Blocked a compatibility format interrupt request
>>>>
>>>> during boot. (Why exactly there's only one instance, when we use multiple
>>>> counters and hence multiple IRQs, I can't tell. My understanding would be
>>>> that this was due to __hpet_setup_msi_irq() being called only after
>>>> request_irq() [and hence the .startup handler], yet that should have
>>>> affected all channels.)
>>>>
>>>> Fixes: 3ba523ff957c ("CPUIDLE: enable MSI capable HPET for timer broadcast")
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>> ---
>>>> A window still remains for IRQs to be caused by stale comparator values:
>>>> hpet_attach_channel() is called ahead of reprogram_hpet_evt_channel().
>>>> Should we also write the comparator to "far into the future"?
>>>
>>> It might be helpful to reprogram the comparator as far ahead as
>>> possible in hpet_attach_channel() ahead of enabling it, or
>>> alternatively in hpet_detach_channel().
>>
>> The downside is yet another (slow) MMIO access. Hence why I didn't make
>> such a change right away. Plus I wasn't quite sure about the locking there:
>> Imo if we did so, it would be better if the lock wasn't dropped
>> intermediately.
>>
>>>> @@ -542,6 +540,8 @@ static void hpet_detach_channel(unsigned
>>>>          spin_unlock_irq(&ch->lock);
>>>>      else if ( (next = cpumask_first(ch->cpumask)) >= nr_cpu_ids )
>>>>      {
>>>> +        hpet_disable_channel(ch);
>>>> +        ch->next_event = STIME_MAX;
>>>>          ch->cpu = -1;
>>>>          clear_bit(HPET_EVT_USED_BIT, &ch->flags);
>>>>          spin_unlock_irq(&ch->lock);
>>>
>>> I'm a bit confused with what the HPET code does here (don't know
>>> enough about it, and there are no comments).  Why is the timer rotated
>>> to a CPU in ch->cpumask once disabled, instead of just being plain
>>> disabled?
>>
>> Because it will still be needed by the other CPUs that the channel is
>> shared with.
> 
> Yeah, missed that part, the channel is migrated to a different CPU.  I
> wonder however: since an active channel can be migrated around between
> CPUs, isn't there a risk of the timer firing just in the middle of
> migration (when interrupt generation is disabled), and hence Xen
> possibly missing a deadline?
> 
> In hpet_broadcast_exit() we need to check whether the timer has
> expired after the migration, and manually trigger a broadcast if
> needed.  This also risks doing to broadcasts also back-to-back, but
> it's the only option I see to avoid missing a deadline.
> 
> Maybe there's something I'm missing, this is all fairly complex.

set_channel_irq_affinity() invokes handle_hpet_broadcast() to cover that
case.

Jan


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 02/10] x86/HPET: disable unused channels
  2025-10-16 15:55         ` Jan Beulich
@ 2025-10-16 16:28           ` Roger Pau Monné
  0 siblings, 0 replies; 41+ messages in thread
From: Roger Pau Monné @ 2025-10-16 16:28 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On Thu, Oct 16, 2025 at 05:55:30PM +0200, Jan Beulich wrote:
> On 16.10.2025 17:34, Roger Pau Monné wrote:
> > On Thu, Oct 16, 2025 at 01:57:41PM +0200, Jan Beulich wrote:
> >> On 16.10.2025 13:42, Roger Pau Monné wrote:
> >>> On Thu, Oct 16, 2025 at 09:31:42AM +0200, Jan Beulich wrote:
> >>>> Keeping channels enabled when they're unused is only causing problems:
> >>>> Extra interrupts harm performance, and extra nested interrupts could even
> >>>> have caused worse problems.
> >>>>
> >>>> Note that no explicit "enable" is necessary - that's implicitly done by
> >>>> set_channel_irq_affinity() once the channel goes into use again.
> >>>>
> >>>> Along with disabling the counter, also "clear" the channel's "next event",
> >>>> for it to be properly written by whatever the next user is going to want
> >>>> (possibly avoiding too early an IRQ).
> >>>>
> >>>> Further, along the same lines, don't enable channels early when starting
> >>>> up an IRQ. This similarly should happen no earlier than from
> >>>> set_channel_irq_affinity() (here: once a channel goes into use the very
> >>>> first time). This eliminates a single instance of
> >>>>
> >>>> (XEN) [VT-D]INTR-REMAP: Request device [0000:00:1f.0] fault index 0
> >>>> (XEN) [VT-D]INTR-REMAP: reason 25 - Blocked a compatibility format interrupt request
> >>>>
> >>>> during boot. (Why exactly there's only one instance, when we use multiple
> >>>> counters and hence multiple IRQs, I can't tell. My understanding would be
> >>>> that this was due to __hpet_setup_msi_irq() being called only after
> >>>> request_irq() [and hence the .startup handler], yet that should have
> >>>> affected all channels.)
> >>>>
> >>>> Fixes: 3ba523ff957c ("CPUIDLE: enable MSI capable HPET for timer broadcast")
> >>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >>>> ---
> >>>> A window still remains for IRQs to be caused by stale comparator values:
> >>>> hpet_attach_channel() is called ahead of reprogram_hpet_evt_channel().
> >>>> Should we also write the comparator to "far into the future"?
> >>>
> >>> It might be helpful to reprogram the comparator as far ahead as
> >>> possible in hpet_attach_channel() ahead of enabling it, or
> >>> alternatively in hpet_detach_channel().
> >>
> >> The downside is yet another (slow) MMIO access. Hence why I didn't make
> >> such a change right away. Plus I wasn't quite sure about the locking there:
> >> Imo if we did so, it would be better if the lock wasn't dropped
> >> intermediately.
> >>
> >>>> @@ -542,6 +540,8 @@ static void hpet_detach_channel(unsigned
> >>>>          spin_unlock_irq(&ch->lock);
> >>>>      else if ( (next = cpumask_first(ch->cpumask)) >= nr_cpu_ids )
> >>>>      {
> >>>> +        hpet_disable_channel(ch);
> >>>> +        ch->next_event = STIME_MAX;
> >>>>          ch->cpu = -1;
> >>>>          clear_bit(HPET_EVT_USED_BIT, &ch->flags);
> >>>>          spin_unlock_irq(&ch->lock);
> >>>
> >>> I'm a bit confused with what the HPET code does here (don't know
> >>> enough about it, and there are no comments).  Why is the timer rotated
> >>> to a CPU in ch->cpumask once disabled, instead of just being plain
> >>> disabled?
> >>
> >> Because it will still be needed by the other CPUs that the channel is
> >> shared with.
> > 
> > Yeah, missed that part, the channel is migrated to a different CPU.  I
> > wonder however: since an active channel can be migrated around between
> > CPUs, isn't there a risk of the timer firing just in the middle of
> > migration (when interrupt generation is disabled), and hence Xen
> > possibly missing a deadline?
> > 
> > In hpet_broadcast_exit() we need to check whether the timer has
> > expired after the migration, and manually trigger a broadcast if
> > needed.  This also risks doing to broadcasts also back-to-back, but
> > it's the only option I see to avoid missing a deadline.
> > 
> > Maybe there's something I'm missing, this is all fairly complex.
> 
> set_channel_irq_affinity() invokes handle_hpet_broadcast() to cover that
> case.

Oh, indeed, sorry for the fuzz then.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 02/10] x86/HPET: disable unused channels
  2025-10-16  7:31 ` [PATCH for-4.21 02/10] x86/HPET: disable unused channels Jan Beulich
  2025-10-16 11:42   ` Roger Pau Monné
@ 2025-10-16 16:31   ` Roger Pau Monné
  2025-10-17  6:08     ` Jan Beulich
  1 sibling, 1 reply; 41+ messages in thread
From: Roger Pau Monné @ 2025-10-16 16:31 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On Thu, Oct 16, 2025 at 09:31:42AM +0200, Jan Beulich wrote:
> Keeping channels enabled when they're unused is only causing problems:
> Extra interrupts harm performance, and extra nested interrupts could even
> have caused worse problems.
> 
> Note that no explicit "enable" is necessary - that's implicitly done by
> set_channel_irq_affinity() once the channel goes into use again.
> 
> Along with disabling the counter, also "clear" the channel's "next event",
> for it to be properly written by whatever the next user is going to want
> (possibly avoiding too early an IRQ).
> 
> Further, along the same lines, don't enable channels early when starting
> up an IRQ. This similarly should happen no earlier than from
> set_channel_irq_affinity() (here: once a channel goes into use the very
> first time). This eliminates a single instance of
> 
> (XEN) [VT-D]INTR-REMAP: Request device [0000:00:1f.0] fault index 0
> (XEN) [VT-D]INTR-REMAP: reason 25 - Blocked a compatibility format interrupt request
> 
> during boot. (Why exactly there's only one instance, when we use multiple
> counters and hence multiple IRQs, I can't tell. My understanding would be
> that this was due to __hpet_setup_msi_irq() being called only after
> request_irq() [and hence the .startup handler], yet that should have
> affected all channels.)
> 
> Fixes: 3ba523ff957c ("CPUIDLE: enable MSI capable HPET for timer broadcast")
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

> ---
> A window still remains for IRQs to be caused by stale comparator values:
> hpet_attach_channel() is called ahead of reprogram_hpet_evt_channel().
> Should we also write the comparator to "far into the future"?

I think we can possibly live with this to avoid doing an extra MMIO
access?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 02/10] x86/HPET: disable unused channels
  2025-10-16 16:31   ` Roger Pau Monné
@ 2025-10-17  6:08     ` Jan Beulich
  2025-10-17  6:10       ` Jan Beulich
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Beulich @ 2025-10-17  6:08 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On 16.10.2025 18:31, Roger Pau Monné wrote:
> On Thu, Oct 16, 2025 at 09:31:42AM +0200, Jan Beulich wrote:
>> Keeping channels enabled when they're unused is only causing problems:
>> Extra interrupts harm performance, and extra nested interrupts could even
>> have caused worse problems.
>>
>> Note that no explicit "enable" is necessary - that's implicitly done by
>> set_channel_irq_affinity() once the channel goes into use again.
>>
>> Along with disabling the counter, also "clear" the channel's "next event",
>> for it to be properly written by whatever the next user is going to want
>> (possibly avoiding too early an IRQ).
>>
>> Further, along the same lines, don't enable channels early when starting
>> up an IRQ. This similarly should happen no earlier than from
>> set_channel_irq_affinity() (here: once a channel goes into use the very
>> first time). This eliminates a single instance of
>>
>> (XEN) [VT-D]INTR-REMAP: Request device [0000:00:1f.0] fault index 0
>> (XEN) [VT-D]INTR-REMAP: reason 25 - Blocked a compatibility format interrupt request
>>
>> during boot. (Why exactly there's only one instance, when we use multiple
>> counters and hence multiple IRQs, I can't tell. My understanding would be
>> that this was due to __hpet_setup_msi_irq() being called only after
>> request_irq() [and hence the .startup handler], yet that should have
>> affected all channels.)
>>
>> Fixes: 3ba523ff957c ("CPUIDLE: enable MSI capable HPET for timer broadcast")
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, but I realized I want to make one further change here: I want to clear
the pointer when handing off the channel to another CPU while detaching. That
is the one point where we clearly know the affinity moves off of the CPU that
is recording the channel as least recently used one. Are you happy for me to
keep the R-b with that extra change?

>> ---
>> A window still remains for IRQs to be caused by stale comparator values:
>> hpet_attach_channel() is called ahead of reprogram_hpet_evt_channel().
>> Should we also write the comparator to "far into the future"?
> 
> I think we can possibly live with this to avoid doing an extra MMIO
> access?

I think we can; which one's more beneficial I simply don't know.

Jan


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 02/10] x86/HPET: disable unused channels
  2025-10-17  6:08     ` Jan Beulich
@ 2025-10-17  6:10       ` Jan Beulich
  0 siblings, 0 replies; 41+ messages in thread
From: Jan Beulich @ 2025-10-17  6:10 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On 17.10.2025 08:08, Jan Beulich wrote:
> On 16.10.2025 18:31, Roger Pau Monné wrote:
>> On Thu, Oct 16, 2025 at 09:31:42AM +0200, Jan Beulich wrote:
>>> Keeping channels enabled when they're unused is only causing problems:
>>> Extra interrupts harm performance, and extra nested interrupts could even
>>> have caused worse problems.
>>>
>>> Note that no explicit "enable" is necessary - that's implicitly done by
>>> set_channel_irq_affinity() once the channel goes into use again.
>>>
>>> Along with disabling the counter, also "clear" the channel's "next event",
>>> for it to be properly written by whatever the next user is going to want
>>> (possibly avoiding too early an IRQ).
>>>
>>> Further, along the same lines, don't enable channels early when starting
>>> up an IRQ. This similarly should happen no earlier than from
>>> set_channel_irq_affinity() (here: once a channel goes into use the very
>>> first time). This eliminates a single instance of
>>>
>>> (XEN) [VT-D]INTR-REMAP: Request device [0000:00:1f.0] fault index 0
>>> (XEN) [VT-D]INTR-REMAP: reason 25 - Blocked a compatibility format interrupt request
>>>
>>> during boot. (Why exactly there's only one instance, when we use multiple
>>> counters and hence multiple IRQs, I can't tell. My understanding would be
>>> that this was due to __hpet_setup_msi_irq() being called only after
>>> request_irq() [and hence the .startup handler], yet that should have
>>> affected all channels.)
>>>
>>> Fixes: 3ba523ff957c ("CPUIDLE: enable MSI capable HPET for timer broadcast")
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>
>> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Thanks, but I realized I want to make one further change here: I want to clear
> the pointer when handing off the channel to another CPU while detaching. That
> is the one point where we clearly know the affinity moves off of the CPU that
> is recording the channel as least recently used one. Are you happy for me to
> keep the R-b with that extra change?

Oh, that was wrong here. That's a change I'm meaning to make to patch 1.

Jan


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH for-4.21 03/10] x86/HPET: use single, global, low-priority vector for broadcast IRQ
  2025-10-16  7:30 [PATCH for-4.21 00/10] x86/HPET: broadcast IRQ and other improvements Jan Beulich
  2025-10-16  7:31 ` [PATCH for-4.21 01/10] x86/HPET: limit channel changes Jan Beulich
  2025-10-16  7:31 ` [PATCH for-4.21 02/10] x86/HPET: disable unused channels Jan Beulich
@ 2025-10-16  7:32 ` Jan Beulich
  2025-10-16 16:27   ` Roger Pau Monné
  2025-10-16 17:01   ` Andrew Cooper
  2025-10-16  7:32 ` [PATCH for-4.21 04/10] x86/HPET: ignore "stale" IRQs Jan Beulich
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 41+ messages in thread
From: Jan Beulich @ 2025-10-16  7:32 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Roger Pau Monné, Oleksii Kurochko

Using dynamically allocated / maintained vectors has several downsides:
- possible nesting of IRQs due to the effects of IRQ migration,
- reduction of vectors available for devices,
- IRQs not moving as intended if there's shortage of vectors,
- higher runtime overhead.

As the vector also doesn't need to be of any priority (first and foremost
it really shouldn't be of higher or same priority as the timer IRQ, as
that raises TIMER_SOFTIRQ anyway), avoid any "ordinary" vectors altogther
and use a vector from the 0x10...0x1f exception vector space. Exception vs
interrupt can easily be distinguished by checking for the presence of an
error code.

Fixes: 996576b965cc ("xen: allow up to 16383 cpus")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
This is an alternative proposal to
https://lists.xen.org/archives/html/xen-devel/2014-03/msg00399.html.

The Fixes: tag indicates where the problem got signficantly worse; in
principle it was there already before (crashing at perhaps 6 or 7 levels
of nested IRQs).

--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -9,17 +9,19 @@
 #include <xen/timer.h>
 #include <xen/smp.h>
 #include <xen/softirq.h>
+#include <xen/cpuidle.h>
 #include <xen/irq.h>
 #include <xen/numa.h>
 #include <xen/param.h>
 #include <xen/sched.h>
 
 #include <asm/apic.h>
-#include <asm/fixmap.h>
 #include <asm/div64.h>
+#include <asm/fixmap.h>
+#include <asm/genapic.h>
 #include <asm/hpet.h>
+#include <asm/irq-vectors.h>
 #include <asm/msi.h>
-#include <xen/cpuidle.h>
 
 #define MAX_DELTA_NS MILLISECS(10*1000)
 #define MIN_DELTA_NS MICROSECS(20)
@@ -307,15 +309,13 @@ static void cf_check hpet_msi_set_affini
     struct hpet_event_channel *ch = desc->action->dev_id;
     struct msi_msg msg = ch->msi.msg;
 
-    msg.dest32 = set_desc_affinity(desc, mask);
-    if ( msg.dest32 == BAD_APICID )
-        return;
+    /* This really is only for dump_irqs(). */
+    cpumask_copy(desc->arch.cpu_mask, mask);
 
-    msg.data &= ~MSI_DATA_VECTOR_MASK;
-    msg.data |= MSI_DATA_VECTOR(desc->arch.vector);
+    msg.dest32 = cpu_mask_to_apicid(mask);
     msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK;
     msg.address_lo |= MSI_ADDR_DEST_ID(msg.dest32);
-    if ( msg.data != ch->msi.msg.data || msg.dest32 != ch->msi.msg.dest32 )
+    if ( msg.dest32 != ch->msi.msg.dest32 )
         hpet_msi_write(ch, &msg);
 }
 
@@ -328,7 +328,7 @@ static hw_irq_controller hpet_msi_type =
     .shutdown   = hpet_msi_shutdown,
     .enable	    = hpet_msi_unmask,
     .disable    = hpet_msi_mask,
-    .ack        = ack_nonmaskable_msi_irq,
+    .ack        = irq_actor_none,
     .end        = end_nonmaskable_irq,
     .set_affinity   = hpet_msi_set_affinity,
 };
@@ -347,6 +347,12 @@ static int __init hpet_setup_msi_irq(str
     u32 cfg = hpet_read32(HPET_Tn_CFG(ch->idx));
     irq_desc_t *desc = irq_to_desc(ch->msi.irq);
 
+    clear_irq_vector(ch->msi.irq);
+    ret = bind_irq_vector(ch->msi.irq, HPET_BROADCAST_VECTOR, &cpu_online_map);
+    if ( ret )
+        return ret;
+    cpumask_setall(desc->affinity);
+
     if ( iommu_intremap != iommu_intremap_off )
     {
         ch->msi.hpet_id = hpet_blockid;
@@ -457,7 +463,7 @@ static struct hpet_event_channel *hpet_g
     /*
      * Try the least recently used channel first.  It may still have its IRQ's
      * affinity set to the desired CPU.  This way we also limit having multiple
-     * of our IRQs raised on the same CPU, in possibly a nested manner.
+     * of our IRQs raised on the same CPU.
      */
     ch = per_cpu(lru_channel, cpu);
     if ( ch && !test_and_set_bit(HPET_EVT_USED_BIT, &ch->flags) )
@@ -497,6 +503,7 @@ static void set_channel_irq_affinity(str
     spin_lock(&desc->lock);
     hpet_msi_mask(desc);
     hpet_msi_set_affinity(desc, cpumask_of(ch->cpu));
+    per_cpu(vector_irq, ch->cpu)[HPET_BROADCAST_VECTOR] = ch->msi.irq;
     hpet_msi_unmask(desc);
     spin_unlock(&desc->lock);
 
--- a/xen/arch/x86/include/asm/irq-vectors.h
+++ b/xen/arch/x86/include/asm/irq-vectors.h
@@ -18,6 +18,15 @@
 /* IRQ0 (timer) is statically allocated but must be high priority. */
 #define IRQ0_VECTOR             0xf0
 
+/*
+ * Low-priority (for now statically allocated) vectors, sharing entry
+ * points with exceptions in the 0x10 ... 0x1f range, as long as the
+ * respective exception has an error code.
+ */
+#define FIRST_LOPRIORITY_VECTOR 0x10
+#define HPET_BROADCAST_VECTOR   X86_EXC_AC
+#define LAST_LOPRIORITY_VECTOR  0x1f
+
 /* Legacy PIC uses vectors 0x20-0x2f. */
 #define FIRST_LEGACY_VECTOR     FIRST_DYNAMIC_VECTOR
 #define LAST_LEGACY_VECTOR      (FIRST_LEGACY_VECTOR + 0xf)
@@ -40,7 +49,7 @@
 /* There's no IRQ2 at the PIC. */
 #define IRQ_MOVE_CLEANUP_VECTOR (FIRST_LEGACY_VECTOR + 2)
 
-#define FIRST_IRQ_VECTOR        FIRST_DYNAMIC_VECTOR
+#define FIRST_IRQ_VECTOR        FIRST_LOPRIORITY_VECTOR
 #define LAST_IRQ_VECTOR         LAST_HIPRIORITY_VECTOR
 
 #endif /* _ASM_IRQ_VECTORS_H */
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -755,8 +755,9 @@ void setup_vector_irq(unsigned int cpu)
         if ( !irq_desc_initialized(desc) )
             continue;
         vector = irq_to_vector(irq);
-        if ( vector >= FIRST_HIPRIORITY_VECTOR &&
-             vector <= LAST_HIPRIORITY_VECTOR )
+        if ( vector <= (vector >= FIRST_HIPRIORITY_VECTOR
+                        ? LAST_HIPRIORITY_VECTOR
+                        : LAST_LOPRIORITY_VECTOR) )
             cpumask_set_cpu(cpu, desc->arch.cpu_mask);
         else if ( !cpumask_test_cpu(cpu, desc->arch.cpu_mask) )
             continue;
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -158,7 +158,7 @@ void msi_compose_msg(unsigned vector, co
 {
     memset(msg, 0, sizeof(*msg));
 
-    if ( vector < FIRST_DYNAMIC_VECTOR )
+    if ( vector < FIRST_LOPRIORITY_VECTOR )
         return;
 
     if ( cpu_mask )
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -1045,7 +1045,13 @@ END(entry_GP)
 
 FUNC(entry_AC)
         ENDBR64
+        /* #AC shares its entry point with the HPET broadcast interrupt. */
+        test  $8, %spl
+        jz    .Lac
+        push  $0
+.Lac:
         movb  $X86_EXC_AC, EFRAME_entry_vector(%rsp)
+        jnz   common_interrupt
         jmp   handle_exception
 END(entry_AC)
 



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 03/10] x86/HPET: use single, global, low-priority vector for broadcast IRQ
  2025-10-16  7:32 ` [PATCH for-4.21 03/10] x86/HPET: use single, global, low-priority vector for broadcast IRQ Jan Beulich
@ 2025-10-16 16:27   ` Roger Pau Monné
  2025-10-17  7:15     ` Jan Beulich
  2025-10-16 17:01   ` Andrew Cooper
  1 sibling, 1 reply; 41+ messages in thread
From: Roger Pau Monné @ 2025-10-16 16:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On Thu, Oct 16, 2025 at 09:32:04AM +0200, Jan Beulich wrote:
> Using dynamically allocated / maintained vectors has several downsides:
> - possible nesting of IRQs due to the effects of IRQ migration,
> - reduction of vectors available for devices,
> - IRQs not moving as intended if there's shortage of vectors,
> - higher runtime overhead.
> 
> As the vector also doesn't need to be of any priority (first and foremost
> it really shouldn't be of higher or same priority as the timer IRQ, as
> that raises TIMER_SOFTIRQ anyway), avoid any "ordinary" vectors altogther
> and use a vector from the 0x10...0x1f exception vector space. Exception vs
> interrupt can easily be distinguished by checking for the presence of an
> error code.
> 
> Fixes: 996576b965cc ("xen: allow up to 16383 cpus")
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> This is an alternative proposal to
> https://lists.xen.org/archives/html/xen-devel/2014-03/msg00399.html.
> 
> The Fixes: tag indicates where the problem got signficantly worse; in
> principle it was there already before (crashing at perhaps 6 or 7 levels
> of nested IRQs).
> 
> --- a/xen/arch/x86/hpet.c
> +++ b/xen/arch/x86/hpet.c
> @@ -9,17 +9,19 @@
>  #include <xen/timer.h>
>  #include <xen/smp.h>
>  #include <xen/softirq.h>
> +#include <xen/cpuidle.h>
>  #include <xen/irq.h>
>  #include <xen/numa.h>
>  #include <xen/param.h>
>  #include <xen/sched.h>
>  
>  #include <asm/apic.h>
> -#include <asm/fixmap.h>
>  #include <asm/div64.h>
> +#include <asm/fixmap.h>
> +#include <asm/genapic.h>
>  #include <asm/hpet.h>
> +#include <asm/irq-vectors.h>
>  #include <asm/msi.h>
> -#include <xen/cpuidle.h>
>  
>  #define MAX_DELTA_NS MILLISECS(10*1000)
>  #define MIN_DELTA_NS MICROSECS(20)
> @@ -307,15 +309,13 @@ static void cf_check hpet_msi_set_affini
>      struct hpet_event_channel *ch = desc->action->dev_id;
>      struct msi_msg msg = ch->msi.msg;
>  
> -    msg.dest32 = set_desc_affinity(desc, mask);
> -    if ( msg.dest32 == BAD_APICID )
> -        return;
> +    /* This really is only for dump_irqs(). */
> +    cpumask_copy(desc->arch.cpu_mask, mask);

If you no longer call set_desc_affinity(), could you adjust the second
parameter of hpet_msi_set_affinity() to be unsigned int cpu instead of
a cpumask?

And here just clear desc->arch.cpu_mask and set the passed CPU.

>  
> -    msg.data &= ~MSI_DATA_VECTOR_MASK;
> -    msg.data |= MSI_DATA_VECTOR(desc->arch.vector);
> +    msg.dest32 = cpu_mask_to_apicid(mask);

And here you can just use cpu_physical_id().

>      msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK;
>      msg.address_lo |= MSI_ADDR_DEST_ID(msg.dest32);
> -    if ( msg.data != ch->msi.msg.data || msg.dest32 != ch->msi.msg.dest32 )
> +    if ( msg.dest32 != ch->msi.msg.dest32 )
>          hpet_msi_write(ch, &msg);

A further note here, which ties to my comment on the previous patch
about loosing the interrupt during the masked window.  If the vector
is the same across all CPUs, we no longer need to update the MSI data
field, just the address one, which can be done atomically.  We also
have signaling from the IOMMU whether the MSI fields need writing.

We can avoid the masking, and the possible drop of interrupts.

>  }
>  
> @@ -328,7 +328,7 @@ static hw_irq_controller hpet_msi_type =
>      .shutdown   = hpet_msi_shutdown,
>      .enable	    = hpet_msi_unmask,
>      .disable    = hpet_msi_mask,
> -    .ack        = ack_nonmaskable_msi_irq,
> +    .ack        = irq_actor_none,
>      .end        = end_nonmaskable_irq,
>      .set_affinity   = hpet_msi_set_affinity,
>  };
> @@ -347,6 +347,12 @@ static int __init hpet_setup_msi_irq(str
>      u32 cfg = hpet_read32(HPET_Tn_CFG(ch->idx));
>      irq_desc_t *desc = irq_to_desc(ch->msi.irq);
>  
> +    clear_irq_vector(ch->msi.irq);
> +    ret = bind_irq_vector(ch->msi.irq, HPET_BROADCAST_VECTOR, &cpu_online_map);
> +    if ( ret )
> +        return ret;
> +    cpumask_setall(desc->affinity);
> +
>      if ( iommu_intremap != iommu_intremap_off )
>      {
>          ch->msi.hpet_id = hpet_blockid;
> @@ -457,7 +463,7 @@ static struct hpet_event_channel *hpet_g
>      /*
>       * Try the least recently used channel first.  It may still have its IRQ's
>       * affinity set to the desired CPU.  This way we also limit having multiple
> -     * of our IRQs raised on the same CPU, in possibly a nested manner.
> +     * of our IRQs raised on the same CPU.
>       */
>      ch = per_cpu(lru_channel, cpu);
>      if ( ch && !test_and_set_bit(HPET_EVT_USED_BIT, &ch->flags) )
> @@ -497,6 +503,7 @@ static void set_channel_irq_affinity(str
>      spin_lock(&desc->lock);
>      hpet_msi_mask(desc);
>      hpet_msi_set_affinity(desc, cpumask_of(ch->cpu));
> +    per_cpu(vector_irq, ch->cpu)[HPET_BROADCAST_VECTOR] = ch->msi.irq;

I would set the vector table ahead of setting the affinity, in case we
can drop the mask calls around this block of code.

I also wonder, do you really need the bind_irq_vector() if you
manually set the affinity afterwards, and the vector table plus
desc->arch.cpu_mask are also set here?

>      hpet_msi_unmask(desc);
>      spin_unlock(&desc->lock);
>  
> --- a/xen/arch/x86/include/asm/irq-vectors.h
> +++ b/xen/arch/x86/include/asm/irq-vectors.h
> @@ -18,6 +18,15 @@
>  /* IRQ0 (timer) is statically allocated but must be high priority. */
>  #define IRQ0_VECTOR             0xf0
>  
> +/*
> + * Low-priority (for now statically allocated) vectors, sharing entry
> + * points with exceptions in the 0x10 ... 0x1f range, as long as the
> + * respective exception has an error code.
> + */
> +#define FIRST_LOPRIORITY_VECTOR 0x10
> +#define HPET_BROADCAST_VECTOR   X86_EXC_AC
> +#define LAST_LOPRIORITY_VECTOR  0x1f

I wonder if it won't be clearer to simply reserve a vector if the HPET
is used, instead of hijacking the AC one.  It's one vector less, but
arguably now that we unconditionally use physical destination mode our
pool of vectors has expanded considerably.

> +
>  /* Legacy PIC uses vectors 0x20-0x2f. */
>  #define FIRST_LEGACY_VECTOR     FIRST_DYNAMIC_VECTOR
>  #define LAST_LEGACY_VECTOR      (FIRST_LEGACY_VECTOR + 0xf)
> @@ -40,7 +49,7 @@
>  /* There's no IRQ2 at the PIC. */
>  #define IRQ_MOVE_CLEANUP_VECTOR (FIRST_LEGACY_VECTOR + 2)
>  
> -#define FIRST_IRQ_VECTOR        FIRST_DYNAMIC_VECTOR
> +#define FIRST_IRQ_VECTOR        FIRST_LOPRIORITY_VECTOR
>  #define LAST_IRQ_VECTOR         LAST_HIPRIORITY_VECTOR
>  
>  #endif /* _ASM_IRQ_VECTORS_H */
> --- a/xen/arch/x86/irq.c
> +++ b/xen/arch/x86/irq.c
> @@ -755,8 +755,9 @@ void setup_vector_irq(unsigned int cpu)
>          if ( !irq_desc_initialized(desc) )
>              continue;
>          vector = irq_to_vector(irq);
> -        if ( vector >= FIRST_HIPRIORITY_VECTOR &&
> -             vector <= LAST_HIPRIORITY_VECTOR )
> +        if ( vector <= (vector >= FIRST_HIPRIORITY_VECTOR
> +                        ? LAST_HIPRIORITY_VECTOR
> +                        : LAST_LOPRIORITY_VECTOR) )
>              cpumask_set_cpu(cpu, desc->arch.cpu_mask);

I think this is wrong.  The low priority vector used by the HPET will
only target a single CPU at a time, and hence adding extra CPUs to
that mask as part of AP bringup is not correct.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 03/10] x86/HPET: use single, global, low-priority vector for broadcast IRQ
  2025-10-16 16:27   ` Roger Pau Monné
@ 2025-10-17  7:15     ` Jan Beulich
  2025-10-17  8:20       ` Roger Pau Monné
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Beulich @ 2025-10-17  7:15 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On 16.10.2025 18:27, Roger Pau Monné wrote:
> On Thu, Oct 16, 2025 at 09:32:04AM +0200, Jan Beulich wrote:
>> @@ -307,15 +309,13 @@ static void cf_check hpet_msi_set_affini
>>      struct hpet_event_channel *ch = desc->action->dev_id;
>>      struct msi_msg msg = ch->msi.msg;
>>  
>> -    msg.dest32 = set_desc_affinity(desc, mask);
>> -    if ( msg.dest32 == BAD_APICID )
>> -        return;
>> +    /* This really is only for dump_irqs(). */
>> +    cpumask_copy(desc->arch.cpu_mask, mask);
> 
> If you no longer call set_desc_affinity(), could you adjust the second
> parameter of hpet_msi_set_affinity() to be unsigned int cpu instead of
> a cpumask?

Looks like I could, yes. But then we need to split the function, as it's
also used as the .set_affinity hook.

> And here just clear desc->arch.cpu_mask and set the passed CPU.

Which would still better be a cpumask_copy(), just given cpumask_of(cpu)
as input.

>> -    msg.data &= ~MSI_DATA_VECTOR_MASK;
>> -    msg.data |= MSI_DATA_VECTOR(desc->arch.vector);
>> +    msg.dest32 = cpu_mask_to_apicid(mask);
> 
> And here you can just use cpu_physical_id().

Right. All of which (up to here; but see below) perhaps better a separate,
follow-on cleanup change.

>>      msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK;
>>      msg.address_lo |= MSI_ADDR_DEST_ID(msg.dest32);
>> -    if ( msg.data != ch->msi.msg.data || msg.dest32 != ch->msi.msg.dest32 )
>> +    if ( msg.dest32 != ch->msi.msg.dest32 )
>>          hpet_msi_write(ch, &msg);
> 
> A further note here, which ties to my comment on the previous patch
> about loosing the interrupt during the masked window.  If the vector
> is the same across all CPUs, we no longer need to update the MSI data
> field, just the address one, which can be done atomically.  We also
> have signaling from the IOMMU whether the MSI fields need writing.

Hmm, yes, we can leverage that, as long as we're willing to make assumptions
here about what exactly iommu_update_ire_from_msi() does: We'd then rely on
not only the original (untranslated) msg->data not changing, but also the
translated one. That looks to hold for both Intel and AMD, but it's still
something we want to be sure we actually want to make the code dependent
upon. (I'm intending to at least add an assertion to that effect.)

> We can avoid the masking, and the possible drop of interrupts.

Hmm, right. There's nothing wrong with the caller relying on the write
being atomic now. (Really, continuing to use hpet_msi_write() wouldn't
be a problem, as re-writing the low half of HPET_Tn_ROUTE() with the
same value is going to be benign. Unless of course that write was the
source of the extra IRQs I'm seeing.)

Taking together with what you said further up, having
set_channel_irq_affinity() no longer use hpet_msi_set_affinity() as it
is to ...

>> @@ -328,7 +328,7 @@ static hw_irq_controller hpet_msi_type =
>>      .shutdown   = hpet_msi_shutdown,
>>      .enable	    = hpet_msi_unmask,
>>      .disable    = hpet_msi_mask,
>> -    .ack        = ack_nonmaskable_msi_irq,
>> +    .ack        = irq_actor_none,
>>      .end        = end_nonmaskable_irq,
>>      .set_affinity   = hpet_msi_set_affinity,

... satisfy the use here would then probably be desirable right away.
The little bit that's left of hpet_msi_set_affinity() would then be
open-coded in set_channel_irq_affinity().

Getting rid of the masking would (hopefully) also get rid of the stray
IRQs that I'm observing, assuming my guessing towards the reason there
is correct.

>> @@ -497,6 +503,7 @@ static void set_channel_irq_affinity(str
>>      spin_lock(&desc->lock);
>>      hpet_msi_mask(desc);
>>      hpet_msi_set_affinity(desc, cpumask_of(ch->cpu));
>> +    per_cpu(vector_irq, ch->cpu)[HPET_BROADCAST_VECTOR] = ch->msi.irq;
> 
> I would set the vector table ahead of setting the affinity, in case we
> can drop the mask calls around this block of code.

Isn't there a problematic window either way round? I can make the change,
but I don't see that addressing anything. The new comparator value will
be written later anyway, and interrupts up to that point aren't of any
interest anyway. I.e. it doesn't matter which of the CPUs gets to handle
them.

> I also wonder, do you really need the bind_irq_vector() if you
> manually set the affinity afterwards, and the vector table plus
> desc->arch.cpu_mask are also set here?

At the very least I'd then also need to open-code the setting of
desc->arch.vector and desc->arch.used. Possibly also the setting of the
bit in desc->arch.used_vectors. And strictly speaking also the
trace_irq_mask() invocation.

>> --- a/xen/arch/x86/include/asm/irq-vectors.h
>> +++ b/xen/arch/x86/include/asm/irq-vectors.h
>> @@ -18,6 +18,15 @@
>>  /* IRQ0 (timer) is statically allocated but must be high priority. */
>>  #define IRQ0_VECTOR             0xf0
>>  
>> +/*
>> + * Low-priority (for now statically allocated) vectors, sharing entry
>> + * points with exceptions in the 0x10 ... 0x1f range, as long as the
>> + * respective exception has an error code.
>> + */
>> +#define FIRST_LOPRIORITY_VECTOR 0x10
>> +#define HPET_BROADCAST_VECTOR   X86_EXC_AC
>> +#define LAST_LOPRIORITY_VECTOR  0x1f
> 
> I wonder if it won't be clearer to simply reserve a vector if the HPET
> is used, instead of hijacking the AC one.  It's one vector less, but
> arguably now that we unconditionally use physical destination mode our
> pool of vectors has expanded considerably.

Well, I'd really like to avoid consuming an otherwise usable vector, if
at all possible (as per Andrew's FRED plans, that won't be possible
there anymore then).

>> --- a/xen/arch/x86/irq.c
>> +++ b/xen/arch/x86/irq.c
>> @@ -755,8 +755,9 @@ void setup_vector_irq(unsigned int cpu)
>>          if ( !irq_desc_initialized(desc) )
>>              continue;
>>          vector = irq_to_vector(irq);
>> -        if ( vector >= FIRST_HIPRIORITY_VECTOR &&
>> -             vector <= LAST_HIPRIORITY_VECTOR )
>> +        if ( vector <= (vector >= FIRST_HIPRIORITY_VECTOR
>> +                        ? LAST_HIPRIORITY_VECTOR
>> +                        : LAST_LOPRIORITY_VECTOR) )
>>              cpumask_set_cpu(cpu, desc->arch.cpu_mask);
> 
> I think this is wrong.  The low priority vector used by the HPET will
> only target a single CPU at a time, and hence adding extra CPUs to
> that mask as part of AP bringup is not correct.

I'm not sure about "wrong". It's not strictly necessary for the HPET one,
I expect, but it's generally what would be necessary. For the HPET one,
hpet_msi_set_affinity() replaces the value anyway. (I can add a sentence
to this effect to the description, if that helps.)

Jan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 03/10] x86/HPET: use single, global, low-priority vector for broadcast IRQ
  2025-10-17  7:15     ` Jan Beulich
@ 2025-10-17  8:20       ` Roger Pau Monné
  2025-10-20  5:53         ` Jan Beulich
  0 siblings, 1 reply; 41+ messages in thread
From: Roger Pau Monné @ 2025-10-17  8:20 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On Fri, Oct 17, 2025 at 09:15:08AM +0200, Jan Beulich wrote:
> On 16.10.2025 18:27, Roger Pau Monné wrote:
> > On Thu, Oct 16, 2025 at 09:32:04AM +0200, Jan Beulich wrote:
> >> @@ -307,15 +309,13 @@ static void cf_check hpet_msi_set_affini
> >>      struct hpet_event_channel *ch = desc->action->dev_id;
> >>      struct msi_msg msg = ch->msi.msg;
> >>  
> >> -    msg.dest32 = set_desc_affinity(desc, mask);
> >> -    if ( msg.dest32 == BAD_APICID )
> >> -        return;
> >> +    /* This really is only for dump_irqs(). */
> >> +    cpumask_copy(desc->arch.cpu_mask, mask);
> > 
> > If you no longer call set_desc_affinity(), could you adjust the second
> > parameter of hpet_msi_set_affinity() to be unsigned int cpu instead of
> > a cpumask?
> 
> Looks like I could, yes. But then we need to split the function, as it's
> also used as the .set_affinity hook.

I see, I wasn't taking that into account.

> > And here just clear desc->arch.cpu_mask and set the passed CPU.
> 
> Which would still better be a cpumask_copy(), just given cpumask_of(cpu)
> as input.

As is it, yes.

> >> -    msg.data &= ~MSI_DATA_VECTOR_MASK;
> >> -    msg.data |= MSI_DATA_VECTOR(desc->arch.vector);
> >> +    msg.dest32 = cpu_mask_to_apicid(mask);
> > 
> > And here you can just use cpu_physical_id().
> 
> Right. All of which (up to here; but see below) perhaps better a separate,
> follow-on cleanup change.

Yes, it's too much fuss, and I also have plans in that area to deal
with it myself anyway.  Just wanted to avoid changing this now to be
changed again.  But it's too unrelated to put in this change.

> >>      msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK;
> >>      msg.address_lo |= MSI_ADDR_DEST_ID(msg.dest32);
> >> -    if ( msg.data != ch->msi.msg.data || msg.dest32 != ch->msi.msg.dest32 )
> >> +    if ( msg.dest32 != ch->msi.msg.dest32 )
> >>          hpet_msi_write(ch, &msg);
> > 
> > A further note here, which ties to my comment on the previous patch
> > about loosing the interrupt during the masked window.  If the vector
> > is the same across all CPUs, we no longer need to update the MSI data
> > field, just the address one, which can be done atomically.  We also
> > have signaling from the IOMMU whether the MSI fields need writing.
> 
> Hmm, yes, we can leverage that, as long as we're willing to make assumptions
> here about what exactly iommu_update_ire_from_msi() does: We'd then rely on
> not only the original (untranslated) msg->data not changing, but also the
> translated one. That looks to hold for both Intel and AMD, but it's still
> something we want to be sure we actually want to make the code dependent
> upon. (I'm intending to at least add an assertion to that effect.)

We could still mask when needed, but the masking would be
conditionally done in hpet_msi_write().

It seems however this might be better done as a followup change.

> > We can avoid the masking, and the possible drop of interrupts.
> 
> Hmm, right. There's nothing wrong with the caller relying on the write
> being atomic now. (Really, continuing to use hpet_msi_write() wouldn't
> be a problem, as re-writing the low half of HPET_Tn_ROUTE() with the
> same value is going to be benign. Unless of course that write was the
> source of the extra IRQs I'm seeing.)

Oh, yes, that's right, we don't even need to avoid the write.

> Taking together with what you said further up, having
> set_channel_irq_affinity() no longer use hpet_msi_set_affinity() as it
> is to ...
> 
> >> @@ -328,7 +328,7 @@ static hw_irq_controller hpet_msi_type =
> >>      .shutdown   = hpet_msi_shutdown,
> >>      .enable	    = hpet_msi_unmask,
> >>      .disable    = hpet_msi_mask,
> >> -    .ack        = ack_nonmaskable_msi_irq,
> >> +    .ack        = irq_actor_none,
> >>      .end        = end_nonmaskable_irq,
> >>      .set_affinity   = hpet_msi_set_affinity,
> 
> ... satisfy the use here would then probably be desirable right away.
> The little bit that's left of hpet_msi_set_affinity() would then be
> open-coded in set_channel_irq_affinity().

As you see fit, I'm not going to insist if the changes become too
unrelated to the fix itself.  Can always be done as a followup patch,
specially taking into account we are in hard code freeze.

> Getting rid of the masking would (hopefully) also get rid of the stray
> IRQs that I'm observing, assuming my guessing towards the reason there
> is correct.
> 
> >> @@ -497,6 +503,7 @@ static void set_channel_irq_affinity(str
> >>      spin_lock(&desc->lock);
> >>      hpet_msi_mask(desc);
> >>      hpet_msi_set_affinity(desc, cpumask_of(ch->cpu));
> >> +    per_cpu(vector_irq, ch->cpu)[HPET_BROADCAST_VECTOR] = ch->msi.irq;
> > 
> > I would set the vector table ahead of setting the affinity, in case we
> > can drop the mask calls around this block of code.
> 
> Isn't there a problematic window either way round? I can make the change,
> but I don't see that addressing anything. The new comparator value will
> be written later anyway, and interrupts up to that point aren't of any
> interest anyway. I.e. it doesn't matter which of the CPUs gets to handle
> them.

It's preferable to get a silent stray interrupt (if the per-cpu vector
table is correctly setup), rather than to get a message from Xen that
an unknown vector has been received?

If a vector is injected ahead of vector_irq being set Xen would
complain in do_IRQ() that that's no handler for such vector.

> > I also wonder, do you really need the bind_irq_vector() if you
> > manually set the affinity afterwards, and the vector table plus
> > desc->arch.cpu_mask are also set here?
> 
> At the very least I'd then also need to open-code the setting of
> desc->arch.vector and desc->arch.used. Possibly also the setting of the
> bit in desc->arch.used_vectors. And strictly speaking also the
> trace_irq_mask() invocation.

Let's keep it as-is.

> >> --- a/xen/arch/x86/include/asm/irq-vectors.h
> >> +++ b/xen/arch/x86/include/asm/irq-vectors.h
> >> @@ -18,6 +18,15 @@
> >>  /* IRQ0 (timer) is statically allocated but must be high priority. */
> >>  #define IRQ0_VECTOR             0xf0
> >>  
> >> +/*
> >> + * Low-priority (for now statically allocated) vectors, sharing entry
> >> + * points with exceptions in the 0x10 ... 0x1f range, as long as the
> >> + * respective exception has an error code.
> >> + */
> >> +#define FIRST_LOPRIORITY_VECTOR 0x10
> >> +#define HPET_BROADCAST_VECTOR   X86_EXC_AC
> >> +#define LAST_LOPRIORITY_VECTOR  0x1f
> > 
> > I wonder if it won't be clearer to simply reserve a vector if the HPET
> > is used, instead of hijacking the AC one.  It's one vector less, but
> > arguably now that we unconditionally use physical destination mode our
> > pool of vectors has expanded considerably.
> 
> Well, I'd really like to avoid consuming an otherwise usable vector, if
> at all possible (as per Andrew's FRED plans, that won't be possible
> there anymore then).

If re-using the AC vector is not possible with FRED we might want to
do this uniformly and always consume a vector then?

> >> --- a/xen/arch/x86/irq.c
> >> +++ b/xen/arch/x86/irq.c
> >> @@ -755,8 +755,9 @@ void setup_vector_irq(unsigned int cpu)
> >>          if ( !irq_desc_initialized(desc) )
> >>              continue;
> >>          vector = irq_to_vector(irq);
> >> -        if ( vector >= FIRST_HIPRIORITY_VECTOR &&
> >> -             vector <= LAST_HIPRIORITY_VECTOR )
> >> +        if ( vector <= (vector >= FIRST_HIPRIORITY_VECTOR
> >> +                        ? LAST_HIPRIORITY_VECTOR
> >> +                        : LAST_LOPRIORITY_VECTOR) )
> >>              cpumask_set_cpu(cpu, desc->arch.cpu_mask);
> > 
> > I think this is wrong.  The low priority vector used by the HPET will
> > only target a single CPU at a time, and hence adding extra CPUs to
> > that mask as part of AP bringup is not correct.
> 
> I'm not sure about "wrong". It's not strictly necessary for the HPET one,
> I expect, but it's generally what would be necessary. For the HPET one,
> hpet_msi_set_affinity() replaces the value anyway. (I can add a sentence
> to this effect to the description, if that helps.)

I do think it's wrong, it's just not harmful per-se apart from showing
up in the output of dump_irqs().  The value in desc->arch.cpu_mask
should be the CPU that's the destination of the interrupt.  In this
case, the HPET interrupt does have a single destination at a give
time, and adding another one will make the output of dump_irqs() show
two destinations, when the interrupt will target a single interrupt.

If anything you should add the CPU to the affinity set
(desc->affinity), but that's not needed since you already init the
affinity mask with cpumask_setall().

FWIW, I'm working on tentatively getting rid of the
desc->arch.{cpu,old_cpu,pending}_mask fields and converting them to
plain unsigned ints after we have dropped logical interrupt delivery
for external interrupts.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 03/10] x86/HPET: use single, global, low-priority vector for broadcast IRQ
  2025-10-17  8:20       ` Roger Pau Monné
@ 2025-10-20  5:53         ` Jan Beulich
  2025-10-20 15:49           ` Roger Pau Monné
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Beulich @ 2025-10-20  5:53 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On 17.10.2025 10:20, Roger Pau Monné wrote:
> On Fri, Oct 17, 2025 at 09:15:08AM +0200, Jan Beulich wrote:
>> On 16.10.2025 18:27, Roger Pau Monné wrote:
>>> On Thu, Oct 16, 2025 at 09:32:04AM +0200, Jan Beulich wrote:
>>>> @@ -497,6 +503,7 @@ static void set_channel_irq_affinity(str
>>>>      spin_lock(&desc->lock);
>>>>      hpet_msi_mask(desc);
>>>>      hpet_msi_set_affinity(desc, cpumask_of(ch->cpu));
>>>> +    per_cpu(vector_irq, ch->cpu)[HPET_BROADCAST_VECTOR] = ch->msi.irq;
>>>
>>> I would set the vector table ahead of setting the affinity, in case we
>>> can drop the mask calls around this block of code.
>>
>> Isn't there a problematic window either way round? I can make the change,
>> but I don't see that addressing anything. The new comparator value will
>> be written later anyway, and interrupts up to that point aren't of any
>> interest anyway. I.e. it doesn't matter which of the CPUs gets to handle
>> them.
> 
> It's preferable to get a silent stray interrupt (if the per-cpu vector
> table is correctly setup), rather than to get a message from Xen that
> an unknown vector has been received?
> 
> If a vector is injected ahead of vector_irq being set Xen would
> complain in do_IRQ() that that's no handler for such vector.

As of now, setup_vector_irq() makes sure the field isn't uninitialized
(i.e. left at INT_MIN). With that change dropped (see below), there
would indeed be such a risk (on the first instance on each CPU).

>>>> --- a/xen/arch/x86/include/asm/irq-vectors.h
>>>> +++ b/xen/arch/x86/include/asm/irq-vectors.h
>>>> @@ -18,6 +18,15 @@
>>>>  /* IRQ0 (timer) is statically allocated but must be high priority. */
>>>>  #define IRQ0_VECTOR             0xf0
>>>>  
>>>> +/*
>>>> + * Low-priority (for now statically allocated) vectors, sharing entry
>>>> + * points with exceptions in the 0x10 ... 0x1f range, as long as the
>>>> + * respective exception has an error code.
>>>> + */
>>>> +#define FIRST_LOPRIORITY_VECTOR 0x10
>>>> +#define HPET_BROADCAST_VECTOR   X86_EXC_AC
>>>> +#define LAST_LOPRIORITY_VECTOR  0x1f
>>>
>>> I wonder if it won't be clearer to simply reserve a vector if the HPET
>>> is used, instead of hijacking the AC one.  It's one vector less, but
>>> arguably now that we unconditionally use physical destination mode our
>>> pool of vectors has expanded considerably.
>>
>> Well, I'd really like to avoid consuming an otherwise usable vector, if
>> at all possible (as per Andrew's FRED plans, that won't be possible
>> there anymore then).
> 
> If re-using the AC vector is not possible with FRED we might want to
> do this uniformly and always consume a vector then?

Right now it saves us a vector. I'd leave the FRED side for when that's
going to be implemented. Which - aiui - isn't going to be straightforward
anyway, due to (at least) requirements around IRQ_MOVE_CLEANUP_VECTOR.

>>>> --- a/xen/arch/x86/irq.c
>>>> +++ b/xen/arch/x86/irq.c
>>>> @@ -755,8 +755,9 @@ void setup_vector_irq(unsigned int cpu)
>>>>          if ( !irq_desc_initialized(desc) )
>>>>              continue;
>>>>          vector = irq_to_vector(irq);
>>>> -        if ( vector >= FIRST_HIPRIORITY_VECTOR &&
>>>> -             vector <= LAST_HIPRIORITY_VECTOR )
>>>> +        if ( vector <= (vector >= FIRST_HIPRIORITY_VECTOR
>>>> +                        ? LAST_HIPRIORITY_VECTOR
>>>> +                        : LAST_LOPRIORITY_VECTOR) )
>>>>              cpumask_set_cpu(cpu, desc->arch.cpu_mask);
>>>
>>> I think this is wrong.  The low priority vector used by the HPET will
>>> only target a single CPU at a time, and hence adding extra CPUs to
>>> that mask as part of AP bringup is not correct.
>>
>> I'm not sure about "wrong". It's not strictly necessary for the HPET one,
>> I expect, but it's generally what would be necessary. For the HPET one,
>> hpet_msi_set_affinity() replaces the value anyway. (I can add a sentence
>> to this effect to the description, if that helps.)
> 
> I do think it's wrong, it's just not harmful per-se apart from showing
> up in the output of dump_irqs().  The value in desc->arch.cpu_mask
> should be the CPU that's the destination of the interrupt.  In this
> case, the HPET interrupt does have a single destination at a give
> time, and adding another one will make the output of dump_irqs() show
> two destinations, when the interrupt will target a single interrupt.

Just that as soon as the interrupt is actually in use, what is done
here doesn't matter anymore.

I continue to think the change is correct for the general case: I'd
expect these special vectors to normally (just not here) be used as
"direct APIC vectors", in which case the IRQ does have multiple
destinations.

Problem is - if I don't make this change, I still expect I ought to
make _some_ change here, as the following "else if()" might be getting
in the way. Then again the vector_irq[] assignment also isn't strictly
needed this early, as set_channel_irq_affinity() deals with that
anyway.

Bottom line - I guess I'll drop this change, realizing that adding
something here later on may then be harder to understand.

> If anything you should add the CPU to the affinity set
> (desc->affinity), but that's not needed since you already init the
> affinity mask with cpumask_setall().

Indeed.

> FWIW, I'm working on tentatively getting rid of the
> desc->arch.{cpu,old_cpu,pending}_mask fields and converting them to
> plain unsigned ints after we have dropped logical interrupt delivery
> for external interrupts.

I'm aware, yes. And I realize this change - for the HPET case - would
be getting in the way (to some degree).

Jan


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 03/10] x86/HPET: use single, global, low-priority vector for broadcast IRQ
  2025-10-20  5:53         ` Jan Beulich
@ 2025-10-20 15:49           ` Roger Pau Monné
  2025-10-20 16:05             ` Jan Beulich
  0 siblings, 1 reply; 41+ messages in thread
From: Roger Pau Monné @ 2025-10-20 15:49 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On Mon, Oct 20, 2025 at 07:53:51AM +0200, Jan Beulich wrote:
> On 17.10.2025 10:20, Roger Pau Monné wrote:
> > On Fri, Oct 17, 2025 at 09:15:08AM +0200, Jan Beulich wrote:
> >> On 16.10.2025 18:27, Roger Pau Monné wrote:
> >>> On Thu, Oct 16, 2025 at 09:32:04AM +0200, Jan Beulich wrote:
> >>>> @@ -497,6 +503,7 @@ static void set_channel_irq_affinity(str
> >>>>      spin_lock(&desc->lock);
> >>>>      hpet_msi_mask(desc);
> >>>>      hpet_msi_set_affinity(desc, cpumask_of(ch->cpu));
> >>>> +    per_cpu(vector_irq, ch->cpu)[HPET_BROADCAST_VECTOR] = ch->msi.irq;
> >>>
> >>> I would set the vector table ahead of setting the affinity, in case we
> >>> can drop the mask calls around this block of code.
> >>
> >> Isn't there a problematic window either way round? I can make the change,
> >> but I don't see that addressing anything. The new comparator value will
> >> be written later anyway, and interrupts up to that point aren't of any
> >> interest anyway. I.e. it doesn't matter which of the CPUs gets to handle
> >> them.
> > 
> > It's preferable to get a silent stray interrupt (if the per-cpu vector
> > table is correctly setup), rather than to get a message from Xen that
> > an unknown vector has been received?
> > 
> > If a vector is injected ahead of vector_irq being set Xen would
> > complain in do_IRQ() that that's no handler for such vector.
> 
> As of now, setup_vector_irq() makes sure the field isn't uninitialized
> (i.e. left at INT_MIN). With that change dropped (see below), there
> would indeed be such a risk (on the first instance on each CPU).
> 
> >>>> --- a/xen/arch/x86/irq.c
> >>>> +++ b/xen/arch/x86/irq.c
> >>>> @@ -755,8 +755,9 @@ void setup_vector_irq(unsigned int cpu)
> >>>>          if ( !irq_desc_initialized(desc) )
> >>>>              continue;
> >>>>          vector = irq_to_vector(irq);
> >>>> -        if ( vector >= FIRST_HIPRIORITY_VECTOR &&
> >>>> -             vector <= LAST_HIPRIORITY_VECTOR )
> >>>> +        if ( vector <= (vector >= FIRST_HIPRIORITY_VECTOR
> >>>> +                        ? LAST_HIPRIORITY_VECTOR
> >>>> +                        : LAST_LOPRIORITY_VECTOR) )
> >>>>              cpumask_set_cpu(cpu, desc->arch.cpu_mask);
> >>>
> >>> I think this is wrong.  The low priority vector used by the HPET will
> >>> only target a single CPU at a time, and hence adding extra CPUs to
> >>> that mask as part of AP bringup is not correct.
> >>
> >> I'm not sure about "wrong". It's not strictly necessary for the HPET one,
> >> I expect, but it's generally what would be necessary. For the HPET one,
> >> hpet_msi_set_affinity() replaces the value anyway. (I can add a sentence
> >> to this effect to the description, if that helps.)
> > 
> > I do think it's wrong, it's just not harmful per-se apart from showing
> > up in the output of dump_irqs().  The value in desc->arch.cpu_mask
> > should be the CPU that's the destination of the interrupt.  In this
> > case, the HPET interrupt does have a single destination at a give
> > time, and adding another one will make the output of dump_irqs() show
> > two destinations, when the interrupt will target a single interrupt.
> 
> Just that as soon as the interrupt is actually in use, what is done
> here doesn't matter anymore.
> 
> I continue to think the change is correct for the general case: I'd
> expect these special vectors to normally (just not here) be used as
> "direct APIC vectors", in which case the IRQ does have multiple
> destinations.

I think it depends on the usage of the vector.  There are indeed
vectors that are active on all CPUs at the same time (like the current
hi priority ones).  However in the case of the HPET vector that's not
the case, it targets a single CPU specifically.

I think it would be best if vectors that are used on all CPUs at the
same time are initialized using cpumask_all or cpumask_setall(), and
avoid having to add a new bit every time a CPU is started.  It's fine
for cpu_mask to contain offline CPUs.

> Problem is - if I don't make this change, I still expect I ought to
> make _some_ change here, as the following "else if()" might be getting
> in the way. Then again the vector_irq[] assignment also isn't strictly
> needed this early, as set_channel_irq_affinity() deals with that
> anyway.
> 
> Bottom line - I guess I'll drop this change, realizing that adding
> something here later on may then be harder to understand.

I have plans to change this soonish hopefully, which I expect will
make this clearer.

> > If anything you should add the CPU to the affinity set
> > (desc->affinity), but that's not needed since you already init the
> > affinity mask with cpumask_setall().
> 
> Indeed.
> 
> > FWIW, I'm working on tentatively getting rid of the
> > desc->arch.{cpu,old_cpu,pending}_mask fields and converting them to
> > plain unsigned ints after we have dropped logical interrupt delivery
> > for external interrupts.
> 
> I'm aware, yes. And I realize this change - for the HPET case - would
> be getting in the way (to some degree).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 03/10] x86/HPET: use single, global, low-priority vector for broadcast IRQ
  2025-10-20 15:49           ` Roger Pau Monné
@ 2025-10-20 16:05             ` Jan Beulich
  2025-10-21  8:37               ` Roger Pau Monné
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Beulich @ 2025-10-20 16:05 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On 20.10.2025 17:49, Roger Pau Monné wrote:
> On Mon, Oct 20, 2025 at 07:53:51AM +0200, Jan Beulich wrote:
>> On 17.10.2025 10:20, Roger Pau Monné wrote:
>>> On Fri, Oct 17, 2025 at 09:15:08AM +0200, Jan Beulich wrote:
>>>> On 16.10.2025 18:27, Roger Pau Monné wrote:
>>>>> On Thu, Oct 16, 2025 at 09:32:04AM +0200, Jan Beulich wrote:
>>>>>> @@ -497,6 +503,7 @@ static void set_channel_irq_affinity(str
>>>>>>      spin_lock(&desc->lock);
>>>>>>      hpet_msi_mask(desc);
>>>>>>      hpet_msi_set_affinity(desc, cpumask_of(ch->cpu));
>>>>>> +    per_cpu(vector_irq, ch->cpu)[HPET_BROADCAST_VECTOR] = ch->msi.irq;
>>>>>
>>>>> I would set the vector table ahead of setting the affinity, in case we
>>>>> can drop the mask calls around this block of code.
>>>>
>>>> Isn't there a problematic window either way round? I can make the change,
>>>> but I don't see that addressing anything. The new comparator value will
>>>> be written later anyway, and interrupts up to that point aren't of any
>>>> interest anyway. I.e. it doesn't matter which of the CPUs gets to handle
>>>> them.
>>>
>>> It's preferable to get a silent stray interrupt (if the per-cpu vector
>>> table is correctly setup), rather than to get a message from Xen that
>>> an unknown vector has been received?
>>>
>>> If a vector is injected ahead of vector_irq being set Xen would
>>> complain in do_IRQ() that that's no handler for such vector.
>>
>> As of now, setup_vector_irq() makes sure the field isn't uninitialized
>> (i.e. left at INT_MIN). With that change dropped (see below), there
>> would indeed be such a risk (on the first instance on each CPU).
>>
>>>>>> --- a/xen/arch/x86/irq.c
>>>>>> +++ b/xen/arch/x86/irq.c
>>>>>> @@ -755,8 +755,9 @@ void setup_vector_irq(unsigned int cpu)
>>>>>>          if ( !irq_desc_initialized(desc) )
>>>>>>              continue;
>>>>>>          vector = irq_to_vector(irq);
>>>>>> -        if ( vector >= FIRST_HIPRIORITY_VECTOR &&
>>>>>> -             vector <= LAST_HIPRIORITY_VECTOR )
>>>>>> +        if ( vector <= (vector >= FIRST_HIPRIORITY_VECTOR
>>>>>> +                        ? LAST_HIPRIORITY_VECTOR
>>>>>> +                        : LAST_LOPRIORITY_VECTOR) )
>>>>>>              cpumask_set_cpu(cpu, desc->arch.cpu_mask);
>>>>>
>>>>> I think this is wrong.  The low priority vector used by the HPET will
>>>>> only target a single CPU at a time, and hence adding extra CPUs to
>>>>> that mask as part of AP bringup is not correct.
>>>>
>>>> I'm not sure about "wrong". It's not strictly necessary for the HPET one,
>>>> I expect, but it's generally what would be necessary. For the HPET one,
>>>> hpet_msi_set_affinity() replaces the value anyway. (I can add a sentence
>>>> to this effect to the description, if that helps.)
>>>
>>> I do think it's wrong, it's just not harmful per-se apart from showing
>>> up in the output of dump_irqs().  The value in desc->arch.cpu_mask
>>> should be the CPU that's the destination of the interrupt.  In this
>>> case, the HPET interrupt does have a single destination at a give
>>> time, and adding another one will make the output of dump_irqs() show
>>> two destinations, when the interrupt will target a single interrupt.
>>
>> Just that as soon as the interrupt is actually in use, what is done
>> here doesn't matter anymore.
>>
>> I continue to think the change is correct for the general case: I'd
>> expect these special vectors to normally (just not here) be used as
>> "direct APIC vectors", in which case the IRQ does have multiple
>> destinations.
> 
> I think it depends on the usage of the vector.  There are indeed
> vectors that are active on all CPUs at the same time (like the current
> hi priority ones).  However in the case of the HPET vector that's not
> the case, it targets a single CPU specifically.
> 
> I think it would be best if vectors that are used on all CPUs at the
> same time are initialized using cpumask_all or cpumask_setall(), and
> avoid having to add a new bit every time a CPU is started.  It's fine
> for cpu_mask to contain offline CPUs.

I don't think so. There may be less dependencies now, but look at e.g.
the check in _bind_irq_vector(). Or this loop

            for_each_cpu(cpu, desc->arch.cpu_mask)
                per_cpu(vector_irq, cpu)[desc->arch.vector] = irq;

in _assign_irq_vector() (that may be fine because of how the mask is
set just before the loop, but the loop itself very much assumes no
offline CPUs in there). The most problematic example may be in
fixup_irqs(), where cpumask_any(desc->arch.cpu_mask) is used.

Jan


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 03/10] x86/HPET: use single, global, low-priority vector for broadcast IRQ
  2025-10-20 16:05             ` Jan Beulich
@ 2025-10-21  8:37               ` Roger Pau Monné
  0 siblings, 0 replies; 41+ messages in thread
From: Roger Pau Monné @ 2025-10-21  8:37 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On Mon, Oct 20, 2025 at 06:05:04PM +0200, Jan Beulich wrote:
> On 20.10.2025 17:49, Roger Pau Monné wrote:
> > On Mon, Oct 20, 2025 at 07:53:51AM +0200, Jan Beulich wrote:
> >> On 17.10.2025 10:20, Roger Pau Monné wrote:
> >>> On Fri, Oct 17, 2025 at 09:15:08AM +0200, Jan Beulich wrote:
> >>>> On 16.10.2025 18:27, Roger Pau Monné wrote:
> >>>>> On Thu, Oct 16, 2025 at 09:32:04AM +0200, Jan Beulich wrote:
> >>>>>> @@ -497,6 +503,7 @@ static void set_channel_irq_affinity(str
> >>>>>>      spin_lock(&desc->lock);
> >>>>>>      hpet_msi_mask(desc);
> >>>>>>      hpet_msi_set_affinity(desc, cpumask_of(ch->cpu));
> >>>>>> +    per_cpu(vector_irq, ch->cpu)[HPET_BROADCAST_VECTOR] = ch->msi.irq;
> >>>>>
> >>>>> I would set the vector table ahead of setting the affinity, in case we
> >>>>> can drop the mask calls around this block of code.
> >>>>
> >>>> Isn't there a problematic window either way round? I can make the change,
> >>>> but I don't see that addressing anything. The new comparator value will
> >>>> be written later anyway, and interrupts up to that point aren't of any
> >>>> interest anyway. I.e. it doesn't matter which of the CPUs gets to handle
> >>>> them.
> >>>
> >>> It's preferable to get a silent stray interrupt (if the per-cpu vector
> >>> table is correctly setup), rather than to get a message from Xen that
> >>> an unknown vector has been received?
> >>>
> >>> If a vector is injected ahead of vector_irq being set Xen would
> >>> complain in do_IRQ() that that's no handler for such vector.
> >>
> >> As of now, setup_vector_irq() makes sure the field isn't uninitialized
> >> (i.e. left at INT_MIN). With that change dropped (see below), there
> >> would indeed be such a risk (on the first instance on each CPU).
> >>
> >>>>>> --- a/xen/arch/x86/irq.c
> >>>>>> +++ b/xen/arch/x86/irq.c
> >>>>>> @@ -755,8 +755,9 @@ void setup_vector_irq(unsigned int cpu)
> >>>>>>          if ( !irq_desc_initialized(desc) )
> >>>>>>              continue;
> >>>>>>          vector = irq_to_vector(irq);
> >>>>>> -        if ( vector >= FIRST_HIPRIORITY_VECTOR &&
> >>>>>> -             vector <= LAST_HIPRIORITY_VECTOR )
> >>>>>> +        if ( vector <= (vector >= FIRST_HIPRIORITY_VECTOR
> >>>>>> +                        ? LAST_HIPRIORITY_VECTOR
> >>>>>> +                        : LAST_LOPRIORITY_VECTOR) )
> >>>>>>              cpumask_set_cpu(cpu, desc->arch.cpu_mask);
> >>>>>
> >>>>> I think this is wrong.  The low priority vector used by the HPET will
> >>>>> only target a single CPU at a time, and hence adding extra CPUs to
> >>>>> that mask as part of AP bringup is not correct.
> >>>>
> >>>> I'm not sure about "wrong". It's not strictly necessary for the HPET one,
> >>>> I expect, but it's generally what would be necessary. For the HPET one,
> >>>> hpet_msi_set_affinity() replaces the value anyway. (I can add a sentence
> >>>> to this effect to the description, if that helps.)
> >>>
> >>> I do think it's wrong, it's just not harmful per-se apart from showing
> >>> up in the output of dump_irqs().  The value in desc->arch.cpu_mask
> >>> should be the CPU that's the destination of the interrupt.  In this
> >>> case, the HPET interrupt does have a single destination at a give
> >>> time, and adding another one will make the output of dump_irqs() show
> >>> two destinations, when the interrupt will target a single interrupt.
> >>
> >> Just that as soon as the interrupt is actually in use, what is done
> >> here doesn't matter anymore.
> >>
> >> I continue to think the change is correct for the general case: I'd
> >> expect these special vectors to normally (just not here) be used as
> >> "direct APIC vectors", in which case the IRQ does have multiple
> >> destinations.
> > 
> > I think it depends on the usage of the vector.  There are indeed
> > vectors that are active on all CPUs at the same time (like the current
> > hi priority ones).  However in the case of the HPET vector that's not
> > the case, it targets a single CPU specifically.
> > 
> > I think it would be best if vectors that are used on all CPUs at the
> > same time are initialized using cpumask_all or cpumask_setall(), and
> > avoid having to add a new bit every time a CPU is started.  It's fine
> > for cpu_mask to contain offline CPUs.
> 
> I don't think so. There may be less dependencies now, but look at e.g.
> the check in _bind_irq_vector(). Or this loop
> 
>             for_each_cpu(cpu, desc->arch.cpu_mask)
>                 per_cpu(vector_irq, cpu)[desc->arch.vector] = irq;
> 
> in _assign_irq_vector() (that may be fine because of how the mask is
> set just before the loop, but the loop itself very much assumes no
> offline CPUs in there). The most problematic example may be in
> fixup_irqs(), where cpumask_any(desc->arch.cpu_mask) is used.

Then it looks like the comment ahead of the field declaration in irq.h
is wrong:

        /*
         * Except for high priority interrupts @cpu_mask may have bits set for
         * offline CPUs.  Consumers need to be careful to mask this down to
         * online ones as necessary.  There is supposed to always be a non-
         * empty intersection with cpu_online_map.
         */

I realize now the comment says "Except for high priority", but we
don't seem to make a such differentiation in most of the code (like
fixup_irqs()).

Hopefully this will be way more simple if I can get rid of the
cpumasks in arch_irq_desc.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 03/10] x86/HPET: use single, global, low-priority vector for broadcast IRQ
  2025-10-16  7:32 ` [PATCH for-4.21 03/10] x86/HPET: use single, global, low-priority vector for broadcast IRQ Jan Beulich
  2025-10-16 16:27   ` Roger Pau Monné
@ 2025-10-16 17:01   ` Andrew Cooper
  2025-10-17  6:23     ` Jan Beulich
  1 sibling, 1 reply; 41+ messages in thread
From: Andrew Cooper @ 2025-10-16 17:01 UTC (permalink / raw)
  To: Jan Beulich, xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Roger Pau Monné, Oleksii Kurochko

On 16/10/2025 8:32 am, Jan Beulich wrote:
> Using dynamically allocated / maintained vectors has several downsides:
> - possible nesting of IRQs due to the effects of IRQ migration,
> - reduction of vectors available for devices,
> - IRQs not moving as intended if there's shortage of vectors,
> - higher runtime overhead.
>
> As the vector also doesn't need to be of any priority (first and foremost
> it really shouldn't be of higher or same priority as the timer IRQ, as
> that raises TIMER_SOFTIRQ anyway), avoid any "ordinary" vectors altogther
> and use a vector from the 0x10...0x1f exception vector space. Exception vs
> interrupt can easily be distinguished by checking for the presence of an
> error code.
>
> Fixes: 996576b965cc ("xen: allow up to 16383 cpus")
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

This is so cunning that it took me a while to figure out how this even
functioned, given no apparent change to TPR.

Having this behaviour under FRED is easy.  In fact, allowing the use of
vectors 0x10-0x1f under FRED is one "extra" I haven't gotten around to
doing yet.

But, the problem it introduces under IDT is that for all the other
reserved exceptions, we'll panic if we see them.  That was the point of
setting TPR to 0x10 originally.

~Andrew


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 03/10] x86/HPET: use single, global, low-priority vector for broadcast IRQ
  2025-10-16 17:01   ` Andrew Cooper
@ 2025-10-17  6:23     ` Jan Beulich
  0 siblings, 0 replies; 41+ messages in thread
From: Jan Beulich @ 2025-10-17  6:23 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné, Oleksii Kurochko,
	xen-devel@lists.xenproject.org

On 16.10.2025 19:01, Andrew Cooper wrote:
> On 16/10/2025 8:32 am, Jan Beulich wrote:
>> Using dynamically allocated / maintained vectors has several downsides:
>> - possible nesting of IRQs due to the effects of IRQ migration,
>> - reduction of vectors available for devices,
>> - IRQs not moving as intended if there's shortage of vectors,
>> - higher runtime overhead.
>>
>> As the vector also doesn't need to be of any priority (first and foremost
>> it really shouldn't be of higher or same priority as the timer IRQ, as
>> that raises TIMER_SOFTIRQ anyway), avoid any "ordinary" vectors altogther
>> and use a vector from the 0x10...0x1f exception vector space. Exception vs
>> interrupt can easily be distinguished by checking for the presence of an
>> error code.
>>
>> Fixes: 996576b965cc ("xen: allow up to 16383 cpus")
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> This is so cunning that it took me a while to figure out how this even
> functioned, given no apparent change to TPR.

Yeah, the TPR part took me a while to figure out. I was scratching my head
as to why I wasn't seeing any interrupts, until I figured I need to make
that adjustment to FIRST_IRQ_VECTOR.

> Having this behaviour under FRED is easy.  In fact, allowing the use of
> vectors 0x10-0x1f under FRED is one "extra" I haven't gotten around to
> doing yet.

Ah yes.

> But, the problem it introduces under IDT is that for all the other
> reserved exceptions, we'll panic if we see them.  That was the point of
> setting TPR to 0x10 originally.

Which would be a clear indication of something being wrong, rather than
things misbehaving (likely) entirely silently. So more of a benefit than
a downside?

Of course, if we were afraid of other vectors getting signaled, other
entry points may need amending similarly to what I do to entry_AC(). It
then isn't quite clear to me what, if anything, to do to entry points of
exceptions coming without error code.

Jan


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH for-4.21 04/10] x86/HPET: ignore "stale" IRQs
  2025-10-16  7:30 [PATCH for-4.21 00/10] x86/HPET: broadcast IRQ and other improvements Jan Beulich
                   ` (2 preceding siblings ...)
  2025-10-16  7:32 ` [PATCH for-4.21 03/10] x86/HPET: use single, global, low-priority vector for broadcast IRQ Jan Beulich
@ 2025-10-16  7:32 ` Jan Beulich
  2025-10-17  9:19   ` Roger Pau Monné
  2025-10-16  7:32 ` [PATCH 05/10] x86/HPET: avoid indirect call to event handler Jan Beulich
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: Jan Beulich @ 2025-10-16  7:32 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Roger Pau Monné, Oleksii Kurochko

Following hpet_detach_channel(), IRQs may still occur: Ones may already
be in flight (both from before and after the last IRQ migration, i.e. on
possibly two distinct vectors targeting two different CPUs), and new ones
may still be raised as long as the channel is enabled.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
RFC: Previously, when we still used "normal" IRQs, this would only work
     correctly if there's no shortage of vectors, i.e. if the original
     hpet_msi_set_affinity()'s call to set_desc_affinity() would succeed.
     With that changed the patch may not actually be of much help anymore
     (hence I've also dropped the Fixes: tag again); it was pretty useful
     prior to that.

--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -241,8 +241,9 @@ again:
 static void cf_check hpet_interrupt_handler(int irq, void *data)
 {
     struct hpet_event_channel *ch = data;
+    unsigned int cpu = smp_processor_id();
 
-    this_cpu(irq_count)--;
+    per_cpu(irq_count, cpu)--;
 
     if ( !ch->event_handler )
     {
@@ -250,6 +251,9 @@ static void cf_check hpet_interrupt_hand
         return;
     }
 
+    if ( ch->cpu != cpu )
+        return;
+
     ch->event_handler(ch);
 }
 



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 04/10] x86/HPET: ignore "stale" IRQs
  2025-10-16  7:32 ` [PATCH for-4.21 04/10] x86/HPET: ignore "stale" IRQs Jan Beulich
@ 2025-10-17  9:19   ` Roger Pau Monné
  2025-10-17  9:57     ` Jan Beulich
  0 siblings, 1 reply; 41+ messages in thread
From: Roger Pau Monné @ 2025-10-17  9:19 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On Thu, Oct 16, 2025 at 09:32:29AM +0200, Jan Beulich wrote:
> Following hpet_detach_channel(), IRQs may still occur: Ones may already
> be in flight (both from before and after the last IRQ migration, i.e. on
> possibly two distinct vectors targeting two different CPUs), and new ones
> may still be raised as long as the channel is enabled.

Description would need to be adjusted if nothing else, as it speaks
about two distinct vectors which is no longer possible after patch 3.

> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

With the above adjusted:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 04/10] x86/HPET: ignore "stale" IRQs
  2025-10-17  9:19   ` Roger Pau Monné
@ 2025-10-17  9:57     ` Jan Beulich
  2025-10-17 12:13       ` Roger Pau Monné
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Beulich @ 2025-10-17  9:57 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On 17.10.2025 11:19, Roger Pau Monné wrote:
> On Thu, Oct 16, 2025 at 09:32:29AM +0200, Jan Beulich wrote:
>> Following hpet_detach_channel(), IRQs may still occur: Ones may already
>> be in flight (both from before and after the last IRQ migration, i.e. on
>> possibly two distinct vectors targeting two different CPUs), and new ones
>> may still be raised as long as the channel is enabled.
> 
> Description would need to be adjusted if nothing else, as it speaks
> about two distinct vectors which is no longer possible after patch 3.

Oh, right - dropping the parenthesized part.

>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> With the above adjusted:
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, but as per the RFC remark: Do you then think this is actually
worthwhile despite what patch 3 does?

Jan


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 04/10] x86/HPET: ignore "stale" IRQs
  2025-10-17  9:57     ` Jan Beulich
@ 2025-10-17 12:13       ` Roger Pau Monné
  0 siblings, 0 replies; 41+ messages in thread
From: Roger Pau Monné @ 2025-10-17 12:13 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On Fri, Oct 17, 2025 at 11:57:28AM +0200, Jan Beulich wrote:
> On 17.10.2025 11:19, Roger Pau Monné wrote:
> > On Thu, Oct 16, 2025 at 09:32:29AM +0200, Jan Beulich wrote:
> >> Following hpet_detach_channel(), IRQs may still occur: Ones may already
> >> be in flight (both from before and after the last IRQ migration, i.e. on
> >> possibly two distinct vectors targeting two different CPUs), and new ones
> >> may still be raised as long as the channel is enabled.
> > 
> > Description would need to be adjusted if nothing else, as it speaks
> > about two distinct vectors which is no longer possible after patch 3.
> 
> Oh, right - dropping the parenthesized part.
> 
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > 
> > With the above adjusted:
> > 
> > Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Thanks, but as per the RFC remark: Do you then think this is actually
> worthwhile despite what patch 3 does?

I don't think it will hurt, but I also think this can possibly wait
after the code freeze, I don't see much win from doing it, and there's
a risk we might get it wrong.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 05/10] x86/HPET: avoid indirect call to event handler
  2025-10-16  7:30 [PATCH for-4.21 00/10] x86/HPET: broadcast IRQ and other improvements Jan Beulich
                   ` (3 preceding siblings ...)
  2025-10-16  7:32 ` [PATCH for-4.21 04/10] x86/HPET: ignore "stale" IRQs Jan Beulich
@ 2025-10-16  7:32 ` Jan Beulich
  2025-10-16  7:33 ` [PATCH 06/10] x86/HPET: make another channel flags update atomic Jan Beulich
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: Jan Beulich @ 2025-10-16  7:32 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné

It's only ever handle_hpet_broadcast() that's used. While we now don't
enable IRQs right away, still play safe and convert the function pointer
to a boolean, to make sure no calls occur too early.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -40,7 +40,7 @@ struct hpet_event_channel
     s_time_t      next_event;
     cpumask_var_t cpumask;
     spinlock_t    lock;
-    void          (*event_handler)(struct hpet_event_channel *ch);
+    bool          event_handler;
 
     unsigned int idx;   /* physical channel idx */
     unsigned int cpu;   /* msi target */
@@ -194,7 +194,7 @@ static void evt_do_broadcast(cpumask_t *
        cpumask_raise_softirq(mask, TIMER_SOFTIRQ);
 }
 
-static void cf_check handle_hpet_broadcast(struct hpet_event_channel *ch)
+static void handle_hpet_broadcast(struct hpet_event_channel *ch)
 {
     cpumask_t mask;
     s_time_t now, next_event;
@@ -254,7 +254,7 @@ static void cf_check hpet_interrupt_hand
     if ( ch->cpu != cpu )
         return;
 
-    ch->event_handler(ch);
+    handle_hpet_broadcast(ch);
 }
 
 static void cf_check hpet_msi_unmask(struct irq_desc *desc)
@@ -515,7 +515,7 @@ static void set_channel_irq_affinity(str
 
     /* We may have missed an interrupt due to the temporary masking. */
     if ( ch->event_handler && ch->next_event < NOW() )
-        ch->event_handler(ch);
+        handle_hpet_broadcast(ch);
 }
 
 static void hpet_attach_channel(unsigned int cpu,
@@ -643,7 +643,7 @@ void __init hpet_broadcast_init(void)
         hpet_events[i].next_event = STIME_MAX;
         spin_lock_init(&hpet_events[i].lock);
         smp_wmb();
-        hpet_events[i].event_handler = handle_hpet_broadcast;
+        hpet_events[i].event_handler = true;
 
         hpet_events[i].msi.msi_attrib.maskbit = 1;
         hpet_events[i].msi.msi_attrib.pos = MSI_TYPE_HPET;
@@ -794,7 +794,9 @@ int hpet_legacy_irq_tick(void)
          (hpet_events->flags & (HPET_EVT_DISABLE|HPET_EVT_LEGACY)) !=
          HPET_EVT_LEGACY )
         return 0;
-    hpet_events->event_handler(hpet_events);
+
+    handle_hpet_broadcast(hpet_events);
+
     return 1;
 }
 



^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 06/10] x86/HPET: make another channel flags update atomic
  2025-10-16  7:30 [PATCH for-4.21 00/10] x86/HPET: broadcast IRQ and other improvements Jan Beulich
                   ` (4 preceding siblings ...)
  2025-10-16  7:32 ` [PATCH 05/10] x86/HPET: avoid indirect call to event handler Jan Beulich
@ 2025-10-16  7:33 ` Jan Beulich
  2025-10-16  7:33 ` [PATCH 07/10] x86/HPET: move legacy tick IRQ count adjustment Jan Beulich
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: Jan Beulich @ 2025-10-16  7:33 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné

Unlike the setting of HPET_EVT_LEGACY in hpet_broadcast_init(), the
setting of HPET_EVT_DISABLE in hpet_disable_legacy_broadcast() isn't init-
only and hence can race other flag manipulation (not all of which occur
while holding the channel's lock). While possibly any such updates would
only ever occur when HPET_EVT_LEGACY isn't set in the first place, this
doesn't look straightforward to prove, so better be on the safe side.

Fixes: d09486dba36a ("cpuidle: Enable hpet broadcast by default")
Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -709,7 +709,7 @@ void hpet_disable_legacy_broadcast(void)
 
     spin_lock_irqsave(&hpet_events->lock, flags);
 
-    hpet_events->flags |= HPET_EVT_DISABLE;
+    set_bit(HPET_EVT_DISABLE_BIT, &hpet_events->flags);
 
     /* disable HPET T0 */
     cfg = hpet_read32(HPET_Tn_CFG(0));



^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 07/10] x86/HPET: move legacy tick IRQ count adjustment
  2025-10-16  7:30 [PATCH for-4.21 00/10] x86/HPET: broadcast IRQ and other improvements Jan Beulich
                   ` (5 preceding siblings ...)
  2025-10-16  7:33 ` [PATCH 06/10] x86/HPET: make another channel flags update atomic Jan Beulich
@ 2025-10-16  7:33 ` Jan Beulich
  2025-10-16  7:34 ` [PATCH 08/10] x86/HPET: shrink IRQ-descriptor locked region in set_channel_irq_affinity() Jan Beulich
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: Jan Beulich @ 2025-10-16  7:33 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné

If already we play with the IRQ count, we should do so only if we actually
"consume" the interrupt; normal timer IRQs should not have any adjustment
done.

Fixes: 353533232730 ("cpuidle: fix the menu governor to enhance IO performance")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
_Why_ we do these adjustments (also elsewhere) I don't reeally know.

--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -788,13 +788,13 @@ int hpet_broadcast_is_available(void)
 
 int hpet_legacy_irq_tick(void)
 {
-    this_cpu(irq_count)--;
-
     if ( !hpet_events ||
          (hpet_events->flags & (HPET_EVT_DISABLE|HPET_EVT_LEGACY)) !=
          HPET_EVT_LEGACY )
         return 0;
 
+    this_cpu(irq_count)--;
+
     handle_hpet_broadcast(hpet_events);
 
     return 1;



^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 08/10] x86/HPET: shrink IRQ-descriptor locked region in set_channel_irq_affinity()
  2025-10-16  7:30 [PATCH for-4.21 00/10] x86/HPET: broadcast IRQ and other improvements Jan Beulich
                   ` (6 preceding siblings ...)
  2025-10-16  7:33 ` [PATCH 07/10] x86/HPET: move legacy tick IRQ count adjustment Jan Beulich
@ 2025-10-16  7:34 ` Jan Beulich
  2025-10-16  7:34 ` [PATCH 09/10] x86/HPET: reduce hpet_next_event() call sites Jan Beulich
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: Jan Beulich @ 2025-10-16  7:34 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné

Along the lines of hpet_disable_channel(), split out hpet_enable_channel()
as well, to then use both functions from set_channel_irq_affinity().

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -257,10 +257,9 @@ static void cf_check hpet_interrupt_hand
     handle_hpet_broadcast(ch);
 }
 
-static void cf_check hpet_msi_unmask(struct irq_desc *desc)
+static void hpet_enable_channel(struct hpet_event_channel *ch)
 {
     u32 cfg;
-    struct hpet_event_channel *ch = desc->action->dev_id;
 
     cfg = hpet_read32(HPET_Tn_CFG(ch->idx));
     cfg |= HPET_TN_ENABLE;
@@ -268,6 +267,11 @@ static void cf_check hpet_msi_unmask(str
     ch->msi.msi_attrib.host_masked = 0;
 }
 
+static void cf_check hpet_msi_unmask(struct irq_desc *desc)
+{
+    hpet_enable_channel(desc->action->dev_id);
+}
+
 static void hpet_disable_channel(struct hpet_event_channel *ch)
 {
     u32 cfg;
@@ -503,14 +507,15 @@ static void set_channel_irq_affinity(str
 
     per_cpu(lru_channel, ch->cpu) = ch;
 
+    hpet_disable_channel(ch);
+
     ASSERT(!local_irq_is_enabled());
     spin_lock(&desc->lock);
-    hpet_msi_mask(desc);
     hpet_msi_set_affinity(desc, cpumask_of(ch->cpu));
     per_cpu(vector_irq, ch->cpu)[HPET_BROADCAST_VECTOR] = ch->msi.irq;
-    hpet_msi_unmask(desc);
     spin_unlock(&desc->lock);
 
+    hpet_enable_channel(ch);
     spin_unlock(&ch->lock);
 
     /* We may have missed an interrupt due to the temporary masking. */



^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 09/10] x86/HPET: reduce hpet_next_event() call sites
  2025-10-16  7:30 [PATCH for-4.21 00/10] x86/HPET: broadcast IRQ and other improvements Jan Beulich
                   ` (7 preceding siblings ...)
  2025-10-16  7:34 ` [PATCH 08/10] x86/HPET: shrink IRQ-descriptor locked region in set_channel_irq_affinity() Jan Beulich
@ 2025-10-16  7:34 ` Jan Beulich
  2025-10-16  7:35 ` [PATCH 10/10] x86/HPET: don't use hardcoded 0 for "long timeout" Jan Beulich
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: Jan Beulich @ 2025-10-16  7:34 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné

I'm surprised gcc doesn't manage to do that: At least in debug builds two
call sites exist, just like source code has it. That's not necessary
though - by using do/while we can reduce this to a single call site. Then
the function will be inlined.

While improving code gen, also switch the function's 2nd parameter to
unsigned.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Oddly enough the CDQE is replaced by an entirely unnecessary 32-bit MOV of
a register to itself (i.e. zero-extending to 64 bits), as that's
immediately preceded by a 32-bit ADD targeting the same register.

--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -124,7 +124,7 @@ static inline unsigned long ns2ticks(uns
     return (unsigned long) tmp;
 }
 
-static int hpet_next_event(unsigned long delta, int timer)
+static int hpet_next_event(unsigned long delta, unsigned int timer)
 {
     uint32_t cnt, cmp;
     unsigned long flags;
@@ -173,12 +173,10 @@ static int reprogram_hpet_evt_channel(
     delta = max_t(int64_t, delta, MIN_DELTA_NS);
     delta = ns2ticks(delta, ch->shift, ch->mult);
 
-    ret = hpet_next_event(delta, ch->idx);
-    while ( ret && force )
-    {
-        delta += delta;
+    do {
         ret = hpet_next_event(delta, ch->idx);
-    }
+        delta += delta;
+    } while ( ret && force );
 
     return ret;
 }



^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 10/10] x86/HPET: don't use hardcoded 0 for "long timeout"
  2025-10-16  7:30 [PATCH for-4.21 00/10] x86/HPET: broadcast IRQ and other improvements Jan Beulich
                   ` (8 preceding siblings ...)
  2025-10-16  7:34 ` [PATCH 09/10] x86/HPET: reduce hpet_next_event() call sites Jan Beulich
@ 2025-10-16  7:35 ` Jan Beulich
  2025-10-16 10:05 ` [PATCH for-4.21 00/10] x86/HPET: broadcast IRQ and other improvements Roger Pau Monné
  2025-10-17 16:03 ` Oleksii Kurochko
  11 siblings, 0 replies; 41+ messages in thread
From: Jan Beulich @ 2025-10-16  7:35 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org; +Cc: Andrew Cooper, Roger Pau Monné

With 32-bit counters, writing 0 means on average half the wrapping period
until an interrupt would be raised. Yet of course in extreme cases an
interrupt would be raised almost right away. Write the present counter
value instead, to make the timeout predicatbly a full wrapping period.

Fixes: e862b83e8433 ("CPUIDLE: Avoid remnant HPET intr while force hpetbroadcast")
Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -165,7 +165,7 @@ static int reprogram_hpet_evt_channel(
     if ( expire == STIME_MAX )
     {
         /* We assume it will take a long time for the timer to wrap. */
-        hpet_write32(0, HPET_Tn_CMP(ch->idx));
+        hpet_write32(hpet_read32(HPET_COUNTER), HPET_Tn_CMP(ch->idx));
         return 0;
     }
 



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 00/10] x86/HPET: broadcast IRQ and other improvements
  2025-10-16  7:30 [PATCH for-4.21 00/10] x86/HPET: broadcast IRQ and other improvements Jan Beulich
                   ` (9 preceding siblings ...)
  2025-10-16  7:35 ` [PATCH 10/10] x86/HPET: don't use hardcoded 0 for "long timeout" Jan Beulich
@ 2025-10-16 10:05 ` Roger Pau Monné
  2025-10-16 10:41   ` Jan Beulich
  2025-10-17 16:03 ` Oleksii Kurochko
  11 siblings, 1 reply; 41+ messages in thread
From: Roger Pau Monné @ 2025-10-16 10:05 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On Thu, Oct 16, 2025 at 09:30:28AM +0200, Jan Beulich wrote:
> While 1db7829e5657 ("x86/hpet: do local APIC EOI after interrupt processing")
> helped quite a bit, nested interrupts could still occur. First and foremost
> as a result from IRQ migration (where we don't have any control over the
> vectors chosen). Hence besides reducing the number of IRQs that can be raised
> (first two patches) and possibly the number of invocations of
> handle_hpet_broadcast() from the IRQ handler (optional patch 4), the main
> goal here is to eliminate the potential for nested IRQs (patch 3). These
> patches are imo 4.21 candidates (with patch 4 being questionable altogether;
> see there). Patches 5 and onwards likely aren't important enough anymore at
> this point of the release cycle, even if those with a Fixes: tag would likely
> end up being backported later on.
> 
> The one related thing I haven't been able to find a viable solution for is
> the elimination of the cpumask_t local variable in handle_hpet_broadcast().
> That'll get in the way of possible future increases of the NR_CPUS upper
> bound: Much like right now a single level of nesting is already too much,
> if the limit was doubled even a single IRQ would end up consuming too much
> stack space (together with cpumask_raise_softirq() also having such a
> variable). Yet further doubling would not allow any such stack variables
> anymore.

If there's no nesting anymore, you could introduce a per-CPU cpumask_t
to use in the context of handle_hpet_broadcast()?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 00/10] x86/HPET: broadcast IRQ and other improvements
  2025-10-16 10:05 ` [PATCH for-4.21 00/10] x86/HPET: broadcast IRQ and other improvements Roger Pau Monné
@ 2025-10-16 10:41   ` Jan Beulich
  0 siblings, 0 replies; 41+ messages in thread
From: Jan Beulich @ 2025-10-16 10:41 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Oleksii Kurochko

On 16.10.2025 12:05, Roger Pau Monné wrote:
> On Thu, Oct 16, 2025 at 09:30:28AM +0200, Jan Beulich wrote:
>> While 1db7829e5657 ("x86/hpet: do local APIC EOI after interrupt processing")
>> helped quite a bit, nested interrupts could still occur. First and foremost
>> as a result from IRQ migration (where we don't have any control over the
>> vectors chosen). Hence besides reducing the number of IRQs that can be raised
>> (first two patches) and possibly the number of invocations of
>> handle_hpet_broadcast() from the IRQ handler (optional patch 4), the main
>> goal here is to eliminate the potential for nested IRQs (patch 3). These
>> patches are imo 4.21 candidates (with patch 4 being questionable altogether;
>> see there). Patches 5 and onwards likely aren't important enough anymore at
>> this point of the release cycle, even if those with a Fixes: tag would likely
>> end up being backported later on.
>>
>> The one related thing I haven't been able to find a viable solution for is
>> the elimination of the cpumask_t local variable in handle_hpet_broadcast().
>> That'll get in the way of possible future increases of the NR_CPUS upper
>> bound: Much like right now a single level of nesting is already too much,
>> if the limit was doubled even a single IRQ would end up consuming too much
>> stack space (together with cpumask_raise_softirq() also having such a
>> variable). Yet further doubling would not allow any such stack variables
>> anymore.
> 
> If there's no nesting anymore, you could introduce a per-CPU cpumask_t
> to use in the context of handle_hpet_broadcast()?

Not quite, as there are other callers, too. But yes, possibly other callers
could be dealt with differently.

Jan


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH for-4.21 00/10] x86/HPET: broadcast IRQ and other improvements
  2025-10-16  7:30 [PATCH for-4.21 00/10] x86/HPET: broadcast IRQ and other improvements Jan Beulich
                   ` (10 preceding siblings ...)
  2025-10-16 10:05 ` [PATCH for-4.21 00/10] x86/HPET: broadcast IRQ and other improvements Roger Pau Monné
@ 2025-10-17 16:03 ` Oleksii Kurochko
  11 siblings, 0 replies; 41+ messages in thread
From: Oleksii Kurochko @ 2025-10-17 16:03 UTC (permalink / raw)
  To: Jan Beulich, xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Roger Pau Monné

[-- Attachment #1: Type: text/plain, Size: 2038 bytes --]


On 10/16/25 9:30 AM, Jan Beulich wrote:
> While 1db7829e5657 ("x86/hpet: do local APIC EOI after interrupt processing")
> helped quite a bit, nested interrupts could still occur. First and foremost
> as a result from IRQ migration (where we don't have any control over the
> vectors chosen). Hence besides reducing the number of IRQs that can be raised
> (first two patches) and possibly the number of invocations of
> handle_hpet_broadcast() from the IRQ handler (optional patch 4), the main
> goal here is to eliminate the potential for nested IRQs (patch 3). These
> patches are imo 4.21 candidates (with patch 4 being questionable altogether;
> see there).

I am not sure that patch 4 is very useful at the current stage.
The first three patches look useful enough for now, so:
  Releaase-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>

~ Oleksii

> Patches 5 and onwards likely aren't important enough anymore at
> this point of the release cycle, even if those with a Fixes: tag would likely
> end up being backported later on.
>
> The one related thing I haven't been able to find a viable solution for is
> the elimination of the cpumask_t local variable in handle_hpet_broadcast().
> That'll get in the way of possible future increases of the NR_CPUS upper
> bound: Much like right now a single level of nesting is already too much,
> if the limit was doubled even a single IRQ would end up consuming too much
> stack space (together with cpumask_raise_softirq() also having such a
> variable). Yet further doubling would not allow any such stack variables
> anymore.
>
> 01: limit channel changes
> 02: disable unused channels
> 03: use single, global, low-priority vector for broadcast IRQ
> 04: ignore "stale" IRQs
> 05: avoid indirect call to event handler
> 06: make another channel flags update atomic
> 07: move legacy tick IRQ count adjustment
> 08: shrink IRQ-descriptor locked region in set_channel_irq_affinity()
> 09: reduce hpet_next_event() call sites
> 10: don't use hardcoded 0 for "long timeout"
>
> Jan

[-- Attachment #2: Type: text/html, Size: 2598 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2025-10-21  8:37 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-16  7:30 [PATCH for-4.21 00/10] x86/HPET: broadcast IRQ and other improvements Jan Beulich
2025-10-16  7:31 ` [PATCH for-4.21 01/10] x86/HPET: limit channel changes Jan Beulich
2025-10-16 10:24   ` Roger Pau Monné
2025-10-16 11:47     ` Jan Beulich
2025-10-16 15:07       ` Roger Pau Monné
2025-10-16 15:16         ` Jan Beulich
2025-10-16 15:25           ` Roger Pau Monné
2025-10-17  9:23   ` Roger Pau Monné
2025-10-17  9:55     ` Jan Beulich
2025-10-16  7:31 ` [PATCH for-4.21 02/10] x86/HPET: disable unused channels Jan Beulich
2025-10-16 11:42   ` Roger Pau Monné
2025-10-16 11:57     ` Jan Beulich
2025-10-16 15:34       ` Roger Pau Monné
2025-10-16 15:55         ` Jan Beulich
2025-10-16 16:28           ` Roger Pau Monné
2025-10-16 16:31   ` Roger Pau Monné
2025-10-17  6:08     ` Jan Beulich
2025-10-17  6:10       ` Jan Beulich
2025-10-16  7:32 ` [PATCH for-4.21 03/10] x86/HPET: use single, global, low-priority vector for broadcast IRQ Jan Beulich
2025-10-16 16:27   ` Roger Pau Monné
2025-10-17  7:15     ` Jan Beulich
2025-10-17  8:20       ` Roger Pau Monné
2025-10-20  5:53         ` Jan Beulich
2025-10-20 15:49           ` Roger Pau Monné
2025-10-20 16:05             ` Jan Beulich
2025-10-21  8:37               ` Roger Pau Monné
2025-10-16 17:01   ` Andrew Cooper
2025-10-17  6:23     ` Jan Beulich
2025-10-16  7:32 ` [PATCH for-4.21 04/10] x86/HPET: ignore "stale" IRQs Jan Beulich
2025-10-17  9:19   ` Roger Pau Monné
2025-10-17  9:57     ` Jan Beulich
2025-10-17 12:13       ` Roger Pau Monné
2025-10-16  7:32 ` [PATCH 05/10] x86/HPET: avoid indirect call to event handler Jan Beulich
2025-10-16  7:33 ` [PATCH 06/10] x86/HPET: make another channel flags update atomic Jan Beulich
2025-10-16  7:33 ` [PATCH 07/10] x86/HPET: move legacy tick IRQ count adjustment Jan Beulich
2025-10-16  7:34 ` [PATCH 08/10] x86/HPET: shrink IRQ-descriptor locked region in set_channel_irq_affinity() Jan Beulich
2025-10-16  7:34 ` [PATCH 09/10] x86/HPET: reduce hpet_next_event() call sites Jan Beulich
2025-10-16  7:35 ` [PATCH 10/10] x86/HPET: don't use hardcoded 0 for "long timeout" Jan Beulich
2025-10-16 10:05 ` [PATCH for-4.21 00/10] x86/HPET: broadcast IRQ and other improvements Roger Pau Monné
2025-10-16 10:41   ` Jan Beulich
2025-10-17 16:03 ` Oleksii Kurochko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.