public inbox for linux-rt-users@vger.kernel.org
 help / color / mirror / Atom feed
* Control of IRQ Affinities from Userspace
@ 2025-10-30 14:20 Preclik, Tobias
  2025-11-03 15:53 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 25+ messages in thread
From: Preclik, Tobias @ 2025-10-30 14:20 UTC (permalink / raw)
  To: linux-rt-users@vger.kernel.org

Dear Linux RT Experts,

the Linux kernel exposes two userspace interfaces to control IRQ
affinities in the procfs [1]. First, /proc/irq/default_smp_affinity
specifies the default affinity mask which gets applied as initial
affinity mask for newly registering IRQs. Second,
/proc/irq/${IRQ_NO}/smp_affinity which can be used to specify an
affinity mask for a given IRQ.

For tuning the I/O path of RT applications we use the second interface
to relocate IRQs to cores dedicated to run RT applications. However, we
have observed that certain situations, such as interface bring-up or
loading BPF/XDP programs, can cause the IRQ affinity mask to be lost.
Specifically, some network drivers, particularly those based on stmmac,
ignore the IRQ affinity mask set from userspace and overwrite it with
decisions from IRQ rebalancing [2]. This driver behavior prevents
consistent control of IRQ affinities from userspace, impacting the
tuning of the I/O path for RT applications.

I would greatly appreciate any comments or guidance on this issue.

Best regards,
Tobias Preclik

[1] https://docs.kernel.org/core-api/irq/irq-affinity.html
[2]
https://elixir.bootlin.com/linux/v6.17.4/source/drivers/net/ethernet/st
micro/stmmac/stmmac_main.c#L3835

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-10-30 14:20 Control of IRQ Affinities from Userspace Preclik, Tobias
@ 2025-11-03 15:53 ` Sebastian Andrzej Siewior
  2025-11-03 17:12   ` Florian Bezdeka
  0 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-11-03 15:53 UTC (permalink / raw)
  To: Preclik, Tobias; +Cc: linux-rt-users@vger.kernel.org

On 2025-10-30 14:20:01 [+0000], Preclik, Tobias wrote:
> Dear Linux RT Experts,
> 
> the Linux kernel exposes two userspace interfaces to control IRQ
> affinities in the procfs [1]. First, /proc/irq/default_smp_affinity
> specifies the default affinity mask which gets applied as initial
> affinity mask for newly registering IRQs. Second,
> /proc/irq/${IRQ_NO}/smp_affinity which can be used to specify an
> affinity mask for a given IRQ.
> 
> For tuning the I/O path of RT applications we use the second interface
> to relocate IRQs to cores dedicated to run RT applications. However, we
> have observed that certain situations, such as interface bring-up or
> loading BPF/XDP programs, can cause the IRQ affinity mask to be lost.
> Specifically, some network drivers, particularly those based on stmmac,
> ignore the IRQ affinity mask set from userspace and overwrite it with
> decisions from IRQ rebalancing [2]. This driver behavior prevents
> consistent control of IRQ affinities from userspace, impacting the
> tuning of the I/O path for RT applications.
> 
> I would greatly appreciate any comments or guidance on this issue.

The usage of irq_set_affinity_hint() is not uncommon within the
networking drivers. It is probably a pity if the request happens on each
ifdown/ up but in this case it happens each time you add/ remove a XDP
program.

But the interrupt should be managed by the kernel. Looking at
irq_do_set_affinity() there is:
|         if (irqd_affinity_is_managed(data) &&
|             housekeeping_enabled(HK_TYPE_MANAGED_IRQ)) {
|                 const struct cpumask *hk_mask;
|
|                 hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
|
|                 cpumask_and(tmp_mask, mask, hk_mask);
|                 if (!cpumask_intersects(tmp_mask, cpu_online_mask))
|                         prog_mask = mask;
|                 else
|                         prog_mask = tmp_mask;
|         } else {

so if the IRQ is managed and you have IRQ isolation enabled then it
should exclude the non-isolated CPUs. Would that work?

> Best regards,
> Tobias Preclik

Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-11-03 15:53 ` Sebastian Andrzej Siewior
@ 2025-11-03 17:12   ` Florian Bezdeka
  2025-11-05 13:11     ` Preclik, Tobias
  2025-11-11 13:58     ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 25+ messages in thread
From: Florian Bezdeka @ 2025-11-03 17:12 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Preclik, Tobias
  Cc: linux-rt-users@vger.kernel.org, Jan Kiszka

On Mon, 2025-11-03 at 16:53 +0100, Sebastian Andrzej Siewior wrote:
> On 2025-10-30 14:20:01 [+0000], Preclik, Tobias wrote:
> > Dear Linux RT Experts,
> > 
> > the Linux kernel exposes two userspace interfaces to control IRQ
> > affinities in the procfs [1]. First, /proc/irq/default_smp_affinity
> > specifies the default affinity mask which gets applied as initial
> > affinity mask for newly registering IRQs. Second,
> > /proc/irq/${IRQ_NO}/smp_affinity which can be used to specify an
> > affinity mask for a given IRQ.
> > 
> > For tuning the I/O path of RT applications we use the second interface
> > to relocate IRQs to cores dedicated to run RT applications. However, we
> > have observed that certain situations, such as interface bring-up or
> > loading BPF/XDP programs, can cause the IRQ affinity mask to be lost.
> > Specifically, some network drivers, particularly those based on stmmac,
> > ignore the IRQ affinity mask set from userspace and overwrite it with
> > decisions from IRQ rebalancing [2]. This driver behavior prevents
> > consistent control of IRQ affinities from userspace, impacting the
> > tuning of the I/O path for RT applications.
> > 
> > I would greatly appreciate any comments or guidance on this issue.
> 
> The usage of irq_set_affinity_hint() is not uncommon within the
> networking drivers. It is probably a pity if the request happens on each
> ifdown/ up but in this case it happens each time you add/ remove a XDP
> program.
> 
> But the interrupt should be managed by the kernel. Looking at
> irq_do_set_affinity() there is:
> >          if (irqd_affinity_is_managed(data) &&
> >              housekeeping_enabled(HK_TYPE_MANAGED_IRQ)) {
> >                  const struct cpumask *hk_mask;
> > 
> >                  hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
> > 
> >                  cpumask_and(tmp_mask, mask, hk_mask);
> >                  if (!cpumask_intersects(tmp_mask, cpu_online_mask))
> >                          prog_mask = mask;
> >                  else
> >                          prog_mask = tmp_mask;
> >          } else {
> 
> so if the IRQ is managed and you have IRQ isolation enabled then it
> should exclude the non-isolated CPUs. Would that work?

I'm trying to jump in and adding some thoughts and results we got while
analyzing this issue:

What stmmac (and some more drivers) are trying to achieve here is some
kind of handcrafted IRQ balancing, like the good old irqbalanced did in
the past from usermode. Turns out that the situation about IRQ balancing
is a bit inconsistent. Some IRQ chips (like the APIC on x86 do that
"automatically" on driver level, many others don't. So drivers end up
fiddling with affinities.

We can nicely tune IRQs and affected affinities that that have been
requested during system boot. Tools like tuned can configure them using
the APIs Tobias described. IRQs that are requested / setup after boot,
during runtime, are kind of "problematic" for us, as there is no API
that informs about new IRQ. We would have to rescan /proc. But even if
there would be such an API: That would be too late. The IRQ might have
fired already.

Once an affinity has been set (e.g. by tuned) this affinity is being
restored when the IRQ comes back after a link up/down or bpf load. But:
It might have happened that the situation on the system has changed.
Even the default affinity could be different now. In case of the stmmac
- and probably way more drivers - the default affinity is not taken into
account anymore. The previous affinity is being restored
unconditionally.

I tried to modify stmmac and let it evaluate the default affinity while
doing the IRQ balancing dance. That turned out to be working at the end,
but each line violated several coding/style/abstraction rules. There is
no API at driver level to read the current default affinity - or I
missed it. I could sent that hack out as RFC if requested. Just let me
know.

Thinking more about this problem - and trying to abstract that in a
generalized way - triggered some ideas about "IRQ namespaces", similar
to what we have for CPUs/Memory/... in the cgroup world. Devices, or
classes of devices could be moved into namespaces, instead of
configuring them one by one. Thoughts welcome. The main challenge here
is that we do not think about rt vs. non-rt. It's more about multiple RT
applications running in parallel, well isolated from each other and the
non-rt world.

Florian

> 
> > Best regards,
> > Tobias Preclik
> 
> Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-11-03 17:12   ` Florian Bezdeka
@ 2025-11-05 13:11     ` Preclik, Tobias
  2025-11-05 13:18       ` Preclik, Tobias
  2025-11-11 14:34       ` bigeasy
  2025-11-11 13:58     ` Sebastian Andrzej Siewior
  1 sibling, 2 replies; 25+ messages in thread
From: Preclik, Tobias @ 2025-11-05 13:11 UTC (permalink / raw)
  To: Bezdeka, Florian, bigeasy@linutronix.de
  Cc: linux-rt-users@vger.kernel.org, Kiszka, Jan

On Mon, 2025-11-03 at 18:12 +0100, Florian Bezdeka wrote:
> On Mon, 2025-11-03 at 16:53 +0100, Sebastian Andrzej Siewior wrote:
> > The usage of irq_set_affinity_hint() is not uncommon within the
> > networking drivers. It is probably a pity if the request happens on
> > each
> > ifdown/ up but in this case it happens each time you add/ remove a
> > XDP
> > program.
> > 
> > But the interrupt should be managed by the kernel. Looking at
> > irq_do_set_affinity() there is:
> > >          if (irqd_affinity_is_managed(data) &&
> > >              housekeeping_enabled(HK_TYPE_MANAGED_IRQ)) {
> > >                  const struct cpumask *hk_mask;
> > > 
> > >                  hk_mask =
> > > housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
> > > 
> > >                  cpumask_and(tmp_mask, mask, hk_mask);
> > >                  if (!cpumask_intersects(tmp_mask,
> > > cpu_online_mask))
> > >                          prog_mask = mask;
> > >                  else
> > >                          prog_mask = tmp_mask;
> > >          } else {
> > 
> > so if the IRQ is managed and you have IRQ isolation enabled then it
> > should exclude the non-isolated CPUs. Would that work?

For that code path to be taken we would have to specify isolcpus and
managed_irq on the kernel command-line. However, we already migrated
away from the deprecated isolcpus parameter. Additionally, we need to
dynamically change the interrupt affinities while in operation. For
that matter we have to rely on setting the interrupt affinities in
procfs.

> I'm trying to jump in and adding some thoughts and results we got
> while
> analyzing this issue:
> 
> What stmmac (and some more drivers) are trying to achieve here is
> some
> kind of handcrafted IRQ balancing, like the good old irqbalanced did
> in
> the past from usermode. Turns out that the situation about IRQ
> balancing
> is a bit inconsistent. Some IRQ chips (like the APIC on x86 do that
> "automatically" on driver level, many others don't. So drivers end up
> fiddling with affinities.
> 
> We can nicely tune IRQs and affected affinities that that have been
> requested during system boot. Tools like tuned can configure them
> using
> the APIs Tobias described. IRQs that are requested / setup after
> boot,
> during runtime, are kind of "problematic" for us, as there is no API
> that informs about new IRQ. We would have to rescan /proc. But even
> if
> there would be such an API: That would be too late. The IRQ might
> have
> fired already.
> 
> Once an affinity has been set (e.g. by tuned) this affinity is being
> restored when the IRQ comes back after a link up/down or bpf load.
> But:
> It might have happened that the situation on the system has changed.
> Even the default affinity could be different now. In case of the
> stmmac
> - and probably way more drivers - the default affinity is not taken
> into
> account anymore. The previous affinity is being restored
> unconditionally.

Thanks for sharing the details Florian. The discovery of newly
registered/requested IRQs in userland is indeed an additional problem.
Still I would say that we can work around it by polling procfs and
setting the default interrupt affinity appropriately (given that it is
not ignored by the drivers of course). RT applications might have to
accept an initial delay until the interrupts on their I/O path are
detected and properly tuned.

> I tried to modify stmmac and let it evaluate the default affinity
> while
> doing the IRQ balancing dance. That turned out to be working at the
> end,
> but each line violated several coding/style/abstraction rules. There
> is
> no API at driver level to read the current default affinity - or I
> missed it. I could sent that hack out as RFC if requested. Just let
> me
> know.

When drivers at least respect the default affinity when rebalancing
interrupts we can avoid interfering with RT applications. However, we
must also ensure that specific interrupts which are in use by RT
applications on their I/O path must remain on the application core.
Meaning the IRQ rebalancing should also respect the interrupt
affinities set from userspace via /proc/irq/${IRQ_NO}/smp_affinity.

Here are steps to explore the behavior and reproduce the issue.

$ iface=eno2
$ ethtool -i ${iface}
driver: st_gmac
version: 6.18.0-rc2+
firmware-version: 
expansion-rom-version: 
bus-info: 0000:00:1d.2
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no

$ ip link show ${iface}
3: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
DEFAULT group default qlen 1000
    link/ether 30:2f:1e:7b:a5:3d brd ff:ff:ff:ff:ff:ff
    altname enp0s29f2

$ for irq in $(cat /proc/interrupts | grep ${iface} | awk '{print $1}'
| sed 's/://'); do cat /proc/irq/${irq}/smp_affinity_list; done

$ # NOTE: interrupts are requested only after interface bring-up

$ echo 3 > /proc/irq/default_smp_affinity 

$ ip link set dev ${iface} up
$ for irq in $(cat /proc/interrupts | grep ${iface} | awk '{print $1}'
| sed 's/://'); do cat /proc/irq/${irq}/smp_affinity_list; done
0
0
1
1
2
2
3
3
0
0
1
1
2
2
3
3
0-1
0-1
0-1
0-1

$ # NOTE: All rx and tx interrupts of the interface are load balanced
to all online cpus in a round-robin fashion by setting the irq
affinities and ignoring the default affinity.

$ # Let's explicitly set the IRQ affinities:

$ for irq in $(cat /proc/interrupts | grep ${iface} | awk '{print $1}'
| sed 's/://'); do echo 2 > /proc/irq/${irq}/smp_affinity_list; done

$ # We trigger the IRQ rebalancing of the stmmac-based driver:

$ ip link set dev ${iface} down
$ ip link set dev ${iface} up

$ for irq in $(cat /proc/interrupts | grep ${iface} | awk '{print $1}'
| sed 's/://'); do cat /proc/irq/${irq}/smp_affinity_list; done
0
0
1
1
2
2
3
3
0
0
1
1
2
2
3
3
2
2
2
2

$ # NOTE: All rx and tx interrupts are load balanced again to all
online cpus ignoring the default interrupt affinity as well as ignoring
and overwriting the explicitly set interrupt affinities from userspace.

Tobias

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-11-05 13:11     ` Preclik, Tobias
@ 2025-11-05 13:18       ` Preclik, Tobias
  2025-11-11 14:35         ` bigeasy
  2025-11-11 14:34       ` bigeasy
  1 sibling, 1 reply; 25+ messages in thread
From: Preclik, Tobias @ 2025-11-05 13:18 UTC (permalink / raw)
  To: Bezdeka, Florian, bigeasy@linutronix.de
  Cc: linux-rt-users@vger.kernel.org, Kiszka, Jan

On Wed, 2025-11-05 at 13:11 +0000, Preclik, Tobias wrote:
> On Mon, 2025-11-03 at 18:12 +0100, Florian Bezdeka wrote:
> > On Mon, 2025-11-03 at 16:53 +0100, Sebastian Andrzej Siewior wrote:
> > > The usage of irq_set_affinity_hint() is not uncommon within the
> > > networking drivers. It is probably a pity if the request happens
> > > on
> > > each
> > > ifdown/ up but in this case it happens each time you add/ remove
> > > a
> > > XDP
> > > program.
> > > 
> > > But the interrupt should be managed by the kernel. Looking at
> > > irq_do_set_affinity() there is:
> > > >          if (irqd_affinity_is_managed(data) &&
> > > >              housekeeping_enabled(HK_TYPE_MANAGED_IRQ)) {
> > > >                  const struct cpumask *hk_mask;
> > > > 
> > > >                  hk_mask =
> > > > housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
> > > > 
> > > >                  cpumask_and(tmp_mask, mask, hk_mask);
> > > >                  if (!cpumask_intersects(tmp_mask,
> > > > cpu_online_mask))
> > > >                          prog_mask = mask;
> > > >                  else
> > > >                          prog_mask = tmp_mask;
> > > >          } else {
> > > 
> > > so if the IRQ is managed and you have IRQ isolation enabled then
> > > it
> > > should exclude the non-isolated CPUs. Would that work?
> 
> For that code path to be taken we would have to specify isolcpus and
> managed_irq on the kernel command-line. However, we already migrated
> away from the deprecated isolcpus parameter. Additionally, we need to
> dynamically change the interrupt affinities while in operation. For
> that matter we have to rely on setting the interrupt affinities in
> procfs.
> 
> > I'm trying to jump in and adding some thoughts and results we got
> > while
> > analyzing this issue:
> > 
> > What stmmac (and some more drivers) are trying to achieve here is
> > some
> > kind of handcrafted IRQ balancing, like the good old irqbalanced
> > did
> > in
> > the past from usermode. Turns out that the situation about IRQ
> > balancing
> > is a bit inconsistent. Some IRQ chips (like the APIC on x86 do that
> > "automatically" on driver level, many others don't. So drivers end
> > up
> > fiddling with affinities.
> > 
> > We can nicely tune IRQs and affected affinities that that have been
> > requested during system boot. Tools like tuned can configure them
> > using
> > the APIs Tobias described. IRQs that are requested / setup after
> > boot,
> > during runtime, are kind of "problematic" for us, as there is no
> > API
> > that informs about new IRQ. We would have to rescan /proc. But even
> > if
> > there would be such an API: That would be too late. The IRQ might
> > have
> > fired already.
> > 
> > Once an affinity has been set (e.g. by tuned) this affinity is
> > being
> > restored when the IRQ comes back after a link up/down or bpf load.
> > But:
> > It might have happened that the situation on the system has
> > changed.
> > Even the default affinity could be different now. In case of the
> > stmmac
> > - and probably way more drivers - the default affinity is not taken
> > into
> > account anymore. The previous affinity is being restored
> > unconditionally.
> 
> Thanks for sharing the details Florian. The discovery of newly
> registered/requested IRQs in userland is indeed an additional
> problem.
> Still I would say that we can work around it by polling procfs and
> setting the default interrupt affinity appropriately (given that it
> is
> not ignored by the drivers of course). RT applications might have to
> accept an initial delay until the interrupts on their I/O path are
> detected and properly tuned.
> 
> > I tried to modify stmmac and let it evaluate the default affinity
> > while
> > doing the IRQ balancing dance. That turned out to be working at the
> > end,
> > but each line violated several coding/style/abstraction rules.
> > There
> > is
> > no API at driver level to read the current default affinity - or I
> > missed it. I could sent that hack out as RFC if requested. Just let
> > me
> > know.
> 
> When drivers at least respect the default affinity when rebalancing
> interrupts we can avoid interfering with RT applications. However, we
> must also ensure that specific interrupts which are in use by RT
> applications on their I/O path must remain on the application core.
> Meaning the IRQ rebalancing should also respect the interrupt
> affinities set from userspace via /proc/irq/${IRQ_NO}/smp_affinity.
> 
> Here are steps to explore the behavior and reproduce the issue.
> 
> $ iface=eno2
> $ ethtool -i ${iface}
> driver: st_gmac
> version: 6.18.0-rc2+
> firmware-version: 
> expansion-rom-version: 
> bus-info: 0000:00:1d.2
> supports-statistics: yes
> supports-test: no
> supports-eeprom-access: no
> supports-register-dump: yes
> supports-priv-flags: no
> 
> $ ip link show ${iface}
> 3: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
> DEFAULT group default qlen 1000
>     link/ether 30:2f:1e:7b:a5:3d brd ff:ff:ff:ff:ff:ff
>     altname enp0s29f2
> 
> $ for irq in $(cat /proc/interrupts | grep ${iface} | awk '{print
> $1}'
> > sed 's/://'); do cat /proc/irq/${irq}/smp_affinity_list; done
> 
> $ # NOTE: interrupts are requested only after interface bring-up
> 
> $ echo 3 > /proc/irq/default_smp_affinity 
> 
> $ ip link set dev ${iface} up
> $ for irq in $(cat /proc/interrupts | grep ${iface} | awk '{print
> $1}'
> > sed 's/://'); do cat /proc/irq/${irq}/smp_affinity_list; done
> 0
> 0
> 1
> 1
> 2
> 2
> 3
> 3
> 0
> 0
> 1
> 1
> 2
> 2
> 3
> 3
> 0-1
> 0-1
> 0-1
> 0-1
> 
> $ # NOTE: All rx and tx interrupts of the interface are load balanced
> to all online cpus in a round-robin fashion by setting the irq
> affinities and ignoring the default affinity.
> 
> $ # Let's explicitly set the IRQ affinities:
> 
> $ for irq in $(cat /proc/interrupts | grep ${iface} | awk '{print
> $1}'
> > sed 's/://'); do echo 2 > /proc/irq/${irq}/smp_affinity_list; done
> 
> $ # We trigger the IRQ rebalancing of the stmmac-based driver:
> 
> $ ip link set dev ${iface} down
> $ ip link set dev ${iface} up
> 
> $ for irq in $(cat /proc/interrupts | grep ${iface} | awk '{print
> $1}'
> > sed 's/://'); do cat /proc/irq/${irq}/smp_affinity_list; done
> 0
> 0
> 1
> 1
> 2
> 2
> 3
> 3
> 0
> 0
> 1
> 1
> 2
> 2
> 3
> 3
> 2
> 2
> 2
> 2
> 
> $ # NOTE: All rx and tx interrupts are load balanced again to all
> online cpus ignoring the default interrupt affinity as well as
> ignoring
> and overwriting the explicitly set interrupt affinities from
> userspace.

The conclusion got lost:

Other drivers like for example igb respect the interrupt affinities
(both default and per-irq affinities). This leads me to believe that
the irq rebalancing in the drivers should only affect the effective
interrupt affinities. This admittedly is more involved than it appears
at first because the interface interrupts would have to be balanced
subject to multiple (potentially totally different) cpusets.

Tobias

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-11-03 17:12   ` Florian Bezdeka
  2025-11-05 13:11     ` Preclik, Tobias
@ 2025-11-11 13:58     ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-11-11 13:58 UTC (permalink / raw)
  To: Florian Bezdeka
  Cc: Preclik, Tobias, linux-rt-users@vger.kernel.org, Jan Kiszka

On 2025-11-03 18:12:48 [+0100], Florian Bezdeka wrote:
> I'm trying to jump in and adding some thoughts and results we got while
> analyzing this issue:
> 
> What stmmac (and some more drivers) are trying to achieve here is some
> kind of handcrafted IRQ balancing, like the good old irqbalanced did in
> the past from usermode. Turns out that the situation about IRQ balancing
> is a bit inconsistent. Some IRQ chips (like the APIC on x86 do that
> "automatically" on driver level, many others don't. So drivers end up
> fiddling with affinities.

Doing it once during startup is probably okay. The problem is probably
that it forgets everything while it removes the IRQ and requests it
again during down/ up. It guess this is simpler because the number of
interrupts can change if the networking queues have been changed. And
this is probably also invoked in that case.

> We can nicely tune IRQs and affected affinities that that have been
> requested during system boot. Tools like tuned can configure them using
> the APIs Tobias described. IRQs that are requested / setup after boot,
> during runtime, are kind of "problematic" for us, as there is no API
> that informs about new IRQ. We would have to rescan /proc. But even if
> there would be such an API: That would be too late. The IRQ might have
> fired already.
> 
> Once an affinity has been set (e.g. by tuned) this affinity is being
> restored when the IRQ comes back after a link up/down or bpf load. But:
> It might have happened that the situation on the system has changed.
> Even the default affinity could be different now. In case of the stmmac
> - and probably way more drivers - the default affinity is not taken into
> account anymore. The previous affinity is being restored
> unconditionally.
> 
> I tried to modify stmmac and let it evaluate the default affinity while
> doing the IRQ balancing dance. That turned out to be working at the end,
> but each line violated several coding/style/abstraction rules. There is
> no API at driver level to read the current default affinity - or I
> missed it. I could sent that hack out as RFC if requested. Just let me
> know.

Several driver tune the affinity based on what they think is best. The
usual is we start with current CPU and increment the CPU with each
queue. This is not unique to networking but also happen with storage.

But we do have the "managed API" already.

> Thinking more about this problem - and trying to abstract that in a
> generalized way - triggered some ideas about "IRQ namespaces", similar
> to what we have for CPUs/Memory/... in the cgroup world. Devices, or
> classes of devices could be moved into namespaces, instead of
> configuring them one by one. Thoughts welcome. The main challenge here
> is that we do not think about rt vs. non-rt. It's more about multiple RT
> applications running in parallel, well isolated from each other and the
> non-rt world.

The excluded "affinity" would be a good place to start. So if you have
16 CPUs but declare only two CPU as housekeeping it would sense to limit
it to two interrupts if possible. Otherwise shuffle them among the two
available CPUs.

> Florian
Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-11-05 13:11     ` Preclik, Tobias
  2025-11-05 13:18       ` Preclik, Tobias
@ 2025-11-11 14:34       ` bigeasy
  2025-11-21 13:25         ` Preclik, Tobias
  1 sibling, 1 reply; 25+ messages in thread
From: bigeasy @ 2025-11-11 14:34 UTC (permalink / raw)
  To: Preclik, Tobias
  Cc: Bezdeka, Florian, linux-rt-users@vger.kernel.org, Kiszka, Jan

On 2025-11-05 13:11:29 [+0000], Preclik, Tobias wrote:
> > > so if the IRQ is managed and you have IRQ isolation enabled then it
> > > should exclude the non-isolated CPUs. Would that work?
> 
> For that code path to be taken we would have to specify isolcpus and
> managed_irq on the kernel command-line. However, we already migrated
> away from the deprecated isolcpus parameter. Additionally, we need to
> dynamically change the interrupt affinities while in operation. For
> that matter we have to rely on setting the interrupt affinities in
> procfs.

I would be careful with the deprecated term here. The functionality is
not deprecated just the interface is. The CPU affinity has been migrated
a cgroup based interface. If the matching irq affinity is missing then
it should be added rather than avoiding the whole "affinity is managed"
interface since this looks as it has been meant for your use case.

…

> Thanks for sharing the details Florian. The discovery of newly
> registered/requested IRQs in userland is indeed an additional problem.
> Still I would say that we can work around it by polling procfs and
> setting the default interrupt affinity appropriately (given that it is
> not ignored by the drivers of course). RT applications might have to
> accept an initial delay until the interrupts on their I/O path are
> detected and properly tuned.

This depends how you sell it. The initial setup of the application/
system might take a small hit until everything is ready and setup for
your workload. I don't know how changing the XDP program fits into this.
Usually I would expect that you setup it once and use it. However if
switching the XDP program makes sense within your real time workload
then maybe the affinity should not be randomly assigned.

> > I tried to modify stmmac and let it evaluate the default affinity
> > while
> > doing the IRQ balancing dance. That turned out to be working at the
> > end,
> > but each line violated several coding/style/abstraction rules. There
> > is
> > no API at driver level to read the current default affinity - or I
> > missed it. I could sent that hack out as RFC if requested. Just let
> > me
> > know.
> 
> When drivers at least respect the default affinity when rebalancing
> interrupts we can avoid interfering with RT applications. However, we
> must also ensure that specific interrupts which are in use by RT
> applications on their I/O path must remain on the application core.
> Meaning the IRQ rebalancing should also respect the interrupt
> affinities set from userspace via /proc/irq/${IRQ_NO}/smp_affinity.

If free_irq() removes all information about the IRQ then it might also
lose the configured smp_affinity. 

> Here are steps to explore the behavior and reproduce the issue.
> online cpus ignoring the default interrupt affinity as well as ignoring
> and overwriting the explicitly set interrupt affinities from userspace.

I just compared with igb and here the affinity mask survives. So it is
just this driver that is doing things different. The igb also removes
all interrupts on down. The affinity remains after changing the number
of queues (which changes the number of used interrupts). 

> Tobias

Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-11-05 13:18       ` Preclik, Tobias
@ 2025-11-11 14:35         ` bigeasy
  0 siblings, 0 replies; 25+ messages in thread
From: bigeasy @ 2025-11-11 14:35 UTC (permalink / raw)
  To: Preclik, Tobias
  Cc: Bezdeka, Florian, linux-rt-users@vger.kernel.org, Kiszka, Jan

On 2025-11-05 13:18:16 [+0000], Preclik, Tobias wrote:
> The conclusion got lost:
> 
> Other drivers like for example igb respect the interrupt affinities
> (both default and per-irq affinities). This leads me to believe that
> the irq rebalancing in the drivers should only affect the effective
> interrupt affinities. This admittedly is more involved than it appears
> at first because the interface interrupts would have to be balanced
> subject to multiple (potentially totally different) cpusets.

Exactly. Maybe it would work to align the driver with what igb does.

> Tobias

Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-11-11 14:34       ` bigeasy
@ 2025-11-21 13:25         ` Preclik, Tobias
  2025-11-24  9:59           ` bigeasy
  0 siblings, 1 reply; 25+ messages in thread
From: Preclik, Tobias @ 2025-11-21 13:25 UTC (permalink / raw)
  To: bigeasy@linutronix.de
  Cc: Bezdeka, Florian, linux-rt-users@vger.kernel.org, Kiszka, Jan

On Tue, 2025-11-11 at 15:34 +0100, bigeasy@linutronix.de wrote:
> On 2025-11-05 13:11:29 [+0000], Preclik, Tobias wrote:
> > > > so if the IRQ is managed and you have IRQ isolation enabled
> > > > then it
> > > > should exclude the non-isolated CPUs. Would that work?
> > 
> > For that code path to be taken we would have to specify isolcpus
> > and
> > managed_irq on the kernel command-line. However, we already
> > migrated
> > away from the deprecated isolcpus parameter. Additionally, we need
> > to
> > dynamically change the interrupt affinities while in operation. For
> > that matter we have to rely on setting the interrupt affinities in
> > procfs.
> 
> I would be careful with the deprecated term here. The functionality
> is
> not deprecated just the interface is. The CPU affinity has been
> migrated
> a cgroup based interface. If the matching irq affinity is missing
> then
> it should be added rather than avoiding the whole "affinity is
> managed"
> interface since this looks as it has been meant for your use case.
> 

As you point out isolcpus interface is deprecated and it seems there
exists no way to translate the managed_irq flag of the isolcpus
interface into the cgroups based interface. My understanding is that
the managed_irq flag ensures that managed interrupts stay off isolated
cores. Whether an interrupt is managed or not depends on how the driver
allocates the interrupt. Managed interrupts explicitly exclude control
from userspace [1]:

> The affinity of managed interrupts is handled by the kernel and
> cannot be changed via the /proc/irq/* interfaces.

I am not aware of any net/ethernet driver allocating managed
interrupts. Mellanox even reverted the introduction of managed
interrupts in mlx5 exactly for this reason [2]. But NVME and multiple
SCSI drivers allocate managed interrupts.

So in essence this uncovers more limitations:

- managed interrupts cannot be confined to housekeeping cores without
the deprecated isolcpus parameter
- managed interrupts cannot be relocated to a subset of isolated cores
from userspace (not so relevant for now since we are mostly interested
to relocate network interrupts to application cores)

Drivers allocating unmanaged interrupts are still subject to the
limitations I described initially (some drivers like stmmac rebalance
IRQs in certain situations like interface bring-up and XDP program
loading ignoring and overwriting any affinities set from userspace).

> > Thanks for sharing the details Florian. The discovery of newly
> > registered/requested IRQs in userland is indeed an additional
> > problem.
> > Still I would say that we can work around it by polling procfs and
> > setting the default interrupt affinity appropriately (given that it
> > is
> > not ignored by the drivers of course). RT applications might have
> > to
> > accept an initial delay until the interrupts on their I/O path are
> > detected and properly tuned.
> 
> This depends how you sell it. The initial setup of the application/
> system might take a small hit until everything is ready and setup for
> your workload. I don't know how changing the XDP program fits into
> this.
> Usually I would expect that you setup it once and use it. However if
> switching the XDP program makes sense within your real time workload
> then maybe the affinity should not be randomly assigned.

Consider an edge platform where operators are free to start, stop and
upgrade rt and non-rt applications at their will. XDP programs can be
loaded at any time and loading of XDP programs should not rebalance
interrupts ignoring (and overwriting) the affinities set from the edge
platform.

> > > I tried to modify stmmac and let it evaluate the default affinity
> > > while
> > > doing the IRQ balancing dance. That turned out to be working at
> > > the
> > > end,
> > > but each line violated several coding/style/abstraction rules.
> > > There
> > > is
> > > no API at driver level to read the current default affinity - or
> > > I
> > > missed it. I could sent that hack out as RFC if requested. Just
> > > let
> > > me
> > > know.
> > 
> > When drivers at least respect the default affinity when rebalancing
> > interrupts we can avoid interfering with RT applications. However,
> > we
> > must also ensure that specific interrupts which are in use by RT
> > applications on their I/O path must remain on the application core.
> > Meaning the IRQ rebalancing should also respect the interrupt
> > affinities set from userspace via /proc/irq/${IRQ_NO}/smp_affinity.
> 
> If free_irq() removes all information about the IRQ then it might
> also
> lose the configured smp_affinity. 
> 
> > Here are steps to explore the behavior and reproduce the issue.
> …
> > online cpus ignoring the default interrupt affinity as well as
> > ignoring
> > and overwriting the explicitly set interrupt affinities from
> > userspace.
> 
> I just compared with igb and here the affinity mask survives. So it
> is
> just this driver that is doing things different. The igb also removes
> all interrupts on down. The affinity remains after changing the
> number
> of queues (which changes the number of used interrupts). 

igb does not apply driver-based irq balancing (no calls to
irq_set_affinity_hint) and does not allocate managed interrupts and
therefore the affinities set from userspace are in effect. Affinities
are usually (unmanaged interrupts, no driver-based irq balancing) still
in effect after freeing and rerequesting an irq handler. So everything
works as I would expect it in this case.

> > The conclusion got lost:
> >
> > Other drivers like for example igb respect the interrupt affinities
> > (both default and per-irq affinities). This leads me to believe
> that
> > the irq rebalancing in the drivers should only affect the effective
> > interrupt affinities. This admittedly is more involved than it
> appears
> > at first because the interface interrupts would have to be balanced
> > subject to multiple (potentially totally different) cpusets.
> 
> Exactly. Maybe it would work to align the driver with what igb does.

Currently, stmmac sets IRQ affinity and hints for all IRQ
configurations. But on x86 systems with IOAPIC MSI-X vectors should be
automatically balanced. If we remove the driver-based irq balancing
then other architectures would not necessarily balance the interrupts
anymore and would be impacted in terms of performance. Maybe driver-
based irq balancing could be deactivated whenever the underlying system
is capable of balancing them? That would of course only reduced the
number of affected systems.

In general I lack information when drivers should (or are allowed to)
balance interrupts on driver level and whether smp_affinity is allowed
to be ignored and overwritten in that case. All documentation I have
found so far remains rather unspecific.

Tobias

[1]
https://www.kernel.org/doc/html/v6.1/admin-guide/kernel-parameters.html
[2]
https://github.com/torvalds/linux/commit/ef8c063cf88e1a3d99ab4ada1cbab5ba7248a4f2

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-11-21 13:25         ` Preclik, Tobias
@ 2025-11-24  9:59           ` bigeasy
  2025-11-25 11:32             ` Florian Bezdeka
  0 siblings, 1 reply; 25+ messages in thread
From: bigeasy @ 2025-11-24  9:59 UTC (permalink / raw)
  To: Preclik, Tobias
  Cc: Bezdeka, Florian, linux-rt-users@vger.kernel.org, Kiszka, Jan,
	Frederic Weisbecker

On 2025-11-21 13:25:09 [+0000], Preclik, Tobias wrote:
> > I would be careful with the deprecated term here. The functionality
> > is
> > not deprecated just the interface is. The CPU affinity has been
> > migrated
> > a cgroup based interface. If the matching irq affinity is missing
> > then
> > it should be added rather than avoiding the whole "affinity is
> > managed"
> > interface since this looks as it has been meant for your use case.
> > 
> 
> As you point out isolcpus interface is deprecated and it seems there
> exists no way to translate the managed_irq flag of the isolcpus
> interface into the cgroups based interface. My understanding is that

I did not point out anything. I just suggested to test whether this
option is working for you and if it does, check if there is matching
configuration knob in the cpusets/cgroups interface. As per 
   https://www.suse.com/c/cpu-isolation-practical-example-part-5/

in "3.2) Isolcpus" Frederic says that the options should be used if the
kernel/ application "haven't been built with cpusets/cgroups support".
So it seems that this bit is either missing in the other interface or
hard to find.

…
> > > The conclusion got lost:
> > >
> > > Other drivers like for example igb respect the interrupt affinities
> > > (both default and per-irq affinities). This leads me to believe
> > that
> > > the irq rebalancing in the drivers should only affect the effective
> > > interrupt affinities. This admittedly is more involved than it
> > appears
> > > at first because the interface interrupts would have to be balanced
> > > subject to multiple (potentially totally different) cpusets.
> > 
> > Exactly. Maybe it would work to align the driver with what igb does.
> 
> Currently, stmmac sets IRQ affinity and hints for all IRQ
> configurations. But on x86 systems with IOAPIC MSI-X vectors should be
> automatically balanced. If we remove the driver-based irq balancing
> then other architectures would not necessarily balance the interrupts
> anymore and would be impacted in terms of performance. Maybe driver-
> based irq balancing could be deactivated whenever the underlying system
> is capable of balancing them? That would of course only reduced the
> number of affected systems.
> 
> In general I lack information when drivers should (or are allowed to)
> balance interrupts on driver level and whether smp_affinity is allowed
> to be ignored and overwritten in that case. All documentation I have
> found so far remains rather unspecific.

It seems that if you exclude certain CPUs from getting interrupt
handling than it should work fine. Then the driver would only balance
the interrupts among the CPUs that are left.

> Tobias

Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-11-24  9:59           ` bigeasy
@ 2025-11-25 11:32             ` Florian Bezdeka
  2025-11-25 11:50               ` bigeasy
  2025-11-26 15:24               ` Frederic Weisbecker
  0 siblings, 2 replies; 25+ messages in thread
From: Florian Bezdeka @ 2025-11-25 11:32 UTC (permalink / raw)
  To: bigeasy@linutronix.de, Preclik, Tobias, Frederic Weisbecker
  Cc: linux-rt-users@vger.kernel.org, Kiszka, Jan

On Mon, 2025-11-24 at 10:59 +0100, bigeasy@linutronix.de wrote:
> On 2025-11-21 13:25:09 [+0000], Preclik, Tobias wrote:
> > > I would be careful with the deprecated term here. The functionality
> > > is
> > > not deprecated just the interface is. The CPU affinity has been
> > > migrated
> > > a cgroup based interface. If the matching irq affinity is missing
> > > then
> > > it should be added rather than avoiding the whole "affinity is
> > > managed"
> > > interface since this looks as it has been meant for your use case.
> > > 
> > 
> > As you point out isolcpus interface is deprecated and it seems there
> > exists no way to translate the managed_irq flag of the isolcpus
> > interface into the cgroups based interface. My understanding is that
> 
> I did not point out anything. I just suggested to test whether this
> option is working for you and if it does, check if there is matching
> configuration knob in the cpusets/cgroups interface. As per 
>    https://www.suse.com/c/cpu-isolation-practical-example-part-5/
> 
> in "3.2) Isolcpus" Frederic says that the options should be used if the
> kernel/ application "haven't been built with cpusets/cgroups support".
> So it seems that this bit is either missing in the other interface or
> hard to find.

In case that was still unclear: We're using the dynamic system
configuration features provided by cpusets/cgrups. No isolcpus= on the
kernel cmdline anymore. With that all applications are build around
cgroups. There is some userspace tooling around that takes care of
proper system configuration / RT isolation.

> 
> …
> > > > The conclusion got lost:
> > > > 
> > > > Other drivers like for example igb respect the interrupt affinities
> > > > (both default and per-irq affinities). This leads me to believe
> > > that
> > > > the irq rebalancing in the drivers should only affect the effective
> > > > interrupt affinities. This admittedly is more involved than it
> > > appears
> > > > at first because the interface interrupts would have to be balanced
> > > > subject to multiple (potentially totally different) cpusets.
> > > 
> > > Exactly. Maybe it would work to align the driver with what igb does.
> > 
> > Currently, stmmac sets IRQ affinity and hints for all IRQ
> > configurations. But on x86 systems with IOAPIC MSI-X vectors should be
> > automatically balanced. If we remove the driver-based irq balancing
> > then other architectures would not necessarily balance the interrupts
> > anymore and would be impacted in terms of performance. Maybe driver-
> > based irq balancing could be deactivated whenever the underlying system
> > is capable of balancing them? That would of course only reduced the
> > number of affected systems.
> > 
> > In general I lack information when drivers should (or are allowed to)
> > balance interrupts on driver level and whether smp_affinity is allowed
> > to be ignored and overwritten in that case. All documentation I have
> > found so far remains rather unspecific.
> 
> It seems that if you exclude certain CPUs from getting interrupt
> handling than it should work fine. Then the driver would only balance
> the interrupts among the CPUs that are left.

Sebastian, what exactly do you mean by "exclude certain CPUs from
getting interrupt handling"? I mean, that is what we do by configuring
the /proc/<irq>/smp_affinity_list interface.

The point here is, that drivers (like the stmmac, storage, ...) simply
ignore everything that was configured by userspace. As soon as one of
the dynamic events (link up/down, bpf loading) occurs they destroy the
current RT aware system configuration.

I was not successful in finding an API that would allow the driver(s) to
do better. The default affinity (/proc/irq/default_smp_affinity) - as an
example - is not visible from outside the IRQ core.

The managed IRQ infrastructure that you mentioned seems coupled with the
interfaces behind CONFIG_CPU_ISOLATION which seems to be "static", so
configured at boot time. Is that understanding correct? That would not
be flexible enough as we don't know the system configuration at boot
time.

As we now have Frederic with us: Frederic, are there any plans to extend
the housekeeping API to deal with cpuset creation? Not sure if that
would be possible as it's hard to say if the newly created cpuset is
targeting isolation or housekeeping...

The question from Tobias in other words: What are drivers allowed to do?
Are they free in choosing/configuring a (performance optimized) IRQ
handling? If so, how can we prevent them from touching RT isolated CPUs?

Do we consider issues like "driver ignoring the default IRQ affinity" as
driver bugs? Is there some consensus in the community? Background here
is the argumentation why we would have to touch / fix those drivers.

To sum up: 
- The IRQ balancing issue is not limited to a single driver / subsystem
- The managed IRQ infrastructure seems very "static" so insufficient for
  this problem. In addition we would have to migrate all affected
  drivers to the managed IRQ infrastructure first.

We would love to hear further thoughts / ideas / comments about this
problem. We're highly interested in fixing this issue properly.

Thanks for all the comments so far!

Florian


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-11-25 11:32             ` Florian Bezdeka
@ 2025-11-25 11:50               ` bigeasy
  2025-11-25 14:36                 ` Florian Bezdeka
  2025-11-26 15:31                 ` Frederic Weisbecker
  2025-11-26 15:24               ` Frederic Weisbecker
  1 sibling, 2 replies; 25+ messages in thread
From: bigeasy @ 2025-11-25 11:50 UTC (permalink / raw)
  To: Florian Bezdeka
  Cc: Preclik, Tobias, Frederic Weisbecker,
	linux-rt-users@vger.kernel.org, Kiszka, Jan

On 2025-11-25 12:32:39 [+0100], Florian Bezdeka wrote:
> > It seems that if you exclude certain CPUs from getting interrupt
> > handling than it should work fine. Then the driver would only balance
> > the interrupts among the CPUs that are left.
> 
> Sebastian, what exactly do you mean by "exclude certain CPUs from
> getting interrupt handling"? I mean, that is what we do by configuring
> the /proc/<irq>/smp_affinity_list interface.

Step #1
- figure out if isolcpus= is restricting the affinity of requested
  interrupts to housekeeping CPUs only

Step #2
- Yes
   => look for the matching knob in cgroup interface
      Knob found?
      - Yes
        => Use knob.
      - No.
        => Add knob.
- No
  This should be added as it breaks the expectation of an isolated
  system.

I *think* the driver should request as many interrupts as there are
available CPUs in the system to handle them. The number of available
CPUs/ CPU mask should be a configure knob by the user. Using the
housekeeping CPUs as a default mask seems reasonable.
The question is what should happen if the mask changes at runtime. Maybe
a device needs to reconfigure, maybe just move the interrupt away.
But this should also affect NOHZ_FULL workloads.

> To sum up: 
> - The IRQ balancing issue is not limited to a single driver / subsystem
> - The managed IRQ infrastructure seems very "static" so insufficient for
>   this problem. In addition we would have to migrate all affected
>   drivers to the managed IRQ infrastructure first.
> 
> We would love to hear further thoughts / ideas / comments about this
> problem. We're highly interested in fixing this issue properly.

If the "managed IRQ infrastructure" would help here then why not. Maybe
Frederic has some insight here.

> Thanks for all the comments so far!
> 
> Florian

Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-11-25 11:50               ` bigeasy
@ 2025-11-25 14:36                 ` Florian Bezdeka
  2025-11-25 16:31                   ` Thomas Gleixner
  2025-11-26 15:31                 ` Frederic Weisbecker
  1 sibling, 1 reply; 25+ messages in thread
From: Florian Bezdeka @ 2025-11-25 14:36 UTC (permalink / raw)
  To: bigeasy@linutronix.de
  Cc: Preclik, Tobias, Frederic Weisbecker,
	linux-rt-users@vger.kernel.org, Kiszka, Jan

On Tue, 2025-11-25 at 12:50 +0100, bigeasy@linutronix.de wrote:
> On 2025-11-25 12:32:39 [+0100], Florian Bezdeka wrote:
> > > It seems that if you exclude certain CPUs from getting interrupt
> > > handling than it should work fine. Then the driver would only balance
> > > the interrupts among the CPUs that are left.
> > 
> > Sebastian, what exactly do you mean by "exclude certain CPUs from
> > getting interrupt handling"? I mean, that is what we do by configuring
> > the /proc/<irq>/smp_affinity_list interface.
> 
> Step #1
> - figure out if isolcpus= is restricting the affinity of requested
>   interrupts to housekeeping CPUs only

This question can not be answered with yes/no. It depends. Affinities
are based on the default_smp_affinity during creation. But as it turned
out there are drivers that overwrite those affinities after IRQ
creation.

> 
> Step #2
> - Yes
>    => look for the matching knob in cgroup interface
>       Knob found?
>       - Yes
>         => Use knob.
>       - No.
>         => Add knob.
> - No
>   This should be added as it breaks the expectation of an isolated
>   system.

Assuming that No is the answer as there are a couple of drivers leading
to this situation. The question is now: How do we prevent this violation
of the isolated system configuration?

> 
> I *think* the driver should request as many interrupts as there are
> available CPUs in the system to handle them. 
> 

That does not match how networking (and some storage) drivers are
designed. Those drivers are usually HW queue centric. A driver is
setting up a IRQ per queue pair (TX/RX). The number of HW queues is
defined by the hardware and is decoupled from any CPU count.

To optimize performance, drivers may spread / balance the IRQs / queues
over available CPUs and while doing so might ignore any previous RT
configuration. Again: The performance optimization is valid, but how
could we prevent violating RT settings?

> The number of available
> CPUs/ CPU mask should be a configure knob by the user. 
> 

The user normally configures the number of HW queues that the NIC should
use. In most cases in combination with some HW packet filters to achieve
best packet separation. IMHO the user should not have to deal with any
(additional) CPU mask on that level. RT tuning will / should handle
that.

> Using the
> housekeeping CPUs as a default mask seems reasonable.
> The question is what should happen if the mask changes at runtime. Maybe
> a device needs to reconfigure, maybe just move the interrupt away.
> But this should also affect NOHZ_FULL workloads.
> 
> > To sum up: 
> > - The IRQ balancing issue is not limited to a single driver / subsystem
> > - The managed IRQ infrastructure seems very "static" so insufficient for
> >   this problem. In addition we would have to migrate all affected
> >   drivers to the managed IRQ infrastructure first.
> > 
> > We would love to hear further thoughts / ideas / comments about this
> > problem. We're highly interested in fixing this issue properly.
> 
> If the "managed IRQ infrastructure" would help here then why not. Maybe
> Frederic has some insight here.

I currently can't see how this could help. 

That looks like dead code to me. I started in irq_do_set_affinity() -
which checks for managed IRQs - but I could not find any user of
irq_create_affinity_masks() - that is where the managed flag is set -
that is actually being used. The road seems dead in
devm_platform_get_irqs_affinity() which has no in-tree user.

> 
> > Thanks for all the comments so far!
> > 
> > Florian
> 
> Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-11-25 14:36                 ` Florian Bezdeka
@ 2025-11-25 16:31                   ` Thomas Gleixner
  2025-11-26  9:20                     ` Florian Bezdeka
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2025-11-25 16:31 UTC (permalink / raw)
  To: Florian Bezdeka, bigeasy@linutronix.de
  Cc: Preclik, Tobias, Frederic Weisbecker,
	linux-rt-users@vger.kernel.org, Kiszka, Jan

On Tue, Nov 25 2025 at 15:36, Florian Bezdeka wrote:
> On Tue, 2025-11-25 at 12:50 +0100, bigeasy@linutronix.de wrote:
>> On 2025-11-25 12:32:39 [+0100], Florian Bezdeka wrote:
>> > > It seems that if you exclude certain CPUs from getting interrupt
>> > > handling than it should work fine. Then the driver would only balance
>> > > the interrupts among the CPUs that are left.
>> > 
>> > Sebastian, what exactly do you mean by "exclude certain CPUs from
>> > getting interrupt handling"? I mean, that is what we do by configuring
>> > the /proc/<irq>/smp_affinity_list interface.
>> 
>> Step #1
>> - figure out if isolcpus= is restricting the affinity of requested
>>   interrupts to housekeeping CPUs only
>
> This question can not be answered with yes/no. It depends. Affinities
> are based on the default_smp_affinity during creation. But as it turned
> out there are drivers that overwrite those affinities after IRQ
> creation.

Which ones?

>> I *think* the driver should request as many interrupts as there are
>> available CPUs in the system to handle them. 
>> 
>
> That does not match how networking (and some storage) drivers are
> designed. Those drivers are usually HW queue centric. A driver is
> setting up a IRQ per queue pair (TX/RX). The number of HW queues is
> defined by the hardware and is decoupled from any CPU count.
>
> To optimize performance, drivers may spread / balance the IRQs / queues
> over available CPUs and while doing so might ignore any previous RT
> configuration. Again: The performance optimization is valid, but how
> could we prevent violating RT settings?

That spreading happens and it depends how it is grouped and how that
matches your isolation requirements. NVME certainly allocates a queue
per CPU if there are enough available and those won't disturb your RT
isolated CPUs as long as nothing issues I/O on those CPUs.

Networking is a different story, but networking does not use managed
interrupts (except for one driver) and you can move them away from your
isolated CPUs after the device is set up.

There have been discussions how to keep interrupts by default off from
isolated CPUs, but I don't know where this stands. Frederic?

>> The number of available
>> CPUs/ CPU mask should be a configure knob by the user. 
>> 
> The user normally configures the number of HW queues that the NIC should
> use. In most cases in combination with some HW packet filters to achieve
> best packet separation. IMHO the user should not have to deal with any
> (additional) CPU mask on that level. RT tuning will / should handle
> that.

How so. The kernel magically knows what the user wants?

>> Using the
>> housekeeping CPUs as a default mask seems reasonable.
>> The question is what should happen if the mask changes at runtime. Maybe
>> a device needs to reconfigure, maybe just move the interrupt away.
>> But this should also affect NOHZ_FULL workloads.
>> 
>> > To sum up: 
>> > - The IRQ balancing issue is not limited to a single driver / subsystem
>> > - The managed IRQ infrastructure seems very "static" so insufficient for
>> >   this problem. In addition we would have to migrate all affected
>> >   drivers to the managed IRQ infrastructure first.
>> > 
>> > We would love to hear further thoughts / ideas / comments about this
>> > problem. We're highly interested in fixing this issue properly.
>> 
>> If the "managed IRQ infrastructure" would help here then why not. Maybe
>> Frederic has some insight here.
>
> I currently can't see how this could help. 
>
> That looks like dead code to me. I started in irq_do_set_affinity() -
> which checks for managed IRQs - but I could not find any user of
> irq_create_affinity_masks() - that is where the managed flag is set -
> that is actually being used. The road seems dead in
> devm_platform_get_irqs_affinity() which has no in-tree user.

# git grep -nH irq_create_affinity_masks drivers/
drivers/base/platform.c:424:    desc = irq_create_affinity_masks(nvec, affd);
drivers/pci/msi/api.c:289:                              irq_create_affinity_masks(1, affd);
drivers/pci/msi/msi.c:405:              affd ? irq_create_affinity_masks(nvec, affd) : NULL;
drivers/pci/msi/msi.c:695:              affd ? irq_create_affinity_masks(nvec, affd) : NULL;

These three PCI ones are all going through pci_alloc_irq_vectors_affinity()

# git grep -nH pci_alloc_irq_vectors_affinity drivers/

drivers/net/ethernet/wangxun/libwx/wx_lib.c:1867:       nvecs = pci_alloc_irq_vectors_affinity(wx->pdev, nvecs,
drivers/nvme/host/pci.c:2659:   return pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues, flags,
drivers/scsi/be2iscsi/be_main.c:3585:           if (pci_alloc_irq_vectors_affinity(phba->pcidev, 2, nvec,
drivers/scsi/csiostor/csio_isr.c:520:   cnt = pci_alloc_irq_vectors_affinity(hw->pdev, min, cnt,
drivers/scsi/hisi_sas/hisi_sas_v3_hw.c:2611:    vectors = pci_alloc_irq_vectors_affinity(pdev,
drivers/scsi/megaraid/megaraid_sas_base.c:5943: i = pci_alloc_irq_vectors_affinity(instance->pdev,
drivers/scsi/mpi3mr/mpi3mr_fw.c:862:            retval = pci_alloc_irq_vectors_affinity(mrioc->pdev,
drivers/scsi/mpt3sas/mpt3sas_base.c:3390:       i = pci_alloc_irq_vectors_affinity(ioc->pdev,
drivers/scsi/pm8001/pm8001_init.c:982:          rc = pci_alloc_irq_vectors_affinity(
drivers/scsi/qla2xxx/qla_isr.c:4539:            ret = pci_alloc_irq_vectors_affinity(ha->pdev, min_vecs,
drivers/virtio/virtio_pci_common.c:160: err = pci_alloc_irq_vectors_affinity(vp_dev->pci_dev, nvectors,

Not a so dead road :)

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-11-25 16:31                   ` Thomas Gleixner
@ 2025-11-26  9:20                     ` Florian Bezdeka
  2025-11-26 14:26                       ` Thomas Gleixner
  2025-11-26 15:45                       ` Frederic Weisbecker
  0 siblings, 2 replies; 25+ messages in thread
From: Florian Bezdeka @ 2025-11-26  9:20 UTC (permalink / raw)
  To: Thomas Gleixner, bigeasy@linutronix.de
  Cc: Preclik, Tobias, Frederic Weisbecker,
	linux-rt-users@vger.kernel.org, Kiszka, Jan

On Tue, 2025-11-25 at 17:31 +0100, Thomas Gleixner wrote:
> On Tue, Nov 25 2025 at 15:36, Florian Bezdeka wrote:
> > On Tue, 2025-11-25 at 12:50 +0100, bigeasy@linutronix.de wrote:
> > > On 2025-11-25 12:32:39 [+0100], Florian Bezdeka wrote:
> > > > > It seems that if you exclude certain CPUs from getting interrupt
> > > > > handling than it should work fine. Then the driver would only balance
> > > > > the interrupts among the CPUs that are left.
> > > > 
> > > > Sebastian, what exactly do you mean by "exclude certain CPUs from
> > > > getting interrupt handling"? I mean, that is what we do by configuring
> > > > the /proc/<irq>/smp_affinity_list interface.
> > > 
> > > Step #1
> > > - figure out if isolcpus= is restricting the affinity of requested
> > >   interrupts to housekeeping CPUs only
> > 
> > This question can not be answered with yes/no. It depends. Affinities
> > are based on the default_smp_affinity during creation. But as it turned
> > out there are drivers that overwrite those affinities after IRQ
> > creation.
> 
> Which ones?

The problem at hand is the stmmac
(drivers/net/ethernet/stmicro/stmmac/stmmac_main), which is pulled in by
a couple of others like the dwmac_intel in our case.

Searching for the problematic irq_set_affinity_hint() call in
drivers/net/ethernet reveals more "affected" drivers but each of them
requires some double checking. In some cases the supplied cpumask or the
"time of the call" might be OKish.

> 
> > > I *think* the driver should request as many interrupts as there are
> > > available CPUs in the system to handle them. 
> > > 
> > 
> > That does not match how networking (and some storage) drivers are
> > designed. Those drivers are usually HW queue centric. A driver is
> > setting up a IRQ per queue pair (TX/RX). The number of HW queues is
> > defined by the hardware and is decoupled from any CPU count.
> > 
> > To optimize performance, drivers may spread / balance the IRQs / queues
> > over available CPUs and while doing so might ignore any previous RT
> > configuration. Again: The performance optimization is valid, but how
> > could we prevent violating RT settings?
> 
> That spreading happens and it depends how it is grouped and how that
> matches your isolation requirements. NVME certainly allocates a queue
> per CPU if there are enough available and those won't disturb your RT
> isolated CPUs as long as nothing issues I/O on those CPUs.
> 
> Networking is a different story, but networking does not use managed
> interrupts (except for one driver) and you can move them away from your
> isolated CPUs after the device is set up.

That's one key point here. The pattern implemented in network drivers
seems to be, that drivers register/free IRQs on interface up/down events
and for loading bpf programs (necessary for AF_XDP).

Such events can happen at any time - even after the initial "move
everything away from my isolated CPUs" has completed - and you end up
with a broken RT system configuration.

Even (RT) applications might trigger such an event. Example would be
loading such an AF_XDP bpf program. Applications can "destroy" isolated
environment this way.

> 
> There have been discussions how to keep interrupts by default off from
> isolated CPUs, but I don't know where this stands. Frederic?

Would love to hear more here.

> 
> > > The number of available
> > > CPUs/ CPU mask should be a configure knob by the user. 
> > > 
> > The user normally configures the number of HW queues that the NIC should
> > use. In most cases in combination with some HW packet filters to achieve
> > best packet separation. IMHO the user should not have to deal with any
> > (additional) CPU mask on that level. RT tuning will / should handle
> > that.
> 
> How so. The kernel magically knows what the user wants?

We can already identify the IRQs that belong to a certain hardware queue
(NAPI threads) and move it when necessary. I just wanted to express that
there might not be a need for an additional API / cpumask.

> 
> > > Using the
> > > housekeeping CPUs as a default mask seems reasonable.
> > > The question is what should happen if the mask changes at runtime. Maybe
> > > a device needs to reconfigure, maybe just move the interrupt away.
> > > But this should also affect NOHZ_FULL workloads.
> > > 
> > > > To sum up: 
> > > > - The IRQ balancing issue is not limited to a single driver / subsystem
> > > > - The managed IRQ infrastructure seems very "static" so insufficient for
> > > >   this problem. In addition we would have to migrate all affected
> > > >   drivers to the managed IRQ infrastructure first.
> > > > 
> > > > We would love to hear further thoughts / ideas / comments about this
> > > > problem. We're highly interested in fixing this issue properly.
> > > 
> > > If the "managed IRQ infrastructure" would help here then why not. Maybe
> > > Frederic has some insight here.
> > 
> > I currently can't see how this could help. 
> > 
> > That looks like dead code to me. I started in irq_do_set_affinity() -
> > which checks for managed IRQs - but I could not find any user of
> > irq_create_affinity_masks() - that is where the managed flag is set -
> > that is actually being used. The road seems dead in
> > devm_platform_get_irqs_affinity() which has no in-tree user.
> 
> # git grep -nH irq_create_affinity_masks drivers/
> drivers/base/platform.c:424:    desc = irq_create_affinity_masks(nvec, affd);
> drivers/pci/msi/api.c:289:                              irq_create_affinity_masks(1, affd);
> drivers/pci/msi/msi.c:405:              affd ? irq_create_affinity_masks(nvec, affd) : NULL;
> drivers/pci/msi/msi.c:695:              affd ? irq_create_affinity_masks(nvec, affd) : NULL;
> 
> These three PCI ones are all going through pci_alloc_irq_vectors_affinity()
> 
> # git grep -nH pci_alloc_irq_vectors_affinity drivers/
> 
> drivers/net/ethernet/wangxun/libwx/wx_lib.c:1867:       nvecs = pci_alloc_irq_vectors_affinity(wx->pdev, nvecs,
> drivers/nvme/host/pci.c:2659:   return pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues, flags,
> drivers/scsi/be2iscsi/be_main.c:3585:           if (pci_alloc_irq_vectors_affinity(phba->pcidev, 2, nvec,
> drivers/scsi/csiostor/csio_isr.c:520:   cnt = pci_alloc_irq_vectors_affinity(hw->pdev, min, cnt,
> drivers/scsi/hisi_sas/hisi_sas_v3_hw.c:2611:    vectors = pci_alloc_irq_vectors_affinity(pdev,
> drivers/scsi/megaraid/megaraid_sas_base.c:5943: i = pci_alloc_irq_vectors_affinity(instance->pdev,
> drivers/scsi/mpi3mr/mpi3mr_fw.c:862:            retval = pci_alloc_irq_vectors_affinity(mrioc->pdev,
> drivers/scsi/mpt3sas/mpt3sas_base.c:3390:       i = pci_alloc_irq_vectors_affinity(ioc->pdev,
> drivers/scsi/pm8001/pm8001_init.c:982:          rc = pci_alloc_irq_vectors_affinity(
> drivers/scsi/qla2xxx/qla_isr.c:4539:            ret = pci_alloc_irq_vectors_affinity(ha->pdev, min_vecs,
> drivers/virtio/virtio_pci_common.c:160: err = pci_alloc_irq_vectors_affinity(vp_dev->pci_dev, nvectors,
> 
> Not a so dead road :)

Grml... I definitely fat-fingered the query. Anyway, this housekeeping
API still seems very boot-time oriented to me. Can't see yet where this
housekeeping cpumasks are filled up by the cgroup/cpuset infrastructure.

cgroups/cpusets don't care about "isolation" in that sense yet. It's
just about cpumasks for compute. Am I missing something?


Florian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-11-26  9:20                     ` Florian Bezdeka
@ 2025-11-26 14:26                       ` Thomas Gleixner
  2025-11-26 15:07                         ` Florian Bezdeka
  2025-11-26 15:45                       ` Frederic Weisbecker
  1 sibling, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2025-11-26 14:26 UTC (permalink / raw)
  To: Florian Bezdeka, bigeasy@linutronix.de
  Cc: Preclik, Tobias, Frederic Weisbecker,
	linux-rt-users@vger.kernel.org, Kiszka, Jan, Waiman Long,
	Gabriele Monaco

On Wed, Nov 26 2025 at 10:20, Florian Bezdeka wrote:
> On Tue, 2025-11-25 at 17:31 +0100, Thomas Gleixner wrote:
>> > This question can not be answered with yes/no. It depends. Affinities
>> > are based on the default_smp_affinity during creation. But as it turned
>> > out there are drivers that overwrite those affinities after IRQ
>> > creation.
>> 
>> Which ones?
>
> The problem at hand is the stmmac
> (drivers/net/ethernet/stmicro/stmmac/stmmac_main), which is pulled in by
> a couple of others like the dwmac_intel in our case.
>
> Searching for the problematic irq_set_affinity_hint() call in
> drivers/net/ethernet reveals more "affected" drivers but each of them
> requires some double checking. In some cases the supplied cpumask or the
> "time of the call" might be OKish.

The question is whether that affinity hint has a functional requirement
to be applied or not. I don't think so because those interrupts can be
moved by userspace as it sees fit.

So it's easy enough to make this "set" part conditional and restrict it
to some TBD mask (housekeeping, default ...) under some isolation magic.

>> > The user normally configures the number of HW queues that the NIC should
>> > use. In most cases in combination with some HW packet filters to achieve
>> > best packet separation. IMHO the user should not have to deal with any
>> > (additional) CPU mask on that level. RT tuning will / should handle
>> > that.
>> 
>> How so. The kernel magically knows what the user wants?
>
> We can already identify the IRQs that belong to a certain hardware queue
> (NAPI threads) and move it when necessary. I just wanted to express that
> there might not be a need for an additional API / cpumask.

Ok.

>> Not a so dead road :)
>
> Grml... I definitely fat-fingered the query. Anyway, this housekeeping
> API still seems very boot-time oriented to me. Can't see yet where this
> housekeeping cpumasks are filled up by the cgroup/cpuset infrastructure.
>
> cgroups/cpusets don't care about "isolation" in that sense yet. It's
> just about cpumasks for compute. Am I missing something?

There is work in progress on various related ends:

  https://lore.kernel.org/all/20251105210348.35256-1-frederic@kernel.org
  https://lore.kernel.org/all/20251120145653.296659-1-gmonaco@redhat.com
  https://lore.kernel.org/all/20251105043848.382703-1-longman@redhat.com
  https://lore.kernel.org/all/20251121143500.42111-3-frederic@kernel.org

There is certainly more going on it that area, but those were the bits
and pieces I could remember from the top of my head. Waiman and Gabriele
should be able to fill in some blanks, but that discussion should move
to LKML then.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-11-26 14:26                       ` Thomas Gleixner
@ 2025-11-26 15:07                         ` Florian Bezdeka
  2025-11-26 19:15                           ` Thomas Gleixner
  0 siblings, 1 reply; 25+ messages in thread
From: Florian Bezdeka @ 2025-11-26 15:07 UTC (permalink / raw)
  To: Thomas Gleixner, bigeasy@linutronix.de
  Cc: Preclik, Tobias, Frederic Weisbecker,
	linux-rt-users@vger.kernel.org, Kiszka, Jan, Waiman Long,
	Gabriele Monaco

On Wed Nov 26, 2025 at 3:26 PM CET, Thomas Gleixner wrote:
> On Wed, Nov 26 2025 at 10:20, Florian Bezdeka wrote:
>> On Tue, 2025-11-25 at 17:31 +0100, Thomas Gleixner wrote:
>>> > This question can not be answered with yes/no. It depends. Affinities
>>> > are based on the default_smp_affinity during creation. But as it turned
>>> > out there are drivers that overwrite those affinities after IRQ
>>> > creation.
>>> 
>>> Which ones?
>>
>> The problem at hand is the stmmac
>> (drivers/net/ethernet/stmicro/stmmac/stmmac_main), which is pulled in by
>> a couple of others like the dwmac_intel in our case.
>>
>> Searching for the problematic irq_set_affinity_hint() call in
>> drivers/net/ethernet reveals more "affected" drivers but each of them
>> requires some double checking. In some cases the supplied cpumask or the
>> "time of the call" might be OKish.
>
> The question is whether that affinity hint has a functional requirement
> to be applied or not. I don't think so because those interrupts can be
> moved by userspace as it sees fit.

The background seems performance. Those NICs support link speeds up to
(or even above) 2.5Gbit/s. Seems it's hard to fully utilize the link
when all queues are routed - IRQ wise - to a single core.

This is now the point where the IRQ chips matters. Some (like APIC for
x86) have the IRQ balancing implemented in SW, while others don't have
that. So the driver does that manually by ignoring all the RT settings.

>
> So it's easy enough to make this "set" part conditional and restrict it
> to some TBD mask (housekeeping, default ...) under some isolation magic.
>

For now I would be happy if I could modify the stmmac in a way that its
balancing takes the default affinity into account. I couldn't find any
available API that allows me to do so from a module.

Are there any strong reasons for not exporting the default affinity from
the IRQ core? Read-only would be enough.

>>> > The user normally configures the number of HW queues that the NIC should
>>> > use. In most cases in combination with some HW packet filters to achieve
>>> > best packet separation. IMHO the user should not have to deal with any
>>> > (additional) CPU mask on that level. RT tuning will / should handle
>>> > that.
>>> 
>>> How so. The kernel magically knows what the user wants?
>>
>> We can already identify the IRQs that belong to a certain hardware queue
>> (NAPI threads) and move it when necessary. I just wanted to express that
>> there might not be a need for an additional API / cpumask.
>
> Ok.
>
>>> Not a so dead road :)
>>
>> Grml... I definitely fat-fingered the query. Anyway, this housekeeping
>> API still seems very boot-time oriented to me. Can't see yet where this
>> housekeeping cpumasks are filled up by the cgroup/cpuset infrastructure.
>>
>> cgroups/cpusets don't care about "isolation" in that sense yet. It's
>> just about cpumasks for compute. Am I missing something?
>
> There is work in progress on various related ends:
>
>   https://lore.kernel.org/all/20251105210348.35256-1-frederic@kernel.org
>   https://lore.kernel.org/all/20251120145653.296659-1-gmonaco@redhat.com
>   https://lore.kernel.org/all/20251105043848.382703-1-longman@redhat.com
>   https://lore.kernel.org/all/20251121143500.42111-3-frederic@kernel.org
>
> There is certainly more going on it that area, but those were the bits
> and pieces I could remember from the top of my head. Waiman and Gabriele
> should be able to fill in some blanks, but that discussion should move
> to LKML then.

Thanks for the pointers! I was aware of most of them. This still seems
very static as the housekeeping cpumasks are filled up on boot with
kernel cmdline arguments.

In addition I'm quite sure that the housekeeping infrastructure would
not help in the area of networking as nobody (except one driver) is
based on the managed IRQ API.

Florian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-11-25 11:32             ` Florian Bezdeka
  2025-11-25 11:50               ` bigeasy
@ 2025-11-26 15:24               ` Frederic Weisbecker
  1 sibling, 0 replies; 25+ messages in thread
From: Frederic Weisbecker @ 2025-11-26 15:24 UTC (permalink / raw)
  To: Florian Bezdeka
  Cc: bigeasy@linutronix.de, Preclik, Tobias,
	linux-rt-users@vger.kernel.org, Kiszka, Jan

Le Tue, Nov 25, 2025 at 12:32:39PM +0100, Florian Bezdeka a écrit :
> On Mon, 2025-11-24 at 10:59 +0100, bigeasy@linutronix.de wrote:
> > On 2025-11-21 13:25:09 [+0000], Preclik, Tobias wrote:
> > > > I would be careful with the deprecated term here. The functionality
> > > > is
> > > > not deprecated just the interface is. The CPU affinity has been
> > > > migrated
> > > > a cgroup based interface. If the matching irq affinity is missing
> > > > then
> > > > it should be added rather than avoiding the whole "affinity is
> > > > managed"
> > > > interface since this looks as it has been meant for your use case.
> > > > 
> > > 
> > > As you point out isolcpus interface is deprecated and it seems there
> > > exists no way to translate the managed_irq flag of the isolcpus
> > > interface into the cgroups based interface. My understanding is that
> > 
> > I did not point out anything. I just suggested to test whether this
> > option is working for you and if it does, check if there is matching
> > configuration knob in the cpusets/cgroups interface. As per 
> >    https://www.suse.com/c/cpu-isolation-practical-example-part-5/
> > 
> > in "3.2) Isolcpus" Frederic says that the options should be used if the
> > kernel/ application "haven't been built with cpusets/cgroups support".
> > So it seems that this bit is either missing in the other interface or
> > hard to find.
> 
> In case that was still unclear: We're using the dynamic system
> configuration features provided by cpusets/cgrups. No isolcpus= on the
> kernel cmdline anymore. With that all applications are build around
> cgroups. There is some userspace tooling around that takes care of
> proper system configuration / RT isolation.
> 
> > 
> > …
> > > > > The conclusion got lost:
> > > > > 
> > > > > Other drivers like for example igb respect the interrupt affinities
> > > > > (both default and per-irq affinities). This leads me to believe
> > > > that
> > > > > the irq rebalancing in the drivers should only affect the effective
> > > > > interrupt affinities. This admittedly is more involved than it
> > > > appears
> > > > > at first because the interface interrupts would have to be balanced
> > > > > subject to multiple (potentially totally different) cpusets.
> > > > 
> > > > Exactly. Maybe it would work to align the driver with what igb does.
> > > 
> > > Currently, stmmac sets IRQ affinity and hints for all IRQ
> > > configurations. But on x86 systems with IOAPIC MSI-X vectors should be
> > > automatically balanced. If we remove the driver-based irq balancing
> > > then other architectures would not necessarily balance the interrupts
> > > anymore and would be impacted in terms of performance. Maybe driver-
> > > based irq balancing could be deactivated whenever the underlying system
> > > is capable of balancing them? That would of course only reduced the
> > > number of affected systems.
> > > 
> > > In general I lack information when drivers should (or are allowed to)
> > > balance interrupts on driver level and whether smp_affinity is allowed
> > > to be ignored and overwritten in that case. All documentation I have
> > > found so far remains rather unspecific.
> > 
> > It seems that if you exclude certain CPUs from getting interrupt
> > handling than it should work fine. Then the driver would only balance
> > the interrupts among the CPUs that are left.
> 
> Sebastian, what exactly do you mean by "exclude certain CPUs from
> getting interrupt handling"? I mean, that is what we do by configuring
> the /proc/<irq>/smp_affinity_list interface.
> 
> The point here is, that drivers (like the stmmac, storage, ...) simply
> ignore everything that was configured by userspace. As soon as one of
> the dynamic events (link up/down, bpf loading) occurs they destroy the
> current RT aware system configuration.
> 
> I was not successful in finding an API that would allow the driver(s) to
> do better. The default affinity (/proc/irq/default_smp_affinity) - as an
> example - is not visible from outside the IRQ core.
> 
> The managed IRQ infrastructure that you mentioned seems coupled with the
> interfaces behind CONFIG_CPU_ISOLATION which seems to be "static", so
> configured at boot time. Is that understanding correct? That would not
> be flexible enough as we don't know the system configuration at boot
> time.
> 
> As we now have Frederic with us: Frederic, are there any plans to extend
> the housekeeping API to deal with cpuset creation? Not sure if that
> would be possible as it's hard to say if the newly created cpuset is
> targeting isolation or housekeeping...

I'm not sure what you mean by that. But HK_TYPE_DOMAIN will soon include
both isolcpus and cpuset isolated partitions. And the next step is to be
able to create nohz_full/isolated cpuset partitions.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-11-25 11:50               ` bigeasy
  2025-11-25 14:36                 ` Florian Bezdeka
@ 2025-11-26 15:31                 ` Frederic Weisbecker
  1 sibling, 0 replies; 25+ messages in thread
From: Frederic Weisbecker @ 2025-11-26 15:31 UTC (permalink / raw)
  To: bigeasy@linutronix.de
  Cc: Florian Bezdeka, Preclik, Tobias, linux-rt-users@vger.kernel.org,
	Kiszka, Jan

Le Tue, Nov 25, 2025 at 12:50:08PM +0100, bigeasy@linutronix.de a écrit :
> On 2025-11-25 12:32:39 [+0100], Florian Bezdeka wrote:
> > > It seems that if you exclude certain CPUs from getting interrupt
> > > handling than it should work fine. Then the driver would only balance
> > > the interrupts among the CPUs that are left.
> > 
> > Sebastian, what exactly do you mean by "exclude certain CPUs from
> > getting interrupt handling"? I mean, that is what we do by configuring
> > the /proc/<irq>/smp_affinity_list interface.
> 
> Step #1
> - figure out if isolcpus= is restricting the affinity of requested
>   interrupts to housekeeping CPUs only
> 
> Step #2
> - Yes
>    => look for the matching knob in cgroup interface
>       Knob found?
>       - Yes
>         => Use knob.
>       - No.
>         => Add knob.
> - No
>   This should be added as it breaks the expectation of an isolated
>   system.
> 
> I *think* the driver should request as many interrupts as there are
> available CPUs in the system to handle them. The number of available
> CPUs/ CPU mask should be a configure knob by the user. Using the
> housekeeping CPUs as a default mask seems reasonable.
> The question is what should happen if the mask changes at runtime. Maybe
> a device needs to reconfigure, maybe just move the interrupt away.
> But this should also affect NOHZ_FULL workloads.

Right now, you still need to change by hand the affinity of an IRQ through
/proc to match a new isolated cpuset partition. But ideally this should be
automatically handled by cpuset. If someone wants to tackle that, it would
be greatly appreciated.

As for those IRQs whose affinity can only be controlled by isolcpus=managed_irq
this is more complicated but probably not unfeasible.

> 
> > To sum up: 
> > - The IRQ balancing issue is not limited to a single driver / subsystem
> > - The managed IRQ infrastructure seems very "static" so insufficient for
> >   this problem. In addition we would have to migrate all affected
> >   drivers to the managed IRQ infrastructure first.
> > 
> > We would love to hear further thoughts / ideas / comments about this
> > problem. We're highly interested in fixing this issue properly.
> 
> If the "managed IRQ infrastructure" would help here then why not. Maybe
> Frederic has some insight here.

Not really but being able to change managed_irq affinities at runtime would
certainly be welcome. I fear my first visit to the genirq subsystem is only
one month old though and I miss the cycles to dive further there right now.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-11-26  9:20                     ` Florian Bezdeka
  2025-11-26 14:26                       ` Thomas Gleixner
@ 2025-11-26 15:45                       ` Frederic Weisbecker
  1 sibling, 0 replies; 25+ messages in thread
From: Frederic Weisbecker @ 2025-11-26 15:45 UTC (permalink / raw)
  To: Florian Bezdeka
  Cc: Thomas Gleixner, bigeasy@linutronix.de, Preclik, Tobias,
	linux-rt-users@vger.kernel.org, Kiszka, Jan

Le Wed, Nov 26, 2025 at 10:20:53AM +0100, Florian Bezdeka a écrit :
> > drivers/net/ethernet/wangxun/libwx/wx_lib.c:1867:       nvecs = pci_alloc_irq_vectors_affinity(wx->pdev, nvecs,
> > drivers/nvme/host/pci.c:2659:   return pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues, flags,
> > drivers/scsi/be2iscsi/be_main.c:3585:           if (pci_alloc_irq_vectors_affinity(phba->pcidev, 2, nvec,
> > drivers/scsi/csiostor/csio_isr.c:520:   cnt = pci_alloc_irq_vectors_affinity(hw->pdev, min, cnt,
> > drivers/scsi/hisi_sas/hisi_sas_v3_hw.c:2611:    vectors = pci_alloc_irq_vectors_affinity(pdev,
> > drivers/scsi/megaraid/megaraid_sas_base.c:5943: i = pci_alloc_irq_vectors_affinity(instance->pdev,
> > drivers/scsi/mpi3mr/mpi3mr_fw.c:862:            retval = pci_alloc_irq_vectors_affinity(mrioc->pdev,
> > drivers/scsi/mpt3sas/mpt3sas_base.c:3390:       i = pci_alloc_irq_vectors_affinity(ioc->pdev,
> > drivers/scsi/pm8001/pm8001_init.c:982:          rc = pci_alloc_irq_vectors_affinity(
> > drivers/scsi/qla2xxx/qla_isr.c:4539:            ret = pci_alloc_irq_vectors_affinity(ha->pdev, min_vecs,
> > drivers/virtio/virtio_pci_common.c:160: err = pci_alloc_irq_vectors_affinity(vp_dev->pci_dev, nvectors,
> > 
> > Not a so dead road :)
> 
> Grml... I definitely fat-fingered the query. Anyway, this housekeeping
> API still seems very boot-time oriented to me. Can't see yet where this
> housekeeping cpumasks are filled up by the cgroup/cpuset infrastructure.

It's on the way:
https://lore.kernel.org/all/20251105210348.35256-1-frederic@kernel.org/

If all goes well, not for the upcoming merge window but the next one.

> cgroups/cpusets don't care about "isolation" in that sense yet. It's
> just about cpumasks for compute. Am I missing something?

It's a bit more than just scheduler domain isolation. It also handles
kthreads and workqueues. It's also going to handle unbound timers (on the way
to the upcoming merge window).

Thanks.

> 
> 
> Florian

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-11-26 15:07                         ` Florian Bezdeka
@ 2025-11-26 19:15                           ` Thomas Gleixner
  2025-11-27 14:06                             ` Preclik, Tobias
  2025-11-27 14:52                             ` Florian Bezdeka
  0 siblings, 2 replies; 25+ messages in thread
From: Thomas Gleixner @ 2025-11-26 19:15 UTC (permalink / raw)
  To: Florian Bezdeka, bigeasy@linutronix.de
  Cc: Preclik, Tobias, Frederic Weisbecker,
	linux-rt-users@vger.kernel.org, Kiszka, Jan, Waiman Long,
	Gabriele Monaco

On Wed, Nov 26 2025 at 16:07, Florian Bezdeka wrote:
> On Wed Nov 26, 2025 at 3:26 PM CET, Thomas Gleixner wrote:
>> The question is whether that affinity hint has a functional requirement
>> to be applied or not. I don't think so because those interrupts can be
>> moved by userspace as it sees fit.
>
> The background seems performance. Those NICs support link speeds up to
> (or even above) 2.5Gbit/s. Seems it's hard to fully utilize the link
> when all queues are routed - IRQ wise - to a single core.
>
> This is now the point where the IRQ chips matters. Some (like APIC for
> x86) have the IRQ balancing implemented in SW, while others don't have
> that. So the driver does that manually by ignoring all the RT settings.

Hardware interrupt balancing never worked right :)

APIC "supports" it in logical/cluster mode, but in fact 99% of the
interrupts ended up on the lowest APIC in the logical/cluster mask. So
we gave up on it because the benefit was close to zero and the
complexity for multi-CPU affinity management with the limited vector
space was just not worth it. In high performance setups the interrupts
were anyway steered to a single CPU by the admin or irqbalanced :)

ARM64 would support that too IIRC, but they decided to avoid the whole
multi-CPU affinity mess as well :)

>> So it's easy enough to make this "set" part conditional and restrict it
>> to some TBD mask (housekeeping, default ...) under some isolation magic.
>>
>
> For now I would be happy if I could modify the stmmac in a way that its
> balancing takes the default affinity into account. I couldn't find any
> available API that allows me to do so from a module.
>
> Are there any strong reasons for not exporting the default affinity from
> the IRQ core? Read-only would be enough.

Default affinity is yet another piece which is disconnected from all the
other isolation mechanics. So we are not exporting it for some quick and
dirty hack. You can do that of course in your own kernel, but please
don't send the result to my inbox :)

> In addition I'm quite sure that the housekeeping infrastructure would
> not help in the area of networking as nobody (except one driver) is
> based on the managed IRQ API.

Managed interrupts are not user steerable and due to their strict
CPU/CPUgroup relationship they are not required to be steerable. NVME &
al have a strict command/response on the same queue scheme, which is
obviously most efficient when you have per CPU queues. The nice thing
about that concept is that the queues are only active (and having
interrupts) when an application on a given CPU issues a R/W operation.

Networking does not have that by default as their strategy of routing
packages to queues is way more complicated and can be affected by
hardware filtering etc.

But why can't housekeeping help in general and why do you want to hack
around the problem in random drivers?

What's wrong with providing a new irq_set_affinity_hint_xxx() variant
which takes a additional queue number as argument and let that do:

    if (isolate) {
        weight = cpumask_weight(housekeeping);
        qnr %= weight;
        cpu = cpumask_nth(qnr, housekeeping);
        mask = cpumask_of(cpu);
    }
    return irq_set_affinity_hint(mask);
    
or something like that. From a quick glance over the drivers this could
maybe be based on a queue number alone as most drivers do:

      mask = cpumask_of(qnr % num_online_cpus());

or something daft like that, which is obviously broken, but who cares.
So that would become:

    if (isolate) {
        weight = cpumask_weight(housekeeping);
        qnr %= weight;
        cpu = cpumask_nth(qnr, housekeeping);
    } else {
        guard(cpus_read_lock)();
        qnr %= num_online_cpus();
        cpu = cpumask_nth(qnr, cpu_online_mask);
    }
    	
    return irq_set_affinity_hint(cpumask_of(cpu));

See?

That lets userspace still override the hint but does at least initial
spreading within the housekeeping mask. Which ever mask that is out of
the zoo of masks you best debate with Frederic. :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-11-26 19:15                           ` Thomas Gleixner
@ 2025-11-27 14:06                             ` Preclik, Tobias
  2025-11-27 14:52                             ` Florian Bezdeka
  1 sibling, 0 replies; 25+ messages in thread
From: Preclik, Tobias @ 2025-11-27 14:06 UTC (permalink / raw)
  To: tglx@linutronix.de, Bezdeka, Florian, bigeasy@linutronix.de
  Cc: frederic@kernel.org, linux-rt-users@vger.kernel.org, Kiszka, Jan,
	longman@redhat.com, gmonaco@redhat.com

On Wed, 2025-11-26 at 20:15 +0100, Thomas Gleixner wrote:
> On Wed, Nov 26 2025 at 16:07, Florian Bezdeka wrote:
> > On Wed Nov 26, 2025 at 3:26 PM CET, Thomas Gleixner wrote:
> > > The question is whether that affinity hint has a functional
> > > requirement
> > > to be applied or not. I don't think so because those interrupts
> > > can be
> > > moved by userspace as it sees fit.
> > 
> > The background seems performance. Those NICs support link speeds up
> > to
> > (or even above) 2.5Gbit/s. Seems it's hard to fully utilize the
> > link
> > when all queues are routed - IRQ wise - to a single core.
> > 
> > This is now the point where the IRQ chips matters. Some (like APIC
> > for
> > x86) have the IRQ balancing implemented in SW, while others don't
> > have
> > that. So the driver does that manually by ignoring all the RT
> > settings.
> 
> Hardware interrupt balancing never worked right :)
> 
> APIC "supports" it in logical/cluster mode, but in fact 99% of the
> interrupts ended up on the lowest APIC in the logical/cluster mask.
> So
> we gave up on it because the benefit was close to zero and the
> complexity for multi-CPU affinity management with the limited vector
> space was just not worth it. In high performance setups the
> interrupts
> were anyway steered to a single CPU by the admin or irqbalanced :)
> 
> ARM64 would support that too IIRC, but they decided to avoid the
> whole
> multi-CPU affinity mess as well :)
> 
> > > So it's easy enough to make this "set" part conditional and
> > > restrict it
> > > to some TBD mask (housekeeping, default ...) under some isolation
> > > magic.
> > > 
> > 
> > For now I would be happy if I could modify the stmmac in a way that
> > its
> > balancing takes the default affinity into account. I couldn't find
> > any
> > available API that allows me to do so from a module.
> > 
> > Are there any strong reasons for not exporting the default affinity
> > from
> > the IRQ core? Read-only would be enough.
> 
> Default affinity is yet another piece which is disconnected from all
> the
> other isolation mechanics. So we are not exporting it for some quick
> and
> dirty hack. You can do that of course in your own kernel, but please
> don't send the result to my inbox :)
> 
> > In addition I'm quite sure that the housekeeping infrastructure
> > would
> > not help in the area of networking as nobody (except one driver) is
> > based on the managed IRQ API.
> 
> Managed interrupts are not user steerable and due to their strict
> CPU/CPUgroup relationship they are not required to be steerable. NVME
> &
> al have a strict command/response on the same queue scheme, which is
> obviously most efficient when you have per CPU queues. The nice thing
> about that concept is that the queues are only active (and having
> interrupts) when an application on a given CPU issues a R/W
> operation.
> 
> Networking does not have that by default as their strategy of routing
> packages to queues is way more complicated and can be affected by
> hardware filtering etc.
> 
> But why can't housekeeping help in general and why do you want to
> hack
> around the problem in random drivers?

I think a dynamic set of housekeeping cpus would partly help. Just like
we use default_smp_affinity/smp_affinity today to steer interrupts to
housekeeping cpus. But stmmac disregards the affinities set from
userspace when balancing interrupts on ifup/xdp program load. We thus
can neither affine interrupts of these network interfaces to isolated
cores (when used for real-time communication) nor can we restrict them
to housekeeping to protect rt workloads. I shared some repro steps in
the beginning of this thread:

https://lore.kernel.org/linux-rt-users/20251111135835.EXCy4ajR@linutronix.de/T/#m0a23cbedada4ceb0a61ad3c7ea81a150c7578ec8


> What's wrong with providing a new irq_set_affinity_hint_xxx() variant
> which takes a additional queue number as argument and let that do:
> 
>     if (isolate) {
>         weight = cpumask_weight(housekeeping);
>         qnr %= weight;
>         cpu = cpumask_nth(qnr, housekeeping);
>         mask = cpumask_of(cpu);
>     }
>     return irq_set_affinity_hint(mask);
>     
> or something like that. From a quick glance over the drivers this
> could
> maybe be based on a queue number alone as most drivers do:
> 
>       mask = cpumask_of(qnr % num_online_cpus());
> 
> or something daft like that, which is obviously broken, but who
> cares.
> So that would become:
> 
>     if (isolate) {
>         weight = cpumask_weight(housekeeping);
>         qnr %= weight;
>         cpu = cpumask_nth(qnr, housekeeping);
>     } else {
>         guard(cpus_read_lock)();
>         qnr %= num_online_cpus();
>         cpu = cpumask_nth(qnr, cpu_online_mask);
>     }
>     	
>     return irq_set_affinity_hint(cpumask_of(cpu));
> 
> See?
> 
> That lets userspace still override the hint but does at least initial
> spreading within the housekeeping mask. Which ever mask that is out
> of
> the zoo of masks you best debate with Frederic. :)

In the proposed change stmmac would still ignore and overwrite
smp_affinity set from userspace (by calling irq_set_affinity_(and_)hint
in the end) in seemingly random situations (like xdp program load) and
we would thus lose userspace affining of interrupts to RT application
cores. The irq balancing in the driver should in my opinion only
influence the effective irq affinity. How about splitting smp_affinity
into a userspace and a kernel mask? This would allow us to select the
appropriate effective mask without discarding userspace intents.
Something like that:

if (smp_affinity is set from userspace) {
	// Do not balance on driver level if affinity was set from
userspace.
        mask=smp_affinity
}
else if (default_smp_affinity is set from userspace) {
	// If smp_affinity is not set from userspace then balance on
default_smp_affinity.
        weight = cpumask_weight(default_smp_affinity);
        qnr %= weight;
        cpu = cpumask_nth(qnr, default_smp_affinity);
        mask=cpumask_of(cpu)
}
else if (housekeeping active) {
	// If also default_smp_affinity is not set from userspace then
balance on housekeeping cpus.
        weight = cpumask_weight(housekeeping);
        qnr %= weight;
        cpu = cpumask_nth(qnr, housekeeping);
        mask=cpumask_of(cpu)
}
else {
	// If also housekeeping cpus are not set then fall back to
balancing on online cpus.
        guard(cpus_read_lock)();
        qnr %= num_online_cpus();
        cpu = cpumask_nth(qnr, cpu_online_mask);
        mask=cpumask_of(cpu)
}

// Do not overwrite smp_affinity here:
smp_balanced_affinity=mask

// smp_balanced_affinity takes into account all affinities set from
userspace and driver balancing
// setting smp_affinity in procfs should be directly reflected in
smp_balanced_affinity
// smp_balanced_affinity should then be taken into account when
deriving the effective affinity

That way (given some more modifications) the affinities from userspace
would stay intact and we could properly steer (not managed) interrupts
to housekeeping and non-housekeeping cpus as we do it today with other
drivers not performing irq balancing. Does that make sense on your
side?

Best,
Tobias

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-11-26 19:15                           ` Thomas Gleixner
  2025-11-27 14:06                             ` Preclik, Tobias
@ 2025-11-27 14:52                             ` Florian Bezdeka
  2025-11-27 18:09                               ` Thomas Gleixner
  1 sibling, 1 reply; 25+ messages in thread
From: Florian Bezdeka @ 2025-11-27 14:52 UTC (permalink / raw)
  To: Thomas Gleixner, bigeasy@linutronix.de
  Cc: Preclik, Tobias, Frederic Weisbecker,
	linux-rt-users@vger.kernel.org, Kiszka, Jan, Waiman Long,
	Gabriele Monaco

On Wed Nov 26, 2025 at 8:15 PM CET, Thomas Gleixner wrote:
>>
>> Are there any strong reasons for not exporting the default affinity from
>> the IRQ core? Read-only would be enough.
>
> Default affinity is yet another piece which is disconnected from all the
> other isolation mechanics. So we are not exporting it for some quick and
> dirty hack. You can do that of course in your own kernel, but please
> don't send the result to my inbox :)

Well, that's the reason for this discussion here: Upstream first.

>
>> In addition I'm quite sure that the housekeeping infrastructure would
>> not help in the area of networking as nobody (except one driver) is
>> based on the managed IRQ API.
>
> Managed interrupts are not user steerable and due to their strict
> CPU/CPUgroup relationship they are not required to be steerable. NVME &
> al have a strict command/response on the same queue scheme, which is
> obviously most efficient when you have per CPU queues. The nice thing
> about that concept is that the queues are only active (and having
> interrupts) when an application on a given CPU issues a R/W operation.
>
> Networking does not have that by default as their strategy of routing
> packages to queues is way more complicated and can be affected by
> hardware filtering etc.
>
> But why can't housekeeping help in general and why do you want to hack
> around the problem in random drivers?

No, that's not what I want. I'm highly interested in solving this
problem properly. Just trying to collect all the information at the
moment. I'm quite sure there is still something around that I did not
take into account yet.

>
> What's wrong with providing a new irq_set_affinity_hint_xxx() variant
> which takes a additional queue number as argument and let that do:
>
>     if (isolate) {
>         weight = cpumask_weight(housekeeping);
>         qnr %= weight;
>         cpu = cpumask_nth(qnr, housekeeping);
>         mask = cpumask_of(cpu);
>     }
>     return irq_set_affinity_hint(mask);
>     
> or something like that. From a quick glance over the drivers this could
> maybe be based on a queue number alone as most drivers do:
>
>       mask = cpumask_of(qnr % num_online_cpus());
>
> or something daft like that, which is obviously broken, but who cares.
> So that would become:
>
>     if (isolate) {
>         weight = cpumask_weight(housekeeping);
>         qnr %= weight;
>         cpu = cpumask_nth(qnr, housekeeping);
>     } else {
>         guard(cpus_read_lock)();
>         qnr %= num_online_cpus();
>         cpu = cpumask_nth(qnr, cpu_online_mask);
>     }
>     	
>     return irq_set_affinity_hint(cpumask_of(cpu));
>
> See?

That is close to a RFC that I was already preparing, until I realized
that it would only solve one part of the problem.

Part one: Get rid of unwanted IRQ traffic on my isolated cores. That
part would be covered as the balancing would be limited to !RT cores.
Fine.

Part two: In case the device is actually being used by an RT application
and allowed to run on isolated cores (userspace has properly configured
that upfront) we would get the opposite after loading a BPF: IRQs are
now configured wrong.

>
> That lets userspace still override the hint but does at least initial
> spreading within the housekeeping mask. Which ever mask that is out of
> the zoo of masks you best debate with Frederic. :)
>

Choosing the right mask is key. The right mask depends on the usage of
the device. Some devices (or maybe even just some queues) should be
limited to !RT CPUs, while others should explicitly run within a
isolated cpuset.

When I'm getting this right, the work from Frederic will bring in the
"isolated flag" for cpusets. That seems great preparation work. In
addition we would need something like a mapping between devices (or
queues maybe indirectly via IRQs) and cgroup/cpusets.

Have there been thoughts around a cpuset.interrupts API - or something
similar - already?

Best regards,
Florian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-11-27 14:52                             ` Florian Bezdeka
@ 2025-11-27 18:09                               ` Thomas Gleixner
  2025-11-28  7:33                                 ` Florian Bezdeka
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2025-11-27 18:09 UTC (permalink / raw)
  To: Florian Bezdeka, bigeasy@linutronix.de
  Cc: Preclik, Tobias, Frederic Weisbecker,
	linux-rt-users@vger.kernel.org, Kiszka, Jan, Waiman Long,
	Gabriele Monaco

On Thu, Nov 27 2025 at 15:52, Florian Bezdeka wrote:
> On Wed Nov 26, 2025 at 8:15 PM CET, Thomas Gleixner wrote:
>> So that would become:
>>
>>     if (isolate) {
>>         weight = cpumask_weight(housekeeping);
>>         qnr %= weight;
>>         cpu = cpumask_nth(qnr, housekeeping);
>>     } else {
>>         guard(cpus_read_lock)();
>>         qnr %= num_online_cpus();
>>         cpu = cpumask_nth(qnr, cpu_online_mask);
>>     }
>>     	
>>     return irq_set_affinity_hint(cpumask_of(cpu));
>>
>> See?
>
> That is close to a RFC that I was already preparing, until I realized
> that it would only solve one part of the problem.
>
> Part one: Get rid of unwanted IRQ traffic on my isolated cores. That
> part would be covered as the balancing would be limited to !RT cores.
> Fine.
>
> Part two: In case the device is actually being used by an RT application
> and allowed to run on isolated cores (userspace has properly configured
> that upfront) we would get the opposite after loading a BPF: IRQs are
> now configured wrong.

I just went and looked at that stmac driver once more. The way how it
sets up those affinity hints is actually stupid and leads exactly to the
effects you describe.

The hints should be set exactly once, when MSI is enabled and the
interrupts are allocated and not after request_irq().

So the first request_irq() will use that hinted affinity.  In case that
user space changed the affinity, the setting is preserved accross a
free_irq()/request_irq() sequence unless all CPUs in the affinity mask
have gone offline.

That preservation was explicitly added on request of networking people,
but then someone got it wrong and that request_irq()/set_hint() sequence
started a Copy&Pasta spreading disease. Oh well...

So yes, you have to fix that driver and do the affinity hint business
right after pci_alloc_irq_vectors() and clear it when the driver shuts
down. Looking at intel_eth_pci_remove(), that's another trainwreck as it
does not do any PCI related cleanup despite claiming so....

But the more I look at that whole hint usage, the more I'm convinced
that it is in most cases actively wrong. It only makes really sense when
there is an actual 1:1 relationship of queues to CPUs like in the NVME
case.

I'm pretty sure by now that this is in most cases used to ensure that
the interrupts are spread out properly. But that spreading is only done
to ensure that not all interrupts end up on CPU0 or whatever the
architecture specific interrupt management decides to do. x86 used to
prefer CPU0, but nowadays it tries to spread it accross CPUs within the
provided affinity mask. Not perfect but better than before :)

So the right thing here is to expand the functionality of
irq_calc_affinity_vectors() and group_cpus_evenly() to:

     1) Take isolation masks into account (opt-in and/or system wide
        knob)

     2) Do the spreading over the interrupt sets without setting
        the managed bit in the mask descriptor.

Then use pci_alloc_irq_vectors_affinity(), which does the spreading and
assigns the resulting affinities during interrupt descriptor allocation.

With that the whole hint business can be removed because it has zero
value after the initial setup.

But that's a discussion to be had on LKML/netdev and not on the RT devel
list.

>> That lets userspace still override the hint but does at least initial
>> spreading within the housekeeping mask. Which ever mask that is out of
>> the zoo of masks you best debate with Frederic. :)
>>
> Choosing the right mask is key. The right mask depends on the usage of
> the device. Some devices (or maybe even just some queues) should be
> limited to !RT CPUs, while others should explicitly run within a
> isolated cpuset.

You can't know that upfront. That's a policy decision and user space has
to make it.

What the kernel can do is to take isolation into account when doing the
initial setup. Though that needs a lot of thoughts and presumably a
opt-in knob:

   Depending on your isolation constraints there might only be a single
   housekeeping CPU, which means depending on the number of devices and
   their queue/interrupt requirements that single CPU might run into
   vector exhaustion pretty fast.

> When I'm getting this right, the work from Frederic will bring in the
> "isolated flag" for cpusets. That seems great preparation work. In
> addition we would need something like a mapping between devices (or
> queues maybe indirectly via IRQs) and cgroup/cpusets.
>
> Have there been thoughts around a cpuset.interrupts API - or something
> similar - already?

There was some mumbling about propagating isolation into the interrupt
world, but as far as I can tell there is no plan or idea how that should
look like. But that's again a discussion to be held on LKML.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Control of IRQ Affinities from Userspace
  2025-11-27 18:09                               ` Thomas Gleixner
@ 2025-11-28  7:33                                 ` Florian Bezdeka
  0 siblings, 0 replies; 25+ messages in thread
From: Florian Bezdeka @ 2025-11-28  7:33 UTC (permalink / raw)
  To: Thomas Gleixner, Florian Bezdeka, bigeasy@linutronix.de
  Cc: Preclik, Tobias, Frederic Weisbecker,
	linux-rt-users@vger.kernel.org, Kiszka, Jan, Waiman Long,
	Gabriele Monaco

On Thu Nov 27, 2025 at 7:09 PM CET, Thomas Gleixner wrote:
> On Thu, Nov 27 2025 at 15:52, Florian Bezdeka wrote:
>> On Wed Nov 26, 2025 at 8:15 PM CET, Thomas Gleixner wrote:
>
> I just went and looked at that stmac driver once more. The way how it
> sets up those affinity hints is actually stupid and leads exactly to the
> effects you describe.
>
> The hints should be set exactly once, when MSI is enabled and the
> interrupts are allocated and not after request_irq().
>
> So the first request_irq() will use that hinted affinity.  In case that
> user space changed the affinity, the setting is preserved accross a
> free_irq()/request_irq() sequence unless all CPUs in the affinity mask
> have gone offline.
>
> That preservation was explicitly added on request of networking people,
> but then someone got it wrong and that request_irq()/set_hint() sequence
> started a Copy&Pasta spreading disease. Oh well...
>
> So yes, you have to fix that driver and do the affinity hint business
> right after pci_alloc_irq_vectors() and clear it when the driver shuts
> down. Looking at intel_eth_pci_remove(), that's another trainwreck as it
> does not do any PCI related cleanup despite claiming so....
>
> But the more I look at that whole hint usage, the more I'm convinced
> that it is in most cases actively wrong. It only makes really sense when
> there is an actual 1:1 relationship of queues to CPUs like in the NVME
> case.
>
> I'm pretty sure by now that this is in most cases used to ensure that
> the interrupts are spread out properly. But that spreading is only done
> to ensure that not all interrupts end up on CPU0 or whatever the
> architecture specific interrupt management decides to do. x86 used to
> prefer CPU0, but nowadays it tries to spread it accross CPUs within the
> provided affinity mask. Not perfect but better than before :)
>
> So the right thing here is to expand the functionality of
> irq_calc_affinity_vectors() and group_cpus_evenly() to:
>
>      1) Take isolation masks into account (opt-in and/or system wide
>         knob)
>
>      2) Do the spreading over the interrupt sets without setting
>         the managed bit in the mask descriptor.
>
> Then use pci_alloc_irq_vectors_affinity(), which does the spreading and
> assigns the resulting affinities during interrupt descriptor allocation.
>
> With that the whole hint business can be removed because it has zero
> value after the initial setup.
>
> But that's a discussion to be had on LKML/netdev and not on the RT devel
> list.

Thanks Thomas, this is now going into the expected direction. Let me
look at that proposal in more detail. Once I have tested that and it
works we will migrate the discussion to LKML/netdev.

>> Choosing the right mask is key. The right mask depends on the usage of
>> the device. Some devices (or maybe even just some queues) should be
>> limited to !RT CPUs, while others should explicitly run within a
>> isolated cpuset.
>
> You can't know that upfront. That's a policy decision and user space has
> to make it.
>
> What the kernel can do is to take isolation into account when doing the
> initial setup. Though that needs a lot of thoughts and presumably a
> opt-in knob:
>
>    Depending on your isolation constraints there might only be a single
>    housekeeping CPU, which means depending on the number of devices and
>    their queue/interrupt requirements that single CPU might run into
>    vector exhaustion pretty fast.

This is definitely something that we should keep an eye on.

>
>> When I'm getting this right, the work from Frederic will bring in the
>> "isolated flag" for cpusets. That seems great preparation work. In
>> addition we would need something like a mapping between devices (or
>> queues maybe indirectly via IRQs) and cgroup/cpusets.
>>
>> Have there been thoughts around a cpuset.interrupts API - or something
>> similar - already?
>
> There was some mumbling about propagating isolation into the interrupt
> world, but as far as I can tell there is no plan or idea how that should
> look like. But that's again a discussion to be held on LKML.

We have the /proc/irq/default_smp_affinity which is already available
but decoupled from the cgroup/cpuset isolation feature of Frederic - as
it seems. Let me check how those settings could be merged at the end.

Thanks a lot Thomas!

Best regards,
Florian

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-11-28  7:33 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-30 14:20 Control of IRQ Affinities from Userspace Preclik, Tobias
2025-11-03 15:53 ` Sebastian Andrzej Siewior
2025-11-03 17:12   ` Florian Bezdeka
2025-11-05 13:11     ` Preclik, Tobias
2025-11-05 13:18       ` Preclik, Tobias
2025-11-11 14:35         ` bigeasy
2025-11-11 14:34       ` bigeasy
2025-11-21 13:25         ` Preclik, Tobias
2025-11-24  9:59           ` bigeasy
2025-11-25 11:32             ` Florian Bezdeka
2025-11-25 11:50               ` bigeasy
2025-11-25 14:36                 ` Florian Bezdeka
2025-11-25 16:31                   ` Thomas Gleixner
2025-11-26  9:20                     ` Florian Bezdeka
2025-11-26 14:26                       ` Thomas Gleixner
2025-11-26 15:07                         ` Florian Bezdeka
2025-11-26 19:15                           ` Thomas Gleixner
2025-11-27 14:06                             ` Preclik, Tobias
2025-11-27 14:52                             ` Florian Bezdeka
2025-11-27 18:09                               ` Thomas Gleixner
2025-11-28  7:33                                 ` Florian Bezdeka
2025-11-26 15:45                       ` Frederic Weisbecker
2025-11-26 15:31                 ` Frederic Weisbecker
2025-11-26 15:24               ` Frederic Weisbecker
2025-11-11 13:58     ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox