* Control of IRQ Affinities from Userspace
@ 2025-10-30 14:20 Preclik, Tobias
2025-11-03 15:53 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 25+ messages in thread
From: Preclik, Tobias @ 2025-10-30 14:20 UTC (permalink / raw)
To: linux-rt-users@vger.kernel.org
Dear Linux RT Experts,
the Linux kernel exposes two userspace interfaces to control IRQ
affinities in the procfs [1]. First, /proc/irq/default_smp_affinity
specifies the default affinity mask which gets applied as initial
affinity mask for newly registering IRQs. Second,
/proc/irq/${IRQ_NO}/smp_affinity which can be used to specify an
affinity mask for a given IRQ.
For tuning the I/O path of RT applications we use the second interface
to relocate IRQs to cores dedicated to run RT applications. However, we
have observed that certain situations, such as interface bring-up or
loading BPF/XDP programs, can cause the IRQ affinity mask to be lost.
Specifically, some network drivers, particularly those based on stmmac,
ignore the IRQ affinity mask set from userspace and overwrite it with
decisions from IRQ rebalancing [2]. This driver behavior prevents
consistent control of IRQ affinities from userspace, impacting the
tuning of the I/O path for RT applications.
I would greatly appreciate any comments or guidance on this issue.
Best regards,
Tobias Preclik
[1] https://docs.kernel.org/core-api/irq/irq-affinity.html
[2]
https://elixir.bootlin.com/linux/v6.17.4/source/drivers/net/ethernet/st
micro/stmmac/stmmac_main.c#L3835
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: Control of IRQ Affinities from Userspace 2025-10-30 14:20 Control of IRQ Affinities from Userspace Preclik, Tobias @ 2025-11-03 15:53 ` Sebastian Andrzej Siewior 2025-11-03 17:12 ` Florian Bezdeka 0 siblings, 1 reply; 25+ messages in thread From: Sebastian Andrzej Siewior @ 2025-11-03 15:53 UTC (permalink / raw) To: Preclik, Tobias; +Cc: linux-rt-users@vger.kernel.org On 2025-10-30 14:20:01 [+0000], Preclik, Tobias wrote: > Dear Linux RT Experts, > > the Linux kernel exposes two userspace interfaces to control IRQ > affinities in the procfs [1]. First, /proc/irq/default_smp_affinity > specifies the default affinity mask which gets applied as initial > affinity mask for newly registering IRQs. Second, > /proc/irq/${IRQ_NO}/smp_affinity which can be used to specify an > affinity mask for a given IRQ. > > For tuning the I/O path of RT applications we use the second interface > to relocate IRQs to cores dedicated to run RT applications. However, we > have observed that certain situations, such as interface bring-up or > loading BPF/XDP programs, can cause the IRQ affinity mask to be lost. > Specifically, some network drivers, particularly those based on stmmac, > ignore the IRQ affinity mask set from userspace and overwrite it with > decisions from IRQ rebalancing [2]. This driver behavior prevents > consistent control of IRQ affinities from userspace, impacting the > tuning of the I/O path for RT applications. > > I would greatly appreciate any comments or guidance on this issue. The usage of irq_set_affinity_hint() is not uncommon within the networking drivers. It is probably a pity if the request happens on each ifdown/ up but in this case it happens each time you add/ remove a XDP program. But the interrupt should be managed by the kernel. Looking at irq_do_set_affinity() there is: | if (irqd_affinity_is_managed(data) && | housekeeping_enabled(HK_TYPE_MANAGED_IRQ)) { | const struct cpumask *hk_mask; | | hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ); | | cpumask_and(tmp_mask, mask, hk_mask); | if (!cpumask_intersects(tmp_mask, cpu_online_mask)) | prog_mask = mask; | else | prog_mask = tmp_mask; | } else { so if the IRQ is managed and you have IRQ isolation enabled then it should exclude the non-isolated CPUs. Would that work? > Best regards, > Tobias Preclik Sebastian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Control of IRQ Affinities from Userspace 2025-11-03 15:53 ` Sebastian Andrzej Siewior @ 2025-11-03 17:12 ` Florian Bezdeka 2025-11-05 13:11 ` Preclik, Tobias 2025-11-11 13:58 ` Sebastian Andrzej Siewior 0 siblings, 2 replies; 25+ messages in thread From: Florian Bezdeka @ 2025-11-03 17:12 UTC (permalink / raw) To: Sebastian Andrzej Siewior, Preclik, Tobias Cc: linux-rt-users@vger.kernel.org, Jan Kiszka On Mon, 2025-11-03 at 16:53 +0100, Sebastian Andrzej Siewior wrote: > On 2025-10-30 14:20:01 [+0000], Preclik, Tobias wrote: > > Dear Linux RT Experts, > > > > the Linux kernel exposes two userspace interfaces to control IRQ > > affinities in the procfs [1]. First, /proc/irq/default_smp_affinity > > specifies the default affinity mask which gets applied as initial > > affinity mask for newly registering IRQs. Second, > > /proc/irq/${IRQ_NO}/smp_affinity which can be used to specify an > > affinity mask for a given IRQ. > > > > For tuning the I/O path of RT applications we use the second interface > > to relocate IRQs to cores dedicated to run RT applications. However, we > > have observed that certain situations, such as interface bring-up or > > loading BPF/XDP programs, can cause the IRQ affinity mask to be lost. > > Specifically, some network drivers, particularly those based on stmmac, > > ignore the IRQ affinity mask set from userspace and overwrite it with > > decisions from IRQ rebalancing [2]. This driver behavior prevents > > consistent control of IRQ affinities from userspace, impacting the > > tuning of the I/O path for RT applications. > > > > I would greatly appreciate any comments or guidance on this issue. > > The usage of irq_set_affinity_hint() is not uncommon within the > networking drivers. It is probably a pity if the request happens on each > ifdown/ up but in this case it happens each time you add/ remove a XDP > program. > > But the interrupt should be managed by the kernel. Looking at > irq_do_set_affinity() there is: > > if (irqd_affinity_is_managed(data) && > > housekeeping_enabled(HK_TYPE_MANAGED_IRQ)) { > > const struct cpumask *hk_mask; > > > > hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ); > > > > cpumask_and(tmp_mask, mask, hk_mask); > > if (!cpumask_intersects(tmp_mask, cpu_online_mask)) > > prog_mask = mask; > > else > > prog_mask = tmp_mask; > > } else { > > so if the IRQ is managed and you have IRQ isolation enabled then it > should exclude the non-isolated CPUs. Would that work? I'm trying to jump in and adding some thoughts and results we got while analyzing this issue: What stmmac (and some more drivers) are trying to achieve here is some kind of handcrafted IRQ balancing, like the good old irqbalanced did in the past from usermode. Turns out that the situation about IRQ balancing is a bit inconsistent. Some IRQ chips (like the APIC on x86 do that "automatically" on driver level, many others don't. So drivers end up fiddling with affinities. We can nicely tune IRQs and affected affinities that that have been requested during system boot. Tools like tuned can configure them using the APIs Tobias described. IRQs that are requested / setup after boot, during runtime, are kind of "problematic" for us, as there is no API that informs about new IRQ. We would have to rescan /proc. But even if there would be such an API: That would be too late. The IRQ might have fired already. Once an affinity has been set (e.g. by tuned) this affinity is being restored when the IRQ comes back after a link up/down or bpf load. But: It might have happened that the situation on the system has changed. Even the default affinity could be different now. In case of the stmmac - and probably way more drivers - the default affinity is not taken into account anymore. The previous affinity is being restored unconditionally. I tried to modify stmmac and let it evaluate the default affinity while doing the IRQ balancing dance. That turned out to be working at the end, but each line violated several coding/style/abstraction rules. There is no API at driver level to read the current default affinity - or I missed it. I could sent that hack out as RFC if requested. Just let me know. Thinking more about this problem - and trying to abstract that in a generalized way - triggered some ideas about "IRQ namespaces", similar to what we have for CPUs/Memory/... in the cgroup world. Devices, or classes of devices could be moved into namespaces, instead of configuring them one by one. Thoughts welcome. The main challenge here is that we do not think about rt vs. non-rt. It's more about multiple RT applications running in parallel, well isolated from each other and the non-rt world. Florian > > > Best regards, > > Tobias Preclik > > Sebastian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Control of IRQ Affinities from Userspace 2025-11-03 17:12 ` Florian Bezdeka @ 2025-11-05 13:11 ` Preclik, Tobias 2025-11-05 13:18 ` Preclik, Tobias 2025-11-11 14:34 ` bigeasy 2025-11-11 13:58 ` Sebastian Andrzej Siewior 1 sibling, 2 replies; 25+ messages in thread From: Preclik, Tobias @ 2025-11-05 13:11 UTC (permalink / raw) To: Bezdeka, Florian, bigeasy@linutronix.de Cc: linux-rt-users@vger.kernel.org, Kiszka, Jan On Mon, 2025-11-03 at 18:12 +0100, Florian Bezdeka wrote: > On Mon, 2025-11-03 at 16:53 +0100, Sebastian Andrzej Siewior wrote: > > The usage of irq_set_affinity_hint() is not uncommon within the > > networking drivers. It is probably a pity if the request happens on > > each > > ifdown/ up but in this case it happens each time you add/ remove a > > XDP > > program. > > > > But the interrupt should be managed by the kernel. Looking at > > irq_do_set_affinity() there is: > > > if (irqd_affinity_is_managed(data) && > > > housekeeping_enabled(HK_TYPE_MANAGED_IRQ)) { > > > const struct cpumask *hk_mask; > > > > > > hk_mask = > > > housekeeping_cpumask(HK_TYPE_MANAGED_IRQ); > > > > > > cpumask_and(tmp_mask, mask, hk_mask); > > > if (!cpumask_intersects(tmp_mask, > > > cpu_online_mask)) > > > prog_mask = mask; > > > else > > > prog_mask = tmp_mask; > > > } else { > > > > so if the IRQ is managed and you have IRQ isolation enabled then it > > should exclude the non-isolated CPUs. Would that work? For that code path to be taken we would have to specify isolcpus and managed_irq on the kernel command-line. However, we already migrated away from the deprecated isolcpus parameter. Additionally, we need to dynamically change the interrupt affinities while in operation. For that matter we have to rely on setting the interrupt affinities in procfs. > I'm trying to jump in and adding some thoughts and results we got > while > analyzing this issue: > > What stmmac (and some more drivers) are trying to achieve here is > some > kind of handcrafted IRQ balancing, like the good old irqbalanced did > in > the past from usermode. Turns out that the situation about IRQ > balancing > is a bit inconsistent. Some IRQ chips (like the APIC on x86 do that > "automatically" on driver level, many others don't. So drivers end up > fiddling with affinities. > > We can nicely tune IRQs and affected affinities that that have been > requested during system boot. Tools like tuned can configure them > using > the APIs Tobias described. IRQs that are requested / setup after > boot, > during runtime, are kind of "problematic" for us, as there is no API > that informs about new IRQ. We would have to rescan /proc. But even > if > there would be such an API: That would be too late. The IRQ might > have > fired already. > > Once an affinity has been set (e.g. by tuned) this affinity is being > restored when the IRQ comes back after a link up/down or bpf load. > But: > It might have happened that the situation on the system has changed. > Even the default affinity could be different now. In case of the > stmmac > - and probably way more drivers - the default affinity is not taken > into > account anymore. The previous affinity is being restored > unconditionally. Thanks for sharing the details Florian. The discovery of newly registered/requested IRQs in userland is indeed an additional problem. Still I would say that we can work around it by polling procfs and setting the default interrupt affinity appropriately (given that it is not ignored by the drivers of course). RT applications might have to accept an initial delay until the interrupts on their I/O path are detected and properly tuned. > I tried to modify stmmac and let it evaluate the default affinity > while > doing the IRQ balancing dance. That turned out to be working at the > end, > but each line violated several coding/style/abstraction rules. There > is > no API at driver level to read the current default affinity - or I > missed it. I could sent that hack out as RFC if requested. Just let > me > know. When drivers at least respect the default affinity when rebalancing interrupts we can avoid interfering with RT applications. However, we must also ensure that specific interrupts which are in use by RT applications on their I/O path must remain on the application core. Meaning the IRQ rebalancing should also respect the interrupt affinities set from userspace via /proc/irq/${IRQ_NO}/smp_affinity. Here are steps to explore the behavior and reproduce the issue. $ iface=eno2 $ ethtool -i ${iface} driver: st_gmac version: 6.18.0-rc2+ firmware-version: expansion-rom-version: bus-info: 0000:00:1d.2 supports-statistics: yes supports-test: no supports-eeprom-access: no supports-register-dump: yes supports-priv-flags: no $ ip link show ${iface} 3: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 30:2f:1e:7b:a5:3d brd ff:ff:ff:ff:ff:ff altname enp0s29f2 $ for irq in $(cat /proc/interrupts | grep ${iface} | awk '{print $1}' | sed 's/://'); do cat /proc/irq/${irq}/smp_affinity_list; done $ # NOTE: interrupts are requested only after interface bring-up $ echo 3 > /proc/irq/default_smp_affinity $ ip link set dev ${iface} up $ for irq in $(cat /proc/interrupts | grep ${iface} | awk '{print $1}' | sed 's/://'); do cat /proc/irq/${irq}/smp_affinity_list; done 0 0 1 1 2 2 3 3 0 0 1 1 2 2 3 3 0-1 0-1 0-1 0-1 $ # NOTE: All rx and tx interrupts of the interface are load balanced to all online cpus in a round-robin fashion by setting the irq affinities and ignoring the default affinity. $ # Let's explicitly set the IRQ affinities: $ for irq in $(cat /proc/interrupts | grep ${iface} | awk '{print $1}' | sed 's/://'); do echo 2 > /proc/irq/${irq}/smp_affinity_list; done $ # We trigger the IRQ rebalancing of the stmmac-based driver: $ ip link set dev ${iface} down $ ip link set dev ${iface} up $ for irq in $(cat /proc/interrupts | grep ${iface} | awk '{print $1}' | sed 's/://'); do cat /proc/irq/${irq}/smp_affinity_list; done 0 0 1 1 2 2 3 3 0 0 1 1 2 2 3 3 2 2 2 2 $ # NOTE: All rx and tx interrupts are load balanced again to all online cpus ignoring the default interrupt affinity as well as ignoring and overwriting the explicitly set interrupt affinities from userspace. Tobias ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Control of IRQ Affinities from Userspace 2025-11-05 13:11 ` Preclik, Tobias @ 2025-11-05 13:18 ` Preclik, Tobias 2025-11-11 14:35 ` bigeasy 2025-11-11 14:34 ` bigeasy 1 sibling, 1 reply; 25+ messages in thread From: Preclik, Tobias @ 2025-11-05 13:18 UTC (permalink / raw) To: Bezdeka, Florian, bigeasy@linutronix.de Cc: linux-rt-users@vger.kernel.org, Kiszka, Jan On Wed, 2025-11-05 at 13:11 +0000, Preclik, Tobias wrote: > On Mon, 2025-11-03 at 18:12 +0100, Florian Bezdeka wrote: > > On Mon, 2025-11-03 at 16:53 +0100, Sebastian Andrzej Siewior wrote: > > > The usage of irq_set_affinity_hint() is not uncommon within the > > > networking drivers. It is probably a pity if the request happens > > > on > > > each > > > ifdown/ up but in this case it happens each time you add/ remove > > > a > > > XDP > > > program. > > > > > > But the interrupt should be managed by the kernel. Looking at > > > irq_do_set_affinity() there is: > > > > if (irqd_affinity_is_managed(data) && > > > > housekeeping_enabled(HK_TYPE_MANAGED_IRQ)) { > > > > const struct cpumask *hk_mask; > > > > > > > > hk_mask = > > > > housekeeping_cpumask(HK_TYPE_MANAGED_IRQ); > > > > > > > > cpumask_and(tmp_mask, mask, hk_mask); > > > > if (!cpumask_intersects(tmp_mask, > > > > cpu_online_mask)) > > > > prog_mask = mask; > > > > else > > > > prog_mask = tmp_mask; > > > > } else { > > > > > > so if the IRQ is managed and you have IRQ isolation enabled then > > > it > > > should exclude the non-isolated CPUs. Would that work? > > For that code path to be taken we would have to specify isolcpus and > managed_irq on the kernel command-line. However, we already migrated > away from the deprecated isolcpus parameter. Additionally, we need to > dynamically change the interrupt affinities while in operation. For > that matter we have to rely on setting the interrupt affinities in > procfs. > > > I'm trying to jump in and adding some thoughts and results we got > > while > > analyzing this issue: > > > > What stmmac (and some more drivers) are trying to achieve here is > > some > > kind of handcrafted IRQ balancing, like the good old irqbalanced > > did > > in > > the past from usermode. Turns out that the situation about IRQ > > balancing > > is a bit inconsistent. Some IRQ chips (like the APIC on x86 do that > > "automatically" on driver level, many others don't. So drivers end > > up > > fiddling with affinities. > > > > We can nicely tune IRQs and affected affinities that that have been > > requested during system boot. Tools like tuned can configure them > > using > > the APIs Tobias described. IRQs that are requested / setup after > > boot, > > during runtime, are kind of "problematic" for us, as there is no > > API > > that informs about new IRQ. We would have to rescan /proc. But even > > if > > there would be such an API: That would be too late. The IRQ might > > have > > fired already. > > > > Once an affinity has been set (e.g. by tuned) this affinity is > > being > > restored when the IRQ comes back after a link up/down or bpf load. > > But: > > It might have happened that the situation on the system has > > changed. > > Even the default affinity could be different now. In case of the > > stmmac > > - and probably way more drivers - the default affinity is not taken > > into > > account anymore. The previous affinity is being restored > > unconditionally. > > Thanks for sharing the details Florian. The discovery of newly > registered/requested IRQs in userland is indeed an additional > problem. > Still I would say that we can work around it by polling procfs and > setting the default interrupt affinity appropriately (given that it > is > not ignored by the drivers of course). RT applications might have to > accept an initial delay until the interrupts on their I/O path are > detected and properly tuned. > > > I tried to modify stmmac and let it evaluate the default affinity > > while > > doing the IRQ balancing dance. That turned out to be working at the > > end, > > but each line violated several coding/style/abstraction rules. > > There > > is > > no API at driver level to read the current default affinity - or I > > missed it. I could sent that hack out as RFC if requested. Just let > > me > > know. > > When drivers at least respect the default affinity when rebalancing > interrupts we can avoid interfering with RT applications. However, we > must also ensure that specific interrupts which are in use by RT > applications on their I/O path must remain on the application core. > Meaning the IRQ rebalancing should also respect the interrupt > affinities set from userspace via /proc/irq/${IRQ_NO}/smp_affinity. > > Here are steps to explore the behavior and reproduce the issue. > > $ iface=eno2 > $ ethtool -i ${iface} > driver: st_gmac > version: 6.18.0-rc2+ > firmware-version: > expansion-rom-version: > bus-info: 0000:00:1d.2 > supports-statistics: yes > supports-test: no > supports-eeprom-access: no > supports-register-dump: yes > supports-priv-flags: no > > $ ip link show ${iface} > 3: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode > DEFAULT group default qlen 1000 > link/ether 30:2f:1e:7b:a5:3d brd ff:ff:ff:ff:ff:ff > altname enp0s29f2 > > $ for irq in $(cat /proc/interrupts | grep ${iface} | awk '{print > $1}' > > sed 's/://'); do cat /proc/irq/${irq}/smp_affinity_list; done > > $ # NOTE: interrupts are requested only after interface bring-up > > $ echo 3 > /proc/irq/default_smp_affinity > > $ ip link set dev ${iface} up > $ for irq in $(cat /proc/interrupts | grep ${iface} | awk '{print > $1}' > > sed 's/://'); do cat /proc/irq/${irq}/smp_affinity_list; done > 0 > 0 > 1 > 1 > 2 > 2 > 3 > 3 > 0 > 0 > 1 > 1 > 2 > 2 > 3 > 3 > 0-1 > 0-1 > 0-1 > 0-1 > > $ # NOTE: All rx and tx interrupts of the interface are load balanced > to all online cpus in a round-robin fashion by setting the irq > affinities and ignoring the default affinity. > > $ # Let's explicitly set the IRQ affinities: > > $ for irq in $(cat /proc/interrupts | grep ${iface} | awk '{print > $1}' > > sed 's/://'); do echo 2 > /proc/irq/${irq}/smp_affinity_list; done > > $ # We trigger the IRQ rebalancing of the stmmac-based driver: > > $ ip link set dev ${iface} down > $ ip link set dev ${iface} up > > $ for irq in $(cat /proc/interrupts | grep ${iface} | awk '{print > $1}' > > sed 's/://'); do cat /proc/irq/${irq}/smp_affinity_list; done > 0 > 0 > 1 > 1 > 2 > 2 > 3 > 3 > 0 > 0 > 1 > 1 > 2 > 2 > 3 > 3 > 2 > 2 > 2 > 2 > > $ # NOTE: All rx and tx interrupts are load balanced again to all > online cpus ignoring the default interrupt affinity as well as > ignoring > and overwriting the explicitly set interrupt affinities from > userspace. The conclusion got lost: Other drivers like for example igb respect the interrupt affinities (both default and per-irq affinities). This leads me to believe that the irq rebalancing in the drivers should only affect the effective interrupt affinities. This admittedly is more involved than it appears at first because the interface interrupts would have to be balanced subject to multiple (potentially totally different) cpusets. Tobias ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Control of IRQ Affinities from Userspace 2025-11-05 13:18 ` Preclik, Tobias @ 2025-11-11 14:35 ` bigeasy 0 siblings, 0 replies; 25+ messages in thread From: bigeasy @ 2025-11-11 14:35 UTC (permalink / raw) To: Preclik, Tobias Cc: Bezdeka, Florian, linux-rt-users@vger.kernel.org, Kiszka, Jan On 2025-11-05 13:18:16 [+0000], Preclik, Tobias wrote: > The conclusion got lost: > > Other drivers like for example igb respect the interrupt affinities > (both default and per-irq affinities). This leads me to believe that > the irq rebalancing in the drivers should only affect the effective > interrupt affinities. This admittedly is more involved than it appears > at first because the interface interrupts would have to be balanced > subject to multiple (potentially totally different) cpusets. Exactly. Maybe it would work to align the driver with what igb does. > Tobias Sebastian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Control of IRQ Affinities from Userspace 2025-11-05 13:11 ` Preclik, Tobias 2025-11-05 13:18 ` Preclik, Tobias @ 2025-11-11 14:34 ` bigeasy 2025-11-21 13:25 ` Preclik, Tobias 1 sibling, 1 reply; 25+ messages in thread From: bigeasy @ 2025-11-11 14:34 UTC (permalink / raw) To: Preclik, Tobias Cc: Bezdeka, Florian, linux-rt-users@vger.kernel.org, Kiszka, Jan On 2025-11-05 13:11:29 [+0000], Preclik, Tobias wrote: > > > so if the IRQ is managed and you have IRQ isolation enabled then it > > > should exclude the non-isolated CPUs. Would that work? > > For that code path to be taken we would have to specify isolcpus and > managed_irq on the kernel command-line. However, we already migrated > away from the deprecated isolcpus parameter. Additionally, we need to > dynamically change the interrupt affinities while in operation. For > that matter we have to rely on setting the interrupt affinities in > procfs. I would be careful with the deprecated term here. The functionality is not deprecated just the interface is. The CPU affinity has been migrated a cgroup based interface. If the matching irq affinity is missing then it should be added rather than avoiding the whole "affinity is managed" interface since this looks as it has been meant for your use case. … > Thanks for sharing the details Florian. The discovery of newly > registered/requested IRQs in userland is indeed an additional problem. > Still I would say that we can work around it by polling procfs and > setting the default interrupt affinity appropriately (given that it is > not ignored by the drivers of course). RT applications might have to > accept an initial delay until the interrupts on their I/O path are > detected and properly tuned. This depends how you sell it. The initial setup of the application/ system might take a small hit until everything is ready and setup for your workload. I don't know how changing the XDP program fits into this. Usually I would expect that you setup it once and use it. However if switching the XDP program makes sense within your real time workload then maybe the affinity should not be randomly assigned. > > I tried to modify stmmac and let it evaluate the default affinity > > while > > doing the IRQ balancing dance. That turned out to be working at the > > end, > > but each line violated several coding/style/abstraction rules. There > > is > > no API at driver level to read the current default affinity - or I > > missed it. I could sent that hack out as RFC if requested. Just let > > me > > know. > > When drivers at least respect the default affinity when rebalancing > interrupts we can avoid interfering with RT applications. However, we > must also ensure that specific interrupts which are in use by RT > applications on their I/O path must remain on the application core. > Meaning the IRQ rebalancing should also respect the interrupt > affinities set from userspace via /proc/irq/${IRQ_NO}/smp_affinity. If free_irq() removes all information about the IRQ then it might also lose the configured smp_affinity. > Here are steps to explore the behavior and reproduce the issue. … > online cpus ignoring the default interrupt affinity as well as ignoring > and overwriting the explicitly set interrupt affinities from userspace. I just compared with igb and here the affinity mask survives. So it is just this driver that is doing things different. The igb also removes all interrupts on down. The affinity remains after changing the number of queues (which changes the number of used interrupts). > Tobias Sebastian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Control of IRQ Affinities from Userspace 2025-11-11 14:34 ` bigeasy @ 2025-11-21 13:25 ` Preclik, Tobias 2025-11-24 9:59 ` bigeasy 0 siblings, 1 reply; 25+ messages in thread From: Preclik, Tobias @ 2025-11-21 13:25 UTC (permalink / raw) To: bigeasy@linutronix.de Cc: Bezdeka, Florian, linux-rt-users@vger.kernel.org, Kiszka, Jan On Tue, 2025-11-11 at 15:34 +0100, bigeasy@linutronix.de wrote: > On 2025-11-05 13:11:29 [+0000], Preclik, Tobias wrote: > > > > so if the IRQ is managed and you have IRQ isolation enabled > > > > then it > > > > should exclude the non-isolated CPUs. Would that work? > > > > For that code path to be taken we would have to specify isolcpus > > and > > managed_irq on the kernel command-line. However, we already > > migrated > > away from the deprecated isolcpus parameter. Additionally, we need > > to > > dynamically change the interrupt affinities while in operation. For > > that matter we have to rely on setting the interrupt affinities in > > procfs. > > I would be careful with the deprecated term here. The functionality > is > not deprecated just the interface is. The CPU affinity has been > migrated > a cgroup based interface. If the matching irq affinity is missing > then > it should be added rather than avoiding the whole "affinity is > managed" > interface since this looks as it has been meant for your use case. > As you point out isolcpus interface is deprecated and it seems there exists no way to translate the managed_irq flag of the isolcpus interface into the cgroups based interface. My understanding is that the managed_irq flag ensures that managed interrupts stay off isolated cores. Whether an interrupt is managed or not depends on how the driver allocates the interrupt. Managed interrupts explicitly exclude control from userspace [1]: > The affinity of managed interrupts is handled by the kernel and > cannot be changed via the /proc/irq/* interfaces. I am not aware of any net/ethernet driver allocating managed interrupts. Mellanox even reverted the introduction of managed interrupts in mlx5 exactly for this reason [2]. But NVME and multiple SCSI drivers allocate managed interrupts. So in essence this uncovers more limitations: - managed interrupts cannot be confined to housekeeping cores without the deprecated isolcpus parameter - managed interrupts cannot be relocated to a subset of isolated cores from userspace (not so relevant for now since we are mostly interested to relocate network interrupts to application cores) Drivers allocating unmanaged interrupts are still subject to the limitations I described initially (some drivers like stmmac rebalance IRQs in certain situations like interface bring-up and XDP program loading ignoring and overwriting any affinities set from userspace). > > Thanks for sharing the details Florian. The discovery of newly > > registered/requested IRQs in userland is indeed an additional > > problem. > > Still I would say that we can work around it by polling procfs and > > setting the default interrupt affinity appropriately (given that it > > is > > not ignored by the drivers of course). RT applications might have > > to > > accept an initial delay until the interrupts on their I/O path are > > detected and properly tuned. > > This depends how you sell it. The initial setup of the application/ > system might take a small hit until everything is ready and setup for > your workload. I don't know how changing the XDP program fits into > this. > Usually I would expect that you setup it once and use it. However if > switching the XDP program makes sense within your real time workload > then maybe the affinity should not be randomly assigned. Consider an edge platform where operators are free to start, stop and upgrade rt and non-rt applications at their will. XDP programs can be loaded at any time and loading of XDP programs should not rebalance interrupts ignoring (and overwriting) the affinities set from the edge platform. > > > I tried to modify stmmac and let it evaluate the default affinity > > > while > > > doing the IRQ balancing dance. That turned out to be working at > > > the > > > end, > > > but each line violated several coding/style/abstraction rules. > > > There > > > is > > > no API at driver level to read the current default affinity - or > > > I > > > missed it. I could sent that hack out as RFC if requested. Just > > > let > > > me > > > know. > > > > When drivers at least respect the default affinity when rebalancing > > interrupts we can avoid interfering with RT applications. However, > > we > > must also ensure that specific interrupts which are in use by RT > > applications on their I/O path must remain on the application core. > > Meaning the IRQ rebalancing should also respect the interrupt > > affinities set from userspace via /proc/irq/${IRQ_NO}/smp_affinity. > > If free_irq() removes all information about the IRQ then it might > also > lose the configured smp_affinity. > > > Here are steps to explore the behavior and reproduce the issue. > … > > online cpus ignoring the default interrupt affinity as well as > > ignoring > > and overwriting the explicitly set interrupt affinities from > > userspace. > > I just compared with igb and here the affinity mask survives. So it > is > just this driver that is doing things different. The igb also removes > all interrupts on down. The affinity remains after changing the > number > of queues (which changes the number of used interrupts). igb does not apply driver-based irq balancing (no calls to irq_set_affinity_hint) and does not allocate managed interrupts and therefore the affinities set from userspace are in effect. Affinities are usually (unmanaged interrupts, no driver-based irq balancing) still in effect after freeing and rerequesting an irq handler. So everything works as I would expect it in this case. > > The conclusion got lost: > > > > Other drivers like for example igb respect the interrupt affinities > > (both default and per-irq affinities). This leads me to believe > that > > the irq rebalancing in the drivers should only affect the effective > > interrupt affinities. This admittedly is more involved than it > appears > > at first because the interface interrupts would have to be balanced > > subject to multiple (potentially totally different) cpusets. > > Exactly. Maybe it would work to align the driver with what igb does. Currently, stmmac sets IRQ affinity and hints for all IRQ configurations. But on x86 systems with IOAPIC MSI-X vectors should be automatically balanced. If we remove the driver-based irq balancing then other architectures would not necessarily balance the interrupts anymore and would be impacted in terms of performance. Maybe driver- based irq balancing could be deactivated whenever the underlying system is capable of balancing them? That would of course only reduced the number of affected systems. In general I lack information when drivers should (or are allowed to) balance interrupts on driver level and whether smp_affinity is allowed to be ignored and overwritten in that case. All documentation I have found so far remains rather unspecific. Tobias [1] https://www.kernel.org/doc/html/v6.1/admin-guide/kernel-parameters.html [2] https://github.com/torvalds/linux/commit/ef8c063cf88e1a3d99ab4ada1cbab5ba7248a4f2 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Control of IRQ Affinities from Userspace 2025-11-21 13:25 ` Preclik, Tobias @ 2025-11-24 9:59 ` bigeasy 2025-11-25 11:32 ` Florian Bezdeka 0 siblings, 1 reply; 25+ messages in thread From: bigeasy @ 2025-11-24 9:59 UTC (permalink / raw) To: Preclik, Tobias Cc: Bezdeka, Florian, linux-rt-users@vger.kernel.org, Kiszka, Jan, Frederic Weisbecker On 2025-11-21 13:25:09 [+0000], Preclik, Tobias wrote: > > I would be careful with the deprecated term here. The functionality > > is > > not deprecated just the interface is. The CPU affinity has been > > migrated > > a cgroup based interface. If the matching irq affinity is missing > > then > > it should be added rather than avoiding the whole "affinity is > > managed" > > interface since this looks as it has been meant for your use case. > > > > As you point out isolcpus interface is deprecated and it seems there > exists no way to translate the managed_irq flag of the isolcpus > interface into the cgroups based interface. My understanding is that I did not point out anything. I just suggested to test whether this option is working for you and if it does, check if there is matching configuration knob in the cpusets/cgroups interface. As per https://www.suse.com/c/cpu-isolation-practical-example-part-5/ in "3.2) Isolcpus" Frederic says that the options should be used if the kernel/ application "haven't been built with cpusets/cgroups support". So it seems that this bit is either missing in the other interface or hard to find. … > > > The conclusion got lost: > > > > > > Other drivers like for example igb respect the interrupt affinities > > > (both default and per-irq affinities). This leads me to believe > > that > > > the irq rebalancing in the drivers should only affect the effective > > > interrupt affinities. This admittedly is more involved than it > > appears > > > at first because the interface interrupts would have to be balanced > > > subject to multiple (potentially totally different) cpusets. > > > > Exactly. Maybe it would work to align the driver with what igb does. > > Currently, stmmac sets IRQ affinity and hints for all IRQ > configurations. But on x86 systems with IOAPIC MSI-X vectors should be > automatically balanced. If we remove the driver-based irq balancing > then other architectures would not necessarily balance the interrupts > anymore and would be impacted in terms of performance. Maybe driver- > based irq balancing could be deactivated whenever the underlying system > is capable of balancing them? That would of course only reduced the > number of affected systems. > > In general I lack information when drivers should (or are allowed to) > balance interrupts on driver level and whether smp_affinity is allowed > to be ignored and overwritten in that case. All documentation I have > found so far remains rather unspecific. It seems that if you exclude certain CPUs from getting interrupt handling than it should work fine. Then the driver would only balance the interrupts among the CPUs that are left. > Tobias Sebastian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Control of IRQ Affinities from Userspace 2025-11-24 9:59 ` bigeasy @ 2025-11-25 11:32 ` Florian Bezdeka 2025-11-25 11:50 ` bigeasy 2025-11-26 15:24 ` Frederic Weisbecker 0 siblings, 2 replies; 25+ messages in thread From: Florian Bezdeka @ 2025-11-25 11:32 UTC (permalink / raw) To: bigeasy@linutronix.de, Preclik, Tobias, Frederic Weisbecker Cc: linux-rt-users@vger.kernel.org, Kiszka, Jan On Mon, 2025-11-24 at 10:59 +0100, bigeasy@linutronix.de wrote: > On 2025-11-21 13:25:09 [+0000], Preclik, Tobias wrote: > > > I would be careful with the deprecated term here. The functionality > > > is > > > not deprecated just the interface is. The CPU affinity has been > > > migrated > > > a cgroup based interface. If the matching irq affinity is missing > > > then > > > it should be added rather than avoiding the whole "affinity is > > > managed" > > > interface since this looks as it has been meant for your use case. > > > > > > > As you point out isolcpus interface is deprecated and it seems there > > exists no way to translate the managed_irq flag of the isolcpus > > interface into the cgroups based interface. My understanding is that > > I did not point out anything. I just suggested to test whether this > option is working for you and if it does, check if there is matching > configuration knob in the cpusets/cgroups interface. As per > https://www.suse.com/c/cpu-isolation-practical-example-part-5/ > > in "3.2) Isolcpus" Frederic says that the options should be used if the > kernel/ application "haven't been built with cpusets/cgroups support". > So it seems that this bit is either missing in the other interface or > hard to find. In case that was still unclear: We're using the dynamic system configuration features provided by cpusets/cgrups. No isolcpus= on the kernel cmdline anymore. With that all applications are build around cgroups. There is some userspace tooling around that takes care of proper system configuration / RT isolation. > > … > > > > The conclusion got lost: > > > > > > > > Other drivers like for example igb respect the interrupt affinities > > > > (both default and per-irq affinities). This leads me to believe > > > that > > > > the irq rebalancing in the drivers should only affect the effective > > > > interrupt affinities. This admittedly is more involved than it > > > appears > > > > at first because the interface interrupts would have to be balanced > > > > subject to multiple (potentially totally different) cpusets. > > > > > > Exactly. Maybe it would work to align the driver with what igb does. > > > > Currently, stmmac sets IRQ affinity and hints for all IRQ > > configurations. But on x86 systems with IOAPIC MSI-X vectors should be > > automatically balanced. If we remove the driver-based irq balancing > > then other architectures would not necessarily balance the interrupts > > anymore and would be impacted in terms of performance. Maybe driver- > > based irq balancing could be deactivated whenever the underlying system > > is capable of balancing them? That would of course only reduced the > > number of affected systems. > > > > In general I lack information when drivers should (or are allowed to) > > balance interrupts on driver level and whether smp_affinity is allowed > > to be ignored and overwritten in that case. All documentation I have > > found so far remains rather unspecific. > > It seems that if you exclude certain CPUs from getting interrupt > handling than it should work fine. Then the driver would only balance > the interrupts among the CPUs that are left. Sebastian, what exactly do you mean by "exclude certain CPUs from getting interrupt handling"? I mean, that is what we do by configuring the /proc/<irq>/smp_affinity_list interface. The point here is, that drivers (like the stmmac, storage, ...) simply ignore everything that was configured by userspace. As soon as one of the dynamic events (link up/down, bpf loading) occurs they destroy the current RT aware system configuration. I was not successful in finding an API that would allow the driver(s) to do better. The default affinity (/proc/irq/default_smp_affinity) - as an example - is not visible from outside the IRQ core. The managed IRQ infrastructure that you mentioned seems coupled with the interfaces behind CONFIG_CPU_ISOLATION which seems to be "static", so configured at boot time. Is that understanding correct? That would not be flexible enough as we don't know the system configuration at boot time. As we now have Frederic with us: Frederic, are there any plans to extend the housekeeping API to deal with cpuset creation? Not sure if that would be possible as it's hard to say if the newly created cpuset is targeting isolation or housekeeping... The question from Tobias in other words: What are drivers allowed to do? Are they free in choosing/configuring a (performance optimized) IRQ handling? If so, how can we prevent them from touching RT isolated CPUs? Do we consider issues like "driver ignoring the default IRQ affinity" as driver bugs? Is there some consensus in the community? Background here is the argumentation why we would have to touch / fix those drivers. To sum up: - The IRQ balancing issue is not limited to a single driver / subsystem - The managed IRQ infrastructure seems very "static" so insufficient for this problem. In addition we would have to migrate all affected drivers to the managed IRQ infrastructure first. We would love to hear further thoughts / ideas / comments about this problem. We're highly interested in fixing this issue properly. Thanks for all the comments so far! Florian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Control of IRQ Affinities from Userspace 2025-11-25 11:32 ` Florian Bezdeka @ 2025-11-25 11:50 ` bigeasy 2025-11-25 14:36 ` Florian Bezdeka 2025-11-26 15:31 ` Frederic Weisbecker 2025-11-26 15:24 ` Frederic Weisbecker 1 sibling, 2 replies; 25+ messages in thread From: bigeasy @ 2025-11-25 11:50 UTC (permalink / raw) To: Florian Bezdeka Cc: Preclik, Tobias, Frederic Weisbecker, linux-rt-users@vger.kernel.org, Kiszka, Jan On 2025-11-25 12:32:39 [+0100], Florian Bezdeka wrote: > > It seems that if you exclude certain CPUs from getting interrupt > > handling than it should work fine. Then the driver would only balance > > the interrupts among the CPUs that are left. > > Sebastian, what exactly do you mean by "exclude certain CPUs from > getting interrupt handling"? I mean, that is what we do by configuring > the /proc/<irq>/smp_affinity_list interface. Step #1 - figure out if isolcpus= is restricting the affinity of requested interrupts to housekeeping CPUs only Step #2 - Yes => look for the matching knob in cgroup interface Knob found? - Yes => Use knob. - No. => Add knob. - No This should be added as it breaks the expectation of an isolated system. I *think* the driver should request as many interrupts as there are available CPUs in the system to handle them. The number of available CPUs/ CPU mask should be a configure knob by the user. Using the housekeeping CPUs as a default mask seems reasonable. The question is what should happen if the mask changes at runtime. Maybe a device needs to reconfigure, maybe just move the interrupt away. But this should also affect NOHZ_FULL workloads. > To sum up: > - The IRQ balancing issue is not limited to a single driver / subsystem > - The managed IRQ infrastructure seems very "static" so insufficient for > this problem. In addition we would have to migrate all affected > drivers to the managed IRQ infrastructure first. > > We would love to hear further thoughts / ideas / comments about this > problem. We're highly interested in fixing this issue properly. If the "managed IRQ infrastructure" would help here then why not. Maybe Frederic has some insight here. > Thanks for all the comments so far! > > Florian Sebastian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Control of IRQ Affinities from Userspace 2025-11-25 11:50 ` bigeasy @ 2025-11-25 14:36 ` Florian Bezdeka 2025-11-25 16:31 ` Thomas Gleixner 2025-11-26 15:31 ` Frederic Weisbecker 1 sibling, 1 reply; 25+ messages in thread From: Florian Bezdeka @ 2025-11-25 14:36 UTC (permalink / raw) To: bigeasy@linutronix.de Cc: Preclik, Tobias, Frederic Weisbecker, linux-rt-users@vger.kernel.org, Kiszka, Jan On Tue, 2025-11-25 at 12:50 +0100, bigeasy@linutronix.de wrote: > On 2025-11-25 12:32:39 [+0100], Florian Bezdeka wrote: > > > It seems that if you exclude certain CPUs from getting interrupt > > > handling than it should work fine. Then the driver would only balance > > > the interrupts among the CPUs that are left. > > > > Sebastian, what exactly do you mean by "exclude certain CPUs from > > getting interrupt handling"? I mean, that is what we do by configuring > > the /proc/<irq>/smp_affinity_list interface. > > Step #1 > - figure out if isolcpus= is restricting the affinity of requested > interrupts to housekeeping CPUs only This question can not be answered with yes/no. It depends. Affinities are based on the default_smp_affinity during creation. But as it turned out there are drivers that overwrite those affinities after IRQ creation. > > Step #2 > - Yes > => look for the matching knob in cgroup interface > Knob found? > - Yes > => Use knob. > - No. > => Add knob. > - No > This should be added as it breaks the expectation of an isolated > system. Assuming that No is the answer as there are a couple of drivers leading to this situation. The question is now: How do we prevent this violation of the isolated system configuration? > > I *think* the driver should request as many interrupts as there are > available CPUs in the system to handle them. > That does not match how networking (and some storage) drivers are designed. Those drivers are usually HW queue centric. A driver is setting up a IRQ per queue pair (TX/RX). The number of HW queues is defined by the hardware and is decoupled from any CPU count. To optimize performance, drivers may spread / balance the IRQs / queues over available CPUs and while doing so might ignore any previous RT configuration. Again: The performance optimization is valid, but how could we prevent violating RT settings? > The number of available > CPUs/ CPU mask should be a configure knob by the user. > The user normally configures the number of HW queues that the NIC should use. In most cases in combination with some HW packet filters to achieve best packet separation. IMHO the user should not have to deal with any (additional) CPU mask on that level. RT tuning will / should handle that. > Using the > housekeeping CPUs as a default mask seems reasonable. > The question is what should happen if the mask changes at runtime. Maybe > a device needs to reconfigure, maybe just move the interrupt away. > But this should also affect NOHZ_FULL workloads. > > > To sum up: > > - The IRQ balancing issue is not limited to a single driver / subsystem > > - The managed IRQ infrastructure seems very "static" so insufficient for > > this problem. In addition we would have to migrate all affected > > drivers to the managed IRQ infrastructure first. > > > > We would love to hear further thoughts / ideas / comments about this > > problem. We're highly interested in fixing this issue properly. > > If the "managed IRQ infrastructure" would help here then why not. Maybe > Frederic has some insight here. I currently can't see how this could help. That looks like dead code to me. I started in irq_do_set_affinity() - which checks for managed IRQs - but I could not find any user of irq_create_affinity_masks() - that is where the managed flag is set - that is actually being used. The road seems dead in devm_platform_get_irqs_affinity() which has no in-tree user. > > > Thanks for all the comments so far! > > > > Florian > > Sebastian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Control of IRQ Affinities from Userspace 2025-11-25 14:36 ` Florian Bezdeka @ 2025-11-25 16:31 ` Thomas Gleixner 2025-11-26 9:20 ` Florian Bezdeka 0 siblings, 1 reply; 25+ messages in thread From: Thomas Gleixner @ 2025-11-25 16:31 UTC (permalink / raw) To: Florian Bezdeka, bigeasy@linutronix.de Cc: Preclik, Tobias, Frederic Weisbecker, linux-rt-users@vger.kernel.org, Kiszka, Jan On Tue, Nov 25 2025 at 15:36, Florian Bezdeka wrote: > On Tue, 2025-11-25 at 12:50 +0100, bigeasy@linutronix.de wrote: >> On 2025-11-25 12:32:39 [+0100], Florian Bezdeka wrote: >> > > It seems that if you exclude certain CPUs from getting interrupt >> > > handling than it should work fine. Then the driver would only balance >> > > the interrupts among the CPUs that are left. >> > >> > Sebastian, what exactly do you mean by "exclude certain CPUs from >> > getting interrupt handling"? I mean, that is what we do by configuring >> > the /proc/<irq>/smp_affinity_list interface. >> >> Step #1 >> - figure out if isolcpus= is restricting the affinity of requested >> interrupts to housekeeping CPUs only > > This question can not be answered with yes/no. It depends. Affinities > are based on the default_smp_affinity during creation. But as it turned > out there are drivers that overwrite those affinities after IRQ > creation. Which ones? >> I *think* the driver should request as many interrupts as there are >> available CPUs in the system to handle them. >> > > That does not match how networking (and some storage) drivers are > designed. Those drivers are usually HW queue centric. A driver is > setting up a IRQ per queue pair (TX/RX). The number of HW queues is > defined by the hardware and is decoupled from any CPU count. > > To optimize performance, drivers may spread / balance the IRQs / queues > over available CPUs and while doing so might ignore any previous RT > configuration. Again: The performance optimization is valid, but how > could we prevent violating RT settings? That spreading happens and it depends how it is grouped and how that matches your isolation requirements. NVME certainly allocates a queue per CPU if there are enough available and those won't disturb your RT isolated CPUs as long as nothing issues I/O on those CPUs. Networking is a different story, but networking does not use managed interrupts (except for one driver) and you can move them away from your isolated CPUs after the device is set up. There have been discussions how to keep interrupts by default off from isolated CPUs, but I don't know where this stands. Frederic? >> The number of available >> CPUs/ CPU mask should be a configure knob by the user. >> > The user normally configures the number of HW queues that the NIC should > use. In most cases in combination with some HW packet filters to achieve > best packet separation. IMHO the user should not have to deal with any > (additional) CPU mask on that level. RT tuning will / should handle > that. How so. The kernel magically knows what the user wants? >> Using the >> housekeeping CPUs as a default mask seems reasonable. >> The question is what should happen if the mask changes at runtime. Maybe >> a device needs to reconfigure, maybe just move the interrupt away. >> But this should also affect NOHZ_FULL workloads. >> >> > To sum up: >> > - The IRQ balancing issue is not limited to a single driver / subsystem >> > - The managed IRQ infrastructure seems very "static" so insufficient for >> > this problem. In addition we would have to migrate all affected >> > drivers to the managed IRQ infrastructure first. >> > >> > We would love to hear further thoughts / ideas / comments about this >> > problem. We're highly interested in fixing this issue properly. >> >> If the "managed IRQ infrastructure" would help here then why not. Maybe >> Frederic has some insight here. > > I currently can't see how this could help. > > That looks like dead code to me. I started in irq_do_set_affinity() - > which checks for managed IRQs - but I could not find any user of > irq_create_affinity_masks() - that is where the managed flag is set - > that is actually being used. The road seems dead in > devm_platform_get_irqs_affinity() which has no in-tree user. # git grep -nH irq_create_affinity_masks drivers/ drivers/base/platform.c:424: desc = irq_create_affinity_masks(nvec, affd); drivers/pci/msi/api.c:289: irq_create_affinity_masks(1, affd); drivers/pci/msi/msi.c:405: affd ? irq_create_affinity_masks(nvec, affd) : NULL; drivers/pci/msi/msi.c:695: affd ? irq_create_affinity_masks(nvec, affd) : NULL; These three PCI ones are all going through pci_alloc_irq_vectors_affinity() # git grep -nH pci_alloc_irq_vectors_affinity drivers/ drivers/net/ethernet/wangxun/libwx/wx_lib.c:1867: nvecs = pci_alloc_irq_vectors_affinity(wx->pdev, nvecs, drivers/nvme/host/pci.c:2659: return pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues, flags, drivers/scsi/be2iscsi/be_main.c:3585: if (pci_alloc_irq_vectors_affinity(phba->pcidev, 2, nvec, drivers/scsi/csiostor/csio_isr.c:520: cnt = pci_alloc_irq_vectors_affinity(hw->pdev, min, cnt, drivers/scsi/hisi_sas/hisi_sas_v3_hw.c:2611: vectors = pci_alloc_irq_vectors_affinity(pdev, drivers/scsi/megaraid/megaraid_sas_base.c:5943: i = pci_alloc_irq_vectors_affinity(instance->pdev, drivers/scsi/mpi3mr/mpi3mr_fw.c:862: retval = pci_alloc_irq_vectors_affinity(mrioc->pdev, drivers/scsi/mpt3sas/mpt3sas_base.c:3390: i = pci_alloc_irq_vectors_affinity(ioc->pdev, drivers/scsi/pm8001/pm8001_init.c:982: rc = pci_alloc_irq_vectors_affinity( drivers/scsi/qla2xxx/qla_isr.c:4539: ret = pci_alloc_irq_vectors_affinity(ha->pdev, min_vecs, drivers/virtio/virtio_pci_common.c:160: err = pci_alloc_irq_vectors_affinity(vp_dev->pci_dev, nvectors, Not a so dead road :) Thanks, tglx ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Control of IRQ Affinities from Userspace 2025-11-25 16:31 ` Thomas Gleixner @ 2025-11-26 9:20 ` Florian Bezdeka 2025-11-26 14:26 ` Thomas Gleixner 2025-11-26 15:45 ` Frederic Weisbecker 0 siblings, 2 replies; 25+ messages in thread From: Florian Bezdeka @ 2025-11-26 9:20 UTC (permalink / raw) To: Thomas Gleixner, bigeasy@linutronix.de Cc: Preclik, Tobias, Frederic Weisbecker, linux-rt-users@vger.kernel.org, Kiszka, Jan On Tue, 2025-11-25 at 17:31 +0100, Thomas Gleixner wrote: > On Tue, Nov 25 2025 at 15:36, Florian Bezdeka wrote: > > On Tue, 2025-11-25 at 12:50 +0100, bigeasy@linutronix.de wrote: > > > On 2025-11-25 12:32:39 [+0100], Florian Bezdeka wrote: > > > > > It seems that if you exclude certain CPUs from getting interrupt > > > > > handling than it should work fine. Then the driver would only balance > > > > > the interrupts among the CPUs that are left. > > > > > > > > Sebastian, what exactly do you mean by "exclude certain CPUs from > > > > getting interrupt handling"? I mean, that is what we do by configuring > > > > the /proc/<irq>/smp_affinity_list interface. > > > > > > Step #1 > > > - figure out if isolcpus= is restricting the affinity of requested > > > interrupts to housekeeping CPUs only > > > > This question can not be answered with yes/no. It depends. Affinities > > are based on the default_smp_affinity during creation. But as it turned > > out there are drivers that overwrite those affinities after IRQ > > creation. > > Which ones? The problem at hand is the stmmac (drivers/net/ethernet/stmicro/stmmac/stmmac_main), which is pulled in by a couple of others like the dwmac_intel in our case. Searching for the problematic irq_set_affinity_hint() call in drivers/net/ethernet reveals more "affected" drivers but each of them requires some double checking. In some cases the supplied cpumask or the "time of the call" might be OKish. > > > > I *think* the driver should request as many interrupts as there are > > > available CPUs in the system to handle them. > > > > > > > That does not match how networking (and some storage) drivers are > > designed. Those drivers are usually HW queue centric. A driver is > > setting up a IRQ per queue pair (TX/RX). The number of HW queues is > > defined by the hardware and is decoupled from any CPU count. > > > > To optimize performance, drivers may spread / balance the IRQs / queues > > over available CPUs and while doing so might ignore any previous RT > > configuration. Again: The performance optimization is valid, but how > > could we prevent violating RT settings? > > That spreading happens and it depends how it is grouped and how that > matches your isolation requirements. NVME certainly allocates a queue > per CPU if there are enough available and those won't disturb your RT > isolated CPUs as long as nothing issues I/O on those CPUs. > > Networking is a different story, but networking does not use managed > interrupts (except for one driver) and you can move them away from your > isolated CPUs after the device is set up. That's one key point here. The pattern implemented in network drivers seems to be, that drivers register/free IRQs on interface up/down events and for loading bpf programs (necessary for AF_XDP). Such events can happen at any time - even after the initial "move everything away from my isolated CPUs" has completed - and you end up with a broken RT system configuration. Even (RT) applications might trigger such an event. Example would be loading such an AF_XDP bpf program. Applications can "destroy" isolated environment this way. > > There have been discussions how to keep interrupts by default off from > isolated CPUs, but I don't know where this stands. Frederic? Would love to hear more here. > > > > The number of available > > > CPUs/ CPU mask should be a configure knob by the user. > > > > > The user normally configures the number of HW queues that the NIC should > > use. In most cases in combination with some HW packet filters to achieve > > best packet separation. IMHO the user should not have to deal with any > > (additional) CPU mask on that level. RT tuning will / should handle > > that. > > How so. The kernel magically knows what the user wants? We can already identify the IRQs that belong to a certain hardware queue (NAPI threads) and move it when necessary. I just wanted to express that there might not be a need for an additional API / cpumask. > > > > Using the > > > housekeeping CPUs as a default mask seems reasonable. > > > The question is what should happen if the mask changes at runtime. Maybe > > > a device needs to reconfigure, maybe just move the interrupt away. > > > But this should also affect NOHZ_FULL workloads. > > > > > > > To sum up: > > > > - The IRQ balancing issue is not limited to a single driver / subsystem > > > > - The managed IRQ infrastructure seems very "static" so insufficient for > > > > this problem. In addition we would have to migrate all affected > > > > drivers to the managed IRQ infrastructure first. > > > > > > > > We would love to hear further thoughts / ideas / comments about this > > > > problem. We're highly interested in fixing this issue properly. > > > > > > If the "managed IRQ infrastructure" would help here then why not. Maybe > > > Frederic has some insight here. > > > > I currently can't see how this could help. > > > > That looks like dead code to me. I started in irq_do_set_affinity() - > > which checks for managed IRQs - but I could not find any user of > > irq_create_affinity_masks() - that is where the managed flag is set - > > that is actually being used. The road seems dead in > > devm_platform_get_irqs_affinity() which has no in-tree user. > > # git grep -nH irq_create_affinity_masks drivers/ > drivers/base/platform.c:424: desc = irq_create_affinity_masks(nvec, affd); > drivers/pci/msi/api.c:289: irq_create_affinity_masks(1, affd); > drivers/pci/msi/msi.c:405: affd ? irq_create_affinity_masks(nvec, affd) : NULL; > drivers/pci/msi/msi.c:695: affd ? irq_create_affinity_masks(nvec, affd) : NULL; > > These three PCI ones are all going through pci_alloc_irq_vectors_affinity() > > # git grep -nH pci_alloc_irq_vectors_affinity drivers/ > > drivers/net/ethernet/wangxun/libwx/wx_lib.c:1867: nvecs = pci_alloc_irq_vectors_affinity(wx->pdev, nvecs, > drivers/nvme/host/pci.c:2659: return pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues, flags, > drivers/scsi/be2iscsi/be_main.c:3585: if (pci_alloc_irq_vectors_affinity(phba->pcidev, 2, nvec, > drivers/scsi/csiostor/csio_isr.c:520: cnt = pci_alloc_irq_vectors_affinity(hw->pdev, min, cnt, > drivers/scsi/hisi_sas/hisi_sas_v3_hw.c:2611: vectors = pci_alloc_irq_vectors_affinity(pdev, > drivers/scsi/megaraid/megaraid_sas_base.c:5943: i = pci_alloc_irq_vectors_affinity(instance->pdev, > drivers/scsi/mpi3mr/mpi3mr_fw.c:862: retval = pci_alloc_irq_vectors_affinity(mrioc->pdev, > drivers/scsi/mpt3sas/mpt3sas_base.c:3390: i = pci_alloc_irq_vectors_affinity(ioc->pdev, > drivers/scsi/pm8001/pm8001_init.c:982: rc = pci_alloc_irq_vectors_affinity( > drivers/scsi/qla2xxx/qla_isr.c:4539: ret = pci_alloc_irq_vectors_affinity(ha->pdev, min_vecs, > drivers/virtio/virtio_pci_common.c:160: err = pci_alloc_irq_vectors_affinity(vp_dev->pci_dev, nvectors, > > Not a so dead road :) Grml... I definitely fat-fingered the query. Anyway, this housekeeping API still seems very boot-time oriented to me. Can't see yet where this housekeeping cpumasks are filled up by the cgroup/cpuset infrastructure. cgroups/cpusets don't care about "isolation" in that sense yet. It's just about cpumasks for compute. Am I missing something? Florian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Control of IRQ Affinities from Userspace 2025-11-26 9:20 ` Florian Bezdeka @ 2025-11-26 14:26 ` Thomas Gleixner 2025-11-26 15:07 ` Florian Bezdeka 2025-11-26 15:45 ` Frederic Weisbecker 1 sibling, 1 reply; 25+ messages in thread From: Thomas Gleixner @ 2025-11-26 14:26 UTC (permalink / raw) To: Florian Bezdeka, bigeasy@linutronix.de Cc: Preclik, Tobias, Frederic Weisbecker, linux-rt-users@vger.kernel.org, Kiszka, Jan, Waiman Long, Gabriele Monaco On Wed, Nov 26 2025 at 10:20, Florian Bezdeka wrote: > On Tue, 2025-11-25 at 17:31 +0100, Thomas Gleixner wrote: >> > This question can not be answered with yes/no. It depends. Affinities >> > are based on the default_smp_affinity during creation. But as it turned >> > out there are drivers that overwrite those affinities after IRQ >> > creation. >> >> Which ones? > > The problem at hand is the stmmac > (drivers/net/ethernet/stmicro/stmmac/stmmac_main), which is pulled in by > a couple of others like the dwmac_intel in our case. > > Searching for the problematic irq_set_affinity_hint() call in > drivers/net/ethernet reveals more "affected" drivers but each of them > requires some double checking. In some cases the supplied cpumask or the > "time of the call" might be OKish. The question is whether that affinity hint has a functional requirement to be applied or not. I don't think so because those interrupts can be moved by userspace as it sees fit. So it's easy enough to make this "set" part conditional and restrict it to some TBD mask (housekeeping, default ...) under some isolation magic. >> > The user normally configures the number of HW queues that the NIC should >> > use. In most cases in combination with some HW packet filters to achieve >> > best packet separation. IMHO the user should not have to deal with any >> > (additional) CPU mask on that level. RT tuning will / should handle >> > that. >> >> How so. The kernel magically knows what the user wants? > > We can already identify the IRQs that belong to a certain hardware queue > (NAPI threads) and move it when necessary. I just wanted to express that > there might not be a need for an additional API / cpumask. Ok. >> Not a so dead road :) > > Grml... I definitely fat-fingered the query. Anyway, this housekeeping > API still seems very boot-time oriented to me. Can't see yet where this > housekeeping cpumasks are filled up by the cgroup/cpuset infrastructure. > > cgroups/cpusets don't care about "isolation" in that sense yet. It's > just about cpumasks for compute. Am I missing something? There is work in progress on various related ends: https://lore.kernel.org/all/20251105210348.35256-1-frederic@kernel.org https://lore.kernel.org/all/20251120145653.296659-1-gmonaco@redhat.com https://lore.kernel.org/all/20251105043848.382703-1-longman@redhat.com https://lore.kernel.org/all/20251121143500.42111-3-frederic@kernel.org There is certainly more going on it that area, but those were the bits and pieces I could remember from the top of my head. Waiman and Gabriele should be able to fill in some blanks, but that discussion should move to LKML then. Thanks, tglx ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Control of IRQ Affinities from Userspace 2025-11-26 14:26 ` Thomas Gleixner @ 2025-11-26 15:07 ` Florian Bezdeka 2025-11-26 19:15 ` Thomas Gleixner 0 siblings, 1 reply; 25+ messages in thread From: Florian Bezdeka @ 2025-11-26 15:07 UTC (permalink / raw) To: Thomas Gleixner, bigeasy@linutronix.de Cc: Preclik, Tobias, Frederic Weisbecker, linux-rt-users@vger.kernel.org, Kiszka, Jan, Waiman Long, Gabriele Monaco On Wed Nov 26, 2025 at 3:26 PM CET, Thomas Gleixner wrote: > On Wed, Nov 26 2025 at 10:20, Florian Bezdeka wrote: >> On Tue, 2025-11-25 at 17:31 +0100, Thomas Gleixner wrote: >>> > This question can not be answered with yes/no. It depends. Affinities >>> > are based on the default_smp_affinity during creation. But as it turned >>> > out there are drivers that overwrite those affinities after IRQ >>> > creation. >>> >>> Which ones? >> >> The problem at hand is the stmmac >> (drivers/net/ethernet/stmicro/stmmac/stmmac_main), which is pulled in by >> a couple of others like the dwmac_intel in our case. >> >> Searching for the problematic irq_set_affinity_hint() call in >> drivers/net/ethernet reveals more "affected" drivers but each of them >> requires some double checking. In some cases the supplied cpumask or the >> "time of the call" might be OKish. > > The question is whether that affinity hint has a functional requirement > to be applied or not. I don't think so because those interrupts can be > moved by userspace as it sees fit. The background seems performance. Those NICs support link speeds up to (or even above) 2.5Gbit/s. Seems it's hard to fully utilize the link when all queues are routed - IRQ wise - to a single core. This is now the point where the IRQ chips matters. Some (like APIC for x86) have the IRQ balancing implemented in SW, while others don't have that. So the driver does that manually by ignoring all the RT settings. > > So it's easy enough to make this "set" part conditional and restrict it > to some TBD mask (housekeeping, default ...) under some isolation magic. > For now I would be happy if I could modify the stmmac in a way that its balancing takes the default affinity into account. I couldn't find any available API that allows me to do so from a module. Are there any strong reasons for not exporting the default affinity from the IRQ core? Read-only would be enough. >>> > The user normally configures the number of HW queues that the NIC should >>> > use. In most cases in combination with some HW packet filters to achieve >>> > best packet separation. IMHO the user should not have to deal with any >>> > (additional) CPU mask on that level. RT tuning will / should handle >>> > that. >>> >>> How so. The kernel magically knows what the user wants? >> >> We can already identify the IRQs that belong to a certain hardware queue >> (NAPI threads) and move it when necessary. I just wanted to express that >> there might not be a need for an additional API / cpumask. > > Ok. > >>> Not a so dead road :) >> >> Grml... I definitely fat-fingered the query. Anyway, this housekeeping >> API still seems very boot-time oriented to me. Can't see yet where this >> housekeeping cpumasks are filled up by the cgroup/cpuset infrastructure. >> >> cgroups/cpusets don't care about "isolation" in that sense yet. It's >> just about cpumasks for compute. Am I missing something? > > There is work in progress on various related ends: > > https://lore.kernel.org/all/20251105210348.35256-1-frederic@kernel.org > https://lore.kernel.org/all/20251120145653.296659-1-gmonaco@redhat.com > https://lore.kernel.org/all/20251105043848.382703-1-longman@redhat.com > https://lore.kernel.org/all/20251121143500.42111-3-frederic@kernel.org > > There is certainly more going on it that area, but those were the bits > and pieces I could remember from the top of my head. Waiman and Gabriele > should be able to fill in some blanks, but that discussion should move > to LKML then. Thanks for the pointers! I was aware of most of them. This still seems very static as the housekeeping cpumasks are filled up on boot with kernel cmdline arguments. In addition I'm quite sure that the housekeeping infrastructure would not help in the area of networking as nobody (except one driver) is based on the managed IRQ API. Florian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Control of IRQ Affinities from Userspace 2025-11-26 15:07 ` Florian Bezdeka @ 2025-11-26 19:15 ` Thomas Gleixner 2025-11-27 14:06 ` Preclik, Tobias 2025-11-27 14:52 ` Florian Bezdeka 0 siblings, 2 replies; 25+ messages in thread From: Thomas Gleixner @ 2025-11-26 19:15 UTC (permalink / raw) To: Florian Bezdeka, bigeasy@linutronix.de Cc: Preclik, Tobias, Frederic Weisbecker, linux-rt-users@vger.kernel.org, Kiszka, Jan, Waiman Long, Gabriele Monaco On Wed, Nov 26 2025 at 16:07, Florian Bezdeka wrote: > On Wed Nov 26, 2025 at 3:26 PM CET, Thomas Gleixner wrote: >> The question is whether that affinity hint has a functional requirement >> to be applied or not. I don't think so because those interrupts can be >> moved by userspace as it sees fit. > > The background seems performance. Those NICs support link speeds up to > (or even above) 2.5Gbit/s. Seems it's hard to fully utilize the link > when all queues are routed - IRQ wise - to a single core. > > This is now the point where the IRQ chips matters. Some (like APIC for > x86) have the IRQ balancing implemented in SW, while others don't have > that. So the driver does that manually by ignoring all the RT settings. Hardware interrupt balancing never worked right :) APIC "supports" it in logical/cluster mode, but in fact 99% of the interrupts ended up on the lowest APIC in the logical/cluster mask. So we gave up on it because the benefit was close to zero and the complexity for multi-CPU affinity management with the limited vector space was just not worth it. In high performance setups the interrupts were anyway steered to a single CPU by the admin or irqbalanced :) ARM64 would support that too IIRC, but they decided to avoid the whole multi-CPU affinity mess as well :) >> So it's easy enough to make this "set" part conditional and restrict it >> to some TBD mask (housekeeping, default ...) under some isolation magic. >> > > For now I would be happy if I could modify the stmmac in a way that its > balancing takes the default affinity into account. I couldn't find any > available API that allows me to do so from a module. > > Are there any strong reasons for not exporting the default affinity from > the IRQ core? Read-only would be enough. Default affinity is yet another piece which is disconnected from all the other isolation mechanics. So we are not exporting it for some quick and dirty hack. You can do that of course in your own kernel, but please don't send the result to my inbox :) > In addition I'm quite sure that the housekeeping infrastructure would > not help in the area of networking as nobody (except one driver) is > based on the managed IRQ API. Managed interrupts are not user steerable and due to their strict CPU/CPUgroup relationship they are not required to be steerable. NVME & al have a strict command/response on the same queue scheme, which is obviously most efficient when you have per CPU queues. The nice thing about that concept is that the queues are only active (and having interrupts) when an application on a given CPU issues a R/W operation. Networking does not have that by default as their strategy of routing packages to queues is way more complicated and can be affected by hardware filtering etc. But why can't housekeeping help in general and why do you want to hack around the problem in random drivers? What's wrong with providing a new irq_set_affinity_hint_xxx() variant which takes a additional queue number as argument and let that do: if (isolate) { weight = cpumask_weight(housekeeping); qnr %= weight; cpu = cpumask_nth(qnr, housekeeping); mask = cpumask_of(cpu); } return irq_set_affinity_hint(mask); or something like that. From a quick glance over the drivers this could maybe be based on a queue number alone as most drivers do: mask = cpumask_of(qnr % num_online_cpus()); or something daft like that, which is obviously broken, but who cares. So that would become: if (isolate) { weight = cpumask_weight(housekeeping); qnr %= weight; cpu = cpumask_nth(qnr, housekeeping); } else { guard(cpus_read_lock)(); qnr %= num_online_cpus(); cpu = cpumask_nth(qnr, cpu_online_mask); } return irq_set_affinity_hint(cpumask_of(cpu)); See? That lets userspace still override the hint but does at least initial spreading within the housekeeping mask. Which ever mask that is out of the zoo of masks you best debate with Frederic. :) Thanks, tglx ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Control of IRQ Affinities from Userspace 2025-11-26 19:15 ` Thomas Gleixner @ 2025-11-27 14:06 ` Preclik, Tobias 2025-11-27 14:52 ` Florian Bezdeka 1 sibling, 0 replies; 25+ messages in thread From: Preclik, Tobias @ 2025-11-27 14:06 UTC (permalink / raw) To: tglx@linutronix.de, Bezdeka, Florian, bigeasy@linutronix.de Cc: frederic@kernel.org, linux-rt-users@vger.kernel.org, Kiszka, Jan, longman@redhat.com, gmonaco@redhat.com On Wed, 2025-11-26 at 20:15 +0100, Thomas Gleixner wrote: > On Wed, Nov 26 2025 at 16:07, Florian Bezdeka wrote: > > On Wed Nov 26, 2025 at 3:26 PM CET, Thomas Gleixner wrote: > > > The question is whether that affinity hint has a functional > > > requirement > > > to be applied or not. I don't think so because those interrupts > > > can be > > > moved by userspace as it sees fit. > > > > The background seems performance. Those NICs support link speeds up > > to > > (or even above) 2.5Gbit/s. Seems it's hard to fully utilize the > > link > > when all queues are routed - IRQ wise - to a single core. > > > > This is now the point where the IRQ chips matters. Some (like APIC > > for > > x86) have the IRQ balancing implemented in SW, while others don't > > have > > that. So the driver does that manually by ignoring all the RT > > settings. > > Hardware interrupt balancing never worked right :) > > APIC "supports" it in logical/cluster mode, but in fact 99% of the > interrupts ended up on the lowest APIC in the logical/cluster mask. > So > we gave up on it because the benefit was close to zero and the > complexity for multi-CPU affinity management with the limited vector > space was just not worth it. In high performance setups the > interrupts > were anyway steered to a single CPU by the admin or irqbalanced :) > > ARM64 would support that too IIRC, but they decided to avoid the > whole > multi-CPU affinity mess as well :) > > > > So it's easy enough to make this "set" part conditional and > > > restrict it > > > to some TBD mask (housekeeping, default ...) under some isolation > > > magic. > > > > > > > For now I would be happy if I could modify the stmmac in a way that > > its > > balancing takes the default affinity into account. I couldn't find > > any > > available API that allows me to do so from a module. > > > > Are there any strong reasons for not exporting the default affinity > > from > > the IRQ core? Read-only would be enough. > > Default affinity is yet another piece which is disconnected from all > the > other isolation mechanics. So we are not exporting it for some quick > and > dirty hack. You can do that of course in your own kernel, but please > don't send the result to my inbox :) > > > In addition I'm quite sure that the housekeeping infrastructure > > would > > not help in the area of networking as nobody (except one driver) is > > based on the managed IRQ API. > > Managed interrupts are not user steerable and due to their strict > CPU/CPUgroup relationship they are not required to be steerable. NVME > & > al have a strict command/response on the same queue scheme, which is > obviously most efficient when you have per CPU queues. The nice thing > about that concept is that the queues are only active (and having > interrupts) when an application on a given CPU issues a R/W > operation. > > Networking does not have that by default as their strategy of routing > packages to queues is way more complicated and can be affected by > hardware filtering etc. > > But why can't housekeeping help in general and why do you want to > hack > around the problem in random drivers? I think a dynamic set of housekeeping cpus would partly help. Just like we use default_smp_affinity/smp_affinity today to steer interrupts to housekeeping cpus. But stmmac disregards the affinities set from userspace when balancing interrupts on ifup/xdp program load. We thus can neither affine interrupts of these network interfaces to isolated cores (when used for real-time communication) nor can we restrict them to housekeeping to protect rt workloads. I shared some repro steps in the beginning of this thread: https://lore.kernel.org/linux-rt-users/20251111135835.EXCy4ajR@linutronix.de/T/#m0a23cbedada4ceb0a61ad3c7ea81a150c7578ec8 > What's wrong with providing a new irq_set_affinity_hint_xxx() variant > which takes a additional queue number as argument and let that do: > > if (isolate) { > weight = cpumask_weight(housekeeping); > qnr %= weight; > cpu = cpumask_nth(qnr, housekeeping); > mask = cpumask_of(cpu); > } > return irq_set_affinity_hint(mask); > > or something like that. From a quick glance over the drivers this > could > maybe be based on a queue number alone as most drivers do: > > mask = cpumask_of(qnr % num_online_cpus()); > > or something daft like that, which is obviously broken, but who > cares. > So that would become: > > if (isolate) { > weight = cpumask_weight(housekeeping); > qnr %= weight; > cpu = cpumask_nth(qnr, housekeeping); > } else { > guard(cpus_read_lock)(); > qnr %= num_online_cpus(); > cpu = cpumask_nth(qnr, cpu_online_mask); > } > > return irq_set_affinity_hint(cpumask_of(cpu)); > > See? > > That lets userspace still override the hint but does at least initial > spreading within the housekeeping mask. Which ever mask that is out > of > the zoo of masks you best debate with Frederic. :) In the proposed change stmmac would still ignore and overwrite smp_affinity set from userspace (by calling irq_set_affinity_(and_)hint in the end) in seemingly random situations (like xdp program load) and we would thus lose userspace affining of interrupts to RT application cores. The irq balancing in the driver should in my opinion only influence the effective irq affinity. How about splitting smp_affinity into a userspace and a kernel mask? This would allow us to select the appropriate effective mask without discarding userspace intents. Something like that: if (smp_affinity is set from userspace) { // Do not balance on driver level if affinity was set from userspace. mask=smp_affinity } else if (default_smp_affinity is set from userspace) { // If smp_affinity is not set from userspace then balance on default_smp_affinity. weight = cpumask_weight(default_smp_affinity); qnr %= weight; cpu = cpumask_nth(qnr, default_smp_affinity); mask=cpumask_of(cpu) } else if (housekeeping active) { // If also default_smp_affinity is not set from userspace then balance on housekeeping cpus. weight = cpumask_weight(housekeeping); qnr %= weight; cpu = cpumask_nth(qnr, housekeeping); mask=cpumask_of(cpu) } else { // If also housekeeping cpus are not set then fall back to balancing on online cpus. guard(cpus_read_lock)(); qnr %= num_online_cpus(); cpu = cpumask_nth(qnr, cpu_online_mask); mask=cpumask_of(cpu) } // Do not overwrite smp_affinity here: smp_balanced_affinity=mask // smp_balanced_affinity takes into account all affinities set from userspace and driver balancing // setting smp_affinity in procfs should be directly reflected in smp_balanced_affinity // smp_balanced_affinity should then be taken into account when deriving the effective affinity That way (given some more modifications) the affinities from userspace would stay intact and we could properly steer (not managed) interrupts to housekeeping and non-housekeeping cpus as we do it today with other drivers not performing irq balancing. Does that make sense on your side? Best, Tobias ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Control of IRQ Affinities from Userspace 2025-11-26 19:15 ` Thomas Gleixner 2025-11-27 14:06 ` Preclik, Tobias @ 2025-11-27 14:52 ` Florian Bezdeka 2025-11-27 18:09 ` Thomas Gleixner 1 sibling, 1 reply; 25+ messages in thread From: Florian Bezdeka @ 2025-11-27 14:52 UTC (permalink / raw) To: Thomas Gleixner, bigeasy@linutronix.de Cc: Preclik, Tobias, Frederic Weisbecker, linux-rt-users@vger.kernel.org, Kiszka, Jan, Waiman Long, Gabriele Monaco On Wed Nov 26, 2025 at 8:15 PM CET, Thomas Gleixner wrote: >> >> Are there any strong reasons for not exporting the default affinity from >> the IRQ core? Read-only would be enough. > > Default affinity is yet another piece which is disconnected from all the > other isolation mechanics. So we are not exporting it for some quick and > dirty hack. You can do that of course in your own kernel, but please > don't send the result to my inbox :) Well, that's the reason for this discussion here: Upstream first. > >> In addition I'm quite sure that the housekeeping infrastructure would >> not help in the area of networking as nobody (except one driver) is >> based on the managed IRQ API. > > Managed interrupts are not user steerable and due to their strict > CPU/CPUgroup relationship they are not required to be steerable. NVME & > al have a strict command/response on the same queue scheme, which is > obviously most efficient when you have per CPU queues. The nice thing > about that concept is that the queues are only active (and having > interrupts) when an application on a given CPU issues a R/W operation. > > Networking does not have that by default as their strategy of routing > packages to queues is way more complicated and can be affected by > hardware filtering etc. > > But why can't housekeeping help in general and why do you want to hack > around the problem in random drivers? No, that's not what I want. I'm highly interested in solving this problem properly. Just trying to collect all the information at the moment. I'm quite sure there is still something around that I did not take into account yet. > > What's wrong with providing a new irq_set_affinity_hint_xxx() variant > which takes a additional queue number as argument and let that do: > > if (isolate) { > weight = cpumask_weight(housekeeping); > qnr %= weight; > cpu = cpumask_nth(qnr, housekeeping); > mask = cpumask_of(cpu); > } > return irq_set_affinity_hint(mask); > > or something like that. From a quick glance over the drivers this could > maybe be based on a queue number alone as most drivers do: > > mask = cpumask_of(qnr % num_online_cpus()); > > or something daft like that, which is obviously broken, but who cares. > So that would become: > > if (isolate) { > weight = cpumask_weight(housekeeping); > qnr %= weight; > cpu = cpumask_nth(qnr, housekeeping); > } else { > guard(cpus_read_lock)(); > qnr %= num_online_cpus(); > cpu = cpumask_nth(qnr, cpu_online_mask); > } > > return irq_set_affinity_hint(cpumask_of(cpu)); > > See? That is close to a RFC that I was already preparing, until I realized that it would only solve one part of the problem. Part one: Get rid of unwanted IRQ traffic on my isolated cores. That part would be covered as the balancing would be limited to !RT cores. Fine. Part two: In case the device is actually being used by an RT application and allowed to run on isolated cores (userspace has properly configured that upfront) we would get the opposite after loading a BPF: IRQs are now configured wrong. > > That lets userspace still override the hint but does at least initial > spreading within the housekeeping mask. Which ever mask that is out of > the zoo of masks you best debate with Frederic. :) > Choosing the right mask is key. The right mask depends on the usage of the device. Some devices (or maybe even just some queues) should be limited to !RT CPUs, while others should explicitly run within a isolated cpuset. When I'm getting this right, the work from Frederic will bring in the "isolated flag" for cpusets. That seems great preparation work. In addition we would need something like a mapping between devices (or queues maybe indirectly via IRQs) and cgroup/cpusets. Have there been thoughts around a cpuset.interrupts API - or something similar - already? Best regards, Florian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Control of IRQ Affinities from Userspace 2025-11-27 14:52 ` Florian Bezdeka @ 2025-11-27 18:09 ` Thomas Gleixner 2025-11-28 7:33 ` Florian Bezdeka 0 siblings, 1 reply; 25+ messages in thread From: Thomas Gleixner @ 2025-11-27 18:09 UTC (permalink / raw) To: Florian Bezdeka, bigeasy@linutronix.de Cc: Preclik, Tobias, Frederic Weisbecker, linux-rt-users@vger.kernel.org, Kiszka, Jan, Waiman Long, Gabriele Monaco On Thu, Nov 27 2025 at 15:52, Florian Bezdeka wrote: > On Wed Nov 26, 2025 at 8:15 PM CET, Thomas Gleixner wrote: >> So that would become: >> >> if (isolate) { >> weight = cpumask_weight(housekeeping); >> qnr %= weight; >> cpu = cpumask_nth(qnr, housekeeping); >> } else { >> guard(cpus_read_lock)(); >> qnr %= num_online_cpus(); >> cpu = cpumask_nth(qnr, cpu_online_mask); >> } >> >> return irq_set_affinity_hint(cpumask_of(cpu)); >> >> See? > > That is close to a RFC that I was already preparing, until I realized > that it would only solve one part of the problem. > > Part one: Get rid of unwanted IRQ traffic on my isolated cores. That > part would be covered as the balancing would be limited to !RT cores. > Fine. > > Part two: In case the device is actually being used by an RT application > and allowed to run on isolated cores (userspace has properly configured > that upfront) we would get the opposite after loading a BPF: IRQs are > now configured wrong. I just went and looked at that stmac driver once more. The way how it sets up those affinity hints is actually stupid and leads exactly to the effects you describe. The hints should be set exactly once, when MSI is enabled and the interrupts are allocated and not after request_irq(). So the first request_irq() will use that hinted affinity. In case that user space changed the affinity, the setting is preserved accross a free_irq()/request_irq() sequence unless all CPUs in the affinity mask have gone offline. That preservation was explicitly added on request of networking people, but then someone got it wrong and that request_irq()/set_hint() sequence started a Copy&Pasta spreading disease. Oh well... So yes, you have to fix that driver and do the affinity hint business right after pci_alloc_irq_vectors() and clear it when the driver shuts down. Looking at intel_eth_pci_remove(), that's another trainwreck as it does not do any PCI related cleanup despite claiming so.... But the more I look at that whole hint usage, the more I'm convinced that it is in most cases actively wrong. It only makes really sense when there is an actual 1:1 relationship of queues to CPUs like in the NVME case. I'm pretty sure by now that this is in most cases used to ensure that the interrupts are spread out properly. But that spreading is only done to ensure that not all interrupts end up on CPU0 or whatever the architecture specific interrupt management decides to do. x86 used to prefer CPU0, but nowadays it tries to spread it accross CPUs within the provided affinity mask. Not perfect but better than before :) So the right thing here is to expand the functionality of irq_calc_affinity_vectors() and group_cpus_evenly() to: 1) Take isolation masks into account (opt-in and/or system wide knob) 2) Do the spreading over the interrupt sets without setting the managed bit in the mask descriptor. Then use pci_alloc_irq_vectors_affinity(), which does the spreading and assigns the resulting affinities during interrupt descriptor allocation. With that the whole hint business can be removed because it has zero value after the initial setup. But that's a discussion to be had on LKML/netdev and not on the RT devel list. >> That lets userspace still override the hint but does at least initial >> spreading within the housekeeping mask. Which ever mask that is out of >> the zoo of masks you best debate with Frederic. :) >> > Choosing the right mask is key. The right mask depends on the usage of > the device. Some devices (or maybe even just some queues) should be > limited to !RT CPUs, while others should explicitly run within a > isolated cpuset. You can't know that upfront. That's a policy decision and user space has to make it. What the kernel can do is to take isolation into account when doing the initial setup. Though that needs a lot of thoughts and presumably a opt-in knob: Depending on your isolation constraints there might only be a single housekeeping CPU, which means depending on the number of devices and their queue/interrupt requirements that single CPU might run into vector exhaustion pretty fast. > When I'm getting this right, the work from Frederic will bring in the > "isolated flag" for cpusets. That seems great preparation work. In > addition we would need something like a mapping between devices (or > queues maybe indirectly via IRQs) and cgroup/cpusets. > > Have there been thoughts around a cpuset.interrupts API - or something > similar - already? There was some mumbling about propagating isolation into the interrupt world, but as far as I can tell there is no plan or idea how that should look like. But that's again a discussion to be held on LKML. Thanks, tglx ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Control of IRQ Affinities from Userspace 2025-11-27 18:09 ` Thomas Gleixner @ 2025-11-28 7:33 ` Florian Bezdeka 0 siblings, 0 replies; 25+ messages in thread From: Florian Bezdeka @ 2025-11-28 7:33 UTC (permalink / raw) To: Thomas Gleixner, Florian Bezdeka, bigeasy@linutronix.de Cc: Preclik, Tobias, Frederic Weisbecker, linux-rt-users@vger.kernel.org, Kiszka, Jan, Waiman Long, Gabriele Monaco On Thu Nov 27, 2025 at 7:09 PM CET, Thomas Gleixner wrote: > On Thu, Nov 27 2025 at 15:52, Florian Bezdeka wrote: >> On Wed Nov 26, 2025 at 8:15 PM CET, Thomas Gleixner wrote: > > I just went and looked at that stmac driver once more. The way how it > sets up those affinity hints is actually stupid and leads exactly to the > effects you describe. > > The hints should be set exactly once, when MSI is enabled and the > interrupts are allocated and not after request_irq(). > > So the first request_irq() will use that hinted affinity. In case that > user space changed the affinity, the setting is preserved accross a > free_irq()/request_irq() sequence unless all CPUs in the affinity mask > have gone offline. > > That preservation was explicitly added on request of networking people, > but then someone got it wrong and that request_irq()/set_hint() sequence > started a Copy&Pasta spreading disease. Oh well... > > So yes, you have to fix that driver and do the affinity hint business > right after pci_alloc_irq_vectors() and clear it when the driver shuts > down. Looking at intel_eth_pci_remove(), that's another trainwreck as it > does not do any PCI related cleanup despite claiming so.... > > But the more I look at that whole hint usage, the more I'm convinced > that it is in most cases actively wrong. It only makes really sense when > there is an actual 1:1 relationship of queues to CPUs like in the NVME > case. > > I'm pretty sure by now that this is in most cases used to ensure that > the interrupts are spread out properly. But that spreading is only done > to ensure that not all interrupts end up on CPU0 or whatever the > architecture specific interrupt management decides to do. x86 used to > prefer CPU0, but nowadays it tries to spread it accross CPUs within the > provided affinity mask. Not perfect but better than before :) > > So the right thing here is to expand the functionality of > irq_calc_affinity_vectors() and group_cpus_evenly() to: > > 1) Take isolation masks into account (opt-in and/or system wide > knob) > > 2) Do the spreading over the interrupt sets without setting > the managed bit in the mask descriptor. > > Then use pci_alloc_irq_vectors_affinity(), which does the spreading and > assigns the resulting affinities during interrupt descriptor allocation. > > With that the whole hint business can be removed because it has zero > value after the initial setup. > > But that's a discussion to be had on LKML/netdev and not on the RT devel > list. Thanks Thomas, this is now going into the expected direction. Let me look at that proposal in more detail. Once I have tested that and it works we will migrate the discussion to LKML/netdev. >> Choosing the right mask is key. The right mask depends on the usage of >> the device. Some devices (or maybe even just some queues) should be >> limited to !RT CPUs, while others should explicitly run within a >> isolated cpuset. > > You can't know that upfront. That's a policy decision and user space has > to make it. > > What the kernel can do is to take isolation into account when doing the > initial setup. Though that needs a lot of thoughts and presumably a > opt-in knob: > > Depending on your isolation constraints there might only be a single > housekeeping CPU, which means depending on the number of devices and > their queue/interrupt requirements that single CPU might run into > vector exhaustion pretty fast. This is definitely something that we should keep an eye on. > >> When I'm getting this right, the work from Frederic will bring in the >> "isolated flag" for cpusets. That seems great preparation work. In >> addition we would need something like a mapping between devices (or >> queues maybe indirectly via IRQs) and cgroup/cpusets. >> >> Have there been thoughts around a cpuset.interrupts API - or something >> similar - already? > > There was some mumbling about propagating isolation into the interrupt > world, but as far as I can tell there is no plan or idea how that should > look like. But that's again a discussion to be held on LKML. We have the /proc/irq/default_smp_affinity which is already available but decoupled from the cgroup/cpuset isolation feature of Frederic - as it seems. Let me check how those settings could be merged at the end. Thanks a lot Thomas! Best regards, Florian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Control of IRQ Affinities from Userspace 2025-11-26 9:20 ` Florian Bezdeka 2025-11-26 14:26 ` Thomas Gleixner @ 2025-11-26 15:45 ` Frederic Weisbecker 1 sibling, 0 replies; 25+ messages in thread From: Frederic Weisbecker @ 2025-11-26 15:45 UTC (permalink / raw) To: Florian Bezdeka Cc: Thomas Gleixner, bigeasy@linutronix.de, Preclik, Tobias, linux-rt-users@vger.kernel.org, Kiszka, Jan Le Wed, Nov 26, 2025 at 10:20:53AM +0100, Florian Bezdeka a écrit : > > drivers/net/ethernet/wangxun/libwx/wx_lib.c:1867: nvecs = pci_alloc_irq_vectors_affinity(wx->pdev, nvecs, > > drivers/nvme/host/pci.c:2659: return pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues, flags, > > drivers/scsi/be2iscsi/be_main.c:3585: if (pci_alloc_irq_vectors_affinity(phba->pcidev, 2, nvec, > > drivers/scsi/csiostor/csio_isr.c:520: cnt = pci_alloc_irq_vectors_affinity(hw->pdev, min, cnt, > > drivers/scsi/hisi_sas/hisi_sas_v3_hw.c:2611: vectors = pci_alloc_irq_vectors_affinity(pdev, > > drivers/scsi/megaraid/megaraid_sas_base.c:5943: i = pci_alloc_irq_vectors_affinity(instance->pdev, > > drivers/scsi/mpi3mr/mpi3mr_fw.c:862: retval = pci_alloc_irq_vectors_affinity(mrioc->pdev, > > drivers/scsi/mpt3sas/mpt3sas_base.c:3390: i = pci_alloc_irq_vectors_affinity(ioc->pdev, > > drivers/scsi/pm8001/pm8001_init.c:982: rc = pci_alloc_irq_vectors_affinity( > > drivers/scsi/qla2xxx/qla_isr.c:4539: ret = pci_alloc_irq_vectors_affinity(ha->pdev, min_vecs, > > drivers/virtio/virtio_pci_common.c:160: err = pci_alloc_irq_vectors_affinity(vp_dev->pci_dev, nvectors, > > > > Not a so dead road :) > > Grml... I definitely fat-fingered the query. Anyway, this housekeeping > API still seems very boot-time oriented to me. Can't see yet where this > housekeeping cpumasks are filled up by the cgroup/cpuset infrastructure. It's on the way: https://lore.kernel.org/all/20251105210348.35256-1-frederic@kernel.org/ If all goes well, not for the upcoming merge window but the next one. > cgroups/cpusets don't care about "isolation" in that sense yet. It's > just about cpumasks for compute. Am I missing something? It's a bit more than just scheduler domain isolation. It also handles kthreads and workqueues. It's also going to handle unbound timers (on the way to the upcoming merge window). Thanks. > > > Florian -- Frederic Weisbecker SUSE Labs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Control of IRQ Affinities from Userspace 2025-11-25 11:50 ` bigeasy 2025-11-25 14:36 ` Florian Bezdeka @ 2025-11-26 15:31 ` Frederic Weisbecker 1 sibling, 0 replies; 25+ messages in thread From: Frederic Weisbecker @ 2025-11-26 15:31 UTC (permalink / raw) To: bigeasy@linutronix.de Cc: Florian Bezdeka, Preclik, Tobias, linux-rt-users@vger.kernel.org, Kiszka, Jan Le Tue, Nov 25, 2025 at 12:50:08PM +0100, bigeasy@linutronix.de a écrit : > On 2025-11-25 12:32:39 [+0100], Florian Bezdeka wrote: > > > It seems that if you exclude certain CPUs from getting interrupt > > > handling than it should work fine. Then the driver would only balance > > > the interrupts among the CPUs that are left. > > > > Sebastian, what exactly do you mean by "exclude certain CPUs from > > getting interrupt handling"? I mean, that is what we do by configuring > > the /proc/<irq>/smp_affinity_list interface. > > Step #1 > - figure out if isolcpus= is restricting the affinity of requested > interrupts to housekeeping CPUs only > > Step #2 > - Yes > => look for the matching knob in cgroup interface > Knob found? > - Yes > => Use knob. > - No. > => Add knob. > - No > This should be added as it breaks the expectation of an isolated > system. > > I *think* the driver should request as many interrupts as there are > available CPUs in the system to handle them. The number of available > CPUs/ CPU mask should be a configure knob by the user. Using the > housekeeping CPUs as a default mask seems reasonable. > The question is what should happen if the mask changes at runtime. Maybe > a device needs to reconfigure, maybe just move the interrupt away. > But this should also affect NOHZ_FULL workloads. Right now, you still need to change by hand the affinity of an IRQ through /proc to match a new isolated cpuset partition. But ideally this should be automatically handled by cpuset. If someone wants to tackle that, it would be greatly appreciated. As for those IRQs whose affinity can only be controlled by isolcpus=managed_irq this is more complicated but probably not unfeasible. > > > To sum up: > > - The IRQ balancing issue is not limited to a single driver / subsystem > > - The managed IRQ infrastructure seems very "static" so insufficient for > > this problem. In addition we would have to migrate all affected > > drivers to the managed IRQ infrastructure first. > > > > We would love to hear further thoughts / ideas / comments about this > > problem. We're highly interested in fixing this issue properly. > > If the "managed IRQ infrastructure" would help here then why not. Maybe > Frederic has some insight here. Not really but being able to change managed_irq affinities at runtime would certainly be welcome. I fear my first visit to the genirq subsystem is only one month old though and I miss the cycles to dive further there right now. Thanks. -- Frederic Weisbecker SUSE Labs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Control of IRQ Affinities from Userspace 2025-11-25 11:32 ` Florian Bezdeka 2025-11-25 11:50 ` bigeasy @ 2025-11-26 15:24 ` Frederic Weisbecker 1 sibling, 0 replies; 25+ messages in thread From: Frederic Weisbecker @ 2025-11-26 15:24 UTC (permalink / raw) To: Florian Bezdeka Cc: bigeasy@linutronix.de, Preclik, Tobias, linux-rt-users@vger.kernel.org, Kiszka, Jan Le Tue, Nov 25, 2025 at 12:32:39PM +0100, Florian Bezdeka a écrit : > On Mon, 2025-11-24 at 10:59 +0100, bigeasy@linutronix.de wrote: > > On 2025-11-21 13:25:09 [+0000], Preclik, Tobias wrote: > > > > I would be careful with the deprecated term here. The functionality > > > > is > > > > not deprecated just the interface is. The CPU affinity has been > > > > migrated > > > > a cgroup based interface. If the matching irq affinity is missing > > > > then > > > > it should be added rather than avoiding the whole "affinity is > > > > managed" > > > > interface since this looks as it has been meant for your use case. > > > > > > > > > > As you point out isolcpus interface is deprecated and it seems there > > > exists no way to translate the managed_irq flag of the isolcpus > > > interface into the cgroups based interface. My understanding is that > > > > I did not point out anything. I just suggested to test whether this > > option is working for you and if it does, check if there is matching > > configuration knob in the cpusets/cgroups interface. As per > > https://www.suse.com/c/cpu-isolation-practical-example-part-5/ > > > > in "3.2) Isolcpus" Frederic says that the options should be used if the > > kernel/ application "haven't been built with cpusets/cgroups support". > > So it seems that this bit is either missing in the other interface or > > hard to find. > > In case that was still unclear: We're using the dynamic system > configuration features provided by cpusets/cgrups. No isolcpus= on the > kernel cmdline anymore. With that all applications are build around > cgroups. There is some userspace tooling around that takes care of > proper system configuration / RT isolation. > > > > > … > > > > > The conclusion got lost: > > > > > > > > > > Other drivers like for example igb respect the interrupt affinities > > > > > (both default and per-irq affinities). This leads me to believe > > > > that > > > > > the irq rebalancing in the drivers should only affect the effective > > > > > interrupt affinities. This admittedly is more involved than it > > > > appears > > > > > at first because the interface interrupts would have to be balanced > > > > > subject to multiple (potentially totally different) cpusets. > > > > > > > > Exactly. Maybe it would work to align the driver with what igb does. > > > > > > Currently, stmmac sets IRQ affinity and hints for all IRQ > > > configurations. But on x86 systems with IOAPIC MSI-X vectors should be > > > automatically balanced. If we remove the driver-based irq balancing > > > then other architectures would not necessarily balance the interrupts > > > anymore and would be impacted in terms of performance. Maybe driver- > > > based irq balancing could be deactivated whenever the underlying system > > > is capable of balancing them? That would of course only reduced the > > > number of affected systems. > > > > > > In general I lack information when drivers should (or are allowed to) > > > balance interrupts on driver level and whether smp_affinity is allowed > > > to be ignored and overwritten in that case. All documentation I have > > > found so far remains rather unspecific. > > > > It seems that if you exclude certain CPUs from getting interrupt > > handling than it should work fine. Then the driver would only balance > > the interrupts among the CPUs that are left. > > Sebastian, what exactly do you mean by "exclude certain CPUs from > getting interrupt handling"? I mean, that is what we do by configuring > the /proc/<irq>/smp_affinity_list interface. > > The point here is, that drivers (like the stmmac, storage, ...) simply > ignore everything that was configured by userspace. As soon as one of > the dynamic events (link up/down, bpf loading) occurs they destroy the > current RT aware system configuration. > > I was not successful in finding an API that would allow the driver(s) to > do better. The default affinity (/proc/irq/default_smp_affinity) - as an > example - is not visible from outside the IRQ core. > > The managed IRQ infrastructure that you mentioned seems coupled with the > interfaces behind CONFIG_CPU_ISOLATION which seems to be "static", so > configured at boot time. Is that understanding correct? That would not > be flexible enough as we don't know the system configuration at boot > time. > > As we now have Frederic with us: Frederic, are there any plans to extend > the housekeeping API to deal with cpuset creation? Not sure if that > would be possible as it's hard to say if the newly created cpuset is > targeting isolation or housekeeping... I'm not sure what you mean by that. But HK_TYPE_DOMAIN will soon include both isolcpus and cpuset isolated partitions. And the next step is to be able to create nohz_full/isolated cpuset partitions. Thanks. -- Frederic Weisbecker SUSE Labs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Control of IRQ Affinities from Userspace 2025-11-03 17:12 ` Florian Bezdeka 2025-11-05 13:11 ` Preclik, Tobias @ 2025-11-11 13:58 ` Sebastian Andrzej Siewior 1 sibling, 0 replies; 25+ messages in thread From: Sebastian Andrzej Siewior @ 2025-11-11 13:58 UTC (permalink / raw) To: Florian Bezdeka Cc: Preclik, Tobias, linux-rt-users@vger.kernel.org, Jan Kiszka On 2025-11-03 18:12:48 [+0100], Florian Bezdeka wrote: > I'm trying to jump in and adding some thoughts and results we got while > analyzing this issue: > > What stmmac (and some more drivers) are trying to achieve here is some > kind of handcrafted IRQ balancing, like the good old irqbalanced did in > the past from usermode. Turns out that the situation about IRQ balancing > is a bit inconsistent. Some IRQ chips (like the APIC on x86 do that > "automatically" on driver level, many others don't. So drivers end up > fiddling with affinities. Doing it once during startup is probably okay. The problem is probably that it forgets everything while it removes the IRQ and requests it again during down/ up. It guess this is simpler because the number of interrupts can change if the networking queues have been changed. And this is probably also invoked in that case. > We can nicely tune IRQs and affected affinities that that have been > requested during system boot. Tools like tuned can configure them using > the APIs Tobias described. IRQs that are requested / setup after boot, > during runtime, are kind of "problematic" for us, as there is no API > that informs about new IRQ. We would have to rescan /proc. But even if > there would be such an API: That would be too late. The IRQ might have > fired already. > > Once an affinity has been set (e.g. by tuned) this affinity is being > restored when the IRQ comes back after a link up/down or bpf load. But: > It might have happened that the situation on the system has changed. > Even the default affinity could be different now. In case of the stmmac > - and probably way more drivers - the default affinity is not taken into > account anymore. The previous affinity is being restored > unconditionally. > > I tried to modify stmmac and let it evaluate the default affinity while > doing the IRQ balancing dance. That turned out to be working at the end, > but each line violated several coding/style/abstraction rules. There is > no API at driver level to read the current default affinity - or I > missed it. I could sent that hack out as RFC if requested. Just let me > know. Several driver tune the affinity based on what they think is best. The usual is we start with current CPU and increment the CPU with each queue. This is not unique to networking but also happen with storage. But we do have the "managed API" already. > Thinking more about this problem - and trying to abstract that in a > generalized way - triggered some ideas about "IRQ namespaces", similar > to what we have for CPUs/Memory/... in the cgroup world. Devices, or > classes of devices could be moved into namespaces, instead of > configuring them one by one. Thoughts welcome. The main challenge here > is that we do not think about rt vs. non-rt. It's more about multiple RT > applications running in parallel, well isolated from each other and the > non-rt world. The excluded "affinity" would be a good place to start. So if you have 16 CPUs but declare only two CPU as housekeeping it would sense to limit it to two interrupts if possible. Otherwise shuffle them among the two available CPUs. > Florian Sebastian ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2025-11-28 7:33 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-10-30 14:20 Control of IRQ Affinities from Userspace Preclik, Tobias 2025-11-03 15:53 ` Sebastian Andrzej Siewior 2025-11-03 17:12 ` Florian Bezdeka 2025-11-05 13:11 ` Preclik, Tobias 2025-11-05 13:18 ` Preclik, Tobias 2025-11-11 14:35 ` bigeasy 2025-11-11 14:34 ` bigeasy 2025-11-21 13:25 ` Preclik, Tobias 2025-11-24 9:59 ` bigeasy 2025-11-25 11:32 ` Florian Bezdeka 2025-11-25 11:50 ` bigeasy 2025-11-25 14:36 ` Florian Bezdeka 2025-11-25 16:31 ` Thomas Gleixner 2025-11-26 9:20 ` Florian Bezdeka 2025-11-26 14:26 ` Thomas Gleixner 2025-11-26 15:07 ` Florian Bezdeka 2025-11-26 19:15 ` Thomas Gleixner 2025-11-27 14:06 ` Preclik, Tobias 2025-11-27 14:52 ` Florian Bezdeka 2025-11-27 18:09 ` Thomas Gleixner 2025-11-28 7:33 ` Florian Bezdeka 2025-11-26 15:45 ` Frederic Weisbecker 2025-11-26 15:31 ` Frederic Weisbecker 2025-11-26 15:24 ` Frederic Weisbecker 2025-11-11 13:58 ` Sebastian Andrzej Siewior
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox