From: Thomas Gleixner <tglx@linutronix.de>
To: Florian Bezdeka <florian.bezdeka@siemens.com>,
"bigeasy@linutronix.de" <bigeasy@linutronix.de>
Cc: "Preclik, Tobias" <tobias.preclik@siemens.com>,
Frederic Weisbecker <frederic@kernel.org>,
"linux-rt-users@vger.kernel.org" <linux-rt-users@vger.kernel.org>,
"Kiszka, Jan" <jan.kiszka@siemens.com>,
Waiman Long <longman@redhat.com>,
Gabriele Monaco <gmonaco@redhat.com>
Subject: Re: Control of IRQ Affinities from Userspace
Date: Thu, 27 Nov 2025 19:09:52 +0100 [thread overview]
Message-ID: <87v7ivfitr.ffs@tglx> (raw)
In-Reply-To: <DEJK91DAS7P0.1UN9SHE15VZRK@siemens.com>
On Thu, Nov 27 2025 at 15:52, Florian Bezdeka wrote:
> On Wed Nov 26, 2025 at 8:15 PM CET, Thomas Gleixner wrote:
>> So that would become:
>>
>> if (isolate) {
>> weight = cpumask_weight(housekeeping);
>> qnr %= weight;
>> cpu = cpumask_nth(qnr, housekeeping);
>> } else {
>> guard(cpus_read_lock)();
>> qnr %= num_online_cpus();
>> cpu = cpumask_nth(qnr, cpu_online_mask);
>> }
>>
>> return irq_set_affinity_hint(cpumask_of(cpu));
>>
>> See?
>
> That is close to a RFC that I was already preparing, until I realized
> that it would only solve one part of the problem.
>
> Part one: Get rid of unwanted IRQ traffic on my isolated cores. That
> part would be covered as the balancing would be limited to !RT cores.
> Fine.
>
> Part two: In case the device is actually being used by an RT application
> and allowed to run on isolated cores (userspace has properly configured
> that upfront) we would get the opposite after loading a BPF: IRQs are
> now configured wrong.
I just went and looked at that stmac driver once more. The way how it
sets up those affinity hints is actually stupid and leads exactly to the
effects you describe.
The hints should be set exactly once, when MSI is enabled and the
interrupts are allocated and not after request_irq().
So the first request_irq() will use that hinted affinity. In case that
user space changed the affinity, the setting is preserved accross a
free_irq()/request_irq() sequence unless all CPUs in the affinity mask
have gone offline.
That preservation was explicitly added on request of networking people,
but then someone got it wrong and that request_irq()/set_hint() sequence
started a Copy&Pasta spreading disease. Oh well...
So yes, you have to fix that driver and do the affinity hint business
right after pci_alloc_irq_vectors() and clear it when the driver shuts
down. Looking at intel_eth_pci_remove(), that's another trainwreck as it
does not do any PCI related cleanup despite claiming so....
But the more I look at that whole hint usage, the more I'm convinced
that it is in most cases actively wrong. It only makes really sense when
there is an actual 1:1 relationship of queues to CPUs like in the NVME
case.
I'm pretty sure by now that this is in most cases used to ensure that
the interrupts are spread out properly. But that spreading is only done
to ensure that not all interrupts end up on CPU0 or whatever the
architecture specific interrupt management decides to do. x86 used to
prefer CPU0, but nowadays it tries to spread it accross CPUs within the
provided affinity mask. Not perfect but better than before :)
So the right thing here is to expand the functionality of
irq_calc_affinity_vectors() and group_cpus_evenly() to:
1) Take isolation masks into account (opt-in and/or system wide
knob)
2) Do the spreading over the interrupt sets without setting
the managed bit in the mask descriptor.
Then use pci_alloc_irq_vectors_affinity(), which does the spreading and
assigns the resulting affinities during interrupt descriptor allocation.
With that the whole hint business can be removed because it has zero
value after the initial setup.
But that's a discussion to be had on LKML/netdev and not on the RT devel
list.
>> That lets userspace still override the hint but does at least initial
>> spreading within the housekeeping mask. Which ever mask that is out of
>> the zoo of masks you best debate with Frederic. :)
>>
> Choosing the right mask is key. The right mask depends on the usage of
> the device. Some devices (or maybe even just some queues) should be
> limited to !RT CPUs, while others should explicitly run within a
> isolated cpuset.
You can't know that upfront. That's a policy decision and user space has
to make it.
What the kernel can do is to take isolation into account when doing the
initial setup. Though that needs a lot of thoughts and presumably a
opt-in knob:
Depending on your isolation constraints there might only be a single
housekeeping CPU, which means depending on the number of devices and
their queue/interrupt requirements that single CPU might run into
vector exhaustion pretty fast.
> When I'm getting this right, the work from Frederic will bring in the
> "isolated flag" for cpusets. That seems great preparation work. In
> addition we would need something like a mapping between devices (or
> queues maybe indirectly via IRQs) and cgroup/cpusets.
>
> Have there been thoughts around a cpuset.interrupts API - or something
> similar - already?
There was some mumbling about propagating isolation into the interrupt
world, but as far as I can tell there is no plan or idea how that should
look like. But that's again a discussion to be held on LKML.
Thanks,
tglx
next prev parent reply other threads:[~2025-11-27 18:09 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-30 14:20 Control of IRQ Affinities from Userspace Preclik, Tobias
2025-11-03 15:53 ` Sebastian Andrzej Siewior
2025-11-03 17:12 ` Florian Bezdeka
2025-11-05 13:11 ` Preclik, Tobias
2025-11-05 13:18 ` Preclik, Tobias
2025-11-11 14:35 ` bigeasy
2025-11-11 14:34 ` bigeasy
2025-11-21 13:25 ` Preclik, Tobias
2025-11-24 9:59 ` bigeasy
2025-11-25 11:32 ` Florian Bezdeka
2025-11-25 11:50 ` bigeasy
2025-11-25 14:36 ` Florian Bezdeka
2025-11-25 16:31 ` Thomas Gleixner
2025-11-26 9:20 ` Florian Bezdeka
2025-11-26 14:26 ` Thomas Gleixner
2025-11-26 15:07 ` Florian Bezdeka
2025-11-26 19:15 ` Thomas Gleixner
2025-11-27 14:06 ` Preclik, Tobias
2025-11-27 14:52 ` Florian Bezdeka
2025-11-27 18:09 ` Thomas Gleixner [this message]
2025-11-28 7:33 ` Florian Bezdeka
2025-11-26 15:45 ` Frederic Weisbecker
2025-11-26 15:31 ` Frederic Weisbecker
2025-11-26 15:24 ` Frederic Weisbecker
2025-11-11 13:58 ` Sebastian Andrzej Siewior
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87v7ivfitr.ffs@tglx \
--to=tglx@linutronix.de \
--cc=bigeasy@linutronix.de \
--cc=florian.bezdeka@siemens.com \
--cc=frederic@kernel.org \
--cc=gmonaco@redhat.com \
--cc=jan.kiszka@siemens.com \
--cc=linux-rt-users@vger.kernel.org \
--cc=longman@redhat.com \
--cc=tobias.preclik@siemens.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox