From: Thomas Gleixner <tglx@linutronix.de>
To: Florian Bezdeka <florian.bezdeka@siemens.com>,
"bigeasy@linutronix.de" <bigeasy@linutronix.de>
Cc: "Preclik, Tobias" <tobias.preclik@siemens.com>,
Frederic Weisbecker <frederic@kernel.org>,
"linux-rt-users@vger.kernel.org" <linux-rt-users@vger.kernel.org>,
"Kiszka, Jan" <jan.kiszka@siemens.com>
Subject: Re: Control of IRQ Affinities from Userspace
Date: Tue, 25 Nov 2025 17:31:47 +0100 [thread overview]
Message-ID: <87tsyigjkc.ffs@tglx> (raw)
In-Reply-To: <767a8c7c1c88d930c5e7d7b39e7081c3cb39a08c.camel@siemens.com>
On Tue, Nov 25 2025 at 15:36, Florian Bezdeka wrote:
> On Tue, 2025-11-25 at 12:50 +0100, bigeasy@linutronix.de wrote:
>> On 2025-11-25 12:32:39 [+0100], Florian Bezdeka wrote:
>> > > It seems that if you exclude certain CPUs from getting interrupt
>> > > handling than it should work fine. Then the driver would only balance
>> > > the interrupts among the CPUs that are left.
>> >
>> > Sebastian, what exactly do you mean by "exclude certain CPUs from
>> > getting interrupt handling"? I mean, that is what we do by configuring
>> > the /proc/<irq>/smp_affinity_list interface.
>>
>> Step #1
>> - figure out if isolcpus= is restricting the affinity of requested
>> interrupts to housekeeping CPUs only
>
> This question can not be answered with yes/no. It depends. Affinities
> are based on the default_smp_affinity during creation. But as it turned
> out there are drivers that overwrite those affinities after IRQ
> creation.
Which ones?
>> I *think* the driver should request as many interrupts as there are
>> available CPUs in the system to handle them.
>>
>
> That does not match how networking (and some storage) drivers are
> designed. Those drivers are usually HW queue centric. A driver is
> setting up a IRQ per queue pair (TX/RX). The number of HW queues is
> defined by the hardware and is decoupled from any CPU count.
>
> To optimize performance, drivers may spread / balance the IRQs / queues
> over available CPUs and while doing so might ignore any previous RT
> configuration. Again: The performance optimization is valid, but how
> could we prevent violating RT settings?
That spreading happens and it depends how it is grouped and how that
matches your isolation requirements. NVME certainly allocates a queue
per CPU if there are enough available and those won't disturb your RT
isolated CPUs as long as nothing issues I/O on those CPUs.
Networking is a different story, but networking does not use managed
interrupts (except for one driver) and you can move them away from your
isolated CPUs after the device is set up.
There have been discussions how to keep interrupts by default off from
isolated CPUs, but I don't know where this stands. Frederic?
>> The number of available
>> CPUs/ CPU mask should be a configure knob by the user.
>>
> The user normally configures the number of HW queues that the NIC should
> use. In most cases in combination with some HW packet filters to achieve
> best packet separation. IMHO the user should not have to deal with any
> (additional) CPU mask on that level. RT tuning will / should handle
> that.
How so. The kernel magically knows what the user wants?
>> Using the
>> housekeeping CPUs as a default mask seems reasonable.
>> The question is what should happen if the mask changes at runtime. Maybe
>> a device needs to reconfigure, maybe just move the interrupt away.
>> But this should also affect NOHZ_FULL workloads.
>>
>> > To sum up:
>> > - The IRQ balancing issue is not limited to a single driver / subsystem
>> > - The managed IRQ infrastructure seems very "static" so insufficient for
>> > this problem. In addition we would have to migrate all affected
>> > drivers to the managed IRQ infrastructure first.
>> >
>> > We would love to hear further thoughts / ideas / comments about this
>> > problem. We're highly interested in fixing this issue properly.
>>
>> If the "managed IRQ infrastructure" would help here then why not. Maybe
>> Frederic has some insight here.
>
> I currently can't see how this could help.
>
> That looks like dead code to me. I started in irq_do_set_affinity() -
> which checks for managed IRQs - but I could not find any user of
> irq_create_affinity_masks() - that is where the managed flag is set -
> that is actually being used. The road seems dead in
> devm_platform_get_irqs_affinity() which has no in-tree user.
# git grep -nH irq_create_affinity_masks drivers/
drivers/base/platform.c:424: desc = irq_create_affinity_masks(nvec, affd);
drivers/pci/msi/api.c:289: irq_create_affinity_masks(1, affd);
drivers/pci/msi/msi.c:405: affd ? irq_create_affinity_masks(nvec, affd) : NULL;
drivers/pci/msi/msi.c:695: affd ? irq_create_affinity_masks(nvec, affd) : NULL;
These three PCI ones are all going through pci_alloc_irq_vectors_affinity()
# git grep -nH pci_alloc_irq_vectors_affinity drivers/
drivers/net/ethernet/wangxun/libwx/wx_lib.c:1867: nvecs = pci_alloc_irq_vectors_affinity(wx->pdev, nvecs,
drivers/nvme/host/pci.c:2659: return pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues, flags,
drivers/scsi/be2iscsi/be_main.c:3585: if (pci_alloc_irq_vectors_affinity(phba->pcidev, 2, nvec,
drivers/scsi/csiostor/csio_isr.c:520: cnt = pci_alloc_irq_vectors_affinity(hw->pdev, min, cnt,
drivers/scsi/hisi_sas/hisi_sas_v3_hw.c:2611: vectors = pci_alloc_irq_vectors_affinity(pdev,
drivers/scsi/megaraid/megaraid_sas_base.c:5943: i = pci_alloc_irq_vectors_affinity(instance->pdev,
drivers/scsi/mpi3mr/mpi3mr_fw.c:862: retval = pci_alloc_irq_vectors_affinity(mrioc->pdev,
drivers/scsi/mpt3sas/mpt3sas_base.c:3390: i = pci_alloc_irq_vectors_affinity(ioc->pdev,
drivers/scsi/pm8001/pm8001_init.c:982: rc = pci_alloc_irq_vectors_affinity(
drivers/scsi/qla2xxx/qla_isr.c:4539: ret = pci_alloc_irq_vectors_affinity(ha->pdev, min_vecs,
drivers/virtio/virtio_pci_common.c:160: err = pci_alloc_irq_vectors_affinity(vp_dev->pci_dev, nvectors,
Not a so dead road :)
Thanks,
tglx
next prev parent reply other threads:[~2025-11-25 16:31 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-30 14:20 Control of IRQ Affinities from Userspace Preclik, Tobias
2025-11-03 15:53 ` Sebastian Andrzej Siewior
2025-11-03 17:12 ` Florian Bezdeka
2025-11-05 13:11 ` Preclik, Tobias
2025-11-05 13:18 ` Preclik, Tobias
2025-11-11 14:35 ` bigeasy
2025-11-11 14:34 ` bigeasy
2025-11-21 13:25 ` Preclik, Tobias
2025-11-24 9:59 ` bigeasy
2025-11-25 11:32 ` Florian Bezdeka
2025-11-25 11:50 ` bigeasy
2025-11-25 14:36 ` Florian Bezdeka
2025-11-25 16:31 ` Thomas Gleixner [this message]
2025-11-26 9:20 ` Florian Bezdeka
2025-11-26 14:26 ` Thomas Gleixner
2025-11-26 15:07 ` Florian Bezdeka
2025-11-26 19:15 ` Thomas Gleixner
2025-11-27 14:06 ` Preclik, Tobias
2025-11-27 14:52 ` Florian Bezdeka
2025-11-27 18:09 ` Thomas Gleixner
2025-11-28 7:33 ` Florian Bezdeka
2025-11-26 15:45 ` Frederic Weisbecker
2025-11-26 15:31 ` Frederic Weisbecker
2025-11-26 15:24 ` Frederic Weisbecker
2025-11-11 13:58 ` Sebastian Andrzej Siewior
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87tsyigjkc.ffs@tglx \
--to=tglx@linutronix.de \
--cc=bigeasy@linutronix.de \
--cc=florian.bezdeka@siemens.com \
--cc=frederic@kernel.org \
--cc=jan.kiszka@siemens.com \
--cc=linux-rt-users@vger.kernel.org \
--cc=tobias.preclik@siemens.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox