Re: Control of IRQ Affinities from Userspace

public inbox for linux-rt-users@vger.kernel.org
 help / color / mirror / Atom feed

From: Thomas Gleixner <tglx@linutronix.de>
To: Florian Bezdeka <florian.bezdeka@siemens.com>,
	"bigeasy@linutronix.de" <bigeasy@linutronix.de>
Cc: "Preclik, Tobias" <tobias.preclik@siemens.com>,
	Frederic Weisbecker <frederic@kernel.org>,
	"linux-rt-users@vger.kernel.org" <linux-rt-users@vger.kernel.org>,
	"Kiszka, Jan" <jan.kiszka@siemens.com>,
	Waiman Long <longman@redhat.com>,
	Gabriele Monaco <gmonaco@redhat.com>
Subject: Re: Control of IRQ Affinities from Userspace
Date: Wed, 26 Nov 2025 20:15:57 +0100	[thread overview]
Message-ID: <877bvchafm.ffs@tglx> (raw)
In-Reply-To: <DEIPY3BS8AHL.1YJ1980URTLYH@siemens.com>

On Wed, Nov 26 2025 at 16:07, Florian Bezdeka wrote:
> On Wed Nov 26, 2025 at 3:26 PM CET, Thomas Gleixner wrote:
>> The question is whether that affinity hint has a functional requirement
>> to be applied or not. I don't think so because those interrupts can be
>> moved by userspace as it sees fit.
>
> The background seems performance. Those NICs support link speeds up to
> (or even above) 2.5Gbit/s. Seems it's hard to fully utilize the link
> when all queues are routed - IRQ wise - to a single core.
>
> This is now the point where the IRQ chips matters. Some (like APIC for
> x86) have the IRQ balancing implemented in SW, while others don't have
> that. So the driver does that manually by ignoring all the RT settings.

Hardware interrupt balancing never worked right :)

APIC "supports" it in logical/cluster mode, but in fact 99% of the
interrupts ended up on the lowest APIC in the logical/cluster mask. So
we gave up on it because the benefit was close to zero and the
complexity for multi-CPU affinity management with the limited vector
space was just not worth it. In high performance setups the interrupts
were anyway steered to a single CPU by the admin or irqbalanced :)

ARM64 would support that too IIRC, but they decided to avoid the whole
multi-CPU affinity mess as well :)

>> So it's easy enough to make this "set" part conditional and restrict it
>> to some TBD mask (housekeeping, default ...) under some isolation magic.
>>
>
> For now I would be happy if I could modify the stmmac in a way that its
> balancing takes the default affinity into account. I couldn't find any
> available API that allows me to do so from a module.
>
> Are there any strong reasons for not exporting the default affinity from
> the IRQ core? Read-only would be enough.

Default affinity is yet another piece which is disconnected from all the
other isolation mechanics. So we are not exporting it for some quick and
dirty hack. You can do that of course in your own kernel, but please
don't send the result to my inbox :)

> In addition I'm quite sure that the housekeeping infrastructure would
> not help in the area of networking as nobody (except one driver) is
> based on the managed IRQ API.

Managed interrupts are not user steerable and due to their strict
CPU/CPUgroup relationship they are not required to be steerable. NVME &
al have a strict command/response on the same queue scheme, which is
obviously most efficient when you have per CPU queues. The nice thing
about that concept is that the queues are only active (and having
interrupts) when an application on a given CPU issues a R/W operation.

Networking does not have that by default as their strategy of routing
packages to queues is way more complicated and can be affected by
hardware filtering etc.

But why can't housekeeping help in general and why do you want to hack
around the problem in random drivers?

What's wrong with providing a new irq_set_affinity_hint_xxx() variant
which takes a additional queue number as argument and let that do:

    if (isolate) {
        weight = cpumask_weight(housekeeping);
        qnr %= weight;
        cpu = cpumask_nth(qnr, housekeeping);
        mask = cpumask_of(cpu);
    }
    return irq_set_affinity_hint(mask);

or something like that. From a quick glance over the drivers this could
maybe be based on a queue number alone as most drivers do:

      mask = cpumask_of(qnr % num_online_cpus());

or something daft like that, which is obviously broken, but who cares.
So that would become:

    if (isolate) {
        weight = cpumask_weight(housekeeping);
        qnr %= weight;
        cpu = cpumask_nth(qnr, housekeeping);
    } else {
        guard(cpus_read_lock)();
        qnr %= num_online_cpus();
        cpu = cpumask_nth(qnr, cpu_online_mask);
    }

    return irq_set_affinity_hint(cpumask_of(cpu));

See?

That lets userspace still override the hint but does at least initial
spreading within the housekeeping mask. Which ever mask that is out of
the zoo of masks you best debate with Frederic. :)

Thanks,

        tglx

next prev parent reply	other threads:[~2025-11-26 19:16 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-30 14:20 Control of IRQ Affinities from Userspace Preclik, Tobias
2025-11-03 15:53 ` Sebastian Andrzej Siewior
2025-11-03 17:12   ` Florian Bezdeka
2025-11-05 13:11     ` Preclik, Tobias
2025-11-05 13:18       ` Preclik, Tobias
2025-11-11 14:35         ` bigeasy
2025-11-11 14:34       ` bigeasy
2025-11-21 13:25         ` Preclik, Tobias
2025-11-24  9:59           ` bigeasy
2025-11-25 11:32             ` Florian Bezdeka
2025-11-25 11:50               ` bigeasy
2025-11-25 14:36                 ` Florian Bezdeka
2025-11-25 16:31                   ` Thomas Gleixner
2025-11-26  9:20                     ` Florian Bezdeka
2025-11-26 14:26                       ` Thomas Gleixner
2025-11-26 15:07                         ` Florian Bezdeka
2025-11-26 19:15                           ` Thomas Gleixner [this message]
2025-11-27 14:06                             ` Preclik, Tobias
2025-11-27 14:52                             ` Florian Bezdeka
2025-11-27 18:09                               ` Thomas Gleixner
2025-11-28  7:33                                 ` Florian Bezdeka
2025-11-26 15:45                       ` Frederic Weisbecker
2025-11-26 15:31                 ` Frederic Weisbecker
2025-11-26 15:24               ` Frederic Weisbecker
2025-11-11 13:58     ` Sebastian Andrzej Siewior

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=877bvchafm.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=bigeasy@linutronix.de \
    --cc=florian.bezdeka@siemens.com \
    --cc=frederic@kernel.org \
    --cc=gmonaco@redhat.com \
    --cc=jan.kiszka@siemens.com \
    --cc=linux-rt-users@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=tobias.preclik@siemens.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox