IRQ thread timeouts and affinity

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* IRQ thread timeouts and affinity
@ 2025-10-09 11:38 Thierry Reding
  2025-10-09 14:30 ` Marc Zyngier
  2025-10-16 18:53 ` Thomas Gleixner
  0 siblings, 2 replies; 16+ messages in thread
From: Thierry Reding @ 2025-10-09 11:38 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier; +Cc: linux-tegra, linux-arm-kernel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3401 bytes --]

Hi Thomas, Marc, all,

Apologies up front for the length of this. There are a lot of details
that I want to share in order to, hopefully, make this as clear as
possible.

We've been running into an issue on some systems (NVIDIA Grace chips)
where either during boot or at runtime, CPU 0 can be under very high
load and cause some IRQ thread functions to be delayed to a point where
we encounter the timeout in the work submission parts of the driver.

Specifically this happens for the Tegra QSPI controller driver found
in drivers/spi/spi-tegra210-quad.c. This driver uses an IRQ thread to
wait for and process "transfer ready" interrupts (which need to run
DMA transfers or copy from the hardware FIFOs using PIO to get the
SPI transfer data). Under heavy load, we've seen the IRQ thread run
with up to multiple seconds of delay.

One solution that we've tried is to move parts of the IRQ handler into
the hard IRQ portion, and we observed that that interrupt is always seen
within the expected period of time. However, the IRQ thread still runs
very late in those cases.

To mitigate this, we're currently trying to gracefully recover on time-
out by checking the hardware state and processing as if no timeout
happened. This needs special care because eventually the IRQ thread will
run and try to process a SPI transfer that's already been processed. It
also isn't optimal because of, well, the timeout.

These devices have a *lot* of CPUs and usually only CPU 0 tends to be
clogged (during boot) and fio-based stress tests at runtime can also
trigger this case if they happen to run on CPU 0.

One workaround that has proven to work is to change the affinity of the
QSPI interrupt to whatever the current CPU is at probe time. That only
only works as long as that CPU doesn't happen to be CPU 0, obviously.
It also doesn't work if we end up stress-testing the selected CPU at
runtime, so it's ultimately just a way of reducing the likelihood, but
not avoiding the problems entirely.

Which brings me to the actual question: what is the right way to solve
this? I had, maybe naively, assumed that the default CPU affinity, which
includes all available CPUs, would be sufficient to have interrupts
balanced across all of those CPUs, but that doesn't appear to be the
case. At least not with the GIC (v3) driver which selects one CPU (CPU 0
in this particular case) from the affinity mask to set the "effective
affinity", which then dictates where IRQs are handled and where the
corresponding IRQ thread function is run.

One potential solution I see is to avoid threaded IRQs for this because
they will cause all of the interrupts to be processed on CPU 0 by
default. A viable alternative would be to use work queues, which, to my
understanding, can (will?) be scheduled more flexibly.

Alternatively, would it be possible (and make sense) to make the IRQ
core code schedule threads across more CPUs? Is there a particular
reason that the IRQ thread runs on the same CPU that services the IRQ?

Maybe another way would be to "reserve" CPU 0 for the type of core OS
driver like QSPI (the TPM is connected to this controller) and make sure
all CPU intensive tasks do not run on that CPU?

I know that things like irqbalance and taskset exist to solve some of
these problems, but they do not work when we hit these cases at boot
time.

Any other solutions that I haven't thought of?

Thanks,
Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IRQ thread timeouts and affinity
  2025-10-09 11:38 IRQ thread timeouts and affinity Thierry Reding
@ 2025-10-09 14:30 ` Marc Zyngier
  2025-10-09 16:05   ` Thierry Reding
  2025-10-16 18:53 ` Thomas Gleixner
  1 sibling, 1 reply; 16+ messages in thread
From: Marc Zyngier @ 2025-10-09 14:30 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel

Hi Thierry,

On Thu, 09 Oct 2025 12:38:55 +0100,
Thierry Reding <thierry.reding@gmail.com> wrote:
> 
> Which brings me to the actual question: what is the right way to solve
> this? I had, maybe naively, assumed that the default CPU affinity, which
> includes all available CPUs, would be sufficient to have interrupts
> balanced across all of those CPUs, but that doesn't appear to be the
> case. At least not with the GIC (v3) driver which selects one CPU (CPU 0
> in this particular case) from the affinity mask to set the "effective
> affinity", which then dictates where IRQs are handled and where the
> corresponding IRQ thread function is run.

There's a (GIC-specific) answer to that, and that's the "1 of N"
distribution model. The problem is that it is a massive headache (it
completely breaks with per-CPU context).

We could try and hack this in somehow, but defining a reasonable API
is complicated. The set of CPUs receiving 1:N interrupts is a *global*
set, which means you cannot have one interrupt targeting CPUs 0-1, and
another targeting CPUs 2-3. You can only have a single set for all 1:N
interrupts. How would you define such a set in a platform agnostic
manner so that a random driver could use this? I definitely don't want
to have a GIC-specific API.

Overall, there is quite a lot of work to be done in this space: the
machine I'm typing this from doesn't have affinity control *at
all*. Any interrupt can target any CPU, and if Linux doesn't expect
that, tough.  Don't even think of managed interrupts on that sort of
systems...

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IRQ thread timeouts and affinity
  2025-10-09 14:30 ` Marc Zyngier
@ 2025-10-09 16:05   ` Thierry Reding
  2025-10-09 17:04     ` Marc Zyngier
  0 siblings, 1 reply; 16+ messages in thread
From: Thierry Reding @ 2025-10-09 16:05 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3587 bytes --]

On Thu, Oct 09, 2025 at 03:30:56PM +0100, Marc Zyngier wrote:
> Hi Thierry,
> 
> On Thu, 09 Oct 2025 12:38:55 +0100,
> Thierry Reding <thierry.reding@gmail.com> wrote:
> > 
> > Which brings me to the actual question: what is the right way to solve
> > this? I had, maybe naively, assumed that the default CPU affinity, which
> > includes all available CPUs, would be sufficient to have interrupts
> > balanced across all of those CPUs, but that doesn't appear to be the
> > case. At least not with the GIC (v3) driver which selects one CPU (CPU 0
> > in this particular case) from the affinity mask to set the "effective
> > affinity", which then dictates where IRQs are handled and where the
> > corresponding IRQ thread function is run.
> 
> There's a (GIC-specific) answer to that, and that's the "1 of N"
> distribution model. The problem is that it is a massive headache (it
> completely breaks with per-CPU context).

Heh, that started out as a very promising first paragraph but turned
ugly very quickly... =)

> We could try and hack this in somehow, but defining a reasonable API
> is complicated. The set of CPUs receiving 1:N interrupts is a *global*
> set, which means you cannot have one interrupt targeting CPUs 0-1, and
> another targeting CPUs 2-3. You can only have a single set for all 1:N
> interrupts. How would you define such a set in a platform agnostic
> manner so that a random driver could use this? I definitely don't want
> to have a GIC-specific API.

I see. I've been thinking that maybe the only way to solve this is using
some sort of policy. A very simple policy might be: use CPU 0 as the
"default" interrupt (much like it is now) because like you said there
might be assumptions built-in that break when the interrupt is scheduled
elsewhere. But then let individual drivers opt into the 1:N set, which
would perhaps span all available CPUs but the first one. From an API PoV
this would just be a flag that's passed to request_irq() (or one of its
derivatives).

> Overall, there is quite a lot of work to be done in this space: the
> machine I'm typing this from doesn't have affinity control *at
> all*. Any interrupt can target any CPU,

Well, that actually sounds pretty nice for the use-case that we have...

>                                         and if Linux doesn't expect
> that, tough.

... but yeah, it may also break things.

>               Don't even think of managed interrupts on that sort of
> systems...

I've seen some of the hardware drivers on the Grace devices distribute
interrupts across multiple CPUs, but they do so via managed interrupts
and multiple queues. I was trying to think if maybe that could be used
for cases like QSPI as well. It's similar to just using a fixed CPU
affinity, so it's hardly a great solution. I also didn't see anything
outside of network and PCI use this (there's one exception in SATA),
so I don't know if it's something that just isn't a good idea outside
of multi-queue devices or if simply nobody has considered it.

irqbalance sounds like it would work to avoid the worst, and it has
built-in support to exclude certain CPUs from the balancing set. At the
same time this seems like something that the kernel would be much better
equipped to handle than a userspace daemon. Has anyone ever attempted to
create an irqbalance but within the kernel?

I should probably go look at how this works on x86 or PowerPC systems. I
keep thinking that this cannot be a new problem, so other solutions must
already exist.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IRQ thread timeouts and affinity
  2025-10-09 16:05   ` Thierry Reding
@ 2025-10-09 17:04     ` Marc Zyngier
  2025-10-09 18:11       ` Marc Zyngier
  0 siblings, 1 reply; 16+ messages in thread
From: Marc Zyngier @ 2025-10-09 17:04 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel

On Thu, 09 Oct 2025 17:05:15 +0100,
Thierry Reding <thierry.reding@gmail.com> wrote:
> 
> [1  <text/plain; us-ascii (quoted-printable)>]
> On Thu, Oct 09, 2025 at 03:30:56PM +0100, Marc Zyngier wrote:
> > Hi Thierry,
> > 
> > On Thu, 09 Oct 2025 12:38:55 +0100,
> > Thierry Reding <thierry.reding@gmail.com> wrote:
> > > 
> > > Which brings me to the actual question: what is the right way to solve
> > > this? I had, maybe naively, assumed that the default CPU affinity, which
> > > includes all available CPUs, would be sufficient to have interrupts
> > > balanced across all of those CPUs, but that doesn't appear to be the
> > > case. At least not with the GIC (v3) driver which selects one CPU (CPU 0
> > > in this particular case) from the affinity mask to set the "effective
> > > affinity", which then dictates where IRQs are handled and where the
> > > corresponding IRQ thread function is run.
> > 
> > There's a (GIC-specific) answer to that, and that's the "1 of N"
> > distribution model. The problem is that it is a massive headache (it
> > completely breaks with per-CPU context).
> 
> Heh, that started out as a very promising first paragraph but turned
> ugly very quickly... =)
> 
> > We could try and hack this in somehow, but defining a reasonable API
> > is complicated. The set of CPUs receiving 1:N interrupts is a *global*
> > set, which means you cannot have one interrupt targeting CPUs 0-1, and
> > another targeting CPUs 2-3. You can only have a single set for all 1:N
> > interrupts. How would you define such a set in a platform agnostic
> > manner so that a random driver could use this? I definitely don't want
> > to have a GIC-specific API.
> 
> I see. I've been thinking that maybe the only way to solve this is using
> some sort of policy. A very simple policy might be: use CPU 0 as the
> "default" interrupt (much like it is now) because like you said there
> might be assumptions built-in that break when the interrupt is scheduled
> elsewhere. But then let individual drivers opt into the 1:N set, which
> would perhaps span all available CPUs but the first one. From an API PoV
> this would just be a flag that's passed to request_irq() (or one of its
> derivatives).

The $10k question is how do you pick the victim CPUs? I can't see how
to do it in a reasonable way unless we decide that interrupts that
have an affinity matching cpu_possible_mask are 1:N. And then we're
left with wondering what to do about CPU hotplug.

> 
> > Overall, there is quite a lot of work to be done in this space: the
> > machine I'm typing this from doesn't have affinity control *at
> > all*. Any interrupt can target any CPU,
> 
> Well, that actually sounds pretty nice for the use-case that we have...
> 
> >                                         and if Linux doesn't expect
> > that, tough.
> 
> ... but yeah, it may also break things.

Yeah. With GICv3, only SPIs can be 1:N, but on this (fruity) box, even
MSIs can be arbitrarily moved from one CPU to another. This is a
ticking bomb.

I'll see if I can squeeze out some time to look into this -- no
promises though.

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IRQ thread timeouts and affinity
  2025-10-09 17:04     ` Marc Zyngier
@ 2025-10-09 18:11       ` Marc Zyngier
  2025-10-10 13:50         ` Thierry Reding
  0 siblings, 1 reply; 16+ messages in thread
From: Marc Zyngier @ 2025-10-09 18:11 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel

On Thu, 09 Oct 2025 18:04:58 +0100,
Marc Zyngier <maz@kernel.org> wrote:
> 
> On Thu, 09 Oct 2025 17:05:15 +0100,
> Thierry Reding <thierry.reding@gmail.com> wrote:
> > 
> > [1  <text/plain; us-ascii (quoted-printable)>]
> > On Thu, Oct 09, 2025 at 03:30:56PM +0100, Marc Zyngier wrote:
> > > Hi Thierry,
> > > 
> > > On Thu, 09 Oct 2025 12:38:55 +0100,
> > > Thierry Reding <thierry.reding@gmail.com> wrote:
> > > > 
> > > > Which brings me to the actual question: what is the right way to solve
> > > > this? I had, maybe naively, assumed that the default CPU affinity, which
> > > > includes all available CPUs, would be sufficient to have interrupts
> > > > balanced across all of those CPUs, but that doesn't appear to be the
> > > > case. At least not with the GIC (v3) driver which selects one CPU (CPU 0
> > > > in this particular case) from the affinity mask to set the "effective
> > > > affinity", which then dictates where IRQs are handled and where the
> > > > corresponding IRQ thread function is run.
> > > 
> > > There's a (GIC-specific) answer to that, and that's the "1 of N"
> > > distribution model. The problem is that it is a massive headache (it
> > > completely breaks with per-CPU context).
> > 
> > Heh, that started out as a very promising first paragraph but turned
> > ugly very quickly... =)
> > 
> > > We could try and hack this in somehow, but defining a reasonable API
> > > is complicated. The set of CPUs receiving 1:N interrupts is a *global*
> > > set, which means you cannot have one interrupt targeting CPUs 0-1, and
> > > another targeting CPUs 2-3. You can only have a single set for all 1:N
> > > interrupts. How would you define such a set in a platform agnostic
> > > manner so that a random driver could use this? I definitely don't want
> > > to have a GIC-specific API.
> > 
> > I see. I've been thinking that maybe the only way to solve this is using
> > some sort of policy. A very simple policy might be: use CPU 0 as the
> > "default" interrupt (much like it is now) because like you said there
> > might be assumptions built-in that break when the interrupt is scheduled
> > elsewhere. But then let individual drivers opt into the 1:N set, which
> > would perhaps span all available CPUs but the first one. From an API PoV
> > this would just be a flag that's passed to request_irq() (or one of its
> > derivatives).
> 
> The $10k question is how do you pick the victim CPUs? I can't see how
> to do it in a reasonable way unless we decide that interrupts that
> have an affinity matching cpu_possible_mask are 1:N. And then we're
> left with wondering what to do about CPU hotplug.

For fun and giggles, here's the result of a 5 minute hack. It enables
1:N distribution on SPIs that have an "all cpus" affinity. It works on
one machine, doesn't on another -- no idea why yet. YMMV.

This is of course conditioned on your favourite HW supporting the 1:N
feature, and it is likely that things will catch fire quickly. It will
probably make your overall interrupt latency *worse*, but maybe less
variable. Let me know.

	M.

diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index dbeb85677b08c..ab32339b32719 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -67,6 +67,7 @@ struct gic_chip_data {
 	u32			nr_redist_regions;
 	u64			flags;
 	bool			has_rss;
+	bool			has_oon;
 	unsigned int		ppi_nr;
 	struct partition_desc	**ppi_descs;
 };
@@ -1173,9 +1174,10 @@ static void gic_update_rdist_properties(void)
 	gic_iterate_rdists(__gic_update_rdist_properties);
 	if (WARN_ON(gic_data.ppi_nr == UINT_MAX))
 		gic_data.ppi_nr = 0;
-	pr_info("GICv3 features: %d PPIs%s%s\n",
+	pr_info("GICv3 features: %d PPIs%s%s%s\n",
 		gic_data.ppi_nr,
 		gic_data.has_rss ? ", RSS" : "",
+		gic_data.has_oon ? ", 1:N" : "",
 		gic_data.rdists.has_direct_lpi ? ", DirectLPI" : "");
 
 	if (gic_data.rdists.has_vlpis)
@@ -1481,6 +1483,7 @@ static int gic_set_affinity(struct irq_data *d, const struct cpumask *mask_val,
 	u32 offset, index;
 	void __iomem *reg;
 	int enabled;
+	bool oon;
 	u64 val;
 
 	if (force)
@@ -1488,6 +1491,8 @@ static int gic_set_affinity(struct irq_data *d, const struct cpumask *mask_val,
 	else
 		cpu = cpumask_any_and(mask_val, cpu_online_mask);
 
+	oon = gic_data.has_oon && cpumask_equal(mask_val, cpu_possible_mask);
+
 	if (cpu >= nr_cpu_ids)
 		return -EINVAL;
 
@@ -1501,7 +1506,7 @@ static int gic_set_affinity(struct irq_data *d, const struct cpumask *mask_val,
 
 	offset = convert_offset_index(d, GICD_IROUTER, &index);
 	reg = gic_dist_base(d) + offset + (index * 8);
-	val = gic_cpu_to_affinity(cpu);
+	val = oon ? GICD_IROUTER_SPI_MODE_ANY : gic_cpu_to_affinity(cpu);
 
 	gic_write_irouter(val, reg);
 
@@ -1512,7 +1517,7 @@ static int gic_set_affinity(struct irq_data *d, const struct cpumask *mask_val,
 	if (enabled)
 		gic_unmask_irq(d);
 
-	irq_data_update_effective_affinity(d, cpumask_of(cpu));
+	irq_data_update_effective_affinity(d, oon ? cpu_possible_mask : cpumask_of(cpu));
 
 	return IRQ_SET_MASK_OK_DONE;
 }
@@ -2114,6 +2119,7 @@ static int __init gic_init_bases(phys_addr_t dist_phys_base,
 	irq_domain_update_bus_token(gic_data.domain, DOMAIN_BUS_WIRED);
 
 	gic_data.has_rss = !!(typer & GICD_TYPER_RSS);
+	gic_data.has_oon = !(typer & GICD_TYPER_No1N);
 
 	if (typer & GICD_TYPER_MBIS) {
 		err = mbi_init(handle, gic_data.domain);
diff --git a/include/linux/irqchip/arm-gic-v3.h b/include/linux/irqchip/arm-gic-v3.h
index 70c0948f978eb..ffbfc1c8d1934 100644
--- a/include/linux/irqchip/arm-gic-v3.h
+++ b/include/linux/irqchip/arm-gic-v3.h
@@ -80,6 +80,7 @@
 #define GICD_CTLR_ENABLE_SS_G0		(1U << 0)
 
 #define GICD_TYPER_RSS			(1U << 26)
+#define GICD_TYPER_No1N			(1U << 25)
 #define GICD_TYPER_LPIS			(1U << 17)
 #define GICD_TYPER_MBIS			(1U << 16)
 #define GICD_TYPER_ESPI			(1U << 8)

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: IRQ thread timeouts and affinity
  2025-10-09 18:11       ` Marc Zyngier
@ 2025-10-10 13:50         ` Thierry Reding
  2025-10-10 14:18           ` Marc Zyngier
  0 siblings, 1 reply; 16+ messages in thread
From: Thierry Reding @ 2025-10-10 13:50 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4814 bytes --]

On Thu, Oct 09, 2025 at 07:11:20PM +0100, Marc Zyngier wrote:
> On Thu, 09 Oct 2025 18:04:58 +0100,
> Marc Zyngier <maz@kernel.org> wrote:
> > 
> > On Thu, 09 Oct 2025 17:05:15 +0100,
> > Thierry Reding <thierry.reding@gmail.com> wrote:
> > > 
> > > [1  <text/plain; us-ascii (quoted-printable)>]
> > > On Thu, Oct 09, 2025 at 03:30:56PM +0100, Marc Zyngier wrote:
> > > > Hi Thierry,
> > > > 
> > > > On Thu, 09 Oct 2025 12:38:55 +0100,
> > > > Thierry Reding <thierry.reding@gmail.com> wrote:
> > > > > 
> > > > > Which brings me to the actual question: what is the right way to solve
> > > > > this? I had, maybe naively, assumed that the default CPU affinity, which
> > > > > includes all available CPUs, would be sufficient to have interrupts
> > > > > balanced across all of those CPUs, but that doesn't appear to be the
> > > > > case. At least not with the GIC (v3) driver which selects one CPU (CPU 0
> > > > > in this particular case) from the affinity mask to set the "effective
> > > > > affinity", which then dictates where IRQs are handled and where the
> > > > > corresponding IRQ thread function is run.
> > > > 
> > > > There's a (GIC-specific) answer to that, and that's the "1 of N"
> > > > distribution model. The problem is that it is a massive headache (it
> > > > completely breaks with per-CPU context).
> > > 
> > > Heh, that started out as a very promising first paragraph but turned
> > > ugly very quickly... =)
> > > 
> > > > We could try and hack this in somehow, but defining a reasonable API
> > > > is complicated. The set of CPUs receiving 1:N interrupts is a *global*
> > > > set, which means you cannot have one interrupt targeting CPUs 0-1, and
> > > > another targeting CPUs 2-3. You can only have a single set for all 1:N
> > > > interrupts. How would you define such a set in a platform agnostic
> > > > manner so that a random driver could use this? I definitely don't want
> > > > to have a GIC-specific API.
> > > 
> > > I see. I've been thinking that maybe the only way to solve this is using
> > > some sort of policy. A very simple policy might be: use CPU 0 as the
> > > "default" interrupt (much like it is now) because like you said there
> > > might be assumptions built-in that break when the interrupt is scheduled
> > > elsewhere. But then let individual drivers opt into the 1:N set, which
> > > would perhaps span all available CPUs but the first one. From an API PoV
> > > this would just be a flag that's passed to request_irq() (or one of its
> > > derivatives).
> > 
> > The $10k question is how do you pick the victim CPUs? I can't see how
> > to do it in a reasonable way unless we decide that interrupts that
> > have an affinity matching cpu_possible_mask are 1:N. And then we're
> > left with wondering what to do about CPU hotplug.
> 
> For fun and giggles, here's the result of a 5 minute hack. It enables
> 1:N distribution on SPIs that have an "all cpus" affinity. It works on
> one machine, doesn't on another -- no idea why yet. YMMV.
> 
> This is of course conditioned on your favourite HW supporting the 1:N
> feature, and it is likely that things will catch fire quickly. It will
> probably make your overall interrupt latency *worse*, but maybe less
> variable. Let me know.

You might be onto something here. Mind you, I've only done very limited
testing, but the system does boot and the QSPI related timeouts are gone
completely.

Here's some snippets from the boot log that might be interesting:

[    0.000000] GICv3: GIC: Using split EOI/Deactivate mode
[    0.000000] GIC: enabling workaround for GICv3: NVIDIA erratum T241-FABRIC-4
[    0.000000] GIC: enabling workaround for GICv3: ARM64 erratum 2941627
[    0.000000] GICv3: 960 SPIs implemented
[    0.000000] GICv3: 320 Extended SPIs implemented
[    0.000000] Root IRQ handler: gic_handle_irq
[    0.000000] GICv3: GICv3 features: 16 PPIs, 1:N
[    0.000000] GICv3: CPU0: found redistributor 20000 region 0:0x0000000022100000
[...]
[    0.000000] GICv3: using LPI property table @0x0000000101500000
[    0.000000] GICv3: CPU0: using allocated LPI pending table @0x0000000101540000
[...]

There's a bunch of ITS info that I dropped, as well as the same
redistributor and LPI property table block for each of the 288 CPUs.

/proc/interrupts is much too big to paste here, but it looks like the
QSPI interrupts now end up evenly distributed across the first 72 CPUs
in this system. Not sure why 72, but possibly because this is a 4 NUMA
node system with 72 CPUs each, so the CPU mask might've been restricted
to just the first node.

On the face of it this looks quite promising. Where do we go from here?
Any areas that we need to test more exhaustively to see if this breaks?

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IRQ thread timeouts and affinity
  2025-10-10 13:50         ` Thierry Reding
@ 2025-10-10 14:18           ` Marc Zyngier
  2025-10-10 14:38             ` Jon Hunter
  2025-10-10 15:03             ` Thierry Reding
  0 siblings, 2 replies; 16+ messages in thread
From: Marc Zyngier @ 2025-10-10 14:18 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel

On Fri, 10 Oct 2025 14:50:57 +0100,
Thierry Reding <thierry.reding@gmail.com> wrote:
> 
> On Thu, Oct 09, 2025 at 07:11:20PM +0100, Marc Zyngier wrote:
> > On Thu, 09 Oct 2025 18:04:58 +0100,
> > Marc Zyngier <maz@kernel.org> wrote:
> > > 
> > > On Thu, 09 Oct 2025 17:05:15 +0100,
> > > Thierry Reding <thierry.reding@gmail.com> wrote:
> > > > 
> > > > [1  <text/plain; us-ascii (quoted-printable)>]
> > > > On Thu, Oct 09, 2025 at 03:30:56PM +0100, Marc Zyngier wrote:
> > > > > Hi Thierry,
> > > > > 
> > > > > On Thu, 09 Oct 2025 12:38:55 +0100,
> > > > > Thierry Reding <thierry.reding@gmail.com> wrote:
> > > > > > 
> > > > > > Which brings me to the actual question: what is the right way to solve
> > > > > > this? I had, maybe naively, assumed that the default CPU affinity, which
> > > > > > includes all available CPUs, would be sufficient to have interrupts
> > > > > > balanced across all of those CPUs, but that doesn't appear to be the
> > > > > > case. At least not with the GIC (v3) driver which selects one CPU (CPU 0
> > > > > > in this particular case) from the affinity mask to set the "effective
> > > > > > affinity", which then dictates where IRQs are handled and where the
> > > > > > corresponding IRQ thread function is run.
> > > > > 
> > > > > There's a (GIC-specific) answer to that, and that's the "1 of N"
> > > > > distribution model. The problem is that it is a massive headache (it
> > > > > completely breaks with per-CPU context).
> > > > 
> > > > Heh, that started out as a very promising first paragraph but turned
> > > > ugly very quickly... =)
> > > > 
> > > > > We could try and hack this in somehow, but defining a reasonable API
> > > > > is complicated. The set of CPUs receiving 1:N interrupts is a *global*
> > > > > set, which means you cannot have one interrupt targeting CPUs 0-1, and
> > > > > another targeting CPUs 2-3. You can only have a single set for all 1:N
> > > > > interrupts. How would you define such a set in a platform agnostic
> > > > > manner so that a random driver could use this? I definitely don't want
> > > > > to have a GIC-specific API.
> > > > 
> > > > I see. I've been thinking that maybe the only way to solve this is using
> > > > some sort of policy. A very simple policy might be: use CPU 0 as the
> > > > "default" interrupt (much like it is now) because like you said there
> > > > might be assumptions built-in that break when the interrupt is scheduled
> > > > elsewhere. But then let individual drivers opt into the 1:N set, which
> > > > would perhaps span all available CPUs but the first one. From an API PoV
> > > > this would just be a flag that's passed to request_irq() (or one of its
> > > > derivatives).
> > > 
> > > The $10k question is how do you pick the victim CPUs? I can't see how
> > > to do it in a reasonable way unless we decide that interrupts that
> > > have an affinity matching cpu_possible_mask are 1:N. And then we're
> > > left with wondering what to do about CPU hotplug.
> > 
> > For fun and giggles, here's the result of a 5 minute hack. It enables
> > 1:N distribution on SPIs that have an "all cpus" affinity. It works on
> > one machine, doesn't on another -- no idea why yet. YMMV.
> > 
> > This is of course conditioned on your favourite HW supporting the 1:N
> > feature, and it is likely that things will catch fire quickly. It will
> > probably make your overall interrupt latency *worse*, but maybe less
> > variable. Let me know.
> 
> You might be onto something here. Mind you, I've only done very limited
> testing, but the system does boot and the QSPI related timeouts are gone
> completely.

Hey, progress.

> Here's some snippets from the boot log that might be interesting:
> 
> [    0.000000] GICv3: GIC: Using split EOI/Deactivate mode
> [    0.000000] GIC: enabling workaround for GICv3: NVIDIA erratum T241-FABRIC-4
> [    0.000000] GIC: enabling workaround for GICv3: ARM64 erratum 2941627
> [    0.000000] GICv3: 960 SPIs implemented
> [    0.000000] GICv3: 320 Extended SPIs implemented
> [    0.000000] Root IRQ handler: gic_handle_irq
> [    0.000000] GICv3: GICv3 features: 16 PPIs, 1:N
> [    0.000000] GICv3: CPU0: found redistributor 20000 region 0:0x0000000022100000
> [...]
> [    0.000000] GICv3: using LPI property table @0x0000000101500000
> [    0.000000] GICv3: CPU0: using allocated LPI pending table @0x0000000101540000
> [...]
> 
> There's a bunch of ITS info that I dropped, as well as the same
> redistributor and LPI property table block for each of the 288 CPUs.
> 
> /proc/interrupts is much too big to paste here, but it looks like the
> QSPI interrupts now end up evenly distributed across the first 72 CPUs
> in this system. Not sure why 72, but possibly because this is a 4 NUMA
> node system with 72 CPUs each, so the CPU mask might've been restricted
> to just the first node.

It could well be that your firmware sets GICR_CTLR.DPG1NS on the 3
other nodes, and the patch I gave you doesn't try to change that.
Check with [1], which does the right thing on that front (it fixed a
similar problem on my slightly more modest 12 CPU machine).

> On the face of it this looks quite promising. Where do we go from here?

For a start, you really should consider sending me one of these
machines. I have plans for it ;-)

> Any areas that we need to test more exhaustively to see if this breaks?

CPU hotplug is the main area of concern, and I'm pretty sure it breaks
this distribution mechanism (or the other way around). Another thing
is that if firmware isn't aware that 1:N interrupts can (or should)
wake-up a CPU from sleep, bad things will happen. Given that nobody
uses 1:N, you can bet that any bit of privileged SW (TF-A,
hypervisors) is likely to be buggy (I've already spotted bugs in KVM
around this).

The other concern is the shape of the API we would expose to drivers,
because I'm not sure we want this sort of "scatter-gun" approach for
all SPIs, and I don't know how that translates to other architectures.

Thomas should probably weight in here.

Thanks,

	M.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/commit/?h=irq/gicv3-1ofN&id=5856e2eb479fc41ea60e76440f768079a1a21a36

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IRQ thread timeouts and affinity
  2025-10-10 14:18           ` Marc Zyngier
@ 2025-10-10 14:38             ` Jon Hunter
  2025-10-10 14:54               ` Thierry Reding
  2025-10-10 15:03             ` Thierry Reding
  1 sibling, 1 reply; 16+ messages in thread
From: Jon Hunter @ 2025-10-10 14:38 UTC (permalink / raw)
  To: Marc Zyngier, Thierry Reding
  Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel


On 10/10/2025 15:18, Marc Zyngier wrote:

...

> CPU hotplug is the main area of concern, and I'm pretty sure it breaks
> this distribution mechanism (or the other way around). Another thing
> is that if firmware isn't aware that 1:N interrupts can (or should)
> wake-up a CPU from sleep, bad things will happen. Given that nobody
> uses 1:N, you can bet that any bit of privileged SW (TF-A,
> hypervisors) is likely to be buggy (I've already spotted bugs in KVM
> around this).

Thierry, do we ever hotplug CPUs on this device? If not, I am wondering 
if something like this, for now, could only be enabled for devices that 
don't hotplug CPUs. Maybe tied to the kernel config (ie. 
CONFIG_HOTPLUG_CPU)? Just a thought ...

Jon

-- 
nvpublic



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IRQ thread timeouts and affinity
  2025-10-10 14:38             ` Jon Hunter
@ 2025-10-10 14:54               ` Thierry Reding
  2025-10-10 15:52                 ` Jon Hunter
  0 siblings, 1 reply; 16+ messages in thread
From: Thierry Reding @ 2025-10-10 14:54 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Marc Zyngier, Thomas Gleixner, linux-tegra, linux-arm-kernel,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1045 bytes --]

On Fri, Oct 10, 2025 at 03:38:59PM +0100, Jon Hunter wrote:
> 
> On 10/10/2025 15:18, Marc Zyngier wrote:
> 
> ...
> 
> > CPU hotplug is the main area of concern, and I'm pretty sure it breaks
> > this distribution mechanism (or the other way around). Another thing
> > is that if firmware isn't aware that 1:N interrupts can (or should)
> > wake-up a CPU from sleep, bad things will happen. Given that nobody
> > uses 1:N, you can bet that any bit of privileged SW (TF-A,
> > hypervisors) is likely to be buggy (I've already spotted bugs in KVM
> > around this).
> 
> Thierry, do we ever hotplug CPUs on this device? If not, I am wondering if
> something like this, for now, could only be enabled for devices that don't
> hotplug CPUs. Maybe tied to the kernel config (ie. CONFIG_HOTPLUG_CPU)? Just
> a thought ...

I've only had limited exposure to this, so I don't know all of the use-
cases. People can buy these devices and do anything they want with it,
so I think we have to account for the general case.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IRQ thread timeouts and affinity
  2025-10-10 14:54               ` Thierry Reding
@ 2025-10-10 15:52                 ` Jon Hunter
  0 siblings, 0 replies; 16+ messages in thread
From: Jon Hunter @ 2025-10-10 15:52 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Marc Zyngier, Thomas Gleixner, linux-tegra, linux-arm-kernel,
	linux-kernel


On 10/10/2025 15:54, Thierry Reding wrote:
> On Fri, Oct 10, 2025 at 03:38:59PM +0100, Jon Hunter wrote:
>>
>> On 10/10/2025 15:18, Marc Zyngier wrote:
>>
>> ...
>>
>>> CPU hotplug is the main area of concern, and I'm pretty sure it breaks
>>> this distribution mechanism (or the other way around). Another thing
>>> is that if firmware isn't aware that 1:N interrupts can (or should)
>>> wake-up a CPU from sleep, bad things will happen. Given that nobody
>>> uses 1:N, you can bet that any bit of privileged SW (TF-A,
>>> hypervisors) is likely to be buggy (I've already spotted bugs in KVM
>>> around this).
>>
>> Thierry, do we ever hotplug CPUs on this device? If not, I am wondering if
>> something like this, for now, could only be enabled for devices that don't
>> hotplug CPUs. Maybe tied to the kernel config (ie. CONFIG_HOTPLUG_CPU)? Just
>> a thought ...
> 
> I've only had limited exposure to this, so I don't know all of the use-
> cases. People can buy these devices and do anything they want with it,
> so I think we have to account for the general case.

Yes, but the point I was trying to make that you can prevent this from 
being used if CPU hotplug is enabled in the kernel and initially limit 
to configurations where this feature would/could be enabled. So you take 
CPU hotplug out of the equation (initially). Of course someone can hack 
the kernel and do what they want, but there is nothing you can do about 
that.

Jon

-- 
nvpublic



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IRQ thread timeouts and affinity
  2025-10-10 14:18           ` Marc Zyngier
  2025-10-10 14:38             ` Jon Hunter
@ 2025-10-10 15:03             ` Thierry Reding
  2025-10-11 10:00               ` Marc Zyngier
  1 sibling, 1 reply; 16+ messages in thread
From: Thierry Reding @ 2025-10-10 15:03 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6858 bytes --]

On Fri, Oct 10, 2025 at 03:18:13PM +0100, Marc Zyngier wrote:
> On Fri, 10 Oct 2025 14:50:57 +0100,
> Thierry Reding <thierry.reding@gmail.com> wrote:
> > 
> > On Thu, Oct 09, 2025 at 07:11:20PM +0100, Marc Zyngier wrote:
> > > On Thu, 09 Oct 2025 18:04:58 +0100,
> > > Marc Zyngier <maz@kernel.org> wrote:
> > > > 
> > > > On Thu, 09 Oct 2025 17:05:15 +0100,
> > > > Thierry Reding <thierry.reding@gmail.com> wrote:
> > > > > 
> > > > > [1  <text/plain; us-ascii (quoted-printable)>]
> > > > > On Thu, Oct 09, 2025 at 03:30:56PM +0100, Marc Zyngier wrote:
> > > > > > Hi Thierry,
> > > > > > 
> > > > > > On Thu, 09 Oct 2025 12:38:55 +0100,
> > > > > > Thierry Reding <thierry.reding@gmail.com> wrote:
> > > > > > > 
> > > > > > > Which brings me to the actual question: what is the right way to solve
> > > > > > > this? I had, maybe naively, assumed that the default CPU affinity, which
> > > > > > > includes all available CPUs, would be sufficient to have interrupts
> > > > > > > balanced across all of those CPUs, but that doesn't appear to be the
> > > > > > > case. At least not with the GIC (v3) driver which selects one CPU (CPU 0
> > > > > > > in this particular case) from the affinity mask to set the "effective
> > > > > > > affinity", which then dictates where IRQs are handled and where the
> > > > > > > corresponding IRQ thread function is run.
> > > > > > 
> > > > > > There's a (GIC-specific) answer to that, and that's the "1 of N"
> > > > > > distribution model. The problem is that it is a massive headache (it
> > > > > > completely breaks with per-CPU context).
> > > > > 
> > > > > Heh, that started out as a very promising first paragraph but turned
> > > > > ugly very quickly... =)
> > > > > 
> > > > > > We could try and hack this in somehow, but defining a reasonable API
> > > > > > is complicated. The set of CPUs receiving 1:N interrupts is a *global*
> > > > > > set, which means you cannot have one interrupt targeting CPUs 0-1, and
> > > > > > another targeting CPUs 2-3. You can only have a single set for all 1:N
> > > > > > interrupts. How would you define such a set in a platform agnostic
> > > > > > manner so that a random driver could use this? I definitely don't want
> > > > > > to have a GIC-specific API.
> > > > > 
> > > > > I see. I've been thinking that maybe the only way to solve this is using
> > > > > some sort of policy. A very simple policy might be: use CPU 0 as the
> > > > > "default" interrupt (much like it is now) because like you said there
> > > > > might be assumptions built-in that break when the interrupt is scheduled
> > > > > elsewhere. But then let individual drivers opt into the 1:N set, which
> > > > > would perhaps span all available CPUs but the first one. From an API PoV
> > > > > this would just be a flag that's passed to request_irq() (or one of its
> > > > > derivatives).
> > > > 
> > > > The $10k question is how do you pick the victim CPUs? I can't see how
> > > > to do it in a reasonable way unless we decide that interrupts that
> > > > have an affinity matching cpu_possible_mask are 1:N. And then we're
> > > > left with wondering what to do about CPU hotplug.
> > > 
> > > For fun and giggles, here's the result of a 5 minute hack. It enables
> > > 1:N distribution on SPIs that have an "all cpus" affinity. It works on
> > > one machine, doesn't on another -- no idea why yet. YMMV.
> > > 
> > > This is of course conditioned on your favourite HW supporting the 1:N
> > > feature, and it is likely that things will catch fire quickly. It will
> > > probably make your overall interrupt latency *worse*, but maybe less
> > > variable. Let me know.
> > 
> > You might be onto something here. Mind you, I've only done very limited
> > testing, but the system does boot and the QSPI related timeouts are gone
> > completely.
> 
> Hey, progress.
> 
> > Here's some snippets from the boot log that might be interesting:
> > 
> > [    0.000000] GICv3: GIC: Using split EOI/Deactivate mode
> > [    0.000000] GIC: enabling workaround for GICv3: NVIDIA erratum T241-FABRIC-4
> > [    0.000000] GIC: enabling workaround for GICv3: ARM64 erratum 2941627
> > [    0.000000] GICv3: 960 SPIs implemented
> > [    0.000000] GICv3: 320 Extended SPIs implemented
> > [    0.000000] Root IRQ handler: gic_handle_irq
> > [    0.000000] GICv3: GICv3 features: 16 PPIs, 1:N
> > [    0.000000] GICv3: CPU0: found redistributor 20000 region 0:0x0000000022100000
> > [...]
> > [    0.000000] GICv3: using LPI property table @0x0000000101500000
> > [    0.000000] GICv3: CPU0: using allocated LPI pending table @0x0000000101540000
> > [...]
> > 
> > There's a bunch of ITS info that I dropped, as well as the same
> > redistributor and LPI property table block for each of the 288 CPUs.
> > 
> > /proc/interrupts is much too big to paste here, but it looks like the
> > QSPI interrupts now end up evenly distributed across the first 72 CPUs
> > in this system. Not sure why 72, but possibly because this is a 4 NUMA
> > node system with 72 CPUs each, so the CPU mask might've been restricted
> > to just the first node.
> 
> It could well be that your firmware sets GICR_CTLR.DPG1NS on the 3
> other nodes, and the patch I gave you doesn't try to change that.
> Check with [1], which does the right thing on that front (it fixed a
> similar problem on my slightly more modest 12 CPU machine).
> 
> > On the face of it this looks quite promising. Where do we go from here?
> 
> For a start, you really should consider sending me one of these
> machines. I have plans for it ;-)

I'm quite happy with someone else hosting this device, I don't think the
electrical installation at home could handle it.

It has proven to be quite well suited for kernel builds...

> > Any areas that we need to test more exhaustively to see if this breaks?
> 
> CPU hotplug is the main area of concern, and I'm pretty sure it breaks
> this distribution mechanism (or the other way around). Another thing
> is that if firmware isn't aware that 1:N interrupts can (or should)
> wake-up a CPU from sleep, bad things will happen. Given that nobody
> uses 1:N, you can bet that any bit of privileged SW (TF-A,
> hypervisors) is likely to be buggy (I've already spotted bugs in KVM
> around this).

Okay, I can find out if CPU hotplug is a common use-case on these
devices, or if we can run some tests with that.

> The other concern is the shape of the API we would expose to drivers,
> because I'm not sure we want this sort of "scatter-gun" approach for
> all SPIs, and I don't know how that translates to other architectures.
> 
> Thomas should probably weight in here.

Yes, it would be interesting to understand how we can make use of this
in a more generic way.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IRQ thread timeouts and affinity
  2025-10-10 15:03             ` Thierry Reding
@ 2025-10-11 10:00               ` Marc Zyngier
  2025-10-14 10:50                 ` Thierry Reding
  0 siblings, 1 reply; 16+ messages in thread
From: Marc Zyngier @ 2025-10-11 10:00 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel

On Fri, 10 Oct 2025 16:03:01 +0100,
Thierry Reding <thierry.reding@gmail.com> wrote:
> 
> On Fri, Oct 10, 2025 at 03:18:13PM +0100, Marc Zyngier wrote:
> > 
> > CPU hotplug is the main area of concern, and I'm pretty sure it breaks
> > this distribution mechanism (or the other way around). Another thing
> > is that if firmware isn't aware that 1:N interrupts can (or should)
> > wake-up a CPU from sleep, bad things will happen. Given that nobody
> > uses 1:N, you can bet that any bit of privileged SW (TF-A,
> > hypervisors) is likely to be buggy (I've already spotted bugs in KVM
> > around this).
> 
> Okay, I can find out if CPU hotplug is a common use-case on these
> devices, or if we can run some tests with that.

It's not so much whether CPU hotplug is of any use to your particular
box, but whether this has any detrimental impact on *any* machine
doing CPU hotplug.

To be clear, this stuff doesn't go in if something breaks, no matter
how small.

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IRQ thread timeouts and affinity
  2025-10-11 10:00               ` Marc Zyngier
@ 2025-10-14 10:50                 ` Thierry Reding
  2025-10-14 11:08                   ` Thierry Reding
  0 siblings, 1 reply; 16+ messages in thread
From: Thierry Reding @ 2025-10-14 10:50 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2710 bytes --]

On Sat, Oct 11, 2025 at 11:00:11AM +0100, Marc Zyngier wrote:
> On Fri, 10 Oct 2025 16:03:01 +0100,
> Thierry Reding <thierry.reding@gmail.com> wrote:
> > 
> > On Fri, Oct 10, 2025 at 03:18:13PM +0100, Marc Zyngier wrote:
> > > 
> > > CPU hotplug is the main area of concern, and I'm pretty sure it breaks
> > > this distribution mechanism (or the other way around). Another thing
> > > is that if firmware isn't aware that 1:N interrupts can (or should)
> > > wake-up a CPU from sleep, bad things will happen. Given that nobody
> > > uses 1:N, you can bet that any bit of privileged SW (TF-A,
> > > hypervisors) is likely to be buggy (I've already spotted bugs in KVM
> > > around this).
> > 
> > Okay, I can find out if CPU hotplug is a common use-case on these
> > devices, or if we can run some tests with that.
> 
> It's not so much whether CPU hotplug is of any use to your particular
> box, but whether this has any detrimental impact on *any* machine
> doing CPU hotplug.
> 
> To be clear, this stuff doesn't go in if something breaks, no matter
> how small.

Of course. I do want to find a way to move forward with this, so I'm
trying to find ways to check what impact this would have in conjunction
with CPU hotplug.

I've done some minimal testing on a Tegra264 device where we have less
CPUs. With your patch applied, I see that most interrupts are nicely
distributed across CPUs. I'm going to use the serial interrupt as an
example since it reliably triggers when I test on a system. Here's an
extract after boot:

	# cat /proc/interrupts
	           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
	 25:         42         44         41         29         37         36         39         36    GICv3 547 Level     c4e0000.serial

I then took CPU 1 offline:

	# echo 0 > /sys/devices/system/cpu/cpu1/online

After that it looks like the GIC automatically reverts to using the
first CPU, since after a little while:

	# cat /proc/interrupts
	           CPU0       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
	 25:        186         66         52         64         58         67         62    GICv3 547 Level     c4e0000.serial

The interrupt count for CPUs 2-7 no longer increments after taking CPU 1
offline. Interestingly, bringing CPU 1 back online doesn't have an
impact, so it doesn't go back to enabling 1:N mode.

Nothing did seem to break. Obviously this doesn't show anything about
the performance yet, but it looks like at least things don't crash and
burn.

Anything else that you think I can test? Do we have a way of restoring
1:N when all CPUs are back online?

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IRQ thread timeouts and affinity
  2025-10-14 10:50                 ` Thierry Reding
@ 2025-10-14 11:08                   ` Thierry Reding
  2025-10-14 17:46                     ` Marc Zyngier
  0 siblings, 1 reply; 16+ messages in thread
From: Thierry Reding @ 2025-10-14 11:08 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2783 bytes --]

On Tue, Oct 14, 2025 at 12:50:18PM +0200, Thierry Reding wrote:
> On Sat, Oct 11, 2025 at 11:00:11AM +0100, Marc Zyngier wrote:
> > On Fri, 10 Oct 2025 16:03:01 +0100,
> > Thierry Reding <thierry.reding@gmail.com> wrote:
> > > 
> > > On Fri, Oct 10, 2025 at 03:18:13PM +0100, Marc Zyngier wrote:
> > > > 
> > > > CPU hotplug is the main area of concern, and I'm pretty sure it breaks
> > > > this distribution mechanism (or the other way around). Another thing
> > > > is that if firmware isn't aware that 1:N interrupts can (or should)
> > > > wake-up a CPU from sleep, bad things will happen. Given that nobody
> > > > uses 1:N, you can bet that any bit of privileged SW (TF-A,
> > > > hypervisors) is likely to be buggy (I've already spotted bugs in KVM
> > > > around this).
> > > 
> > > Okay, I can find out if CPU hotplug is a common use-case on these
> > > devices, or if we can run some tests with that.
> > 
> > It's not so much whether CPU hotplug is of any use to your particular
> > box, but whether this has any detrimental impact on *any* machine
> > doing CPU hotplug.
> > 
> > To be clear, this stuff doesn't go in if something breaks, no matter
> > how small.
> 
> Of course. I do want to find a way to move forward with this, so I'm
> trying to find ways to check what impact this would have in conjunction
> with CPU hotplug.
> 
> I've done some minimal testing on a Tegra264 device where we have less
> CPUs. With your patch applied, I see that most interrupts are nicely
> distributed across CPUs. I'm going to use the serial interrupt as an
> example since it reliably triggers when I test on a system. Here's an
> extract after boot:
> 
> 	# cat /proc/interrupts
> 	           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
> 	 25:         42         44         41         29         37         36         39         36    GICv3 547 Level     c4e0000.serial
> 
> I then took CPU 1 offline:
> 
> 	# echo 0 > /sys/devices/system/cpu/cpu1/online
> 
> After that it looks like the GIC automatically reverts to using the
> first CPU, since after a little while:
> 
> 	# cat /proc/interrupts
> 	           CPU0       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
> 	 25:        186         66         52         64         58         67         62    GICv3 547 Level     c4e0000.serial
> 
> The interrupt count for CPUs 2-7 no longer increments after taking CPU 1
> offline. Interestingly, bringing CPU 1 back online doesn't have an
> impact, so it doesn't go back to enabling 1:N mode.

Looks like that is because gic_set_affinity() gets called with the new
CPU mask when the CPU goes offline, but it's *not* called when the CPU
comes back online.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IRQ thread timeouts and affinity
  2025-10-14 11:08                   ` Thierry Reding
@ 2025-10-14 17:46                     ` Marc Zyngier
  0 siblings, 0 replies; 16+ messages in thread
From: Marc Zyngier @ 2025-10-14 17:46 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel

On Tue, 14 Oct 2025 12:08:22 +0100,
Thierry Reding <thierry.reding@gmail.com> wrote:
>

[...]

> > The interrupt count for CPUs 2-7 no longer increments after taking CPU 1
> > offline. Interestingly, bringing CPU 1 back online doesn't have an
> > impact, so it doesn't go back to enabling 1:N mode.
> 
> Looks like that is because gic_set_affinity() gets called with the new
> CPU mask when the CPU goes offline, but it's *not* called when the CPU
> comes back online.

Indeed, because there is no need to change the affinity as far as the
kernel is concerned -- the interrupt is on an online CPU and all is
well.

I think that's the point where a per-interrupt flag (let's call it
IRQ_BCAST for the sake of argument) is required to decide what to
do. Ideally, IRQ_BCAST would replace any notion of affinity, and you'd
get the scatter-gun behaviour all the time. Which means no adjustment
to the affinity on a CPU going offline (everything still works).

But that's assumes a bunch of other things:

- when going offline, at least DPG1NS gets set to make sure this CPU
  is not a target anymore if not going completely dead (still running
  secure code, for example). The kernel could do it, but...

- when going idle, should this CPU still be a target of 1:N
  interrupts? That's a firmware decision what could severely impact
  power on battery-bound machines if not carefully managed...

- and should a CPU wake up from such an interrupt? Again, that's a
  firmware decision, and I don't know how existing implementation deal
  with that stuff.

Someone needs to investigate these things, and work out all of the
above. That will give us a set of conditions under which we could do
something.

	M.

-- 
Jazz isn't dead. It just smells funny.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IRQ thread timeouts and affinity
  2025-10-09 11:38 IRQ thread timeouts and affinity Thierry Reding
  2025-10-09 14:30 ` Marc Zyngier
@ 2025-10-16 18:53 ` Thomas Gleixner
  1 sibling, 0 replies; 16+ messages in thread
From: Thomas Gleixner @ 2025-10-16 18:53 UTC (permalink / raw)
  To: Thierry Reding, Marc Zyngier; +Cc: linux-tegra, linux-arm-kernel, linux-kernel

On Thu, Oct 09 2025 at 13:38, Thierry Reding wrote:
> We've been running into an issue on some systems (NVIDIA Grace chips)
> where either during boot or at runtime, CPU 0 can be under very high
> load and cause some IRQ thread functions to be delayed to a point where
> we encounter the timeout in the work submission parts of the driver.
>
> Specifically this happens for the Tegra QSPI controller driver found
> in drivers/spi/spi-tegra210-quad.c. This driver uses an IRQ thread to
> wait for and process "transfer ready" interrupts (which need to run
> DMA transfers or copy from the hardware FIFOs using PIO to get the
> SPI transfer data). Under heavy load, we've seen the IRQ thread run
> with up to multiple seconds of delay.

If the interrupt thread which runs with SCHED_FIFO is delayed for
multiple seconds, then there is something seriously wrong to begin with.

You fail to explain how that happens in the first place. Heavy load is
not really a good explanation for that.

> Alternatively, would it be possible (and make sense) to make the IRQ
> core code schedule threads across more CPUs? Is there a particular
> reason that the IRQ thread runs on the same CPU that services the IRQ?

Locality. Also remote wakeups are way more expensive than local wakeups.

Though there is no actual hard requirement to force it onto the same
CPU. What could be done is to have a flag which binds the thread to the
real affinity mask instead of the effective affinity mask so it can be
scheduled freely. Needs some thoughts, but should work.

> Maybe another way would be to "reserve" CPU 0 for the type of core OS
> driver like QSPI (the TPM is connected to this controller) and make sure
> all CPU intensive tasks do not run on that CPU?
>
> I know that things like irqbalance and taskset exist to solve some of
> these problems, but they do not work when we hit these cases at boot
> time.

I'm still completely failing to see how you end up with multiple seconds
delay of that thread especially during boot. What exactly keeps it from
getting scheduled?

Thanks,

        tglx
 


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-10-16 18:54 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-09 11:38 IRQ thread timeouts and affinity Thierry Reding
2025-10-09 14:30 ` Marc Zyngier
2025-10-09 16:05   ` Thierry Reding
2025-10-09 17:04     ` Marc Zyngier
2025-10-09 18:11       ` Marc Zyngier
2025-10-10 13:50         ` Thierry Reding
2025-10-10 14:18           ` Marc Zyngier
2025-10-10 14:38             ` Jon Hunter
2025-10-10 14:54               ` Thierry Reding
2025-10-10 15:52                 ` Jon Hunter
2025-10-10 15:03             ` Thierry Reding
2025-10-11 10:00               ` Marc Zyngier
2025-10-14 10:50                 ` Thierry Reding
2025-10-14 11:08                   ` Thierry Reding
2025-10-14 17:46                     ` Marc Zyngier
2025-10-16 18:53 ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).