* IRQ thread timeouts and affinity @ 2025-10-09 11:38 Thierry Reding 2025-10-09 14:30 ` Marc Zyngier 2025-10-16 18:53 ` Thomas Gleixner 0 siblings, 2 replies; 16+ messages in thread From: Thierry Reding @ 2025-10-09 11:38 UTC (permalink / raw) To: Thomas Gleixner, Marc Zyngier; +Cc: linux-tegra, linux-arm-kernel, linux-kernel [-- Attachment #1: Type: text/plain, Size: 3401 bytes --] Hi Thomas, Marc, all, Apologies up front for the length of this. There are a lot of details that I want to share in order to, hopefully, make this as clear as possible. We've been running into an issue on some systems (NVIDIA Grace chips) where either during boot or at runtime, CPU 0 can be under very high load and cause some IRQ thread functions to be delayed to a point where we encounter the timeout in the work submission parts of the driver. Specifically this happens for the Tegra QSPI controller driver found in drivers/spi/spi-tegra210-quad.c. This driver uses an IRQ thread to wait for and process "transfer ready" interrupts (which need to run DMA transfers or copy from the hardware FIFOs using PIO to get the SPI transfer data). Under heavy load, we've seen the IRQ thread run with up to multiple seconds of delay. One solution that we've tried is to move parts of the IRQ handler into the hard IRQ portion, and we observed that that interrupt is always seen within the expected period of time. However, the IRQ thread still runs very late in those cases. To mitigate this, we're currently trying to gracefully recover on time- out by checking the hardware state and processing as if no timeout happened. This needs special care because eventually the IRQ thread will run and try to process a SPI transfer that's already been processed. It also isn't optimal because of, well, the timeout. These devices have a *lot* of CPUs and usually only CPU 0 tends to be clogged (during boot) and fio-based stress tests at runtime can also trigger this case if they happen to run on CPU 0. One workaround that has proven to work is to change the affinity of the QSPI interrupt to whatever the current CPU is at probe time. That only only works as long as that CPU doesn't happen to be CPU 0, obviously. It also doesn't work if we end up stress-testing the selected CPU at runtime, so it's ultimately just a way of reducing the likelihood, but not avoiding the problems entirely. Which brings me to the actual question: what is the right way to solve this? I had, maybe naively, assumed that the default CPU affinity, which includes all available CPUs, would be sufficient to have interrupts balanced across all of those CPUs, but that doesn't appear to be the case. At least not with the GIC (v3) driver which selects one CPU (CPU 0 in this particular case) from the affinity mask to set the "effective affinity", which then dictates where IRQs are handled and where the corresponding IRQ thread function is run. One potential solution I see is to avoid threaded IRQs for this because they will cause all of the interrupts to be processed on CPU 0 by default. A viable alternative would be to use work queues, which, to my understanding, can (will?) be scheduled more flexibly. Alternatively, would it be possible (and make sense) to make the IRQ core code schedule threads across more CPUs? Is there a particular reason that the IRQ thread runs on the same CPU that services the IRQ? Maybe another way would be to "reserve" CPU 0 for the type of core OS driver like QSPI (the TPM is connected to this controller) and make sure all CPU intensive tasks do not run on that CPU? I know that things like irqbalance and taskset exist to solve some of these problems, but they do not work when we hit these cases at boot time. Any other solutions that I haven't thought of? Thanks, Thierry [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IRQ thread timeouts and affinity 2025-10-09 11:38 IRQ thread timeouts and affinity Thierry Reding @ 2025-10-09 14:30 ` Marc Zyngier 2025-10-09 16:05 ` Thierry Reding 2025-10-16 18:53 ` Thomas Gleixner 1 sibling, 1 reply; 16+ messages in thread From: Marc Zyngier @ 2025-10-09 14:30 UTC (permalink / raw) To: Thierry Reding Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel Hi Thierry, On Thu, 09 Oct 2025 12:38:55 +0100, Thierry Reding <thierry.reding@gmail.com> wrote: > > Which brings me to the actual question: what is the right way to solve > this? I had, maybe naively, assumed that the default CPU affinity, which > includes all available CPUs, would be sufficient to have interrupts > balanced across all of those CPUs, but that doesn't appear to be the > case. At least not with the GIC (v3) driver which selects one CPU (CPU 0 > in this particular case) from the affinity mask to set the "effective > affinity", which then dictates where IRQs are handled and where the > corresponding IRQ thread function is run. There's a (GIC-specific) answer to that, and that's the "1 of N" distribution model. The problem is that it is a massive headache (it completely breaks with per-CPU context). We could try and hack this in somehow, but defining a reasonable API is complicated. The set of CPUs receiving 1:N interrupts is a *global* set, which means you cannot have one interrupt targeting CPUs 0-1, and another targeting CPUs 2-3. You can only have a single set for all 1:N interrupts. How would you define such a set in a platform agnostic manner so that a random driver could use this? I definitely don't want to have a GIC-specific API. Overall, there is quite a lot of work to be done in this space: the machine I'm typing this from doesn't have affinity control *at all*. Any interrupt can target any CPU, and if Linux doesn't expect that, tough. Don't even think of managed interrupts on that sort of systems... M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IRQ thread timeouts and affinity 2025-10-09 14:30 ` Marc Zyngier @ 2025-10-09 16:05 ` Thierry Reding 2025-10-09 17:04 ` Marc Zyngier 0 siblings, 1 reply; 16+ messages in thread From: Thierry Reding @ 2025-10-09 16:05 UTC (permalink / raw) To: Marc Zyngier; +Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel [-- Attachment #1: Type: text/plain, Size: 3587 bytes --] On Thu, Oct 09, 2025 at 03:30:56PM +0100, Marc Zyngier wrote: > Hi Thierry, > > On Thu, 09 Oct 2025 12:38:55 +0100, > Thierry Reding <thierry.reding@gmail.com> wrote: > > > > Which brings me to the actual question: what is the right way to solve > > this? I had, maybe naively, assumed that the default CPU affinity, which > > includes all available CPUs, would be sufficient to have interrupts > > balanced across all of those CPUs, but that doesn't appear to be the > > case. At least not with the GIC (v3) driver which selects one CPU (CPU 0 > > in this particular case) from the affinity mask to set the "effective > > affinity", which then dictates where IRQs are handled and where the > > corresponding IRQ thread function is run. > > There's a (GIC-specific) answer to that, and that's the "1 of N" > distribution model. The problem is that it is a massive headache (it > completely breaks with per-CPU context). Heh, that started out as a very promising first paragraph but turned ugly very quickly... =) > We could try and hack this in somehow, but defining a reasonable API > is complicated. The set of CPUs receiving 1:N interrupts is a *global* > set, which means you cannot have one interrupt targeting CPUs 0-1, and > another targeting CPUs 2-3. You can only have a single set for all 1:N > interrupts. How would you define such a set in a platform agnostic > manner so that a random driver could use this? I definitely don't want > to have a GIC-specific API. I see. I've been thinking that maybe the only way to solve this is using some sort of policy. A very simple policy might be: use CPU 0 as the "default" interrupt (much like it is now) because like you said there might be assumptions built-in that break when the interrupt is scheduled elsewhere. But then let individual drivers opt into the 1:N set, which would perhaps span all available CPUs but the first one. From an API PoV this would just be a flag that's passed to request_irq() (or one of its derivatives). > Overall, there is quite a lot of work to be done in this space: the > machine I'm typing this from doesn't have affinity control *at > all*. Any interrupt can target any CPU, Well, that actually sounds pretty nice for the use-case that we have... > and if Linux doesn't expect > that, tough. ... but yeah, it may also break things. > Don't even think of managed interrupts on that sort of > systems... I've seen some of the hardware drivers on the Grace devices distribute interrupts across multiple CPUs, but they do so via managed interrupts and multiple queues. I was trying to think if maybe that could be used for cases like QSPI as well. It's similar to just using a fixed CPU affinity, so it's hardly a great solution. I also didn't see anything outside of network and PCI use this (there's one exception in SATA), so I don't know if it's something that just isn't a good idea outside of multi-queue devices or if simply nobody has considered it. irqbalance sounds like it would work to avoid the worst, and it has built-in support to exclude certain CPUs from the balancing set. At the same time this seems like something that the kernel would be much better equipped to handle than a userspace daemon. Has anyone ever attempted to create an irqbalance but within the kernel? I should probably go look at how this works on x86 or PowerPC systems. I keep thinking that this cannot be a new problem, so other solutions must already exist. Thierry [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IRQ thread timeouts and affinity 2025-10-09 16:05 ` Thierry Reding @ 2025-10-09 17:04 ` Marc Zyngier 2025-10-09 18:11 ` Marc Zyngier 0 siblings, 1 reply; 16+ messages in thread From: Marc Zyngier @ 2025-10-09 17:04 UTC (permalink / raw) To: Thierry Reding Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel On Thu, 09 Oct 2025 17:05:15 +0100, Thierry Reding <thierry.reding@gmail.com> wrote: > > [1 <text/plain; us-ascii (quoted-printable)>] > On Thu, Oct 09, 2025 at 03:30:56PM +0100, Marc Zyngier wrote: > > Hi Thierry, > > > > On Thu, 09 Oct 2025 12:38:55 +0100, > > Thierry Reding <thierry.reding@gmail.com> wrote: > > > > > > Which brings me to the actual question: what is the right way to solve > > > this? I had, maybe naively, assumed that the default CPU affinity, which > > > includes all available CPUs, would be sufficient to have interrupts > > > balanced across all of those CPUs, but that doesn't appear to be the > > > case. At least not with the GIC (v3) driver which selects one CPU (CPU 0 > > > in this particular case) from the affinity mask to set the "effective > > > affinity", which then dictates where IRQs are handled and where the > > > corresponding IRQ thread function is run. > > > > There's a (GIC-specific) answer to that, and that's the "1 of N" > > distribution model. The problem is that it is a massive headache (it > > completely breaks with per-CPU context). > > Heh, that started out as a very promising first paragraph but turned > ugly very quickly... =) > > > We could try and hack this in somehow, but defining a reasonable API > > is complicated. The set of CPUs receiving 1:N interrupts is a *global* > > set, which means you cannot have one interrupt targeting CPUs 0-1, and > > another targeting CPUs 2-3. You can only have a single set for all 1:N > > interrupts. How would you define such a set in a platform agnostic > > manner so that a random driver could use this? I definitely don't want > > to have a GIC-specific API. > > I see. I've been thinking that maybe the only way to solve this is using > some sort of policy. A very simple policy might be: use CPU 0 as the > "default" interrupt (much like it is now) because like you said there > might be assumptions built-in that break when the interrupt is scheduled > elsewhere. But then let individual drivers opt into the 1:N set, which > would perhaps span all available CPUs but the first one. From an API PoV > this would just be a flag that's passed to request_irq() (or one of its > derivatives). The $10k question is how do you pick the victim CPUs? I can't see how to do it in a reasonable way unless we decide that interrupts that have an affinity matching cpu_possible_mask are 1:N. And then we're left with wondering what to do about CPU hotplug. > > > Overall, there is quite a lot of work to be done in this space: the > > machine I'm typing this from doesn't have affinity control *at > > all*. Any interrupt can target any CPU, > > Well, that actually sounds pretty nice for the use-case that we have... > > > and if Linux doesn't expect > > that, tough. > > ... but yeah, it may also break things. Yeah. With GICv3, only SPIs can be 1:N, but on this (fruity) box, even MSIs can be arbitrarily moved from one CPU to another. This is a ticking bomb. I'll see if I can squeeze out some time to look into this -- no promises though. M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IRQ thread timeouts and affinity 2025-10-09 17:04 ` Marc Zyngier @ 2025-10-09 18:11 ` Marc Zyngier 2025-10-10 13:50 ` Thierry Reding 0 siblings, 1 reply; 16+ messages in thread From: Marc Zyngier @ 2025-10-09 18:11 UTC (permalink / raw) To: Thierry Reding Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel On Thu, 09 Oct 2025 18:04:58 +0100, Marc Zyngier <maz@kernel.org> wrote: > > On Thu, 09 Oct 2025 17:05:15 +0100, > Thierry Reding <thierry.reding@gmail.com> wrote: > > > > [1 <text/plain; us-ascii (quoted-printable)>] > > On Thu, Oct 09, 2025 at 03:30:56PM +0100, Marc Zyngier wrote: > > > Hi Thierry, > > > > > > On Thu, 09 Oct 2025 12:38:55 +0100, > > > Thierry Reding <thierry.reding@gmail.com> wrote: > > > > > > > > Which brings me to the actual question: what is the right way to solve > > > > this? I had, maybe naively, assumed that the default CPU affinity, which > > > > includes all available CPUs, would be sufficient to have interrupts > > > > balanced across all of those CPUs, but that doesn't appear to be the > > > > case. At least not with the GIC (v3) driver which selects one CPU (CPU 0 > > > > in this particular case) from the affinity mask to set the "effective > > > > affinity", which then dictates where IRQs are handled and where the > > > > corresponding IRQ thread function is run. > > > > > > There's a (GIC-specific) answer to that, and that's the "1 of N" > > > distribution model. The problem is that it is a massive headache (it > > > completely breaks with per-CPU context). > > > > Heh, that started out as a very promising first paragraph but turned > > ugly very quickly... =) > > > > > We could try and hack this in somehow, but defining a reasonable API > > > is complicated. The set of CPUs receiving 1:N interrupts is a *global* > > > set, which means you cannot have one interrupt targeting CPUs 0-1, and > > > another targeting CPUs 2-3. You can only have a single set for all 1:N > > > interrupts. How would you define such a set in a platform agnostic > > > manner so that a random driver could use this? I definitely don't want > > > to have a GIC-specific API. > > > > I see. I've been thinking that maybe the only way to solve this is using > > some sort of policy. A very simple policy might be: use CPU 0 as the > > "default" interrupt (much like it is now) because like you said there > > might be assumptions built-in that break when the interrupt is scheduled > > elsewhere. But then let individual drivers opt into the 1:N set, which > > would perhaps span all available CPUs but the first one. From an API PoV > > this would just be a flag that's passed to request_irq() (or one of its > > derivatives). > > The $10k question is how do you pick the victim CPUs? I can't see how > to do it in a reasonable way unless we decide that interrupts that > have an affinity matching cpu_possible_mask are 1:N. And then we're > left with wondering what to do about CPU hotplug. For fun and giggles, here's the result of a 5 minute hack. It enables 1:N distribution on SPIs that have an "all cpus" affinity. It works on one machine, doesn't on another -- no idea why yet. YMMV. This is of course conditioned on your favourite HW supporting the 1:N feature, and it is likely that things will catch fire quickly. It will probably make your overall interrupt latency *worse*, but maybe less variable. Let me know. M. diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index dbeb85677b08c..ab32339b32719 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -67,6 +67,7 @@ struct gic_chip_data { u32 nr_redist_regions; u64 flags; bool has_rss; + bool has_oon; unsigned int ppi_nr; struct partition_desc **ppi_descs; }; @@ -1173,9 +1174,10 @@ static void gic_update_rdist_properties(void) gic_iterate_rdists(__gic_update_rdist_properties); if (WARN_ON(gic_data.ppi_nr == UINT_MAX)) gic_data.ppi_nr = 0; - pr_info("GICv3 features: %d PPIs%s%s\n", + pr_info("GICv3 features: %d PPIs%s%s%s\n", gic_data.ppi_nr, gic_data.has_rss ? ", RSS" : "", + gic_data.has_oon ? ", 1:N" : "", gic_data.rdists.has_direct_lpi ? ", DirectLPI" : ""); if (gic_data.rdists.has_vlpis) @@ -1481,6 +1483,7 @@ static int gic_set_affinity(struct irq_data *d, const struct cpumask *mask_val, u32 offset, index; void __iomem *reg; int enabled; + bool oon; u64 val; if (force) @@ -1488,6 +1491,8 @@ static int gic_set_affinity(struct irq_data *d, const struct cpumask *mask_val, else cpu = cpumask_any_and(mask_val, cpu_online_mask); + oon = gic_data.has_oon && cpumask_equal(mask_val, cpu_possible_mask); + if (cpu >= nr_cpu_ids) return -EINVAL; @@ -1501,7 +1506,7 @@ static int gic_set_affinity(struct irq_data *d, const struct cpumask *mask_val, offset = convert_offset_index(d, GICD_IROUTER, &index); reg = gic_dist_base(d) + offset + (index * 8); - val = gic_cpu_to_affinity(cpu); + val = oon ? GICD_IROUTER_SPI_MODE_ANY : gic_cpu_to_affinity(cpu); gic_write_irouter(val, reg); @@ -1512,7 +1517,7 @@ static int gic_set_affinity(struct irq_data *d, const struct cpumask *mask_val, if (enabled) gic_unmask_irq(d); - irq_data_update_effective_affinity(d, cpumask_of(cpu)); + irq_data_update_effective_affinity(d, oon ? cpu_possible_mask : cpumask_of(cpu)); return IRQ_SET_MASK_OK_DONE; } @@ -2114,6 +2119,7 @@ static int __init gic_init_bases(phys_addr_t dist_phys_base, irq_domain_update_bus_token(gic_data.domain, DOMAIN_BUS_WIRED); gic_data.has_rss = !!(typer & GICD_TYPER_RSS); + gic_data.has_oon = !(typer & GICD_TYPER_No1N); if (typer & GICD_TYPER_MBIS) { err = mbi_init(handle, gic_data.domain); diff --git a/include/linux/irqchip/arm-gic-v3.h b/include/linux/irqchip/arm-gic-v3.h index 70c0948f978eb..ffbfc1c8d1934 100644 --- a/include/linux/irqchip/arm-gic-v3.h +++ b/include/linux/irqchip/arm-gic-v3.h @@ -80,6 +80,7 @@ #define GICD_CTLR_ENABLE_SS_G0 (1U << 0) #define GICD_TYPER_RSS (1U << 26) +#define GICD_TYPER_No1N (1U << 25) #define GICD_TYPER_LPIS (1U << 17) #define GICD_TYPER_MBIS (1U << 16) #define GICD_TYPER_ESPI (1U << 8) -- Without deviation from the norm, progress is not possible. ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: IRQ thread timeouts and affinity 2025-10-09 18:11 ` Marc Zyngier @ 2025-10-10 13:50 ` Thierry Reding 2025-10-10 14:18 ` Marc Zyngier 0 siblings, 1 reply; 16+ messages in thread From: Thierry Reding @ 2025-10-10 13:50 UTC (permalink / raw) To: Marc Zyngier; +Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel [-- Attachment #1: Type: text/plain, Size: 4814 bytes --] On Thu, Oct 09, 2025 at 07:11:20PM +0100, Marc Zyngier wrote: > On Thu, 09 Oct 2025 18:04:58 +0100, > Marc Zyngier <maz@kernel.org> wrote: > > > > On Thu, 09 Oct 2025 17:05:15 +0100, > > Thierry Reding <thierry.reding@gmail.com> wrote: > > > > > > [1 <text/plain; us-ascii (quoted-printable)>] > > > On Thu, Oct 09, 2025 at 03:30:56PM +0100, Marc Zyngier wrote: > > > > Hi Thierry, > > > > > > > > On Thu, 09 Oct 2025 12:38:55 +0100, > > > > Thierry Reding <thierry.reding@gmail.com> wrote: > > > > > > > > > > Which brings me to the actual question: what is the right way to solve > > > > > this? I had, maybe naively, assumed that the default CPU affinity, which > > > > > includes all available CPUs, would be sufficient to have interrupts > > > > > balanced across all of those CPUs, but that doesn't appear to be the > > > > > case. At least not with the GIC (v3) driver which selects one CPU (CPU 0 > > > > > in this particular case) from the affinity mask to set the "effective > > > > > affinity", which then dictates where IRQs are handled and where the > > > > > corresponding IRQ thread function is run. > > > > > > > > There's a (GIC-specific) answer to that, and that's the "1 of N" > > > > distribution model. The problem is that it is a massive headache (it > > > > completely breaks with per-CPU context). > > > > > > Heh, that started out as a very promising first paragraph but turned > > > ugly very quickly... =) > > > > > > > We could try and hack this in somehow, but defining a reasonable API > > > > is complicated. The set of CPUs receiving 1:N interrupts is a *global* > > > > set, which means you cannot have one interrupt targeting CPUs 0-1, and > > > > another targeting CPUs 2-3. You can only have a single set for all 1:N > > > > interrupts. How would you define such a set in a platform agnostic > > > > manner so that a random driver could use this? I definitely don't want > > > > to have a GIC-specific API. > > > > > > I see. I've been thinking that maybe the only way to solve this is using > > > some sort of policy. A very simple policy might be: use CPU 0 as the > > > "default" interrupt (much like it is now) because like you said there > > > might be assumptions built-in that break when the interrupt is scheduled > > > elsewhere. But then let individual drivers opt into the 1:N set, which > > > would perhaps span all available CPUs but the first one. From an API PoV > > > this would just be a flag that's passed to request_irq() (or one of its > > > derivatives). > > > > The $10k question is how do you pick the victim CPUs? I can't see how > > to do it in a reasonable way unless we decide that interrupts that > > have an affinity matching cpu_possible_mask are 1:N. And then we're > > left with wondering what to do about CPU hotplug. > > For fun and giggles, here's the result of a 5 minute hack. It enables > 1:N distribution on SPIs that have an "all cpus" affinity. It works on > one machine, doesn't on another -- no idea why yet. YMMV. > > This is of course conditioned on your favourite HW supporting the 1:N > feature, and it is likely that things will catch fire quickly. It will > probably make your overall interrupt latency *worse*, but maybe less > variable. Let me know. You might be onto something here. Mind you, I've only done very limited testing, but the system does boot and the QSPI related timeouts are gone completely. Here's some snippets from the boot log that might be interesting: [ 0.000000] GICv3: GIC: Using split EOI/Deactivate mode [ 0.000000] GIC: enabling workaround for GICv3: NVIDIA erratum T241-FABRIC-4 [ 0.000000] GIC: enabling workaround for GICv3: ARM64 erratum 2941627 [ 0.000000] GICv3: 960 SPIs implemented [ 0.000000] GICv3: 320 Extended SPIs implemented [ 0.000000] Root IRQ handler: gic_handle_irq [ 0.000000] GICv3: GICv3 features: 16 PPIs, 1:N [ 0.000000] GICv3: CPU0: found redistributor 20000 region 0:0x0000000022100000 [...] [ 0.000000] GICv3: using LPI property table @0x0000000101500000 [ 0.000000] GICv3: CPU0: using allocated LPI pending table @0x0000000101540000 [...] There's a bunch of ITS info that I dropped, as well as the same redistributor and LPI property table block for each of the 288 CPUs. /proc/interrupts is much too big to paste here, but it looks like the QSPI interrupts now end up evenly distributed across the first 72 CPUs in this system. Not sure why 72, but possibly because this is a 4 NUMA node system with 72 CPUs each, so the CPU mask might've been restricted to just the first node. On the face of it this looks quite promising. Where do we go from here? Any areas that we need to test more exhaustively to see if this breaks? Thierry [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IRQ thread timeouts and affinity 2025-10-10 13:50 ` Thierry Reding @ 2025-10-10 14:18 ` Marc Zyngier 2025-10-10 14:38 ` Jon Hunter 2025-10-10 15:03 ` Thierry Reding 0 siblings, 2 replies; 16+ messages in thread From: Marc Zyngier @ 2025-10-10 14:18 UTC (permalink / raw) To: Thierry Reding Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel On Fri, 10 Oct 2025 14:50:57 +0100, Thierry Reding <thierry.reding@gmail.com> wrote: > > On Thu, Oct 09, 2025 at 07:11:20PM +0100, Marc Zyngier wrote: > > On Thu, 09 Oct 2025 18:04:58 +0100, > > Marc Zyngier <maz@kernel.org> wrote: > > > > > > On Thu, 09 Oct 2025 17:05:15 +0100, > > > Thierry Reding <thierry.reding@gmail.com> wrote: > > > > > > > > [1 <text/plain; us-ascii (quoted-printable)>] > > > > On Thu, Oct 09, 2025 at 03:30:56PM +0100, Marc Zyngier wrote: > > > > > Hi Thierry, > > > > > > > > > > On Thu, 09 Oct 2025 12:38:55 +0100, > > > > > Thierry Reding <thierry.reding@gmail.com> wrote: > > > > > > > > > > > > Which brings me to the actual question: what is the right way to solve > > > > > > this? I had, maybe naively, assumed that the default CPU affinity, which > > > > > > includes all available CPUs, would be sufficient to have interrupts > > > > > > balanced across all of those CPUs, but that doesn't appear to be the > > > > > > case. At least not with the GIC (v3) driver which selects one CPU (CPU 0 > > > > > > in this particular case) from the affinity mask to set the "effective > > > > > > affinity", which then dictates where IRQs are handled and where the > > > > > > corresponding IRQ thread function is run. > > > > > > > > > > There's a (GIC-specific) answer to that, and that's the "1 of N" > > > > > distribution model. The problem is that it is a massive headache (it > > > > > completely breaks with per-CPU context). > > > > > > > > Heh, that started out as a very promising first paragraph but turned > > > > ugly very quickly... =) > > > > > > > > > We could try and hack this in somehow, but defining a reasonable API > > > > > is complicated. The set of CPUs receiving 1:N interrupts is a *global* > > > > > set, which means you cannot have one interrupt targeting CPUs 0-1, and > > > > > another targeting CPUs 2-3. You can only have a single set for all 1:N > > > > > interrupts. How would you define such a set in a platform agnostic > > > > > manner so that a random driver could use this? I definitely don't want > > > > > to have a GIC-specific API. > > > > > > > > I see. I've been thinking that maybe the only way to solve this is using > > > > some sort of policy. A very simple policy might be: use CPU 0 as the > > > > "default" interrupt (much like it is now) because like you said there > > > > might be assumptions built-in that break when the interrupt is scheduled > > > > elsewhere. But then let individual drivers opt into the 1:N set, which > > > > would perhaps span all available CPUs but the first one. From an API PoV > > > > this would just be a flag that's passed to request_irq() (or one of its > > > > derivatives). > > > > > > The $10k question is how do you pick the victim CPUs? I can't see how > > > to do it in a reasonable way unless we decide that interrupts that > > > have an affinity matching cpu_possible_mask are 1:N. And then we're > > > left with wondering what to do about CPU hotplug. > > > > For fun and giggles, here's the result of a 5 minute hack. It enables > > 1:N distribution on SPIs that have an "all cpus" affinity. It works on > > one machine, doesn't on another -- no idea why yet. YMMV. > > > > This is of course conditioned on your favourite HW supporting the 1:N > > feature, and it is likely that things will catch fire quickly. It will > > probably make your overall interrupt latency *worse*, but maybe less > > variable. Let me know. > > You might be onto something here. Mind you, I've only done very limited > testing, but the system does boot and the QSPI related timeouts are gone > completely. Hey, progress. > Here's some snippets from the boot log that might be interesting: > > [ 0.000000] GICv3: GIC: Using split EOI/Deactivate mode > [ 0.000000] GIC: enabling workaround for GICv3: NVIDIA erratum T241-FABRIC-4 > [ 0.000000] GIC: enabling workaround for GICv3: ARM64 erratum 2941627 > [ 0.000000] GICv3: 960 SPIs implemented > [ 0.000000] GICv3: 320 Extended SPIs implemented > [ 0.000000] Root IRQ handler: gic_handle_irq > [ 0.000000] GICv3: GICv3 features: 16 PPIs, 1:N > [ 0.000000] GICv3: CPU0: found redistributor 20000 region 0:0x0000000022100000 > [...] > [ 0.000000] GICv3: using LPI property table @0x0000000101500000 > [ 0.000000] GICv3: CPU0: using allocated LPI pending table @0x0000000101540000 > [...] > > There's a bunch of ITS info that I dropped, as well as the same > redistributor and LPI property table block for each of the 288 CPUs. > > /proc/interrupts is much too big to paste here, but it looks like the > QSPI interrupts now end up evenly distributed across the first 72 CPUs > in this system. Not sure why 72, but possibly because this is a 4 NUMA > node system with 72 CPUs each, so the CPU mask might've been restricted > to just the first node. It could well be that your firmware sets GICR_CTLR.DPG1NS on the 3 other nodes, and the patch I gave you doesn't try to change that. Check with [1], which does the right thing on that front (it fixed a similar problem on my slightly more modest 12 CPU machine). > On the face of it this looks quite promising. Where do we go from here? For a start, you really should consider sending me one of these machines. I have plans for it ;-) > Any areas that we need to test more exhaustively to see if this breaks? CPU hotplug is the main area of concern, and I'm pretty sure it breaks this distribution mechanism (or the other way around). Another thing is that if firmware isn't aware that 1:N interrupts can (or should) wake-up a CPU from sleep, bad things will happen. Given that nobody uses 1:N, you can bet that any bit of privileged SW (TF-A, hypervisors) is likely to be buggy (I've already spotted bugs in KVM around this). The other concern is the shape of the API we would expose to drivers, because I'm not sure we want this sort of "scatter-gun" approach for all SPIs, and I don't know how that translates to other architectures. Thomas should probably weight in here. Thanks, M. [1] https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/commit/?h=irq/gicv3-1ofN&id=5856e2eb479fc41ea60e76440f768079a1a21a36 -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IRQ thread timeouts and affinity 2025-10-10 14:18 ` Marc Zyngier @ 2025-10-10 14:38 ` Jon Hunter 2025-10-10 14:54 ` Thierry Reding 2025-10-10 15:03 ` Thierry Reding 1 sibling, 1 reply; 16+ messages in thread From: Jon Hunter @ 2025-10-10 14:38 UTC (permalink / raw) To: Marc Zyngier, Thierry Reding Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel On 10/10/2025 15:18, Marc Zyngier wrote: ... > CPU hotplug is the main area of concern, and I'm pretty sure it breaks > this distribution mechanism (or the other way around). Another thing > is that if firmware isn't aware that 1:N interrupts can (or should) > wake-up a CPU from sleep, bad things will happen. Given that nobody > uses 1:N, you can bet that any bit of privileged SW (TF-A, > hypervisors) is likely to be buggy (I've already spotted bugs in KVM > around this). Thierry, do we ever hotplug CPUs on this device? If not, I am wondering if something like this, for now, could only be enabled for devices that don't hotplug CPUs. Maybe tied to the kernel config (ie. CONFIG_HOTPLUG_CPU)? Just a thought ... Jon -- nvpublic ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IRQ thread timeouts and affinity 2025-10-10 14:38 ` Jon Hunter @ 2025-10-10 14:54 ` Thierry Reding 2025-10-10 15:52 ` Jon Hunter 0 siblings, 1 reply; 16+ messages in thread From: Thierry Reding @ 2025-10-10 14:54 UTC (permalink / raw) To: Jon Hunter Cc: Marc Zyngier, Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1045 bytes --] On Fri, Oct 10, 2025 at 03:38:59PM +0100, Jon Hunter wrote: > > On 10/10/2025 15:18, Marc Zyngier wrote: > > ... > > > CPU hotplug is the main area of concern, and I'm pretty sure it breaks > > this distribution mechanism (or the other way around). Another thing > > is that if firmware isn't aware that 1:N interrupts can (or should) > > wake-up a CPU from sleep, bad things will happen. Given that nobody > > uses 1:N, you can bet that any bit of privileged SW (TF-A, > > hypervisors) is likely to be buggy (I've already spotted bugs in KVM > > around this). > > Thierry, do we ever hotplug CPUs on this device? If not, I am wondering if > something like this, for now, could only be enabled for devices that don't > hotplug CPUs. Maybe tied to the kernel config (ie. CONFIG_HOTPLUG_CPU)? Just > a thought ... I've only had limited exposure to this, so I don't know all of the use- cases. People can buy these devices and do anything they want with it, so I think we have to account for the general case. Thierry [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IRQ thread timeouts and affinity 2025-10-10 14:54 ` Thierry Reding @ 2025-10-10 15:52 ` Jon Hunter 0 siblings, 0 replies; 16+ messages in thread From: Jon Hunter @ 2025-10-10 15:52 UTC (permalink / raw) To: Thierry Reding Cc: Marc Zyngier, Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel On 10/10/2025 15:54, Thierry Reding wrote: > On Fri, Oct 10, 2025 at 03:38:59PM +0100, Jon Hunter wrote: >> >> On 10/10/2025 15:18, Marc Zyngier wrote: >> >> ... >> >>> CPU hotplug is the main area of concern, and I'm pretty sure it breaks >>> this distribution mechanism (or the other way around). Another thing >>> is that if firmware isn't aware that 1:N interrupts can (or should) >>> wake-up a CPU from sleep, bad things will happen. Given that nobody >>> uses 1:N, you can bet that any bit of privileged SW (TF-A, >>> hypervisors) is likely to be buggy (I've already spotted bugs in KVM >>> around this). >> >> Thierry, do we ever hotplug CPUs on this device? If not, I am wondering if >> something like this, for now, could only be enabled for devices that don't >> hotplug CPUs. Maybe tied to the kernel config (ie. CONFIG_HOTPLUG_CPU)? Just >> a thought ... > > I've only had limited exposure to this, so I don't know all of the use- > cases. People can buy these devices and do anything they want with it, > so I think we have to account for the general case. Yes, but the point I was trying to make that you can prevent this from being used if CPU hotplug is enabled in the kernel and initially limit to configurations where this feature would/could be enabled. So you take CPU hotplug out of the equation (initially). Of course someone can hack the kernel and do what they want, but there is nothing you can do about that. Jon -- nvpublic ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IRQ thread timeouts and affinity 2025-10-10 14:18 ` Marc Zyngier 2025-10-10 14:38 ` Jon Hunter @ 2025-10-10 15:03 ` Thierry Reding 2025-10-11 10:00 ` Marc Zyngier 1 sibling, 1 reply; 16+ messages in thread From: Thierry Reding @ 2025-10-10 15:03 UTC (permalink / raw) To: Marc Zyngier; +Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel [-- Attachment #1: Type: text/plain, Size: 6858 bytes --] On Fri, Oct 10, 2025 at 03:18:13PM +0100, Marc Zyngier wrote: > On Fri, 10 Oct 2025 14:50:57 +0100, > Thierry Reding <thierry.reding@gmail.com> wrote: > > > > On Thu, Oct 09, 2025 at 07:11:20PM +0100, Marc Zyngier wrote: > > > On Thu, 09 Oct 2025 18:04:58 +0100, > > > Marc Zyngier <maz@kernel.org> wrote: > > > > > > > > On Thu, 09 Oct 2025 17:05:15 +0100, > > > > Thierry Reding <thierry.reding@gmail.com> wrote: > > > > > > > > > > [1 <text/plain; us-ascii (quoted-printable)>] > > > > > On Thu, Oct 09, 2025 at 03:30:56PM +0100, Marc Zyngier wrote: > > > > > > Hi Thierry, > > > > > > > > > > > > On Thu, 09 Oct 2025 12:38:55 +0100, > > > > > > Thierry Reding <thierry.reding@gmail.com> wrote: > > > > > > > > > > > > > > Which brings me to the actual question: what is the right way to solve > > > > > > > this? I had, maybe naively, assumed that the default CPU affinity, which > > > > > > > includes all available CPUs, would be sufficient to have interrupts > > > > > > > balanced across all of those CPUs, but that doesn't appear to be the > > > > > > > case. At least not with the GIC (v3) driver which selects one CPU (CPU 0 > > > > > > > in this particular case) from the affinity mask to set the "effective > > > > > > > affinity", which then dictates where IRQs are handled and where the > > > > > > > corresponding IRQ thread function is run. > > > > > > > > > > > > There's a (GIC-specific) answer to that, and that's the "1 of N" > > > > > > distribution model. The problem is that it is a massive headache (it > > > > > > completely breaks with per-CPU context). > > > > > > > > > > Heh, that started out as a very promising first paragraph but turned > > > > > ugly very quickly... =) > > > > > > > > > > > We could try and hack this in somehow, but defining a reasonable API > > > > > > is complicated. The set of CPUs receiving 1:N interrupts is a *global* > > > > > > set, which means you cannot have one interrupt targeting CPUs 0-1, and > > > > > > another targeting CPUs 2-3. You can only have a single set for all 1:N > > > > > > interrupts. How would you define such a set in a platform agnostic > > > > > > manner so that a random driver could use this? I definitely don't want > > > > > > to have a GIC-specific API. > > > > > > > > > > I see. I've been thinking that maybe the only way to solve this is using > > > > > some sort of policy. A very simple policy might be: use CPU 0 as the > > > > > "default" interrupt (much like it is now) because like you said there > > > > > might be assumptions built-in that break when the interrupt is scheduled > > > > > elsewhere. But then let individual drivers opt into the 1:N set, which > > > > > would perhaps span all available CPUs but the first one. From an API PoV > > > > > this would just be a flag that's passed to request_irq() (or one of its > > > > > derivatives). > > > > > > > > The $10k question is how do you pick the victim CPUs? I can't see how > > > > to do it in a reasonable way unless we decide that interrupts that > > > > have an affinity matching cpu_possible_mask are 1:N. And then we're > > > > left with wondering what to do about CPU hotplug. > > > > > > For fun and giggles, here's the result of a 5 minute hack. It enables > > > 1:N distribution on SPIs that have an "all cpus" affinity. It works on > > > one machine, doesn't on another -- no idea why yet. YMMV. > > > > > > This is of course conditioned on your favourite HW supporting the 1:N > > > feature, and it is likely that things will catch fire quickly. It will > > > probably make your overall interrupt latency *worse*, but maybe less > > > variable. Let me know. > > > > You might be onto something here. Mind you, I've only done very limited > > testing, but the system does boot and the QSPI related timeouts are gone > > completely. > > Hey, progress. > > > Here's some snippets from the boot log that might be interesting: > > > > [ 0.000000] GICv3: GIC: Using split EOI/Deactivate mode > > [ 0.000000] GIC: enabling workaround for GICv3: NVIDIA erratum T241-FABRIC-4 > > [ 0.000000] GIC: enabling workaround for GICv3: ARM64 erratum 2941627 > > [ 0.000000] GICv3: 960 SPIs implemented > > [ 0.000000] GICv3: 320 Extended SPIs implemented > > [ 0.000000] Root IRQ handler: gic_handle_irq > > [ 0.000000] GICv3: GICv3 features: 16 PPIs, 1:N > > [ 0.000000] GICv3: CPU0: found redistributor 20000 region 0:0x0000000022100000 > > [...] > > [ 0.000000] GICv3: using LPI property table @0x0000000101500000 > > [ 0.000000] GICv3: CPU0: using allocated LPI pending table @0x0000000101540000 > > [...] > > > > There's a bunch of ITS info that I dropped, as well as the same > > redistributor and LPI property table block for each of the 288 CPUs. > > > > /proc/interrupts is much too big to paste here, but it looks like the > > QSPI interrupts now end up evenly distributed across the first 72 CPUs > > in this system. Not sure why 72, but possibly because this is a 4 NUMA > > node system with 72 CPUs each, so the CPU mask might've been restricted > > to just the first node. > > It could well be that your firmware sets GICR_CTLR.DPG1NS on the 3 > other nodes, and the patch I gave you doesn't try to change that. > Check with [1], which does the right thing on that front (it fixed a > similar problem on my slightly more modest 12 CPU machine). > > > On the face of it this looks quite promising. Where do we go from here? > > For a start, you really should consider sending me one of these > machines. I have plans for it ;-) I'm quite happy with someone else hosting this device, I don't think the electrical installation at home could handle it. It has proven to be quite well suited for kernel builds... > > Any areas that we need to test more exhaustively to see if this breaks? > > CPU hotplug is the main area of concern, and I'm pretty sure it breaks > this distribution mechanism (or the other way around). Another thing > is that if firmware isn't aware that 1:N interrupts can (or should) > wake-up a CPU from sleep, bad things will happen. Given that nobody > uses 1:N, you can bet that any bit of privileged SW (TF-A, > hypervisors) is likely to be buggy (I've already spotted bugs in KVM > around this). Okay, I can find out if CPU hotplug is a common use-case on these devices, or if we can run some tests with that. > The other concern is the shape of the API we would expose to drivers, > because I'm not sure we want this sort of "scatter-gun" approach for > all SPIs, and I don't know how that translates to other architectures. > > Thomas should probably weight in here. Yes, it would be interesting to understand how we can make use of this in a more generic way. Thierry [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IRQ thread timeouts and affinity 2025-10-10 15:03 ` Thierry Reding @ 2025-10-11 10:00 ` Marc Zyngier 2025-10-14 10:50 ` Thierry Reding 0 siblings, 1 reply; 16+ messages in thread From: Marc Zyngier @ 2025-10-11 10:00 UTC (permalink / raw) To: Thierry Reding Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel On Fri, 10 Oct 2025 16:03:01 +0100, Thierry Reding <thierry.reding@gmail.com> wrote: > > On Fri, Oct 10, 2025 at 03:18:13PM +0100, Marc Zyngier wrote: > > > > CPU hotplug is the main area of concern, and I'm pretty sure it breaks > > this distribution mechanism (or the other way around). Another thing > > is that if firmware isn't aware that 1:N interrupts can (or should) > > wake-up a CPU from sleep, bad things will happen. Given that nobody > > uses 1:N, you can bet that any bit of privileged SW (TF-A, > > hypervisors) is likely to be buggy (I've already spotted bugs in KVM > > around this). > > Okay, I can find out if CPU hotplug is a common use-case on these > devices, or if we can run some tests with that. It's not so much whether CPU hotplug is of any use to your particular box, but whether this has any detrimental impact on *any* machine doing CPU hotplug. To be clear, this stuff doesn't go in if something breaks, no matter how small. M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IRQ thread timeouts and affinity 2025-10-11 10:00 ` Marc Zyngier @ 2025-10-14 10:50 ` Thierry Reding 2025-10-14 11:08 ` Thierry Reding 0 siblings, 1 reply; 16+ messages in thread From: Thierry Reding @ 2025-10-14 10:50 UTC (permalink / raw) To: Marc Zyngier; +Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel [-- Attachment #1: Type: text/plain, Size: 2710 bytes --] On Sat, Oct 11, 2025 at 11:00:11AM +0100, Marc Zyngier wrote: > On Fri, 10 Oct 2025 16:03:01 +0100, > Thierry Reding <thierry.reding@gmail.com> wrote: > > > > On Fri, Oct 10, 2025 at 03:18:13PM +0100, Marc Zyngier wrote: > > > > > > CPU hotplug is the main area of concern, and I'm pretty sure it breaks > > > this distribution mechanism (or the other way around). Another thing > > > is that if firmware isn't aware that 1:N interrupts can (or should) > > > wake-up a CPU from sleep, bad things will happen. Given that nobody > > > uses 1:N, you can bet that any bit of privileged SW (TF-A, > > > hypervisors) is likely to be buggy (I've already spotted bugs in KVM > > > around this). > > > > Okay, I can find out if CPU hotplug is a common use-case on these > > devices, or if we can run some tests with that. > > It's not so much whether CPU hotplug is of any use to your particular > box, but whether this has any detrimental impact on *any* machine > doing CPU hotplug. > > To be clear, this stuff doesn't go in if something breaks, no matter > how small. Of course. I do want to find a way to move forward with this, so I'm trying to find ways to check what impact this would have in conjunction with CPU hotplug. I've done some minimal testing on a Tegra264 device where we have less CPUs. With your patch applied, I see that most interrupts are nicely distributed across CPUs. I'm going to use the serial interrupt as an example since it reliably triggers when I test on a system. Here's an extract after boot: # cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 25: 42 44 41 29 37 36 39 36 GICv3 547 Level c4e0000.serial I then took CPU 1 offline: # echo 0 > /sys/devices/system/cpu/cpu1/online After that it looks like the GIC automatically reverts to using the first CPU, since after a little while: # cat /proc/interrupts CPU0 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 25: 186 66 52 64 58 67 62 GICv3 547 Level c4e0000.serial The interrupt count for CPUs 2-7 no longer increments after taking CPU 1 offline. Interestingly, bringing CPU 1 back online doesn't have an impact, so it doesn't go back to enabling 1:N mode. Nothing did seem to break. Obviously this doesn't show anything about the performance yet, but it looks like at least things don't crash and burn. Anything else that you think I can test? Do we have a way of restoring 1:N when all CPUs are back online? Thierry [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IRQ thread timeouts and affinity 2025-10-14 10:50 ` Thierry Reding @ 2025-10-14 11:08 ` Thierry Reding 2025-10-14 17:46 ` Marc Zyngier 0 siblings, 1 reply; 16+ messages in thread From: Thierry Reding @ 2025-10-14 11:08 UTC (permalink / raw) To: Marc Zyngier; +Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel [-- Attachment #1: Type: text/plain, Size: 2783 bytes --] On Tue, Oct 14, 2025 at 12:50:18PM +0200, Thierry Reding wrote: > On Sat, Oct 11, 2025 at 11:00:11AM +0100, Marc Zyngier wrote: > > On Fri, 10 Oct 2025 16:03:01 +0100, > > Thierry Reding <thierry.reding@gmail.com> wrote: > > > > > > On Fri, Oct 10, 2025 at 03:18:13PM +0100, Marc Zyngier wrote: > > > > > > > > CPU hotplug is the main area of concern, and I'm pretty sure it breaks > > > > this distribution mechanism (or the other way around). Another thing > > > > is that if firmware isn't aware that 1:N interrupts can (or should) > > > > wake-up a CPU from sleep, bad things will happen. Given that nobody > > > > uses 1:N, you can bet that any bit of privileged SW (TF-A, > > > > hypervisors) is likely to be buggy (I've already spotted bugs in KVM > > > > around this). > > > > > > Okay, I can find out if CPU hotplug is a common use-case on these > > > devices, or if we can run some tests with that. > > > > It's not so much whether CPU hotplug is of any use to your particular > > box, but whether this has any detrimental impact on *any* machine > > doing CPU hotplug. > > > > To be clear, this stuff doesn't go in if something breaks, no matter > > how small. > > Of course. I do want to find a way to move forward with this, so I'm > trying to find ways to check what impact this would have in conjunction > with CPU hotplug. > > I've done some minimal testing on a Tegra264 device where we have less > CPUs. With your patch applied, I see that most interrupts are nicely > distributed across CPUs. I'm going to use the serial interrupt as an > example since it reliably triggers when I test on a system. Here's an > extract after boot: > > # cat /proc/interrupts > CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 > 25: 42 44 41 29 37 36 39 36 GICv3 547 Level c4e0000.serial > > I then took CPU 1 offline: > > # echo 0 > /sys/devices/system/cpu/cpu1/online > > After that it looks like the GIC automatically reverts to using the > first CPU, since after a little while: > > # cat /proc/interrupts > CPU0 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 > 25: 186 66 52 64 58 67 62 GICv3 547 Level c4e0000.serial > > The interrupt count for CPUs 2-7 no longer increments after taking CPU 1 > offline. Interestingly, bringing CPU 1 back online doesn't have an > impact, so it doesn't go back to enabling 1:N mode. Looks like that is because gic_set_affinity() gets called with the new CPU mask when the CPU goes offline, but it's *not* called when the CPU comes back online. Thierry [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IRQ thread timeouts and affinity 2025-10-14 11:08 ` Thierry Reding @ 2025-10-14 17:46 ` Marc Zyngier 0 siblings, 0 replies; 16+ messages in thread From: Marc Zyngier @ 2025-10-14 17:46 UTC (permalink / raw) To: Thierry Reding Cc: Thomas Gleixner, linux-tegra, linux-arm-kernel, linux-kernel On Tue, 14 Oct 2025 12:08:22 +0100, Thierry Reding <thierry.reding@gmail.com> wrote: > [...] > > The interrupt count for CPUs 2-7 no longer increments after taking CPU 1 > > offline. Interestingly, bringing CPU 1 back online doesn't have an > > impact, so it doesn't go back to enabling 1:N mode. > > Looks like that is because gic_set_affinity() gets called with the new > CPU mask when the CPU goes offline, but it's *not* called when the CPU > comes back online. Indeed, because there is no need to change the affinity as far as the kernel is concerned -- the interrupt is on an online CPU and all is well. I think that's the point where a per-interrupt flag (let's call it IRQ_BCAST for the sake of argument) is required to decide what to do. Ideally, IRQ_BCAST would replace any notion of affinity, and you'd get the scatter-gun behaviour all the time. Which means no adjustment to the affinity on a CPU going offline (everything still works). But that's assumes a bunch of other things: - when going offline, at least DPG1NS gets set to make sure this CPU is not a target anymore if not going completely dead (still running secure code, for example). The kernel could do it, but... - when going idle, should this CPU still be a target of 1:N interrupts? That's a firmware decision what could severely impact power on battery-bound machines if not carefully managed... - and should a CPU wake up from such an interrupt? Again, that's a firmware decision, and I don't know how existing implementation deal with that stuff. Someone needs to investigate these things, and work out all of the above. That will give us a set of conditions under which we could do something. M. -- Jazz isn't dead. It just smells funny. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IRQ thread timeouts and affinity 2025-10-09 11:38 IRQ thread timeouts and affinity Thierry Reding 2025-10-09 14:30 ` Marc Zyngier @ 2025-10-16 18:53 ` Thomas Gleixner 1 sibling, 0 replies; 16+ messages in thread From: Thomas Gleixner @ 2025-10-16 18:53 UTC (permalink / raw) To: Thierry Reding, Marc Zyngier; +Cc: linux-tegra, linux-arm-kernel, linux-kernel On Thu, Oct 09 2025 at 13:38, Thierry Reding wrote: > We've been running into an issue on some systems (NVIDIA Grace chips) > where either during boot or at runtime, CPU 0 can be under very high > load and cause some IRQ thread functions to be delayed to a point where > we encounter the timeout in the work submission parts of the driver. > > Specifically this happens for the Tegra QSPI controller driver found > in drivers/spi/spi-tegra210-quad.c. This driver uses an IRQ thread to > wait for and process "transfer ready" interrupts (which need to run > DMA transfers or copy from the hardware FIFOs using PIO to get the > SPI transfer data). Under heavy load, we've seen the IRQ thread run > with up to multiple seconds of delay. If the interrupt thread which runs with SCHED_FIFO is delayed for multiple seconds, then there is something seriously wrong to begin with. You fail to explain how that happens in the first place. Heavy load is not really a good explanation for that. > Alternatively, would it be possible (and make sense) to make the IRQ > core code schedule threads across more CPUs? Is there a particular > reason that the IRQ thread runs on the same CPU that services the IRQ? Locality. Also remote wakeups are way more expensive than local wakeups. Though there is no actual hard requirement to force it onto the same CPU. What could be done is to have a flag which binds the thread to the real affinity mask instead of the effective affinity mask so it can be scheduled freely. Needs some thoughts, but should work. > Maybe another way would be to "reserve" CPU 0 for the type of core OS > driver like QSPI (the TPM is connected to this controller) and make sure > all CPU intensive tasks do not run on that CPU? > > I know that things like irqbalance and taskset exist to solve some of > these problems, but they do not work when we hit these cases at boot > time. I'm still completely failing to see how you end up with multiple seconds delay of that thread especially during boot. What exactly keeps it from getting scheduled? Thanks, tglx ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-10-16 18:54 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-10-09 11:38 IRQ thread timeouts and affinity Thierry Reding 2025-10-09 14:30 ` Marc Zyngier 2025-10-09 16:05 ` Thierry Reding 2025-10-09 17:04 ` Marc Zyngier 2025-10-09 18:11 ` Marc Zyngier 2025-10-10 13:50 ` Thierry Reding 2025-10-10 14:18 ` Marc Zyngier 2025-10-10 14:38 ` Jon Hunter 2025-10-10 14:54 ` Thierry Reding 2025-10-10 15:52 ` Jon Hunter 2025-10-10 15:03 ` Thierry Reding 2025-10-11 10:00 ` Marc Zyngier 2025-10-14 10:50 ` Thierry Reding 2025-10-14 11:08 ` Thierry Reding 2025-10-14 17:46 ` Marc Zyngier 2025-10-16 18:53 ` Thomas Gleixner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).