* [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table()
@ 2026-01-07 21:53 Waiman Long
2026-01-08 8:26 ` Marc Zyngier
0 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2026-01-07 21:53 UTC (permalink / raw)
To: Marc Zyngier, Thomas Gleixner, Sebastian Andrzej Siewior,
Clark Williams, Steven Rostedt
Cc: linux-arm-kernel, linux-kernel, linux-rt-devel, Waiman Long
When running a PREEMPT_RT debug kernel on a 2-socket Grace arm64 system,
the following bug report was produced at bootup time.
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/72
preempt_count: 1, expected: 0
RCU nest depth: 1, expected: 1
:
CPU: 72 UID: 0 PID: 0 Comm: swapper/72 Tainted: G W 6.19.0-rc4-test+ #4 PREEMPT_{RT,(full)}
Tainted: [W]=WARN
Call trace:
:
rt_spin_lock+0xe4/0x408
rmqueue_bulk+0x48/0x1de8
__rmqueue_pcplist+0x410/0x650
rmqueue.constprop.0+0x6a8/0x2b50
get_page_from_freelist+0x3c0/0xe68
__alloc_frozen_pages_noprof+0x1dc/0x348
alloc_pages_mpol+0xe4/0x2f8
alloc_frozen_pages_noprof+0x124/0x190
allocate_slab+0x2f0/0x438
new_slab+0x4c/0x80
___slab_alloc+0x410/0x798
__slab_alloc.constprop.0+0x88/0x1e0
__kmalloc_cache_noprof+0x2dc/0x4b0
allocate_vpe_l1_table+0x114/0x788
its_cpu_init_lpis+0x344/0x790
its_cpu_init+0x60/0x220
gic_starting_cpu+0x64/0xe8
cpuhp_invoke_callback+0x438/0x6d8
__cpuhp_invoke_callback_range+0xd8/0x1f8
notify_cpu_starting+0x11c/0x178
secondary_start_kernel+0xc8/0x188
__secondary_switched+0xc0/0xc8
This is due to the fact that allocate_vpe_l1_table() will call
kzalloc() to allocate a cpumask_t when the first CPU of the
second node of the 72-cpu Grace system is being called from the
CPUHP_AP_MIPS_GIC_TIMER_STARTING state inside the starting section of
the CPU hotplug bringup pipeline where interrupt is disabled. This is an
atomic context where sleeping is not allowed and acquiring a sleeping
rt_spin_lock within kzalloc() may lead to system hang in case there is
a lock contention.
To work around this issue, a static buffer is used for cpumask
allocation when running a PREEMPT_RT kernel via the newly introduced
vpe_alloc_cpumask() helper. The static buffer is currently set to be
4 kbytes in size. As only one cpumask is needed per node, the current
size should be big enough as long as (cpumask_size() * nr_node_ids)
is not bigger than 4k.
Signed-off-by: Waiman Long <longman@redhat.com>
---
drivers/irqchip/irq-gic-v3-its.c | 26 +++++++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)
diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index ada585bfa451..9185785524dc 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -2896,6 +2896,30 @@ static bool allocate_vpe_l2_table(int cpu, u32 id)
return true;
}
+static void *vpe_alloc_cpumask(void)
+{
+ /*
+ * With PREEMPT_RT kernel, we can't call any k*alloc() APIs as they
+ * may acquire a sleeping rt_spin_lock in an atomic context. So use
+ * a pre-allocated buffer instead.
+ */
+ if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
+ static unsigned long mask_buf[512];
+ static atomic_t alloc_idx;
+ int idx, mask_size = cpumask_size();
+ int nr_cpumasks = sizeof(mask_buf)/mask_size;
+
+ /*
+ * Fetch an allocation index and if it points to a buffer within
+ * mask_buf[], return that. Fall back to kzalloc() otherwise.
+ */
+ idx = atomic_fetch_inc(&alloc_idx);
+ if (idx < nr_cpumasks)
+ return &mask_buf[idx * mask_size/sizeof(long)];
+ }
+ return kzalloc(sizeof(cpumask_t), GFP_ATOMIC);
+}
+
static int allocate_vpe_l1_table(void)
{
void __iomem *vlpi_base = gic_data_rdist_vlpi_base();
@@ -2927,7 +2951,7 @@ static int allocate_vpe_l1_table(void)
if (val & GICR_VPROPBASER_4_1_VALID)
goto out;
- gic_data_rdist()->vpe_table_mask = kzalloc(sizeof(cpumask_t), GFP_ATOMIC);
+ gic_data_rdist()->vpe_table_mask = vpe_alloc_cpumask();
if (!gic_data_rdist()->vpe_table_mask)
return -ENOMEM;
--
2.52.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() 2026-01-07 21:53 [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() Waiman Long @ 2026-01-08 8:26 ` Marc Zyngier 2026-01-08 22:11 ` Thomas Gleixner 2026-01-10 21:47 ` Waiman Long 0 siblings, 2 replies; 24+ messages in thread From: Marc Zyngier @ 2026-01-08 8:26 UTC (permalink / raw) To: Waiman Long Cc: Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt, linux-arm-kernel, linux-kernel, linux-rt-devel On Wed, 07 Jan 2026 21:53:53 +0000, Waiman Long <longman@redhat.com> wrote: > > When running a PREEMPT_RT debug kernel on a 2-socket Grace arm64 system, > the following bug report was produced at bootup time. > > BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 > in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/72 > preempt_count: 1, expected: 0 > RCU nest depth: 1, expected: 1 > : > CPU: 72 UID: 0 PID: 0 Comm: swapper/72 Tainted: G W 6.19.0-rc4-test+ #4 PREEMPT_{RT,(full)} > Tainted: [W]=WARN > Call trace: > : > rt_spin_lock+0xe4/0x408 > rmqueue_bulk+0x48/0x1de8 > __rmqueue_pcplist+0x410/0x650 > rmqueue.constprop.0+0x6a8/0x2b50 > get_page_from_freelist+0x3c0/0xe68 > __alloc_frozen_pages_noprof+0x1dc/0x348 > alloc_pages_mpol+0xe4/0x2f8 > alloc_frozen_pages_noprof+0x124/0x190 > allocate_slab+0x2f0/0x438 > new_slab+0x4c/0x80 > ___slab_alloc+0x410/0x798 > __slab_alloc.constprop.0+0x88/0x1e0 > __kmalloc_cache_noprof+0x2dc/0x4b0 > allocate_vpe_l1_table+0x114/0x788 > its_cpu_init_lpis+0x344/0x790 > its_cpu_init+0x60/0x220 > gic_starting_cpu+0x64/0xe8 > cpuhp_invoke_callback+0x438/0x6d8 > __cpuhp_invoke_callback_range+0xd8/0x1f8 > notify_cpu_starting+0x11c/0x178 > secondary_start_kernel+0xc8/0x188 > __secondary_switched+0xc0/0xc8 > > This is due to the fact that allocate_vpe_l1_table() will call > kzalloc() to allocate a cpumask_t when the first CPU of the > second node of the 72-cpu Grace system is being called from the > CPUHP_AP_MIPS_GIC_TIMER_STARTING state inside the starting section of Surely *not* that particular state. > the CPU hotplug bringup pipeline where interrupt is disabled. This is an > atomic context where sleeping is not allowed and acquiring a sleeping > rt_spin_lock within kzalloc() may lead to system hang in case there is > a lock contention. > > To work around this issue, a static buffer is used for cpumask > allocation when running a PREEMPT_RT kernel via the newly introduced > vpe_alloc_cpumask() helper. The static buffer is currently set to be > 4 kbytes in size. As only one cpumask is needed per node, the current > size should be big enough as long as (cpumask_size() * nr_node_ids) > is not bigger than 4k. What role does the node play here? The GIC topology has nothing to do with NUMA. It may be true on your particular toy, but that's definitely not true architecturally. You could, at worse, end-up with one such cpumask per *CPU*. That'd be a braindead system, but this code is written to support the architecture, not any particular implementation. > > Signed-off-by: Waiman Long <longman@redhat.com> > --- > drivers/irqchip/irq-gic-v3-its.c | 26 +++++++++++++++++++++++++- > 1 file changed, 25 insertions(+), 1 deletion(-) > > diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c > index ada585bfa451..9185785524dc 100644 > --- a/drivers/irqchip/irq-gic-v3-its.c > +++ b/drivers/irqchip/irq-gic-v3-its.c > @@ -2896,6 +2896,30 @@ static bool allocate_vpe_l2_table(int cpu, u32 id) > return true; > } > > +static void *vpe_alloc_cpumask(void) > +{ > + /* > + * With PREEMPT_RT kernel, we can't call any k*alloc() APIs as they > + * may acquire a sleeping rt_spin_lock in an atomic context. So use > + * a pre-allocated buffer instead. > + */ > + if (IS_ENABLED(CONFIG_PREEMPT_RT)) { > + static unsigned long mask_buf[512]; > + static atomic_t alloc_idx; > + int idx, mask_size = cpumask_size(); > + int nr_cpumasks = sizeof(mask_buf)/mask_size; > + > + /* > + * Fetch an allocation index and if it points to a buffer within > + * mask_buf[], return that. Fall back to kzalloc() otherwise. > + */ > + idx = atomic_fetch_inc(&alloc_idx); > + if (idx < nr_cpumasks) > + return &mask_buf[idx * mask_size/sizeof(long)]; > + } Err, no. That's horrible. I can see three ways to address this in a more appealing way: - you give RT a generic allocator that works for (small) atomic allocations. I appreciate that's not easy, and even probably contrary to the RT goals. But I'm also pretty sure that the GIC code is not the only pile of crap being caught doing that. - you pre-compute upfront how many cpumasks you are going to require, based on the actual GIC topology. You do that on CPU0, outside of the hotplug constraints, and allocate what you need. This is difficult as you need to ensure the RD<->CPU matching without the CPUs having booted, which means wading through the DT/ACPI gunk to try and guess what you have. - you delay the allocation of L1 tables to a context where you can perform allocations, and before we have a chance of running a guest on this CPU. That's probably the simplest option (though dealing with late onlining while guests are already running could be interesting...). But I'm always going to say no to something that is a poor hack and ultimately falling back to the same broken behaviour. Thanks, M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() 2026-01-08 8:26 ` Marc Zyngier @ 2026-01-08 22:11 ` Thomas Gleixner 2026-01-09 16:13 ` Marc Zyngier 2026-01-10 21:47 ` Waiman Long 1 sibling, 1 reply; 24+ messages in thread From: Thomas Gleixner @ 2026-01-08 22:11 UTC (permalink / raw) To: Marc Zyngier, Waiman Long Cc: Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt, linux-arm-kernel, linux-kernel, linux-rt-devel On Thu, Jan 08 2026 at 08:26, Marc Zyngier wrote: > Err, no. That's horrible. I can see three ways to address this in a > more appealing way: > > - you give RT a generic allocator that works for (small) atomic > allocations. I appreciate that's not easy, and even probably > contrary to the RT goals. But I'm also pretty sure that the GIC code > is not the only pile of crap being caught doing that. > > - you pre-compute upfront how many cpumasks you are going to require, > based on the actual GIC topology. You do that on CPU0, outside of > the hotplug constraints, and allocate what you need. This is > difficult as you need to ensure the RD<->CPU matching without the > CPUs having booted, which means wading through the DT/ACPI gunk to > try and guess what you have. > > - you delay the allocation of L1 tables to a context where you can > perform allocations, and before we have a chance of running a guest > on this CPU. That's probably the simplest option (though dealing > with late onlining while guests are already running could be > interesting...). At the point where a CPU is brought up, the topology should be known already, which means this can be allocated on the control CPU _before_ the new CPU comes up, no? Thanks, tglx ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() 2026-01-08 22:11 ` Thomas Gleixner @ 2026-01-09 16:13 ` Marc Zyngier 2026-01-11 9:39 ` Thomas Gleixner 0 siblings, 1 reply; 24+ messages in thread From: Marc Zyngier @ 2026-01-09 16:13 UTC (permalink / raw) To: Thomas Gleixner Cc: Waiman Long, Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt, linux-arm-kernel, linux-kernel, linux-rt-devel On Thu, 08 Jan 2026 22:11:33 +0000, Thomas Gleixner <tglx@kernel.org> wrote: > > On Thu, Jan 08 2026 at 08:26, Marc Zyngier wrote: > > Err, no. That's horrible. I can see three ways to address this in a > > more appealing way: > > > > - you give RT a generic allocator that works for (small) atomic > > allocations. I appreciate that's not easy, and even probably > > contrary to the RT goals. But I'm also pretty sure that the GIC code > > is not the only pile of crap being caught doing that. > > > > - you pre-compute upfront how many cpumasks you are going to require, > > based on the actual GIC topology. You do that on CPU0, outside of > > the hotplug constraints, and allocate what you need. This is > > difficult as you need to ensure the RD<->CPU matching without the > > CPUs having booted, which means wading through the DT/ACPI gunk to > > try and guess what you have. > > > > - you delay the allocation of L1 tables to a context where you can > > perform allocations, and before we have a chance of running a guest > > on this CPU. That's probably the simplest option (though dealing > > with late onlining while guests are already running could be > > interesting...). > > At the point where a CPU is brought up, the topology should be known > already, which means this can be allocated on the control CPU _before_ > the new CPU comes up, no? No. Each CPU finds *itself* in the forest of redistributors, and from there tries to find whether it has some shared resource with a CPU that has booted before it. That's because firmware is absolutely awful and can't present a consistent view of the system. Anyway, I expect it could be solved by moving this part of the init to an ONLINE HP callback. Thanks, M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() 2026-01-09 16:13 ` Marc Zyngier @ 2026-01-11 9:39 ` Thomas Gleixner 2026-01-11 10:38 ` Marc Zyngier 0 siblings, 1 reply; 24+ messages in thread From: Thomas Gleixner @ 2026-01-11 9:39 UTC (permalink / raw) To: Marc Zyngier Cc: Waiman Long, Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt, linux-arm-kernel, linux-kernel, linux-rt-devel On Fri, Jan 09 2026 at 16:13, Marc Zyngier wrote: > On Thu, 08 Jan 2026 22:11:33 +0000, > Thomas Gleixner <tglx@kernel.org> wrote: >> At the point where a CPU is brought up, the topology should be known >> already, which means this can be allocated on the control CPU _before_ >> the new CPU comes up, no? > > No. Each CPU finds *itself* in the forest of redistributors, and from > there tries to find whether it has some shared resource with a CPU > that has booted before it. That's because firmware is absolutely awful > and can't present a consistent view of the system. Groan.... > Anyway, I expect it could be solved by moving this part of the init to > an ONLINE HP callback. Which needs to be before CPUHP_AP_IRQ_AFFINITY_ONLINE, but even that might be to late because there are callbacks in the STARTING section, i.e. timer, perf, which might rely on interrupts being accessible. Also that patch seems to be incomplete because there is another allocation further down in allocate_vpe_l1_table().... Thanks, tglx ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() 2026-01-11 9:39 ` Thomas Gleixner @ 2026-01-11 10:38 ` Marc Zyngier 2026-01-11 16:20 ` Thomas Gleixner 2026-01-11 23:02 ` Waiman Long 0 siblings, 2 replies; 24+ messages in thread From: Marc Zyngier @ 2026-01-11 10:38 UTC (permalink / raw) To: Thomas Gleixner Cc: Waiman Long, Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt, linux-arm-kernel, linux-kernel, linux-rt-devel On Sun, 11 Jan 2026 09:39:07 +0000, Thomas Gleixner <tglx@kernel.org> wrote: > > On Fri, Jan 09 2026 at 16:13, Marc Zyngier wrote: > > On Thu, 08 Jan 2026 22:11:33 +0000, > > Thomas Gleixner <tglx@kernel.org> wrote: > >> At the point where a CPU is brought up, the topology should be known > >> already, which means this can be allocated on the control CPU _before_ > >> the new CPU comes up, no? > > > > No. Each CPU finds *itself* in the forest of redistributors, and from > > there tries to find whether it has some shared resource with a CPU > > that has booted before it. That's because firmware is absolutely awful > > and can't present a consistent view of the system. > > Groan.... > > > Anyway, I expect it could be solved by moving this part of the init to > > an ONLINE HP callback. > > Which needs to be before CPUHP_AP_IRQ_AFFINITY_ONLINE, but even that > might be to late because there are callbacks in the STARTING section, > i.e. timer, perf, which might rely on interrupts being accessible. Nah. This stuff is only for direct injection of vLPIs into guests, so as long as this is done before we can schedule a vcpu on this physical CPU, we're good. No physical interrupt is concerned with this code. > Also that patch seems to be incomplete because there is another > allocation further down in allocate_vpe_l1_table().... Yeah, I wondered why page allocation wasn't affected by this issue, but didn't try to find out. M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() 2026-01-11 10:38 ` Marc Zyngier @ 2026-01-11 16:20 ` Thomas Gleixner 2026-01-12 11:20 ` Marc Zyngier 2026-01-11 23:02 ` Waiman Long 1 sibling, 1 reply; 24+ messages in thread From: Thomas Gleixner @ 2026-01-11 16:20 UTC (permalink / raw) To: Marc Zyngier Cc: Waiman Long, Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt, linux-arm-kernel, linux-kernel, linux-rt-devel On Sun, Jan 11 2026 at 10:38, Marc Zyngier wrote: > On Sun, 11 Jan 2026 09:39:07 +0000, > Thomas Gleixner <tglx@kernel.org> wrote: >> >> On Fri, Jan 09 2026 at 16:13, Marc Zyngier wrote: >> > On Thu, 08 Jan 2026 22:11:33 +0000, >> > Thomas Gleixner <tglx@kernel.org> wrote: >> >> At the point where a CPU is brought up, the topology should be known >> >> already, which means this can be allocated on the control CPU _before_ >> >> the new CPU comes up, no? >> > >> > No. Each CPU finds *itself* in the forest of redistributors, and from >> > there tries to find whether it has some shared resource with a CPU >> > that has booted before it. That's because firmware is absolutely awful >> > and can't present a consistent view of the system. >> >> Groan.... >> >> > Anyway, I expect it could be solved by moving this part of the init to >> > an ONLINE HP callback. >> >> Which needs to be before CPUHP_AP_IRQ_AFFINITY_ONLINE, but even that >> might be to late because there are callbacks in the STARTING section, >> i.e. timer, perf, which might rely on interrupts being accessible. > > Nah. This stuff is only for direct injection of vLPIs into guests, so > as long as this is done before we can schedule a vcpu on this physical > CPU, we're good. No physical interrupt is concerned with this code. That's fine then. vCPUs are considered "user-space" tasks and can't be scheduled before CPUHP_AP_ACTIVE sets the CPU active for the scheduler. Thanks, tglx ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() 2026-01-11 16:20 ` Thomas Gleixner @ 2026-01-12 11:20 ` Marc Zyngier 2026-01-12 14:08 ` Sebastian Andrzej Siewior 2026-01-21 8:38 ` Marc Zyngier 0 siblings, 2 replies; 24+ messages in thread From: Marc Zyngier @ 2026-01-12 11:20 UTC (permalink / raw) To: Waiman Long, Thomas Gleixner Cc: Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt, linux-arm-kernel, linux-kernel, linux-rt-devel On Sun, 11 Jan 2026 16:20:45 +0000, Thomas Gleixner <tglx@kernel.org> wrote: > > On Sun, Jan 11 2026 at 10:38, Marc Zyngier wrote: > > On Sun, 11 Jan 2026 09:39:07 +0000, > > Thomas Gleixner <tglx@kernel.org> wrote: > >> > >> On Fri, Jan 09 2026 at 16:13, Marc Zyngier wrote: > >> > On Thu, 08 Jan 2026 22:11:33 +0000, > >> > Thomas Gleixner <tglx@kernel.org> wrote: > >> >> At the point where a CPU is brought up, the topology should be known > >> >> already, which means this can be allocated on the control CPU _before_ > >> >> the new CPU comes up, no? > >> > > >> > No. Each CPU finds *itself* in the forest of redistributors, and from > >> > there tries to find whether it has some shared resource with a CPU > >> > that has booted before it. That's because firmware is absolutely awful > >> > and can't present a consistent view of the system. > >> > >> Groan.... > >> > >> > Anyway, I expect it could be solved by moving this part of the init to > >> > an ONLINE HP callback. > >> > >> Which needs to be before CPUHP_AP_IRQ_AFFINITY_ONLINE, but even that > >> might be to late because there are callbacks in the STARTING section, > >> i.e. timer, perf, which might rely on interrupts being accessible. > > > > Nah. This stuff is only for direct injection of vLPIs into guests, so > > as long as this is done before we can schedule a vcpu on this physical > > CPU, we're good. No physical interrupt is concerned with this code. > > That's fine then. vCPUs are considered "user-space" tasks and can't be > scheduled before CPUHP_AP_ACTIVE sets the CPU active for the scheduler. Waiman, can you please give the following hack a go on your box? The machines I have are thankfully limited to a single ITS group, so I can't directly reproduce your issue. Thanks, M. diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c index ada585bfa4517..20967000f2348 100644 --- a/drivers/irqchip/irq-gic-v3-its.c +++ b/drivers/irqchip/irq-gic-v3-its.c @@ -2896,7 +2896,7 @@ static bool allocate_vpe_l2_table(int cpu, u32 id) return true; } -static int allocate_vpe_l1_table(void) +static int allocate_vpe_l1_table(unsigned int cpu) { void __iomem *vlpi_base = gic_data_rdist_vlpi_base(); u64 val, gpsz, npg, pa; @@ -3012,10 +3012,11 @@ static int allocate_vpe_l1_table(void) out: gicr_write_vpropbaser(val, vlpi_base + GICR_VPROPBASER); - cpumask_set_cpu(smp_processor_id(), gic_data_rdist()->vpe_table_mask); + cpumask_set_cpu(cpu, gic_data_rdist()->vpe_table_mask); + dsb(sy); pr_debug("CPU%d: VPROPBASER = %llx %*pbl\n", - smp_processor_id(), val, + cpu, val, cpumask_pr_args(gic_data_rdist()->vpe_table_mask)); return 0; @@ -3264,15 +3265,9 @@ static void its_cpu_init_lpis(void) val = its_clear_vpend_valid(vlpi_base, 0, 0); } - if (allocate_vpe_l1_table()) { - /* - * If the allocation has failed, we're in massive trouble. - * Disable direct injection, and pray that no VM was - * already running... - */ - gic_rdists->has_rvpeid = false; - gic_rdists->has_vlpis = false; - } + if (smp_processor_id() == 0) + cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "irqchip/arm/gicv3:vpe", + allocate_vpe_l1_table, NULL); /* Make sure the GIC has seen the above */ dsb(sy); -- Without deviation from the norm, progress is not possible. ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() 2026-01-12 11:20 ` Marc Zyngier @ 2026-01-12 14:08 ` Sebastian Andrzej Siewior 2026-01-12 14:38 ` Marc Zyngier 2026-01-21 8:38 ` Marc Zyngier 1 sibling, 1 reply; 24+ messages in thread From: Sebastian Andrzej Siewior @ 2026-01-12 14:08 UTC (permalink / raw) To: Marc Zyngier Cc: Waiman Long, Thomas Gleixner, Clark Williams, Steven Rostedt, linux-arm-kernel, linux-kernel, linux-rt-devel On 2026-01-12 11:20:07 [+0000], Marc Zyngier wrote: > On Sun, 11 Jan 2026 16:20:45 +0000, > Thomas Gleixner <tglx@kernel.org> wrote: > > > > On Sun, Jan 11 2026 at 10:38, Marc Zyngier wrote: > > > On Sun, 11 Jan 2026 09:39:07 +0000, > > > Thomas Gleixner <tglx@kernel.org> wrote: > > >> > > >> On Fri, Jan 09 2026 at 16:13, Marc Zyngier wrote: > > >> > On Thu, 08 Jan 2026 22:11:33 +0000, > > >> > Thomas Gleixner <tglx@kernel.org> wrote: > > >> >> At the point where a CPU is brought up, the topology should be known > > >> >> already, which means this can be allocated on the control CPU _before_ > > >> >> the new CPU comes up, no? > > >> > > > >> > No. Each CPU finds *itself* in the forest of redistributors, and from > > >> > there tries to find whether it has some shared resource with a CPU > > >> > that has booted before it. That's because firmware is absolutely awful > > >> > and can't present a consistent view of the system. > > >> > > >> Groan.... > > >> > > >> > Anyway, I expect it could be solved by moving this part of the init to > > >> > an ONLINE HP callback. > > >> > > >> Which needs to be before CPUHP_AP_IRQ_AFFINITY_ONLINE, but even that > > >> might be to late because there are callbacks in the STARTING section, > > >> i.e. timer, perf, which might rely on interrupts being accessible. > > > > > > Nah. This stuff is only for direct injection of vLPIs into guests, so > > > as long as this is done before we can schedule a vcpu on this physical > > > CPU, we're good. No physical interrupt is concerned with this code. > > > > That's fine then. vCPUs are considered "user-space" tasks and can't be > > scheduled before CPUHP_AP_ACTIVE sets the CPU active for the scheduler. > > Waiman, can you please give the following hack a go on your box? The > machines I have are thankfully limited to a single ITS group, so I > can't directly reproduce your issue. > > Thanks, > > M. > > diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c > index ada585bfa4517..20967000f2348 100644 > --- a/drivers/irqchip/irq-gic-v3-its.c > +++ b/drivers/irqchip/irq-gic-v3-its.c > @@ -2896,7 +2896,7 @@ static bool allocate_vpe_l2_table(int cpu, u32 id) > return true; > } > > -static int allocate_vpe_l1_table(void) > +static int allocate_vpe_l1_table(unsigned int cpu) > { > void __iomem *vlpi_base = gic_data_rdist_vlpi_base(); > u64 val, gpsz, npg, pa; > @@ -3012,10 +3012,11 @@ static int allocate_vpe_l1_table(void) > > out: > gicr_write_vpropbaser(val, vlpi_base + GICR_VPROPBASER); > - cpumask_set_cpu(smp_processor_id(), gic_data_rdist()->vpe_table_mask); > + cpumask_set_cpu(cpu, gic_data_rdist()->vpe_table_mask); > + dsb(sy); > > pr_debug("CPU%d: VPROPBASER = %llx %*pbl\n", > - smp_processor_id(), val, > + cpu, val, > cpumask_pr_args(gic_data_rdist()->vpe_table_mask)); > > return 0; > @@ -3264,15 +3265,9 @@ static void its_cpu_init_lpis(void) > val = its_clear_vpend_valid(vlpi_base, 0, 0); > } > > - if (allocate_vpe_l1_table()) { > - /* > - * If the allocation has failed, we're in massive trouble. > - * Disable direct injection, and pray that no VM was > - * already running... > - */ > - gic_rdists->has_rvpeid = false; > - gic_rdists->has_vlpis = false; > - } > + if (smp_processor_id() == 0) > + cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "irqchip/arm/gicv3:vpe", > + allocate_vpe_l1_table, NULL); If you move it the online state then you could also s/GFP_ATOMIC/GFP_KERNEL. Also previously you checked the error code set has_rvpeid, has_vlpis on failure. Now you you should the same in case of a failure during registration. This also happens happens on CPU hotplug and I don't see how you avoid a second allocation. But I also don't understand why this registrations happens on CPU0. It might be just a test patch… > > /* Make sure the GIC has seen the above */ > dsb(sy); Sebastian ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() 2026-01-12 14:08 ` Sebastian Andrzej Siewior @ 2026-01-12 14:38 ` Marc Zyngier 0 siblings, 0 replies; 24+ messages in thread From: Marc Zyngier @ 2026-01-12 14:38 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Waiman Long, Thomas Gleixner, Clark Williams, Steven Rostedt, linux-arm-kernel, linux-kernel, linux-rt-devel On Mon, 12 Jan 2026 14:08:37 +0000, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > On 2026-01-12 11:20:07 [+0000], Marc Zyngier wrote: > > On Sun, 11 Jan 2026 16:20:45 +0000, > > Thomas Gleixner <tglx@kernel.org> wrote: > > > > > > On Sun, Jan 11 2026 at 10:38, Marc Zyngier wrote: > > > > On Sun, 11 Jan 2026 09:39:07 +0000, > > > > Thomas Gleixner <tglx@kernel.org> wrote: > > > >> > > > >> On Fri, Jan 09 2026 at 16:13, Marc Zyngier wrote: > > > >> > On Thu, 08 Jan 2026 22:11:33 +0000, > > > >> > Thomas Gleixner <tglx@kernel.org> wrote: > > > >> >> At the point where a CPU is brought up, the topology should be known > > > >> >> already, which means this can be allocated on the control CPU _before_ > > > >> >> the new CPU comes up, no? > > > >> > > > > >> > No. Each CPU finds *itself* in the forest of redistributors, and from > > > >> > there tries to find whether it has some shared resource with a CPU > > > >> > that has booted before it. That's because firmware is absolutely awful > > > >> > and can't present a consistent view of the system. > > > >> > > > >> Groan.... > > > >> > > > >> > Anyway, I expect it could be solved by moving this part of the init to > > > >> > an ONLINE HP callback. > > > >> > > > >> Which needs to be before CPUHP_AP_IRQ_AFFINITY_ONLINE, but even that > > > >> might be to late because there are callbacks in the STARTING section, > > > >> i.e. timer, perf, which might rely on interrupts being accessible. > > > > > > > > Nah. This stuff is only for direct injection of vLPIs into guests, so > > > > as long as this is done before we can schedule a vcpu on this physical > > > > CPU, we're good. No physical interrupt is concerned with this code. > > > > > > That's fine then. vCPUs are considered "user-space" tasks and can't be > > > scheduled before CPUHP_AP_ACTIVE sets the CPU active for the scheduler. > > > > Waiman, can you please give the following hack a go on your box? The > > machines I have are thankfully limited to a single ITS group, so I > > can't directly reproduce your issue. > > > > Thanks, > > > > M. > > > > diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c > > index ada585bfa4517..20967000f2348 100644 > > --- a/drivers/irqchip/irq-gic-v3-its.c > > +++ b/drivers/irqchip/irq-gic-v3-its.c > > @@ -2896,7 +2896,7 @@ static bool allocate_vpe_l2_table(int cpu, u32 id) > > return true; > > } > > > > -static int allocate_vpe_l1_table(void) > > +static int allocate_vpe_l1_table(unsigned int cpu) > > { > > void __iomem *vlpi_base = gic_data_rdist_vlpi_base(); > > u64 val, gpsz, npg, pa; > > @@ -3012,10 +3012,11 @@ static int allocate_vpe_l1_table(void) > > > > out: > > gicr_write_vpropbaser(val, vlpi_base + GICR_VPROPBASER); > > - cpumask_set_cpu(smp_processor_id(), gic_data_rdist()->vpe_table_mask); > > + cpumask_set_cpu(cpu, gic_data_rdist()->vpe_table_mask); > > + dsb(sy); > > > > pr_debug("CPU%d: VPROPBASER = %llx %*pbl\n", > > - smp_processor_id(), val, > > + cpu, val, > > cpumask_pr_args(gic_data_rdist()->vpe_table_mask)); > > > > return 0; > > @@ -3264,15 +3265,9 @@ static void its_cpu_init_lpis(void) > > val = its_clear_vpend_valid(vlpi_base, 0, 0); > > } > > > > - if (allocate_vpe_l1_table()) { > > - /* > > - * If the allocation has failed, we're in massive trouble. > > - * Disable direct injection, and pray that no VM was > > - * already running... > > - */ > > - gic_rdists->has_rvpeid = false; > > - gic_rdists->has_vlpis = false; > > - } > > + if (smp_processor_id() == 0) > > + cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "irqchip/arm/gicv3:vpe", > > + allocate_vpe_l1_table, NULL); > > If you move it the online state then you could also > s/GFP_ATOMIC/GFP_KERNEL. > > Also previously you checked the error code set has_rvpeid, has_vlpis on > failure. Now you you should the same in case of a failure during > registration. > This also happens happens on CPU hotplug and I don't see how you avoid a > second allocation. But I also don't understand why this registrations > happens on CPU0. It might be just a test patch… It's just a test hack. There is way more things that would need to change in order to cope with moving this to CPUHP, but I want confirmation that this indeed solves the original issue before I start breaking more things. Thanks, M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() 2026-01-12 11:20 ` Marc Zyngier 2026-01-12 14:08 ` Sebastian Andrzej Siewior @ 2026-01-21 8:38 ` Marc Zyngier 2026-01-21 16:48 ` Waiman Long 2026-01-21 20:41 ` Waiman Long 1 sibling, 2 replies; 24+ messages in thread From: Marc Zyngier @ 2026-01-21 8:38 UTC (permalink / raw) To: Waiman Long, Thomas Gleixner Cc: Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt, linux-arm-kernel, linux-kernel, linux-rt-devel On Mon, 12 Jan 2026 11:20:07 +0000, Marc Zyngier <maz@kernel.org> wrote: > > On Sun, 11 Jan 2026 16:20:45 +0000, > Thomas Gleixner <tglx@kernel.org> wrote: > > > > On Sun, Jan 11 2026 at 10:38, Marc Zyngier wrote: > > > On Sun, 11 Jan 2026 09:39:07 +0000, > > > Thomas Gleixner <tglx@kernel.org> wrote: > > >> > > >> On Fri, Jan 09 2026 at 16:13, Marc Zyngier wrote: > > >> > On Thu, 08 Jan 2026 22:11:33 +0000, > > >> > Thomas Gleixner <tglx@kernel.org> wrote: > > >> >> At the point where a CPU is brought up, the topology should be known > > >> >> already, which means this can be allocated on the control CPU _before_ > > >> >> the new CPU comes up, no? > > >> > > > >> > No. Each CPU finds *itself* in the forest of redistributors, and from > > >> > there tries to find whether it has some shared resource with a CPU > > >> > that has booted before it. That's because firmware is absolutely awful > > >> > and can't present a consistent view of the system. > > >> > > >> Groan.... > > >> > > >> > Anyway, I expect it could be solved by moving this part of the init to > > >> > an ONLINE HP callback. > > >> > > >> Which needs to be before CPUHP_AP_IRQ_AFFINITY_ONLINE, but even that > > >> might be to late because there are callbacks in the STARTING section, > > >> i.e. timer, perf, which might rely on interrupts being accessible. > > > > > > Nah. This stuff is only for direct injection of vLPIs into guests, so > > > as long as this is done before we can schedule a vcpu on this physical > > > CPU, we're good. No physical interrupt is concerned with this code. > > > > That's fine then. vCPUs are considered "user-space" tasks and can't be > > scheduled before CPUHP_AP_ACTIVE sets the CPU active for the scheduler. > > Waiman, can you please give the following hack a go on your box? The > machines I have are thankfully limited to a single ITS group, so I > can't directly reproduce your issue. Have you managed to try this hack? I may be able to spend some time addressing the issue in the next cycle if I have an indication that I'm on the right track. Thanks, M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() 2026-01-21 8:38 ` Marc Zyngier @ 2026-01-21 16:48 ` Waiman Long 2026-01-21 20:41 ` Waiman Long 1 sibling, 0 replies; 24+ messages in thread From: Waiman Long @ 2026-01-21 16:48 UTC (permalink / raw) To: Marc Zyngier, Thomas Gleixner Cc: Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt, linux-arm-kernel, linux-kernel, linux-rt-devel On 1/21/26 3:38 AM, Marc Zyngier wrote: > On Mon, 12 Jan 2026 11:20:07 +0000, > Marc Zyngier <maz@kernel.org> wrote: >> On Sun, 11 Jan 2026 16:20:45 +0000, >> Thomas Gleixner <tglx@kernel.org> wrote: >>> On Sun, Jan 11 2026 at 10:38, Marc Zyngier wrote: >>>> On Sun, 11 Jan 2026 09:39:07 +0000, >>>> Thomas Gleixner <tglx@kernel.org> wrote: >>>>> On Fri, Jan 09 2026 at 16:13, Marc Zyngier wrote: >>>>>> On Thu, 08 Jan 2026 22:11:33 +0000, >>>>>> Thomas Gleixner <tglx@kernel.org> wrote: >>>>>>> At the point where a CPU is brought up, the topology should be known >>>>>>> already, which means this can be allocated on the control CPU _before_ >>>>>>> the new CPU comes up, no? >>>>>> No. Each CPU finds *itself* in the forest of redistributors, and from >>>>>> there tries to find whether it has some shared resource with a CPU >>>>>> that has booted before it. That's because firmware is absolutely awful >>>>>> and can't present a consistent view of the system. >>>>> Groan.... >>>>> >>>>>> Anyway, I expect it could be solved by moving this part of the init to >>>>>> an ONLINE HP callback. >>>>> Which needs to be before CPUHP_AP_IRQ_AFFINITY_ONLINE, but even that >>>>> might be to late because there are callbacks in the STARTING section, >>>>> i.e. timer, perf, which might rely on interrupts being accessible. >>>> Nah. This stuff is only for direct injection of vLPIs into guests, so >>>> as long as this is done before we can schedule a vcpu on this physical >>>> CPU, we're good. No physical interrupt is concerned with this code. >>> That's fine then. vCPUs are considered "user-space" tasks and can't be >>> scheduled before CPUHP_AP_ACTIVE sets the CPU active for the scheduler. >> Waiman, can you please give the following hack a go on your box? The >> machines I have are thankfully limited to a single ITS group, so I >> can't directly reproduce your issue. > Have you managed to try this hack? I may be able to spend some time > addressing the issue in the next cycle if I have an indication that > I'm on the right track. I am sorry that I was busy working on other stuff. Will try out the hack today and report back ASAP. Cheers, Longman ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() 2026-01-21 8:38 ` Marc Zyngier 2026-01-21 16:48 ` Waiman Long @ 2026-01-21 20:41 ` Waiman Long [not found] ` <70dbf293-dd5b-4d77-b653-8f8c09129723@redhat.com> 1 sibling, 1 reply; 24+ messages in thread From: Waiman Long @ 2026-01-21 20:41 UTC (permalink / raw) To: Marc Zyngier, Thomas Gleixner Cc: Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt, linux-arm-kernel, linux-kernel, linux-rt-devel On 1/21/26 3:38 AM, Marc Zyngier wrote: > On Mon, 12 Jan 2026 11:20:07 +0000, > Marc Zyngier <maz@kernel.org> wrote: >> On Sun, 11 Jan 2026 16:20:45 +0000, >> Thomas Gleixner <tglx@kernel.org> wrote: >>> On Sun, Jan 11 2026 at 10:38, Marc Zyngier wrote: >>>> On Sun, 11 Jan 2026 09:39:07 +0000, >>>> Thomas Gleixner <tglx@kernel.org> wrote: >>>>> On Fri, Jan 09 2026 at 16:13, Marc Zyngier wrote: >>>>>> On Thu, 08 Jan 2026 22:11:33 +0000, >>>>>> Thomas Gleixner <tglx@kernel.org> wrote: >>>>>>> At the point where a CPU is brought up, the topology should be known >>>>>>> already, which means this can be allocated on the control CPU _before_ >>>>>>> the new CPU comes up, no? >>>>>> No. Each CPU finds *itself* in the forest of redistributors, and from >>>>>> there tries to find whether it has some shared resource with a CPU >>>>>> that has booted before it. That's because firmware is absolutely awful >>>>>> and can't present a consistent view of the system. >>>>> Groan.... >>>>> >>>>>> Anyway, I expect it could be solved by moving this part of the init to >>>>>> an ONLINE HP callback. >>>>> Which needs to be before CPUHP_AP_IRQ_AFFINITY_ONLINE, but even that >>>>> might be to late because there are callbacks in the STARTING section, >>>>> i.e. timer, perf, which might rely on interrupts being accessible. >>>> Nah. This stuff is only for direct injection of vLPIs into guests, so >>>> as long as this is done before we can schedule a vcpu on this physical >>>> CPU, we're good. No physical interrupt is concerned with this code. >>> That's fine then. vCPUs are considered "user-space" tasks and can't be >>> scheduled before CPUHP_AP_ACTIVE sets the CPU active for the scheduler. >> Waiman, can you please give the following hack a go on your box? The >> machines I have are thankfully limited to a single ITS group, so I >> can't directly reproduce your issue. > Have you managed to try this hack? I may be able to spend some time > addressing the issue in the next cycle if I have an indication that > I'm on the right track. Yes, I have tried out your hack patch and the 2-socket Grace test system booted up without producing any bug report for a RT debug kernel. I will try out your official patch once it come out. So moving the memory allocation to a later part of the hotplug bringup pipeline where sleeping is allowed should work. Cheers, Longman ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <70dbf293-dd5b-4d77-b653-8f8c09129723@redhat.com>]
* Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() [not found] ` <70dbf293-dd5b-4d77-b653-8f8c09129723@redhat.com> @ 2026-03-09 19:06 ` Waiman Long 2026-03-10 8:12 ` Marc Zyngier 0 siblings, 1 reply; 24+ messages in thread From: Waiman Long @ 2026-03-09 19:06 UTC (permalink / raw) To: Marc Zyngier, Thomas Gleixner Cc: Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt, linux-arm-kernel, linux-kernel, linux-rt-devel On 1/21/26 10:49 PM, Waiman Long wrote: > > On 1/21/26 3:41 PM, Waiman Long wrote: >> >>>> Waiman, can you please give the following hack a go on your box? The >>>> machines I have are thankfully limited to a single ITS group, so I >>>> can't directly reproduce your issue. >>> Have you managed to try this hack? I may be able to spend some time >>> addressing the issue in the next cycle if I have an indication that >>> I'm on the right track. >> >> Yes, I have tried out your hack patch and the 2-socket Grace test >> system booted up without producing any bug report for a RT debug >> kernel. I will try out your official patch once it come out. So >> moving the memory allocation to a later part of the hotplug bringup >> pipeline where sleeping is allowed should work. > > Attaching the dmesg log for your further investigation. Ping, Are you planning to send out an official patch soon? Thanks, Longman ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() 2026-03-09 19:06 ` Waiman Long @ 2026-03-10 8:12 ` Marc Zyngier 0 siblings, 0 replies; 24+ messages in thread From: Marc Zyngier @ 2026-03-10 8:12 UTC (permalink / raw) To: Waiman Long Cc: Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt, linux-arm-kernel, linux-kernel, linux-rt-devel On Mon, 09 Mar 2026 19:06:04 +0000, Waiman Long <longman@redhat.com> wrote: > > On 1/21/26 10:49 PM, Waiman Long wrote: > > > > On 1/21/26 3:41 PM, Waiman Long wrote: > >> > >>>> Waiman, can you please give the following hack a go on your box? The > >>>> machines I have are thankfully limited to a single ITS group, so I > >>>> can't directly reproduce your issue. > >>> Have you managed to try this hack? I may be able to spend some time > >>> addressing the issue in the next cycle if I have an indication that > >>> I'm on the right track. > >> > >> Yes, I have tried out your hack patch and the 2-socket Grace test > >> system booted up without producing any bug report for a RT debug > >> kernel. I will try out your official patch once it come out. So > >> moving the memory allocation to a later part of the hotplug bringup > >> pipeline where sleeping is allowed should work. > > > > Attaching the dmesg log for your further investigation. > > Ping, > > Are you planning to send out an official patch soon? Soon? On a geological scale, certainly. On a more practical scale, when I get time, which hasn't happened so far in this cycle ($WORK gets, unsurprisingly, in the way of solving problems I don't have). If that's not soon enough, feel free to expand the hack I posted to include all boot-time tables. Thanks, M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() 2026-01-11 10:38 ` Marc Zyngier 2026-01-11 16:20 ` Thomas Gleixner @ 2026-01-11 23:02 ` Waiman Long 2026-01-12 15:09 ` Thomas Gleixner 1 sibling, 1 reply; 24+ messages in thread From: Waiman Long @ 2026-01-11 23:02 UTC (permalink / raw) To: Marc Zyngier, Thomas Gleixner Cc: Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt, linux-arm-kernel, linux-kernel, linux-rt-devel On 1/11/26 5:38 AM, Marc Zyngier wrote: >> Also that patch seems to be incomplete because there is another >> allocation further down in allocate_vpe_l1_table().... > Yeah, I wondered why page allocation wasn't affected by this issue, > but didn't try to find out. The use of GFP_ATOMIC flag in the page allocation request may help it to dip into the reserved area and avoid taking any spinlock. In my own test, just removing the kzalloc() call is enough to avoid any invalid context warning. In the page allocation code, there is a zone lock and a per_cpu_pages lock. They were not acquired in my particular test case, though further investigation may be needed to make sure it is really safe. Cheers, Longman ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() 2026-01-11 23:02 ` Waiman Long @ 2026-01-12 15:09 ` Thomas Gleixner 2026-01-12 17:14 ` Waiman Long 0 siblings, 1 reply; 24+ messages in thread From: Thomas Gleixner @ 2026-01-12 15:09 UTC (permalink / raw) To: Waiman Long, Marc Zyngier Cc: Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt, linux-arm-kernel, linux-kernel, linux-rt-devel On Sun, Jan 11 2026 at 18:02, Waiman Long wrote: > On 1/11/26 5:38 AM, Marc Zyngier wrote: >>> Also that patch seems to be incomplete because there is another >>> allocation further down in allocate_vpe_l1_table().... >> Yeah, I wondered why page allocation wasn't affected by this issue, >> but didn't try to find out. > > The use of GFP_ATOMIC flag in the page allocation request may help it to > dip into the reserved area and avoid taking any spinlock. In my own > test, just removing the kzalloc() call is enough to avoid any invalid > context warning. In the page allocation code, there is a zone lock and a > per_cpu_pages lock. They were not acquired in my particular test case, > though further investigation may be needed to make sure it is really safe. They might be acquired though. Only alloc_pages_nolock() guarantees that no lock is taken IIRC. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() 2026-01-12 15:09 ` Thomas Gleixner @ 2026-01-12 17:14 ` Waiman Long 2026-01-13 11:55 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 24+ messages in thread From: Waiman Long @ 2026-01-12 17:14 UTC (permalink / raw) To: Thomas Gleixner, Waiman Long, Marc Zyngier Cc: Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt, linux-arm-kernel, linux-kernel, linux-rt-devel On 1/12/26 10:09 AM, Thomas Gleixner wrote: > On Sun, Jan 11 2026 at 18:02, Waiman Long wrote: >> On 1/11/26 5:38 AM, Marc Zyngier wrote: >>>> Also that patch seems to be incomplete because there is another >>>> allocation further down in allocate_vpe_l1_table().... >>> Yeah, I wondered why page allocation wasn't affected by this issue, >>> but didn't try to find out. >> The use of GFP_ATOMIC flag in the page allocation request may help it to >> dip into the reserved area and avoid taking any spinlock. In my own >> test, just removing the kzalloc() call is enough to avoid any invalid >> context warning. In the page allocation code, there is a zone lock and a >> per_cpu_pages lock. They were not acquired in my particular test case, >> though further investigation may be needed to make sure it is really safe. > They might be acquired though. Only alloc_pages_nolock() guarantees that > no lock is taken IIRC. Thanks for the suggestion. I will look into using that for page allocation. I had actually attempt to use kmalloc_nolock() to replace kzalloc() initially. Even though it removed the call to rmqueue(), but there were other spinlocks in the slub code that were still being acquired like the local_lock() or the spinlock in the get_random() code. So I gave up using that. Anyway, kmalloc_nolock() doesn't seem to be fully working yet. Cheers, Longman ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() 2026-01-12 17:14 ` Waiman Long @ 2026-01-13 11:55 ` Sebastian Andrzej Siewior 2026-01-13 23:25 ` Alexei Starovoitov 2026-01-14 17:59 ` Vlastimil Babka 0 siblings, 2 replies; 24+ messages in thread From: Sebastian Andrzej Siewior @ 2026-01-13 11:55 UTC (permalink / raw) To: Waiman Long Cc: Thomas Gleixner, Marc Zyngier, Clark Williams, Steven Rostedt, linux-arm-kernel, linux-kernel, linux-rt-devel, Vlastimil Babka, Alexei Starovoitov On 2026-01-12 12:14:30 [-0500], Waiman Long wrote: > On 1/12/26 10:09 AM, Thomas Gleixner wrote: > > They might be acquired though. Only alloc_pages_nolock() guarantees that > > no lock is taken IIRC. > > Thanks for the suggestion. I will look into using that for page allocation. > I had actually attempt to use kmalloc_nolock() to replace kzalloc() > initially. Even though it removed the call to rmqueue(), but there were > other spinlocks in the slub code that were still being acquired like the > local_lock() or the spinlock in the get_random() code. So I gave up using > that. Anyway, kmalloc_nolock() doesn't seem to be fully working yet. with kmalloc_nolock() you have to be able to deal with a NULL pointer. Looking at kmalloc_nolock(), it has this (in_nmi() || in_hardirq()) check on PREEMPT_RT. The reasoning was unconditional raw_spinlock_t locking and bad lock-owner recording for hardirq. There was a trylock path for local_lock to make it work from atomic context. But from what I can tell this goes kmalloc_nolock_noprof() -> __slab_alloc_node() -> __slab_alloc() -> ___slab_alloc() -> local_lock_cpu_slab() The last one does local_lock_irqsave() on PREEMPT_RT which does a spin_lock(). That means atomic context is not possible. Where did I make a wrong turn? Or did this change recently? I do remember that Alexei reworked parts of the allocator to make the local_lock based trylock allocation work. > Cheers, > Longman Sebastian ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() 2026-01-13 11:55 ` Sebastian Andrzej Siewior @ 2026-01-13 23:25 ` Alexei Starovoitov 2026-01-14 16:01 ` Sebastian Andrzej Siewior 2026-01-14 17:59 ` Vlastimil Babka 1 sibling, 1 reply; 24+ messages in thread From: Alexei Starovoitov @ 2026-01-13 23:25 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Waiman Long, Thomas Gleixner, Marc Zyngier, Clark Williams, Steven Rostedt, linux-arm-kernel, LKML, linux-rt-devel, Vlastimil Babka, Alexei Starovoitov On Tue, Jan 13, 2026 at 3:55 AM Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > On 2026-01-12 12:14:30 [-0500], Waiman Long wrote: > > On 1/12/26 10:09 AM, Thomas Gleixner wrote: > > > They might be acquired though. Only alloc_pages_nolock() guarantees that > > > no lock is taken IIRC. > > > > Thanks for the suggestion. I will look into using that for page allocation. > > I had actually attempt to use kmalloc_nolock() to replace kzalloc() > > initially. Even though it removed the call to rmqueue(), but there were > > other spinlocks in the slub code that were still being acquired like the > > local_lock() or the spinlock in the get_random() code. So I gave up using > > that. Anyway, kmalloc_nolock() doesn't seem to be fully working yet. > > with kmalloc_nolock() you have to be able to deal with a NULL pointer. > Looking at kmalloc_nolock(), it has this (in_nmi() || in_hardirq()) > check on PREEMPT_RT. The reasoning was unconditional raw_spinlock_t > locking and bad lock-owner recording for hardirq. > There was a trylock path for local_lock to make it work from atomic > context. But from what I can tell this goes > kmalloc_nolock_noprof() -> __slab_alloc_node() -> __slab_alloc() -> > ___slab_alloc() -> local_lock_cpu_slab() > > The last one does local_lock_irqsave() on PREEMPT_RT which does a > spin_lock(). That means atomic context is not possible. Where did I make > a wrong turn? Or did this change recently? I do remember that Alexei > reworked parts of the allocator to make the local_lock based trylock > allocation work. Are you forgetting about local_lock_is_locked() in __slab_alloc() ? With sheaves the whole thing will be very different. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() 2026-01-13 23:25 ` Alexei Starovoitov @ 2026-01-14 16:01 ` Sebastian Andrzej Siewior 0 siblings, 0 replies; 24+ messages in thread From: Sebastian Andrzej Siewior @ 2026-01-14 16:01 UTC (permalink / raw) To: Alexei Starovoitov Cc: Waiman Long, Thomas Gleixner, Marc Zyngier, Clark Williams, Steven Rostedt, linux-arm-kernel, LKML, linux-rt-devel, Vlastimil Babka, Alexei Starovoitov On 2026-01-13 15:25:26 [-0800], Alexei Starovoitov wrote: > On Tue, Jan 13, 2026 at 3:55 AM Sebastian Andrzej Siewior > <bigeasy@linutronix.de> wrote: > > The last one does local_lock_irqsave() on PREEMPT_RT which does a > > spin_lock(). That means atomic context is not possible. Where did I make > > a wrong turn? Or did this change recently? I do remember that Alexei > > reworked parts of the allocator to make the local_lock based trylock > > allocation work. > > Are you forgetting about local_lock_is_locked() in __slab_alloc() ? Yeah but this just checks it. Further down the road there is local_lock_cpu_slab() for the allocation and there is no try-lock on RT. > With sheaves the whole thing will be very different. Yes. Sebastian ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() 2026-01-13 11:55 ` Sebastian Andrzej Siewior 2026-01-13 23:25 ` Alexei Starovoitov @ 2026-01-14 17:59 ` Vlastimil Babka 2026-01-21 16:37 ` Waiman Long 1 sibling, 1 reply; 24+ messages in thread From: Vlastimil Babka @ 2026-01-14 17:59 UTC (permalink / raw) To: Sebastian Andrzej Siewior, Waiman Long Cc: Thomas Gleixner, Marc Zyngier, Clark Williams, Steven Rostedt, linux-arm-kernel, linux-kernel, linux-rt-devel, Alexei Starovoitov On 1/13/26 12:55, Sebastian Andrzej Siewior wrote: > On 2026-01-12 12:14:30 [-0500], Waiman Long wrote: >> On 1/12/26 10:09 AM, Thomas Gleixner wrote: >> > They might be acquired though. Only alloc_pages_nolock() guarantees that >> > no lock is taken IIRC. >> >> Thanks for the suggestion. I will look into using that for page allocation. >> I had actually attempt to use kmalloc_nolock() to replace kzalloc() >> initially. Even though it removed the call to rmqueue(), but there were >> other spinlocks in the slub code that were still being acquired like the >> local_lock() or the spinlock in the get_random() code. So I gave up using Hmm if get_random() code takes a spinlock, we have an unsolved incompatibility with kmalloc_nolock() and CONFIG_SLAB_FREELIST_RANDOM. >> that. Anyway, kmalloc_nolock() doesn't seem to be fully working yet. > > with kmalloc_nolock() you have to be able to deal with a NULL pointer. Yes. So even after we fix the current problems with incompatible context, I think kmalloc_nolock() would still be a bad fit for hw bringup code that should not really fail. Because the possibility of failure will always exist. The BPF use case that motivated it is quite different. > Looking at kmalloc_nolock(), it has this (in_nmi() || in_hardirq()) > check on PREEMPT_RT. The reasoning was unconditional raw_spinlock_t > locking and bad lock-owner recording for hardirq. > There was a trylock path for local_lock to make it work from atomic > context. But from what I can tell this goes > kmalloc_nolock_noprof() -> __slab_alloc_node() -> __slab_alloc() -> > ___slab_alloc() -> local_lock_cpu_slab() > > The last one does local_lock_irqsave() on PREEMPT_RT which does a > spin_lock(). That means atomic context is not possible. Where did I make > a wrong turn? Or did this change recently? I do remember that Alexei > reworked parts of the allocator to make the local_lock based trylock > allocation work. > >> Cheers, >> Longman > > Sebastian ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() 2026-01-14 17:59 ` Vlastimil Babka @ 2026-01-21 16:37 ` Waiman Long 0 siblings, 0 replies; 24+ messages in thread From: Waiman Long @ 2026-01-21 16:37 UTC (permalink / raw) To: Vlastimil Babka, Sebastian Andrzej Siewior, Waiman Long Cc: Thomas Gleixner, Marc Zyngier, Clark Williams, Steven Rostedt, linux-arm-kernel, linux-kernel, linux-rt-devel, Alexei Starovoitov On 1/14/26 12:59 PM, Vlastimil Babka wrote: > On 1/13/26 12:55, Sebastian Andrzej Siewior wrote: >> On 2026-01-12 12:14:30 [-0500], Waiman Long wrote: >>> On 1/12/26 10:09 AM, Thomas Gleixner wrote: >>>> They might be acquired though. Only alloc_pages_nolock() guarantees that >>>> no lock is taken IIRC. >>> Thanks for the suggestion. I will look into using that for page allocation. >>> I had actually attempt to use kmalloc_nolock() to replace kzalloc() >>> initially. Even though it removed the call to rmqueue(), but there were >>> other spinlocks in the slub code that were still being acquired like the >>> local_lock() or the spinlock in the get_random() code. So I gave up using > Hmm if get_random() code takes a spinlock, we have an unsolved > incompatibility with kmalloc_nolock() and CONFIG_SLAB_FREELIST_RANDOM. > >>> that. Anyway, kmalloc_nolock() doesn't seem to be fully working yet. >> with kmalloc_nolock() you have to be able to deal with a NULL pointer. > Yes. So even after we fix the current problems with incompatible context, I > think kmalloc_nolock() would still be a bad fit for hw bringup code that > should not really fail. Because the possibility of failure will always > exist. The BPF use case that motivated it is quite different. Yes, it is an issue too that kmalloc_nolock() may fail. If that happens, we don't have another good alternative. Cheers, Longman ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() 2026-01-08 8:26 ` Marc Zyngier 2026-01-08 22:11 ` Thomas Gleixner @ 2026-01-10 21:47 ` Waiman Long 1 sibling, 0 replies; 24+ messages in thread From: Waiman Long @ 2026-01-10 21:47 UTC (permalink / raw) To: Marc Zyngier Cc: Thomas Gleixner, Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt, linux-arm-kernel, linux-kernel, linux-rt-devel On 1/8/26 3:26 AM, Marc Zyngier wrote: > On Wed, 07 Jan 2026 21:53:53 +0000, > Waiman Long <longman@redhat.com> wrote: >> When running a PREEMPT_RT debug kernel on a 2-socket Grace arm64 system, >> the following bug report was produced at bootup time. >> >> BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 >> in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/72 >> preempt_count: 1, expected: 0 >> RCU nest depth: 1, expected: 1 >> : >> CPU: 72 UID: 0 PID: 0 Comm: swapper/72 Tainted: G W 6.19.0-rc4-test+ #4 PREEMPT_{RT,(full)} >> Tainted: [W]=WARN >> Call trace: >> : >> rt_spin_lock+0xe4/0x408 >> rmqueue_bulk+0x48/0x1de8 >> __rmqueue_pcplist+0x410/0x650 >> rmqueue.constprop.0+0x6a8/0x2b50 >> get_page_from_freelist+0x3c0/0xe68 >> __alloc_frozen_pages_noprof+0x1dc/0x348 >> alloc_pages_mpol+0xe4/0x2f8 >> alloc_frozen_pages_noprof+0x124/0x190 >> allocate_slab+0x2f0/0x438 >> new_slab+0x4c/0x80 >> ___slab_alloc+0x410/0x798 >> __slab_alloc.constprop.0+0x88/0x1e0 >> __kmalloc_cache_noprof+0x2dc/0x4b0 >> allocate_vpe_l1_table+0x114/0x788 >> its_cpu_init_lpis+0x344/0x790 >> its_cpu_init+0x60/0x220 >> gic_starting_cpu+0x64/0xe8 >> cpuhp_invoke_callback+0x438/0x6d8 >> __cpuhp_invoke_callback_range+0xd8/0x1f8 >> notify_cpu_starting+0x11c/0x178 >> secondary_start_kernel+0xc8/0x188 >> __secondary_switched+0xc0/0xc8 >> >> This is due to the fact that allocate_vpe_l1_table() will call >> kzalloc() to allocate a cpumask_t when the first CPU of the >> second node of the 72-cpu Grace system is being called from the >> CPUHP_AP_MIPS_GIC_TIMER_STARTING state inside the starting section of > Surely *not* that particular state. My mistake, it should be CPUHP_AP_IRQ_GIC_STARTING. There are three static gic_starting_cpu() functions that confuse me. >> the CPU hotplug bringup pipeline where interrupt is disabled. This is an >> atomic context where sleeping is not allowed and acquiring a sleeping >> rt_spin_lock within kzalloc() may lead to system hang in case there is >> a lock contention. >> >> To work around this issue, a static buffer is used for cpumask >> allocation when running a PREEMPT_RT kernel via the newly introduced >> vpe_alloc_cpumask() helper. The static buffer is currently set to be >> 4 kbytes in size. As only one cpumask is needed per node, the current >> size should be big enough as long as (cpumask_size() * nr_node_ids) >> is not bigger than 4k. > What role does the node play here? The GIC topology has nothing to do > with NUMA. It may be true on your particular toy, but that's > definitely not true architecturally. You could, at worse, end-up with > one such cpumask per *CPU*. That'd be a braindead system, but this > code is written to support the architecture, not any particular > implementation. > It is just what I have observed on the hardware that I used for reproducing the problem. I agree that it may be different in other arm64 CPUs. >> Signed-off-by: Waiman Long <longman@redhat.com> >> --- >> drivers/irqchip/irq-gic-v3-its.c | 26 +++++++++++++++++++++++++- >> 1 file changed, 25 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c >> index ada585bfa451..9185785524dc 100644 >> --- a/drivers/irqchip/irq-gic-v3-its.c >> +++ b/drivers/irqchip/irq-gic-v3-its.c >> @@ -2896,6 +2896,30 @@ static bool allocate_vpe_l2_table(int cpu, u32 id) >> return true; >> } >> >> +static void *vpe_alloc_cpumask(void) >> +{ >> + /* >> + * With PREEMPT_RT kernel, we can't call any k*alloc() APIs as they >> + * may acquire a sleeping rt_spin_lock in an atomic context. So use >> + * a pre-allocated buffer instead. >> + */ >> + if (IS_ENABLED(CONFIG_PREEMPT_RT)) { >> + static unsigned long mask_buf[512]; >> + static atomic_t alloc_idx; >> + int idx, mask_size = cpumask_size(); >> + int nr_cpumasks = sizeof(mask_buf)/mask_size; >> + >> + /* >> + * Fetch an allocation index and if it points to a buffer within >> + * mask_buf[], return that. Fall back to kzalloc() otherwise. >> + */ >> + idx = atomic_fetch_inc(&alloc_idx); >> + if (idx < nr_cpumasks) >> + return &mask_buf[idx * mask_size/sizeof(long)]; >> + } > Err, no. That's horrible. I can see three ways to address this in a > more appealing way: > > - you give RT a generic allocator that works for (small) atomic > allocations. I appreciate that's not easy, and even probably > contrary to the RT goals. But I'm also pretty sure that the GIC code > is not the only pile of crap being caught doing that. > > - you pre-compute upfront how many cpumasks you are going to require, > based on the actual GIC topology. You do that on CPU0, outside of > the hotplug constraints, and allocate what you need. This is > difficult as you need to ensure the RD<->CPU matching without the > CPUs having booted, which means wading through the DT/ACPI gunk to > try and guess what you have. > > - you delay the allocation of L1 tables to a context where you can > perform allocations, and before we have a chance of running a guest > on this CPU. That's probably the simplest option (though dealing > with late onlining while guests are already running could be > interesting...). > > But I'm always going to say no to something that is a poor hack and > ultimately falling back to the same broken behaviour. Thanks for the suggestion. I will try the first alternative of a more generic memory allocator. Cheers, Longman > > Thanks, > > M. > ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2026-03-10 8:12 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-07 21:53 [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() Waiman Long
2026-01-08 8:26 ` Marc Zyngier
2026-01-08 22:11 ` Thomas Gleixner
2026-01-09 16:13 ` Marc Zyngier
2026-01-11 9:39 ` Thomas Gleixner
2026-01-11 10:38 ` Marc Zyngier
2026-01-11 16:20 ` Thomas Gleixner
2026-01-12 11:20 ` Marc Zyngier
2026-01-12 14:08 ` Sebastian Andrzej Siewior
2026-01-12 14:38 ` Marc Zyngier
2026-01-21 8:38 ` Marc Zyngier
2026-01-21 16:48 ` Waiman Long
2026-01-21 20:41 ` Waiman Long
[not found] ` <70dbf293-dd5b-4d77-b653-8f8c09129723@redhat.com>
2026-03-09 19:06 ` Waiman Long
2026-03-10 8:12 ` Marc Zyngier
2026-01-11 23:02 ` Waiman Long
2026-01-12 15:09 ` Thomas Gleixner
2026-01-12 17:14 ` Waiman Long
2026-01-13 11:55 ` Sebastian Andrzej Siewior
2026-01-13 23:25 ` Alexei Starovoitov
2026-01-14 16:01 ` Sebastian Andrzej Siewior
2026-01-14 17:59 ` Vlastimil Babka
2026-01-21 16:37 ` Waiman Long
2026-01-10 21:47 ` Waiman Long
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox