From: Marc Zyngier <maz@kernel.org>
To: Waiman Long <longman@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
Clark Williams <clrkwllms@kernel.org>,
Steven Rostedt <rostedt@goodmis.org>,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-rt-devel@lists.linux.dev
Subject: Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table()
Date: Thu, 08 Jan 2026 08:26:13 +0000 [thread overview]
Message-ID: <864iowmrx6.wl-maz@kernel.org> (raw)
In-Reply-To: <20260107215353.75612-1-longman@redhat.com>
On Wed, 07 Jan 2026 21:53:53 +0000,
Waiman Long <longman@redhat.com> wrote:
>
> When running a PREEMPT_RT debug kernel on a 2-socket Grace arm64 system,
> the following bug report was produced at bootup time.
>
> BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
> in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/72
> preempt_count: 1, expected: 0
> RCU nest depth: 1, expected: 1
> :
> CPU: 72 UID: 0 PID: 0 Comm: swapper/72 Tainted: G W 6.19.0-rc4-test+ #4 PREEMPT_{RT,(full)}
> Tainted: [W]=WARN
> Call trace:
> :
> rt_spin_lock+0xe4/0x408
> rmqueue_bulk+0x48/0x1de8
> __rmqueue_pcplist+0x410/0x650
> rmqueue.constprop.0+0x6a8/0x2b50
> get_page_from_freelist+0x3c0/0xe68
> __alloc_frozen_pages_noprof+0x1dc/0x348
> alloc_pages_mpol+0xe4/0x2f8
> alloc_frozen_pages_noprof+0x124/0x190
> allocate_slab+0x2f0/0x438
> new_slab+0x4c/0x80
> ___slab_alloc+0x410/0x798
> __slab_alloc.constprop.0+0x88/0x1e0
> __kmalloc_cache_noprof+0x2dc/0x4b0
> allocate_vpe_l1_table+0x114/0x788
> its_cpu_init_lpis+0x344/0x790
> its_cpu_init+0x60/0x220
> gic_starting_cpu+0x64/0xe8
> cpuhp_invoke_callback+0x438/0x6d8
> __cpuhp_invoke_callback_range+0xd8/0x1f8
> notify_cpu_starting+0x11c/0x178
> secondary_start_kernel+0xc8/0x188
> __secondary_switched+0xc0/0xc8
>
> This is due to the fact that allocate_vpe_l1_table() will call
> kzalloc() to allocate a cpumask_t when the first CPU of the
> second node of the 72-cpu Grace system is being called from the
> CPUHP_AP_MIPS_GIC_TIMER_STARTING state inside the starting section of
Surely *not* that particular state.
> the CPU hotplug bringup pipeline where interrupt is disabled. This is an
> atomic context where sleeping is not allowed and acquiring a sleeping
> rt_spin_lock within kzalloc() may lead to system hang in case there is
> a lock contention.
>
> To work around this issue, a static buffer is used for cpumask
> allocation when running a PREEMPT_RT kernel via the newly introduced
> vpe_alloc_cpumask() helper. The static buffer is currently set to be
> 4 kbytes in size. As only one cpumask is needed per node, the current
> size should be big enough as long as (cpumask_size() * nr_node_ids)
> is not bigger than 4k.
What role does the node play here? The GIC topology has nothing to do
with NUMA. It may be true on your particular toy, but that's
definitely not true architecturally. You could, at worse, end-up with
one such cpumask per *CPU*. That'd be a braindead system, but this
code is written to support the architecture, not any particular
implementation.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> drivers/irqchip/irq-gic-v3-its.c | 26 +++++++++++++++++++++++++-
> 1 file changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
> index ada585bfa451..9185785524dc 100644
> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -2896,6 +2896,30 @@ static bool allocate_vpe_l2_table(int cpu, u32 id)
> return true;
> }
>
> +static void *vpe_alloc_cpumask(void)
> +{
> + /*
> + * With PREEMPT_RT kernel, we can't call any k*alloc() APIs as they
> + * may acquire a sleeping rt_spin_lock in an atomic context. So use
> + * a pre-allocated buffer instead.
> + */
> + if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> + static unsigned long mask_buf[512];
> + static atomic_t alloc_idx;
> + int idx, mask_size = cpumask_size();
> + int nr_cpumasks = sizeof(mask_buf)/mask_size;
> +
> + /*
> + * Fetch an allocation index and if it points to a buffer within
> + * mask_buf[], return that. Fall back to kzalloc() otherwise.
> + */
> + idx = atomic_fetch_inc(&alloc_idx);
> + if (idx < nr_cpumasks)
> + return &mask_buf[idx * mask_size/sizeof(long)];
> + }
Err, no. That's horrible. I can see three ways to address this in a
more appealing way:
- you give RT a generic allocator that works for (small) atomic
allocations. I appreciate that's not easy, and even probably
contrary to the RT goals. But I'm also pretty sure that the GIC code
is not the only pile of crap being caught doing that.
- you pre-compute upfront how many cpumasks you are going to require,
based on the actual GIC topology. You do that on CPU0, outside of
the hotplug constraints, and allocate what you need. This is
difficult as you need to ensure the RD<->CPU matching without the
CPUs having booted, which means wading through the DT/ACPI gunk to
try and guess what you have.
- you delay the allocation of L1 tables to a context where you can
perform allocations, and before we have a chance of running a guest
on this CPU. That's probably the simplest option (though dealing
with late onlining while guests are already running could be
interesting...).
But I'm always going to say no to something that is a poor hack and
ultimately falling back to the same broken behaviour.
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
next prev parent reply other threads:[~2026-01-08 8:26 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-07 21:53 [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() Waiman Long
2026-01-08 8:26 ` Marc Zyngier [this message]
2026-01-08 22:11 ` Thomas Gleixner
2026-01-09 16:13 ` Marc Zyngier
2026-01-11 9:39 ` Thomas Gleixner
2026-01-11 10:38 ` Marc Zyngier
2026-01-11 16:20 ` Thomas Gleixner
2026-01-12 11:20 ` Marc Zyngier
2026-01-12 14:08 ` Sebastian Andrzej Siewior
2026-01-12 14:38 ` Marc Zyngier
2026-01-21 8:38 ` Marc Zyngier
2026-01-21 16:48 ` Waiman Long
2026-01-21 20:41 ` Waiman Long
2026-01-22 3:49 ` Waiman Long
2026-03-09 19:06 ` Waiman Long
2026-03-10 8:12 ` Marc Zyngier
2026-01-11 23:02 ` Waiman Long
2026-01-12 15:09 ` Thomas Gleixner
2026-01-12 17:14 ` Waiman Long
2026-01-13 11:55 ` Sebastian Andrzej Siewior
2026-01-13 23:25 ` Alexei Starovoitov
2026-01-14 16:01 ` Sebastian Andrzej Siewior
2026-01-14 17:59 ` Vlastimil Babka
2026-01-21 16:37 ` Waiman Long
2026-01-10 21:47 ` Waiman Long
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=864iowmrx6.wl-maz@kernel.org \
--to=maz@kernel.org \
--cc=bigeasy@linutronix.de \
--cc=clrkwllms@kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rt-devel@lists.linux.dev \
--cc=longman@redhat.com \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox