From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C913C3ACF1C; Thu, 8 Jan 2026 08:26:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767860777; cv=none; b=jfMfbk5V9Liy/AijAokUG+VqDP99BLblXhhyU0CCz+b9iGx5EKtNQMnaxVGqaL7rwglTaoH35CQfjRiA8MTfTGsEiynznsIWghgA6HUP79zfzBi0Ul2Sncb+rjgc/I0qX9er6VTigsME2jik38X4i9ypeZ6XuwMFk4GVFhvTpuc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767860777; c=relaxed/simple; bh=IiIY2CdkSYWlB6GwGKjGlhZ6z8O6XDQYU49akh0KXkM=; h=Date:Message-ID:From:To:Cc:Subject:In-Reply-To:References: MIME-Version:Content-Type; b=BmBcl5prunPN4cb0JHfonOTulg00Jo1UQwV3xlsAE5OoJSMK47kOCU/IAqPa4Jc4hsYGG+kOBxKVk+ZMsaSqO/k1+uO282eECT3e8kBAnGA0szSYo1v/GTDbF6MxWHXLVcH0de2jRToOqiAjefcoLQVWeFGvnS8Jv3mVXPkD3oM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=TOkqm+Sr; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="TOkqm+Sr" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 0931BC116C6; Thu, 8 Jan 2026 08:26:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1767860777; bh=IiIY2CdkSYWlB6GwGKjGlhZ6z8O6XDQYU49akh0KXkM=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=TOkqm+SrXu3kx/Bn4tPdK92T7s6eXcuMqNAehNZTlpBCLyXWDcMJh8WYm/F6LMab/ v5/BAUAIZEwVOInvi3d130tOCu+eslnDjk8DL97syGst92MaBZ73bJj6VT5oJ5COYi by6WW/axJNZBf1MuO+bJ+t2Jkd2voS1PHvC8anWODIibp/ClUvSG6c8DqGOhefTljI mY6TIJMiDM3qa/Od2vmm+CRY02wb8vRf+XeBgfYqazNi/0559mDGhEQ/TeGVbx9QMS qSctvFarqKZjC/mHrMoZMa4H0P21k1fuQpZCqtV4R03G2pITUl07QDEJjtfy06fDMh N4pv1qE8yFGvA== Received: from sofa.misterjones.org ([185.219.108.64] helo=goblin-girl.misterjones.org) by disco-boy.misterjones.org with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1vdlLS-00000000KPo-1t9i; Thu, 08 Jan 2026 08:26:14 +0000 Date: Thu, 08 Jan 2026 08:26:13 +0000 Message-ID: <864iowmrx6.wl-maz@kernel.org> From: Marc Zyngier To: Waiman Long Cc: Thomas Gleixner , Sebastian Andrzej Siewior , Clark Williams , Steven Rostedt , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-rt-devel@lists.linux.dev Subject: Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table() In-Reply-To: <20260107215353.75612-1-longman@redhat.com> References: <20260107215353.75612-1-longman@redhat.com> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM-LB/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/30.1 (aarch64-unknown-linux-gnu) MULE/6.0 (HANACHIRUSATO) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue") Content-Type: text/plain; charset=US-ASCII X-SA-Exim-Connect-IP: 185.219.108.64 X-SA-Exim-Rcpt-To: longman@redhat.com, tglx@linutronix.de, bigeasy@linutronix.de, clrkwllms@kernel.org, rostedt@goodmis.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-rt-devel@lists.linux.dev X-SA-Exim-Mail-From: maz@kernel.org X-SA-Exim-Scanned: No (on disco-boy.misterjones.org); SAEximRunCond expanded to false On Wed, 07 Jan 2026 21:53:53 +0000, Waiman Long wrote: > > When running a PREEMPT_RT debug kernel on a 2-socket Grace arm64 system, > the following bug report was produced at bootup time. > > BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 > in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/72 > preempt_count: 1, expected: 0 > RCU nest depth: 1, expected: 1 > : > CPU: 72 UID: 0 PID: 0 Comm: swapper/72 Tainted: G W 6.19.0-rc4-test+ #4 PREEMPT_{RT,(full)} > Tainted: [W]=WARN > Call trace: > : > rt_spin_lock+0xe4/0x408 > rmqueue_bulk+0x48/0x1de8 > __rmqueue_pcplist+0x410/0x650 > rmqueue.constprop.0+0x6a8/0x2b50 > get_page_from_freelist+0x3c0/0xe68 > __alloc_frozen_pages_noprof+0x1dc/0x348 > alloc_pages_mpol+0xe4/0x2f8 > alloc_frozen_pages_noprof+0x124/0x190 > allocate_slab+0x2f0/0x438 > new_slab+0x4c/0x80 > ___slab_alloc+0x410/0x798 > __slab_alloc.constprop.0+0x88/0x1e0 > __kmalloc_cache_noprof+0x2dc/0x4b0 > allocate_vpe_l1_table+0x114/0x788 > its_cpu_init_lpis+0x344/0x790 > its_cpu_init+0x60/0x220 > gic_starting_cpu+0x64/0xe8 > cpuhp_invoke_callback+0x438/0x6d8 > __cpuhp_invoke_callback_range+0xd8/0x1f8 > notify_cpu_starting+0x11c/0x178 > secondary_start_kernel+0xc8/0x188 > __secondary_switched+0xc0/0xc8 > > This is due to the fact that allocate_vpe_l1_table() will call > kzalloc() to allocate a cpumask_t when the first CPU of the > second node of the 72-cpu Grace system is being called from the > CPUHP_AP_MIPS_GIC_TIMER_STARTING state inside the starting section of Surely *not* that particular state. > the CPU hotplug bringup pipeline where interrupt is disabled. This is an > atomic context where sleeping is not allowed and acquiring a sleeping > rt_spin_lock within kzalloc() may lead to system hang in case there is > a lock contention. > > To work around this issue, a static buffer is used for cpumask > allocation when running a PREEMPT_RT kernel via the newly introduced > vpe_alloc_cpumask() helper. The static buffer is currently set to be > 4 kbytes in size. As only one cpumask is needed per node, the current > size should be big enough as long as (cpumask_size() * nr_node_ids) > is not bigger than 4k. What role does the node play here? The GIC topology has nothing to do with NUMA. It may be true on your particular toy, but that's definitely not true architecturally. You could, at worse, end-up with one such cpumask per *CPU*. That'd be a braindead system, but this code is written to support the architecture, not any particular implementation. > > Signed-off-by: Waiman Long > --- > drivers/irqchip/irq-gic-v3-its.c | 26 +++++++++++++++++++++++++- > 1 file changed, 25 insertions(+), 1 deletion(-) > > diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c > index ada585bfa451..9185785524dc 100644 > --- a/drivers/irqchip/irq-gic-v3-its.c > +++ b/drivers/irqchip/irq-gic-v3-its.c > @@ -2896,6 +2896,30 @@ static bool allocate_vpe_l2_table(int cpu, u32 id) > return true; > } > > +static void *vpe_alloc_cpumask(void) > +{ > + /* > + * With PREEMPT_RT kernel, we can't call any k*alloc() APIs as they > + * may acquire a sleeping rt_spin_lock in an atomic context. So use > + * a pre-allocated buffer instead. > + */ > + if (IS_ENABLED(CONFIG_PREEMPT_RT)) { > + static unsigned long mask_buf[512]; > + static atomic_t alloc_idx; > + int idx, mask_size = cpumask_size(); > + int nr_cpumasks = sizeof(mask_buf)/mask_size; > + > + /* > + * Fetch an allocation index and if it points to a buffer within > + * mask_buf[], return that. Fall back to kzalloc() otherwise. > + */ > + idx = atomic_fetch_inc(&alloc_idx); > + if (idx < nr_cpumasks) > + return &mask_buf[idx * mask_size/sizeof(long)]; > + } Err, no. That's horrible. I can see three ways to address this in a more appealing way: - you give RT a generic allocator that works for (small) atomic allocations. I appreciate that's not easy, and even probably contrary to the RT goals. But I'm also pretty sure that the GIC code is not the only pile of crap being caught doing that. - you pre-compute upfront how many cpumasks you are going to require, based on the actual GIC topology. You do that on CPU0, outside of the hotplug constraints, and allocate what you need. This is difficult as you need to ensure the RD<->CPU matching without the CPUs having booted, which means wading through the DT/ACPI gunk to try and guess what you have. - you delay the allocation of L1 tables to a context where you can perform allocations, and before we have a chance of running a guest on this CPU. That's probably the simplest option (though dealing with late onlining while guests are already running could be interesting...). But I'm always going to say no to something that is a poor hack and ultimately falling back to the same broken behaviour. Thanks, M. -- Without deviation from the norm, progress is not possible.