From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 133823A7F48 for ; Thu, 30 Apr 2026 20:22:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777580581; cv=none; b=fyJUf+jZTIZK3gR5TAlGWK+ijWGfGfbVtpy+XuUSsleGOPvAA6ZFUfqy9IXbD9a2rv8knkSFhOtSsmOLDstTcVre8BOywNHcet/inzvtTMxjk1TfnYHUa2aC3F4ZROD8iCekSALc6ZFh11CG0GWfDVb3Klzj/WZZf1cIAf+rEeY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777580581; c=relaxed/simple; bh=+5lblgvBS0cbluJN0p9Pa49EH0GO6ionUOlba/891q0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=PCewsckjkWFBH50cKMc+eF5J/SUSOBdhrj7r84wpcEF/rHbWAvt5jkDXqxX4yTQNk+aQGBYgmR8WCX1g5rYoXX3LrgZbiKelUUMfPts+QHwwsSREJYDjltaZV/FE7gQYrwCD9MfSfhnBJR9MVNvIZ0FdWgonyvAMDYZlyUdC88k= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=Na/Hu1TO; arc=none smtp.client-ip=96.67.55.147 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="Na/Hu1TO" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:MIME-Version:References:In-Reply-To: Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-Type:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=IYH84USUtbM92TuFOgIUSH8TxfyB6yvkrbhTmn3mpO8=; b=Na/Hu1TOZFT6GSbxkHu5RyQ+2c NUMDbYi7/nFa1ja+vUqP3hO7Sw/QP4WH+WOpX0HWoDlaKNp+VFr0ZBZ4lH6dPT43TUAruUkJsboJb BrzhyC+eFt4YYF5ZCh/lMqU1SW0iOW7tAbEWiHngqviGsh+eQsKiJ8H2NnkGJQQydRX0pRPAAYuZ0 sdRrEko+G+YVaPJ3b87o/7LAtws/8vSb7H73L8rolvci/DteoUpSJ62LgkMcwg8AqN2pN3tDxmWtF ym7OZ+ZDf0VLraqW4xSos+FDU7UJ3cTNZRRzge2Yd51tklyFKwxWivDEQVE5tdLlC9BbleDRxVHMP pgnqw/bQ==; Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1wIXuC-000000001R0-2640; Thu, 30 Apr 2026 16:22:40 -0400 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: kernel-team@meta.com, linux-mm@kvack.org, david@kernel.org, willy@infradead.org, surenb@google.com, hannes@cmpxchg.org, ljs@kernel.org, ziy@nvidia.com, usama.arif@linux.dev, Rik van Riel , Rik van Riel Subject: [RFC PATCH 03/45] mm: page_alloc: use trylock for PCP lock in free path to avoid lock inversion Date: Thu, 30 Apr 2026 16:20:32 -0400 Message-ID: <20260430202233.111010-4-riel@surriel.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260430202233.111010-1-riel@surriel.com> References: <20260430202233.111010-1-riel@surriel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: Rik van Riel The per-cpu pageblock buddy allocator changed __free_frozen_pages() and free_unref_folios() to use a blocking spin_lock_irqsave() for the PCP lock when in_task(), rather than mainline's unconditional trylock via pcp_spin_trylock(). This breaks a mainline invariant: the allocation path in rmqueue_pcplist() acquires pcp->lock via pcp_spin_trylock(), which on SMP does preempt_disable() + spin_trylock() without disabling IRQs. This means the alloc path holds pcp->lock with interrupts enabled. The resulting ABBA deadlock scenario: CPU0 (alloc path): pcp_spin_trylock() acquires pcp->lock (IRQs ON) -> hardirq fires while lock is held -> IRQ handler takes xa_lock (e.g. __folio_end_writeback -> xa_lock) CPU1 (free path): xa_lock held (e.g. slab -> stack_depot_free) -> __free_frozen_pages() -> spin_lock_irqsave(&pcp->lock) BLOCKS -> waits for CPU0 CPU0 cannot release pcp->lock because it is stuck in hardirq waiting for xa_lock held by CPU1. Deadlock. The key insight is that pcp_trylock_prepare() is a no-op on SMP, so pcp_spin_trylock() does not save/restore IRQs. Any lock taken in hardirq context that is also held across __free_frozen_pages() creates this ABBA potential. Fix by always using spin_trylock_irqsave() for the PCP lock, falling back to free_one_page() (zone buddy) when the trylock fails. This restores the mainline invariant of never blocking on PCP lock acquisition in the free path. Signed-off-by: Rik van Riel Assisted-by: Claude:claude-opus-4.7 syzkaller --- mm/page_alloc.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c0aa39fa2f61..d98eab3e288e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3262,13 +3262,15 @@ static void __free_frozen_pages(struct page *page, unsigned int order, cache_cpu = raw_smp_processor_id(); pcp = per_cpu_ptr(zone->per_cpu_pageset, cache_cpu); - if (unlikely(fpi_flags & FPI_TRYLOCK) || !in_task()) { - if (!spin_trylock_irqsave(&pcp->lock, UP_flags)) { - free_one_page(zone, page, pfn, order, fpi_flags); - return; - } - } else { - spin_lock_irqsave(&pcp->lock, UP_flags); + /* + * Always use trylock: callers may hold locks (e.g. xa_lock via + * slab/stack_depot) that are also taken in hardirq context, and + * pcp->lock is acquired with IRQs enabled on the allocation side. + * A blocking lock here would create an ABBA deadlock potential. + */ + if (!spin_trylock_irqsave(&pcp->lock, UP_flags)) { + free_one_page(zone, page, pfn, order, fpi_flags); + return; } if (unlikely(pcp->flags & PCPF_CPU_DEAD)) { -- 2.52.0