From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8ADB33A6F0F for ; Thu, 30 Apr 2026 20:22:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777580579; cv=none; b=owBuQrRdAeoYnRWB1wOf54izQ2RgIOEviRrHh7e4vbXmOX8cwxjUkb57Glcd/ZqI0I7x8YCxTdw0i+boroZtwLnzZjvjw/SYRSAo9jDsI81ZR9zCoasflSDHbMdyclcq1G3S7Fx6jU0SnEb/J2+gaPdlKgD8iaratcQwChh4gig= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777580579; c=relaxed/simple; bh=RriSQD0I/CY4OIie4sK8FCysWW2X1vIMgoZmaciOhco=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=sQqT2LUq78WxPkzwLjki/HQSopv+QjoxoQET61hUOUKUy+ifhQrSLGN+PUgjn27yXL7dI3xHZzpz8cWIykQQ7ZtB1LMPtQwWFyRtvU2nZ32fJmXSpzp0f87jJyFUtX1UMyE6fwrdupmHxKB5wrdbYDNMub7g6n7iNBXoAvFqSr8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=N2FPmWXa; arc=none smtp.client-ip=96.67.55.147 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="N2FPmWXa" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References: In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=NWSfXutrQQEBLYlyfIL/K5SplHsx+P2yivojM0MS/kg=; b=N2FPmWXapbfw75FON2bKPPYt8V Sq8AoC4LFxV8MVx1fThjY0Z4sqVSZGh80LVywj0WNCGIjw9CLjYOWTpmXPpSujkSsYYqh4vVr8b8Y pylGdDOfLqoXs+XDsznRRBhP8Vm6J+uoSQ2VrQi/ckCY7ji7AoRXIkwjEvR4KCTsAJKxxNzCnJ7vq KKbv5i/91ySsRqtzXoB0F4WKKj2RTMunsRK3OzoRoMDf2gnE4dZvErhSaRlG7eLCk5OXtnbUYkU5M MP46ivr27PJvf30USvpd5MNzpsfQ7sFnXuDJ//RrrOXZcFDNbKBkTLYPaKPPDLouU+V0KZ3KaKNJ6 EfgSAzaA==; Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1wIXuD-000000001R0-1blZ; Thu, 30 Apr 2026 16:22:41 -0400 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: kernel-team@meta.com, linux-mm@kvack.org, david@kernel.org, willy@infradead.org, surenb@google.com, hannes@cmpxchg.org, ljs@kernel.org, ziy@nvidia.com, usama.arif@linux.dev, Rik van Riel , Rik van Riel Subject: [RFC PATCH 37/45] =?UTF-8?q?mm/slub:=20kvmalloc=20=E2=80=94=20add?= =?UTF-8?q?=20=5F=5FGFP=5FNORETRY=20to=20large-kmalloc=20attempt?= Date: Thu, 30 Apr 2026 16:21:06 -0400 Message-ID: <20260430202233.111010-38-riel@surriel.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260430202233.111010-1-riel@surriel.com> References: <20260430202233.111010-1-riel@surriel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From: Rik van Riel kvmalloc's contract is "try contiguous physical memory first; fall back to vmalloc on failure." For size > PAGE_SIZE, kmalloc_gfp_adjust already strips __GFP_DIRECT_RECLAIM and adds __GFP_NOWARN to make the kmalloc attempt non-disruptive. But the page allocator's atomic- allocation retry chain in get_page_from_freelist (no __GFP_DIRECT_RECLAIM path) progressively relaxes ALLOC_NOFRAGMENT — first adding ALLOC_NOFRAG_TAINTED_OK, then dropping ALLOC_NOFRAGMENT entirely — because atomic allocations have no slowpath escape and need every chance to succeed. For kvmalloc-large, this is wrong: there IS a slowpath escape (the vmalloc fallback). Tainting a previously-clean superpageblock to satisfy the kmalloc attempt costs more than letting it fail and calling vmalloc — the SPB stays tainted for the rest of the workload's lifetime, blocking 1 GiB hugepage allocation from that region. Add __GFP_NORETRY in the same conditional that strips __GFP_DIRECT_RECLAIM. The page allocator's NORETRY-skip exit (mm/page_alloc.c) treats this as the documented "caller has a fallback" signal and returns NULL immediately instead of relaxing ALLOC_NOFRAGMENT. kvmalloc then runs its existing vmalloc fallback as designed. kvmalloc's documented contract already disallows callers passing __GFP_NORETRY directly (see the comment block above __kvmalloc_node_noprof), so adding it internally cannot surprise any existing caller. Observed on a 247 GB devvm running the page-superblock v18 series: a `below` process reading a /proc/sys file via kvmalloc(buf, GFP_USER) tainted a fresh clean SPB at boot+~47 min via __kmalloc_large_node → alloc_pages_mpol. A tls-cert-validator did the same a minute later. Both were "best effort" allocations with vmalloc as their existing fallback — they should not have been tainting clean SPBs. Signed-off-by: Rik van Riel Assisted-by: Claude:claude-opus-4.7 syzkaller --- mm/slub.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 2b2d33cc735c..fa422d245a53 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -6703,13 +6703,24 @@ static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size) * However make sure that larger requests are not too disruptive - i.e. * do not direct reclaim unless physically continuous memory is preferred * (__GFP_RETRY_MAYFAIL mode). We still kick in kswapd/kcompactd to - * start working in the background + * start working in the background. + * + * Also signal __GFP_NORETRY: the vmalloc fallback IS our retry path, + * so the page allocator should not go to extreme lengths (e.g. + * tainting a previously-clean superpageblock from the page-superblock + * series) just to satisfy the kmalloc attempt. The atomic-allocation + * relaxation logic in get_page_from_freelist treats __GFP_NORETRY as + * "caller has a fallback" and returns NULL early instead of dropping + * ALLOC_NOFRAGMENT. kvmalloc's documented contract already disallows + * callers passing __GFP_NORETRY directly, so adding it here is safe. */ if (size > PAGE_SIZE) { flags |= __GFP_NOWARN; - if (!(flags & __GFP_RETRY_MAYFAIL)) + if (!(flags & __GFP_RETRY_MAYFAIL)) { flags &= ~__GFP_DIRECT_RECLAIM; + flags |= __GFP_NORETRY; + } /* nofail semantic is implemented by the vmalloc fallback */ flags &= ~__GFP_NOFAIL; -- 2.52.0