From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1552C30C177 for ; Mon, 22 Jun 2026 23:11:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782169867; cv=none; b=PaQjix2HogOSabOtjBxwuXoVka330BABfrCFtEanjPwjSP3LmKVJJAL4Cw1HhBR++KoHAmJeSId3ExuK6I+3utTpPH0BlEbZw2qcmnvCZfR5fpJ5z8vfaKnHFRuS8o6qzY3ubAa8JjM9s2+7Owe+g5cYT44gGaPG8tJ3nuje6ug= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782169867; c=relaxed/simple; bh=N/MxoOtt+X2BvqQ6Fzwk6OpKzTPoxE4dUSVmN6u66r4=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=mA1JjclxMcyiE4wI2f3RXaNEFpdLV4+rcON44ivORXrkGyaaf03LFSzrupdhJMWN7etNAMZA4kMI1jEQPsPW75KoEw/9kTxHkT6Ywd0Z7476wwUlbA9ouAnfffApPEYatFhCMgm5K9goCMRVAx702KGrQBBEjoO7PLE5chc54n0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=VRz5Bnu/; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="VRz5Bnu/" Received: by smtp.kernel.org (Postfix) with ESMTPSA id A757A1F000E9; Mon, 22 Jun 2026 23:11:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1782169865; bh=f0wc83EHX9hwGpiBEnxrlHQclPhLLkIjvr7ldgETsdI=; h=Date:From:To:Cc:Subject:References:In-Reply-To; b=VRz5Bnu/i5DdnCarRhQRtGxey4tME8/VThBkSf9rhsVKWNja3FrbkfeKJkJJSjz9L vda0mKbt9jAnrgSpgIioUQ5hOYpnKWrfj3ybWJWxMaoUq0Y4CY+FTAf6lyWi2e4Xk6 aQ8TFNCZfktQG3cTtqiz1p3TsPtS8z0cWZWq9L+PFTEW0Pxy/wIprVFqPFDdyljUks t9i6iBUerUZcIpwt2IlV6qtBM4+TeMz9DJyoo6akH3NhX4ps3OPTSl4He04Z9c54u/ MWDcftnL8XHD31qyt0IuVirGHq/39n0SnZrc0ZYIeAH4OVbjD75rpqCqv/Q33IUC6d oRIcrgLxh/4dQ== Date: Tue, 23 Jun 2026 09:10:43 +1000 From: Dave Chinner To: Matthew Brost Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org, Andrew Morton , Dave Chinner , Qi Zheng , Roman Gushchin , Muchun Song , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Johannes Weiner , Shakeel Butt , Kairui Song , Barry Song , Axel Rasmussen , Yuanchu Xie , Wei Xu Subject: Re: [PATCH v6 1/2] mm: Introduce opportunistic_compaction concept to vmscan and shrinkers Message-ID: References: <20260617032218.1165929-1-matthew.brost@intel.com> <20260617032218.1165929-2-matthew.brost@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260617032218.1165929-2-matthew.brost@intel.com> On Tue, Jun 16, 2026 at 08:22:17PM -0700, Matthew Brost wrote: > High-order allocations using __GFP_NORETRY or __GFP_RETRY_MAYFAIL > are often opportunistic attempts to satisfy fragmentation-sensitive > allocations rather than indications of severe memory pressure. In these > cases, reclaim may invoke shrinkers that aggressively destroy working > sets even though reclaim is unlikely to materially improve the > allocation outcome. > > Some shrinkers manage expensive backing or migration operations where > reclaim can result in substantial working set disruption despite the > system having sufficient free memory overall. This is particularly > visible in fragmentation-heavy workloads where reclaim repeatedly tears > down active state while kswapd attempts to satisfy higher-order > allocations. > > Introduce an opportunistic_compaction hint in shrink_control that allows > kswapd to communicate when reclaim originates from a high-order > allocation context that may be fragmentation driven rather than true > memory pressure. Shrinkers may use this hint to avoid destructive > working set reclaim while still participating normally during order-0 > or stronger reclaim conditions. To be honest, this seems like another "push a hint through to the XE shrinker" mechanism under a different name. You seem so focused on fixing the XE reproducer that the -systemic problem- that -any- high-order folio demand causes is not being acknowleged. e.g. we use high-order folios extensively in the page cache these days, and there are -many- cases where memory compaction driven by high-order demand cause significant performance regressions for page cache performance. To date, every single person who has wanted to fix the problem they are seeing has effectively attempted to -turn off compaction- via GFP flags. I've even done that myself inside XFS to work around kvmalloc() issues with a lack of GFP_NOFAIL support and doing costly high order allocations that fail and trigger compaction before falling back to vmalloc(). However, these issues have since been fixed in the kvmalloc() code, such that it now does the right thing for most calling contexts (i.e. tries high-order kmalloc() without triggering compaction, then fall back to GFP_NOFAIL vmalloc()). This has made kvmalloc() more performant and better behaved for -all users-, not just XFS. This is not sustainable - we need compaction to be robust and performant in the face of high-order folio demands, regardless of what subsystem is generating the demand. So with that in mind, let me paraphrase the comment in the second patch in the Xe shrinker implementation: "Shrinker reclaim is based on implementation specific object sizes so it is unlikely to ever acheive contiguous page reclaim in a manner that will measurably improve compaction rates." You also say: > No functional changes are introduced for existing shrinkers. Consider how many shrinkable caches the general statement above applies to, and then think about the fundamental impedence mismatch between the affected shrinkable caches and what this patch actually fixes. For example, what happens to slab-based caches if the XE cache is being excessively reclaimed under high-order page demand? e.g. the slab-based cache may have tens of objects per page and holds a system-level performance critical working set of objects. How do these caches handle the excessive reclaim demand being generated by compaction thrashing? Yup, they don't. In the case of filesystem caches, the "reclaim and repopulate" pattern you describe causing the XE perf problems causes internal slab cache fragmentation. Not only does this not improve compaction rates, it also results in more memory fragmentation because slab pages get pinned by a small number of long lived objects and they won't get freed until the cache is largely emptied. IOWs, things get -even worse- from a memory fragmentation POV when compaction thrashing causes the working set of a high-object-count-per-page slab cache to thrash.... This isn't isolated to individual subsystem thrashing. If we run a file-based workload that generates high-order folio demand and hence compaction (e.g copy tens of GB of files between two XFS, ext4 or btrfs filesystems), that will -also- trash the Xe working set via the shrinker being hammered by memory compaction try to free up contiguous pages for the page cache. Similarly, if we run a Xe workload that generates sustained high order folio demand, that will trash the working set in the dentry and inode caches and any other shrinkable slab-based cache. Hence the abstracted case of the problem we need to solve is this: shrinker reclaim is based on x-byte objects is extremely unlikely to acheive contiguous page reclaim in a manner that will measurably improve compaction rates. This is a problem that has to be addresses by the high level infrastructure level, not worked around by individual shrinkers. IMO, compaction shouldn't trigger shrinkers unless the shrinkers are specifically flagged as being able to release contiguous pages of memory in short order. I don't think there's very many shrinkable caches that even hold a significant quantity of objects larger than a single page, so it's clearly questionable as to whether compaction based reclaim should run shrinker reclaim to begin with. i.e. a subsystem that can track high order folios in a shrinkable cache should probably have a "->compaction_scan()" method that is run directly from compaction context to try to free high order folios. This provides a direct opt-in mechanism for a subsystem, and it allows subsystems that can track low- and high- order objects independently to efficiently free objects in a way that will help improve compaction rates without impacting the entire working set of objects in the cache. IOWs, this patch to inform kswapd about it's trigger (doesn't it already have a "reason" parameter, though?) is likely a necessary part of the solution - we don't want kswapd running shrinkers if it has been triggered to reclaim pages for compaction. This patch would allows kswapd to elide normal shrinker passes when it has been woken purely for compaction purposes. Given that the compaction code would be running the high-order reclaim capable shrinkers itself, this would avoid trashing the working set of most shrinkable caches -by default- under high order allocation demand.... -Dave. -- Dave Chinner dgc@kernel.org