From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5050D339863 for ; Fri, 24 Apr 2026 07:05:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.12 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777014332; cv=none; b=aiTwne2VUtUBVRbbEF6/fc9we8qpDi5+OslpzOmNVSgor3gaUWe4AFa7OQmz55JN5CYssQzodIXRzrWJcDF1iSkfPsZkmryoe/EionbAAxH8ycRG5spsnK2+25vsySi7zwHUV638DKBgwIim2aYHuNLGl58ewCYtnsQFlUIhCPs= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777014332; c=relaxed/simple; bh=wKkQjksF58KOBShBOIr4QCuWLnUJX6G0srOO77XVonw=; h=Message-ID:Subject:From:To:Cc:Date:In-Reply-To:References: Content-Type:MIME-Version; b=oBQGz2JC/IJacmNu+BW0DfdR6s9Cw07zkivNcNJYs3M78+t78B8IcCXMrcQtif1rqD7vJ4DqXMX4C9DsBAUZsuLhW7qco38CL6LY2EKRlM3H9Nypt3qiB+HQRVKDdctDKhssvO0xr0ErG38RV89XIG4oVnKrv15kYW6cqmVcXU8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=JrPh/pfE; arc=none smtp.client-ip=192.198.163.12 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="JrPh/pfE" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1777014325; x=1808550325; h=message-id:subject:from:to:cc:date:in-reply-to: references:content-transfer-encoding:mime-version; bh=wKkQjksF58KOBShBOIr4QCuWLnUJX6G0srOO77XVonw=; b=JrPh/pfEORTc5h7WbAs9qFle9zR5u4YLp+dT4YLdKRJEMy2ABGdRkCyV 2aIjDBYezQX+WmrDNaZz4jZdhLm43vpMCTsYby75m98a5vqsuRwDApdqT 3zix55UovLftupJDfhnlFUuy9Z/r8kxMGwbIxu2vHRPZJGL7fxUq8qV90 m0QJ2no2t0jNvHfs+/2Gs+q+ShfgQqrwz+ll2H9iOYycAuJgh/eJv/5aT 9iFmbU+D1GFIaWAypc/dR51NbK3+fRbNpejyioeJs96bk1wv5nem8KRE0 xf+c6/XO0d3kds/+gI84dRNcEez32M8EmRXA6pooQlRtU5z8XjeRCYN0e Q==; X-CSE-ConnectionGUID: UE0LvGXxRcyoqch7Y+/Svg== X-CSE-MsgGUID: HDH9Nx3+SamMTg/KsA9pXw== X-IronPort-AV: E=McAfee;i="6800,10657,11765"; a="81852598" X-IronPort-AV: E=Sophos;i="6.23,196,1770624000"; d="scan'208";a="81852598" Received: from fmviesa006.fm.intel.com ([10.60.135.146]) by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Apr 2026 00:05:22 -0700 X-CSE-ConnectionGUID: jALR2HnjQy2FSh+IwLvczw== X-CSE-MsgGUID: AFI2FRyNTSWr9Z8LdXAzDA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,196,1770624000"; d="scan'208";a="228315564" Received: from pgcooper-mobl3.ger.corp.intel.com (HELO [10.245.245.58]) ([10.245.245.58]) by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Apr 2026 00:05:19 -0700 Message-ID: <291406b26b8badf2e565996515931d9ebe50208f.camel@linux.intel.com> Subject: Re: [PATCH v2 1/5] mm: Introduce zone_appears_fragmented() From: Thomas =?ISO-8859-1?Q?Hellstr=F6m?= To: Matthew Brost Cc: "David Hildenbrand (Arm)" , intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org, Andrew Morton , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner Date: Fri, 24 Apr 2026 09:05:16 +0200 In-Reply-To: References: <20260423055656.1696379-1-matthew.brost@intel.com> <20260423055656.1696379-2-matthew.brost@intel.com> <76191a17-18bf-4e9b-9ab5-dc9a48abfabb@kernel.org> Organization: Intel Sweden AB, Registration Number: 556189-6027 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.58.3 (3.58.3-1.fc43) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 On Thu, 2026-04-23 at 15:21 -0700, Matthew Brost wrote: > On Thu, Apr 23, 2026 at 12:08:36PM -0700, Matthew Brost wrote: > > On Thu, Apr 23, 2026 at 01:27:11PM +0200, Thomas Hellstr=C3=B6m wrote: > > > On Thu, 2026-04-23 at 12:27 +0200, David Hildenbrand (Arm) wrote: > > > > On 4/23/26 07:56, Matthew Brost wrote: > > > > > Introduce zone_appears_fragmented() as a lightweight helper > > > > > to > > > > > allow > > > > > subsystems to make coarse decisions about reclaim behavior in > > > > > the > > > > > presence of likely fragmentation. > > > > >=20 > > > > > The helper implements a simple heuristic: if the number of > > > > > free > > > > > pages > > > > > in a zone exceeds twice the high watermark, the zone is > > > > > considered > > > > > to > > > > > have ample free memory and allocation failures are more > > > > > likely due > > > > > to > > > > > fragmentation than overall memory pressure. > > > > >=20 > > > > > This is intentionally imprecise and is not meant to replace > > > > > the > > > > > core > > > > > MM compaction or fragmentation accounting logic. Instead, it > > > > > provides > > > > > a cheap signal for callers (e.g., shrinkers) that wish to > > > > > avoid > > > > > overly aggressive reclaim when sufficient free memory exists > > > > > but > > > > > high-order allocations may still fail. > > > > >=20 > > > > > No functional changes; this is a preparatory helper for > > > > > future > > > > > users. > > > > >=20 > > > > > Cc: Thomas Hellstr=C3=B6m > > > > > Cc: Andrew Morton > > > > > Cc: David Hildenbrand > > > > > Cc: Lorenzo Stoakes > > > > > Cc: "Liam R. Howlett" > > > > > Cc: Vlastimil Babka > > > > > Cc: Mike Rapoport > > > > > Cc: Suren Baghdasaryan > > > > > Cc: Michal Hocko > > > > > Cc: linux-mm@kvack.org > > > > > Cc: linux-kernel@vger.kernel.org > > > > > Signed-off-by: Matthew Brost > > > > > --- > > > > > =C2=A0include/linux/vmstat.h | 13 +++++++++++++ > > > > > =C2=A01 file changed, 13 insertions(+) > > > > >=20 > > > > > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h > > > > > index 3c9c266cf782..568d9f4f1a1f 100644 > > > > > --- a/include/linux/vmstat.h > > > > > +++ b/include/linux/vmstat.h > > > > > @@ -483,6 +483,19 @@ static inline const char > > > > > *zone_stat_name(enum > > > > > zone_stat_item item) > > > > > =C2=A0 return vmstat_text[item]; > > > > > =C2=A0} > > > > > =C2=A0 > > > > > +static inline bool zone_appears_fragmented(struct zone > > > > > *zone) > > > > > +{ > > > >=20 > > > > "zone_likely_fragmented" or "zone_maybe_fragmented" might be > > > > clearer, > > > > depending > > > > on the actual semantics. > > > >=20 > > > > > + /* > > > > > + * Simple heuristic: if the number of free pages is > > > > > more > > > > > than twice the > > > > > + * high watermark, this strongly suggests that the > > > > > zone is > > > > > heavily > > > > > + * fragmented when called from a shrinker. > > > > > + */ > > > >=20 > > > > I'll cc some more people. But the "when called from a shrinker" > > > > bit > > > > is > > > > concerning. Are there additional semantics that should be > > > > expressed > > > > in the > > > > function name, for example? > > > >=20 > > > > Something that implies that this function only gives you a > > > > reasonable > > > > answer in > > > > a certain context. > > >=20 > > > I think that test would not be relevant for cgroup-aware > > > shrinking. > > >=20 > > > What about trying to pass something in the struct shrink_control? > > > Like > > > if we pass the struct scan_control's order field also in struct > >=20 > > If the order were included in shrink_control, there is about a 95% > > certain that this change would allow TTM / Xe to break the > > problematic > > kswapd feedback loop. This may also better express the intent of > > the > > problem we are trying to fix here. > >=20 > > For reference, the cover letter [1] details the problem. > >=20 > > Any guidance from the core MM folks would be appreciated=E2=80=94would > > adding > > the order to shrink_control be an acceptable solution? > >=20 > > Matt > >=20 > > [1] https://patchwork.freedesktop.org/series/165330/ > >=20 > > > shrink_control, really expensive shrinkers could duck reclaim > > > attempts > > > from higher-order allocations that may fail anyway: > > >=20 > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (sc->order > PAGE_ALLOC_COSTLY_ORDE= R && > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (sc->gfp= _mask & (__GFP_NORETRY | __GFP_RETRY_MAYFAIL)) > > > && > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 !(sc->gf= p_mask & __GFP_NOFAIL)) >=20 > It doesn't look like __GFP_NORETRY, __GFP_RETRY_MAYFAIL, __GFP_NOFAIL > make it to the sc->gfp_mask flags from the caller and get into kswapd > loop... Perhaps that's because they mostly (only?) make sense from direct reclaim? Looks like the trace is from kswapd. Another metric to weigh in is perhaps the scan_control::priority field. >From my understanding it is progressively decreased towards 0 with 0 indicating most urgent shrinking.=20 Thanks, Thomas >=20 > =C2=A01182 [=C2=A0 394.049058] xe_shrinker_scan: no skip order=3D9, > gfp=3D0x0000000000000cc0 > =C2=A01183 [=C2=A0 394.049061] CPU: 2 UID: 0 PID: 110 Comm: kswapd0 Not t= ainted > 7.0.0-xe+ #355 PREEMPT(full) > =C2=A01184 [=C2=A0 394.049062] Hardware name: Intel Corporation Panther L= ake > Client Platform/PTL-UH LP5 T3 RVP1, BIOS > PTLPFWI1.R00.3332.D05.2509011438 09/01/2025 > =C2=A01185 [=C2=A0 394.049063] Call Trace: > =C2=A01186 [=C2=A0 394.049065]=C2=A0 > =C2=A01187 [=C2=A0 394.049066]=C2=A0 dump_stack_lvl+0x55/0x70 > =C2=A01188 [=C2=A0 394.049073]=C2=A0 xe_shrinker_scan+0x274/0x280 [xe] > =C2=A01189 [=C2=A0 394.049181]=C2=A0 do_shrink_slab+0x132/0x360 > =C2=A01190 [=C2=A0 394.049184]=C2=A0 shrink_slab+0xf0/0x3e0 > =C2=A01191 [=C2=A0 394.049186]=C2=A0 shrink_node+0x2bd/0x800 > =C2=A01192 [=C2=A0 394.049188]=C2=A0 balance_pgdat+0x323/0x760 > =C2=A01193 [=C2=A0 394.049189]=C2=A0 kswapd+0x1c3/0x340 > =C2=A01194 [=C2=A0 394.049190]=C2=A0 ? __pfx_autoremove_wake_function+0x1= 0/0x10 > =C2=A01195 [=C2=A0 394.049193]=C2=A0 ? __pfx_kswapd+0x10/0x10 > =C2=A01196 [=C2=A0 394.049194]=C2=A0 kthread+0xdf/0x120 > =C2=A01197 [=C2=A0 394.049196]=C2=A0 ? __pfx_kthread+0x10/0x10 > =C2=A01198 [=C2=A0 394.049197]=C2=A0 ret_from_fork+0x1d0/0x220 > =C2=A01199 [=C2=A0 394.049200]=C2=A0 ? __pfx_kthread+0x10/0x10 > =C2=A01200 [=C2=A0 394.049200]=C2=A0 ret_from_fork_asm+0x1a/0x30 > =C2=A01201 [=C2=A0 394.049202]=C2=A0 >=20 > Will look into if this is fixable, but again any core MM guidance > would > helpful. >=20 > Matt >=20 > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 return S= HRINK_STOP; > > >=20 > > > Possibly exposed as an inline helper in the shrinker interface? > > >=20 > > > /Thomas > > >=20 > > >=20 > > >=20 > > >=20