From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C8255337110 for ; Mon, 25 May 2026 10:03:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779703383; cv=none; b=JPEEMRdhAsaT8zDqLka0HnnT77pMjJ22HjeBvQFxhwCrPsLBZm7ZLTFaMBIsiGzlyThfkKlUAlGpCuqGMsslI5Oe1NxOn1zd4kxP/GKf5RqGDw/LKcBNNbcpS3iV75BMxn0OSX2XI//UsxfUrw3hoEEWJ+ZYNjhFxvut6E0V8U8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779703383; c=relaxed/simple; bh=mCRXnRY6JFXu0ryJLfmbUUiYIwF1kCgQJGqegpqmEwQ=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=Gm2ap93p405Kowkv6AmYgoLV+75h3pFYHjjMdTXjAsqAsToCIiqyEPJzhDKBmcBZNLVm8XXnAdUHxy11WZbkbjLRxYMlcS4gErNMkfcrQ4OxUdQtblTdE2vn1IUtxxf7Rihb5UXebspVL9OPh2c0eI34Otfx6CCTUqoGVq4/P0o= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=FjCmrlGJ; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="FjCmrlGJ" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 946EF1F000E9; Mon, 25 May 2026 10:02:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1779703381; bh=x0mhL5vcU+s/d/EqLj9OBWnDvQEhqCVklpI8plUjfIM=; h=Date:Subject:To:Cc:References:From:In-Reply-To; b=FjCmrlGJvRw2Hf7xIiX6gaNGFNj2oH0Dixotx08k96/MBSiZ/jKWyLVKUZlFLONKu 37SsGP/D0LVR5zVy/8jkAt4BI4flwe+eet5Sd9Ulo+bSuqumED+qzbg3AbzbBTeV9/ 4EWVnXFIX/EqvRjuwpkHRNLIRnioH3L2Ng519jWBpUpZxD1SueauP9/KEuBE19DEuO X+l/uUAstLH8R6MZpbD+kS/Od7MBLnUfSCGxSmRDoWDTMkzrGBw54JZ2kcgOBqBDhE ZVdjZM+ER2CRkBIbgghXxpELz4A5TXf/YZBErdrqhCd1WXAVR3NPy1KuW78TJkc+ey spWSC/pZVlngQ== Message-ID: Date: Mon, 25 May 2026 12:02:57 +0200 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX Content-Language: en-US To: "JP Kobryn (Meta)" , akpm@linux-foundation.org, surenb@google.com, mhocko@suse.com, jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com, linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, kernel-team@meta.com References: <20260519200851.141955-1-jp.kobryn@linux.dev> From: "Vlastimil Babka (SUSE)" In-Reply-To: <20260519200851.141955-1-jp.kobryn@linux.dev> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 5/19/26 22:08, JP Kobryn (Meta) wrote: > compact_gap() returns 2 << order, which is used as watermark headroom in > __compaction_suitable() and as a reclaim target in kswapd. The computed > value scales exponentially by order. For order-9 THP allocations this > evaluates to 1024 pages, but the compaction free scanner's working set is > bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops isolating free > pages once it matches the migration batch. The current gap over-reserves by > 32x. > > On fragmented production hosts, kswapd will try and reclaim up to the gap, > but it only reaches that threshold 18% of the time, causing reclaim to > continue a majority of the time. But doesn't that mean there's genuine memory pressure? We're effectively raising the high watermark by 4 MB, but if processes are continuously allocating, we'd be reclaiming without the gap as well? Unless the workload is sized to fit without the gap. > The over-sized gap also causes 46% of > order-9 compaction suitability checks to fail unnecessarily - the zone has > sufficient free pages for the scanner to operate, but not enough to clear > the inflated threshold. > > Cap compact_gap() at COMPACT_CLUSTER_MAX to align the watermark headroom > with the scanner's actual capacity. Orders 0-4 are unaffected since their > gap is <= 32. > > A/B test on ~100 instagram production hosts (64GB, 60s measurement): What was the base kernel version? > Unpatched (43 hosts) > pgscan_kswapd (mean/host): ~1.6M > reclaim efficiency (steal/scan): 83.8% > compaction success (success/stall): 2.1% > THP success (alloc/alloc+fallback): 4.9% > forced lru_add_drain (mean/host): ~107K > > Patched (59 hosts) > pgscan_kswapd (mean/host): ~449K Did the extra reclaim just disappear because we allow the allocations to use 4MB more memory? Or it shifted to direct reclaim? > reclaim efficiency (steal/scan): 91.0% > compaction success (success/stall): 28.3% Is this compaction success per compaction stall or per alloc stall? > THP success (alloc/alloc+fallback): 17.2% Weird that things would improve that much. I would expect the free memory just to stabilize around the lower gap but then behave similarly. Are we missing something here? > forced lru_add_drain (mean/host): ~64K > > Signed-off-by: JP Kobryn (Meta) > --- > include/linux/compaction.h | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/include/linux/compaction.h b/include/linux/compaction.h > index 173d9c07a8952..09aea63b8a89d 100644 > --- a/include/linux/compaction.h > +++ b/include/linux/compaction.h > @@ -2,6 +2,8 @@ > #ifndef _LINUX_COMPACTION_H > #define _LINUX_COMPACTION_H > > +#include > + > /* > * Determines how hard direct compaction should try to succeed. > * Lower value means higher priority, analogically to reclaim priority. > @@ -73,11 +75,9 @@ static inline unsigned long compact_gap(unsigned int order) > * effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum > * that the migrate scanner can have isolated on migrate list, and free > * scanner is only invoked when the number of isolated free pages is > - * lower than that. But it's not worth to complicate the formula here > - * as a bigger gap for higher orders than strictly necessary can also > - * improve chances of compaction success. > + * lower than that. > */ > - return 2UL << order; > + return min(2UL << order, COMPACT_CLUSTER_MAX); Shouldn't it at least be 2x COMPACT_CLUSTER_MAX? > } > > static inline int current_is_kcompactd(void)