From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-171.mta0.migadu.com (out-171.mta0.migadu.com [91.218.175.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B7F34240611 for ; Wed, 27 May 2026 02:34:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.171 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779849285; cv=none; b=o8/b5FxCS8KoCKe2H52vSXaaxor154K7JKneHXRssrCYKguE91D0fszqXgPhDo3gPsFBe83Y6X5hfCKquFkGKiKNGBBFh7pjoilqewSsHnov7Zg76Q008L4eA3hqM1ysIa2pILROh460Uif9sXlrpJ/36KaS932NOunZKaPOtOs= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779849285; c=relaxed/simple; bh=zNds2cFZzHGg2NCNwdvWgvEkxHIbWAvfT/E2vfzjPJ8=; h=Message-ID:Date:MIME-Version:From:Subject:To:Cc:References: In-Reply-To:Content-Type; b=Cb9Hhvvp60WvqaRLklGtItdXdtXUxZvu0HrIA39vnQ/J76bIPrTlRKks6tbITH1zFVN9j364gIxAWR8TXVqNK/ruwrsp+vlxd8ZLQMbEn5oIG+C/5stFOTZvanKN9Zr2PyxIwSElLenv90r922ru4tRRc95mK/vGE0Czpiq7Ltg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=HMo6EQ/K; arc=none smtp.client-ip=91.218.175.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="HMo6EQ/K" Message-ID: <97c2501d-6d11-4a73-adab-dd06b7de54a3@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1779849280; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=9pK/qCFzxhqWb9ZXlkS9Lyl5Svgy0N/OROixrRLPVBU=; b=HMo6EQ/Kw7sSi9APQe9F0mSCqt2Biq7ujEmoyMCOlKJeJRB0ttEUtweba4ruvsIk2rY+FG te1VnVutGNJ3KVc7vvPXpooYyp67iKDnXuLKiZ9Tc9IAobUF4OvuB/DVcMXLJ1oCw4tGUG ZK7gUDcydH5XFaGL1VVNGGW2gjYTTF0= Date: Tue, 26 May 2026 19:33:52 -0700 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: JP Kobryn Subject: Re: [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order To: Johannes Weiner Cc: akpm@linux-foundation.org, vbabka@kernel.org, surenb@google.com, mhocko@suse.com, jackmanb@google.com, ziy@nvidia.com, linux-mm@kvack.org, usama.arif@linux.dev, kirill@shutemov.name, willy@infradead.org, linux-kernel@vger.kernel.org, kernel-team@meta.com References: <20260519012532.272770-1-jp.kobryn@linux.dev> Content-Language: en-US In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_OUT On 5/19/26 1:28 PM, Johannes Weiner wrote: > On Mon, May 18, 2026 at 06:25:32PM -0700, JP Kobryn (Meta) wrote: >> We're seeing a pattern in production where 2MB THP order-9 >> allocations are >> failing due to fragmentation and triggering reclaim on systems with >> plenty >> of free memory. Over time, the success rate of these THP allocations >> do not >> increase at all. >> >> Inspecting zone->vm_stat[NR_FREE_PAGES] via kprobe on >> compaction_suitable() >> indicated the given zone had sufficient free pages for order-9 >> allocations, >> yet they were going unused. Drilling down into the zone and inspecting >> /proc/pagetypeinfo revealed why. Order-9 blocks were accumulating in the >> zone's HighAtomic bucket (while zero were present in Movable). THP is >> unable to draw blocks from HighAtomic since that bucket is not in the >> fallback list. >> >> The heuristic for reserving pageblocks in HighAtomic is that any atomic >> allocation greater than order-0 will result in the full pageblock being >> captured. This means that an order-1 atomic allocation will >> over-reserve by >> 256x, a full 512 pageblock. >> >> Gate the reservation on order. Skip for allocations at or below >> PAGE_ALLOC_COSTLY_ORDER. This prevents smaller atomic allocations from >> reserving entire pageblocks, and significantly helps when THP is in >> use on >> a fragmented but otherwise healthy system. >> >> Testing was performed using an A/B instagram workload receiving prod >> traffic. Each side had ~60 hosts with 64G memory. The patch resulted in >> several gains: >> >> Unpatched >> HighAtomic pageblocks per host: 309-312 (1% of zone or 620MB), >> ...all order-9 blocks in HighAtomic >> THP success rate: 1-6% >> Compaction success rate: 0-2% >> pgscan_kswapd (total across ~60 hosts, per minute): ~70.2M >> Atomic order-4+ allocations: 0 >> >> Patched >> HighAtomic pageblocks per host: 1 >> THP success rate: 44-78% >> Compaction success rate: 24-47% >> pgscan_kswapd (total across ~60 hosts, per minute): ~29.9M >> Atomic order-4+ allocations: 0 > This is an interesting patch. A couple of thoughts: > > 1. You disabled the highatomic reserve for this workload and it didn't > seem to matter. Presumably 2. Maxing out the reserves is odd. ALLOC_HIGHATOMIC allocations will > try reserved space first, and I'd expect things that are commonly > highatomic to be short-lived. Why don't we stop with a couple of > claimed highatomic blocks that get continuously recycled? Even though they may be short-lived, the data shows the volume of allocations is steady enough to keep the reserves maxed out. > 3. The impact on THP and compaction success rate is pretty > extreme. How can 1% of memory throw such a wrench into the gears? Looking at the pre-patched high atomic pageblock counts, that's ~300 pageblocks that could've been used for THPs. They become usable after the patch. > Have you tried this with other workloads? No, but the pre-patch symptoms will show up on workloads where net allocs are frequent enough to keep the high atomic pageblock count up. Memory size of hosts involved is a factor as well since it's possible for a majority of order-9 pages to be stuck in high atomic.