Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: JP Kobryn <jp.kobryn@linux.dev>
To: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>
Cc: akpm@linux-foundation.org, surenb@google.com, mhocko@suse.com,
	jackmanb@google.com, ziy@nvidia.com, linux-mm@kvack.org,
	usama.arif@linux.dev, kirill@shutemov.name, willy@infradead.org,
	linux-kernel@vger.kernel.org, kernel-team@meta.com
Subject: Re: [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order
Date: Tue, 26 May 2026 22:57:58 -0700	[thread overview]
Message-ID: <7a906c76-6dd9-4bd6-8bab-cb69eb0a3db6@linux.dev> (raw)
In-Reply-To: <c467595c-31a4-4d23-abee-0fdf90f8d505@kernel.org>

On 5/25/26 2:11 AM, Vlastimil Babka (SUSE) wrote:
> On 5/19/26 22:28, Johannes Weiner wrote:
>> On Mon, May 18, 2026 at 06:25:32PM -0700, JP Kobryn (Meta) wrote:
>>> We're seeing a pattern in production where 2MB THP order-9 
>>> allocations are
>>> failing due to fragmentation and triggering reclaim on systems with 
>>> plenty
>>> of free memory. Over time, the success rate of these THP allocations 
>>> do not
>>> increase at all.
>>>
>>> Inspecting zone->vm_stat[NR_FREE_PAGES] via kprobe on 
>>> compaction_suitable()
>>> indicated the given zone had sufficient free pages for order-9 
>>> allocations,
>>> yet they were going unused. Drilling down into the zone and inspecting
>>> /proc/pagetypeinfo revealed why. Order-9 blocks were accumulating in the
>>> zone's HighAtomic bucket (while zero were present in Movable). THP is
>>> unable to draw blocks from HighAtomic since that bucket is not in the
>>> fallback list.
>>>
>>> The heuristic for reserving pageblocks in HighAtomic is that any atomic
>>> allocation greater than order-0 will result in the full pageblock being
>>> captured. This means that an order-1 atomic allocation will 
>>> over-reserve by
>>> 256x, a full 512 pageblock.
>>>
>>> Gate the reservation on order. Skip for allocations at or below
>>> PAGE_ALLOC_COSTLY_ORDER. This prevents smaller atomic allocations from
>>> reserving entire pageblocks, and significantly helps when THP is in 
>>> use on
>>> a fragmented but otherwise healthy system.
>>>
>>> Testing was performed using an A/B instagram workload receiving prod
>>> traffic. Each side had ~60 hosts with 64G memory. The patch resulted in
>>> several gains:
>>>
>>> Unpatched
>>> HighAtomic pageblocks per host: 309-312 (1% of zone or 620MB),
>>> ...all order-9 blocks in HighAtomic
>>> THP success rate: 1-6%
>>> Compaction success rate: 0-2%
>>> pgscan_kswapd (total across ~60 hosts, per minute): ~70.2M
>>> Atomic order-4+ allocations: 0
>>>
>>> Patched
>>> HighAtomic pageblocks per host: 1
>>> THP success rate: 44-78%
>>> Compaction success rate: 24-47%
>>> pgscan_kswapd (total across ~60 hosts, per minute): ~29.9M
>>> Atomic order-4+ allocations: 0
>> This is an interesting patch. A couple of thoughts:
>>
>> 1. You disabled the highatomic reserve for this workload and it didn't
>> seem to matter. Presumably <costly orders don't need the protection.
>>
>> 2. Maxing out the reserves is odd. ALLOC_HIGHATOMIC allocations will
>> try reserved space first,
> Hmm, but if the allocation succeeds before entering slowpath,
> ALLOC_NON_BLOCK won't be set.
> But reserving another block should mean we already exhausted the 
> reserved ones.
> Unreserving is only done when direct reclaim made some progress but failed
> to produce a page. But if it works, or kswapd does the job, we won't 
> enter it?

There was just no real pressure to invoke the unreserving. Let me know
if I'm misunderstanding the question.

>> and I'd expect things that are commonly
>> highatomic to be short-lived. Why don't we stop with a couple of
>> claimed highatomic blocks that get continuously recycled?
> Maybe it's some big burst of highatomic allocations that leads to the
> reservations and then they stay around "forever"?

I should add to the changelog the missing info that high frequency
net allocations are responsible for these high atomic reservations.
Even though the allocations are not necessarily long-lived, the
pageblocks remain high atomic.

> If that's the case I think we should be perhaps looking at the unreserving
> being done more proactively, rather than limiting things to costly order.

What are your thoughts if we instead look at it as: should we be reserving
full pageblocks for small allocations?

It seems to come down to whether we want the disproportionate protection 
of full
pageblocks (below costly order) for high atomic allocs vs letting them 
coalesce
in the buddy path. Is the data not enough to justify the latter?

>> 3. The impact on THP and compaction success rate is pretty
>> extreme. How can 1% of memory throw such a wrench into the gears?
> Maybe if ~all free memory is in the highatomic blocks, compaction can't be
> effective much. Or some suitability check somewhere in reclaim+compaction
> wrongly assumes the highatomic blocks are usable, so it won't do the work.

I could be missing something, but I spent some time tonight looking into
this and didn't find an issue in the compaction/reclaim suitability path.

__compaction_suitable() calls __zone_watermark_ok(), and that path
subtracts free MIGRATE_HIGHATOMIC pages from usable free memory for
callers without reserve access:

  /*
   * If the caller does not have rights to reserves below the min
   * watermark then subtract the free pages reserved for highatomic.
   */
  if (likely(!(alloc_flags & ALLOC_RESERVES)))
      unusable_free += READ_ONCE(z->nr_free_highatomic);

So free highatomic pages are removed from the usable free count there.

Also, the suitable-free-block check in __zone_watermark_ok() only treats
MIGRATE_HIGHATOMIC as usable when alloc_flags includes
ALLOC_HIGHATOMIC (or ALLOC_OOM). __compaction_suitable() passes
ALLOC_CMA here (not ALLOC_HIGHATOMIC), so I don't think compaction is
incorrectly treating free highatomic blocks as usable.

The only caveat I noticed is the fragmentation accounting side:
fill_contig_page_info() / fragmentation_index() appear to count
free_area[order].nr_free across migratetypes, so fragmentation scoring
may look better than they really are. But that seems adjacent
to this patch.

I think though that by the time we consider reclaim or compaction we're
dealing with the aftermath. The patch prevents the problem from occurring
up front.



  reply	other threads:[~2026-05-27  5:58 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-19  1:25 [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order JP Kobryn (Meta)
2026-05-19 19:27 ` Andrew Morton
2026-05-19 23:25   ` JP Kobryn (Meta)
2026-05-19 20:28 ` Johannes Weiner
2026-05-25  9:11   ` Vlastimil Babka (SUSE)
2026-05-27  5:57     ` JP Kobryn [this message]
2026-05-28 13:57       ` Vlastimil Babka (SUSE)
2026-06-16 19:58         ` JP Kobryn
2026-05-27  2:33   ` JP Kobryn
2026-05-28 17:09 ` Frank van der Linden
2026-06-16 20:00   ` JP Kobryn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7a906c76-6dd9-4bd6-8bab-cb69eb0a3db6@linux.dev \
    --to=jp.kobryn@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=jackmanb@google.com \
    --cc=kernel-team@meta.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=surenb@google.com \
    --cc=usama.arif@linux.dev \
    --cc=vbabka@kernel.org \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox