Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Auld <matthew.auld@intel.com>
To: Arunpravin Paneer Selvam <arunpravin.paneerselvam@amd.com>,
	christian.koenig@amd.com, dri-devel@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	amd-gfx@lists.freedesktop.org
Cc: alexander.deucher@amd.com
Subject: Re: [PATCH v4 1/2] gpu/buddy: replace dual-tree/force_merge with decoupled clear tracker
Date: Wed, 10 Jun 2026 10:19:44 +0100	[thread overview]
Message-ID: <3ba98ec2-ea1f-4074-b1cc-456fca283ef8@intel.com> (raw)
In-Reply-To: <9b0add60-9bca-44dc-a95d-be289ea2d3c1@amd.com>

On 01/06/2026 11:51, Arunpravin Paneer Selvam wrote:
> 
> 
> On 5/29/2026 11:11 PM, Matthew Auld wrote:
>> Hi,
>>
>> On 27/05/2026 12:29, Arunpravin Paneer Selvam wrote:
>>> The current buddy allocator maintains separate clear_tree[] and
>>> dirty_tree[] rbtrees per order, preventing coalescing between cleared
>>> and dirty buddies. Under mixed workloads, this creates a merge barrier:
>>> adjacent buddies frequently end up split across trees, forcing reliance
>>> on __force_merge() during allocation.
>>>
>>> __force_merge() performs an O(N x max_order) scan under the VRAM manager
>>> lock, leading to allocation stalls and failures for large contiguous
>>> requests even when sufficient total free memory is available.
>>
>> So is this contig with non power-of-two sizes?
> Both power-of-two and non-power-of-two contiguous requests are affected 
> - in either case, the required higher-order block can't form when its 
> lower-order buddies are separated by clear/dirty state across the dual 
> trees. But the core issue we are seeing is VRAM fragmentation caused by 
> massive small allocations (e.g., thousands of 4 KiB–8 KiB buffers) that 
> end up split across clear and dirty trees, preventing buddy coalescing. 
> This leads to allocation failures and OOM in later workloads even when 
> sufficient total free VRAM is available.
>>
>> Do we know if we could force_merge everything in one go or somehow be 
>> more aggressive and do more than needed now, at the first sign of 
>> contention here, instead of doing it piecemeal? Downside would be 
>> losing more of the clear tracking, when this happens, but more re- 
>> merging.
>>
>> Could we have another per-order list, of all blocks that we failed to 
>> merge, when we did the free step? When doing the force merge step, we 
>> maybe don't need to search blindly and can focus instead on the stuff 
>> tracked in those lists? Maybe it doesn't need to be a list, but could 
>> be another rb-tree?
>>
>> We know the size of the total allocation, if we trigger force_merge, 
>> could we try to merge enough in one go for the entire allocation, 
>> instead of restarting the entire thing on the next iteration? Would 
>> that help at all?
>>
>> But I guess these are more for the stalling side, and won't help much 
>> with the contig angle?
> The memory is highly fragmented into mostly 4 KiB chunks and small 
> scattered blocks across the dual trees, so although total free memory 
> exists, it is split into low-order fragments. The workload then requests 
> very large contiguous allocations (tens of GBs, e.g., ~64 GiB), which 
> fail with OOM because the allocator cannot form sufficiently large high- 
> order blocks from the fragmented space. We could go with more aggressive 
> merging or merge-in-one-go approaches, but this might waste more cleared 
> memory. I think fundamentally the buddy allocator should be allowed to 
> merge unconditionally - the single-tree approach with unconditional 
> coalescing would improve the fragmentation and benefit contiguous 
> allocations along with addressing the stalling and latency issues.
>>
>> For the extent idea, is there any merit in maybe doing this for all 
>> contig blobs, and not just cleared stuff? Or is the workload you are 
>> seeing only benefit users that want cleared stuff? Wondering if this 
>> would benefit all users that want contig? Like if we hypothetically 
>> kept clear and dirty separate, like we do now, but with an improved 
>> force_merge, and then have extent tracking for all contig blobs and 
>> replace the try_harder stuff? When you do a contig alloc, the 
>> individual clear/dirty is still all there within the range, so you can 
>> skip re-clearing in some cases. I guess downside is overall more fuzzy 
>> contig + clear/free path, but I guess you would never get allocation 
>> failures, when there is sufficient contig space?
> Yes, extending extent tracking to all contig allocations has merit, but 
> the core problem remains - with the dual-tree design, we still need 
> force_merge to undo the clear/dirty split before those extents can form. 
> In cases like heavy small-allocation workloads (thousands of 4 KiB 
> buffers) running first, the memory ends up massively fragmented across 
> both trees. When a very large contiguous allocation (e.g., ~64 GiB) 
> comes in later, the allocator fails with OOM even though sufficient 
> total free memory exists, because the extent tracker can't find a 
> contiguous range that was never allowed to merge in the first place. I 
> think the dirty/clear split is fundamentally the problem - allowing the 
> buddy allocator to merge unconditionally removes this barrier, and the 
> clear tracker can then be layered on top as an optimization without 
> blocking coalescing.
>>
>>>
>>> Solution
>>>
>>> Replace the dual-tree design with:
>>> - A single free_tree[order] rbtree for dirty and mixed free blocks
>>>    (fully cleared free blocks float outside this tree)
>>> - A lightweight out-of-band clear tracker (gpu_clear_tracker)
>>>
>>> Fully cleared free blocks are tracked outside the buddy trees using an
>>> augmented interval rbtree, enabling O(log E) lookup of the largest
>>> cleared extents.
>>>
>>> Buddy coalescing is now unconditional in __gpu_buddy_free(), regardless
>>> of clear/dirty state. This removes the merge barrier and eliminates the
>>> need for __force_merge().
>>>
>>> Benefits
>>>
>>> - Correct high-order allocations after mixed clear/dirty workloads
>>> - Elimination of O(N x max_order) merge cost from the allocation path
>>> - O(log E) cleared-extent lookup replacing O(N) scans
>>> - Predictable allocation latency under fragmentation
>>> - Reduced complexity with a single tree per order
>>
>> Since there is no separate tracking for dirty stuff, is the non- 
>> cleared alloc path a bit more "fuzzy" now, with it potentially 
>> stealing cleared memory, or is it the same behaviour still?
> Right, on v4, the dirty and mixed (partially cleared) blocks are 
> allocated for the non-cleared alloc path, which can end up stealing 
> cleared memory. On v5, I plan to address this with a three-tier dirty 
> allocation fallback: dirty → mixed → clear, driven by rbtree augment 
> bits (subtree_has_dirty, subtree_has_mixed), each pass O(log N). The 
> split-descent also applies the same preference at every level when 
> carving a higher-order block, so cleared memory is preserved as much as 
> possible and only used as a last resort.
> Thoughts ?

No objections from me. Do you want me to still look at v4 in depth, or 
wait for v5? I only really looked at this from high level.

>>
>> For drivers that don't use free tracking, is there some benefit? Are 
>> there any downsides there? I assume that clear tracker is always empty.
> Correct, for drivers that don't clear memory, the clear tracker is 
> always empty and they simply allocate from the free_tree[]. Benefits:
> 
> Single tree per order instead of dual trees (fewer rbtree operations)
> No force_merge path at all (unconditional coalescing at free time)
> Simpler code path overall
> 
> No real downsides - the clear tracker adds zero overhead when empty, and 
> the augment bits would simply show all blocks as dirty, so the walk 
> degenerates to a normal rbtree lookup with no extra cost.
> 
> Regards,
> Arun.
> 
> 


  parent reply	other threads:[~2026-06-10  9:19 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-27 11:29 [PATCH v4 1/2] gpu/buddy: replace dual-tree/force_merge with decoupled clear tracker Arunpravin Paneer Selvam
2026-05-27 11:29 ` [PATCH v4 2/2] gpu/tests/buddy: add clear-tracker allocation latency benchmarks Arunpravin Paneer Selvam
2026-05-27 14:30 ` ✗ CI.checkpatch: warning for series starting with [v4,1/2] gpu/buddy: replace dual-tree/force_merge with decoupled clear tracker Patchwork
2026-05-27 14:32 ` ✓ CI.KUnit: success " Patchwork
2026-05-27 15:24 ` ✓ Xe.CI.BAT: " Patchwork
2026-05-27 19:27 ` ✗ Xe.CI.FULL: failure " Patchwork
2026-05-28 12:33 ` [PATCH v4 1/2] " Arunpravin Paneer Selvam
2026-05-29 17:41 ` Matthew Auld
2026-06-01 10:51   ` Arunpravin Paneer Selvam
2026-06-10  6:27     ` Arunpravin Paneer Selvam
2026-06-10  9:19     ` Matthew Auld [this message]
2026-06-10 13:06       ` Arunpravin Paneer Selvam

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3ba98ec2-ea1f-4074-b1cc-456fca283ef8@intel.com \
    --to=matthew.auld@intel.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=arunpravin.paneerselvam@amd.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox