* [PATCH] Docs/mm: document Page Allocation
@ 2026-03-14 15:25 Kit Dallege
2026-03-15 20:36 ` Lorenzo Stoakes (Oracle)
0 siblings, 1 reply; 3+ messages in thread
From: Kit Dallege @ 2026-03-14 15:25 UTC (permalink / raw)
To: akpm, david, corbet; +Cc: linux-mm, linux-doc, Kit Dallege
Fill in the page_allocation.rst stub created in commit 481cc97349d6
("mm,doc: Add new documentation structure") as part of
the structured memory management documentation following
Mel Gorman's book outline.
Signed-off-by: Kit Dallege <xaum.io@gmail.com>
---
Documentation/mm/page_allocation.rst | 219 +++++++++++++++++++++++++++
1 file changed, 219 insertions(+)
diff --git a/Documentation/mm/page_allocation.rst b/Documentation/mm/page_allocation.rst
index d9b4495561f1..4d0c1f2db9af 100644
--- a/Documentation/mm/page_allocation.rst
+++ b/Documentation/mm/page_allocation.rst
@@ -3,3 +3,222 @@
===============
Page Allocation
===============
+
+The page allocator is the kernel's primary interface for obtaining and
+releasing physical page frames. It is built on the buddy algorithm and
+implemented in ``mm/page_alloc.c``.
+
+.. contents:: :local:
+
+Buddy Allocator
+===============
+
+Free pages are grouped by order (power-of-two size) in per-zone
+``free_area`` arrays, where order 0 is a single page and the maximum is
+``MAX_PAGE_ORDER``. To satisfy an allocation of order N, the allocator
+looks for a free block of that order. If none is available, it splits a
+higher-order block in half repeatedly until one of the right size is
+produced. When a page is freed, the allocator checks whether its "buddy"
+(the adjacent block of the same order) is also free; if so, the two are
+merged into a block of the next higher order. This coalescing continues
+as high as possible, rebuilding large contiguous blocks over time.
+
+Migratetypes
+============
+
+Each pageblock (typically 2MB on x86) carries a migratetype tag that
+describes the kind of allocations it serves:
+
+- **MIGRATE_UNMOVABLE**: kernel allocations that cannot be relocated
+ (slab objects, page tables).
+- **MIGRATE_MOVABLE**: user pages and other content that can be migrated
+ or reclaimed (used by compaction and memory hot-remove).
+- **MIGRATE_RECLAIMABLE**: caches that can be dropped under pressure
+ (page cache, dentries).
+- **MIGRATE_CMA**: reserved for the contiguous memory allocator;
+ behaves as movable when not in use by CMA.
+- **MIGRATE_ISOLATE**: temporarily prevents allocation from a range,
+ used during compaction and memory hot-remove.
+
+When a free list for the requested migratetype is empty, the allocator
+falls back to other types in a defined order. It may also "steal" an
+entire pageblock from another migratetype if it needs to take pages from
+it, changing the pageblock's tag to reduce future fragmentation. This
+fallback and stealing logic is a key mechanism for balancing fragmentation
+against allocation success.
+
+Per-CPU Pagesets
+================
+
+Most order-0 allocations are served from per-CPU page lists (PCP) rather
+than the global ``free_area``. This avoids taking the zone lock on the
+common path, which is critical for scalability on large systems.
+
+Each CPU maintains lists of free pages grouped by migratetype. Pages are
+moved between the per-CPU lists and the buddy in batches. The batch size
+and high watermark for each per-CPU list are tuned based on zone size and
+the number of CPUs.
+
+When a per-CPU list is empty, a batch of pages is taken from the buddy.
+When it exceeds its high watermark, excess pages are returned.
+``lru_add_drain()`` and ``drain_all_pages()`` flush per-CPU lists when
+the system needs an accurate count of free pages, such as during memory
+hot-remove.
+
+GFP Flags
+=========
+
+Every allocation request carries a set of GFP (Get Free Pages) flags,
+defined in ``include/linux/gfp.h``, that describe what the allocator is
+allowed to do:
+
+Zone selection
+ ``__GFP_DMA``, ``__GFP_DMA32``, ``__GFP_HIGHMEM``, ``__GFP_MOVABLE``
+ select the highest zone the allocation may use. ``gfp_zone()`` maps
+ flags to a zone type; the allocator then scans the zonelist from that
+ zone downward.
+
+Reclaim and compaction
+ ``__GFP_DIRECT_RECLAIM`` allows the allocator to invoke direct reclaim.
+ ``__GFP_KSWAPD_RECLAIM`` allows it to wake kswapd. Together these form
+ ``GFP_KERNEL``, the most common flag combination.
+
+Retry behavior
+ ``__GFP_NORETRY`` gives up after one attempt at reclaim.
+ ``__GFP_RETRY_MAYFAIL`` retries as long as progress is being made.
+ ``__GFP_NOFAIL`` never fails — the allocator retries indefinitely,
+ which is appropriate only for small allocations in contexts that
+ cannot handle failure.
+
+Migratetype
+ ``__GFP_MOVABLE`` and ``__GFP_RECLAIMABLE`` select the migratetype.
+ ``gfp_migratetype()`` maps flags to the appropriate type.
+
+Allocation Path
+===============
+
+Fast path
+---------
+
+``get_page_from_freelist()`` is the fast path. It walks the zonelist
+(an ordered list of zones across all nodes, starting with the preferred
+node) looking for a zone with enough free pages above its watermarks.
+When it finds one, it pulls a page from the per-CPU list or buddy.
+
+The fast path also checks NUMA locality, cpuset constraints, and memory
+cgroup limits. If no zone can satisfy the request, control passes to
+the slow path.
+
+Slow path
+---------
+
+``__alloc_pages_slowpath()`` engages increasingly aggressive measures:
+
+1. Wake kswapd to begin background reclaim.
+2. Attempt direct reclaim — the allocating task itself reclaims pages.
+3. Attempt direct compaction — migrate pages to create contiguous blocks
+ (for high-order allocations).
+4. Retry with lowered watermarks if progress was made.
+5. As a last resort, invoke the OOM killer (see Documentation/mm/oom.rst).
+
+Each step may succeed, in which case the allocation is retried. The
+``__GFP_NORETRY``, ``__GFP_RETRY_MAYFAIL``, and ``__GFP_NOFAIL`` flags
+control how far down this chain the allocator goes.
+
+Watermarks
+==========
+
+Each zone maintains min, low, high, and promo watermarks that govern
+reclaim behavior:
+
+- **min**: below this level, only emergency allocations (those with
+ ``__GFP_MEMALLOC`` or from the OOM victim) can proceed. Direct reclaim
+ may be triggered.
+- **low**: when free pages drop below this level, kswapd is woken to
+ begin background reclaim.
+- **high**: kswapd stops reclaiming when free pages reach this level.
+ The zone is considered "balanced."
+- **promo**: used for NUMA memory tiering; controls when kswapd stops
+ reclaiming when tier promotion is enabled.
+
+The min watermark is derived from ``vm.min_free_kbytes``. The distance
+between watermarks is scaled by ``vm.watermark_scale_factor``.
+
+Watermark boosting temporarily raises watermarks after a pageblock is
+stolen from a different migratetype, increasing reclaim pressure to
+recover from the fragmentation event.
+
+High-Atomic Reserves
+--------------------
+
+The allocator reserves a small number of high-order pageblocks for atomic
+(non-sleeping) allocations. When a high-order atomic allocation succeeds
+from unreserved memory, the containing pageblock is moved to the reserve.
+When memory pressure is high, unreserved pageblocks are released back to
+the general pool.
+
+Compaction
+==========
+
+Memory compaction (``mm/compaction.c``) creates contiguous free blocks for
+high-order allocations by relocating movable pages. It runs two scanners
+across a zone: one walks from the bottom to find movable in-use pages, the
+other walks from the top to find free pages. Movable pages are migrated
+to the free locations, consolidating free space in the middle.
+
+Sync modes
+----------
+
+Compaction operates in three modes:
+
+- **ASYNC**: skips pages that require blocking to isolate or migrate.
+ Used in the allocation fast path and by kcompactd.
+- **SYNC_LIGHT**: allows some blocking but skips pages under writeback.
+- **SYNC**: allows full blocking. Used when direct compaction is the
+ last option before OOM.
+
+Deferral
+--------
+
+When compaction fails for a given order in a zone, it is deferred for an
+exponentially increasing number of attempts to avoid wasting CPU on zones
+that are too fragmented. A successful high-order allocation resets the
+deferral.
+
+kcompactd
+---------
+
+Each node has a kcompactd kernel thread that performs background
+compaction. It is woken when kswapd finishes reclaiming but high-order
+allocations are still failing due to fragmentation. kcompactd runs at
+low priority to avoid interfering with foreground work.
+
+Capture Control
+---------------
+
+During direct compaction, the allocator uses a capture mechanism: when
+compaction frees a block of the right order, the allocation can claim it
+immediately rather than racing with other allocators on the free list.
+
+Page Isolation
+==============
+
+``mm/page_isolation.c`` supports marking pageblocks as ``MIGRATE_ISOLATE``
+to prevent new allocations from those ranges. Existing free pages are
+moved out; the caller then migrates all in-use pages away. Once the range
+is fully evacuated, it can be used for a contiguous allocation or taken
+offline.
+
+This mechanism is used by:
+
+- **CMA** (contiguous memory allocator): reserves regions at boot for
+ device drivers that need physically contiguous buffers. The reserved
+ pages serve normal movable allocations until a CMA allocation claims
+ the range.
+- **Memory hot-remove**: isolates a memory block before offlining it.
+- **alloc_contig_range()**: general-purpose contiguous allocation used
+ by gigantic huge pages and other subsystems.
+
+The isolation process must handle pageblocks that straddle the requested
+range boundaries, compound pages (huge pages, THP) that overlap the
+boundary, and unmovable pages that prevent evacuation.
--
2.53.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH] Docs/mm: document Page Allocation
2026-03-14 15:25 [PATCH] Docs/mm: document Page Allocation Kit Dallege
@ 2026-03-15 20:36 ` Lorenzo Stoakes (Oracle)
2026-03-16 12:52 ` Vlastimil Babka (SUSE)
0 siblings, 1 reply; 3+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-15 20:36 UTC (permalink / raw)
To: Kit Dallege; +Cc: akpm, david, corbet, linux-mm, linux-doc, Vlastimil Babka
NAK.
Because AI slop obviously, please don't send this kind of stuff.
This time I will +cc the page alloc maintainer for you, who I am sure will
be overjoyed by this...
Same reasons as the rest, I'm already annoyed you didn't bother to put this
in a series even. If you didn't lazily get Claude to do everything, you're
doing a very good job at seeming like you did.
On Sat, Mar 14, 2026 at 04:25:30PM +0100, Kit Dallege wrote:
> Fill in the page_allocation.rst stub created in commit 481cc97349d6
> ("mm,doc: Add new documentation structure") as part of
> the structured memory management documentation following
> Mel Gorman's book outline.
>
> Signed-off-by: Kit Dallege <xaum.io@gmail.com>
> ---
> Documentation/mm/page_allocation.rst | 219 +++++++++++++++++++++++++++
> 1 file changed, 219 insertions(+)
>
> diff --git a/Documentation/mm/page_allocation.rst b/Documentation/mm/page_allocation.rst
> index d9b4495561f1..4d0c1f2db9af 100644
> --- a/Documentation/mm/page_allocation.rst
> +++ b/Documentation/mm/page_allocation.rst
> @@ -3,3 +3,222 @@
> ===============
> Page Allocation
> ===============
> +
> +The page allocator is the kernel's primary interface for obtaining and
> +releasing physical page frames. It is built on the buddy algorithm and
> +implemented in ``mm/page_alloc.c``
Page frames... Which are? And how is that useful
Primary? So what's the secondary, or tertiary, etc. 'interface' for
'obtaining' and 'releasing' (hint: we use other words for these) 'page
frames' (hint: we don't really refer to these as page frames) ?
.
> +
> +.. contents:: :local:
> +
> +Buddy Allocator
> +===============
> +
> +Free pages are grouped by order (power-of-two size) in per-zone
> +``free_area`` arrays, where order 0 is a single page and the maximum is
> +``MAX_PAGE_ORDER``. To satisfy an allocation of order N, the allocator
> +looks for a free block of that order. If none is available, it splits
Nope. Alot are in PCP lists. Claude mentions this below but you're already
hand waving in a way that's actively unhelpful.
> +higher-order block in half repeatedly until one of the right size is
> +produced. When a page is freed, the allocator checks whether its "buddy"
> +(the adjacent block of the same order) is also free; if so, the two are
> +merged into a block of the next higher order. This coalescing continues
> +as high as possible, rebuilding large contiguous blocks over time.
This is such a useless abbreviated description of the buddy allocator as to
be frankly worthless.
> +
> +Migratetypes
> +============
> +
> +Each pageblock (typically 2MB on x86) carries a migratetype tag that
Ha! You (read; Claude) don't even define what a pageblock is or what it's
for... as if people ought to 'just know', somehow...
Again, you're proactively wasting our time with this, it's not wanted or
helpful.
And etc. etc. etc. the whole document would need to be thrown away and
rewritten, and we could choose to do that ourselves without your 'help',
thanks.
> +describes the kind of allocations it serves:
> +
> +- **MIGRATE_UNMOVABLE**: kernel allocations that cannot be relocated
> + (slab objects, page tables).
> +- **MIGRATE_MOVABLE**: user pages and other content that can be migrated
> + or reclaimed (used by compaction and memory hot-remove).
> +- **MIGRATE_RECLAIMABLE**: caches that can be dropped under pressure
> + (page cache, dentries).
> +- **MIGRATE_CMA**: reserved for the contiguous memory allocator;
> + behaves as movable when not in use by CMA.
> +- **MIGRATE_ISOLATE**: temporarily prevents allocation from a range,
> + used during compaction and memory hot-remove.
> +
> +When a free list for the requested migratetype is empty, the allocator
> +falls back to other types in a defined order. It may also "steal" an
> +entire pageblock from another migratetype if it needs to take pages from
> +it, changing the pageblock's tag to reduce future fragmentation. This
> +fallback and stealing logic is a key mechanism for balancing fragmentation
> +against allocation success.
> +
> +Per-CPU Pagesets
> +================
> +
> +Most order-0 allocations are served from per-CPU page lists (PCP) rather
> +than the global ``free_area``. This avoids taking the zone lock on the
> +common path, which is critical for scalability on large systems.
> +
> +Each CPU maintains lists of free pages grouped by migratetype. Pages are
> +moved between the per-CPU lists and the buddy in batches. The batch size
> +and high watermark for each per-CPU list are tuned based on zone size and
> +the number of CPUs.
> +
> +When a per-CPU list is empty, a batch of pages is taken from the buddy.
> +When it exceeds its high watermark, excess pages are returned.
> +``lru_add_drain()`` and ``drain_all_pages()`` flush per-CPU lists when
> +the system needs an accurate count of free pages, such as during memory
> +hot-remove.
> +
> +GFP Flags
> +=========
> +
> +Every allocation request carries a set of GFP (Get Free Pages) flags,
> +defined in ``include/linux/gfp.h``, that describe what the allocator is
> +allowed to do:
> +
> +Zone selection
> + ``__GFP_DMA``, ``__GFP_DMA32``, ``__GFP_HIGHMEM``, ``__GFP_MOVABLE``
> + select the highest zone the allocation may use. ``gfp_zone()`` maps
> + flags to a zone type; the allocator then scans the zonelist from that
> + zone downward.
> +
> +Reclaim and compaction
> + ``__GFP_DIRECT_RECLAIM`` allows the allocator to invoke direct reclaim.
> + ``__GFP_KSWAPD_RECLAIM`` allows it to wake kswapd. Together these form
> + ``GFP_KERNEL``, the most common flag combination.
> +
> +Retry behavior
> + ``__GFP_NORETRY`` gives up after one attempt at reclaim.
> + ``__GFP_RETRY_MAYFAIL`` retries as long as progress is being made.
> + ``__GFP_NOFAIL`` never fails — the allocator retries indefinitely,
> + which is appropriate only for small allocations in contexts that
> + cannot handle failure.
> +
> +Migratetype
> + ``__GFP_MOVABLE`` and ``__GFP_RECLAIMABLE`` select the migratetype.
> + ``gfp_migratetype()`` maps flags to the appropriate type.
> +
> +Allocation Path
> +===============
> +
> +Fast path
> +---------
> +
> +``get_page_from_freelist()`` is the fast path. It walks the zonelist
> +(an ordered list of zones across all nodes, starting with the preferred
> +node) looking for a zone with enough free pages above its watermarks.
> +When it finds one, it pulls a page from the per-CPU list or buddy.
> +
> +The fast path also checks NUMA locality, cpuset constraints, and memory
> +cgroup limits. If no zone can satisfy the request, control passes to
> +the slow path.
> +
> +Slow path
> +---------
> +
> +``__alloc_pages_slowpath()`` engages increasingly aggressive measures:
> +
> +1. Wake kswapd to begin background reclaim.
> +2. Attempt direct reclaim — the allocating task itself reclaims pages.
> +3. Attempt direct compaction — migrate pages to create contiguous blocks
> + (for high-order allocations).
> +4. Retry with lowered watermarks if progress was made.
> +5. As a last resort, invoke the OOM killer (see Documentation/mm/oom.rst).
> +
> +Each step may succeed, in which case the allocation is retried. The
> +``__GFP_NORETRY``, ``__GFP_RETRY_MAYFAIL``, and ``__GFP_NOFAIL`` flags
> +control how far down this chain the allocator goes.
> +
> +Watermarks
> +==========
> +
> +Each zone maintains min, low, high, and promo watermarks that govern
> +reclaim behavior:
> +
> +- **min**: below this level, only emergency allocations (those with
> + ``__GFP_MEMALLOC`` or from the OOM victim) can proceed. Direct reclaim
> + may be triggered.
> +- **low**: when free pages drop below this level, kswapd is woken to
> + begin background reclaim.
> +- **high**: kswapd stops reclaiming when free pages reach this level.
> + The zone is considered "balanced."
> +- **promo**: used for NUMA memory tiering; controls when kswapd stops
> + reclaiming when tier promotion is enabled.
> +
> +The min watermark is derived from ``vm.min_free_kbytes``. The distance
> +between watermarks is scaled by ``vm.watermark_scale_factor``.
> +
> +Watermark boosting temporarily raises watermarks after a pageblock is
> +stolen from a different migratetype, increasing reclaim pressure to
> +recover from the fragmentation event.
> +
> +High-Atomic Reserves
> +--------------------
> +
> +The allocator reserves a small number of high-order pageblocks for atomic
> +(non-sleeping) allocations. When a high-order atomic allocation succeeds
> +from unreserved memory, the containing pageblock is moved to the reserve.
> +When memory pressure is high, unreserved pageblocks are released back to
> +the general pool.
> +
> +Compaction
> +==========
> +
> +Memory compaction (``mm/compaction.c``) creates contiguous free blocks for
> +high-order allocations by relocating movable pages. It runs two scanners
> +across a zone: one walks from the bottom to find movable in-use pages, the
> +other walks from the top to find free pages. Movable pages are migrated
> +to the free locations, consolidating free space in the middle.
> +
> +Sync modes
> +----------
> +
> +Compaction operates in three modes:
> +
> +- **ASYNC**: skips pages that require blocking to isolate or migrate.
> + Used in the allocation fast path and by kcompactd.
> +- **SYNC_LIGHT**: allows some blocking but skips pages under writeback.
> +- **SYNC**: allows full blocking. Used when direct compaction is the
> + last option before OOM.
> +
> +Deferral
> +--------
> +
> +When compaction fails for a given order in a zone, it is deferred for an
> +exponentially increasing number of attempts to avoid wasting CPU on zones
> +that are too fragmented. A successful high-order allocation resets the
> +deferral.
> +
> +kcompactd
> +---------
> +
> +Each node has a kcompactd kernel thread that performs background
> +compaction. It is woken when kswapd finishes reclaiming but high-order
> +allocations are still failing due to fragmentation. kcompactd runs at
> +low priority to avoid interfering with foreground work.
> +
> +Capture Control
> +---------------
> +
> +During direct compaction, the allocator uses a capture mechanism: when
> +compaction frees a block of the right order, the allocation can claim it
> +immediately rather than racing with other allocators on the free list.
> +
> +Page Isolation
> +==============
> +
> +``mm/page_isolation.c`` supports marking pageblocks as ``MIGRATE_ISOLATE``
> +to prevent new allocations from those ranges. Existing free pages are
> +moved out; the caller then migrates all in-use pages away. Once the range
> +is fully evacuated, it can be used for a contiguous allocation or taken
> +offline.
> +
> +This mechanism is used by:
> +
> +- **CMA** (contiguous memory allocator): reserves regions at boot for
> + device drivers that need physically contiguous buffers. The reserved
> + pages serve normal movable allocations until a CMA allocation claims
> + the range.
> +- **Memory hot-remove**: isolates a memory block before offlining it.
> +- **alloc_contig_range()**: general-purpose contiguous allocation used
> + by gigantic huge pages and other subsystems.
> +
> +The isolation process must handle pageblocks that straddle the requested
> +range boundaries, compound pages (huge pages, THP) that overlap the
> +boundary, and unmovable pages that prevent evacuation.
> --
> 2.53.0
>
>
>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] Docs/mm: document Page Allocation
2026-03-15 20:36 ` Lorenzo Stoakes (Oracle)
@ 2026-03-16 12:52 ` Vlastimil Babka (SUSE)
0 siblings, 0 replies; 3+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-16 12:52 UTC (permalink / raw)
To: Lorenzo Stoakes (Oracle), Kit Dallege
Cc: akpm, david, corbet, linux-mm, linux-doc
On 3/15/26 21:36, Lorenzo Stoakes (Oracle) wrote:
> NAK.
>
> Because AI slop obviously, please don't send this kind of stuff.
>
> This time I will +cc the page alloc maintainer for you, who I am sure will
> be overjoyed by this...
Thanks, Lorenzo. Indeed I don't think this is useful addition and agree with
your NAK.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-03-16 12:52 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-14 15:25 [PATCH] Docs/mm: document Page Allocation Kit Dallege
2026-03-15 20:36 ` Lorenzo Stoakes (Oracle)
2026-03-16 12:52 ` Vlastimil Babka (SUSE)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox