[PATCH] Docs/mm: document Page Allocation

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Kit Dallege <xaum.io@gmail.com>
To: akpm@linux-foundation.org, david@kernel.org, corbet@lwn.net
Cc: linux-mm@kvack.org, linux-doc@vger.kernel.org,
	Kit Dallege <xaum.io@gmail.com>
Subject: [PATCH] Docs/mm: document Page Allocation
Date: Sat, 14 Mar 2026 16:25:30 +0100	[thread overview]
Message-ID: <20260314152530.100357-1-xaum.io@gmail.com> (raw)

Fill in the page_allocation.rst stub created in commit 481cc97349d6
("mm,doc: Add new documentation structure") as part of
the structured memory management documentation following
Mel Gorman's book outline.

Signed-off-by: Kit Dallege <xaum.io@gmail.com>
---
 Documentation/mm/page_allocation.rst | 219 +++++++++++++++++++++++++++
 1 file changed, 219 insertions(+)

diff --git a/Documentation/mm/page_allocation.rst b/Documentation/mm/page_allocation.rst
index d9b4495561f1..4d0c1f2db9af 100644
--- a/Documentation/mm/page_allocation.rst
+++ b/Documentation/mm/page_allocation.rst
@@ -3,3 +3,222 @@
 ===============
 Page Allocation
 ===============
+
+The page allocator is the kernel's primary interface for obtaining and
+releasing physical page frames.  It is built on the buddy algorithm and
+implemented in ``mm/page_alloc.c``.
+
+.. contents:: :local:
+
+Buddy Allocator
+===============
+
+Free pages are grouped by order (power-of-two size) in per-zone
+``free_area`` arrays, where order 0 is a single page and the maximum is
+``MAX_PAGE_ORDER``.  To satisfy an allocation of order N, the allocator
+looks for a free block of that order.  If none is available, it splits a
+higher-order block in half repeatedly until one of the right size is
+produced.  When a page is freed, the allocator checks whether its "buddy"
+(the adjacent block of the same order) is also free; if so, the two are
+merged into a block of the next higher order.  This coalescing continues
+as high as possible, rebuilding large contiguous blocks over time.
+
+Migratetypes
+============
+
+Each pageblock (typically 2MB on x86) carries a migratetype tag that
+describes the kind of allocations it serves:
+
+- **MIGRATE_UNMOVABLE**: kernel allocations that cannot be relocated
+  (slab objects, page tables).
+- **MIGRATE_MOVABLE**: user pages and other content that can be migrated
+  or reclaimed (used by compaction and memory hot-remove).
+- **MIGRATE_RECLAIMABLE**: caches that can be dropped under pressure
+  (page cache, dentries).
+- **MIGRATE_CMA**: reserved for the contiguous memory allocator;
+  behaves as movable when not in use by CMA.
+- **MIGRATE_ISOLATE**: temporarily prevents allocation from a range,
+  used during compaction and memory hot-remove.
+
+When a free list for the requested migratetype is empty, the allocator
+falls back to other types in a defined order.  It may also "steal" an
+entire pageblock from another migratetype if it needs to take pages from
+it, changing the pageblock's tag to reduce future fragmentation.  This
+fallback and stealing logic is a key mechanism for balancing fragmentation
+against allocation success.
+
+Per-CPU Pagesets
+================
+
+Most order-0 allocations are served from per-CPU page lists (PCP) rather
+than the global ``free_area``.  This avoids taking the zone lock on the
+common path, which is critical for scalability on large systems.
+
+Each CPU maintains lists of free pages grouped by migratetype.  Pages are
+moved between the per-CPU lists and the buddy in batches.  The batch size
+and high watermark for each per-CPU list are tuned based on zone size and
+the number of CPUs.
+
+When a per-CPU list is empty, a batch of pages is taken from the buddy.
+When it exceeds its high watermark, excess pages are returned.
+``lru_add_drain()`` and ``drain_all_pages()`` flush per-CPU lists when
+the system needs an accurate count of free pages, such as during memory
+hot-remove.
+
+GFP Flags
+=========
+
+Every allocation request carries a set of GFP (Get Free Pages) flags,
+defined in ``include/linux/gfp.h``, that describe what the allocator is
+allowed to do:
+
+Zone selection
+  ``__GFP_DMA``, ``__GFP_DMA32``, ``__GFP_HIGHMEM``, ``__GFP_MOVABLE``
+  select the highest zone the allocation may use.  ``gfp_zone()`` maps
+  flags to a zone type; the allocator then scans the zonelist from that
+  zone downward.
+
+Reclaim and compaction
+  ``__GFP_DIRECT_RECLAIM`` allows the allocator to invoke direct reclaim.
+  ``__GFP_KSWAPD_RECLAIM`` allows it to wake kswapd.  Together these form
+  ``GFP_KERNEL``, the most common flag combination.
+
+Retry behavior
+  ``__GFP_NORETRY`` gives up after one attempt at reclaim.
+  ``__GFP_RETRY_MAYFAIL`` retries as long as progress is being made.
+  ``__GFP_NOFAIL`` never fails — the allocator retries indefinitely,
+  which is appropriate only for small allocations in contexts that
+  cannot handle failure.
+
+Migratetype
+  ``__GFP_MOVABLE`` and ``__GFP_RECLAIMABLE`` select the migratetype.
+  ``gfp_migratetype()`` maps flags to the appropriate type.
+
+Allocation Path
+===============
+
+Fast path
+---------
+
+``get_page_from_freelist()`` is the fast path.  It walks the zonelist
+(an ordered list of zones across all nodes, starting with the preferred
+node) looking for a zone with enough free pages above its watermarks.
+When it finds one, it pulls a page from the per-CPU list or buddy.
+
+The fast path also checks NUMA locality, cpuset constraints, and memory
+cgroup limits.  If no zone can satisfy the request, control passes to
+the slow path.
+
+Slow path
+---------
+
+``__alloc_pages_slowpath()`` engages increasingly aggressive measures:
+
+1. Wake kswapd to begin background reclaim.
+2. Attempt direct reclaim — the allocating task itself reclaims pages.
+3. Attempt direct compaction — migrate pages to create contiguous blocks
+   (for high-order allocations).
+4. Retry with lowered watermarks if progress was made.
+5. As a last resort, invoke the OOM killer (see Documentation/mm/oom.rst).
+
+Each step may succeed, in which case the allocation is retried.  The
+``__GFP_NORETRY``, ``__GFP_RETRY_MAYFAIL``, and ``__GFP_NOFAIL`` flags
+control how far down this chain the allocator goes.
+
+Watermarks
+==========
+
+Each zone maintains min, low, high, and promo watermarks that govern
+reclaim behavior:
+
+- **min**: below this level, only emergency allocations (those with
+  ``__GFP_MEMALLOC`` or from the OOM victim) can proceed.  Direct reclaim
+  may be triggered.
+- **low**: when free pages drop below this level, kswapd is woken to
+  begin background reclaim.
+- **high**: kswapd stops reclaiming when free pages reach this level.
+  The zone is considered "balanced."
+- **promo**: used for NUMA memory tiering; controls when kswapd stops
+  reclaiming when tier promotion is enabled.
+
+The min watermark is derived from ``vm.min_free_kbytes``.  The distance
+between watermarks is scaled by ``vm.watermark_scale_factor``.
+
+Watermark boosting temporarily raises watermarks after a pageblock is
+stolen from a different migratetype, increasing reclaim pressure to
+recover from the fragmentation event.
+
+High-Atomic Reserves
+--------------------
+
+The allocator reserves a small number of high-order pageblocks for atomic
+(non-sleeping) allocations.  When a high-order atomic allocation succeeds
+from unreserved memory, the containing pageblock is moved to the reserve.
+When memory pressure is high, unreserved pageblocks are released back to
+the general pool.
+
+Compaction
+==========
+
+Memory compaction (``mm/compaction.c``) creates contiguous free blocks for
+high-order allocations by relocating movable pages.  It runs two scanners
+across a zone: one walks from the bottom to find movable in-use pages, the
+other walks from the top to find free pages.  Movable pages are migrated
+to the free locations, consolidating free space in the middle.
+
+Sync modes
+----------
+
+Compaction operates in three modes:
+
+- **ASYNC**: skips pages that require blocking to isolate or migrate.
+  Used in the allocation fast path and by kcompactd.
+- **SYNC_LIGHT**: allows some blocking but skips pages under writeback.
+- **SYNC**: allows full blocking.  Used when direct compaction is the
+  last option before OOM.
+
+Deferral
+--------
+
+When compaction fails for a given order in a zone, it is deferred for an
+exponentially increasing number of attempts to avoid wasting CPU on zones
+that are too fragmented.  A successful high-order allocation resets the
+deferral.
+
+kcompactd
+---------
+
+Each node has a kcompactd kernel thread that performs background
+compaction.  It is woken when kswapd finishes reclaiming but high-order
+allocations are still failing due to fragmentation.  kcompactd runs at
+low priority to avoid interfering with foreground work.
+
+Capture Control
+---------------
+
+During direct compaction, the allocator uses a capture mechanism: when
+compaction frees a block of the right order, the allocation can claim it
+immediately rather than racing with other allocators on the free list.
+
+Page Isolation
+==============
+
+``mm/page_isolation.c`` supports marking pageblocks as ``MIGRATE_ISOLATE``
+to prevent new allocations from those ranges.  Existing free pages are
+moved out; the caller then migrates all in-use pages away.  Once the range
+is fully evacuated, it can be used for a contiguous allocation or taken
+offline.
+
+This mechanism is used by:
+
+- **CMA** (contiguous memory allocator): reserves regions at boot for
+  device drivers that need physically contiguous buffers.  The reserved
+  pages serve normal movable allocations until a CMA allocation claims
+  the range.
+- **Memory hot-remove**: isolates a memory block before offlining it.
+- **alloc_contig_range()**: general-purpose contiguous allocation used
+  by gigantic huge pages and other subsystems.
+
+The isolation process must handle pageblocks that straddle the requested
+range boundaries, compound pages (huge pages, THP) that overlap the
+boundary, and unmovable pages that prevent evacuation.
-- 
2.53.0

next             reply	other threads:[~2026-03-14 15:25 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-14 15:25 Kit Dallege [this message]
2026-03-15 20:36 ` [PATCH] Docs/mm: document Page Allocation Lorenzo Stoakes (Oracle)
2026-03-16 12:52   ` Vlastimil Babka (SUSE)

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:d9b4495561f dfblob:4d0c1f2db9a )
 OR (
bs:"[PATCH] Docs/mm: document Page Allocation" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260314152530.100357-1-xaum.io@gmail.com \
    --to=xaum.io@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=david@kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox