[PATCH] Docs/mm: document Page Reclaim

public inbox for linux-doc@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] Docs/mm: document Page Reclaim
@ 2026-03-14 15:25 Kit Dallege
  2026-03-15 20:25 ` Lorenzo Stoakes (Oracle)
  0 siblings, 1 reply; 2+ messages in thread
From: Kit Dallege @ 2026-03-14 15:25 UTC (permalink / raw)
  To: akpm, david, corbet; +Cc: linux-mm, linux-doc, Kit Dallege

Fill in the page_reclaim.rst stub created in commit 481cc97349d6
("mm,doc: Add new documentation structure") as part of
the structured memory management documentation following
Mel Gorman's book outline.

Signed-off-by: Kit Dallege <xaum.io@gmail.com>
---
 Documentation/mm/page_reclaim.rst | 164 ++++++++++++++++++++++++++++++
 1 file changed, 164 insertions(+)

diff --git a/Documentation/mm/page_reclaim.rst b/Documentation/mm/page_reclaim.rst
index 50a30b7f8ac3..bfa53bee98c2 100644
--- a/Documentation/mm/page_reclaim.rst
+++ b/Documentation/mm/page_reclaim.rst
@@ -3,3 +3,167 @@
 ============
 Page Reclaim
 ============
+
+Page reclaim frees memory by evicting pages that can be reloaded from disk
+or regenerated.  File-backed pages are dropped (clean) or written back
+(dirty); anonymous pages are swapped out.  The bulk of the implementation
+is in ``mm/vmscan.c``.
+
+.. contents:: :local:
+
+When Reclaim Runs
+=================
+
+Reclaim is triggered in two ways:
+
+- **kswapd**: a per-node kernel thread that runs in the background when
+  free pages in any zone drop below the low watermark.  It reclaims until
+  free pages reach the high watermark, then sleeps.
+
+- **Direct reclaim**: when an allocation cannot be satisfied even after
+  kswapd has been woken, the allocating task reclaims pages synchronously
+  in its own context.  This adds latency to the allocation but is necessary
+  when background reclaim cannot keep up.
+
+Reclaim Priority
+================
+
+The reclaim path operates at decreasing priority levels (from
+``DEF_PRIORITY`` down to 0).  At each level, a larger fraction of the LRU
+lists is scanned.  At the default priority, only 1/4096th of pages are
+considered; at priority 0, the entire list is scanned.
+
+If a full scan at priority 0 still does not free enough memory, the OOM
+killer is invoked (see Documentation/mm/oom.rst).  This escalation
+prevents the system from spinning indefinitely in reclaim.
+
+Scan Control
+============
+
+Each reclaim invocation is parameterized by a ``struct scan_control`` that
+captures the allocation context: which GFP flags were used, how many pages
+are needed, which node or memory cgroup to reclaim from, and whether
+writeback or swap are allowed.  This struct threads through the entire
+reclaim stack, ensuring consistent policy at every level.
+
+LRU Lists
+=========
+
+Each ``lruvec`` (one per node, or per node and memory cgroup combination)
+maintains lists of pages ordered by access recency.
+
+Classic LRU
+-----------
+
+The classic scheme uses four LRU lists per lruvec: active and inactive for
+both anonymous and file-backed pages.  This approximates a second-chance
+(clock) algorithm:
+
+- Pages start on the inactive list when first allocated.
+- If accessed again while on the inactive list, they are promoted to the
+  active list.
+- Reclaim scans the inactive list and evicts pages that have not been
+  recently accessed.
+- To prevent the active list from growing without bound, pages are
+  periodically demoted from active to inactive.
+
+The split between anonymous and file-backed lists allows the reclaim path
+to balance eviction pressure between the two types based on their relative
+cost.  Swapping anonymous pages is generally more expensive than dropping
+clean file pages, so the scanner adjusts the ratio using IO cost
+accounting and the ``vm.swappiness`` tunable.
+
+Multi-Gen LRU
+-------------
+
+The multi-gen LRU is an alternative reclaim algorithm that groups pages
+into generations by access time rather than a simple active/inactive
+split.  It is documented separately in Documentation/mm/multigen_lru.rst.
+
+LRU Batching
+------------
+
+To avoid taking the lruvec lock on every page access, LRU operations are
+batched per-CPU (``mm/swap.c``).  Functions like ``folio_add_lru()`` and
+``folio_mark_accessed()`` queue pages into per-CPU folio batches that are
+drained to the actual LRU lists periodically or when the batch is full.
+This batching is critical for scalability on systems with many CPUs.
+
+Reclaiming Pages
+================
+
+The core reclaim loop (``shrink_node()``) divides its work between page
+cache / anonymous pages and slab caches.  For each lruvec, it scans the
+inactive LRU lists, evaluating each page:
+
+- **Clean file pages** can be dropped immediately — they can be re-read
+  from disk.
+- **Dirty file pages** are queued for writeback.  Reclaim typically skips
+  them and returns later, but under severe pressure it may wait for
+  writeback to complete.
+- **Anonymous pages** are swapped out if swap space is available and
+  ``vm.swappiness`` allows it.
+- **Mapped pages** require TLB invalidation (unmapping) before they can
+  be freed.  The rmap (reverse mapping) system is used to find and
+  remove all page table entries pointing to the page.
+- **Unevictable pages** (locked with ``mlock()``) are skipped entirely.
+  See Documentation/mm/unevictable-lru.rst.
+
+Memory Cgroup Reclaim
+---------------------
+
+When memory cgroup limits are exceeded, reclaim targets only the pages
+belonging to that cgroup.  Each memory cgroup has its own lruvec per node,
+so the scanner can isolate its pages without disturbing the rest of the
+system.  ``try_to_free_mem_cgroup_pages()`` is the entry point for
+cgroup-scoped reclaim.
+
+NUMA Demotion
+-------------
+
+On systems with tiered memory (e.g., fast DRAM and slower persistent
+memory), reclaim can demote pages to a slower tier instead of evicting
+them.  This keeps the data in memory but frees the faster tier for
+actively accessed pages.
+
+Shrinkers
+=========
+
+Besides page cache and anonymous pages, kernel caches (dentries, inodes,
+and driver-specific caches) are reclaimed through the shrinker interface
+(``mm/shrinker.c``).  A shrinker registers two callbacks:
+
+- ``count_objects()``: report how many objects are reclaimable.
+- ``scan_objects()``: free up to a requested number of objects.
+
+The reclaim path calls all registered shrinkers proportionally to the
+amount of reclaimable memory they report.  Shrinkers are NUMA-aware: on
+NUMA systems, each shrinker is called with the node being reclaimed so it
+can prioritize freeing objects local to that node.
+
+Per-memcg shrinker tracking uses bitmap arrays (``shrinker_info``) so that
+the reclaim path only invokes shrinkers that actually have objects in the
+target cgroup, avoiding unnecessary work when there are many cgroups.
+
+Working Set Detection
+=====================
+
+When a page is evicted, a compact shadow entry is stored in its place in
+the page cache or swap cache.  The shadow records the eviction timestamp
+(in terms of the lruvec's nonresident age counter) and the cgroup and
+node that owned the page.
+
+If the page is faulted back in (a "refault"), the shadow entry allows the
+kernel to compute the *refault distance* — how many other pages were
+activated or evicted between this page's eviction and its refault.  If the
+refault distance is shorter than the size of the inactive list, the page
+was part of the active working set and is immediately activated rather
+than placed on the inactive list.  This reduces thrashing by protecting
+frequently accessed pages that would otherwise be repeatedly evicted and
+refaulted.
+
+Shadow entries consume a small amount of memory.  To prevent them from
+accumulating indefinitely, a shrinker reclaims shadow entries from page
+cache radix tree nodes that contain only shadows and no actual pages.
+
+This logic is implemented in ``mm/workingset.c``.
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH] Docs/mm: document Page Reclaim
  2026-03-14 15:25 [PATCH] Docs/mm: document Page Reclaim Kit Dallege
@ 2026-03-15 20:25 ` Lorenzo Stoakes (Oracle)
  0 siblings, 0 replies; 2+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-15 20:25 UTC (permalink / raw)
  To: Kit Dallege; +Cc: akpm, david, corbet, linux-mm, linux-doc

NAK because clearly AI slop, again.

(side note - 'page' reclaim is a misnomer now, we should just call this doc
reclaim - we reclaim folios not pages :)

Anway, again, you've not bothered finding out who maintains reclaim, I just
looked and it took me 10 seconds:

MEMORY MANAGEMENT - RECLAIM
M:	Andrew Morton <akpm@linux-foundation.org>
M:	Johannes Weiner <hannes@cmpxchg.org>
R:	David Hildenbrand <david@kernel.org> <- by chance you have David :)
R:	Michal Hocko <mhocko@kernel.org>
R:	Qi Zheng <zhengqi.arch@bytedance.com>
R:	Shakeel Butt <shakeel.butt@linux.dev>
R:	Lorenzo Stoakes <ljs@kernel.org>
L:	linux-mm@kvack.org
S:	Maintained
F:	mm/vmscan.c
F:	mm/workingset.c

You've not even done that, let alone thought to cc- anybody on that list, 5
minutes glancing over the mailing list would tell you this is is common
courtesy.

The documentation is useless hand-waving that maintainers would have to
essentially rewrite for you on 'review'.

This is not a good use of maintainer time, and we don't want stuff we could
generate ourselves.

On Sat, Mar 14, 2026 at 04:25:34PM +0100, Kit Dallege wrote:
> Fill in the page_reclaim.rst stub created in commit 481cc97349d6
> ("mm,doc: Add new documentation structure") as part of
> the structured memory management documentation following
> Mel Gorman's book outline.

You've also, again, used a copy/paste meaningless, worthless commit message - 5
minutes glacing through the linux-mm list would tell you what we expect.

I mean I say 'you', this was Claude surely?

>
> Signed-off-by: Kit Dallege <xaum.io@gmail.com>
> ---
>  Documentation/mm/page_reclaim.rst | 164 ++++++++++++++++++++++++++++++
>  1 file changed, 164 insertions(+)
>
> diff --git a/Documentation/mm/page_reclaim.rst b/Documentation/mm/page_reclaim.rst
> index 50a30b7f8ac3..bfa53bee98c2 100644
> --- a/Documentation/mm/page_reclaim.rst
> +++ b/Documentation/mm/page_reclaim.rst
> @@ -3,3 +3,167 @@
>  ============
>  Page Reclaim
>  ============
> +
> +Page reclaim frees memory by evicting pages that can be reloaded from disk
> +or regenerated.  File-backed pages are dropped (clean) or written back

Or regenerated?... This isn't doctor who?

> +(dirty); anonymous pages are swapped out.  The bulk of the implementation
> +is in ``mm/vmscan.c``.

Yeah let's not bother discuss what clean or dirty means, or why that matters, or
anything useful...

etc.

> +
> +.. contents:: :local:
> +
> +When Reclaim Runs
> +=================
> +
> +Reclaim is triggered in two ways:
> +
> +- **kswapd**: a per-node kernel thread that runs in the background when
> +  free pages in any zone drop below the low watermark.  It reclaims until
> +  free pages reach the high watermark, then sleeps.
> +
> +- **Direct reclaim**: when an allocation cannot be satisfied even after
> +  kswapd has been woken, the allocating task reclaims pages synchronously
> +  in its own context.  This adds latency to the allocation but is necessary
> +  when background reclaim cannot keep up.
> +
> +Reclaim Priority
> +================
> +
> +The reclaim path operates at decreasing priority levels (from
> +``DEF_PRIORITY`` down to 0).  At each level, a larger fraction of the LRU
> +lists is scanned.  At the default priority, only 1/4096th of pages are
> +considered; at priority 0, the entire list is scanned.
> +
> +If a full scan at priority 0 still does not free enough memory, the OOM
> +killer is invoked (see Documentation/mm/oom.rst).  This escalation
> +prevents the system from spinning indefinitely in reclaim.
> +
> +Scan Control
> +============
> +
> +Each reclaim invocation is parameterized by a ``struct scan_control`` that
> +captures the allocation context: which GFP flags were used, how many pages
> +are needed, which node or memory cgroup to reclaim from, and whether
> +writeback or swap are allowed.  This struct threads through the entire
> +reclaim stack, ensuring consistent policy at every level.
> +
> +LRU Lists
> +=========
> +
> +Each ``lruvec`` (one per node, or per node and memory cgroup combination)
> +maintains lists of pages ordered by access recency.
> +
> +Classic LRU
> +-----------
> +
> +The classic scheme uses four LRU lists per lruvec: active and inactive for
> +both anonymous and file-backed pages.  This approximates a second-chance
> +(clock) algorithm:
> +
> +- Pages start on the inactive list when first allocated.
> +- If accessed again while on the inactive list, they are promoted to the
> +  active list.
> +- Reclaim scans the inactive list and evicts pages that have not been
> +  recently accessed.
> +- To prevent the active list from growing without bound, pages are
> +  periodically demoted from active to inactive.
> +
> +The split between anonymous and file-backed lists allows the reclaim path
> +to balance eviction pressure between the two types based on their relative
> +cost.  Swapping anonymous pages is generally more expensive than dropping
> +clean file pages, so the scanner adjusts the ratio using IO cost
> +accounting and the ``vm.swappiness`` tunable.
> +
> +Multi-Gen LRU
> +-------------
> +
> +The multi-gen LRU is an alternative reclaim algorithm that groups pages
> +into generations by access time rather than a simple active/inactive
> +split.  It is documented separately in Documentation/mm/multigen_lru.rst.
> +
> +LRU Batching
> +------------
> +
> +To avoid taking the lruvec lock on every page access, LRU operations are
> +batched per-CPU (``mm/swap.c``).  Functions like ``folio_add_lru()`` and
> +``folio_mark_accessed()`` queue pages into per-CPU folio batches that are
> +drained to the actual LRU lists periodically or when the batch is full.
> +This batching is critical for scalability on systems with many CPUs.
> +
> +Reclaiming Pages
> +================
> +
> +The core reclaim loop (``shrink_node()``) divides its work between page
> +cache / anonymous pages and slab caches.  For each lruvec, it scans the
> +inactive LRU lists, evaluating each page:
> +
> +- **Clean file pages** can be dropped immediately — they can be re-read
> +  from disk.
> +- **Dirty file pages** are queued for writeback.  Reclaim typically skips
> +  them and returns later, but under severe pressure it may wait for
> +  writeback to complete.
> +- **Anonymous pages** are swapped out if swap space is available and
> +  ``vm.swappiness`` allows it.
> +- **Mapped pages** require TLB invalidation (unmapping) before they can
> +  be freed.  The rmap (reverse mapping) system is used to find and
> +  remove all page table entries pointing to the page.
> +- **Unevictable pages** (locked with ``mlock()``) are skipped entirely.
> +  See Documentation/mm/unevictable-lru.rst.
> +
> +Memory Cgroup Reclaim
> +---------------------
> +
> +When memory cgroup limits are exceeded, reclaim targets only the pages
> +belonging to that cgroup.  Each memory cgroup has its own lruvec per node,
> +so the scanner can isolate its pages without disturbing the rest of the
> +system.  ``try_to_free_mem_cgroup_pages()`` is the entry point for
> +cgroup-scoped reclaim.
> +
> +NUMA Demotion
> +-------------
> +
> +On systems with tiered memory (e.g., fast DRAM and slower persistent
> +memory), reclaim can demote pages to a slower tier instead of evicting
> +them.  This keeps the data in memory but frees the faster tier for
> +actively accessed pages.
> +
> +Shrinkers
> +=========
> +
> +Besides page cache and anonymous pages, kernel caches (dentries, inodes,
> +and driver-specific caches) are reclaimed through the shrinker interface
> +(``mm/shrinker.c``).  A shrinker registers two callbacks:
> +
> +- ``count_objects()``: report how many objects are reclaimable.
> +- ``scan_objects()``: free up to a requested number of objects.
> +
> +The reclaim path calls all registered shrinkers proportionally to the
> +amount of reclaimable memory they report.  Shrinkers are NUMA-aware: on
> +NUMA systems, each shrinker is called with the node being reclaimed so it
> +can prioritize freeing objects local to that node.
> +
> +Per-memcg shrinker tracking uses bitmap arrays (``shrinker_info``) so that
> +the reclaim path only invokes shrinkers that actually have objects in the
> +target cgroup, avoiding unnecessary work when there are many cgroups.
> +
> +Working Set Detection
> +=====================
> +
> +When a page is evicted, a compact shadow entry is stored in its place in
> +the page cache or swap cache.  The shadow records the eviction timestamp
> +(in terms of the lruvec's nonresident age counter) and the cgroup and
> +node that owned the page.
> +
> +If the page is faulted back in (a "refault"), the shadow entry allows the
> +kernel to compute the *refault distance* — how many other pages were
> +activated or evicted between this page's eviction and its refault.  If the
> +refault distance is shorter than the size of the inactive list, the page
> +was part of the active working set and is immediately activated rather
> +than placed on the inactive list.  This reduces thrashing by protecting
> +frequently accessed pages that would otherwise be repeatedly evicted and
> +refaulted.
> +
> +Shadow entries consume a small amount of memory.  To prevent them from
> +accumulating indefinitely, a shrinker reclaims shadow entries from page
> +cache radix tree nodes that contain only shadows and no actual pages.
> +
> +This logic is implemented in ``mm/workingset.c``.
> --
> 2.53.0
>
>
>

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-03-15 20:25 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-14 15:25 [PATCH] Docs/mm: document Page Reclaim Kit Dallege
2026-03-15 20:25 ` Lorenzo Stoakes (Oracle)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox