* [PATCH] Docs/mm: document Page Reclaim
@ 2026-03-14 15:25 Kit Dallege
2026-03-15 20:25 ` Lorenzo Stoakes (Oracle)
0 siblings, 1 reply; 2+ messages in thread
From: Kit Dallege @ 2026-03-14 15:25 UTC (permalink / raw)
To: akpm, david, corbet; +Cc: linux-mm, linux-doc, Kit Dallege
Fill in the page_reclaim.rst stub created in commit 481cc97349d6
("mm,doc: Add new documentation structure") as part of
the structured memory management documentation following
Mel Gorman's book outline.
Signed-off-by: Kit Dallege <xaum.io@gmail.com>
---
Documentation/mm/page_reclaim.rst | 164 ++++++++++++++++++++++++++++++
1 file changed, 164 insertions(+)
diff --git a/Documentation/mm/page_reclaim.rst b/Documentation/mm/page_reclaim.rst
index 50a30b7f8ac3..bfa53bee98c2 100644
--- a/Documentation/mm/page_reclaim.rst
+++ b/Documentation/mm/page_reclaim.rst
@@ -3,3 +3,167 @@
============
Page Reclaim
============
+
+Page reclaim frees memory by evicting pages that can be reloaded from disk
+or regenerated. File-backed pages are dropped (clean) or written back
+(dirty); anonymous pages are swapped out. The bulk of the implementation
+is in ``mm/vmscan.c``.
+
+.. contents:: :local:
+
+When Reclaim Runs
+=================
+
+Reclaim is triggered in two ways:
+
+- **kswapd**: a per-node kernel thread that runs in the background when
+ free pages in any zone drop below the low watermark. It reclaims until
+ free pages reach the high watermark, then sleeps.
+
+- **Direct reclaim**: when an allocation cannot be satisfied even after
+ kswapd has been woken, the allocating task reclaims pages synchronously
+ in its own context. This adds latency to the allocation but is necessary
+ when background reclaim cannot keep up.
+
+Reclaim Priority
+================
+
+The reclaim path operates at decreasing priority levels (from
+``DEF_PRIORITY`` down to 0). At each level, a larger fraction of the LRU
+lists is scanned. At the default priority, only 1/4096th of pages are
+considered; at priority 0, the entire list is scanned.
+
+If a full scan at priority 0 still does not free enough memory, the OOM
+killer is invoked (see Documentation/mm/oom.rst). This escalation
+prevents the system from spinning indefinitely in reclaim.
+
+Scan Control
+============
+
+Each reclaim invocation is parameterized by a ``struct scan_control`` that
+captures the allocation context: which GFP flags were used, how many pages
+are needed, which node or memory cgroup to reclaim from, and whether
+writeback or swap are allowed. This struct threads through the entire
+reclaim stack, ensuring consistent policy at every level.
+
+LRU Lists
+=========
+
+Each ``lruvec`` (one per node, or per node and memory cgroup combination)
+maintains lists of pages ordered by access recency.
+
+Classic LRU
+-----------
+
+The classic scheme uses four LRU lists per lruvec: active and inactive for
+both anonymous and file-backed pages. This approximates a second-chance
+(clock) algorithm:
+
+- Pages start on the inactive list when first allocated.
+- If accessed again while on the inactive list, they are promoted to the
+ active list.
+- Reclaim scans the inactive list and evicts pages that have not been
+ recently accessed.
+- To prevent the active list from growing without bound, pages are
+ periodically demoted from active to inactive.
+
+The split between anonymous and file-backed lists allows the reclaim path
+to balance eviction pressure between the two types based on their relative
+cost. Swapping anonymous pages is generally more expensive than dropping
+clean file pages, so the scanner adjusts the ratio using IO cost
+accounting and the ``vm.swappiness`` tunable.
+
+Multi-Gen LRU
+-------------
+
+The multi-gen LRU is an alternative reclaim algorithm that groups pages
+into generations by access time rather than a simple active/inactive
+split. It is documented separately in Documentation/mm/multigen_lru.rst.
+
+LRU Batching
+------------
+
+To avoid taking the lruvec lock on every page access, LRU operations are
+batched per-CPU (``mm/swap.c``). Functions like ``folio_add_lru()`` and
+``folio_mark_accessed()`` queue pages into per-CPU folio batches that are
+drained to the actual LRU lists periodically or when the batch is full.
+This batching is critical for scalability on systems with many CPUs.
+
+Reclaiming Pages
+================
+
+The core reclaim loop (``shrink_node()``) divides its work between page
+cache / anonymous pages and slab caches. For each lruvec, it scans the
+inactive LRU lists, evaluating each page:
+
+- **Clean file pages** can be dropped immediately — they can be re-read
+ from disk.
+- **Dirty file pages** are queued for writeback. Reclaim typically skips
+ them and returns later, but under severe pressure it may wait for
+ writeback to complete.
+- **Anonymous pages** are swapped out if swap space is available and
+ ``vm.swappiness`` allows it.
+- **Mapped pages** require TLB invalidation (unmapping) before they can
+ be freed. The rmap (reverse mapping) system is used to find and
+ remove all page table entries pointing to the page.
+- **Unevictable pages** (locked with ``mlock()``) are skipped entirely.
+ See Documentation/mm/unevictable-lru.rst.
+
+Memory Cgroup Reclaim
+---------------------
+
+When memory cgroup limits are exceeded, reclaim targets only the pages
+belonging to that cgroup. Each memory cgroup has its own lruvec per node,
+so the scanner can isolate its pages without disturbing the rest of the
+system. ``try_to_free_mem_cgroup_pages()`` is the entry point for
+cgroup-scoped reclaim.
+
+NUMA Demotion
+-------------
+
+On systems with tiered memory (e.g., fast DRAM and slower persistent
+memory), reclaim can demote pages to a slower tier instead of evicting
+them. This keeps the data in memory but frees the faster tier for
+actively accessed pages.
+
+Shrinkers
+=========
+
+Besides page cache and anonymous pages, kernel caches (dentries, inodes,
+and driver-specific caches) are reclaimed through the shrinker interface
+(``mm/shrinker.c``). A shrinker registers two callbacks:
+
+- ``count_objects()``: report how many objects are reclaimable.
+- ``scan_objects()``: free up to a requested number of objects.
+
+The reclaim path calls all registered shrinkers proportionally to the
+amount of reclaimable memory they report. Shrinkers are NUMA-aware: on
+NUMA systems, each shrinker is called with the node being reclaimed so it
+can prioritize freeing objects local to that node.
+
+Per-memcg shrinker tracking uses bitmap arrays (``shrinker_info``) so that
+the reclaim path only invokes shrinkers that actually have objects in the
+target cgroup, avoiding unnecessary work when there are many cgroups.
+
+Working Set Detection
+=====================
+
+When a page is evicted, a compact shadow entry is stored in its place in
+the page cache or swap cache. The shadow records the eviction timestamp
+(in terms of the lruvec's nonresident age counter) and the cgroup and
+node that owned the page.
+
+If the page is faulted back in (a "refault"), the shadow entry allows the
+kernel to compute the *refault distance* — how many other pages were
+activated or evicted between this page's eviction and its refault. If the
+refault distance is shorter than the size of the inactive list, the page
+was part of the active working set and is immediately activated rather
+than placed on the inactive list. This reduces thrashing by protecting
+frequently accessed pages that would otherwise be repeatedly evicted and
+refaulted.
+
+Shadow entries consume a small amount of memory. To prevent them from
+accumulating indefinitely, a shrinker reclaims shadow entries from page
+cache radix tree nodes that contain only shadows and no actual pages.
+
+This logic is implemented in ``mm/workingset.c``.
--
2.53.0
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: [PATCH] Docs/mm: document Page Reclaim
2026-03-14 15:25 [PATCH] Docs/mm: document Page Reclaim Kit Dallege
@ 2026-03-15 20:25 ` Lorenzo Stoakes (Oracle)
0 siblings, 0 replies; 2+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-15 20:25 UTC (permalink / raw)
To: Kit Dallege; +Cc: akpm, david, corbet, linux-mm, linux-doc
NAK because clearly AI slop, again.
(side note - 'page' reclaim is a misnomer now, we should just call this doc
reclaim - we reclaim folios not pages :)
Anway, again, you've not bothered finding out who maintains reclaim, I just
looked and it took me 10 seconds:
MEMORY MANAGEMENT - RECLAIM
M: Andrew Morton <akpm@linux-foundation.org>
M: Johannes Weiner <hannes@cmpxchg.org>
R: David Hildenbrand <david@kernel.org> <- by chance you have David :)
R: Michal Hocko <mhocko@kernel.org>
R: Qi Zheng <zhengqi.arch@bytedance.com>
R: Shakeel Butt <shakeel.butt@linux.dev>
R: Lorenzo Stoakes <ljs@kernel.org>
L: linux-mm@kvack.org
S: Maintained
F: mm/vmscan.c
F: mm/workingset.c
You've not even done that, let alone thought to cc- anybody on that list, 5
minutes glancing over the mailing list would tell you this is is common
courtesy.
The documentation is useless hand-waving that maintainers would have to
essentially rewrite for you on 'review'.
This is not a good use of maintainer time, and we don't want stuff we could
generate ourselves.
On Sat, Mar 14, 2026 at 04:25:34PM +0100, Kit Dallege wrote:
> Fill in the page_reclaim.rst stub created in commit 481cc97349d6
> ("mm,doc: Add new documentation structure") as part of
> the structured memory management documentation following
> Mel Gorman's book outline.
You've also, again, used a copy/paste meaningless, worthless commit message - 5
minutes glacing through the linux-mm list would tell you what we expect.
I mean I say 'you', this was Claude surely?
>
> Signed-off-by: Kit Dallege <xaum.io@gmail.com>
> ---
> Documentation/mm/page_reclaim.rst | 164 ++++++++++++++++++++++++++++++
> 1 file changed, 164 insertions(+)
>
> diff --git a/Documentation/mm/page_reclaim.rst b/Documentation/mm/page_reclaim.rst
> index 50a30b7f8ac3..bfa53bee98c2 100644
> --- a/Documentation/mm/page_reclaim.rst
> +++ b/Documentation/mm/page_reclaim.rst
> @@ -3,3 +3,167 @@
> ============
> Page Reclaim
> ============
> +
> +Page reclaim frees memory by evicting pages that can be reloaded from disk
> +or regenerated. File-backed pages are dropped (clean) or written back
Or regenerated?... This isn't doctor who?
> +(dirty); anonymous pages are swapped out. The bulk of the implementation
> +is in ``mm/vmscan.c``.
Yeah let's not bother discuss what clean or dirty means, or why that matters, or
anything useful...
etc.
> +
> +.. contents:: :local:
> +
> +When Reclaim Runs
> +=================
> +
> +Reclaim is triggered in two ways:
> +
> +- **kswapd**: a per-node kernel thread that runs in the background when
> + free pages in any zone drop below the low watermark. It reclaims until
> + free pages reach the high watermark, then sleeps.
> +
> +- **Direct reclaim**: when an allocation cannot be satisfied even after
> + kswapd has been woken, the allocating task reclaims pages synchronously
> + in its own context. This adds latency to the allocation but is necessary
> + when background reclaim cannot keep up.
> +
> +Reclaim Priority
> +================
> +
> +The reclaim path operates at decreasing priority levels (from
> +``DEF_PRIORITY`` down to 0). At each level, a larger fraction of the LRU
> +lists is scanned. At the default priority, only 1/4096th of pages are
> +considered; at priority 0, the entire list is scanned.
> +
> +If a full scan at priority 0 still does not free enough memory, the OOM
> +killer is invoked (see Documentation/mm/oom.rst). This escalation
> +prevents the system from spinning indefinitely in reclaim.
> +
> +Scan Control
> +============
> +
> +Each reclaim invocation is parameterized by a ``struct scan_control`` that
> +captures the allocation context: which GFP flags were used, how many pages
> +are needed, which node or memory cgroup to reclaim from, and whether
> +writeback or swap are allowed. This struct threads through the entire
> +reclaim stack, ensuring consistent policy at every level.
> +
> +LRU Lists
> +=========
> +
> +Each ``lruvec`` (one per node, or per node and memory cgroup combination)
> +maintains lists of pages ordered by access recency.
> +
> +Classic LRU
> +-----------
> +
> +The classic scheme uses four LRU lists per lruvec: active and inactive for
> +both anonymous and file-backed pages. This approximates a second-chance
> +(clock) algorithm:
> +
> +- Pages start on the inactive list when first allocated.
> +- If accessed again while on the inactive list, they are promoted to the
> + active list.
> +- Reclaim scans the inactive list and evicts pages that have not been
> + recently accessed.
> +- To prevent the active list from growing without bound, pages are
> + periodically demoted from active to inactive.
> +
> +The split between anonymous and file-backed lists allows the reclaim path
> +to balance eviction pressure between the two types based on their relative
> +cost. Swapping anonymous pages is generally more expensive than dropping
> +clean file pages, so the scanner adjusts the ratio using IO cost
> +accounting and the ``vm.swappiness`` tunable.
> +
> +Multi-Gen LRU
> +-------------
> +
> +The multi-gen LRU is an alternative reclaim algorithm that groups pages
> +into generations by access time rather than a simple active/inactive
> +split. It is documented separately in Documentation/mm/multigen_lru.rst.
> +
> +LRU Batching
> +------------
> +
> +To avoid taking the lruvec lock on every page access, LRU operations are
> +batched per-CPU (``mm/swap.c``). Functions like ``folio_add_lru()`` and
> +``folio_mark_accessed()`` queue pages into per-CPU folio batches that are
> +drained to the actual LRU lists periodically or when the batch is full.
> +This batching is critical for scalability on systems with many CPUs.
> +
> +Reclaiming Pages
> +================
> +
> +The core reclaim loop (``shrink_node()``) divides its work between page
> +cache / anonymous pages and slab caches. For each lruvec, it scans the
> +inactive LRU lists, evaluating each page:
> +
> +- **Clean file pages** can be dropped immediately — they can be re-read
> + from disk.
> +- **Dirty file pages** are queued for writeback. Reclaim typically skips
> + them and returns later, but under severe pressure it may wait for
> + writeback to complete.
> +- **Anonymous pages** are swapped out if swap space is available and
> + ``vm.swappiness`` allows it.
> +- **Mapped pages** require TLB invalidation (unmapping) before they can
> + be freed. The rmap (reverse mapping) system is used to find and
> + remove all page table entries pointing to the page.
> +- **Unevictable pages** (locked with ``mlock()``) are skipped entirely.
> + See Documentation/mm/unevictable-lru.rst.
> +
> +Memory Cgroup Reclaim
> +---------------------
> +
> +When memory cgroup limits are exceeded, reclaim targets only the pages
> +belonging to that cgroup. Each memory cgroup has its own lruvec per node,
> +so the scanner can isolate its pages without disturbing the rest of the
> +system. ``try_to_free_mem_cgroup_pages()`` is the entry point for
> +cgroup-scoped reclaim.
> +
> +NUMA Demotion
> +-------------
> +
> +On systems with tiered memory (e.g., fast DRAM and slower persistent
> +memory), reclaim can demote pages to a slower tier instead of evicting
> +them. This keeps the data in memory but frees the faster tier for
> +actively accessed pages.
> +
> +Shrinkers
> +=========
> +
> +Besides page cache and anonymous pages, kernel caches (dentries, inodes,
> +and driver-specific caches) are reclaimed through the shrinker interface
> +(``mm/shrinker.c``). A shrinker registers two callbacks:
> +
> +- ``count_objects()``: report how many objects are reclaimable.
> +- ``scan_objects()``: free up to a requested number of objects.
> +
> +The reclaim path calls all registered shrinkers proportionally to the
> +amount of reclaimable memory they report. Shrinkers are NUMA-aware: on
> +NUMA systems, each shrinker is called with the node being reclaimed so it
> +can prioritize freeing objects local to that node.
> +
> +Per-memcg shrinker tracking uses bitmap arrays (``shrinker_info``) so that
> +the reclaim path only invokes shrinkers that actually have objects in the
> +target cgroup, avoiding unnecessary work when there are many cgroups.
> +
> +Working Set Detection
> +=====================
> +
> +When a page is evicted, a compact shadow entry is stored in its place in
> +the page cache or swap cache. The shadow records the eviction timestamp
> +(in terms of the lruvec's nonresident age counter) and the cgroup and
> +node that owned the page.
> +
> +If the page is faulted back in (a "refault"), the shadow entry allows the
> +kernel to compute the *refault distance* — how many other pages were
> +activated or evicted between this page's eviction and its refault. If the
> +refault distance is shorter than the size of the inactive list, the page
> +was part of the active working set and is immediately activated rather
> +than placed on the inactive list. This reduces thrashing by protecting
> +frequently accessed pages that would otherwise be repeatedly evicted and
> +refaulted.
> +
> +Shadow entries consume a small amount of memory. To prevent them from
> +accumulating indefinitely, a shrinker reclaims shadow entries from page
> +cache radix tree nodes that contain only shadows and no actual pages.
> +
> +This logic is implemented in ``mm/workingset.c``.
> --
> 2.53.0
>
>
>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2026-03-15 20:25 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-14 15:25 [PATCH] Docs/mm: document Page Reclaim Kit Dallege
2026-03-15 20:25 ` Lorenzo Stoakes (Oracle)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox