From: Rik van Riel <riel@surriel.com>
To: linux-kernel@vger.kernel.org
Cc: kernel-team@meta.com, linux-mm@kvack.org, david@kernel.org,
willy@infradead.org, surenb@google.com, hannes@cmpxchg.org,
ljs@kernel.org, ziy@nvidia.com, usama.arif@linux.dev,
Rik van Riel <riel@meta.com>, Rik van Riel <riel@surriel.com>
Subject: [RFC PATCH 34/45] mm: page_reporting: walk per-superpageblock free lists
Date: Thu, 30 Apr 2026 16:21:03 -0400 [thread overview]
Message-ID: <20260430202233.111010-35-riel@surriel.com> (raw)
In-Reply-To: <20260430202233.111010-1-riel@surriel.com>
From: Rik van Riel <riel@meta.com>
After the SPB rework, free pages live on per-superpageblock free lists
(zone->superpageblocks[i].free_area[order].free_list[mt]) rather than
on a single zone-level list. page_reporting_cycle() was still walking
the now-empty zone-level list, so virtio-balloon free page reporting
silently became a no-op on systems with superpageblocks: no pages were
ever isolated, no MADV_DONTNEED hints reached the host, and any guest
memory backing balloon-eligible pages stayed resident on the host.
Refactor the per-list walk into page_reporting_cycle_list() taking an
explicit list_head and a pointer to the shared budget, then have
page_reporting_cycle() iterate every SPB in the zone for the requested
(order, mt). The budget is shared across the whole walk so a fragmented
zone does not multiply the rate-limit. The zone-level shadow nr_free
(maintained by __add_to_free_list / __del_page_from_free_list) is used
both for the early-out and for the budget total; that shadow already
sums all SPBs.
Hold the memory hotplug read lock around the SPB walk.
resize_zone_superpageblocks() swaps zone->superpageblocks under
zone->lock and immediately kvfree()s the old array with no RCU grace
period. The helper drops zone->lock during prdev->report() (which can
sleep) and resumes operating on a list_head pointer that lives inside
an SPB; without get_online_mems(), that pointer can become a dangling
reference if hotplug runs in the unlock window.
The zone-level fallback path is retained for zones whose SPB array has
not yet been allocated (e.g. unpopulated hotplug zones).
Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
mm/page_reporting.c | 149 ++++++++++++++++++++++++++------------------
1 file changed, 90 insertions(+), 59 deletions(-)
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index f0042d5743af..81f903caec22 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -6,6 +6,7 @@
#include <linux/export.h>
#include <linux/module.h>
#include <linux/delay.h>
+#include <linux/memory_hotplug.h>
#include <linux/scatterlist.h>
#include "page_reporting.h"
@@ -138,116 +139,68 @@ page_reporting_drain(struct page_reporting_dev_info *prdev,
}
/*
- * The page reporting cycle consists of 4 stages, fill, report, drain, and
- * idle. We will cycle through the first 3 stages until we cannot obtain a
- * full scatterlist of pages, in that case we will switch to idle.
+ * Walk a single free_list (zone-level or per-superpageblock), pulling
+ * unreported pages into the scatterlist and calling prdev->report() each
+ * time the scatterlist fills. Updates *budget and *offset across calls so
+ * the caller can spread one budget across multiple lists (e.g. one per SPB).
*/
static int
-page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
- unsigned int order, unsigned int mt,
- struct scatterlist *sgl, unsigned int *offset)
+page_reporting_cycle_list(struct page_reporting_dev_info *prdev,
+ struct zone *zone, struct list_head *list,
+ unsigned int order, struct scatterlist *sgl,
+ unsigned int *offset, long *budget)
{
- struct free_area *area = &zone->free_area[order];
- struct list_head *list = &area->free_list[mt];
unsigned int page_len = PAGE_SIZE << order;
struct page *page, *next;
- long budget;
int err = 0;
- /*
- * Perform early check, if free area is empty there is
- * nothing to process so we can skip this free_list.
- */
if (list_empty(list))
- return err;
+ return 0;
spin_lock_irq(&zone->lock);
- /*
- * Limit how many calls we will be making to the page reporting
- * device for this list. By doing this we avoid processing any
- * given list for too long.
- *
- * The current value used allows us enough calls to process over a
- * sixteenth of the current list plus one additional call to handle
- * any pages that may have already been present from the previous
- * list processed. This should result in us reporting all pages on
- * an idle system in about 30 seconds.
- *
- * The division here should be cheap since PAGE_REPORTING_CAPACITY
- * should always be a power of 2.
- */
- budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16);
-
- /* loop through free list adding unreported pages to sg list */
list_for_each_entry_safe(page, next, list, lru) {
- /* We are going to skip over the reported pages. */
if (PageReported(page))
continue;
- /*
- * If we fully consumed our budget then update our
- * state to indicate that we are requesting additional
- * processing and exit this list.
- */
- if (budget < 0) {
+ if (*budget < 0) {
atomic_set(&prdev->state, PAGE_REPORTING_REQUESTED);
next = page;
break;
}
- /* Attempt to pull page from list and place in scatterlist */
if (*offset) {
if (!__isolate_free_page(page, order)) {
next = page;
break;
}
- /* Add page to scatter list */
--(*offset);
sg_set_page(&sgl[*offset], page, page_len, 0);
continue;
}
- /*
- * Make the first non-reported page in the free list
- * the new head of the free list before we release the
- * zone lock.
- */
if (!list_is_first(&page->lru, list))
list_rotate_to_front(&page->lru, list);
- /* release lock before waiting on report processing */
spin_unlock_irq(&zone->lock);
- /* begin processing pages in local list */
err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY);
- /* reset offset since the full list was reported */
*offset = PAGE_REPORTING_CAPACITY;
+ (*budget)--;
- /* update budget to reflect call to report function */
- budget--;
-
- /* reacquire zone lock and resume processing */
spin_lock_irq(&zone->lock);
- /* flush reported pages from the sg list */
page_reporting_drain(prdev, sgl, PAGE_REPORTING_CAPACITY, !err);
- /*
- * Reset next to first entry, the old next isn't valid
- * since we dropped the lock to report the pages
- */
next = list_first_entry(list, struct page, lru);
- /* exit on error */
if (err)
break;
}
- /* Rotate any leftover pages to the head of the freelist */
if (!list_entry_is_head(next, list, lru) && !list_is_first(&next->lru, list))
list_rotate_to_front(&next->lru, list);
@@ -256,6 +209,84 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
return err;
}
+/*
+ * The page reporting cycle consists of 4 stages, fill, report, drain, and
+ * idle. We will cycle through the first 3 stages until we cannot obtain a
+ * full scatterlist of pages, in that case we will switch to idle.
+ *
+ * With superpageblocks, free pages live on per-SPB free_lists rather than a
+ * single zone-level list, so the cycle iterates every SPB for the requested
+ * (order, mt). The budget is shared across the entire walk so that
+ * fragmented zones do not produce a budget multiplier.
+ */
+static int
+page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
+ unsigned int order, unsigned int mt,
+ struct scatterlist *sgl, unsigned int *offset)
+{
+ long budget;
+ int err = 0;
+
+ /*
+ * Early exit if the per-zone shadow says there is nothing free at
+ * this order in any SPB. Avoids touching every SPB's list head.
+ */
+ if (!data_race(zone->free_area[order].nr_free))
+ return 0;
+
+ /*
+ * Limit how many calls we will be making to the page reporting
+ * device. By doing this we avoid processing any given (order, mt)
+ * for too long.
+ *
+ * The current value used allows us enough calls to process over a
+ * sixteenth of the current free pool plus one additional call to
+ * handle any pages that may have already been present from the
+ * previous list processed. This should result in us reporting all
+ * pages on an idle system in about 30 seconds.
+ *
+ * The division here should be cheap since PAGE_REPORTING_CAPACITY
+ * should always be a power of 2.
+ */
+ budget = DIV_ROUND_UP(data_race(zone->free_area[order].nr_free),
+ PAGE_REPORTING_CAPACITY * 16);
+
+ /*
+ * Block memory hotplug for the SPB walk. resize_zone_superpageblocks()
+ * swaps zone->superpageblocks under zone->lock and immediately
+ * kvfree()s the old array, with no RCU grace period. The helper drops
+ * zone->lock during prdev->report() and resumes using a list_head
+ * pointer into an SPB; without holding mem_hotplug_lock for read,
+ * that pointer can become a dangling reference into freed memory.
+ */
+ get_online_mems();
+
+ if (zone->nr_superpageblocks) {
+ unsigned long sb_idx, nr_sbs = zone->nr_superpageblocks;
+
+ for (sb_idx = 0; sb_idx < nr_sbs; sb_idx++) {
+ struct list_head *list =
+ &zone->superpageblocks[sb_idx].free_area[order].free_list[mt];
+
+ err = page_reporting_cycle_list(prdev, zone, list,
+ order, sgl, offset,
+ &budget);
+ if (err || budget < 0)
+ break;
+ }
+ } else {
+ /* No SPBs (e.g. unpopulated zone); fall back to zone-level list. */
+ struct list_head *list = &zone->free_area[order].free_list[mt];
+
+ err = page_reporting_cycle_list(prdev, zone, list, order,
+ sgl, offset, &budget);
+ }
+
+ put_online_mems();
+
+ return err;
+}
+
static int
page_reporting_process_zone(struct page_reporting_dev_info *prdev,
struct scatterlist *sgl, struct zone *zone)
--
2.52.0
next prev parent reply other threads:[~2026-04-30 20:22 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 01/45] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 02/45] mm: page_alloc: per-cpu pageblock buddy allocator Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 03/45] mm: page_alloc: use trylock for PCP lock in free path to avoid lock inversion Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 04/45] mm: mm_init: fix zone assignment for pages in unavailable ranges Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 05/45] mm: vmstat: restore per-migratetype free counts in /proc/pagetypeinfo Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 06/45] mm: page_alloc: remove watermark boost mechanism Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 07/45] mm: page_alloc: async evacuation of stolen movable pageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 08/45] mm: page_alloc: track actual page contents in pageblock flags Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 09/45] mm: page_alloc: introduce superpageblock metadata for 1GB anti-fragmentation Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 10/45] mm: page_alloc: support superpageblock resize for memory hotplug Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 11/45] mm: page_alloc: add superpageblock fullness lists for allocation steering Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 12/45] mm: page_alloc: steer pageblock stealing to tainted superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 13/45] mm: page_alloc: steer movable allocations to fullest clean superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 14/45] mm: page_alloc: extract claim_whole_block from try_to_claim_block Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 15/45] mm: page_alloc: add per-superpageblock free lists Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 16/45] mm: page_alloc: add background superpageblock defragmentation worker Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 17/45] mm: page_alloc: add within-superpageblock compaction for clean superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 18/45] mm: page_alloc: superpageblock-aware contiguous and higher order allocation Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 19/45] mm: page_alloc: prevent atomic allocations from tainting clean SPBs Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 20/45] mm: page_alloc: aggressively pack non-movable allocations in tainted SPBs on large systems Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 21/45] mm: page_alloc: prefer reclaim over tainting clean superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 22/45] mm: page_alloc: adopt partial pageblocks from tainted superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 23/45] mm: page_alloc: add CONFIG_DEBUG_VM sanity checks for SPB counters Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 24/45] mm: page_alloc: targeted evacuation and dynamic reserves for tainted SPBs Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 25/45] mm: page_alloc: skip pageblock compatibility threshold in " Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 26/45] mm: page_alloc: prevent UNMOVABLE/RECLAIMABLE mixing in pageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 27/45] mm: trigger deferred SPB evacuation when atomic allocs would taint a clean SPB Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 28/45] mm: page_alloc: keep PCP refill in tainted SPBs across owned pageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 29/45] mm: page_alloc: refuse fragmenting fallback for callers with cheap fallback Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 30/45] mm: page_alloc: drive slab shrink from SPB anti-fragmentation pressure Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 31/45] mm: page_alloc: cross-non-movable buddy borrow within tainted SPBs Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 32/45] mm: page_alloc: proactive high-water trigger for SPB slab shrink Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 33/45] mm: page_alloc: refuse to taint clean SPBs for atomic NORETRY callers Rik van Riel
2026-04-30 20:21 ` Rik van Riel [this message]
2026-04-30 20:21 ` [RFC PATCH 35/45] mm: show_mem: collect migratetype letters from per-superpageblock lists Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 36/45] mm: page_alloc: add alloc_flags parameter to __rmqueue_smallest Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 37/45] mm/slub: kvmalloc — add __GFP_NORETRY to large-kmalloc attempt Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 38/45] mm: page_alloc: per-(zone, order, mt) PASS_1 hint cache Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 39/45] mm: debug: prevent infinite recursion in dump_page() with CMA Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 40/45] PM: hibernate: walk per-superpageblock free lists in mark_free_pages Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 41/45] btrfs: allocate eb-attached btree pages as movable Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 42/45] mm: page_alloc: cross-MOV borrow within tainted SPBs Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 43/45] mm: page_alloc: trigger defrag from allocator hot path on tainted-SPB pressure Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 44/45] mm: page_alloc: SPB tracepoint instrumentation [DROP-FOR-UPSTREAM] Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 45/45] mm: page_alloc: enlarge and unify spb_evacuate_for_order Rik van Riel
2026-05-01 7:14 ` [00/45 RFC PATCH] 1GB superpageblock memory allocation David Hildenbrand (Arm)
2026-05-01 11:58 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260430202233.111010-35-riel@surriel.com \
--to=riel@surriel.com \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=riel@meta.com \
--cc=surenb@google.com \
--cc=usama.arif@linux.dev \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox