From: Rik van Riel <riel@surriel.com>
To: linux-kernel@vger.kernel.org
Cc: kernel-team@meta.com, linux-mm@kvack.org, david@kernel.org,
willy@infradead.org, surenb@google.com, hannes@cmpxchg.org,
ljs@kernel.org, ziy@nvidia.com, usama.arif@linux.dev,
fvdl@google.com, Rik van Riel <riel@surriel.com>
Subject: [RFC PATCH 29/40] mm: page_reporting: walk per-superpageblock free lists
Date: Wed, 20 May 2026 10:59:35 -0400 [thread overview]
Message-ID: <20260520150018.2491267-30-riel@surriel.com> (raw)
In-Reply-To: <20260520150018.2491267-1-riel@surriel.com>
After the SPB rework, free pages live on per-superpageblock free lists
(zone->superpageblocks[i].free_area[order].free_list[mt]) rather than
on a single zone-level list. page_reporting_cycle() was still walking
the now-empty zone-level list, so virtio-balloon free page reporting
silently became a no-op on systems with superpageblocks: no pages were
ever isolated, no MADV_DONTNEED hints reached the host, and any guest
memory backing balloon-eligible pages stayed resident on the host.
Refactor the per-list walk into page_reporting_cycle_list() taking an
explicit list_head and a pointer to the shared budget, then have
page_reporting_cycle() iterate every SPB in the zone for the requested
(order, mt). The budget is shared across the whole walk so a fragmented
zone does not multiply the rate-limit. The zone-level shadow nr_free
(maintained by __add_to_free_list / __del_page_from_free_list) is used
both for the early-out and for the budget total; that shadow already
sums all SPBs.
Hold the memory hotplug read lock around the SPB walk.
resize_zone_superpageblocks() swaps zone->superpageblocks under
zone->lock and immediately kvfree()s the old array with no RCU grace
period. The helper drops zone->lock during prdev->report() (which can
sleep) and resumes operating on a list_head pointer that lives inside
an SPB; without get_online_mems(), that pointer can become a dangling
reference if hotplug runs in the unlock window.
The zone-level fallback path is retained for zones whose SPB array has
not yet been allocated (e.g. unpopulated hotplug zones).
Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
mm/page_reporting.c | 149 ++++++++++++++++++++++++++------------------
1 file changed, 90 insertions(+), 59 deletions(-)
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 7418f2e500bb..836d97879b8d 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -6,6 +6,7 @@
#include <linux/export.h>
#include <linux/module.h>
#include <linux/delay.h>
+#include <linux/memory_hotplug.h>
#include <linux/scatterlist.h>
#include "page_reporting.h"
@@ -138,116 +139,68 @@ page_reporting_drain(struct page_reporting_dev_info *prdev,
}
/*
- * The page reporting cycle consists of 4 stages, fill, report, drain, and
- * idle. We will cycle through the first 3 stages until we cannot obtain a
- * full scatterlist of pages, in that case we will switch to idle.
+ * Walk a single free_list (zone-level or per-superpageblock), pulling
+ * unreported pages into the scatterlist and calling prdev->report() each
+ * time the scatterlist fills. Updates *budget and *offset across calls so
+ * the caller can spread one budget across multiple lists (e.g. one per SPB).
*/
static int
-page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
- unsigned int order, unsigned int mt,
- struct scatterlist *sgl, unsigned int *offset)
+page_reporting_cycle_list(struct page_reporting_dev_info *prdev,
+ struct zone *zone, struct list_head *list,
+ unsigned int order, struct scatterlist *sgl,
+ unsigned int *offset, long *budget)
{
- struct free_area *area = &zone->free_area[order];
- struct list_head *list = &area->free_list[mt];
unsigned int page_len = PAGE_SIZE << order;
struct page *page, *next;
- long budget;
int err = 0;
- /*
- * Perform early check, if free area is empty there is
- * nothing to process so we can skip this free_list.
- */
if (list_empty(list))
- return err;
+ return 0;
spin_lock_irq(&zone->lock);
- /*
- * Limit how many calls we will be making to the page reporting
- * device for this list. By doing this we avoid processing any
- * given list for too long.
- *
- * The current value used allows us enough calls to process over a
- * sixteenth of the current list plus one additional call to handle
- * any pages that may have already been present from the previous
- * list processed. This should result in us reporting all pages on
- * an idle system in about 30 seconds.
- *
- * The division here should be cheap since PAGE_REPORTING_CAPACITY
- * should always be a power of 2.
- */
- budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16);
-
- /* loop through free list adding unreported pages to sg list */
list_for_each_entry_safe(page, next, list, lru) {
- /* We are going to skip over the reported pages. */
if (PageReported(page))
continue;
- /*
- * If we fully consumed our budget then update our
- * state to indicate that we are requesting additional
- * processing and exit this list.
- */
- if (budget < 0) {
+ if (*budget < 0) {
atomic_set(&prdev->state, PAGE_REPORTING_REQUESTED);
next = page;
break;
}
- /* Attempt to pull page from list and place in scatterlist */
if (*offset) {
if (!__isolate_free_page(page, order)) {
next = page;
break;
}
- /* Add page to scatter list */
--(*offset);
sg_set_page(&sgl[*offset], page, page_len, 0);
continue;
}
- /*
- * Make the first non-reported page in the free list
- * the new head of the free list before we release the
- * zone lock.
- */
if (!list_is_first(&page->lru, list))
list_rotate_to_front(&page->lru, list);
- /* release lock before waiting on report processing */
spin_unlock_irq(&zone->lock);
- /* begin processing pages in local list */
err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY);
- /* reset offset since the full list was reported */
*offset = PAGE_REPORTING_CAPACITY;
+ (*budget)--;
- /* update budget to reflect call to report function */
- budget--;
-
- /* reacquire zone lock and resume processing */
spin_lock_irq(&zone->lock);
- /* flush reported pages from the sg list */
page_reporting_drain(prdev, sgl, PAGE_REPORTING_CAPACITY, !err);
- /*
- * Reset next to first entry, the old next isn't valid
- * since we dropped the lock to report the pages
- */
next = list_first_entry(list, struct page, lru);
- /* exit on error */
if (err)
break;
}
- /* Rotate any leftover pages to the head of the freelist */
if (!list_entry_is_head(next, list, lru) && !list_is_first(&next->lru, list))
list_rotate_to_front(&next->lru, list);
@@ -256,6 +209,84 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
return err;
}
+/*
+ * The page reporting cycle consists of 4 stages, fill, report, drain, and
+ * idle. We will cycle through the first 3 stages until we cannot obtain a
+ * full scatterlist of pages, in that case we will switch to idle.
+ *
+ * With superpageblocks, free pages live on per-SPB free_lists rather than a
+ * single zone-level list, so the cycle iterates every SPB for the requested
+ * (order, mt). The budget is shared across the entire walk so that
+ * fragmented zones do not produce a budget multiplier.
+ */
+static int
+page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
+ unsigned int order, unsigned int mt,
+ struct scatterlist *sgl, unsigned int *offset)
+{
+ long budget;
+ int err = 0;
+
+ /*
+ * Early exit if the per-zone shadow says there is nothing free at
+ * this order in any SPB. Avoids touching every SPB's list head.
+ */
+ if (!data_race(zone->free_area[order].nr_free))
+ return 0;
+
+ /*
+ * Limit how many calls we will be making to the page reporting
+ * device. By doing this we avoid processing any given (order, mt)
+ * for too long.
+ *
+ * The current value used allows us enough calls to process over a
+ * sixteenth of the current free pool plus one additional call to
+ * handle any pages that may have already been present from the
+ * previous list processed. This should result in us reporting all
+ * pages on an idle system in about 30 seconds.
+ *
+ * The division here should be cheap since PAGE_REPORTING_CAPACITY
+ * should always be a power of 2.
+ */
+ budget = DIV_ROUND_UP(data_race(zone->free_area[order].nr_free),
+ PAGE_REPORTING_CAPACITY * 16);
+
+ /*
+ * Block memory hotplug for the SPB walk. resize_zone_superpageblocks()
+ * swaps zone->superpageblocks under zone->lock and immediately
+ * kvfree()s the old array, with no RCU grace period. The helper drops
+ * zone->lock during prdev->report() and resumes using a list_head
+ * pointer into an SPB; without holding mem_hotplug_lock for read,
+ * that pointer can become a dangling reference into freed memory.
+ */
+ get_online_mems();
+
+ if (zone->nr_superpageblocks) {
+ unsigned long sb_idx, nr_sbs = zone->nr_superpageblocks;
+
+ for (sb_idx = 0; sb_idx < nr_sbs; sb_idx++) {
+ struct list_head *list =
+ &zone->superpageblocks[sb_idx].free_area[order].free_list[mt];
+
+ err = page_reporting_cycle_list(prdev, zone, list,
+ order, sgl, offset,
+ &budget);
+ if (err || budget < 0)
+ break;
+ }
+ } else {
+ /* No SPBs (e.g. unpopulated zone); fall back to zone-level list. */
+ struct list_head *list = &zone->free_area[order].free_list[mt];
+
+ err = page_reporting_cycle_list(prdev, zone, list, order,
+ sgl, offset, &budget);
+ }
+
+ put_online_mems();
+
+ return err;
+}
+
static int
page_reporting_process_zone(struct page_reporting_dev_info *prdev,
struct scatterlist *sgl, struct zone *zone)
--
2.54.0
next prev parent reply other threads:[~2026-05-20 15:01 UTC|newest]
Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-20 14:59 [RFC PATCH 00/40] mm: reliable 1GB page allocation Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 01/40] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 02/40] mm: page_alloc: per-cpu pageblock buddy allocator Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 03/40] mm: page_alloc: split-path PCP free with local-trylock + remote-llist Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 04/40] mm: mm_init: fix zone assignment for pages in unavailable ranges Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 05/40] mm: page_alloc: remove watermark boost mechanism Rik van Riel
2026-05-26 14:02 ` Usama Arif
2026-05-20 14:59 ` [RFC PATCH 06/40] mm: page_alloc: async evacuation of stolen movable pageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 07/40] mm: page_alloc: track actual page contents in pageblock flags Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 08/40] mm: page_alloc: superpageblock metadata for 1GB anti-fragmentation Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 09/40] mm: page_alloc: support superpageblock resize for memory hotplug Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 10/40] mm: page_alloc: add superpageblock fullness lists for allocation steering Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 11/40] mm: page_alloc: steer pageblock stealing to tainted superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 12/40] mm: page_alloc: steer movable allocations to fullest clean superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 13/40] mm: page_alloc: extract claim_whole_block from try_to_claim_block Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 14/40] mm: page_alloc: add per-superpageblock free lists Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 15/40] mm: page_alloc: add background superpageblock defragmentation worker Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 16/40] mm: compaction: walk per-superpageblock free lists for migration targets Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 17/40] mm: page_alloc: superpageblock-aware contiguous and higher order allocation Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 18/40] mm: page_alloc: prevent atomic allocations from tainting clean SPBs Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 19/40] mm: page_alloc: aggressively pack non-movable allocs in tainted SPBs on large systems Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 20/40] mm: page_alloc: prefer reclaim over tainting clean superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 21/40] mm: page_alloc: adopt partial pageblocks from tainted superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 22/40] mm: page_alloc: add CONFIG_DEBUG_VM sanity checks for SPB counters Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 23/40] mm: page_alloc: targeted evacuation and dynamic reserves for tainted SPBs Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 24/40] mm: page_alloc: prevent UNMOVABLE/RECLAIMABLE mixing in pageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 25/40] mm: trigger deferred SPB evac when atomic allocs would taint a clean SPB Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 26/40] mm: page_alloc: refuse fragmenting fallback for callers with cheap fallback Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 27/40] mm: page_alloc: cross-migratetype buddy borrow within tainted SPBs Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 28/40] mm: page_alloc: drive slab shrink from SPB anti-fragmentation pressure Rik van Riel
2026-05-20 14:59 ` Rik van Riel [this message]
2026-05-20 14:59 ` [RFC PATCH 30/40] mm: show_mem: collect migratetype letters from per-superpageblock lists Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 31/40] mm: page_alloc: per-(zone, order, mt) PASS_1 hint cache Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 32/40] mm: debug: prevent infinite recursion in dump_page() with CMA Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 33/40] PM: hibernate: walk per-superpageblock free lists in mark_free_pages Rik van Riel
2026-05-20 18:19 ` Rafael J. Wysocki
2026-05-20 14:59 ` [RFC PATCH 34/40] btrfs: allocate eb-attached btree pages as movable Rik van Riel
2026-05-20 17:47 ` Boris Burkov
2026-05-23 15:58 ` David Sterba
2026-05-24 1:43 ` Rik van Riel
2026-05-24 19:59 ` Matthew Wilcox
2026-05-25 6:57 ` Christoph Hellwig
2026-05-20 14:59 ` [RFC PATCH 35/40] mm: page_alloc: refuse best-effort high-order allocs servable at lower orders Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 36/40] mm: page_alloc: set ALLOC_NOFRAGMENT on alloc_frozen_pages_nolock_noprof Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 37/40] mm: page_alloc: move spb_get_category and spb_tainted_reserve to mmzone.h Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 38/40] mm: compaction: skip empty tainted superpageblocks as migration source Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 39/40] mm: compaction: respect tainted SPB reserve in destination selection Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 40/40] mm: page_alloc: SPB tracepoint instrumentation [DO-NOT-MERGE] Rik van Riel
2026-05-21 7:39 ` [syzbot ci] Re: mm: reliable 1GB page allocation syzbot ci
2026-05-22 11:02 ` [RFC PATCH 00/40] " Usama Arif
2026-05-22 13:55 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260520150018.2491267-30-riel@surriel.com \
--to=riel@surriel.com \
--cc=david@kernel.org \
--cc=fvdl@google.com \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=surenb@google.com \
--cc=usama.arif@linux.dev \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox