From: Rik van Riel <riel@surriel.com>
To: linux-kernel@vger.kernel.org
Cc: kernel-team@meta.com, linux-mm@kvack.org, david@kernel.org,
willy@infradead.org, surenb@google.com, hannes@cmpxchg.org,
ljs@kernel.org, ziy@nvidia.com, usama.arif@linux.dev,
fvdl@google.com, Rik van Riel <riel@surriel.com>
Subject: [RFC PATCH 29/40] mm: page_reporting: walk per-superpageblock free lists
Date: Wed, 20 May 2026 10:59:35 -0400 [thread overview]
Message-ID: <20260520150018.2491267-30-riel@surriel.com> (raw)
In-Reply-To: <20260520150018.2491267-1-riel@surriel.com>
After the SPB rework, free pages live on per-superpageblock free lists
(zone->superpageblocks[i].free_area[order].free_list[mt]) rather than
on a single zone-level list. page_reporting_cycle() was still walking
the now-empty zone-level list, so virtio-balloon free page reporting
silently became a no-op on systems with superpageblocks: no pages were
ever isolated, no MADV_DONTNEED hints reached the host, and any guest
memory backing balloon-eligible pages stayed resident on the host.
Refactor the per-list walk into page_reporting_cycle_list() taking an
explicit list_head and a pointer to the shared budget, then have
page_reporting_cycle() iterate every SPB in the zone for the requested
(order, mt). The budget is shared across the whole walk so a fragmented
zone does not multiply the rate-limit. The zone-level shadow nr_free
(maintained by __add_to_free_list / __del_page_from_free_list) is used
both for the early-out and for the budget total; that shadow already
sums all SPBs.
Hold the memory hotplug read lock around the SPB walk.
resize_zone_superpageblocks() swaps zone->superpageblocks under
zone->lock and immediately kvfree()s the old array with no RCU grace
period. The helper drops zone->lock during prdev->report() (which can
sleep) and resumes operating on a list_head pointer that lives inside
an SPB; without get_online_mems(), that pointer can become a dangling
reference if hotplug runs in the unlock window.
The zone-level fallback path is retained for zones whose SPB array has
not yet been allocated (e.g. unpopulated hotplug zones).
Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
mm/page_reporting.c | 149 ++++++++++++++++++++++++++------------------
1 file changed, 90 insertions(+), 59 deletions(-)
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 7418f2e500bb..836d97879b8d 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -6,6 +6,7 @@
#include <linux/export.h>
#include <linux/module.h>
#include <linux/delay.h>
+#include <linux/memory_hotplug.h>
#include <linux/scatterlist.h>
#include "page_reporting.h"
@@ -138,116 +139,68 @@ page_reporting_drain(struct page_reporting_dev_info *prdev,
}
/*
- * The page reporting cycle consists of 4 stages, fill, report, drain, and
- * idle. We will cycle through the first 3 stages until we cannot obtain a
- * full scatterlist of pages, in that case we will switch to idle.
+ * Walk a single free_list (zone-level or per-superpageblock), pulling
+ * unreported pages into the scatterlist and calling prdev->report() each
+ * time the scatterlist fills. Updates *budget and *offset across calls so
+ * the caller can spread one budget across multiple lists (e.g. one per SPB).
*/
static int
-page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
- unsigned int order, unsigned int mt,
- struct scatterlist *sgl, unsigned int *offset)
+page_reporting_cycle_list(struct page_reporting_dev_info *prdev,
+ struct zone *zone, struct list_head *list,
+ unsigned int order, struct scatterlist *sgl,
+ unsigned int *offset, long *budget)
{
- struct free_area *area = &zone->free_area[order];
- struct list_head *list = &area->free_list[mt];
unsigned int page_len = PAGE_SIZE << order;
struct page *page, *next;
- long budget;
int err = 0;
- /*
- * Perform early check, if free area is empty there is
- * nothing to process so we can skip this free_list.
- */
if (list_empty(list))
- return err;
+ return 0;
spin_lock_irq(&zone->lock);
- /*
- * Limit how many calls we will be making to the page reporting
- * device for this list. By doing this we avoid processing any
- * given list for too long.
- *
- * The current value used allows us enough calls to process over a
- * sixteenth of the current list plus one additional call to handle
- * any pages that may have already been present from the previous
- * list processed. This should result in us reporting all pages on
- * an idle system in about 30 seconds.
- *
- * The division here should be cheap since PAGE_REPORTING_CAPACITY
- * should always be a power of 2.
- */
- budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16);
-
- /* loop through free list adding unreported pages to sg list */
list_for_each_entry_safe(page, next, list, lru) {
- /* We are going to skip over the reported pages. */
if (PageReported(page))
continue;
- /*
- * If we fully consumed our budget then update our
- * state to indicate that we are requesting additional
- * processing and exit this list.
- */
- if (budget < 0) {
+ if (*budget < 0) {
atomic_set(&prdev->state, PAGE_REPORTING_REQUESTED);
next = page;
break;
}
- /* Attempt to pull page from list and place in scatterlist */
if (*offset) {
if (!__isolate_free_page(page, order)) {
next = page;
break;
}
- /* Add page to scatter list */
--(*offset);
sg_set_page(&sgl[*offset], page, page_len, 0);
continue;
}
- /*
- * Make the first non-reported page in the free list
- * the new head of the free list before we release the
- * zone lock.
- */
if (!list_is_first(&page->lru, list))
list_rotate_to_front(&page->lru, list);
- /* release lock before waiting on report processing */
spin_unlock_irq(&zone->lock);
- /* begin processing pages in local list */
err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY);
- /* reset offset since the full list was reported */
*offset = PAGE_REPORTING_CAPACITY;
+ (*budget)--;
- /* update budget to reflect call to report function */
- budget--;
-
- /* reacquire zone lock and resume processing */
spin_lock_irq(&zone->lock);
- /* flush reported pages from the sg list */
page_reporting_drain(prdev, sgl, PAGE_REPORTING_CAPACITY, !err);
- /*
- * Reset next to first entry, the old next isn't valid
- * since we dropped the lock to report the pages
- */
next = list_first_entry(list, struct page, lru);
- /* exit on error */
if (err)
break;
}
- /* Rotate any leftover pages to the head of the freelist */
if (!list_entry_is_head(next, list, lru) && !list_is_first(&next->lru, list))
list_rotate_to_front(&next->lru, list);
@@ -256,6 +209,84 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
return err;
}
+/*
+ * The page reporting cycle consists of 4 stages, fill, report, drain, and
+ * idle. We will cycle through the first 3 stages until we cannot obtain a
+ * full scatterlist of pages, in that case we will switch to idle.
+ *
+ * With superpageblocks, free pages live on per-SPB free_lists rather than a
+ * single zone-level list, so the cycle iterates every SPB for the requested
+ * (order, mt). The budget is shared across the entire walk so that
+ * fragmented zones do not produce a budget multiplier.
+ */
+static int
+page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
+ unsigned int order, unsigned int mt,
+ struct scatterlist *sgl, unsigned int *offset)
+{
+ long budget;
+ int err = 0;
+
+ /*
+ * Early exit if the per-zone shadow says there is nothing free at
+ * this order in any SPB. Avoids touching every SPB's list head.
+ */
+ if (!data_race(zone->free_area[order].nr_free))
+ return 0;
+
+ /*
+ * Limit how many calls we will be making to the page reporting
+ * device. By doing this we avoid processing any given (order, mt)
+ * for too long.
+ *
+ * The current value used allows us enough calls to process over a
+ * sixteenth of the current free pool plus one additional call to
+ * handle any pages that may have already been present from the
+ * previous list processed. This should result in us reporting all
+ * pages on an idle system in about 30 seconds.
+ *
+ * The division here should be cheap since PAGE_REPORTING_CAPACITY
+ * should always be a power of 2.
+ */
+ budget = DIV_ROUND_UP(data_race(zone->free_area[order].nr_free),
+ PAGE_REPORTING_CAPACITY * 16);
+
+ /*
+ * Block memory hotplug for the SPB walk. resize_zone_superpageblocks()
+ * swaps zone->superpageblocks under zone->lock and immediately
+ * kvfree()s the old array, with no RCU grace period. The helper drops
+ * zone->lock during prdev->report() and resumes using a list_head
+ * pointer into an SPB; without holding mem_hotplug_lock for read,
+ * that pointer can become a dangling reference into freed memory.
+ */
+ get_online_mems();
+
+ if (zone->nr_superpageblocks) {
+ unsigned long sb_idx, nr_sbs = zone->nr_superpageblocks;
+
+ for (sb_idx = 0; sb_idx < nr_sbs; sb_idx++) {
+ struct list_head *list =
+ &zone->superpageblocks[sb_idx].free_area[order].free_list[mt];
+
+ err = page_reporting_cycle_list(prdev, zone, list,
+ order, sgl, offset,
+ &budget);
+ if (err || budget < 0)
+ break;
+ }
+ } else {
+ /* No SPBs (e.g. unpopulated zone); fall back to zone-level list. */
+ struct list_head *list = &zone->free_area[order].free_list[mt];
+
+ err = page_reporting_cycle_list(prdev, zone, list, order,
+ sgl, offset, &budget);
+ }
+
+ put_online_mems();
+
+ return err;
+}
+
static int
page_reporting_process_zone(struct page_reporting_dev_info *prdev,
struct scatterlist *sgl, struct zone *zone)
--
2.54.0
next prev parent reply other threads:[~2026-05-20 15:01 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-20 14:59 [RFC PATCH 00/40] mm: reliable 1GB page allocation Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 01/40] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 02/40] mm: page_alloc: per-cpu pageblock buddy allocator Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 03/40] mm: page_alloc: split-path PCP free with local-trylock + remote-llist Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 04/40] mm: mm_init: fix zone assignment for pages in unavailable ranges Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 05/40] mm: page_alloc: remove watermark boost mechanism Rik van Riel
2026-05-26 14:02 ` Usama Arif
2026-05-27 15:41 ` Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 06/40] mm: page_alloc: async evacuation of stolen movable pageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 07/40] mm: page_alloc: track actual page contents in pageblock flags Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 08/40] mm: page_alloc: superpageblock metadata for 1GB anti-fragmentation Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 09/40] mm: page_alloc: support superpageblock resize for memory hotplug Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 10/40] mm: page_alloc: add superpageblock fullness lists for allocation steering Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 11/40] mm: page_alloc: steer pageblock stealing to tainted superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 12/40] mm: page_alloc: steer movable allocations to fullest clean superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 13/40] mm: page_alloc: extract claim_whole_block from try_to_claim_block Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 14/40] mm: page_alloc: add per-superpageblock free lists Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 15/40] mm: page_alloc: add background superpageblock defragmentation worker Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 16/40] mm: compaction: walk per-superpageblock free lists for migration targets Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 17/40] mm: page_alloc: superpageblock-aware contiguous and higher order allocation Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 18/40] mm: page_alloc: prevent atomic allocations from tainting clean SPBs Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 19/40] mm: page_alloc: aggressively pack non-movable allocs in tainted SPBs on large systems Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 20/40] mm: page_alloc: prefer reclaim over tainting clean superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 21/40] mm: page_alloc: adopt partial pageblocks from tainted superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 22/40] mm: page_alloc: add CONFIG_DEBUG_VM sanity checks for SPB counters Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 23/40] mm: page_alloc: targeted evacuation and dynamic reserves for tainted SPBs Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 24/40] mm: page_alloc: prevent UNMOVABLE/RECLAIMABLE mixing in pageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 25/40] mm: trigger deferred SPB evac when atomic allocs would taint a clean SPB Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 26/40] mm: page_alloc: refuse fragmenting fallback for callers with cheap fallback Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 27/40] mm: page_alloc: cross-migratetype buddy borrow within tainted SPBs Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 28/40] mm: page_alloc: drive slab shrink from SPB anti-fragmentation pressure Rik van Riel
2026-05-20 14:59 ` Rik van Riel [this message]
2026-05-20 14:59 ` [RFC PATCH 30/40] mm: show_mem: collect migratetype letters from per-superpageblock lists Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 31/40] mm: page_alloc: per-(zone, order, mt) PASS_1 hint cache Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 32/40] mm: debug: prevent infinite recursion in dump_page() with CMA Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 33/40] PM: hibernate: walk per-superpageblock free lists in mark_free_pages Rik van Riel
2026-05-20 18:19 ` Rafael J. Wysocki
2026-05-20 14:59 ` [RFC PATCH 34/40] btrfs: allocate eb-attached btree pages as movable Rik van Riel
2026-05-20 17:47 ` Boris Burkov
2026-05-23 15:58 ` David Sterba
2026-05-24 1:43 ` Rik van Riel
2026-05-24 19:59 ` Matthew Wilcox
2026-05-25 6:57 ` Christoph Hellwig
2026-05-20 14:59 ` [RFC PATCH 35/40] mm: page_alloc: refuse best-effort high-order allocs servable at lower orders Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 36/40] mm: page_alloc: set ALLOC_NOFRAGMENT on alloc_frozen_pages_nolock_noprof Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 37/40] mm: page_alloc: move spb_get_category and spb_tainted_reserve to mmzone.h Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 38/40] mm: compaction: skip empty tainted superpageblocks as migration source Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 39/40] mm: compaction: respect tainted SPB reserve in destination selection Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 40/40] mm: page_alloc: SPB tracepoint instrumentation [DO-NOT-MERGE] Rik van Riel
2026-05-21 5:09 ` kernel test robot
2026-05-21 7:39 ` [syzbot ci] Re: mm: reliable 1GB page allocation syzbot ci
2026-05-22 11:02 ` [RFC PATCH 00/40] " Usama Arif
2026-05-22 13:55 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260520150018.2491267-30-riel@surriel.com \
--to=riel@surriel.com \
--cc=david@kernel.org \
--cc=fvdl@google.com \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=surenb@google.com \
--cc=usama.arif@linux.dev \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.