* [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages
@ 2026-04-26 21:47 Michael S. Tsirkin
2026-04-26 21:47 ` [PATCH RFC v4 01/22] mm: move vma_alloc_folio to page_alloc.c Michael S. Tsirkin
` (21 more replies)
0 siblings, 22 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-26 21:47 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization
When a guest reports free pages to the hypervisor via virtio-balloon's
free page reporting, the host typically zeros those pages when reclaiming
their backing memory (e.g., via MADV_DONTNEED on anonymous mappings).
When the guest later reallocates those pages, the kernel zeros them
again, redundantly.
Further, on architectures with aliasing caches, upstream with init_on_alloc
double-zeros user pages: once via kernel_init_pages() in
post_alloc_hook, and again via clear_user_highpage() at the
callsite (because user_alloc_needs_zeroing() returns true).
This series eliminates that double-zeroing by moving the zeroing
into the post_alloc_hook + propagating the "host
already zeroed this page" information through the buddy allocator.
For the reporting part, I am working on virtio spec now, so sending this
out for early feedback. In particular:
- is the mm zeroing rework acceptable?
- is sysfs testing hook for flushing acceptable?
- first 12 patches, including the fix for init_on_alloc double zeroing
are independently mergeable mm rework -
are they deemed a desirable rework, and should I post them
separately for inclusion?
Thanks in advance.
Still an RFC as virtio bits need work, but I would very much like
to get a general agreement on mm bits first, so we don't add
a spec for something we can't then use.
-------
Performance with THP enabled on a 2GB VM, 1 vCPU, allocating
256MB of anonymous pages:
metric baseline optimized delta
task-clock 175 +- 10 ms 40 +- 9 ms -77%
cache-misses 924K +- 323K 287K +- 93K -69%
instructions 15.3M +- 634K 13.5M +- 337K -12%
With hugetlb surplus pages:
metric baseline optimized delta
task-clock 169 +- 9 ms 49 +- 19 ms -71%
cache-misses 1.24M +- 222K 316K +- 114K -75%
instructions 17.3M +- 1.23M 15.0M +- 604K -13%
Notes:
- The virtio_balloon module parameter (18/22) is a testing hack.
A proper virtio feature flag is needed before merging.
- Patch 19/22 adds a sysfs flush trigger for deterministic testing
(avoids waiting for the 2-second reporting delay).
- When host_zeroes_pages is set, callers skip folio_zero_user() for
pages known to be zeroed by the host. This is safe on all
architectures because the hypervisor invalidates guest cache lines
when reclaiming page backing (MADV_DONTNEED).
Two flags track known-zero pages:
PG_zeroed (aliased to PG_private) marks buddy allocator pages that
are known to contain all zeros -- either because the host zeroed
them during page reporting, or because they were freed via the
balloon deflate path. It lives on free-list pages and is consumed
by post_alloc_hook() on allocation.
HPG_zeroed (stored in hugetlb folio->private bits) serves the same
purpose for hugetlb pool pages, which are kept in a pool and may
be zeroed long after buddy allocation, so PG_zeroed (consumed at
allocation time) cannot track their state.
- PG_zeroed is aliased to PG_private. It is excluded from
PAGE_FLAGS_CHECK_AT_PREP because it must survive on free-list pages
until post_alloc_hook() consumes and clears it. Is this acceptable,
or should a different bit be used?
- On architectures with aliasing caches, upstream with init_on_alloc
double-zeros user pages: once via kernel_init_pages() in
post_alloc_hook, and again via clear_user_highpage() at the
callsite (because user_alloc_needs_zeroing() returns true).
Our patches eliminate this by zeroing once via folio_zero_user()
in post_alloc_hook. Not yet performance-tested on aliasing
hardware.
PG_zeroed lifecycle:
Sets PG_zeroed:
- page_reporting_drain: on reported pages when host zeroes them
- __free_pages_ok / __free_frozen_pages: when FPI_ZEROED is set
(balloon deflate path)
- buddy merge: on merged page if both buddies were zeroed
- expand(): propagate to split-off buddy sub-pages
Clears PG_zeroed:
- buddy merge: clear both pages before merge, then conditionally
re-set on merged head if both were zeroed
- post_alloc_hook: clear on head page after consuming the hint
HPG_zeroed lifecycle (hugetlb pool pages, stored in folio->private):
Sets HPG_zeroed:
- alloc_surplus_hugetlb_folio: after buddy allocation with
__GFP_ZERO, mark pool page as known-zero
Clears HPG_zeroed:
- free_huge_folio: page was mapped to userspace, no longer
known-zero when it returns to the pool
- alloc_hugetlb_folio / alloc_hugetlb_folio_reserve: clear
after reporting to caller via bool *zeroed output (consumed)
- The optimization is most effective with THP, where entire 2MB
pages are allocated directly from reported order-9+ buddy pages.
Without THP, only ~21% of order-0 allocations come from reported
pages due to low-order fragmentation.
- Persistent hugetlb pool pages are not covered: when freed by
userspace they return to the hugetlb free pool, not the buddy
allocator, so they are never reported to the host. Surplus
hugetlb pages are allocated from buddy and do benefit.
Test program:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#ifndef MADV_POPULATE_WRITE
#define MADV_POPULATE_WRITE 23
#endif
#ifndef MAP_HUGETLB
#define MAP_HUGETLB 0x40000
#endif
int main(int argc, char **argv)
{
unsigned long size;
int flags = MAP_PRIVATE | MAP_ANONYMOUS;
void *p;
int r;
if (argc < 2) {
fprintf(stderr, "usage: %s <size_mb> [huge]\n", argv[0]);
return 1;
}
size = atol(argv[1]) * 1024UL * 1024;
if (argc >= 3 && strcmp(argv[2], "huge") == 0)
flags |= MAP_HUGETLB;
p = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return 1;
}
r = madvise(p, size, MADV_POPULATE_WRITE);
if (r) {
perror("madvise");
return 1;
}
munmap(p, size);
return 0;
}
Test script (bench.sh):
#!/bin/bash
# Usage: bench.sh <size_mb> <mode> <iterations> [huge]
# mode 0 = baseline, mode 1 = skip zeroing
SZ=${1:-256}; MODE=${2:-0}; ITER=${3:-10}; HUGE=${4:-}
FLUSH=/sys/module/page_reporting/parameters/flush
PERF_DATA=/tmp/perf-$MODE.csv
rmmod virtio_balloon 2>/dev/null
insmod virtio_balloon.ko host_zeroes_pages=$MODE
echo 512 > $FLUSH
[ "$HUGE" = "huge" ] && echo $((SZ/2)) > /proc/sys/vm/nr_overcommit_hugepages
rm -f $PERF_DATA
echo "=== sz=${SZ}MB mode=$MODE iter=$ITER $HUGE ==="
for i in $(seq 1 $ITER); do
echo 3 > /proc/sys/vm/drop_caches
echo 512 > $FLUSH
perf stat -e task-clock,instructions,cache-misses \
-x, -o $PERF_DATA --append -- ./alloc_once $SZ $HUGE
done
[ "$HUGE" = "huge" ] && echo 0 > /proc/sys/vm/nr_overcommit_hugepages
rmmod virtio_balloon
awk -F, '/^#/||/^$/{next}{v=$1+0;e=$3;gsub(/ /,"",e);s[e]+=v;ss[e]+=v*v;n[e]++}
END{for(e in s){a=s[e]/n[e];d=sqrt(ss[e]/n[e]-a*a);printf " %-16s %10.0f +- %8.0f (n=%d)\n",e,a,d,n[e]}}' $PERF_DATA
Compile and run:
gcc -static -O2 -o alloc_once alloc_once.c
bash bench.sh 256 0 10 # baseline (regular pages)
bash bench.sh 256 1 10 # optimized (regular pages)
bash bench.sh 256 0 10 huge # baseline (hugetlb surplus)
bash bench.sh 256 1 10 huge # optimized (hugetlb surplus)
Changes since v3 (address review by Gregory Price and David Hildenbrand):
- Keep user_addr threading internal: public APIs (__alloc_pages,
__folio_alloc, folio_alloc_mpol) are unchanged. Only internal
functions (__alloc_frozen_pages_noprof, __alloc_pages_mpol) carry
user_addr. This eliminates all API churn for external callers.
- Add vma_alloc_folio_user_addr() (2/22) to separate NUMA policy
address from the zeroing hint address. Fixes NUMA interleave
index corruption when passing unaligned fault address for
higher-order allocations.
- Add per-page zeroed_bitmap to page_reporting_dev_info (17/22).
The driver's report() callback manages the bitmap. Drain
checks it gated by the host_zeroes_pages static key. This
matches the proposed virtio balloon extension at
https://lore.kernel.org/all/cover.1776874126.git.mst@redhat.com/
- Clear PG_zeroed in __isolate_free_page() to prevent the aliased
PG_private flag from leaking to compaction/alloc_contig paths.
- Do not exclude PG_zeroed from PAGE_FLAGS_CHECK_AT_PREP macro.
Instead, __free_pages_prepare() clears it (preventing filesystem
PG_private leaking as false PG_zeroed), and FPI_ZEROED sets it
after prepare. Only buddy merge assertion is relaxed.
- Initialize alloc_context.user_addr in alloc_pages_bulk_noprof.
Changes since v2 (address review by Gregory Price and David Hildenbrand):
- v2 used pghint_t / vma_alloc_folio_hints API. v3 switches to
threading user_addr through the page allocator and using __GFP_ZERO,
so post_alloc_hook() can use folio_zero_user() for cache-friendly
zeroing when the user fault address is known.
- Exclude __PG_ZEROED from PAGE_FLAGS_CHECK_AT_PREP macro definition
instead of runtime masking in __free_one_page.
- Drop redundant page_poisoning_enabled() check from mm core free
path -- already guarded at feature negotiation time in
virtio_balloon_validate. The balloon driver keeps its own
page_poisoning_enabled_static() check as defense in depth.
- Split free_frozen_pages_zeroed and put_page_zeroed into separate
patches. David Hildenbrand indicated he intends to rework balloon
pages to be frozen (no refcount), at which point put_page_zeroed
(21/22) can be dropped and the balloon can call
free_frozen_pages_zeroed directly.
- Use HPG_zeroed flag (in hugetlb folio->private) for hugetlb pool
pages instead of PG_zeroed, since pool pages are zeroed long after
buddy allocation and PG_zeroed is consumed at allocation time.
- syzbot CI found a PF_NO_COMPOUND BUG in the v2 pghint_t approach
where __ClearPageZeroed was called on compound hugetlb pages in
free_huge_folio. The v3 HPG_zeroed approach avoids this.
- Remove redundant arch vma_alloc_zeroed_movable_folio overrides
on x86, s390, m68k, and alpha (12/22). Suggested by David
Hildenbrand.
- Updated benchmarking script to compute per-run avg +- stddev
via awk on CSV output.
Changes v1->v2:
- Replaced __GFP_PREZEROED with PG_zeroed page flag (aliased PG_private)
- Added pghint_t type and vma_alloc_folio_hints() API
- Track PG_zeroed across buddy merges and splits
- Added post_alloc_hook integration (single consume/clear point)
- Added hugetlb support (pool pages + memfd)
- Added page_reporting flush parameter for deterministic testing
- Added free_frozen_pages_hint/put_page_hint for balloon deflate path
- Added try_to_claim_block PG_zeroed preservation
- Updated perf numbers with per-iteration flush methodology
Written with assistance from Claude (claude-opus-4-6).
Reviewed by cursor-agent (GPT-5.4-xhigh).
Everything manually read, patchset split and commit logs edited manually.
Michael S. Tsirkin (22):
mm: move vma_alloc_folio to page_alloc.c
mm: add vma_alloc_folio_user_addr
mm: thread user_addr through page allocator for cache-friendly zeroing
mm: add folio_zero_user stub for configs without THP/HUGETLBFS
mm: page_alloc: move prep_compound_page before post_alloc_hook
mm: use folio_zero_user for user pages in post_alloc_hook
mm: use __GFP_ZERO in vma_alloc_zeroed_movable_folio
mm: use __GFP_ZERO in alloc_anon_folio
mm: use __GFP_ZERO in vma_alloc_anon_folio_pmd
mm: hugetlb: use __GFP_ZERO and skip zeroing for zeroed pages
mm: memfd: skip zeroing for zeroed hugetlb pool pages
mm: remove arch vma_alloc_zeroed_movable_folio overrides
mm: page_alloc: propagate PageReported flag across buddy splits
mm: page_reporting: skip redundant zeroing of host-zeroed reported
pages
mm: page_alloc: clear PG_zeroed on buddy merge if not both zero
mm: page_alloc: preserve PG_zeroed in page_del_and_expand
mm: page_reporting: add per-page zeroed bitmap for host feedback
virtio_balloon: a hack to enable host-zeroed page optimization
mm: page_reporting: add flush parameter with page budget
mm: add free_frozen_pages_zeroed
mm: add put_page_zeroed and folio_put_zeroed
virtio_balloon: mark deflated pages as zeroed
arch/alpha/include/asm/page.h | 3 -
arch/m68k/include/asm/page_no.h | 3 -
arch/s390/include/asm/page.h | 3 -
arch/x86/include/asm/page.h | 3 -
drivers/virtio/virtio_balloon.c | 17 ++-
fs/hugetlbfs/inode.c | 10 +-
include/linux/gfp.h | 22 ++--
include/linux/highmem.h | 9 +-
include/linux/hugetlb.h | 14 ++-
include/linux/mm.h | 15 +++
include/linux/page-flags.h | 9 ++
include/linux/page_reporting.h | 10 ++
mm/compaction.c | 7 +-
mm/huge_memory.c | 18 +--
mm/hugetlb.c | 101 +++++++++++-----
mm/internal.h | 11 +-
mm/memfd.c | 17 ++-
mm/memory.c | 17 +--
mm/mempolicy.c | 59 ++++-----
mm/page_alloc.c | 204 ++++++++++++++++++++++++++------
mm/page_reporting.c | 62 +++++++++-
mm/page_reporting.h | 12 ++
mm/slub.c | 4 +-
mm/swap.c | 18 ++-
24 files changed, 479 insertions(+), 169 deletions(-)
--
MST
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH RFC v4 01/22] mm: move vma_alloc_folio to page_alloc.c
2026-04-26 21:47 [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
@ 2026-04-26 21:47 ` Michael S. Tsirkin
2026-04-26 21:47 ` [PATCH RFC v4 02/22] mm: add vma_alloc_folio_user_addr Michael S. Tsirkin
` (20 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-26 21:47 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Johannes Weiner, Zi Yan, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, Matthew Brost, Joshua Hahn, Rakie Kim,
Byungchul Park, Ying Huang, Alistair Popple
Move vma_alloc_folio_noprof() from an inline in gfp.h (for !NUMA)
and mempolicy.c (for NUMA) to page_alloc.c. The declaration is
moved outside the #ifdef CONFIG_NUMA block so both configs use
the same real function.
On NUMA, it calls the mempolicy allocation path as before.
On !NUMA, it calls folio_alloc_noprof() directly.
This prepares for a subsequent patch that will thread user_addr
through the allocator: having vma_alloc_folio in page_alloc.c
means user_addr can be passed to the internal allocation path
without changing public API signatures or duplicating plumbing
in both gfp.h and mempolicy.c.
No functional change.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
include/linux/gfp.h | 9 ++-------
mm/mempolicy.c | 17 -----------------
mm/page_alloc.c | 28 ++++++++++++++++++++++++++++
3 files changed, 30 insertions(+), 24 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 51ef13ed756e..7ccbda35b9ad 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -318,13 +318,13 @@ static inline struct page *alloc_pages_node_noprof(int nid, gfp_t gfp_mask,
#define alloc_pages_node(...) alloc_hooks(alloc_pages_node_noprof(__VA_ARGS__))
+struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
+ struct vm_area_struct *vma, unsigned long addr);
#ifdef CONFIG_NUMA
struct page *alloc_pages_noprof(gfp_t gfp, unsigned int order);
struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order);
struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
struct mempolicy *mpol, pgoff_t ilx, int nid);
-struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct *vma,
- unsigned long addr);
#else
static inline struct page *alloc_pages_noprof(gfp_t gfp_mask, unsigned int order)
{
@@ -339,11 +339,6 @@ static inline struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int orde
{
return folio_alloc_noprof(gfp, order);
}
-static inline struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
- struct vm_area_struct *vma, unsigned long addr)
-{
- return folio_alloc_noprof(gfp, order);
-}
#endif
#define alloc_pages(...) alloc_hooks(alloc_pages_noprof(__VA_ARGS__))
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0e5175f1c767..f0f85c89da82 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2524,23 +2524,6 @@ struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
*
* Return: The folio on success or NULL if allocation fails.
*/
-struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct *vma,
- unsigned long addr)
-{
- struct mempolicy *pol;
- pgoff_t ilx;
- struct folio *folio;
-
- if (vma->vm_flags & VM_DROPPABLE)
- gfp |= __GFP_NOWARN;
-
- pol = get_vma_policy(vma, addr, order, &ilx);
- folio = folio_alloc_mpol_noprof(gfp, order, pol, ilx, numa_node_id());
- mpol_cond_put(pol);
- return folio;
-}
-EXPORT_SYMBOL(vma_alloc_folio_noprof);
-
struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned order)
{
struct mempolicy *pol = &default_policy;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2d4b6f1a554e..0e6ec7310087 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5297,6 +5297,34 @@ struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_
}
EXPORT_SYMBOL(__folio_alloc_noprof);
+#ifdef CONFIG_NUMA
+struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
+ struct vm_area_struct *vma, unsigned long addr)
+{
+ struct mempolicy *pol;
+ pgoff_t ilx;
+ struct folio *folio;
+
+ if (vma->vm_flags & VM_DROPPABLE)
+ gfp |= __GFP_NOWARN;
+
+ pol = get_vma_policy(vma, addr, order, &ilx);
+ folio = folio_alloc_mpol_noprof(gfp, order, pol, ilx, numa_node_id());
+ mpol_cond_put(pol);
+ return folio;
+}
+#else
+struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
+ struct vm_area_struct *vma, unsigned long addr)
+{
+ if (vma->vm_flags & VM_DROPPABLE)
+ gfp |= __GFP_NOWARN;
+
+ return folio_alloc_noprof(gfp, order);
+}
+#endif
+EXPORT_SYMBOL(vma_alloc_folio_noprof);
+
/*
* Common helper functions. Never use with __GFP_HIGHMEM because the returned
* address cannot represent highmem pages. Use alloc_pages and then kmap if
--
MST
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC v4 02/22] mm: add vma_alloc_folio_user_addr
2026-04-26 21:47 [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
2026-04-26 21:47 ` [PATCH RFC v4 01/22] mm: move vma_alloc_folio to page_alloc.c Michael S. Tsirkin
@ 2026-04-26 21:47 ` Michael S. Tsirkin
2026-04-26 21:47 ` [PATCH RFC v4 03/22] mm: thread user_addr through page allocator for cache-friendly zeroing Michael S. Tsirkin
` (19 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-26 21:47 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Johannes Weiner, Zi Yan, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport
Add vma_alloc_folio_user_addr() which will be used in follow-up patches. It takes a separate user_addr
parameter for the cache-friendly zeroing hint, independent of the
addr used for NUMA policy lookup.
The NUMA interleave index is computed from
(addr - vma->vm_start) >> (PAGE_SHIFT + order), so addr must be
folio-aligned for correct NUMA placement. But the zeroing hint
wants the exact fault address for cache locality.
vma_alloc_folio() becomes a thin wrapper that passes addr for both.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
include/linux/gfp.h | 4 ++++
mm/page_alloc.c | 17 +++++++++++++----
2 files changed, 17 insertions(+), 4 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 7ccbda35b9ad..7069b810f171 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -320,6 +320,9 @@ static inline struct page *alloc_pages_node_noprof(int nid, gfp_t gfp_mask,
struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
struct vm_area_struct *vma, unsigned long addr);
+struct folio *vma_alloc_folio_user_addr_noprof(gfp_t gfp, int order,
+ struct vm_area_struct *vma, unsigned long addr,
+ unsigned long user_addr);
#ifdef CONFIG_NUMA
struct page *alloc_pages_noprof(gfp_t gfp, unsigned int order);
struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order);
@@ -345,6 +348,7 @@ static inline struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int orde
#define folio_alloc(...) alloc_hooks(folio_alloc_noprof(__VA_ARGS__))
#define folio_alloc_mpol(...) alloc_hooks(folio_alloc_mpol_noprof(__VA_ARGS__))
#define vma_alloc_folio(...) alloc_hooks(vma_alloc_folio_noprof(__VA_ARGS__))
+#define vma_alloc_folio_user_addr(...) alloc_hooks(vma_alloc_folio_user_addr_noprof(__VA_ARGS__))
#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e6ec7310087..6d31a5c99e93 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5298,8 +5298,9 @@ struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_
EXPORT_SYMBOL(__folio_alloc_noprof);
#ifdef CONFIG_NUMA
-struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
- struct vm_area_struct *vma, unsigned long addr)
+struct folio *vma_alloc_folio_user_addr_noprof(gfp_t gfp, int order,
+ struct vm_area_struct *vma, unsigned long addr,
+ unsigned long user_addr)
{
struct mempolicy *pol;
pgoff_t ilx;
@@ -5314,8 +5315,9 @@ struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
return folio;
}
#else
-struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
- struct vm_area_struct *vma, unsigned long addr)
+struct folio *vma_alloc_folio_user_addr_noprof(gfp_t gfp, int order,
+ struct vm_area_struct *vma, unsigned long addr,
+ unsigned long user_addr)
{
if (vma->vm_flags & VM_DROPPABLE)
gfp |= __GFP_NOWARN;
@@ -5323,6 +5325,13 @@ struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
return folio_alloc_noprof(gfp, order);
}
#endif
+EXPORT_SYMBOL(vma_alloc_folio_user_addr_noprof);
+
+struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
+ struct vm_area_struct *vma, unsigned long addr)
+{
+ return vma_alloc_folio_user_addr_noprof(gfp, order, vma, addr, addr);
+}
EXPORT_SYMBOL(vma_alloc_folio_noprof);
/*
--
MST
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC v4 03/22] mm: thread user_addr through page allocator for cache-friendly zeroing
2026-04-26 21:47 [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
2026-04-26 21:47 ` [PATCH RFC v4 01/22] mm: move vma_alloc_folio to page_alloc.c Michael S. Tsirkin
2026-04-26 21:47 ` [PATCH RFC v4 02/22] mm: add vma_alloc_folio_user_addr Michael S. Tsirkin
@ 2026-04-26 21:47 ` Michael S. Tsirkin
2026-04-26 21:47 ` [PATCH RFC v4 04/22] mm: add folio_zero_user stub for configs without THP/HUGETLBFS Michael S. Tsirkin
` (18 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-26 21:47 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Johannes Weiner, Zi Yan, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, Muchun Song, Oscar Salvador, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo
Thread a user virtual address from vma_alloc_folio() down through
the page allocator to post_alloc_hook(). This is plumbing preparation
for a subsequent patch that will use user_addr to call folio_zero_user()
for cache-friendly zeroing of user pages.
The user_addr is stored in struct alloc_context and flows through:
vma_alloc_folio -> folio_alloc_mpol -> __alloc_pages_mpol ->
__alloc_frozen_pages -> get_page_from_freelist -> prep_new_page ->
post_alloc_hook
user_addr is threaded through internal APIs only
(__alloc_frozen_pages_noprof, __alloc_pages_mpol). Public APIs
(__alloc_pages, __folio_alloc, folio_alloc_mpol) are unchanged.
USER_ADDR_NONE ((unsigned long)-1) is used for non-user
allocations, since address 0 is a valid userspace mapping.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: cursor-agent:GPT-5.4-xhigh
---
include/linux/gfp.h | 8 +++++++-
mm/compaction.c | 5 ++---
mm/hugetlb.c | 36 ++++++++++++++++++++----------------
mm/internal.h | 12 +++++++++---
mm/mempolicy.c | 42 +++++++++++++++++++++++++++++++-----------
mm/page_alloc.c | 44 +++++++++++++++++++++++++++++---------------
mm/slub.c | 4 ++--
7 files changed, 100 insertions(+), 51 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 7069b810f171..e275cc80e19e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -226,6 +226,12 @@ static inline void arch_free_page(struct page *page, int order) { }
static inline void arch_alloc_page(struct page *page, int order) { }
#endif
+/*
+ * Sentinel for user_addr: indicates a non-user allocation.
+ * Cannot use 0 because address 0 is a valid userspace mapping.
+ */
+#define USER_ADDR_NONE ((unsigned long)-1)
+
struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
#define __alloc_pages(...) alloc_hooks(__alloc_pages_noprof(__VA_ARGS__))
@@ -340,7 +346,7 @@ static inline struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order)
static inline struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
struct mempolicy *mpol, pgoff_t ilx, int nid)
{
- return folio_alloc_noprof(gfp, order);
+ return __folio_alloc_noprof(gfp, order, numa_node_id(), NULL);
}
#endif
diff --git a/mm/compaction.c b/mm/compaction.c
index 1e8f8eca318c..c1039a9373e5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -82,7 +82,7 @@ static inline bool is_via_compact_memory(int order) { return false; }
static struct page *mark_allocated_noprof(struct page *page, unsigned int order, gfp_t gfp_flags)
{
- post_alloc_hook(page, order, __GFP_MOVABLE);
+ post_alloc_hook(page, order, __GFP_MOVABLE, USER_ADDR_NONE);
set_page_refcounted(page);
return page;
}
@@ -1832,8 +1832,7 @@ static struct folio *compaction_alloc_noprof(struct folio *src, unsigned long da
set_page_private(&freepage[size], start_order);
}
dst = (struct folio *)freepage;
-
- post_alloc_hook(&dst->page, order, __GFP_MOVABLE);
+ post_alloc_hook(&dst->page, order, __GFP_MOVABLE, USER_ADDR_NONE);
set_page_refcounted(&dst->page);
if (order)
prep_compound_page(&dst->page, order);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0beb6e22bc26..de8361b503d2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1842,7 +1842,8 @@ struct address_space *hugetlb_folio_mapping_lock_write(struct folio *folio)
}
static struct folio *alloc_buddy_frozen_folio(int order, gfp_t gfp_mask,
- int nid, nodemask_t *nmask, nodemask_t *node_alloc_noretry)
+ int nid, nodemask_t *nmask, nodemask_t *node_alloc_noretry,
+ unsigned long addr)
{
struct folio *folio;
bool alloc_try_hard = true;
@@ -1859,7 +1860,7 @@ static struct folio *alloc_buddy_frozen_folio(int order, gfp_t gfp_mask,
if (alloc_try_hard)
gfp_mask |= __GFP_RETRY_MAYFAIL;
- folio = (struct folio *)__alloc_frozen_pages(gfp_mask, order, nid, nmask);
+ folio = (struct folio *)__alloc_frozen_pages(gfp_mask, order, nid, nmask, addr);
/*
* If we did not specify __GFP_RETRY_MAYFAIL, but still got a
@@ -1888,7 +1889,7 @@ static struct folio *alloc_buddy_frozen_folio(int order, gfp_t gfp_mask,
static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
gfp_t gfp_mask, int nid, nodemask_t *nmask,
- nodemask_t *node_alloc_noretry)
+ nodemask_t *node_alloc_noretry, unsigned long addr)
{
struct folio *folio;
int order = huge_page_order(h);
@@ -1900,7 +1901,7 @@ static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
folio = alloc_gigantic_frozen_folio(order, gfp_mask, nid, nmask);
else
folio = alloc_buddy_frozen_folio(order, gfp_mask, nid, nmask,
- node_alloc_noretry);
+ node_alloc_noretry, addr);
if (folio)
init_new_hugetlb_folio(folio);
return folio;
@@ -1914,11 +1915,12 @@ static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
* pages is zero, and the accounting must be done in the caller.
*/
static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
- gfp_t gfp_mask, int nid, nodemask_t *nmask)
+ gfp_t gfp_mask, int nid, nodemask_t *nmask,
+ unsigned long addr)
{
struct folio *folio;
- folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, NULL);
+ folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, NULL, addr);
if (folio)
hugetlb_vmemmap_optimize_folio(h, folio);
return folio;
@@ -1958,7 +1960,7 @@ static struct folio *alloc_pool_huge_folio(struct hstate *h,
struct folio *folio;
folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, node,
- nodes_allowed, node_alloc_noretry);
+ nodes_allowed, node_alloc_noretry, USER_ADDR_NONE);
if (folio)
return folio;
}
@@ -2127,7 +2129,8 @@ int dissolve_free_hugetlb_folios(unsigned long start_pfn, unsigned long end_pfn)
* Allocates a fresh surplus page from the page allocator.
*/
static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
- gfp_t gfp_mask, int nid, nodemask_t *nmask)
+ gfp_t gfp_mask, int nid, nodemask_t *nmask,
+ unsigned long addr)
{
struct folio *folio = NULL;
@@ -2139,7 +2142,7 @@ static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
goto out_unlock;
spin_unlock_irq(&hugetlb_lock);
- folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask);
+ folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, addr);
if (!folio)
return NULL;
@@ -2182,7 +2185,7 @@ static struct folio *alloc_migrate_hugetlb_folio(struct hstate *h, gfp_t gfp_mas
if (hstate_is_gigantic(h))
return NULL;
- folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask);
+ folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, USER_ADDR_NONE);
if (!folio)
return NULL;
@@ -2218,14 +2221,14 @@ struct folio *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h,
if (mpol_is_preferred_many(mpol)) {
gfp_t gfp = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
- folio = alloc_surplus_hugetlb_folio(h, gfp, nid, nodemask);
+ folio = alloc_surplus_hugetlb_folio(h, gfp, nid, nodemask, addr);
/* Fallback to all nodes if page==NULL */
nodemask = NULL;
}
if (!folio)
- folio = alloc_surplus_hugetlb_folio(h, gfp_mask, nid, nodemask);
+ folio = alloc_surplus_hugetlb_folio(h, gfp_mask, nid, nodemask, addr);
mpol_cond_put(mpol);
return folio;
}
@@ -2332,7 +2335,8 @@ static int gather_surplus_pages(struct hstate *h, long delta)
* down the road to pick the current node if that is the case.
*/
folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h),
- NUMA_NO_NODE, &alloc_nodemask);
+ NUMA_NO_NODE, &alloc_nodemask,
+ USER_ADDR_NONE);
if (!folio) {
alloc_ok = false;
break;
@@ -2738,7 +2742,7 @@ static int alloc_and_dissolve_hugetlb_folio(struct folio *old_folio,
spin_unlock_irq(&hugetlb_lock);
gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
new_folio = alloc_fresh_hugetlb_folio(h, gfp_mask,
- nid, NULL);
+ nid, NULL, USER_ADDR_NONE);
if (!new_folio)
return -ENOMEM;
goto retry;
@@ -3434,13 +3438,13 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid,
- &node_states[N_MEMORY], NULL);
+ &node_states[N_MEMORY], NULL, USER_ADDR_NONE);
if (!folio && !list_empty(&folio_list) &&
hugetlb_vmemmap_optimizable_size(h)) {
prep_and_add_allocated_folios(h, &folio_list);
INIT_LIST_HEAD(&folio_list);
folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid,
- &node_states[N_MEMORY], NULL);
+ &node_states[N_MEMORY], NULL, USER_ADDR_NONE);
}
if (!folio)
break;
diff --git a/mm/internal.h b/mm/internal.h
index cb0af847d7d9..8e4616e42b4a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -672,6 +672,7 @@ struct alloc_context {
*/
enum zone_type highest_zoneidx;
bool spread_dirty_pages;
+ unsigned long user_addr;
};
/*
@@ -887,24 +888,29 @@ static inline void prep_compound_tail(struct page *head, int tail_idx)
set_page_private(p, 0);
}
-void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags);
+void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags,
+ unsigned long user_addr);
extern bool free_pages_prepare(struct page *page, unsigned int order);
extern int user_min_free_kbytes;
struct page *__alloc_frozen_pages_noprof(gfp_t, unsigned int order, int nid,
- nodemask_t *);
+ nodemask_t *, unsigned long user_addr);
#define __alloc_frozen_pages(...) \
alloc_hooks(__alloc_frozen_pages_noprof(__VA_ARGS__))
void free_frozen_pages(struct page *page, unsigned int order);
+void free_frozen_pages_zeroed(struct page *page, unsigned int order);
void free_unref_folios(struct folio_batch *fbatch);
#ifdef CONFIG_NUMA
struct page *alloc_frozen_pages_noprof(gfp_t, unsigned int order);
+struct folio *folio_alloc_mpol_user_noprof(gfp_t gfp, unsigned int order,
+ struct mempolicy *pol, pgoff_t ilx, int nid,
+ unsigned long user_addr);
#else
static inline struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order)
{
- return __alloc_frozen_pages_noprof(gfp, order, numa_node_id(), NULL);
+ return __alloc_frozen_pages_noprof(gfp, order, numa_node_id(), NULL, USER_ADDR_NONE);
}
#endif
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f0f85c89da82..06403a3812b4 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2406,7 +2406,8 @@ bool mempolicy_in_oom_domain(struct task_struct *tsk,
}
static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
- int nid, nodemask_t *nodemask)
+ int nid, nodemask_t *nodemask,
+ unsigned long user_addr)
{
struct page *page;
gfp_t preferred_gfp;
@@ -2419,9 +2420,11 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
*/
preferred_gfp = gfp | __GFP_NOWARN;
preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
- page = __alloc_frozen_pages_noprof(preferred_gfp, order, nid, nodemask);
+ page = __alloc_frozen_pages_noprof(preferred_gfp, order, nid,
+ nodemask, user_addr);
if (!page)
- page = __alloc_frozen_pages_noprof(gfp, order, nid, NULL);
+ page = __alloc_frozen_pages_noprof(gfp, order, nid, NULL,
+ user_addr);
return page;
}
@@ -2436,8 +2439,9 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
*
* Return: The page on success or NULL if allocation fails.
*/
-static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
- struct mempolicy *pol, pgoff_t ilx, int nid)
+static struct page *__alloc_pages_mpol(gfp_t gfp, unsigned int order,
+ struct mempolicy *pol, pgoff_t ilx, int nid,
+ unsigned long user_addr)
{
nodemask_t *nodemask;
struct page *page;
@@ -2445,7 +2449,8 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
nodemask = policy_nodemask(gfp, pol, ilx, &nid);
if (pol->mode == MPOL_PREFERRED_MANY)
- return alloc_pages_preferred_many(gfp, order, nid, nodemask);
+ return alloc_pages_preferred_many(gfp, order, nid, nodemask,
+ user_addr);
if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
/* filter "hugepage" allocation, unless from alloc_pages() */
@@ -2469,7 +2474,7 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
*/
page = __alloc_frozen_pages_noprof(
gfp | __GFP_THISNODE | __GFP_NORETRY, order,
- nid, NULL);
+ nid, NULL, user_addr);
if (page || !(gfp & __GFP_DIRECT_RECLAIM))
return page;
/*
@@ -2481,7 +2486,7 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
}
}
- page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask);
+ page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask, user_addr);
if (unlikely(pol->mode == MPOL_INTERLEAVE ||
pol->mode == MPOL_WEIGHTED_INTERLEAVE) && page) {
@@ -2497,11 +2502,18 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
return page;
}
-struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
+static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
struct mempolicy *pol, pgoff_t ilx, int nid)
{
- struct page *page = alloc_pages_mpol(gfp | __GFP_COMP, order, pol,
- ilx, nid);
+ return __alloc_pages_mpol(gfp, order, pol, ilx, nid, USER_ADDR_NONE);
+}
+
+struct folio *folio_alloc_mpol_user_noprof(gfp_t gfp, unsigned int order,
+ struct mempolicy *pol, pgoff_t ilx, int nid,
+ unsigned long user_addr)
+{
+ struct page *page = __alloc_pages_mpol(gfp | __GFP_COMP, order, pol,
+ ilx, nid, user_addr);
if (!page)
return NULL;
@@ -2509,6 +2521,14 @@ struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
return page_rmappable_folio(page);
}
+struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
+ struct mempolicy *pol, pgoff_t ilx, int nid)
+{
+ return folio_alloc_mpol_user_noprof(gfp, order, pol, ilx, nid,
+ USER_ADDR_NONE);
+}
+EXPORT_SYMBOL(folio_alloc_mpol_noprof);
+
/**
* vma_alloc_folio - Allocate a folio for a VMA.
* @gfp: GFP flags.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d31a5c99e93..c19eaf76607c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1837,7 +1837,7 @@ static inline bool should_skip_init(gfp_t flags)
}
inline void post_alloc_hook(struct page *page, unsigned int order,
- gfp_t gfp_flags)
+ gfp_t gfp_flags, unsigned long user_addr)
{
bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) &&
!should_skip_init(gfp_flags);
@@ -1892,9 +1892,10 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
}
static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
- unsigned int alloc_flags)
+ unsigned int alloc_flags,
+ unsigned long user_addr)
{
- post_alloc_hook(page, order, gfp_flags);
+ post_alloc_hook(page, order, gfp_flags, user_addr);
if (order && (gfp_flags & __GFP_COMP))
prep_compound_page(page, order);
@@ -3959,7 +3960,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
page = rmqueue(zonelist_zone(ac->preferred_zoneref), zone, order,
gfp_mask, alloc_flags, ac->migratetype);
if (page) {
- prep_new_page(page, order, gfp_mask, alloc_flags);
+ prep_new_page(page, order, gfp_mask, alloc_flags,
+ ac->user_addr);
/*
* If this is a high-order atomic allocation then check
@@ -4194,7 +4196,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
/* Prep a captured page if available */
if (page)
- prep_new_page(page, order, gfp_mask, alloc_flags);
+ prep_new_page(page, order, gfp_mask, alloc_flags,
+ ac->user_addr);
/* Try get a page from the freelist if available */
if (!page)
@@ -5072,7 +5075,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
struct zoneref *z;
struct per_cpu_pages *pcp;
struct list_head *pcp_list;
- struct alloc_context ac;
+ struct alloc_context ac = { .user_addr = USER_ADDR_NONE };
gfp_t alloc_gfp;
unsigned int alloc_flags = ALLOC_WMARK_LOW;
int nr_populated = 0, nr_account = 0;
@@ -5187,7 +5190,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
}
nr_account++;
- prep_new_page(page, 0, gfp, 0);
+ prep_new_page(page, 0, gfp, 0, USER_ADDR_NONE);
set_page_refcounted(page);
page_array[nr_populated++] = page;
}
@@ -5212,12 +5215,13 @@ EXPORT_SYMBOL_GPL(alloc_pages_bulk_noprof);
* This is the 'heart' of the zoned buddy allocator.
*/
struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
- int preferred_nid, nodemask_t *nodemask)
+ int preferred_nid, nodemask_t *nodemask,
+ unsigned long user_addr)
{
struct page *page;
unsigned int alloc_flags = ALLOC_WMARK_LOW;
gfp_t alloc_gfp; /* The gfp_t that was actually used for allocation */
- struct alloc_context ac = { };
+ struct alloc_context ac = { .user_addr = user_addr };
/*
* There are several places where we assume that the order value is sane
@@ -5278,10 +5282,12 @@ EXPORT_SYMBOL(__alloc_frozen_pages_noprof);
struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
int preferred_nid, nodemask_t *nodemask)
+
{
struct page *page;
- page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid, nodemask);
+ page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid,
+ nodemask, USER_ADDR_NONE);
if (page)
set_page_refcounted(page);
return page;
@@ -5310,7 +5316,8 @@ struct folio *vma_alloc_folio_user_addr_noprof(gfp_t gfp, int order,
gfp |= __GFP_NOWARN;
pol = get_vma_policy(vma, addr, order, &ilx);
- folio = folio_alloc_mpol_noprof(gfp, order, pol, ilx, numa_node_id());
+ folio = folio_alloc_mpol_user_noprof(gfp, order, pol, ilx,
+ numa_node_id(), user_addr);
mpol_cond_put(pol);
return folio;
}
@@ -5319,10 +5326,17 @@ struct folio *vma_alloc_folio_user_addr_noprof(gfp_t gfp, int order,
struct vm_area_struct *vma, unsigned long addr,
unsigned long user_addr)
{
+ struct page *page;
+
if (vma->vm_flags & VM_DROPPABLE)
gfp |= __GFP_NOWARN;
- return folio_alloc_noprof(gfp, order);
+ page = __alloc_frozen_pages_noprof(gfp | __GFP_COMP, order,
+ numa_node_id(), NULL, user_addr);
+ if (!page)
+ return NULL;
+ set_page_refcounted(page);
+ return page_rmappable_folio(page);
}
#endif
EXPORT_SYMBOL(vma_alloc_folio_user_addr_noprof);
@@ -6947,7 +6961,7 @@ static void split_free_frozen_pages(struct list_head *list, gfp_t gfp_mask)
list_for_each_entry_safe(page, next, &list[order], lru) {
int i;
- post_alloc_hook(page, order, gfp_mask);
+ post_alloc_hook(page, order, gfp_mask, USER_ADDR_NONE);
if (!order)
continue;
@@ -7153,7 +7167,7 @@ int alloc_contig_frozen_range_noprof(unsigned long start, unsigned long end,
struct page *head = pfn_to_page(start);
check_new_pages(head, order);
- prep_new_page(head, order, gfp_mask, 0);
+ prep_new_page(head, order, gfp_mask, 0, USER_ADDR_NONE);
} else {
ret = -EINVAL;
WARN(true, "PFN range: requested [%lu, %lu), allocated [%lu, %lu)\n",
@@ -7818,7 +7832,7 @@ struct page *alloc_frozen_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned
gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC | __GFP_COMP
| gfp_flags;
unsigned int alloc_flags = ALLOC_TRYLOCK;
- struct alloc_context ac = { };
+ struct alloc_context ac = { .user_addr = USER_ADDR_NONE };
struct page *page;
VM_WARN_ON_ONCE(gfp_flags & ~__GFP_ACCOUNT);
diff --git a/mm/slub.c b/mm/slub.c
index 0c906fefc31b..fc8f998a0fe1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3266,7 +3266,7 @@ static inline struct slab *alloc_slab_page(gfp_t flags, int node,
else if (node == NUMA_NO_NODE)
page = alloc_frozen_pages(flags, order);
else
- page = __alloc_frozen_pages(flags, order, node, NULL);
+ page = __alloc_frozen_pages(flags, order, node, NULL, USER_ADDR_NONE);
if (!page)
return NULL;
@@ -5178,7 +5178,7 @@ static void *___kmalloc_large_node(size_t size, gfp_t flags, int node)
if (node == NUMA_NO_NODE)
page = alloc_frozen_pages_noprof(flags, order);
else
- page = __alloc_frozen_pages_noprof(flags, order, node, NULL);
+ page = __alloc_frozen_pages_noprof(flags, order, node, NULL, USER_ADDR_NONE);
if (page) {
ptr = page_address(page);
--
MST
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC v4 04/22] mm: add folio_zero_user stub for configs without THP/HUGETLBFS
2026-04-26 21:47 [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
` (2 preceding siblings ...)
2026-04-26 21:47 ` [PATCH RFC v4 03/22] mm: thread user_addr through page allocator for cache-friendly zeroing Michael S. Tsirkin
@ 2026-04-26 21:47 ` Michael S. Tsirkin
2026-04-26 21:47 ` [PATCH RFC v4 05/22] mm: page_alloc: move prep_compound_page before post_alloc_hook Michael S. Tsirkin
` (17 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-26 21:47 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport
folio_zero_user() is defined in mm/memory.c under
CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS. A subsequent patch
will call it from post_alloc_hook() for all user page zeroing, so
configs without THP or HUGETLBFS will need a stub.
Add a macro in the #else branch that falls back to
clear_user_highpages(), which handles cache aliasing correctly on
VIPT architectures and is always available via highmem.h.
Without THP/HUGETLBFS, only order-0 user pages are allocated, so
the locality optimization in the real folio_zero_user() (zero near
the faulting address last) is not needed.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: cursor-agent:GPT-5.4-xhigh
---
include/linux/mm.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5be3d8a8f806..541d36e5e420 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4718,6 +4718,9 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)));
}
+#else /* !CONFIG_TRANSPARENT_HUGEPAGE && !CONFIG_HUGETLBFS */
+#define folio_zero_user(folio, addr_hint) \
+ clear_user_highpages(&(folio)->page, (addr_hint), folio_nr_pages(folio))
#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
#if MAX_NUMNODES > 1
--
MST
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC v4 05/22] mm: page_alloc: move prep_compound_page before post_alloc_hook
2026-04-26 21:47 [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
` (3 preceding siblings ...)
2026-04-26 21:47 ` [PATCH RFC v4 04/22] mm: add folio_zero_user stub for configs without THP/HUGETLBFS Michael S. Tsirkin
@ 2026-04-26 21:47 ` Michael S. Tsirkin
2026-04-26 21:47 ` [PATCH RFC v4 06/22] mm: use folio_zero_user for user pages in post_alloc_hook Michael S. Tsirkin
` (16 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-26 21:47 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Johannes Weiner, Zi Yan
Move prep_compound_page() before post_alloc_hook() in prep_new_page().
The next patch adds a folio_zero_user() call to post_alloc_hook(),
which uses folio_nr_pages() to determine how many pages to zero.
Without compound metadata set up first, folio_nr_pages() returns 1
for higher-order allocations, so only the first page would be zeroed.
All other operations in post_alloc_hook() (arch_alloc_page, KASAN,
debug, page owner, etc.) use raw page pointers with explicit order
counts and are unaffected by this reordering.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: cursor-agent:GPT-5.4-xhigh
---
mm/page_alloc.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c19eaf76607c..04c03c56abec 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1895,11 +1895,11 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags
unsigned int alloc_flags,
unsigned long user_addr)
{
- post_alloc_hook(page, order, gfp_flags, user_addr);
-
if (order && (gfp_flags & __GFP_COMP))
prep_compound_page(page, order);
+ post_alloc_hook(page, order, gfp_flags, user_addr);
+
/*
* page is set pfmemalloc when ALLOC_NO_WATERMARKS was necessary to
* allocate the page. The expectation is that the caller is taking
--
MST
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC v4 06/22] mm: use folio_zero_user for user pages in post_alloc_hook
2026-04-26 21:47 [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
` (4 preceding siblings ...)
2026-04-26 21:47 ` [PATCH RFC v4 05/22] mm: page_alloc: move prep_compound_page before post_alloc_hook Michael S. Tsirkin
@ 2026-04-26 21:47 ` Michael S. Tsirkin
2026-04-26 21:47 ` [PATCH RFC v4 07/22] mm: use __GFP_ZERO in vma_alloc_zeroed_movable_folio Michael S. Tsirkin
` (15 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-26 21:47 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Johannes Weiner, Zi Yan
When post_alloc_hook() needs to zero a page for an explicit
__GFP_ZERO allocation and user_addr is set, use folio_zero_user()
instead of kernel_init_pages(). This zeros near the faulting
address last, keeping those cachelines hot for the impending
user access.
folio_zero_user() is only used for explicit __GFP_ZERO, not for
init_on_alloc. On architectures with virtually-indexed caches
(e.g., ARM), clear_user_highpage() performs per-line cache
operations; using it for init_on_alloc would add overhead that
kernel_init_pages() avoids (the page fault path flushes the
cache at PTE installation time regardless).
No functional change yet: current callers do not pass __GFP_ZERO
for user pages (they zero at the callsite instead). Subsequent
patches will convert them.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
mm/page_alloc.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 04c03c56abec..7791bc1eeefa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1882,9 +1882,20 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
for (i = 0; i != 1 << order; ++i)
page_kasan_tag_reset(page + i);
}
- /* If memory is still not initialized, initialize it now. */
- if (init)
- kernel_init_pages(page, 1 << order);
+ /*
+ * If memory is still not initialized, initialize it now.
+ * When __GFP_ZERO was explicitly requested and user_addr is set,
+ * use folio_zero_user() which zeros near the faulting address
+ * last, keeping those cachelines hot. For init_on_alloc, use
+ * kernel_init_pages() to avoid unnecessary cache flush overhead
+ * on architectures with virtually-indexed caches.
+ */
+ if (init) {
+ if ((gfp_flags & __GFP_ZERO) && user_addr != USER_ADDR_NONE)
+ folio_zero_user(page_folio(page), user_addr);
+ else
+ kernel_init_pages(page, 1 << order);
+ }
set_page_owner(page, order, gfp_flags);
page_table_check_alloc(page, order);
--
MST
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC v4 07/22] mm: use __GFP_ZERO in vma_alloc_zeroed_movable_folio
2026-04-26 21:47 [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
` (5 preceding siblings ...)
2026-04-26 21:47 ` [PATCH RFC v4 06/22] mm: use folio_zero_user for user pages in post_alloc_hook Michael S. Tsirkin
@ 2026-04-26 21:47 ` Michael S. Tsirkin
2026-04-26 21:47 ` [PATCH RFC v4 08/22] mm: use __GFP_ZERO in alloc_anon_folio Michael S. Tsirkin
` (14 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-26 21:47 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport
Now that post_alloc_hook() handles cache-friendly user page
zeroing via folio_zero_user(), convert vma_alloc_zeroed_movable_folio()
to pass __GFP_ZERO instead of zeroing at the callsite.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
include/linux/highmem.h | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index af03db851a1d..ffa683f64f1d 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -320,13 +320,8 @@ static inline
struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
unsigned long vaddr)
{
- struct folio *folio;
-
- folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vaddr);
- if (folio && user_alloc_needs_zeroing())
- clear_user_highpage(&folio->page, vaddr);
-
- return folio;
+ return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO,
+ 0, vma, vaddr);
}
#endif
--
MST
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC v4 08/22] mm: use __GFP_ZERO in alloc_anon_folio
2026-04-26 21:47 [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
` (6 preceding siblings ...)
2026-04-26 21:47 ` [PATCH RFC v4 07/22] mm: use __GFP_ZERO in vma_alloc_zeroed_movable_folio Michael S. Tsirkin
@ 2026-04-26 21:47 ` Michael S. Tsirkin
2026-04-26 21:47 ` [PATCH RFC v4 09/22] mm: use __GFP_ZERO in vma_alloc_anon_folio_pmd Michael S. Tsirkin
` (13 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-26 21:47 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport
Convert alloc_anon_folio() to pass __GFP_ZERO instead of zeroing at the callsite.
Use vma_alloc_folio_user_addr() to pass the folio-aligned address
for NUMA policy and the raw vmf->address for cache-friendly zeroing.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
mm/memory.c | 17 +++++------------
1 file changed, 5 insertions(+), 12 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 07778814b4a8..83ec73791fae 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4662,7 +4662,8 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
gfp = vma_thp_gfp_mask(vma);
while (orders) {
addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
- folio = vma_alloc_folio(gfp, order, vma, addr);
+ folio = vma_alloc_folio_user_addr(gfp, order, vma, addr,
+ vmf->address);
if (folio) {
if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
gfp, entry))
@@ -5176,10 +5177,11 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
goto fallback;
/* Try allocating the highest of the remaining orders. */
- gfp = vma_thp_gfp_mask(vma);
+ gfp = vma_thp_gfp_mask(vma) | __GFP_ZERO;
while (orders) {
addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
- folio = vma_alloc_folio(gfp, order, vma, addr);
+ folio = vma_alloc_folio_user_addr(gfp, order, vma, addr,
+ vmf->address);
if (folio) {
if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
@@ -5187,15 +5189,6 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
goto next;
}
folio_throttle_swaprate(folio, gfp);
- /*
- * When a folio is not zeroed during allocation
- * (__GFP_ZERO not used) or user folios require special
- * handling, folio_zero_user() is used to make sure
- * that the page corresponding to the faulting address
- * will be hot in the cache after zeroing.
- */
- if (user_alloc_needs_zeroing())
- folio_zero_user(folio, vmf->address);
return folio;
}
next:
--
MST
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC v4 09/22] mm: use __GFP_ZERO in vma_alloc_anon_folio_pmd
2026-04-26 21:47 [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
` (7 preceding siblings ...)
2026-04-26 21:47 ` [PATCH RFC v4 08/22] mm: use __GFP_ZERO in alloc_anon_folio Michael S. Tsirkin
@ 2026-04-26 21:47 ` Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 10/22] mm: hugetlb: use __GFP_ZERO and skip zeroing for zeroed pages Michael S. Tsirkin
` (12 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-26 21:47 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Lance Yang
Convert vma_alloc_anon_folio_pmd() to pass __GFP_ZERO instead of
zeroing at the callsite.
Use vma_alloc_folio_user_addr() to pass the PMD-aligned haddr
for NUMA policy and the raw vmf->address for cache-friendly zeroing.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
mm/huge_memory.c | 18 +++++-------------
1 file changed, 5 insertions(+), 13 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8e2746ea74ad..752f0b2e5bac 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1254,13 +1254,13 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
EXPORT_SYMBOL_GPL(thp_get_unmapped_area);
static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
- unsigned long addr)
+ unsigned long haddr, unsigned long user_addr)
{
- gfp_t gfp = vma_thp_gfp_mask(vma);
+ gfp_t gfp = vma_thp_gfp_mask(vma) | __GFP_ZERO;
const int order = HPAGE_PMD_ORDER;
struct folio *folio;
- folio = vma_alloc_folio(gfp, order, vma, addr & HPAGE_PMD_MASK);
+ folio = vma_alloc_folio_user_addr(gfp, order, vma, haddr, user_addr);
if (unlikely(!folio)) {
count_vm_event(THP_FAULT_FALLBACK);
@@ -1279,14 +1279,6 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
}
folio_throttle_swaprate(folio, gfp);
- /*
- * When a folio is not zeroed during allocation (__GFP_ZERO not used)
- * or user folios require special handling, folio_zero_user() is used to
- * make sure that the page corresponding to the faulting address will be
- * hot in the cache after zeroing.
- */
- if (user_alloc_needs_zeroing())
- folio_zero_user(folio, addr);
/*
* The memory barrier inside __folio_mark_uptodate makes sure that
* folio_zero_user writes become visible before the set_pmd_at()
@@ -1328,7 +1320,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
pgtable_t pgtable;
vm_fault_t ret = 0;
- folio = vma_alloc_anon_folio_pmd(vma, vmf->address);
+ folio = vma_alloc_anon_folio_pmd(vma, haddr, vmf->address);
if (unlikely(!folio))
return VM_FAULT_FALLBACK;
@@ -2033,7 +2025,7 @@ static vm_fault_t do_huge_zero_wp_pmd(struct vm_fault *vmf)
struct folio *folio;
vm_fault_t ret = 0;
- folio = vma_alloc_anon_folio_pmd(vma, vmf->address);
+ folio = vma_alloc_anon_folio_pmd(vma, haddr, vmf->address);
if (unlikely(!folio))
return VM_FAULT_FALLBACK;
--
MST
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC v4 10/22] mm: hugetlb: use __GFP_ZERO and skip zeroing for zeroed pages
2026-04-26 21:47 [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
` (8 preceding siblings ...)
2026-04-26 21:47 ` [PATCH RFC v4 09/22] mm: use __GFP_ZERO in vma_alloc_anon_folio_pmd Michael S. Tsirkin
@ 2026-04-26 21:48 ` Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 11/22] mm: memfd: skip zeroing for zeroed hugetlb pool pages Michael S. Tsirkin
` (11 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-26 21:48 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Muchun Song, Oscar Salvador
Convert the hugetlb fault and fallocate paths to use __GFP_ZERO.
For pages allocated from the buddy allocator, post_alloc_hook()
handles zeroing (with zeroed skip when the host already zeroed
the page).
Hugetlb surplus pages need special handling because they can be
pre-allocated into the pool during mmap (by hugetlb_acct_memory)
before any page fault. Pool pages are kept around and may need
zeroing long after buddy allocation, so PG_zeroed (consumed at
allocation time) cannot track their state.
Add a bool *zeroed output parameter to alloc_hugetlb_folio()
so callers know whether the page needs zeroing. Buddy-allocated
pages are always zeroed (zeroed by post_alloc_hook). Pool
pages use a new HPG_zeroed flag to track whether the page is
known-zero (freshly buddy-allocated, never mapped to userspace).
The flag is set in alloc_surplus_hugetlb_folio() after buddy
allocation and cleared in free_huge_folio() when a user-mapped
page returns to the pool.
Callers that do not need zeroing (CoW, migration) pass NULL for
zeroed and 0 for gfp.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
fs/hugetlbfs/inode.c | 10 ++++++--
include/linux/hugetlb.h | 8 ++++--
mm/hugetlb.c | 54 ++++++++++++++++++++++++++++++++---------
3 files changed, 56 insertions(+), 16 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 3f70c47981de..d5d570d6eff4 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -822,14 +822,20 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
* folios in these areas, we need to consume the reserves
* to keep reservation accounting consistent.
*/
- folio = alloc_hugetlb_folio(&pseudo_vma, addr, false);
+ {
+ bool zeroed;
+
+ folio = alloc_hugetlb_folio(&pseudo_vma, addr, false,
+ __GFP_ZERO, &zeroed);
if (IS_ERR(folio)) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
error = PTR_ERR(folio);
goto out;
}
- folio_zero_user(folio, addr);
+ if (!zeroed)
+ folio_zero_user(folio, addr);
__folio_mark_uptodate(folio);
+ }
error = hugetlb_add_to_page_cache(folio, mapping, index);
if (unlikely(error)) {
restore_reserve_on_error(h, &pseudo_vma, addr, folio);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 65910437be1c..094714c607f9 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -598,6 +598,7 @@ enum hugetlb_page_flags {
HPG_vmemmap_optimized,
HPG_raw_hwp_unreliable,
HPG_cma,
+ HPG_zeroed,
__NR_HPAGEFLAGS,
};
@@ -658,6 +659,7 @@ HPAGEFLAG(Freed, freed)
HPAGEFLAG(VmemmapOptimized, vmemmap_optimized)
HPAGEFLAG(RawHwpUnreliable, raw_hwp_unreliable)
HPAGEFLAG(Cma, cma)
+HPAGEFLAG(Zeroed, zeroed)
#ifdef CONFIG_HUGETLB_PAGE
@@ -705,7 +707,8 @@ int isolate_or_dissolve_huge_folio(struct folio *folio, struct list_head *list);
int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn);
void wait_for_freed_hugetlb_folios(void);
struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
- unsigned long addr, bool cow_from_owner);
+ unsigned long addr, bool cow_from_owner,
+ gfp_t gfp, bool *zeroed);
struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
nodemask_t *nmask, gfp_t gfp_mask,
bool allow_alloc_fallback);
@@ -1117,7 +1120,8 @@ static inline void wait_for_freed_hugetlb_folios(void)
static inline struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
unsigned long addr,
- bool cow_from_owner)
+ bool cow_from_owner,
+ gfp_t gfp, bool *zeroed)
{
return NULL;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index de8361b503d2..4f0ed01f5b13 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1744,6 +1744,9 @@ void free_huge_folio(struct folio *folio)
int nid = folio_nid(folio);
struct hugepage_subpool *spool = hugetlb_folio_subpool(folio);
bool restore_reserve;
+
+ /* Page was mapped to userspace; no longer known-zero */
+ folio_clear_hugetlb_zeroed(folio);
unsigned long flags;
VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
@@ -2146,6 +2149,10 @@ static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
if (!folio)
return NULL;
+ /* Mark as known-zero only if __GFP_ZERO was requested */
+ if (gfp_mask & __GFP_ZERO)
+ folio_set_hugetlb_zeroed(folio);
+
spin_lock_irq(&hugetlb_lock);
/*
* nr_huge_pages needs to be adjusted within the same lock cycle
@@ -2209,11 +2216,11 @@ static struct folio *alloc_migrate_hugetlb_folio(struct hstate *h, gfp_t gfp_mas
*/
static
struct folio *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h,
- struct vm_area_struct *vma, unsigned long addr)
+ struct vm_area_struct *vma, unsigned long addr, gfp_t gfp)
{
struct folio *folio = NULL;
struct mempolicy *mpol;
- gfp_t gfp_mask = htlb_alloc_mask(h);
+ gfp_t gfp_mask = htlb_alloc_mask(h) | gfp;
int nid;
nodemask_t *nodemask;
@@ -2910,7 +2917,8 @@ typedef enum {
* When it's set, the allocation will bypass all vma level reservations.
*/
struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
- unsigned long addr, bool cow_from_owner)
+ unsigned long addr, bool cow_from_owner,
+ gfp_t gfp, bool *zeroed)
{
struct hugepage_subpool *spool = subpool_vma(vma);
struct hstate *h = hstate_vma(vma);
@@ -2919,7 +2927,9 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
map_chg_state map_chg;
int ret, idx;
struct hugetlb_cgroup *h_cg = NULL;
- gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
+ bool from_pool;
+
+ gfp |= htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
idx = hstate_index(h);
@@ -2987,13 +2997,15 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
folio = dequeue_hugetlb_folio_vma(h, vma, addr, gbl_chg);
if (!folio) {
spin_unlock_irq(&hugetlb_lock);
- folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr);
+ folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr, gfp);
if (!folio)
goto out_uncharge_cgroup;
spin_lock_irq(&hugetlb_lock);
list_add(&folio->lru, &h->hugepage_activelist);
folio_ref_unfreeze(folio, 1);
- /* Fall through */
+ from_pool = false;
+ } else {
+ from_pool = true;
}
/*
@@ -3016,6 +3028,14 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
spin_unlock_irq(&hugetlb_lock);
+ if (zeroed) {
+ if (from_pool)
+ *zeroed = folio_test_hugetlb_zeroed(folio);
+ else
+ *zeroed = true; /* buddy-allocated, zeroed by post_alloc_hook */
+ folio_clear_hugetlb_zeroed(folio);
+ }
+
hugetlb_set_folio_subpool(folio, spool);
if (map_chg != MAP_CHG_ENFORCED) {
@@ -5004,7 +5024,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
spin_unlock(src_ptl);
spin_unlock(dst_ptl);
/* Do not use reserve as it's private owned */
- new_folio = alloc_hugetlb_folio(dst_vma, addr, false);
+ new_folio = alloc_hugetlb_folio(dst_vma, addr, false, 0, NULL);
if (IS_ERR(new_folio)) {
folio_put(pte_folio);
ret = PTR_ERR(new_folio);
@@ -5533,7 +5553,7 @@ static vm_fault_t hugetlb_wp(struct vm_fault *vmf)
* be acquired again before returning to the caller, as expected.
*/
spin_unlock(vmf->ptl);
- new_folio = alloc_hugetlb_folio(vma, vmf->address, cow_from_owner);
+ new_folio = alloc_hugetlb_folio(vma, vmf->address, cow_from_owner, 0, NULL);
if (IS_ERR(new_folio)) {
/*
@@ -5793,7 +5813,11 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
goto out;
}
- folio = alloc_hugetlb_folio(vma, vmf->address, false);
+ {
+ bool zeroed;
+
+ folio = alloc_hugetlb_folio(vma, vmf->address, false,
+ __GFP_ZERO, &zeroed);
if (IS_ERR(folio)) {
/*
* Returning error will result in faulting task being
@@ -5813,9 +5837,15 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
ret = 0;
goto out;
}
- folio_zero_user(folio, vmf->real_address);
+ /*
+ * Buddy-allocated pages are zeroed in post_alloc_hook().
+ * Pool pages bypass the allocator, zero them here.
+ */
+ if (!zeroed)
+ folio_zero_user(folio, vmf->real_address);
__folio_mark_uptodate(folio);
new_folio = true;
+ }
if (vma->vm_flags & VM_MAYSHARE) {
int err = hugetlb_add_to_page_cache(folio, mapping,
@@ -6252,7 +6282,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
goto out;
}
- folio = alloc_hugetlb_folio(dst_vma, dst_addr, false);
+ folio = alloc_hugetlb_folio(dst_vma, dst_addr, false, 0, NULL);
if (IS_ERR(folio)) {
pte_t *actual_pte = hugetlb_walk(dst_vma, dst_addr, PMD_SIZE);
if (actual_pte) {
@@ -6299,7 +6329,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
goto out;
}
- folio = alloc_hugetlb_folio(dst_vma, dst_addr, false);
+ folio = alloc_hugetlb_folio(dst_vma, dst_addr, false, 0, NULL);
if (IS_ERR(folio)) {
folio_put(*foliop);
ret = -ENOMEM;
--
MST
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC v4 11/22] mm: memfd: skip zeroing for zeroed hugetlb pool pages
2026-04-26 21:47 [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
` (9 preceding siblings ...)
2026-04-26 21:48 ` [PATCH RFC v4 10/22] mm: hugetlb: use __GFP_ZERO and skip zeroing for zeroed pages Michael S. Tsirkin
@ 2026-04-26 21:48 ` Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 12/22] mm: remove arch vma_alloc_zeroed_movable_folio overrides Michael S. Tsirkin
` (10 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-26 21:48 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Muchun Song, Oscar Salvador, Hugh Dickins, Baolin Wang
gather_surplus_pages() pre-allocates hugetlb pages into the pool
during mmap. Pass __GFP_ZERO so these pages are zeroed by the
buddy allocator, and HPG_zeroed is set by alloc_surplus_hugetlb_folio.
Add bool *zeroed output to alloc_hugetlb_folio_reserve() so
callers can check whether the pool page is known-zero. memfd's
memfd_alloc_folio() uses this to skip the explicit folio_zero_user()
when the page is already zero.
This avoids redundant zeroing for memfd hugetlb pages that were
pre-allocated into the pool and never mapped to userspace.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
include/linux/hugetlb.h | 6 ++++--
mm/hugetlb.c | 11 +++++++++--
mm/memfd.c | 17 +++++++++++------
3 files changed, 24 insertions(+), 10 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 094714c607f9..93bb06a33f57 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -713,7 +713,8 @@ struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
nodemask_t *nmask, gfp_t gfp_mask,
bool allow_alloc_fallback);
struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
- nodemask_t *nmask, gfp_t gfp_mask);
+ nodemask_t *nmask, gfp_t gfp_mask,
+ bool *zeroed);
int hugetlb_add_to_page_cache(struct folio *folio, struct address_space *mapping,
pgoff_t idx);
@@ -1128,7 +1129,8 @@ static inline struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
static inline struct folio *
alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
- nodemask_t *nmask, gfp_t gfp_mask)
+ nodemask_t *nmask, gfp_t gfp_mask,
+ bool *zeroed)
{
return NULL;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4f0ed01f5b13..f02583b9faab 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2241,7 +2241,7 @@ struct folio *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h,
}
struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
- nodemask_t *nmask, gfp_t gfp_mask)
+ nodemask_t *nmask, gfp_t gfp_mask, bool *zeroed)
{
struct folio *folio;
@@ -2257,6 +2257,12 @@ struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
h->resv_huge_pages--;
spin_unlock_irq(&hugetlb_lock);
+
+ if (zeroed && folio) {
+ *zeroed = folio_test_hugetlb_zeroed(folio);
+ folio_clear_hugetlb_zeroed(folio);
+ }
+
return folio;
}
@@ -2341,7 +2347,8 @@ static int gather_surplus_pages(struct hstate *h, long delta)
* It is okay to use NUMA_NO_NODE because we use numa_mem_id()
* down the road to pick the current node if that is the case.
*/
- folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h),
+ folio = alloc_surplus_hugetlb_folio(h,
+ htlb_alloc_mask(h) | __GFP_ZERO,
NUMA_NO_NODE, &alloc_nodemask,
USER_ADDR_NONE);
if (!folio) {
diff --git a/mm/memfd.c b/mm/memfd.c
index 919c2a53eb96..b9b44ed54db5 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -90,20 +90,24 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
if (nr_resv < 0)
return ERR_PTR(nr_resv);
+ {
+ bool zeroed;
+
folio = alloc_hugetlb_folio_reserve(h,
numa_node_id(),
NULL,
- gfp_mask);
+ gfp_mask,
+ &zeroed);
if (folio) {
u32 hash;
/*
- * Zero the folio to prevent information leaks to userspace.
- * Use folio_zero_user() which is optimized for huge/gigantic
- * pages. Pass 0 as addr_hint since this is not a faulting path
- * and we don't have a user virtual address yet.
+ * Zero the folio to prevent information leaks to
+ * userspace. Skip if the pool page is known-zero
+ * (HPG_zeroed set during pool pre-allocation).
*/
- folio_zero_user(folio, 0);
+ if (!zeroed)
+ folio_zero_user(folio, 0);
/*
* Mark the folio uptodate before adding to page cache,
@@ -139,6 +143,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
hugetlb_unreserve_pages(inode, idx, idx + 1, 0);
return ERR_PTR(err);
}
+ }
#endif
return shmem_read_folio(memfd->f_mapping, idx);
}
--
MST
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC v4 12/22] mm: remove arch vma_alloc_zeroed_movable_folio overrides
2026-04-26 21:47 [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
` (10 preceding siblings ...)
2026-04-26 21:48 ` [PATCH RFC v4 11/22] mm: memfd: skip zeroing for zeroed hugetlb pool pages Michael S. Tsirkin
@ 2026-04-26 21:48 ` Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 13/22] mm: page_alloc: propagate PageReported flag across buddy splits Michael S. Tsirkin
` (9 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-26 21:48 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Magnus Lindholm, Greg Ungerer, Geert Uytterhoeven,
Richard Henderson, Matt Turner, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, linux-alpha, linux-m68k, linux-s390
Now that the generic vma_alloc_zeroed_movable_folio() uses
__GFP_ZERO, the arch-specific macros on alpha, m68k, s390, and
x86 that did the same thing are redundant. Remove them.
arm64 is not affected: it has a real function override that
handles MTE tag zeroing, not just __GFP_ZERO.
Suggested-by: David Hildenbrand <david@kernel.org>
Acked-by: Magnus Lindholm <linmag7@gmail.com>
Acked-by: Greg Ungerer <gerg@linux-m68k.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
arch/alpha/include/asm/page.h | 3 ---
arch/m68k/include/asm/page_no.h | 3 ---
arch/s390/include/asm/page.h | 3 ---
arch/x86/include/asm/page.h | 3 ---
4 files changed, 12 deletions(-)
diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
index 59d01f9b77f6..4327029cd660 100644
--- a/arch/alpha/include/asm/page.h
+++ b/arch/alpha/include/asm/page.h
@@ -12,9 +12,6 @@
extern void clear_page(void *page);
-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
- vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr)
-
extern void copy_page(void * _to, void * _from);
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
diff --git a/arch/m68k/include/asm/page_no.h b/arch/m68k/include/asm/page_no.h
index d2532bc407ef..f511b763a235 100644
--- a/arch/m68k/include/asm/page_no.h
+++ b/arch/m68k/include/asm/page_no.h
@@ -12,9 +12,6 @@ extern unsigned long memory_end;
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
- vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr)
-
#define __pa(vaddr) ((unsigned long)(vaddr))
#define __va(paddr) ((void *)((unsigned long)(paddr)))
diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
index f339258135f7..04020a19a5cf 100644
--- a/arch/s390/include/asm/page.h
+++ b/arch/s390/include/asm/page.h
@@ -67,9 +67,6 @@ static inline void copy_page(void *to, void *from)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
- vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr)
-
#ifdef CONFIG_STRICT_MM_TYPECHECKS
#define STRICT_MM_TYPECHECKS
#endif
diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 416dc88e35c1..92fa975b46f3 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -28,9 +28,6 @@ static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
copy_page(to, from);
}
-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
- vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr)
-
#ifndef __pa
#define __pa(x) __phys_addr((unsigned long)(x))
#endif
--
MST
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC v4 13/22] mm: page_alloc: propagate PageReported flag across buddy splits
2026-04-26 21:47 [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
` (11 preceding siblings ...)
2026-04-26 21:48 ` [PATCH RFC v4 12/22] mm: remove arch vma_alloc_zeroed_movable_folio overrides Michael S. Tsirkin
@ 2026-04-26 21:48 ` Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 14/22] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
` (8 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-26 21:48 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Johannes Weiner, Zi Yan
When a reported free page is split via expand() to satisfy a
smaller allocation, the sub-pages placed back on the free lists
lose the PageReported flag. This means they will be unnecessarily
re-reported to the hypervisor in the next reporting cycle, wasting
work.
Propagate the PageReported flag to sub-pages during expand() so
that they are recognized as already-reported.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
mm/page_alloc.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7791bc1eeefa..ca4f9c0948af 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1730,7 +1730,7 @@ struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
* -- nyc
*/
static inline unsigned int expand(struct zone *zone, struct page *page, int low,
- int high, int migratetype)
+ int high, int migratetype, bool reported)
{
unsigned int size = 1 << high;
unsigned int nr_added = 0;
@@ -1752,6 +1752,15 @@ static inline unsigned int expand(struct zone *zone, struct page *page, int low,
__add_to_free_list(&page[size], zone, high, migratetype, false);
set_buddy_order(&page[size], high);
nr_added += size;
+
+ /*
+ * The parent page has been reported to the host. The
+ * sub-pages are part of the same reported block, so mark
+ * them reported too. This avoids re-reporting pages that
+ * the host already knows about.
+ */
+ if (reported)
+ __SetPageReported(&page[size]);
}
return nr_added;
@@ -1762,9 +1771,10 @@ static __always_inline void page_del_and_expand(struct zone *zone,
int high, int migratetype)
{
int nr_pages = 1 << high;
+ bool was_reported = page_reported(page);
__del_page_from_free_list(page, zone, high, migratetype);
- nr_pages -= expand(zone, page, low, high, migratetype);
+ nr_pages -= expand(zone, page, low, high, migratetype, was_reported);
account_freepages(zone, -nr_pages, migratetype);
}
@@ -2334,7 +2344,8 @@ try_to_claim_block(struct zone *zone, struct page *page,
del_page_from_free_list(page, zone, current_order, block_type);
change_pageblock_range(page, current_order, start_type);
- nr_added = expand(zone, page, order, current_order, start_type);
+ nr_added = expand(zone, page, order, current_order, start_type,
+ false);
account_freepages(zone, nr_added, start_type);
return page;
}
--
MST
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC v4 14/22] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages
2026-04-26 21:47 [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
` (12 preceding siblings ...)
2026-04-26 21:48 ` [PATCH RFC v4 13/22] mm: page_alloc: propagate PageReported flag across buddy splits Michael S. Tsirkin
@ 2026-04-26 21:48 ` Michael S. Tsirkin
2026-04-27 15:13 ` Zi Yan
2026-04-26 21:48 ` [PATCH RFC v4 15/22] mm: page_alloc: clear PG_zeroed on buddy merge if not both zero Michael S. Tsirkin
` (7 subsequent siblings)
21 siblings, 1 reply; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-26 21:48 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Johannes Weiner,
Zi Yan
When a guest reports free pages to the hypervisor via the page reporting
framework (used by virtio-balloon and hv_balloon), the host typically
zeros those pages when reclaiming their backing memory. However, when
those pages are later allocated in the guest, post_alloc_hook()
unconditionally zeros them again if __GFP_ZERO is set. This
double-zeroing is wasteful, especially for large pages.
Avoid redundant zeroing:
- Add a host_zeroes_pages flag to page_reporting_dev_info, allowing
drivers to declare that their host zeros reported pages on reclaim.
A static key (page_reporting_host_zeroes) gates the fast path.
- Add PG_zeroed page flag (sharing PG_private bit) to mark pages
that have been zeroed by the host. Set it on reported pages during
allocation from the buddy in page_del_and_expand().
- Thread the zeroed bool through rmqueue -> prep_new_page ->
post_alloc_hook, where it skips redundant zeroing for __GFP_ZERO
allocations.
No driver sets host_zeroes_pages yet; a follow-up patch to
virtio_balloon is needed to opt in.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: cursor-agent:GPT-5.4-xhigh
---
include/linux/page-flags.h | 9 +++++
include/linux/page_reporting.h | 3 ++
mm/compaction.c | 6 ++--
mm/internal.h | 2 +-
mm/page_alloc.c | 66 +++++++++++++++++++++++-----------
mm/page_reporting.c | 14 +++++++-
mm/page_reporting.h | 12 +++++++
7 files changed, 87 insertions(+), 25 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f7a0e4af0c73..eef2499cba8b 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -135,6 +135,8 @@ enum pageflags {
PG_swapcache = PG_owner_priv_1, /* Swap page: swp_entry_t in private */
/* Some filesystems */
PG_checked = PG_owner_priv_1,
+ /* Page contents are known to be zero */
+ PG_zeroed = PG_private,
/*
* Depending on the way an anonymous folio can be mapped into a page
@@ -679,6 +681,13 @@ FOLIO_TEST_CLEAR_FLAG_FALSE(young)
FOLIO_FLAG_FALSE(idle)
#endif
+/*
+ * PageZeroed() tracks pages known to be zero. The allocator
+ * uses this to skip redundant zeroing in post_alloc_hook().
+ */
+__PAGEFLAG(Zeroed, zeroed, PF_NO_COMPOUND)
+#define __PG_ZEROED (1UL << PG_zeroed)
+
/*
* PageReported() is used to track reported free pages within the Buddy
* allocator. We can use the non-atomic version of the test and set
diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
index fe648dfa3a7c..10faadfeb4fb 100644
--- a/include/linux/page_reporting.h
+++ b/include/linux/page_reporting.h
@@ -13,6 +13,9 @@ struct page_reporting_dev_info {
int (*report)(struct page_reporting_dev_info *prdev,
struct scatterlist *sg, unsigned int nents);
+ /* If true, host zeros reported pages on reclaim */
+ bool host_zeroes_pages;
+
/* work struct for processing reports */
struct delayed_work work;
diff --git a/mm/compaction.c b/mm/compaction.c
index c1039a9373e5..61209cd408ea 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -82,7 +82,8 @@ static inline bool is_via_compact_memory(int order) { return false; }
static struct page *mark_allocated_noprof(struct page *page, unsigned int order, gfp_t gfp_flags)
{
- post_alloc_hook(page, order, __GFP_MOVABLE, USER_ADDR_NONE);
+ __ClearPageZeroed(page);
+ post_alloc_hook(page, order, __GFP_MOVABLE, false, USER_ADDR_NONE);
set_page_refcounted(page);
return page;
}
@@ -1832,7 +1833,8 @@ static struct folio *compaction_alloc_noprof(struct folio *src, unsigned long da
set_page_private(&freepage[size], start_order);
}
dst = (struct folio *)freepage;
- post_alloc_hook(&dst->page, order, __GFP_MOVABLE, USER_ADDR_NONE);
+ __ClearPageZeroed(&dst->page);
+ post_alloc_hook(&dst->page, order, __GFP_MOVABLE, false, USER_ADDR_NONE);
set_page_refcounted(&dst->page);
if (order)
prep_compound_page(&dst->page, order);
diff --git a/mm/internal.h b/mm/internal.h
index 8e4616e42b4a..0600d824ba03 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -889,7 +889,7 @@ static inline void prep_compound_tail(struct page *head, int tail_idx)
}
void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags,
- unsigned long user_addr);
+ bool zeroed, unsigned long user_addr);
extern bool free_pages_prepare(struct page *page, unsigned int order);
extern int user_min_free_kbytes;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ca4f9c0948af..eff01a819744 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1774,6 +1774,7 @@ static __always_inline void page_del_and_expand(struct zone *zone,
bool was_reported = page_reported(page);
__del_page_from_free_list(page, zone, high, migratetype);
+
nr_pages -= expand(zone, page, low, high, migratetype, was_reported);
account_freepages(zone, -nr_pages, migratetype);
}
@@ -1846,8 +1847,10 @@ static inline bool should_skip_init(gfp_t flags)
return (flags & __GFP_SKIP_ZERO);
}
+
inline void post_alloc_hook(struct page *page, unsigned int order,
- gfp_t gfp_flags, unsigned long user_addr)
+ gfp_t gfp_flags, bool zeroed,
+ unsigned long user_addr)
{
bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) &&
!should_skip_init(gfp_flags);
@@ -1856,6 +1859,14 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
set_page_private(page, 0);
+ /*
+ * If the page is zeroed, skip memory initialization.
+ * We still need to handle tag zeroing separately since the host
+ * does not know about memory tags.
+ */
+ if (zeroed && init && !zero_tags)
+ init = false;
+
arch_alloc_page(page, order);
debug_pagealloc_map_pages(page, 1 << order);
@@ -1913,13 +1924,13 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
}
static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
- unsigned int alloc_flags,
- unsigned long user_addr)
+ unsigned int alloc_flags, bool zeroed,
+ unsigned long user_addr)
{
if (order && (gfp_flags & __GFP_COMP))
prep_compound_page(page, order);
- post_alloc_hook(page, order, gfp_flags, user_addr);
+ post_alloc_hook(page, order, gfp_flags, zeroed, user_addr);
/*
* page is set pfmemalloc when ALLOC_NO_WATERMARKS was necessary to
@@ -3189,6 +3200,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
}
del_page_from_free_list(page, zone, order, mt);
+ __ClearPageZeroed(page);
/*
* Set the pageblock if the isolated page is at least half of a
@@ -3261,7 +3273,7 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
static __always_inline
struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
unsigned int order, unsigned int alloc_flags,
- int migratetype)
+ int migratetype, bool *zeroed)
{
struct page *page;
unsigned long flags;
@@ -3296,6 +3308,8 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
}
}
spin_unlock_irqrestore(&zone->lock, flags);
+ *zeroed = PageZeroed(page);
+ __ClearPageZeroed(page);
} while (check_new_pages(page, order));
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
@@ -3357,10 +3371,9 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
/* Remove page from the per-cpu list, caller must protect the list */
static inline
struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
- int migratetype,
- unsigned int alloc_flags,
+ int migratetype, unsigned int alloc_flags,
struct per_cpu_pages *pcp,
- struct list_head *list)
+ struct list_head *list, bool *zeroed)
{
struct page *page;
@@ -3381,6 +3394,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
page = list_first_entry(list, struct page, pcp_list);
list_del(&page->pcp_list);
pcp->count -= 1 << order;
+ *zeroed = PageZeroed(page);
+ __ClearPageZeroed(page);
} while (check_new_pages(page, order));
return page;
@@ -3389,7 +3404,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
/* Lock and remove page from the per-cpu list */
static struct page *rmqueue_pcplist(struct zone *preferred_zone,
struct zone *zone, unsigned int order,
- int migratetype, unsigned int alloc_flags)
+ int migratetype, unsigned int alloc_flags,
+ bool *zeroed)
{
struct per_cpu_pages *pcp;
struct list_head *list;
@@ -3408,7 +3424,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
*/
pcp->free_count >>= 1;
list = &pcp->lists[order_to_pindex(migratetype, order)];
- page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
+ page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags,
+ pcp, list, zeroed);
pcp_spin_unlock(pcp, UP_flags);
if (page) {
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
@@ -3433,19 +3450,19 @@ static inline
struct page *rmqueue(struct zone *preferred_zone,
struct zone *zone, unsigned int order,
gfp_t gfp_flags, unsigned int alloc_flags,
- int migratetype)
+ int migratetype, bool *zeroed)
{
struct page *page;
if (likely(pcp_allowed_order(order))) {
page = rmqueue_pcplist(preferred_zone, zone, order,
- migratetype, alloc_flags);
+ migratetype, alloc_flags, zeroed);
if (likely(page))
goto out;
}
page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
- migratetype);
+ migratetype, zeroed);
out:
/* Separate test+clear to avoid unnecessary atomics */
@@ -3836,6 +3853,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
struct pglist_data *last_pgdat = NULL;
bool last_pgdat_dirty_ok = false;
bool no_fallback;
+ bool zeroed;
bool skip_kswapd_nodes = nr_online_nodes > 1;
bool skipped_kswapd_nodes = false;
@@ -3980,10 +3998,11 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
try_this_zone:
page = rmqueue(zonelist_zone(ac->preferred_zoneref), zone, order,
- gfp_mask, alloc_flags, ac->migratetype);
+ gfp_mask, alloc_flags, ac->migratetype,
+ &zeroed);
if (page) {
prep_new_page(page, order, gfp_mask, alloc_flags,
- ac->user_addr);
+ zeroed, ac->user_addr);
/*
* If this is a high-order atomic allocation then check
@@ -4217,9 +4236,11 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
count_vm_event(COMPACTSTALL);
/* Prep a captured page if available */
- if (page)
- prep_new_page(page, order, gfp_mask, alloc_flags,
+ if (page) {
+ __ClearPageZeroed(page);
+ prep_new_page(page, order, gfp_mask, alloc_flags, false,
ac->user_addr);
+ }
/* Try get a page from the freelist if available */
if (!page)
@@ -5193,6 +5214,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
/* Attempt the batch allocation */
pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
while (nr_populated < nr_pages) {
+ bool zeroed = false;
/* Skip existing pages */
if (page_array[nr_populated]) {
@@ -5201,7 +5223,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
}
page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
- pcp, pcp_list);
+ pcp, pcp_list, &zeroed);
if (unlikely(!page)) {
/* Try and allocate at least one page */
if (!nr_account) {
@@ -5212,7 +5234,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
}
nr_account++;
- prep_new_page(page, 0, gfp, 0, USER_ADDR_NONE);
+ prep_new_page(page, 0, gfp, 0, zeroed, USER_ADDR_NONE);
set_page_refcounted(page);
page_array[nr_populated++] = page;
}
@@ -6983,7 +7005,8 @@ static void split_free_frozen_pages(struct list_head *list, gfp_t gfp_mask)
list_for_each_entry_safe(page, next, &list[order], lru) {
int i;
- post_alloc_hook(page, order, gfp_mask, USER_ADDR_NONE);
+ __ClearPageZeroed(page);
+ post_alloc_hook(page, order, gfp_mask, false, USER_ADDR_NONE);
if (!order)
continue;
@@ -7188,8 +7211,9 @@ int alloc_contig_frozen_range_noprof(unsigned long start, unsigned long end,
} else if (start == outer_start && end == outer_end && is_power_of_2(end - start)) {
struct page *head = pfn_to_page(start);
+ __ClearPageZeroed(head);
check_new_pages(head, order);
- prep_new_page(head, order, gfp_mask, 0, USER_ADDR_NONE);
+ prep_new_page(head, order, gfp_mask, 0, false, USER_ADDR_NONE);
} else {
ret = -EINVAL;
WARN(true, "PFN range: requested [%lu, %lu), allocated [%lu, %lu)\n",
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index f0042d5743af..6177d2413743 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -50,6 +50,8 @@ EXPORT_SYMBOL_GPL(page_reporting_order);
#define PAGE_REPORTING_DELAY (2 * HZ)
static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly;
+DEFINE_STATIC_KEY_FALSE(page_reporting_host_zeroes);
+
enum {
PAGE_REPORTING_IDLE = 0,
PAGE_REPORTING_REQUESTED,
@@ -129,8 +131,11 @@ page_reporting_drain(struct page_reporting_dev_info *prdev,
* report on the new larger page when we make our way
* up to that higher order.
*/
- if (PageBuddy(page) && buddy_order(page) == order)
+ if (PageBuddy(page) && buddy_order(page) == order) {
__SetPageReported(page);
+ if (page_reporting_host_zeroes_pages())
+ __SetPageZeroed(page);
+ }
} while ((sg = sg_next(sg)));
/* reinitialize scatterlist now that it is empty */
@@ -386,6 +391,10 @@ int page_reporting_register(struct page_reporting_dev_info *prdev)
/* Assign device to allow notifications */
rcu_assign_pointer(pr_dev_info, prdev);
+ /* enable zeroed page optimization if host zeroes reported pages */
+ if (prdev->host_zeroes_pages)
+ static_branch_enable(&page_reporting_host_zeroes);
+
/* enable page reporting notification */
if (!static_key_enabled(&page_reporting_enabled)) {
static_branch_enable(&page_reporting_enabled);
@@ -410,6 +419,9 @@ void page_reporting_unregister(struct page_reporting_dev_info *prdev)
/* Flush any existing work, and lock it out */
cancel_delayed_work_sync(&prdev->work);
+
+ if (prdev->host_zeroes_pages)
+ static_branch_disable(&page_reporting_host_zeroes);
}
mutex_unlock(&page_reporting_mutex);
diff --git a/mm/page_reporting.h b/mm/page_reporting.h
index c51dbc228b94..736ea7b37e9e 100644
--- a/mm/page_reporting.h
+++ b/mm/page_reporting.h
@@ -15,6 +15,13 @@ DECLARE_STATIC_KEY_FALSE(page_reporting_enabled);
extern unsigned int page_reporting_order;
void __page_reporting_notify(void);
+DECLARE_STATIC_KEY_FALSE(page_reporting_host_zeroes);
+
+static inline bool page_reporting_host_zeroes_pages(void)
+{
+ return static_branch_unlikely(&page_reporting_host_zeroes);
+}
+
static inline bool page_reported(struct page *page)
{
return static_branch_unlikely(&page_reporting_enabled) &&
@@ -46,6 +53,11 @@ static inline void page_reporting_notify_free(unsigned int order)
#else /* CONFIG_PAGE_REPORTING */
#define page_reported(_page) false
+static inline bool page_reporting_host_zeroes_pages(void)
+{
+ return false;
+}
+
static inline void page_reporting_notify_free(unsigned int order)
{
}
--
MST
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC v4 15/22] mm: page_alloc: clear PG_zeroed on buddy merge if not both zero
2026-04-26 21:47 [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
` (13 preceding siblings ...)
2026-04-26 21:48 ` [PATCH RFC v4 14/22] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
@ 2026-04-26 21:48 ` Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 16/22] mm: page_alloc: preserve PG_zeroed in page_del_and_expand Michael S. Tsirkin
` (6 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-26 21:48 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Johannes Weiner, Zi Yan
When two buddy pages merge in __free_one_page(), preserve
PG_zeroed on the merged page only if both buddies have the
flag set. Otherwise clear it.
Without this, a zeroed page (freed via free_frozen_pages_zeroed
from balloon deflate) could merge with a non-zero buddy. The merged
page would inherit PG_zeroed, and a later __GFP_ZERO allocation
would skip zeroing stale data in the non-zero half.
The page reporting path is not affected: it sets PG_zeroed during
allocation (page_del_and_expand), not on free list pages.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: cursor-agent:GPT-5.4-xhigh
---
mm/page_alloc.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eff01a819744..1183ef3e91c9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -984,10 +984,14 @@ static inline void __free_one_page(struct page *page,
unsigned long buddy_pfn = 0;
unsigned long combined_pfn;
struct page *buddy;
+ bool buddy_zeroed;
+ bool page_zeroed;
bool to_tail;
VM_BUG_ON(!zone_is_initialized(zone));
- VM_BUG_ON_PAGE(page->flags.f & PAGE_FLAGS_CHECK_AT_PREP, page);
+ /* PG_zeroed (aliased to PG_private) is valid on free-list pages */
+ VM_BUG_ON_PAGE(page->flags.f &
+ (PAGE_FLAGS_CHECK_AT_PREP & ~__PG_ZEROED), page);
VM_BUG_ON(migratetype == -1);
VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
@@ -1022,6 +1026,8 @@ static inline void __free_one_page(struct page *page,
goto done_merging;
}
+ buddy_zeroed = PageZeroed(buddy);
+
/*
* Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
* merge with it and move up one order.
@@ -1040,10 +1046,17 @@ static inline void __free_one_page(struct page *page,
change_pageblock_range(buddy, order, migratetype);
}
+ page_zeroed = PageZeroed(page);
+ __ClearPageZeroed(page);
+ __ClearPageZeroed(buddy);
+
combined_pfn = buddy_pfn & pfn;
page = page + (combined_pfn - pfn);
pfn = combined_pfn;
order++;
+
+ if (page_zeroed && buddy_zeroed)
+ __SetPageZeroed(page);
}
done_merging:
--
MST
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC v4 16/22] mm: page_alloc: preserve PG_zeroed in page_del_and_expand
2026-04-26 21:47 [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
` (14 preceding siblings ...)
2026-04-26 21:48 ` [PATCH RFC v4 15/22] mm: page_alloc: clear PG_zeroed on buddy merge if not both zero Michael S. Tsirkin
@ 2026-04-26 21:48 ` Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 17/22] mm: page_reporting: add per-page zeroed bitmap for host feedback Michael S. Tsirkin
` (5 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-26 21:48 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Johannes Weiner, Zi Yan
Don't unconditionally clear PG_zeroed for non-reported pages in
page_del_and_expand(). Pages freed via free_frozen_pages_zeroed
(balloon deflate) already have the flag set and should keep it
through buddy allocation, not just PCP reuse.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: cursor-agent:GPT-5.4-xhigh
---
mm/page_alloc.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1183ef3e91c9..1169714406e7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1743,7 +1743,8 @@ struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
* -- nyc
*/
static inline unsigned int expand(struct zone *zone, struct page *page, int low,
- int high, int migratetype, bool reported)
+ int high, int migratetype, bool reported,
+ bool zeroed)
{
unsigned int size = 1 << high;
unsigned int nr_added = 0;
@@ -1774,6 +1775,8 @@ static inline unsigned int expand(struct zone *zone, struct page *page, int low,
*/
if (reported)
__SetPageReported(&page[size]);
+ if (zeroed)
+ __SetPageZeroed(&page[size]);
}
return nr_added;
@@ -1785,10 +1788,12 @@ static __always_inline void page_del_and_expand(struct zone *zone,
{
int nr_pages = 1 << high;
bool was_reported = page_reported(page);
+ bool was_zeroed = PageZeroed(page);
__del_page_from_free_list(page, zone, high, migratetype);
- nr_pages -= expand(zone, page, low, high, migratetype, was_reported);
+ nr_pages -= expand(zone, page, low, high, migratetype, was_reported,
+ was_zeroed);
account_freepages(zone, -nr_pages, migratetype);
}
@@ -2365,11 +2370,13 @@ try_to_claim_block(struct zone *zone, struct page *page,
/* Take ownership for orders >= pageblock_order */
if (current_order >= pageblock_order) {
unsigned int nr_added;
+ bool was_reported = page_reported(page);
+ bool was_zeroed = PageZeroed(page);
del_page_from_free_list(page, zone, current_order, block_type);
change_pageblock_range(page, current_order, start_type);
nr_added = expand(zone, page, order, current_order, start_type,
- false);
+ was_reported, was_zeroed);
account_freepages(zone, nr_added, start_type);
return page;
}
--
MST
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC v4 17/22] mm: page_reporting: add per-page zeroed bitmap for host feedback
2026-04-26 21:47 [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
` (15 preceding siblings ...)
2026-04-26 21:48 ` [PATCH RFC v4 16/22] mm: page_alloc: preserve PG_zeroed in page_del_and_expand Michael S. Tsirkin
@ 2026-04-26 21:48 ` Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 18/22] virtio_balloon: a hack to enable host-zeroed page optimization Michael S. Tsirkin
` (4 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-26 21:48 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Johannes Weiner,
Zi Yan
The host may skip zeroing some reported pages (e.g., due to alignment
constraints or bounce buffer fallback in QEMU). Currently, when
host_zeroes_pages is set, all reported pages are unconditionally
marked PG_zeroed -- even ones the host did not actually zero.
Add a zeroed_bitmap to page_reporting_dev_info that the report()
callback can use to indicate which pages were actually zeroed.
Before calling report(), the framework initializes the bitmap:
all bits set if host_zeroes_pages (optimistic default), all clear
otherwise. The driver's report() can then clear bits for pages
that were not zeroed.
page_reporting_drain() checks the bitmap per-page instead of the
global host_zeroes_pages flag.
No driver uses per-page feedback yet; the bitmap is pre-filled
based on host_zeroes_pages, so behavior is unchanged.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
include/linux/page_reporting.h | 7 +++++++
mm/page_reporting.c | 8 ++++++--
2 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
index 10faadfeb4fb..5b63de21949c 100644
--- a/include/linux/page_reporting.h
+++ b/include/linux/page_reporting.h
@@ -16,6 +16,13 @@ struct page_reporting_dev_info {
/* If true, host zeros reported pages on reclaim */
bool host_zeroes_pages;
+ /*
+ * Per-page zeroed status, indexed by scatterlist position.
+ * The driver's report() callback must clear the bitmap,
+ * then set bits for pages that were actually zeroed.
+ */
+ DECLARE_BITMAP(zeroed_bitmap, PAGE_REPORTING_CAPACITY);
+
/* work struct for processing reports */
struct delayed_work work;
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 6177d2413743..61f7f08d02d4 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -108,6 +108,7 @@ page_reporting_drain(struct page_reporting_dev_info *prdev,
struct scatterlist *sgl, unsigned int nents, bool reported)
{
struct scatterlist *sg = sgl;
+ unsigned int i = 0;
/*
* Drain the now reported pages back into their respective
@@ -122,7 +123,7 @@ page_reporting_drain(struct page_reporting_dev_info *prdev,
/* If the pages were not reported due to error skip flagging */
if (!reported)
- continue;
+ goto next;
/*
* If page was not commingled with another page we can
@@ -133,9 +134,12 @@ page_reporting_drain(struct page_reporting_dev_info *prdev,
*/
if (PageBuddy(page) && buddy_order(page) == order) {
__SetPageReported(page);
- if (page_reporting_host_zeroes_pages())
+ if (page_reporting_host_zeroes_pages() &&
+ test_bit(i, prdev->zeroed_bitmap))
__SetPageZeroed(page);
}
+next:
+ i++;
} while ((sg = sg_next(sg)));
/* reinitialize scatterlist now that it is empty */
--
MST
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC v4 18/22] virtio_balloon: a hack to enable host-zeroed page optimization
2026-04-26 21:47 [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
` (16 preceding siblings ...)
2026-04-26 21:48 ` [PATCH RFC v4 17/22] mm: page_reporting: add per-page zeroed bitmap for host feedback Michael S. Tsirkin
@ 2026-04-26 21:48 ` Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 19/22] mm: page_reporting: add flush parameter with page budget Michael S. Tsirkin
` (3 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-26 21:48 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Xuan Zhuo, Eugenio Pérez
Add a module parameter host_zeroes_pages to opt in to the zeroed
page optimization. A proper virtio feature flag is needed before
this can be merged.
insmod virtio_balloon.ko host_zeroes_pages=1
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
drivers/virtio/virtio_balloon.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index d1fbc8fe8470..8c15530b51b3 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -19,6 +19,11 @@
#include <linux/mm.h>
#include <linux/page_reporting.h>
+static bool host_zeroes_pages;
+module_param(host_zeroes_pages, bool, 0444);
+MODULE_PARM_DESC(host_zeroes_pages,
+ "Host zeroes reported pages, skip guest re-zeroing");
+
/*
* Balloon device works in 4K page units. So each page is pointed to by
* multiple balloon pages. All memory counters in this driver are in balloon
@@ -204,6 +209,8 @@ static int virtballoon_free_page_report(struct page_reporting_dev_info *pr_dev_i
struct virtqueue *vq = vb->reporting_vq;
unsigned int unused, err;
+ bitmap_zero(pr_dev_info->zeroed_bitmap, nents);
+
/* We should always be able to add these buffers to an empty queue. */
err = virtqueue_add_inbuf(vq, sg, nents, vb, GFP_NOWAIT);
@@ -220,6 +227,9 @@ static int virtballoon_free_page_report(struct page_reporting_dev_info *pr_dev_i
/* When host has read buffer, this completes via balloon_ack */
wait_event(vb->acked, virtqueue_get_buf(vq, &unused));
+ if (host_zeroes_pages)
+ bitmap_fill(pr_dev_info->zeroed_bitmap, nents);
+
return 0;
}
@@ -1039,6 +1049,8 @@ static int virtballoon_probe(struct virtio_device *vdev)
vb->pr_dev_info.order = 5;
#endif
+ /* TODO: needs a virtio feature flag */
+ vb->pr_dev_info.host_zeroes_pages = host_zeroes_pages;
err = page_reporting_register(&vb->pr_dev_info);
if (err)
goto out_unregister_oom;
--
MST
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC v4 19/22] mm: page_reporting: add flush parameter with page budget
2026-04-26 21:47 [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
` (17 preceding siblings ...)
2026-04-26 21:48 ` [PATCH RFC v4 18/22] virtio_balloon: a hack to enable host-zeroed page optimization Michael S. Tsirkin
@ 2026-04-26 21:48 ` Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 20/22] mm: add free_frozen_pages_zeroed Michael S. Tsirkin
` (2 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-26 21:48 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Johannes Weiner, Zi Yan
Add a write-only module parameter 'flush' that triggers immediate
page reporting. The value specifies approximately how many pages
(at page_reporting_order) to report. The flush loops through
reporting cycles, each processing up to PAGE_REPORTING_CAPACITY
pages, until the budget is exhausted, all pages are reported, or
a signal is pending.
This is helpful when there is a lot of memory freed quickly,
and a single cycle may not process all free pages due to
internal budget limits.
echo 512 > /sys/module/page_reporting/parameters/flush
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
mm/page_reporting.c | 42 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 42 insertions(+)
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 61f7f08d02d4..05bdcf89f3e3 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -358,6 +358,48 @@ static void page_reporting_process(struct work_struct *work)
static DEFINE_MUTEX(page_reporting_mutex);
DEFINE_STATIC_KEY_FALSE(page_reporting_enabled);
+static int page_reporting_flush_set(const char *val,
+ const struct kernel_param *kp)
+{
+ struct page_reporting_dev_info *prdev;
+ unsigned int budget;
+ int err;
+
+ err = kstrtouint(val, 0, &budget);
+ if (err)
+ return err;
+ if (!budget)
+ return 0;
+
+ mutex_lock(&page_reporting_mutex);
+ prdev = rcu_dereference_protected(pr_dev_info,
+ lockdep_is_held(&page_reporting_mutex));
+ if (prdev) {
+ unsigned int reported;
+
+ for (reported = 0; reported < budget;
+ reported += PAGE_REPORTING_CAPACITY) {
+ flush_delayed_work(&prdev->work);
+ __page_reporting_request(prdev);
+ flush_delayed_work(&prdev->work);
+ if (atomic_read(&prdev->state) == PAGE_REPORTING_IDLE)
+ break;
+ if (signal_pending(current))
+ break;
+ }
+ }
+ mutex_unlock(&page_reporting_mutex);
+ return 0;
+}
+
+static const struct kernel_param_ops flush_ops = {
+ .set = page_reporting_flush_set,
+ .get = param_get_uint,
+};
+static unsigned int page_reporting_flush;
+module_param_cb(flush, &flush_ops, &page_reporting_flush, 0200);
+MODULE_PARM_DESC(flush, "Report up to N pages at page_reporting_order");
+
int page_reporting_register(struct page_reporting_dev_info *prdev)
{
int err = 0;
--
MST
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC v4 20/22] mm: add free_frozen_pages_zeroed
2026-04-26 21:47 [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
` (18 preceding siblings ...)
2026-04-26 21:48 ` [PATCH RFC v4 19/22] mm: page_reporting: add flush parameter with page budget Michael S. Tsirkin
@ 2026-04-26 21:48 ` Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 21/22] mm: add put_page_zeroed and folio_put_zeroed Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 22/22] virtio_balloon: mark deflated pages as zeroed Michael S. Tsirkin
21 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-26 21:48 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Johannes Weiner,
Zi Yan
Add free_frozen_pages_zeroed(page, order) to free a frozen page
while marking it as zeroed, so the next allocation can skip
redundant zeroing.
An FPI_ZEROED internal flag carries the hint through the free path.
PageZeroed is set after __free_pages_prepare() clears all flags,
so the hint survives on the free list.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
include/linux/gfp.h | 1 +
mm/internal.h | 1 -
mm/page_alloc.c | 21 ++++++++++++++++++++-
3 files changed, 21 insertions(+), 2 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index e275cc80e19e..766b1c7f0731 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -394,6 +394,7 @@ __meminit void *alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mas
extern void __free_pages(struct page *page, unsigned int order);
extern void free_pages_nolock(struct page *page, unsigned int order);
extern void free_pages(unsigned long addr, unsigned int order);
+void free_frozen_pages_zeroed(struct page *page, unsigned int order);
#define __free_page(page) __free_pages((page), 0)
#define free_page(addr) free_pages((addr), 0)
diff --git a/mm/internal.h b/mm/internal.h
index 0600d824ba03..60b983872d51 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -899,7 +899,6 @@ struct page *__alloc_frozen_pages_noprof(gfp_t, unsigned int order, int nid,
#define __alloc_frozen_pages(...) \
alloc_hooks(__alloc_frozen_pages_noprof(__VA_ARGS__))
void free_frozen_pages(struct page *page, unsigned int order);
-void free_frozen_pages_zeroed(struct page *page, unsigned int order);
void free_unref_folios(struct folio_batch *fbatch);
#ifdef CONFIG_NUMA
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1169714406e7..981ddf3566b5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -90,6 +90,13 @@ typedef int __bitwise fpi_t;
/* Free the page without taking locks. Rely on trylock only. */
#define FPI_TRYLOCK ((__force fpi_t)BIT(2))
+/*
+ * The page contents are known to be zero (e.g., the host zeroed them
+ * during balloon deflate). Set PageZeroed after free so the next
+ * allocation can skip redundant zeroing.
+ */
+#define FPI_ZEROED ((__force fpi_t)BIT(3))
+
/* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
static DEFINE_MUTEX(pcp_batch_high_lock);
#define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
@@ -1624,8 +1631,11 @@ static void __free_pages_ok(struct page *page, unsigned int order,
unsigned long pfn = page_to_pfn(page);
struct zone *zone = page_zone(page);
- if (__free_pages_prepare(page, order, fpi_flags))
+ if (__free_pages_prepare(page, order, fpi_flags)) {
+ if (fpi_flags & FPI_ZEROED)
+ __SetPageZeroed(page);
free_one_page(zone, page, pfn, order, fpi_flags);
+ }
}
void __meminit __free_pages_core(struct page *page, unsigned int order,
@@ -3032,6 +3042,9 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
if (!__free_pages_prepare(page, order, fpi_flags))
return;
+ if (fpi_flags & FPI_ZEROED)
+ __SetPageZeroed(page);
+
/*
* We only track unmovable, reclaimable and movable on pcp lists.
* Place ISOLATE pages on the isolated list because they are being
@@ -3070,6 +3083,12 @@ void free_frozen_pages(struct page *page, unsigned int order)
__free_frozen_pages(page, order, FPI_NONE);
}
+void free_frozen_pages_zeroed(struct page *page, unsigned int order)
+{
+ __free_frozen_pages(page, order, FPI_ZEROED);
+}
+EXPORT_SYMBOL(free_frozen_pages_zeroed);
+
void free_frozen_pages_nolock(struct page *page, unsigned int order)
{
__free_frozen_pages(page, order, FPI_TRYLOCK);
--
MST
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC v4 21/22] mm: add put_page_zeroed and folio_put_zeroed
2026-04-26 21:47 [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
` (19 preceding siblings ...)
2026-04-26 21:48 ` [PATCH RFC v4 20/22] mm: add free_frozen_pages_zeroed Michael S. Tsirkin
@ 2026-04-26 21:48 ` Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 22/22] virtio_balloon: mark deflated pages as zeroed Michael S. Tsirkin
21 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-26 21:48 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu
Add put_page_zeroed() / folio_put_zeroed() for callers that hold
a reference to a page known to be zeroed.
If this drops the last reference, the page goes through
__folio_put_zeroed() which calls free_frozen_pages_zeroed() so
the zeroed hint is preserved. If someone else still holds a
reference, the hint is simply lost -- this is best-effort.
This is useful for balloon drivers during deflation: the host
has already zeroed the pages, and the balloon is typically the
sole owner. But if the page happens to be shared, silently
dropping the hint is safe and avoids the need for callers to
check the refcount.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
include/linux/mm.h | 12 ++++++++++++
mm/swap.c | 18 ++++++++++++++++--
2 files changed, 28 insertions(+), 2 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 541d36e5e420..1c6bf82a967a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1640,6 +1640,7 @@ static inline struct folio *virt_to_folio(const void *x)
}
void __folio_put(struct folio *folio);
+void __folio_put_zeroed(struct folio *folio);
void split_page(struct page *page, unsigned int order);
void folio_copy(struct folio *dst, struct folio *src);
@@ -1817,6 +1818,17 @@ static inline void folio_put(struct folio *folio)
__folio_put(folio);
}
+static inline void folio_put_zeroed(struct folio *folio)
+{
+ if (folio_put_testzero(folio))
+ __folio_put_zeroed(folio);
+}
+
+static inline void put_page_zeroed(struct page *page)
+{
+ folio_put_zeroed(page_folio(page));
+}
+
/**
* folio_put_refs - Reduce the reference count on a folio.
* @folio: The folio.
diff --git a/mm/swap.c b/mm/swap.c
index bb19ccbece46..5d05a463b46a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -94,7 +94,7 @@ static void page_cache_release(struct folio *folio)
unlock_page_lruvec_irqrestore(lruvec, flags);
}
-void __folio_put(struct folio *folio)
+static void ___folio_put(struct folio *folio, bool zeroed)
{
if (unlikely(folio_is_zone_device(folio))) {
free_zone_device_folio(folio);
@@ -109,10 +109,24 @@ void __folio_put(struct folio *folio)
page_cache_release(folio);
folio_unqueue_deferred_split(folio);
mem_cgroup_uncharge(folio);
- free_frozen_pages(&folio->page, folio_order(folio));
+ if (zeroed)
+ free_frozen_pages_zeroed(&folio->page, folio_order(folio));
+ else
+ free_frozen_pages(&folio->page, folio_order(folio));
+}
+
+void __folio_put(struct folio *folio)
+{
+ ___folio_put(folio, false);
}
EXPORT_SYMBOL(__folio_put);
+void __folio_put_zeroed(struct folio *folio)
+{
+ ___folio_put(folio, true);
+}
+EXPORT_SYMBOL(__folio_put_zeroed);
+
typedef void (*move_fn_t)(struct lruvec *lruvec, struct folio *folio);
static void lru_add(struct lruvec *lruvec, struct folio *folio)
--
MST
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC v4 22/22] virtio_balloon: mark deflated pages as zeroed
2026-04-26 21:47 [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
` (20 preceding siblings ...)
2026-04-26 21:48 ` [PATCH RFC v4 21/22] mm: add put_page_zeroed and folio_put_zeroed Michael S. Tsirkin
@ 2026-04-26 21:48 ` Michael S. Tsirkin
21 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-26 21:48 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Xuan Zhuo, Eugenio Pérez
When host_zeroes_pages is set, the host has zeroed the balloon
pages on reclaim. Use put_page_zeroed() during deflation so
the freed pages are marked as zeroed in the buddy allocator,
allowing the next allocation to skip redundant zeroing.
put_page_zeroed() is best-effort: if the balloon is the sole
holder (the common case), the zeroed hint reaches the buddy
allocator via free_frozen_pages_zeroed(). If someone else
holds a reference, the hint is silently lost.
Once balloon pages are converted to frozen pages (no refcount),
this can switch to free_frozen_pages_zeroed() directly.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
drivers/virtio/virtio_balloon.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 8c15530b51b3..cdc3c960397d 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -304,7 +304,10 @@ static void release_pages_balloon(struct virtio_balloon *vb,
list_for_each_entry_safe(page, next, pages, lru) {
list_del(&page->lru);
- put_page(page); /* balloon reference */
+ if (host_zeroes_pages && !page_poisoning_enabled_static())
+ put_page_zeroed(page);
+ else
+ put_page(page); /* balloon reference */
}
}
--
MST
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH RFC v4 14/22] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages
2026-04-26 21:48 ` [PATCH RFC v4 14/22] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
@ 2026-04-27 15:13 ` Zi Yan
2026-04-27 15:18 ` Michael S. Tsirkin
2026-04-27 15:43 ` David Hildenbrand (Arm)
0 siblings, 2 replies; 26+ messages in thread
From: Zi Yan @ 2026-04-27 15:13 UTC (permalink / raw)
To: Michael S. Tsirkin, linux-kernel
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Johannes Weiner,
Zi Yan, Matthew Wilcox
On Sun Apr 26, 2026 at 5:48 PM EDT, Michael S. Tsirkin wrote:
> When a guest reports free pages to the hypervisor via the page reporting
> framework (used by virtio-balloon and hv_balloon), the host typically
> zeros those pages when reclaiming their backing memory. However, when
> those pages are later allocated in the guest, post_alloc_hook()
> unconditionally zeros them again if __GFP_ZERO is set. This
> double-zeroing is wasteful, especially for large pages.
>
> Avoid redundant zeroing:
>
> - Add a host_zeroes_pages flag to page_reporting_dev_info, allowing
> drivers to declare that their host zeros reported pages on reclaim.
> A static key (page_reporting_host_zeroes) gates the fast path.
>
> - Add PG_zeroed page flag (sharing PG_private bit) to mark pages
> that have been zeroed by the host. Set it on reported pages during
> allocation from the buddy in page_del_and_expand().
>
> - Thread the zeroed bool through rmqueue -> prep_new_page ->
> post_alloc_hook, where it skips redundant zeroing for __GFP_ZERO
> allocations.
>
> No driver sets host_zeroes_pages yet; a follow-up patch to
> virtio_balloon is needed to opt in.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> Assisted-by: cursor-agent:GPT-5.4-xhigh
> ---
> include/linux/page-flags.h | 9 +++++
> include/linux/page_reporting.h | 3 ++
> mm/compaction.c | 6 ++--
> mm/internal.h | 2 +-
> mm/page_alloc.c | 66 +++++++++++++++++++++++-----------
> mm/page_reporting.c | 14 +++++++-
> mm/page_reporting.h | 12 +++++++
> 7 files changed, 87 insertions(+), 25 deletions(-)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index f7a0e4af0c73..eef2499cba8b 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -135,6 +135,8 @@ enum pageflags {
> PG_swapcache = PG_owner_priv_1, /* Swap page: swp_entry_t in private */
> /* Some filesystems */
> PG_checked = PG_owner_priv_1,
> + /* Page contents are known to be zero */
> + PG_zeroed = PG_private,
+willy,
I was discussing with willy and David about removing PG_private and
repurposing it to PG_folio to identify folios. IIUC, PG_zeroed is only
set for PageBuddy, so it should not be an issue to set it for allocated
pages for folio identification. Let me know if I get it wrong.
Thanks.
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC v4 14/22] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages
2026-04-27 15:13 ` Zi Yan
@ 2026-04-27 15:18 ` Michael S. Tsirkin
2026-04-27 15:43 ` David Hildenbrand (Arm)
1 sibling, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2026-04-27 15:18 UTC (permalink / raw)
To: Zi Yan
Cc: linux-kernel, Andrew Morton, David Hildenbrand, Vlastimil Babka,
Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Jason Wang,
Andrea Arcangeli, Gregory Price, linux-mm, virtualization,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Johannes Weiner,
Matthew Wilcox
On Mon, Apr 27, 2026 at 11:13:44AM -0400, Zi Yan wrote:
> On Sun Apr 26, 2026 at 5:48 PM EDT, Michael S. Tsirkin wrote:
> > When a guest reports free pages to the hypervisor via the page reporting
> > framework (used by virtio-balloon and hv_balloon), the host typically
> > zeros those pages when reclaiming their backing memory. However, when
> > those pages are later allocated in the guest, post_alloc_hook()
> > unconditionally zeros them again if __GFP_ZERO is set. This
> > double-zeroing is wasteful, especially for large pages.
> >
> > Avoid redundant zeroing:
> >
> > - Add a host_zeroes_pages flag to page_reporting_dev_info, allowing
> > drivers to declare that their host zeros reported pages on reclaim.
> > A static key (page_reporting_host_zeroes) gates the fast path.
> >
> > - Add PG_zeroed page flag (sharing PG_private bit) to mark pages
> > that have been zeroed by the host. Set it on reported pages during
> > allocation from the buddy in page_del_and_expand().
> >
> > - Thread the zeroed bool through rmqueue -> prep_new_page ->
> > post_alloc_hook, where it skips redundant zeroing for __GFP_ZERO
> > allocations.
> >
> > No driver sets host_zeroes_pages yet; a follow-up patch to
> > virtio_balloon is needed to opt in.
> >
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > Assisted-by: Claude:claude-opus-4-6
> > Assisted-by: cursor-agent:GPT-5.4-xhigh
> > ---
> > include/linux/page-flags.h | 9 +++++
> > include/linux/page_reporting.h | 3 ++
> > mm/compaction.c | 6 ++--
> > mm/internal.h | 2 +-
> > mm/page_alloc.c | 66 +++++++++++++++++++++++-----------
> > mm/page_reporting.c | 14 +++++++-
> > mm/page_reporting.h | 12 +++++++
> > 7 files changed, 87 insertions(+), 25 deletions(-)
> >
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index f7a0e4af0c73..eef2499cba8b 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -135,6 +135,8 @@ enum pageflags {
> > PG_swapcache = PG_owner_priv_1, /* Swap page: swp_entry_t in private */
> > /* Some filesystems */
> > PG_checked = PG_owner_priv_1,
> > + /* Page contents are known to be zero */
> > + PG_zeroed = PG_private,
>
> +willy,
>
> I was discussing with willy and David about removing PG_private and
> repurposing it to PG_folio to identify folios. IIUC, PG_zeroed is only
> set for PageBuddy, so it should not be an issue to set it for allocated
> pages for folio identification. Let me know if I get it wrong.
>
> Thanks.
>
> --
> Best Regards,
> Yan, Zi
Exactly. Trivial to switch to another bit, of course - most are
unused in the buddy.
--
MST
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC v4 14/22] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages
2026-04-27 15:13 ` Zi Yan
2026-04-27 15:18 ` Michael S. Tsirkin
@ 2026-04-27 15:43 ` David Hildenbrand (Arm)
1 sibling, 0 replies; 26+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-27 15:43 UTC (permalink / raw)
To: Zi Yan, Michael S. Tsirkin, linux-kernel
Cc: Andrew Morton, Vlastimil Babka, Brendan Jackman, Michal Hocko,
Suren Baghdasaryan, Jason Wang, Andrea Arcangeli, Gregory Price,
linux-mm, virtualization, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, Johannes Weiner, Matthew Wilcox
On 4/27/26 17:13, Zi Yan wrote:
> On Sun Apr 26, 2026 at 5:48 PM EDT, Michael S. Tsirkin wrote:
>> When a guest reports free pages to the hypervisor via the page reporting
>> framework (used by virtio-balloon and hv_balloon), the host typically
>> zeros those pages when reclaiming their backing memory. However, when
>> those pages are later allocated in the guest, post_alloc_hook()
>> unconditionally zeros them again if __GFP_ZERO is set. This
>> double-zeroing is wasteful, especially for large pages.
>>
>> Avoid redundant zeroing:
>>
>> - Add a host_zeroes_pages flag to page_reporting_dev_info, allowing
>> drivers to declare that their host zeros reported pages on reclaim.
>> A static key (page_reporting_host_zeroes) gates the fast path.
>>
>> - Add PG_zeroed page flag (sharing PG_private bit) to mark pages
>> that have been zeroed by the host. Set it on reported pages during
>> allocation from the buddy in page_del_and_expand().
>>
>> - Thread the zeroed bool through rmqueue -> prep_new_page ->
>> post_alloc_hook, where it skips redundant zeroing for __GFP_ZERO
>> allocations.
>>
>> No driver sets host_zeroes_pages yet; a follow-up patch to
>> virtio_balloon is needed to opt in.
>>
>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>> Assisted-by: Claude:claude-opus-4-6
>> Assisted-by: cursor-agent:GPT-5.4-xhigh
>> ---
>> include/linux/page-flags.h | 9 +++++
>> include/linux/page_reporting.h | 3 ++
>> mm/compaction.c | 6 ++--
>> mm/internal.h | 2 +-
>> mm/page_alloc.c | 66 +++++++++++++++++++++++-----------
>> mm/page_reporting.c | 14 +++++++-
>> mm/page_reporting.h | 12 +++++++
>> 7 files changed, 87 insertions(+), 25 deletions(-)
>>
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index f7a0e4af0c73..eef2499cba8b 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -135,6 +135,8 @@ enum pageflags {
>> PG_swapcache = PG_owner_priv_1, /* Swap page: swp_entry_t in private */
>> /* Some filesystems */
>> PG_checked = PG_owner_priv_1,
>> + /* Page contents are known to be zero */
>> + PG_zeroed = PG_private,
>
> +willy,
>
> I was discussing with willy and David about removing PG_private and
> repurposing it to PG_folio to identify folios. IIUC, PG_zeroed is only
> set for PageBuddy, so it should not be an issue to set it for allocated
> pages for folio identification. Let me know if I get it wrong.
Right, we can keep using that flag here. Whenever we leave the buddy it gets
cleared. So as part of your work, simply let the buddy use the renamed flag.
--
Cheers,
David
^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2026-04-27 16:05 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-26 21:47 [PATCH RFC v4 00/22] mm/virtio: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
2026-04-26 21:47 ` [PATCH RFC v4 01/22] mm: move vma_alloc_folio to page_alloc.c Michael S. Tsirkin
2026-04-26 21:47 ` [PATCH RFC v4 02/22] mm: add vma_alloc_folio_user_addr Michael S. Tsirkin
2026-04-26 21:47 ` [PATCH RFC v4 03/22] mm: thread user_addr through page allocator for cache-friendly zeroing Michael S. Tsirkin
2026-04-26 21:47 ` [PATCH RFC v4 04/22] mm: add folio_zero_user stub for configs without THP/HUGETLBFS Michael S. Tsirkin
2026-04-26 21:47 ` [PATCH RFC v4 05/22] mm: page_alloc: move prep_compound_page before post_alloc_hook Michael S. Tsirkin
2026-04-26 21:47 ` [PATCH RFC v4 06/22] mm: use folio_zero_user for user pages in post_alloc_hook Michael S. Tsirkin
2026-04-26 21:47 ` [PATCH RFC v4 07/22] mm: use __GFP_ZERO in vma_alloc_zeroed_movable_folio Michael S. Tsirkin
2026-04-26 21:47 ` [PATCH RFC v4 08/22] mm: use __GFP_ZERO in alloc_anon_folio Michael S. Tsirkin
2026-04-26 21:47 ` [PATCH RFC v4 09/22] mm: use __GFP_ZERO in vma_alloc_anon_folio_pmd Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 10/22] mm: hugetlb: use __GFP_ZERO and skip zeroing for zeroed pages Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 11/22] mm: memfd: skip zeroing for zeroed hugetlb pool pages Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 12/22] mm: remove arch vma_alloc_zeroed_movable_folio overrides Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 13/22] mm: page_alloc: propagate PageReported flag across buddy splits Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 14/22] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
2026-04-27 15:13 ` Zi Yan
2026-04-27 15:18 ` Michael S. Tsirkin
2026-04-27 15:43 ` David Hildenbrand (Arm)
2026-04-26 21:48 ` [PATCH RFC v4 15/22] mm: page_alloc: clear PG_zeroed on buddy merge if not both zero Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 16/22] mm: page_alloc: preserve PG_zeroed in page_del_and_expand Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 17/22] mm: page_reporting: add per-page zeroed bitmap for host feedback Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 18/22] virtio_balloon: a hack to enable host-zeroed page optimization Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 19/22] mm: page_reporting: add flush parameter with page budget Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 20/22] mm: add free_frozen_pages_zeroed Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 21/22] mm: add put_page_zeroed and folio_put_zeroed Michael S. Tsirkin
2026-04-26 21:48 ` [PATCH RFC v4 22/22] virtio_balloon: mark deflated pages as zeroed Michael S. Tsirkin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox