[PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
@ 2026-06-08  8:33 Michael S. Tsirkin
  2026-06-08  8:34 ` [PATCH v10 01/37] mm: mempolicy: fix interleave index calculation Michael S. Tsirkin
                   ` (40 more replies)
  0 siblings, 41 replies; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

Further, on architectures with aliasing caches, upstream with init_on_alloc
double-zeros user pages: once via kernel_init_pages() in
post_alloc_hook, and again via clear_user_highpage() at the
callsite (because user_alloc_needs_zeroing() returns true).
This series eliminates that double-zeroing by moving the zeroing
into the post_alloc_hook + propagating the "host
already zeroed this page" information through the buddy allocator.

For page reporting, VIRTIO_BALLOON_F_DEVICE_INIT_REPORTED (bit 6)
is used. For the inflate/deflate path,
VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE (bit 7) is used.

Virtio spec: https://lore.kernel.org/all/cover.1778140241.git.mst@redhat.com

Based on v7.1-rc6.  When applying on mm-unstable, two conflicts
are expected:
- kernel_init_pages() was renamed to clear_highpages_kasan_tagged()
  in mm-unstable.  Use clear_highpages_kasan_tagged() in the
  post_alloc_hook else branch.
- FPI_PREPARED uses BIT(3) in mm-unstable.  Bump FPI_ZEROED to
  BIT(4).
Build-tested on mm-unstable at e9dd96806dbc:
https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git zero-mm-unstable

Patches 1-5: fixes/cleanups, dependencies of the zeroing patches.
Patches 6-9: thread user_addr through page allocator, contig API,
  and gigantic hugetlb allocation.
Patches 10-16: folio_zero_user in post_alloc_hook, vma_alloc_zeroed
  conversion, raw fault address threading.
Patches 17-24: PG_zeroed flag, aliasing guard, buddy merge/split
  tracking, FPI_ZEROED optimization, folio_put_zeroed.
Patches 25-27: __GFP_ZERO callsite conversions (alloc_anon_folio,
  vma_alloc_anon_folio_pmd) with memcg charge failure mitigation.
Patches 28-29: hugetlb __GFP_ZERO + HPG_zeroed.
Patches 30-35: page reporting zeroing (DEVICE_INIT_REPORTED),
  disable indirect descriptors.
Patches 36-37: inflate/deflate zeroing (DEVICE_INIT_ON_INFLATE).

-------

Performance with THP enabled on a 2GB VM, 1 vCPU, allocating
256MB of anonymous pages:

  metric         baseline            optimized           delta
  task-clock     232 +- 20 ms        51 +- 26 ms         -78%
  cache-misses   1.20M +- 248K       288K +- 102K        -76%
  instructions   16.3M +- 1.2M       13.8M +- 1.0M       -15%

With hugetlb surplus pages:

  metric         baseline            optimized           delta
  task-clock     219 +- 23 ms        65 +- 34 ms         -70%
  cache-misses   1.17M +- 391K       263K +- 36K         -78%
  instructions   17.9M +- 1.2M       15.1M +- 724K       -16%

Two flags track known-zero pages:
  PG_zeroed (aliased to PG_private) marks buddy allocator pages that
  are known to contain all zeros, either because the host zeroed
  them during page reporting, or because they were freed via the
  balloon deflate path.  It lives on free-list pages and is consumed
  by post_alloc_hook() on allocation.
  HPG_zeroed (stored in hugetlb folio->private bits) serves the same
  purpose for hugetlb pool pages, which are kept in a pool and may
  be zeroed long after buddy allocation, so PG_zeroed (consumed at
  allocation time) cannot track their state.

PG_zeroed lifecycle:

  Sets PG_zeroed:
  - page_reporting_drain: on reported pages when host zeroes them
  - __free_pages_ok / __free_frozen_pages: when FPI_ZEROED is set
    (balloon deflate path)
  - buddy merge: on merged page if both buddies were zeroed
  - expand(): propagate to split-off buddy sub-pages

  Clears PG_zeroed:
  - __free_pages_prepare: clears all PAGE_FLAGS_CHECK_AT_PREP flags
    (PG_zeroed included), preventing PG_private aliasing leaks
  - rmqueue_buddy / __rmqueue_pcplist: read-then-clear, passes
    zeroed hint to prep_new_page -> post_alloc_hook
  - __isolate_free_page: clear (compaction/page_reporting isolation)
  - compaction, alloc_contig, split_free_frozen: clear before use
  - buddy merge: clear both pages before merge, then conditionally
    re-set on merged head if both were zeroed

HPG_zeroed lifecycle (hugetlb pool pages, stored in folio->private):

  Sets HPG_zeroed:
  - alloc_surplus_hugetlb_folio: after buddy allocation with
    __GFP_ZERO, mark pool page as known-zero

  Clears HPG_zeroed:
  - free_huge_folio: page was mapped to userspace, no longer
    known-zero when it returns to the pool
  - alloc_hugetlb_folio: cleared unconditionally on output
  - alloc_hugetlb_folio_reserve: cleared after checking

- The optimization is most effective with THP, where entire 2MB
  pages are allocated directly from reported order-9+ buddy pages.
  Without THP, only ~21% of order-0 allocations come from reported
  pages due to low-order fragmentation.
- Persistent hugetlb pool pages are not covered: when freed by
  userspace they return to the hugetlb free pool, not the buddy
  allocator, so they are never reported to the host.  Surplus
  hugetlb pages are allocated from buddy and do benefit.

- PG_zeroed is aliased to PG_private.  __free_pages_prepare() clears it
  (preventing filesystem PG_private from leaking as false PG_zeroed).
  FPI_ZEROED re-sets it after prepare for balloon deflate pages.
  Is aliasing PG_private acceptable, or should a different bit be used?

- With __GFP_ZERO, the folio is zeroed before mem_cgroup_charge().
  If the charge fails (cgroup at limit), the zeroing work is wasted
  and the folio is freed and retried at a smaller order.  Previously,
  zeroing was done after a successful charge.  This is inherent to
  the __GFP_ZERO approach.  Is this acceptable?

- On architectures with aliasing caches, upstream with init_on_alloc
  double-zeros user pages: once via kernel_init_pages() in
  post_alloc_hook, and again via clear_user_highpage() at the
  callsite (because user_alloc_needs_zeroing() returns true).
  Our patches eliminate this by zeroing once via folio_zero_user()
  in post_alloc_hook.  Not a critical fix (people who set init_on_alloc
  know they are paying performance) but a nice cleanup anyway.

Test program:

  #include <stdio.h>
  #include <stdlib.h>
  #include <string.h>
  #include <sys/mman.h>

  #ifndef MADV_POPULATE_WRITE
  #define MADV_POPULATE_WRITE 23
  #endif
  #ifndef MAP_HUGETLB
  #define MAP_HUGETLB 0x40000
  #endif

  int main(int argc, char **argv)
  {
      unsigned long size;
      int flags = MAP_PRIVATE | MAP_ANONYMOUS;
      void *p;
      int r;

      if (argc < 2) {
          fprintf(stderr, "usage: %s <size_mb> [huge]\n", argv[0]);
          return 1;
      }
      size = atol(argv[1]) * 1024UL * 1024;
      if (argc >= 3 && strcmp(argv[2], "huge") == 0)
          flags |= MAP_HUGETLB;
      p = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
      if (p == MAP_FAILED) {
          perror("mmap");
          return 1;
      }
      r = madvise(p, size, MADV_POPULATE_WRITE);
      if (r) {
          perror("madvise");
          return 1;
      }
      munmap(p, size);
      return 0;
  }

Test script (bench.sh):

  #!/bin/bash
  # Usage: bench.sh <size_mb> <iterations> [huge]
  # Feature negotiation (DEVICE_INIT_REPORTED/ON_INFLATE) is
  # handled by QEMU command line flags,
  SZ=${1:-256}; ITER=${2:-10}; HUGE=${3:-}
  FLUSH=/sys/module/page_reporting/parameters/flush
  CSV=/tmp/perf.csv
  rmmod virtio_balloon 2>/dev/null
  insmod /mnt/share/virtio_balloon.ko
  echo 512 > $FLUSH
  [ "$HUGE" = "huge" ] && echo $((SZ/2)) > /proc/sys/vm/nr_overcommit_hugepages
  rm -f $CSV
  echo "=== sz=${SZ}MB iter=$ITER $HUGE ==="
  for i in $(seq 1 $ITER); do
      echo 3 > /proc/sys/vm/drop_caches
      echo 512 > $FLUSH
      perf stat -e task-clock,instructions,cache-misses \
          -x, -o $CSV --append -- /mnt/share/alloc_once $SZ $HUGE
  done
  [ "$HUGE" = "huge" ] && echo 0 > /proc/sys/vm/nr_overcommit_hugepages
  rmmod virtio_balloon
  awk -F, '/^#/||/^$/{next}{v=$1+0;e=$3;gsub(/ /,"",e);s[e]+=v;ss[e]+=v*v;n[e]++}
  END{for(e in s){a=s[e]/n[e];d=sqrt(ss[e]/n[e]-a*a);printf "  %-16s %10.0f +- %8.0f (n=%d)\n",e,a,d,n[e]}}' $CSV

Compile and run:
  gcc -static -O2 -o alloc_once alloc_once.c
  bash bench.sh 256 10            # regular pages
  bash bench.sh 256 10 huge       # hugetlb surplus

Note about Sashiko (sashiko.dev) false positives:
  Sashiko's mm-alloc guideline says "Any optimization replacing
  clear_user_highpage() with __GFP_ZERO is wrong on [cache-aliasing]
  architectures". This is correct for mainline but not for this
  series, which threads user_addr through the allocator so that
  post_alloc_hook() calls folio_zero_user() with the dcache flush.
  Suggested guideline update: add "unless the caller passes a
  valid user address (i.e. not USER_ADDR_NONE) to vma_alloc_folio(),
  alloc_contig_frozen_pages_user() etc., which reaches
  post_alloc_hook() for the dcache flush".

Pre-existing bugs found during review (not fixed, not made worse):
  - do_swap_page() returns VM_FAULT_OOM on large-folio swapin race
    instead of retrying.
  - free_huge_folio() called with refcount==1 on
    mem_cgroup_charge_hugetlb failure.
  - memfd_alloc_folio() double-decrements resv_huge_pages on error.
  - wait_event in virtballoon_free_page_report hangs on broken
    virtqueue (pre-existing, same as old single-buffer code).
  - tell_host() GFP_KERNEL under balloon_lock risks OOM deadlock.

Changes since v9:
- Fix W=1 kerneldoc warning on alloc_contig_frozen_pages_user_noprof.
- Fix link error on !MMU configs (m68k, arm allnoconfig): move
  folio_zero_user stub to new mm/folio_zero.h header.
- Reorder patches: move PG_zeroed tracking and folio_put_zeroed
  before __GFP_ZERO conversions, allowing folio_put_zeroed to
  handle memcg charge failures.
- Better handle memcg charge failures.

Changes since v8 (address Sashiko v8 review findings):
- Fix mempolicy interleave: combine vm_pgoff and VMA offset into
  a single expression before shifting, fixing carry loss for
  file-backed VMAs with unaligned vm_pgoff.
- Fix memory-failure: wrap ClearPageHWPoison in retry path with
  zone->lock (same race as TestSetPageHWPoison).
- Fix stale comment: "folio_zero_user writes" -> "page zeroing"
  in huge_memory.c __folio_mark_uptodate comment.
- Drop rounddown_pow_of_two for page reporting capacity (no-op
  for compiler optimization, halves batch size for non-power-of-2).
- Reorder: move "mm: balloon: use put_page_zeroed" before
  "virtio_balloon: implement DEVICE_INIT_ON_INFLATE" so the
  ClearPageZeroed handling is in place before any page gets
  the flag set.
- Various commit log improvements (PowerPC note in aliasing
  patch, memory-failure note about other HWPoison calls,
  wording fixes).

Changes since v7 (address Sashiko AI review findings):
- Fix dcache flush on VIPT aliasing architectures: add
  user_alloc_needs_zeroing() guard in post_alloc_hook to force
  folio_zero_user for user pages when cache aliasing requires it.
  Host-zeroed pages excluded (!zeroed).  Optimization preserved.
- Fix folio_zero_user stub: replace macro with non-inline function
  in mm/memory.c to avoid double-evaluation and missing include.
- Fix C89 declaration-after-statement in free_huge_folio.
- Fix CMA __GFP_ZERO: pass through to cma_alloc_frozen_compound
  so HPG_zeroed accurately reflects whether page was zeroed.
- Fix big-endian bitmap: use test_bit_le() for inflate_bitmap.
- Fix migratepage: clear PageZeroed on old page before deflation.
- Fix page_reporting flush: overflow-safe loop, add -EINTR on
  signal, add code comment explaining double flush_delayed_work.
- Add atomic ClearPageZeroed (CLEARPAGEFLAG) for balloon migration
  path where zone->lock is not held.
- Add VM_WARN_ON_ONCE for order>0 without __GFP_COMP in
  post_alloc_hook (folio_zero_user requires compound metadata).
- Add _noprof pattern for vma_alloc_zeroed_movable_folio to
  preserve memory allocation profiling attribution.
- Add PageReported propagation in split_large_buddy (was missing
  from patch 2).
- Add FPI_ZEROED guard: skip PageZeroed when page_poisoning
  enabled and init_on_free disabled (poison overwrites zeroes).
- Add DMA alignment comment for inflate_bitmap (ACCESS_PLATFORM
  cleared, so not needed now).
- Restore tell_host comment explaining vq buffer assumption.
- Various code comments documenting design decisions.
- Drop __GFP_ZERO from gather_surplus_pages: avoid shifting
  zeroing from fault time to reservation time (mmap/fallocate).
  Pool pages are zeroed at fault time via alloc_hugetlb_folio.
  Fresh surplus allocations at fault time still benefit from
  __GFP_ZERO + HPG_zeroed.
- New patch: add alloc_contig_frozen_pages_user API with user_addr
  for cache-friendly zeroing in the contiguous allocation path.
- New patch: thread user_addr through gigantic hugetlb allocation
  via alloc_contig_frozen_pages_user.
- New patch: replace user_alloc_needs_zeroing() with aliasing-only
  checks (cpu_dcache_is_aliasing || cpu_icache_is_aliasing) in the
  post_alloc_hook guard.  Avoids redundant re-zero on non-aliasing.
- New patch: serialize TestSetPageHWPoison with zone->lock in
  memory_failure to fix pre-existing race with non-atomic buddy
  flag operations (e.g. page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP).
- New patch: disable VIRTIO_RING_F_INDIRECT_DESC in balloon to
  prevent GFP_KERNEL allocation under balloon_lock (OOM deadlock).
- New patch: skip kernel_init_pages for FPI_ZEROED when page
  poisoning is not enabled (page already zero, skip redundant work).

Also since v7 (address review by Gregory Price):
- Drop from_pool bool in alloc_hugetlb_folio: use
  folio_test_hugetlb_zeroed directly.  HPG_zeroed is set by
  alloc_surplus_hugetlb_folio for fresh allocations, so the
  check handles both pool and fresh pages.
- Drop bool *zeroed output parameter from alloc_hugetlb_folio:
  sink zeroing inside the function.  When __GFP_ZERO is set and
  !folio_test_hugetlb_zeroed, call folio_zero_user internally.
- Rename addr to user_addr in alloc_hugetlb_folio, align
  internally with huge_page_mask.
- Add Reviewed-by: Gregory Price tags on reviewed patches.

New patches since v7:
- mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
- mm: add alloc_contig_frozen_pages_user for cache-friendly zeroing
- mm: hugetlb: thread user_addr through gigantic page allocation
- mm: page_alloc: use aliasing checks instead of
  user_alloc_needs_zeroing
- virtio_balloon: disable indirect descriptors
- mm: page_alloc: skip kernel_init_pages for FPI_ZEROED when safe

Changes since v6 (address review by Gregory Price):
- Rework hugetlb: use gfp_t parameter instead of bool zero /
  bool *zeroed.  Sink zeroing inside alloc_hugetlb_folio().
  Pass raw fault address (user_addr) for cache-friendly zeroing
  on both pool-page and fresh allocation
  paths.  (Suggested by Gregory Price)
- Reorder compaction_alloc_noprof() to call prep_compound_page
  before post_alloc_hook for consistency.
  (Suggested by Gregory Price)
- Reorder: interleave fix first, PageReported propagation and
  capacity fix moved to front as dependencies.
- Add USER_ADDR_NONE comments in mmap.c and internal.h explaining why -1 is
  never a valid userspace address.
- Fix err uninitialized warning in virtballoon_free_page_report().
- Lots of commit log tweaks.

Also in v7:
- Fix hugetlb pool page zeroing to use vmf->real_address
  (the actual faulting subpage) instead of vmf->address
  (hugepage-aligned), preserving cache-friendly zeroing
  locality that upstream had at the callsite.
- Remove dead/broken alloc_hugetlb_folio !CONFIG_HUGETLB_PAGE
  stub (returned NULL but callers check IS_ERR).

Changes since v5:
- Rebased onto v7.1-rc2.
- Split alloc_anon_folio and alloc_swap_folio raw fault address
  changes into separate patches.
- In virtio, move PAGE_POISON check for DEVICE_INIT_REPORTED
  from probe() to validate(), clearing the feature instead of
  just gating host_zeroes_pages.  Same for confidential
  computing check.
- Fix bisectability: FPI_ZEROED definition and usage now in
  the same patch.
- Lots of commit log tweaks.
- Reorder: REPORTED before ON_INFLATE.
- Kerneldoc fixes.

Changes since v4:
With virtio spec posted, update to latest spec:
- Add VIRTIO_BALLOON_F_DEVICE_INIT_REPORTED (bit 6) for reporting.
- Add VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE (bit 7) for inflate.
- Per-page virtqueue submission, per-page used_len feedback.
- Balloon migration preserves PageZeroed hint.
- Page_reporting capacity bugfix for small virtqueues.
- PG_zeroed propagation in split_large_buddy.
- Disable both features for confidential computing guests.
- Gate host_zeroes_pages on PAGE_POISON/poison_val: when PAGE_POISON
  is negotiated with non-zero poison_val, device fills with poison
  not zeros, so host_zeroes_pages must be false.
- Disable ON_INFLATE when PAGE_POISON with non-zero poison_val.
- Bound inflate bitmap reads by used_len from device.
- Move ON_INFLATE poison_val check to validate() for proper
  feature negotiation.
- Fix NUMA interleave index for unaligned VMA start (new patch 1).
- Drop vma_alloc_folio_user_addr: with the ilx fix, callers can
  pass raw fault address to vma_alloc_folio directly.
- Tested with DEBUG_VM, INIT_ON_ALLOC/FREE enabled.

Changes since v3 (address review by Gregory Price and David Hildenbrand):
- Keep user_addr threading internal: public APIs (__alloc_pages,
  __folio_alloc, folio_alloc_mpol) are unchanged.  Only internal
  functions (__alloc_frozen_pages_noprof, __alloc_pages_mpol) carry
  user_addr.  This eliminates all API churn for external callers.
- Add vma_alloc_folio_user_addr() (2/22) to separate NUMA policy
  address from the zeroing hint address.  Fixes NUMA interleave
  index corruption when passing unaligned fault address for
  higher-order allocations.
- Add per-page zeroed_bitmap to page_reporting_dev_info (17/22).
  The driver's report() callback manages the bitmap.  Drain
  checks it gated by the host_zeroes_pages static key.  This
  matches the proposed virtio balloon extension at
  https://lore.kernel.org/all/cover.1776874126.git.mst@redhat.com/
- Clear PG_zeroed in __isolate_free_page() to prevent the aliased
  PG_private flag from leaking to compaction/alloc_contig paths.
- Do not exclude PG_zeroed from PAGE_FLAGS_CHECK_AT_PREP macro.
  Instead, __free_pages_prepare() clears it (preventing filesystem
  PG_private leaking as false PG_zeroed), and FPI_ZEROED sets it
  after prepare.  Only buddy merge assertion is relaxed.
- Initialize alloc_context.user_addr in alloc_pages_bulk_noprof.
- Deflate and hugetlb changes are much smaller now.  Still, the
  patchset can be merged gradually, if desired.

Changes since v2 (address review by Gregory Price and David Hildenbrand):
- v2 used pghint_t / vma_alloc_folio_hints API.  v3 switches to
  threading user_addr through the page allocator and using __GFP_ZERO,
  so post_alloc_hook() can use folio_zero_user() for cache-friendly
  zeroing when the user fault address is known.
- Use FPI_ZEROED to set PG_zeroed after __free_pages_prepare() instead
  of runtime masking in __free_one_page (further refined in v4).
- Drop redundant page_poisoning_enabled() check from mm core free
  path, already guarded at feature negotiation time in
  virtio_balloon_validate.  The balloon driver keeps its own
  page_poisoning_enabled_static() check as defense in depth.
- Split free_frozen_pages_zeroed and put_page_zeroed into separate
  patches.  David Hildenbrand indicated he intends to rework balloon
  pages to be frozen (no refcount), at which point put_page_zeroed
  (21/22) can be dropped and the balloon can call
  free_frozen_pages_zeroed directly.
- Use HPG_zeroed flag (in hugetlb folio->private) for hugetlb pool
  pages instead of PG_zeroed, since pool pages are zeroed long after
  buddy allocation and PG_zeroed is consumed at allocation time.
- syzbot CI found a PF_NO_COMPOUND BUG in the v2 pghint_t approach
  where __ClearPageZeroed was called on compound hugetlb pages in
  free_huge_folio.  The v3 HPG_zeroed approach avoids this.
- Remove redundant arch vma_alloc_zeroed_movable_folio overrides
  on x86, s390, m68k, and alpha (12/22). Suggested by David
  Hildenbrand.
- Updated benchmarking script to compute per-run avg +- stddev
  via awk on CSV output.

Changes v1->v2:
- Replaced __GFP_PREZEROED with PG_zeroed page flag (aliased PG_private)
- Added pghint_t type and vma_alloc_folio_hints() API
- Track PG_zeroed across buddy merges and splits
- Added post_alloc_hook integration (single consume/clear point)
- Added hugetlb support (pool pages + memfd)
- Added page_reporting flush parameter for deterministic testing
- Added free_frozen_pages_hint/put_page_hint for balloon deflate path
- Added try_to_claim_block PG_zeroed preservation
- Updated perf numbers with per-iteration flush methodology

Written with assistance from Claude (claude-opus-4-6).
Reviewed by cursor-agent (GPT-5.4-xhigh).
Everything manually read, patchset split and commit logs edited manually.


Michael S. Tsirkin (37):
  mm: mempolicy: fix interleave index calculation
  mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
  mm: page_alloc: propagate PageReported flag across buddy splits
  mm: page_reporting: allow driver to set batch capacity
  mm: hugetlb: remove dead alloc_hugetlb_folio stub
  mm: move vma_alloc_folio_noprof to page_alloc.c
  mm: thread user_addr through page allocator for cache-friendly zeroing
  mm: add alloc_contig_frozen_pages_user for cache-friendly zeroing
  mm: hugetlb: thread user_addr through gigantic page allocation
  mm: add folio_zero_user stub for configs without THP/HUGETLBFS
  mm: page_alloc: move prep_compound_page before post_alloc_hook
  mm: use folio_zero_user for user pages in post_alloc_hook
  mm: use __GFP_ZERO in vma_alloc_zeroed_movable_folio
  mm: remove arch vma_alloc_zeroed_movable_folio overrides
  mm: alloc_anon_folio: pass raw fault address to vma_alloc_folio
  mm: alloc_swap_folio: pass raw fault address to vma_alloc_folio
  mm: page_reporting: skip redundant zeroing of host-zeroed reported
    pages
  mm: page_alloc: use aliasing checks instead of
    user_alloc_needs_zeroing
  mm: page_alloc: clear PG_zeroed on buddy merge if not both zero
  mm: page_alloc: preserve PG_zeroed in page_del_and_expand
  mm: page_alloc: propagate PG_zeroed in split_large_buddy
  mm: add free_frozen_pages_zeroed
  mm: page_alloc: skip kernel_init_pages for FPI_ZEROED when safe
  mm: add put_page_zeroed and folio_put_zeroed
  mm: use __GFP_ZERO in alloc_anon_folio
  mm: vma_alloc_anon_folio_pmd: pass raw fault address to
    vma_alloc_folio
  mm: use __GFP_ZERO in vma_alloc_anon_folio_pmd
  mm: hugetlb: add gfp parameter and skip zeroing for zeroed pages
  mm: memfd: skip zeroing for zeroed hugetlb pool pages
  mm: page_reporting: add per-page zeroed bitmap for host feedback
  virtio_balloon: submit reported pages as individual buffers
  virtio_balloon: disable indirect descriptors
  mm: page_reporting: add flush parameter with page budget
  virtio_balloon: skip zeroing for host-zeroed reported pages
  virtio_balloon: disable reporting zeroed optimization for confidential
    guests
  mm: balloon: use put_page_zeroed for zeroed balloon pages
  virtio_balloon: implement VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE

 arch/alpha/include/asm/page.h       |   3 -
 arch/m68k/include/asm/page_no.h     |   3 -
 arch/s390/include/asm/page.h        |   3 -
 arch/x86/include/asm/page.h         |   3 -
 drivers/virtio/virtio_balloon.c     | 177 ++++++++++++++---
 fs/hugetlbfs/inode.c                |   3 +-
 include/linux/cma.h                 |   3 +-
 include/linux/gfp.h                 |  18 +-
 include/linux/highmem.h             |  15 +-
 include/linux/hugetlb.h             |  18 +-
 include/linux/mm.h                  |  13 ++
 include/linux/page-flags.h          |  11 ++
 include/linux/page_reporting.h      |  13 ++
 include/uapi/linux/virtio_balloon.h |   2 +
 mm/balloon.c                        |  10 +-
 mm/cma.c                            |   6 +-
 mm/compaction.c                     |   9 +-
 mm/folio_zero.h                     |  18 ++
 mm/huge_memory.c                    |  16 +-
 mm/hugetlb.c                        | 138 ++++++++-----
 mm/hugetlb_cma.c                    |   4 +-
 mm/internal.h                       |  22 ++-
 mm/memfd.c                          |  14 +-
 mm/memory-failure.c                 |  10 +
 mm/memory.c                         |  19 +-
 mm/mempolicy.c                      |  75 +++----
 mm/mmap.c                           |   6 +
 mm/page_alloc.c                     | 297 +++++++++++++++++++++++-----
 mm/page_reporting.c                 |  99 ++++++++--
 mm/page_reporting.h                 |  12 ++
 mm/slub.c                           |   4 +-
 mm/swap.c                           |  20 +-
 32 files changed, 792 insertions(+), 272 deletions(-)
 create mode 100644 mm/folio_zero.h

-- 
MST



^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v10 01/37] mm: mempolicy: fix interleave index calculation
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
@ 2026-06-08  8:34 ` Michael S. Tsirkin
  2026-06-08  9:43   ` Lorenzo Stoakes
  2026-06-08  8:34 ` [PATCH v10 02/37] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock Michael S. Tsirkin
                   ` (39 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

The NUMA interleave index was computed as two separate terms:

    *ilx += vma->vm_pgoff >> order;
    *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);

This has two problems:

1. When vm_start is not aligned to the folio size, the
   subtraction before the shift lets low bits affect the
   result via borrows.

2. For file-backed VMAs, shifting vm_pgoff and the VMA
   offset independently loses carries between them, giving
   wrong chunk indices when vm_pgoff is not aligned to order.

Combine into a single expression that adds vm_pgoff and
the page-granularity VMA offset first, then shifts once:

    *ilx += (vma->vm_pgoff +
            (addr >> PAGE_SHIFT) -
            (vma->vm_start >> PAGE_SHIFT)) >> order;

For anonymous VMAs, vm_pgoff equals vm_start >> PAGE_SHIFT,
so the vm_pgoff and vm_start terms cancel and the result
reduces to addr >> (PAGE_SHIFT + order), same as before.

For file-backed VMAs, the sum vm_pgoff + (addr >> PAGE_SHIFT)
- (vm_start >> PAGE_SHIFT) gives the file page offset of addr.
Shifting by order gives the correct file chunk index.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Gregory Price <gourry@gourry.net>
---
 mm/mempolicy.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4e4421b22b59..d139b074a599 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2048,8 +2048,9 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 		pol = get_task_policy(current);
 	if (pol->mode == MPOL_INTERLEAVE ||
 	    pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
-		*ilx += vma->vm_pgoff >> order;
-		*ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
+		*ilx += (vma->vm_pgoff +
+			(addr >> PAGE_SHIFT) -
+			(vma->vm_start >> PAGE_SHIFT)) >> order;
 	}
 	return pol;
 }
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 02/37] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
  2026-06-08  8:34 ` [PATCH v10 01/37] mm: mempolicy: fix interleave index calculation Michael S. Tsirkin
@ 2026-06-08  8:34 ` Michael S. Tsirkin
  2026-06-08  9:43   ` Lorenzo Stoakes
  2026-06-08  8:34 ` [PATCH v10 03/37] mm: page_alloc: propagate PageReported flag across buddy splits Michael S. Tsirkin
                   ` (38 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Miaohe Lin

TestSetPageHWPoison() is called without zone->lock, so its atomic
update to page->flags can race with non-atomic flag operations
that run under zone->lock in the buddy allocator.

In particular, __free_pages_prepare() does:

    page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;

This non-atomic read-modify-write, while correctly excluding
__PG_HWPOISON from the mask, can still lose a concurrent
TestSetPageHWPoison if the read happens before the poison bit
is set and the write happens after.  Follow-up patches in this
series add similar non-atomic flag operations as well.

Fix by acquiring zone->lock around TestSetPageHWPoison and
around ClearPageHWPoison in the retry path.  This
serializes with all buddy flag manipulation.  The cost is
negligible: one lock/unlock in an extremely rare path
(hardware memory errors).

Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
in this file operate on pages already removed from the buddy
allocator or on non-buddy pages (DAX, hugetlb), so they do not
need zone->lock protection.

Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 mm/memory-failure.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ee42d4361309..3880486028a1 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2348,6 +2348,8 @@ int memory_failure(unsigned long pfn, int flags)
 	unsigned long page_flags;
 	bool retry = true;
 	int hugetlb = 0;
+	struct zone *zone;
+	unsigned long mf_flags;
 
 	if (!sysctl_memory_failure_recovery)
 		panic("Memory failure on page %lx", pfn);
@@ -2390,7 +2392,11 @@ int memory_failure(unsigned long pfn, int flags)
 	if (hugetlb)
 		goto unlock_mutex;
 
+	/* Serialize with non-atomic buddy flag operations */
+	zone = page_zone(p);
+	spin_lock_irqsave(&zone->lock, mf_flags);
 	if (TestSetPageHWPoison(p)) {
+		spin_unlock_irqrestore(&zone->lock, mf_flags);
 		res = -EHWPOISON;
 		if (flags & MF_ACTION_REQUIRED)
 			res = kill_accessing_process(current, pfn, flags);
@@ -2399,6 +2405,7 @@ int memory_failure(unsigned long pfn, int flags)
 		action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
 		goto unlock_mutex;
 	}
+	spin_unlock_irqrestore(&zone->lock, mf_flags);
 
 	/*
 	 * We need/can do nothing about count=0 pages.
@@ -2420,7 +2427,10 @@ int memory_failure(unsigned long pfn, int flags)
 			} else {
 				/* We lost the race, try again */
 				if (retry) {
+					/* Serialize with non-atomic buddy flag operations */
+					spin_lock_irqsave(&zone->lock, mf_flags);
 					ClearPageHWPoison(p);
+					spin_unlock_irqrestore(&zone->lock, mf_flags);
 					retry = false;
 					goto try_again;
 				}
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 03/37] mm: page_alloc: propagate PageReported flag across buddy splits
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
  2026-06-08  8:34 ` [PATCH v10 01/37] mm: mempolicy: fix interleave index calculation Michael S. Tsirkin
  2026-06-08  8:34 ` [PATCH v10 02/37] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock Michael S. Tsirkin
@ 2026-06-08  8:34 ` Michael S. Tsirkin
  2026-06-08  9:52   ` Lorenzo Stoakes
  2026-06-08  8:34 ` [PATCH v10 04/37] mm: page_reporting: allow driver to set batch capacity Michael S. Tsirkin
                   ` (37 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

When a reported free page is split via expand() to satisfy a
smaller allocation, the sub-pages placed back on the free lists
lose the PageReported flag.  This means they will be unnecessarily
re-reported to the hypervisor in the next reporting cycle, wasting
work.

While I was unable to quantify the performance difference, it is
an obvious waste, even if small.

Propagate the PageReported flag to sub-pages during expand(),
both in page_del_and_expand() and try_to_claim_block(), so

split_large_buddy() also propagates PageReported via a bool
parameter: the caller saves PageReported before
del_page_from_free_list() clears it, then passes the saved
value. The flag is set after __free_one_page() with a
PageBuddy check, matching the page_reporting_drain() pattern.
Free-path callers pass false (freshly freed pages are never
reported).
that they are recognized as already-reported.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 mm/page_alloc.c | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d49c254174da..8dae5b3f5876 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1502,7 +1502,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 
 /* Split a multi-block free page into its individual pageblocks. */
 static void split_large_buddy(struct zone *zone, struct page *page,
-			      unsigned long pfn, int order, fpi_t fpi)
+			      unsigned long pfn, int order, fpi_t fpi,
+			      bool reported)
 {
 	unsigned long end = pfn + (1 << order);
 
@@ -1517,6 +1518,8 @@ static void split_large_buddy(struct zone *zone, struct page *page,
 		int mt = get_pfnblock_migratetype(page, pfn);
 
 		__free_one_page(page, pfn, zone, order, mt, fpi);
+		if (reported && PageBuddy(page) && buddy_order(page) == order)
+			__SetPageReported(page);
 		pfn += 1 << order;
 		if (pfn == end)
 			break;
@@ -1559,11 +1562,12 @@ static void free_one_page(struct zone *zone, struct page *page,
 		llist_for_each_entry_safe(p, tmp, llnode, pcp_llist) {
 			unsigned int p_order = p->private;
 
-			split_large_buddy(zone, p, page_to_pfn(p), p_order, fpi_flags);
+			split_large_buddy(zone, p, page_to_pfn(p), p_order,
+					  fpi_flags, false);
 			__count_vm_events(PGFREE, 1 << p_order);
 		}
 	}
-	split_large_buddy(zone, page, pfn, order, fpi_flags);
+	split_large_buddy(zone, page, pfn, order, fpi_flags, false);
 	spin_unlock_irqrestore(&zone->lock, flags);
 
 	__count_vm_events(PGFREE, 1 << order);
@@ -1694,7 +1698,7 @@ struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
  * -- nyc
  */
 static inline unsigned int expand(struct zone *zone, struct page *page, int low,
-				  int high, int migratetype)
+				  int high, int migratetype, bool reported)
 {
 	unsigned int size = 1 << high;
 	unsigned int nr_added = 0;
@@ -1716,6 +1720,15 @@ static inline unsigned int expand(struct zone *zone, struct page *page, int low,
 		__add_to_free_list(&page[size], zone, high, migratetype, false);
 		set_buddy_order(&page[size], high);
 		nr_added += size;
+
+		/*
+		 * The parent page has been reported to the host.  The
+		 * sub-pages are part of the same reported block, so mark
+		 * them reported too.  This avoids re-reporting pages that
+		 * the host already knows about.
+		 */
+		if (reported)
+			__SetPageReported(&page[size]);
 	}
 
 	return nr_added;
@@ -1726,9 +1739,10 @@ static __always_inline void page_del_and_expand(struct zone *zone,
 						int high, int migratetype)
 {
 	int nr_pages = 1 << high;
+	bool was_reported = page_reported(page);
 
 	__del_page_from_free_list(page, zone, high, migratetype);
-	nr_pages -= expand(zone, page, low, high, migratetype);
+	nr_pages -= expand(zone, page, low, high, migratetype, was_reported);
 	account_freepages(zone, -nr_pages, migratetype);
 }
 
@@ -2116,11 +2130,13 @@ static bool __move_freepages_block_isolate(struct zone *zone,
 	/* We're a part of a larger buddy */
 	if (PageBuddy(buddy) && buddy_order(buddy) > pageblock_order) {
 		int order = buddy_order(buddy);
+		bool reported = PageReported(buddy);
 
 		del_page_from_free_list(buddy, zone, order,
 					get_pfnblock_migratetype(buddy, buddy_pfn));
 		toggle_pageblock_isolate(page, isolate);
-		split_large_buddy(zone, buddy, buddy_pfn, order, FPI_NONE);
+		split_large_buddy(zone, buddy, buddy_pfn, order, FPI_NONE,
+				  reported);
 		return true;
 	}
 
@@ -2283,10 +2299,12 @@ try_to_claim_block(struct zone *zone, struct page *page,
 	/* Take ownership for orders >= pageblock_order */
 	if (current_order >= pageblock_order) {
 		unsigned int nr_added;
+		bool was_reported = page_reported(page);
 
 		del_page_from_free_list(page, zone, current_order, block_type);
 		change_pageblock_range(page, current_order, start_type);
-		nr_added = expand(zone, page, order, current_order, start_type);
+		nr_added = expand(zone, page, order, current_order, start_type,
+				  was_reported);
 		account_freepages(zone, nr_added, start_type);
 		return page;
 	}
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 04/37] mm: page_reporting: allow driver to set batch capacity
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (2 preceding siblings ...)
  2026-06-08  8:34 ` [PATCH v10 03/37] mm: page_alloc: propagate PageReported flag across buddy splits Michael S. Tsirkin
@ 2026-06-08  8:34 ` Michael S. Tsirkin
  2026-06-08  8:34 ` [PATCH v10 05/37] mm: hugetlb: remove dead alloc_hugetlb_folio stub Michael S. Tsirkin
                   ` (36 subsequent siblings)
  40 siblings, 0 replies; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

Add a capacity field to page_reporting_dev_info so drivers can
control the maximum number of pages per report batch. This is
useful when the driver needs to reserve virtqueue descriptors for
metadata (e.g., a bitmap buffer) alongside the page buffers.

The value is capped at PAGE_REPORTING_CAPACITY.
If unset (0), defaults to PAGE_REPORTING_CAPACITY.

The virtio_balloon driver sets capacity to the reporting virtqueue
size, letting page_reporting adapt to whatever the device provides.

Note: capacity need not be a power of 2. The DIV_ROUND_UP
in page_reporting_cycle() uses integer division, which the
compiler handles efficiently. Rounding would halve the batch
size for non-power-of-2 virtqueue sizes, wasting capacity.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 drivers/virtio/virtio_balloon.c |  5 +----
 include/linux/page_reporting.h  |  3 +++
 mm/page_reporting.c             | 25 ++++++++++++++-----------
 3 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f6c2dff33f8a..6a1a610c2cb1 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -1017,10 +1017,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		unsigned int capacity;
 
 		capacity = virtqueue_get_vring_size(vb->reporting_vq);
-		if (capacity < PAGE_REPORTING_CAPACITY) {
-			err = -ENOSPC;
-			goto out_unregister_oom;
-		}
 
 		vb->pr_dev_info.order = PAGE_REPORTING_ORDER_UNSPECIFIED;
 
@@ -1041,6 +1037,7 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		vb->pr_dev_info.order = 5;
 #endif
 
+		vb->pr_dev_info.capacity = capacity;
 		err = page_reporting_register(&vb->pr_dev_info);
 		if (err)
 			goto out_unregister_oom;
diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
index 9d4ca5c218a0..5ab5be02fa15 100644
--- a/include/linux/page_reporting.h
+++ b/include/linux/page_reporting.h
@@ -22,6 +22,9 @@ struct page_reporting_dev_info {
 
 	/* Minimal order of page reporting */
 	unsigned int order;
+
+	/* Max pages per report batch (default PAGE_REPORTING_CAPACITY) */
+	unsigned int capacity;
 };
 
 /* Tear-down and bring-up for page reporting devices */
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 7418f2e500bb..5b6b17f67131 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -174,10 +174,10 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 	 * list processed. This should result in us reporting all pages on
 	 * an idle system in about 30 seconds.
 	 *
-	 * The division here should be cheap since PAGE_REPORTING_CAPACITY
-	 * should always be a power of 2.
+	 * The division here uses integer division; capacity need
+	 * not be a power of 2.
 	 */
-	budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16);
+	budget = DIV_ROUND_UP(area->nr_free, prdev->capacity * 16);
 
 	/* loop through free list adding unreported pages to sg list */
 	list_for_each_entry_safe(page, next, list, lru) {
@@ -222,10 +222,10 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		spin_unlock_irq(&zone->lock);
 
 		/* begin processing pages in local list */
-		err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY);
+		err = prdev->report(prdev, sgl, prdev->capacity);
 
 		/* reset offset since the full list was reported */
-		*offset = PAGE_REPORTING_CAPACITY;
+		*offset = prdev->capacity;
 
 		/* update budget to reflect call to report function */
 		budget--;
@@ -234,7 +234,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		spin_lock_irq(&zone->lock);
 
 		/* flush reported pages from the sg list */
-		page_reporting_drain(prdev, sgl, PAGE_REPORTING_CAPACITY, !err);
+		page_reporting_drain(prdev, sgl, prdev->capacity, !err);
 
 		/*
 		 * Reset next to first entry, the old next isn't valid
@@ -260,13 +260,13 @@ static int
 page_reporting_process_zone(struct page_reporting_dev_info *prdev,
 			    struct scatterlist *sgl, struct zone *zone)
 {
-	unsigned int order, mt, leftover, offset = PAGE_REPORTING_CAPACITY;
+	unsigned int order, mt, leftover, offset = prdev->capacity;
 	unsigned long watermark;
 	int err = 0;
 
 	/* Generate minimum watermark to be able to guarantee progress */
 	watermark = low_wmark_pages(zone) +
-		    (PAGE_REPORTING_CAPACITY << page_reporting_order);
+		    (prdev->capacity << page_reporting_order);
 
 	/*
 	 * Cancel request if insufficient free memory or if we failed
@@ -290,7 +290,7 @@ page_reporting_process_zone(struct page_reporting_dev_info *prdev,
 	}
 
 	/* report the leftover pages before going idle */
-	leftover = PAGE_REPORTING_CAPACITY - offset;
+	leftover = prdev->capacity - offset;
 	if (leftover) {
 		sgl = &sgl[offset];
 		err = prdev->report(prdev, sgl, leftover);
@@ -322,11 +322,11 @@ static void page_reporting_process(struct work_struct *work)
 	atomic_set(&prdev->state, state);
 
 	/* allocate scatterlist to store pages being reported on */
-	sgl = kmalloc_objs(*sgl, PAGE_REPORTING_CAPACITY);
+	sgl = kmalloc_objs(*sgl, prdev->capacity);
 	if (!sgl)
 		goto err_out;
 
-	sg_init_table(sgl, PAGE_REPORTING_CAPACITY);
+	sg_init_table(sgl, prdev->capacity);
 
 	for_each_zone(zone) {
 		err = page_reporting_process_zone(prdev, sgl, zone);
@@ -377,6 +377,9 @@ int page_reporting_register(struct page_reporting_dev_info *prdev)
 			page_reporting_order = pageblock_order;
 	}
 
+	if (!prdev->capacity || prdev->capacity > PAGE_REPORTING_CAPACITY)
+		prdev->capacity = PAGE_REPORTING_CAPACITY;
+
 	/* initialize state and work structures */
 	atomic_set(&prdev->state, PAGE_REPORTING_IDLE);
 	INIT_DELAYED_WORK(&prdev->work, &page_reporting_process);
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 05/37] mm: hugetlb: remove dead alloc_hugetlb_folio stub
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (3 preceding siblings ...)
  2026-06-08  8:34 ` [PATCH v10 04/37] mm: page_reporting: allow driver to set batch capacity Michael S. Tsirkin
@ 2026-06-08  8:34 ` Michael S. Tsirkin
  2026-06-08  9:56   ` Lorenzo Stoakes
  2026-06-08  8:35 ` [PATCH v10 06/37] mm: move vma_alloc_folio_noprof to page_alloc.c Michael S. Tsirkin
                   ` (35 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

Remove the !CONFIG_HUGETLB_PAGE stub for alloc_hugetlb_folio().

The stub is dead code: all callers are in mm/hugetlb.c
(CONFIG_HUGETLB_PAGE) or fs/hugetlbfs/inode.c (CONFIG_HUGETLBFS),
and CONFIG_HUGETLB_PAGE is def_bool HUGETLBFS with nothing
selecting it independently.

The stub is also broken: it returns NULL, but all callers check
IS_ERR(folio), so a NULL return would not be caught and would
crash on the subsequent folio dereference.

Remove it now since follow-up patches change the signature of
alloc_hugetlb_folio and would otherwise need to update the
broken stub too.

Reviewed-by: Gregory Price <gourry@gourry.net>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/hugetlb.h | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 5957bc25efa8..1f7ae6609e51 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1123,13 +1123,6 @@ static inline void wait_for_freed_hugetlb_folios(void)
 {
 }
 
-static inline struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
-					   unsigned long addr,
-					   bool cow_from_owner)
-{
-	return NULL;
-}
-
 static inline struct folio *
 alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
 			    nodemask_t *nmask, gfp_t gfp_mask)
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 06/37] mm: move vma_alloc_folio_noprof to page_alloc.c
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (4 preceding siblings ...)
  2026-06-08  8:34 ` [PATCH v10 05/37] mm: hugetlb: remove dead alloc_hugetlb_folio stub Michael S. Tsirkin
@ 2026-06-08  8:35 ` Michael S. Tsirkin
  2026-06-08 10:05   ` Lorenzo Stoakes
  2026-06-08  8:35 ` [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing Michael S. Tsirkin
                   ` (34 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

Move vma_alloc_folio_noprof() from an inline in gfp.h (for !NUMA)
and mempolicy.c (for NUMA) to page_alloc.c.

This prepares for a subsequent patch that will thread user_addr
through the allocator: having vma_alloc_folio_noprof in page_alloc.c
means user_addr can be passed to the internal allocation path
without changing public API signatures or duplicating plumbing
in both gfp.h and mempolicy.c.

The !NUMA path gains the VM_DROPPABLE -> __GFP_NOWARN check
that the NUMA path already had.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: cursor-agent:GPT-5.4-xhigh
---
 include/linux/gfp.h |  9 ++-------
 mm/mempolicy.c      | 32 --------------------------------
 mm/page_alloc.c     | 43 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 45 insertions(+), 39 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 51ef13ed756e..7ccbda35b9ad 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -318,13 +318,13 @@ static inline struct page *alloc_pages_node_noprof(int nid, gfp_t gfp_mask,
 
 #define  alloc_pages_node(...)			alloc_hooks(alloc_pages_node_noprof(__VA_ARGS__))
 
+struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
+		struct vm_area_struct *vma, unsigned long addr);
 #ifdef CONFIG_NUMA
 struct page *alloc_pages_noprof(gfp_t gfp, unsigned int order);
 struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order);
 struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
 		struct mempolicy *mpol, pgoff_t ilx, int nid);
-struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct *vma,
-		unsigned long addr);
 #else
 static inline struct page *alloc_pages_noprof(gfp_t gfp_mask, unsigned int order)
 {
@@ -339,11 +339,6 @@ static inline struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int orde
 {
 	return folio_alloc_noprof(gfp, order);
 }
-static inline struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
-		struct vm_area_struct *vma, unsigned long addr)
-{
-	return folio_alloc_noprof(gfp, order);
-}
 #endif
 
 #define alloc_pages(...)			alloc_hooks(alloc_pages_noprof(__VA_ARGS__))
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d139b074a599..a1707ad498a8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2516,38 +2516,6 @@ struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
 	return page_rmappable_folio(page);
 }
 
-/**
- * vma_alloc_folio - Allocate a folio for a VMA.
- * @gfp: GFP flags.
- * @order: Order of the folio.
- * @vma: Pointer to VMA.
- * @addr: Virtual address of the allocation.  Must be inside @vma.
- *
- * Allocate a folio for a specific address in @vma, using the appropriate
- * NUMA policy.  The caller must hold the mmap_lock of the mm_struct of the
- * VMA to prevent it from going away.  Should be used for all allocations
- * for folios that will be mapped into user space, excepting hugetlbfs, and
- * excepting where direct use of folio_alloc_mpol() is more appropriate.
- *
- * Return: The folio on success or NULL if allocation fails.
- */
-struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct *vma,
-		unsigned long addr)
-{
-	struct mempolicy *pol;
-	pgoff_t ilx;
-	struct folio *folio;
-
-	if (vma->vm_flags & VM_DROPPABLE)
-		gfp |= __GFP_NOWARN;
-
-	pol = get_vma_policy(vma, addr, order, &ilx);
-	folio = folio_alloc_mpol_noprof(gfp, order, pol, ilx, numa_node_id());
-	mpol_cond_put(pol);
-	return folio;
-}
-EXPORT_SYMBOL(vma_alloc_folio_noprof);
-
 struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned order)
 {
 	struct mempolicy *pol = &default_policy;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8dae5b3f5876..6a605d05e8cd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5286,6 +5286,49 @@ struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_
 }
 EXPORT_SYMBOL(__folio_alloc_noprof);
 
+#ifdef CONFIG_NUMA
+/**
+ * vma_alloc_folio - Allocate a folio for a VMA.
+ * @gfp: GFP flags.
+ * @order: Order of the folio.
+ * @vma: Pointer to VMA.
+ * @addr: Virtual address of the allocation.  Must be inside @vma.
+ *
+ * Allocate a folio for a specific address in @vma, using the appropriate
+ * NUMA policy.  The caller must hold the mmap_lock of the mm_struct of the
+ * VMA to prevent it from going away.  Should be used for all allocations
+ * for folios that will be mapped into user space, excepting hugetlbfs, and
+ * excepting where direct use of folio_alloc_mpol() is more appropriate.
+ *
+ * Return: The folio on success or NULL if allocation fails.
+ */
+struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
+		struct vm_area_struct *vma, unsigned long addr)
+{
+	struct mempolicy *pol;
+	pgoff_t ilx;
+	struct folio *folio;
+
+	if (vma->vm_flags & VM_DROPPABLE)
+		gfp |= __GFP_NOWARN;
+
+	pol = get_vma_policy(vma, addr, order, &ilx);
+	folio = folio_alloc_mpol_noprof(gfp, order, pol, ilx, numa_node_id());
+	mpol_cond_put(pol);
+	return folio;
+}
+#else
+struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
+		struct vm_area_struct *vma, unsigned long addr)
+{
+	if (vma->vm_flags & VM_DROPPABLE)
+		gfp |= __GFP_NOWARN;
+
+	return folio_alloc_noprof(gfp, order);
+}
+#endif
+EXPORT_SYMBOL(vma_alloc_folio_noprof);
+
 /*
  * Common helper functions. Never use with __GFP_HIGHMEM because the returned
  * address cannot represent highmem pages. Use alloc_pages and then kmap if
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (5 preceding siblings ...)
  2026-06-08  8:35 ` [PATCH v10 06/37] mm: move vma_alloc_folio_noprof to page_alloc.c Michael S. Tsirkin
@ 2026-06-08  8:35 ` Michael S. Tsirkin
  2026-06-08 10:23   ` Lorenzo Stoakes
  2026-06-08  8:35 ` [PATCH v10 08/37] mm: add alloc_contig_frozen_pages_user " Michael S. Tsirkin
                   ` (33 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

Thread a user virtual address from vma_alloc_folio() down through
the page allocator to post_alloc_hook(). This is plumbing
preparation for a subsequent patch that will use user_addr to
call folio_zero_user() for cache-friendly zeroing of user pages.

The user_addr is stored in struct alloc_context and flows through:
  vma_alloc_folio -> folio_alloc_mpol -> __alloc_pages_mpol ->
  __alloc_frozen_pages -> get_page_from_freelist -> prep_new_page ->
  post_alloc_hook

USER_ADDR_NONE ((unsigned long)-1) is used for non-user
allocations, since address 0 is a valid userspace mapping.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: cursor-agent:GPT-5.4-xhigh
---
 include/linux/gfp.h |  2 +-
 mm/compaction.c     |  5 ++---
 mm/hugetlb.c        | 36 ++++++++++++++++++++----------------
 mm/internal.h       | 21 ++++++++++++++++++---
 mm/mempolicy.c      | 44 ++++++++++++++++++++++++++++++++------------
 mm/mmap.c           |  6 ++++++
 mm/page_alloc.c     | 44 +++++++++++++++++++++++++++++---------------
 mm/slub.c           |  4 ++--
 8 files changed, 110 insertions(+), 52 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 7ccbda35b9ad..ee35c5367abc 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -337,7 +337,7 @@ static inline struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order)
 static inline struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
 		struct mempolicy *mpol, pgoff_t ilx, int nid)
 {
-	return folio_alloc_noprof(gfp, order);
+	return __folio_alloc_noprof(gfp, order, numa_node_id(), NULL);
 }
 #endif
 
diff --git a/mm/compaction.c b/mm/compaction.c
index 3648ce22c807..72684fe81e83 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -82,7 +82,7 @@ static inline bool is_via_compact_memory(int order) { return false; }
 
 static struct page *mark_allocated_noprof(struct page *page, unsigned int order, gfp_t gfp_flags)
 {
-	post_alloc_hook(page, order, __GFP_MOVABLE);
+	post_alloc_hook(page, order, __GFP_MOVABLE, USER_ADDR_NONE);
 	set_page_refcounted(page);
 	return page;
 }
@@ -1849,8 +1849,7 @@ static struct folio *compaction_alloc_noprof(struct folio *src, unsigned long da
 		set_page_private(&freepage[size], start_order);
 	}
 	dst = (struct folio *)freepage;
-
-	post_alloc_hook(&dst->page, order, __GFP_MOVABLE);
+	post_alloc_hook(&dst->page, order, __GFP_MOVABLE, USER_ADDR_NONE);
 	set_page_refcounted(&dst->page);
 	if (order)
 		prep_compound_page(&dst->page, order);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4b80b167cc9c..f3bc15a7889a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1786,7 +1786,8 @@ struct address_space *hugetlb_folio_mapping_lock_write(struct folio *folio)
 }
 
 static struct folio *alloc_buddy_frozen_folio(int order, gfp_t gfp_mask,
-		int nid, nodemask_t *nmask, nodemask_t *node_alloc_noretry)
+		int nid, nodemask_t *nmask, nodemask_t *node_alloc_noretry,
+		unsigned long addr)
 {
 	struct folio *folio;
 	bool alloc_try_hard = true;
@@ -1803,7 +1804,7 @@ static struct folio *alloc_buddy_frozen_folio(int order, gfp_t gfp_mask,
 	if (alloc_try_hard)
 		gfp_mask |= __GFP_RETRY_MAYFAIL;
 
-	folio = (struct folio *)__alloc_frozen_pages(gfp_mask, order, nid, nmask);
+	folio = (struct folio *)__alloc_frozen_pages(gfp_mask, order, nid, nmask, addr);
 
 	/*
 	 * If we did not specify __GFP_RETRY_MAYFAIL, but still got a
@@ -1832,7 +1833,7 @@ static struct folio *alloc_buddy_frozen_folio(int order, gfp_t gfp_mask,
 
 static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
 		gfp_t gfp_mask, int nid, nodemask_t *nmask,
-		nodemask_t *node_alloc_noretry)
+		nodemask_t *node_alloc_noretry, unsigned long addr)
 {
 	struct folio *folio;
 	int order = huge_page_order(h);
@@ -1844,7 +1845,7 @@ static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
 		folio = alloc_gigantic_frozen_folio(order, gfp_mask, nid, nmask);
 	else
 		folio = alloc_buddy_frozen_folio(order, gfp_mask, nid, nmask,
-						 node_alloc_noretry);
+						 node_alloc_noretry, addr);
 	if (folio)
 		init_new_hugetlb_folio(folio);
 	return folio;
@@ -1858,11 +1859,12 @@ static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
  * pages is zero, and the accounting must be done in the caller.
  */
 static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
-		gfp_t gfp_mask, int nid, nodemask_t *nmask)
+		gfp_t gfp_mask, int nid, nodemask_t *nmask,
+		unsigned long addr)
 {
 	struct folio *folio;
 
-	folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, NULL);
+	folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, NULL, addr);
 	if (folio)
 		hugetlb_vmemmap_optimize_folio(h, folio);
 	return folio;
@@ -1902,7 +1904,7 @@ static struct folio *alloc_pool_huge_folio(struct hstate *h,
 		struct folio *folio;
 
 		folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, node,
-					nodes_allowed, node_alloc_noretry);
+					nodes_allowed, node_alloc_noretry, USER_ADDR_NONE);
 		if (folio)
 			return folio;
 	}
@@ -2071,7 +2073,8 @@ int dissolve_free_hugetlb_folios(unsigned long start_pfn, unsigned long end_pfn)
  * Allocates a fresh surplus page from the page allocator.
  */
 static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
-				gfp_t gfp_mask,	int nid, nodemask_t *nmask)
+				gfp_t gfp_mask,	int nid, nodemask_t *nmask,
+				unsigned long addr)
 {
 	struct folio *folio = NULL;
 
@@ -2083,7 +2086,7 @@ static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
 		goto out_unlock;
 	spin_unlock_irq(&hugetlb_lock);
 
-	folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask);
+	folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, addr);
 	if (!folio)
 		return NULL;
 
@@ -2126,7 +2129,7 @@ static struct folio *alloc_migrate_hugetlb_folio(struct hstate *h, gfp_t gfp_mas
 	if (hstate_is_gigantic(h))
 		return NULL;
 
-	folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask);
+	folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, USER_ADDR_NONE);
 	if (!folio)
 		return NULL;
 
@@ -2162,14 +2165,14 @@ struct folio *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h,
 	if (mpol_is_preferred_many(mpol)) {
 		gfp_t gfp = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
 
-		folio = alloc_surplus_hugetlb_folio(h, gfp, nid, nodemask);
+		folio = alloc_surplus_hugetlb_folio(h, gfp, nid, nodemask, addr);
 
 		/* Fallback to all nodes if page==NULL */
 		nodemask = NULL;
 	}
 
 	if (!folio)
-		folio = alloc_surplus_hugetlb_folio(h, gfp_mask, nid, nodemask);
+		folio = alloc_surplus_hugetlb_folio(h, gfp_mask, nid, nodemask, addr);
 	mpol_cond_put(mpol);
 	return folio;
 }
@@ -2276,7 +2279,8 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 		 * down the road to pick the current node if that is the case.
 		 */
 		folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h),
-						    NUMA_NO_NODE, &alloc_nodemask);
+						    NUMA_NO_NODE, &alloc_nodemask,
+						    USER_ADDR_NONE);
 		if (!folio) {
 			alloc_ok = false;
 			break;
@@ -2682,7 +2686,7 @@ static int alloc_and_dissolve_hugetlb_folio(struct folio *old_folio,
 			spin_unlock_irq(&hugetlb_lock);
 			gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
 			new_folio = alloc_fresh_hugetlb_folio(h, gfp_mask,
-							      nid, NULL);
+							      nid, NULL, USER_ADDR_NONE);
 			if (!new_folio)
 				return -ENOMEM;
 			goto retry;
@@ -3380,13 +3384,13 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
 			gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
 
 			folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid,
-					&node_states[N_MEMORY], NULL);
+					&node_states[N_MEMORY], NULL, USER_ADDR_NONE);
 			if (!folio && !list_empty(&folio_list) &&
 			    hugetlb_vmemmap_optimizable_size(h)) {
 				prep_and_add_allocated_folios(h, &folio_list);
 				INIT_LIST_HEAD(&folio_list);
 				folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid,
-						&node_states[N_MEMORY], NULL);
+						&node_states[N_MEMORY], NULL, USER_ADDR_NONE);
 			}
 			if (!folio)
 				break;
diff --git a/mm/internal.h b/mm/internal.h
index 5a2ddcf68e0b..9d2198114510 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -662,6 +662,16 @@ void calculate_min_free_kbytes(void);
 int __meminit init_per_zone_wmark_min(void);
 void page_alloc_sysctl_init(void);
 
+/*
+ * Sentinel for user_addr: indicates a non-user allocation.
+ * Cannot use 0 because address 0 is a valid userspace mapping.
+ * (unsigned long)-1 is safe because:
+ * 1. vm_end = addr + len <= TASK_SIZE, and vm_end is exclusive,
+ *    so -1 is never inside any VMA.
+ * 2. It will only be compared to page-aligned addresses.
+ */
+#define USER_ADDR_NONE	((unsigned long)-1)
+
 /*
  * Structure for holding the mostly immutable allocation parameters passed
  * between functions involved in allocations, including the alloc_pages*
@@ -693,6 +703,7 @@ struct alloc_context {
 	 */
 	enum zone_type highest_zoneidx;
 	bool spread_dirty_pages;
+	unsigned long user_addr;
 };
 
 /*
@@ -916,13 +927,14 @@ static inline void init_compound_tail(struct page *tail,
 	prep_compound_tail(tail, head, order);
 }
 
-void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags);
+void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags,
+		     unsigned long user_addr);
 extern bool free_pages_prepare(struct page *page, unsigned int order);
 
 extern int user_min_free_kbytes;
 
 struct page *__alloc_frozen_pages_noprof(gfp_t, unsigned int order, int nid,
-		nodemask_t *);
+		nodemask_t *, unsigned long user_addr);
 #define __alloc_frozen_pages(...) \
 	alloc_hooks(__alloc_frozen_pages_noprof(__VA_ARGS__))
 void free_frozen_pages(struct page *page, unsigned int order);
@@ -930,10 +942,13 @@ void free_unref_folios(struct folio_batch *fbatch);
 
 #ifdef CONFIG_NUMA
 struct page *alloc_frozen_pages_noprof(gfp_t, unsigned int order);
+struct folio *folio_alloc_mpol_user_noprof(gfp_t gfp, unsigned int order,
+		struct mempolicy *pol, pgoff_t ilx, int nid,
+		unsigned long user_addr);
 #else
 static inline struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order)
 {
-	return __alloc_frozen_pages_noprof(gfp, order, numa_node_id(), NULL);
+	return __alloc_frozen_pages_noprof(gfp, order, numa_node_id(), NULL, USER_ADDR_NONE);
 }
 #endif
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a1707ad498a8..f573ff32e94d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2413,7 +2413,8 @@ bool mempolicy_in_oom_domain(struct task_struct *tsk,
 }
 
 static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
-						int nid, nodemask_t *nodemask)
+						int nid, nodemask_t *nodemask,
+						unsigned long user_addr)
 {
 	struct page *page;
 	gfp_t preferred_gfp;
@@ -2426,25 +2427,29 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
 	 */
 	preferred_gfp = gfp | __GFP_NOWARN;
 	preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
-	page = __alloc_frozen_pages_noprof(preferred_gfp, order, nid, nodemask);
+	page = __alloc_frozen_pages_noprof(preferred_gfp, order, nid,
+					   nodemask, user_addr);
 	if (!page)
-		page = __alloc_frozen_pages_noprof(gfp, order, nid, NULL);
+		page = __alloc_frozen_pages_noprof(gfp, order, nid, NULL,
+						   user_addr);
 
 	return page;
 }
 
 /**
- * alloc_pages_mpol - Allocate pages according to NUMA mempolicy.
+ * __alloc_pages_mpol - Allocate pages according to NUMA mempolicy.
  * @gfp: GFP flags.
  * @order: Order of the page allocation.
  * @pol: Pointer to the NUMA mempolicy.
  * @ilx: Index for interleave mempolicy (also distinguishes alloc_pages()).
  * @nid: Preferred node (usually numa_node_id() but @mpol may override it).
+ * @user_addr: User fault address for cache-friendly zeroing, or USER_ADDR_NONE.
  *
  * Return: The page on success or NULL if allocation fails.
  */
-static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
-		struct mempolicy *pol, pgoff_t ilx, int nid)
+static struct page *__alloc_pages_mpol(gfp_t gfp, unsigned int order,
+		struct mempolicy *pol, pgoff_t ilx, int nid,
+		unsigned long user_addr)
 {
 	nodemask_t *nodemask;
 	struct page *page;
@@ -2452,7 +2457,8 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
 	nodemask = policy_nodemask(gfp, pol, ilx, &nid);
 
 	if (pol->mode == MPOL_PREFERRED_MANY)
-		return alloc_pages_preferred_many(gfp, order, nid, nodemask);
+		return alloc_pages_preferred_many(gfp, order, nid, nodemask,
+						 user_addr);
 
 	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
 	    /* filter "hugepage" allocation, unless from alloc_pages() */
@@ -2476,7 +2482,7 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
 			 */
 			page = __alloc_frozen_pages_noprof(
 				gfp | __GFP_THISNODE | __GFP_NORETRY, order,
-				nid, NULL);
+				nid, NULL, user_addr);
 			if (page || !(gfp & __GFP_DIRECT_RECLAIM))
 				return page;
 			/*
@@ -2488,7 +2494,7 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
 		}
 	}
 
-	page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask);
+	page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask, user_addr);
 
 	if (unlikely(pol->mode == MPOL_INTERLEAVE ||
 		     pol->mode == MPOL_WEIGHTED_INTERLEAVE) && page) {
@@ -2504,11 +2510,18 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
 	return page;
 }
 
-struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
+static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
 		struct mempolicy *pol, pgoff_t ilx, int nid)
 {
-	struct page *page = alloc_pages_mpol(gfp | __GFP_COMP, order, pol,
-			ilx, nid);
+	return __alloc_pages_mpol(gfp, order, pol, ilx, nid, USER_ADDR_NONE);
+}
+
+struct folio *folio_alloc_mpol_user_noprof(gfp_t gfp, unsigned int order,
+		struct mempolicy *pol, pgoff_t ilx, int nid,
+		unsigned long user_addr)
+{
+	struct page *page = __alloc_pages_mpol(gfp | __GFP_COMP, order, pol,
+			ilx, nid, user_addr);
 	if (!page)
 		return NULL;
 
@@ -2516,6 +2529,13 @@ struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
 	return page_rmappable_folio(page);
 }
 
+struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
+		struct mempolicy *pol, pgoff_t ilx, int nid)
+{
+	return folio_alloc_mpol_user_noprof(gfp, order, pol, ilx, nid,
+					    USER_ADDR_NONE);
+}
+
 struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned order)
 {
 	struct mempolicy *pol = &default_policy;
diff --git a/mm/mmap.c b/mm/mmap.c
index 5754d1c36462..73413cebc418 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -855,6 +855,12 @@ __get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 	if (IS_ERR_VALUE(addr))
 		return addr;
 
+	/*
+	 * The check below ensures vm_end = addr + len <= TASK_SIZE.
+	 * Since (unsigned long)-1 (USER_ADDR_NONE) >= TASK_SIZE and
+	 * vm_end is exclusive, USER_ADDR_NONE is thus never a valid
+	 * userspace address.
+	 */
 	if (addr > TASK_SIZE - len)
 		return -ENOMEM;
 	if (offset_in_page(addr))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6a605d05e8cd..21b52c879751 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1815,7 +1815,7 @@ static inline bool should_skip_init(gfp_t flags)
 }
 
 inline void post_alloc_hook(struct page *page, unsigned int order,
-				gfp_t gfp_flags)
+				gfp_t gfp_flags, unsigned long user_addr)
 {
 	const bool zero_tags = gfp_flags & __GFP_ZEROTAGS;
 	bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) &&
@@ -1870,9 +1870,10 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 }
 
 static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
-							unsigned int alloc_flags)
+							unsigned int alloc_flags,
+							unsigned long user_addr)
 {
-	post_alloc_hook(page, order, gfp_flags);
+	post_alloc_hook(page, order, gfp_flags, user_addr);
 
 	if (order && (gfp_flags & __GFP_COMP))
 		prep_compound_page(page, order);
@@ -3956,7 +3957,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		page = rmqueue(zonelist_zone(ac->preferred_zoneref), zone, order,
 				gfp_mask, alloc_flags, ac->migratetype);
 		if (page) {
-			prep_new_page(page, order, gfp_mask, alloc_flags);
+			prep_new_page(page, order, gfp_mask, alloc_flags,
+				      ac->user_addr);
 
 			return page;
 		} else {
@@ -4184,7 +4186,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 
 	/* Prep a captured page if available */
 	if (page)
-		prep_new_page(page, order, gfp_mask, alloc_flags);
+		prep_new_page(page, order, gfp_mask, alloc_flags,
+			      ac->user_addr);
 
 	/* Try get a page from the freelist if available */
 	if (!page)
@@ -5061,7 +5064,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 	struct zoneref *z;
 	struct per_cpu_pages *pcp;
 	struct list_head *pcp_list;
-	struct alloc_context ac;
+	struct alloc_context ac = { .user_addr = USER_ADDR_NONE };
 	gfp_t alloc_gfp;
 	unsigned int alloc_flags = ALLOC_WMARK_LOW;
 	int nr_populated = 0, nr_account = 0;
@@ -5176,7 +5179,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 		}
 		nr_account++;
 
-		prep_new_page(page, 0, gfp, 0);
+		prep_new_page(page, 0, gfp, 0, USER_ADDR_NONE);
 		set_page_refcounted(page);
 		page_array[nr_populated++] = page;
 	}
@@ -5201,12 +5204,13 @@ EXPORT_SYMBOL_GPL(alloc_pages_bulk_noprof);
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
-		int preferred_nid, nodemask_t *nodemask)
+		int preferred_nid, nodemask_t *nodemask,
+		unsigned long user_addr)
 {
 	struct page *page;
 	unsigned int alloc_flags = ALLOC_WMARK_LOW;
 	gfp_t alloc_gfp; /* The gfp_t that was actually used for allocation */
-	struct alloc_context ac = { };
+	struct alloc_context ac = { .user_addr = user_addr };
 
 	/*
 	 * There are several places where we assume that the order value is sane
@@ -5267,10 +5271,12 @@ EXPORT_SYMBOL(__alloc_frozen_pages_noprof);
 
 struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
 		int preferred_nid, nodemask_t *nodemask)
+
 {
 	struct page *page;
 
-	page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid, nodemask);
+	page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid,
+					   nodemask, USER_ADDR_NONE);
 	if (page)
 		set_page_refcounted(page);
 	return page;
@@ -5313,7 +5319,8 @@ struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
 		gfp |= __GFP_NOWARN;
 
 	pol = get_vma_policy(vma, addr, order, &ilx);
-	folio = folio_alloc_mpol_noprof(gfp, order, pol, ilx, numa_node_id());
+	folio = folio_alloc_mpol_user_noprof(gfp, order, pol, ilx,
+					     numa_node_id(), addr);
 	mpol_cond_put(pol);
 	return folio;
 }
@@ -5321,10 +5328,17 @@ struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
 struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
 		struct vm_area_struct *vma, unsigned long addr)
 {
+	struct page *page;
+
 	if (vma->vm_flags & VM_DROPPABLE)
 		gfp |= __GFP_NOWARN;
 
-	return folio_alloc_noprof(gfp, order);
+	page = __alloc_frozen_pages_noprof(gfp | __GFP_COMP, order,
+					   numa_node_id(), NULL, addr);
+	if (!page)
+		return NULL;
+	set_page_refcounted(page);
+	return page_rmappable_folio(page);
 }
 #endif
 EXPORT_SYMBOL(vma_alloc_folio_noprof);
@@ -6905,7 +6919,7 @@ static void split_free_frozen_pages(struct list_head *list, gfp_t gfp_mask)
 		list_for_each_entry_safe(page, next, &list[order], lru) {
 			int i;
 
-			post_alloc_hook(page, order, gfp_mask);
+			post_alloc_hook(page, order, gfp_mask, USER_ADDR_NONE);
 			if (!order)
 				continue;
 
@@ -7111,7 +7125,7 @@ int alloc_contig_frozen_range_noprof(unsigned long start, unsigned long end,
 		struct page *head = pfn_to_page(start);
 
 		check_new_pages(head, order);
-		prep_new_page(head, order, gfp_mask, 0);
+		prep_new_page(head, order, gfp_mask, 0, USER_ADDR_NONE);
 	} else {
 		ret = -EINVAL;
 		WARN(true, "PFN range: requested [%lu, %lu), allocated [%lu, %lu)\n",
@@ -7776,7 +7790,7 @@ struct page *alloc_frozen_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned
 	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC | __GFP_COMP
 			| gfp_flags;
 	unsigned int alloc_flags = ALLOC_TRYLOCK;
-	struct alloc_context ac = { };
+	struct alloc_context ac = { .user_addr = USER_ADDR_NONE };
 	struct page *page;
 
 	VM_WARN_ON_ONCE(gfp_flags & ~__GFP_ACCOUNT);
diff --git a/mm/slub.c b/mm/slub.c
index a2bf3756ca7d..f397fa2f3f80 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3275,7 +3275,7 @@ static inline struct slab *alloc_slab_page(gfp_t flags, int node,
 	else if (node == NUMA_NO_NODE)
 		page = alloc_frozen_pages(flags, order);
 	else
-		page = __alloc_frozen_pages(flags, order, node, NULL);
+		page = __alloc_frozen_pages(flags, order, node, NULL, USER_ADDR_NONE);
 
 	if (!page)
 		return NULL;
@@ -5236,7 +5236,7 @@ static void *___kmalloc_large_node(size_t size, gfp_t flags, int node)
 	if (node == NUMA_NO_NODE)
 		page = alloc_frozen_pages_noprof(flags, order);
 	else
-		page = __alloc_frozen_pages_noprof(flags, order, node, NULL);
+		page = __alloc_frozen_pages_noprof(flags, order, node, NULL, USER_ADDR_NONE);
 
 	if (page) {
 		ptr = page_address(page);
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 08/37] mm: add alloc_contig_frozen_pages_user for cache-friendly zeroing
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (6 preceding siblings ...)
  2026-06-08  8:35 ` [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing Michael S. Tsirkin
@ 2026-06-08  8:35 ` Michael S. Tsirkin
  2026-06-08 10:29   ` Lorenzo Stoakes
  2026-06-08  8:35 ` [PATCH v10 09/37] mm: hugetlb: thread user_addr through gigantic page allocation Michael S. Tsirkin
                   ` (32 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

Add a _user variant of alloc_contig_frozen_pages that accepts a user_addr
parameter for cache-friendly zeroing of contiguous allocations.

No functional change; all existing callers continue to pass
USER_ADDR_NONE.

Note for reviewers: non-compound contiguous allocations are
zeroed via kernel_init_pages, same as before this patch.
There is no fault address because these allocations are not
from the page fault path. For compound allocations, user_addr
reaches post_alloc_hook() which calls folio_zero_user() with
the dcache flush on cache-aliasing architectures.

Note about Sashiko (sashiko.dev) false positives: sashiko
flags two issues here: (1) user_addr silently ignored for
non-compound allocations, and (2) post_alloc_hook ignores
user_addr. Both are false positives: (1) non-compound
contiguous allocations have no fault address to pass, and
(2) post_alloc_hook does use user_addr when it is not
USER_ADDR_NONE.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 include/linux/gfp.h |  6 ++++++
 mm/page_alloc.c     | 42 ++++++++++++++++++++++++++++++++----------
 2 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index ee35c5367abc..73109d4e31a4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -453,6 +453,12 @@ struct page *alloc_contig_frozen_pages_noprof(unsigned long nr_pages,
 #define alloc_contig_frozen_pages(...) \
 	alloc_hooks(alloc_contig_frozen_pages_noprof(__VA_ARGS__))
 
+struct page *alloc_contig_frozen_pages_user_noprof(unsigned long nr_pages,
+		gfp_t gfp_mask, int nid, nodemask_t *nodemask,
+		unsigned long user_addr);
+#define alloc_contig_frozen_pages_user(...) \
+	alloc_hooks(alloc_contig_frozen_pages_user_noprof(__VA_ARGS__))
+
 struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
 		int nid, nodemask_t *nodemask);
 #define alloc_contig_pages(...)	\
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 21b52c879751..6d3f284c607d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6975,13 +6975,15 @@ static void __free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages
 }
 
 /**
- * alloc_contig_frozen_range() -- tries to allocate given range of frozen pages
+ * __alloc_contig_frozen_range() -- tries to allocate given range of frozen pages
  * @start:	start PFN to allocate
  * @end:	one-past-the-last PFN to allocate
  * @alloc_flags:	allocation information
  * @gfp_mask:	GFP mask. Node/zone/placement hints are ignored; only some
  *		action and reclaim modifiers are supported. Reclaim modifiers
  *		control allocation behavior during compaction/migration/reclaim.
+ * @user_addr:	user virtual address for cache-friendly zeroing, or
+ *		USER_ADDR_NONE for kernel allocations.
  *
  * The PFN range does not have to be pageblock aligned. The PFN range must
  * belong to a single zone.
@@ -6997,8 +6999,9 @@ static void __free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages
  *
  * Return: zero on success or negative error code.
  */
-int alloc_contig_frozen_range_noprof(unsigned long start, unsigned long end,
-		acr_flags_t alloc_flags, gfp_t gfp_mask)
+static int __alloc_contig_frozen_range(unsigned long start, unsigned long end,
+		acr_flags_t alloc_flags, gfp_t gfp_mask,
+		unsigned long user_addr)
 {
 	const unsigned int order = ilog2(end - start);
 	unsigned long outer_start, outer_end;
@@ -7125,7 +7128,7 @@ int alloc_contig_frozen_range_noprof(unsigned long start, unsigned long end,
 		struct page *head = pfn_to_page(start);
 
 		check_new_pages(head, order);
-		prep_new_page(head, order, gfp_mask, 0, USER_ADDR_NONE);
+		prep_new_page(head, order, gfp_mask, 0, user_addr);
 	} else {
 		ret = -EINVAL;
 		WARN(true, "PFN range: requested [%lu, %lu), allocated [%lu, %lu)\n",
@@ -7135,6 +7138,13 @@ int alloc_contig_frozen_range_noprof(unsigned long start, unsigned long end,
 	undo_isolate_page_range(start, end);
 	return ret;
 }
+
+int alloc_contig_frozen_range_noprof(unsigned long start, unsigned long end,
+		acr_flags_t alloc_flags, gfp_t gfp_mask)
+{
+	return __alloc_contig_frozen_range(start, end, alloc_flags, gfp_mask,
+					   USER_ADDR_NONE);
+}
 EXPORT_SYMBOL(alloc_contig_frozen_range_noprof);
 
 /**
@@ -7227,14 +7237,16 @@ static bool zone_spans_last_pfn(const struct zone *zone,
 	return zone_spans_pfn(zone, last_pfn);
 }
 
-/**
- * alloc_contig_frozen_pages() -- tries to find and allocate contiguous range of frozen pages
+/*
+ * alloc_contig_frozen_pages_user_noprof() -- allocate contiguous frozen pages with user address
  * @nr_pages:	Number of contiguous pages to allocate
  * @gfp_mask:	GFP mask. Node/zone/placement hints limit the search; only some
  *		action and reclaim modifiers are supported. Reclaim modifiers
  *		control allocation behavior during compaction/migration/reclaim.
  * @nid:	Target node
  * @nodemask:	Mask for other possible nodes
+ * @user_addr:	user virtual address for cache-friendly zeroing, or
+ *		USER_ADDR_NONE for kernel allocations.
  *
  * This routine is a wrapper around alloc_contig_frozen_range(). It scans over
  * zones on an applicable zonelist to find a contiguous pfn range which can then
@@ -7253,8 +7265,9 @@ static bool zone_spans_last_pfn(const struct zone *zone,
  *
  * Return: pointer to contiguous frozen pages on success, or NULL if not successful.
  */
-struct page *alloc_contig_frozen_pages_noprof(unsigned long nr_pages,
-		gfp_t gfp_mask, int nid, nodemask_t *nodemask)
+struct page *alloc_contig_frozen_pages_user_noprof(unsigned long nr_pages,
+		gfp_t gfp_mask, int nid, nodemask_t *nodemask,
+		unsigned long user_addr)
 {
 	unsigned long ret, pfn, flags;
 	struct zonelist *zonelist;
@@ -7282,10 +7295,11 @@ struct page *alloc_contig_frozen_pages_noprof(unsigned long nr_pages,
 				 * win the race and cause allocation to fail.
 				 */
 				spin_unlock_irqrestore(&zone->lock, flags);
-				ret = alloc_contig_frozen_range_noprof(pfn,
+				ret = __alloc_contig_frozen_range(pfn,
 							pfn + nr_pages,
 							ACR_FLAGS_NONE,
-							gfp_mask);
+							gfp_mask,
+							user_addr);
 				if (!ret)
 					return pfn_to_page(pfn);
 				spin_lock_irqsave(&zone->lock, flags);
@@ -7307,6 +7321,14 @@ struct page *alloc_contig_frozen_pages_noprof(unsigned long nr_pages,
 	}
 	return NULL;
 }
+EXPORT_SYMBOL(alloc_contig_frozen_pages_user_noprof);
+
+struct page *alloc_contig_frozen_pages_noprof(unsigned long nr_pages,
+		gfp_t gfp_mask, int nid, nodemask_t *nodemask)
+{
+	return alloc_contig_frozen_pages_user_noprof(nr_pages, gfp_mask, nid,
+						     nodemask, USER_ADDR_NONE);
+}
 EXPORT_SYMBOL(alloc_contig_frozen_pages_noprof);
 
 /**
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 09/37] mm: hugetlb: thread user_addr through gigantic page allocation
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (7 preceding siblings ...)
  2026-06-08  8:35 ` [PATCH v10 08/37] mm: add alloc_contig_frozen_pages_user " Michael S. Tsirkin
@ 2026-06-08  8:35 ` Michael S. Tsirkin
  2026-06-08  8:36 ` [PATCH v10 10/37] mm: add folio_zero_user stub for configs without THP/HUGETLBFS Michael S. Tsirkin
                   ` (31 subsequent siblings)
  40 siblings, 0 replies; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

Thread the user_addr parameter through alloc_gigantic_frozen_folio so that
gigantic page allocations can benefit from cache-friendly zeroing.

Note: the CMA path (hugetlb_cma_alloc_frozen_folio) does not
receive user_addr because CMA uses alloc_contig_frozen_pages,
not the _user variant. CMA-allocated pages get zeroed via
the normal __GFP_ZERO path without cache-friendly addressing.
This is acceptable: gigantic pages are rare and the CMA path
is a fallback when buddy allocation fails.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 mm/hugetlb.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f3bc15a7889a..5d7e546565f5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1355,7 +1355,7 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 
 #if defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE) && defined(CONFIG_CONTIG_ALLOC)
 static struct folio *alloc_gigantic_frozen_folio(int order, gfp_t gfp_mask,
-		int nid, nodemask_t *nodemask)
+		int nid, nodemask_t *nodemask, unsigned long addr)
 {
 	struct folio *folio;
 
@@ -1366,13 +1366,15 @@ static struct folio *alloc_gigantic_frozen_folio(int order, gfp_t gfp_mask,
 	if (hugetlb_cma_exclusive_alloc())
 		return NULL;
 
-	folio = (struct folio *)alloc_contig_frozen_pages(1 << order, gfp_mask,
-							  nid, nodemask);
+	folio = (struct folio *)alloc_contig_frozen_pages_user(1 << order,
+							      gfp_mask,
+							      nid, nodemask,
+							      addr);
 	return folio;
 }
 #else /* !CONFIG_ARCH_HAS_GIGANTIC_PAGE || !CONFIG_CONTIG_ALLOC */
 static struct folio *alloc_gigantic_frozen_folio(int order, gfp_t gfp_mask, int nid,
-					  nodemask_t *nodemask)
+					  nodemask_t *nodemask, unsigned long addr)
 {
 	return NULL;
 }
@@ -1842,7 +1844,8 @@ static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
 		nid = numa_mem_id();
 
 	if (order_is_gigantic(order))
-		folio = alloc_gigantic_frozen_folio(order, gfp_mask, nid, nmask);
+		folio = alloc_gigantic_frozen_folio(order, gfp_mask, nid, nmask,
+						    addr);
 	else
 		folio = alloc_buddy_frozen_folio(order, gfp_mask, nid, nmask,
 						 node_alloc_noretry, addr);
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 10/37] mm: add folio_zero_user stub for configs without THP/HUGETLBFS
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (8 preceding siblings ...)
  2026-06-08  8:35 ` [PATCH v10 09/37] mm: hugetlb: thread user_addr through gigantic page allocation Michael S. Tsirkin
@ 2026-06-08  8:36 ` Michael S. Tsirkin
  2026-06-08  9:12   ` Lorenzo Stoakes
  2026-06-08  8:36 ` [PATCH v10 11/37] mm: page_alloc: move prep_compound_page before post_alloc_hook Michael S. Tsirkin
                   ` (30 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

folio_zero_user() is defined in mm/memory.c under
CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS.  A subsequent patch
will call it from post_alloc_hook() for all user page zeroing, so
configs without THP or HUGETLBFS will need a stub.

Add a stub that uses clear_user_highpages() with aligned
addr_hint.

Without THP/HUGETLBFS, only order-0 user pages are allocated, so
the locality optimization in the real folio_zero_user() (zero near
the faulting address last) is not needed.
This also matches what vma_alloc_zeroed_movable_folio currently does.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: cursor-agent:GPT-5.4-xhigh
---
 mm/folio_zero.h | 18 ++++++++++++++++++
 mm/page_alloc.c |  1 +
 2 files changed, 19 insertions(+)
 create mode 100644 mm/folio_zero.h

diff --git a/mm/folio_zero.h b/mm/folio_zero.h
new file mode 100644
index 000000000000..c135b3a34da8
--- /dev/null
+++ b/mm/folio_zero.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef MM_FOLIO_ZERO_H
+#define MM_FOLIO_ZERO_H
+
+#include <linux/highmem.h>
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+void folio_zero_user(struct folio *folio, unsigned long addr_hint);
+#else
+static inline void folio_zero_user(struct folio *folio, unsigned long addr_hint)
+{
+	unsigned long base = ALIGN_DOWN(addr_hint, folio_size(folio));
+
+	clear_user_highpages(&folio->page, base, folio_nr_pages(folio));
+}
+#endif
+
+#endif /* MM_FOLIO_ZERO_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d3f284c607d..0943ab724032 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -17,6 +17,7 @@
 #include <linux/stddef.h>
 #include <linux/mm.h>
 #include <linux/highmem.h>
+#include "folio_zero.h"
 #include <linux/interrupt.h>
 #include <linux/jiffies.h>
 #include <linux/compiler.h>
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 11/37] mm: page_alloc: move prep_compound_page before post_alloc_hook
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (9 preceding siblings ...)
  2026-06-08  8:36 ` [PATCH v10 10/37] mm: add folio_zero_user stub for configs without THP/HUGETLBFS Michael S. Tsirkin
@ 2026-06-08  8:36 ` Michael S. Tsirkin
  2026-06-08 10:33   ` Lorenzo Stoakes
  2026-06-08  8:36 ` [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook Michael S. Tsirkin
                   ` (29 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

Move prep_compound_page() before post_alloc_hook() in prep_new_page().

The next patch adds a folio_zero_user() call to post_alloc_hook(),
which uses folio_nr_pages() to determine how many pages to zero.
Without compound metadata set up first, folio_nr_pages() returns 1
for higher-order allocations, so only the first page would be zeroed.

All other operations in post_alloc_hook() (arch_alloc_page, KASAN,
debug, page owner, etc.) use raw page pointers with explicit order
counts and are unaffected by this reordering.

Also reorder compaction_alloc_noprof() for consistency. Compaction
currently passes USER_ADDR_NONE so folio_zero_user() is not called
there, but keeping the same ordering avoids a future tripping hazard.

Reviewed-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: cursor-agent:GPT-5.4-xhigh
---
 mm/compaction.c | 4 ++--
 mm/page_alloc.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 72684fe81e83..4336e433c99b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1849,10 +1849,10 @@ static struct folio *compaction_alloc_noprof(struct folio *src, unsigned long da
 		set_page_private(&freepage[size], start_order);
 	}
 	dst = (struct folio *)freepage;
-	post_alloc_hook(&dst->page, order, __GFP_MOVABLE, USER_ADDR_NONE);
-	set_page_refcounted(&dst->page);
 	if (order)
 		prep_compound_page(&dst->page, order);
+	post_alloc_hook(&dst->page, order, __GFP_MOVABLE, USER_ADDR_NONE);
+	set_page_refcounted(&dst->page);
 	cc->nr_freepages -= 1 << order;
 	cc->nr_migratepages -= 1 << order;
 	return page_rmappable_folio(&dst->page);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0943ab724032..4676fd49819e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1874,11 +1874,11 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags
 							unsigned int alloc_flags,
 							unsigned long user_addr)
 {
-	post_alloc_hook(page, order, gfp_flags, user_addr);
-
 	if (order && (gfp_flags & __GFP_COMP))
 		prep_compound_page(page, order);
 
+	post_alloc_hook(page, order, gfp_flags, user_addr);
+
 	/*
 	 * page is set pfmemalloc when ALLOC_NO_WATERMARKS was necessary to
 	 * allocate the page. The expectation is that the caller is taking
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (10 preceding siblings ...)
  2026-06-08  8:36 ` [PATCH v10 11/37] mm: page_alloc: move prep_compound_page before post_alloc_hook Michael S. Tsirkin
@ 2026-06-08  8:36 ` Michael S. Tsirkin
  2026-06-08 11:23   ` Lorenzo Stoakes
  2026-06-08  8:36 ` [PATCH v10 13/37] mm: use __GFP_ZERO in vma_alloc_zeroed_movable_folio Michael S. Tsirkin
                   ` (28 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

When post_alloc_hook() needs to zero a page for an explicit
__GFP_ZERO allocation for a user page (user_addr is set), use folio_zero_user()
instead of kernel_init_pages().  This zeros near the faulting
address last, keeping those cachelines hot for the impending
user access.

folio_zero_user() is only used for explicit __GFP_ZERO, not for
init_on_alloc.  On architectures with virtually-indexed caches
(e.g., ARM), clear_user_highpage() performs per-line cache
operations; using it for init_on_alloc would add overhead that
kernel_init_pages() avoids (the page fault path flushes the
cache at PTE installation time regardless).

No functional change yet: current callers do not pass __GFP_ZERO
for user pages (they zero at the callsite instead).  Subsequent
patches will convert them.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 mm/page_alloc.c | 35 ++++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4676fd49819e..d4fbf1861a8a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1861,9 +1861,38 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 		for (i = 0; i != 1 << order; ++i)
 			page_kasan_tag_reset(page + i);
 	}
-	/* If memory is still not initialized, initialize it now. */
-	if (init)
-		kernel_init_pages(page, 1 << order);
+	/*
+	 * On architectures with cache aliasing, pages zeroed via the
+	 * kernel direct map (e.g. init_on_free) must be re-zeroed
+	 * through a user-congruent mapping.  Host-zeroed pages
+	 * (zeroed flag) don't need this: physical RAM is clean.
+	 */
+	if (!init && (gfp_flags & __GFP_ZERO) &&
+	    user_addr != USER_ADDR_NONE &&
+	    user_alloc_needs_zeroing())
+		init = true;
+	/*
+	 * If memory is still not initialized, initialize it now.
+	 * When __GFP_ZERO was explicitly requested and user_addr is set,
+	 * use folio_zero_user() which zeros near the faulting address
+	 * last, keeping those cachelines hot.  For init_on_alloc, use
+	 * kernel_init_pages() to avoid unnecessary cache flush overhead
+	 * on architectures with virtually-indexed caches.
+	 */
+	if (init) {
+		if ((gfp_flags & __GFP_ZERO) && user_addr != USER_ADDR_NONE) {
+			/*
+			 * folio_zero_user relies on folio_nr_pages which
+			 * requires __GFP_COMP for order > 0.  All user folio
+			 * allocations set __GFP_COMP via __folio_alloc.
+			 * user_addr != USER_ADDR_NONE implies sleepable
+			 * context (user page fault).
+			 */
+			VM_WARN_ON_ONCE(order && !(gfp_flags & __GFP_COMP));
+			folio_zero_user(page_folio(page), user_addr);
+		} else
+			kernel_init_pages(page, 1 << order);
+	}
 
 	set_page_owner(page, order, gfp_flags);
 	page_table_check_alloc(page, order);
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 13/37] mm: use __GFP_ZERO in vma_alloc_zeroed_movable_folio
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (11 preceding siblings ...)
  2026-06-08  8:36 ` [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook Michael S. Tsirkin
@ 2026-06-08  8:36 ` Michael S. Tsirkin
  2026-06-08 10:39   ` Lorenzo Stoakes
  2026-06-08  8:37 ` [PATCH v10 14/37] mm: remove arch vma_alloc_zeroed_movable_folio overrides Michael S. Tsirkin
                   ` (27 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

Now that post_alloc_hook() handles cache-friendly user page
zeroing via folio_zero_user(), convert vma_alloc_zeroed_movable_folio()
to pass __GFP_ZERO instead of zeroing at the callsite.

Note: before this series, replacing clear_user_highpage() with
__GFP_ZERO was unsafe on cache-aliasing architectures because
__GFP_ZERO uses clear_page() without a dcache flush. With this
series, it is safe if the caller passes a valid user address
(not USER_ADDR_NONE) to vma_alloc_folio() etc., which delivers
it to post_alloc_hook() for the dcache flush via
folio_zero_user(). It is only unsafe if USER_ADDR_NONE is passed.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 include/linux/highmem.h | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index d7aac9de1c8a..8b0afaabbc6e 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -320,13 +320,8 @@ static inline
 struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
 				   unsigned long vaddr)
 {
-	struct folio *folio;
-
-	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vaddr);
-	if (folio && user_alloc_needs_zeroing())
-		clear_user_highpage(&folio->page, vaddr);
-
-	return folio;
+	return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO,
+			      0, vma, vaddr);
 }
 #endif
 
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 14/37] mm: remove arch vma_alloc_zeroed_movable_folio overrides
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (12 preceding siblings ...)
  2026-06-08  8:36 ` [PATCH v10 13/37] mm: use __GFP_ZERO in vma_alloc_zeroed_movable_folio Michael S. Tsirkin
@ 2026-06-08  8:37 ` Michael S. Tsirkin
  2026-06-08 11:29   ` Lorenzo Stoakes
  2026-06-08  8:37 ` [PATCH v10 15/37] mm: alloc_anon_folio: pass raw fault address to vma_alloc_folio Michael S. Tsirkin
                   ` (26 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Magnus Lindholm,
	Greg Ungerer, Geert Uytterhoeven

Now that the generic vma_alloc_zeroed_movable_folio() uses
__GFP_ZERO, the arch-specific macros on alpha, m68k, s390, and
x86 that did the same thing are redundant.  Remove them.

arm64 is not affected: it has a real function override that
handles MTE tag zeroing, not just __GFP_ZERO.

Suggested-by: David Hildenbrand <david@kernel.org>
Acked-by: Magnus Lindholm <linmag7@gmail.com>
Acked-by: Greg Ungerer <gerg@linux-m68k.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Gregory Price <gourry@gourry.net>
---
 arch/alpha/include/asm/page.h   | 3 ---
 arch/m68k/include/asm/page_no.h | 3 ---
 arch/s390/include/asm/page.h    | 3 ---
 arch/x86/include/asm/page.h     | 3 ---
 include/linux/highmem.h         | 8 +++++---
 5 files changed, 5 insertions(+), 15 deletions(-)

diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
index 59d01f9b77f6..4327029cd660 100644
--- a/arch/alpha/include/asm/page.h
+++ b/arch/alpha/include/asm/page.h
@@ -12,9 +12,6 @@
 
 extern void clear_page(void *page);
 
-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr)
-
 extern void copy_page(void * _to, void * _from);
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
diff --git a/arch/m68k/include/asm/page_no.h b/arch/m68k/include/asm/page_no.h
index d2532bc407ef..f511b763a235 100644
--- a/arch/m68k/include/asm/page_no.h
+++ b/arch/m68k/include/asm/page_no.h
@@ -12,9 +12,6 @@ extern unsigned long memory_end;
 
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr)
-
 #define __pa(vaddr)		((unsigned long)(vaddr))
 #define __va(paddr)		((void *)((unsigned long)(paddr)))
 
diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
index 56da819a79e6..e995d2a413f9 100644
--- a/arch/s390/include/asm/page.h
+++ b/arch/s390/include/asm/page.h
@@ -67,9 +67,6 @@ static inline void copy_page(void *to, void *from)
 
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr)
-
 #ifdef CONFIG_STRICT_MM_TYPECHECKS
 #define STRICT_MM_TYPECHECKS
 #endif
diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 416dc88e35c1..92fa975b46f3 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -28,9 +28,6 @@ static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
 	copy_page(to, from);
 }
 
-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr)
-
 #ifndef __pa
 #define __pa(x)		__phys_addr((unsigned long)(x))
 #endif
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 8b0afaabbc6e..642718a50c27 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -303,7 +303,6 @@ static inline void clear_user_highpages(struct page *page, unsigned long vaddr,
 #endif
 }
 
-#ifndef vma_alloc_zeroed_movable_folio
 /**
  * vma_alloc_zeroed_movable_folio - Allocate a zeroed page for a VMA.
  * @vma: The VMA the page is to be allocated for.
@@ -317,12 +316,15 @@ static inline void clear_user_highpages(struct page *page, unsigned long vaddr,
  * we are out of memory.
  */
 static inline
-struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
+struct folio *vma_alloc_zeroed_movable_folio_noprof(struct vm_area_struct *vma,
 				   unsigned long vaddr)
 {
-	return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO,
+	return vma_alloc_folio_noprof(GFP_HIGHUSER_MOVABLE | __GFP_ZERO,
 			      0, vma, vaddr);
 }
+#ifndef vma_alloc_zeroed_movable_folio
+#define vma_alloc_zeroed_movable_folio(...) \
+	alloc_hooks(vma_alloc_zeroed_movable_folio_noprof(__VA_ARGS__))
 #endif
 
 static inline void clear_highpage(struct page *page)
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 15/37] mm: alloc_anon_folio: pass raw fault address to vma_alloc_folio
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (13 preceding siblings ...)
  2026-06-08  8:37 ` [PATCH v10 14/37] mm: remove arch vma_alloc_zeroed_movable_folio overrides Michael S. Tsirkin
@ 2026-06-08  8:37 ` Michael S. Tsirkin
  2026-06-08 11:35   ` Lorenzo Stoakes
  2026-06-08  8:37 ` [PATCH v10 16/37] mm: alloc_swap_folio: " Michael S. Tsirkin
                   ` (25 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

Pass vmf->address directly instead of ALIGN_DOWN(vmf->address, ...).
NUMA interleave is not affected: the ilx calculation in
get_vma_policy() shifts addr >> PAGE_SHIFT >> order, which
drops sub-page bits regardless of alignment. post_alloc_hook
will use the raw address for cache-friendly zeroing via
folio_zero_user().

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: cursor-agent:GPT-5.4-xhigh
---
 mm/memory.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 86a973119bd4..21f640674c4f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5268,8 +5268,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 	/* Try allocating the highest of the remaining orders. */
 	gfp = vma_thp_gfp_mask(vma);
 	while (orders) {
-		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
-		folio = vma_alloc_folio(gfp, order, vma, addr);
+		folio = vma_alloc_folio(gfp, order, vma, vmf->address);
 		if (folio) {
 			if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
 				count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 16/37] mm: alloc_swap_folio: pass raw fault address to vma_alloc_folio
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (14 preceding siblings ...)
  2026-06-08  8:37 ` [PATCH v10 15/37] mm: alloc_anon_folio: pass raw fault address to vma_alloc_folio Michael S. Tsirkin
@ 2026-06-08  8:37 ` Michael S. Tsirkin
  2026-06-08 11:37   ` Lorenzo Stoakes
  2026-06-08  8:37 ` [PATCH v10 17/37] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
                   ` (24 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

Same change as the previous patch but for alloc_swap_folio:
pass vmf->address directly instead of ALIGN_DOWN(vmf->address, ...).

Note: NUMA interleave is not affected by the raw address;
the ilx calculation shifts addr >> PAGE_SHIFT >> order,
dropping sub-page bits regardless of alignment.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: cursor-agent:GPT-5.4-xhigh
---
 mm/memory.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 21f640674c4f..6c14b90f558e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4750,8 +4750,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	/* Try allocating the highest of the remaining orders. */
 	gfp = vma_thp_gfp_mask(vma);
 	while (orders) {
-		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
-		folio = vma_alloc_folio(gfp, order, vma, addr);
+		folio = vma_alloc_folio(gfp, order, vma, vmf->address);
 		if (folio) {
 			if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
 							    gfp, entry))
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 17/37] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (15 preceding siblings ...)
  2026-06-08  8:37 ` [PATCH v10 16/37] mm: alloc_swap_folio: " Michael S. Tsirkin
@ 2026-06-08  8:37 ` Michael S. Tsirkin
  2026-06-08 12:00   ` Lorenzo Stoakes
  2026-06-08  8:38 ` [PATCH v10 18/37] mm: page_alloc: use aliasing checks instead of user_alloc_needs_zeroing Michael S. Tsirkin
                   ` (23 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

When a guest reports free pages to the hypervisor via the page reporting
framework (used by virtio-balloon and hv_balloon), the host typically
zeros those pages when reclaiming their backing memory.  However, when
those pages are later allocated in the guest, post_alloc_hook()
unconditionally zeros them again if __GFP_ZERO is set.  This
double-zeroing is wasteful, especially for large pages.

Avoid redundant zeroing:

- Add a host_zeroes_pages flag to page_reporting_dev_info, allowing
  drivers to declare that their host zeros reported pages on reclaim.
  A static key (page_reporting_host_zeroes) gates the fast path.

- Add PG_zeroed page flag (sharing PG_private bit) to mark pages
  that have been zeroed by the host.  Set it in
  page_reporting_drain() after the host reports them.

- Thread the zeroed bool through rmqueue -> prep_new_page ->
  post_alloc_hook, where it skips redundant zeroing for __GFP_ZERO
  allocations.

Currently the PG_zeroed hint can be lost when pages are
split (expand) or merged in the buddy allocator.  This is
harmless: losing the hint just means the page gets re-zeroed,
which is correct but suboptimal.  Follow-up patches propagate
PG_zeroed across splits and merges to preserve the hint on
common paths.

No driver sets host_zeroes_pages yet; a follow-up patch to
virtio_balloon is needed to opt in.

PG_zeroed pages may pass through PCP lists before being freed.
This is safe: __free_pages_prepare clears all
PAGE_FLAGS_CHECK_AT_PREP flags (including PG_zeroed/PG_private)
before the page re-enters the buddy allocator.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: cursor-agent:GPT-5.4-xhigh
---
 include/linux/page-flags.h     |  9 +++++
 include/linux/page_reporting.h |  3 ++
 mm/compaction.c                |  6 ++-
 mm/internal.h                  |  2 +-
 mm/page_alloc.c                | 68 +++++++++++++++++++++++-----------
 mm/page_reporting.c            | 14 ++++++-
 mm/page_reporting.h            | 12 ++++++
 7 files changed, 88 insertions(+), 26 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7223f6f4e2b4..91f8ddb1d512 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -135,6 +135,8 @@ enum pageflags {
 	PG_swapcache = PG_owner_priv_1, /* Swap page: swp_entry_t in private */
 	/* Some filesystems */
 	PG_checked = PG_owner_priv_1,
+	/* Page contents are known to be zero */
+	PG_zeroed = PG_private,
 
 	/*
 	 * Depending on the way an anonymous folio can be mapped into a page
@@ -673,6 +675,13 @@ FOLIO_TEST_CLEAR_FLAG_FALSE(young)
 FOLIO_FLAG_FALSE(idle)
 #endif
 
+/*
+ * PageZeroed() tracks pages known to be zero.  The allocator
+ * uses this to skip redundant zeroing in post_alloc_hook().
+ */
+__PAGEFLAG(Zeroed, zeroed, PF_NO_COMPOUND)
+#define __PG_ZEROED (1UL << PG_zeroed)
+
 /*
  * PageReported() is used to track reported free pages within the Buddy
  * allocator. We can use the non-atomic version of the test and set
diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
index 5ab5be02fa15..c331c6b36687 100644
--- a/include/linux/page_reporting.h
+++ b/include/linux/page_reporting.h
@@ -14,6 +14,9 @@ struct page_reporting_dev_info {
 	int (*report)(struct page_reporting_dev_info *prdev,
 		      struct scatterlist *sg, unsigned int nents);
 
+	/* If true, host zeros reported pages on reclaim */
+	bool host_zeroes_pages;
+
 	/* work struct for processing reports */
 	struct delayed_work work;
 
diff --git a/mm/compaction.c b/mm/compaction.c
index 4336e433c99b..8000fc5e0a2e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -82,7 +82,8 @@ static inline bool is_via_compact_memory(int order) { return false; }
 
 static struct page *mark_allocated_noprof(struct page *page, unsigned int order, gfp_t gfp_flags)
 {
-	post_alloc_hook(page, order, __GFP_MOVABLE, USER_ADDR_NONE);
+	__ClearPageZeroed(page);
+	post_alloc_hook(page, order, __GFP_MOVABLE, false, USER_ADDR_NONE);
 	set_page_refcounted(page);
 	return page;
 }
@@ -1849,9 +1850,10 @@ static struct folio *compaction_alloc_noprof(struct folio *src, unsigned long da
 		set_page_private(&freepage[size], start_order);
 	}
 	dst = (struct folio *)freepage;
+	__ClearPageZeroed(&dst->page);
 	if (order)
 		prep_compound_page(&dst->page, order);
-	post_alloc_hook(&dst->page, order, __GFP_MOVABLE, USER_ADDR_NONE);
+	post_alloc_hook(&dst->page, order, __GFP_MOVABLE, false, USER_ADDR_NONE);
 	set_page_refcounted(&dst->page);
 	cc->nr_freepages -= 1 << order;
 	cc->nr_migratepages -= 1 << order;
diff --git a/mm/internal.h b/mm/internal.h
index 9d2198114510..4af5e72742ba 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -928,7 +928,7 @@ static inline void init_compound_tail(struct page *tail,
 }
 
 void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags,
-		     unsigned long user_addr);
+		     bool zeroed, unsigned long user_addr);
 extern bool free_pages_prepare(struct page *page, unsigned int order);
 
 extern int user_min_free_kbytes;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d4fbf1861a8a..45e824b1ec75 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1743,6 +1743,7 @@ static __always_inline void page_del_and_expand(struct zone *zone,
 	bool was_reported = page_reported(page);
 
 	__del_page_from_free_list(page, zone, high, migratetype);
+
 	nr_pages -= expand(zone, page, low, high, migratetype, was_reported);
 	account_freepages(zone, -nr_pages, migratetype);
 }
@@ -1815,8 +1816,10 @@ static inline bool should_skip_init(gfp_t flags)
 	return (flags & __GFP_SKIP_ZERO);
 }
 
+
 inline void post_alloc_hook(struct page *page, unsigned int order,
-				gfp_t gfp_flags, unsigned long user_addr)
+				gfp_t gfp_flags, bool zeroed,
+				unsigned long user_addr)
 {
 	const bool zero_tags = gfp_flags & __GFP_ZEROTAGS;
 	bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) &&
@@ -1825,6 +1828,14 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 
 	set_page_private(page, 0);
 
+	/*
+	 * If the page is zeroed, skip memory initialization.
+	 * We still need to handle tag zeroing separately since the host
+	 * does not know about memory tags.
+	 */
+	if (zeroed && init && !zero_tags)
+		init = false;
+
 	arch_alloc_page(page, order);
 	debug_pagealloc_map_pages(page, 1 << order);
 
@@ -1867,7 +1878,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	 * through a user-congruent mapping.  Host-zeroed pages
 	 * (zeroed flag) don't need this: physical RAM is clean.
 	 */
-	if (!init && (gfp_flags & __GFP_ZERO) &&
+	if (!zeroed && !init && (gfp_flags & __GFP_ZERO) &&
 	    user_addr != USER_ADDR_NONE &&
 	    user_alloc_needs_zeroing())
 		init = true;
@@ -1900,13 +1911,13 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 }
 
 static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
-							unsigned int alloc_flags,
-							unsigned long user_addr)
+			  unsigned int alloc_flags, bool zeroed,
+			  unsigned long user_addr)
 {
 	if (order && (gfp_flags & __GFP_COMP))
 		prep_compound_page(page, order);
 
-	post_alloc_hook(page, order, gfp_flags, user_addr);
+	post_alloc_hook(page, order, gfp_flags, zeroed, user_addr);
 
 	/*
 	 * page is set pfmemalloc when ALLOC_NO_WATERMARKS was necessary to
@@ -3174,6 +3185,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
 	}
 
 	del_page_from_free_list(page, zone, order, mt);
+	__ClearPageZeroed(page);
 
 	/*
 	 * Set the pageblock if the isolated page is at least half of a
@@ -3246,7 +3258,7 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
 static __always_inline
 struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 			   unsigned int order, unsigned int alloc_flags,
-			   int migratetype)
+			   int migratetype, bool *zeroed)
 {
 	struct page *page;
 	unsigned long flags;
@@ -3281,6 +3293,8 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 			}
 		}
 		spin_unlock_irqrestore(&zone->lock, flags);
+		*zeroed = PageZeroed(page);
+		__ClearPageZeroed(page);
 	} while (check_new_pages(page, order));
 
 	/*
@@ -3349,10 +3363,9 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
 /* Remove page from the per-cpu list, caller must protect the list */
 static inline
 struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
-			int migratetype,
-			unsigned int alloc_flags,
+			int migratetype, unsigned int alloc_flags,
 			struct per_cpu_pages *pcp,
-			struct list_head *list)
+			struct list_head *list, bool *zeroed)
 {
 	struct page *page;
 
@@ -3387,6 +3400,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 		page = list_first_entry(list, struct page, pcp_list);
 		list_del(&page->pcp_list);
 		pcp->count -= 1 << order;
+		*zeroed = PageZeroed(page);
+		__ClearPageZeroed(page);
 	} while (check_new_pages(page, order));
 
 	return page;
@@ -3395,7 +3410,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 /* Lock and remove page from the per-cpu list */
 static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
-			int migratetype, unsigned int alloc_flags)
+			int migratetype, unsigned int alloc_flags,
+			bool *zeroed)
 {
 	struct per_cpu_pages *pcp;
 	struct list_head *list;
@@ -3413,7 +3429,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	 */
 	pcp->free_count >>= 1;
 	list = &pcp->lists[order_to_pindex(migratetype, order)];
-	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
+	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags,
+				 pcp, list, zeroed);
 	pcp_spin_unlock(pcp);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
@@ -3438,19 +3455,19 @@ static inline
 struct page *rmqueue(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
 			gfp_t gfp_flags, unsigned int alloc_flags,
-			int migratetype)
+			int migratetype, bool *zeroed)
 {
 	struct page *page;
 
 	if (likely(pcp_allowed_order(order))) {
 		page = rmqueue_pcplist(preferred_zone, zone, order,
-				       migratetype, alloc_flags);
+				       migratetype, alloc_flags, zeroed);
 		if (likely(page))
 			goto out;
 	}
 
 	page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
-							migratetype);
+			     migratetype, zeroed);
 
 out:
 	/* Separate test+clear to avoid unnecessary atomics */
@@ -3841,6 +3858,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	struct pglist_data *last_pgdat = NULL;
 	bool last_pgdat_dirty_ok = false;
 	bool no_fallback;
+	bool zeroed;
 	bool skip_kswapd_nodes = nr_online_nodes > 1;
 	bool skipped_kswapd_nodes = false;
 
@@ -3985,10 +4003,11 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 
 try_this_zone:
 		page = rmqueue(zonelist_zone(ac->preferred_zoneref), zone, order,
-				gfp_mask, alloc_flags, ac->migratetype);
+					gfp_mask, alloc_flags, ac->migratetype,
+					&zeroed);
 		if (page) {
 			prep_new_page(page, order, gfp_mask, alloc_flags,
-				      ac->user_addr);
+				      zeroed, ac->user_addr);
 
 			return page;
 		} else {
@@ -4215,9 +4234,11 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	count_vm_event(COMPACTSTALL);
 
 	/* Prep a captured page if available */
-	if (page)
-		prep_new_page(page, order, gfp_mask, alloc_flags,
+	if (page) {
+		__ClearPageZeroed(page);
+		prep_new_page(page, order, gfp_mask, alloc_flags, false,
 			      ac->user_addr);
+	}
 
 	/* Try get a page from the freelist if available */
 	if (!page)
@@ -5190,6 +5211,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 	/* Attempt the batch allocation */
 	pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
 	while (nr_populated < nr_pages) {
+		bool zeroed = false;
 
 		/* Skip existing pages */
 		if (page_array[nr_populated]) {
@@ -5198,7 +5220,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 		}
 
 		page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
-								pcp, pcp_list);
+					 pcp, pcp_list, &zeroed);
 		if (unlikely(!page)) {
 			/* Try and allocate at least one page */
 			if (!nr_account) {
@@ -5209,7 +5231,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 		}
 		nr_account++;
 
-		prep_new_page(page, 0, gfp, 0, USER_ADDR_NONE);
+		prep_new_page(page, 0, gfp, 0, zeroed, USER_ADDR_NONE);
 		set_page_refcounted(page);
 		page_array[nr_populated++] = page;
 	}
@@ -6949,7 +6971,8 @@ static void split_free_frozen_pages(struct list_head *list, gfp_t gfp_mask)
 		list_for_each_entry_safe(page, next, &list[order], lru) {
 			int i;
 
-			post_alloc_hook(page, order, gfp_mask, USER_ADDR_NONE);
+			__ClearPageZeroed(page);
+			post_alloc_hook(page, order, gfp_mask, false, USER_ADDR_NONE);
 			if (!order)
 				continue;
 
@@ -7157,8 +7180,9 @@ static int __alloc_contig_frozen_range(unsigned long start, unsigned long end,
 	} else if (start == outer_start && end == outer_end && is_power_of_2(end - start)) {
 		struct page *head = pfn_to_page(start);
 
+		__ClearPageZeroed(head);
 		check_new_pages(head, order);
-		prep_new_page(head, order, gfp_mask, 0, user_addr);
+		prep_new_page(head, order, gfp_mask, 0, false, user_addr);
 	} else {
 		ret = -EINVAL;
 		WARN(true, "PFN range: requested [%lu, %lu), allocated [%lu, %lu)\n",
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 5b6b17f67131..84ebc4547119 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -50,6 +50,8 @@ EXPORT_SYMBOL_GPL(page_reporting_order);
 #define PAGE_REPORTING_DELAY	(2 * HZ)
 static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly;
 
+DEFINE_STATIC_KEY_FALSE(page_reporting_host_zeroes);
+
 enum {
 	PAGE_REPORTING_IDLE = 0,
 	PAGE_REPORTING_REQUESTED,
@@ -129,8 +131,11 @@ page_reporting_drain(struct page_reporting_dev_info *prdev,
 		 * report on the new larger page when we make our way
 		 * up to that higher order.
 		 */
-		if (PageBuddy(page) && buddy_order(page) == order)
+		if (PageBuddy(page) && buddy_order(page) == order) {
 			__SetPageReported(page);
+			if (page_reporting_host_zeroes_pages())
+				__SetPageZeroed(page);
+		}
 	} while ((sg = sg_next(sg)));
 
 	/* reinitialize scatterlist now that it is empty */
@@ -390,6 +395,10 @@ int page_reporting_register(struct page_reporting_dev_info *prdev)
 	/* Assign device to allow notifications */
 	rcu_assign_pointer(pr_dev_info, prdev);
 
+	/* enable zeroed page optimization if host zeroes reported pages */
+	if (prdev->host_zeroes_pages)
+		static_branch_enable(&page_reporting_host_zeroes);
+
 	/* enable page reporting notification */
 	if (!static_key_enabled(&page_reporting_enabled)) {
 		static_branch_enable(&page_reporting_enabled);
@@ -414,6 +423,9 @@ void page_reporting_unregister(struct page_reporting_dev_info *prdev)
 
 		/* Flush any existing work, and lock it out */
 		cancel_delayed_work_sync(&prdev->work);
+
+		if (prdev->host_zeroes_pages)
+			static_branch_disable(&page_reporting_host_zeroes);
 	}
 
 	mutex_unlock(&page_reporting_mutex);
diff --git a/mm/page_reporting.h b/mm/page_reporting.h
index c51dbc228b94..736ea7b37e9e 100644
--- a/mm/page_reporting.h
+++ b/mm/page_reporting.h
@@ -15,6 +15,13 @@ DECLARE_STATIC_KEY_FALSE(page_reporting_enabled);
 extern unsigned int page_reporting_order;
 void __page_reporting_notify(void);
 
+DECLARE_STATIC_KEY_FALSE(page_reporting_host_zeroes);
+
+static inline bool page_reporting_host_zeroes_pages(void)
+{
+	return static_branch_unlikely(&page_reporting_host_zeroes);
+}
+
 static inline bool page_reported(struct page *page)
 {
 	return static_branch_unlikely(&page_reporting_enabled) &&
@@ -46,6 +53,11 @@ static inline void page_reporting_notify_free(unsigned int order)
 #else /* CONFIG_PAGE_REPORTING */
 #define page_reported(_page)	false
 
+static inline bool page_reporting_host_zeroes_pages(void)
+{
+	return false;
+}
+
 static inline void page_reporting_notify_free(unsigned int order)
 {
 }
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 18/37] mm: page_alloc: use aliasing checks instead of user_alloc_needs_zeroing
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (16 preceding siblings ...)
  2026-06-08  8:37 ` [PATCH v10 17/37] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
@ 2026-06-08  8:38 ` Michael S. Tsirkin
  2026-06-08 11:39   ` Lorenzo Stoakes
  2026-06-08  8:38 ` [PATCH v10 19/37] mm: page_alloc: clear PG_zeroed on buddy merge if not both zero Michael S. Tsirkin
                   ` (22 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

Replace user_alloc_needs_zeroing() with the direct aliasing checks
(cpu_dcache_is_aliasing() || cpu_icache_is_aliasing()) in the
post_alloc_hook aliasing guard.

user_alloc_needs_zeroing() includes a !init_on_alloc term that
means "allocator didn't zero this page."  But in this guard's
context (!zeroed && !init && __GFP_ZERO), we already know the page
is zero; init incorporates init_on_alloc via want_init_on_alloc().
The only question left is whether the cache architecture needs
the data re-zeroed through a congruent mapping, which is purely
cpu_dcache_is_aliasing() || cpu_icache_is_aliasing().

On non-aliasing architectures with init_on_free=true and
init_on_alloc=false, this avoids a redundant re-zero of an
already-zero page.

Note on PowerPC: PowerPC overrides clear_user_page to call
flush_dcache_page after clear_page, but on freshly allocated
pages PG_dcache_clean is already clear (cleared by
__free_pages_prepare), so flush_dcache_page is a no-op.
Skipping this here thus has no effect.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 45e824b1ec75..edfc83571985 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1880,7 +1880,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	 */
 	if (!zeroed && !init && (gfp_flags & __GFP_ZERO) &&
 	    user_addr != USER_ADDR_NONE &&
-	    user_alloc_needs_zeroing())
+	    (cpu_dcache_is_aliasing() || cpu_icache_is_aliasing()))
 		init = true;
 	/*
 	 * If memory is still not initialized, initialize it now.
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 19/37] mm: page_alloc: clear PG_zeroed on buddy merge if not both zero
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (17 preceding siblings ...)
  2026-06-08  8:38 ` [PATCH v10 18/37] mm: page_alloc: use aliasing checks instead of user_alloc_needs_zeroing Michael S. Tsirkin
@ 2026-06-08  8:38 ` Michael S. Tsirkin
  2026-06-08 11:47   ` Lorenzo Stoakes
  2026-06-08  8:38 ` [PATCH v10 20/37] mm: page_alloc: preserve PG_zeroed in page_del_and_expand Michael S. Tsirkin
                   ` (21 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

When two buddy pages merge in __free_one_page(), preserve
PG_zeroed on the merged page only if both buddies have the
flag set.  Otherwise clear it.

The merged page would inherit PG_zeroed, and a later __GFP_ZERO
allocation would skip zeroing stale data in the non-zero half.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: cursor-agent:GPT-5.4-xhigh
---
 include/linux/page-flags.h |  1 +
 mm/page_alloc.c            | 15 ++++++++++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 91f8ddb1d512..9365d59ac1d6 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -680,6 +680,7 @@ FOLIO_FLAG_FALSE(idle)
  * uses this to skip redundant zeroing in post_alloc_hook().
  */
 __PAGEFLAG(Zeroed, zeroed, PF_NO_COMPOUND)
+CLEARPAGEFLAG(Zeroed, zeroed, PF_NO_COMPOUND)
 #define __PG_ZEROED (1UL << PG_zeroed)
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index edfc83571985..a90bca5317c1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -941,10 +941,14 @@ static inline void __free_one_page(struct page *page,
 	unsigned long buddy_pfn = 0;
 	unsigned long combined_pfn;
 	struct page *buddy;
+	bool buddy_zeroed;
+	bool page_zeroed;
 	bool to_tail;
 
 	VM_BUG_ON(!zone_is_initialized(zone));
-	VM_BUG_ON_PAGE(page->flags.f & PAGE_FLAGS_CHECK_AT_PREP, page);
+	/* PG_zeroed (aliased to PG_private) is valid on free-list pages */
+	VM_BUG_ON_PAGE(page->flags.f &
+		       (PAGE_FLAGS_CHECK_AT_PREP & ~__PG_ZEROED), page);
 
 	VM_BUG_ON(migratetype == -1);
 	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
@@ -979,6 +983,8 @@ static inline void __free_one_page(struct page *page,
 				goto done_merging;
 		}
 
+		buddy_zeroed = PageZeroed(buddy);
+
 		/*
 		 * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
 		 * merge with it and move up one order.
@@ -997,10 +1003,17 @@ static inline void __free_one_page(struct page *page,
 			change_pageblock_range(buddy, order, migratetype);
 		}
 
+		page_zeroed = PageZeroed(page);
+		__ClearPageZeroed(page);
+		__ClearPageZeroed(buddy);
+
 		combined_pfn = buddy_pfn & pfn;
 		page = page + (combined_pfn - pfn);
 		pfn = combined_pfn;
 		order++;
+
+		if (page_zeroed && buddy_zeroed)
+			__SetPageZeroed(page);
 	}
 
 done_merging:
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 20/37] mm: page_alloc: preserve PG_zeroed in page_del_and_expand
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (18 preceding siblings ...)
  2026-06-08  8:38 ` [PATCH v10 19/37] mm: page_alloc: clear PG_zeroed on buddy merge if not both zero Michael S. Tsirkin
@ 2026-06-08  8:38 ` Michael S. Tsirkin
  2026-06-08  8:38 ` [PATCH v10 21/37] mm: page_alloc: propagate PG_zeroed in split_large_buddy Michael S. Tsirkin
                   ` (20 subsequent siblings)
  40 siblings, 0 replies; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

Propagate PG_zeroed through buddy splits in page_del_and_expand()
and try_to_claim_block().  When a zeroed high-order page is split
to satisfy a smaller allocation, the sub-pages placed back on the
free lists keep PG_zeroed.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: cursor-agent:GPT-5.4-xhigh
---
 mm/page_alloc.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a90bca5317c1..7a6dedd716e2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1712,7 +1712,8 @@ struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
  * -- nyc
  */
 static inline unsigned int expand(struct zone *zone, struct page *page, int low,
-				  int high, int migratetype, bool reported)
+				  int high, int migratetype, bool reported,
+				  bool zeroed)
 {
 	unsigned int size = 1 << high;
 	unsigned int nr_added = 0;
@@ -1743,6 +1744,8 @@ static inline unsigned int expand(struct zone *zone, struct page *page, int low,
 		 */
 		if (reported)
 			__SetPageReported(&page[size]);
+		if (zeroed)
+			__SetPageZeroed(&page[size]);
 	}
 
 	return nr_added;
@@ -1754,10 +1757,12 @@ static __always_inline void page_del_and_expand(struct zone *zone,
 {
 	int nr_pages = 1 << high;
 	bool was_reported = page_reported(page);
+	bool was_zeroed = PageZeroed(page);
 
 	__del_page_from_free_list(page, zone, high, migratetype);
 
-	nr_pages -= expand(zone, page, low, high, migratetype, was_reported);
+	nr_pages -= expand(zone, page, low, high, migratetype, was_reported,
+			   was_zeroed);
 	account_freepages(zone, -nr_pages, migratetype);
 }
 
@@ -2355,11 +2360,12 @@ try_to_claim_block(struct zone *zone, struct page *page,
 	if (current_order >= pageblock_order) {
 		unsigned int nr_added;
 		bool was_reported = page_reported(page);
+		bool was_zeroed = PageZeroed(page);
 
 		del_page_from_free_list(page, zone, current_order, block_type);
 		change_pageblock_range(page, current_order, start_type);
 		nr_added = expand(zone, page, order, current_order, start_type,
-				  was_reported);
+				  was_reported, was_zeroed);
 		account_freepages(zone, nr_added, start_type);
 		return page;
 	}
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 21/37] mm: page_alloc: propagate PG_zeroed in split_large_buddy
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (19 preceding siblings ...)
  2026-06-08  8:38 ` [PATCH v10 20/37] mm: page_alloc: preserve PG_zeroed in page_del_and_expand Michael S. Tsirkin
@ 2026-06-08  8:38 ` Michael S. Tsirkin
  2026-06-08  8:38 ` [PATCH v10 22/37] mm: add free_frozen_pages_zeroed Michael S. Tsirkin
                   ` (19 subsequent siblings)
  40 siblings, 0 replies; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

When splitting a large buddy page, propagate the PG_zeroed flag
to each sub-page before freeing it.  __free_pages_prepare clears
all flags (including PG_zeroed), so the flag must be re-set on
each fragment after the split.  This ensures that the buddy merge
logic can see PG_zeroed on pages that were part of a larger
zeroed block.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: cursor-agent:GPT-5.4-xhigh
---
 mm/page_alloc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7a6dedd716e2..21f9e92922f1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1520,6 +1520,7 @@ static void split_large_buddy(struct zone *zone, struct page *page,
 			      bool reported)
 {
 	unsigned long end = pfn + (1 << order);
+	bool zeroed = PageZeroed(page);
 
 	VM_WARN_ON_ONCE(!IS_ALIGNED(pfn, 1 << order));
 	/* Caller removed page from freelist, buddy info cleared! */
@@ -1531,6 +1532,8 @@ static void split_large_buddy(struct zone *zone, struct page *page,
 	do {
 		int mt = get_pfnblock_migratetype(page, pfn);
 
+		if (zeroed)
+			__SetPageZeroed(page);
 		__free_one_page(page, pfn, zone, order, mt, fpi);
 		if (reported && PageBuddy(page) && buddy_order(page) == order)
 			__SetPageReported(page);
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 22/37] mm: add free_frozen_pages_zeroed
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (20 preceding siblings ...)
  2026-06-08  8:38 ` [PATCH v10 21/37] mm: page_alloc: propagate PG_zeroed in split_large_buddy Michael S. Tsirkin
@ 2026-06-08  8:38 ` Michael S. Tsirkin
  2026-06-08 12:06   ` Lorenzo Stoakes
  2026-06-08  8:38 ` [PATCH v10 23/37] mm: page_alloc: skip kernel_init_pages for FPI_ZEROED when safe Michael S. Tsirkin
                   ` (18 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

Add free_frozen_pages_zeroed(page, order) to free a frozen page
while marking it as zeroed, so the next allocation can skip
redundant zeroing.

An FPI_ZEROED internal flag carries the hint through the free path.
PageZeroed is set after __free_pages_prepare() clears all flags,
so the hint survives on the free list.

__SetPageZeroed is non-atomic but safe here: the page is frozen
(refcount 0) and not yet on any free list.

Note: when want_init_on_free() zeroes the page via
kernel_init_pages(), the page is zero but the direct-map
cache lines may be dirty. A later patch (skip
kernel_init_pages for FPI_ZEROED) avoids the redundant
re-zero, and post_alloc_hook handles the dcache flush
for user pages on aliasing architectures.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 include/linux/gfp.h |  1 +
 mm/internal.h       |  1 +
 mm/page_alloc.c     | 23 ++++++++++++++++++++++-
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 73109d4e31a4..d24b61e45861 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -384,6 +384,7 @@ __meminit void *alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mas
 extern void __free_pages(struct page *page, unsigned int order);
 extern void free_pages_nolock(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
+void free_frozen_pages_zeroed(struct page *page, unsigned int order);
 
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr), 0)
diff --git a/mm/internal.h b/mm/internal.h
index 4af5e72742ba..fd910743ddc3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -938,6 +938,7 @@ struct page *__alloc_frozen_pages_noprof(gfp_t, unsigned int order, int nid,
 #define __alloc_frozen_pages(...) \
 	alloc_hooks(__alloc_frozen_pages_noprof(__VA_ARGS__))
 void free_frozen_pages(struct page *page, unsigned int order);
+void free_frozen_pages_zeroed(struct page *page, unsigned int order);
 void free_unref_folios(struct folio_batch *fbatch);
 
 #ifdef CONFIG_NUMA
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 21f9e92922f1..008f1a311c40 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -91,6 +91,13 @@ typedef int __bitwise fpi_t;
 /* Free the page without taking locks. Rely on trylock only. */
 #define FPI_TRYLOCK		((__force fpi_t)BIT(2))
 
+/*
+ * The page contents are known to be zero (e.g., the host zeroed them
+ * during balloon deflate).  Set PageZeroed after free so the next
+ * allocation can skip redundant zeroing.
+ */
+#define FPI_ZEROED		((__force fpi_t)BIT(3))
+
 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
 static DEFINE_MUTEX(pcp_batch_high_lock);
 #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
@@ -1596,8 +1603,12 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 	unsigned long pfn = page_to_pfn(page);
 	struct zone *zone = page_zone(page);
 
-	if (__free_pages_prepare(page, order, fpi_flags))
+	if (__free_pages_prepare(page, order, fpi_flags)) {
+		/* Don't mark zeroed if poison overwrote with 0xAA. */
+		if ((fpi_flags & FPI_ZEROED) && !page_poisoning_enabled_static())
+			__SetPageZeroed(page);
 		free_one_page(zone, page, pfn, order, fpi_flags);
+	}
 }
 
 void __meminit __free_pages_core(struct page *page, unsigned int order,
@@ -3020,6 +3031,10 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
 	if (!__free_pages_prepare(page, order, fpi_flags))
 		return;
 
+	/* Don't mark zeroed if poison overwrote with 0xAA. */
+	if ((fpi_flags & FPI_ZEROED) && !page_poisoning_enabled_static())
+		__SetPageZeroed(page);
+
 	/*
 	 * We only track unmovable, reclaimable and movable on pcp lists.
 	 * Place ISOLATE pages on the isolated list because they are being
@@ -3058,6 +3073,12 @@ void free_frozen_pages(struct page *page, unsigned int order)
 	__free_frozen_pages(page, order, FPI_NONE);
 }
 
+void free_frozen_pages_zeroed(struct page *page, unsigned int order)
+{
+	__free_frozen_pages(page, order, FPI_ZEROED);
+}
+EXPORT_SYMBOL(free_frozen_pages_zeroed);
+
 void free_frozen_pages_nolock(struct page *page, unsigned int order)
 {
 	__free_frozen_pages(page, order, FPI_TRYLOCK);
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 23/37] mm: page_alloc: skip kernel_init_pages for FPI_ZEROED when safe
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (21 preceding siblings ...)
  2026-06-08  8:38 ` [PATCH v10 22/37] mm: add free_frozen_pages_zeroed Michael S. Tsirkin
@ 2026-06-08  8:38 ` Michael S. Tsirkin
  2026-06-08 12:18   ` Lorenzo Stoakes
  2026-06-08  8:38 ` [PATCH v10 24/37] mm: add put_page_zeroed and folio_put_zeroed Michael S. Tsirkin
                   ` (17 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

In __free_pages_prepare(), when FPI_ZEROED is set the page is already
known to be zero. We can skip kernel_init_pages() if page poisoning is
not enabled (because poison would overwrite the zeroes).

This avoids redundant zeroing work when freeing pages that are already
known to contain all zeros.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 mm/page_alloc.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 008f1a311c40..e3a7c40c769c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1443,7 +1443,14 @@ __always_inline bool __free_pages_prepare(struct page *page,
 		if (kasan_has_integrated_init())
 			init = false;
 	}
-	if (init)
+	/*
+	 * Skip redundant zeroing when the page is already known-zero
+	 * (FPI_ZEROED) and page poisoning did not overwrite it.
+	 * When page_poisoning is enabled, kernel_poison_pages above
+	 * wrote PAGE_POISON (0xAA), so we must re-zero.
+	 */
+	if (init && !((fpi_flags & FPI_ZEROED) &&
+		      !page_poisoning_enabled_static()))
 		kernel_init_pages(page, 1 << order);
 
 	/*
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 24/37] mm: add put_page_zeroed and folio_put_zeroed
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (22 preceding siblings ...)
  2026-06-08  8:38 ` [PATCH v10 23/37] mm: page_alloc: skip kernel_init_pages for FPI_ZEROED when safe Michael S. Tsirkin
@ 2026-06-08  8:38 ` Michael S. Tsirkin
  2026-06-08 12:25   ` Lorenzo Stoakes
  2026-06-08  8:39 ` [PATCH v10 25/37] mm: use __GFP_ZERO in alloc_anon_folio Michael S. Tsirkin
                   ` (16 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

Add put_page_zeroed() / folio_put_zeroed() for callers that hold
a reference to a page known to be zeroed.

If this drops the last reference, the zeroed hint is
propagated to the buddy allocator.  If someone else still holds a
reference, the hint is simply lost - this is best-effort.

This is useful for balloon drivers during deflation: the host
has already zeroed the pages, and the balloon is typically the
sole owner.  But if the page happens to be shared, silently
dropping the hint is safe and avoids the need for callers to
check the refcount.

Note: put_page_zeroed uses folio_put_testzero() which only
detects sole ownership at the instant of the atomic decrement.
A concurrent reference holder (e.g. migration) means the hint
is silently lost. This is by design: the zeroed hint is a
performance optimization, not a correctness requirement.
Losing it just means the next allocation re-zeroes the page.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 include/linux/mm.h | 13 +++++++++++++
 mm/swap.c          | 20 ++++++++++++++++++--
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 06bbe9eba636..79b3a8cb9a3b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1913,6 +1913,7 @@ static inline struct folio *virt_to_folio(const void *x)
 }
 
 void __folio_put(struct folio *folio);
+void __folio_put_zeroed(struct folio *folio);
 
 void split_page(struct page *page, unsigned int order);
 void folio_copy(struct folio *dst, struct folio *src);
@@ -2090,6 +2091,18 @@ static inline void folio_put(struct folio *folio)
 		__folio_put(folio);
 }
 
+/* Caller must be sole owner to guarantee page is still zero */
+static inline void folio_put_zeroed(struct folio *folio)
+{
+	if (folio_put_testzero(folio))
+		__folio_put_zeroed(folio);
+}
+
+static inline void put_page_zeroed(struct page *page)
+{
+	folio_put_zeroed(page_folio(page));
+}
+
 /**
  * folio_put_refs - Reduce the reference count on a folio.
  * @folio: The folio.
diff --git a/mm/swap.c b/mm/swap.c
index 5cc44f0de987..ecec780172ad 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -94,13 +94,15 @@ static void page_cache_release(struct folio *folio)
 		lruvec_unlock_irqrestore(lruvec, flags);
 }
 
-void __folio_put(struct folio *folio)
+static void ___folio_put(struct folio *folio, bool zeroed)
 {
+	/* zeroed hint ignored for now, no current user */
 	if (unlikely(folio_is_zone_device(folio))) {
 		free_zone_device_folio(folio);
 		return;
 	}
 
+	/* zeroed hint ignored for now, no current user */
 	if (folio_test_hugetlb(folio)) {
 		free_huge_folio(folio);
 		return;
@@ -109,10 +111,24 @@ void __folio_put(struct folio *folio)
 	page_cache_release(folio);
 	folio_unqueue_deferred_split(folio);
 	mem_cgroup_uncharge(folio);
-	free_frozen_pages(&folio->page, folio_order(folio));
+	if (zeroed)
+		free_frozen_pages_zeroed(&folio->page, folio_order(folio));
+	else
+		free_frozen_pages(&folio->page, folio_order(folio));
+}
+
+void __folio_put(struct folio *folio)
+{
+	___folio_put(folio, false);
 }
 EXPORT_SYMBOL(__folio_put);
 
+void __folio_put_zeroed(struct folio *folio)
+{
+	___folio_put(folio, true);
+}
+EXPORT_SYMBOL(__folio_put_zeroed);
+
 typedef void (*move_fn_t)(struct lruvec *lruvec, struct folio *folio);
 
 static void lru_add(struct lruvec *lruvec, struct folio *folio)
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 25/37] mm: use __GFP_ZERO in alloc_anon_folio
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (23 preceding siblings ...)
  2026-06-08  8:38 ` [PATCH v10 24/37] mm: add put_page_zeroed and folio_put_zeroed Michael S. Tsirkin
@ 2026-06-08  8:39 ` Michael S. Tsirkin
  2026-06-08 12:29   ` Lorenzo Stoakes
  2026-06-08  8:39 ` [PATCH v10 26/37] mm: vma_alloc_anon_folio_pmd: pass raw fault address to vma_alloc_folio Michael S. Tsirkin
                   ` (15 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

Convert alloc_anon_folio() to pass __GFP_ZERO instead of zeroing
at the callsite. post_alloc_hook uses the fault address passed
through vma_alloc_folio for cache-friendly zeroing.

Note: before this series, replacing clear_user_highpage() with
__GFP_ZERO was unsafe on cache-aliasing architectures because
__GFP_ZERO uses clear_page() without a dcache flush. With this
series, it is safe if the caller passes a valid user address
(not USER_ADDR_NONE) to vma_alloc_folio() etc., which delivers
it to post_alloc_hook() for the dcache flush via
folio_zero_user(). It is only unsafe if USER_ADDR_NONE is passed.

Note: with __GFP_ZERO, the folio is zeroed before
mem_cgroup_charge().  If the charge fails, the zeroing work is
wasted.  Previously zeroing was done after a successful charge.
This is inherent to moving zeroing into the allocator.
Charge failures are rare (only at cgroup limits).

Use folio_put_zeroed() on charge failure so the zeroed hint
propagates to the buddy allocator, avoiding redundant re-zeroing
on the next allocation attempt.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Assisted-by: Claude:claude-opus-4-6
---
 mm/memory.c | 13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 6c14b90f558e..6d6a3e1a02c1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5265,25 +5265,16 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 		goto fallback;
 
 	/* Try allocating the highest of the remaining orders. */
-	gfp = vma_thp_gfp_mask(vma);
+	gfp = vma_thp_gfp_mask(vma) | __GFP_ZERO;
 	while (orders) {
 		folio = vma_alloc_folio(gfp, order, vma, vmf->address);
 		if (folio) {
 			if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
 				count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
-				folio_put(folio);
+				folio_put_zeroed(folio);
 				goto next;
 			}
 			folio_throttle_swaprate(folio, gfp);
-			/*
-			 * When a folio is not zeroed during allocation
-			 * (__GFP_ZERO not used) or user folios require special
-			 * handling, folio_zero_user() is used to make sure
-			 * that the page corresponding to the faulting address
-			 * will be hot in the cache after zeroing.
-			 */
-			if (user_alloc_needs_zeroing())
-				folio_zero_user(folio, vmf->address);
 			return folio;
 		}
 next:
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 26/37] mm: vma_alloc_anon_folio_pmd: pass raw fault address to vma_alloc_folio
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (24 preceding siblings ...)
  2026-06-08  8:39 ` [PATCH v10 25/37] mm: use __GFP_ZERO in alloc_anon_folio Michael S. Tsirkin
@ 2026-06-08  8:39 ` Michael S. Tsirkin
  2026-06-08 12:30   ` Lorenzo Stoakes
  2026-06-08  8:39 ` [PATCH v10 27/37] mm: use __GFP_ZERO in vma_alloc_anon_folio_pmd Michael S. Tsirkin
                   ` (14 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

Drop the redundant HPAGE_PMD_MASK alignment at the callsite.
NUMA interleave is not affected by the raw address; the ilx
calculation shifts addr >> PAGE_SHIFT >> order, dropping
sub-page bits regardless of alignment. post_alloc_hook will
use the raw address for cache-friendly zeroing.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Gregory Price <gourry@gourry.net>
---
 mm/huge_memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 970e077019b7..d689e6491ddb 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1337,7 +1337,7 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
 	const int order = HPAGE_PMD_ORDER;
 	struct folio *folio;
 
-	folio = vma_alloc_folio(gfp, order, vma, addr & HPAGE_PMD_MASK);
+	folio = vma_alloc_folio(gfp, order, vma, addr);
 
 	if (unlikely(!folio)) {
 		count_vm_event(THP_FAULT_FALLBACK);
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 27/37] mm: use __GFP_ZERO in vma_alloc_anon_folio_pmd
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (25 preceding siblings ...)
  2026-06-08  8:39 ` [PATCH v10 26/37] mm: vma_alloc_anon_folio_pmd: pass raw fault address to vma_alloc_folio Michael S. Tsirkin
@ 2026-06-08  8:39 ` Michael S. Tsirkin
  2026-06-08 12:32   ` Lorenzo Stoakes
  2026-06-08  8:39 ` [PATCH v10 28/37] mm: hugetlb: add gfp parameter and skip zeroing for zeroed pages Michael S. Tsirkin
                   ` (13 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

Convert vma_alloc_anon_folio_pmd() to pass __GFP_ZERO instead of
zeroing at the callsite. post_alloc_hook uses the fault address
passed through vma_alloc_folio for cache-friendly zeroing.

Note: before this series, replacing folio_zero_user() with
__GFP_ZERO was unsafe on cache-aliasing architectures because
__GFP_ZERO uses clear_page() without a dcache flush. With this
series, it is safe if the caller passes a valid user address
(not USER_ADDR_NONE) to vma_alloc_folio() etc., which delivers
it to post_alloc_hook() for the dcache flush via
folio_zero_user(). It is only unsafe if USER_ADDR_NONE is passed.

Note: with __GFP_ZERO, the folio is zeroed before
mem_cgroup_charge().  If the charge fails, the zeroing work is
wasted.  Previously zeroing was done after a successful charge.
This is inherent to moving zeroing into the allocator.
Charge failures are rare (only at cgroup limits).

Use folio_put_zeroed() on charge failure so the zeroed hint
propagates to the buddy allocator, avoiding redundant re-zeroing
on the next allocation attempt.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Assisted-by: Claude:claude-opus-4-6
---
 mm/huge_memory.c | 14 +++-----------
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d689e6491ddb..0dec3c717ff2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1333,7 +1333,7 @@ EXPORT_SYMBOL_GPL(thp_get_unmapped_area);
 static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
 		unsigned long addr)
 {
-	gfp_t gfp = vma_thp_gfp_mask(vma);
+	gfp_t gfp = vma_thp_gfp_mask(vma) | __GFP_ZERO;
 	const int order = HPAGE_PMD_ORDER;
 	struct folio *folio;
 
@@ -1347,7 +1347,7 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
 
 	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
 	if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
-		folio_put(folio);
+		folio_put_zeroed(folio);
 		count_vm_event(THP_FAULT_FALLBACK);
 		count_vm_event(THP_FAULT_FALLBACK_CHARGE);
 		count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK);
@@ -1356,17 +1356,9 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
 	}
 	folio_throttle_swaprate(folio, gfp);
 
-       /*
-	* When a folio is not zeroed during allocation (__GFP_ZERO not used)
-	* or user folios require special handling, folio_zero_user() is used to
-	* make sure that the page corresponding to the faulting address will be
-	* hot in the cache after zeroing.
-	*/
-	if (user_alloc_needs_zeroing())
-		folio_zero_user(folio, addr);
 	/*
 	 * The memory barrier inside __folio_mark_uptodate makes sure that
-	 * folio_zero_user writes become visible before the set_pmd_at()
+	 * page zeroing becomes visible before the set_pmd_at()
 	 * write.
 	 */
 	__folio_mark_uptodate(folio);
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 28/37] mm: hugetlb: add gfp parameter and skip zeroing for zeroed pages
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (26 preceding siblings ...)
  2026-06-08  8:39 ` [PATCH v10 27/37] mm: use __GFP_ZERO in vma_alloc_anon_folio_pmd Michael S. Tsirkin
@ 2026-06-08  8:39 ` Michael S. Tsirkin
  2026-06-08 12:44   ` Lorenzo Stoakes
  2026-06-08  8:39 ` [PATCH v10 29/37] mm: memfd: skip zeroing for zeroed hugetlb pool pages Michael S. Tsirkin
                   ` (12 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

Add a gfp_t parameter to alloc_hugetlb_folio(). When __GFP_ZERO
is set, the function guarantees the returned folio is zeroed:
- Fresh allocations (buddy or gigantic): zeroed by
  post_alloc_hook via __GFP_ZERO, HPG_zeroed set by
  alloc_surplus_hugetlb_folio.
- Pool pages with HPG_zeroed set: already zeroed, skip.
- Pool pages without HPG_zeroed: zeroed via folio_zero_user().

The address parameter is renamed to user_addr; the function
aligns it internally for reservation and NUMA policy lookups.
For pages that need zeroing, user_addr is passed to
folio_zero_user() for cache-friendly zeroing near the faulting
subpage.  All callers pass a page-aligned address; the
hugetlb_no_page caller passes vmf->real_address & PAGE_MASK
for consistency.

HPG_zeroed (stored in hugetlb folio->private bits) tracks
known-zero pool pages. It is set when alloc_surplus_hugetlb_folio
allocates with __GFP_ZERO, and cleared in free_huge_folio when
the page returns to the pool after userspace use.

Note: for gigantic CMA pages, __GFP_ZERO is passed through
to cma_alloc_frozen_compound() via its caller_gfp parameter,
so the pages ARE zeroed by the allocator. HPG_zeroed is only
set when __GFP_ZERO was in the original gfp_mask.
Pool pages allocated without __GFP_ZERO (e.g. by
alloc_pool_huge_folio) do not get HPG_zeroed; they are zeroed
later by folio_zero_user() at fault time.

Note: with __GFP_ZERO, the folio is zeroed before
mem_cgroup_charge_hugetlb().  If the charge fails, the zeroed
folio is freed back.  Before this patch it is zeroed after charge, so
simply freeing after zeroing would be a regression.  Thread a
zeroed hint through free_huge_folio so surplus pages freed back
to buddy preserve the zeroed state via free_frozen_pages_zeroed,
avoiding redundant re-zeroing on the next allocation.

Suggested-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: cursor-agent:GPT-5.4-xhigh
---
 fs/hugetlbfs/inode.c    |  3 +-
 include/linux/hugetlb.h |  5 ++-
 mm/hugetlb.c            | 78 +++++++++++++++++++++++++++--------------
 3 files changed, 57 insertions(+), 29 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 78d61bf2bd9b..2c0c51fe9ec3 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -790,13 +790,12 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		 * folios in these areas, we need to consume the reserves
 		 * to keep reservation accounting consistent.
 		 */
-		folio = alloc_hugetlb_folio(&pseudo_vma, addr, false);
+		folio = alloc_hugetlb_folio(&pseudo_vma, addr, false, __GFP_ZERO);
 		if (IS_ERR(folio)) {
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			error = PTR_ERR(folio);
 			goto out;
 		}
-		folio_zero_user(folio, addr);
 		__folio_mark_uptodate(folio);
 		error = hugetlb_add_to_page_cache(folio, mapping, index);
 		if (unlikely(error)) {
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 1f7ae6609e51..06d033a57a61 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -593,6 +593,7 @@ enum hugetlb_page_flags {
 	HPG_vmemmap_optimized,
 	HPG_raw_hwp_unreliable,
 	HPG_cma,
+	HPG_zeroed,
 	__NR_HPAGEFLAGS,
 };
 
@@ -653,6 +654,7 @@ HPAGEFLAG(Freed, freed)
 HPAGEFLAG(VmemmapOptimized, vmemmap_optimized)
 HPAGEFLAG(RawHwpUnreliable, raw_hwp_unreliable)
 HPAGEFLAG(Cma, cma)
+HPAGEFLAG(Zeroed, zeroed)
 
 #ifdef CONFIG_HUGETLB_PAGE
 
@@ -700,7 +702,8 @@ int isolate_or_dissolve_huge_folio(struct folio *folio, struct list_head *list);
 int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn);
 void wait_for_freed_hugetlb_folios(void);
 struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
-				unsigned long addr, bool cow_from_owner);
+				unsigned long user_addr, bool cow_from_owner,
+				gfp_t gfp);
 struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
 				nodemask_t *nmask, gfp_t gfp_mask,
 				bool allow_alloc_fallback);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5d7e546565f5..ed00db703911 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1455,7 +1455,8 @@ void add_hugetlb_folio(struct hstate *h, struct folio *folio,
 }
 
 static void __update_and_free_hugetlb_folio(struct hstate *h,
-						struct folio *folio)
+						struct folio *folio,
+						bool zeroed)
 {
 	bool clear_flag = folio_test_hugetlb_vmemmap_optimized(folio);
 
@@ -1506,6 +1507,8 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
 	VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
 	if (folio_test_hugetlb_cma(folio))
 		hugetlb_cma_free_frozen_folio(folio);
+	else if (zeroed)
+		free_frozen_pages_zeroed(&folio->page, folio_order(folio));
 	else
 		free_frozen_pages(&folio->page, folio_order(folio));
 }
@@ -1545,7 +1548,7 @@ static void free_hpage_workfn(struct work_struct *work)
 		 */
 		h = size_to_hstate(folio_size(folio));
 
-		__update_and_free_hugetlb_folio(h, folio);
+		__update_and_free_hugetlb_folio(h, folio, false);
 
 		cond_resched();
 	}
@@ -1559,10 +1562,10 @@ static inline void flush_free_hpage_work(struct hstate *h)
 }
 
 static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
-				 bool atomic)
+				 bool atomic, bool zeroed)
 {
 	if (!folio_test_hugetlb_vmemmap_optimized(folio) || !atomic) {
-		__update_and_free_hugetlb_folio(h, folio);
+		__update_and_free_hugetlb_folio(h, folio, zeroed);
 		return;
 	}
 
@@ -1596,7 +1599,7 @@ static void bulk_vmemmap_restore_error(struct hstate *h,
 			spin_lock_irq(&hugetlb_lock);
 			__folio_clear_hugetlb(folio);
 			spin_unlock_irq(&hugetlb_lock);
-			update_and_free_hugetlb_folio(h, folio, false);
+			update_and_free_hugetlb_folio(h, folio, false, false);
 			cond_resched();
 		}
 	} else {
@@ -1621,7 +1624,7 @@ static void bulk_vmemmap_restore_error(struct hstate *h,
 				spin_lock_irq(&hugetlb_lock);
 				__folio_clear_hugetlb(folio);
 				spin_unlock_irq(&hugetlb_lock);
-				update_and_free_hugetlb_folio(h, folio, false);
+				update_and_free_hugetlb_folio(h, folio, false, false);
 				cond_resched();
 				break;
 			}
@@ -1664,7 +1667,7 @@ static void update_and_free_pages_bulk(struct hstate *h,
 	}
 
 	list_for_each_entry_safe(folio, t_folio, &non_hvo_folios, lru) {
-		update_and_free_hugetlb_folio(h, folio, false);
+		update_and_free_hugetlb_folio(h, folio, false, false);
 		cond_resched();
 	}
 }
@@ -1680,7 +1683,7 @@ struct hstate *size_to_hstate(unsigned long size)
 	return NULL;
 }
 
-void free_huge_folio(struct folio *folio)
+static void __free_huge_folio(struct folio *folio, bool zeroed)
 {
 	/*
 	 * Can't pass hstate in here because it is called from the
@@ -1692,6 +1695,9 @@ void free_huge_folio(struct folio *folio)
 	bool restore_reserve;
 	unsigned long flags;
 
+	/* Page was mapped to userspace; no longer known-zero */
+	folio_clear_hugetlb_zeroed(folio);
+
 	VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
 	VM_BUG_ON_FOLIO(folio_mapcount(folio), folio);
 
@@ -1735,12 +1741,12 @@ void free_huge_folio(struct folio *folio)
 	if (folio_test_hugetlb_temporary(folio)) {
 		remove_hugetlb_folio(h, folio, false);
 		spin_unlock_irqrestore(&hugetlb_lock, flags);
-		update_and_free_hugetlb_folio(h, folio, true);
+		update_and_free_hugetlb_folio(h, folio, true, zeroed);
 	} else if (h->surplus_huge_pages_node[nid]) {
 		/* remove the page from active list */
 		remove_hugetlb_folio(h, folio, true);
 		spin_unlock_irqrestore(&hugetlb_lock, flags);
-		update_and_free_hugetlb_folio(h, folio, true);
+		update_and_free_hugetlb_folio(h, folio, true, zeroed);
 	} else {
 		arch_clear_hugetlb_flags(folio);
 		enqueue_hugetlb_folio(h, folio);
@@ -1748,6 +1754,11 @@ void free_huge_folio(struct folio *folio)
 	}
 }
 
+void free_huge_folio(struct folio *folio)
+{
+	__free_huge_folio(folio, false);
+}
+
 /*
  * Must be called with the hugetlb lock held
  */
@@ -2031,7 +2042,7 @@ int dissolve_free_hugetlb_folio(struct folio *folio)
 			rc = 0;
 		}
 
-		update_and_free_hugetlb_folio(h, folio, false);
+		update_and_free_hugetlb_folio(h, folio, false, false);
 		return rc;
 	}
 out:
@@ -2093,6 +2104,10 @@ static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
 	if (!folio)
 		return NULL;
 
+	/* Mark as known-zero only if __GFP_ZERO was requested */
+	if (gfp_mask & __GFP_ZERO)
+		folio_set_hugetlb_zeroed(folio);
+
 	spin_lock_irq(&hugetlb_lock);
 	/*
 	 * nr_huge_pages needs to be adjusted within the same lock cycle
@@ -2156,11 +2171,11 @@ static struct folio *alloc_migrate_hugetlb_folio(struct hstate *h, gfp_t gfp_mas
  */
 static
 struct folio *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h,
-		struct vm_area_struct *vma, unsigned long addr)
+		struct vm_area_struct *vma, unsigned long addr, gfp_t gfp)
 {
 	struct folio *folio = NULL;
 	struct mempolicy *mpol;
-	gfp_t gfp_mask = htlb_alloc_mask(h);
+	gfp_t gfp_mask = htlb_alloc_mask(h) | gfp;
 	int nid;
 	nodemask_t *nodemask;
 
@@ -2715,7 +2730,7 @@ static int alloc_and_dissolve_hugetlb_folio(struct folio *old_folio,
 		 * Folio has been replaced, we can safely free the old one.
 		 */
 		spin_unlock_irq(&hugetlb_lock);
-		update_and_free_hugetlb_folio(h, old_folio, false);
+		update_and_free_hugetlb_folio(h, old_folio, false, false);
 	}
 
 	return ret;
@@ -2723,7 +2738,7 @@ static int alloc_and_dissolve_hugetlb_folio(struct folio *old_folio,
 free_new:
 	spin_unlock_irq(&hugetlb_lock);
 	if (new_folio)
-		update_and_free_hugetlb_folio(h, new_folio, false);
+		update_and_free_hugetlb_folio(h, new_folio, false, false);
 
 	return ret;
 }
@@ -2857,16 +2872,19 @@ typedef enum {
  * When it's set, the allocation will bypass all vma level reservations.
  */
 struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
-				    unsigned long addr, bool cow_from_owner)
+				    unsigned long user_addr, bool cow_from_owner,
+				    gfp_t gfp)
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
+	unsigned long addr = user_addr & huge_page_mask(h);
 	struct folio *folio;
 	long retval, gbl_chg, gbl_reserve;
 	map_chg_state map_chg;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg = NULL;
-	gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
+
+	gfp |= htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
 
 	idx = hstate_index(h);
 
@@ -2934,13 +2952,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	folio = dequeue_hugetlb_folio_vma(h, vma, addr, gbl_chg);
 	if (!folio) {
 		spin_unlock_irq(&hugetlb_lock);
-		folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr);
+		folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, user_addr, gfp);
 		if (!folio)
 			goto out_uncharge_cgroup;
 		spin_lock_irq(&hugetlb_lock);
 		list_add(&folio->lru, &h->hugepage_activelist);
 		folio_ref_unfreeze(folio, 1);
-		/* Fall through */
 	}
 
 	/*
@@ -2963,6 +2980,10 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 	spin_unlock_irq(&hugetlb_lock);
 
+	if ((gfp & __GFP_ZERO) && !folio_test_hugetlb_zeroed(folio))
+		folio_zero_user(folio, user_addr);
+	folio_clear_hugetlb_zeroed(folio);
+
 	hugetlb_set_folio_subpool(folio, spool);
 
 	if (map_chg != MAP_CHG_ENFORCED) {
@@ -2999,7 +3020,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	lruvec_stat_mod_folio(folio, NR_HUGETLB, pages_per_huge_page(h));
 
 	if (ret == -ENOMEM) {
-		free_huge_folio(folio);
+		__free_huge_folio(folio, !!(gfp & __GFP_ZERO));
 		return ERR_PTR(-ENOMEM);
 	}
 
@@ -4971,7 +4992,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				spin_unlock(src_ptl);
 				spin_unlock(dst_ptl);
 				/* Do not use reserve as it's private owned */
-				new_folio = alloc_hugetlb_folio(dst_vma, addr, false);
+				new_folio = alloc_hugetlb_folio(dst_vma, addr, false, 0);
 				if (IS_ERR(new_folio)) {
 					folio_put(pte_folio);
 					ret = PTR_ERR(new_folio);
@@ -5500,7 +5521,7 @@ static vm_fault_t hugetlb_wp(struct vm_fault *vmf)
 	 * be acquired again before returning to the caller, as expected.
 	 */
 	spin_unlock(vmf->ptl);
-	new_folio = alloc_hugetlb_folio(vma, vmf->address, cow_from_owner);
+	new_folio = alloc_hugetlb_folio(vma, vmf->address, cow_from_owner, 0);
 
 	if (IS_ERR(new_folio)) {
 		/*
@@ -5760,7 +5781,13 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 				goto out;
 		}
 
-		folio = alloc_hugetlb_folio(vma, vmf->address, false);
+		/*
+		 * Passing vmf->real_address would work just as well,
+		 * but PAGE_MASK helps make sure we never pass
+		 * USER_ADDR_NONE by mistake.
+		 */
+		folio = alloc_hugetlb_folio(vma, vmf->real_address & PAGE_MASK,
+					   false, __GFP_ZERO);
 		if (IS_ERR(folio)) {
 			/*
 			 * Returning error will result in faulting task being
@@ -5780,7 +5807,6 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 				ret = 0;
 			goto out;
 		}
-		folio_zero_user(folio, vmf->real_address);
 		__folio_mark_uptodate(folio);
 		new_folio = true;
 
@@ -6219,7 +6245,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 			goto out;
 		}
 
-		folio = alloc_hugetlb_folio(dst_vma, dst_addr, false);
+		folio = alloc_hugetlb_folio(dst_vma, dst_addr, false, 0);
 		if (IS_ERR(folio)) {
 			pte_t *actual_pte = hugetlb_walk(dst_vma, dst_addr, PMD_SIZE);
 			if (actual_pte) {
@@ -6266,7 +6292,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 			goto out;
 		}
 
-		folio = alloc_hugetlb_folio(dst_vma, dst_addr, false);
+		folio = alloc_hugetlb_folio(dst_vma, dst_addr, false, 0);
 		if (IS_ERR(folio)) {
 			folio_put(*foliop);
 			ret = -ENOMEM;
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 29/37] mm: memfd: skip zeroing for zeroed hugetlb pool pages
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (27 preceding siblings ...)
  2026-06-08  8:39 ` [PATCH v10 28/37] mm: hugetlb: add gfp parameter and skip zeroing for zeroed pages Michael S. Tsirkin
@ 2026-06-08  8:39 ` Michael S. Tsirkin
  2026-06-08 12:47   ` Lorenzo Stoakes
  2026-06-08  8:39 ` [PATCH v10 30/37] mm: page_reporting: add per-page zeroed bitmap for host feedback Michael S. Tsirkin
                   ` (11 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

Add bool *zeroed output to alloc_hugetlb_folio_reserve() so
callers can check whether the pool page is known-zero.  memfd's
memfd_alloc_folio() uses this to skip the explicit folio_zero_user()
when the page is already zero.

This avoids redundant zeroing for memfd hugetlb pages that were
pre-allocated into the pool and never mapped to userspace.

Note: HPG_zeroed is currently only set for surplus pages
allocated with __GFP_ZERO (via alloc_surplus_hugetlb_folio),
not for pool pages from alloc_pool_huge_folio. So the
zeroed output from alloc_hugetlb_folio_reserve is typically
false for pool-only reservations. It becomes true when
surplus pages fill the reservation. The addr_hint 0 passed
to folio_zero_user is acceptable for memfd: these pages are
not mapped yet and will get proper dcache handling at mmap
time via the page fault path.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 include/linux/cma.h     |  3 ++-
 include/linux/hugetlb.h |  6 ++++--
 mm/cma.c                |  6 ++++--
 mm/hugetlb.c            | 11 +++++++++--
 mm/hugetlb_cma.c        |  4 ++--
 mm/memfd.c              | 14 ++++++++------
 6 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/include/linux/cma.h b/include/linux/cma.h
index 8555d38a97b1..dee88909cf5d 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -53,7 +53,8 @@ extern bool cma_release(struct cma *cma, const struct page *pages, unsigned long
 
 struct page *cma_alloc_frozen(struct cma *cma, unsigned long count,
 		unsigned int align, bool no_warn);
-struct page *cma_alloc_frozen_compound(struct cma *cma, unsigned int order);
+struct page *cma_alloc_frozen_compound(struct cma *cma, unsigned int order,
+				       gfp_t caller_gfp);
 bool cma_release_frozen(struct cma *cma, const struct page *pages,
 		unsigned long count);
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 06d033a57a61..7eb529eabe99 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -708,7 +708,8 @@ struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
 				nodemask_t *nmask, gfp_t gfp_mask,
 				bool allow_alloc_fallback);
 struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
-					  nodemask_t *nmask, gfp_t gfp_mask);
+					  nodemask_t *nmask, gfp_t gfp_mask,
+					  bool *zeroed);
 
 int hugetlb_add_to_page_cache(struct folio *folio, struct address_space *mapping,
 			pgoff_t idx);
@@ -1128,7 +1129,8 @@ static inline void wait_for_freed_hugetlb_folios(void)
 
 static inline struct folio *
 alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
-			    nodemask_t *nmask, gfp_t gfp_mask)
+			    nodemask_t *nmask, gfp_t gfp_mask,
+			    bool *zeroed)
 {
 	return NULL;
 }
diff --git a/mm/cma.c b/mm/cma.c
index c7ca567f4c5c..27971f6264ab 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -924,9 +924,11 @@ struct page *cma_alloc_frozen(struct cma *cma, unsigned long count,
 	return __cma_alloc_frozen(cma, count, align, gfp);
 }
 
-struct page *cma_alloc_frozen_compound(struct cma *cma, unsigned int order)
+struct page *cma_alloc_frozen_compound(struct cma *cma, unsigned int order,
+				       gfp_t caller_gfp)
 {
-	gfp_t gfp = GFP_KERNEL | __GFP_COMP | __GFP_NOWARN;
+	gfp_t gfp = GFP_KERNEL | __GFP_COMP | __GFP_NOWARN |
+		    (caller_gfp & __GFP_ZERO);
 
 	return __cma_alloc_frozen(cma, 1 << order, order, gfp);
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ed00db703911..a087e915783f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2196,7 +2196,7 @@ struct folio *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h,
 }
 
 struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
-		nodemask_t *nmask, gfp_t gfp_mask)
+		nodemask_t *nmask, gfp_t gfp_mask, bool *zeroed)
 {
 	struct folio *folio;
 
@@ -2212,6 +2212,12 @@ struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
 		h->resv_huge_pages--;
 
 	spin_unlock_irq(&hugetlb_lock);
+
+	if (zeroed && folio) {
+		*zeroed = folio_test_hugetlb_zeroed(folio);
+		folio_clear_hugetlb_zeroed(folio);
+	}
+
 	return folio;
 }
 
@@ -2296,7 +2302,8 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 		 * It is okay to use NUMA_NO_NODE because we use numa_mem_id()
 		 * down the road to pick the current node if that is the case.
 		 */
-		folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h),
+		folio = alloc_surplus_hugetlb_folio(h,
+						    htlb_alloc_mask(h),
 						    NUMA_NO_NODE, &alloc_nodemask,
 						    USER_ADDR_NONE);
 		if (!folio) {
diff --git a/mm/hugetlb_cma.c b/mm/hugetlb_cma.c
index 7693ccefd0c6..c9266b25be3d 100644
--- a/mm/hugetlb_cma.c
+++ b/mm/hugetlb_cma.c
@@ -35,14 +35,14 @@ struct folio *hugetlb_cma_alloc_frozen_folio(int order, gfp_t gfp_mask,
 		return NULL;
 
 	if (hugetlb_cma[nid])
-		page = cma_alloc_frozen_compound(hugetlb_cma[nid], order);
+		page = cma_alloc_frozen_compound(hugetlb_cma[nid], order, gfp_mask);
 
 	if (!page && !(gfp_mask & __GFP_THISNODE)) {
 		for_each_node_mask(node, *nodemask) {
 			if (node == nid || !hugetlb_cma[node])
 				continue;
 
-			page = cma_alloc_frozen_compound(hugetlb_cma[node], order);
+			page = cma_alloc_frozen_compound(hugetlb_cma[node], order, gfp_mask);
 			if (page)
 				break;
 		}
diff --git a/mm/memfd.c b/mm/memfd.c
index abe13b291ddc..a99617a62e33 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -69,6 +69,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
 #ifdef CONFIG_HUGETLB_PAGE
 	struct folio *folio;
 	gfp_t gfp_mask;
+	bool zeroed;
 
 	if (is_file_hugepages(memfd)) {
 		/*
@@ -93,17 +94,18 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
 		folio = alloc_hugetlb_folio_reserve(h,
 						    numa_node_id(),
 						    NULL,
-						    gfp_mask);
+						    gfp_mask,
+						    &zeroed);
 		if (folio) {
 			u32 hash;
 
 			/*
-			 * Zero the folio to prevent information leaks to userspace.
-			 * Use folio_zero_user() which is optimized for huge/gigantic
-			 * pages. Pass 0 as addr_hint since this is not a faulting path
-			 *  and we don't have a user virtual address yet.
+			 * Zero the folio to prevent information leaks to
+			 * userspace.  Skip if the pool page is known-zero
+			 * (HPG_zeroed set during pool pre-allocation).
 			 */
-			folio_zero_user(folio, 0);
+			if (!zeroed)
+				folio_zero_user(folio, 0);
 
 			/*
 			 * Mark the folio uptodate before adding to page cache,
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 30/37] mm: page_reporting: add per-page zeroed bitmap for host feedback
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (28 preceding siblings ...)
  2026-06-08  8:39 ` [PATCH v10 29/37] mm: memfd: skip zeroing for zeroed hugetlb pool pages Michael S. Tsirkin
@ 2026-06-08  8:39 ` Michael S. Tsirkin
  2026-06-08  8:39 ` [PATCH v10 31/37] virtio_balloon: submit reported pages as individual buffers Michael S. Tsirkin
                   ` (10 subsequent siblings)
  40 siblings, 0 replies; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

The host may skip zeroing some reported pages (e.g., due to alignment
constraints or bounce buffer fallback in QEMU).  Currently, when
host_zeroes_pages is set, all reported pages are unconditionally
marked PG_zeroed - even ones the host did not actually zero.

Add a zeroed_bitmap to page_reporting_dev_info that the report()
callback can use to indicate which pages were actually zeroed.
The driver's report() callback is responsible for managing the
bitmap: zeroing it before sending pages to the host, then setting
bits for pages the host actually zeroed.

page_reporting_drain() checks the bitmap per-page in addition to the
global host_zeroes_pages flag.

No driver sets host_zeroes_pages yet, so the static key is
off and the bitmap is never read.  Behavior is unchanged.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: cursor-agent:GPT-5.4-xhigh
---
 include/linux/page_reporting.h | 7 +++++++
 mm/page_reporting.c            | 8 ++++++--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
index c331c6b36687..e2e6a487ddab 100644
--- a/include/linux/page_reporting.h
+++ b/include/linux/page_reporting.h
@@ -17,6 +17,13 @@ struct page_reporting_dev_info {
 	/* If true, host zeros reported pages on reclaim */
 	bool host_zeroes_pages;
 
+	/*
+	 * Per-page zeroed status, indexed by scatterlist position.
+	 * The driver's report() callback must clear the bitmap,
+	 * then set bits for pages that were actually zeroed.
+	 */
+	DECLARE_BITMAP(zeroed_bitmap, PAGE_REPORTING_CAPACITY);
+
 	/* work struct for processing reports */
 	struct delayed_work work;
 
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 84ebc4547119..9c39448e758b 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -108,6 +108,7 @@ page_reporting_drain(struct page_reporting_dev_info *prdev,
 		     struct scatterlist *sgl, unsigned int nents, bool reported)
 {
 	struct scatterlist *sg = sgl;
+	unsigned int i = 0;
 
 	/*
 	 * Drain the now reported pages back into their respective
@@ -122,7 +123,7 @@ page_reporting_drain(struct page_reporting_dev_info *prdev,
 
 		/* If the pages were not reported due to error skip flagging */
 		if (!reported)
-			continue;
+			goto next;
 
 		/*
 		 * If page was not commingled with another page we can
@@ -133,9 +134,12 @@ page_reporting_drain(struct page_reporting_dev_info *prdev,
 		 */
 		if (PageBuddy(page) && buddy_order(page) == order) {
 			__SetPageReported(page);
-			if (page_reporting_host_zeroes_pages())
+			if (page_reporting_host_zeroes_pages() &&
+			    test_bit(i, prdev->zeroed_bitmap))
 				__SetPageZeroed(page);
 		}
+next:
+		i++;
 	} while ((sg = sg_next(sg)));
 
 	/* reinitialize scatterlist now that it is empty */
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 31/37] virtio_balloon: submit reported pages as individual buffers
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (29 preceding siblings ...)
  2026-06-08  8:39 ` [PATCH v10 30/37] mm: page_reporting: add per-page zeroed bitmap for host feedback Michael S. Tsirkin
@ 2026-06-08  8:39 ` Michael S. Tsirkin
  2026-06-08  8:40 ` [PATCH v10 32/37] virtio_balloon: disable indirect descriptors Michael S. Tsirkin
                   ` (9 subsequent siblings)
  40 siblings, 0 replies; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

Submit each reported page as a separate virtqueue buffer instead
of one buffer with an sg list of all pages. This avoids indirect
descriptor allocation (kmalloc in the reporting path) and gives
per-page used length feedback from the device.

On error, the already-queued pages are kicked and drained
before the error is returned. The caller (page_reporting_drain)
then marks the batch as unreported, which is conservative
but correct.

Note: if the virtqueue is broken, wait_event on
virtqueue_get_buf hangs.  This is a pre-existing issue:
the old single-buffer code had the same hang.  EVENT_IDX
is not a concern: callbacks were never disabled, so the
virtqueue manages used_event automatically.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 drivers/virtio/virtio_balloon.c | 40 +++++++++++++++++++++------------
 1 file changed, 26 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 6a1a610c2cb1..53b4a3984e7d 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -187,7 +187,9 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 
 	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
 
-	/* We should always be able to add one buffer to an empty queue. */
+	/* We made sure the vq is large enough so we should always
+	 * be able to add one buffer to an empty queue.
+	 */
 	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
 	virtqueue_kick(vq);
 
@@ -202,25 +204,35 @@ static int virtballoon_free_page_report(struct page_reporting_dev_info *pr_dev_i
 	struct virtio_balloon *vb =
 		container_of(pr_dev_info, struct virtio_balloon, pr_dev_info);
 	struct virtqueue *vq = vb->reporting_vq;
-	unsigned int unused, err;
+	unsigned int i, err = 0;
 
 	/* We should always be able to add these buffers to an empty queue. */
-	err = virtqueue_add_inbuf(vq, sg, nents, vb, GFP_NOWAIT);
+	for (i = 0; i < nents; i++) {
+		struct scatterlist one;
 
-	/*
-	 * In the extremely unlikely case that something has occurred and we
-	 * are able to trigger an error we will simply display a warning
-	 * and exit without actually processing the pages.
-	 */
-	if (WARN_ON_ONCE(err))
-		return err;
+		sg_init_table(&one, 1);
+		sg_set_page(&one, sg_page(&sg[i]), sg[i].length,
+			    sg[i].offset);
+		err = virtqueue_add_inbuf(vq, &one, 1, &sg[i], GFP_NOWAIT);
+		if (WARN_ON_ONCE(err)) {
+			nents = i;
+			break;
+		}
+	}
 
-	virtqueue_kick(vq);
+	if (nents) {
+		virtqueue_kick(vq);
 
-	/* When host has read buffer, this completes via balloon_ack */
-	wait_event(vb->acked, virtqueue_get_buf(vq, &unused));
+		/* When host has read buffer, this completes via balloon_ack */
+		for (i = 0; i < nents; i++) {
+			unsigned int unused;
 
-	return 0;
+			wait_event(vb->acked,
+				   virtqueue_get_buf(vq, &unused));
+		}
+	}
+
+	return err;
 }
 
 static void set_page_pfns(struct virtio_balloon *vb,
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 32/37] virtio_balloon: disable indirect descriptors
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (30 preceding siblings ...)
  2026-06-08  8:39 ` [PATCH v10 31/37] virtio_balloon: submit reported pages as individual buffers Michael S. Tsirkin
@ 2026-06-08  8:40 ` Michael S. Tsirkin
  2026-06-08  8:40 ` [PATCH v10 33/37] mm: page_reporting: add flush parameter with page budget Michael S. Tsirkin
                   ` (8 subsequent siblings)
  40 siblings, 0 replies; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

A follow-up patch (DEVICE_INIT_ON_INFLATE) adds an inbuf
(bitmap) alongside the outbuf (pfns), making total_sg=2.
With INDIRECT_DESC negotiated, virtqueue_use_indirect()
triggers for total_sg > 1.  Balloon never needs more than
2 entries and virtqueue sizes are 128+, so indirect
descriptors are unnecessary.

Disabling them avoids a GFP_KERNEL allocation inside
virtqueue_add_sgs when VIRTIO_RING_F_INDIRECT_DESC is negotiated.
This allocation could trigger OOM reclaim while balloon_lock is
held, leading to a deadlock: OOM notifier -> leak_balloon ->
balloon_lock.

With single-buffer submissions (previous patch) and no indirect
descriptors, virtqueue_add never allocates memory.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 drivers/virtio/virtio_balloon.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 53b4a3984e7d..1fa1c7fa285f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -7,6 +7,7 @@
  */
 
 #include <linux/virtio.h>
+#include <uapi/linux/virtio_ring.h>
 #include <linux/virtio_balloon.h>
 #include <linux/swap.h>
 #include <linux/workqueue.h>
@@ -1175,6 +1176,13 @@ static int virtballoon_validate(struct virtio_device *vdev)
 	else if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON))
 		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_REPORTING);
 
+	/*
+	 * Balloon submits 1-2 sg entries max per buffer, virtqueue
+	 * sizes are 128+.  Disable indirect descriptors to avoid
+	 * GFP_KERNEL allocation in virtqueue_add under balloon_lock,
+	 * which could deadlock via OOM -> oom_notify -> leak_balloon.
+	 */
+	__virtio_clear_bit(vdev, VIRTIO_RING_F_INDIRECT_DESC);
 	__virtio_clear_bit(vdev, VIRTIO_F_ACCESS_PLATFORM);
 	return 0;
 }
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 33/37] mm: page_reporting: add flush parameter with page budget
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (31 preceding siblings ...)
  2026-06-08  8:40 ` [PATCH v10 32/37] virtio_balloon: disable indirect descriptors Michael S. Tsirkin
@ 2026-06-08  8:40 ` Michael S. Tsirkin
  2026-06-08  8:40 ` [PATCH v10 34/37] virtio_balloon: skip zeroing for host-zeroed reported pages Michael S. Tsirkin
                   ` (7 subsequent siblings)
  40 siblings, 0 replies; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

Add a write-only module parameter 'flush' that triggers immediate
page reporting.  The value specifies a page budget: at least
this many pages (at page_reporting_order) will be reported,
or all unreported pages if fewer remain.  The actual number
reported may exceed the budget since each reporting pass
processes a full cycle across all zones.

This is helpful when there is a lot of memory freed quickly,
and a single cycle may not process all free pages due to
internal budget limits.

  echo 512 > /sys/module/page_reporting/parameters/flush

Note: the set callback runs under kernel param_lock,
so writing this parameter blocks other built-in parameter
writes until the flush loop completes.  This is acceptable
for a privileged debug/test parameter.
Note: flush_delayed_work is called twice (once before the
main loop, once after). The first call ensures any pending
work completes before we submit new budget. The race with
__page_reporting_request is benign: if a concurrent request
arrives between our flushes, it queues new delayed work that
will run on its own schedule.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: cursor-agent:GPT-5.4-xhigh
---
 mm/page_reporting.c | 54 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 9c39448e758b..400150a2aa15 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -358,6 +358,60 @@ static void page_reporting_process(struct work_struct *work)
 static DEFINE_MUTEX(page_reporting_mutex);
 DEFINE_STATIC_KEY_FALSE(page_reporting_enabled);
 
+static int page_reporting_flush_set(const char *val,
+				    const struct kernel_param *kp)
+{
+	struct page_reporting_dev_info *prdev;
+	unsigned int budget;
+	int err;
+
+	err = kstrtouint(val, 0, &budget);
+	if (err)
+		return err;
+	if (!budget)
+		return 0;
+
+	mutex_lock(&page_reporting_mutex);
+	prdev = rcu_dereference_protected(pr_dev_info,
+				lockdep_is_held(&page_reporting_mutex));
+	if (prdev) {
+		unsigned int reported;
+		bool interrupted = false;
+
+		for (reported = 0; reported < budget;
+		     reported += min(prdev->capacity, budget - reported)) {
+			/*
+			 * First flush completes any previously scheduled
+			 * reporting work.  Then request a new reporting
+			 * cycle and flush again to execute it.
+			 */
+			flush_delayed_work(&prdev->work);
+			__page_reporting_request(prdev);
+			flush_delayed_work(&prdev->work);
+			if (atomic_read(&prdev->state) == PAGE_REPORTING_IDLE)
+				break;
+			if (signal_pending(current)) {
+				interrupted = true;
+				break;
+			}
+		}
+		if (interrupted) {
+			mutex_unlock(&page_reporting_mutex);
+			return -EINTR;
+		}
+	}
+	mutex_unlock(&page_reporting_mutex);
+	return 0;
+}
+
+static const struct kernel_param_ops flush_ops = {
+	.set = page_reporting_flush_set,
+	.get = param_get_uint,
+};
+static unsigned int page_reporting_flush;
+module_param_cb(flush, &flush_ops, &page_reporting_flush, 0200);
+MODULE_PARM_DESC(flush, "Report at least N pages at page_reporting_order, or until all reported");
+
 int page_reporting_register(struct page_reporting_dev_info *prdev)
 {
 	int err = 0;
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 34/37] virtio_balloon: skip zeroing for host-zeroed reported pages
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (32 preceding siblings ...)
  2026-06-08  8:40 ` [PATCH v10 33/37] mm: page_reporting: add flush parameter with page budget Michael S. Tsirkin
@ 2026-06-08  8:40 ` Michael S. Tsirkin
  2026-06-08  8:40 ` [PATCH v10 35/37] virtio_balloon: disable reporting zeroed optimization for confidential guests Michael S. Tsirkin
                   ` (6 subsequent siblings)
  40 siblings, 0 replies; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

Implement VIRTIO_BALLOON_F_DEVICE_INIT_REPORTED (per virtio spec
proposal): when negotiated, the device initializes reported pages
(zeros, or poison_val if PAGE_POISON).

Check per-page used length returned by the device to determine
which reported pages were zeroed. If used_len matches the page
size, the device successfully initialized the page (e.g. via
MADV_DONTNEED), and we set the corresponding zeroed_bitmap bit.

Gate host_zeroes_pages on the feature bit and page content:
when PAGE_POISON is negotiated with non-zero poison_val, the
device fills with poison not zeros, so pages are not zeroed.

Clear the feature in validate() if REPORTING is not present
or if PAGE_POISON is active with non-zero poison_val.

See the virtio spec change:
https://github.com/oasis-tcs/virtio-spec/issues/244

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: cursor-agent:GPT-5.4-xhigh
---
 drivers/virtio/virtio_balloon.c     | 22 ++++++++++++++++++++--
 include/uapi/linux/virtio_balloon.h |  1 +
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1fa1c7fa285f..e3afa6f32ba5 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -207,6 +207,8 @@ static int virtballoon_free_page_report(struct page_reporting_dev_info *pr_dev_i
 	struct virtqueue *vq = vb->reporting_vq;
 	unsigned int i, err = 0;
 
+	bitmap_zero(pr_dev_info->zeroed_bitmap, nents);
+
 	/* We should always be able to add these buffers to an empty queue. */
 	for (i = 0; i < nents; i++) {
 		struct scatterlist one;
@@ -226,10 +228,14 @@ static int virtballoon_free_page_report(struct page_reporting_dev_info *pr_dev_i
 
 		/* When host has read buffer, this completes via balloon_ack */
 		for (i = 0; i < nents; i++) {
-			unsigned int unused;
+			struct scatterlist *entry;
+			unsigned int used_len;
 
 			wait_event(vb->acked,
-				   virtqueue_get_buf(vq, &unused));
+				   (entry = virtqueue_get_buf(vq, &used_len)));
+			if (used_len == entry->length)
+				set_bit(entry - sg,
+					pr_dev_info->zeroed_bitmap);
 		}
 	}
 
@@ -1051,6 +1057,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 #endif
 
 		vb->pr_dev_info.capacity = capacity;
+		vb->pr_dev_info.host_zeroes_pages =
+			virtio_has_feature(vdev,
+					   VIRTIO_BALLOON_F_DEVICE_INIT_REPORTED);
 		err = page_reporting_register(&vb->pr_dev_info);
 		if (err)
 			goto out_unregister_oom;
@@ -1176,6 +1185,14 @@ static int virtballoon_validate(struct virtio_device *vdev)
 	else if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON))
 		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_REPORTING);
 
+	if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_REPORTING))
+		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_DEVICE_INIT_REPORTED);
+
+	/* Device fills with poison_val, not zeros; disable zeroed hint */
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON) &&
+	    !want_init_on_free())
+		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_DEVICE_INIT_REPORTED);
+
 	/*
 	 * Balloon submits 1-2 sg entries max per buffer, virtqueue
 	 * sizes are 128+.  Disable indirect descriptors to avoid
@@ -1194,6 +1211,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_FREE_PAGE_HINT,
 	VIRTIO_BALLOON_F_PAGE_POISON,
 	VIRTIO_BALLOON_F_REPORTING,
+	VIRTIO_BALLOON_F_DEVICE_INIT_REPORTED,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index ee35a372805d..13074631f300 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -37,6 +37,7 @@
 #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
 #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
 #define VIRTIO_BALLOON_F_REPORTING	5 /* Page reporting virtqueue */
+#define VIRTIO_BALLOON_F_DEVICE_INIT_REPORTED	6 /* Device initializes reported pages */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 35/37] virtio_balloon: disable reporting zeroed optimization for confidential guests
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (33 preceding siblings ...)
  2026-06-08  8:40 ` [PATCH v10 34/37] virtio_balloon: skip zeroing for host-zeroed reported pages Michael S. Tsirkin
@ 2026-06-08  8:40 ` Michael S. Tsirkin
  2026-06-08  8:40 ` [PATCH v10 36/37] mm: balloon: use put_page_zeroed for zeroed balloon pages Michael S. Tsirkin
                   ` (5 subsequent siblings)
  40 siblings, 0 replies; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

In confidential computing environments (TDX, SEV-SNP), the host
is untrusted and may lie about zeroing reported pages. Clear
DEVICE_INIT_REPORTED in validate() so the guest does not skip
re-zeroing based on hints from an untrusted device.

Note: currently REPORTING remains enabled and
VIRTIO_F_ACCESS_PLATFORM is cleared in CC environments.
This is known to leak free page physical addresses to the
host.  Whether that, or ballooning in general, is a security
concern in CC is up to the user.  This patch only disables
our new zeroed-page hints where the host is untrusted.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: cursor-agent:GPT-5.4-xhigh
---
 drivers/virtio/virtio_balloon.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index e3afa6f32ba5..bf1172ad5419 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -19,6 +19,7 @@
 #include <linux/wait.h>
 #include <linux/mm.h>
 #include <linux/page_reporting.h>
+#include <linux/cc_platform.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -1193,6 +1194,8 @@ static int virtballoon_validate(struct virtio_device *vdev)
 	    !want_init_on_free())
 		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_DEVICE_INIT_REPORTED);
 
+	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
+		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_DEVICE_INIT_REPORTED);
 	/*
 	 * Balloon submits 1-2 sg entries max per buffer, virtqueue
 	 * sizes are 128+.  Disable indirect descriptors to avoid
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 36/37] mm: balloon: use put_page_zeroed for zeroed balloon pages
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (34 preceding siblings ...)
  2026-06-08  8:40 ` [PATCH v10 35/37] virtio_balloon: disable reporting zeroed optimization for confidential guests Michael S. Tsirkin
@ 2026-06-08  8:40 ` Michael S. Tsirkin
  2026-06-08 11:10   ` David Hildenbrand (Arm)
  2026-06-08  8:40 ` [PATCH v10 37/37] virtio_balloon: implement VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE Michael S. Tsirkin
                   ` (4 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

When a balloon page marked PageZeroed is freed during migration,
use put_page_zeroed() to propagate the zeroed hint to the buddy
allocator. Previously the hint was silently lost via plain put_page().

No page has PageZeroed set yet; the next patch
(VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE) will set it on
pages the host has zeroed during inflate.
Note: during balloon migration, the migration core holds an
extra reference, so put_page_zeroed() will not be the final
put. The zeroed hint is lost in that case, which is
acceptable: it is a best-effort optimization.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
 mm/balloon.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/balloon.c b/mm/balloon.c
index 96a8f1e20bc6..6c9dd8ab0c5d 100644
--- a/mm/balloon.c
+++ b/mm/balloon.c
@@ -324,7 +324,15 @@ static int balloon_page_migrate(struct page *newpage, struct page *page,
 	balloon_page_finalize(page);
 	spin_unlock_irqrestore(&balloon_pages_lock, flags);
 
-	put_page(page);
+	if (PageZeroed(page)) {
+		/* Atomic to serialize with memory_failure's
+		 * TestSetPageHWPoison; not under zone->lock here.
+		 */
+		ClearPageZeroed(page);
+		put_page_zeroed(page);
+	} else {
+		put_page(page);
+	}
 
 	return 0;
 }
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH v10 37/37] virtio_balloon: implement VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (35 preceding siblings ...)
  2026-06-08  8:40 ` [PATCH v10 36/37] mm: balloon: use put_page_zeroed for zeroed balloon pages Michael S. Tsirkin
@ 2026-06-08  8:40 ` Michael S. Tsirkin
  2026-06-08  9:17 ` [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Lorenzo Stoakes
                   ` (3 subsequent siblings)
  40 siblings, 0 replies; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08  8:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

When the device offers DEVICE_INIT_ON_INFLATE (bit 7), the device
initializes inflated pages and returns a per-page bitmap indicating
which pages were successfully initialized.

The driver appends a device-writable bitmap buffer to each inflate
descriptor chain via virtqueue_add_sgs. After the host acknowledges,
the driver checks bitmap bits (bounded by used_len) and marks pages
with SetPageZeroed.

tell_host() returns used_len from virtqueue_get_buf(). Bitmap reads
are bounded: fill_balloon() and virtballoon_migratepage() only trust
bits within the used_len range.

On deflate, release_pages_balloon checks PageZeroed per page and
uses put_page_zeroed for pages the host initialized, propagating
the zeroed hint to the buddy allocator.

If inflate_vq has fewer than 2 descriptors, probe fails with
-ENOSPC. If PAGE_POISON is negotiated with non-zero poison_val,
the feature is cleared in validate().

See the virtio spec change:
https://lore.kernel.org/all/9c69b992c3dd83dfef3db92cd86b2fd8a0730d48.1777731396.git.mst@redhat.com

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: cursor-agent:GPT-5.4-xhigh
---
 drivers/virtio/virtio_balloon.c     | 107 ++++++++++++++++++++++++----
 include/linux/page-flags.h          |   1 +
 include/uapi/linux/virtio_balloon.h |   1 +
 3 files changed, 97 insertions(+), 12 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index bf1172ad5419..0cd4fd06ca2d 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -20,6 +20,7 @@
 #include <linux/mm.h>
 #include <linux/page_reporting.h>
 #include <linux/cc_platform.h>
+#include <asm-generic/bitops/le.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -122,6 +123,13 @@ struct virtio_balloon {
 	struct virtqueue *reporting_vq;
 	struct page_reporting_dev_info pr_dev_info;
 
+	/* No DMA alignment needed: ACCESS_PLATFORM is cleared,
+	 * so virtio bypasses the DMA API.  If this ever changes,
+	 * add ____dma_from_device_aligned here.
+	 */
+	/* Bitmap returned by host for DEVICE_INIT_ON_INFLATE */
+	DECLARE_BITMAP(inflate_bitmap, VIRTIO_BALLOON_ARRAY_PFNS_MAX);
+
 	/* State for keeping the wakeup_source active while adjusting the balloon */
 	spinlock_t wakeup_lock;
 	bool processing_wakeup_event;
@@ -182,22 +190,33 @@ static void balloon_ack(struct virtqueue *vq)
 	wake_up(&vb->acked);
 }
 
-static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+static unsigned int tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 {
-	struct scatterlist sg;
+	struct scatterlist sg_out, sg_in;
+	struct scatterlist *sgs[] = { &sg_out, &sg_in };
 	unsigned int len;
 
-	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+	sg_init_one(&sg_out, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
 
 	/* We made sure the vq is large enough so we should always
 	 * be able to add one buffer to an empty queue.
 	 */
-	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
+	if (vq == vb->inflate_vq &&
+	    virtio_has_feature(vb->vdev,
+			       VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE)) {
+		unsigned int bitmap_bytes;
+
+		bitmap_bytes = DIV_ROUND_UP(vb->num_pfns, 8);
+		bitmap_zero(vb->inflate_bitmap, vb->num_pfns);
+		sg_init_one(&sg_in, vb->inflate_bitmap, bitmap_bytes);
+		virtqueue_add_sgs(vq, sgs, 1, 1, vb, GFP_KERNEL);
+	} else {
+		virtqueue_add_outbuf(vq, &sg_out, 1, vb, GFP_KERNEL);
+	}
 	virtqueue_kick(vq);
 
-	/* When host has read buffer, this completes via balloon_ack */
 	wait_event(vb->acked, virtqueue_get_buf(vq, &len));
-
+	return len;
 }
 
 static int virtballoon_free_page_report(struct page_reporting_dev_info *pr_dev_info,
@@ -300,8 +319,37 @@ static unsigned int fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns != 0) {
+		unsigned int used_len = tell_host(vb, vb->inflate_vq);
+
+		if (virtio_has_feature(vb->vdev,
+				       VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE)) {
+			unsigned int i;
+			unsigned int valid_bits = used_len * 8;
+
+			for (i = 0; i < vb->num_pfns;
+			     i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
+				unsigned int pfn, j;
+				bool zeroed = true;
+
+				if (i + VIRTIO_BALLOON_PAGES_PER_PAGE > valid_bits)
+					break;
+				for (j = 0; j < VIRTIO_BALLOON_PAGES_PER_PAGE; j++) {
+					if (!test_bit_le(i + j, vb->inflate_bitmap)) {
+						zeroed = false;
+						break;
+					}
+				}
+				if (zeroed) {
+					pfn = virtio32_to_cpu(vb->vdev,
+							      vb->pfns[i]);
+					SetPageZeroed(pfn_to_page(pfn >>
+						(PAGE_SHIFT -
+						 VIRTIO_BALLOON_PFN_SHIFT)));
+				}
+			}
+		}
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -314,7 +362,12 @@ static void release_pages_balloon(struct virtio_balloon *vb,
 
 	list_for_each_entry_safe(page, next, pages, lru) {
 		list_del(&page->lru);
-		put_page(page); /* balloon reference */
+		if (PageZeroed(page)) {
+			ClearPageZeroed(page);
+			put_page_zeroed(page);
+		} else {
+			put_page(page);
+		}
 	}
 }
 
@@ -861,8 +914,27 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	/* balloon's page migration 1st step  -- inflate "newpage" */
 	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
 	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
+	{
+		unsigned int used_len = tell_host(vb, vb->inflate_vq);
 
+		if (virtio_has_feature(vb->vdev,
+				       VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE) &&
+		    used_len >= DIV_ROUND_UP(VIRTIO_BALLOON_PAGES_PER_PAGE, 8)) {
+			unsigned int j;
+			bool zeroed = true;
+
+			for (j = 0; j < VIRTIO_BALLOON_PAGES_PER_PAGE; j++) {
+				if (!test_bit_le(j, vb->inflate_bitmap)) {
+					zeroed = false;
+					break;
+				}
+			}
+			if (zeroed)
+				SetPageZeroed(newpage);
+		}
+	}
+
+	ClearPageZeroed(page);
 	/* balloon's page migration 2nd step -- deflate "page" */
 	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
 	set_page_pfns(vb, vb->pfns, page);
@@ -966,6 +1038,12 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vb;
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE) &&
+	    virtqueue_get_vring_size(vb->inflate_vq) < 2) {
+		err = -ENOSPC;
+		goto out_del_vqs;
+	}
+
 	if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
 		vb->vb_dev_info.adjust_managed_page_count = true;
 #ifdef CONFIG_BALLOON_MIGRATION
@@ -1191,11 +1269,15 @@ static int virtballoon_validate(struct virtio_device *vdev)
 
 	/* Device fills with poison_val, not zeros; disable zeroed hint */
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON) &&
-	    !want_init_on_free())
+	    !want_init_on_free()) {
 		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_DEVICE_INIT_REPORTED);
+		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE);
+	}
 
-	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
+	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) {
 		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_DEVICE_INIT_REPORTED);
+		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE);
+	}
 	/*
 	 * Balloon submits 1-2 sg entries max per buffer, virtqueue
 	 * sizes are 128+.  Disable indirect descriptors to avoid
@@ -1215,6 +1297,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_PAGE_POISON,
 	VIRTIO_BALLOON_F_REPORTING,
 	VIRTIO_BALLOON_F_DEVICE_INIT_REPORTED,
+	VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 9365d59ac1d6..caecf0fbf0e2 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -680,6 +680,7 @@ FOLIO_FLAG_FALSE(idle)
  * uses this to skip redundant zeroing in post_alloc_hook().
  */
 __PAGEFLAG(Zeroed, zeroed, PF_NO_COMPOUND)
+SETPAGEFLAG(Zeroed, zeroed, PF_NO_COMPOUND)
 CLEARPAGEFLAG(Zeroed, zeroed, PF_NO_COMPOUND)
 #define __PG_ZEROED (1UL << PG_zeroed)
 
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 13074631f300..cbaf18e0b17c 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -38,6 +38,7 @@
 #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
 #define VIRTIO_BALLOON_F_REPORTING	5 /* Page reporting virtqueue */
 #define VIRTIO_BALLOON_F_DEVICE_INIT_REPORTED	6 /* Device initializes reported pages */
+#define VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE	7 /* Device initializes pages on inflate */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
MST



^ permalink raw reply related	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 10/37] mm: add folio_zero_user stub for configs without THP/HUGETLBFS
  2026-06-08  8:36 ` [PATCH v10 10/37] mm: add folio_zero_user stub for configs without THP/HUGETLBFS Michael S. Tsirkin
@ 2026-06-08  9:12   ` Lorenzo Stoakes
  0 siblings, 0 replies; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08  9:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:36:02AM -0400, Michael S. Tsirkin wrote:
> folio_zero_user() is defined in mm/memory.c under
> CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS.  A subsequent patch
> will call it from post_alloc_hook() for all user page zeroing, so
> configs without THP or HUGETLBFS will need a stub.
>
> Add a stub that uses clear_user_highpages() with aligned
> addr_hint.

It's not a stub? You're doing default behaviour for !THP || hugetlbfs?

>
> Without THP/HUGETLBFS, only order-0 user pages are allocated, so
> the locality optimization in the real folio_zero_user() (zero near
> the faulting address last) is not needed.

The 'real' folio_zero_user()?

But they're both real? In what sense is this unreal?

This commit message is pretty confusing.

> This also matches what vma_alloc_zeroed_movable_folio currently does.

You're also not explaining why we need a random new file just for a single
define, this is really weird?

And please don't add new files without adding an entry to MAINTAINERS.

While we're here, can we just move this function somewhere else? memory.c is
like the junkyard of mm. Maybe util.c.

Anyway this seems like it belongs in mm/internal.h.

>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> Assisted-by: cursor-agent:GPT-5.4-xhigh
> ---
>  mm/folio_zero.h | 18 ++++++++++++++++++
>  mm/page_alloc.c |  1 +
>  2 files changed, 19 insertions(+)
>  create mode 100644 mm/folio_zero.h
>
> diff --git a/mm/folio_zero.h b/mm/folio_zero.h
> new file mode 100644
> index 000000000000..c135b3a34da8
> --- /dev/null
> +++ b/mm/folio_zero.h
> @@ -0,0 +1,18 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef MM_FOLIO_ZERO_H
> +#define MM_FOLIO_ZERO_H
> +
> +#include <linux/highmem.h>
> +
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> +void folio_zero_user(struct folio *folio, unsigned long addr_hint);
> +#else

No comment explaining what this does in comparison to the THP/hugetlbfs version?
Let's have a kdoc comment here.

> +static inline void folio_zero_user(struct folio *folio, unsigned long addr_hint)

So we have a folio (physical metadata) and an adress hint?

> +{
> +	unsigned long base = ALIGN_DOWN(addr_hint, folio_size(folio));

const please.

> +
> +	clear_user_highpages(&folio->page, base, folio_nr_pages(folio));
> +}
> +#endif
> +
> +#endif /* MM_FOLIO_ZERO_H */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6d3f284c607d..0943ab724032 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -17,6 +17,7 @@
>  #include <linux/stddef.h>
>  #include <linux/mm.h>
>  #include <linux/highmem.h>
> +#include "folio_zero.h"
>  #include <linux/interrupt.h>
>  #include <linux/jiffies.h>
>  #include <linux/compiler.h>
> --
> MST
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (36 preceding siblings ...)
  2026-06-08  8:40 ` [PATCH v10 37/37] virtio_balloon: implement VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE Michael S. Tsirkin
@ 2026-06-08  9:17 ` Lorenzo Stoakes
  2026-06-08 12:52   ` Lorenzo Stoakes
  2026-06-08 11:02 ` Vlastimil Babka (SUSE)
                   ` (2 subsequent siblings)
  40 siblings, 1 reply; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08  9:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:33:46AM -0400, Michael S. Tsirkin wrote:
> Further, on architectures with aliasing caches, upstream with init_on_alloc
> double-zeros user pages: once via kernel_init_pages() in
> post_alloc_hook, and again via clear_user_highpage() at the
> callsite (because user_alloc_needs_zeroing() returns true).
> This series eliminates that double-zeroing by moving the zeroing
> into the post_alloc_hook + propagating the "host
> already zeroed this page" information through the buddy allocator.
>
> For page reporting, VIRTIO_BALLOON_F_DEVICE_INIT_REPORTED (bit 6)
> is used. For the inflate/deflate path,
> VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE (bit 7) is used.
>
> Virtio spec: https://lore.kernel.org/all/cover.1778140241.git.mst@redhat.com
>
> Based on v7.1-rc6.  When applying on mm-unstable, two conflicts
> are expected:
> - kernel_init_pages() was renamed to clear_highpages_kasan_tagged()
>   in mm-unstable.  Use clear_highpages_kasan_tagged() in the
>   post_alloc_hook else branch.
> - FPI_PREPARED uses BIT(3) in mm-unstable.  Bump FPI_ZEROED to
>   BIT(4).
> Build-tested on mm-unstable at e9dd96806dbc:
> https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git zero-mm-unstable
>
> Patches 1-5: fixes/cleanups, dependencies of the zeroing patches.
> Patches 6-9: thread user_addr through page allocator, contig API,
>   and gigantic hugetlb allocation.
> Patches 10-16: folio_zero_user in post_alloc_hook, vma_alloc_zeroed
>   conversion, raw fault address threading.
> Patches 17-24: PG_zeroed flag, aliasing guard, buddy merge/split
>   tracking, FPI_ZEROED optimization, folio_put_zeroed.
> Patches 25-27: __GFP_ZERO callsite conversions (alloc_anon_folio,
>   vma_alloc_anon_folio_pmd) with memcg charge failure mitigation.
> Patches 28-29: hugetlb __GFP_ZERO + HPG_zeroed.
> Patches 30-35: page reporting zeroing (DEVICE_INIT_REPORTED),
>   disable indirect descriptors.
> Patches 36-37: inflate/deflate zeroing (DEVICE_INIT_ON_INFLATE).

This seems far too much for one series.

YOu're doing a bunch of mm stuff that seems relatively independent, then
putting the virtio stuff on top.

I think this should be broken out into separate series laying foundations
rather than doing it all in one go, which is also difficult for review
purposes.

Adding a new folio flag is contentious also for instance, we maybe want to
go bit-by-bit and ensure that each foundational element is acceptable
before doing the next bit rather than having it as part of a big series.

Looking through the changelog only adds to this feeling! Huge numbers of
changes, even relatively recently and I'm not sure all relevant maintainers
in mm have had a look through either.

Thanks, Lorenzo

>
> -------
>
> Performance with THP enabled on a 2GB VM, 1 vCPU, allocating
> 256MB of anonymous pages:
>
>   metric         baseline            optimized           delta
>   task-clock     232 +- 20 ms        51 +- 26 ms         -78%
>   cache-misses   1.20M +- 248K       288K +- 102K        -76%
>   instructions   16.3M +- 1.2M       13.8M +- 1.0M       -15%
>
> With hugetlb surplus pages:
>
>   metric         baseline            optimized           delta
>   task-clock     219 +- 23 ms        65 +- 34 ms         -70%
>   cache-misses   1.17M +- 391K       263K +- 36K         -78%
>   instructions   17.9M +- 1.2M       15.1M +- 724K       -16%
>
> Two flags track known-zero pages:
>   PG_zeroed (aliased to PG_private) marks buddy allocator pages that
>   are known to contain all zeros, either because the host zeroed
>   them during page reporting, or because they were freed via the
>   balloon deflate path.  It lives on free-list pages and is consumed
>   by post_alloc_hook() on allocation.
>   HPG_zeroed (stored in hugetlb folio->private bits) serves the same
>   purpose for hugetlb pool pages, which are kept in a pool and may
>   be zeroed long after buddy allocation, so PG_zeroed (consumed at
>   allocation time) cannot track their state.
>
> PG_zeroed lifecycle:
>
>   Sets PG_zeroed:
>   - page_reporting_drain: on reported pages when host zeroes them
>   - __free_pages_ok / __free_frozen_pages: when FPI_ZEROED is set
>     (balloon deflate path)
>   - buddy merge: on merged page if both buddies were zeroed
>   - expand(): propagate to split-off buddy sub-pages
>
>   Clears PG_zeroed:
>   - __free_pages_prepare: clears all PAGE_FLAGS_CHECK_AT_PREP flags
>     (PG_zeroed included), preventing PG_private aliasing leaks
>   - rmqueue_buddy / __rmqueue_pcplist: read-then-clear, passes
>     zeroed hint to prep_new_page -> post_alloc_hook
>   - __isolate_free_page: clear (compaction/page_reporting isolation)
>   - compaction, alloc_contig, split_free_frozen: clear before use
>   - buddy merge: clear both pages before merge, then conditionally
>     re-set on merged head if both were zeroed
>
> HPG_zeroed lifecycle (hugetlb pool pages, stored in folio->private):
>
>   Sets HPG_zeroed:
>   - alloc_surplus_hugetlb_folio: after buddy allocation with
>     __GFP_ZERO, mark pool page as known-zero
>
>   Clears HPG_zeroed:
>   - free_huge_folio: page was mapped to userspace, no longer
>     known-zero when it returns to the pool
>   - alloc_hugetlb_folio: cleared unconditionally on output
>   - alloc_hugetlb_folio_reserve: cleared after checking
>
> - The optimization is most effective with THP, where entire 2MB
>   pages are allocated directly from reported order-9+ buddy pages.
>   Without THP, only ~21% of order-0 allocations come from reported
>   pages due to low-order fragmentation.
> - Persistent hugetlb pool pages are not covered: when freed by
>   userspace they return to the hugetlb free pool, not the buddy
>   allocator, so they are never reported to the host.  Surplus
>   hugetlb pages are allocated from buddy and do benefit.
>
> - PG_zeroed is aliased to PG_private.  __free_pages_prepare() clears it
>   (preventing filesystem PG_private from leaking as false PG_zeroed).
>   FPI_ZEROED re-sets it after prepare for balloon deflate pages.
>   Is aliasing PG_private acceptable, or should a different bit be used?
>
> - With __GFP_ZERO, the folio is zeroed before mem_cgroup_charge().
>   If the charge fails (cgroup at limit), the zeroing work is wasted
>   and the folio is freed and retried at a smaller order.  Previously,
>   zeroing was done after a successful charge.  This is inherent to
>   the __GFP_ZERO approach.  Is this acceptable?
>
> - On architectures with aliasing caches, upstream with init_on_alloc
>   double-zeros user pages: once via kernel_init_pages() in
>   post_alloc_hook, and again via clear_user_highpage() at the
>   callsite (because user_alloc_needs_zeroing() returns true).
>   Our patches eliminate this by zeroing once via folio_zero_user()
>   in post_alloc_hook.  Not a critical fix (people who set init_on_alloc
>   know they are paying performance) but a nice cleanup anyway.
>
> Test program:
>
>   #include <stdio.h>
>   #include <stdlib.h>
>   #include <string.h>
>   #include <sys/mman.h>
>
>   #ifndef MADV_POPULATE_WRITE
>   #define MADV_POPULATE_WRITE 23
>   #endif
>   #ifndef MAP_HUGETLB
>   #define MAP_HUGETLB 0x40000
>   #endif
>
>   int main(int argc, char **argv)
>   {
>       unsigned long size;
>       int flags = MAP_PRIVATE | MAP_ANONYMOUS;
>       void *p;
>       int r;
>
>       if (argc < 2) {
>           fprintf(stderr, "usage: %s <size_mb> [huge]\n", argv[0]);
>           return 1;
>       }
>       size = atol(argv[1]) * 1024UL * 1024;
>       if (argc >= 3 && strcmp(argv[2], "huge") == 0)
>           flags |= MAP_HUGETLB;
>       p = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
>       if (p == MAP_FAILED) {
>           perror("mmap");
>           return 1;
>       }
>       r = madvise(p, size, MADV_POPULATE_WRITE);
>       if (r) {
>           perror("madvise");
>           return 1;
>       }
>       munmap(p, size);
>       return 0;
>   }
>
> Test script (bench.sh):
>
>   #!/bin/bash
>   # Usage: bench.sh <size_mb> <iterations> [huge]
>   # Feature negotiation (DEVICE_INIT_REPORTED/ON_INFLATE) is
>   # handled by QEMU command line flags,
>   SZ=${1:-256}; ITER=${2:-10}; HUGE=${3:-}
>   FLUSH=/sys/module/page_reporting/parameters/flush
>   CSV=/tmp/perf.csv
>   rmmod virtio_balloon 2>/dev/null
>   insmod /mnt/share/virtio_balloon.ko
>   echo 512 > $FLUSH
>   [ "$HUGE" = "huge" ] && echo $((SZ/2)) > /proc/sys/vm/nr_overcommit_hugepages
>   rm -f $CSV
>   echo "=== sz=${SZ}MB iter=$ITER $HUGE ==="
>   for i in $(seq 1 $ITER); do
>       echo 3 > /proc/sys/vm/drop_caches
>       echo 512 > $FLUSH
>       perf stat -e task-clock,instructions,cache-misses \
>           -x, -o $CSV --append -- /mnt/share/alloc_once $SZ $HUGE
>   done
>   [ "$HUGE" = "huge" ] && echo 0 > /proc/sys/vm/nr_overcommit_hugepages
>   rmmod virtio_balloon
>   awk -F, '/^#/||/^$/{next}{v=$1+0;e=$3;gsub(/ /,"",e);s[e]+=v;ss[e]+=v*v;n[e]++}
>   END{for(e in s){a=s[e]/n[e];d=sqrt(ss[e]/n[e]-a*a);printf "  %-16s %10.0f +- %8.0f (n=%d)\n",e,a,d,n[e]}}' $CSV
>
> Compile and run:
>   gcc -static -O2 -o alloc_once alloc_once.c
>   bash bench.sh 256 10            # regular pages
>   bash bench.sh 256 10 huge       # hugetlb surplus
>
> Note about Sashiko (sashiko.dev) false positives:
>   Sashiko's mm-alloc guideline says "Any optimization replacing
>   clear_user_highpage() with __GFP_ZERO is wrong on [cache-aliasing]
>   architectures". This is correct for mainline but not for this
>   series, which threads user_addr through the allocator so that
>   post_alloc_hook() calls folio_zero_user() with the dcache flush.
>   Suggested guideline update: add "unless the caller passes a
>   valid user address (i.e. not USER_ADDR_NONE) to vma_alloc_folio(),
>   alloc_contig_frozen_pages_user() etc., which reaches
>   post_alloc_hook() for the dcache flush".
>
> Pre-existing bugs found during review (not fixed, not made worse):
>   - do_swap_page() returns VM_FAULT_OOM on large-folio swapin race
>     instead of retrying.
>   - free_huge_folio() called with refcount==1 on
>     mem_cgroup_charge_hugetlb failure.
>   - memfd_alloc_folio() double-decrements resv_huge_pages on error.
>   - wait_event in virtballoon_free_page_report hangs on broken
>     virtqueue (pre-existing, same as old single-buffer code).
>   - tell_host() GFP_KERNEL under balloon_lock risks OOM deadlock.
>
> Changes since v9:
> - Fix W=1 kerneldoc warning on alloc_contig_frozen_pages_user_noprof.
> - Fix link error on !MMU configs (m68k, arm allnoconfig): move
>   folio_zero_user stub to new mm/folio_zero.h header.
> - Reorder patches: move PG_zeroed tracking and folio_put_zeroed
>   before __GFP_ZERO conversions, allowing folio_put_zeroed to
>   handle memcg charge failures.
> - Better handle memcg charge failures.
>
> Changes since v8 (address Sashiko v8 review findings):
> - Fix mempolicy interleave: combine vm_pgoff and VMA offset into
>   a single expression before shifting, fixing carry loss for
>   file-backed VMAs with unaligned vm_pgoff.
> - Fix memory-failure: wrap ClearPageHWPoison in retry path with
>   zone->lock (same race as TestSetPageHWPoison).
> - Fix stale comment: "folio_zero_user writes" -> "page zeroing"
>   in huge_memory.c __folio_mark_uptodate comment.
> - Drop rounddown_pow_of_two for page reporting capacity (no-op
>   for compiler optimization, halves batch size for non-power-of-2).
> - Reorder: move "mm: balloon: use put_page_zeroed" before
>   "virtio_balloon: implement DEVICE_INIT_ON_INFLATE" so the
>   ClearPageZeroed handling is in place before any page gets
>   the flag set.
> - Various commit log improvements (PowerPC note in aliasing
>   patch, memory-failure note about other HWPoison calls,
>   wording fixes).
>
> Changes since v7 (address Sashiko AI review findings):
> - Fix dcache flush on VIPT aliasing architectures: add
>   user_alloc_needs_zeroing() guard in post_alloc_hook to force
>   folio_zero_user for user pages when cache aliasing requires it.
>   Host-zeroed pages excluded (!zeroed).  Optimization preserved.
> - Fix folio_zero_user stub: replace macro with non-inline function
>   in mm/memory.c to avoid double-evaluation and missing include.
> - Fix C89 declaration-after-statement in free_huge_folio.
> - Fix CMA __GFP_ZERO: pass through to cma_alloc_frozen_compound
>   so HPG_zeroed accurately reflects whether page was zeroed.
> - Fix big-endian bitmap: use test_bit_le() for inflate_bitmap.
> - Fix migratepage: clear PageZeroed on old page before deflation.
> - Fix page_reporting flush: overflow-safe loop, add -EINTR on
>   signal, add code comment explaining double flush_delayed_work.
> - Add atomic ClearPageZeroed (CLEARPAGEFLAG) for balloon migration
>   path where zone->lock is not held.
> - Add VM_WARN_ON_ONCE for order>0 without __GFP_COMP in
>   post_alloc_hook (folio_zero_user requires compound metadata).
> - Add _noprof pattern for vma_alloc_zeroed_movable_folio to
>   preserve memory allocation profiling attribution.
> - Add PageReported propagation in split_large_buddy (was missing
>   from patch 2).
> - Add FPI_ZEROED guard: skip PageZeroed when page_poisoning
>   enabled and init_on_free disabled (poison overwrites zeroes).
> - Add DMA alignment comment for inflate_bitmap (ACCESS_PLATFORM
>   cleared, so not needed now).
> - Restore tell_host comment explaining vq buffer assumption.
> - Various code comments documenting design decisions.
> - Drop __GFP_ZERO from gather_surplus_pages: avoid shifting
>   zeroing from fault time to reservation time (mmap/fallocate).
>   Pool pages are zeroed at fault time via alloc_hugetlb_folio.
>   Fresh surplus allocations at fault time still benefit from
>   __GFP_ZERO + HPG_zeroed.
> - New patch: add alloc_contig_frozen_pages_user API with user_addr
>   for cache-friendly zeroing in the contiguous allocation path.
> - New patch: thread user_addr through gigantic hugetlb allocation
>   via alloc_contig_frozen_pages_user.
> - New patch: replace user_alloc_needs_zeroing() with aliasing-only
>   checks (cpu_dcache_is_aliasing || cpu_icache_is_aliasing) in the
>   post_alloc_hook guard.  Avoids redundant re-zero on non-aliasing.
> - New patch: serialize TestSetPageHWPoison with zone->lock in
>   memory_failure to fix pre-existing race with non-atomic buddy
>   flag operations (e.g. page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP).
> - New patch: disable VIRTIO_RING_F_INDIRECT_DESC in balloon to
>   prevent GFP_KERNEL allocation under balloon_lock (OOM deadlock).
> - New patch: skip kernel_init_pages for FPI_ZEROED when page
>   poisoning is not enabled (page already zero, skip redundant work).
>
> Also since v7 (address review by Gregory Price):
> - Drop from_pool bool in alloc_hugetlb_folio: use
>   folio_test_hugetlb_zeroed directly.  HPG_zeroed is set by
>   alloc_surplus_hugetlb_folio for fresh allocations, so the
>   check handles both pool and fresh pages.
> - Drop bool *zeroed output parameter from alloc_hugetlb_folio:
>   sink zeroing inside the function.  When __GFP_ZERO is set and
>   !folio_test_hugetlb_zeroed, call folio_zero_user internally.
> - Rename addr to user_addr in alloc_hugetlb_folio, align
>   internally with huge_page_mask.
> - Add Reviewed-by: Gregory Price tags on reviewed patches.
>
> New patches since v7:
> - mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
> - mm: add alloc_contig_frozen_pages_user for cache-friendly zeroing
> - mm: hugetlb: thread user_addr through gigantic page allocation
> - mm: page_alloc: use aliasing checks instead of
>   user_alloc_needs_zeroing
> - virtio_balloon: disable indirect descriptors
> - mm: page_alloc: skip kernel_init_pages for FPI_ZEROED when safe
>
> Changes since v6 (address review by Gregory Price):
> - Rework hugetlb: use gfp_t parameter instead of bool zero /
>   bool *zeroed.  Sink zeroing inside alloc_hugetlb_folio().
>   Pass raw fault address (user_addr) for cache-friendly zeroing
>   on both pool-page and fresh allocation
>   paths.  (Suggested by Gregory Price)
> - Reorder compaction_alloc_noprof() to call prep_compound_page
>   before post_alloc_hook for consistency.
>   (Suggested by Gregory Price)
> - Reorder: interleave fix first, PageReported propagation and
>   capacity fix moved to front as dependencies.
> - Add USER_ADDR_NONE comments in mmap.c and internal.h explaining why -1 is
>   never a valid userspace address.
> - Fix err uninitialized warning in virtballoon_free_page_report().
> - Lots of commit log tweaks.
>
> Also in v7:
> - Fix hugetlb pool page zeroing to use vmf->real_address
>   (the actual faulting subpage) instead of vmf->address
>   (hugepage-aligned), preserving cache-friendly zeroing
>   locality that upstream had at the callsite.
> - Remove dead/broken alloc_hugetlb_folio !CONFIG_HUGETLB_PAGE
>   stub (returned NULL but callers check IS_ERR).
>
> Changes since v5:
> - Rebased onto v7.1-rc2.
> - Split alloc_anon_folio and alloc_swap_folio raw fault address
>   changes into separate patches.
> - In virtio, move PAGE_POISON check for DEVICE_INIT_REPORTED
>   from probe() to validate(), clearing the feature instead of
>   just gating host_zeroes_pages.  Same for confidential
>   computing check.
> - Fix bisectability: FPI_ZEROED definition and usage now in
>   the same patch.
> - Lots of commit log tweaks.
> - Reorder: REPORTED before ON_INFLATE.
> - Kerneldoc fixes.
>
> Changes since v4:
> With virtio spec posted, update to latest spec:
> - Add VIRTIO_BALLOON_F_DEVICE_INIT_REPORTED (bit 6) for reporting.
> - Add VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE (bit 7) for inflate.
> - Per-page virtqueue submission, per-page used_len feedback.
> - Balloon migration preserves PageZeroed hint.
> - Page_reporting capacity bugfix for small virtqueues.
> - PG_zeroed propagation in split_large_buddy.
> - Disable both features for confidential computing guests.
> - Gate host_zeroes_pages on PAGE_POISON/poison_val: when PAGE_POISON
>   is negotiated with non-zero poison_val, device fills with poison
>   not zeros, so host_zeroes_pages must be false.
> - Disable ON_INFLATE when PAGE_POISON with non-zero poison_val.
> - Bound inflate bitmap reads by used_len from device.
> - Move ON_INFLATE poison_val check to validate() for proper
>   feature negotiation.
> - Fix NUMA interleave index for unaligned VMA start (new patch 1).
> - Drop vma_alloc_folio_user_addr: with the ilx fix, callers can
>   pass raw fault address to vma_alloc_folio directly.
> - Tested with DEBUG_VM, INIT_ON_ALLOC/FREE enabled.
>
> Changes since v3 (address review by Gregory Price and David Hildenbrand):
> - Keep user_addr threading internal: public APIs (__alloc_pages,
>   __folio_alloc, folio_alloc_mpol) are unchanged.  Only internal
>   functions (__alloc_frozen_pages_noprof, __alloc_pages_mpol) carry
>   user_addr.  This eliminates all API churn for external callers.
> - Add vma_alloc_folio_user_addr() (2/22) to separate NUMA policy
>   address from the zeroing hint address.  Fixes NUMA interleave
>   index corruption when passing unaligned fault address for
>   higher-order allocations.
> - Add per-page zeroed_bitmap to page_reporting_dev_info (17/22).
>   The driver's report() callback manages the bitmap.  Drain
>   checks it gated by the host_zeroes_pages static key.  This
>   matches the proposed virtio balloon extension at
>   https://lore.kernel.org/all/cover.1776874126.git.mst@redhat.com/
> - Clear PG_zeroed in __isolate_free_page() to prevent the aliased
>   PG_private flag from leaking to compaction/alloc_contig paths.
> - Do not exclude PG_zeroed from PAGE_FLAGS_CHECK_AT_PREP macro.
>   Instead, __free_pages_prepare() clears it (preventing filesystem
>   PG_private leaking as false PG_zeroed), and FPI_ZEROED sets it
>   after prepare.  Only buddy merge assertion is relaxed.
> - Initialize alloc_context.user_addr in alloc_pages_bulk_noprof.
> - Deflate and hugetlb changes are much smaller now.  Still, the
>   patchset can be merged gradually, if desired.
>
> Changes since v2 (address review by Gregory Price and David Hildenbrand):
> - v2 used pghint_t / vma_alloc_folio_hints API.  v3 switches to
>   threading user_addr through the page allocator and using __GFP_ZERO,
>   so post_alloc_hook() can use folio_zero_user() for cache-friendly
>   zeroing when the user fault address is known.
> - Use FPI_ZEROED to set PG_zeroed after __free_pages_prepare() instead
>   of runtime masking in __free_one_page (further refined in v4).
> - Drop redundant page_poisoning_enabled() check from mm core free
>   path, already guarded at feature negotiation time in
>   virtio_balloon_validate.  The balloon driver keeps its own
>   page_poisoning_enabled_static() check as defense in depth.
> - Split free_frozen_pages_zeroed and put_page_zeroed into separate
>   patches.  David Hildenbrand indicated he intends to rework balloon
>   pages to be frozen (no refcount), at which point put_page_zeroed
>   (21/22) can be dropped and the balloon can call
>   free_frozen_pages_zeroed directly.
> - Use HPG_zeroed flag (in hugetlb folio->private) for hugetlb pool
>   pages instead of PG_zeroed, since pool pages are zeroed long after
>   buddy allocation and PG_zeroed is consumed at allocation time.
> - syzbot CI found a PF_NO_COMPOUND BUG in the v2 pghint_t approach
>   where __ClearPageZeroed was called on compound hugetlb pages in
>   free_huge_folio.  The v3 HPG_zeroed approach avoids this.
> - Remove redundant arch vma_alloc_zeroed_movable_folio overrides
>   on x86, s390, m68k, and alpha (12/22). Suggested by David
>   Hildenbrand.
> - Updated benchmarking script to compute per-run avg +- stddev
>   via awk on CSV output.
>
> Changes v1->v2:
> - Replaced __GFP_PREZEROED with PG_zeroed page flag (aliased PG_private)
> - Added pghint_t type and vma_alloc_folio_hints() API
> - Track PG_zeroed across buddy merges and splits
> - Added post_alloc_hook integration (single consume/clear point)
> - Added hugetlb support (pool pages + memfd)
> - Added page_reporting flush parameter for deterministic testing
> - Added free_frozen_pages_hint/put_page_hint for balloon deflate path
> - Added try_to_claim_block PG_zeroed preservation
> - Updated perf numbers with per-iteration flush methodology
>
> Written with assistance from Claude (claude-opus-4-6).
> Reviewed by cursor-agent (GPT-5.4-xhigh).
> Everything manually read, patchset split and commit logs edited manually.
>
>
> Michael S. Tsirkin (37):
>   mm: mempolicy: fix interleave index calculation
>   mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
>   mm: page_alloc: propagate PageReported flag across buddy splits
>   mm: page_reporting: allow driver to set batch capacity
>   mm: hugetlb: remove dead alloc_hugetlb_folio stub
>   mm: move vma_alloc_folio_noprof to page_alloc.c
>   mm: thread user_addr through page allocator for cache-friendly zeroing
>   mm: add alloc_contig_frozen_pages_user for cache-friendly zeroing
>   mm: hugetlb: thread user_addr through gigantic page allocation
>   mm: add folio_zero_user stub for configs without THP/HUGETLBFS
>   mm: page_alloc: move prep_compound_page before post_alloc_hook
>   mm: use folio_zero_user for user pages in post_alloc_hook
>   mm: use __GFP_ZERO in vma_alloc_zeroed_movable_folio
>   mm: remove arch vma_alloc_zeroed_movable_folio overrides
>   mm: alloc_anon_folio: pass raw fault address to vma_alloc_folio
>   mm: alloc_swap_folio: pass raw fault address to vma_alloc_folio
>   mm: page_reporting: skip redundant zeroing of host-zeroed reported
>     pages
>   mm: page_alloc: use aliasing checks instead of
>     user_alloc_needs_zeroing
>   mm: page_alloc: clear PG_zeroed on buddy merge if not both zero
>   mm: page_alloc: preserve PG_zeroed in page_del_and_expand
>   mm: page_alloc: propagate PG_zeroed in split_large_buddy
>   mm: add free_frozen_pages_zeroed
>   mm: page_alloc: skip kernel_init_pages for FPI_ZEROED when safe
>   mm: add put_page_zeroed and folio_put_zeroed
>   mm: use __GFP_ZERO in alloc_anon_folio
>   mm: vma_alloc_anon_folio_pmd: pass raw fault address to
>     vma_alloc_folio
>   mm: use __GFP_ZERO in vma_alloc_anon_folio_pmd
>   mm: hugetlb: add gfp parameter and skip zeroing for zeroed pages
>   mm: memfd: skip zeroing for zeroed hugetlb pool pages
>   mm: page_reporting: add per-page zeroed bitmap for host feedback
>   virtio_balloon: submit reported pages as individual buffers
>   virtio_balloon: disable indirect descriptors
>   mm: page_reporting: add flush parameter with page budget
>   virtio_balloon: skip zeroing for host-zeroed reported pages
>   virtio_balloon: disable reporting zeroed optimization for confidential
>     guests
>   mm: balloon: use put_page_zeroed for zeroed balloon pages
>   virtio_balloon: implement VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE
>
>  arch/alpha/include/asm/page.h       |   3 -
>  arch/m68k/include/asm/page_no.h     |   3 -
>  arch/s390/include/asm/page.h        |   3 -
>  arch/x86/include/asm/page.h         |   3 -
>  drivers/virtio/virtio_balloon.c     | 177 ++++++++++++++---
>  fs/hugetlbfs/inode.c                |   3 +-
>  include/linux/cma.h                 |   3 +-
>  include/linux/gfp.h                 |  18 +-
>  include/linux/highmem.h             |  15 +-
>  include/linux/hugetlb.h             |  18 +-
>  include/linux/mm.h                  |  13 ++
>  include/linux/page-flags.h          |  11 ++
>  include/linux/page_reporting.h      |  13 ++
>  include/uapi/linux/virtio_balloon.h |   2 +
>  mm/balloon.c                        |  10 +-
>  mm/cma.c                            |   6 +-
>  mm/compaction.c                     |   9 +-
>  mm/folio_zero.h                     |  18 ++
>  mm/huge_memory.c                    |  16 +-
>  mm/hugetlb.c                        | 138 ++++++++-----
>  mm/hugetlb_cma.c                    |   4 +-
>  mm/internal.h                       |  22 ++-
>  mm/memfd.c                          |  14 +-
>  mm/memory-failure.c                 |  10 +
>  mm/memory.c                         |  19 +-
>  mm/mempolicy.c                      |  75 +++----
>  mm/mmap.c                           |   6 +
>  mm/page_alloc.c                     | 297 +++++++++++++++++++++++-----
>  mm/page_reporting.c                 |  99 ++++++++--
>  mm/page_reporting.h                 |  12 ++
>  mm/slub.c                           |   4 +-
>  mm/swap.c                           |  20 +-
>  32 files changed, 792 insertions(+), 272 deletions(-)
>  create mode 100644 mm/folio_zero.h
>
> --
> MST
>


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 02/37] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
  2026-06-08  8:34 ` [PATCH v10 02/37] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock Michael S. Tsirkin
@ 2026-06-08  9:43   ` Lorenzo Stoakes
  2026-06-08 13:48     ` Michael S. Tsirkin
  0 siblings, 1 reply; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08  9:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Miaohe Lin

On Mon, Jun 08, 2026 at 04:34:23AM -0400, Michael S. Tsirkin wrote:
> TestSetPageHWPoison() is called without zone->lock, so its atomic
> update to page->flags can race with non-atomic flag operations
> that run under zone->lock in the buddy allocator.
>
> In particular, __free_pages_prepare() does:
>
>     page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>
> This non-atomic read-modify-write, while correctly excluding
> __PG_HWPOISON from the mask, can still lose a concurrent
> TestSetPageHWPoison if the read happens before the poison bit
> is set and the write happens after.  Follow-up patches in this
> series add similar non-atomic flag operations as well.
>
> Fix by acquiring zone->lock around TestSetPageHWPoison and
> around ClearPageHWPoison in the retry path.  This
> serializes with all buddy flag manipulation.  The cost is
> negligible: one lock/unlock in an extremely rare path
> (hardware memory errors).
>
> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> in this file operate on pages already removed from the buddy
> allocator or on non-buddy pages (DAX, hugetlb), so they do not
> need zone->lock protection.
>
> Acked-by: Miaohe Lin <linmiaohe@huawei.com>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

Can we have Fixes: and Cc: stable and also send this separately please?

These patches seem like unrelated fixups that you've discovered along the way,
and don't belong as part of the already rather large series, unless I'm missing
something here.

Thanks, Lorenzo

> Assisted-by: Claude:claude-opus-4-6
> ---
>  mm/memory-failure.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index ee42d4361309..3880486028a1 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2348,6 +2348,8 @@ int memory_failure(unsigned long pfn, int flags)
>  	unsigned long page_flags;
>  	bool retry = true;
>  	int hugetlb = 0;
> +	struct zone *zone;
> +	unsigned long mf_flags;
>
>  	if (!sysctl_memory_failure_recovery)
>  		panic("Memory failure on page %lx", pfn);
> @@ -2390,7 +2392,11 @@ int memory_failure(unsigned long pfn, int flags)
>  	if (hugetlb)
>  		goto unlock_mutex;
>
> +	/* Serialize with non-atomic buddy flag operations */
> +	zone = page_zone(p);
> +	spin_lock_irqsave(&zone->lock, mf_flags);
>  	if (TestSetPageHWPoison(p)) {
> +		spin_unlock_irqrestore(&zone->lock, mf_flags);
>  		res = -EHWPOISON;
>  		if (flags & MF_ACTION_REQUIRED)
>  			res = kill_accessing_process(current, pfn, flags);
> @@ -2399,6 +2405,7 @@ int memory_failure(unsigned long pfn, int flags)
>  		action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
>  		goto unlock_mutex;
>  	}
> +	spin_unlock_irqrestore(&zone->lock, mf_flags);
>
>  	/*
>  	 * We need/can do nothing about count=0 pages.
> @@ -2420,7 +2427,10 @@ int memory_failure(unsigned long pfn, int flags)
>  			} else {
>  				/* We lost the race, try again */
>  				if (retry) {
> +					/* Serialize with non-atomic buddy flag operations */
> +					spin_lock_irqsave(&zone->lock, mf_flags);
>  					ClearPageHWPoison(p);
> +					spin_unlock_irqrestore(&zone->lock, mf_flags);
>  					retry = false;
>  					goto try_again;
>  				}
> --
> MST
>


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 01/37] mm: mempolicy: fix interleave index calculation
  2026-06-08  8:34 ` [PATCH v10 01/37] mm: mempolicy: fix interleave index calculation Michael S. Tsirkin
@ 2026-06-08  9:43   ` Lorenzo Stoakes
  0 siblings, 0 replies; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08  9:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:34:06AM -0400, Michael S. Tsirkin wrote:
> The NUMA interleave index was computed as two separate terms:
>
>     *ilx += vma->vm_pgoff >> order;
>     *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
>
> This has two problems:
>
> 1. When vm_start is not aligned to the folio size, the
>    subtraction before the shift lets low bits affect the
>    result via borrows.

This feels really vague. Do you have examples of where the calculation has
been impacted like this?

How is this a problem that affects things in practice?

>
> 2. For file-backed VMAs, shifting vm_pgoff and the VMA
>    offset independently loses carries between them, giving
>    wrong chunk indices when vm_pgoff is not aligned to order.

Similar comments to the above. An example would be helpful here.

>
> Combine into a single expression that adds vm_pgoff and

Combining this kind of thing into a single expression is the complete
opposite of what we want to do when refactoring code. No thanks.

> the page-granularity VMA offset first, then shifts once:
>
>     *ilx += (vma->vm_pgoff +
>             (addr >> PAGE_SHIFT) -
>             (vma->vm_start >> PAGE_SHIFT)) >> order;
>
> For anonymous VMAs, vm_pgoff equals vm_start >> PAGE_SHIFT,

This is completely incorrect.

For anonymous VMAs:

vm_pgoff = vm_start_at_first_fault >> PAGE_SHIFT.

So if you remap a _faulted_ VMA, the vm_start changes, vm_pgoff does
not. The two terms can be COMPLETELY independent.

> so the vm_pgoff and vm_start terms cancel and the result

No, they do not.

> reduces to addr >> (PAGE_SHIFT + order), same as before.

No, it doesn't.

>
> For file-backed VMAs, the sum vm_pgoff + (addr >> PAGE_SHIFT)
> - (vm_start >> PAGE_SHIFT) gives the file page offset of addr.

That's the page offset in a VMA for both anon and file-backed?

(addr - vm_start) >> PAGE_SHIFT is the page offset into a VMA (canonically
determined by linear_page_index())

> Shifting by order gives the correct file chunk index.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

You're claiming we have an incorrect calculation here, but are not
providing a Fixes patch or Cc: stable or sending this separately as a fix?

> Assisted-by: Claude:claude-opus-4-6
> Reviewed-by: Gregory Price <gourry@gourry.net>
> ---
>  mm/mempolicy.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 4e4421b22b59..d139b074a599 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2048,8 +2048,9 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>  		pol = get_task_policy(current);
>  	if (pol->mode == MPOL_INTERLEAVE ||
>  	    pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
> -		*ilx += vma->vm_pgoff >> order;
> -		*ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
> +		*ilx += (vma->vm_pgoff +
> +			(addr >> PAGE_SHIFT) -
> +			(vma->vm_start >> PAGE_SHIFT)) >> order;

This is horrible code. Not only are you doing everything in a single
expression for some reason, you're also making the parens confusing and not
explaining what you're doing here at all.

The code before was at least tractable, this is objectively making it
worse.

And anyway, the canonical way to find the page offset into a VMA is
linear_page_index():

static inline pgoff_t linear_page_index(const struct vm_area_struct *vma,
					const unsigned long address)
{
	pgoff_t pgoff;
	pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
	pgoff += vma->vm_pgoff;
	return pgoff;
}

So isn't a far better solution therefore:

	const pgoff_t index = linear_page_index(vma, addr);

	*ilx += index >> order;

Which has the benefit of being readable, uses the canonical method for
determining page offset in the VMA + eliminates the open-coded stuff?

>  	}
>  	return pol;
>  }
> --
> MST
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 03/37] mm: page_alloc: propagate PageReported flag across buddy splits
  2026-06-08  8:34 ` [PATCH v10 03/37] mm: page_alloc: propagate PageReported flag across buddy splits Michael S. Tsirkin
@ 2026-06-08  9:52   ` Lorenzo Stoakes
  2026-06-08 12:50     ` Matthew Wilcox
  0 siblings, 1 reply; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08  9:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:34:39AM -0400, Michael S. Tsirkin wrote:
> When a reported free page is split via expand() to satisfy a
> smaller allocation, the sub-pages placed back on the free lists
> lose the PageReported flag.  This means they will be unnecessarily
> re-reported to the hypervisor in the next reporting cycle, wasting
> work.
>
> While I was unable to quantify the performance difference, it is
> an obvious waste, even if small.

Hmm, that makes me wonder if this is really worth it?

You're making everything more complicated propagating yet another parameter
around (which could actually in itself cause perf/bloat issues) for
something you say yourself isn't really doing much?

>
> Propagate the PageReported flag to sub-pages during expand(),
> both in page_del_and_expand() and try_to_claim_block(), so
>
> split_large_buddy() also propagates PageReported via a bool
> parameter: the caller saves PageReported before
> del_page_from_free_list() clears it, then passes the saved
> value. The flag is set after __free_one_page() with a
> PageBuddy check, matching the page_reporting_drain() pattern.
> Free-path callers pass false (freshly freed pages are never
> reported).
> that they are recognized as already-reported.

The sentence is completely garbled?

>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

Again this really doesn't feel like it belongs to this series.

If it's really a bug, it should have fixes/Cc: stable, but honestly my
assessment here is this doesn't seem worthwhile at all.

You're not making a case for it and you're adding complexity. We don't just
take AI-reported bugs at face value (if this was indeed one).

(If sashiko-reported some people do Reported-by: sashiko-bot@kernel.org
now)

> Assisted-by: Claude:claude-opus-4-6
> ---
>  mm/page_alloc.c | 32 +++++++++++++++++++++++++-------
>  1 file changed, 25 insertions(+), 7 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d49c254174da..8dae5b3f5876 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1502,7 +1502,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>
>  /* Split a multi-block free page into its individual pageblocks. */
>  static void split_large_buddy(struct zone *zone, struct page *page,
> -			      unsigned long pfn, int order, fpi_t fpi)
> +			      unsigned long pfn, int order, fpi_t fpi,
> +			      bool reported)
>  {
>  	unsigned long end = pfn + (1 << order);
>
> @@ -1517,6 +1518,8 @@ static void split_large_buddy(struct zone *zone, struct page *page,
>  		int mt = get_pfnblock_migratetype(page, pfn);
>
>  		__free_one_page(page, pfn, zone, order, mt, fpi);
> +		if (reported && PageBuddy(page) && buddy_order(page) == order)

Isn't this racey? We just freed this page to the buddy allocator, couldn't
it be allocated before we set this flag?

> +			__SetPageReported(page);
>  		pfn += 1 << order;
>  		if (pfn == end)
>  			break;
> @@ -1559,11 +1562,12 @@ static void free_one_page(struct zone *zone, struct page *page,
>  		llist_for_each_entry_safe(p, tmp, llnode, pcp_llist) {
>  			unsigned int p_order = p->private;
>
> -			split_large_buddy(zone, p, page_to_pfn(p), p_order, fpi_flags);
> +			split_large_buddy(zone, p, page_to_pfn(p), p_order,
> +					  fpi_flags, false);

I don't love adding a mystery meat boolean parameter like this.

>  			__count_vm_events(PGFREE, 1 << p_order);
>  		}
>  	}
> -	split_large_buddy(zone, page, pfn, order, fpi_flags);
> +	split_large_buddy(zone, page, pfn, order, fpi_flags, false);
>  	spin_unlock_irqrestore(&zone->lock, flags);
>
>  	__count_vm_events(PGFREE, 1 << order);
> @@ -1694,7 +1698,7 @@ struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
>   * -- nyc
>   */
>  static inline unsigned int expand(struct zone *zone, struct page *page, int low,
> -				  int high, int migratetype)
> +				  int high, int migratetype, bool reported)
>  {
>  	unsigned int size = 1 << high;
>  	unsigned int nr_added = 0;
> @@ -1716,6 +1720,15 @@ static inline unsigned int expand(struct zone *zone, struct page *page, int low,
>  		__add_to_free_list(&page[size], zone, high, migratetype, false);
>  		set_buddy_order(&page[size], high);
>  		nr_added += size;
> +
> +		/*
> +		 * The parent page has been reported to the host.  The
> +		 * sub-pages are part of the same reported block, so mark
> +		 * them reported too.  This avoids re-reporting pages that
> +		 * the host already knows about.
> +		 */

This comment seems entirely redundant (and very much like something claude would
add, it's too verbose like this).

> +		if (reported)
> +			__SetPageReported(&page[size]);
>  	}
>
>  	return nr_added;
> @@ -1726,9 +1739,10 @@ static __always_inline void page_del_and_expand(struct zone *zone,
>  						int high, int migratetype)
>  {
>  	int nr_pages = 1 << high;
> +	bool was_reported = page_reported(page);
>
>  	__del_page_from_free_list(page, zone, high, migratetype);
> -	nr_pages -= expand(zone, page, low, high, migratetype);
> +	nr_pages -= expand(zone, page, low, high, migratetype, was_reported);

I don't love propagating yet another parameter like this.

>  	account_freepages(zone, -nr_pages, migratetype);
>  }
>
> @@ -2116,11 +2130,13 @@ static bool __move_freepages_block_isolate(struct zone *zone,
>  	/* We're a part of a larger buddy */
>  	if (PageBuddy(buddy) && buddy_order(buddy) > pageblock_order) {
>  		int order = buddy_order(buddy);
> +		bool reported = PageReported(buddy);
>
>  		del_page_from_free_list(buddy, zone, order,
>  					get_pfnblock_migratetype(buddy, buddy_pfn));
>  		toggle_pageblock_isolate(page, isolate);
> -		split_large_buddy(zone, buddy, buddy_pfn, order, FPI_NONE);
> +		split_large_buddy(zone, buddy, buddy_pfn, order, FPI_NONE,
> +				  reported);
>  		return true;
>  	}
>
> @@ -2283,10 +2299,12 @@ try_to_claim_block(struct zone *zone, struct page *page,
>  	/* Take ownership for orders >= pageblock_order */
>  	if (current_order >= pageblock_order) {
>  		unsigned int nr_added;
> +		bool was_reported = page_reported(page);
>
>  		del_page_from_free_list(page, zone, current_order, block_type);
>  		change_pageblock_range(page, current_order, start_type);
> -		nr_added = expand(zone, page, order, current_order, start_type);
> +		nr_added = expand(zone, page, order, current_order, start_type,
> +				  was_reported);
>  		account_freepages(zone, nr_added, start_type);
>  		return page;
>  	}
> --
> MST
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 05/37] mm: hugetlb: remove dead alloc_hugetlb_folio stub
  2026-06-08  8:34 ` [PATCH v10 05/37] mm: hugetlb: remove dead alloc_hugetlb_folio stub Michael S. Tsirkin
@ 2026-06-08  9:56   ` Lorenzo Stoakes
  0 siblings, 0 replies; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08  9:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:34:57AM -0400, Michael S. Tsirkin wrote:
> Remove the !CONFIG_HUGETLB_PAGE stub for alloc_hugetlb_folio().
>
> The stub is dead code: all callers are in mm/hugetlb.c
> (CONFIG_HUGETLB_PAGE) or fs/hugetlbfs/inode.c (CONFIG_HUGETLBFS),

obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o hugetlb_sysfs.o hugetlb_sysctl.o

mm/hugetlb.c seems dependent on CONFIG_HUGETLBFS not CONFIG_HUGETLB_PAGE?

> and CONFIG_HUGETLB_PAGE is def_bool HUGETLBFS with nothing
> selecting it independently.
>
> The stub is also broken: it returns NULL, but all callers check
> IS_ERR(folio), so a NULL return would not be caught and would
> crash on the subsequent folio dereference.
>
> Remove it now since follow-up patches change the signature of
> alloc_hugetlb_folio and would otherwise need to update the
> broken stub too.
>
> Reviewed-by: Gregory Price <gourry@gourry.net>
> Assisted-by: Claude:claude-opus-4-6
> Reviewed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

The logic seems good but you should fix up the commit message. With that
fixed:

Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>

Thanks, Lorenzo

> ---
>  include/linux/hugetlb.h | 7 -------
>  1 file changed, 7 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 5957bc25efa8..1f7ae6609e51 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -1123,13 +1123,6 @@ static inline void wait_for_freed_hugetlb_folios(void)
>  {
>  }
>
> -static inline struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> -					   unsigned long addr,
> -					   bool cow_from_owner)
> -{
> -	return NULL;
> -}
> -
>  static inline struct folio *
>  alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
>  			    nodemask_t *nmask, gfp_t gfp_mask)
> --
> MST
>


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 06/37] mm: move vma_alloc_folio_noprof to page_alloc.c
  2026-06-08  8:35 ` [PATCH v10 06/37] mm: move vma_alloc_folio_noprof to page_alloc.c Michael S. Tsirkin
@ 2026-06-08 10:05   ` Lorenzo Stoakes
  0 siblings, 0 replies; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 10:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:35:05AM -0400, Michael S. Tsirkin wrote:
> Move vma_alloc_folio_noprof() from an inline in gfp.h (for !NUMA)
> and mempolicy.c (for NUMA) to page_alloc.c.

This is an incorrectly described patch. You're not moving the function,
you're also altering its behaviour.

Do the two changes separately, and justify why you are changing this
behaviour.

>
> This prepares for a subsequent patch that will thread user_addr
> through the allocator: having vma_alloc_folio_noprof in page_alloc.c
> means user_addr can be passed to the internal allocation path
> without changing public API signatures or duplicating plumbing
> in both gfp.h and mempolicy.c.

We seem to have a fun mess with some NUMA stuff living in mempolicy.c and
some living in page_alloc.c with #ifdef CONFIG_NUMA's around it.

But I guess that was a pre-existing problem...

>
> The !NUMA path gains the VM_DROPPABLE -> __GFP_NOWARN check
> that the NUMA path already had.

What is your justification for doing this? Commit messages that put the
code changes in English are not helpful, tell me _why_ you are doing this.

It seems that you're suggesting not doing this was a pre-existing bug?
Therefore it would need a fixes/cc: stable no? Unless you feel this isn't
significant enough to require that?

If not, then why are you changing this behaviour?

>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> Assisted-by: cursor-agent:GPT-5.4-xhigh
> ---
>  include/linux/gfp.h |  9 ++-------
>  mm/mempolicy.c      | 32 --------------------------------
>  mm/page_alloc.c     | 43 +++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 45 insertions(+), 39 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 51ef13ed756e..7ccbda35b9ad 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -318,13 +318,13 @@ static inline struct page *alloc_pages_node_noprof(int nid, gfp_t gfp_mask,
>
>  #define  alloc_pages_node(...)			alloc_hooks(alloc_pages_node_noprof(__VA_ARGS__))
>
> +struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
> +		struct vm_area_struct *vma, unsigned long addr);
>  #ifdef CONFIG_NUMA
>  struct page *alloc_pages_noprof(gfp_t gfp, unsigned int order);
>  struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order);
>  struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
>  		struct mempolicy *mpol, pgoff_t ilx, int nid);
> -struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct *vma,
> -		unsigned long addr);
>  #else
>  static inline struct page *alloc_pages_noprof(gfp_t gfp_mask, unsigned int order)
>  {
> @@ -339,11 +339,6 @@ static inline struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int orde
>  {
>  	return folio_alloc_noprof(gfp, order);
>  }
> -static inline struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
> -		struct vm_area_struct *vma, unsigned long addr)
> -{
> -	return folio_alloc_noprof(gfp, order);
> -}
>  #endif
>
>  #define alloc_pages(...)			alloc_hooks(alloc_pages_noprof(__VA_ARGS__))
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index d139b074a599..a1707ad498a8 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2516,38 +2516,6 @@ struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
>  	return page_rmappable_folio(page);
>  }
>
> -/**
> - * vma_alloc_folio - Allocate a folio for a VMA.
> - * @gfp: GFP flags.
> - * @order: Order of the folio.
> - * @vma: Pointer to VMA.
> - * @addr: Virtual address of the allocation.  Must be inside @vma.
> - *
> - * Allocate a folio for a specific address in @vma, using the appropriate
> - * NUMA policy.  The caller must hold the mmap_lock of the mm_struct of the
> - * VMA to prevent it from going away.  Should be used for all allocations
> - * for folios that will be mapped into user space, excepting hugetlbfs, and
> - * excepting where direct use of folio_alloc_mpol() is more appropriate.
> - *
> - * Return: The folio on success or NULL if allocation fails.
> - */
> -struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct *vma,
> -		unsigned long addr)
> -{
> -	struct mempolicy *pol;
> -	pgoff_t ilx;
> -	struct folio *folio;
> -
> -	if (vma->vm_flags & VM_DROPPABLE)
> -		gfp |= __GFP_NOWARN;
> -
> -	pol = get_vma_policy(vma, addr, order, &ilx);
> -	folio = folio_alloc_mpol_noprof(gfp, order, pol, ilx, numa_node_id());
> -	mpol_cond_put(pol);
> -	return folio;
> -}
> -EXPORT_SYMBOL(vma_alloc_folio_noprof);
> -
>  struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned order)
>  {
>  	struct mempolicy *pol = &default_policy;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8dae5b3f5876..6a605d05e8cd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5286,6 +5286,49 @@ struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_
>  }
>  EXPORT_SYMBOL(__folio_alloc_noprof);
>
> +#ifdef CONFIG_NUMA
> +/**
> + * vma_alloc_folio - Allocate a folio for a VMA.
> + * @gfp: GFP flags.
> + * @order: Order of the folio.
> + * @vma: Pointer to VMA.
> + * @addr: Virtual address of the allocation.  Must be inside @vma.
> + *
> + * Allocate a folio for a specific address in @vma, using the appropriate
> + * NUMA policy.  The caller must hold the mmap_lock of the mm_struct of the
> + * VMA to prevent it from going away.  Should be used for all allocations
> + * for folios that will be mapped into user space, excepting hugetlbfs, and
> + * excepting where direct use of folio_alloc_mpol() is more appropriate.
> + *
> + * Return: The folio on success or NULL if allocation fails.
> + */
> +struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
> +		struct vm_area_struct *vma, unsigned long addr)
> +{
> +	struct mempolicy *pol;
> +	pgoff_t ilx;
> +	struct folio *folio;
> +
> +	if (vma->vm_flags & VM_DROPPABLE)
> +		gfp |= __GFP_NOWARN;
> +
> +	pol = get_vma_policy(vma, addr, order, &ilx);
> +	folio = folio_alloc_mpol_noprof(gfp, order, pol, ilx, numa_node_id());
> +	mpol_cond_put(pol);
> +	return folio;
> +}
> +#else
> +struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
> +		struct vm_area_struct *vma, unsigned long addr)
> +{
> +	if (vma->vm_flags & VM_DROPPABLE)
> +		gfp |= __GFP_NOWARN;
> +
> +	return folio_alloc_noprof(gfp, order);
> +}
> +#endif
> +EXPORT_SYMBOL(vma_alloc_folio_noprof);
> +
>  /*
>   * Common helper functions. Never use with __GFP_HIGHMEM because the returned
>   * address cannot represent highmem pages. Use alloc_pages and then kmap if
> --
> MST
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08  8:35 ` [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing Michael S. Tsirkin
@ 2026-06-08 10:23   ` Lorenzo Stoakes
  2026-06-08 11:06     ` Lorenzo Stoakes
  2026-06-08 11:08     ` David Hildenbrand (Arm)
  0 siblings, 2 replies; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 10:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

I noticed this patch, again, sneaks in some additional code changes that
are not mentioned in the commit message and seem irrelevant to the patch.

Not sure if the AI is doing this, but please don't do that.

On Mon, Jun 08, 2026 at 04:35:17AM -0400, Michael S. Tsirkin wrote:
> Thread a user virtual address from vma_alloc_folio() down through
> the page allocator to post_alloc_hook(). This is plumbing
> preparation for a subsequent patch that will use user_addr to
> call folio_zero_user() for cache-friendly zeroing of user pages.

This feels like very weak justification for code that massively changes mm
code and makes it all much worse.

>
> The user_addr is stored in struct alloc_context and flows through:
>   vma_alloc_folio -> folio_alloc_mpol -> __alloc_pages_mpol ->
>   __alloc_frozen_pages -> get_page_from_freelist -> prep_new_page ->
>   post_alloc_hook

Is this the only relevant code path? You're changing a TON of code here
that will have multiple different code paths?

>
> USER_ADDR_NONE ((unsigned long)-1) is used for non-user
> allocations, since address 0 is a valid userspace mapping.

Ugh god, so now we're passing a user address through allocation paths that
might not even be aware of this or be allocating memory at a point when the
mapping is known?

>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

Honestly I absolutely hate this patch.

You're passing a user address through allocation paths that might not even
have one associated with them, you're vandalising a bunch of core mm stuff
here for virtio, and then also adding a sentinel to indicate 'not user
address' just to confuse matters further.

We separate the physical allocation of memory from mapping it, and you're
now coupling something that is explicitly decoupled to suit a specific
usecase.

It feels like a hack and I don't think we can accept this upstream, as-is.

> Assisted-by: Claude:claude-opus-4-6
> Assisted-by: cursor-agent:GPT-5.4-xhigh

This really feels like the kind of brute force thing AI does when coming up
with code for this kind of thing.

I think it needs some architectural thought rather than that here. I'd
suggest thinking about how we might achieve the same objective without
doing this.

> ---
>  include/linux/gfp.h |  2 +-
>  mm/compaction.c     |  5 ++---
>  mm/hugetlb.c        | 36 ++++++++++++++++++++----------------
>  mm/internal.h       | 21 ++++++++++++++++++---
>  mm/mempolicy.c      | 44 ++++++++++++++++++++++++++++++++------------
>  mm/mmap.c           |  6 ++++++
>  mm/page_alloc.c     | 44 +++++++++++++++++++++++++++++---------------
>  mm/slub.c           |  4 ++--
>  8 files changed, 110 insertions(+), 52 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 7ccbda35b9ad..ee35c5367abc 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -337,7 +337,7 @@ static inline struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order)
>  static inline struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
>  		struct mempolicy *mpol, pgoff_t ilx, int nid)
>  {
> -	return folio_alloc_noprof(gfp, order);
> +	return __folio_alloc_noprof(gfp, order, numa_node_id(), NULL);
>  }
>  #endif
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 3648ce22c807..72684fe81e83 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -82,7 +82,7 @@ static inline bool is_via_compact_memory(int order) { return false; }
>
>  static struct page *mark_allocated_noprof(struct page *page, unsigned int order, gfp_t gfp_flags)
>  {
> -	post_alloc_hook(page, order, __GFP_MOVABLE);
> +	post_alloc_hook(page, order, __GFP_MOVABLE, USER_ADDR_NONE);

This is really confusing. Essentially 'we don't know what the user address
is' here. And now we're making anybody who calls this have to figure out
what the user address field is actually for and under what circumstances we supply it.

>  	set_page_refcounted(page);
>  	return page;
>  }
> @@ -1849,8 +1849,7 @@ static struct folio *compaction_alloc_noprof(struct folio *src, unsigned long da
>  		set_page_private(&freepage[size], start_order);
>  	}
>  	dst = (struct folio *)freepage;
> -
> -	post_alloc_hook(&dst->page, order, __GFP_MOVABLE);
> +	post_alloc_hook(&dst->page, order, __GFP_MOVABLE, USER_ADDR_NONE);
>  	set_page_refcounted(&dst->page);
>  	if (order)
>  		prep_compound_page(&dst->page, order);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 4b80b167cc9c..f3bc15a7889a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1786,7 +1786,8 @@ struct address_space *hugetlb_folio_mapping_lock_write(struct folio *folio)
>  }
>
>  static struct folio *alloc_buddy_frozen_folio(int order, gfp_t gfp_mask,
> -		int nid, nodemask_t *nmask, nodemask_t *node_alloc_noretry)
> +		int nid, nodemask_t *nmask, nodemask_t *node_alloc_noretry,
> +		unsigned long addr)

addr? Can we use user_addr please? This is confusing.

>  {
>  	struct folio *folio;
>  	bool alloc_try_hard = true;
> @@ -1803,7 +1804,7 @@ static struct folio *alloc_buddy_frozen_folio(int order, gfp_t gfp_mask,
>  	if (alloc_try_hard)
>  		gfp_mask |= __GFP_RETRY_MAYFAIL;
>
> -	folio = (struct folio *)__alloc_frozen_pages(gfp_mask, order, nid, nmask);
> +	folio = (struct folio *)__alloc_frozen_pages(gfp_mask, order, nid, nmask, addr);
>
>  	/*
>  	 * If we did not specify __GFP_RETRY_MAYFAIL, but still got a
> @@ -1832,7 +1833,7 @@ static struct folio *alloc_buddy_frozen_folio(int order, gfp_t gfp_mask,
>
>  static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
>  		gfp_t gfp_mask, int nid, nodemask_t *nmask,
> -		nodemask_t *node_alloc_noretry)
> +		nodemask_t *node_alloc_noretry, unsigned long addr)
>  {
>  	struct folio *folio;
>  	int order = huge_page_order(h);
> @@ -1844,7 +1845,7 @@ static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
>  		folio = alloc_gigantic_frozen_folio(order, gfp_mask, nid, nmask);
>  	else
>  		folio = alloc_buddy_frozen_folio(order, gfp_mask, nid, nmask,
> -						 node_alloc_noretry);
> +						 node_alloc_noretry, addr);
>  	if (folio)
>  		init_new_hugetlb_folio(folio);
>  	return folio;
> @@ -1858,11 +1859,12 @@ static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
>   * pages is zero, and the accounting must be done in the caller.
>   */
>  static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
> -		gfp_t gfp_mask, int nid, nodemask_t *nmask)
> +		gfp_t gfp_mask, int nid, nodemask_t *nmask,
> +		unsigned long addr)
>  {
>  	struct folio *folio;
>
> -	folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, NULL);
> +	folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, NULL, addr);
>  	if (folio)
>  		hugetlb_vmemmap_optimize_folio(h, folio);
>  	return folio;
> @@ -1902,7 +1904,7 @@ static struct folio *alloc_pool_huge_folio(struct hstate *h,
>  		struct folio *folio;
>
>  		folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, node,
> -					nodes_allowed, node_alloc_noretry);
> +					nodes_allowed, node_alloc_noretry, USER_ADDR_NONE);
>  		if (folio)
>  			return folio;
>  	}
> @@ -2071,7 +2073,8 @@ int dissolve_free_hugetlb_folios(unsigned long start_pfn, unsigned long end_pfn)
>   * Allocates a fresh surplus page from the page allocator.
>   */
>  static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
> -				gfp_t gfp_mask,	int nid, nodemask_t *nmask)
> +				gfp_t gfp_mask,	int nid, nodemask_t *nmask,
> +				unsigned long addr)

Sometimes user_addr, sometimes addr. no thanks.

>  {
>  	struct folio *folio = NULL;
>
> @@ -2083,7 +2086,7 @@ static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
>  		goto out_unlock;
>  	spin_unlock_irq(&hugetlb_lock);
>
> -	folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask);
> +	folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, addr);
>  	if (!folio)
>  		return NULL;
>
> @@ -2126,7 +2129,7 @@ static struct folio *alloc_migrate_hugetlb_folio(struct hstate *h, gfp_t gfp_mas
>  	if (hstate_is_gigantic(h))
>  		return NULL;
>
> -	folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask);
> +	folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, USER_ADDR_NONE);
>  	if (!folio)
>  		return NULL;
>
> @@ -2162,14 +2165,14 @@ struct folio *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h,
>  	if (mpol_is_preferred_many(mpol)) {
>  		gfp_t gfp = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
>
> -		folio = alloc_surplus_hugetlb_folio(h, gfp, nid, nodemask);
> +		folio = alloc_surplus_hugetlb_folio(h, gfp, nid, nodemask, addr);
>
>  		/* Fallback to all nodes if page==NULL */
>  		nodemask = NULL;
>  	}
>
>  	if (!folio)
> -		folio = alloc_surplus_hugetlb_folio(h, gfp_mask, nid, nodemask);
> +		folio = alloc_surplus_hugetlb_folio(h, gfp_mask, nid, nodemask, addr);
>  	mpol_cond_put(mpol);
>  	return folio;
>  }
> @@ -2276,7 +2279,8 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>  		 * down the road to pick the current node if that is the case.
>  		 */
>  		folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h),
> -						    NUMA_NO_NODE, &alloc_nodemask);
> +						    NUMA_NO_NODE, &alloc_nodemask,
> +						    USER_ADDR_NONE);
>  		if (!folio) {
>  			alloc_ok = false;
>  			break;
> @@ -2682,7 +2686,7 @@ static int alloc_and_dissolve_hugetlb_folio(struct folio *old_folio,
>  			spin_unlock_irq(&hugetlb_lock);
>  			gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
>  			new_folio = alloc_fresh_hugetlb_folio(h, gfp_mask,
> -							      nid, NULL);
> +							      nid, NULL, USER_ADDR_NONE);
>  			if (!new_folio)
>  				return -ENOMEM;
>  			goto retry;
> @@ -3380,13 +3384,13 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
>  			gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
>
>  			folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid,
> -					&node_states[N_MEMORY], NULL);
> +					&node_states[N_MEMORY], NULL, USER_ADDR_NONE);
>  			if (!folio && !list_empty(&folio_list) &&
>  			    hugetlb_vmemmap_optimizable_size(h)) {
>  				prep_and_add_allocated_folios(h, &folio_list);
>  				INIT_LIST_HEAD(&folio_list);
>  				folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid,
> -						&node_states[N_MEMORY], NULL);
> +						&node_states[N_MEMORY], NULL, USER_ADDR_NONE);
>  			}
>  			if (!folio)
>  				break;
> diff --git a/mm/internal.h b/mm/internal.h
> index 5a2ddcf68e0b..9d2198114510 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -662,6 +662,16 @@ void calculate_min_free_kbytes(void);
>  int __meminit init_per_zone_wmark_min(void);
>  void page_alloc_sysctl_init(void);
>
> +/*
> + * Sentinel for user_addr: indicates a non-user allocation.
> + * Cannot use 0 because address 0 is a valid userspace mapping.
> + * (unsigned long)-1 is safe because:
> + * 1. vm_end = addr + len <= TASK_SIZE, and vm_end is exclusive,
> + *    so -1 is never inside any VMA.
> + * 2. It will only be compared to page-aligned addresses.
> + */
> +#define USER_ADDR_NONE	((unsigned long)-1)
> +
>  /*
>   * Structure for holding the mostly immutable allocation parameters passed
>   * between functions involved in allocations, including the alloc_pages*
> @@ -693,6 +703,7 @@ struct alloc_context {
>  	 */
>  	enum zone_type highest_zoneidx;
>  	bool spread_dirty_pages;
> +	unsigned long user_addr;
>  };
>
>  /*
> @@ -916,13 +927,14 @@ static inline void init_compound_tail(struct page *tail,
>  	prep_compound_tail(tail, head, order);
>  }
>
> -void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags);
> +void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags,
> +		     unsigned long user_addr);
>  extern bool free_pages_prepare(struct page *page, unsigned int order);
>
>  extern int user_min_free_kbytes;
>
>  struct page *__alloc_frozen_pages_noprof(gfp_t, unsigned int order, int nid,
> -		nodemask_t *);
> +		nodemask_t *, unsigned long user_addr);
>  #define __alloc_frozen_pages(...) \
>  	alloc_hooks(__alloc_frozen_pages_noprof(__VA_ARGS__))
>  void free_frozen_pages(struct page *page, unsigned int order);
> @@ -930,10 +942,13 @@ void free_unref_folios(struct folio_batch *fbatch);
>
>  #ifdef CONFIG_NUMA
>  struct page *alloc_frozen_pages_noprof(gfp_t, unsigned int order);
> +struct folio *folio_alloc_mpol_user_noprof(gfp_t gfp, unsigned int order,
> +		struct mempolicy *pol, pgoff_t ilx, int nid,
> +		unsigned long user_addr);
>  #else
>  static inline struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order)
>  {
> -	return __alloc_frozen_pages_noprof(gfp, order, numa_node_id(), NULL);
> +	return __alloc_frozen_pages_noprof(gfp, order, numa_node_id(), NULL, USER_ADDR_NONE);
>  }
>  #endif
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index a1707ad498a8..f573ff32e94d 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2413,7 +2413,8 @@ bool mempolicy_in_oom_domain(struct task_struct *tsk,
>  }
>
>  static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
> -						int nid, nodemask_t *nodemask)
> +						int nid, nodemask_t *nodemask,
> +						unsigned long user_addr)
>  {
>  	struct page *page;
>  	gfp_t preferred_gfp;
> @@ -2426,25 +2427,29 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
>  	 */
>  	preferred_gfp = gfp | __GFP_NOWARN;
>  	preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
> -	page = __alloc_frozen_pages_noprof(preferred_gfp, order, nid, nodemask);
> +	page = __alloc_frozen_pages_noprof(preferred_gfp, order, nid,
> +					   nodemask, user_addr);
>  	if (!page)
> -		page = __alloc_frozen_pages_noprof(gfp, order, nid, NULL);
> +		page = __alloc_frozen_pages_noprof(gfp, order, nid, NULL,
> +						   user_addr);
>
>  	return page;
>  }
>
>  /**
> - * alloc_pages_mpol - Allocate pages according to NUMA mempolicy.
> + * __alloc_pages_mpol - Allocate pages according to NUMA mempolicy.
>   * @gfp: GFP flags.
>   * @order: Order of the page allocation.
>   * @pol: Pointer to the NUMA mempolicy.
>   * @ilx: Index for interleave mempolicy (also distinguishes alloc_pages()).
>   * @nid: Preferred node (usually numa_node_id() but @mpol may override it).
> + * @user_addr: User fault address for cache-friendly zeroing, or USER_ADDR_NONE.

OK this isn't great - 'for cache-friendly zeroing' is vague, confusing, and
an break in the abstraction (don't tell me one specific detail of what
you're doing down the callstack) .

'User fault address' is nebulous and confusing.

>   *
>   * Return: The page on success or NULL if allocation fails.
>   */
> -static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
> -		struct mempolicy *pol, pgoff_t ilx, int nid)
> +static struct page *__alloc_pages_mpol(gfp_t gfp, unsigned int order,
> +		struct mempolicy *pol, pgoff_t ilx, int nid,
> +		unsigned long user_addr)
>  {
>  	nodemask_t *nodemask;
>  	struct page *page;
> @@ -2452,7 +2457,8 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
>  	nodemask = policy_nodemask(gfp, pol, ilx, &nid);
>
>  	if (pol->mode == MPOL_PREFERRED_MANY)
> -		return alloc_pages_preferred_many(gfp, order, nid, nodemask);
> +		return alloc_pages_preferred_many(gfp, order, nid, nodemask,
> +						 user_addr);
>
>  	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
>  	    /* filter "hugepage" allocation, unless from alloc_pages() */
> @@ -2476,7 +2482,7 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
>  			 */
>  			page = __alloc_frozen_pages_noprof(
>  				gfp | __GFP_THISNODE | __GFP_NORETRY, order,
> -				nid, NULL);
> +				nid, NULL, user_addr);
>  			if (page || !(gfp & __GFP_DIRECT_RECLAIM))
>  				return page;
>  			/*
> @@ -2488,7 +2494,7 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
>  		}
>  	}
>
> -	page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask);
> +	page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask, user_addr);
>
>  	if (unlikely(pol->mode == MPOL_INTERLEAVE ||
>  		     pol->mode == MPOL_WEIGHTED_INTERLEAVE) && page) {
> @@ -2504,11 +2510,18 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
>  	return page;
>  }
>
> -struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
> +static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
>  		struct mempolicy *pol, pgoff_t ilx, int nid)
>  {
> -	struct page *page = alloc_pages_mpol(gfp | __GFP_COMP, order, pol,
> -			ilx, nid);
> +	return __alloc_pages_mpol(gfp, order, pol, ilx, nid, USER_ADDR_NONE);
> +}
> +
> +struct folio *folio_alloc_mpol_user_noprof(gfp_t gfp, unsigned int order,
> +		struct mempolicy *pol, pgoff_t ilx, int nid,
> +		unsigned long user_addr)
> +{
> +	struct page *page = __alloc_pages_mpol(gfp | __GFP_COMP, order, pol,
> +			ilx, nid, user_addr);
>  	if (!page)
>  		return NULL;
>
> @@ -2516,6 +2529,13 @@ struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
>  	return page_rmappable_folio(page);
>  }
>
> +struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
> +		struct mempolicy *pol, pgoff_t ilx, int nid)
> +{
> +	return folio_alloc_mpol_user_noprof(gfp, order, pol, ilx, nid,
> +					    USER_ADDR_NONE);
> +}
> +
>  struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned order)
>  {
>  	struct mempolicy *pol = &default_policy;
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 5754d1c36462..73413cebc418 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -855,6 +855,12 @@ __get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
>  	if (IS_ERR_VALUE(addr))
>  		return addr;
>
> +	/*
> +	 * The check below ensures vm_end = addr + len <= TASK_SIZE.
> +	 * Since (unsigned long)-1 (USER_ADDR_NONE) >= TASK_SIZE and
> +	 * vm_end is exclusive, USER_ADDR_NONE is thus never a valid
> +	 * userspace address.
> +	 */

You're adding what seems to be an AI-generated comment randomly here for some
reason? Let's not thanks?

>  	if (addr > TASK_SIZE - len)
>  		return -ENOMEM;
>  	if (offset_in_page(addr))
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6a605d05e8cd..21b52c879751 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1815,7 +1815,7 @@ static inline bool should_skip_init(gfp_t flags)
>  }
>
>  inline void post_alloc_hook(struct page *page, unsigned int order,
> -				gfp_t gfp_flags)
> +				gfp_t gfp_flags, unsigned long user_addr)
>  {
>  	const bool zero_tags = gfp_flags & __GFP_ZEROTAGS;
>  	bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) &&
> @@ -1870,9 +1870,10 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>  }
>
>  static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
> -							unsigned int alloc_flags)
> +							unsigned int alloc_flags,
> +							unsigned long user_addr)
>  {
> -	post_alloc_hook(page, order, gfp_flags);
> +	post_alloc_hook(page, order, gfp_flags, user_addr);
>
>  	if (order && (gfp_flags & __GFP_COMP))
>  		prep_compound_page(page, order);
> @@ -3956,7 +3957,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  		page = rmqueue(zonelist_zone(ac->preferred_zoneref), zone, order,
>  				gfp_mask, alloc_flags, ac->migratetype);
>  		if (page) {
> -			prep_new_page(page, order, gfp_mask, alloc_flags);
> +			prep_new_page(page, order, gfp_mask, alloc_flags,
> +				      ac->user_addr);
>
>  			return page;
>  		} else {
> @@ -4184,7 +4186,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>
>  	/* Prep a captured page if available */
>  	if (page)
> -		prep_new_page(page, order, gfp_mask, alloc_flags);
> +		prep_new_page(page, order, gfp_mask, alloc_flags,
> +			      ac->user_addr);
>
>  	/* Try get a page from the freelist if available */
>  	if (!page)
> @@ -5061,7 +5064,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  	struct zoneref *z;
>  	struct per_cpu_pages *pcp;
>  	struct list_head *pcp_list;
> -	struct alloc_context ac;
> +	struct alloc_context ac = { .user_addr = USER_ADDR_NONE };
>  	gfp_t alloc_gfp;
>  	unsigned int alloc_flags = ALLOC_WMARK_LOW;
>  	int nr_populated = 0, nr_account = 0;
> @@ -5176,7 +5179,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  		}
>  		nr_account++;
>
> -		prep_new_page(page, 0, gfp, 0);
> +		prep_new_page(page, 0, gfp, 0, USER_ADDR_NONE);
>  		set_page_refcounted(page);
>  		page_array[nr_populated++] = page;
>  	}
> @@ -5201,12 +5204,13 @@ EXPORT_SYMBOL_GPL(alloc_pages_bulk_noprof);
>   * This is the 'heart' of the zoned buddy allocator.
>   */
>  struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
> -		int preferred_nid, nodemask_t *nodemask)
> +		int preferred_nid, nodemask_t *nodemask,
> +		unsigned long user_addr)
>  {
>  	struct page *page;
>  	unsigned int alloc_flags = ALLOC_WMARK_LOW;
>  	gfp_t alloc_gfp; /* The gfp_t that was actually used for allocation */
> -	struct alloc_context ac = { };
> +	struct alloc_context ac = { .user_addr = user_addr };
>
>  	/*
>  	 * There are several places where we assume that the order value is sane
> @@ -5267,10 +5271,12 @@ EXPORT_SYMBOL(__alloc_frozen_pages_noprof);
>
>  struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
>  		int preferred_nid, nodemask_t *nodemask)
> +
>  {
>  	struct page *page;
>
> -	page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid, nodemask);
> +	page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid,
> +					   nodemask, USER_ADDR_NONE);
>  	if (page)
>  		set_page_refcounted(page);
>  	return page;
> @@ -5313,7 +5319,8 @@ struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
>  		gfp |= __GFP_NOWARN;
>
>  	pol = get_vma_policy(vma, addr, order, &ilx);
> -	folio = folio_alloc_mpol_noprof(gfp, order, pol, ilx, numa_node_id());
> +	folio = folio_alloc_mpol_user_noprof(gfp, order, pol, ilx,
> +					     numa_node_id(), addr);
>  	mpol_cond_put(pol);
>  	return folio;
>  }
> @@ -5321,10 +5328,17 @@ struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
>  struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
>  		struct vm_area_struct *vma, unsigned long addr)
>  {
> +	struct page *page;
> +
>  	if (vma->vm_flags & VM_DROPPABLE)
>  		gfp |= __GFP_NOWARN;
>
> -	return folio_alloc_noprof(gfp, order);
> +	page = __alloc_frozen_pages_noprof(gfp | __GFP_COMP, order,
> +					   numa_node_id(), NULL, addr);
> +	if (!page)
> +		return NULL;
> +	set_page_refcounted(page);
> +	return page_rmappable_folio(page);

Err, what?

You're adding arbitrary new code here? What's the justification? Is it an
open-coded version of what was there before just to propagate the addr?

In any case this is totally unacceptable, don't open code like this, don't
make changes like this when the commit message doesn't mention them.

>  }
>  #endif
>  EXPORT_SYMBOL(vma_alloc_folio_noprof);
> @@ -6905,7 +6919,7 @@ static void split_free_frozen_pages(struct list_head *list, gfp_t gfp_mask)
>  		list_for_each_entry_safe(page, next, &list[order], lru) {
>  			int i;
>
> -			post_alloc_hook(page, order, gfp_mask);
> +			post_alloc_hook(page, order, gfp_mask, USER_ADDR_NONE);
>  			if (!order)
>  				continue;
>
> @@ -7111,7 +7125,7 @@ int alloc_contig_frozen_range_noprof(unsigned long start, unsigned long end,
>  		struct page *head = pfn_to_page(start);
>
>  		check_new_pages(head, order);
> -		prep_new_page(head, order, gfp_mask, 0);
> +		prep_new_page(head, order, gfp_mask, 0, USER_ADDR_NONE);
>  	} else {
>  		ret = -EINVAL;
>  		WARN(true, "PFN range: requested [%lu, %lu), allocated [%lu, %lu)\n",
> @@ -7776,7 +7790,7 @@ struct page *alloc_frozen_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned
>  	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC | __GFP_COMP
>  			| gfp_flags;
>  	unsigned int alloc_flags = ALLOC_TRYLOCK;
> -	struct alloc_context ac = { };
> +	struct alloc_context ac = { .user_addr = USER_ADDR_NONE };
>  	struct page *page;
>
>  	VM_WARN_ON_ONCE(gfp_flags & ~__GFP_ACCOUNT);
> diff --git a/mm/slub.c b/mm/slub.c
> index a2bf3756ca7d..f397fa2f3f80 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3275,7 +3275,7 @@ static inline struct slab *alloc_slab_page(gfp_t flags, int node,
>  	else if (node == NUMA_NO_NODE)
>  		page = alloc_frozen_pages(flags, order);
>  	else
> -		page = __alloc_frozen_pages(flags, order, node, NULL);
> +		page = __alloc_frozen_pages(flags, order, node, NULL, USER_ADDR_NONE);
>
>  	if (!page)
>  		return NULL;
> @@ -5236,7 +5236,7 @@ static void *___kmalloc_large_node(size_t size, gfp_t flags, int node)
>  	if (node == NUMA_NO_NODE)
>  		page = alloc_frozen_pages_noprof(flags, order);
>  	else
> -		page = __alloc_frozen_pages_noprof(flags, order, node, NULL);
> +		page = __alloc_frozen_pages_noprof(flags, order, node, NULL, USER_ADDR_NONE);
>
>  	if (page) {
>  		ptr = page_address(page);
> --
> MST
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 08/37] mm: add alloc_contig_frozen_pages_user for cache-friendly zeroing
  2026-06-08  8:35 ` [PATCH v10 08/37] mm: add alloc_contig_frozen_pages_user " Michael S. Tsirkin
@ 2026-06-08 10:29   ` Lorenzo Stoakes
  0 siblings, 0 replies; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 10:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:35:29AM -0400, Michael S. Tsirkin wrote:
> Add a _user variant of alloc_contig_frozen_pages that accepts a user_addr
> parameter for cache-friendly zeroing of contiguous allocations.
>
> No functional change; all existing callers continue to pass
> USER_ADDR_NONE.

Well, except that you're adding a new function...

>
> Note for reviewers: non-compound contiguous allocations are

Please don't put notes for reviewers in commit messages.

> zeroed via kernel_init_pages, same as before this patch.
> There is no fault address because these allocations are not
> from the page fault path. For compound allocations, user_addr
> reaches post_alloc_hook() which calls folio_zero_user() with
> the dcache flush on cache-aliasing architectures.

Yeah it's exactly this kind of 'just have to know' stuff that makes this
user_addr approach unacceptable.

We mustn't add more cognitive overhead for already confusing code. Now
everybody using these has to figure out what 'user_addr' means and will
inevitably get it wrong.

This whole approach needs to be rethought.

>
> Note about Sashiko (sashiko.dev) false positives: sashiko
> flags two issues here: (1) user_addr silently ignored for
> non-compound allocations, and (2) post_alloc_hook ignores
> user_addr. Both are false positives: (1) non-compound
> contiguous allocations have no fault address to pass, and
> (2) post_alloc_hook does use user_addr when it is not
> USER_ADDR_NONE.

Please don't put AI hallucinations in commit messages.

If you want something to not appear in a commit message, but want reviewers
to see it, put it below the '---' in the patch.


>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

This patch is unacceptable for the same reasons given for 7/37.

> Assisted-by: Claude:claude-opus-4-6
> ---
>  include/linux/gfp.h |  6 ++++++
>  mm/page_alloc.c     | 42 ++++++++++++++++++++++++++++++++----------
>  2 files changed, 38 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index ee35c5367abc..73109d4e31a4 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -453,6 +453,12 @@ struct page *alloc_contig_frozen_pages_noprof(unsigned long nr_pages,
>  #define alloc_contig_frozen_pages(...) \
>  	alloc_hooks(alloc_contig_frozen_pages_noprof(__VA_ARGS__))
>
> +struct page *alloc_contig_frozen_pages_user_noprof(unsigned long nr_pages,
> +		gfp_t gfp_mask, int nid, nodemask_t *nodemask,
> +		unsigned long user_addr);
> +#define alloc_contig_frozen_pages_user(...) \
> +	alloc_hooks(alloc_contig_frozen_pages_user_noprof(__VA_ARGS__))
> +
>  struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
>  		int nid, nodemask_t *nodemask);
>  #define alloc_contig_pages(...)	\
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 21b52c879751..6d3f284c607d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6975,13 +6975,15 @@ static void __free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages
>  }
>
>  /**
> - * alloc_contig_frozen_range() -- tries to allocate given range of frozen pages
> + * __alloc_contig_frozen_range() -- tries to allocate given range of frozen pages
>   * @start:	start PFN to allocate
>   * @end:	one-past-the-last PFN to allocate
>   * @alloc_flags:	allocation information
>   * @gfp_mask:	GFP mask. Node/zone/placement hints are ignored; only some
>   *		action and reclaim modifiers are supported. Reclaim modifiers
>   *		control allocation behavior during compaction/migration/reclaim.
> + * @user_addr:	user virtual address for cache-friendly zeroing, or
> + *		USER_ADDR_NONE for kernel allocations.

Yeah, I really do not want us passing USER_ADDR_NONE for kernel allocations
thanks. This is hugely confusing already pretty confusing logic.

>   *
>   * The PFN range does not have to be pageblock aligned. The PFN range must
>   * belong to a single zone.
> @@ -6997,8 +6999,9 @@ static void __free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages
>   *
>   * Return: zero on success or negative error code.
>   */
> -int alloc_contig_frozen_range_noprof(unsigned long start, unsigned long end,
> -		acr_flags_t alloc_flags, gfp_t gfp_mask)
> +static int __alloc_contig_frozen_range(unsigned long start, unsigned long end,
> +		acr_flags_t alloc_flags, gfp_t gfp_mask,
> +		unsigned long user_addr)
>  {
>  	const unsigned int order = ilog2(end - start);
>  	unsigned long outer_start, outer_end;
> @@ -7125,7 +7128,7 @@ int alloc_contig_frozen_range_noprof(unsigned long start, unsigned long end,
>  		struct page *head = pfn_to_page(start);
>
>  		check_new_pages(head, order);
> -		prep_new_page(head, order, gfp_mask, 0, USER_ADDR_NONE);
> +		prep_new_page(head, order, gfp_mask, 0, user_addr);
>  	} else {
>  		ret = -EINVAL;
>  		WARN(true, "PFN range: requested [%lu, %lu), allocated [%lu, %lu)\n",
> @@ -7135,6 +7138,13 @@ int alloc_contig_frozen_range_noprof(unsigned long start, unsigned long end,
>  	undo_isolate_page_range(start, end);
>  	return ret;
>  }
> +
> +int alloc_contig_frozen_range_noprof(unsigned long start, unsigned long end,
> +		acr_flags_t alloc_flags, gfp_t gfp_mask)
> +{
> +	return __alloc_contig_frozen_range(start, end, alloc_flags, gfp_mask,
> +					   USER_ADDR_NONE);
> +}
>  EXPORT_SYMBOL(alloc_contig_frozen_range_noprof);
>
>  /**
> @@ -7227,14 +7237,16 @@ static bool zone_spans_last_pfn(const struct zone *zone,
>  	return zone_spans_pfn(zone, last_pfn);
>  }
>
> -/**
> - * alloc_contig_frozen_pages() -- tries to find and allocate contiguous range of frozen pages
> +/*
> + * alloc_contig_frozen_pages_user_noprof() -- allocate contiguous frozen pages with user address
>   * @nr_pages:	Number of contiguous pages to allocate
>   * @gfp_mask:	GFP mask. Node/zone/placement hints limit the search; only some
>   *		action and reclaim modifiers are supported. Reclaim modifiers
>   *		control allocation behavior during compaction/migration/reclaim.
>   * @nid:	Target node
>   * @nodemask:	Mask for other possible nodes
> + * @user_addr:	user virtual address for cache-friendly zeroing, or
> + *		USER_ADDR_NONE for kernel allocations.
>   *
>   * This routine is a wrapper around alloc_contig_frozen_range(). It scans over
>   * zones on an applicable zonelist to find a contiguous pfn range which can then
> @@ -7253,8 +7265,9 @@ static bool zone_spans_last_pfn(const struct zone *zone,
>   *
>   * Return: pointer to contiguous frozen pages on success, or NULL if not successful.
>   */
> -struct page *alloc_contig_frozen_pages_noprof(unsigned long nr_pages,
> -		gfp_t gfp_mask, int nid, nodemask_t *nodemask)
> +struct page *alloc_contig_frozen_pages_user_noprof(unsigned long nr_pages,
> +		gfp_t gfp_mask, int nid, nodemask_t *nodemask,
> +		unsigned long user_addr)
>  {
>  	unsigned long ret, pfn, flags;
>  	struct zonelist *zonelist;
> @@ -7282,10 +7295,11 @@ struct page *alloc_contig_frozen_pages_noprof(unsigned long nr_pages,
>  				 * win the race and cause allocation to fail.
>  				 */
>  				spin_unlock_irqrestore(&zone->lock, flags);
> -				ret = alloc_contig_frozen_range_noprof(pfn,
> +				ret = __alloc_contig_frozen_range(pfn,
>  							pfn + nr_pages,
>  							ACR_FLAGS_NONE,
> -							gfp_mask);
> +							gfp_mask,
> +							user_addr);
>  				if (!ret)
>  					return pfn_to_page(pfn);
>  				spin_lock_irqsave(&zone->lock, flags);
> @@ -7307,6 +7321,14 @@ struct page *alloc_contig_frozen_pages_noprof(unsigned long nr_pages,
>  	}
>  	return NULL;
>  }
> +EXPORT_SYMBOL(alloc_contig_frozen_pages_user_noprof);

Generally we don't add EXPORT_SYMBOL() for new stuff unless
unavoidable. EXPORT_SYMBOL_GPL() is preferred.

> +
> +struct page *alloc_contig_frozen_pages_noprof(unsigned long nr_pages,
> +		gfp_t gfp_mask, int nid, nodemask_t *nodemask)
> +{
> +	return alloc_contig_frozen_pages_user_noprof(nr_pages, gfp_mask, nid,
> +						     nodemask, USER_ADDR_NONE);
> +}
>  EXPORT_SYMBOL(alloc_contig_frozen_pages_noprof);
>
>  /**
> --
> MST
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 11/37] mm: page_alloc: move prep_compound_page before post_alloc_hook
  2026-06-08  8:36 ` [PATCH v10 11/37] mm: page_alloc: move prep_compound_page before post_alloc_hook Michael S. Tsirkin
@ 2026-06-08 10:33   ` Lorenzo Stoakes
  0 siblings, 0 replies; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 10:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:36:20AM -0400, Michael S. Tsirkin wrote:
> Move prep_compound_page() before post_alloc_hook() in prep_new_page().
>
> The next patch adds a folio_zero_user() call to post_alloc_hook(),
> which uses folio_nr_pages() to determine how many pages to zero.
> Without compound metadata set up first, folio_nr_pages() returns 1
> for higher-order allocations, so only the first page would be zeroed.
>
> All other operations in post_alloc_hook() (arch_alloc_page, KASAN,
> debug, page owner, etc.) use raw page pointers with explicit order
> counts and are unaffected by this reordering.

I'd put this justification for why this is safe above the 'next patch' stuff.

>
> Also reorder compaction_alloc_noprof() for consistency. Compaction
> currently passes USER_ADDR_NONE so folio_zero_user() is not called

Yeah again, this 'just so' stuff of when or when not an address is passed is
concerning.

> there, but keeping the same ordering avoids a future tripping hazard.

No functional changes or functional changes? If none then say so.

>
> Reviewed-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> Assisted-by: cursor-agent:GPT-5.4-xhigh
> ---
>  mm/compaction.c | 4 ++--
>  mm/page_alloc.c | 4 ++--
>  2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 72684fe81e83..4336e433c99b 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1849,10 +1849,10 @@ static struct folio *compaction_alloc_noprof(struct folio *src, unsigned long da
>  		set_page_private(&freepage[size], start_order);
>  	}
>  	dst = (struct folio *)freepage;
> -	post_alloc_hook(&dst->page, order, __GFP_MOVABLE, USER_ADDR_NONE);
> -	set_page_refcounted(&dst->page);
>  	if (order)
>  		prep_compound_page(&dst->page, order);
> +	post_alloc_hook(&dst->page, order, __GFP_MOVABLE, USER_ADDR_NONE);
> +	set_page_refcounted(&dst->page);
>  	cc->nr_freepages -= 1 << order;
>  	cc->nr_migratepages -= 1 << order;
>  	return page_rmappable_folio(&dst->page);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0943ab724032..4676fd49819e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1874,11 +1874,11 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags
>  							unsigned int alloc_flags,
>  							unsigned long user_addr)
>  {
> -	post_alloc_hook(page, order, gfp_flags, user_addr);
> -
>  	if (order && (gfp_flags & __GFP_COMP))
>  		prep_compound_page(page, order);
>
> +	post_alloc_hook(page, order, gfp_flags, user_addr);
> +
>  	/*
>  	 * page is set pfmemalloc when ALLOC_NO_WATERMARKS was necessary to
>  	 * allocate the page. The expectation is that the caller is taking
> --
> MST
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 13/37] mm: use __GFP_ZERO in vma_alloc_zeroed_movable_folio
  2026-06-08  8:36 ` [PATCH v10 13/37] mm: use __GFP_ZERO in vma_alloc_zeroed_movable_folio Michael S. Tsirkin
@ 2026-06-08 10:39   ` Lorenzo Stoakes
  2026-06-08 10:55     ` Lorenzo Stoakes
  0 siblings, 1 reply; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 10:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:36:51AM -0400, Michael S. Tsirkin wrote:
> Now that post_alloc_hook() handles cache-friendly user page
> zeroing via folio_zero_user(), convert vma_alloc_zeroed_movable_folio()
> to pass __GFP_ZERO instead of zeroing at the callsite.
>
> Note: before this series, replacing clear_user_highpage() with
> __GFP_ZERO was unsafe on cache-aliasing architectures because
> __GFP_ZERO uses clear_page() without a dcache flush. With this
> series, it is safe if the caller passes a valid user address
> (not USER_ADDR_NONE) to vma_alloc_folio() etc., which delivers

Wait, so now you're making actual correctness predicated on correctly
passing the right user address??

> it to post_alloc_hook() for the dcache flush via
> folio_zero_user(). It is only unsafe if USER_ADDR_NONE is passed.

Yeah, ok I'm beating a dead horse a bit here, but no to this approach.

>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> ---
>  include/linux/highmem.h | 9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index d7aac9de1c8a..8b0afaabbc6e 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -320,13 +320,8 @@ static inline
>  struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
>  				   unsigned long vaddr)
>  {
> -	struct folio *folio;
> -
> -	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vaddr);
> -	if (folio && user_alloc_needs_zeroing())

So now we are unconditionally zeroing the pages even if
!user_alloc_needs_zeroing()? You don't mention this in the commit message
and it seems like it'll regress performance?

> -		clear_user_highpage(&folio->page, vaddr);
> -
> -	return folio;
> +	return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO,
> +			      0, vma, vaddr);
>  }
>  #endif
>
> --
> MST
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 13/37] mm: use __GFP_ZERO in vma_alloc_zeroed_movable_folio
  2026-06-08 10:39   ` Lorenzo Stoakes
@ 2026-06-08 10:55     ` Lorenzo Stoakes
  0 siblings, 0 replies; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 10:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 11:40:02AM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 04:36:51AM -0400, Michael S. Tsirkin wrote:
> > Now that post_alloc_hook() handles cache-friendly user page
> > zeroing via folio_zero_user(), convert vma_alloc_zeroed_movable_folio()
> > to pass __GFP_ZERO instead of zeroing at the callsite.
> >
> > Note: before this series, replacing clear_user_highpage() with
> > __GFP_ZERO was unsafe on cache-aliasing architectures because
> > __GFP_ZERO uses clear_page() without a dcache flush. With this
> > series, it is safe if the caller passes a valid user address
> > (not USER_ADDR_NONE) to vma_alloc_folio() etc., which delivers
>
> Wait, so now you're making actual correctness predicated on correctly
> passing the right user address??
>
> > it to post_alloc_hook() for the dcache flush via
> > folio_zero_user(). It is only unsafe if USER_ADDR_NONE is passed.
>
> Yeah, ok I'm beating a dead horse a bit here, but no to this approach.
>
> >
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > Assisted-by: Claude:claude-opus-4-6
> > ---
> >  include/linux/highmem.h | 9 ++-------
> >  1 file changed, 2 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> > index d7aac9de1c8a..8b0afaabbc6e 100644
> > --- a/include/linux/highmem.h
> > +++ b/include/linux/highmem.h
> > @@ -320,13 +320,8 @@ static inline
> >  struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
> >  				   unsigned long vaddr)
> >  {
> > -	struct folio *folio;
> > -
> > -	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vaddr);
> > -	if (folio && user_alloc_needs_zeroing())
>
> So now we are unconditionally zeroing the pages even if
> !user_alloc_needs_zeroing()? You don't mention this in the commit message
> and it seems like it'll regress performance?

OK sorry I see in 12/37 you do this there instead.

>
> > -		clear_user_highpage(&folio->page, vaddr);
> > -
> > -	return folio;
> > +	return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO,
> > +			      0, vma, vaddr);
> >  }
> >  #endif
> >
> > --
> > MST
> >
>
> Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (37 preceding siblings ...)
  2026-06-08  9:17 ` [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Lorenzo Stoakes
@ 2026-06-08 11:02 ` Vlastimil Babka (SUSE)
  2026-06-08 11:13   ` Vlastimil Babka (SUSE)
  2026-06-08 14:21 ` Matthew Wilcox
  2026-06-09  3:58 ` New design Matthew Wilcox
  40 siblings, 1 reply; 131+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-06-08 11:02 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On 6/8/26 10:33, Michael S. Tsirkin wrote:
> Further, on architectures with aliasing caches, upstream with init_on_alloc

It seems those are niche architectures so we can ignore that part for perf
purposes; the other reason why user_alloc_needs_zeroing() would be true is
booting with init_on_alloc.

> double-zeros user pages: once via kernel_init_pages() in
> post_alloc_hook, and again via clear_user_highpage() at the
> callsite (because user_alloc_needs_zeroing() returns true).
> This series eliminates that double-zeroing by moving the zeroing
> into the post_alloc_hook + propagating the "host
> already zeroed this page" information through the buddy allocator.

I wonder if this whole thing would be much easier if those that would want
performance would enable only init_on_free and not init_on_alloc at the same
time. If you enable both you're paranoid and just eat the cost I guess. Then
you're maybe also paranoid so much that you wouldn't want to trust the host
zeroing anyway?

post_alloc_hook() already has logic that if init_on_free is enabled, there's
no init during alloc. So then I think you would only need to communicate
that host zeroed some pages when the guest adds them to the buddy allocator,
and that's it?

I.e. if we can just assume that everything in the buddy is zeroed, we don't
need all the PG_zero flag and whatnot complexity.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 10:23   ` Lorenzo Stoakes
@ 2026-06-08 11:06     ` Lorenzo Stoakes
  2026-06-08 13:04       ` Matthew Wilcox
  2026-06-08 11:08     ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 11:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

In terms of an alternative, I feel that we should:

1. Continue to not change external APIs (correct to not do so)

2. Avoid adding this to any code paths that don't strictly need it

3. Refactor the existing code to make it harder to get wrong. Don't have an
   easily confused sentinel, rather have some kind of context state that is
   passed through.

I mean we already have alloc_context and you're already updating it :)

But instead of overloading user_addr to indicate all kinds of things, instead
make life easier by actually breaking things out.

Like:

enum alloc_context_type {
	KERNEL_ALLOCATION,
	USER_MAPPED_ALLOCATION,
	USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
	/* Perhaps some other states we want to encode? */
};

struct alloc_context {
	...

	enum alloc_context_type type;
	unsigned long user_addr; // Only set if type == USER_ALLOCATION

	// Maybe something suggesting context or whether we init before in some
	// cases?
};

And thread that through?

That way we can also add further context later if required rather than this
awkward easily misunderstood parameter.

It might means some further prepatory patches but it avoids the confusion of not
knowing what to set this to, and could perhaps be set further up the stack?

Then look to completely eliminate cases where we zero other than with
__GFP_ZERO.

Perhaps others have ideas too, but this kind of approach seems more appropriate?

It also seems right to send this work as a entirely separate series. 'Always
zero user memory using __GFP_ZERO' seems like a totally valid independent
series.

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 10:23   ` Lorenzo Stoakes
  2026-06-08 11:06     ` Lorenzo Stoakes
@ 2026-06-08 11:08     ` David Hildenbrand (Arm)
  2026-06-08 15:27       ` Zi Yan
  1 sibling, 1 reply; 131+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-08 11:08 UTC (permalink / raw)
  To: Lorenzo Stoakes, Michael S. Tsirkin
  Cc: linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Muchun Song, Oscar Salvador, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, virtualization, linux-mm, Andrea Arcangeli

On 6/8/26 12:23, Lorenzo Stoakes wrote:
> I noticed this patch, again, sneaks in some additional code changes that
> are not mentioned in the commit message and seem irrelevant to the patch.
> 
> Not sure if the AI is doing this, but please don't do that.
> 
> On Mon, Jun 08, 2026 at 04:35:17AM -0400, Michael S. Tsirkin wrote:
>> Thread a user virtual address from vma_alloc_folio() down through
>> the page allocator to post_alloc_hook(). This is plumbing
>> preparation for a subsequent patch that will use user_addr to
>> call folio_zero_user() for cache-friendly zeroing of user pages.
> 
> This feels like very weak justification for code that massively changes mm
> code and makes it all much worse.
> 
>>
>> The user_addr is stored in struct alloc_context and flows through:
>>   vma_alloc_folio -> folio_alloc_mpol -> __alloc_pages_mpol ->
>>   __alloc_frozen_pages -> get_page_from_freelist -> prep_new_page ->
>>   post_alloc_hook
> 
> Is this the only relevant code path? You're changing a TON of code here
> that will have multiple different code paths?
> 
>>
>> USER_ADDR_NONE ((unsigned long)-1) is used for non-user
>> allocations, since address 0 is a valid userspace mapping.
> 
> Ugh god, so now we're passing a user address through allocation paths that
> might not even be aware of this or be allocating memory at a point when the
> mapping is known?

The original ideas was to do this only with internal interfaces. I think I
raised before to leave hugetlb alone for now.

Fundamentally, user_alloc_needs_zeroing() is something we should get rid of, to
just be able use __GFP_ZERO and do zeroing at exactly one place.


The question is how to pass that information (user_addr) through internal APIs.

I previously said, that ideally we'd end up with a folio allocation interface in
mm/page_alloc.c, from where we could get this done more cleanly internally.

But I don't want what the previous proposals did: use GFP flags+leak state or
return magic "zeroed" flags.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 36/37] mm: balloon: use put_page_zeroed for zeroed balloon pages
  2026-06-08  8:40 ` [PATCH v10 36/37] mm: balloon: use put_page_zeroed for zeroed balloon pages Michael S. Tsirkin
@ 2026-06-08 11:10   ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 131+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-08 11:10 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, virtualization, linux-mm, Andrea Arcangeli

On 6/8/26 10:40, Michael S. Tsirkin wrote:
> When a balloon page marked PageZeroed is freed during migration,
> use put_page_zeroed() to propagate the zeroed hint to the buddy
> allocator. Previously the hint was silently lost via plain put_page().
> 
> No page has PageZeroed set yet; the next patch
> (VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE) will set it on
> pages the host has zeroed during inflate.
> Note: during balloon migration, the migration core holds an
> extra reference, so put_page_zeroed() will not be the final
> put. The zeroed hint is lost in that case, which is
> acceptable: it is a best-effort optimization.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> ---
>  mm/balloon.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/balloon.c b/mm/balloon.c
> index 96a8f1e20bc6..6c9dd8ab0c5d 100644
> --- a/mm/balloon.c
> +++ b/mm/balloon.c
> @@ -324,7 +324,15 @@ static int balloon_page_migrate(struct page *newpage, struct page *page,
>  	balloon_page_finalize(page);
>  	spin_unlock_irqrestore(&balloon_pages_lock, flags);
>  
> -	put_page(page);
> +	if (PageZeroed(page)) {
> +		/* Atomic to serialize with memory_failure's
> +		 * TestSetPageHWPoison; not under zone->lock here.
> +		 */
> +		ClearPageZeroed(page);
> +		put_page_zeroed(page);
> +	} else {
> +		put_page(page);
> +	}
>  
>  	return 0;
>  }

I think I raised previously that this is best done later.

This patch set is currently trying to do too many things, and is on the larger
side. I think we should try to reduce it to the bare minimum and get some
agreement between maintainers on the core design.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
  2026-06-08 11:02 ` Vlastimil Babka (SUSE)
@ 2026-06-08 11:13   ` Vlastimil Babka (SUSE)
  2026-06-08 15:45     ` Gregory Price
  0 siblings, 1 reply; 131+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-06-08 11:13 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On 6/8/26 13:02, Vlastimil Babka (SUSE) wrote:
> On 6/8/26 10:33, Michael S. Tsirkin wrote:
>> Further, on architectures with aliasing caches, upstream with init_on_alloc
> 
> It seems those are niche architectures so we can ignore that part for perf
> purposes; the other reason why user_alloc_needs_zeroing() would be true is
> booting with init_on_alloc.

OK I misread how user_alloc_needs_zeroing() works wrt init_on_alloc, as it's
negated. But you're changing that anyway to skip that user zeroing, right?

"
This series eliminates that double-zeroing by moving the zeroing
into the post_alloc_hook + propagating the "host
already zeroed this page" information through the buddy allocator.
"

So relying on "everything in buddy is zeroed" would still work I'd think.

>> double-zeros user pages: once via kernel_init_pages() in
>> post_alloc_hook, and again via clear_user_highpage() at the
>> callsite (because user_alloc_needs_zeroing() returns true).
>> This series eliminates that double-zeroing by moving the zeroing
>> into the post_alloc_hook + propagating the "host
>> already zeroed this page" information through the buddy allocator.
> 
> I wonder if this whole thing would be much easier if those that would want
> performance would enable only init_on_free and not init_on_alloc at the same
> time. If you enable both you're paranoid and just eat the cost I guess. Then
> you're maybe also paranoid so much that you wouldn't want to trust the host
> zeroing anyway?
> 
> post_alloc_hook() already has logic that if init_on_free is enabled, there's
> no init during alloc. So then I think you would only need to communicate
> that host zeroed some pages when the guest adds them to the buddy allocator,
> and that's it?
> 
> I.e. if we can just assume that everything in the buddy is zeroed, we don't
> need all the PG_zero flag and whatnot complexity.



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
  2026-06-08  8:36 ` [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook Michael S. Tsirkin
@ 2026-06-08 11:23   ` Lorenzo Stoakes
  2026-06-08 15:53     ` Gregory Price
  2026-06-08 19:42     ` Michael S. Tsirkin
  0 siblings, 2 replies; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 11:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:36:38AM -0400, Michael S. Tsirkin wrote:
> When post_alloc_hook() needs to zero a page for an explicit
> __GFP_ZERO allocation for a user page (user_addr is set), use folio_zero_user()
> instead of kernel_init_pages().  This zeros near the faulting
> address last, keeping those cachelines hot for the impending
> user access.
>
> folio_zero_user() is only used for explicit __GFP_ZERO, not for
> init_on_alloc.  On architectures with virtually-indexed caches
> (e.g., ARM), clear_user_highpage() performs per-line cache
> operations; using it for init_on_alloc would add overhead that
> kernel_init_pages() avoids (the page fault path flushes the
> cache at PTE installation time regardless).
>
> No functional change yet: current callers do not pass __GFP_ZERO
> for user pages (they zero at the callsite instead).  Subsequent
> patches will convert them.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> ---
>  mm/page_alloc.c | 35 ++++++++++++++++++++++++++++++++---
>  1 file changed, 32 insertions(+), 3 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4676fd49819e..d4fbf1861a8a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1861,9 +1861,38 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>  		for (i = 0; i != 1 << order; ++i)
>  			page_kasan_tag_reset(page + i);
>  	}
> -	/* If memory is still not initialized, initialize it now. */
> -	if (init)
> -		kernel_init_pages(page, 1 << order);
> +	/*
> +	 * On architectures with cache aliasing, pages zeroed via the
> +	 * kernel direct map (e.g. init_on_free) must be re-zeroed
> +	 * through a user-congruent mapping.  Host-zeroed pages
> +	 * (zeroed flag) don't need this: physical RAM is clean.
> +	 */
> +	if (!init && (gfp_flags & __GFP_ZERO) &&
> +	    user_addr != USER_ADDR_NONE &&
> +	    user_alloc_needs_zeroing())

We check this (gfp_flags & __GFP_ZERO) && user_addr != USER_ADDR_NONE thing
twice, can we just put in a 'init_should_folio_zero' const bool or something?

> +		init = true;

As Vlasta says not sure if we want to add complexity just for these arches.

> +	/*
> +	 * If memory is still not initialized, initialize it now.

I kinda hate that 'init' is unclear as to 'do init' or 'was init somewhere
else'... Anwyay.

> +	 * When __GFP_ZERO was explicitly requested and user_addr is set,
> +	 * use folio_zero_user() which zeros near the faulting address
> +	 * last, keeping those cachelines hot.  For init_on_alloc, use
> +	 * kernel_init_pages() to avoid unnecessary cache flush overhead
> +	 * on architectures with virtually-indexed caches.

This whole paragraph seems pretty useless and just describing the code?

> +	 */
> +	if (init) {
> +		if ((gfp_flags & __GFP_ZERO) && user_addr != USER_ADDR_NONE) {
> +			/*
> +			 * folio_zero_user relies on folio_nr_pages which
> +			 * requires __GFP_COMP for order > 0.  All user folio
> +			 * allocations set __GFP_COMP via __folio_alloc.

This whole paragraph is useless and very like the kind of stuff AI generates for
comments, i.e. overly long + entirely unnecessary stuff.

> +			 * user_addr != USER_ADDR_NONE implies sleepable
> +			 * context (user page fault).

Can you safely assume that? Also inferring which context we are in from this
parameter seems risky.

It seems to me that you're now making it such that kernel developers:

- Have to know when and when not to specify a user address, and under what
  circumstances we might consider that to be mapped.

- Need to know to do this correctly for aliasing architectures or have silent
  correctness issues.

- Need to take context into account when specifying this.

We definitely need to find a simpler way to do this!

> +			 */
> +			VM_WARN_ON_ONCE(order && !(gfp_flags & __GFP_COMP));

Surely by now we can assume this?

> +			folio_zero_user(page_folio(page), user_addr);
> +		} else
> +			kernel_init_pages(page, 1 << order);

I hate this hanging else branch... definitely prefer {} on both branches.

But in any case it seems like we could avoid some indentation with something
like:

	if (init && init_should_folio_zero) {
		...
	} else if (init) {
		...
	}

Or even a:

	if (!init)
		goto out;

And stick an out label below?

> +	}

>
>  	set_page_owner(page, order, gfp_flags);
>  	page_table_check_alloc(page, order);
> --
> MST
>

Oh and in general it seems that this conflicts with [0] which removes
kernel_init_pages().

[0];https://lore.kernel.org/all/20260422102729.166599-1-hsalunke@amd.com/

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 14/37] mm: remove arch vma_alloc_zeroed_movable_folio overrides
  2026-06-08  8:37 ` [PATCH v10 14/37] mm: remove arch vma_alloc_zeroed_movable_folio overrides Michael S. Tsirkin
@ 2026-06-08 11:29   ` Lorenzo Stoakes
  0 siblings, 0 replies; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 11:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Magnus Lindholm,
	Greg Ungerer, Geert Uytterhoeven

On Mon, Jun 08, 2026 at 04:37:08AM -0400, Michael S. Tsirkin wrote:
> Now that the generic vma_alloc_zeroed_movable_folio() uses
> __GFP_ZERO, the arch-specific macros on alpha, m68k, s390, and
> x86 that did the same thing are redundant.  Remove them.
>
> arm64 is not affected: it has a real function override that
> handles MTE tag zeroing, not just __GFP_ZERO.
>
> Suggested-by: David Hildenbrand <david@kernel.org>
> Acked-by: Magnus Lindholm <linmag7@gmail.com>
> Acked-by: Greg Ungerer <gerg@linux-m68k.org>
> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> Reviewed-by: Gregory Price <gourry@gourry.net>
> ---
>  arch/alpha/include/asm/page.h   | 3 ---
>  arch/m68k/include/asm/page_no.h | 3 ---
>  arch/s390/include/asm/page.h    | 3 ---
>  arch/x86/include/asm/page.h     | 3 ---
>  include/linux/highmem.h         | 8 +++++---
>  5 files changed, 5 insertions(+), 15 deletions(-)
>
> diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
> index 59d01f9b77f6..4327029cd660 100644
> --- a/arch/alpha/include/asm/page.h
> +++ b/arch/alpha/include/asm/page.h
> @@ -12,9 +12,6 @@
>
>  extern void clear_page(void *page);
>
> -#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
> -	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr)
> -
>  extern void copy_page(void * _to, void * _from);
>  #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
>
> diff --git a/arch/m68k/include/asm/page_no.h b/arch/m68k/include/asm/page_no.h
> index d2532bc407ef..f511b763a235 100644
> --- a/arch/m68k/include/asm/page_no.h
> +++ b/arch/m68k/include/asm/page_no.h
> @@ -12,9 +12,6 @@ extern unsigned long memory_end;
>
>  #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
>
> -#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
> -	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr)
> -
>  #define __pa(vaddr)		((unsigned long)(vaddr))
>  #define __va(paddr)		((void *)((unsigned long)(paddr)))
>
> diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
> index 56da819a79e6..e995d2a413f9 100644
> --- a/arch/s390/include/asm/page.h
> +++ b/arch/s390/include/asm/page.h
> @@ -67,9 +67,6 @@ static inline void copy_page(void *to, void *from)
>
>  #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
>
> -#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
> -	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr)
> -
>  #ifdef CONFIG_STRICT_MM_TYPECHECKS
>  #define STRICT_MM_TYPECHECKS
>  #endif
> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> index 416dc88e35c1..92fa975b46f3 100644
> --- a/arch/x86/include/asm/page.h
> +++ b/arch/x86/include/asm/page.h
> @@ -28,9 +28,6 @@ static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
>  	copy_page(to, from);
>  }
>
> -#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
> -	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr)
> -
>  #ifndef __pa
>  #define __pa(x)		__phys_addr((unsigned long)(x))
>  #endif
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index 8b0afaabbc6e..642718a50c27 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -303,7 +303,6 @@ static inline void clear_user_highpages(struct page *page, unsigned long vaddr,
>  #endif
>  }
>
> -#ifndef vma_alloc_zeroed_movable_folio

We're specifying this function unconditionally even though arm64 overrides?

>  /**
>   * vma_alloc_zeroed_movable_folio - Allocate a zeroed page for a VMA.
>   * @vma: The VMA the page is to be allocated for.
> @@ -317,12 +316,15 @@ static inline void clear_user_highpages(struct page *page, unsigned long vaddr,
>   * we are out of memory.
>   */
>  static inline
> -struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
> +struct folio *vma_alloc_zeroed_movable_folio_noprof(struct vm_area_struct *vma,
>  				   unsigned long vaddr)
>  {
> -	return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO,
> +	return vma_alloc_folio_noprof(GFP_HIGHUSER_MOVABLE | __GFP_ZERO,
>  			      0, vma, vaddr);

This whole change seems unnecessary?

>  }
> +#ifndef vma_alloc_zeroed_movable_folio
> +#define vma_alloc_zeroed_movable_folio(...) \
> +	alloc_hooks(vma_alloc_zeroed_movable_folio_noprof(__VA_ARGS__))
>  #endif

I don't know why we need to add more of this alloc_hooks() dance when we could
just do:

#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr)

Like the existing arch stuff?

>
>  static inline void clear_highpage(struct page *page)
> --
> MST
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 15/37] mm: alloc_anon_folio: pass raw fault address to vma_alloc_folio
  2026-06-08  8:37 ` [PATCH v10 15/37] mm: alloc_anon_folio: pass raw fault address to vma_alloc_folio Michael S. Tsirkin
@ 2026-06-08 11:35   ` Lorenzo Stoakes
  0 siblings, 0 replies; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 11:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:37:25AM -0400, Michael S. Tsirkin wrote:
> Pass vmf->address directly instead of ALIGN_DOWN(vmf->address, ...).
> NUMA interleave is not affected: the ilx calculation in
> get_vma_policy() shifts addr >> PAGE_SHIFT >> order, which
> drops sub-page bits regardless of alignment. post_alloc_hook
> will use the raw address for cache-friendly zeroing via
> folio_zero_user().

I'm confused as to the justification for this? You're saying 'make change X,
it's safe because Y'. So the justification is now this post_alloc_hook thing.

But are you now creating a new requirement of vma_alloc_folio() that you must
specify the actual address we are faulting on, not an address within the folio
or the folio's base address?

(If that's a requirement, why is it?)

If so you should update the vma_alloc_folio() description 'virtual address of
the allocation' is not at all clear.

And if that _is_ a requirement, then are you sure all allocation paths are
correct?

I already see addr & HPAGE_PMD_MASK in vma_alloc_anon_folio_pmd() for
instance?

If it's not a requirement, why are we doing this? it's surely useless in
that case?

>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Reviewed-by: Gregory Price <gourry@gourry.net>
> Assisted-by: Claude:claude-opus-4-6
> Assisted-by: cursor-agent:GPT-5.4-xhigh
> ---
>  mm/memory.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 86a973119bd4..21f640674c4f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5268,8 +5268,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>  	/* Try allocating the highest of the remaining orders. */
>  	gfp = vma_thp_gfp_mask(vma);
>  	while (orders) {
> -		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);

If you're removing this usage, could you also remove the silly thing of us
declaring this at function scope then using it in branches when we should always
have declared these separately?

> -		folio = vma_alloc_folio(gfp, order, vma, addr);
> +		folio = vma_alloc_folio(gfp, order, vma, vmf->address);
>  		if (folio) {
>  			if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
>  				count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> --
> MST
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 16/37] mm: alloc_swap_folio: pass raw fault address to vma_alloc_folio
  2026-06-08  8:37 ` [PATCH v10 16/37] mm: alloc_swap_folio: " Michael S. Tsirkin
@ 2026-06-08 11:37   ` Lorenzo Stoakes
  2026-06-08 15:59     ` Gregory Price
  0 siblings, 1 reply; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 11:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:37:41AM -0400, Michael S. Tsirkin wrote:
> Same change as the previous patch but for alloc_swap_folio:

Please don't say 'same change as the previous patch' :) explain what you're
doing here. It's a pain to have to go check otherwise.

> pass vmf->address directly instead of ALIGN_DOWN(vmf->address, ...).
>
> Note: NUMA interleave is not affected by the raw address;
> the ilx calculation shifts addr >> PAGE_SHIFT >> order,
> dropping sub-page bits regardless of alignment.

You're expressing the same thing as the last patch differently, but then
eliding other explanations from that?

All the same questions as I asked for the last apply to this also.

And also - if you've now made this a _requirement_ that is broken
otherwise, then aren't these bisection hazards and should be squashed into
the change?...

No patch should break anything at any point.

>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Reviewed-by: Gregory Price <gourry@gourry.net>
> Assisted-by: Claude:claude-opus-4-6
> Assisted-by: cursor-agent:GPT-5.4-xhigh
> ---
>  mm/memory.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 21f640674c4f..6c14b90f558e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4750,8 +4750,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>  	/* Try allocating the highest of the remaining orders. */
>  	gfp = vma_thp_gfp_mask(vma);
>  	while (orders) {
> -		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> -		folio = vma_alloc_folio(gfp, order, vma, addr);
> +		folio = vma_alloc_folio(gfp, order, vma, vmf->address);
>  		if (folio) {
>  			if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
>  							    gfp, entry))
> --
> MST
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 18/37] mm: page_alloc: use aliasing checks instead of user_alloc_needs_zeroing
  2026-06-08  8:38 ` [PATCH v10 18/37] mm: page_alloc: use aliasing checks instead of user_alloc_needs_zeroing Michael S. Tsirkin
@ 2026-06-08 11:39   ` Lorenzo Stoakes
  0 siblings, 0 replies; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 11:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:38:00AM -0400, Michael S. Tsirkin wrote:
> Replace user_alloc_needs_zeroing() with the direct aliasing checks
> (cpu_dcache_is_aliasing() || cpu_icache_is_aliasing()) in the
> post_alloc_hook aliasing guard.
>
> user_alloc_needs_zeroing() includes a !init_on_alloc term that
> means "allocator didn't zero this page."  But in this guard's
> context (!zeroed && !init && __GFP_ZERO), we already know the page
> is zero; init incorporates init_on_alloc via want_init_on_alloc().
> The only question left is whether the cache architecture needs
> the data re-zeroed through a congruent mapping, which is purely
> cpu_dcache_is_aliasing() || cpu_icache_is_aliasing().
>
> On non-aliasing architectures with init_on_free=true and
> init_on_alloc=false, this avoids a redundant re-zero of an
> already-zero page.
>
> Note on PowerPC: PowerPC overrides clear_user_page to call
> flush_dcache_page after clear_page, but on freshly allocated
> pages PG_dcache_clean is already clear (cleared by
> __free_pages_prepare), so flush_dcache_page is a no-op.
> Skipping this here thus has no effect.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6

This seems like an odd ordering of patches, can we group like changes
together?

> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 45e824b1ec75..edfc83571985 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1880,7 +1880,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>  	 */
>  	if (!zeroed && !init && (gfp_flags & __GFP_ZERO) &&
>  	    user_addr != USER_ADDR_NONE &&
> -	    user_alloc_needs_zeroing())
> +	    (cpu_dcache_is_aliasing() || cpu_icache_is_aliasing()))

Let's try and simplify things rather than adding endlessly huge if conditionals?

It's now incredibly hard to track exactly what's going on here, and that is
bug-bait.

>  		init = true;
>  	/*
>  	 * If memory is still not initialized, initialize it now.
> --
> MST
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 19/37] mm: page_alloc: clear PG_zeroed on buddy merge if not both zero
  2026-06-08  8:38 ` [PATCH v10 19/37] mm: page_alloc: clear PG_zeroed on buddy merge if not both zero Michael S. Tsirkin
@ 2026-06-08 11:47   ` Lorenzo Stoakes
  0 siblings, 0 replies; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 11:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:38:08AM -0400, Michael S. Tsirkin wrote:
> When two buddy pages merge in __free_one_page(), preserve
> PG_zeroed on the merged page only if both buddies have the
> flag set.  Otherwise clear it.
>
> The merged page would inherit PG_zeroed, and a later __GFP_ZERO
> allocation would skip zeroing stale data in the non-zero half.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Reviewed-by: Gregory Price <gourry@gourry.net>
> Assisted-by: Claude:claude-opus-4-6
> Assisted-by: cursor-agent:GPT-5.4-xhigh
> ---
>  include/linux/page-flags.h |  1 +
>  mm/page_alloc.c            | 15 ++++++++++++++-
>  2 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 91f8ddb1d512..9365d59ac1d6 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -680,6 +680,7 @@ FOLIO_FLAG_FALSE(idle)
>   * uses this to skip redundant zeroing in post_alloc_hook().
>   */
>  __PAGEFLAG(Zeroed, zeroed, PF_NO_COMPOUND)
> +CLEARPAGEFLAG(Zeroed, zeroed, PF_NO_COMPOUND)
>  #define __PG_ZEROED (1UL << PG_zeroed)
>
>  /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index edfc83571985..a90bca5317c1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -941,10 +941,14 @@ static inline void __free_one_page(struct page *page,
>  	unsigned long buddy_pfn = 0;
>  	unsigned long combined_pfn;
>  	struct page *buddy;
> +	bool buddy_zeroed;
> +	bool page_zeroed;
>  	bool to_tail;
>
>  	VM_BUG_ON(!zone_is_initialized(zone));
> -	VM_BUG_ON_PAGE(page->flags.f & PAGE_FLAGS_CHECK_AT_PREP, page);
> +	/* PG_zeroed (aliased to PG_private) is valid on free-list pages */
> +	VM_BUG_ON_PAGE(page->flags.f &

NIT: We don't add new VM_BUG_ON()'s, please use VM_WARN_ON().

> +		       (PAGE_FLAGS_CHECK_AT_PREP & ~__PG_ZEROED), page);
>
>  	VM_BUG_ON(migratetype == -1);
>  	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
> @@ -979,6 +983,8 @@ static inline void __free_one_page(struct page *page,
>  				goto done_merging;
>  		}
>
> +		buddy_zeroed = PageZeroed(buddy);
> +
>  		/*
>  		 * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
>  		 * merge with it and move up one order.
> @@ -997,10 +1003,17 @@ static inline void __free_one_page(struct page *page,
>  			change_pageblock_range(buddy, order, migratetype);
>  		}
>
> +		page_zeroed = PageZeroed(page);
> +		__ClearPageZeroed(page);
> +		__ClearPageZeroed(buddy);
> +
>  		combined_pfn = buddy_pfn & pfn;
>  		page = page + (combined_pfn - pfn);
>  		pfn = combined_pfn;
>  		order++;
> +
> +		if (page_zeroed && buddy_zeroed)
> +			__SetPageZeroed(page);
>  	}
>
>  done_merging:
> --
> MST
>


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 17/37] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages
  2026-06-08  8:37 ` [PATCH v10 17/37] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
@ 2026-06-08 12:00   ` Lorenzo Stoakes
  2026-06-08 16:09     ` Gregory Price
  0 siblings, 1 reply; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 12:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:37:48AM -0400, Michael S. Tsirkin wrote:
> When a guest reports free pages to the hypervisor via the page reporting
> framework (used by virtio-balloon and hv_balloon), the host typically
> zeros those pages when reclaiming their backing memory.  However, when
> those pages are later allocated in the guest, post_alloc_hook()
> unconditionally zeros them again if __GFP_ZERO is set.  This
> double-zeroing is wasteful, especially for large pages.
>
> Avoid redundant zeroing:
>
> - Add a host_zeroes_pages flag to page_reporting_dev_info, allowing
>   drivers to declare that their host zeros reported pages on reclaim.
>   A static key (page_reporting_host_zeroes) gates the fast path.
>
> - Add PG_zeroed page flag (sharing PG_private bit) to mark pages
>   that have been zeroed by the host.  Set it in
>   page_reporting_drain() after the host reports them.

I think this flag is really confusingly named, if it's a virtualised host
thing, then can we please encode that in the flag name?

I was looking at a later commit and wondering who was doing the zeroing
exactly.

And could we please propagate that throughout the code, some nebulous 'bool
zeroed = ... ' begs the question of whether it's the kernel who did it and
why we are adding logic

>
> - Thread the zeroed bool through rmqueue -> prep_new_page ->
>   post_alloc_hook, where it skips redundant zeroing for __GFP_ZERO
>   allocations.
>
> Currently the PG_zeroed hint can be lost when pages are
> split (expand) or merged in the buddy allocator.  This is
> harmless: losing the hint just means the page gets re-zeroed,
> which is correct but suboptimal.  Follow-up patches propagate
> PG_zeroed across splits and merges to preserve the hint on
> common paths.
>
> No driver sets host_zeroes_pages yet; a follow-up patch to
> virtio_balloon is needed to opt in.
>
> PG_zeroed pages may pass through PCP lists before being freed.
> This is safe: __free_pages_prepare clears all
> PAGE_FLAGS_CHECK_AT_PREP flags (including PG_zeroed/PG_private)
> before the page re-enters the buddy allocator.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> Assisted-by: cursor-agent:GPT-5.4-xhigh
> ---
>  include/linux/page-flags.h     |  9 +++++
>  include/linux/page_reporting.h |  3 ++
>  mm/compaction.c                |  6 ++-
>  mm/internal.h                  |  2 +-
>  mm/page_alloc.c                | 68 +++++++++++++++++++++++-----------
>  mm/page_reporting.c            | 14 ++++++-
>  mm/page_reporting.h            | 12 ++++++
>  7 files changed, 88 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 7223f6f4e2b4..91f8ddb1d512 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -135,6 +135,8 @@ enum pageflags {
>  	PG_swapcache = PG_owner_priv_1, /* Swap page: swp_entry_t in private */
>  	/* Some filesystems */
>  	PG_checked = PG_owner_priv_1,
> +	/* Page contents are known to be zero */
> +	PG_zeroed = PG_private,
>
>  	/*
>  	 * Depending on the way an anonymous folio can be mapped into a page
> @@ -673,6 +675,13 @@ FOLIO_TEST_CLEAR_FLAG_FALSE(young)
>  FOLIO_FLAG_FALSE(idle)
>  #endif
>
> +/*
> + * PageZeroed() tracks pages known to be zero.  The allocator
> + * uses this to skip redundant zeroing in post_alloc_hook().
> + */
> +__PAGEFLAG(Zeroed, zeroed, PF_NO_COMPOUND)
> +#define __PG_ZEROED (1UL << PG_zeroed)
> +
>  /*
>   * PageReported() is used to track reported free pages within the Buddy
>   * allocator. We can use the non-atomic version of the test and set
> diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
> index 5ab5be02fa15..c331c6b36687 100644
> --- a/include/linux/page_reporting.h
> +++ b/include/linux/page_reporting.h
> @@ -14,6 +14,9 @@ struct page_reporting_dev_info {
>  	int (*report)(struct page_reporting_dev_info *prdev,
>  		      struct scatterlist *sg, unsigned int nents);
>
> +	/* If true, host zeros reported pages on reclaim */
> +	bool host_zeroes_pages;
> +
>  	/* work struct for processing reports */
>  	struct delayed_work work;
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4336e433c99b..8000fc5e0a2e 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -82,7 +82,8 @@ static inline bool is_via_compact_memory(int order) { return false; }
>
>  static struct page *mark_allocated_noprof(struct page *page, unsigned int order, gfp_t gfp_flags)
>  {
> -	post_alloc_hook(page, order, __GFP_MOVABLE, USER_ADDR_NONE);
> +	__ClearPageZeroed(page);
> +	post_alloc_hook(page, order, __GFP_MOVABLE, false, USER_ADDR_NONE);
>  	set_page_refcounted(page);
>  	return page;
>  }
> @@ -1849,9 +1850,10 @@ static struct folio *compaction_alloc_noprof(struct folio *src, unsigned long da
>  		set_page_private(&freepage[size], start_order);
>  	}
>  	dst = (struct folio *)freepage;
> +	__ClearPageZeroed(&dst->page);
>  	if (order)
>  		prep_compound_page(&dst->page, order);
> -	post_alloc_hook(&dst->page, order, __GFP_MOVABLE, USER_ADDR_NONE);
> +	post_alloc_hook(&dst->page, order, __GFP_MOVABLE, false, USER_ADDR_NONE);
>  	set_page_refcounted(&dst->page);
>  	cc->nr_freepages -= 1 << order;
>  	cc->nr_migratepages -= 1 << order;
> diff --git a/mm/internal.h b/mm/internal.h
> index 9d2198114510..4af5e72742ba 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -928,7 +928,7 @@ static inline void init_compound_tail(struct page *tail,
>  }
>
>  void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags,
> -		     unsigned long user_addr);
> +		     bool zeroed, unsigned long user_addr);

host_zeroed or something would be more appropriate no?

But in general do we need to propagate this around, can't we derive it from
the page zeroed flag?

It's really confusing as to _which_ zeroing this refers to, it seems the
only one relevant here is the VM host zeroing but that's completely
non-obvious and now everybody using these functions with the extra param
will simply have to happen to know this.

If we could find a way to avoid this propagation that'd be ideal.

Failing that, making it clear this is _only_ for vm host zeroing would be
better, but then maybe we need to think about how we could encode this in
some other way, e.g. passing alloc_context perhaps?

>  extern bool free_pages_prepare(struct page *page, unsigned int order);
>
>  extern int user_min_free_kbytes;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d4fbf1861a8a..45e824b1ec75 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1743,6 +1743,7 @@ static __always_inline void page_del_and_expand(struct zone *zone,
>  	bool was_reported = page_reported(page);
>
>  	__del_page_from_free_list(page, zone, high, migratetype);
> +

Stray whitespace?

>  	nr_pages -= expand(zone, page, low, high, migratetype, was_reported);
>  	account_freepages(zone, -nr_pages, migratetype);
>  }
> @@ -1815,8 +1816,10 @@ static inline bool should_skip_init(gfp_t flags)
>  	return (flags & __GFP_SKIP_ZERO);
>  }
>
> +

Stray whitespace?

>  inline void post_alloc_hook(struct page *page, unsigned int order,
> -				gfp_t gfp_flags, unsigned long user_addr)
> +				gfp_t gfp_flags, bool zeroed,
> +				unsigned long user_addr)
>  {
>  	const bool zero_tags = gfp_flags & __GFP_ZEROTAGS;
>  	bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) &&
> @@ -1825,6 +1828,14 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>
>  	set_page_private(page, 0);
>
> +	/*
> +	 * If the page is zeroed, skip memory initialization.
> +	 * We still need to handle tag zeroing separately since the host
> +	 * does not know about memory tags.
> +	 */
> +	if (zeroed && init && !zero_tags)
> +		init = false;
> +
>  	arch_alloc_page(page, order);
>  	debug_pagealloc_map_pages(page, 1 << order);
>
> @@ -1867,7 +1878,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>  	 * through a user-congruent mapping.  Host-zeroed pages
>  	 * (zeroed flag) don't need this: physical RAM is clean.
>  	 */
> -	if (!init && (gfp_flags & __GFP_ZERO) &&
> +	if (!zeroed && !init && (gfp_flags & __GFP_ZERO) &&
>  	    user_addr != USER_ADDR_NONE &&
>  	    user_alloc_needs_zeroing())
>  		init = true;
> @@ -1900,13 +1911,13 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>  }
>
>  static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
> -							unsigned int alloc_flags,
> -							unsigned long user_addr)
> +			  unsigned int alloc_flags, bool zeroed,
> +			  unsigned long user_addr)
>  {
>  	if (order && (gfp_flags & __GFP_COMP))
>  		prep_compound_page(page, order);
>
> -	post_alloc_hook(page, order, gfp_flags, user_addr);
> +	post_alloc_hook(page, order, gfp_flags, zeroed, user_addr);
>
>  	/*
>  	 * page is set pfmemalloc when ALLOC_NO_WATERMARKS was necessary to
> @@ -3174,6 +3185,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
>  	}
>
>  	del_page_from_free_list(page, zone, order, mt);
> +	__ClearPageZeroed(page);
>
>  	/*
>  	 * Set the pageblock if the isolated page is at least half of a
> @@ -3246,7 +3258,7 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
>  static __always_inline
>  struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
>  			   unsigned int order, unsigned int alloc_flags,
> -			   int migratetype)
> +			   int migratetype, bool *zeroed)
>  {
>  	struct page *page;
>  	unsigned long flags;
> @@ -3281,6 +3293,8 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
>  			}
>  		}
>  		spin_unlock_irqrestore(&zone->lock, flags);
> +		*zeroed = PageZeroed(page);
> +		__ClearPageZeroed(page);
>  	} while (check_new_pages(page, order));
>
>  	/*
> @@ -3349,10 +3363,9 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
>  /* Remove page from the per-cpu list, caller must protect the list */
>  static inline
>  struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
> -			int migratetype,
> -			unsigned int alloc_flags,
> +			int migratetype, unsigned int alloc_flags,
>  			struct per_cpu_pages *pcp,
> -			struct list_head *list)
> +			struct list_head *list, bool *zeroed)
>  {
>  	struct page *page;
>
> @@ -3387,6 +3400,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
>  		page = list_first_entry(list, struct page, pcp_list);
>  		list_del(&page->pcp_list);
>  		pcp->count -= 1 << order;
> +		*zeroed = PageZeroed(page);
> +		__ClearPageZeroed(page);
>  	} while (check_new_pages(page, order));
>
>  	return page;
> @@ -3395,7 +3410,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
>  /* Lock and remove page from the per-cpu list */
>  static struct page *rmqueue_pcplist(struct zone *preferred_zone,
>  			struct zone *zone, unsigned int order,
> -			int migratetype, unsigned int alloc_flags)
> +			int migratetype, unsigned int alloc_flags,
> +			bool *zeroed)
>  {
>  	struct per_cpu_pages *pcp;
>  	struct list_head *list;
> @@ -3413,7 +3429,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
>  	 */
>  	pcp->free_count >>= 1;
>  	list = &pcp->lists[order_to_pindex(migratetype, order)];
> -	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
> +	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags,
> +				 pcp, list, zeroed);
>  	pcp_spin_unlock(pcp);
>  	if (page) {
>  		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
> @@ -3438,19 +3455,19 @@ static inline
>  struct page *rmqueue(struct zone *preferred_zone,
>  			struct zone *zone, unsigned int order,
>  			gfp_t gfp_flags, unsigned int alloc_flags,
> -			int migratetype)
> +			int migratetype, bool *zeroed)
>  {
>  	struct page *page;
>
>  	if (likely(pcp_allowed_order(order))) {
>  		page = rmqueue_pcplist(preferred_zone, zone, order,
> -				       migratetype, alloc_flags);
> +				       migratetype, alloc_flags, zeroed);
>  		if (likely(page))
>  			goto out;
>  	}
>
>  	page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
> -							migratetype);
> +			     migratetype, zeroed);
>
>  out:
>  	/* Separate test+clear to avoid unnecessary atomics */
> @@ -3841,6 +3858,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  	struct pglist_data *last_pgdat = NULL;
>  	bool last_pgdat_dirty_ok = false;
>  	bool no_fallback;
> +	bool zeroed;
>  	bool skip_kswapd_nodes = nr_online_nodes > 1;
>  	bool skipped_kswapd_nodes = false;
>
> @@ -3985,10 +4003,11 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>
>  try_this_zone:
>  		page = rmqueue(zonelist_zone(ac->preferred_zoneref), zone, order,
> -				gfp_mask, alloc_flags, ac->migratetype);
> +					gfp_mask, alloc_flags, ac->migratetype,
> +					&zeroed);
>  		if (page) {
>  			prep_new_page(page, order, gfp_mask, alloc_flags,
> -				      ac->user_addr);
> +				      zeroed, ac->user_addr);
>
>  			return page;
>  		} else {
> @@ -4215,9 +4234,11 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  	count_vm_event(COMPACTSTALL);
>
>  	/* Prep a captured page if available */
> -	if (page)
> -		prep_new_page(page, order, gfp_mask, alloc_flags,
> +	if (page) {
> +		__ClearPageZeroed(page);
> +		prep_new_page(page, order, gfp_mask, alloc_flags, false,
>  			      ac->user_addr);
> +	}
>
>  	/* Try get a page from the freelist if available */
>  	if (!page)
> @@ -5190,6 +5211,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  	/* Attempt the batch allocation */
>  	pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
>  	while (nr_populated < nr_pages) {
> +		bool zeroed = false;
>
>  		/* Skip existing pages */
>  		if (page_array[nr_populated]) {
> @@ -5198,7 +5220,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  		}
>
>  		page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
> -								pcp, pcp_list);
> +					 pcp, pcp_list, &zeroed);
>  		if (unlikely(!page)) {
>  			/* Try and allocate at least one page */
>  			if (!nr_account) {
> @@ -5209,7 +5231,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  		}
>  		nr_account++;
>
> -		prep_new_page(page, 0, gfp, 0, USER_ADDR_NONE);
> +		prep_new_page(page, 0, gfp, 0, zeroed, USER_ADDR_NONE);
>  		set_page_refcounted(page);
>  		page_array[nr_populated++] = page;
>  	}
> @@ -6949,7 +6971,8 @@ static void split_free_frozen_pages(struct list_head *list, gfp_t gfp_mask)
>  		list_for_each_entry_safe(page, next, &list[order], lru) {
>  			int i;
>
> -			post_alloc_hook(page, order, gfp_mask, USER_ADDR_NONE);
> +			__ClearPageZeroed(page);
> +			post_alloc_hook(page, order, gfp_mask, false, USER_ADDR_NONE);
>  			if (!order)
>  				continue;
>
> @@ -7157,8 +7180,9 @@ static int __alloc_contig_frozen_range(unsigned long start, unsigned long end,
>  	} else if (start == outer_start && end == outer_end && is_power_of_2(end - start)) {
>  		struct page *head = pfn_to_page(start);
>
> +		__ClearPageZeroed(head);
>  		check_new_pages(head, order);
> -		prep_new_page(head, order, gfp_mask, 0, user_addr);
> +		prep_new_page(head, order, gfp_mask, 0, false, user_addr);

A nit but I really hate these kinds of mystery meat booleans, which mean
you have to now go look up what this is.

One thing we use in mm quite a bit now is e.g. '/*zeroed=*/ false'. Though
some might say even having a boolean like this is a code smell in itself.

>  	} else {
>  		ret = -EINVAL;
>  		WARN(true, "PFN range: requested [%lu, %lu), allocated [%lu, %lu)\n",
> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
> index 5b6b17f67131..84ebc4547119 100644
> --- a/mm/page_reporting.c
> +++ b/mm/page_reporting.c
> @@ -50,6 +50,8 @@ EXPORT_SYMBOL_GPL(page_reporting_order);
>  #define PAGE_REPORTING_DELAY	(2 * HZ)
>  static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly;
>
> +DEFINE_STATIC_KEY_FALSE(page_reporting_host_zeroes);
> +
>  enum {
>  	PAGE_REPORTING_IDLE = 0,
>  	PAGE_REPORTING_REQUESTED,
> @@ -129,8 +131,11 @@ page_reporting_drain(struct page_reporting_dev_info *prdev,
>  		 * report on the new larger page when we make our way
>  		 * up to that higher order.
>  		 */
> -		if (PageBuddy(page) && buddy_order(page) == order)
> +		if (PageBuddy(page) && buddy_order(page) == order) {
>  			__SetPageReported(page);
> +			if (page_reporting_host_zeroes_pages())
> +				__SetPageZeroed(page);
> +		}
>  	} while ((sg = sg_next(sg)));
>
>  	/* reinitialize scatterlist now that it is empty */
> @@ -390,6 +395,10 @@ int page_reporting_register(struct page_reporting_dev_info *prdev)
>  	/* Assign device to allow notifications */
>  	rcu_assign_pointer(pr_dev_info, prdev);
>
> +	/* enable zeroed page optimization if host zeroes reported pages */
> +	if (prdev->host_zeroes_pages)
> +		static_branch_enable(&page_reporting_host_zeroes);
> +
>  	/* enable page reporting notification */
>  	if (!static_key_enabled(&page_reporting_enabled)) {
>  		static_branch_enable(&page_reporting_enabled);
> @@ -414,6 +423,9 @@ void page_reporting_unregister(struct page_reporting_dev_info *prdev)
>
>  		/* Flush any existing work, and lock it out */
>  		cancel_delayed_work_sync(&prdev->work);
> +
> +		if (prdev->host_zeroes_pages)
> +			static_branch_disable(&page_reporting_host_zeroes);
>  	}
>
>  	mutex_unlock(&page_reporting_mutex);
> diff --git a/mm/page_reporting.h b/mm/page_reporting.h
> index c51dbc228b94..736ea7b37e9e 100644
> --- a/mm/page_reporting.h
> +++ b/mm/page_reporting.h
> @@ -15,6 +15,13 @@ DECLARE_STATIC_KEY_FALSE(page_reporting_enabled);
>  extern unsigned int page_reporting_order;
>  void __page_reporting_notify(void);
>
> +DECLARE_STATIC_KEY_FALSE(page_reporting_host_zeroes);
> +
> +static inline bool page_reporting_host_zeroes_pages(void)
> +{
> +	return static_branch_unlikely(&page_reporting_host_zeroes);
> +}
> +
>  static inline bool page_reported(struct page *page)
>  {
>  	return static_branch_unlikely(&page_reporting_enabled) &&
> @@ -46,6 +53,11 @@ static inline void page_reporting_notify_free(unsigned int order)
>  #else /* CONFIG_PAGE_REPORTING */
>  #define page_reported(_page)	false
>
> +static inline bool page_reporting_host_zeroes_pages(void)
> +{
> +	return false;
> +}
> +
>  static inline void page_reporting_notify_free(unsigned int order)
>  {
>  }
> --
> MST
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 22/37] mm: add free_frozen_pages_zeroed
  2026-06-08  8:38 ` [PATCH v10 22/37] mm: add free_frozen_pages_zeroed Michael S. Tsirkin
@ 2026-06-08 12:06   ` Lorenzo Stoakes
  0 siblings, 0 replies; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 12:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:38:37AM -0400, Michael S. Tsirkin wrote:
> Add free_frozen_pages_zeroed(page, order) to free a frozen page
> while marking it as zeroed, so the next allocation can skip
> redundant zeroing.
>
> An FPI_ZEROED internal flag carries the hint through the free path.
> PageZeroed is set after __free_pages_prepare() clears all flags,
> so the hint survives on the free list.
>
> __SetPageZeroed is non-atomic but safe here: the page is frozen
> (refcount 0) and not yet on any free list.
>
> Note: when want_init_on_free() zeroes the page via
> kernel_init_pages(), the page is zero but the direct-map
> cache lines may be dirty. A later patch (skip
> kernel_init_pages for FPI_ZEROED) avoids the redundant
> re-zero, and post_alloc_hook handles the dcache flush
> for user pages on aliasing architectures.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> ---
>  include/linux/gfp.h |  1 +
>  mm/internal.h       |  1 +
>  mm/page_alloc.c     | 23 ++++++++++++++++++++++-
>  3 files changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 73109d4e31a4..d24b61e45861 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -384,6 +384,7 @@ __meminit void *alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mas
>  extern void __free_pages(struct page *page, unsigned int order);
>  extern void free_pages_nolock(struct page *page, unsigned int order);
>  extern void free_pages(unsigned long addr, unsigned int order);
> +void free_frozen_pages_zeroed(struct page *page, unsigned int order);
>
>  #define __free_page(page) __free_pages((page), 0)
>  #define free_page(addr) free_pages((addr), 0)
> diff --git a/mm/internal.h b/mm/internal.h
> index 4af5e72742ba..fd910743ddc3 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -938,6 +938,7 @@ struct page *__alloc_frozen_pages_noprof(gfp_t, unsigned int order, int nid,
>  #define __alloc_frozen_pages(...) \
>  	alloc_hooks(__alloc_frozen_pages_noprof(__VA_ARGS__))
>  void free_frozen_pages(struct page *page, unsigned int order);
> +void free_frozen_pages_zeroed(struct page *page, unsigned int order);

This is badly named. That name implies you're freeing frozen, zeroed pages, not
that you're marking them zeroed.

And again, you're overloading 'zeroed' here. Be specific, it's about
host zeroing in virtualisation.

>  void free_unref_folios(struct folio_batch *fbatch);
>
>  #ifdef CONFIG_NUMA
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 21f9e92922f1..008f1a311c40 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -91,6 +91,13 @@ typedef int __bitwise fpi_t;
>  /* Free the page without taking locks. Rely on trylock only. */
>  #define FPI_TRYLOCK		((__force fpi_t)BIT(2))
>
> +/*
> + * The page contents are known to be zero (e.g., the host zeroed them
> + * during balloon deflate).  Set PageZeroed after free so the next

Can we just be specific that this is about VM hosts, I don't imagine that we are
going to ever have a use beyond that, and we can adjust the phrasing later if
needed.

Otherwise it's just confusing right now, you're overloading 'zeroed' to mean
different things and we do that enough in mm already.

> + * allocation can skip redundant zeroing.
> + */
> +#define FPI_ZEROED		((__force fpi_t)BIT(3))

Hmm now we have another flag to propagate this around... this is messy.

Now we have multiple different ways of representing this state... ugh.

> +
>  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>  static DEFINE_MUTEX(pcp_batch_high_lock);
>  #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
> @@ -1596,8 +1603,12 @@ static void __free_pages_ok(struct page *page, unsigned int order,
>  	unsigned long pfn = page_to_pfn(page);
>  	struct zone *zone = page_zone(page);
>
> -	if (__free_pages_prepare(page, order, fpi_flags))
> +	if (__free_pages_prepare(page, order, fpi_flags)) {
> +		/* Don't mark zeroed if poison overwrote with 0xAA. */

Can we not reference arbitrary values in comments? And this comment seems
redundant.

> +		if ((fpi_flags & FPI_ZEROED) && !page_poisoning_enabled_static())
> +			__SetPageZeroed(page);
>  		free_one_page(zone, page, pfn, order, fpi_flags);
> +	}
>  }
>
>  void __meminit __free_pages_core(struct page *page, unsigned int order,
> @@ -3020,6 +3031,10 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
>  	if (!__free_pages_prepare(page, order, fpi_flags))
>  		return;
>
> +	/* Don't mark zeroed if poison overwrote with 0xAA. */

Same comment as above.

> +	if ((fpi_flags & FPI_ZEROED) && !page_poisoning_enabled_static())
> +		__SetPageZeroed(page);
> +
>  	/*
>  	 * We only track unmovable, reclaimable and movable on pcp lists.
>  	 * Place ISOLATE pages on the isolated list because they are being
> @@ -3058,6 +3073,12 @@ void free_frozen_pages(struct page *page, unsigned int order)
>  	__free_frozen_pages(page, order, FPI_NONE);
>  }
>

No comment describing this? kdoc please.

> +void free_frozen_pages_zeroed(struct page *page, unsigned int order)
> +{
> +	__free_frozen_pages(page, order, FPI_ZEROED);
> +}
> +EXPORT_SYMBOL(free_frozen_pages_zeroed);

Do we have to use EXPORT_SYMBOLS()? Why not EXPORT_SYMBOLS_GPL()?

> +
>  void free_frozen_pages_nolock(struct page *page, unsigned int order)
>  {
>  	__free_frozen_pages(page, order, FPI_TRYLOCK);
> --
> MST
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 23/37] mm: page_alloc: skip kernel_init_pages for FPI_ZEROED when safe
  2026-06-08  8:38 ` [PATCH v10 23/37] mm: page_alloc: skip kernel_init_pages for FPI_ZEROED when safe Michael S. Tsirkin
@ 2026-06-08 12:18   ` Lorenzo Stoakes
  0 siblings, 0 replies; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 12:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:38:46AM -0400, Michael S. Tsirkin wrote:
> In __free_pages_prepare(), when FPI_ZEROED is set the page is already
> known to be zero. We can skip kernel_init_pages() if page poisoning is
> not enabled (because poison would overwrite the zeroes).
>
> This avoids redundant zeroing work when freeing pages that are already
> known to contain all zeros.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> ---
>  mm/page_alloc.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 008f1a311c40..e3a7c40c769c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1443,7 +1443,14 @@ __always_inline bool __free_pages_prepare(struct page *page,
>  		if (kasan_has_integrated_init())
>  			init = false;
>  	}
> -	if (init)
> +	/*
> +	 * Skip redundant zeroing when the page is already known-zero
> +	 * (FPI_ZEROED) and page poisoning did not overwrite it.
> +	 * When page_poisoning is enabled, kernel_poison_pages above
> +	 * wrote PAGE_POISON (0xAA), so we must re-zero.
> +	 */

Again, please stop specifying arbitrary hex values in comments, this seems
mostly 'describing what we do here'.

Maybe drop to just e.g.:

/* if poisoned or not zeroed by a virtualised host, zero now. */

or suchlike?

> +	if (init && !((fpi_flags & FPI_ZEROED) &&
> +		      !page_poisoning_enabled_static()))

This condition is absolutely horrible, !(X && !Y), you're making life difficult
for the readers.

'if not both zeroed and not poisoned' is how that reads logically. Which is hard
to understand.

De Morgan's law gives us -> !zeroed || posioned

How about:

	if (init && (!(fpi_flags & FPI_ZEROED) || page_poisoning_enabled_static())

Or preferably something like:

	const bool poisoned = page_poisoning_enabled_static();
	const bool vm_host_zeroed = fpi_flags & FPI_ZEROED;

	...

	if (init && (poisoned || !vm_host_zeroed))
		...

?


>  		kernel_init_pages(page, 1 << order);
>
>  	/*
> --
> MST
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 24/37] mm: add put_page_zeroed and folio_put_zeroed
  2026-06-08  8:38 ` [PATCH v10 24/37] mm: add put_page_zeroed and folio_put_zeroed Michael S. Tsirkin
@ 2026-06-08 12:25   ` Lorenzo Stoakes
  2026-06-08 12:46     ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 12:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:38:54AM -0400, Michael S. Tsirkin wrote:
> Add put_page_zeroed() / folio_put_zeroed() for callers that hold
> a reference to a page known to be zeroed.
>
> If this drops the last reference, the zeroed hint is
> propagated to the buddy allocator.  If someone else still holds a
> reference, the hint is simply lost - this is best-effort.
>
> This is useful for balloon drivers during deflation: the host
> has already zeroed the pages, and the balloon is typically the
> sole owner.  But if the page happens to be shared, silently
> dropping the hint is safe and avoids the need for callers to
> check the refcount.
>
> Note: put_page_zeroed uses folio_put_testzero() which only
> detects sole ownership at the instant of the atomic decrement.
> A concurrent reference holder (e.g. migration) means the hint
> is silently lost. This is by design: the zeroed hint is a
> performance optimization, not a correctness requirement.
> Losing it just means the next allocation re-zeroes the page.

Do not put comments about specific expected races like this in the commit
message but not in the code. Subtleties need to be called out.

The commit message also doesn't at all explain why PG_zeroed doesn't
suffice here.

>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6

I really don't understand why you have a 'zeroed' folio flag but need to
also have new API calls to detect that?

They're also HORRIBLY named. Zeroed as in what? Zero page? Huge zero page?
Memory zeroed by kernel? Pages that userland happen to have zeroed? Or host
VM zeroed?

Each are cases we address individually and relate to folios.

You absolutely fail to clarify _which one_ you mean, and provide absolutely
no documentation and add an exported mm API with no description.

This is just I think not something we want to add? Especially on something
so fundamental?

> ---
>  include/linux/mm.h | 13 +++++++++++++
>  mm/swap.c          | 20 ++++++++++++++++++--
>  2 files changed, 31 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 06bbe9eba636..79b3a8cb9a3b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1913,6 +1913,7 @@ static inline struct folio *virt_to_folio(const void *x)
>  }
>
>  void __folio_put(struct folio *folio);
> +void __folio_put_zeroed(struct folio *folio);
>
>  void split_page(struct page *page, unsigned int order);
>  void folio_copy(struct folio *dst, struct folio *src);
> @@ -2090,6 +2091,18 @@ static inline void folio_put(struct folio *folio)
>  		__folio_put(folio);
>  }
>
> +/* Caller must be sole owner to guarantee page is still zero */
> +static inline void folio_put_zeroed(struct folio *folio)
> +{
> +	if (folio_put_testzero(folio))
> +		__folio_put_zeroed(folio);
> +}
> +
> +static inline void put_page_zeroed(struct page *page)
> +{
> +	folio_put_zeroed(page_folio(page));
> +}
> +

Please stop adding more APIs to mm without kdocs. This just isn't
acceptable.

>  /**
>   * folio_put_refs - Reduce the reference count on a folio.
>   * @folio: The folio.
> diff --git a/mm/swap.c b/mm/swap.c
> index 5cc44f0de987..ecec780172ad 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -94,13 +94,15 @@ static void page_cache_release(struct folio *folio)
>  		lruvec_unlock_irqrestore(lruvec, flags);
>  }
>
> -void __folio_put(struct folio *folio)
> +static void ___folio_put(struct folio *folio, bool zeroed)
>  {
> +	/* zeroed hint ignored for now, no current user */

Please don't add comments about why you didn't do something that nobody
here knows about with no context.

If you want to say something about this, make it clear. This is so succinct
it's utterly meaningless.

>  	if (unlikely(folio_is_zone_device(folio))) {
>  		free_zone_device_folio(folio);
>  		return;
>  	}
>
> +	/* zeroed hint ignored for now, no current user */
>  	if (folio_test_hugetlb(folio)) {
>  		free_huge_folio(folio);
>  		return;
> @@ -109,10 +111,24 @@ void __folio_put(struct folio *folio)
>  	page_cache_release(folio);
>  	folio_unqueue_deferred_split(folio);
>  	mem_cgroup_uncharge(folio);
> -	free_frozen_pages(&folio->page, folio_order(folio));
> +	if (zeroed)
> +		free_frozen_pages_zeroed(&folio->page, folio_order(folio));
> +	else
> +		free_frozen_pages(&folio->page, folio_order(folio));
> +}
> +
> +void __folio_put(struct folio *folio)
> +{
> +	___folio_put(folio, false);
>  }
>  EXPORT_SYMBOL(__folio_put);
>

No documentation again...

> +void __folio_put_zeroed(struct folio *folio)
> +{
> +	___folio_put(folio, true);
> +}
> +EXPORT_SYMBOL(__folio_put_zeroed);
> +
>  typedef void (*move_fn_t)(struct lruvec *lruvec, struct folio *folio);
>
>  static void lru_add(struct lruvec *lruvec, struct folio *folio)
> --
> MST
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 25/37] mm: use __GFP_ZERO in alloc_anon_folio
  2026-06-08  8:39 ` [PATCH v10 25/37] mm: use __GFP_ZERO in alloc_anon_folio Michael S. Tsirkin
@ 2026-06-08 12:29   ` Lorenzo Stoakes
  0 siblings, 0 replies; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 12:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:39:02AM -0400, Michael S. Tsirkin wrote:
> Convert alloc_anon_folio() to pass __GFP_ZERO instead of zeroing
> at the callsite. post_alloc_hook uses the fault address passed
> through vma_alloc_folio for cache-friendly zeroing.
>
> Note: before this series, replacing clear_user_highpage() with
> __GFP_ZERO was unsafe on cache-aliasing architectures because
> __GFP_ZERO uses clear_page() without a dcache flush. With this
> series, it is safe if the caller passes a valid user address
> (not USER_ADDR_NONE) to vma_alloc_folio() etc., which delivers
> it to post_alloc_hook() for the dcache flush via
> folio_zero_user(). It is only unsafe if USER_ADDR_NONE is passed.
>
> Note: with __GFP_ZERO, the folio is zeroed before
> mem_cgroup_charge().  If the charge fails, the zeroing work is
> wasted.  Previously zeroing was done after a successful charge.
> This is inherent to moving zeroing into the allocator.
> Charge failures are rare (only at cgroup limits).
>
> Use folio_put_zeroed() on charge failure so the zeroed hint
> propagates to the buddy allocator, avoiding redundant re-zeroing
> on the next allocation attempt.

Is this even worth the effort? This is surely not a hotpath...

>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Reviewed-by: Gregory Price <gourry@gourry.net>
> Assisted-by: Claude:claude-opus-4-6
> ---
>  mm/memory.c | 13 ++-----------
>  1 file changed, 2 insertions(+), 11 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 6c14b90f558e..6d6a3e1a02c1 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5265,25 +5265,16 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>  		goto fallback;
>
>  	/* Try allocating the highest of the remaining orders. */
> -	gfp = vma_thp_gfp_mask(vma);
> +	gfp = vma_thp_gfp_mask(vma) | __GFP_ZERO;
>  	while (orders) {
>  		folio = vma_alloc_folio(gfp, order, vma, vmf->address);
>  		if (folio) {
>  			if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
>  				count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> -				folio_put(folio);
> +				folio_put_zeroed(folio);

You just allocated the folio as zeroed above, should PG_zeroed not be set thus
making it unnecessary to add folio_put_zeroed()?

>  				goto next;
>  			}
>  			folio_throttle_swaprate(folio, gfp);
> -			/*
> -			 * When a folio is not zeroed during allocation
> -			 * (__GFP_ZERO not used) or user folios require special
> -			 * handling, folio_zero_user() is used to make sure
> -			 * that the page corresponding to the faulting address
> -			 * will be hot in the cache after zeroing.
> -			 */
> -			if (user_alloc_needs_zeroing())
> -				folio_zero_user(folio, vmf->address);
>  			return folio;
>  		}
>  next:
> --
> MST
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 26/37] mm: vma_alloc_anon_folio_pmd: pass raw fault address to vma_alloc_folio
  2026-06-08  8:39 ` [PATCH v10 26/37] mm: vma_alloc_anon_folio_pmd: pass raw fault address to vma_alloc_folio Michael S. Tsirkin
@ 2026-06-08 12:30   ` Lorenzo Stoakes
  0 siblings, 0 replies; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 12:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:39:10AM -0400, Michael S. Tsirkin wrote:
> Drop the redundant HPAGE_PMD_MASK alignment at the callsite.
> NUMA interleave is not affected by the raw address; the ilx
> calculation shifts addr >> PAGE_SHIFT >> order, dropping
> sub-page bits regardless of alignment. post_alloc_hook will
> use the raw address for cache-friendly zeroing.

But then what's the point in this change?

And why are we changing what we pass in this parameter but not the
vma_alloc_folio() kdoc?

>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> Reviewed-by: Gregory Price <gourry@gourry.net>
> ---
>  mm/huge_memory.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 970e077019b7..d689e6491ddb 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1337,7 +1337,7 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
>  	const int order = HPAGE_PMD_ORDER;
>  	struct folio *folio;
>
> -	folio = vma_alloc_folio(gfp, order, vma, addr & HPAGE_PMD_MASK);
> +	folio = vma_alloc_folio(gfp, order, vma, addr);
>
>  	if (unlikely(!folio)) {
>  		count_vm_event(THP_FAULT_FALLBACK);
> --
> MST
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 27/37] mm: use __GFP_ZERO in vma_alloc_anon_folio_pmd
  2026-06-08  8:39 ` [PATCH v10 27/37] mm: use __GFP_ZERO in vma_alloc_anon_folio_pmd Michael S. Tsirkin
@ 2026-06-08 12:32   ` Lorenzo Stoakes
  0 siblings, 0 replies; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 12:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:39:18AM -0400, Michael S. Tsirkin wrote:
> Convert vma_alloc_anon_folio_pmd() to pass __GFP_ZERO instead of
> zeroing at the callsite. post_alloc_hook uses the fault address
> passed through vma_alloc_folio for cache-friendly zeroing.
>
> Note: before this series, replacing folio_zero_user() with
> __GFP_ZERO was unsafe on cache-aliasing architectures because
> __GFP_ZERO uses clear_page() without a dcache flush. With this
> series, it is safe if the caller passes a valid user address
> (not USER_ADDR_NONE) to vma_alloc_folio() etc., which delivers
> it to post_alloc_hook() for the dcache flush via
> folio_zero_user(). It is only unsafe if USER_ADDR_NONE is passed.
>
> Note: with __GFP_ZERO, the folio is zeroed before
> mem_cgroup_charge().  If the charge fails, the zeroing work is
> wasted.  Previously zeroing was done after a successful charge.
> This is inherent to moving zeroing into the allocator.
> Charge failures are rare (only at cgroup limits).
>
> Use folio_put_zeroed() on charge failure so the zeroed hint
> propagates to the buddy allocator, avoiding redundant re-zeroing
> on the next allocation attempt.

Again, is this worth it?...

Every bit of code added increases risks of bugs, maintenance burden,
etc. let's just not do stuff because we can.

>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Reviewed-by: Gregory Price <gourry@gourry.net>
> Assisted-by: Claude:claude-opus-4-6
> ---
>  mm/huge_memory.c | 14 +++-----------
>  1 file changed, 3 insertions(+), 11 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d689e6491ddb..0dec3c717ff2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1333,7 +1333,7 @@ EXPORT_SYMBOL_GPL(thp_get_unmapped_area);
>  static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
>  		unsigned long addr)
>  {
> -	gfp_t gfp = vma_thp_gfp_mask(vma);
> +	gfp_t gfp = vma_thp_gfp_mask(vma) | __GFP_ZERO;
>  	const int order = HPAGE_PMD_ORDER;
>  	struct folio *folio;
>
> @@ -1347,7 +1347,7 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
>
>  	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
>  	if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
> -		folio_put(folio);
> +		folio_put_zeroed(folio);

Same comments as previously.

>  		count_vm_event(THP_FAULT_FALLBACK);
>  		count_vm_event(THP_FAULT_FALLBACK_CHARGE);
>  		count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK);
> @@ -1356,17 +1356,9 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
>  	}
>  	folio_throttle_swaprate(folio, gfp);
>
> -       /*
> -	* When a folio is not zeroed during allocation (__GFP_ZERO not used)
> -	* or user folios require special handling, folio_zero_user() is used to
> -	* make sure that the page corresponding to the faulting address will be
> -	* hot in the cache after zeroing.
> -	*/
> -	if (user_alloc_needs_zeroing())
> -		folio_zero_user(folio, addr);
>  	/*
>  	 * The memory barrier inside __folio_mark_uptodate makes sure that
> -	 * folio_zero_user writes become visible before the set_pmd_at()
> +	 * page zeroing becomes visible before the set_pmd_at()

folio zeroing?

>  	 * write.
>  	 */
>  	__folio_mark_uptodate(folio);
> --
> MST
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 28/37] mm: hugetlb: add gfp parameter and skip zeroing for zeroed pages
  2026-06-08  8:39 ` [PATCH v10 28/37] mm: hugetlb: add gfp parameter and skip zeroing for zeroed pages Michael S. Tsirkin
@ 2026-06-08 12:44   ` Lorenzo Stoakes
  0 siblings, 0 replies; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 12:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:39:26AM -0400, Michael S. Tsirkin wrote:
> Add a gfp_t parameter to alloc_hugetlb_folio(). When __GFP_ZERO
> is set, the function guarantees the returned folio is zeroed:
> - Fresh allocations (buddy or gigantic): zeroed by
>   post_alloc_hook via __GFP_ZERO, HPG_zeroed set by
>   alloc_surplus_hugetlb_folio.
> - Pool pages with HPG_zeroed set: already zeroed, skip.
> - Pool pages without HPG_zeroed: zeroed via folio_zero_user().
>
> The address parameter is renamed to user_addr; the function
> aligns it internally for reservation and NUMA policy lookups.
> For pages that need zeroing, user_addr is passed to
> folio_zero_user() for cache-friendly zeroing near the faulting
> subpage.  All callers pass a page-aligned address; the
> hugetlb_no_page caller passes vmf->real_address & PAGE_MASK
> for consistency.
>
> HPG_zeroed (stored in hugetlb folio->private bits) tracks
> known-zero pool pages. It is set when alloc_surplus_hugetlb_folio
> allocates with __GFP_ZERO, and cleared in free_huge_folio when
> the page returns to the pool after userspace use.
>
> Note: for gigantic CMA pages, __GFP_ZERO is passed through
> to cma_alloc_frozen_compound() via its caller_gfp parameter,
> so the pages ARE zeroed by the allocator. HPG_zeroed is only
> set when __GFP_ZERO was in the original gfp_mask.
> Pool pages allocated without __GFP_ZERO (e.g. by
> alloc_pool_huge_folio) do not get HPG_zeroed; they are zeroed
> later by folio_zero_user() at fault time.
>
> Note: with __GFP_ZERO, the folio is zeroed before
> mem_cgroup_charge_hugetlb().  If the charge fails, the zeroed
> folio is freed back.  Before this patch it is zeroed after charge, so
> simply freeing after zeroing would be a regression.  Thread a
> zeroed hint through free_huge_folio so surplus pages freed back
> to buddy preserve the zeroed state via free_frozen_pages_zeroed,
> avoiding redundant re-zeroing on the next allocation.
>
> Suggested-by: Gregory Price <gourry@gourry.net>
> Reviewed-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> Assisted-by: cursor-agent:GPT-5.4-xhigh

This whole set of hugetlb changes should be a separate series. Or really
deferred.

You very badly need to pare this down into the _minimum_ changes required rather
than AI go brrrr on everything.

And propagating _yet more_ 'zeroed' state seems unpleasant, do we have to?

> ---
>  fs/hugetlbfs/inode.c    |  3 +-
>  include/linux/hugetlb.h |  5 ++-
>  mm/hugetlb.c            | 78 +++++++++++++++++++++++++++--------------
>  3 files changed, 57 insertions(+), 29 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 78d61bf2bd9b..2c0c51fe9ec3 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -790,13 +790,12 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>  		 * folios in these areas, we need to consume the reserves
>  		 * to keep reservation accounting consistent.
>  		 */
> -		folio = alloc_hugetlb_folio(&pseudo_vma, addr, false);
> +		folio = alloc_hugetlb_folio(&pseudo_vma, addr, false, __GFP_ZERO);
>  		if (IS_ERR(folio)) {
>  			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>  			error = PTR_ERR(folio);
>  			goto out;
>  		}
> -		folio_zero_user(folio, addr);
>  		__folio_mark_uptodate(folio);
>  		error = hugetlb_add_to_page_cache(folio, mapping, index);
>  		if (unlikely(error)) {
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1f7ae6609e51..06d033a57a61 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -593,6 +593,7 @@ enum hugetlb_page_flags {
>  	HPG_vmemmap_optimized,
>  	HPG_raw_hwp_unreliable,
>  	HPG_cma,
> +	HPG_zeroed,
>  	__NR_HPAGEFLAGS,
>  };
>
> @@ -653,6 +654,7 @@ HPAGEFLAG(Freed, freed)
>  HPAGEFLAG(VmemmapOptimized, vmemmap_optimized)
>  HPAGEFLAG(RawHwpUnreliable, raw_hwp_unreliable)
>  HPAGEFLAG(Cma, cma)
> +HPAGEFLAG(Zeroed, zeroed)

Same comments about this naming as elsewhere. Let's be specific about what
_kind_ of zeroing this is.

>
>  #ifdef CONFIG_HUGETLB_PAGE
>
> @@ -700,7 +702,8 @@ int isolate_or_dissolve_huge_folio(struct folio *folio, struct list_head *list);
>  int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn);
>  void wait_for_freed_hugetlb_folios(void);
>  struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> -				unsigned long addr, bool cow_from_owner);
> +				unsigned long user_addr, bool cow_from_owner,
> +				gfp_t gfp);

You already started calling into this user_addr stuff in an earlier patch I
believe, so why not rename it then...

>  struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
>  				nodemask_t *nmask, gfp_t gfp_mask,
>  				bool allow_alloc_fallback);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5d7e546565f5..ed00db703911 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1455,7 +1455,8 @@ void add_hugetlb_folio(struct hstate *h, struct folio *folio,
>  }
>
>  static void __update_and_free_hugetlb_folio(struct hstate *h,
> -						struct folio *folio)
> +						struct folio *folio,
> +						bool zeroed)
>  {
>  	bool clear_flag = folio_test_hugetlb_vmemmap_optimized(folio);
>
> @@ -1506,6 +1507,8 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
>  	VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
>  	if (folio_test_hugetlb_cma(folio))
>  		hugetlb_cma_free_frozen_folio(folio);
> +	else if (zeroed)
> +		free_frozen_pages_zeroed(&folio->page, folio_order(folio));
>  	else
>  		free_frozen_pages(&folio->page, folio_order(folio));
>  }
> @@ -1545,7 +1548,7 @@ static void free_hpage_workfn(struct work_struct *work)
>  		 */
>  		h = size_to_hstate(folio_size(folio));
>
> -		__update_and_free_hugetlb_folio(h, folio);
> +		__update_and_free_hugetlb_folio(h, folio, false);
>
>  		cond_resched();
>  	}
> @@ -1559,10 +1562,10 @@ static inline void flush_free_hpage_work(struct hstate *h)
>  }
>
>  static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
> -				 bool atomic)
> +				 bool atomic, bool zeroed)
>  {
>  	if (!folio_test_hugetlb_vmemmap_optimized(folio) || !atomic) {
> -		__update_and_free_hugetlb_folio(h, folio);
> +		__update_and_free_hugetlb_folio(h, folio, zeroed);
>  		return;
>  	}
>
> @@ -1596,7 +1599,7 @@ static void bulk_vmemmap_restore_error(struct hstate *h,
>  			spin_lock_irq(&hugetlb_lock);
>  			__folio_clear_hugetlb(folio);
>  			spin_unlock_irq(&hugetlb_lock);
> -			update_and_free_hugetlb_folio(h, folio, false);
> +			update_and_free_hugetlb_folio(h, folio, false, false);
>  			cond_resched();
>  		}
>  	} else {
> @@ -1621,7 +1624,7 @@ static void bulk_vmemmap_restore_error(struct hstate *h,
>  				spin_lock_irq(&hugetlb_lock);
>  				__folio_clear_hugetlb(folio);
>  				spin_unlock_irq(&hugetlb_lock);
> -				update_and_free_hugetlb_folio(h, folio, false);
> +				update_and_free_hugetlb_folio(h, folio, false, false);
>  				cond_resched();
>  				break;
>  			}
> @@ -1664,7 +1667,7 @@ static void update_and_free_pages_bulk(struct hstate *h,
>  	}
>
>  	list_for_each_entry_safe(folio, t_folio, &non_hvo_folios, lru) {
> -		update_and_free_hugetlb_folio(h, folio, false);
> +		update_and_free_hugetlb_folio(h, folio, false, false);
>  		cond_resched();
>  	}
>  }
> @@ -1680,7 +1683,7 @@ struct hstate *size_to_hstate(unsigned long size)
>  	return NULL;
>  }
>
> -void free_huge_folio(struct folio *folio)
> +static void __free_huge_folio(struct folio *folio, bool zeroed)

This zeroed flag seems to be used for both hugetlb zeroed flag state and
__GFP_ZERO?

What does it mean? Can we be specific and comment? Because it's very confusing.

>  {
>  	/*
>  	 * Can't pass hstate in here because it is called from the
> @@ -1692,6 +1695,9 @@ void free_huge_folio(struct folio *folio)
>  	bool restore_reserve;
>  	unsigned long flags;
>
> +	/* Page was mapped to userspace; no longer known-zero */

Again, please be specific about the kind of zeroing.

> +	folio_clear_hugetlb_zeroed(folio);
> +
>  	VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
>  	VM_BUG_ON_FOLIO(folio_mapcount(folio), folio);
>
> @@ -1735,12 +1741,12 @@ void free_huge_folio(struct folio *folio)
>  	if (folio_test_hugetlb_temporary(folio)) {
>  		remove_hugetlb_folio(h, folio, false);
>  		spin_unlock_irqrestore(&hugetlb_lock, flags);
> -		update_and_free_hugetlb_folio(h, folio, true);
> +		update_and_free_hugetlb_folio(h, folio, true, zeroed);
>  	} else if (h->surplus_huge_pages_node[nid]) {
>  		/* remove the page from active list */
>  		remove_hugetlb_folio(h, folio, true);
>  		spin_unlock_irqrestore(&hugetlb_lock, flags);
> -		update_and_free_hugetlb_folio(h, folio, true);
> +		update_and_free_hugetlb_folio(h, folio, true, zeroed);
>  	} else {
>  		arch_clear_hugetlb_flags(folio);
>  		enqueue_hugetlb_folio(h, folio);
> @@ -1748,6 +1754,11 @@ void free_huge_folio(struct folio *folio)
>  	}
>  }
>
> +void free_huge_folio(struct folio *folio)

_Please_ be specific about hugetlb in newer functions. 'Huge' is very
overloaded.

And again you're doing the pattern of adding various 'zeroed' state, but then
_also_ adding a specific flag for hinting zeroed state.

I don't understand why we need both and you're adding complexity here for that.

> +{
> +	__free_huge_folio(folio, false);
> +}
> +
>  /*
>   * Must be called with the hugetlb lock held
>   */
> @@ -2031,7 +2042,7 @@ int dissolve_free_hugetlb_folio(struct folio *folio)
>  			rc = 0;
>  		}
>
> -		update_and_free_hugetlb_folio(h, folio, false);
> +		update_and_free_hugetlb_folio(h, folio, false, false);

More mystery meat boolean flags.

>  		return rc;
>  	}
>  out:
> @@ -2093,6 +2104,10 @@ static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
>  	if (!folio)
>  		return NULL;
>
> +	/* Mark as known-zero only if __GFP_ZERO was requested */

This comment is redundant and underspecified.

> +	if (gfp_mask & __GFP_ZERO)
> +		folio_set_hugetlb_zeroed(folio);

So now we're marking zeroed even in cases where it's not the host VM zeroing?

Is this useful?

> +
>  	spin_lock_irq(&hugetlb_lock);
>  	/*
>  	 * nr_huge_pages needs to be adjusted within the same lock cycle
> @@ -2156,11 +2171,11 @@ static struct folio *alloc_migrate_hugetlb_folio(struct hstate *h, gfp_t gfp_mas
>   */
>  static
>  struct folio *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h,
> -		struct vm_area_struct *vma, unsigned long addr)
> +		struct vm_area_struct *vma, unsigned long addr, gfp_t gfp)

Can we not propagate arbitrary GFP flags if we can avoid it?

>  {
>  	struct folio *folio = NULL;
>  	struct mempolicy *mpol;
> -	gfp_t gfp_mask = htlb_alloc_mask(h);
> +	gfp_t gfp_mask = htlb_alloc_mask(h) | gfp;
>  	int nid;
>  	nodemask_t *nodemask;
>
> @@ -2715,7 +2730,7 @@ static int alloc_and_dissolve_hugetlb_folio(struct folio *old_folio,
>  		 * Folio has been replaced, we can safely free the old one.
>  		 */
>  		spin_unlock_irq(&hugetlb_lock);
> -		update_and_free_hugetlb_folio(h, old_folio, false);
> +		update_and_free_hugetlb_folio(h, old_folio, false, false);
>  	}
>
>  	return ret;
> @@ -2723,7 +2738,7 @@ static int alloc_and_dissolve_hugetlb_folio(struct folio *old_folio,
>  free_new:
>  	spin_unlock_irq(&hugetlb_lock);
>  	if (new_folio)
> -		update_and_free_hugetlb_folio(h, new_folio, false);
> +		update_and_free_hugetlb_folio(h, new_folio, false, false);
>
>  	return ret;
>  }
> @@ -2857,16 +2872,19 @@ typedef enum {
>   * When it's set, the allocation will bypass all vma level reservations.
>   */
>  struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> -				    unsigned long addr, bool cow_from_owner)
> +				    unsigned long user_addr, bool cow_from_owner,
> +				    gfp_t gfp)
>  {
>  	struct hugepage_subpool *spool = subpool_vma(vma);
>  	struct hstate *h = hstate_vma(vma);
> +	unsigned long addr = user_addr & huge_page_mask(h);
>  	struct folio *folio;
>  	long retval, gbl_chg, gbl_reserve;
>  	map_chg_state map_chg;
>  	int ret, idx;
>  	struct hugetlb_cgroup *h_cg = NULL;
> -	gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
> +
> +	gfp |= htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
>
>  	idx = hstate_index(h);
>
> @@ -2934,13 +2952,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  	folio = dequeue_hugetlb_folio_vma(h, vma, addr, gbl_chg);
>  	if (!folio) {
>  		spin_unlock_irq(&hugetlb_lock);
> -		folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr);
> +		folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, user_addr, gfp);
>  		if (!folio)
>  			goto out_uncharge_cgroup;
>  		spin_lock_irq(&hugetlb_lock);
>  		list_add(&folio->lru, &h->hugepage_activelist);
>  		folio_ref_unfreeze(folio, 1);
> -		/* Fall through */

Why are you dropping this?

>  	}
>
>  	/*
> @@ -2963,6 +2980,10 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>
>  	spin_unlock_irq(&hugetlb_lock);
>
> +	if ((gfp & __GFP_ZERO) && !folio_test_hugetlb_zeroed(folio))
> +		folio_zero_user(folio, user_addr);
> +	folio_clear_hugetlb_zeroed(folio);

So this represents general zeroed state (not just host VM, based on how you set
it above), but you also clear it when you just zeroed the folio? I'm confused.

What does this flag actually mean?

> +
>  	hugetlb_set_folio_subpool(folio, spool);
>
>  	if (map_chg != MAP_CHG_ENFORCED) {
> @@ -2999,7 +3020,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  	lruvec_stat_mod_folio(folio, NR_HUGETLB, pages_per_huge_page(h));
>
>  	if (ret == -ENOMEM) {
> -		free_huge_folio(folio);
> +		__free_huge_folio(folio, !!(gfp & __GFP_ZERO));

Is the !! actually necessary?

>  		return ERR_PTR(-ENOMEM);
>  	}
>
> @@ -4971,7 +4992,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  				spin_unlock(src_ptl);
>  				spin_unlock(dst_ptl);
>  				/* Do not use reserve as it's private owned */
> -				new_folio = alloc_hugetlb_folio(dst_vma, addr, false);
> +				new_folio = alloc_hugetlb_folio(dst_vma, addr, false, 0);
>  				if (IS_ERR(new_folio)) {
>  					folio_put(pte_folio);
>  					ret = PTR_ERR(new_folio);
> @@ -5500,7 +5521,7 @@ static vm_fault_t hugetlb_wp(struct vm_fault *vmf)
>  	 * be acquired again before returning to the caller, as expected.
>  	 */
>  	spin_unlock(vmf->ptl);
> -	new_folio = alloc_hugetlb_folio(vma, vmf->address, cow_from_owner);
> +	new_folio = alloc_hugetlb_folio(vma, vmf->address, cow_from_owner, 0);
>
>  	if (IS_ERR(new_folio)) {
>  		/*
> @@ -5760,7 +5781,13 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>  				goto out;
>  		}
>
> -		folio = alloc_hugetlb_folio(vma, vmf->address, false);
> +		/*
> +		 * Passing vmf->real_address would work just as well,
> +		 * but PAGE_MASK helps make sure we never pass
> +		 * USER_ADDR_NONE by mistake.
> +		 */

Wait what??

Your whole thesis is that USER_ADDR_NONE can never possibly get used, and now
you're guarding against that not being true?


> +		folio = alloc_hugetlb_folio(vma, vmf->real_address & PAGE_MASK,
> +					   false, __GFP_ZERO);

OK so we propagate GFP just for __GFP_ZERO... Can we not?

>  		if (IS_ERR(folio)) {
>  			/*
>  			 * Returning error will result in faulting task being
> @@ -5780,7 +5807,6 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>  				ret = 0;
>  			goto out;
>  		}
> -		folio_zero_user(folio, vmf->real_address);
>  		__folio_mark_uptodate(folio);
>  		new_folio = true;
>
> @@ -6219,7 +6245,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
>  			goto out;
>  		}
>
> -		folio = alloc_hugetlb_folio(dst_vma, dst_addr, false);
> +		folio = alloc_hugetlb_folio(dst_vma, dst_addr, false, 0);
>  		if (IS_ERR(folio)) {
>  			pte_t *actual_pte = hugetlb_walk(dst_vma, dst_addr, PMD_SIZE);
>  			if (actual_pte) {
> @@ -6266,7 +6292,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
>  			goto out;
>  		}
>
> -		folio = alloc_hugetlb_folio(dst_vma, dst_addr, false);
> +		folio = alloc_hugetlb_folio(dst_vma, dst_addr, false, 0);

Not loving the '0' on these... Let's find another way to represent that.

>  		if (IS_ERR(folio)) {
>  			folio_put(*foliop);
>  			ret = -ENOMEM;
> --
> MST
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 24/37] mm: add put_page_zeroed and folio_put_zeroed
  2026-06-08 12:25   ` Lorenzo Stoakes
@ 2026-06-08 12:46     ` David Hildenbrand (Arm)
  2026-06-08 14:08       ` Michael S. Tsirkin
  0 siblings, 1 reply; 131+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-08 12:46 UTC (permalink / raw)
  To: Lorenzo Stoakes, Michael S. Tsirkin
  Cc: linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Muchun Song, Oscar Salvador, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, virtualization, linux-mm, Andrea Arcangeli

On 6/8/26 14:25, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 04:38:54AM -0400, Michael S. Tsirkin wrote:
>> Add put_page_zeroed() / folio_put_zeroed() for callers that hold
>> a reference to a page known to be zeroed.
>>
>> If this drops the last reference, the zeroed hint is
>> propagated to the buddy allocator.  If someone else still holds a
>> reference, the hint is simply lost - this is best-effort.
>>
>> This is useful for balloon drivers during deflation: the host
>> has already zeroed the pages, and the balloon is typically the
>> sole owner.  But if the page happens to be shared, silently
>> dropping the hint is safe and avoids the need for callers to
>> check the refcount.
>>
>> Note: put_page_zeroed uses folio_put_testzero() which only
>> detects sole ownership at the instant of the atomic decrement.
>> A concurrent reference holder (e.g. migration) means the hint
>> is silently lost. This is by design: the zeroed hint is a
>> performance optimization, not a correctness requirement.
>> Losing it just means the next allocation re-zeroes the page.
> 
> Do not put comments about specific expected races like this in the commit
> message but not in the code. Subtleties need to be called out.
> 
> The commit message also doesn't at all explain why PG_zeroed doesn't
> suffice here.
> 
>>
>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>> Assisted-by: Claude:claude-opus-4-6
> 
> I really don't understand why you have a 'zeroed' folio flag but need to
> also have new API calls to detect that?
> 
> They're also HORRIBLY named. Zeroed as in what? Zero page? Huge zero page?
> Memory zeroed by kernel? Pages that userland happen to have zeroed? Or host
> VM zeroed?
> 
> Each are cases we address individually and relate to folios.
> 
> You absolutely fail to clarify _which one_ you mean, and provide absolutely
> no documentation and add an exported mm API with no description.
> 
> This is just I think not something we want to add? Especially on something
> so fundamental?

I raised previously that providing a folio helper is odd, and that I suggested
that we defer this change.

Maybe we'd want to add such an interface for frozen pages later (to be used by
the balloon), but I don't think we want that for folios.

[1] https://lore.kernel.org/all/5f76af6e-9818-42ea-a305-c0fc1d920dca@kernel.org/

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 29/37] mm: memfd: skip zeroing for zeroed hugetlb pool pages
  2026-06-08  8:39 ` [PATCH v10 29/37] mm: memfd: skip zeroing for zeroed hugetlb pool pages Michael S. Tsirkin
@ 2026-06-08 12:47   ` Lorenzo Stoakes
  0 siblings, 0 replies; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 12:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:39:37AM -0400, Michael S. Tsirkin wrote:
> Add bool *zeroed output to alloc_hugetlb_folio_reserve() so
> callers can check whether the pool page is known-zero.  memfd's
> memfd_alloc_folio() uses this to skip the explicit folio_zero_user()
> when the page is already zero.

But why does memfd do that?

This is more AI-ish 'write out in English what the code does' which isn't
really helpful.

>
> This avoids redundant zeroing for memfd hugetlb pages that were
> pre-allocated into the pool and never mapped to userspace.

I think this should lead the commit message given it seems to be the whole
intent no?

>
> Note: HPG_zeroed is currently only set for surplus pages
> allocated with __GFP_ZERO (via alloc_surplus_hugetlb_folio),
> not for pool pages from alloc_pool_huge_folio. So the
> zeroed output from alloc_hugetlb_folio_reserve is typically
> false for pool-only reservations. It becomes true when
> surplus pages fill the reservation. The addr_hint 0 passed
> to folio_zero_user is acceptable for memfd: these pages are
> not mapped yet and will get proper dcache handling at mmap
> time via the page fault path.

This paragraph is really hard to read, and you don't seem to propagate the
same very specific information in the code so people maintaining it don't
know what's going on.

>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6

This is committing the sins of the rest and adding more complexity
throughout.

The whole approach needs a rework I think, but hugetlbfs stuff should be
deferred in general.

> ---
>  include/linux/cma.h     |  3 ++-
>  include/linux/hugetlb.h |  6 ++++--
>  mm/cma.c                |  6 ++++--
>  mm/hugetlb.c            | 11 +++++++++--
>  mm/hugetlb_cma.c        |  4 ++--
>  mm/memfd.c              | 14 ++++++++------
>  6 files changed, 29 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/cma.h b/include/linux/cma.h
> index 8555d38a97b1..dee88909cf5d 100644
> --- a/include/linux/cma.h
> +++ b/include/linux/cma.h
> @@ -53,7 +53,8 @@ extern bool cma_release(struct cma *cma, const struct page *pages, unsigned long
>
>  struct page *cma_alloc_frozen(struct cma *cma, unsigned long count,
>  		unsigned int align, bool no_warn);
> -struct page *cma_alloc_frozen_compound(struct cma *cma, unsigned int order);
> +struct page *cma_alloc_frozen_compound(struct cma *cma, unsigned int order,
> +				       gfp_t caller_gfp);
>  bool cma_release_frozen(struct cma *cma, const struct page *pages,
>  		unsigned long count);
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 06d033a57a61..7eb529eabe99 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -708,7 +708,8 @@ struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
>  				nodemask_t *nmask, gfp_t gfp_mask,
>  				bool allow_alloc_fallback);
>  struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
> -					  nodemask_t *nmask, gfp_t gfp_mask);
> +					  nodemask_t *nmask, gfp_t gfp_mask,
> +					  bool *zeroed);
>
>  int hugetlb_add_to_page_cache(struct folio *folio, struct address_space *mapping,
>  			pgoff_t idx);
> @@ -1128,7 +1129,8 @@ static inline void wait_for_freed_hugetlb_folios(void)
>
>  static inline struct folio *
>  alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
> -			    nodemask_t *nmask, gfp_t gfp_mask)
> +			    nodemask_t *nmask, gfp_t gfp_mask,
> +			    bool *zeroed)
>  {
>  	return NULL;
>  }
> diff --git a/mm/cma.c b/mm/cma.c
> index c7ca567f4c5c..27971f6264ab 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -924,9 +924,11 @@ struct page *cma_alloc_frozen(struct cma *cma, unsigned long count,
>  	return __cma_alloc_frozen(cma, count, align, gfp);
>  }
>
> -struct page *cma_alloc_frozen_compound(struct cma *cma, unsigned int order)
> +struct page *cma_alloc_frozen_compound(struct cma *cma, unsigned int order,
> +				       gfp_t caller_gfp)
>  {
> -	gfp_t gfp = GFP_KERNEL | __GFP_COMP | __GFP_NOWARN;
> +	gfp_t gfp = GFP_KERNEL | __GFP_COMP | __GFP_NOWARN |
> +		    (caller_gfp & __GFP_ZERO);
>
>  	return __cma_alloc_frozen(cma, 1 << order, order, gfp);
>  }
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ed00db703911..a087e915783f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2196,7 +2196,7 @@ struct folio *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h,
>  }
>
>  struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
> -		nodemask_t *nmask, gfp_t gfp_mask)
> +		nodemask_t *nmask, gfp_t gfp_mask, bool *zeroed)
>  {
>  	struct folio *folio;
>
> @@ -2212,6 +2212,12 @@ struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
>  		h->resv_huge_pages--;
>
>  	spin_unlock_irq(&hugetlb_lock);
> +
> +	if (zeroed && folio) {
> +		*zeroed = folio_test_hugetlb_zeroed(folio);
> +		folio_clear_hugetlb_zeroed(folio);
> +	}
> +
>  	return folio;
>  }
>
> @@ -2296,7 +2302,8 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>  		 * It is okay to use NUMA_NO_NODE because we use numa_mem_id()
>  		 * down the road to pick the current node if that is the case.
>  		 */
> -		folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h),
> +		folio = alloc_surplus_hugetlb_folio(h,
> +						    htlb_alloc_mask(h),
>  						    NUMA_NO_NODE, &alloc_nodemask,
>  						    USER_ADDR_NONE);
>  		if (!folio) {
> diff --git a/mm/hugetlb_cma.c b/mm/hugetlb_cma.c
> index 7693ccefd0c6..c9266b25be3d 100644
> --- a/mm/hugetlb_cma.c
> +++ b/mm/hugetlb_cma.c
> @@ -35,14 +35,14 @@ struct folio *hugetlb_cma_alloc_frozen_folio(int order, gfp_t gfp_mask,
>  		return NULL;
>
>  	if (hugetlb_cma[nid])
> -		page = cma_alloc_frozen_compound(hugetlb_cma[nid], order);
> +		page = cma_alloc_frozen_compound(hugetlb_cma[nid], order, gfp_mask);
>
>  	if (!page && !(gfp_mask & __GFP_THISNODE)) {
>  		for_each_node_mask(node, *nodemask) {
>  			if (node == nid || !hugetlb_cma[node])
>  				continue;
>
> -			page = cma_alloc_frozen_compound(hugetlb_cma[node], order);
> +			page = cma_alloc_frozen_compound(hugetlb_cma[node], order, gfp_mask);
>  			if (page)
>  				break;
>  		}
> diff --git a/mm/memfd.c b/mm/memfd.c
> index abe13b291ddc..a99617a62e33 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -69,6 +69,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
>  #ifdef CONFIG_HUGETLB_PAGE
>  	struct folio *folio;
>  	gfp_t gfp_mask;
> +	bool zeroed;
>
>  	if (is_file_hugepages(memfd)) {
>  		/*
> @@ -93,17 +94,18 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
>  		folio = alloc_hugetlb_folio_reserve(h,
>  						    numa_node_id(),
>  						    NULL,
> -						    gfp_mask);
> +						    gfp_mask,
> +						    &zeroed);
>  		if (folio) {
>  			u32 hash;
>
>  			/*
> -			 * Zero the folio to prevent information leaks to userspace.
> -			 * Use folio_zero_user() which is optimized for huge/gigantic
> -			 * pages. Pass 0 as addr_hint since this is not a faulting path
> -			 *  and we don't have a user virtual address yet.
> +			 * Zero the folio to prevent information leaks to
> +			 * userspace.  Skip if the pool page is known-zero
> +			 * (HPG_zeroed set during pool pre-allocation).
>  			 */
> -			folio_zero_user(folio, 0);
> +			if (!zeroed)
> +				folio_zero_user(folio, 0);
>
>  			/*
>  			 * Mark the folio uptodate before adding to page cache,
> --
> MST
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 03/37] mm: page_alloc: propagate PageReported flag across buddy splits
  2026-06-08  9:52   ` Lorenzo Stoakes
@ 2026-06-08 12:50     ` Matthew Wilcox
  0 siblings, 0 replies; 131+ messages in thread
From: Matthew Wilcox @ 2026-06-08 12:50 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Michael S. Tsirkin, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 10:52:28AM +0100, Lorenzo Stoakes wrote:
> > -			split_large_buddy(zone, p, page_to_pfn(p), p_order, fpi_flags);
> > +			split_large_buddy(zone, p, page_to_pfn(p), p_order,
> > +					  fpi_flags, false);
> 
> I don't love adding a mystery meat boolean parameter like this.

Particularly when we have the FPI flags already being passed.  Surely
this should just be another FPI flag?



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
  2026-06-08  9:17 ` [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Lorenzo Stoakes
@ 2026-06-08 12:52   ` Lorenzo Stoakes
  0 siblings, 0 replies; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 12:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 10:17:47AM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 04:33:46AM -0400, Michael S. Tsirkin wrote:
> > Further, on architectures with aliasing caches, upstream with init_on_alloc
> > double-zeros user pages: once via kernel_init_pages() in
> > post_alloc_hook, and again via clear_user_highpage() at the
> > callsite (because user_alloc_needs_zeroing() returns true).
> > This series eliminates that double-zeroing by moving the zeroing
> > into the post_alloc_hook + propagating the "host
> > already zeroed this page" information through the buddy allocator.
> >
> > For page reporting, VIRTIO_BALLOON_F_DEVICE_INIT_REPORTED (bit 6)
> > is used. For the inflate/deflate path,
> > VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE (bit 7) is used.
> >
> > Virtio spec: https://lore.kernel.org/all/cover.1778140241.git.mst@redhat.com
> >
> > Based on v7.1-rc6.  When applying on mm-unstable, two conflicts
> > are expected:
> > - kernel_init_pages() was renamed to clear_highpages_kasan_tagged()
> >   in mm-unstable.  Use clear_highpages_kasan_tagged() in the
> >   post_alloc_hook else branch.
> > - FPI_PREPARED uses BIT(3) in mm-unstable.  Bump FPI_ZEROED to
> >   BIT(4).
> > Build-tested on mm-unstable at e9dd96806dbc:
> > https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git zero-mm-unstable
> >
> > Patches 1-5: fixes/cleanups, dependencies of the zeroing patches.
> > Patches 6-9: thread user_addr through page allocator, contig API,
> >   and gigantic hugetlb allocation.
> > Patches 10-16: folio_zero_user in post_alloc_hook, vma_alloc_zeroed
> >   conversion, raw fault address threading.
> > Patches 17-24: PG_zeroed flag, aliasing guard, buddy merge/split
> >   tracking, FPI_ZEROED optimization, folio_put_zeroed.
> > Patches 25-27: __GFP_ZERO callsite conversions (alloc_anon_folio,
> >   vma_alloc_anon_folio_pmd) with memcg charge failure mitigation.
> > Patches 28-29: hugetlb __GFP_ZERO + HPG_zeroed.
> > Patches 30-35: page reporting zeroing (DEVICE_INIT_REPORTED),
> >   disable indirect descriptors.
> > Patches 36-37: inflate/deflate zeroing (DEVICE_INIT_ON_INFLATE).
>
> This seems far too much for one series.
>
> YOu're doing a bunch of mm stuff that seems relatively independent, then
> putting the virtio stuff on top.
>
> I think this should be broken out into separate series laying foundations
> rather than doing it all in one go, which is also difficult for review
> purposes.
>
> Adding a new folio flag is contentious also for instance, we maybe want to
> go bit-by-bit and ensure that each foundational element is acceptable
> before doing the next bit rather than having it as part of a big series.
>
> Looking through the changelog only adds to this feeling! Huge numbers of
> changes, even relatively recently and I'm not sure all relevant maintainers
> in mm have had a look through either.
>
> Thanks, Lorenzo

Additionally, it seems you've missed/ignored (I hope not) a bunch of
pre-existing feedback, so it'd be helpful if you'd carefully go through
what people have previously asked!

Keeping track of that, especially on a big series, is a lot of work.

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 11:06     ` Lorenzo Stoakes
@ 2026-06-08 13:04       ` Matthew Wilcox
  2026-06-08 13:09         ` Lorenzo Stoakes
  2026-06-08 19:59         ` Gregory Price
  0 siblings, 2 replies; 131+ messages in thread
From: Matthew Wilcox @ 2026-06-08 13:04 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Michael S. Tsirkin, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
> But instead of overloading user_addr to indicate all kinds of things, instead
> make life easier by actually breaking things out.
> 
> Like:
> 
> enum alloc_context_type {
> 	KERNEL_ALLOCATION,
> 	USER_MAPPED_ALLOCATION,
> 	USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
> 	/* Perhaps some other states we want to encode? */
> };
> 
> struct alloc_context {
> 	...
> 
> 	enum alloc_context_type type;
> 	unsigned long user_addr; // Only set if type == USER_ALLOCATION
> 
> 	// Maybe something suggesting context or whether we init before in some
> 	// cases?
> };

Ugh, please, no.  As I suggested last time I commented on this
trainwreck of a series, lift the zeroing functionality from
alloc_frozen_pages() into its callers.


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 13:04       ` Matthew Wilcox
@ 2026-06-08 13:09         ` Lorenzo Stoakes
  2026-06-08 14:26           ` David Hildenbrand (Arm)
  2026-06-08 19:59         ` Gregory Price
  1 sibling, 1 reply; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 13:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michael S. Tsirkin, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 02:04:28PM +0100, Matthew Wilcox wrote:
> On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
> > But instead of overloading user_addr to indicate all kinds of things, instead
> > make life easier by actually breaking things out.
> >
> > Like:
> >
> > enum alloc_context_type {
> > 	KERNEL_ALLOCATION,
> > 	USER_MAPPED_ALLOCATION,
> > 	USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
> > 	/* Perhaps some other states we want to encode? */
> > };
> >
> > struct alloc_context {
> > 	...
> >
> > 	enum alloc_context_type type;
> > 	unsigned long user_addr; // Only set if type == USER_ALLOCATION
> >
> > 	// Maybe something suggesting context or whether we init before in some
> > 	// cases?
> > };
>
> Ugh, please, no.  As I suggested last time I commented on this
> trainwreck of a series, lift the zeroing functionality from
> alloc_frozen_pages() into its callers.

I've not looked at the callers closely enough to see the delta on that, but if
it avoids this mess then also worth looking at yes...

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 02/37] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
  2026-06-08  9:43   ` Lorenzo Stoakes
@ 2026-06-08 13:48     ` Michael S. Tsirkin
  2026-06-08 14:14       ` Lorenzo Stoakes
  2026-06-08 16:20       ` Andrew Morton
  0 siblings, 2 replies; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08 13:48 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Miaohe Lin

On Mon, Jun 08, 2026 at 10:43:21AM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 04:34:23AM -0400, Michael S. Tsirkin wrote:
> > TestSetPageHWPoison() is called without zone->lock, so its atomic
> > update to page->flags can race with non-atomic flag operations
> > that run under zone->lock in the buddy allocator.
> >
> > In particular, __free_pages_prepare() does:
> >
> >     page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >
> > This non-atomic read-modify-write, while correctly excluding
> > __PG_HWPOISON from the mask, can still lose a concurrent
> > TestSetPageHWPoison if the read happens before the poison bit
> > is set and the write happens after.  Follow-up patches in this
> > series add similar non-atomic flag operations as well.
> >
> > Fix by acquiring zone->lock around TestSetPageHWPoison and
> > around ClearPageHWPoison in the retry path.  This
> > serializes with all buddy flag manipulation.  The cost is
> > negligible: one lock/unlock in an extremely rare path
> > (hardware memory errors).
> >
> > Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> > in this file operate on pages already removed from the buddy
> > allocator or on non-buddy pages (DAX, hugetlb), so they do not
> > need zone->lock protection.
> >
> > Acked-by: Miaohe Lin <linmiaohe@huawei.com>
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> Can we have Fixes: and Cc: stable and also send this separately please?
> 
> These patches seem like unrelated fixups that you've discovered along the way,
> and don't belong as part of the already rather large series, unless I'm missing
> something here.
> 
> Thanks, Lorenzo

I think you are mising that they are a dependency, not unrelated.
For example, this issue gets worse with the patchset as there are more
places that manipulate flags without atomics. No?


You are welcome to send this to stable, but I think stable rules
preclude theoretical bugfixes.

As for Fixes: the issue has been there for decades. I wouldn't know
what to attribute it for.


I guess I could send these separately, too, why not. Not sure
what this accomplishes, but hey.  But is that an ack? You want
this fix merged even before the feature?



> > Assisted-by: Claude:claude-opus-4-6
> > ---
> >  mm/memory-failure.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index ee42d4361309..3880486028a1 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -2348,6 +2348,8 @@ int memory_failure(unsigned long pfn, int flags)
> >  	unsigned long page_flags;
> >  	bool retry = true;
> >  	int hugetlb = 0;
> > +	struct zone *zone;
> > +	unsigned long mf_flags;
> >
> >  	if (!sysctl_memory_failure_recovery)
> >  		panic("Memory failure on page %lx", pfn);
> > @@ -2390,7 +2392,11 @@ int memory_failure(unsigned long pfn, int flags)
> >  	if (hugetlb)
> >  		goto unlock_mutex;
> >
> > +	/* Serialize with non-atomic buddy flag operations */
> > +	zone = page_zone(p);
> > +	spin_lock_irqsave(&zone->lock, mf_flags);
> >  	if (TestSetPageHWPoison(p)) {
> > +		spin_unlock_irqrestore(&zone->lock, mf_flags);
> >  		res = -EHWPOISON;
> >  		if (flags & MF_ACTION_REQUIRED)
> >  			res = kill_accessing_process(current, pfn, flags);
> > @@ -2399,6 +2405,7 @@ int memory_failure(unsigned long pfn, int flags)
> >  		action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
> >  		goto unlock_mutex;
> >  	}
> > +	spin_unlock_irqrestore(&zone->lock, mf_flags);
> >
> >  	/*
> >  	 * We need/can do nothing about count=0 pages.
> > @@ -2420,7 +2427,10 @@ int memory_failure(unsigned long pfn, int flags)
> >  			} else {
> >  				/* We lost the race, try again */
> >  				if (retry) {
> > +					/* Serialize with non-atomic buddy flag operations */
> > +					spin_lock_irqsave(&zone->lock, mf_flags);
> >  					ClearPageHWPoison(p);
> > +					spin_unlock_irqrestore(&zone->lock, mf_flags);
> >  					retry = false;
> >  					goto try_again;
> >  				}
> > --
> > MST
> >



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 24/37] mm: add put_page_zeroed and folio_put_zeroed
  2026-06-08 12:46     ` David Hildenbrand (Arm)
@ 2026-06-08 14:08       ` Michael S. Tsirkin
  2026-06-08 14:28         ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08 14:08 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lorenzo Stoakes, linux-kernel, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 02:46:34PM +0200, David Hildenbrand (Arm) wrote:
> On 6/8/26 14:25, Lorenzo Stoakes wrote:
> > On Mon, Jun 08, 2026 at 04:38:54AM -0400, Michael S. Tsirkin wrote:
> >> Add put_page_zeroed() / folio_put_zeroed() for callers that hold
> >> a reference to a page known to be zeroed.
> >>
> >> If this drops the last reference, the zeroed hint is
> >> propagated to the buddy allocator.  If someone else still holds a
> >> reference, the hint is simply lost - this is best-effort.
> >>
> >> This is useful for balloon drivers during deflation: the host
> >> has already zeroed the pages, and the balloon is typically the
> >> sole owner.  But if the page happens to be shared, silently
> >> dropping the hint is safe and avoids the need for callers to
> >> check the refcount.
> >>
> >> Note: put_page_zeroed uses folio_put_testzero() which only
> >> detects sole ownership at the instant of the atomic decrement.
> >> A concurrent reference holder (e.g. migration) means the hint
> >> is silently lost. This is by design: the zeroed hint is a
> >> performance optimization, not a correctness requirement.
> >> Losing it just means the next allocation re-zeroes the page.
> > 
> > Do not put comments about specific expected races like this in the commit
> > message but not in the code. Subtleties need to be called out.
> > 
> > The commit message also doesn't at all explain why PG_zeroed doesn't
> > suffice here.
> > 
> >>
> >> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >> Assisted-by: Claude:claude-opus-4-6
> > 
> > I really don't understand why you have a 'zeroed' folio flag but need to
> > also have new API calls to detect that?
> > 
> > They're also HORRIBLY named. Zeroed as in what? Zero page? Huge zero page?
> > Memory zeroed by kernel? Pages that userland happen to have zeroed? Or host
> > VM zeroed?
> > 
> > Each are cases we address individually and relate to folios.
> > 
> > You absolutely fail to clarify _which one_ you mean, and provide absolutely
> > no documentation and add an exported mm API with no description.
> > 
> > This is just I think not something we want to add? Especially on something
> > so fundamental?
> 
> I raised previously that providing a folio helper is odd, and that I suggested
> that we defer this change.

Sadly it's a dependency actually - without it memcg failures would cause
repeated re-zeroing where previously it failed without zeroing.

It's the result of the whole GFP_ZERO idea.



> Maybe we'd want to add such an interface for frozen pages later (to be used by
> the balloon), but I don't think we want that for folios.
> 
> [1] https://lore.kernel.org/all/5f76af6e-9818-42ea-a305-c0fc1d920dca@kernel.org/
> 
> -- 
> Cheers,
> 
> David



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 02/37] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
  2026-06-08 13:48     ` Michael S. Tsirkin
@ 2026-06-08 14:14       ` Lorenzo Stoakes
  2026-06-08 20:17         ` Michael S. Tsirkin
  2026-06-08 16:20       ` Andrew Morton
  1 sibling, 1 reply; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 14:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Miaohe Lin

On Mon, Jun 08, 2026 at 09:48:34AM -0400, Michael S. Tsirkin wrote:
> On Mon, Jun 08, 2026 at 10:43:21AM +0100, Lorenzo Stoakes wrote:
> > On Mon, Jun 08, 2026 at 04:34:23AM -0400, Michael S. Tsirkin wrote:
> > > TestSetPageHWPoison() is called without zone->lock, so its atomic
> > > update to page->flags can race with non-atomic flag operations
> > > that run under zone->lock in the buddy allocator.
> > >
> > > In particular, __free_pages_prepare() does:
> > >
> > >     page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > >
> > > This non-atomic read-modify-write, while correctly excluding
> > > __PG_HWPOISON from the mask, can still lose a concurrent
> > > TestSetPageHWPoison if the read happens before the poison bit
> > > is set and the write happens after.  Follow-up patches in this
> > > series add similar non-atomic flag operations as well.
> > >
> > > Fix by acquiring zone->lock around TestSetPageHWPoison and
> > > around ClearPageHWPoison in the retry path.  This
> > > serializes with all buddy flag manipulation.  The cost is
> > > negligible: one lock/unlock in an extremely rare path
> > > (hardware memory errors).
> > >
> > > Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> > > in this file operate on pages already removed from the buddy
> > > allocator or on non-buddy pages (DAX, hugetlb), so they do not
> > > need zone->lock protection.
> > >
> > > Acked-by: Miaohe Lin <linmiaohe@huawei.com>
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >
> > Can we have Fixes: and Cc: stable and also send this separately please?
> >
> > These patches seem like unrelated fixups that you've discovered along the way,
> > and don't belong as part of the already rather large series, unless I'm missing
> > something here.
> >
> > Thanks, Lorenzo
>
> I think you are mising that they are a dependency, not unrelated.

Then say so.

> For example, this issue gets worse with the patchset as there are more
> places that manipulate flags without atomics. No?

It's your job to make that case, not mine.

>
>
> You are welcome to send this to stable, but I think stable rules
> preclude theoretical bugfixes.

It's a dependency but also theoretical?

>
> As for Fixes: the issue has been there for decades. I wouldn't know
> what to attribute it for.

Again, your job.

>
>
> I guess I could send these separately, too, why not. Not sure
> what this accomplishes, but hey.  But is that an ack? You want
> this fix merged even before the feature?

I already made the case as to why, as have other maintainers.

If you need to review what an ack looks like please consult
https://docs.kernel.org/process/5.Posting.html

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (38 preceding siblings ...)
  2026-06-08 11:02 ` Vlastimil Babka (SUSE)
@ 2026-06-08 14:21 ` Matthew Wilcox
  2026-06-08 20:02   ` Michael S. Tsirkin
  2026-06-09  3:58 ` New design Matthew Wilcox
  40 siblings, 1 reply; 131+ messages in thread
From: Matthew Wilcox @ 2026-06-08 14:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:33:46AM -0400, Michael S. Tsirkin wrote:
> Further, on architectures with aliasing caches, upstream with init_on_alloc

Further to what?  Did you leave out some paragraphs here?

As far as I can tell, this patch series decides to trust that the
hypervisor has zeroed pages that it allocates to the guest.  But
as far as I can tell, the trend is towards less trust in the hypervisor
from the guest, not more.


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 13:09         ` Lorenzo Stoakes
@ 2026-06-08 14:26           ` David Hildenbrand (Arm)
  2026-06-08 14:31             ` Matthew Wilcox
  0 siblings, 1 reply; 131+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-08 14:26 UTC (permalink / raw)
  To: Lorenzo Stoakes, Matthew Wilcox
  Cc: Michael S. Tsirkin, linux-kernel, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On 6/8/26 15:09, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 02:04:28PM +0100, Matthew Wilcox wrote:
>> On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
>>> But instead of overloading user_addr to indicate all kinds of things, instead
>>> make life easier by actually breaking things out.
>>>
>>> Like:
>>>
>>> enum alloc_context_type {
>>> 	KERNEL_ALLOCATION,
>>> 	USER_MAPPED_ALLOCATION,
>>> 	USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
>>> 	/* Perhaps some other states we want to encode? */
>>> };
>>>
>>> struct alloc_context {
>>> 	...
>>>
>>> 	enum alloc_context_type type;
>>> 	unsigned long user_addr; // Only set if type == USER_ALLOCATION
>>>
>>> 	// Maybe something suggesting context or whether we init before in some
>>> 	// cases?
>>> };
>>
>> Ugh, please, no.  As I suggested last time I commented on this
>> trainwreck of a series, lift the zeroing functionality from
>> alloc_frozen_pages() into its callers.
> 
> I've not looked at the callers closely enough to see the delta on that, but if
> it avoids this mess then also worth looking at yes...

If that means that we would handle __GFP_ZERO consistently in the callers of
alloc_frozen_pages(), that would also do I guess. We'd still have to pass the
user address down to some degree, through folio interfaces only at least.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 24/37] mm: add put_page_zeroed and folio_put_zeroed
  2026-06-08 14:08       ` Michael S. Tsirkin
@ 2026-06-08 14:28         ` David Hildenbrand (Arm)
  2026-06-08 19:58           ` Michael S. Tsirkin
  0 siblings, 1 reply; 131+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-08 14:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Lorenzo Stoakes, linux-kernel, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On 6/8/26 16:08, Michael S. Tsirkin wrote:
> On Mon, Jun 08, 2026 at 02:46:34PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/8/26 14:25, Lorenzo Stoakes wrote:
>>>
>>> Do not put comments about specific expected races like this in the commit
>>> message but not in the code. Subtleties need to be called out.
>>>
>>> The commit message also doesn't at all explain why PG_zeroed doesn't
>>> suffice here.
>>>
>>>
>>> I really don't understand why you have a 'zeroed' folio flag but need to
>>> also have new API calls to detect that?
>>>
>>> They're also HORRIBLY named. Zeroed as in what? Zero page? Huge zero page?
>>> Memory zeroed by kernel? Pages that userland happen to have zeroed? Or host
>>> VM zeroed?
>>>
>>> Each are cases we address individually and relate to folios.
>>>
>>> You absolutely fail to clarify _which one_ you mean, and provide absolutely
>>> no documentation and add an exported mm API with no description.
>>>
>>> This is just I think not something we want to add? Especially on something
>>> so fundamental?
>>
>> I raised previously that providing a folio helper is odd, and that I suggested
>> that we defer this change.
> 
> Sadly it's a dependency actually - without it memcg failures would cause
> repeated re-zeroing where previously it failed without zeroing.

Oh, you mean that we succeeded allocating (+zeroing) but failed to charge?

I don't immediately see that to be a real problem?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 14:26           ` David Hildenbrand (Arm)
@ 2026-06-08 14:31             ` Matthew Wilcox
  2026-06-08 14:37               ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 131+ messages in thread
From: Matthew Wilcox @ 2026-06-08 14:31 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Andrew Morton, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:26:18PM +0200, David Hildenbrand (Arm) wrote:
> If that means that we would handle __GFP_ZERO consistently in the callers of
> alloc_frozen_pages(), that would also do I guess. We'd still have to pass the
> user address down to some degree, through folio interfaces only at least.

What I don't understand is how the kernel page allocator needs to know
the user address in order to effectively zero it, but the hypervisor is
able to zero the page without knowing the user address.  It feels like
somebody has x86-centric thinking where cache colouring doesn't matter.


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 14:31             ` Matthew Wilcox
@ 2026-06-08 14:37               ` David Hildenbrand (Arm)
  2026-06-08 14:44                 ` Matthew Wilcox
  0 siblings, 1 reply; 131+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-08 14:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Andrew Morton, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On 6/8/26 16:31, Matthew Wilcox wrote:
> On Mon, Jun 08, 2026 at 04:26:18PM +0200, David Hildenbrand (Arm) wrote:
>> If that means that we would handle __GFP_ZERO consistently in the callers of
>> alloc_frozen_pages(), that would also do I guess. We'd still have to pass the
>> user address down to some degree, through folio interfaces only at least.
> 
> What I don't understand is how the kernel page allocator needs to know
> the user address in order to effectively zero it, but the hypervisor is
> able to zero the page without knowing the user address.  It feels like
> somebody has x86-centric thinking where cache colouring doesn't matter.

(not commenting on the icache dache mess we have to drag along)

The thing is that with free-page-reporting the memory is already zeroed by the
hypervisor as part of discarding that memory previously (e.g., MADV_DONTNEED)
and allocating fresh pages on re-access.

So it's not a question of "why is the hypervisor zeroing less efficiently", as
zeroing is just a side-product of reclaiming that memory in the first place.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 14:37               ` David Hildenbrand (Arm)
@ 2026-06-08 14:44                 ` Matthew Wilcox
  2026-06-08 14:55                   ` David Hildenbrand (Arm)
  2026-06-08 19:33                   ` Michael S. Tsirkin
  0 siblings, 2 replies; 131+ messages in thread
From: Matthew Wilcox @ 2026-06-08 14:44 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Andrew Morton, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:37:03PM +0200, David Hildenbrand (Arm) wrote:
> On 6/8/26 16:31, Matthew Wilcox wrote:
> > On Mon, Jun 08, 2026 at 04:26:18PM +0200, David Hildenbrand (Arm) wrote:
> >> If that means that we would handle __GFP_ZERO consistently in the callers of
> >> alloc_frozen_pages(), that would also do I guess. We'd still have to pass the
> >> user address down to some degree, through folio interfaces only at least.
> > 
> > What I don't understand is how the kernel page allocator needs to know
> > the user address in order to effectively zero it, but the hypervisor is
> > able to zero the page without knowing the user address.  It feels like
> > somebody has x86-centric thinking where cache colouring doesn't matter.
> 
> (not commenting on the icache dache mess we have to drag along)

Well, that was kind of the point of this email ... I did ask the
question you're answering in a different email so let me respond
to that too.

> The thing is that with free-page-reporting the memory is already zeroed by the
> hypervisor as part of discarding that memory previously (e.g., MADV_DONTNEED)
> and allocating fresh pages on re-access.
> 
> So it's not a question of "why is the hypervisor zeroing less efficiently", as
> zeroing is just a side-product of reclaiming that memory in the first place.

We definitely have users who don't want the guest to trust the
hypervisor.  So how do they disable this optimisation?


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 14:44                 ` Matthew Wilcox
@ 2026-06-08 14:55                   ` David Hildenbrand (Arm)
  2026-06-09  9:54                     ` Michael S. Tsirkin
  2026-06-08 19:33                   ` Michael S. Tsirkin
  1 sibling, 1 reply; 131+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-08 14:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Andrew Morton, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On 6/8/26 16:44, Matthew Wilcox wrote:
> On Mon, Jun 08, 2026 at 04:37:03PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/8/26 16:31, Matthew Wilcox wrote:
>>>
>>> What I don't understand is how the kernel page allocator needs to know
>>> the user address in order to effectively zero it, but the hypervisor is
>>> able to zero the page without knowing the user address.  It feels like
>>> somebody has x86-centric thinking where cache colouring doesn't matter.
>>
>> (not commenting on the icache dache mess we have to drag along)
> 
> Well, that was kind of the point of this email ... I did ask the
> question you're answering in a different email so let me respond
> to that too.

Now I'm confused :)

> 
>> The thing is that with free-page-reporting the memory is already zeroed by the
>> hypervisor as part of discarding that memory previously (e.g., MADV_DONTNEED)
>> and allocating fresh pages on re-access.
>>
>> So it's not a question of "why is the hypervisor zeroing less efficiently", as
>> zeroing is just a side-product of reclaiming that memory in the first place.
> 
> We definitely have users who don't want the guest to trust the
> hypervisor.  So how do they disable this optimisation?

Right, I don't think we currently have a toggle to disable free page reporting.
So IIUC, this optimization would similarly automatically get enabled if the
hypervisor advertises it.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 11:08     ` David Hildenbrand (Arm)
@ 2026-06-08 15:27       ` Zi Yan
  2026-06-08 18:39         ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 131+ messages in thread
From: Zi Yan @ 2026-06-08 15:27 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Andrew Morton, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On 8 Jun 2026, at 7:08, David Hildenbrand (Arm) wrote:

> On 6/8/26 12:23, Lorenzo Stoakes wrote:
>> I noticed this patch, again, sneaks in some additional code changes that
>> are not mentioned in the commit message and seem irrelevant to the patch.
>>
>> Not sure if the AI is doing this, but please don't do that.
>>
>> On Mon, Jun 08, 2026 at 04:35:17AM -0400, Michael S. Tsirkin wrote:
>>> Thread a user virtual address from vma_alloc_folio() down through
>>> the page allocator to post_alloc_hook(). This is plumbing
>>> preparation for a subsequent patch that will use user_addr to
>>> call folio_zero_user() for cache-friendly zeroing of user pages.
>>
>> This feels like very weak justification for code that massively changes mm
>> code and makes it all much worse.
>>
>>>
>>> The user_addr is stored in struct alloc_context and flows through:
>>>   vma_alloc_folio -> folio_alloc_mpol -> __alloc_pages_mpol ->
>>>   __alloc_frozen_pages -> get_page_from_freelist -> prep_new_page ->
>>>   post_alloc_hook
>>
>> Is this the only relevant code path? You're changing a TON of code here
>> that will have multiple different code paths?
>>
>>>
>>> USER_ADDR_NONE ((unsigned long)-1) is used for non-user
>>> allocations, since address 0 is a valid userspace mapping.
>>
>> Ugh god, so now we're passing a user address through allocation paths that
>> might not even be aware of this or be allocating memory at a point when the
>> mapping is known?
>
> The original ideas was to do this only with internal interfaces. I think I
> raised before to leave hugetlb alone for now.
>
> Fundamentally, user_alloc_needs_zeroing() is something we should get rid of, to
> just be able use __GFP_ZERO and do zeroing at exactly one place.

Just a reminder that user_alloc_needs_zeroing() not only checks init_on_alloc,
but also some arc clearing page requirements. To do zeroing in one place,
clear_page() used in post_alloc_hook() will need to learn how to handle
arch-specific zeroing from clear_user_page()/clear_user_highpage().

>
>
> The question is how to pass that information (user_addr) through internal APIs.

Or should we defer zeroing after a page is returned from allocator? So that
user_addr does not need to be passed through irrelevant allocation APIs.
Something like:

alloc_page_wrapper(gfp, order, user_addr)
{
	page = alloc_pages();
	if (gfp & __GFP_ZERO)
		clear_page(page);
}

>
> I previously said, that ideally we'd end up with a folio allocation interface in
> mm/page_alloc.c, from where we could get this done more cleanly internally.
>
> But I don't want what the previous proposals did: use GFP flags+leak state or
> return magic "zeroed" flags.


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
  2026-06-08 11:13   ` Vlastimil Babka (SUSE)
@ 2026-06-08 15:45     ` Gregory Price
  2026-06-08 17:50       ` Lorenzo Stoakes
  0 siblings, 1 reply; 131+ messages in thread
From: Gregory Price @ 2026-06-08 15:45 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Michael S. Tsirkin, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 01:13:42PM +0200, Vlastimil Babka (SUSE) wrote:
> On 6/8/26 13:02, Vlastimil Babka (SUSE) wrote:
> > On 6/8/26 10:33, Michael S. Tsirkin wrote:
> >> Further, on architectures with aliasing caches, upstream with init_on_alloc
> > 
> > It seems those are niche architectures so we can ignore that part for perf
> > purposes; the other reason why user_alloc_needs_zeroing() would be true is
> > booting with init_on_alloc.
> 
> OK I misread how user_alloc_needs_zeroing() works wrt init_on_alloc, as it's
> negated. But you're changing that anyway to skip that user zeroing, right?
> 
> "
> This series eliminates that double-zeroing by moving the zeroing
> into the post_alloc_hook + propagating the "host
> already zeroed this page" information through the buddy allocator.
> "
> 
> So relying on "everything in buddy is zeroed" would still work I'd think.
> 

This regresses for anything that previously didn't zero on free or
alloc, which is most kernel allocations.

I think the scope of this set has increased too much based on early
feedback to fix the userland-initiated allocations piece along with the
balloon/reporting/double-zero piece.  That's making all of this
difficult to continue following.

~Gregory


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
  2026-06-08 11:23   ` Lorenzo Stoakes
@ 2026-06-08 15:53     ` Gregory Price
  2026-06-08 19:45       ` Michael S. Tsirkin
  2026-06-08 19:42     ` Michael S. Tsirkin
  1 sibling, 1 reply; 131+ messages in thread
From: Gregory Price @ 2026-06-08 15:53 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Michael S. Tsirkin, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 12:23:07PM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 04:36:38AM -0400, Michael S. Tsirkin wrote:
> 
> > +			 * user_addr != USER_ADDR_NONE implies sleepable
> > +			 * context (user page fault).
> 
> Can you safely assume that? Also inferring which context we are in from this
> parameter seems risky.
> 
> It seems to me that you're now making it such that kernel developers:
> 
> - Have to know when and when not to specify a user address, and under what
>   circumstances we might consider that to be mapped.
> 
> - Need to know to do this correctly for aliasing architectures or have silent
>   correctness issues.
> 
> - Need to take context into account when specifying this.
> 
> We definitely need to find a simpler way to do this!
>

This feedback was poked at in earlier versions.  There's a tension
between keeping the old interface as-is, having explicit interfaces
for something like this, and the state of a page inside the
allocator vs outside.

Double-plus complicated by the fact that we're trying to reason about
two allocators at once:  host and guest.

It seems it has gotten a bit more complicated since then (I missed this
"sleepable context" bit, not sure if it was there on prior versions).

If `user_addr` is now implying anything other than exactly: "This needs
to be zeroed / caches flushed", then this is bad.

~Gregory


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 16/37] mm: alloc_swap_folio: pass raw fault address to vma_alloc_folio
  2026-06-08 11:37   ` Lorenzo Stoakes
@ 2026-06-08 15:59     ` Gregory Price
  2026-06-08 20:09       ` Michael S. Tsirkin
  0 siblings, 1 reply; 131+ messages in thread
From: Gregory Price @ 2026-06-08 15:59 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Michael S. Tsirkin, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 12:37:20PM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 04:37:41AM -0400, Michael S. Tsirkin wrote:
> > Same change as the previous patch but for alloc_swap_folio:
> 
> Please don't say 'same change as the previous patch' :) explain what you're
> doing here. It's a pain to have to go check otherwise.
> 

MST you need to slow down a bit.

I gave you this same feedback 3 versions ago:

https://lore.kernel.org/linux-mm/agXUHItfxSwtriRF@gourry-fedora-PF4VCD3F/

~Gregory


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 17/37] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages
  2026-06-08 12:00   ` Lorenzo Stoakes
@ 2026-06-08 16:09     ` Gregory Price
  2026-06-08 18:40       ` Matthew Wilcox
  0 siblings, 1 reply; 131+ messages in thread
From: Gregory Price @ 2026-06-08 16:09 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Michael S. Tsirkin, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 01:00:17PM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 04:37:48AM -0400, Michael S. Tsirkin wrote:
> >
> >  void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags,
> > -		     unsigned long user_addr);
> > +		     bool zeroed, unsigned long user_addr);
> 
> host_zeroed or something would be more appropriate no?
> 
> But in general do we need to propagate this around, can't we derive it from
> the page zeroed flag?
> 
> It's really confusing as to _which_ zeroing this refers to, it seems the
> only one relevant here is the VM host zeroing but that's completely
> non-obvious and now everybody using these functions with the extra param
> will simply have to happen to know this.
> 
> If we could find a way to avoid this propagation that'd be ideal.
> 
> Failing that, making it clear this is _only_ for vm host zeroing would be
> better, but then maybe we need to think about how we could encode this in
> some other way, e.g. passing alloc_context perhaps?
> 

This is unaddressed feedback from 3 version ago:

https://lore.kernel.org/linux-mm/agXYbcuQYooG74pb@gourry-fedora-PF4VCD3F/

We can infer all of this from snapshotted page flags and propogate those
around.  This is infinitely more useful than just a single flag being
pulled out into a boolean, and more extensible.

void
post_alloc_hook(struct page *page, usigned int order, gfp_t gfp_flags,
                uint64_t pg_alloc_flags, unsigned long user_addr);

		^^^ page_alloc.c internal falgs only

Once the allocator gets a page it wants to return, it can take a snapshot
of the flags at that point, and then doodle all over the flags as it
goes through the page setup prior to return (include the post hook).

Haven't seen a reason why this shouldn't be done this way.

~Gregory


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 02/37] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
  2026-06-08 13:48     ` Michael S. Tsirkin
  2026-06-08 14:14       ` Lorenzo Stoakes
@ 2026-06-08 16:20       ` Andrew Morton
  1 sibling, 0 replies; 131+ messages in thread
From: Andrew Morton @ 2026-06-08 16:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Miaohe Lin

On Mon, 8 Jun 2026 09:48:34 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:

> > These patches seem like unrelated fixups that you've discovered along the way,
> > and don't belong as part of the already rather large series, unless I'm missing
> > something here.
> > 
> > Thanks, Lorenzo
> 
> I think you are mising that they are a dependency, not unrelated.
> For example, this issue gets worse with the patchset as there are more
> places that manipulate flags without atomics. No?
> 
> 
> You are welcome to send this to stable, but I think stable rules
> preclude theoretical bugfixes.

I agree with Lorenzo, please - let's have fixes separated out.  After
all, the rest of the series might never be merged (sorry, it happens).

Whether each fix gets a cc:stable is TBD, case-by-case - it depends on
the expected userspace impact.



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
  2026-06-08 15:45     ` Gregory Price
@ 2026-06-08 17:50       ` Lorenzo Stoakes
  2026-06-08 19:39         ` Michael S. Tsirkin
  0 siblings, 1 reply; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-08 17:50 UTC (permalink / raw)
  To: Gregory Price
  Cc: Vlastimil Babka (SUSE), Michael S. Tsirkin, linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 11:45:18AM -0400, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 01:13:42PM +0200, Vlastimil Babka (SUSE) wrote:
> > On 6/8/26 13:02, Vlastimil Babka (SUSE) wrote:
> > > On 6/8/26 10:33, Michael S. Tsirkin wrote:
> > >> Further, on architectures with aliasing caches, upstream with init_on_alloc
> > >
> > > It seems those are niche architectures so we can ignore that part for perf
> > > purposes; the other reason why user_alloc_needs_zeroing() would be true is
> > > booting with init_on_alloc.
> >
> > OK I misread how user_alloc_needs_zeroing() works wrt init_on_alloc, as it's
> > negated. But you're changing that anyway to skip that user zeroing, right?
> >
> > "
> > This series eliminates that double-zeroing by moving the zeroing
> > into the post_alloc_hook + propagating the "host
> > already zeroed this page" information through the buddy allocator.
> > "
> >
> > So relying on "everything in buddy is zeroed" would still work I'd think.
> >
>
> This regresses for anything that previously didn't zero on free or
> alloc, which is most kernel allocations.
>
> I think the scope of this set has increased too much based on early
> feedback to fix the userland-initiated allocations piece along with the
> balloon/reporting/double-zero piece.  That's making all of this
> difficult to continue following.

Yeah I feel this is 3, 4 or 5 series put together, and there's a lot to
discuss in each :) so it's pretty difficult to work with them all put
together.

These need to be deferred/separated.

>
> ~Gregory

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 15:27       ` Zi Yan
@ 2026-06-08 18:39         ` David Hildenbrand (Arm)
  2026-06-08 19:43           ` Gregory Price
  0 siblings, 1 reply; 131+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-08 18:39 UTC (permalink / raw)
  To: Zi Yan
  Cc: Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Andrew Morton, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On 6/8/26 17:27, Zi Yan wrote:
> On 8 Jun 2026, at 7:08, David Hildenbrand (Arm) wrote:
> 
>> On 6/8/26 12:23, Lorenzo Stoakes wrote:
>>> I noticed this patch, again, sneaks in some additional code changes that
>>> are not mentioned in the commit message and seem irrelevant to the patch.
>>>
>>> Not sure if the AI is doing this, but please don't do that.
>>>
>>>
>>> This feels like very weak justification for code that massively changes mm
>>> code and makes it all much worse.
>>>
>>>
>>> Is this the only relevant code path? You're changing a TON of code here
>>> that will have multiple different code paths?
>>>
>>>
>>> Ugh god, so now we're passing a user address through allocation paths that
>>> might not even be aware of this or be allocating memory at a point when the
>>> mapping is known?
>>
>> The original ideas was to do this only with internal interfaces. I think I
>> raised before to leave hugetlb alone for now.
>>
>> Fundamentally, user_alloc_needs_zeroing() is something we should get rid of, to
>> just be able use __GFP_ZERO and do zeroing at exactly one place.
> 
> Just a reminder that user_alloc_needs_zeroing() not only checks init_on_alloc,
> but also some arc clearing page requirements. To do zeroing in one place,
> clear_page() used in post_alloc_hook() will need to learn how to handle
> arch-specific zeroing from clear_user_page()/clear_user_highpage().

Right.

> 
>>
>>
>> The question is how to pass that information (user_addr) through internal APIs.
> 
> Or should we defer zeroing after a page is returned from allocator? So that
> user_addr does not need to be passed through irrelevant allocation APIs.
> Something like:
> 
> alloc_page_wrapper(gfp, order, user_addr)
> {
> 	page = alloc_pages();
> 	if (gfp & __GFP_ZERO)
> 		clear_page(page);
> }
> 

Not really sure what's best here. I think we'd want to limit the lifting to some
internal API, so it cannot easily be messed up by random kernel code calling
into the wrong API and not getting pages cleared.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 17/37] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages
  2026-06-08 16:09     ` Gregory Price
@ 2026-06-08 18:40       ` Matthew Wilcox
  2026-06-08 19:55         ` Michael S. Tsirkin
  0 siblings, 1 reply; 131+ messages in thread
From: Matthew Wilcox @ 2026-06-08 18:40 UTC (permalink / raw)
  To: Gregory Price
  Cc: Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 12:09:19PM -0400, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 01:00:17PM +0100, Lorenzo Stoakes wrote:
> > On Mon, Jun 08, 2026 at 04:37:48AM -0400, Michael S. Tsirkin wrote:
> > >
> > >  void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags,
> > > -		     unsigned long user_addr);
> > > +		     bool zeroed, unsigned long user_addr);
> > 
> > host_zeroed or something would be more appropriate no?
> > 
> > But in general do we need to propagate this around, can't we derive it from
> > the page zeroed flag?
> > 
> > It's really confusing as to _which_ zeroing this refers to, it seems the
> > only one relevant here is the VM host zeroing but that's completely
> > non-obvious and now everybody using these functions with the extra param
> > will simply have to happen to know this.
> > 
> > If we could find a way to avoid this propagation that'd be ideal.
> > 
> > Failing that, making it clear this is _only_ for vm host zeroing would be
> > better, but then maybe we need to think about how we could encode this in
> > some other way, e.g. passing alloc_context perhaps?
> > 
> 
> This is unaddressed feedback from 3 version ago:
> 
> https://lore.kernel.org/linux-mm/agXYbcuQYooG74pb@gourry-fedora-PF4VCD3F/
> 
> We can infer all of this from snapshotted page flags and propogate those
> around.  This is infinitely more useful than just a single flag being
> pulled out into a boolean, and more extensible.
> 
> void
> post_alloc_hook(struct page *page, usigned int order, gfp_t gfp_flags,
>                 uint64_t pg_alloc_flags, unsigned long user_addr);
> 
> 		^^^ page_alloc.c internal falgs only
> 
> Once the allocator gets a page it wants to return, it can take a snapshot
> of the flags at that point, and then doodle all over the flags as it
> goes through the page setup prior to return (include the post hook).
> 
> Haven't seen a reason why this shouldn't be done this way.

I'd tuned out this awful series since it'd become apparent that my
feedback wasn't going to be taken seriously.  Good to know I shouldn't
take it personally because he does it to you too.

I think it's apparent that Michael has no understanding of the MM.
So we should start again with the architecture.  Let's look at the
problem that he's trying to solve:

 - The hypervisor has zeroed the memory that it passes to the MM
 - But the MM then zeroes it again before passing it to userspace.
 - We want to avoid this

Let's make sure that's the actual problem before going any further.
Because I do have a design that will satisfy that without doing this
insane level of invasive change, but if that's not the problem, there's
no point wasting my time writing it up.


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 14:44                 ` Matthew Wilcox
  2026-06-08 14:55                   ` David Hildenbrand (Arm)
@ 2026-06-08 19:33                   ` Michael S. Tsirkin
  1 sibling, 0 replies; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08 19:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand (Arm), Lorenzo Stoakes, linux-kernel,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 03:44:29PM +0100, Matthew Wilcox wrote:
> On Mon, Jun 08, 2026 at 04:37:03PM +0200, David Hildenbrand (Arm) wrote:
> > On 6/8/26 16:31, Matthew Wilcox wrote:
> > > On Mon, Jun 08, 2026 at 04:26:18PM +0200, David Hildenbrand (Arm) wrote:
> > >> If that means that we would handle __GFP_ZERO consistently in the callers of
> > >> alloc_frozen_pages(), that would also do I guess. We'd still have to pass the
> > >> user address down to some degree, through folio interfaces only at least.
> > > 
> > > What I don't understand is how the kernel page allocator needs to know
> > > the user address in order to effectively zero it, but the hypervisor is
> > > able to zero the page without knowing the user address.  It feels like
> > > somebody has x86-centric thinking where cache colouring doesn't matter.
> > 
> > (not commenting on the icache dache mess we have to drag along)
> 
> Well, that was kind of the point of this email ... I did ask the
> question you're answering in a different email so let me respond
> to that too.
> 
> > The thing is that with free-page-reporting the memory is already zeroed by the
> > hypervisor as part of discarding that memory previously (e.g., MADV_DONTNEED)
> > and allocating fresh pages on re-access.
> > 
> > So it's not a question of "why is the hypervisor zeroing less efficiently", as
> > zeroing is just a side-product of reclaiming that memory in the first place.
> 
> We definitely have users who don't want the guest to trust the
> hypervisor.  So how do they disable this optimisation?

What do you mean, how? This is done by:

[PATCH v10 35/37] virtio_balloon: disable reporting zeroed optimization for confidential guests


-- 
MST



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
  2026-06-08 17:50       ` Lorenzo Stoakes
@ 2026-06-08 19:39         ` Michael S. Tsirkin
  0 siblings, 0 replies; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08 19:39 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Gregory Price, Vlastimil Babka (SUSE), linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 06:50:46PM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 11:45:18AM -0400, Gregory Price wrote:
> > On Mon, Jun 08, 2026 at 01:13:42PM +0200, Vlastimil Babka (SUSE) wrote:
> > > On 6/8/26 13:02, Vlastimil Babka (SUSE) wrote:
> > > > On 6/8/26 10:33, Michael S. Tsirkin wrote:
> > > >> Further, on architectures with aliasing caches, upstream with init_on_alloc
> > > >
> > > > It seems those are niche architectures so we can ignore that part for perf
> > > > purposes; the other reason why user_alloc_needs_zeroing() would be true is
> > > > booting with init_on_alloc.
> > >
> > > OK I misread how user_alloc_needs_zeroing() works wrt init_on_alloc, as it's
> > > negated. But you're changing that anyway to skip that user zeroing, right?
> > >
> > > "
> > > This series eliminates that double-zeroing by moving the zeroing
> > > into the post_alloc_hook + propagating the "host
> > > already zeroed this page" information through the buddy allocator.
> > > "
> > >
> > > So relying on "everything in buddy is zeroed" would still work I'd think.
> > >
> >
> > This regresses for anything that previously didn't zero on free or
> > alloc, which is most kernel allocations.
> >
> > I think the scope of this set has increased too much based on early
> > feedback to fix the userland-initiated allocations piece along with the
> > balloon/reporting/double-zero piece.  That's making all of this
> > difficult to continue following.
> 
> Yeah I feel this is 3, 4 or 5 series put together, and there's a lot to
> discuss in each :) so it's pretty difficult to work with them all put
> together.
> 
> These need to be deferred/separated.

I can do that, it's just that the real performance benefits
only come with the last patches in the series.

If I send series that merely moves zeroing around, with
a bunch of threading of addresses and stuff to achieve that,
0 perf gain and slight degradation in corner cases like memcg
failures, you feel it will be well received? You guys really
want to do that, independently of the rest?

Just making sure, I'm not the maintainer here.


> >
> > ~Gregory
> 
> Thanks, Lorenzo



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
  2026-06-08 11:23   ` Lorenzo Stoakes
  2026-06-08 15:53     ` Gregory Price
@ 2026-06-08 19:42     ` Michael S. Tsirkin
  1 sibling, 0 replies; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08 19:42 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 12:23:07PM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 04:36:38AM -0400, Michael S. Tsirkin wrote:
> > When post_alloc_hook() needs to zero a page for an explicit
> > __GFP_ZERO allocation for a user page (user_addr is set), use folio_zero_user()
> > instead of kernel_init_pages().  This zeros near the faulting
> > address last, keeping those cachelines hot for the impending
> > user access.
> >
> > folio_zero_user() is only used for explicit __GFP_ZERO, not for
> > init_on_alloc.  On architectures with virtually-indexed caches
> > (e.g., ARM), clear_user_highpage() performs per-line cache
> > operations; using it for init_on_alloc would add overhead that
> > kernel_init_pages() avoids (the page fault path flushes the
> > cache at PTE installation time regardless).
> >
> > No functional change yet: current callers do not pass __GFP_ZERO
> > for user pages (they zero at the callsite instead).  Subsequent
> > patches will convert them.
> >
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > Assisted-by: Claude:claude-opus-4-6
> > ---
> >  mm/page_alloc.c | 35 ++++++++++++++++++++++++++++++++---
> >  1 file changed, 32 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 4676fd49819e..d4fbf1861a8a 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1861,9 +1861,38 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
> >  		for (i = 0; i != 1 << order; ++i)
> >  			page_kasan_tag_reset(page + i);
> >  	}
> > -	/* If memory is still not initialized, initialize it now. */
> > -	if (init)
> > -		kernel_init_pages(page, 1 << order);
> > +	/*
> > +	 * On architectures with cache aliasing, pages zeroed via the
> > +	 * kernel direct map (e.g. init_on_free) must be re-zeroed
> > +	 * through a user-congruent mapping.  Host-zeroed pages
> > +	 * (zeroed flag) don't need this: physical RAM is clean.
> > +	 */
> > +	if (!init && (gfp_flags & __GFP_ZERO) &&
> > +	    user_addr != USER_ADDR_NONE &&
> > +	    user_alloc_needs_zeroing())
> 
> We check this (gfp_flags & __GFP_ZERO) && user_addr != USER_ADDR_NONE thing
> twice, can we just put in a 'init_should_folio_zero' const bool or something?
> 
> > +		init = true;
> 
> As Vlasta says not sure if we want to add complexity just for these arches.
> 
> > +	/*
> > +	 * If memory is still not initialized, initialize it now.
> 
> I kinda hate that 'init' is unclear as to 'do init' or 'was init somewhere
> else'... Anwyay.
> 
> > +	 * When __GFP_ZERO was explicitly requested and user_addr is set,
> > +	 * use folio_zero_user() which zeros near the faulting address
> > +	 * last, keeping those cachelines hot.  For init_on_alloc, use
> > +	 * kernel_init_pages() to avoid unnecessary cache flush overhead
> > +	 * on architectures with virtually-indexed caches.
> 
> This whole paragraph seems pretty useless and just describing the code?
> 
> > +	 */
> > +	if (init) {
> > +		if ((gfp_flags & __GFP_ZERO) && user_addr != USER_ADDR_NONE) {
> > +			/*
> > +			 * folio_zero_user relies on folio_nr_pages which
> > +			 * requires __GFP_COMP for order > 0.  All user folio
> > +			 * allocations set __GFP_COMP via __folio_alloc.
> 
> This whole paragraph is useless and very like the kind of stuff AI generates for
> comments, i.e. overly long + entirely unnecessary stuff.


It was an attempt to make sashiko shut up, it doesn't understand the
context and kept complaining. Didn't really help so yea I should drop this.

> 
> > +			 * user_addr != USER_ADDR_NONE implies sleepable
> > +			 * context (user page fault).
> 
> Can you safely assume that? Also inferring which context we are in from this
> parameter seems risky.
> 
> It seems to me that you're now making it such that kernel developers:
> 
> - Have to know when and when not to specify a user address, and under what
>   circumstances we might consider that to be mapped.
> 
> - Need to know to do this correctly for aliasing architectures or have silent
>   correctness issues.
> 
> - Need to take context into account when specifying this.
> 
> We definitely need to find a simpler way to do this!
> 
> > +			 */
> > +			VM_WARN_ON_ONCE(order && !(gfp_flags & __GFP_COMP));
> 
> Surely by now we can assume this?

Another attempt to make it obvious.


> > +			folio_zero_user(page_folio(page), user_addr);
> > +		} else
> > +			kernel_init_pages(page, 1 << order);
> 
> I hate this hanging else branch... definitely prefer {} on both branches.
> 
> But in any case it seems like we could avoid some indentation with something
> like:
> 
> 	if (init && init_should_folio_zero) {
> 		...
> 	} else if (init) {
> 		...
> 	}
> 
> Or even a:
> 
> 	if (!init)
> 		goto out;
> 
> And stick an out label below?
> 
> > +	}
> 
> >
> >  	set_page_owner(page, order, gfp_flags);
> >  	page_table_check_alloc(page, order);
> > --
> > MST
> >
> 
> Oh and in general it seems that this conflicts with [0] which removes
> kernel_init_pages().
> 
> [0];https://lore.kernel.org/all/20260422102729.166599-1-hsalunke@amd.com/
> 
> Thanks, Lorenzo



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 18:39         ` David Hildenbrand (Arm)
@ 2026-06-08 19:43           ` Gregory Price
  2026-06-08 19:52             ` Zi Yan
  0 siblings, 1 reply; 131+ messages in thread
From: Gregory Price @ 2026-06-08 19:43 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Zi Yan, Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple,
	Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, virtualization, linux-mm,
	Andrea Arcangeli

On Mon, Jun 08, 2026 at 08:39:10PM +0200, David Hildenbrand (Arm) wrote:
> On 6/8/26 17:27, Zi Yan wrote:
> > On 8 Jun 2026, at 7:08, David Hildenbrand (Arm) wrote:
> > 
> > Or should we defer zeroing after a page is returned from allocator? So that
> > user_addr does not need to be passed through irrelevant allocation APIs.
> > Something like:
> > 
> > alloc_page_wrapper(gfp, order, user_addr)
> > {
> > 	page = alloc_pages();
> > 	if (gfp & __GFP_ZERO)
> > 		clear_page(page);
> > }
> > 
> 
> Not really sure what's best here. I think we'd want to limit the lifting to some
> internal API, so it cannot easily be messed up by random kernel code calling
> into the wrong API and not getting pages cleared.
> 

We're a bit in circles on this.  We discussed explicit interfaces a few
months back and the trade off was:

a) add user_addr to the existing API and cause churn

or

b) add special interface like above
   increase the buddy surface
   leaves open the ability for users to get it wrong easily

If we forget VMs for a moment and break this step out separately, the
core question is whether page_alloc.c is the right place to be calling
the folio_user_zero() or whatever it is.

We seem to have agreed "yes", which necessitates the plumbing of the
address into the allocator.  The question is whether it should churn the
existing interface for have its own explicit interface.

The implications of getting it wrong are:  a user page doesn't get
zeroed.  *oof*

I think that's why we thought the churn was better.  But considering
we're already in the same state now (callers are responsible for calling
folio_user_zero() or whatever), maybe that's not horrid? 

~Gregory


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
  2026-06-08 15:53     ` Gregory Price
@ 2026-06-08 19:45       ` Michael S. Tsirkin
  2026-06-08 20:16         ` Gregory Price
  0 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08 19:45 UTC (permalink / raw)
  To: Gregory Price
  Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 11:53:40AM -0400, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 12:23:07PM +0100, Lorenzo Stoakes wrote:
> > On Mon, Jun 08, 2026 at 04:36:38AM -0400, Michael S. Tsirkin wrote:
> > 
> > > +			 * user_addr != USER_ADDR_NONE implies sleepable
> > > +			 * context (user page fault).
> > 
> > Can you safely assume that? Also inferring which context we are in from this
> > parameter seems risky.
> > 
> > It seems to me that you're now making it such that kernel developers:
> > 
> > - Have to know when and when not to specify a user address, and under what
> >   circumstances we might consider that to be mapped.
> > 
> > - Need to know to do this correctly for aliasing architectures or have silent
> >   correctness issues.
> > 
> > - Need to take context into account when specifying this.
> > 
> > We definitely need to find a simpler way to do this!
> >
> 
> This feedback was poked at in earlier versions.  There's a tension
> between keeping the old interface as-is, having explicit interfaces
> for something like this, and the state of a page inside the
> allocator vs outside.
> 
> Double-plus complicated by the fact that we're trying to reason about
> two allocators at once:  host and guest.
> 
> It seems it has gotten a bit more complicated since then (I missed this
> "sleepable context" bit, not sure if it was there on prior versions).
> 
> If `user_addr` is now implying anything other than exactly: "This needs
> to be zeroed / caches flushed", then this is bad.
> 
> ~Gregory

Well if you do folio_zero_user in a non sleepable context then things
are not going to work. So combining e.g. GFP_ATOMIC and GFP_ZERO and
user_addr all together is not a good idea.

You are saying it's bad? It's pretty fundamental to the idea of moving
zeroing into the allocator, I feel.

-- 
MST



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 19:43           ` Gregory Price
@ 2026-06-08 19:52             ` Zi Yan
  2026-06-08 20:25               ` Gregory Price
  0 siblings, 1 reply; 131+ messages in thread
From: Zi Yan @ 2026-06-08 19:52 UTC (permalink / raw)
  To: Gregory Price
  Cc: David Hildenbrand (Arm), Lorenzo Stoakes, Michael S. Tsirkin,
	linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Muchun Song, Oscar Salvador, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On 8 Jun 2026, at 15:43, Gregory Price wrote:

> On Mon, Jun 08, 2026 at 08:39:10PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/8/26 17:27, Zi Yan wrote:
>>> On 8 Jun 2026, at 7:08, David Hildenbrand (Arm) wrote:
>>>
>>> Or should we defer zeroing after a page is returned from allocator? So that
>>> user_addr does not need to be passed through irrelevant allocation APIs.
>>> Something like:
>>>
>>> alloc_page_wrapper(gfp, order, user_addr)
>>> {
>>> 	page = alloc_pages();
>>> 	if (gfp & __GFP_ZERO)
>>> 		clear_page(page);
>>> }
>>>
>>
>> Not really sure what's best here. I think we'd want to limit the lifting to some
>> internal API, so it cannot easily be messed up by random kernel code calling
>> into the wrong API and not getting pages cleared.
>>
>
> We're a bit in circles on this.  We discussed explicit interfaces a few
> months back and the trade off was:
>
> a) add user_addr to the existing API and cause churn
>
> or
>
> b) add special interface like above
>    increase the buddy surface
>    leaves open the ability for users to get it wrong easily
>
> If we forget VMs for a moment and break this step out separately, the
> core question is whether page_alloc.c is the right place to be calling
> the folio_user_zero() or whatever it is.

page_alloc.c calling folio_user_zero() is fine, but my question is
whether we should do the zeroing inside post_alloc_hook(), part of
allocation.

What I propose is to lift __GFP_ZERO up as much as possible,
so that most of allocation code does not need to care about it.
We do the zeroing right before the page is returned to callers.

>
> We seem to have agreed "yes", which necessitates the plumbing of the
> address into the allocator.  The question is whether it should churn the
> existing interface for have its own explicit interface.
>
> The implications of getting it wrong are:  a user page doesn't get
> zeroed.  *oof*
>
> I think that's why we thought the churn was better.  But considering
> we're already in the same state now (callers are responsible for calling
> folio_user_zero() or whatever), maybe that's not horrid?

Yeah, some callers think they know better about huge page zeroing and want to
do it differently. That caused double zeroing when init_on_alloc was
introduced and was mitigated by user_alloc_needs_zeroing().

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 17/37] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages
  2026-06-08 18:40       ` Matthew Wilcox
@ 2026-06-08 19:55         ` Michael S. Tsirkin
  0 siblings, 0 replies; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08 19:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Gregory Price, Lorenzo Stoakes, linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 07:40:23PM +0100, Matthew Wilcox wrote:
> On Mon, Jun 08, 2026 at 12:09:19PM -0400, Gregory Price wrote:
> > On Mon, Jun 08, 2026 at 01:00:17PM +0100, Lorenzo Stoakes wrote:
> > > On Mon, Jun 08, 2026 at 04:37:48AM -0400, Michael S. Tsirkin wrote:
> > > >
> > > >  void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags,
> > > > -		     unsigned long user_addr);
> > > > +		     bool zeroed, unsigned long user_addr);
> > > 
> > > host_zeroed or something would be more appropriate no?
> > > 
> > > But in general do we need to propagate this around, can't we derive it from
> > > the page zeroed flag?
> > > 
> > > It's really confusing as to _which_ zeroing this refers to, it seems the
> > > only one relevant here is the VM host zeroing but that's completely
> > > non-obvious and now everybody using these functions with the extra param
> > > will simply have to happen to know this.
> > > 
> > > If we could find a way to avoid this propagation that'd be ideal.
> > > 
> > > Failing that, making it clear this is _only_ for vm host zeroing would be
> > > better, but then maybe we need to think about how we could encode this in
> > > some other way, e.g. passing alloc_context perhaps?
> > > 
> > 
> > This is unaddressed feedback from 3 version ago:
> > 
> > https://lore.kernel.org/linux-mm/agXYbcuQYooG74pb@gourry-fedora-PF4VCD3F/
> > 
> > We can infer all of this from snapshotted page flags and propogate those
> > around.  This is infinitely more useful than just a single flag being
> > pulled out into a boolean, and more extensible.
> > 
> > void
> > post_alloc_hook(struct page *page, usigned int order, gfp_t gfp_flags,
> >                 uint64_t pg_alloc_flags, unsigned long user_addr);
> > 
> > 		^^^ page_alloc.c internal falgs only
> > 
> > Once the allocator gets a page it wants to return, it can take a snapshot
> > of the flags at that point, and then doodle all over the flags as it
> > goes through the page setup prior to return (include the post hook).
> > 
> > Haven't seen a reason why this shouldn't be done this way.
> 
> I'd tuned out this awful series since it'd become apparent that my
> feedback wasn't going to be taken seriously.

Sorry about that.

> Good to know I shouldn't
> take it personally because he does it to you too.
> 
> I think it's apparent that Michael has no understanding of the MM.

It's a bit of an overstatement but I'm more of a networking guy, yes.
What I freely admit I don't understand is why I have to refactor all of mm
first.

> So we should start again with the architecture.  Let's look at the
> problem that he's trying to solve:
> 
>  - The hypervisor has zeroed the memory that it passes to the MM
>  - But the MM then zeroes it again before passing it to userspace.
>  - We want to avoid this
> 
> Let's make sure that's the actual problem before going any further.
> Because I do have a design that will satisfy that without doing this
> insane level of invasive change, but if that's not the problem, there's
> no point wasting my time writing it up.

Yes that's exactly the problem I was trying to solve. Early RFC versions
didn't do this invasive change:

https://lore.kernel.org/lkml/cover.1776689093.git.mst@redhat.com/


But I was asked to refactor mm first and implement the optimization
second.

Sure, glad to hear what the design is.



-- 
MST



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 24/37] mm: add put_page_zeroed and folio_put_zeroed
  2026-06-08 14:28         ` David Hildenbrand (Arm)
@ 2026-06-08 19:58           ` Michael S. Tsirkin
  0 siblings, 0 replies; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08 19:58 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lorenzo Stoakes, linux-kernel, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:28:48PM +0200, David Hildenbrand (Arm) wrote:
> On 6/8/26 16:08, Michael S. Tsirkin wrote:
> > On Mon, Jun 08, 2026 at 02:46:34PM +0200, David Hildenbrand (Arm) wrote:
> >> On 6/8/26 14:25, Lorenzo Stoakes wrote:
> >>>
> >>> Do not put comments about specific expected races like this in the commit
> >>> message but not in the code. Subtleties need to be called out.
> >>>
> >>> The commit message also doesn't at all explain why PG_zeroed doesn't
> >>> suffice here.
> >>>
> >>>
> >>> I really don't understand why you have a 'zeroed' folio flag but need to
> >>> also have new API calls to detect that?
> >>>
> >>> They're also HORRIBLY named. Zeroed as in what? Zero page? Huge zero page?
> >>> Memory zeroed by kernel? Pages that userland happen to have zeroed? Or host
> >>> VM zeroed?
> >>>
> >>> Each are cases we address individually and relate to folios.
> >>>
> >>> You absolutely fail to clarify _which one_ you mean, and provide absolutely
> >>> no documentation and add an exported mm API with no description.
> >>>
> >>> This is just I think not something we want to add? Especially on something
> >>> so fundamental?
> >>
> >> I raised previously that providing a folio helper is odd, and that I suggested
> >> that we defer this change.
> > 
> > Sadly it's a dependency actually - without it memcg failures would cause
> > repeated re-zeroing where previously it failed without zeroing.
> 
> Oh, you mean that we succeeded allocating (+zeroing) but failed to charge?
> 
> I don't immediately see that to be a real problem?

Yes exactly.
I don't really know if any real applications live close enough to memcg edge
that repeatedly wasting cycles zeroing pages then discarding that
information will be noticeable.

I should be able to write a test to show the difference, if that's the
question.

So just writing code in a way that we are not regressing them seemed cleaner to me.
But I'm not a maintainer so hey. Just so we are clear.

-- 
MST



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 13:04       ` Matthew Wilcox
  2026-06-08 13:09         ` Lorenzo Stoakes
@ 2026-06-08 19:59         ` Gregory Price
  2026-06-08 20:21           ` Zi Yan
  1 sibling, 1 reply; 131+ messages in thread
From: Gregory Price @ 2026-06-08 19:59 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 02:04:28PM +0100, Matthew Wilcox wrote:
> On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
> > But instead of overloading user_addr to indicate all kinds of things, instead
> > make life easier by actually breaking things out.
> > 
> > Like:
> > 
> > enum alloc_context_type {
> > 	KERNEL_ALLOCATION,
> > 	USER_MAPPED_ALLOCATION,
> > 	USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
> > 	/* Perhaps some other states we want to encode? */
> > };
> > 
> > struct alloc_context {
> > 	...
> > 
> > 	enum alloc_context_type type;
> > 	unsigned long user_addr; // Only set if type == USER_ALLOCATION
> > 
> > 	// Maybe something suggesting context or whether we init before in some
> > 	// cases?
> > };
> 
> Ugh, please, no.  As I suggested last time I commented on this
> trainwreck of a series, lift the zeroing functionality from
> alloc_frozen_pages() into its callers.

This sort of just implies writing the "alloc_frozen_zeroed_pages()"
wrapper that does the zeroing at the end before return, and then killing
the post hook nonsense associated with it in the first place.

None of this resolves the user address annoyance which is needed on some
archs for cache flushing.  Whether anyone agrees that the page allocator
should be responsible for this particular operation - open debate.

~Gregory


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
  2026-06-08 14:21 ` Matthew Wilcox
@ 2026-06-08 20:02   ` Michael S. Tsirkin
  0 siblings, 0 replies; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08 20:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 03:21:25PM +0100, Matthew Wilcox wrote:
> On Mon, Jun 08, 2026 at 04:33:46AM -0400, Michael S. Tsirkin wrote:
> > Further, on architectures with aliasing caches, upstream with init_on_alloc
> 
> Further to what?  Did you leave out some paragraphs here?
> 
> As far as I can tell, this patch series decides to trust that the
> hypervisor has zeroed pages that it allocates to the guest.  But
> as far as I can tell, the trend is towards less trust in the hypervisor
> from the guest, not more.

AKA confidential computing. I'm not a visionary, no idea about trends, but
yes these are used more than in the past (not hard given it used to be
0% of the market in the past).

Page reporting already leaks some info like free page addresses, so it's
for trusted hypervisors.

Anyway:
Subject: [PATCH v10 35/37] virtio_balloon: disable reporting zeroed optimization for confidential guests

makes sure guests that do not trust hypervisors are not affected.

-- 
MST



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 16/37] mm: alloc_swap_folio: pass raw fault address to vma_alloc_folio
  2026-06-08 15:59     ` Gregory Price
@ 2026-06-08 20:09       ` Michael S. Tsirkin
  0 siblings, 0 replies; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08 20:09 UTC (permalink / raw)
  To: Gregory Price
  Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 11:59:32AM -0400, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 12:37:20PM +0100, Lorenzo Stoakes wrote:
> > On Mon, Jun 08, 2026 at 04:37:41AM -0400, Michael S. Tsirkin wrote:
> > > Same change as the previous patch but for alloc_swap_folio:
> > 
> > Please don't say 'same change as the previous patch' :) explain what you're
> > doing here. It's a pain to have to go check otherwise.
> > 
> 
> MST you need to slow down a bit.
> 
> I gave you this same feedback 3 versions ago:
> 
> https://lore.kernel.org/linux-mm/agXUHItfxSwtriRF@gourry-fedora-PF4VCD3F/
> 
> ~Gregory

Ooof I do try but the patchset is just too big. Sorry. I need to find a
way to split it. Or maybe Matthew will tell me how to make it much
smaller, he says he sees a way that will make everyone happy. Let's
wait.

-- 
MST



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
  2026-06-08 19:45       ` Michael S. Tsirkin
@ 2026-06-08 20:16         ` Gregory Price
  2026-06-08 20:30           ` Michael S. Tsirkin
  0 siblings, 1 reply; 131+ messages in thread
From: Gregory Price @ 2026-06-08 20:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 03:45:58PM -0400, Michael S. Tsirkin wrote:
> On Mon, Jun 08, 2026 at 11:53:40AM -0400, Gregory Price wrote:
> > 
> > If `user_addr` is now implying anything other than exactly: "This needs
> > to be zeroed / caches flushed", then this is bad.
> > 
> > ~Gregory
> 
> Well if you do folio_zero_user in a non sleepable context then things
> are not going to work. So combining e.g. GFP_ATOMIC and GFP_ZERO and
> user_addr all together is not a good idea.
>

Can you say whether (GFP_ATOMIC | GFP_ZERO) w/o user_addr has the same
issue?  If not, then this subtle complexity is now a tripping hazard.

Is there some combination of arguments here that should just outright
fail if a user attempts it?

>
> You are saying it's bad? It's pretty fundamental to the idea of moving
> zeroing into the allocator, I feel.
> 

I'm saying having to infer that safety state from the cobbling of those
things together is not a good pattern (at least as-is).

If the introduction of user_addr into the mix is the thing that causes
us to have to infer safety, then there's an argument the page allocator
shouldn't handle that operation (in this case: user_addr cache flush).

Please consider that this is arguably the most fundamental interface in
in all of mm/.  All we're doing is going through the process of figuring
out what changes here are reasonable while trying to meet your goal.

~Gregory

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 02/37] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
  2026-06-08 14:14       ` Lorenzo Stoakes
@ 2026-06-08 20:17         ` Michael S. Tsirkin
  0 siblings, 0 replies; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08 20:17 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Miaohe Lin

On Mon, Jun 08, 2026 at 03:14:51PM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 09:48:34AM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jun 08, 2026 at 10:43:21AM +0100, Lorenzo Stoakes wrote:
> > > On Mon, Jun 08, 2026 at 04:34:23AM -0400, Michael S. Tsirkin wrote:
> > > > TestSetPageHWPoison() is called without zone->lock, so its atomic
> > > > update to page->flags can race with non-atomic flag operations
> > > > that run under zone->lock in the buddy allocator.
> > > >
> > > > In particular, __free_pages_prepare() does:
> > > >
> > > >     page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > > >
> > > > This non-atomic read-modify-write, while correctly excluding
> > > > __PG_HWPOISON from the mask, can still lose a concurrent
> > > > TestSetPageHWPoison if the read happens before the poison bit
> > > > is set and the write happens after.  Follow-up patches in this
> > > > series add similar non-atomic flag operations as well.
> > > >
> > > > Fix by acquiring zone->lock around TestSetPageHWPoison and
> > > > around ClearPageHWPoison in the retry path.  This
> > > > serializes with all buddy flag manipulation.  The cost is
> > > > negligible: one lock/unlock in an extremely rare path
> > > > (hardware memory errors).
> > > >
> > > > Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> > > > in this file operate on pages already removed from the buddy
> > > > allocator or on non-buddy pages (DAX, hugetlb), so they do not
> > > > need zone->lock protection.
> > > >
> > > > Acked-by: Miaohe Lin <linmiaohe@huawei.com>
> > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > >
> > > Can we have Fixes: and Cc: stable and also send this separately please?
> > >
> > > These patches seem like unrelated fixups that you've discovered along the way,
> > > and don't belong as part of the already rather large series, unless I'm missing
> > > something here.
> > >
> > > Thanks, Lorenzo
> >
> > I think you are mising that they are a dependency, not unrelated.
> 
> Then say so.
> 
> > For example, this issue gets worse with the patchset as there are more
> > places that manipulate flags without atomics. No?
> 
> It's your job to make that case, not mine.
> 
> >
> >
> > You are welcome to send this to stable, but I think stable rules
> > preclude theoretical bugfixes.
> 
> It's a dependency but also theoretical?

As in, the race is exteremely hard to trigger and I have no idea if it
triggers for anyone, but it's obvious from reading the code that
theoretically it exists? Yes.

> >
> > As for Fixes: the issue has been there for decades. I wouldn't know
> > what to attribute it for.
> 
> Again, your job.

Alright, if you insist:

Fixes: 6a46079cf57a ("HWPOISON: The high level memory error handler in the VM v7")

now everyone running 2.6 kernels will backport this fix, I presume.


> >
> >
> > I guess I could send these separately, too, why not. Not sure
> > what this accomplishes, but hey.  But is that an ack? You want
> > this fix merged even before the feature?
> 
> I already made the case as to why, as have other maintainers.
> 
> If you need to review what an ack looks like please consult
> https://docs.kernel.org/process/5.Posting.html
> 
> Thanks, Lorenzo

I am merely asking if you want this patch in the set including
all these nits I had to fix.

-- 
MST



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 19:59         ` Gregory Price
@ 2026-06-08 20:21           ` Zi Yan
  2026-06-08 20:33             ` Michael S. Tsirkin
  0 siblings, 1 reply; 131+ messages in thread
From: Zi Yan @ 2026-06-08 20:21 UTC (permalink / raw)
  To: Gregory Price
  Cc: Matthew Wilcox, Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple,
	Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, virtualization, linux-mm,
	Andrea Arcangeli

On 8 Jun 2026, at 15:59, Gregory Price wrote:

> On Mon, Jun 08, 2026 at 02:04:28PM +0100, Matthew Wilcox wrote:
>> On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
>>> But instead of overloading user_addr to indicate all kinds of things, instead
>>> make life easier by actually breaking things out.
>>>
>>> Like:
>>>
>>> enum alloc_context_type {
>>> 	KERNEL_ALLOCATION,
>>> 	USER_MAPPED_ALLOCATION,
>>> 	USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
>>> 	/* Perhaps some other states we want to encode? */
>>> };
>>>
>>> struct alloc_context {
>>> 	...
>>>
>>> 	enum alloc_context_type type;
>>> 	unsigned long user_addr; // Only set if type == USER_ALLOCATION
>>>
>>> 	// Maybe something suggesting context or whether we init before in some
>>> 	// cases?
>>> };
>>
>> Ugh, please, no.  As I suggested last time I commented on this
>> trainwreck of a series, lift the zeroing functionality from
>> alloc_frozen_pages() into its callers.
>
> This sort of just implies writing the "alloc_frozen_zeroed_pages()"
> wrapper that does the zeroing at the end before return, and then killing
> the post hook nonsense associated with it in the first place.

This means it is going to be a multi-step optimization. This is probably
step 1.

>
> None of this resolves the user address annoyance which is needed on some
> archs for cache flushing.  Whether anyone agrees that the page allocator
> should be responsible for this particular operation - open debate.

This is probably step 2. But does the virtio use case apply to these
archs? Does the performance matter for them? If not, maybe this part can
be left as a TODO.


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 19:52             ` Zi Yan
@ 2026-06-08 20:25               ` Gregory Price
  2026-06-08 20:37                 ` Zi Yan
  0 siblings, 1 reply; 131+ messages in thread
From: Gregory Price @ 2026-06-08 20:25 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand (Arm), Lorenzo Stoakes, Michael S. Tsirkin,
	linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Muchun Song, Oscar Salvador, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 03:52:20PM -0400, Zi Yan wrote:
> On 8 Jun 2026, at 15:43, Gregory Price wrote:
> 
> > On Mon, Jun 08, 2026 at 08:39:10PM +0200, David Hildenbrand (Arm) wrote:
> >> On 6/8/26 17:27, Zi Yan wrote:
> >>> On 8 Jun 2026, at 7:08, David Hildenbrand (Arm) wrote:
> >>>
> >>> Or should we defer zeroing after a page is returned from allocator? So that
> >>> user_addr does not need to be passed through irrelevant allocation APIs.
> >>> Something like:
> >>>
> >>> alloc_page_wrapper(gfp, order, user_addr)
> >>> {
> >>> 	page = alloc_pages();
> >>> 	if (gfp & __GFP_ZERO)
> >>> 		clear_page(page);
> >>> }
> >>>
> >>
> >> Not really sure what's best here. I think we'd want to limit the lifting to some
> >> internal API, so it cannot easily be messed up by random kernel code calling
> >> into the wrong API and not getting pages cleared.
> >>
> >
> > We're a bit in circles on this.  We discussed explicit interfaces a few
> > months back and the trade off was:
> >
> > a) add user_addr to the existing API and cause churn
> >
> > or
> >
> > b) add special interface like above
> >    increase the buddy surface
> >    leaves open the ability for users to get it wrong easily
> >
> > If we forget VMs for a moment and break this step out separately, the
> > core question is whether page_alloc.c is the right place to be calling
> > the folio_user_zero() or whatever it is.
> 
> page_alloc.c calling folio_user_zero() is fine, but my question is
> whether we should do the zeroing inside post_alloc_hook(), part of
> allocation.
> 
> What I propose is to lift __GFP_ZERO up as much as possible,
> so that most of allocation code does not need to care about it.
> We do the zeroing right before the page is returned to callers.
>

essentially we end up with something like

alloc_frozen_...(..., gfp)
{
  folio = whatever(..., gfp);
  if (gfp & __GFP_ZERO)
    folio_zero(folio, -1); /* don't do cache flush part */
}

alloc_frozen_user_...(..., gfp, user_addr)
{
  folio = whatever(..., gfp);
  if (gfp & __GFP_ZERO)
    folio_zero(folio, user_addr); /* do cache flush part */
}

The downside of this is obvious: it's easy for developers to get this
wrong and call the non-user interface for user-bound allocations and
miss the cache flush (that is only needed on some archs).

Not saying that's a deal breaker, but it's something to chew on.

~Gregory


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
  2026-06-08 20:16         ` Gregory Price
@ 2026-06-08 20:30           ` Michael S. Tsirkin
  2026-06-08 20:53             ` Gregory Price
  0 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08 20:30 UTC (permalink / raw)
  To: Gregory Price
  Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:16:53PM -0400, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 03:45:58PM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jun 08, 2026 at 11:53:40AM -0400, Gregory Price wrote:
> > > 
> > > If `user_addr` is now implying anything other than exactly: "This needs
> > > to be zeroed / caches flushed", then this is bad.
> > > 
> > > ~Gregory
> > 
> > Well if you do folio_zero_user in a non sleepable context then things
> > are not going to work. So combining e.g. GFP_ATOMIC and GFP_ZERO and
> > user_addr all together is not a good idea.
> >
> 
> Can you say whether (GFP_ATOMIC | GFP_ZERO) w/o user_addr has the same
> issue?

I don't think it is because it does not call folio_zero_user, right?

>  If not, then this subtle complexity is now a tripping hazard.

Yes.

> Is there some combination of arguments here that should just outright
> fail if a user attempts it?

__GFP_DIRECT_RECLAIM at least.

> >
> > You are saying it's bad? It's pretty fundamental to the idea of moving
> > zeroing into the allocator, I feel.
> > 
> 
> I'm saying having to infer that safety state from the cobbling of those
> things together is not a good pattern (at least as-is).

Understood. Don't have a better idea, yet.

> If the introduction of user_addr into the mix is the thing that causes
> us to have to infer safety, then there's an argument the page allocator
> shouldn't handle that operation (in this case: user_addr cache flush).

It's not just the flush, we are also trying to use that to optimize
zeroing.

> 
> Please consider that this is arguably the most fundamental interface in
> in all of mm/.  All we're doing is going through the process of figuring
> out what changes here are reasonable while trying to meet your goal.
> 
> ~Gregory

I don't mind discarding all of this and doing something else completely,
but I dislike it that multiple people are apparently now angry that I
don't address all the contradictory comments at the same time.
I thought just sending a patchset to show how the result looks like
is easier than arguing about architecture, and would be helpful.

I'm not pushing any of the mm rework, I was asked to do it,
myself I just want the ridiculously effective optimization in there.


-- 
MST



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 20:21           ` Zi Yan
@ 2026-06-08 20:33             ` Michael S. Tsirkin
  2026-06-08 20:40               ` Zi Yan
  0 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08 20:33 UTC (permalink / raw)
  To: Zi Yan
  Cc: Gregory Price, Matthew Wilcox, Lorenzo Stoakes, linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple,
	Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, virtualization, linux-mm,
	Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:21:08PM -0400, Zi Yan wrote:
> On 8 Jun 2026, at 15:59, Gregory Price wrote:
> 
> > On Mon, Jun 08, 2026 at 02:04:28PM +0100, Matthew Wilcox wrote:
> >> On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
> >>> But instead of overloading user_addr to indicate all kinds of things, instead
> >>> make life easier by actually breaking things out.
> >>>
> >>> Like:
> >>>
> >>> enum alloc_context_type {
> >>> 	KERNEL_ALLOCATION,
> >>> 	USER_MAPPED_ALLOCATION,
> >>> 	USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
> >>> 	/* Perhaps some other states we want to encode? */
> >>> };
> >>>
> >>> struct alloc_context {
> >>> 	...
> >>>
> >>> 	enum alloc_context_type type;
> >>> 	unsigned long user_addr; // Only set if type == USER_ALLOCATION
> >>>
> >>> 	// Maybe something suggesting context or whether we init before in some
> >>> 	// cases?
> >>> };
> >>
> >> Ugh, please, no.  As I suggested last time I commented on this
> >> trainwreck of a series, lift the zeroing functionality from
> >> alloc_frozen_pages() into its callers.
> >
> > This sort of just implies writing the "alloc_frozen_zeroed_pages()"
> > wrapper that does the zeroing at the end before return, and then killing
> > the post hook nonsense associated with it in the first place.
> 
> This means it is going to be a multi-step optimization. This is probably
> step 1.
> 
> >
> > None of this resolves the user address annoyance which is needed on some
> > archs for cache flushing.  Whether anyone agrees that the page allocator
> > should be responsible for this particular operation - open debate.
> 
> This is probably step 2. But does the virtio use case apply to these
> archs? Does the performance matter for them? If not, maybe this part can
> be left as a TODO.
> 
> 
> Best Regards,
> Yan, Zi

I doubt it. But I don't get what's proposed, the code that we
have to modify is arch independent?



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 20:25               ` Gregory Price
@ 2026-06-08 20:37                 ` Zi Yan
  2026-06-08 20:56                   ` Gregory Price
  2026-06-08 21:03                   ` Michael S. Tsirkin
  0 siblings, 2 replies; 131+ messages in thread
From: Zi Yan @ 2026-06-08 20:37 UTC (permalink / raw)
  To: Gregory Price
  Cc: David Hildenbrand (Arm), Lorenzo Stoakes, Michael S. Tsirkin,
	linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Muchun Song, Oscar Salvador, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On 8 Jun 2026, at 16:25, Gregory Price wrote:

> On Mon, Jun 08, 2026 at 03:52:20PM -0400, Zi Yan wrote:
>> On 8 Jun 2026, at 15:43, Gregory Price wrote:
>>
>>> On Mon, Jun 08, 2026 at 08:39:10PM +0200, David Hildenbrand (Arm) wrote:
>>>> On 6/8/26 17:27, Zi Yan wrote:
>>>>> On 8 Jun 2026, at 7:08, David Hildenbrand (Arm) wrote:
>>>>>
>>>>> Or should we defer zeroing after a page is returned from allocator? So that
>>>>> user_addr does not need to be passed through irrelevant allocation APIs.
>>>>> Something like:
>>>>>
>>>>> alloc_page_wrapper(gfp, order, user_addr)
>>>>> {
>>>>> 	page = alloc_pages();
>>>>> 	if (gfp & __GFP_ZERO)
>>>>> 		clear_page(page);
>>>>> }
>>>>>
>>>>
>>>> Not really sure what's best here. I think we'd want to limit the lifting to some
>>>> internal API, so it cannot easily be messed up by random kernel code calling
>>>> into the wrong API and not getting pages cleared.
>>>>
>>>
>>> We're a bit in circles on this.  We discussed explicit interfaces a few
>>> months back and the trade off was:
>>>
>>> a) add user_addr to the existing API and cause churn
>>>
>>> or
>>>
>>> b) add special interface like above
>>>    increase the buddy surface
>>>    leaves open the ability for users to get it wrong easily
>>>
>>> If we forget VMs for a moment and break this step out separately, the
>>> core question is whether page_alloc.c is the right place to be calling
>>> the folio_user_zero() or whatever it is.
>>
>> page_alloc.c calling folio_user_zero() is fine, but my question is
>> whether we should do the zeroing inside post_alloc_hook(), part of
>> allocation.
>>
>> What I propose is to lift __GFP_ZERO up as much as possible,
>> so that most of allocation code does not need to care about it.
>> We do the zeroing right before the page is returned to callers.
>>
>
> essentially we end up with something like
>
> alloc_frozen_...(..., gfp)
> {
>   folio = whatever(..., gfp);
>   if (gfp & __GFP_ZERO)
>     folio_zero(folio, -1); /* don't do cache flush part */
> }
>
> alloc_frozen_user_...(..., gfp, user_addr)
> {
>   folio = whatever(..., gfp);
>   if (gfp & __GFP_ZERO)
>     folio_zero(folio, user_addr); /* do cache flush part */
> }
>
> The downside of this is obvious: it's easy for developers to get this
> wrong and call the non-user interface for user-bound allocations and
> miss the cache flush (that is only needed on some archs).
>
> Not saying that's a deal breaker, but it's something to chew on.

I agree that misuse can cause trouble. But if we do the churn approach,
what prevents developer from doing alloc_frozen(..., user_addr = -1)
and using the returned page for userspace? It is possible the allocated
page can be exported to userspace later.

BTW, that cache flush thing is fragile even today, you probably can
do alloc_page() + vm_insert() to get a page without doing proper flush
and export it to userspace. There seems to be no mechanism to
prevent that.

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 20:33             ` Michael S. Tsirkin
@ 2026-06-08 20:40               ` Zi Yan
  2026-06-08 21:04                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 131+ messages in thread
From: Zi Yan @ 2026-06-08 20:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gregory Price, Matthew Wilcox, Lorenzo Stoakes, linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple,
	Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, virtualization, linux-mm,
	Andrea Arcangeli

On 8 Jun 2026, at 16:33, Michael S. Tsirkin wrote:

> On Mon, Jun 08, 2026 at 04:21:08PM -0400, Zi Yan wrote:
>> On 8 Jun 2026, at 15:59, Gregory Price wrote:
>>
>>> On Mon, Jun 08, 2026 at 02:04:28PM +0100, Matthew Wilcox wrote:
>>>> On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
>>>>> But instead of overloading user_addr to indicate all kinds of things, instead
>>>>> make life easier by actually breaking things out.
>>>>>
>>>>> Like:
>>>>>
>>>>> enum alloc_context_type {
>>>>> 	KERNEL_ALLOCATION,
>>>>> 	USER_MAPPED_ALLOCATION,
>>>>> 	USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
>>>>> 	/* Perhaps some other states we want to encode? */
>>>>> };
>>>>>
>>>>> struct alloc_context {
>>>>> 	...
>>>>>
>>>>> 	enum alloc_context_type type;
>>>>> 	unsigned long user_addr; // Only set if type == USER_ALLOCATION
>>>>>
>>>>> 	// Maybe something suggesting context or whether we init before in some
>>>>> 	// cases?
>>>>> };
>>>>
>>>> Ugh, please, no.  As I suggested last time I commented on this
>>>> trainwreck of a series, lift the zeroing functionality from
>>>> alloc_frozen_pages() into its callers.
>>>
>>> This sort of just implies writing the "alloc_frozen_zeroed_pages()"
>>> wrapper that does the zeroing at the end before return, and then killing
>>> the post hook nonsense associated with it in the first place.
>>
>> This means it is going to be a multi-step optimization. This is probably
>> step 1.
>>
>>>
>>> None of this resolves the user address annoyance which is needed on some
>>> archs for cache flushing.  Whether anyone agrees that the page allocator
>>> should be responsible for this particular operation - open debate.
>>
>> This is probably step 2. But does the virtio use case apply to these
>> archs? Does the performance matter for them? If not, maybe this part can
>> be left as a TODO.
>>
>>
>> Best Regards,
>> Yan, Zi
>
> I doubt it. But I don't get what's proposed, the code that we
> have to modify is arch independent?

Change user_alloc_needs_zeroing() to only check address aliasing even
if that can cause double zeroing for virtio.

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
  2026-06-08 20:30           ` Michael S. Tsirkin
@ 2026-06-08 20:53             ` Gregory Price
  2026-06-08 21:16               ` Michael S. Tsirkin
  0 siblings, 1 reply; 131+ messages in thread
From: Gregory Price @ 2026-06-08 20:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:30:46PM -0400, Michael S. Tsirkin wrote:
> > 
> > Please consider that this is arguably the most fundamental interface in
> > in all of mm/.  All we're doing is going through the process of figuring
> > out what changes here are reasonable while trying to meet your goal.
> > 
> > ~Gregory
> 
> I don't mind discarding all of this and doing something else completely,
> but I dislike it that multiple people are apparently now angry that I

I wouldn't say anyone is angry, I think most folks are tripping on the
complexity of the set - which has increased (at the request of others).

> don't address all the contradictory comments at the same time.

Such is life in mm/ :] - it's hard to known the entire state machine,
and sometimes the contradictions aren't even wrong.

> I thought just sending a patchset to show how the result looks like
> is easier than arguing about architecture, and would be helpful.
> 

Notice: When folks argue implementation, they largely agree the
end goal is useful.  I haven't seen anyone say your problem isn't
real or that it shouldn't be addressed - just opinions on a particular
path forward (which is utterly normal here).

Getting the right incantation of an API is really hard when the
API being changes is something that underpins the entire kernel.

> I'm not pushing any of the mm rework, I was asked to do it,
> myself I just want the ridiculously effective optimization in there.
> 

As Lorenzo, David, and Matthew have said, the focus of the patch set
does seem to have become unweildy (in part at the request of folks
asking something be done differently).

What needs to be done now is to break it up into some pull-ahead 
sets that are easier to review.  Having a brief RFC doc that lays out
the set of patches might help clarify the confusion going on here,
especially as new folks come in to ask "What's all this about?".

As a start:

  1) the user_addr and zeroing piece seems like a discrete
     improvement worthy of its own set - aside from end goal.

     This is needed by your patch set, but was requested to
     try to push us towards a more reasonable pattern for
     folio_zero_user().

  2) There are a handful of patches that seem able to pull-ahead
     (some of the mempolicy stuff), either as prep work for #1 or
     just on their own.

     Some of these patches seem like latent bugs that aren't hit by
     current users, but do seem to be doing something subtly wrong?

  3) the final virtio piece seems like it should be entirely separate
     once the core pieces are done.

It's not uncommon for core changes like this to take multiple prepatory
sets over many major versions before the final feature lands.

~Gregory

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 20:37                 ` Zi Yan
@ 2026-06-08 20:56                   ` Gregory Price
  2026-06-08 21:03                   ` Michael S. Tsirkin
  1 sibling, 0 replies; 131+ messages in thread
From: Gregory Price @ 2026-06-08 20:56 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand (Arm), Lorenzo Stoakes, Michael S. Tsirkin,
	linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Muchun Song, Oscar Salvador, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:37:59PM -0400, Zi Yan wrote:
> On 8 Jun 2026, at 16:25, Gregory Price wrote:
> >
> > essentially we end up with something like
> >
> > alloc_frozen_...(..., gfp)
> > {
> >   folio = whatever(..., gfp);
> >   if (gfp & __GFP_ZERO)
> >     folio_zero(folio, -1); /* don't do cache flush part */
> > }
> >
> > alloc_frozen_user_...(..., gfp, user_addr)
> > {
> >   folio = whatever(..., gfp);
> >   if (gfp & __GFP_ZERO)
> >     folio_zero(folio, user_addr); /* do cache flush part */
> > }
> >
> > The downside of this is obvious: it's easy for developers to get this
> > wrong and call the non-user interface for user-bound allocations and
> > miss the cache flush (that is only needed on some archs).
> >
> > Not saying that's a deal breaker, but it's something to chew on.
> 
> I agree that misuse can cause trouble. But if we do the churn approach,
> what prevents developer from doing alloc_frozen(..., user_addr = -1)
> and using the returned page for userspace? It is possible the allocated
> page can be exported to userspace later.
> 
> BTW, that cache flush thing is fragile even today, you probably can
> do alloc_page() + vm_insert() to get a page without doing proper flush
> and export it to userspace. There seems to be no mechanism to
> prevent that.
>

Oh of course, I said that elsewhere.  It leaves us in a spot where we're
not technically worse than we were yesterday - except that the surface
of the buddy has increased (developers need to know about 2 APIs instead
of 1).  That carries maintenance burden (if something in alloc_frozen()
changes, something in alloc_frozen_user() may need to change).

There's a careful dance here.

~Gregory


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 20:37                 ` Zi Yan
  2026-06-08 20:56                   ` Gregory Price
@ 2026-06-08 21:03                   ` Michael S. Tsirkin
  1 sibling, 0 replies; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08 21:03 UTC (permalink / raw)
  To: Zi Yan
  Cc: Gregory Price, David Hildenbrand (Arm), Lorenzo Stoakes,
	linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Muchun Song, Oscar Salvador, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:37:59PM -0400, Zi Yan wrote:
> On 8 Jun 2026, at 16:25, Gregory Price wrote:
> 
> > On Mon, Jun 08, 2026 at 03:52:20PM -0400, Zi Yan wrote:
> >> On 8 Jun 2026, at 15:43, Gregory Price wrote:
> >>
> >>> On Mon, Jun 08, 2026 at 08:39:10PM +0200, David Hildenbrand (Arm) wrote:
> >>>> On 6/8/26 17:27, Zi Yan wrote:
> >>>>> On 8 Jun 2026, at 7:08, David Hildenbrand (Arm) wrote:
> >>>>>
> >>>>> Or should we defer zeroing after a page is returned from allocator? So that
> >>>>> user_addr does not need to be passed through irrelevant allocation APIs.
> >>>>> Something like:
> >>>>>
> >>>>> alloc_page_wrapper(gfp, order, user_addr)
> >>>>> {
> >>>>> 	page = alloc_pages();
> >>>>> 	if (gfp & __GFP_ZERO)
> >>>>> 		clear_page(page);
> >>>>> }
> >>>>>
> >>>>
> >>>> Not really sure what's best here. I think we'd want to limit the lifting to some
> >>>> internal API, so it cannot easily be messed up by random kernel code calling
> >>>> into the wrong API and not getting pages cleared.
> >>>>
> >>>
> >>> We're a bit in circles on this.  We discussed explicit interfaces a few
> >>> months back and the trade off was:
> >>>
> >>> a) add user_addr to the existing API and cause churn
> >>>
> >>> or
> >>>
> >>> b) add special interface like above
> >>>    increase the buddy surface
> >>>    leaves open the ability for users to get it wrong easily
> >>>
> >>> If we forget VMs for a moment and break this step out separately, the
> >>> core question is whether page_alloc.c is the right place to be calling
> >>> the folio_user_zero() or whatever it is.
> >>
> >> page_alloc.c calling folio_user_zero() is fine, but my question is
> >> whether we should do the zeroing inside post_alloc_hook(), part of
> >> allocation.
> >>
> >> What I propose is to lift __GFP_ZERO up as much as possible,
> >> so that most of allocation code does not need to care about it.
> >> We do the zeroing right before the page is returned to callers.
> >>
> >
> > essentially we end up with something like
> >
> > alloc_frozen_...(..., gfp)
> > {
> >   folio = whatever(..., gfp);
> >   if (gfp & __GFP_ZERO)
> >     folio_zero(folio, -1); /* don't do cache flush part */
> > }
> >
> > alloc_frozen_user_...(..., gfp, user_addr)
> > {
> >   folio = whatever(..., gfp);
> >   if (gfp & __GFP_ZERO)
> >     folio_zero(folio, user_addr); /* do cache flush part */
> > }
> >
> > The downside of this is obvious: it's easy for developers to get this
> > wrong and call the non-user interface for user-bound allocations and
> > miss the cache flush (that is only needed on some archs).
> >
> > Not saying that's a deal breaker, but it's something to chew on.
> 
> I agree that misuse can cause trouble. But if we do the churn approach,
> what prevents developer from doing alloc_frozen(..., user_addr = -1)
> and using the returned page for userspace? It is possible the allocated
> page can be exported to userspace later.
> 
> BTW, that cache flush thing is fragile even today,

Probably arch dependent. On arm32, I think if you miss the flush, then
PG_dcache_clean will be clear and then you get a perf hit but
it's still correct. Didn't check others.

> you probably can
> do alloc_page() + vm_insert() to get a page without doing proper flush
> and export it to userspace. There seems to be no mechanism to
> prevent that.
> 
> Best Regards,
> Yan, Zi

Because maybe you want to expose data to userspace?


-- 
MST



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 20:40               ` Zi Yan
@ 2026-06-08 21:04                 ` Michael S. Tsirkin
  2026-06-08 21:16                   ` Zi Yan
  0 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08 21:04 UTC (permalink / raw)
  To: Zi Yan
  Cc: Gregory Price, Matthew Wilcox, Lorenzo Stoakes, linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple,
	Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, virtualization, linux-mm,
	Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:40:15PM -0400, Zi Yan wrote:
> On 8 Jun 2026, at 16:33, Michael S. Tsirkin wrote:
> 
> > On Mon, Jun 08, 2026 at 04:21:08PM -0400, Zi Yan wrote:
> >> On 8 Jun 2026, at 15:59, Gregory Price wrote:
> >>
> >>> On Mon, Jun 08, 2026 at 02:04:28PM +0100, Matthew Wilcox wrote:
> >>>> On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
> >>>>> But instead of overloading user_addr to indicate all kinds of things, instead
> >>>>> make life easier by actually breaking things out.
> >>>>>
> >>>>> Like:
> >>>>>
> >>>>> enum alloc_context_type {
> >>>>> 	KERNEL_ALLOCATION,
> >>>>> 	USER_MAPPED_ALLOCATION,
> >>>>> 	USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
> >>>>> 	/* Perhaps some other states we want to encode? */
> >>>>> };
> >>>>>
> >>>>> struct alloc_context {
> >>>>> 	...
> >>>>>
> >>>>> 	enum alloc_context_type type;
> >>>>> 	unsigned long user_addr; // Only set if type == USER_ALLOCATION
> >>>>>
> >>>>> 	// Maybe something suggesting context or whether we init before in some
> >>>>> 	// cases?
> >>>>> };
> >>>>
> >>>> Ugh, please, no.  As I suggested last time I commented on this
> >>>> trainwreck of a series, lift the zeroing functionality from
> >>>> alloc_frozen_pages() into its callers.
> >>>
> >>> This sort of just implies writing the "alloc_frozen_zeroed_pages()"
> >>> wrapper that does the zeroing at the end before return, and then killing
> >>> the post hook nonsense associated with it in the first place.
> >>
> >> This means it is going to be a multi-step optimization. This is probably
> >> step 1.
> >>
> >>>
> >>> None of this resolves the user address annoyance which is needed on some
> >>> archs for cache flushing.  Whether anyone agrees that the page allocator
> >>> should be responsible for this particular operation - open debate.
> >>
> >> This is probably step 2. But does the virtio use case apply to these
> >> archs? Does the performance matter for them? If not, maybe this part can
> >> be left as a TODO.
> >>
> >>
> >> Best Regards,
> >> Yan, Zi
> >
> > I doubt it. But I don't get what's proposed, the code that we
> > have to modify is arch independent?
> 
> Change user_alloc_needs_zeroing() to only check address aliasing even
> if that can cause double zeroing for virtio.
> 
> Best Regards,
> Yan, Zi

Ah. I started with exactly that in v1/v2. It's a simple approach.

But mm maintainers said no, user_alloc_needs_zeroing is a hack and
I must not add to it.

-- 
MST



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 21:04                 ` Michael S. Tsirkin
@ 2026-06-08 21:16                   ` Zi Yan
  2026-06-08 21:51                     ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 131+ messages in thread
From: Zi Yan @ 2026-06-08 21:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gregory Price, Matthew Wilcox, Lorenzo Stoakes, linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple,
	Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, virtualization, linux-mm,
	Andrea Arcangeli

On 8 Jun 2026, at 17:04, Michael S. Tsirkin wrote:

> On Mon, Jun 08, 2026 at 04:40:15PM -0400, Zi Yan wrote:
>> On 8 Jun 2026, at 16:33, Michael S. Tsirkin wrote:
>>
>>> On Mon, Jun 08, 2026 at 04:21:08PM -0400, Zi Yan wrote:
>>>> On 8 Jun 2026, at 15:59, Gregory Price wrote:
>>>>
>>>>> On Mon, Jun 08, 2026 at 02:04:28PM +0100, Matthew Wilcox wrote:
>>>>>> On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
>>>>>>> But instead of overloading user_addr to indicate all kinds of things, instead
>>>>>>> make life easier by actually breaking things out.
>>>>>>>
>>>>>>> Like:
>>>>>>>
>>>>>>> enum alloc_context_type {
>>>>>>> 	KERNEL_ALLOCATION,
>>>>>>> 	USER_MAPPED_ALLOCATION,
>>>>>>> 	USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
>>>>>>> 	/* Perhaps some other states we want to encode? */
>>>>>>> };
>>>>>>>
>>>>>>> struct alloc_context {
>>>>>>> 	...
>>>>>>>
>>>>>>> 	enum alloc_context_type type;
>>>>>>> 	unsigned long user_addr; // Only set if type == USER_ALLOCATION
>>>>>>>
>>>>>>> 	// Maybe something suggesting context or whether we init before in some
>>>>>>> 	// cases?
>>>>>>> };
>>>>>>
>>>>>> Ugh, please, no.  As I suggested last time I commented on this
>>>>>> trainwreck of a series, lift the zeroing functionality from
>>>>>> alloc_frozen_pages() into its callers.
>>>>>
>>>>> This sort of just implies writing the "alloc_frozen_zeroed_pages()"
>>>>> wrapper that does the zeroing at the end before return, and then killing
>>>>> the post hook nonsense associated with it in the first place.
>>>>
>>>> This means it is going to be a multi-step optimization. This is probably
>>>> step 1.
>>>>
>>>>>
>>>>> None of this resolves the user address annoyance which is needed on some
>>>>> archs for cache flushing.  Whether anyone agrees that the page allocator
>>>>> should be responsible for this particular operation - open debate.
>>>>
>>>> This is probably step 2. But does the virtio use case apply to these
>>>> archs? Does the performance matter for them? If not, maybe this part can
>>>> be left as a TODO.
>>>>
>>>>
>>>> Best Regards,
>>>> Yan, Zi
>>>
>>> I doubt it. But I don't get what's proposed, the code that we
>>> have to modify is arch independent?
>>
>> Change user_alloc_needs_zeroing() to only check address aliasing even
>> if that can cause double zeroing for virtio.
>>
>> Best Regards,
>> Yan, Zi
>
> Ah. I started with exactly that in v1/v2. It's a simple approach.
>
> But mm maintainers said no, user_alloc_needs_zeroing is a hack and
> I must not add to it.

Got it. It sounds that you now get conflicting ideas. Maybe you should
start a [DISCUSSION] thread that presents the high level idea of what
you want to achieve and all the ideas you got from the reviews, so that
people in this thread can have the big picture and come up a consensus
before you send another version.

Thank you for patiently replying my comments, since those points
apparently have been discussed in prior submissions.

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
  2026-06-08 20:53             ` Gregory Price
@ 2026-06-08 21:16               ` Michael S. Tsirkin
  2026-06-08 21:33                 ` Gregory Price
  0 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08 21:16 UTC (permalink / raw)
  To: Gregory Price
  Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:53:14PM -0400, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 04:30:46PM -0400, Michael S. Tsirkin wrote:
> > > 
> > > Please consider that this is arguably the most fundamental interface in
> > > in all of mm/.  All we're doing is going through the process of figuring
> > > out what changes here are reasonable while trying to meet your goal.
> > > 
> > > ~Gregory
> > 
> > I don't mind discarding all of this and doing something else completely,
> > but I dislike it that multiple people are apparently now angry that I
> 
> I wouldn't say anyone is angry, I think most folks are tripping on the
> complexity of the set - which has increased (at the request of others).
> 
> > don't address all the contradictory comments at the same time.
> 
> Such is life in mm/ :] - it's hard to known the entire state machine,
> and sometimes the contradictions aren't even wrong.
> 
> > I thought just sending a patchset to show how the result looks like
> > is easier than arguing about architecture, and would be helpful.
> > 
> 
> Notice: When folks argue implementation, they largely agree the
> end goal is useful.  I haven't seen anyone say your problem isn't
> real or that it shouldn't be addressed - just opinions on a particular
> path forward (which is utterly normal here).
> 
> Getting the right incantation of an API is really hard when the
> API being changes is something that underpins the entire kernel.
> 
> > I'm not pushing any of the mm rework, I was asked to do it,
> > myself I just want the ridiculously effective optimization in there.
> > 
> 
> As Lorenzo, David, and Matthew have said, the focus of the patch set
> does seem to have become unweildy (in part at the request of folks
> asking something be done differently).
> 
> What needs to be done now is to break it up into some pull-ahead 
> sets that are easier to review.  Having a brief RFC doc that lays out
> the set of patches might help clarify the confusion going on here,
> especially as new folks come in to ask "What's all this about?".
> 
> As a start:
> 
>   1) the user_addr and zeroing piece seems like a discrete
>      improvement worthy of its own set - aside from end goal.
> 
>      This is needed by your patch set, but was requested to
>      try to push us towards a more reasonable pattern for
>      folio_zero_user().

What I worry about is people can't agree what api they want.

Simply not being an mm maintainer, I don't really have the
perspective of what changes are envisioned down the road
and so what api makes sense for you guys.

I don't mind trying all kind of approaches, but it seems to
be past the point where people feel it's costing too much of
their time with all of these revisions.


>   2) There are a handful of patches that seem able to pull-ahead
>      (some of the mempolicy stuff), either as prep work for #1 or
>      just on their own.
> 
>      Some of these patches seem like latent bugs that aren't hit by
>      current users, but do seem to be doing something subtly wrong?

Right.

>   3) the final virtio piece seems like it should be entirely separate
>      once the core pieces are done.
> 
> It's not uncommon for core changes like this to take multiple prepatory
> sets over many major versions before the final feature lands.
> 
> ~Gregory



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
  2026-06-08 21:16               ` Michael S. Tsirkin
@ 2026-06-08 21:33                 ` Gregory Price
  2026-06-08 21:46                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 131+ messages in thread
From: Gregory Price @ 2026-06-08 21:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 05:16:53PM -0400, Michael S. Tsirkin wrote:
> On Mon, Jun 08, 2026 at 04:53:14PM -0400, Gregory Price wrote:
> > 
> > As a start:
> > 
> >   1) the user_addr and zeroing piece seems like a discrete
> >      improvement worthy of its own set - aside from end goal.
> > 
> >      This is needed by your patch set, but was requested to
> >      try to push us towards a more reasonable pattern for
> >      folio_zero_user().
> 
> What I worry about is people can't agree what api they want.
>

Oh that's just our base state of existence.  We mostly agree that
all APIs are bad in some way and we don't want any of them :P

What you're looking for is to get people to agree to the
least-offensive, least-worst option :]

I don't think we're far off from that.  I suggest doing as Zi said and
start a [DISCUSSION] thread on specifically this and lay out the needs
and wants and design issues that you've learned from the past set of
versions and continue the discussion there.

It helps to take some snippets from your set to lay out what you've
learned and explain why you need the folio_user_zero() stuff to get from
A->Z, and then let maintainers hash out whether that should live in
post_alloc_hook or new interfaces (or outside page_alloc.c altogether).

> I don't mind trying all kind of approaches, but it seems to
> be past the point where people feel it's costing too much of
> their time with all of these revisions.
> 

People are still commenting, so I don't think you've gotten there yet.
I think the rate of revision is what's costing too much attention.

You'd save yourself some revisions by taking the attention you have
right now and starting the discussion thread (and consider submitting
the topic to LPC if that's something interests you!).

All this is to say you're doing fine, just keep on keepin' on. Maybe
pivot your approach from iterations to discussion for a bit until the
opinions settle.

~Gregory

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
  2026-06-08 21:33                 ` Gregory Price
@ 2026-06-08 21:46                   ` Michael S. Tsirkin
  2026-06-08 21:53                     ` Gregory Price
  0 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-08 21:46 UTC (permalink / raw)
  To: Gregory Price
  Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 05:33:50PM -0400, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 05:16:53PM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jun 08, 2026 at 04:53:14PM -0400, Gregory Price wrote:
> > > 
> > > As a start:
> > > 
> > >   1) the user_addr and zeroing piece seems like a discrete
> > >      improvement worthy of its own set - aside from end goal.
> > > 
> > >      This is needed by your patch set, but was requested to
> > >      try to push us towards a more reasonable pattern for
> > >      folio_zero_user().
> > 
> > What I worry about is people can't agree what api they want.
> >
> 
> Oh that's just our base state of existence.  We mostly agree that
> all APIs are bad in some way and we don't want any of them :P
> 
> What you're looking for is to get people to agree to the
> least-offensive, least-worst option :]
> 
> I don't think we're far off from that.  I suggest doing as Zi said and
> start a [DISCUSSION] thread on specifically this and lay out the needs
> and wants and design issues that you've learned from the past set of
> versions and continue the discussion there.
> 
> It helps to take some snippets from your set to lay out what you've
> learned and explain why you need the folio_user_zero() stuff to get from
> A->Z, and then let maintainers hash out whether that should live in
> post_alloc_hook or new interfaces (or outside page_alloc.c altogether).
> 
> > I don't mind trying all kind of approaches, but it seems to
> > be past the point where people feel it's costing too much of
> > their time with all of these revisions.
> > 
> 
> People are still commenting, so I don't think you've gotten there yet.
> I think the rate of revision is what's costing too much attention.
> 
> You'd save yourself some revisions by taking the attention you have
> right now and starting the discussion thread (and consider submitting
> the topic to LPC if that's something interests you!).

Well it's in october, is it not? I don't think I have the patience to
keep fiddling with that for half a year.

> All this is to say you're doing fine, just keep on keepin' on. Maybe
> pivot your approach from iterations to discussion for a bit until the
> opinions settle.
> 
> ~Gregory



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 21:16                   ` Zi Yan
@ 2026-06-08 21:51                     ` David Hildenbrand (Arm)
  2026-06-08 22:28                       ` Gregory Price
  0 siblings, 1 reply; 131+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-08 21:51 UTC (permalink / raw)
  To: Zi Yan, Michael S. Tsirkin
  Cc: Gregory Price, Matthew Wilcox, Lorenzo Stoakes, linux-kernel,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple,
	Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, virtualization, linux-mm,
	Andrea Arcangeli

On 6/8/26 23:16, Zi Yan wrote:
> On 8 Jun 2026, at 17:04, Michael S. Tsirkin wrote:
> 
>> On Mon, Jun 08, 2026 at 04:40:15PM -0400, Zi Yan wrote:
>>>
>>>
>>> Change user_alloc_needs_zeroing() to only check address aliasing even
>>> if that can cause double zeroing for virtio.
>>>
>>> Best Regards,
>>> Yan, Zi
>>
>> Ah. I started with exactly that in v1/v2. It's a simple approach.

Simple, and hacky -> unmergable. I tried to push it into a different (no GFP
flags -> IMHO better) direction, but the patch set grew in complexity.

I kept telling to keep it simple (e.g., no folio_put optimization, no hugetlb
optimization, simple wrapper functions), and ideally we would have gotten a
better discussion with other folks here much earlier.

And I still do not consider providing an user address to selected interfaces
while centralizing zeroing a bad idea. The real question is how that could be
done in a cleaner way.

Or as Willy said, if we could move zeroing further out to callers, where they
can special-case. But given that KASAN and friends interact in their own way
with zeroing doesn't make that super straight forward as people might think.

>>
>> But mm maintainers said no, user_alloc_needs_zeroing is a hack and
>> I must not add to it.

I mean, I would hope that we can agree that our existing page/folio zeroing is a
mess and should not be extended by slapping more special casing on top?

Sure, we can try cleaning it up, but conceptually, zeroing happening at two
places in the callchain, with random optimizations to avoid double-zeroing is
just bad.

The fact that a vma_alloc_zeroed_movable_folio() that can be overridden by
architectures even exists makes me angry. user_alloc_needs_zeroing() is jsut the
tip of the ugly iceberg.

> 
> Got it. It sounds that you now get conflicting ideas. Maybe you should
> start a [DISCUSSION] thread that presents the high level idea of what
> you want to achieve and all the ideas you got from the reviews, so that
> people in this thread can have the big picture and come up a consensus
> before you send another version.
> 
> Thank you for patiently replying my comments, since those points
> apparently have been discussed in prior submissions.
> 

There was Willy's comment in RFC v3 [1], which had 19 patches. Unfortunately, he
no longer followed up to my initial push back and Michael's question later [2].

That would have probably been the right time to wait for more discussion.

RFC v4 had 22 patches with little replies.
v5 had 28 patches with little replies.
v6 had 30 patches with no replies.
v7 had 31 patches with little replies.
v8 had 37 patches with no replies.

[1] https://lore.kernel.org/lkml/aeu5P1bZW3yEH54t@casper.infradead.org/
[2] https://lore.kernel.org/lkml/20260426165330-mutt-send-email-mst@kernel.org/

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
  2026-06-08 21:46                   ` Michael S. Tsirkin
@ 2026-06-08 21:53                     ` Gregory Price
  0 siblings, 0 replies; 131+ messages in thread
From: Gregory Price @ 2026-06-08 21:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 05:46:27PM -0400, Michael S. Tsirkin wrote:
> On Mon, Jun 08, 2026 at 05:33:50PM -0400, Gregory Price wrote:
> > 
> > You'd save yourself some revisions by taking the attention you have
> > right now and starting the discussion thread (and consider submitting
> > the topic to LPC if that's something interests you!).
> 
> Well it's in october, is it not? I don't think I have the patience to
> keep fiddling with that for half a year.
> 

You might be able to find a way forward that doesn't take that long, but
that starts with trying to build consensus on what to build before you
build it.

You're proposing a non-trivial change to the page allocator API, I would
not expect this to move at the speed of claude.

~Gregory


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 21:51                     ` David Hildenbrand (Arm)
@ 2026-06-08 22:28                       ` Gregory Price
  2026-06-22 14:25                         ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 131+ messages in thread
From: Gregory Price @ 2026-06-08 22:28 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Zi Yan, Michael S. Tsirkin, Matthew Wilcox, Lorenzo Stoakes,
	linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Muchun Song, Oscar Salvador, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 11:51:47PM +0200, David Hildenbrand (Arm) wrote:
> On 6/8/26 23:16, Zi Yan wrote:
> 
> There was Willy's comment in RFC v3 [1], which had 19 patches. Unfortunately, he
> no longer followed up to my initial push back and Michael's question later [2].
> 
> That would have probably been the right time to wait for more discussion.
> 
> RFC v4 had 22 patches with little replies.
> v5 had 28 patches with little replies.
> v6 had 30 patches with no replies.
> v7 had 31 patches with little replies.
> v8 had 37 patches with no replies.
> 
> [1] https://lore.kernel.org/lkml/aeu5P1bZW3yEH54t@casper.infradead.org/
> [2] https://lore.kernel.org/lkml/20260426165330-mutt-send-email-mst@kernel.org/
>

Hm, rewinding on this back to v3 here:
https://lore.kernel.org/lkml/016cc5e5-044c-46c6-a668-200f90a64d85@kernel.org/

You said:

  ```
  Exactly, that's why I am saying that vma_alloc_folio() is the only
  external interface people should be using with a user address.
  ```

Going through the list of folio_zero_user references:

Called unconditionally if a folio is acquired:
   fs/hugetlbfs/inode.c:   folio_zero_user(folio, addr);
   mm/hugetlb.c:           folio_zero_user(folio, vmf->real_address);
   mm/memfd.c:             folio_zero_user(folio, 0);

Called when user_alloc_needs_zeroing() and charging passes:
   mm/huge_memory.c:       folio_zero_user(folio, addr);
   mm/memory.c:            folio_zero_user(folio, vmf->address);

No one outside mm/ should know about this interface at all.
Arguably none of these should know about this interface either.

The appropriate place for this logic appears to be:
    vma_alloc_folio
    alloc_hugetlb_folio
    alloc_hugetlb_folio_reserve

The reason to sink it into the post_alloc_hook is to let the buddy
decide whether the page actually needs to be zeroed (like the virtio
situation) based on PG_zeroed or whatever.

It seems like at a minimum moving the logic all the way into
post_alloc_hook lets us actually delete folio_zero_user() as a published
interface and move it entirely within page_alloc.c.

The catch is user_alloc_needs_zeroing() coming along with it.

~Gregory


^ permalink raw reply	[flat|nested] 131+ messages in thread

* New design
  2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
                   ` (39 preceding siblings ...)
  2026-06-08 14:21 ` Matthew Wilcox
@ 2026-06-09  3:58 ` Matthew Wilcox
  2026-06-09  6:18   ` Gregory Price
                     ` (2 more replies)
  40 siblings, 3 replies; 131+ messages in thread
From: Matthew Wilcox @ 2026-06-09  3:58 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

OK, here's how I'd structure this:

1. Introduce PG_zeroed for buddy pages
2. Set it if init_on_free is set
3. Set it from balloon driver

https://lore.kernel.org/lkml/c7094de807c0e963526686e1d245bc76193b1a92.1776689093.git.mst@redhat.com/ 

but add FPI_ZEROED instead of an extra bool parameter.

4. Introduce page_is_zeroed like this:

static inline bool page_is_zeroed(const struct page *page)
{
        /*
         * lru.next has bit 2 set if the page is already zeroed.
         * Callers may simply overwrite it once they no longer
	 * need to preserve that information.
         */
        return (unsigned long)page->lru.next & BIT(2);
}

(you'll notice this is similar to page_is_pfmemalloc() but it doesn't
need to be in mm.h)

This step is going to be a bit fiddly.  We weren't expecting to return
multiple flags in page->lru.next, so clear_page_pfmemalloc() just sets
page->lru.next to NULL.  So somewhere we need to make sure that
page->lru.next is definitely NULL, and then allow both the zeroed and
pfmemalloc flags to be set in it.

The important part of this is that it allows the zeroed flag to be
returned from the page allocator without introducing pghint_t like you
did in v2.

5. Now you can start skipping various zeroing steps higher in the call
chain.

I understand David's disgust with vma_alloc_zeroed_movable_folio()
but that is surely a separate cleanup and nothing to do with this
patchset.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: New design
  2026-06-09  3:58 ` New design Matthew Wilcox
@ 2026-06-09  6:18   ` Gregory Price
  2026-06-09  8:06   ` Michael S. Tsirkin
  2026-06-09  8:12   ` David Hildenbrand (Arm)
  2 siblings, 0 replies; 131+ messages in thread
From: Gregory Price @ 2026-06-09  6:18 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michael S. Tsirkin, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Tue, Jun 09, 2026 at 04:58:14AM +0100, Matthew Wilcox wrote:
> OK, here's how I'd structure this:
> 
> 1. Introduce PG_zeroed for buddy pages
> 2. Set it if init_on_free is set
> 3. Set it from balloon driver
> 
> https://lore.kernel.org/lkml/c7094de807c0e963526686e1d245bc76193b1a92.1776689093.git.mst@redhat.com/ 
> 
> but add FPI_ZEROED instead of an extra bool parameter.
> 
> 4. Introduce page_is_zeroed like this:
> 
> static inline bool page_is_zeroed(const struct page *page)
> {
>         /*
>          * lru.next has bit 2 set if the page is already zeroed.
>          * Callers may simply overwrite it once they no longer
> 	 * need to preserve that information.
>          */
>         return (unsigned long)page->lru.next & BIT(2);
> }
> 
> (you'll notice this is similar to page_is_pfmemalloc() but it doesn't
> need to be in mm.h)
> 
> This step is going to be a bit fiddly.  We weren't expecting to return
> multiple flags in page->lru.next, so clear_page_pfmemalloc() just sets
> page->lru.next to NULL.  So somewhere we need to make sure that
> page->lru.next is definitely NULL, and then allow both the zeroed and
> pfmemalloc flags to be set in it.
> 
> The important part of this is that it allows the zeroed flag to be
> returned from the page allocator without introducing pghint_t like you
> did in v2.
>

Are you suggesting leaking the flags out entirely, or just to the
boundaries of page_alloc.c (__alloc_frozen_pages_noproft and etc).

I assume the latter, but worth clarifying.

Otherwise this seems reasonable.

If we're just going to pile more stuff in lru.next you might as well
either add the alias to mm.h but keep the bits defined in page_alloc.c
to prevent them from escaping (even if they end up set, nothing outside
page_alloc.c knows what any of them mean).

Unless my read on this is mistaken, let me know if i've misunderstood
anything.

> 5. Now you can start skipping various zeroing steps higher in the call
> chain.
> 
> I understand David's disgust with vma_alloc_zeroed_movable_folio()
> but that is surely a separate cleanup and nothing to do with this
> patchset.


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: New design
  2026-06-09  3:58 ` New design Matthew Wilcox
  2026-06-09  6:18   ` Gregory Price
@ 2026-06-09  8:06   ` Michael S. Tsirkin
  2026-06-09 10:04     ` Lorenzo Stoakes
  2026-06-09  8:12   ` David Hildenbrand (Arm)
  2 siblings, 1 reply; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-09  8:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Tue, Jun 09, 2026 at 04:58:14AM +0100, Matthew Wilcox wrote:
> OK, here's how I'd structure this:

Thanks a lot for looking into this and writing this Matthew! Looks
workable, let's see if there's rough consensus around this.
Two questions to make sure I understand.

> 
> 1. Introduce PG_zeroed for buddy pages
> 2. Set it if init_on_free is set

Not 100% sure why we want this bit. And I am not sure this works
actually because init_on_free does kernel_init_pages
and does not flush cache on arm32.
You will notice that user_alloc_needs_zeroing ignores init_on_free.
Right?
How about we skip step 2, make the patchset a bit smaller?


> 3. Set it from balloon driver
> 
> https://lore.kernel.org/lkml/c7094de807c0e963526686e1d245bc76193b1a92.1776689093.git.mst@redhat.com/ 
> 
> but add FPI_ZEROED instead of an extra bool parameter.
> 
> 4. Introduce page_is_zeroed like this:
> 
> static inline bool page_is_zeroed(const struct page *page)
> {
>         /*
>          * lru.next has bit 2 set if the page is already zeroed.
>          * Callers may simply overwrite it once they no longer
> 	 * need to preserve that information.
>          */
>         return (unsigned long)page->lru.next & BIT(2);
> }
> 
> (you'll notice this is similar to page_is_pfmemalloc() but it doesn't
> need to be in mm.h)
> 
> This step is going to be a bit fiddly.  We weren't expecting to return
> multiple flags in page->lru.next, so clear_page_pfmemalloc() just sets
> page->lru.next to NULL.  So somewhere we need to make sure that
> page->lru.next is definitely NULL, and then allow both the zeroed and
> pfmemalloc flags to be set in it.
> 
> The important part of this is that it allows the zeroed flag to be
> returned from the page allocator without introducing pghint_t like you
> did in v2.
> 
> 5. Now you can start skipping various zeroing steps higher in the call
> chain.
> I understand David's disgust with vma_alloc_zeroed_movable_folio()
> but that is surely a separate cleanup and nothing to do with this
> patchset.



One other question: would people like to see it as a single patchset
or multiple ones 1-4? Multiple ones would be easier to review but
of course this means no actual perf gain until part 5 is merged. Is that
acceptable?

-- 
MST



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: New design
  2026-06-09  3:58 ` New design Matthew Wilcox
  2026-06-09  6:18   ` Gregory Price
  2026-06-09  8:06   ` Michael S. Tsirkin
@ 2026-06-09  8:12   ` David Hildenbrand (Arm)
  2 siblings, 0 replies; 131+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-09  8:12 UTC (permalink / raw)
  To: Matthew Wilcox, Michael S. Tsirkin
  Cc: linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Muchun Song, Oscar Salvador, Andrew Morton, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On 6/9/26 05:58, Matthew Wilcox wrote:
> OK, here's how I'd structure this:
> 
> 1. Introduce PG_zeroed for buddy pages
> 2. Set it if init_on_free is set
> 3. Set it from balloon driver
> 
> https://lore.kernel.org/lkml/c7094de807c0e963526686e1d245bc76193b1a92.1776689093.git.mst@redhat.com/ 
> 
> but add FPI_ZEROED instead of an extra bool parameter.
> 
> 4. Introduce page_is_zeroed like this:
> 
> static inline bool page_is_zeroed(const struct page *page)
> {
>         /*
>          * lru.next has bit 2 set if the page is already zeroed.
>          * Callers may simply overwrite it once they no longer
> 	 * need to preserve that information.
>          */
>         return (unsigned long)page->lru.next & BIT(2);
> }
> 
> (you'll notice this is similar to page_is_pfmemalloc() but it doesn't
> need to be in mm.h)
> 
> This step is going to be a bit fiddly.  We weren't expecting to return
> multiple flags in page->lru.next, so clear_page_pfmemalloc() just sets
> page->lru.next to NULL.  So somewhere we need to make sure that
> page->lru.next is definitely NULL, and then allow both the zeroed and
> pfmemalloc flags to be set in it.
> 
> The important part of this is that it allows the zeroed flag to be
> returned from the page allocator without introducing pghint_t like you
> did in v2.

I previously raised (in v2? not sure) that we could using a pageflag that are
only used for folios, and then simply clear that flag on the folio allocation
path such that we don't get false-postives with the bit set.

> 
> 5. Now you can start skipping various zeroing steps higher in the call
> chain.
> 
> I understand David's disgust with vma_alloc_zeroed_movable_folio()
> but that is surely a separate cleanup and nothing to do with this
> patchset.

Well, in my reality, we're just finding interesting ways to work around the fact
that GFP_ZERO sometimes does what we want, sometimes doesn't.

So we leak information out of the buddy to really only handle one scenario:
fixing up GFP_ZERO currently sometimes not doing what we want.

I'm afraid we couldn't use the above trick to punch zeroed pages back into the
buddy: some random user doing alloc+use+free would be unaware that there is a
bit to clear.

So I assume really only folio allocation would make use of this, to work around
our problematic GFP_ZERO implementation.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 14:55                   ` David Hildenbrand (Arm)
@ 2026-06-09  9:54                     ` Michael S. Tsirkin
  0 siblings, 0 replies; 131+ messages in thread
From: Michael S. Tsirkin @ 2026-06-09  9:54 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Matthew Wilcox, Lorenzo Stoakes, linux-kernel, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Andrew Morton, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Mon, Jun 08, 2026 at 04:55:19PM +0200, David Hildenbrand (Arm) wrote:
> On 6/8/26 16:44, Matthew Wilcox wrote:
> > On Mon, Jun 08, 2026 at 04:37:03PM +0200, David Hildenbrand (Arm) wrote:
> >> On 6/8/26 16:31, Matthew Wilcox wrote:
> >>>
> >>> What I don't understand is how the kernel page allocator needs to know
> >>> the user address in order to effectively zero it, but the hypervisor is
> >>> able to zero the page without knowing the user address.  It feels like
> >>> somebody has x86-centric thinking where cache colouring doesn't matter.
> >>
> >> (not commenting on the icache dache mess we have to drag along)
> > 
> > Well, that was kind of the point of this email ... I did ask the
> > question you're answering in a different email so let me respond
> > to that too.
> 
> Now I'm confused :)
> 
> > 
> >> The thing is that with free-page-reporting the memory is already zeroed by the
> >> hypervisor as part of discarding that memory previously (e.g., MADV_DONTNEED)
> >> and allocating fresh pages on re-access.
> >>
> >> So it's not a question of "why is the hypervisor zeroing less efficiently", as
> >> zeroing is just a side-product of reclaiming that memory in the first place.
> > 
> > We definitely have users who don't want the guest to trust the
> > hypervisor.  So how do they disable this optimisation?
> 
> Right, I don't think we currently have a toggle to disable free page reporting.
> So IIUC, this optimization would similarly automatically get enabled if the
> hypervisor advertises it.
> 
> -- 
> Cheers,
> 
> David

Not as the patchset stands:

[PATCH v10 35/37] virtio_balloon: disable reporting zeroed optimization for confidential guests

disables it.

-- 
MST



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: New design
  2026-06-09  8:06   ` Michael S. Tsirkin
@ 2026-06-09 10:04     ` Lorenzo Stoakes
  0 siblings, 0 replies; 131+ messages in thread
From: Lorenzo Stoakes @ 2026-06-09 10:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Matthew Wilcox, linux-kernel, David Hildenbrand (Arm), Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Andrew Morton, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On Tue, Jun 09, 2026 at 04:06:08AM -0400, Michael S. Tsirkin wrote:
> One other question: would people like to see it as a single patchset
> or multiple ones 1-4? Multiple ones would be easier to review but
> of course this means no actual perf gain until part 5 is merged. Is that
> acceptable?

Personally I'd like to see multiple patch sets, not all sent at once.

Rather - send the first, wait for review, and once people have given tags,
then send the next - rinse and repeat.

That makes life easier for review, allows forward progress, and avoids
noise on-list.

>
> --
> MST
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
  2026-06-08 22:28                       ` Gregory Price
@ 2026-06-22 14:25                         ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 131+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-22 14:25 UTC (permalink / raw)
  To: Gregory Price
  Cc: Zi Yan, Michael S. Tsirkin, Matthew Wilcox, Lorenzo Stoakes,
	linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Muchun Song, Oscar Salvador, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli

On 6/9/26 00:28, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 11:51:47PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/8/26 23:16, Zi Yan wrote:
>>
>> There was Willy's comment in RFC v3 [1], which had 19 patches. Unfortunately, he
>> no longer followed up to my initial push back and Michael's question later [2].
>>
>> That would have probably been the right time to wait for more discussion.
>>
>> RFC v4 had 22 patches with little replies.
>> v5 had 28 patches with little replies.
>> v6 had 30 patches with no replies.
>> v7 had 31 patches with little replies.
>> v8 had 37 patches with no replies.
>>
>> [1] https://lore.kernel.org/lkml/aeu5P1bZW3yEH54t@casper.infradead.org/
>> [2] https://lore.kernel.org/lkml/20260426165330-mutt-send-email-mst@kernel.org/
>>
> 
> Hm, rewinding on this back to v3 here:

I'm late ...

> https://lore.kernel.org/lkml/016cc5e5-044c-46c6-a668-200f90a64d85@kernel.org/
> 
> You said:
> 
>   ```
>   Exactly, that's why I am saying that vma_alloc_folio() is the only
>   external interface people should be using with a user address.
>   ```
> 
> Going through the list of folio_zero_user references:
> 
> Called unconditionally if a folio is acquired:
>    fs/hugetlbfs/inode.c:   folio_zero_user(folio, addr);
>    mm/hugetlb.c:           folio_zero_user(folio, vmf->real_address);
>    mm/memfd.c:             folio_zero_user(folio, 0);
> 
> Called when user_alloc_needs_zeroing() and charging passes:
>    mm/huge_memory.c:       folio_zero_user(folio, addr);
>    mm/memory.c:            folio_zero_user(folio, vmf->address);
> 
> No one outside mm/ should know about this interface at all.

Exactly.

> Arguably none of these should know about this interface either.
> 
> The appropriate place for this logic appears to be:
>     vma_alloc_folio
>     alloc_hugetlb_folio
>     alloc_hugetlb_folio_reserve

Yes. And the hugetlb stuff should just be left alone for now. We don't encourage
new hugetlb features, and the same should apply to new optimizations that are
not straight forward.

> 
> The reason to sink it into the post_alloc_hook is to let the buddy
> decide whether the page actually needs to be zeroed (like the virtio
> situation) based on PG_zeroed or whatever.

It's a bit related to your private node work .... if we have an interface that
consumes an alloc_context, we could just forward the address easily.

Now, there are different ways of doing it, but having folio allocation sometimes
use GFP_ZERO and sometimes not is just nasty. It's all a big hack.

> 
> It seems like at a minimum moving the logic all the way into
> post_alloc_hook lets us actually delete folio_zero_user() as a published
> interface and move it entirely within page_alloc.c.
> 
> The catch is user_alloc_needs_zeroing() coming along with it.

Yes. And once there is only one user remaining, we could inline it and get rid
of this horrible helper :)

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 131+ messages in thread

end of thread, other threads:[~2026-06-22 14:31 UTC | newest]

Thread overview: 131+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-08  8:33 [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Michael S. Tsirkin
2026-06-08  8:34 ` [PATCH v10 01/37] mm: mempolicy: fix interleave index calculation Michael S. Tsirkin
2026-06-08  9:43   ` Lorenzo Stoakes
2026-06-08  8:34 ` [PATCH v10 02/37] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock Michael S. Tsirkin
2026-06-08  9:43   ` Lorenzo Stoakes
2026-06-08 13:48     ` Michael S. Tsirkin
2026-06-08 14:14       ` Lorenzo Stoakes
2026-06-08 20:17         ` Michael S. Tsirkin
2026-06-08 16:20       ` Andrew Morton
2026-06-08  8:34 ` [PATCH v10 03/37] mm: page_alloc: propagate PageReported flag across buddy splits Michael S. Tsirkin
2026-06-08  9:52   ` Lorenzo Stoakes
2026-06-08 12:50     ` Matthew Wilcox
2026-06-08  8:34 ` [PATCH v10 04/37] mm: page_reporting: allow driver to set batch capacity Michael S. Tsirkin
2026-06-08  8:34 ` [PATCH v10 05/37] mm: hugetlb: remove dead alloc_hugetlb_folio stub Michael S. Tsirkin
2026-06-08  9:56   ` Lorenzo Stoakes
2026-06-08  8:35 ` [PATCH v10 06/37] mm: move vma_alloc_folio_noprof to page_alloc.c Michael S. Tsirkin
2026-06-08 10:05   ` Lorenzo Stoakes
2026-06-08  8:35 ` [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing Michael S. Tsirkin
2026-06-08 10:23   ` Lorenzo Stoakes
2026-06-08 11:06     ` Lorenzo Stoakes
2026-06-08 13:04       ` Matthew Wilcox
2026-06-08 13:09         ` Lorenzo Stoakes
2026-06-08 14:26           ` David Hildenbrand (Arm)
2026-06-08 14:31             ` Matthew Wilcox
2026-06-08 14:37               ` David Hildenbrand (Arm)
2026-06-08 14:44                 ` Matthew Wilcox
2026-06-08 14:55                   ` David Hildenbrand (Arm)
2026-06-09  9:54                     ` Michael S. Tsirkin
2026-06-08 19:33                   ` Michael S. Tsirkin
2026-06-08 19:59         ` Gregory Price
2026-06-08 20:21           ` Zi Yan
2026-06-08 20:33             ` Michael S. Tsirkin
2026-06-08 20:40               ` Zi Yan
2026-06-08 21:04                 ` Michael S. Tsirkin
2026-06-08 21:16                   ` Zi Yan
2026-06-08 21:51                     ` David Hildenbrand (Arm)
2026-06-08 22:28                       ` Gregory Price
2026-06-22 14:25                         ` David Hildenbrand (Arm)
2026-06-08 11:08     ` David Hildenbrand (Arm)
2026-06-08 15:27       ` Zi Yan
2026-06-08 18:39         ` David Hildenbrand (Arm)
2026-06-08 19:43           ` Gregory Price
2026-06-08 19:52             ` Zi Yan
2026-06-08 20:25               ` Gregory Price
2026-06-08 20:37                 ` Zi Yan
2026-06-08 20:56                   ` Gregory Price
2026-06-08 21:03                   ` Michael S. Tsirkin
2026-06-08  8:35 ` [PATCH v10 08/37] mm: add alloc_contig_frozen_pages_user " Michael S. Tsirkin
2026-06-08 10:29   ` Lorenzo Stoakes
2026-06-08  8:35 ` [PATCH v10 09/37] mm: hugetlb: thread user_addr through gigantic page allocation Michael S. Tsirkin
2026-06-08  8:36 ` [PATCH v10 10/37] mm: add folio_zero_user stub for configs without THP/HUGETLBFS Michael S. Tsirkin
2026-06-08  9:12   ` Lorenzo Stoakes
2026-06-08  8:36 ` [PATCH v10 11/37] mm: page_alloc: move prep_compound_page before post_alloc_hook Michael S. Tsirkin
2026-06-08 10:33   ` Lorenzo Stoakes
2026-06-08  8:36 ` [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook Michael S. Tsirkin
2026-06-08 11:23   ` Lorenzo Stoakes
2026-06-08 15:53     ` Gregory Price
2026-06-08 19:45       ` Michael S. Tsirkin
2026-06-08 20:16         ` Gregory Price
2026-06-08 20:30           ` Michael S. Tsirkin
2026-06-08 20:53             ` Gregory Price
2026-06-08 21:16               ` Michael S. Tsirkin
2026-06-08 21:33                 ` Gregory Price
2026-06-08 21:46                   ` Michael S. Tsirkin
2026-06-08 21:53                     ` Gregory Price
2026-06-08 19:42     ` Michael S. Tsirkin
2026-06-08  8:36 ` [PATCH v10 13/37] mm: use __GFP_ZERO in vma_alloc_zeroed_movable_folio Michael S. Tsirkin
2026-06-08 10:39   ` Lorenzo Stoakes
2026-06-08 10:55     ` Lorenzo Stoakes
2026-06-08  8:37 ` [PATCH v10 14/37] mm: remove arch vma_alloc_zeroed_movable_folio overrides Michael S. Tsirkin
2026-06-08 11:29   ` Lorenzo Stoakes
2026-06-08  8:37 ` [PATCH v10 15/37] mm: alloc_anon_folio: pass raw fault address to vma_alloc_folio Michael S. Tsirkin
2026-06-08 11:35   ` Lorenzo Stoakes
2026-06-08  8:37 ` [PATCH v10 16/37] mm: alloc_swap_folio: " Michael S. Tsirkin
2026-06-08 11:37   ` Lorenzo Stoakes
2026-06-08 15:59     ` Gregory Price
2026-06-08 20:09       ` Michael S. Tsirkin
2026-06-08  8:37 ` [PATCH v10 17/37] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages Michael S. Tsirkin
2026-06-08 12:00   ` Lorenzo Stoakes
2026-06-08 16:09     ` Gregory Price
2026-06-08 18:40       ` Matthew Wilcox
2026-06-08 19:55         ` Michael S. Tsirkin
2026-06-08  8:38 ` [PATCH v10 18/37] mm: page_alloc: use aliasing checks instead of user_alloc_needs_zeroing Michael S. Tsirkin
2026-06-08 11:39   ` Lorenzo Stoakes
2026-06-08  8:38 ` [PATCH v10 19/37] mm: page_alloc: clear PG_zeroed on buddy merge if not both zero Michael S. Tsirkin
2026-06-08 11:47   ` Lorenzo Stoakes
2026-06-08  8:38 ` [PATCH v10 20/37] mm: page_alloc: preserve PG_zeroed in page_del_and_expand Michael S. Tsirkin
2026-06-08  8:38 ` [PATCH v10 21/37] mm: page_alloc: propagate PG_zeroed in split_large_buddy Michael S. Tsirkin
2026-06-08  8:38 ` [PATCH v10 22/37] mm: add free_frozen_pages_zeroed Michael S. Tsirkin
2026-06-08 12:06   ` Lorenzo Stoakes
2026-06-08  8:38 ` [PATCH v10 23/37] mm: page_alloc: skip kernel_init_pages for FPI_ZEROED when safe Michael S. Tsirkin
2026-06-08 12:18   ` Lorenzo Stoakes
2026-06-08  8:38 ` [PATCH v10 24/37] mm: add put_page_zeroed and folio_put_zeroed Michael S. Tsirkin
2026-06-08 12:25   ` Lorenzo Stoakes
2026-06-08 12:46     ` David Hildenbrand (Arm)
2026-06-08 14:08       ` Michael S. Tsirkin
2026-06-08 14:28         ` David Hildenbrand (Arm)
2026-06-08 19:58           ` Michael S. Tsirkin
2026-06-08  8:39 ` [PATCH v10 25/37] mm: use __GFP_ZERO in alloc_anon_folio Michael S. Tsirkin
2026-06-08 12:29   ` Lorenzo Stoakes
2026-06-08  8:39 ` [PATCH v10 26/37] mm: vma_alloc_anon_folio_pmd: pass raw fault address to vma_alloc_folio Michael S. Tsirkin
2026-06-08 12:30   ` Lorenzo Stoakes
2026-06-08  8:39 ` [PATCH v10 27/37] mm: use __GFP_ZERO in vma_alloc_anon_folio_pmd Michael S. Tsirkin
2026-06-08 12:32   ` Lorenzo Stoakes
2026-06-08  8:39 ` [PATCH v10 28/37] mm: hugetlb: add gfp parameter and skip zeroing for zeroed pages Michael S. Tsirkin
2026-06-08 12:44   ` Lorenzo Stoakes
2026-06-08  8:39 ` [PATCH v10 29/37] mm: memfd: skip zeroing for zeroed hugetlb pool pages Michael S. Tsirkin
2026-06-08 12:47   ` Lorenzo Stoakes
2026-06-08  8:39 ` [PATCH v10 30/37] mm: page_reporting: add per-page zeroed bitmap for host feedback Michael S. Tsirkin
2026-06-08  8:39 ` [PATCH v10 31/37] virtio_balloon: submit reported pages as individual buffers Michael S. Tsirkin
2026-06-08  8:40 ` [PATCH v10 32/37] virtio_balloon: disable indirect descriptors Michael S. Tsirkin
2026-06-08  8:40 ` [PATCH v10 33/37] mm: page_reporting: add flush parameter with page budget Michael S. Tsirkin
2026-06-08  8:40 ` [PATCH v10 34/37] virtio_balloon: skip zeroing for host-zeroed reported pages Michael S. Tsirkin
2026-06-08  8:40 ` [PATCH v10 35/37] virtio_balloon: disable reporting zeroed optimization for confidential guests Michael S. Tsirkin
2026-06-08  8:40 ` [PATCH v10 36/37] mm: balloon: use put_page_zeroed for zeroed balloon pages Michael S. Tsirkin
2026-06-08 11:10   ` David Hildenbrand (Arm)
2026-06-08  8:40 ` [PATCH v10 37/37] virtio_balloon: implement VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE Michael S. Tsirkin
2026-06-08  9:17 ` [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages Lorenzo Stoakes
2026-06-08 12:52   ` Lorenzo Stoakes
2026-06-08 11:02 ` Vlastimil Babka (SUSE)
2026-06-08 11:13   ` Vlastimil Babka (SUSE)
2026-06-08 15:45     ` Gregory Price
2026-06-08 17:50       ` Lorenzo Stoakes
2026-06-08 19:39         ` Michael S. Tsirkin
2026-06-08 14:21 ` Matthew Wilcox
2026-06-08 20:02   ` Michael S. Tsirkin
2026-06-09  3:58 ` New design Matthew Wilcox
2026-06-09  6:18   ` Gregory Price
2026-06-09  8:06   ` Michael S. Tsirkin
2026-06-09 10:04     ` Lorenzo Stoakes
2026-06-09  8:12   ` David Hildenbrand (Arm)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox