Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Andrei Vagin @ 2026-06-24 17:52 UTC (permalink / raw)
  To: Askar Safin
  Cc: akpm, alexander, axboe, bernd, brauner, criu, david, dhowells,
	fuse-devel, hch, jack, joannelkoong, linux-api, linux-fsdevel,
	linux-kernel, linux-mm, miklos, netdev, patches, pfalcato,
	rostedt, torvalds, val, viro, willy
In-Reply-To: <20260624071226.2272209-1-safinaskar@gmail.com>

On Wed, Jun 24, 2026 at 12:12 AM Askar Safin <safinaskar@gmail.com> wrote:
>
> Andrei Vagin <avagin@gmail.com>:
> > The CRIU fifo test fails with this change. The problem is that vmsplice
> > with SPLICE_F_NONBLOCK to a fifo file descriptor fails with -EOPNOTSUPP.
> >
> > It seems we need a fix like this one:
> >
> > diff --git a/fs/pipe.c b/fs/pipe.c
> > index 429b0714ec57..6fc49e933727 100644
> > --- a/fs/pipe.c
> > +++ b/fs/pipe.c
> > @@ -1253,6 +1253,7 @@ static int fifo_open(struct inode *inode, struct
> > file *filp)
> >
> >         /* We can only do regular read/write on fifos */
> >         stream_open(inode, filp);
> > +       filp->f_mode |= FMODE_NOWAIT;
> >
> >         switch (filp->f_mode & (FMODE_READ | FMODE_WRITE)) {
> >         case FMODE_READ:
>
> Does CRIU actually rely on ability to do SPLICE_F_NONBLOCK vmsplice into
> named fifos? Or this is merely a test?

Yes, it does.

>
> If this is just a test, I think we need not to preserve this behavior.
>
> I did debian code search with regex "vmsplice.*SPLICE_F_NONBLOCK" and I
> found very few packages. And it seems all them use pipes, not named fifos.

In short, this isn't how such cases are handled in the kernel. The fix is
simple and should be applied to avoid breaking random software.

>
> (On speed: I still think that my vmsplice patches are good thing,
> despite performance regressions in CRIU.)

I already explained that this isn't just a perfomance degradation, it
actually breaks the pre-dump mechanism in CRIU. vmsplice is invoked from
our parasite code within the context of a user process, where execution
speed is critical. A heavy performance penalty completely invalidates
the pre-dump logic, making the feature useless.

Under normal circumstances, patches that cause this kind of breakage
would never be merged. However, since there are exceptions to every
rule, we should let the maintainers decide how to proceed here. In CRIU,
we have a backup plan to utilize process_vm_readv to dump process
memory. We already support this mode, but it isn't the default due to
performance concerns. If these patches are merged, it will be the
only option left for CRIU to implement pre-dumping.

However, we need to look at this case in a broader context. This is yet
another example where the change introduces a workflow breakage, meaning
there might be other workloads out there that could be broken by this
change.

At a minimum, we may need to consider a deprecation plan where vmsplice
with SPLICE_F_GIFT triggers a warning for a few releases before these
changes are applied. Alternatively, we could introduce the proposed
behavior alongside a sysctl to fall back to the old behavior and explicitly
state that this fallback path will be completely deprecated in a future kernel
version.

Thanks,
Andrei


^ permalink raw reply

* Re: [PATCH v8 15/46] KVM: guest_memfd: Call arch invalidate hooks on conversion
From: Ackerley Tng @ 2026-06-24 17:46 UTC (permalink / raw)
  To: Sean Christopherson, Fuad Tabba
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
	yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
	liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <ajneQVLriUshjFIO@google.com>

Sean Christopherson <seanjc@google.com> writes:

> On Fri, Jun 19, 2026, Fuad Tabba wrote:
>> On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
>> <devnull+ackerleytng.google.com@kernel.org> wrote:
>> >
>> > From: Ackerley Tng <ackerleytng@google.com>
>> >
>> > When memory in guest_memfd is converted from private to shared, the
>> > platform-specific state associated with the guest-private pages must be
>> > invalidated or cleaned up.
>> >
>> > Iterate over the folios in the affected range and call the
>> > kvm_arch_gmem_invalidate() hook for each PFN range. This allows
>> > architectures to perform necessary teardown, such as updating hardware
>> > metadata or encryption states, before the pages are transitioned to the
>> > shared state.
>> >
>> > Invoke this helper after indicating to KVM's mmu code that an invalidation
>> > is in progress to stop in-flight page faults from succeeding.
>> >
>> > Reviewed-by: Fuad Tabba <tabba@google.com>
>> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>>
>> Coming back to this after working through the arm64/pKVM side. My
>> Reviewed-by here is from the previous round and the patch hasn't
>> changed, but I missed an implication for arm64.
>>
>> kvm_arch_gmem_invalidate() is now called from two paths with the same
>> (start, end) signature: folio teardown (kvm_gmem_free_folio) and
>> private->shared conversion (here). For SNP/TDX that's fine, conversion is
>> destructive anyway. For pKVM the two need opposite content semantics:
>> conversion must preserve the page in place (same physical page, the point
>> of in-place conversion without encryption), while teardown must scrub it
>> before returning it to the host.
>>
>> The hook gets only a pfn range with no indication of which caller it's
>> serving, so arm64 can't give the two paths the behaviour they need. It
>> would help to signal intent on the conversion path: a reason/flag, a
>> separate hook, or not routing non-destructive conversion through the
>> teardown hook.
>>
>> arm64 isn't here yet, so this isn't urgent, but the hook is gaining a
>> second caller now, and it's cheaper to leave room for the distinction
>> than to change a generic contract other arches depend on later.
>
> Crud.  It may not be urgent for arm64, but it's urgent for other reasons that
> I "can't" describe in detail at the moment, and even if that weren't the case, I
> think we should clean things up now.  More below.
>
>> >  virt/kvm/guest_memfd.c | 41 +++++++++++++++++++++++++++++++++++++++++
>> >  1 file changed, 41 insertions(+)
>> >
>> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> > index 433f79047b9d1..3c94442bc8131 100644
>> > --- a/virt/kvm/guest_memfd.c
>> > +++ b/virt/kvm/guest_memfd.c
>> > @@ -607,6 +607,42 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
>> >         return safe;
>> >  }
>> >
>> > +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
>> > +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
>
> Not your fault, but kvm_arch_gmem_invalidate() is badly misnamed.  It's not
> "invalidating" anything, it's much more of a "free" callback, as SNP uses it to
> put physical pages back into a shared state when a maybe-private folio is freed.
>
> As Fuad points out, (ab)using that hook for the private=>shared conversion case
> "works", but not broadly.  And it makes the bad name worse, because it's called
> from code that _is_ doing true invalidations.  For pKVM, it may not even need to
> do anything invalidation-like.
>

Thanks, I also didn't like the naming of kvm_gmem_invalidate(),
especially when conversions also calls
kvm_gmem_invalidate_{start,end}() and those do different things.

> To avoid a conflict with patches that are going to have priority over this series,
> to set the stage for arm64 support, and to avoid avoid bleeding vendor details
> into guest_memfd, as if they are core guest_memfd behavior (only SNP needs the
> "invalidation" on this specific transition), I think we should add an arch hook
> to do conversions straightaway.
>
> Unless there's a clever option I'm missing, it'll mean adding yet another
> HAVE_KVM_ARCH_GMEM_XXX flag?  Hmm, especially because IIUC, arm64/pKVM doesn't
> need a callback for this case, only the free_folio case.
>
>> > +{
>> > +       struct folio_batch fbatch;
>> > +       pgoff_t next = start;
>> > +       int i;
>> > +
>> > +       folio_batch_init(&fbatch);
>> > +       while (filemap_get_folios(inode->i_mapping, &next, end - 1, &fbatch)) {
>> > +               for (i = 0; i < folio_batch_count(&fbatch); ++i) {
>> > +                       struct folio *folio = fbatch.folios[i];
>> > +                       pgoff_t start_index, end_index;
>> > +                       kvm_pfn_t start_pfn, end_pfn;
>> > +
>> > +                       start_index = max(start, folio->index);
>> > +                       end_index = min(end, folio_next_index(folio));
>> > +                       /*
>> > +                        * end_index is either in folio or points to
>> > +                        * the first page of the next folio. Hence,
>> > +                        * all pages in range [start_index, end_index)
>> > +                        * are contiguous.
>> > +                        */
>> > +                       start_pfn = folio_file_pfn(folio, start_index);
>> > +                       end_pfn = start_pfn + end_index - start_index;
>> > +
>> > +                       kvm_arch_gmem_invalidate(start_pfn, end_pfn);
>> > +               }
>> > +
>> > +               folio_batch_release(&fbatch);
>> > +               cond_resched();
>> > +       }
>> > +}
>> > +#else
>> > +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
>> > +#endif
>> > +
>> >  static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>> >                                      size_t nr_pages, uint64_t attrs,
>> >                                      pgoff_t *err_index)
>> > @@ -647,7 +683,12 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>> >          */
>> >
>> >         kvm_gmem_invalidate_start(inode, start, end);
>> > +
>> > +       if (!to_private)
>> > +               kvm_gmem_invalidate(inode, start, end);
>
> E.g. instead make this something like this?
>
> 	kvm_gmem_set_pfn_attributes(...)
>
> Hrm, though that wastes folio lookups in the to_private case.  So maybe just this,
> assuming pKVM doesn't need to take additional action on conversions?
>
> 	if (!to_private)
> 		kvm_gmem_make_shared(...)
>
> Actually, if we do that, then we don't need a separate arch hook, just a separate
> config.  It'll still bleed SNP details into guest_memfd, but it'll at least be
> done in a way that's more explicitly arch specific (and it's no different than
> what we already do for PREPARE...).
>

pKVM needs some arch guest_memfd lifecycle functions that

+ for conversion, doesn't do anything,
+ for teardown, resets page state (IIUC it'll be reset to
  PKVM_PAGE_OWNED (by the host))

So I think we need different functions for those two stages in the
lifecycle of a page with guest_memfd? What if we have

CONFIG_HAVE_KVM_ARCH_GMEM_SET_PFN_ATTRIBUTES, which gates

+ kvm_gmem_should_set_pfn_attributes(attributes) and
  .gmem_should_set_pfn_attributes
+ kvm_gmem_set_pfn_attributes(start_pfn, end_pfn, attributes) and
  .gmem_set_pfn_attributes

CONFIG_HAVE_KVM_ARCH_GMEM_TEARDOWN, which gates

+ kvm_gmem_teardown() and .gmem_teardown

SNP:

+ .gmem_should_set_pfn_attributes = sev_gmem_should_set_pfn_attributes,
  and sev_gmem_should_set_pfn_attributes returns !is_private
+ Rename .gmem_invalidate and sev_gmem_invalidate to *set_pfn_attributes
+ .gmem_teardown = sev_gmem_set_pfn_attributes

TDX:

+ Disable CONFIG_HAVE_KVM_ARCH_GMEM_SET_PFN_ATTRIBUTES
+ Disable CONFIG_HAVE_KVM_ARCH_GMEM_TEARDOWN

pKVM:

+ Disable CONFIG_HAVE_KVM_ARCH_GMEM_SET_PFN_ATTRIBUTES
+ .gmem_teardown = pkvm_gmem_set_pfn_attributes

Suzuki, does this work for ARM CCA?

This way,

+ The if (is_private) check doesn't leak SNP details into guest_memfd
+ .gmem_make_shared doesn't stick out without a .gmem_make_private
+ .gmem_set_pfn_attributes, .gmem_prepare and .gmem_teardown are aligned
  conceptually as lifecycle hooks

+ I think the private/shared check for prepare can also be folded into
  preparation.
    + Preparation perhaps doesn't need a should_prepare equivalent since
      there's no iteration and getting the gfn is just doing some math?
    + In another patch series?

> E.g. this?  There will still be a looming rename conflict, but that's easy enough
> to handle.
>
> diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
> index 9ce5be7843f2..8aead0abd788 100644
> --- virt/kvm/guest_memfd.c
> +++ virt/kvm/guest_memfd.c
> @@ -648,8 +648,8 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
>         return safe;
>  }
>
> -#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> -static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
> +#ifdef CONFIG_KVM_ARCH_GMEM_FREE_ON_SHARED_CONVERSION
> +static void kvm_gmem_make_shared(struct inode *inode, pgoff_t start, pgoff_t end)
>  {
>         struct folio_batch fbatch;
>         pgoff_t next = start;
> @@ -681,7 +681,7 @@ static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
>         }
>  }
>  #else
> -static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
> +static void kvm_gmem_make_shared(struct inode *inode, pgoff_t start, pgoff_t end) { }
>  #endif
>
>  static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> @@ -729,7 +729,7 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>         kvm_gmem_invalidate_start(inode, start, end);
>
>         if (!to_private)
> -               kvm_gmem_invalidate(inode, start, end);
> +               kvm_gmem_make_shared(inode, start, end);
>
>         mas_store_prealloc(&mas, xa_mk_value(attrs));


^ permalink raw reply

* Re: [RFC PATCH] mm: bypass swap readahead for zswap
From: Nhat Pham @ 2026-06-24 17:43 UTC (permalink / raw)
  To: Kairui Song
  Cc: Alexandre Ghiti, akpm, hannes, yosry, chengming.zhou, david, ljs,
	liam, vbabka, rppt, surenb, mhocko, chrisl, baohua, usama.arif,
	linux-mm, linux-kernel
In-Reply-To: <CAMgjq7CfMp_7bhDWirUmxM0pFzk6d9in9h6wuHsMoeUu-+TC_Q@mail.gmail.com>

On Wed, Jun 24, 2026 at 3:31 AM Kairui Song <ryncsn@gmail.com> wrote:
>
>
> Better check zswap_never_enabled first to avoid a xa_load if not needed.

+1.

Maybe also xa_empty() when we're at it? :)


^ permalink raw reply

* [REGRESSION] mm/mprotect: shared-dirty base-page toggle slower since v6.17
From: Chengfeng Lin @ 2026-06-24 17:28 UTC (permalink / raw)
  To: Pedro Falcato, Andrew Morton, linux-mm
  Cc: Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Jann Horn,
	linux-kernel, regressions

Hi,

I have a refreshed bare-metal result for the shared-dirty mprotect()
slowdown I reported earlier from QEMU/lab testing.

The reproducer is intentionally narrow:

  - MAP_SHARED | MAP_ANONYMOUS mapping
  - 64 MiB range, write-prefaulted before timing
  - state check: 4 KiB base pages, no THP backing
  - repeated full-range mprotect(PROT_READ)
  - restore with mprotect(PROT_READ | PROT_WRITE)
  - write-touch after each protect/restore cycle

So this is not a generic mprotect() regression claim.  The scope is the
shared-dirty base-page PTE permission-change path.

The bare-metal machine is an Intel Core i7-14700 system.  The workload is
single-threaded and pinned to one logical CPU with `taskset -c 2`.  The primary
metric is `iteration_ns_per_page`, lower is better.  It is the wall-clock time
for one full protect/restore/write-touch iteration, divided by the number of
4 KiB pages in the range.  Each benchmark step used 9 external rounds, 1000
iterations, and 10 warmup iterations.

First, the v6.12 -> v6.19 result still reproduces on bare metal:

  kernel                         iteration_ns_per_page
  v6.12.77                       26
  v6.19.9                        37

I then narrowed the release window with 3 interleaved boot/run steps per
kernel:

  kernel                         values          mean
  v6.16                          25 25 25        25.000
  v6.17                          37 37 37        37.000
  v6.18                          38 38 38        38.000
  v6.18.19                       38 38 38        38.000
  v6.19.9                        37 36 37        36.667

I also checked later context with the same standalone command:

  kernel                         values          mean
  v7.0.9                         36 36 36        36.000
  v6.19.9 + Pedro v3 patch-only  39 39 39        39.000
  v7.1.0-rc3 mm-unstable/Pedro   39 39 39        39.000

I do not treat the mm-unstable result as a clean release-kernel comparison.
It is only a follow-up check, and in this workload it did not improve the
standalone result.

All of these runs reported `expected_match_ratio=100` and
`unexpected_results=0`.  The state check in the standalone output stays in the
same shape: 4 KiB pages, no THP.

This puts the slowdown in the v6.16 -> v6.17 release window.

As an attribution check, I also built a v6.17 probe kernel that only changes
the present-PTE path in `mm/mprotect.c::change_pte_range()` for this workload
back to a single-PTE start/commit/flush shape.  That is not an upstream patch
and not a clean release-kernel comparison; it is only a hot-path probe.

The result was:

  kernel                         values          mean
  v6.16                          25 25 25        25.000
  v6.17                          37 37 37        37.000
  v6.17 single-PTE probe         25 25 25        25.000

So the targeted probe brings v6.17 back to the v6.16 range for this workload.
That points at the v6.17 PTE-batching shape in `change_pte_range()` as the
main cost for this shared-dirty 4 KiB base-page case.

I do not want to overstate the attribution.  I tried reversing the official
`cac1db8c3aad ("mm: optimize mprotect() by PTE batching")` patch onto my
linux-6.17 tree, but it did not apply cleanly.  That means this is not an
exact revert result.  I can only say that the slowdown appears in the
v6.16 -> v6.17 window, and that this focused probe brings the v6.17 result
back to the v6.16 range.

Evidence bundle:

  https://github.com/lcf0399/linux-mm-regression-evidence/tree/acd7fef0e0276ac361971b0960e6611811edf5b3/mprotect-shared-dirty-toggle

Standalone reproducer:

  https://github.com/lcf0399/linux-mm-regression-evidence/tree/acd7fef0e0276ac361971b0960e6611811edf5b3/mprotect-shared-dirty-toggle/reproducer

For each installed kernel, the standalone reproducer was run as:

  taskset -c 2 env MAPPING_MB=64 ITERATIONS=1000 WARMUP=10 \
    EXTERNAL_ROUNDS=9 ./run_mprotect_shared_dirty_reproducer.sh

For the release-window check, a small systemd/GRUB queue booted each target
kernel before running the same command.

Bare-metal summaries and raw run logs:

  https://github.com/lcf0399/linux-mm-regression-evidence/tree/acd7fef0e0276ac361971b0960e6611811edf5b3/mprotect-shared-dirty-toggle/bare-metal

Release-window narrowing:

  https://github.com/lcf0399/linux-mm-regression-evidence/tree/acd7fef0e0276ac361971b0960e6611811edf5b3/mprotect-shared-dirty-toggle/bare-metal/20260623-narrow-6.16-6.19-3rounds

v6.17 single-PTE probe:

  https://github.com/lcf0399/linux-mm-regression-evidence/tree/acd7fef0e0276ac361971b0960e6611811edf5b3/mprotect-shared-dirty-toggle/bare-metal/20260624-6.17-singlepte-probe

Probe patch used for that attribution run:

  https://github.com/lcf0399/linux-mm-regression-evidence/blob/acd7fef0e0276ac361971b0960e6611811edf5b3/mprotect-shared-dirty-toggle/bare-metal/20260624-6.17-singlepte-probe/0001-mm-mprotect-probe-6.17-single-pte-hotpath.patch

#regzbot introduced: v6.16..v6.17

Does this scope look useful to investigate further?  If yes, I can try a more
exact commit-level check or test a patch you think is the right direction.

Thanks,
Chengfeng


^ permalink raw reply

* Re: [PATCH v8 18/46] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: Sean Christopherson @ 2026-06-24 17:01 UTC (permalink / raw)
  To: Binbin Wu
  Cc: ackerleytng, aik, andrew.jones, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <6fc7f450-6d0a-494d-b295-297e4703148d@linux.intel.com>

On Tue, Jun 23, 2026, Binbin Wu wrote:
> On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> > @@ -606,12 +608,20 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> >  	next = start;
> >  	while (safe && filemap_get_folios(mapping, &next, last, &fbatch)) {
> >  
> > -		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> > +		for (i = 0; i < folio_batch_count(&fbatch);) {
> >  			struct folio *folio = fbatch.folios[i];
> >  
> > -			if (folio_ref_count(folio) !=
> > -			    folio_nr_pages(folio) + filemap_get_folios_refcount) {
> > -				safe = false;
> > +			safe = (folio_ref_count(folio) ==
> > +				folio_nr_pages(folio) +
> > +				filemap_get_folios_refcount);
> > +
> > +			if (safe) {
> > +				++i;
> > +			} else if (folio_may_be_lru_cached(folio) &&
> > +				   !lru_drained) {
> > +				lru_add_drain_all();
> 
> It seems unprivileged userspace is able to trigger lru_add_drain_all() repeatedly
> by invoking KVM_SET_MEMORY_ATTRIBUTES2 in a loop, which could lead to DoS risk?

FIW, if there's a risk, then AFAICT fadvise() and memfd's F_ADD_SEALS already
have the same risk.


^ permalink raw reply

* Re: [PATCH v4 2/5] mm/zswap: Factor writeback loop out of shrink_worker()
From: Yosry Ahmed @ 2026-06-24 17:00 UTC (permalink / raw)
  To: Hao Jia
  Cc: akpm, tj, hannes, shakeel.butt, mhocko, mkoutny, nphamcs,
	chengming.zhou, muchun.song, roman.gushchin, linux-mm,
	linux-kernel, linux-doc, Hao Jia
In-Reply-To: <0916e673-861f-b472-7417-afbffbcc98ad@gmail.com>

On Wed, Jun 24, 2026 at 4:55 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>
>
>
> On 2026/6/23 07:36, Yosry Ahmed wrote:
> >> +/*
> >> + * Walk the memcg tree and write back zswap pages until the
> >> + * (lower_pages, upper_pages) window closes, or abort encounter
> >> + * MAX_RECLAIM_RETRIES times of the following conditions:
> >> + * - No writeback-candidate memcgs found in a memcg tree walk.
> >> + * - Shrinking a writeback-candidate memcg failed.
> >> + *
> >> + * For shrink_worker(), it passes lower=thr and upper=zswap_total_pages().
> >> + * The @upper limit is refreshed in each iteration by re-evaluating
> >> + * zswap_total_pages(), and the window closes once the total falls
> >> + * below the threshold.
> >
> > This is the wrong abstraction level, and it's obvious by the fact that
> > the function calls zswap_total_pages() again to recalcualte
> > 'upper_pages'. It gets much worse in the next patch as well.
> >
> > The lower_pages and upper_pages thing is also unnecessarily hard to
> > follow.
> >
> > The core of the reuse here is the retry logic. So maybe keep the memcg
> > iteration in the callers, and define a function that takes in one memcg
> > and reclaims one batch from it? failures and attempts can be passed into
> > the function to maintain the state across scans of different memcgs,
> > like zswap_shrink_walk_arg?
> >
> > WDYT?
>
>
> Perhaps something like this?
>
> struct zswap_shrink_state {
>      int attempts;
>      int failures;
>      bool stop;
> };
>
> static bool zswap_shrink_no_candidate(struct zswap_shrink_state *s)
> {
>      if (!s->attempts && ++s->failures == MAX_RECLAIM_RETRIES)
>          return true;
>
>      s->attempts = 0;
>      return false;
> }
>
> static long zswap_shrink_one(struct mem_cgroup *memcg,
>                   struct zswap_shrink_state *s)
> {
>      long shrunk;
>
>      shrunk = shrink_memcg(memcg, NR_ZSWAP_WB_BATCH);
>      if (shrunk == -ENOENT)
>          return 0;
>
>      s->attempts++;
>      if (shrunk <= 0 && ++s->failures == MAX_RECLAIM_RETRIES)
>          s->stop = true;

Do we need 'stop' or can we just return a value here to indicate that
we should stop (e.g. -EBUSY)?

>
>      return shrunk;
> }
>
> static void shrink_worker(struct work_struct *w)
> {
>      struct zswap_shrink_state s = {};
>      unsigned long thr;
>
>      /* Reclaim down to the accept threshold */
>      thr = zswap_accept_thr_pages();
>
>      while (zswap_total_pages() > thr) {
>          struct mem_cgroup *memcg;
>
>          cond_resched();
>
>          memcg = zswap_iter_global();
>          if (!memcg) {
>              if (zswap_shrink_no_candidate(&s))
>                  break;
>              continue;
>          }
>
>          zswap_shrink_one(memcg, &s);
>          /* Drop the extra reference taken by the iterator. */
>          mem_cgroup_put(memcg);
>          if (s.stop)
>              break;
>      }
> }
>
> We could also fold the logic of zswap_shrink_no_candidate() into
> zswap_shrink_one(), but adding a !memcg check inside zswap_shrink_one()
> feels a bit awkward.
>
> WDYT?

I think splitting the shrink/retry logic over 2 functions makes it
more difficult to follow, so yeah I think fold
zswap_shrink_no_candidate() into zswap_shrink_one(). Then the callers
only need to iterate memcgs (depending on the context) and call
zswap_shrink_one() for each of them.


^ permalink raw reply

* Re: [PATCH v4 1/5] mm/zswap: Extend shrink_memcg() writeback capability
From: Yosry Ahmed @ 2026-06-24 16:57 UTC (permalink / raw)
  To: Hao Jia
  Cc: akpm, tj, hannes, shakeel.butt, mhocko, mkoutny, nphamcs,
	chengming.zhou, muchun.song, roman.gushchin, linux-mm,
	linux-kernel, linux-doc, Hao Jia
In-Reply-To: <057ea303-4c27-1a6e-08de-cce26c699097@gmail.com>

>
> /*
>   * Scan up to @nr_to_scan pages across the per-node zswap LRUs of @memcg
>   * and write back the reclaimable ones.
>   *
>   * Since the second-chance algorithm rotates referenced entries to the
>   * LRU tail, the per-node scan is capped at the current LRU length so
>   * each entry is scanned at most once per call. It is up to the caller
>   * to handle retries, deciding whether to scan the next memcg to complete

Nit: "whether to scan another memcg to complete.."

>   * the full iteration, or to rescan the current memcg to drain its zswap
>   * entries.
>   *
>   * Return: The number of compressed bytes written back (>= 0), or -ENOENT
>   * if @memcg has writeback disabled, is a zombie cgroup, or has empty
>   * zswap LRUs.
>   */
> static long shrink_memcg(struct mem_cgroup *memcg, unsigned long nr_to_scan)
> {
>      struct zswap_shrink_walk_arg walk_arg = {
>          .bytes_written = 0,
>          .encountered_page_in_swapcache = false,
>      };
>      unsigned long nr_remaining = nr_to_scan;
>      int nid;
>
>      if (!mem_cgroup_zswap_writeback_enabled(memcg))
>          return -ENOENT;
>
>      /*
>       * Skip zombies because their LRUs are reparented and we would be
>       * reclaiming from the parent instead of the dead memcg.
>       */
>      if (memcg && !mem_cgroup_online(memcg))
>          return -ENOENT;
>
>      for_each_node_state(nid, N_NORMAL_MEMORY) {
>          unsigned long nr_to_walk;
>
>          /*
>           * Cap the walk at the current LRU length to ensure each entry is
>           * scanned at most once per call. Referenced entries are rotated
>           * to the tail for a second chance, and this bound prevents them
>           * from being revisited within a single call. Retries are left to
>           * the caller, which can choose to rescan the current memcg or
>           * move on to the next one.
>           */

Nit: Make this more concise since it's already explained above.

Otherwise this looks good to me, thank you!

>          nr_to_walk = min(nr_remaining,
>                   list_lru_count_one(&zswap_list_lru, nid, memcg));
>          if (!nr_to_walk)
>              continue;
>
>          nr_remaining -= nr_to_walk;
>          list_lru_walk_one(&zswap_list_lru, nid, memcg, &shrink_memcg_cb,
>                    &walk_arg, &nr_to_walk);
>          /* Return the unused share of the budget to the pool. */
>          nr_remaining += nr_to_walk;
>
>          if (!nr_remaining)
>              break;
>      }
>
>      /* Nothing was scanned: every LRU under @memcg was empty. */
>      if (nr_remaining == nr_to_scan)
>          return -ENOENT;
>
>      return walk_arg.bytes_written;
> }
>
>
> Thanks,
> Hao


^ permalink raw reply

* Re: [PATCH v8 18/46] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: Sean Christopherson @ 2026-06-24 16:57 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-18-9d2959357853@google.com>

On Thu, Jun 18, 2026, Ackerley Tng wrote:
> When checking if a guest_memfd folio is safe for conversion, its refcount
> is examined. A folio may be present in a per-CPU lru_add fbatch, which
> temporarily increases its refcount. 

Under what circumstances does this happen, and what alternatives are there for
userspace to work around the issue?


^ permalink raw reply

* Re: [PATCH v2 13/13] mm: remove __GFP_NO_CODETAG
From: Suren Baghdasaryan @ 2026-06-24 16:47 UTC (permalink / raw)
  To: Hao Ge
  Cc: Brendan Jackman, Vlastimil Babka, Harry Yoo (Oracle),
	Gregory Price, Alexei Starovoitov, Matthew Wilcox, linux-mm,
	linux-kernel, linux-rt-devel, Michal Hocko, Andrew Morton,
	Johannes Weiner, Zi Yan, Muchun Song, Oscar Salvador,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Alistair Popple, Ying Huang, Hao Li,
	Christoph Lameter, David Rientjes, Roman Gushchin,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt
In-Reply-To: <6e312b15-d2b5-4137-aa3f-720ec214c7ab@linux.dev>

On Tue, Jun 23, 2026 at 12:57 AM Hao Ge <hao.ge@linux.dev> wrote:
>
> Hi Brendan
>
>
> On 2026/6/22 18:01, Brendan Jackman wrote:
> > Now that alloc_pages has an entrypoint that allows passing alloc_flags,
> > we can take advantage of this to start removing GFP flags that are only
> > used for mm-internal stuff.
> >
> > This requires also plumbing the alloc_flags into some more of the
> > allocator code, in particular __alloc_pages[_noprof]() gets an
> > alloc_flags arg to go along with its callees, and we now need to pass
> > those flags deeper into the allocator so they can reach the alloc_tag
> > code.
> >
> > To try and keep the new ALLOC_NO_CODETAG's scope nice and narrow, don't
> > define it in mm/internal.h, instead just define a "reserved bit" and
> > then use that in places that don't care about what it means.

I don't understand why you want to narrow down visibility of one of
the alloc_flag bits. We don't do that for any other flags, and this
seems like an unnecessary complexity.

> >
> > Signed-off-by: Brendan Jackman <jackmanb@google.com>
>
>
> Nit: The title says "remove __GFP_NO_CODETAG" but the flag isn't really
> removed — it's migrated from gfp_t to alloc_flags as
>
> ALLOC_NO_CODETAG. Something like "mm: replace __GFP_NO_CODETAG with an
> alloc_flag" would be more accurate.
>
>
> Additionally, as Lorenzo pointed out in another thread, you will likely
> need to rebase this series later.
>
> I noticed Vlastimil has already landed the slab changes removing
> __GFP_NO_OBJ_EXT into mainline:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=335c347686e76df9d2c7d7f61b5ea627a4c5cb4c
>
> For v3, it might make sense to fold in Vlastimil's patch so the full
> removal of __GFP_NO_OBJ_EXT can be completed end-to-end
>
> https://lore.kernel.org/all/20260609-slab_alloc_flags-v1-15-2bf4a4b9b526@kernel.org/

I think Vlastimil's patch will be merged before this one, so this
patch could remove __GFP_NO_OBJ_EXT complely, saying that its last
user (__GFP_NO_CODETAG) is gone.

>
>
> > ---
> >   mm/alloc_tag.c       | 18 ++++++++++--------
> >   mm/compaction.c      |  4 ++--
> >   mm/internal.h        |  8 ++++++--
> >   mm/page_alloc.c      | 42 ++++++++++++++++++++++++------------------
> >   mm/page_frag_cache.c |  4 ++--
> >   5 files changed, 44 insertions(+), 32 deletions(-)
> >
> > diff --git a/mm/alloc_tag.c b/mm/alloc_tag.c
> > index d9be1cf5187d9..61a6cba32ff35 100644
> > --- a/mm/alloc_tag.c
> > +++ b/mm/alloc_tag.c
> > @@ -15,6 +15,8 @@
> >   #include <linux/vmalloc.h>
> >   #include <linux/kmemleak.h>
> >
> > +#include "internal.h"
> > +
> >   #define ALLOCINFO_FILE_NAME         "allocinfo"
> >   #define MODULE_ALLOC_TAG_VMAP_SIZE  (100000UL * sizeof(struct alloc_tag))
> >   #define SECTION_START(NAME)         (CODETAG_SECTION_START_PREFIX NAME)
> > @@ -785,16 +787,15 @@ struct pfn_pool {
> >                                        sizeof(unsigned long))
> >
> >   /*
> > - * Skip early PFN recording for a page allocation.  Reuses the
> > - * %__GFP_NO_OBJ_EXT bit.  Used by __alloc_tag_add_early_pfn() to avoid
> > - * recursion when allocating pages for the early PFN tracking list
> > - * itself.
> > + * Skip early PFN recording for a page allocation.  Used by
> > + * __alloc_tag_add_early_pfn() to avoid recursion when allocating pages for the
> > + * early PFN tracking list itself.
> >    *
> >    * Codetags of the pages allocated with __GFP_NO_CODETAG should be
> >    * cleared (via clear_page_tag_ref()) before freeing the pages to prevent
> >    * alloc_tag_sub_check() from triggering a warning.
> >    */
> > -#define __GFP_NO_CODETAG             __GFP_NO_OBJ_EXT
> > +#define ALLOC_NO_CODETAG             __ALLOC_ALLOC_TAG
> >
> >   static struct pfn_pool *current_pfn_pool __initdata;
> >
> > @@ -806,7 +807,8 @@ static void __init __alloc_tag_add_early_pfn(unsigned long pfn)
> >       do {
> >               pool = READ_ONCE(current_pfn_pool);
> >               if (!pool || atomic_read(&pool->count) >= PFN_POOL_SIZE) {
> > -                     struct page *new_page = alloc_page(__GFP_HIGH | __GFP_NO_CODETAG);
> > +                     struct page *new_page = __alloc_pages(__GFP_HIGH, 0, numa_mem_id(),
> > +                                                           NULL, ALLOC_NO_CODETAG);
> >                       struct pfn_pool *new;
> >
> >                       if (!new_page) {
> > @@ -837,7 +839,7 @@ typedef void alloc_tag_add_func(unsigned long pfn);
> >   static alloc_tag_add_func __rcu *alloc_tag_add_early_pfn_ptr __refdata =
> >       RCU_INITIALIZER(__alloc_tag_add_early_pfn);
> >
> > -void alloc_tag_add_early_pfn(unsigned long pfn, gfp_t gfp_flags)
> > +void alloc_tag_add_early_pfn(unsigned long pfn, unsigned int alloc_flags)
>
>
> alloc_tag_add_early_pfn is actually declared in include/linux/alloc_tag.h,
>
> so we need to update this header in sync as well.
>
> include/linux/alloc_tag.h:166:void alloc_tag_add_early_pfn(unsigned long
> pfn, gfp_t gfp_flags);
> include/linux/alloc_tag.h:170:static inline void
> alloc_tag_add_early_pfn(unsigned long pfn, gfp_t gfp_flags) {}
>
>
> >   {
> >       alloc_tag_add_func *alloc_tag_add;
> >
> > @@ -845,7 +847,7 @@ void alloc_tag_add_early_pfn(unsigned long pfn, gfp_t gfp_flags)
> >               return;
> >
> >       /* Skip allocations for the tracking list itself to avoid recursion. */
> > -     if (gfp_flags & __GFP_NO_CODETAG)
> > +     if (alloc_flags & ALLOC_NO_CODETAG)
> >               return;
> >
> >       rcu_read_lock();
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index b776f35ad0200..e90ebd2c54f48 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -82,7 +82,7 @@ static inline bool is_via_compact_memory(int order) { return false; }
> >
> >   static struct page *mark_allocated_noprof(struct page *page, unsigned int order, gfp_t gfp_flags)
> >   {
> > -     post_alloc_hook(page, order, __GFP_MOVABLE);
> > +     post_alloc_hook(page, order, __GFP_MOVABLE, ALLOC_DEFAULT);
> >       set_page_refcounted(page);
> >       return page;
> >   }
> > @@ -1850,7 +1850,7 @@ static struct folio *compaction_alloc_noprof(struct folio *src, unsigned long da
> >       }
> >       dst = (struct folio *)freepage;
> >
> > -     post_alloc_hook(&dst->page, order, __GFP_MOVABLE);
> > +     post_alloc_hook(&dst->page, order, __GFP_MOVABLE, ALLOC_DEFAULT);
> >       set_page_refcounted(&dst->page);
> >       if (order)
> >               prep_compound_page(&dst->page, order);
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 0847b55bfc147..a45bedb9ada5f 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -684,6 +684,8 @@ struct alloc_context {
> >        */
> >       enum zone_type highest_zoneidx;
> >       bool spread_dirty_pages;
> > +     /* Only flags that are global to the whole allocation go here. */
> > +     unsigned int alloc_flags;
> >   };
> >
> >   /*
> > @@ -907,7 +909,8 @@ static inline void init_compound_tail(struct page *tail,
> >       prep_compound_tail(tail, head, order);
> >   }
> >
> > -void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags);
> > +void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags,
> > +                  unsigned int alloc_flags);
> >   extern bool free_pages_prepare(struct page *page, unsigned int order);
> >
> >   extern int user_min_free_kbytes;
> > @@ -1481,6 +1484,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
> >   #define ALLOC_HIGHATOMIC    0x200 /* Allows access to MIGRATE_HIGHATOMIC */
> >   #define ALLOC_NOLOCK                0x400 /* Only use spin_trylock in allocation path */
> >   #define ALLOC_KSWAPD                0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
> > +#define __ALLOC_ALLOC_TAG      0x1000 /* Reserved bit for use by alloc_tag code */
> >
> >   /* Flags that allow allocations below the min watermark. */
> >   #define ALLOC_RESERVES (ALLOC_NON_BLOCK|ALLOC_MIN_RESERVE|ALLOC_HIGHATOMIC|ALLOC_OOM)
> > @@ -1956,7 +1960,7 @@ bool may_expand_vm(struct mm_struct *mm, const vma_flags_t *vma_flags,
> >                  unsigned long npages);
> >
> >   struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
> > -             nodemask_t *nodemask);
> > +             nodemask_t *nodemask, unsigned int alloc_flags);
> >   #define __alloc_pages(...)                  alloc_hooks(__alloc_pages_noprof(__VA_ARGS__))
> >
> >   #endif      /* __MM_INTERNAL_H */
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index d99e4ea8307ea..d50fd9c77a2e8 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1246,7 +1246,7 @@ void __clear_page_tag_ref(struct page *page)
> >   /* Should be called only if mem_alloc_profiling_enabled() */
> >   static noinline
> >   void __pgalloc_tag_add(struct page *page, struct task_struct *task,
> > -                    unsigned int nr, gfp_t gfp_flags)
> > +                    unsigned int nr, unsigned int alloc_flags)
> >   {
> >       union pgtag_ref_handle handle;
> >       union codetag_ref ref;
> > @@ -1260,17 +1260,17 @@ void __pgalloc_tag_add(struct page *page, struct task_struct *task,
> >                * page_ext is not available yet, record the pfn so we can
> >                * clear the tag ref later when page_ext is initialized.
> >                */
> > -             alloc_tag_add_early_pfn(page_to_pfn(page), gfp_flags);
> > +             alloc_tag_add_early_pfn(page_to_pfn(page), alloc_flags);
> >               if (task->alloc_tag)
> >                       alloc_tag_set_inaccurate(task->alloc_tag);
> >       }
> >   }
> >
> >   static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
> > -                                unsigned int nr, gfp_t gfp_flags)
> > +                                unsigned int nr, unsigned int alloc_flags)
>
>
> The pgalloc_tag_add() stub in the non-CONFIG_MEM_ALLOC_PROFILING build
> could use the same parameter types for consistency

Umm, correction. It *should* use the same parameter type. It's
unfortunate that the compiler doesn't catch this...

>
>
> Thanks
>
> Best Regards
>
> Hao
>
>
> >   {
> >       if (mem_alloc_profiling_enabled())
> > -             __pgalloc_tag_add(page, task, nr, gfp_flags);
> > +             __pgalloc_tag_add(page, task, nr, alloc_flags);
> >   }
> >
> >   /* Should be called only if mem_alloc_profiling_enabled() */
> > @@ -1807,7 +1807,7 @@ static inline bool should_skip_init(gfp_t flags)
> >   }
> >
> >   inline void post_alloc_hook(struct page *page, unsigned int order,
> > -                             gfp_t gfp_flags)
> > +                             gfp_t gfp_flags, unsigned int alloc_flags)
> >   {
> >       const bool zero_tags = gfp_flags & __GFP_ZEROTAGS;
> >       bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) &&
> > @@ -1858,13 +1858,13 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
> >
> >       set_page_owner(page, order, gfp_flags);
> >       page_table_check_alloc(page, order);
> > -     pgalloc_tag_add(page, current, 1 << order, gfp_flags);
> > +     pgalloc_tag_add(page, current, 1 << order, alloc_flags);
> >   }
> >
> >   static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
> >                                                       unsigned int alloc_flags)
> >   {
> > -     post_alloc_hook(page, order, gfp_flags);
> > +     post_alloc_hook(page, order, gfp_flags, alloc_flags);
> >
> >       if (order && (gfp_flags & __GFP_COMP))
> >               prep_compound_page(page, order);
> > @@ -4773,8 +4773,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >        * The fast path uses conservative alloc_flags to succeed only until
> >        * kswapd needs to be woken up, and to avoid the cost of setting up
> >        * alloc_flags precisely. So we do that now.
> > +      *
> > +      * Can't just or alloc_flags if it contains WMARK bits, but those flags
> > +      * shouldn't be set in ac->alloc_flags.
> >        */
> > -     alloc_flags = slowpath_alloc_flags(gfp_mask, order);
> > +     VM_WARN_ON(ac->alloc_flags & ALLOC_WMARK_MASK);
> > +     alloc_flags = ac->alloc_flags | slowpath_alloc_flags(gfp_mask, order);
> >
> >       /*
> >        * We need to recalculate the starting point for the zonelist iterator
> > @@ -4816,7 +4820,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >       reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
> >       if (reserve_flags)
> >               alloc_flags = cma_alloc_flags(gfp_mask, reserve_flags) |
> > -                                       (alloc_flags & ALLOC_KSWAPD);
> > +                             ac->alloc_flags | (alloc_flags & ALLOC_KSWAPD);
> >
> >       /*
> >        * Reset the nodemask and zonelist iterators if memory policies can be
> > @@ -5218,7 +5222,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
> >       return nr_populated;
> >
> >   failed:
> > -     page = __alloc_pages_noprof(gfp, 0, preferred_nid, nodemask);
> > +     page = __alloc_pages_noprof(gfp, 0, preferred_nid, nodemask, ALLOC_DEFAULT);
> >       if (page)
> >               page_array[nr_populated++] = page;
> >       goto out;
> > @@ -5326,11 +5330,13 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
> >   {
> >       struct page *page;
> >       gfp_t alloc_gfp; /* The gfp_t that was actually used for allocation */
> > -     struct alloc_context ac = { };
> > +     struct alloc_context ac = {
> > +             .alloc_flags = alloc_flags,
> > +     };
> >       unsigned int fastpath_alloc_flags = alloc_flags;
> >
> >       /* Other flags could be supported later if needed. */
> > -     if (WARN_ON(alloc_flags & ~ALLOC_NOLOCK))
> > +     if (WARN_ON(alloc_flags & ~(ALLOC_NOLOCK | __ALLOC_ALLOC_TAG)))
> >               return NULL;
> >
> >       if (!alloc_order_allowed(gfp, order, alloc_flags))
> > @@ -5398,12 +5404,12 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
> >   EXPORT_SYMBOL(__alloc_frozen_pages_noprof);
> >
> >   struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
> > -             int preferred_nid, nodemask_t *nodemask)
> > +             int preferred_nid, nodemask_t *nodemask, unsigned int alloc_flags)
> >   {
> >       struct page *page;
> >
> >       page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid, nodemask,
> > -                                        ALLOC_DEFAULT);
> > +                                        alloc_flags);
> >       if (page)
> >               set_page_refcounted(page);
> >       return page;
> > @@ -5418,14 +5424,14 @@ struct page *alloc_pages_node_noprof(int nid, gfp_t gfp_mask, unsigned int order
> >       VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> >       warn_if_node_offline(nid, gfp_mask);
> >
> > -     return __alloc_pages_noprof(gfp_mask, order, nid, NULL);
> > +     return __alloc_pages_noprof(gfp_mask, order, nid, NULL, ALLOC_DEFAULT);
> >   }
> >
> >   struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
> >               nodemask_t *nodemask)
> >   {
> >       struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
> > -                                     preferred_nid, nodemask);
> > +                                     preferred_nid, nodemask, ALLOC_DEFAULT);
> >       return page_rmappable_folio(page);
> >   }
> >   EXPORT_SYMBOL(__folio_alloc_noprof);
> > @@ -7107,7 +7113,7 @@ static void split_free_frozen_pages(struct list_head *list, gfp_t gfp_mask)
> >               list_for_each_entry_safe(page, next, &list[order], lru) {
> >                       int i;
> >
> > -                     post_alloc_hook(page, order, gfp_mask);
> > +                     post_alloc_hook(page, order, gfp_mask, ALLOC_DEFAULT);
> >                       if (!order)
> >                               continue;
> >
> > @@ -7312,7 +7318,7 @@ int alloc_contig_frozen_range_noprof(unsigned long start, unsigned long end,
> >               struct page *head = pfn_to_page(start);
> >
> >               check_new_pages(head, order);
> > -             prep_new_page(head, order, gfp_mask, 0);
> > +             prep_new_page(head, order, gfp_mask, ALLOC_DEFAULT);
> >       } else {
> >               ret = -EINVAL;
> >               WARN(true, "PFN range: requested [%lu, %lu), allocated [%lu, %lu)\n",
> > diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
> > index d2423f30577e4..d9573170e0719 100644
> > --- a/mm/page_frag_cache.c
> > +++ b/mm/page_frag_cache.c
> > @@ -57,10 +57,10 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
> >       gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
> >                  __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
> >       page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
> > -                          numa_mem_id(), NULL);
> > +                          numa_mem_id(), NULL, ALLOC_DEFAULT);
> >   #endif
> >       if (unlikely(!page)) {
> > -             page = __alloc_pages(gfp, 0, numa_mem_id(), NULL);
> > +             page = __alloc_pages(gfp, 0, numa_mem_id(), NULL, ALLOC_DEFAULT);
> >               order = 0;
> >       }
> >
> >


^ permalink raw reply

* Re: [PATCH v4 4/5] mm/memcontrol: convert memcg to use page_counter_stock
From: Usama Arif @ 2026-06-24 16:43 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, cgroups, linux-mm, linux-kernel, kernel-team
In-Reply-To: <20260624152331.2228828-1-joshua.hahnjy@gmail.com>



On 24/06/2026 16:23, Joshua Hahn wrote:
> On Wed, 24 Jun 2026 07:43:47 -0700 Usama Arif <usama.arif@linux.dev> wrote:
> 
>> On Tue, 23 Jun 2026 11:01:22 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> 
> Hello Usama!!
> 
> Thank you for reviewing the patch : -)
> 
> [...snip...]
> 
>>> @@ -2595,7 +2596,6 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
>>>  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>>  			    unsigned int nr_pages)
>>>  {
>>> -	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
>>>  	int nr_retries = MAX_RECLAIM_RETRIES;
>>>  	struct mem_cgroup *mem_over_limit;
>>>  	struct page_counter *counter;
>>> @@ -2606,36 +2606,30 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>>  	bool raised_max_event = false;
>>>  	unsigned long pflags;
>>>  	bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
>>> +	unsigned long nr_charged = 0;
>>>  
>>>  retry:
>>> -	if (consume_stock(memcg, nr_pages))
>>> -		return 0;
>>> -
>>> -	if (!allow_spinning)
>>> -		/* Avoid the refill and flush of the older stock */
>>> -		batch = nr_pages;
>>> -
>>>  	reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
>>>  	if (do_memsw_account() &&
>>> -	    !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
>>> +	    !page_counter_try_charge_stock(&memcg->memsw, nr_pages,
>>> +					   &counter, NULL)) {
>>>  		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
>>>  		reclaim_options &= ~MEMCG_RECLAIM_MAY_SWAP;
>>>  		goto reclaim;
>>>  	}
>>>  
>>> -	if (page_counter_try_charge(&memcg->memory, batch, &counter))
>>> -		goto done_restock;
>>> +	if (page_counter_try_charge_stock(&memcg->memory, nr_pages,
>>> +					  &counter, &nr_charged)) {
>>> +		if (!nr_charged)
>>> +			return 0;
>>> +		goto handle_high;
>>> +	}
>>>  
>>>  	if (do_memsw_account())
>>> -		page_counter_uncharge(&memcg->memsw, batch);
>>> +		page_counter_uncharge(&memcg->memsw, nr_pages);
>>
>> This needs a transactional rollback. page_counter_try_charge_stock() can
>> succeed by consuming memsw stock and charging 0 new pages, but the
>> memory-failure path unconditionally uncharges nr_pages from memsw.
>> That turns a failed allocation into a real memsw usage decrement.
> 
> Hmmmmmmmmmm....... I'm not sure.
> 
> At this point in the code, we are either (1) using cgroup v1 with memsw
> and charged successfully, or (2) not using cgroup v1 with memsw. So I'm
> not sure if this really is unconditional, we're just distinguishing
> between cases (1) and (2) by checking if we're using cgroupv1.
> 
> Or is your concern with taking a charge via stock, but uncharging with
> a hierarchical page_counter walk?

This was my concern. But I re-read the page_counter stock invariant,
and the stock-hit case is not an undercount? Consuming stock transfers
already-charged credit to the pending allocation; if the later memory charge
fails, page_counter_uncharge() discards that consumed credit from the
hierarchy. That should keeps usage equal to real charges plus remaining stock?

> If so, I think there's a case to be
> made here with just simply returning the stock. I just wanted to keep
> it consistent with the original memcontrol code, which only used
> stock to fulfill charges, not uncharges, since this could make the
> stock grow without bound.
> 
> What do you think? Thanks again for reviewing Usama, I hope you have a
> great day!!!
> Joshua



^ permalink raw reply

* Re: [PATCH v5 4/9] mm/memory_hotplug: add __add_memory_driver_managed() with online_type arg
From: Gupta, Pankaj @ 2026-06-24 16:41 UTC (permalink / raw)
  To: Gregory Price, linux-mm, nvdimm
  Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
	kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
	vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
	mhocko, shuah, alison.schofield, Smita.KoralahalliChannabasappa,
	ira.weiny, apopple
In-Reply-To: <20260624145744.3532049-5-gourry@gourry.net>


> Existing callers of add_memory_driver_managed cannot select the
> preferred online type (ZONE_NORMAL vs ZONE_MOVABLE), requiring it to
> hot-add memory as offline blocks, and then follow up by onlining each
> memory block individually.
>
> Most drivers prefer the system default, but the CXL driver wants to
> plumb a preferred policy through the dax kmem driver.
>
> Refactor APIs to add a new interface which allows the dax kmem module
> to select a preferred policy.
>
> Overriding the configured auto-online policy is only safe for known
> in-tree modules, where we know the override reflects a different,
> user-requested policy.  We do not want arbitrary out-of-tree drivers
> silently overriding the system-wide onlining policy, so restrict the
> new interface to the kmem module using EXPORT_SYMBOL_FOR_MODULES()
> rather than a plain EXPORT_SYMBOL_GPL().  Other in-tree modules (e.g.
> cxl_core) can be added to the allowed list as the need arises.
>
> Refactor add_memory_driver_managed, extract __add_memory_driver_managed
> - Add proper kernel-doc for add_memory_driver_managed while refactoring
> - New helper accepts an explicit online_type.
> - New helper validates online_type is between OFFLINE and ONLINE_MOVABLE
>
> Refactor: add_memory_resource, extract __add_memory_resource
> - new helper accepts an explicit online_type
>
> Original APIs now explicitly pass the system-default to new helpers.
>
> No functional change for existing users.
>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
>   include/linux/memory_hotplug.h |  3 ++
>   mm/memory_hotplug.c            | 61 +++++++++++++++++++++++++++++-----
>   2 files changed, 56 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index f059025f8f8b..d3edeb80aadb 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -294,6 +294,9 @@ extern int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
>   extern int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
>   extern int add_memory_resource(int nid, struct resource *resource,
>   			       mhp_t mhp_flags);
> +int __add_memory_driver_managed(int nid, u64 start, u64 size,
> +				const char *resource_name, mhp_t mhp_flags,
> +				enum mmop online_type);
>   extern int add_memory_driver_managed(int nid, u64 start, u64 size,
>   				     const char *resource_name,
>   				     mhp_t mhp_flags);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 494257054095..a66346def504 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1494,10 +1494,10 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
>    *
>    * we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG
>    */
> -int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
> +static int __add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags,
> +				 enum mmop online_type)
>   {
>   	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> -	enum mmop online_type = mhp_get_default_online_type();
>   	enum memblock_flags memblock_flags = MEMBLOCK_NONE;
>   	struct memory_group *group = NULL;
>   	u64 start, size;
> @@ -1585,7 +1585,7 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>   		merge_system_ram_resource(res);
>   
>   	/* online pages if requested */
> -	if (mhp_get_default_online_type() != MMOP_OFFLINE)
> +	if (online_type != MMOP_OFFLINE)
>   		walk_memory_blocks(start, size, &online_type,
>   				   online_memory_block);
>   
> @@ -1603,7 +1603,13 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>   	return ret;
>   }
>   
> -/* requires device_hotplug_lock, see add_memory_resource() */
> +int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
> +{
> +	return __add_memory_resource(nid, res, mhp_flags,
> +				     mhp_get_default_online_type());
> +}
> +
> +/* requires device_hotplug_lock, see __add_memory_resource() */
>   int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags)
>   {
>   	struct resource *res;
> @@ -1631,7 +1637,15 @@ int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags)
>   }
>   EXPORT_SYMBOL_GPL(add_memory);
>   
> -/*
> +/**
> + * __add_memory_driver_managed - add driver-managed memory with explicit online_type
> + * @nid: NUMA node ID where the memory will be added
> + * @start: Start physical address of the memory range
> + * @size: Size of the memory range in bytes
> + * @resource_name: Resource name in format "System RAM ($DRIVER)"
> + * @mhp_flags: Memory hotplug flags
> + * @online_type: Auto-Online behavior (offline, online, kernel, movable)
> + *
>    * Add special, driver-managed memory to the system as system RAM. Such
>    * memory is not exposed via the raw firmware-provided memmap as system
>    * RAM, instead, it is detected and added by a driver - during cold boot,
> @@ -1639,6 +1653,7 @@ EXPORT_SYMBOL_GPL(add_memory);
>    *
>    * Reasons why this memory should not be used for the initial memmap of a
>    * kexec kernel or for placing kexec images:
> + *
>    * - The booting kernel is in charge of determining how this memory will be
>    *   used (e.g., use persistent memory as system RAM)
>    * - Coordination with a hypervisor is required before this memory
> @@ -1651,9 +1666,12 @@ EXPORT_SYMBOL_GPL(add_memory);
>    *
>    * The resource_name (visible via /proc/iomem) has to have the format
>    * "System RAM ($DRIVER)".
> + *
> + * Return: 0 on success, negative error code on failure.
>    */
> -int add_memory_driver_managed(int nid, u64 start, u64 size,
> -			      const char *resource_name, mhp_t mhp_flags)
> +int __add_memory_driver_managed(int nid, u64 start, u64 size,
> +		const char *resource_name, mhp_t mhp_flags,
> +		enum mmop online_type)
>   {
>   	struct resource *res;
>   	int rc;
> @@ -1663,6 +1681,9 @@ int add_memory_driver_managed(int nid, u64 start, u64 size,
>   	    resource_name[strlen(resource_name) - 1] != ')')
>   		return -EINVAL;
>   
> +	if (online_type < MMOP_OFFLINE || online_type > MMOP_ONLINE_MOVABLE)
> +		return -EINVAL;
> +
>   	lock_device_hotplug();
>   
>   	res = register_memory_resource(start, size, resource_name);
> @@ -1671,7 +1692,7 @@ int add_memory_driver_managed(int nid, u64 start, u64 size,
>   		goto out_unlock;
>   	}
>   
> -	rc = add_memory_resource(nid, res, mhp_flags);
> +	rc = __add_memory_resource(nid, res, mhp_flags, online_type);
>   	if (rc < 0)
>   		release_memory_resource(res);
>   
> @@ -1679,6 +1700,30 @@ int add_memory_driver_managed(int nid, u64 start, u64 size,
>   	unlock_device_hotplug();
>   	return rc;
>   }
> +EXPORT_SYMBOL_FOR_MODULES(__add_memory_driver_managed, "kmem");
> +
> +/**
> + * add_memory_driver_managed - add driver-managed memory
> + * @nid: NUMA node ID where the memory will be added
> + * @start: Start physical address of the memory range
> + * @size: Size of the memory range in bytes
> + * @resource_name: Resource name in format "System RAM ($DRIVER)"
> + * @mhp_flags: Memory hotplug flags
> + *
> + * Add driver-managed memory with the system default online type set by
> + * build config or kernel boot parameter.
> + *
> + * See __add_memory_driver_managed for more details.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +int add_memory_driver_managed(int nid, u64 start, u64 size,
> +			      const char *resource_name, mhp_t mhp_flags)
> +{
> +	return __add_memory_driver_managed(nid, start, size, resource_name,
> +			mhp_flags,
> +			mhp_get_default_online_type());
> +}
>   EXPORT_SYMBOL_GPL(add_memory_driver_managed);
>   
>   /*

Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>




^ permalink raw reply

* Re: [PATCH RFC 0/4] memcg,slab: kmalloc_nolock() fixes
From: Alexei Starovoitov @ 2026-06-24 16:30 UTC (permalink / raw)
  To: Harry Yoo (Oracle), Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Vlastimil Babka, Hao Li,
	Christoph Lameter, David Rientjes, Alexei Starovoitov,
	Pedro Falcato
  Cc: cgroups, linux-mm, linux-kernel, bpf
In-Reply-To: <20260624-kmalloc-nolock-fixes-v1-0-fdf4d17351dd@kernel.org>

On Wed Jun 24, 2026 at 6:11 AM PDT, Harry Yoo (Oracle) wrote:
>
> Bug 1 was reported by lockdep, and bugs 2 [2] and 3 [3] were
> reported by Sashiko.

... and in fixes for sashiko complains sashiko finds more issues.
I don't think it will ever end. I suggest to fix realistic scenarios
instead of one out of billion cases that sashiko think is plausible
but will never be hit in reality. The chance of server crashing
due to cosmic rays are higher than such bugs. Hence do not fix them.

> To BPF folks: do we need to backport kmalloc_nolock() support
> for architectures without __CMPXCHG_DOUBLE to v6.18?

nope.

> There are still few users in v6.18, but I can't tell whether it is
> necessary to backport it to v6.18 (hopefully not as urgent as other
> bugfixes).

imo none of these 'fixes' are necessary. Humans are not hitting them.



^ permalink raw reply

* Re: [PATCH v5 2/9] mm/memory_hotplug: pass online_type to online_memory_block() via arg
From: Gupta, Pankaj @ 2026-06-24 16:28 UTC (permalink / raw)
  To: Gregory Price, linux-mm, nvdimm
  Cc: linux-kernel, linux-cxl, driver-core, linux-kselftest,
	kernel-team, david, osalvador, gregkh, rafael, dakr, djbw,
	vishal.l.verma, dave.jiang, akpm, ljs, liam, vbabka, rppt, surenb,
	mhocko, shuah, alison.schofield, Smita.KoralahalliChannabasappa,
	ira.weiny, apopple
In-Reply-To: <20260624145744.3532049-3-gourry@gourry.net>


> Modify online_memory_block() to accept the online type through its arg
> parameter rather than calling mhp_get_default_online_type() internally.
>
> This prepares for allowing callers to specify explicit online types.
>
> Update the caller in add_memory_resource() to pass the default online
> type via a local variable.
>
> No functional change.
>
> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
>   mm/memory_hotplug.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 7ac19fab2263..6833208cc17c 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1337,7 +1337,9 @@ static int check_hotplug_memory_range(u64 start, u64 size)
>   
>   static int online_memory_block(struct memory_block *mem, void *arg)
>   {
> -	mem->online_type = mhp_get_default_online_type();
> +	enum mmop *online_type = arg;
> +
> +	mem->online_type = *online_type;
>   	return device_online(&mem->dev);
>   }
>   
> @@ -1494,6 +1496,7 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
>   int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>   {
>   	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> +	enum mmop online_type = mhp_get_default_online_type();
>   	enum memblock_flags memblock_flags = MEMBLOCK_NONE;
>   	struct memory_group *group = NULL;
>   	u64 start, size;
> @@ -1582,7 +1585,8 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>   
>   	/* online pages if requested */
>   	if (mhp_get_default_online_type() != MMOP_OFFLINE)
> -		walk_memory_blocks(start, size, NULL, online_memory_block);
> +		walk_memory_blocks(start, size, &online_type,
> +				   online_memory_block);
>   
>   	return ret;
>   error:
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>



^ permalink raw reply

* Re: [PATCH v2 10/13] mm: Remove __alloc_pages_node()
From: Suren Baghdasaryan @ 2026-06-24 16:24 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Andrew Morton, Vlastimil Babka, Michal Hocko, Johannes Weiner,
	Zi Yan, Muchun Song, Oscar Salvador, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Sebastian Andrzej Siewior, Clark Williams,
	Steven Rostedt, Harry Yoo (Oracle), Gregory Price,
	Alexei Starovoitov, Matthew Wilcox, linux-mm, linux-kernel,
	linux-rt-devel
In-Reply-To: <20260622-alloc-trylock-v2-10-31f31367d420@google.com>

On Mon, Jun 22, 2026 at 3:01 AM Brendan Jackman <jackmanb@google.com> wrote:
>
> There were only a few users, which have been removed. The only advantage
> of this API over alloc_pages_node() is avoiding a single conditional
> branch. The disadvantages are:
>
> 1. More API surface, more sources of confusion, more maintenance.
>
> 2. Worse impact of CPU hotplug bugs: most users of __alloc_pages_node()
>    were using the result of cpu_to_node(); if the CPU gets hotplugged
>    out this will return NUMA_NO_NODE. If one of these paths fails to
>    protect against a concurrent hotplug then page_alloc.c will use
>    NUMA_NO_NODE as an index into NODE_DATA() and cause some horrible
>    memory corruption or other. With alloc_pages_node(), the code might
>    just work fine.
>
> Ulterior motive: this frees up the __* variants of the allocator APIs to
> serve specifically for use as mm-internal API.

Ah, that's what motivated all that churn! :)

>
> Signed-off-by: Brendan Jackman <jackmanb@google.com>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  include/linux/gfp.h | 20 ++++----------------
>  1 file changed, 4 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index cdf95a9f0b87c..7edcc2e0be9ce 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -278,21 +278,6 @@ static inline void warn_if_node_offline(int this_node, gfp_t gfp_mask)
>         dump_stack();
>  }
>
> -/*
> - * Allocate pages, preferring the node given as nid. The node must be valid and
> - * online. For more general interface, see alloc_pages_node().
> - */
> -static inline struct page *
> -__alloc_pages_node_noprof(int nid, gfp_t gfp_mask, unsigned int order)
> -{
> -       VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> -       warn_if_node_offline(nid, gfp_mask);
> -
> -       return __alloc_pages_noprof(gfp_mask, order, nid, NULL);
> -}
> -
> -#define  __alloc_pages_node(...)               alloc_hooks(__alloc_pages_node_noprof(__VA_ARGS__))
> -
>  static inline
>  struct folio *__folio_alloc_node_noprof(gfp_t gfp, unsigned int order, int nid)
>  {
> @@ -315,7 +300,10 @@ static inline struct page *alloc_pages_node_noprof(int nid, gfp_t gfp_mask,
>         if (nid == NUMA_NO_NODE)
>                 nid = numa_mem_id();
>
> -       return __alloc_pages_node_noprof(nid, gfp_mask, order);
> +       VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> +       warn_if_node_offline(nid, gfp_mask);
> +
> +       return __alloc_pages_noprof(gfp_mask, order, nid, NULL);
>  }
>
>  #define  alloc_pages_node(...)                 alloc_hooks(alloc_pages_node_noprof(__VA_ARGS__))
>
> --
> 2.54.0
>


^ permalink raw reply

* Re: [PATCH v2 03/13] mm/page_alloc: unify __alloc_frozen_pages[_nolock]_noprof()
From: Brendan Jackman @ 2026-06-24 16:24 UTC (permalink / raw)
  To: Suren Baghdasaryan, Brendan Jackman
  Cc: Andrew Morton, Vlastimil Babka, Michal Hocko, Johannes Weiner,
	Zi Yan, Muchun Song, Oscar Salvador, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Sebastian Andrzej Siewior, Clark Williams,
	Steven Rostedt, Harry Yoo (Oracle), Gregory Price,
	Alexei Starovoitov, Matthew Wilcox, linux-mm, linux-kernel,
	linux-rt-devel
In-Reply-To: <CAJuCfpE28TqZy2k5-X1ZEdd0HhTuc5i7+0kyxX5nXH4j+5JVfw@mail.gmail.com>

On Wed Jun 24, 2026 at 4:00 PM UTC, Suren Baghdasaryan wrote:
> On Mon, Jun 22, 2026 at 3:02 AM Brendan Jackman <jackmanb@google.com> wrote:
>>
>> Currently the core allocator code is controlled by ALLOC_NOLOCK, but the
>> main entry point function is significantly different from the normal
>> __alloc_frozen_pages_nolock(), this is tiring when reading the code.
>>
>> Plumb the ALLOC_NOLOCK control one layer up in the call stack: create
>> an alloc_flags argument to __alloc_frozen_pages_nolock() (which is only
>> exposed to mm/) and then turn the nolock variant into a thin wrapper
>> that just sets that flag (as well as handling NUMA_NO_NODE, similar to
>> how some of the wrappers in gfp.h do).
>>
>> Rationale that this doesn't change anything:
>>
>> 1. Simple bits: A bunch of the nolock-specific handling is just moved to
>>    the new alloc_order_allowed(), alloc_trylock_allowed() and
>>    gfp_trylock.
>>
>> 2. __alloc_frozen_pages_noprof() has some extra logic that wasn't
>>    previously in the nolock variant:
>>
>>    a. Application of gfp_allowed_mask; this only affects early boot, and
>>       only flags that affect the slowpath get changed here.
>>
>>    b. Application of current_gfp_context() - also only affects the
>>       slowpath
>>
>> 3. The slowpath itself: this is now just explicitly skipped under
>>    !ALLOC_TRYLOCK.
>>
>> Ulterior motive: adding an alloc_flags arg to the allocator's
>> mm-internal entrypoint can later be used to do more allocation
>> customisation without needing to create new GFP flags.
>
> Looks like a nice overall cleanup.
>
>>
>> While adding this flag to a bunch of places, create ALLOC_DEFAULT to
>> avoid a mysterious literal 0 in most places. alloc_frozen_pages_noprof()
>> is defined above the alloc flags so just leave that as a slightly messy
>> exception instead of trying to fully reorder mm/internal.h for that one
>> case.
>
> Moving the whole alloc_frozen_pages() block down seems simple enough
> and would avoid special-casing this.

Yeah... when you put it like that, I don't actually know why I was so
intimidated by the prospect of moving a handful of function
declarations!

Anyway in the v3 I'm creating a new mm/page_alloc.h so this will happen
as a side effect of that.


^ permalink raw reply

* Re: [PATCH v2 02/13] mm/page_alloc: some renames to clarify alloc_flags scopes
From: Brendan Jackman @ 2026-06-24 16:13 UTC (permalink / raw)
  To: Suren Baghdasaryan, Brendan Jackman
  Cc: Andrew Morton, Vlastimil Babka, Michal Hocko, Johannes Weiner,
	Zi Yan, Muchun Song, Oscar Salvador, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Sebastian Andrzej Siewior, Clark Williams,
	Steven Rostedt, Harry Yoo (Oracle), Gregory Price,
	Alexei Starovoitov, Matthew Wilcox, linux-mm, linux-kernel,
	linux-rt-devel
In-Reply-To: <CAJuCfpGagFf4RaCNJqu4Y1oqOXXOWSkP1UGuT9LA-9EbDN4Njw@mail.gmail.com>

On Wed Jun 24, 2026 at 3:03 PM UTC, Suren Baghdasaryan wrote:
> On Mon, Jun 22, 2026 at 3:01 AM Brendan Jackman <jackmanb@google.com> wrote:
>>
>> It's pretty confusing that:
>>
>> - The slowpath and fastpath have a totally distinct set of alloc_flags.
>>
>> - gfp_to_alloc_flags() sounds generic but it only influences the
>>   slowpath.
>>
>> - prepare_alloc_pages() is generic in that it sets up the
>>   alloc_context, but the alloc_flags it generates are only used for the
>>   fastpath.
>
> I understand you want to clarify the usage but this particular point
> seems to be an implementation detail. IOW, if tomorrow
> __alloc_frozen_pages_noprof() is changed to use alloc_flags when
> calling __alloc_pages_slowpath(), would we be renaming it back? So, I
> would suggest keeping alloc_flags as is in prepare_alloc_pages() 

I would say yes, we should rename it even though it might mean having to
rename it back later. IMO it's very useful to make it clear to the
reader that they also need to look elsewhere to find the slowpath flag
logic, without them having to notice this rather odd detail of
__alloc_frozen_pages_noprof().

But, yeah I guess prepare_alloc_pages() doesn't really care that its
caller is only using the result for the fastpath so I could see the
rationale for keeping the arg as alloc_flags...

> and its callers. 

...  but for it's caller I really do think the rename is necessary and
doesn't really have any downside? Especially once alloc_context gets an
alloc_flags field as it does later in the series.

> The rest LGTM.

Thanks as always, I appreciate the review.


^ permalink raw reply

* Re: [PATCH v2 04/13] mm/page_alloc: relax GFP WARN in nolock allocs
From: Suren Baghdasaryan @ 2026-06-24 16:04 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Andrew Morton, Vlastimil Babka, Michal Hocko, Johannes Weiner,
	Zi Yan, Muchun Song, Oscar Salvador, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Sebastian Andrzej Siewior, Clark Williams,
	Steven Rostedt, Harry Yoo (Oracle), Gregory Price,
	Alexei Starovoitov, Matthew Wilcox, linux-mm, linux-kernel,
	linux-rt-devel
In-Reply-To: <20260622-alloc-trylock-v2-4-31f31367d420@google.com>

On Mon, Jun 22, 2026 at 3:01 AM Brendan Jackman <jackmanb@google.com> wrote:
>
> This WARN forbids setting other flags than __GFP_ACCOUNT but we
> unconditionally set the ones in gfp_nolock so they are certainly fine
> for the caller to set.
>
> There are other GFP flags that are almost certainly fine to set here;
> Willy noted GFP_HIGHMEM, GFP_DMA, GFP_MOVABLE and GFP_HARDWALL. But,
> nolock allocation is rather special, so be conservative to try and
> ensure we have a chance to think carefully before nontrivial new
> usecases arise.
>
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Link: https://lore.kernel.org/linux-mm/ajS96fWbG4dzP3u3@casper.infradead.org/
> Signed-off-by: Brendan Jackman <jackmanb@google.com>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  mm/page_alloc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e31babe2181a1..074e007bf1bc3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5337,7 +5337,8 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
>                 return NULL;
>
>         if (alloc_flags & ALLOC_NOLOCK) {
> -               VM_WARN_ON_ONCE(gfp & ~__GFP_ACCOUNT);
> +               /* Certain other flags could be supported later if needed. */
> +               VM_WARN_ON_ONCE(gfp & ~(__GFP_ACCOUNT | gfp_nolock));
>                 if (!alloc_trylock_allowed())
>                         return NULL;
>                 gfp |= gfp_nolock;
>
> --
> 2.54.0
>


^ permalink raw reply

* Re: [PATCH v2 03/13] mm/page_alloc: unify __alloc_frozen_pages[_nolock]_noprof()
From: Suren Baghdasaryan @ 2026-06-24 16:00 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Andrew Morton, Vlastimil Babka, Michal Hocko, Johannes Weiner,
	Zi Yan, Muchun Song, Oscar Salvador, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Sebastian Andrzej Siewior, Clark Williams,
	Steven Rostedt, Harry Yoo (Oracle), Gregory Price,
	Alexei Starovoitov, Matthew Wilcox, linux-mm, linux-kernel,
	linux-rt-devel
In-Reply-To: <20260622-alloc-trylock-v2-3-31f31367d420@google.com>

On Mon, Jun 22, 2026 at 3:02 AM Brendan Jackman <jackmanb@google.com> wrote:
>
> Currently the core allocator code is controlled by ALLOC_NOLOCK, but the
> main entry point function is significantly different from the normal
> __alloc_frozen_pages_nolock(), this is tiring when reading the code.
>
> Plumb the ALLOC_NOLOCK control one layer up in the call stack: create
> an alloc_flags argument to __alloc_frozen_pages_nolock() (which is only
> exposed to mm/) and then turn the nolock variant into a thin wrapper
> that just sets that flag (as well as handling NUMA_NO_NODE, similar to
> how some of the wrappers in gfp.h do).
>
> Rationale that this doesn't change anything:
>
> 1. Simple bits: A bunch of the nolock-specific handling is just moved to
>    the new alloc_order_allowed(), alloc_trylock_allowed() and
>    gfp_trylock.
>
> 2. __alloc_frozen_pages_noprof() has some extra logic that wasn't
>    previously in the nolock variant:
>
>    a. Application of gfp_allowed_mask; this only affects early boot, and
>       only flags that affect the slowpath get changed here.
>
>    b. Application of current_gfp_context() - also only affects the
>       slowpath
>
> 3. The slowpath itself: this is now just explicitly skipped under
>    !ALLOC_TRYLOCK.
>
> Ulterior motive: adding an alloc_flags arg to the allocator's
> mm-internal entrypoint can later be used to do more allocation
> customisation without needing to create new GFP flags.

Looks like a nice overall cleanup.

>
> While adding this flag to a bunch of places, create ALLOC_DEFAULT to
> avoid a mysterious literal 0 in most places. alloc_frozen_pages_noprof()
> is defined above the alloc flags so just leave that as a slightly messy
> exception instead of trying to fully reorder mm/internal.h for that one
> case.

Moving the whole alloc_frozen_pages() block down seems simple enough
and would avoid special-casing this.

>
> No functional change intended.
>
> Signed-off-by: Brendan Jackman <jackmanb@google.com>
> ---
>  mm/hugetlb.c    |   3 +-
>  mm/internal.h   |   8 ++-
>  mm/mempolicy.c  |  10 ++--
>  mm/page_alloc.c | 178 +++++++++++++++++++++++++++++---------------------------
>  mm/slub.c       |   6 +-
>  5 files changed, 110 insertions(+), 95 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 571212b80835e..2ce6169ca0dfd 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1806,7 +1806,8 @@ static struct folio *alloc_buddy_frozen_folio(int order, gfp_t gfp_mask,
>         if (alloc_try_hard)
>                 gfp_mask |= __GFP_RETRY_MAYFAIL;
>
> -       folio = (struct folio *)__alloc_frozen_pages(gfp_mask, order, nid, nmask);
> +       folio = (struct folio *)__alloc_frozen_pages(gfp_mask, order, nid, nmask,
> +                                                    ALLOC_DEFAULT);
>
>         /*
>          * If we did not specify __GFP_RETRY_MAYFAIL, but still got a
> diff --git a/mm/internal.h b/mm/internal.h
> index 1483a4fcdfce1..6bc89ec62e527 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -913,7 +913,7 @@ extern bool free_pages_prepare(struct page *page, unsigned int order);
>  extern int user_min_free_kbytes;
>
>  struct page *__alloc_frozen_pages_noprof(gfp_t, unsigned int order, int nid,
> -               nodemask_t *);
> +               nodemask_t *, unsigned int alloc_flags);
>  #define __alloc_frozen_pages(...) \
>         alloc_hooks(__alloc_frozen_pages_noprof(__VA_ARGS__))
>  void free_frozen_pages(struct page *page, unsigned int order);
> @@ -924,7 +924,8 @@ struct page *alloc_frozen_pages_noprof(gfp_t, unsigned int order);
>  #else
>  static inline struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order)
>  {
> -       return __alloc_frozen_pages_noprof(gfp, order, numa_node_id(), NULL);
> +       return __alloc_frozen_pages_noprof(gfp, order, numa_node_id(), NULL,
> +                                          0 /* ALLOC_DEFAULT */);
>  }
>  #endif
>
> @@ -1440,6 +1441,9 @@ extern void set_pageblock_order(void);
>  unsigned long reclaim_pages(struct list_head *folio_list);
>  unsigned int reclaim_clean_pages_from_list(struct zone *zone,
>                                             struct list_head *folio_list);
> +
> +
> +#define ALLOC_DEFAULT          0
>  /* The ALLOC_WMARK bits are used as an index to zone->watermark */
>  #define ALLOC_WMARK_MIN                WMARK_MIN
>  #define ALLOC_WMARK_LOW                WMARK_LOW
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 36699fabd3c22..40bbea614aced 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2425,9 +2425,11 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
>          */
>         preferred_gfp = gfp | __GFP_NOWARN;
>         preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
> -       page = __alloc_frozen_pages_noprof(preferred_gfp, order, nid, nodemask);
> +       page = __alloc_frozen_pages_noprof(preferred_gfp, order, nid, nodemask,
> +                                          ALLOC_DEFAULT);
>         if (!page)
> -               page = __alloc_frozen_pages_noprof(gfp, order, nid, NULL);
> +               page = __alloc_frozen_pages_noprof(gfp, order, nid, NULL,
> +                                                  ALLOC_DEFAULT);
>
>         return page;
>  }
> @@ -2475,7 +2477,7 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
>                          */
>                         page = __alloc_frozen_pages_noprof(
>                                 gfp | __GFP_THISNODE | __GFP_NORETRY, order,
> -                               nid, NULL);
> +                               nid, NULL, ALLOC_DEFAULT);
>                         if (page || !(gfp & __GFP_DIRECT_RECLAIM))
>                                 return page;
>                         /*
> @@ -2487,7 +2489,7 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
>                 }
>         }
>
> -       page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask);
> +       page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask, ALLOC_DEFAULT);
>
>         if (unlikely(pol->mode == MPOL_INTERLEAVE ||
>                      pol->mode == MPOL_WEIGHTED_INTERLEAVE) && page) {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bc05d75a41627..e31babe2181a1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5204,7 +5204,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>                 }
>                 nr_account++;
>
> -               prep_new_page(page, 0, gfp, 0);
> +               prep_new_page(page, 0, gfp, ALLOC_DEFAULT);
>                 set_page_refcounted(page);
>                 page_array[nr_populated++] = page;
>         }
> @@ -5253,24 +5253,98 @@ void free_pages_bulk(struct page **page_array, unsigned long nr_pages)
>         }
>  }
>
> -/*
> - * This is the 'heart' of the zoned buddy allocator.
> - */
> -struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
> -               int preferred_nid, nodemask_t *nodemask)
> +static inline bool alloc_order_allowed(gfp_t gfp, unsigned int order,
> +                                      unsigned int alloc_flags)
>  {
> -       struct page *page;
> -       unsigned int fastpath_alloc_flags = ALLOC_WMARK_LOW;
> -       gfp_t alloc_gfp; /* The gfp_t that was actually used for allocation */
> -       struct alloc_context ac = { };
> +       if (alloc_flags & ALLOC_NOLOCK)
> +               return pcp_allowed_order(order);
>
>         /*
>          * There are several places where we assume that the order value is sane
>          * so bail out early if the request is out of bound.
>          */
> -       if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
> +       return !(WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp));
> +}
> +
> +static inline bool alloc_trylock_allowed(void)
> +{
> +       /*
> +        * In PREEMPT_RT spin_trylock() will call raw_spin_lock() which is
> +        * unsafe in NMI. If spin_trylock() is called from hard IRQ the current
> +        * task may be waiting for one rt_spin_lock, but rt_spin_trylock() will
> +        * mark the task as the owner of another rt_spin_lock which will
> +        * confuse PI logic, so return immediately if called from hard IRQ or
> +        * NMI.
> +        *
> +        * Note, irqs_disabled() case is ok. This function can be called
> +        * from raw_spin_lock_irqsave region.
> +        */
> +       if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
> +               return false;
> +
> +       /* On UP, spin_trylock() always succeeds even when it is locked */
> +       if (!IS_ENABLED(CONFIG_SMP) && in_nmi())
> +               return false;
> +
> +       /* Bailout, since _deferred_grow_zone() needs to take a lock */
> +       if (deferred_pages_enabled())
> +               return false;
> +
> +       return true;
> +}
> +
> +/*
> + * GFP flags to set for ALLOC_NOLOCK i.e. alloc_pages_nolock().
> + *
> + * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
> + * Do not specify __GFP_KSWAPD_RECLAIM either, since wake up of kswapd
> + * is not safe in arbitrary context.
> + *
> + * These two are the conditions for gfpflags_allow_spinning() being true.
> + *
> + * Specify __GFP_NOWARN since failing alloc_pages_nolock() is not a reason
> + * to warn. Also warn would trigger printk() which is unsafe from
> + * various contexts. We cannot use printk_deferred_enter() to mitigate,
> + * since the running context is unknown.
> + *
> + * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below
> + * is safe in any context. Also zeroing the page is mandatory for
> + * BPF use cases.
> + *
> + * Though __GFP_NOMEMALLOC is not checked in the code path below,
> + * specify it here to highlight that alloc_pages_nolock()
> + * doesn't want to deplete reserves.
> + */
> +static const gfp_t gfp_nolock = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC |
> +                               __GFP_COMP;

__alloc_frozen_pages_noprof() is the only user of gfp_nolock. Can we
move it into that function to limit its scope? (unless you plan to use
it elsewhere).

> +
> +/*
> + * This is the 'heart' of the zoned buddy allocator.
> + */
> +struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
> +               int preferred_nid, nodemask_t *nodemask, unsigned int alloc_flags)
> +{
> +       struct page *page;
> +       gfp_t alloc_gfp; /* The gfp_t that was actually used for allocation */
> +       struct alloc_context ac = { };
> +       unsigned int fastpath_alloc_flags = alloc_flags;
> +
> +       /* Other flags could be supported later if needed. */
> +       if (WARN_ON(alloc_flags & ~ALLOC_NOLOCK))
>                 return NULL;
>
> +       if (!alloc_order_allowed(gfp, order, alloc_flags))
> +               return NULL;
> +
> +       if (alloc_flags & ALLOC_NOLOCK) {
> +               VM_WARN_ON_ONCE(gfp & ~__GFP_ACCOUNT);
> +               if (!alloc_trylock_allowed())
> +                       return NULL;
> +               gfp |= gfp_nolock;
> +       } else {
> +               fastpath_alloc_flags |= ALLOC_WMARK_LOW;
> +       }
> +
>         gfp &= gfp_allowed_mask;
>         /*
>          * Apply scoped allocation constraints. This is mainly about GFP_NOFS
> @@ -5291,9 +5365,9 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
>          */
>         fastpath_alloc_flags |= alloc_flags_nofragment(zonelist_zone(ac.preferred_zoneref), gfp);
>
> -       /* First allocation attempt */
> +       /* First allocation attempt (or, for nolock, only attempt) */
>         page = get_page_from_freelist(alloc_gfp, order, fastpath_alloc_flags, &ac);
> -       if (likely(page))
> +       if (likely(page) || (alloc_flags & ALLOC_NOLOCK))
>                 goto out;
>
>         alloc_gfp = gfp;
> @@ -5310,7 +5384,8 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
>  out:
>         if (memcg_kmem_online() && (gfp & __GFP_ACCOUNT) && page &&
>             unlikely(__memcg_kmem_charge_page(page, gfp, order) != 0)) {
> -               free_frozen_pages(page, order);
> +               __free_frozen_pages(page, order,
> +                                   alloc_flags & ALLOC_NOLOCK ? FPI_TRYLOCK : 0);
>                 page = NULL;
>         }
>
> @@ -5326,7 +5401,8 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
>  {
>         struct page *page;
>
> -       page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid, nodemask);
> +       page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid, nodemask,
> +                                          ALLOC_DEFAULT);
>         if (page)
>                 set_page_refcounted(page);
>         return page;
> @@ -7856,80 +7932,10 @@ static bool __free_unaccepted(struct page *page)
>
>  struct page *alloc_frozen_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order)
>  {
> -       /*
> -        * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
> -        * Do not specify __GFP_KSWAPD_RECLAIM either, since wake up of kswapd
> -        * is not safe in arbitrary context.
> -        *
> -        * These two are the conditions for gfpflags_allow_spinning() being true.
> -        *
> -        * Specify __GFP_NOWARN since failing alloc_pages_nolock() is not a reason
> -        * to warn. Also warn would trigger printk() which is unsafe from
> -        * various contexts. We cannot use printk_deferred_enter() to mitigate,
> -        * since the running context is unknown.
> -        *
> -        * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below
> -        * is safe in any context. Also zeroing the page is mandatory for
> -        * BPF use cases.
> -        *
> -        * Though __GFP_NOMEMALLOC is not checked in the code path below,
> -        * specify it here to highlight that alloc_pages_nolock()
> -        * doesn't want to deplete reserves.
> -        */
> -       gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC | __GFP_COMP
> -                       | gfp_flags;
> -       unsigned int alloc_flags = ALLOC_NOLOCK;
> -       struct alloc_context ac = { };
> -       struct page *page;
> -
> -       VM_WARN_ON_ONCE(gfp_flags & ~__GFP_ACCOUNT);
> -       /*
> -        * In PREEMPT_RT spin_trylock() will call raw_spin_lock() which is
> -        * unsafe in NMI. If spin_trylock() is called from hard IRQ the current
> -        * task may be waiting for one rt_spin_lock, but rt_spin_trylock() will
> -        * mark the task as the owner of another rt_spin_lock which will
> -        * confuse PI logic, so return immediately if called from hard IRQ or
> -        * NMI.
> -        *
> -        * Note, irqs_disabled() case is ok. This function can be called
> -        * from raw_spin_lock_irqsave region.
> -        */
> -       if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
> -               return NULL;
> -
> -       /* On UP, spin_trylock() always succeeds even when it is locked */
> -       if (!IS_ENABLED(CONFIG_SMP) && in_nmi())
> -               return NULL;
> -
> -       if (!pcp_allowed_order(order))
> -               return NULL;
> -
> -       /* Bailout, since _deferred_grow_zone() needs to take a lock */
> -       if (deferred_pages_enabled())
> -               return NULL;
> -
>         if (nid == NUMA_NO_NODE)
>                 nid = numa_node_id();
>
> -       prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac,
> -                           &alloc_gfp, &alloc_flags);
> -
> -       /*
> -        * Best effort allocation from percpu free list.
> -        * If it's empty attempt to spin_trylock zone->lock.
> -        */
> -       page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
> -
> -       /* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
> -
> -       if (memcg_kmem_online() && page && (gfp_flags & __GFP_ACCOUNT) &&
> -           unlikely(__memcg_kmem_charge_page(page, alloc_gfp, order) != 0)) {
> -               __free_frozen_pages(page, order, FPI_TRYLOCK);
> -               page = NULL;
> -       }
> -       trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
> -       kmsan_alloc_page(page, order, alloc_gfp);
> -       return page;
> +       return __alloc_frozen_pages_noprof(gfp_flags, order, nid, NULL, ALLOC_NOLOCK);
>  }
>  /**
>   * alloc_pages_nolock - opportunistic reentrant allocation from any context
> diff --git a/mm/slub.c b/mm/slub.c
> index a2bf3756ca7d0..b9c1284844a0a 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3275,7 +3275,8 @@ static inline struct slab *alloc_slab_page(gfp_t flags, int node,
>         else if (node == NUMA_NO_NODE)
>                 page = alloc_frozen_pages(flags, order);
>         else
> -               page = __alloc_frozen_pages(flags, order, node, NULL);
> +               page = __alloc_frozen_pages(flags, order, node, NULL,
> +                                           ALLOC_DEFAULT);
>
>         if (!page)
>                 return NULL;
> @@ -5236,7 +5237,8 @@ static void *___kmalloc_large_node(size_t size, gfp_t flags, int node)
>         if (node == NUMA_NO_NODE)
>                 page = alloc_frozen_pages_noprof(flags, order);
>         else
> -               page = __alloc_frozen_pages_noprof(flags, order, node, NULL);
> +               page = __alloc_frozen_pages_noprof(flags, order, node, NULL,
> +                                                  ALLOC_DEFAULT);
>
>         if (page) {
>                 ptr = page_address(page);
>
> --
> 2.54.0
>
>


^ permalink raw reply

* Re: mm: opaque hardware page-table entry handles
From: Zi Yan @ 2026-06-24 15:52 UTC (permalink / raw)
  To: Usama Anjum, Andrew Morton, Lorenzo Stoakes, David Hildenbrand,
	Liam R. Howlett, Mike Rapoport, Ryan Roberts, Anshuman Khandual,
	Catalin Marinas, Will Deacon, Samuel Holland
  Cc: linux-mm, linux-arm-kernel, linux-kernel
In-Reply-To: <74182e50-b54f-4d2d-a27f-3a59a538d6bc@arm.com>

On Wed Jun 24, 2026 at 10:09 AM EDT, Usama Anjum wrote:
> Hi all,
>
> This is a direction-check with the wider community before spending time on the
> development. This picks up the idea that was raised and broadly agreed in the
> earlier thread (Ryan Roberts, Lorenzo Stoakes, David Hildenbrand) [1].
>
> The problem
> -----------
> Core MM code reaches page-table entries by raw pointer dereference (pte_t *,
> pmd_t *, *pud, ...) in places, implicitly assuming a single, uniform
> representation. Sprinkling getters wouldn't solve the problem entirely. The
> problem is one level up: the *pointer type* itself is overloaded. At each level
> there are really three distinct things:
>
>   1. a page-table entry value (pte_t, pmd_t, ...)
>   2. a pointer to an entry value, e.g. a pXX_t on the stack
>   3. a pointer to a live entry in the hardware page table

This sounds good to me, but can you clarify the situation below?

A live entry means the entry can be accessed by hardware when the code
is manipulating it? What type should we use if we are pre-populating
PTEs in a PMD page before we establish the PMD page as a HW page table?
In __split_huge_pmd_locked(), we do that. A PMD page is first withdrawn
and filled with after-split PTEs, pmd_populate() and pte_offset_map()
are used for this not-yet-HW page table. Later, pmd_populate() is used
to make this page table visible to HW. Should we have two versions of
pmd_populate() and pte_offset_map()? Since the first pmd_populate()
would accept pmd_t*, but the second one would accept hw_pmdp, if we are
pedantic. Of course, we can be flexible here to use pmd_populate()
accpeting hw_pmdp for both, since the PMD page table we are modifying
is going to be visible to HW soon. But I think we should have clear
definitions for where these types are used and document them well.

You probably can ask LLMs to check these ambiguous/vague uses throughout
the code base.

>
> Today (2) and (3) share the same type - pte_t *, pmd_t *, and so on. Nothing
> distinguishes a pointer into a live table from a pointer to a stack copy.
>
> A pointer to an on-stack entry value and a pointer to a live hardware entry have
> the same type, so the compiler cannot distinguish them. Passing the stack
> pointer to an arch helper that expects a hardware-entry pointer compiles fine,
> but is wrong - a bug class the type system makes invisible. It also blocks
> evolution: an arch helper may need to read beyond the addressed entry (e.g.
> adjacent or contiguous entries), which only makes sense for a real page-table
> pointer, not a stack copy.
>
> The idea
> --------
> Give (3) its own opaque type that cannot be dereferenced:
>
>     /* opaque handle to a HW page-table entry; not dereferenceable */
>     typedef struct {
> 	pte_t *ptr;
>     } hw_ptep;
>
> With this:
>
>   - a stack value can no longer masquerade as a hardware table entry,
>   - a hardware handle can no longer be raw-dereferenced,
>   - cases that genuinely operate on a value can be refactored to pass the value
>     and let the caller, which knows whether it holds a handle or a stack copy,
>     read it once.
>
> The overload becomes a compile-time type error instead of a silent runtime bug,
> and converting the tree forces every such site to be made explicit. This gives
> us a framework where the architecture can completely virtualize the pgtable if
> it likes; and the compiler can enforce that higher level code can't accidentally
> work around it.
>
> It is opt-in by architectures and incremental. The generic definition is
> just an alias, so arches that do not care build unchanged:
>
>     typedef pte_t *hw_ptep;
>
> An arch flips to the strong struct type when it is ready, and only then does
> it get the stronger checking. This lets the conversion land gradually.
>
> Beyond fixing the latent bug class, this abstraction is an enabler for upcoming
> features that need tighter control over how page tables are accessed and
> manipulated.
>
> Getter flavours
> ---------------
> While converting, it is useful to have two accessor flavours at each level:
>
>   - pXXp_get(hw_ptep)        plain C dereference (compiler may optimize)
>   - pXXp_get_once(hw_ptep)   single-copy-atomic, not torn, elided or
>                              duplicated by the compiler
>
> Keeping them distinct simplifies the conversion and avoids re-introducing the
> class of lockless-read bugs seen on 32-bit.
>
> Example conversion
> ------------------
> Most of the conversion is mechanical.
>
>   -static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>   -		pte_t *ptep, pte_t pte, unsigned int nr)
>   +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>   +		hw_ptep ptep, pte_t pte, unsigned int nr)
>    {
>    	page_table_check_ptes_set(mm, addr, ptep, pte, nr);
>    	for (;;) {
>    		set_pte(ptep, pte);
>    		if (--nr == 0)
>    			break;
>   -		ptep++;
>   +		ptep = hw_pte_next(ptep);
>    		pte = pte_next_pfn(pte);
>    	}
>    }
>
> The bulk of work is this kind of rote substitution. The genuine work is the
> handful of sites that turn out to be operating on a stack copy rather than a
> live entry - those are exactly the ones the new type forces us to surface and 
> fix.
>
> Estimated churn:
> ----------------
> Half way through the prototyping converting only PTE and PMD levels:
>   77 files changed, +1801 / -1425
>   ~57 files reference the new types
>
> So the line count will grow once PUD/P4D/PGD and the remaining call sites are
> converted; expect meaningfully more churn than the numbers above.
>
> Introduce the type as an alias, convert one helper family per patch, and flip
> an arch to the strong type last - with non-opted arches building unchanged at
> every step.
>
> Open questions
> --------------
>   - Is the type-safety + future-feature enablement worth the churn?
>   - Naming: hw_ptep/hw_pmdp vs something else?
>   - Should all five levels be converted before merging anything, or is a staged
>     PTE-and-PMD then landing others acceptable?
>   - Do we want the two getter flavours (pXXp_get / pXXp_get_once) at every
>     level?
>
> [1] https://lore.kernel.org/all/a063f6c5-2785-4a9f-8079-25edb3e54cef@arm.com
>
> Thanks,
> Usama




-- 
Best Regards,
Yan, Zi



^ permalink raw reply

* Re: [PATCH] drm/panthor: Check VMA boundaries for PMD mappings
From: Steven Price @ 2026-06-24 15:34 UTC (permalink / raw)
  To: Christian A. Ehrhardt, Boris Brezillon, dri-devel
  Cc: Liviu Dudau, Andrew Morton, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, linux-mm,
	linux-kernel
In-Reply-To: <20260623181942.1536598-1-lk@c--e.de>

On 23/06/2026 19:19, Christian A. Ehrhardt wrote:
> When checking a different patch[1] sashiko AI pointed out that
> panthor needs the same fix[2]:
> 
> In the ->huge_fault handler do not install a PMD huge page
> mapping if the huge page exceeds the boundaries of the VMA.
> 
> [1] https://lore.kernel.org/lkml/20260622215718.1532689-1-lk@c--e.de/
> [2] https://sashiko.dev/#/patchset/20260622215718.1532689-1-lk%40c--e.de
> 
> Cc: Boris Brezillon <boris.brezillon@collabora.com>
> Cc: Steven Price <steven.price@arm.com>
> Cc: Liviu Dudau <liviu.dudau@arm.com>
> Fixes: 68cbf96b1e9b ("drm/panthor: Part ways with drm_gem_shmem_object")
> Signed-off-by: Christian A. Ehrhardt <lk@c--e.de>

Reviewed-by: Steven Price <steven.price@arm.com>

> ---
>  drivers/gpu/drm/panthor/panthor_gem.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> NOTE:
> The panthor version is only compile tested because I don't
> have the hardware. However, the code is identical to that
> fixed in [1] and I have a reproducer for that.
> 
> No need for for stable backports. The code is new in 7.1.
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_gem.c b/drivers/gpu/drm/panthor/panthor_gem.c
> index a1e2eb1ca7bb..54535bae2b0c 100644
> --- a/drivers/gpu/drm/panthor/panthor_gem.c
> +++ b/drivers/gpu/drm/panthor/panthor_gem.c
> @@ -802,9 +802,13 @@ static vm_fault_t insert_page(struct vm_fault *vmf, unsigned int order, struct p
>  	} else if (order == PMD_ORDER) {
>  		unsigned long pfn = page_to_pfn(page);
>  		unsigned long paddr = pfn << PAGE_SHIFT;
> +		struct vm_area_struct *vma = vmf->vma;
> +		unsigned long start = ALIGN_DOWN(vmf->address, PMD_SIZE);
> +		unsigned long end = start + PMD_SIZE;
> +		bool in_range = vma->vm_start <= start && end <= vma->vm_end;
>  		bool aligned = (vmf->address & ~PMD_MASK) == (paddr & ~PMD_MASK);
>  
> -		if (aligned &&
> +		if (aligned && in_range &&
>  		    folio_test_pmd_mappable(page_folio(page))) {
>  			pfn &= PMD_MASK >> PAGE_SHIFT;
>  			return vmf_insert_pfn_pmd(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);



^ permalink raw reply

* Re: mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC]
From: Kiryl Shutsemau @ 2026-06-24 15:34 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Ard Biesheuvel, nao.horiguchi, linmiaohe, david, lance.yang, akpm,
	baoquan.he, rppt, pratyush, kexec, linux-mm, rneu, riel, caggio
In-Reply-To: <ajvzUg8KK9gtFTYe@gmail.com>

On Wed, Jun 24, 2026 at 08:21:16AM -0700, Breno Leitao wrote:
> > > Possible solutions
> > > ==================
> > ...
> > > 
> > > 2. e820 / EFI memory map (E820_TYPE_UNUSABLE). Tempting because the
> > >    frame would simply never become RAM (no allocator race at all).
> > >    But: it is x86-only (no arm64 equivalent in the same mechanism;
> > >    this series is tested on arm64);
> > 
> > (+Ard. I might get some details around EFI wrong.)
> > 
> > This isn't accurate, and I think it's the right direction for EFI
> > platforms. EFI_UNUSABLE_MEMORY is honored on both arches today, no new
> > consumer code:
> > 
> >   - arm64: reserve_regions() marks non-usable memory nomap.
> 
> Is it true for non-UEFI arm64 hosts?

No. It is EFI-only thingy.

Is there non-EFI server platforms worth caring about?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply

* Re: [PATCH] mm/mm_init: fix incorrect node_spanned_pages
From: Mike Rapoport @ 2026-06-24 15:30 UTC (permalink / raw)
  To: Wei Yang; +Cc: akpm, izumi.taku, linux-mm, Yuan Liu, David Hildenbrand
In-Reply-To: <20260623092653.yfahsc7kekuncbcf@master>

On Tue, Jun 23, 2026 at 09:26:53AM +0000, Wei Yang wrote:
> On Tue, Jun 23, 2026 at 12:22:15PM +0300, Mike Rapoport wrote:
> >
> >After I applied your patch, I did some checks to see why
> >mirrored_kernelcore causes us troubles. I found that unlike other variants
> >of kernelcore/movablecore settings, mirrored_kernelcore creates zone
> >overlap for no apparent reason. I did some git archaeology and I didn't
> >find a justification for making overlapping pages absent in ZONE_NORMAL.
> >
> >So I came up with this cleanup:
> >
> >https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=kernelcore-mirror
> >
> >I'm waiting for the bots to chew on it before positing the patches.
> > 
> 
> Ah, just the same as I do :-).
 
I'm going to send my version with your co-developed-by if you don't mind.
 
> -- 
> Wei Yang
> Help you, Help me

-- 
Sincerely yours,
Mike.


^ permalink raw reply

* Re: [PATCH v4 4/5] mm/memcontrol: convert memcg to use page_counter_stock
From: Joshua Hahn @ 2026-06-24 15:23 UTC (permalink / raw)
  To: Usama Arif
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, cgroups, linux-mm, linux-kernel, kernel-team
In-Reply-To: <20260624144348.4117578-1-usama.arif@linux.dev>

On Wed, 24 Jun 2026 07:43:47 -0700 Usama Arif <usama.arif@linux.dev> wrote:

> On Tue, 23 Jun 2026 11:01:22 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

Hello Usama!!

Thank you for reviewing the patch : -)

[...snip...]

> > @@ -2595,7 +2596,6 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
> >  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  			    unsigned int nr_pages)
> >  {
> > -	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
> >  	int nr_retries = MAX_RECLAIM_RETRIES;
> >  	struct mem_cgroup *mem_over_limit;
> >  	struct page_counter *counter;
> > @@ -2606,36 +2606,30 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	bool raised_max_event = false;
> >  	unsigned long pflags;
> >  	bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
> > +	unsigned long nr_charged = 0;
> >  
> >  retry:
> > -	if (consume_stock(memcg, nr_pages))
> > -		return 0;
> > -
> > -	if (!allow_spinning)
> > -		/* Avoid the refill and flush of the older stock */
> > -		batch = nr_pages;
> > -
> >  	reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
> >  	if (do_memsw_account() &&
> > -	    !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
> > +	    !page_counter_try_charge_stock(&memcg->memsw, nr_pages,
> > +					   &counter, NULL)) {
> >  		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
> >  		reclaim_options &= ~MEMCG_RECLAIM_MAY_SWAP;
> >  		goto reclaim;
> >  	}
> >  
> > -	if (page_counter_try_charge(&memcg->memory, batch, &counter))
> > -		goto done_restock;
> > +	if (page_counter_try_charge_stock(&memcg->memory, nr_pages,
> > +					  &counter, &nr_charged)) {
> > +		if (!nr_charged)
> > +			return 0;
> > +		goto handle_high;
> > +	}
> >  
> >  	if (do_memsw_account())
> > -		page_counter_uncharge(&memcg->memsw, batch);
> > +		page_counter_uncharge(&memcg->memsw, nr_pages);
> 
> This needs a transactional rollback. page_counter_try_charge_stock() can
> succeed by consuming memsw stock and charging 0 new pages, but the
> memory-failure path unconditionally uncharges nr_pages from memsw.
> That turns a failed allocation into a real memsw usage decrement.

Hmmmmmmmmmm....... I'm not sure.

At this point in the code, we are either (1) using cgroup v1 with memsw
and charged successfully, or (2) not using cgroup v1 with memsw. So I'm
not sure if this really is unconditional, we're just distinguishing
between cases (1) and (2) by checking if we're using cgroupv1.

Or is your concern with taking a charge via stock, but uncharging with
a hierarchical page_counter walk? If so, I think there's a case to be
made here with just simply returning the stock. I just wanted to keep
it consistent with the original memcontrol code, which only used
stock to fulfill charges, not uncharges, since this could make the
stock grow without bound.

What do you think? Thanks again for reviewing Usama, I hope you have a
great day!!!
Joshua


^ permalink raw reply

* Re: mm/hwpoison: persist poisoned PFN list across kexec via KHO [RFC]
From: Breno Leitao @ 2026-06-24 15:21 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Ard Biesheuvel, nao.horiguchi, linmiaohe, david, lance.yang, akpm,
	baoquan.he, rppt, pratyush, kexec, linux-mm, rneu, riel, caggio
In-Reply-To: <aju5_WjBtagTOSJw@thinkstation>

Hello Kiryl, 

First of all, thanks for the review and topics raised!

On Wed, Jun 24, 2026 at 01:04:19PM +0100, Kiryl Shutsemau wrote:
> On Wed, Jun 24, 2026 at 03:39:38AM -0700, Breno Leitao wrote:
> >   * Consumer: early in the next boot (fs_initcall_sync, before the
> >     buddy allocator has handed anything out) it restores that array
> >     and re-runs memory_failure() on each PFN, re-offlining the frame
> >     and rebuilding the full hwpoison state (PG_hwpoison, counters,
> >     HardwareCorrupted).
> 
> fs_initcall_sync is not before buddy hands anything out - buddy has been
> live since memblock_free_all() in start_kernel(), and every initcall before
> this one has allocated freely. So this is recovery, not prevention: you may
> be running memory_failure() against a frame already in use, possibly by a
> kernel allocation.

Agreed - that wording was wrong. It is recovery, not prevention, and running
memory_failure() against an already-allocated (possibly kernel) frame is the
not ideal, but, still better than what we have today.

> Two windows are missed entirely:
> 
>   - memblock allocations between setup_arch() and memblock_free_all()
>     (page tables, mem_map[], percpu) can land on the bad frame.
> 
>   - The kernel image itself: KASLR picks its location in the
>     decompressor/stub, long before any initcall. The next kernel can end
>     up running *on* the bad frame.
> 
> So I don't think this should be a memory_failure() replay. The frames need
> to leave the next kernel's view at the memory-map level, before memblock
> and KASLR.

Agreed, this is the ideal right approach.

> > Possible solutions
> > ==================
> ...
> > 
> > 2. e820 / EFI memory map (E820_TYPE_UNUSABLE). Tempting because the
> >    frame would simply never become RAM (no allocator race at all).
> >    But: it is x86-only (no arm64 equivalent in the same mechanism;
> >    this series is tested on arm64);
> 
> (+Ard. I might get some details around EFI wrong.)
> 
> This isn't accurate, and I think it's the right direction for EFI
> platforms. EFI_UNUSABLE_MEMORY is honored on both arches today, no new
> consumer code:
> 
>   - arm64: reserve_regions() marks non-usable memory nomap.

Is it true for non-UEFI arm64 hosts?

>   - x86: do_add_efi_memmap() maps it to E820_TYPE_UNUSABLE.
> 
> And it closes the KASLR window for free, because the image is only placed in
> EFI_CONVENTIONAL_MEMORY on both (x86 process_efi_entries(), arm64
> randomalloc.c). So the bad frame is invisible to both the allocator and
> KASLR, which is exactly what fs_initcall_sync can't give you.
> 
> There's also LINUX_EFI_MEMRESERVE (efi_mem_reserve_persistent()) -
> cross-arch, reserved pre-buddy in efi_init() - and looks otherwise fine, but
> it's parsed too late to keep KASLR off the frame.

Thanks, I am wondering if we piggy-back on this EFI_UNUSABLE_MEMORY (or
something similar), than we don't need to use KHO at all, basically just marked
the page as EFI_UNUSABLE_MEMORY at poison time, and rely on kexec to avoid
passing this page forward.

Thanks for the discussion,
--breno


^ permalink raw reply

* Re: [PATCH v2 4/4] mm: try to free swapcache for non-LRU folios
From: David Hildenbrand (Arm) @ 2026-06-24 15:20 UTC (permalink / raw)
  To: Barry Song (Xiaomi), akpm, linux-mm
  Cc: baoquan.he, chrisl, jp.kobryn, kasong, liam, linux-kernel, ljs,
	mhocko, nphamcs, rppt, shakeel.butt, shikemeng, surenb,
	usama.arif, vbabka, youngjun.park, Kairui Song
In-Reply-To: <20260623231635.43086-5-baohua@kernel.org>

On 6/24/26 01:16, Barry Song (Xiaomi) wrote:
> Originally, we unconditionally called lru_add_drain() for write
> swap-in page faults. This might drop the reference held by the per-CPU
> LRU cache if the folio happened to reside there. However, there was no
> guarantee that the folio was actually cached on the current CPU.
> 
> Now that lru_add_drain() has been removed, we have lost one
> opportunity to drop a reference held by the LRU cache. We could
> instead incorporate that possibility into the condition evaluated by
> should_try_to_free_swap().
> 
> Suggested-by: Kairui Song <ryncsn@gmail.com>
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
>  mm/memory.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 2983a6baf474..14577c67c61a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5087,8 +5087,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	 * Remove the swap entry and conditionally try to free up the swapcache.
>  	 * Do it after mapping, so raced page faults will likely see the folio
>  	 * in swap cache and wait on the folio lock.
> +	 * Assume non-LRU folios may be queued in the LRU cache, which contributes
> +	 * an additional reference to the folio.
>  	 */
> -	if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags))
> +	if (should_try_to_free_swap(si, folio, vma, nr_pages +
> +			!folio_test_lru(folio), vmf->flags))
>  		folio_free_swap(folio);
>  
>  	folio_unlock(folio);

Hm, in wp_can_reuse_anon_folio() we'll try dropping the swapcache ourselves.

So I wonder if we still need that handling ("If we want to map a page that's in
the swapcache writable, we ...") at all?


Ahh, I see the problem now:

commit 4b34f1d82c6549837b2061096dea249e881a4495
Author: Kairui Song <kasong@tencent.com>
Date:   Sat Dec 20 03:43:35 2025 +0800

    mm, swap: free the swap cache after folio is mapped

    Currently, we remove the folio from the swap cache and free the swap cache
    before mapping the PTE.  To reduce repeated faults due to parallel swapins
    of the same PTE, change it to remove the folio from the swap cache after
    it is mapped.  So new faults from the swap PTE will be much more likely to
    see the folio in the swap cache and wait on it.

    This does not eliminate all swapin races: an ongoing swapin fault may
    still see an empty swap cache.  That's harmless, as the PTE is changed
    before the swap cache is cleared, so it will just return and not trigger
    any repeated faults.  This does help to reduce the chance.


That changed that behavior such that we *must* now always fallback to do_wp_page().

What a mess (I didn't ack)

-- 
Cheers,

David


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox