Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v2 6/7] nfs: Optimize direct I/O to use folios for requests
From: Pranjal Shrivastava @ 2026-06-19 12:32 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Trond Myklebust, linux-nfs, linux-kernel,
	Anna Schumaker, Shivaji Kant, linux-mm, linux-fsdevel
In-Reply-To: <ajQ21kH1ZVajS2Y7@casper.infradead.org>

On Thu, Jun 18, 2026 at 07:20:06PM +0100, Matthew Wilcox wrote:

Hi Matthew, Christoph, Trond,

> On Thu, Jun 18, 2026 at 07:10:45AM -0700, Christoph Hellwig wrote:
> > On Tue, Jun 16, 2026 at 05:23:48PM +0000, Pranjal Shrivastava wrote:
> > > AFAIU, the MM subsystem explicitly ensures that every valid struct page
> > > is part of a folio.
> > 
> > It is definitively not what the vision for the folio is, although if
> > I'm not mistaken it actually is still true right now.
> 
> It's not true, eg, for slab.  While there's still a struct page there
> for slab, there's no refcount and flags like PG_locked have different
> meanings.  You'll get into a lot of trouble trying to treat slabs as
> folios (and that will include assertions tripping).
> 
> > This whole
> > area is a minefield unfortunately, and we also ran into it with
> > iov_iter_extract_bvecs and the earlier block code it was extracted
> > from.  Adding the relevant people and lists, but for now your best
> > bet is to stick to what the block code does or even better reuse
> > as much as possible of that code.
> 
> Yes.  Fundamentally, it is no business of the filesystem what the iov_iter
> refers to.  We can do direct io to slab memory, vmalloc memory, memory
> that doesn't have a struct page (eg iomem), or whatever we choose.
> 

Thanks for the clarification. I understand the larger vision of keeping
filesystems agnostic to the underlying memory represented by the iov_iter

The documentation for page_folio() [1] mentions that "Every page is part
of a folio," but it appears there are important nuances regarding slab
and other memory types that I was not aware of.

However, I am a bit confused on one point:
Looking at iov_iter_extract_bvecs() [1] it relies on 
get_contig_folio_len() [2], which calls page_folio() on the pages 
extracted (via iov_iter_extract_pages()) without additional checks for
slab or vmalloc memory. 

I am happy to refactor the NFS Direct I/O path to reuse the same helper
(get_contig_folio_len()) from the bvec extractor, but I'm a little 
confused as the bvec extractor seems to suffer from the same risk?

Is the recommendation to keep these details abstracted by the iov_iter
lib and eventually hide things like iov_iter_extract_pages() and manual
folio conversions from filesystems entirely?

If that's the case, would it help to export get_contig_folio_len() (or 
introduce new helpers) in the iov_iter lib for NFS and other fs to use?

Thanks,
Praan

[1] https://elixir.bootlin.com/linux/v7.1-rc6/source/include/linux/page-flags.h#L291
[2] https://elixir.bootlin.com/linux/v7.1/source/lib/iov_iter.c#L1849




^ permalink raw reply

* Re: [PATCH v8 00/46] guest_memfd: In-place conversion support
From: Garg, Shivank @ 2026-06-19 12:28 UTC (permalink / raw)
  To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>



On 6/19/2026 6:01 AM, Ackerley Tng via B4 Relay wrote:
> This is v8 of guest_memfd in-place conversion support.
> 
> Up till now, guest_memfd supports the entire inode worth of memory being
> used as all-shared, or all-private. CoCo VMs may request guest memory to be
> converted between private and shared states, and the only way to support
> that currently would be to have the userspace VMM provide two sources of
> backing memory from completely different areas of physical memory.
> 
> pKVM has a use case for in-place sharing: the guest and host may be
> cooperating on given data, and pKVM doesn't protect data through
> encryption, so copying that given data between different areas of physical
> memory as part of conversions would be unnecessary work.
> 
> This series also serves as a foundation for guest_memfd huge page
> support. Now, guest_memfd only supports PAGE_SIZE pages, so if two sources
> of backing memory are used, the userspace VMM could maintain a steady total
> memory utilized by punching out the pages that are not used. When huge
> pages are available in guest_memfd, even if the backing memory source
> supports hole punching within a huge page, punching out pages to maintain
> the total memory utilized by a VM would be introducing lots of
> fragmentation.
> 
> In-place conversion avoids fragmentation by allowing the same physical
> memory to be used for both shared and private memory, with guest_memfd
> tracks the shared/private status of all the pages at a per-page
> granularity.
> 
> The central principle, which guest_memfd continues to uphold, is that any
> guest-private page will not be mappable to host userspace. All pages will
> be mmap()-able in host userspace, but accesses to guest-private pages (as
> tracked by guest_memfd) will result in a SIGBUS.
> 
> This series introduces a guest_memfd ioctl (not kvm, vm or vcpu, but
> guest_memfd ioctl) that allows userspace to set memory
> attributes (shared/private) directly through the guest_memfd. This is the
> appropriate interface because shared/private-ness is a property of memory
> and hence the request should be sent directly to the memory provider -
> guest_memfd.
> 
> Tested with both CONFIG_KVM_VM_MEMORY_ATTRIBUTES enabled and disabled:
> 
> + tools/testing/selftests/kvm/guest_memfd_test.c
> + tools/testing/selftests/kvm/pre_fault_memory_test.c
> + tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
> + tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
> + tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c
> 
> Updates for this revision:
> 
> + Updated the series to _not_ deprecate all of VM memory attributes, but
>   only deprecate tracking of the PRIVATE attributes in VM memory
>   attributes. This takes into account upcoming RWX attributes support,
>   which will be tracked at the VM level.
> + Reshuffled the earlier commits that deal with preparing KVM to stop
>   seeing VM memory attributes as the only source of attributes.
> + Addressed comments from v7
> 
> TODOs
> 
> + Retest with TDX selftests. v7 was tested with TDX [12], but the setup there was
>   wrong. Conversions were successful (no errors), but the shared memory being
>   tested is actually in a completely different host physical page.
> + Retest with SNP selftests. v6 was tested with SNP, I ported that to v7
>   and those ran fine too. Just need to double-check for v8.
> 
> This series is based on kvm-x86/next, and here's the tree for your convenience:
> 
> https://github.com/googleprodkernel/linux-cc/commits/guest_memfd-inplace-conversion-v8
> 
> Older series:
> 
> + RFCv7 is at [11]
> + RFCv6 is at [10]
> + RFCv5 is at [8]
> + RFCv4 is at [7]
> + RFCv3 is at [6]
> + RFCv2 is at [5]
> + RFCv1 is at [4]
> + Previous versions of this feature, part of other series, are available at
>   [1][2][3].
> 
> [1] https://lore.kernel.org/all/bd163de3118b626d1005aa88e71ef2fb72f0be0f.1726009989.git.ackerleytng@google.com/
> [2] https://lore.kernel.org/all/20250117163001.2326672-6-tabba@google.com/
> [3] https://lore.kernel.org/all/b784326e9ccae6a08388f1bf39db70a2204bdc51.1747264138.git.ackerleytng@google.com/
> [4] https://lore.kernel.org/all/cover.1760731772.git.ackerleytng@google.com/T/
> [5] https://lore.kernel.org/all/cover.1770071243.git.ackerleytng@google.com/T/
> [6] https://lore.kernel.org/r/20260313-gmem-inplace-conversion-v3-0-5fc12a70ec89@google.com/T/
> [7] https://lore.kernel.org/all/20260326-gmem-inplace-conversion-v4-0-e202fe950ffd@google.com/T/
> [8] https://lore.kernel.org/r/20260428-gmem-inplace-conversion-v5-0-d8608ccfca22@google.com
> [9] https://lore.kernel.org/all/20260414-selftest-global-metadata-v1-0-fd223922bc57@google.com/T/
> [10] https://lore.kernel.org/r/20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4@google.com
> [11] https://lore.kernel.org/r/20260522-gmem-inplace-conversion-v7-0-2f0fae496530@google.com
> [12] https://lore.kernel.org/all/20260605134153.204152-1-ackerleytng@google.com/
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
> Ackerley Tng (27):
>       KVM: Make CONFIG_KVM_VM_MEMORY_ATTRIBUTES selectable
>       KVM: Enumerate support for PRIVATE memory iff kvm_arch_has_private_mem is defined
>       KVM: guest_memfd: Introduce function to check GFN private/shared status
>       KVM: guest_memfd: Only prepare folios for private pages
>       KVM: guest_memfd: Add base support for KVM_SET_MEMORY_ATTRIBUTES2
>       KVM: guest_memfd: Ensure pages are not in use before conversion
>       KVM: guest_memfd: Call arch invalidate hooks on conversion
>       KVM: guest_memfd: Return early if range already has requested attributes
>       KVM: guest_memfd: Advertise KVM_SET_MEMORY_ATTRIBUTES2 ioctl
>       KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
>       KVM: guest_memfd: Use actual size for invalidation in kvm_gmem_release()
>       KVM: guest_memfd: Determine invalidation filter from memory attributes
>       KVM: guest_memfd: Zero page while getting pfn
>       KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
>       KVM: guest_memfd: Make in-place conversion the default
>       KVM: selftests: Test basic single-page conversion flow
>       KVM: selftests: Test conversion flow when INIT_SHARED
>       KVM: selftests: Test conversion precision in guest_memfd
>       KVM: selftests: Test conversion before allocation
>       KVM: selftests: Convert with allocated folios in different layouts
>       KVM: selftests: Test that truncation does not change shared/private status
>       KVM: selftests: Add helpers to pin pages with CONFIG_GUP_TEST
>       KVM: selftests: Test conversion with elevated page refcount
>       KVM: selftests: Reset shared memory after hole-punching
>       KVM: selftests: Provide function to look up guest_memfd details from gpa
>       KVM: selftests: Make TEST_EXPECT_SIGBUS thread-safe
>       KVM: selftests: Update private_mem_conversions_test to mmap() guest_memfd
> 
> Michael Roth (1):
>       KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
> 
> Sean Christopherson (18):
>       KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings
>       KVM: Rename KVM_GENERIC_MEMORY_ATTRIBUTES to KVM_VM_MEMORY_ATTRIBUTES
>       KVM: Move KVM_VM_MEMORY_ATTRIBUTES config definition to x86
>       KVM: Decouple kvm_has_arch_private_mem from CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>       KVM: Rename memory attribute APIs to prepare for in-place gmem conversion
>       KVM: Provide generic interface for checking memory private/shared status
>       KVM: guest_memfd: Wire up core private/shared attribute interfaces
>       KVM: Consolidate private memory and guest_memfd ifdeffery in kvm_host.h
>       KVM: guest_memfd: Enable INIT_SHARED on guest_memfd for x86 Coco VMs
>       KVM: selftests: Create gmem fd before "regular" fd when adding memslot
>       KVM: selftests: Rename guest_memfd{,_offset} to gmem_{fd,offset}
>       KVM: selftests: Add support for mmap() on guest_memfd in core library
>       KVM: selftests: Add selftests global for guest memory attributes capability
>       KVM: selftests: Add helpers for calling ioctls on guest_memfd
>       KVM: selftests: Test that shared/private status is consistent across processes
>       KVM: selftests: Provide common function to set memory attributes
>       KVM: selftests: Check fd/flags provided to mmap() when setting up memslot
>       KVM: selftests: Update private memory exits test to work with per-gmem attributes
> 

Hi,

Thanks for this series.
This works well for me on AMD EPYC 7713 (SEV-SNP enabled). I tested:
1. KVM selftests: all tests pass.
2. Using in-place conversion QEMU branch [1]:
qemu-system-x86_64 \
  -machine q35,confidential-guest-support=sev0 \
  -enable-kvm -cpu EPYC-v4 -smp 8,maxcpus=8 -m 120G -no-reboot \
  -object memory-backend-guest-memfd,id=ram0,size=60G,share=on,host-nodes=0-1,policy=interleave \
  -object memory-backend-guest-memfd,id=ram1,size=60G,share=on,host-nodes=0,policy=bind \
  -numa node,nodeid=0,memdev=ram0,cpus=0-3 \
  -numa node,nodeid=1,memdev=ram1,cpus=4-7 \
  -object sev-snp-guest,id=sev0,policy=0x30000,cbitpos=51,reduced-phys-bits=1,convert-in-place=on \
  -bios "$OVMF" \
  -drive file="$DISK",if=none,id=disk0,format=qcow2 \
  -device virtio-scsi-pci,id=scsi0,disable-legacy=on,iommu_platform=true -device scsi-hd,drive=disk0 \
  -netdev user,id=net0,hostfwd=tcp::8000-:22 -device virtio-net-pci,netdev=net0 \
  -kernel "$KERNEL" -initrd "$INITRD" \
  -append "$ROOT ro console=ttyS0,115200" \
  -trace enable=kvm_convert_memory,file=/tmp/convert.log \
  -nographic -serial mon:stdio

   The guest boots successfully and run memory hogger. With this, I verified the
   shared <-> private conversion logs (trace_kvm_convert_memory).

3. Additionally, verified the NUMA placement for SEV-SNP. With this series,
   NUMA mempolicy support for guest_memfd [2] now works for SEV-SNP as well.

[1] https://github.com/amdese/qemu/commits/snp-inplace-rfc1
[2] https://lore.kernel.org/kvm/20251016172853.52451-1-seanjc@google.com

Tested-by: Shivank Garg <shivankg@amd.com>

Best regards,
Shivank


^ permalink raw reply

* Re: [PATCH v2] mm/page_alloc: drop flag-conversion "optimisation"
From: Zi Yan @ 2026-06-19 12:27 UTC (permalink / raw)
  To: Brendan Jackman, Brendan Jackman, Andrew Morton, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Johannes Weiner, Zi Yan
  Cc: Harry Yoo (Oracle), Gregory Price, linux-mm, linux-kernel
In-Reply-To: <DJD06ZZOC9IR.33KCFLM7L4267@linux.dev>

On Fri Jun 19, 2026 at 7:53 AM EDT, Brendan Jackman wrote:
> On Mon Jun 15, 2026 at 10:59 AM UTC, Brendan Jackman wrote: > On Mon Jun 15, 2026 at 10:54 AM UTC, Brendan Jackman wrote:
>>
>>>
>>> Signed-off-by: Brendan Jackman <jackmanb@google.com>
>>> ---
>>> Changes in v2:
>>> - Updated alloc_flags_nofragment() too.
>>> - Link to v1: https://lore.kernel.org/r/20260612-gfp-pessimisation-v1-1-936eb04202e7@google.com
>>
>> Sigh, maybe one day I'll send a patch without immediately following up
>> with a "oops, I forgot to ...". But today is not that day.
>>
>> Forgot to add these tags from the v1:
>>
>> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
>> Reviewed-by: Zi Yan <ziy@nvidia.com>
>> Reviewed-by: Gregory Price <gourry@gourry.net>
>> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>> Acked-by: Harry Yoo (Oracle) <harry@kernel.org>
>>
>> Thanks everyone for the prompt reviews.
>
> Hi Andrew,
>
> This one didn't make it into any of the latest mm-* branches, is
> anything blocked here?

It is in the quiet period, no patch will be picked up until -rc1 is
out. If yours is not picked up then, feel free to resend it.

-- 
Best Regards,
Yan, Zi



^ permalink raw reply

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
From: Lorenzo Stoakes @ 2026-06-19 12:24 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Rik van Riel, linux-kernel, x86, linux-mm, Thomas Gleixner,
	Ingo Molnar, Dmitry Ilvokhin, Borislav Petkov, Dave Hansen,
	Andrew Morton, David Hildenbrand, Liam R. Howlett,
	Vlastimil Babka
In-Reply-To: <CAJuCfpHHkQyj7MMoH++RGraZgQ1Hr+4=T4PfbBn-minUCZyuOQ@mail.gmail.com>

On Tue, Jun 16, 2026 at 11:19:12PM -0700, Suren Baghdasaryan wrote:
> On Tue, Jun 16, 2026 at 12:04 PM Rik van Riel <riel@surriel.com> wrote:
> >
> > __access_remote_vm() takes mmap_read_lock() for the entire transfer and
> > uses get_user_pages_remote(), which faults pages in.  For the common
> > case of reading memory that is already resident -- /proc/PID/cmdline,
> > /proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
> > unnecessary and is badly contended on large machines.
> >
> > Add an opportunistic, read-only fast path that transfers what it can
> > without the mmap lock.  For each address it takes the per-VMA lock with
> > lock_vma_under_rcu(), re-checks the read-side VMA permissions, and uses
> > folio_walk_start(..., FW_VMA_LOCKED) to grab a short-lived reference to
> > a present page before copying it out.  Anything non-trivial -- a not-
> > present page (needs faulting), a hugetlb or VM_IO/VM_PFNMAP mapping, or
> > a race with a VMA writer -- falls back to the existing mmap_lock path
> > for the remainder.
>
> I don't think we should be using per-VMA locks if the read spans
> multiple VMAs. Doing that would risk a possibility of reading
> inconsistent data since we are locking one VMA at a time. While we

Yeah, very true.

Suren has expounded on the possible cases that can occur elsewhere but you can
observe strange states like that.

You can see tools/testing/selftests/proc/proc-maps-race.c for a sense of it and
https://lore.kernel.org/all/20260426062718.1238437-1-surenb@google.com/

Note that for e.g. madvise() this is exactly what we do.

> load and read VMA, its neighboring VMA can be unmapped and another one
> can be mapped in its place. So, our read spanning both VMAs will
> return inconsistent data. access_remote_vm_fast() can check if the
> entire read is contained within one VMA and if not, fall back to
> mmap_lock.

This would also vastly simplify the code. I expect most real-world cases are
like this anyway?

Cheers, Lorenzo


^ permalink raw reply

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
From: Lorenzo Stoakes @ 2026-06-19 12:20 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Liam R. Howlett, Vlastimil Babka,
	Suren Baghdasaryan
In-Reply-To: <20260616190300.1509639-4-riel@surriel.com>

On Tue, Jun 16, 2026 at 03:03:00PM -0400, Rik van Riel wrote:
> __access_remote_vm() takes mmap_read_lock() for the entire transfer and
> uses get_user_pages_remote(), which faults pages in.  For the common
> case of reading memory that is already resident -- /proc/PID/cmdline,
> /proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
> unnecessary and is badly contended on large machines.
>
> Add an opportunistic, read-only fast path that transfers what it can
> without the mmap lock.  For each address it takes the per-VMA lock with
> lock_vma_under_rcu(), re-checks the read-side VMA permissions, and uses
> folio_walk_start(..., FW_VMA_LOCKED) to grab a short-lived reference to
> a present page before copying it out.  Anything non-trivial -- a not-
> present page (needs faulting), a hugetlb or VM_IO/VM_PFNMAP mapping, or
> a race with a VMA writer -- falls back to the existing mmap_lock path
> for the remainder.
>
> untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
> for the fast path; the untag mask is a stable per-mm value.
>
> Only reads are handled here; writes keep using the slow path.
>
> Assisted-by: Claude:claude-opus-4-8

This feels as if there was a little too much left to AI :)

> Signed-off-by: Rik van Riel <riel@surriel.com>

This needs to be separated into more patches, functions, and thoroughly reworked
to be upstreamable, unfortunately.

It's additionally quite hard to review in this form.

> ---
>  arch/x86/include/asm/uaccess_64.h |  12 +++
>  include/linux/uaccess.h           |  11 ++
>  mm/memory.c                       | 166 +++++++++++++++++++++++++++++-
>  3 files changed, 188 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index 4a52497ba6a1..c6fac900a747 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -51,6 +51,18 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
>  	(__force __typeof__(addr))__untagged_addr_remote(mm, __addr);	\
>  })
>
> +/* Same as __untagged_addr_remote(), but usable without the mmap lock held. */

How? This is pretty vague.

> +static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
> +							    unsigned long addr)
> +{
> +	return addr & READ_ONCE((mm)->context.untag_mask);
> +}
> +
> +#define untagged_addr_remote_unlocked(mm, addr)	({			\
> +	unsigned long __addr = (__force unsigned long)(addr);		\
> +	(__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
> +})

I'm confused why you're implementing this and not just calling untagged_addr()
from untagged_addr_remote_unlocked()?

You don't comment or explain this in the commit msg afaict.

> +
>  #endif
>
>  #define valid_user_address(x) \
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index 8a264662b242..c8c83372c9d8 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -34,6 +34,17 @@
>  })
>  #endif
>
> +/*
> + * Like untagged_addr_remote(), but for callers that stabilize @mm by other
> + * means (e.g. a per-VMA lock) and must not assert the mmap lock.
> + */

It's odd you'll comment this like this but not explain the confusing bit as to
why you can't just call untagged_addr()

> +#ifndef untagged_addr_remote_unlocked
> +#define untagged_addr_remote_unlocked(mm, addr)	({	\
> +	(void)(mm);					\

I'm not sure this is required?

> +	untagged_addr(addr);				\

Weird again that x86 needs special treatment but not other arches?

> +})
> +#endif
> +

You should really make untagged_addr_remote() call
untagged_addr_remote_unlocked() after its assert, otherwise it's a really odd
inconsistency.

>  #ifdef masked_user_access_begin
>   #define can_do_masked_user_access() 1
>  # ifndef masked_user_write_access_begin
> diff --git a/mm/memory.c b/mm/memory.c
> index 86a973119bd4..0b23b82eaa18 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -42,6 +42,8 @@
>  #include <linux/kernel_stat.h>
>  #include <linux/mm.h>
>  #include <linux/mm_inline.h>
> +#include <linux/secretmem.h>
> +#include <linux/pagewalk.h>
>  #include <linux/sched/mm.h>
>  #include <linux/sched/numa_balancing.h>
>  #include <linux/sched/task.h>
> @@ -7062,6 +7064,153 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>  EXPORT_SYMBOL_GPL(generic_access_phys);
>  #endif
>
> +/*
> + * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
> + * lock and RCU-freed page tables to walk page tables without the mmap lock.
> + */
> +#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)

Shall we wait for, or rely on Dave's series to remove CONFIG_PER_VMA_LOCK + make
it permanently on here?

> +/*
> + * Opportunistic lockless fast path for __access_remote_vm() reads.
> + *
> + * Memory already resident in @mm can be read without taking the heavily
> + * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
> + * with FW_VMA_LOCKED grabs a short-lived reference to a present page via an
> + * RCU/PTL protected page table walk (relying on MMU_GATHER_RCU_TABLE_FREE).
> + *
> + * Anything that would require faulting a page in, touching a hugetlb or
> + * VM_IO/VM_PFNMAP mapping, or that races a VMA writer is left to the mmap_lock
> + * path in __access_remote_vm().  Only reads are handled here.

I think referencing the confusing mess that is the special VMA flags is best
avoided (and anyway I think and you should just say

I think we could be clearer here like:

This is the read fast patch, writes are handled by the slow path in
__access_remote_vm() - faulting in, touching hugetlb or a remap.


> + *
> + * Returns the number of bytes transferred via the fast path.
> + */
> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
> +				 void *buf, int len, unsigned int gup_flags)

In general this function is... ugly. Very ugly :)

You nest while -> while -> for and it's all open-coded.

It needs major refactoring - separate out smaller functions, improve the
comments, separate out logic that can be shared with gup etc.

> +{
> +	void *old_buf = buf;
> +
> +	addr = untagged_addr_remote_unlocked(mm, addr);
> +
> +	while (len) {
> +		struct vm_area_struct *vma;
> +		vm_flags_t vm_flags;
> +
> +		vma = lock_vma_under_rcu(mm, addr);
> +		if (!vma)
> +			break;
> +
> +		/*
> +		 * Mirror the read-side permission checks of check_vma_flags(),
> +		 * and exclude what FW_VMA_LOCKED cannot handle (hugetlb) or what
> +		 * needs the ->access() handler (VM_IO/VM_PFNMAP).  Checked once
> +		 * per VMA; anything not positively allowed falls back to the
> +		 * slow path, which re-validates everything.
> +		 */

This feels very overwrought. You're compressing far to omuch into one lump of
text.

> +		vm_flags = vma->vm_flags;

Please don't use the old VMA flags API for new code. And definitely don't do
some weird vm_flags we keep separate from vma->vm_flags thing here.

	vma_test_any(vma, VMA_IO_BIT, VMA_PFNMAP_BIT)

> +		if ((vm_flags & (VM_IO | VM_PFNMAP)) ||
> +		    is_vm_hugetlb_page(vma) || vma_is_secretmem(vma) ||
> +		    (!(vm_flags & VM_READ) &&

But what does !VM_READ mean exactly? Are you really checking for PROT_NONE?

So vma_is_accessible() is right here no?

> +		     (!(gup_flags & FOLL_FORCE) || !(vm_flags & VM_MAYREAD)))) {

This conditional is abhorrent...

I think a good rule of thumb for this kind of thing is to read it out loud in
English - 'if IO or PFN map or hugetlb or secret mem or either not VM_READ etc.'
- if you're confused by it in English then don't put it in code.

Anyway it should clearly be a separate function like:

	static bool vma_can_use_fast_path(const struct vm_area_struct *vma)
	{
		/* We cannot GUP PFN maps or I/O memory. */
		if (vma_test_any(vma, VMA_IO_BIT, VMA_PFNMAP_BIT))
			return false;
		/* Hugetlb is a special snowflake. */
		if (is_vm_hugetlb_page(vma))
			return false;
		... etc. etc. ...
		return true;
	}

Which is vastly clearer.


> +			vma_end_read(vma);
> +			break;
> +		}
> +
> +		/*
> +		 * Copy as much of this VMA as we can without re-acquiring the
> +		 * per-VMA lock; re-lock only when @addr leaves the VMA.
> +		 */

Strange phrasing. I'm not even sure it's a useful comment?

> +		while (len && addr < vma->vm_end) {
> +			struct folio_walk fw;

Be good to avoid mystery meat varible names. 'walk'?

> +			struct folio *folio;
> +			struct page *page;
> +			unsigned long entry_size, entry_left, folio_left, span;
> +			unsigned long copied, idx0;

idx0 is a terrible name :)

All these variables tells you the function is too long.

> +			int offset;
> +
> +			folio = folio_walk_start(&fw, vma, addr, FW_VMA_LOCKED);
> +			if (!folio) {
> +				vma_end_read(vma);
> +				goto out;
> +			}
> +			page = fw.page;
> +			if (!page) {

under what circumstances would !fw.page when folio is non-NULL?

> +				folio_walk_end(&fw, vma);
> +				vma_end_read(vma);
> +				goto out;
> +			}
> +			/* Pin the folio so it stays valid after the PTL is dropped. */
> +			folio_get(folio);
> +			folio_walk_end(&fw, vma);
> +
> +			/*
> +			 * folio_walk_start() validated exactly one mapping entry,
> +			 * which covers a contiguous, present run of this folio:
> +			 * PAGE_SIZE for a pte, PMD_SIZE for a pmd leaf, PUD_SIZE
> +			 * for a pud leaf.  Copy up to the end of that entry,
> +			 * bounded by the folio, the VMA and len, so a huge mapping
> +			 * is handled in one walk instead of per page.
> +			 */
> +			offset = offset_in_page(addr);
> +			switch (fw.level) {
> +			case FW_LEVEL_PUD:
> +				entry_size = PUD_SIZE;
> +				break;
> +			case FW_LEVEL_PMD:
> +				entry_size = PMD_SIZE;
> +				break;
> +			default:
> +				entry_size = PAGE_SIZE;
> +				break;
> +			}
> +			entry_left = entry_size - (addr & (entry_size - 1));

Surely we have a better way of doing this? At least needs abstracting, a random
switch in the middle of this code is horrid.

> +			idx0 = folio_page_idx(folio, page);
> +			folio_left = ((folio_nr_pages(folio) - idx0) << PAGE_SHIFT) -
> +				     offset;

Couldn't we just keep track of this without this horrid expression?

> +			span = min3((unsigned long)len, entry_left, folio_left);
> +			span = min(span, vma->vm_end - addr);

You add massive comments for some bits, then do extremely confusing open coded
stuff here?

This needs a lot of breaking up.

> +
> +			/*
> +			 * Copy the span page-by-page: kmap_local_folio() maps one
> +			 * page on HIGHMEM and copy_from_user_page() flushes per
> +			 * page on aliasing caches, but the page tables are not
> +			 * re-walked.  The span borrows the single folio reference
> +			 * taken above, so each mapping is dropped with
> +			 * kunmap_local() (not folio_release_kmap(), which would
> +			 * also drop a folio reference per page).
> +			 */

This is a really confusing mass of text that is really dense and hard to
parse. Clarity is king.

In any case this should be separted out.

> +			for (copied = 0; copied < span; ) {

Very odd for loop.

	copied = 0;
	while (copied < span) {
		...
	}

Would be better. But I think reworking it so a normal for (init; cond; incr)
loop would work would be better?

> +				unsigned long foff = offset + copied;

foff :)) now I won't be childish :P

I really dislike overly compressed variable names. It's vague. File offset? I
guess you mean folio offset right?

and why did you call it plain 'offset' before but now specify folio but as 'f'?

> +				unsigned long pidx = idx0 + (foff >> PAGE_SHIFT);

'pidx'?

Equally unnecessarily and confusingly compressed variable name. We can live with
page_index if that's what you mean?

Also it's unceratin what the units are. It's ok to say 'bytes' and 'nr_pages' or
'page_nr' or something, and far clearer.

This should obviously be in another function anyway given indentation levels here.

> +				int poff = foff & ~PAGE_MASK;
> +				int chunk = min_t(unsigned long, span - copied,
> +						  PAGE_SIZE - poff);
> +				void *maddr = kmap_local_folio(folio,
> +						pidx << PAGE_SHIFT);
> +
> +				copy_from_user_page(vma, folio_page(folio, pidx),
> +						    addr + copied, buf + copied,
> +						    maddr + poff, chunk);
> +				kunmap_local(maddr);
> +				copied += chunk;
> +			}
> +
> +			folio_put(folio);
> +			len -= span;
> +			buf += span;
> +			addr += span;
> +		}
> +		vma_end_read(vma);

Really hard to keep track of what's what here.

> +	}
> +out:
> +	return buf - old_buf;
> +}
> +#else
> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
> +				 void *buf, int len, unsigned int gup_flags)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_PER_VMA_LOCK && CONFIG_MMU_GATHER_RCU_TABLE_FREE */
> +
>  /*
>   * Access another process' address space as given in mm.
>   */
> @@ -7071,8 +7220,23 @@ static int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
>  	void *old_buf = buf;
>  	int write = gup_flags & FOLL_WRITE;
>
> +	/*
> +	 * Try the lockless fast path for reads first; it transfers what it can
> +	 * from resident memory without taking mmap_lock, and leaves the
> +	 * remainder (if any) to the slow path below.
> +	 */

This is a weird comment, you should describe what access_remote_vm_fast() does
in access_remote_vm_fast(). You also don't mention the !write here which is the
thing people might wonder about.

I think the code is self-documenting anyway - try fast path - pretty clear.


> +	if (!write) {
> +		int done = access_remote_vm_fast(mm, addr, buf, len, gup_flags);

Can be const.

Can't errors arise in access_remote_vm_fast()? And in general seems it'd make
more sense to return an error/bool rather and have done as output param rather
than infer stuff from done.

> +
> +		addr += done;
> +		buf += done;
> +		len -= done;
> +		if (!len)
> +			return buf - old_buf;

So usual case will be it does everything right? So you do some useless
arithmetic and then return buf - old_buf.

Should probably instead have a return value.

But in general __access_remote_vm() is horrible. I think if you add new features
it's only right you spend some commits cleaning up first.

Otherwise we heap more stuff on top of broken stuff on and on and things get
messier + messier.


> +	}
> +
>  	if (mmap_read_lock_killable(mm))
> -		return 0;
> +		return buf - old_buf;

Err there's other cases where you return 0 here, e.g.:

	/* Avoid triggering the temporary warning in __get_user_pages */
	if (!vma_lookup(mm, addr) && !expand_stack(mm, addr))
		return 0;

So you probably need to fix those up to?

Probably better to just have a return value declared in the function.

>
>  	/* Untag the address before looking up the VMA */
>  	addr = untagged_addr_remote(mm, addr);
> --
> 2.53.0-Meta
>

Thanks, Lorenzo


^ permalink raw reply

* Re: [Patch v2] mm/page_vma_mapped: revalidate and do proper check before return device-private pmd
From: Lance Yang @ 2026-06-19 12:19 UTC (permalink / raw)
  To: richard.weiyang
  Cc: lance.yang, akpm, david, ljs, riel, liam, vbabka, harry, jannh,
	balbirs, ziy, sj, linux-mm, stable
In-Reply-To: <20260619023025.vqx2dsitxffuuwh3@master>


On Fri, Jun 19, 2026 at 02:30:25AM +0000, Wei Yang wrote:
>On Wed, Jun 17, 2026 at 08:18:15AM +0000, Wei Yang wrote:
>>On Wed, Jun 17, 2026 at 10:32:11AM +0800, Lance Yang wrote:
>>>
>>>On Tue, Jun 16, 2026 at 11:50:22PM +0000, Wei Yang wrote:
>>>>On Tue, Jun 16, 2026 at 08:30:01PM +0800, Lance Yang wrote:
>>>>>
>>>>>On Tue, Jun 16, 2026 at 06:34:36AM +0000, Wei Yang wrote:
>>>>>[...]
>>>>>>diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>>>>>>index 2ccbabfb2cc1..21635fab209c 100644
>>>>>>--- a/mm/page_vma_mapped.c
>>>>>>+++ b/mm/page_vma_mapped.c
>>>>>>@@ -243,40 +243,28 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>>>>> 		 */
>>>>>> 		pmde = pmdp_get_lockless(pvmw->pmd);
>>>>>> 
>>>>>>-		if (pmd_trans_huge(pmde) || pmd_is_migration_entry(pmde)) {
>>>>>>-			pvmw->ptl = pmd_lock(mm, pvmw->pmd);
>>>>>>-			pmde = *pvmw->pmd;
>>>>>>-			if (!pmd_present(pmde)) {
>>>>>>-				softleaf_t entry;
>>>>>>-
>>>>>>-				if (!thp_migration_supported() ||
>>>>>>-				    !(pvmw->flags & PVMW_MIGRATION))
>>>>>>-					return not_found(pvmw);
>>>>>>-				entry = softleaf_from_pmd(pmde);
>>>>>>-
>>>>>>-				if (!softleaf_is_migration(entry) ||
>>>>>>-				    !check_pmd(softleaf_to_pfn(entry), pvmw))
>>>>>>-					return not_found(pvmw);
>>>>>>-				return true;
>>>>>>-			}
>>>>>>-			if (likely(pmd_trans_huge(pmde))) {
>>>>>>-				if (pvmw->flags & PVMW_MIGRATION)
>>>>>>-					return not_found(pvmw);
>>>>>>-				if (!check_pmd(pmd_pfn(pmde), pvmw))
>>>>>>-					return not_found(pvmw);
>>>>>>-				return true;
>>>>>>-			}
>>>>>>-			/* THP pmd was split under us: handle on pte level */
>>>>>>-			spin_unlock(pvmw->ptl);
>>>>>>-			pvmw->ptl = NULL;
>>>>>>-		} else if (!pmd_present(pmde)) {
>>>>>>-			const softleaf_t entry = softleaf_from_pmd(pmde);
>>>>>>-
>>>>>>-			if (softleaf_is_device_private(entry)) {
>>>>>>-				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
>>>>>>-				return true;
>>>>>>-			}
>>>>>>+		if (pmd_present(pmde)) {
>>>>>>+			if (!pmd_leaf(pmde))
>>>>>>+				goto pte_table;
>>>>>>+			if (pvmw->flags & PVMW_MIGRATION)
>>>>>>+				return not_found(pvmw);
>>>>>>+			if (!check_pmd(pmd_pfn(pmde), pvmw))
>>>>>>+				return not_found(pvmw);
>>>>>>+		} else if (pmd_is_migration_entry(pmde)) {
>>>>>>+			softleaf_t entry = softleaf_from_pmd(pmde);
>>>>>>+
>>>>>>+			if (!(pvmw->flags & PVMW_MIGRATION))
>>>>>>+				return not_found(pvmw);
>>>>>
>>>>>Looked at history a bit, and I wonder if this changed something old
>>>>>here ...
>>>>>
>>>>>Since 616b8371539a ("mm: thp: enable thp migration in generic path"), PMD
>>>>>migration handling took PTL before doing PVMW_MIGRATION/PFN checks,
>>>>>including not_found() cases. So lockless PMD read was just a filter ...
>>>>>
>>>>>With this fix, true case gets final pmd_same() check, but this
>>>>>not_found() case happens before taking PTL.
>>>>>
>>>>>So a !PVMW_MIGRATION walker could race with someone, e.g.
>>>>>remove_migration_pmd(): we make the not_found() decision from old PMD
>>>>>value that still says "migration", while real *pvmw->pmd may already be
>>>>>present again. We return without ever taking PTL :)
>>>>>
>>>>
>>>>Hi, Lance
>>>>
>>>>Thanks for take a look.
>>>>
>>>>I am trying to understand the scenario you mentioned. Let's say A migrate a
>>>>pmd and B want to unmap the pmd.
>>>>
>>>>            A                                        B
>>>>
>>>>  try to migrate a pmd
>>>>  pmd is set to migration entry
>>>>                                           unmap the pmd ...
>>>>  managed to finish migration
>>>>                                           ...still see migration entry,
>>>>                                           so skipped and unmap fail
>>>>
>>>>Would this be a timing case? Even B grab the PTL, it still could see migration
>>>>entry if B visit pmd before A finish migration.
>>>>
>>>>Maybe I miss something, look forward your insight.
>>>
>>>Right, seeing migration entry while migration is still ongoing is fine.
>>>
>>>What I meant was this ordering:
>>>
>>>  CPU 0: pmde = pmdp_get_lockless(...); /* migration */
>>>  CPU 1: remove_migration_pmd() restores PMD to present
>>>  CPU 0: returns not_found() from old pmde, without ever taking PTL and
>>>         rechecking *pvmw->pmd
>>>
>>>So issue is not seeing migration entry itself, but making final
>>>not_found() decision from stale lockless PMD value ...
>>>
>>>Before this patch, PMD migration case took PTL before making that
>>>decision ...
>>>
>>
>>Yes, this patch changes the decision making condition for pmd entry. Thanks
>>for pointing out.
>>
>>Hmm... I took another look into current pte handling and find for pte entry,
>>we did two phase check:
>>
>>  * map_pte() without ptl
>>  * check_pte() with ptl
>>
>>While check_pte() do extra pfn range check, map_pte() doesn't.
>>
>>This means for pte entry, we may face the same situation as you describe: 
>>make the decision before grab PTL. Till now, it looks reasonable.
>>
>>But one thing jumped at me, PVMW_SYNC. When this flag is specified, all check
>>is done under PTL. But now for pmd entry, we don't have a chance to do so.
>>
>>And as the comment says in try_to_migrate_one()
>>
>>	/*
>>	 * When racing against e.g. zap_pte_range() on another cpu,
>>	 * in between its ptep_get_and_clear_full() and folio_remove_rmap_*(),
>>	 * try_to_migrate() may return before folio_mapped() has become false,
>>	 * if page table locking is skipped: use TTU_SYNC to wait for that.
>>	 */
>>
>>I tracked down to commit a98a2f0c8ce1 ('mm/rmap: split migration into its own
>>function'), but not getting more detail on reasoning. Not fully understand it
>>yet, but it seems there is some race between migration and unmap which is
>>protected by PTL?
>>
>>Will look into this to get more detail.
>>
>
>After going through the history, I found this:
>
>   commit 732ed55823fc3ad998d43b86bf771887bcc5ec67
>   Author: Hugh Dickins <hughd@google.com>
>   Date:   Tue Jun 15 18:23:53 2021 -0700
>   
>       mm/thp: try_to_unmap() use TTU_SYNC for safe splitting
>
>This one fix the race mentioned above: we expect mapcount is 0, but is not.

Cool, thanks!

I do want to spend more time on this refactor. It is touching some subtle
page_vma_mapped_walk() rules, so I don't want to skim and guess ...

One case I can pin down now is device-private: the PTE side gives us a
clear rule to compare against :)

On the PTE side:

1) PVMW_SYNC set, PVMW_MIGRATION set

  map_pte() uses pte_offset_map_lock(), so it takes PTL first.
  check_pte() then runs under PTL. Since PVMW_MIGRATION is set,
  check_pte() requires a migration entry, so device-private is rejected.

2) PVMW_SYNC set, PVMW_MIGRATION clear

  map_pte() takes PTL first. check_pte() then runs under PTL.
  Since PVMW_MIGRATION is clear, device-private can be a normal mapping,
  but check_pte() still checks entry type and PFN range.

3) PVMW_SYNC clear, PVMW_MIGRATION set

  map_pte() first does a lockless read. A non-present, non-none PTE can
  still be a candidate, so map_pte() takes PTL. check_pte() then rejects
  device-private, because PVMW_MIGRATION requires a migration entry.

4) PVMW_SYNC clear, PVMW_MIGRATION clear

  map_pte() first does a lockless read. A device-private PTE can be a
  normal mapping candidate, so map_pte() takes PTL. check_pte() then
  checks entry type and PFN range under PTL.

On the PMD device-private side, before this patch, all four cases go
through the same code once the lockless PMD read sees a device-private
entry:

- lockless read PMD into pmde
- pmde is non-present
- decode pmde as a softleaf entry
- entry is device-private
- take pmd_lock()
- return true

So compared with the PTE side:

A) PVMW_SYNC set, PVMW_MIGRATION set

  PTE rejects device-private under PTL.

  PMD returns true.

  This does not match. The PMD code misses the PVMW_MIGRATION direction
  check, and does not reread/revalidate PMD under pmd_lock().

B) PVMW_SYNC set, PVMW_MIGRATION clear

  PTE can accept device-private, but only after locked check_pte()
  validation.

  PMD also returns true.

  The direction is OK, but the final check is missing. PMD returns true
  from the lockless PMD classification, without PMD revalidation and
  without check_pmd() PFN-range check.

C) PVMW_SYNC clear, PVMW_MIGRATION set

  PTE can reach locked check_pte() from the lockless candidate, but
  check_pte() rejects device-private.
  
  PMD returns true.

  Same mismatch as case A: missing PVMW_MIGRATION direction check, and no
  locked PMD revalidation.

D) PVMW_SYNC clear, PVMW_MIGRATION clear

  PTE can accept device-private after locked validation.

  PMD also returns true.

  Direction is OK here as well, but the PMD code still has no final
  locked check matching check_pte(): no PMD reread/revalidation, and no
  check_pmd() PFN-range check.

>
>IIUC, if we apply the change in this patch, the affected case is
>pmd_is_migration_entry(). In case someone else has cleared it but not update
>mapcount yet, try_to_migrate() would return before folio_mapped() is false.
>
>Thanks Lance for raise the question.
>
>If above analysis is true, I haven't got a neat way to take this into
>consideration.
>
>BTW, for a fix, I am thinking to keep it simple and direct. So how about leave
>the refactor as a followup cleanup?

So for a fix, let's line up the PTE and PMD rules first :D

Cheers, Lance

>-- 
>Wei Yang
>Help you, Help me
>


^ permalink raw reply

* Re: [RFC PATCH 5/6] arm64: execmem: enable EXECMEM_ROX_CACHE on supported CPUs
From: Ryan Roberts @ 2026-06-19 12:09 UTC (permalink / raw)
  To: Adrian Barnaś, linux-arm-kernel
  Cc: linux-mm, Catalin Marinas, Will Deacon, David Hildenbrand,
	Mike Rapoport (Microsoft), Ard Biesheuvel, Christoph Lameter,
	Yang Shi, Brendan Jackman
In-Reply-To: <20260611130144.1385343-6-abarnas@google.com>

On 11/06/2026 14:01, Adrian Barnaś wrote:
> Enable EXECMEM_ROX_CACHE support for ARM64 systems that implement
> the bbml2_no_abort CPU feature.
> 
> Using the ROX cache brings a performance boost by reducing linear region
> fragmentation caused by strict memory permissions (e.g., W^X enforcement).
> Grouping executable code (which is read-only in the linear region alias)
> into PMD-sized block mappings reduces TLB pressure and page table size.

Do you have any data on fragmentation reduction and/or performance improvement
in practice due to this change?

> 
> This is only enabled on systems with bbml2_no_abort, as splitting
> these large blocks to make pages writable during module loading would
> otherwise risk triggering TLB Conflict Aborts.
> > Signed-off-by: Adrian Barnaś <abarnas@google.com>
> ---
>  arch/arm64/Kconfig   |  1 +
>  arch/arm64/mm/init.c | 22 +++++++++++++++++++++-
>  2 files changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 38dba5f7e4d2..79c347ab841e 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -285,6 +285,7 @@ config ARM64
>  	select USER_STACKTRACE_SUPPORT
>  	select VDSO_GETRANDOM
>  	select VMAP_STACK
> +	select ARCH_HAS_EXECMEM_ROX

nit: This list is sorted in alphabetical order; please maintain that ordering.

>  	help
>  	  ARM 64-bit (AArch64) Linux support.
>  
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 71aa745e0bef..8269d7747b84 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -420,6 +420,12 @@ void execmem_fill_trapping_insns(void *ptr, size_t size)
>  
>  	flush_icache_range((unsigned long)ptr, (unsigned long)ptr + size);
>  }
> +
> +#define MODULE_TEXT_FLAG	EXECMEM_ROX_CACHE
> +#define MODULE_TEXT_PGPROT	PAGE_KERNEL_ROX
> +#else
> +#define MODULE_TEXT_FLAG	(0)
> +#define MODULE_TEXT_PGPROT	PAGE_KERNEL
>  #endif
>  
>  static u64 module_direct_base __ro_after_init = 0;
> @@ -511,6 +517,8 @@ struct execmem_info __init *execmem_arch_setup(void)
>  {
>  	unsigned long fallback_start = 0, fallback_end = 0;
>  	unsigned long start = 0, end = 0;
> +	enum execmem_range_flags module_text_flags = 0;
> +	pgprot_t module_text_pgprot = PAGE_KERNEL;
>  
>  	module_init_limits();
>  
> @@ -531,12 +539,24 @@ struct execmem_info __init *execmem_arch_setup(void)
>  		end = module_plt_base + SZ_2G;
>  	}
>  
> +	/*
> +	 * The ROX Cache requires bbml2_no_abort because it uses large block
> +	 * mappings. On systems without this guarantee, splitting these blocks
> +	 * to make pages writable for module loading can trigger TLB Conflict
> +	 * Aborts.
> +	 */
> +	if (system_supports_bbml2_noabort()) {
> +		module_text_flags = MODULE_TEXT_FLAG;
> +		module_text_pgprot = MODULE_TEXT_PGPROT;
> +	}

Perhaps this is a bit clearer? Then you don't need the MODULE_TEXT_* macros:

	if (IS_ENABLED(CONFIG_ARCH_HAS_EXECMEM_ROX) &&
	    system_supports_bbml2_noabort()) {
		module_text_flags = EXECMEM_ROX_CACHE;
		module_text_pgprot = PAGE_KERNEL_ROX;
	}

Thanks,
Ryan

> +
>  	execmem_info = (struct execmem_info){
>  		.ranges = {
>  			[EXECMEM_MODULE_TEXT] = {
>  				.start	= start,
>  				.end	= end,
> -				.pgprot	= PAGE_KERNEL,
> +				.flags = module_text_flags,
> +				.pgprot	= module_text_pgprot,
>  				.alignment = 1,
>  				.fallback_start	= fallback_start,
>  				.fallback_end	= fallback_end,



^ permalink raw reply

* Re: [PATCH v2 0/2] luo: migrate serialized_data to type-safe KHO pointers
From: tarunsahu @ 2026-06-19 12:00 UTC (permalink / raw)
  To: Alexander Graf, Pratyush Yadav, Mike Rapoport, Andrew Morton,
	Pasha Tatashin
  Cc: linux-kernel, linux-mm, kexec
In-Reply-To: <cover.1781615759.git.tarunsahu@google.com>

+Mike, +Pasha

This is currently rebased on top of mailine, also can be applied to
liveupdate/fixes with no conflict. I would like to know which release
cycle we are targeting for this, so That I can rebase it accordingly?

As for liveupdate/next, it needs some conlfict resolution.

Tarun Sahu <tarunsahu@google.com> writes:

> Covvert raw serialized_data to KHO serializeable pointer (KHOSER_PTR).
> This series also takes care of resolving the bug with memfd of using
> phys_to_virt before checking the args->serialized_data value.
>
> Tarun Sahu (2):
>   kho: add KHOSER_COPY_TYPE(UN)SAFE for phys copy
>   luo: Update serialized data to serializeable pointer
>
>  include/linux/kho/abi/kexec_handover.h | 12 ++++++++++++
>  include/linux/kho/abi/luo.h            |  6 ++++--
>  include/linux/liveupdate.h             |  4 ++--
>  kernel/liveupdate/luo_file.c           | 24 ++++++++++++------------
>  mm/memfd_luo.c                         | 18 +++++++++---------
>  5 files changed, 39 insertions(+), 25 deletions(-)
>
>
> base-commit: 0e0611827f3349d0a2ac121c023a6d3260dcecdb
> -- 
> 2.54.0.1136.gdb2ca164c4-goog


^ permalink raw reply

* Re: [PATCH mm-hotfixses] Revert "mm: limit filemap_fault readahead to VMA boundaries"
From: Pedro Falcato @ 2026-06-19 11:58 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Frederick Mayle, Kalesh Singh, Matthew Wilcox,
	Jan Kara, linux-fsdevel, linux-mm, linux-kernel
In-Reply-To: <20260619112852.104213-1-ljs@kernel.org>

On Fri, Jun 19, 2026 at 12:28:51PM +0100, Lorenzo Stoakes wrote:
> This reverts commit 7b32f64bc512b40b268776c5ac4d354b325b3197.
> 
> This patch caused a significant performance regression, so revert it, and
> we can determine whether the approach is sensible or not moving forwards,
> and if so how to avoid this.
> 
> There was a merge conflict with commit de97ae6222c1 ("mm/readahead: no
> PG_readahead on EOF"), care was taken to ensure that the revert retained the
> behaviour of this patch and cleanly reverts commit 7b32f64bc512 ("mm: limit
> filemap_fault readahead to VMA boundaries") only.
> 
> Fixes: 7b32f64bc512 ("mm: limit filemap_fault readahead to VMA boundaries")
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202606181547.617a6967-lkp@intel.com
> Signed-off-by: Lorenzo Stoakes <ljs@kernel.org>

Reviewed-by: Pedro Falcato <pfalcato@suse.de>

-- 
Pedro


^ permalink raw reply

* Re: [PATCH] mm/page_alloc: unify __alloc_frozen_pages[_nolock]_noprof()
From: Brendan Jackman @ 2026-06-19 11:57 UTC (permalink / raw)
  To: Hao Ge, Brendan Jackman, Suren Baghdasaryan,
	Vlastimil Babka (SUSE)
  Cc: Brendan Jackman, Andrew Morton, Michal Hocko, Johannes Weiner,
	Zi Yan, Muchun Song, Oscar Salvador, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Sebastian Andrzej Siewior, Clark Williams,
	Steven Rostedt, Alexei Starovoitov, Harry Yoo (Oracle),
	Gregory Price, linux-mm, linux-kernel, linux-rt-devel
In-Reply-To: <45fcc57a-ec8d-46d6-9c28-065d001c081f@linux.dev>

On Thu Jun 18, 2026 at 2:22 AM UTC, Hao Ge wrote:
>
> On 2026/6/18 01:14, Brendan Jackman wrote:
>> On Wed Jun 17, 2026 at 4:49 PM UTC, Suren Baghdasaryan wrote:
>>> On Wed, Jun 17, 2026 at 9:39 AM Vlastimil Babka (SUSE)
>>> <vbabka@kernel.org> wrote:
>>>> +Cc Alexei
>>>>
>>>> On 6/17/26 17:29, Brendan Jackman wrote:
>>>>> Currently the core allocator code is controlled by ALLOC_NOLOCK, but the
>>>> It's not, it's ALLOC_TRYLOCK! Thanks for proving that we need to rename it
>>>> to ALLOC_NOLOCK:
>>>>
>>>> https://lore.kernel.org/all/DJ9QPTO2WXNB.10E88ZHWRDHB0@gmail.com/
>>>>
>>>> So you just won the job to do the rename :) I think it should be done before
>>>> this patch, so that the new usages and other _trylock names introduced here
>>>> can be done as _nolock outright.
>> Ack. I'll aim to send that tomorrow once Sashiko has caught up.
>>
>>>>> main entry point function is significantly different from the normal
>>>>> __alloc_frozen_pages_nolock(), this is tiring when reading the code.
>>>>>
>>>>> Plumb the ALLOC_NOLOCK control one layer up in the call stack: create
>>>>> an alloc_flags argument to __alloc_frozen_pages_nolock() (which is only
>>>>> exposed to mm/) and then turn the nolock variant into a thin wrapper
>>>>> that just sets that flag (as well as handling NUMA_NO_NODE, similar to
>>>>> how some of the wrappers in gfp.h do).
>>>>>
>>>>> Rationale that this doesn't change anything:
>>>>>
>>>>> 1. Simple bits: A bunch of the nolock-specific handling is just moved to
>>>>>     the new alloc_order_allowed(), alloc_trylock_allowed() and
>>>>>     gfp_trylock.
>>>>>
>>>>> 2. __alloc_frozen_pages_noprof() has some extra logic that wasn't
>>>>>     previously in the nolock variant:
>>>>>
>>>>>     a. Application of gfp_allowed_mask; this only affects early boot, and
>>>>>        only flags that affect the slowpath get changed here.
>>>>>
>>>>>     b. Application of current_gfp_context() - also only affects the
>>>>>        slowpath
>>>>>
>>>>> 3. The slowpath itself: this is now just explicitly skipped under
>>>>>     !ALLOC_TRYLOCK.
>>>> I'll have to ponder it more closely.
>>>>
>>>>> Ulterior motive: adding an alloc_flags arg to the allocator's
>>>>> mm-internal entrypoint can later be used to do more allocation
>>>>> customisation without needing to create new GFP flags.
>>>> Ack.
>>> I think this change might also help us in removing __GFP_NO_CODETAG
>> Nice, this actually looks trivial? I can probably just tack it onto the
>> v2 for this patch/series.
>>
>>> introduced in [1] and being the only user of __GFP_NO_OBJ_EXT once
>>> Vlastimil's patchset removing other __GFP_NO_OBJ_EXT users lands.
>>> CC'ing Hao as he is brainstorming ways to remove __GFP_NO_CODETAG, and
>>> this might be the answer.
>
>
> Hi Brendan, Suren,
>
> Thanks for CC'ing me, Suren. This is indeed a viable approach
>
> and I believe it brings us one step closer to removing
>
> __GFP_NO_CODETAG entirely.
>
>
> Brendan, I'd actually put together a rough local implementation
>
> earlier with mostly the same core idea as yours, and this change
>
> would indeed be minimal based on your patch.
>
> Thanks a lot for being interested in tacking this into your v2 patch series.

Oh, I just took a look and it's a bit more fiddly than I thought because
alloc_tag.c is actually in lib/ not mm/. 

How did you tackle that, can you share your implementation? It would be
nice if we can avoid exposing alloc_flags in gfp.h.


^ permalink raw reply

* Re: [PATCH v2] mm/page_alloc: drop flag-conversion "optimisation"
From: Brendan Jackman @ 2026-06-19 11:53 UTC (permalink / raw)
  To: Brendan Jackman, Brendan Jackman, Andrew Morton, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Johannes Weiner, Zi Yan
  Cc: Harry Yoo (Oracle), Gregory Price, linux-mm, linux-kernel
In-Reply-To: <DJ9KK3Z7BPYU.2HDB5IVO6Q40@linux.dev>

On Mon Jun 15, 2026 at 10:59 AM UTC, Brendan Jackman wrote: > On Mon Jun 15, 2026 at 10:54 AM UTC, Brendan Jackman wrote:
>
>>
>> Signed-off-by: Brendan Jackman <jackmanb@google.com>
>> ---
>> Changes in v2:
>> - Updated alloc_flags_nofragment() too.
>> - Link to v1: https://lore.kernel.org/r/20260612-gfp-pessimisation-v1-1-936eb04202e7@google.com
>
> Sigh, maybe one day I'll send a patch without immediately following up
> with a "oops, I forgot to ...". But today is not that day.
>
> Forgot to add these tags from the v1:
>
> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Gregory Price <gourry@gourry.net>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Harry Yoo (Oracle) <harry@kernel.org>
>
> Thanks everyone for the prompt reviews.

Hi Andrew,

This one didn't make it into any of the latest mm-* branches, is
anything blocked here?


^ permalink raw reply

* Re: [PATCH mm-hotfixses] Revert "mm: limit filemap_fault readahead to VMA boundaries"
From: David Hildenbrand (Arm) @ 2026-06-19 11:40 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Frederick Mayle, Kalesh Singh, Matthew Wilcox, Jan Kara,
	linux-fsdevel, linux-mm, linux-kernel
In-Reply-To: <20260619112852.104213-1-ljs@kernel.org>

On 6/19/26 13:28, Lorenzo Stoakes wrote:
> This reverts commit 7b32f64bc512b40b268776c5ac4d354b325b3197.
> 
> This patch caused a significant performance regression, so revert it, and
> we can determine whether the approach is sensible or not moving forwards,
> and if so how to avoid this.
> 
> There was a merge conflict with commit de97ae6222c1 ("mm/readahead: no
> PG_readahead on EOF"), care was taken to ensure that the revert retained the
> behaviour of this patch and cleanly reverts commit 7b32f64bc512 ("mm: limit
> filemap_fault readahead to VMA boundaries") only.
> 
> Fixes: 7b32f64bc512 ("mm: limit filemap_fault readahead to VMA boundaries")
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202606181547.617a6967-lkp@intel.com
> Signed-off-by: Lorenzo Stoakes <ljs@kernel.org>
> ---

Thanks!

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply

* [PATCH] Docs/mm: fix documentation warning for GFP parameter in kmalloc_obj, kmalloc_objs and kmalloc_flex
From: Jakov Novak @ 2026-06-19 11:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Vlastimil Babka, Harry Yoo, Andrew Morton, Hao Li,
	Christoph Lameter, David Rientjes, Roman Gushchin,
	linux-kernel-mentees, Shuah Khan, Jakov Novak

Compiling the documentation currently gives the errors:

WARNING: ./include/linux/slab.h:1100 Excess function parameter 'GFP' description in 'kmalloc_obj'
WARNING: ./include/linux/slab.h:1112 Excess function parameter 'GFP' description in 'kmalloc_objs'
WARNING: ./include/linux/slab.h:1127 Excess function parameter 'GFP' description in 'kmalloc_flex'
WARNING: ./include/linux/slab.h:1100 Excess function parameter 'GFP' description in 'kmalloc_obj'
WARNING: ./include/linux/slab.h:1112 Excess function parameter 'GFP' description in 'kmalloc_objs'
WARNING: ./include/linux/slab.h:1127 Excess function parameter 'GFP' description in 'kmalloc_flex'

This effectively omits the GFP parameter from the current kernel
documentation. This patch marks the "..." parameter with the previous
description of the GFP parameter along with an "optional" tag in
parantheses.

Signed-off-by: Jakov Novak <jakovnovak30@gmail.com>
---
 include/linux/slab.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index d4a873a16289..ee952784a150 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -1093,7 +1093,7 @@ void *kmalloc_nolock(size_t size, gfp_t gfp_flags, int node);
 /**
  * kmalloc_obj - Allocate a single instance of the given type
  * @VAR_OR_TYPE: Variable or type to allocate.
- * @GFP: GFP flags for the allocation.
+ * @...: GFP flags for the allocation (optional).
  *
  * Returns: newly allocated pointer to a @VAR_OR_TYPE on success, or NULL
  * on failure.
@@ -1105,7 +1105,7 @@ void *kmalloc_nolock(size_t size, gfp_t gfp_flags, int node);
  * kmalloc_objs - Allocate an array of the given type
  * @VAR_OR_TYPE: Variable or type to allocate an array of.
  * @COUNT: How many elements in the array.
- * @GFP: GFP flags for the allocation.
+ * @...: GFP flags for the allocation (optional).
  *
  * Returns: newly allocated pointer to array of @VAR_OR_TYPE on success,
  * or NULL on failure.
@@ -1118,7 +1118,7 @@ void *kmalloc_nolock(size_t size, gfp_t gfp_flags, int node);
  * @VAR_OR_TYPE: Variable or type to allocate (with its flex array).
  * @FAM: The name of the flexible array member of the structure.
  * @COUNT: How many flexible array member elements are desired.
- * @GFP: GFP flags for the allocation.
+ * @...: GFP flags for the allocation (optional).
  *
  * Returns: newly allocated pointer to @VAR_OR_TYPE on success, NULL on
  * failure. If @FAM has been annotated with __counted_by(), the allocation
-- 
2.54.0



^ permalink raw reply related

* [PATCH mm-hotfixses] Revert "mm: limit filemap_fault readahead to VMA boundaries"
From: Lorenzo Stoakes @ 2026-06-19 11:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Frederick Mayle, Kalesh Singh, Matthew Wilcox, Jan Kara,
	linux-fsdevel, linux-mm, linux-kernel

This reverts commit 7b32f64bc512b40b268776c5ac4d354b325b3197.

This patch caused a significant performance regression, so revert it, and
we can determine whether the approach is sensible or not moving forwards,
and if so how to avoid this.

There was a merge conflict with commit de97ae6222c1 ("mm/readahead: no
PG_readahead on EOF"), care was taken to ensure that the revert retained the
behaviour of this patch and cleanly reverts commit 7b32f64bc512 ("mm: limit
filemap_fault readahead to VMA boundaries") only.

Fixes: 7b32f64bc512 ("mm: limit filemap_fault readahead to VMA boundaries")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202606181547.617a6967-lkp@intel.com
Signed-off-by: Lorenzo Stoakes <ljs@kernel.org>
---
 include/linux/pagemap.h | 2 --
 mm/filemap.c            | 4 ----
 mm/readahead.c          | 6 +-----
 3 files changed, 1 insertion(+), 11 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 627771e82eb1..2c3718d592d6 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1348,7 +1348,6 @@ struct readahead_control {
 	struct file_ra_state *ra;
 /* private: use the readahead_* accessors instead */
 	pgoff_t _index;
-	pgoff_t _max_index; /* limit readahead to _max_index, inclusive */
 	unsigned int _nr_pages;
 	unsigned int _batch_count;
 	bool dropbehind;
@@ -1362,7 +1361,6 @@ struct readahead_control {
 		.mapping = m,						\
 		.ra = r,						\
 		._index = i,						\
-		._max_index = ULONG_MAX,				\
 	}

 #define VM_READAHEAD_PAGES	(SZ_128K / PAGE_SIZE)
diff --git a/mm/filemap.c b/mm/filemap.c
index dc3a0e960b9f..17a64837597c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3312,8 +3312,6 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	unsigned int thp_order = 0;
 	unsigned short mmap_miss;

-	ractl._max_index = vmf->vma->vm_pgoff + vma_pages(vmf->vma) - 1;
-
 	/* Use the readahead code, even if readahead is disabled */
 	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && (vm_flags & VM_HUGEPAGE)) {
 		/*
@@ -3409,7 +3407,6 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 		 * mmap read-around
 		 */
 		ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
-		ra->start = max(ra->start, vmf->vma->vm_pgoff);
 		ra->size = ra->ra_pages;
 		ra->async_size = ra->ra_pages / 4;
 		ra->order = 0;
@@ -3457,7 +3454,6 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
 	}

 	if (folio_test_readahead(folio)) {
-		ractl._max_index = vmf->vma->vm_pgoff + vma_pages(vmf->vma) - 1;
 		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 		page_cache_async_ra(&ractl, folio, ra->ra_pages);
 	}
diff --git a/mm/readahead.c b/mm/readahead.c
index 38ce16e3fcbd..558c92957518 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -335,8 +335,6 @@ static void do_page_cache_ra(struct readahead_control *ractl,
 		return;

 	end_index = (isize - 1) >> PAGE_SHIFT;
-	if (end_index > ractl->_max_index)
-		end_index = ractl->_max_index;
 	if (index > end_index)
 		return;
 	/* Don't read past the page containing the last byte of the file */
@@ -487,7 +485,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
 	pgoff_t start = readahead_index(ractl);
 	pgoff_t index = start;
 	unsigned int min_order = mapping_min_folio_order(mapping);
-	pgoff_t limit;
+	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
 	pgoff_t mark;
 	unsigned int nofs;
 	int err = 0;
@@ -500,8 +498,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
 		goto fallback;
 	}

-	limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
-	limit = min(limit, ractl->_max_index);
 	if (limit > index + ra->size - 1) {
 		limit = index + ra->size - 1;
 		mark = index + ra->size - ra->async_size;
--
2.54.0


^ permalink raw reply related

* Re: [linux-next:master] [mm] 7b32f64bc5: pts.svt-av1.Preset13.Bosphorus4K.frames_per_second 45.8% regression
From: Lorenzo Stoakes @ 2026-06-19 11:11 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Suren Baghdasaryan, Jan Kara, kernel test robot, Frederick Mayle,
	oe-lkp, lkp, Andrew Morton, Kalesh Singh, David Hildenbrand,
	Matthew Wilcox, linux-fsdevel, linux-mm
In-Reply-To: <ajQbXthzbr9xgUIM@pedro-suse>

On Thu, Jun 18, 2026 at 05:32:31PM +0100, Pedro Falcato wrote:
> On Thu, Jun 18, 2026 at 04:03:43PM +0000, Suren Baghdasaryan wrote:
> > On Thu, Jun 18, 2026 at 2:30 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > >
> > > On Thu, Jun 18, 2026 at 11:30:47AM +0200, Jan Kara wrote:
> > > > On Thu 18-06-26 16:00:42, kernel test robot wrote:
> > > > > Hello,
> > > > >
> > > > > kernel test robot noticed a 45.8% regression of pts.svt-av1.Preset13.Bosphorus4K.frames_per_second on:
> > > >
> > > > This one looks serious enough and real. It would be good to figure out what
> > > > happens in this benchmark that it benefits from the readahead across VMA
> > > > boundaries so much...
> > >
> > > I think a revert first no? This seems pretty huge for something that isn't key
> > > to the kernel, then a new attempt can be tried with this issue addressed
> > > perhaps?
> >
> > A quick search yields: "The
> > pts.svt-av1.Preset13.Bosphorus4K.frames_per_second is a benchmarking
> > metric from the Phoronix Test Suite that measures how many frames per
> > second a CPU can encode using the open-source SVT-AV1 video encoder."
> >
> > If this is a video encoding benchmark I would expect it to explicitly
> > prefetch the data from the disk before measuring the encoding speed.
> > If limiting readahead caused this regression, I suspect the benchmark
> > doesn't explicitly prefetch the data...
>
> Well, commonly video data doesn't actually fit in memory :)
>
> A quick look at the code (I think it's https://gitlab.com/AOMediaCodec/SVT-AV1/-/blob/master/Source/App/app_process_cmd.c#L821)
> suggests it is progressively mapping the file data for a given frame
> (or frames?). So the old behavior would result in page faults for a given
> frame starting readahead for the next few frames. This looks reasonable.
>
> FWIW I suspected this was a really weird case regarding mprotect or
> something, and I'm happy it isn't; but at least I had a suggestion for that -
> for this, maybe dropping the change (for now?) is the best course of action.

Am sending a revert.

>
> --
> Pedro

Thanks, Lorenzo


^ permalink raw reply

* Re: [PATCH v8 23/46] KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
From: Fuad Tabba @ 2026-06-19 11:09 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
	yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
	liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
	Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-23-9d2959357853@google.com>

On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Update tdx_gmem_post_populate() to handle cases where a source page is
> not explicitly provided. Instead of returning -EOPNOTSUPP when src_page
> is NULL, default to using the page associated with the destination PFN.
>
> This change allows for in-place memory conversion where the data is
> already present in the target PFN, ensuring the TDX module has a valid
> source page reference for the TDH.MEM.PAGE.ADD operation.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Sashiko flagged that when src_page = pfn_to_page(pfn),
tdh_mem_page_add gets identical physical addresses for r8
(destination) and r9 (source), reading with host KeyID and writing
with TD KeyID on the same address. I don't know enough about the TDX
module's operand constraints to confirm whether it allows overlapping
source and destination, but the concern looks legitimate.

nit: why does it have Sean's SoB?

Cheers,
/fuad


>  Documentation/virt/kvm/x86/intel-tdx.rst |  4 ++++
>  arch/x86/kvm/vmx/tdx.c                   | 11 ++++++++---
>  2 files changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/virt/kvm/x86/intel-tdx.rst b/Documentation/virt/kvm/x86/intel-tdx.rst
> index 6a222e9d09541..74357fe87f9ec 100644
> --- a/Documentation/virt/kvm/x86/intel-tdx.rst
> +++ b/Documentation/virt/kvm/x86/intel-tdx.rst
> @@ -158,6 +158,10 @@ KVM_TDX_INIT_MEM_REGION
>  Initialize @nr_pages TDX guest private memory starting from @gpa with userspace
>  provided data from @source_addr. @source_addr must be PAGE_SIZE-aligned.
>
> +If guest_memfd in-place conversion is enabled, pass NULL for @source_addr to
> +initialize the memory region using memory contents already populated in
> +guest_memfd memory.
> +
>  Note, before calling this sub command, memory attribute of the range
>  [gpa, gpa + nr_pages] needs to be private.  Userspace can use
>  KVM_SET_MEMORY_ATTRIBUTES to set the attribute.
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index ffe9d0db58c59..56d10333c61a7 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -3198,8 +3198,12 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>         if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
>                 return -EIO;
>
> -       if (!src_page)
> -               return -EOPNOTSUPP;
> +       if (!src_page) {
> +               if (!gmem_in_place_conversion)
> +                       return -EOPNOTSUPP;
> +
> +               src_page = pfn_to_page(pfn);
> +       }
>
>         kvm_tdx->page_add_src = src_page;
>         ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn);
> @@ -3278,7 +3282,8 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
>                         break;
>                 }
>
> -               region.source_addr += PAGE_SIZE;
> +               if (region.source_addr)
> +                       region.source_addr += PAGE_SIZE;
>                 region.gpa += PAGE_SIZE;
>                 region.nr_pages--;
>
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>


^ permalink raw reply

* Re: [RFC PATCH v2 1/3] mm/huge_memory: make persistent huge zero folio read-only
From: David Hildenbrand (Arm) @ 2026-06-19 11:09 UTC (permalink / raw)
  To: Xueyuan Chen
  Cc: dave.hansen, akpm, linux-mm, linux-kernel, linux-arm-kernel, x86,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, luto, peterz,
	hpa, ljs, liam, vbabka, rppt, surenb, mhocko, ziy, baolin.wang,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, yang, jannh
In-Reply-To: <20260619025553.226940-1-xueyuan.chen21@gmail.com>

On 6/19/26 04:55, Xueyuan Chen wrote:
> On Thu, Jun 18, 2026 at 02:36:25PM +0200, David Hildenbrand (Arm) wrote:
> 
> Hi, David
> 
> [...]
> 
>> Best to wait for some feedback.
> 
> Sure.
> 
>> I do wonder whether we want to pass an address instead of a page.
>>
>> https://lore.kernel.org/r/20260410151746.61150-2-kalyazin@amazon.com
>>
>> Wants to convert existing ones as well.
>>
>> That would imply that the caller must check for highmem.
>>
>> But then, we could just use existing set_memory_ro(), right?
>>
> 
> Agreed. Passing an address and reusing the existing set_memory_ro() 
> definitely makes things simpler.
> 
> However, there is an arm64 specific limitation:
> currently, the set_memory_r* api on arm64 only support the vmap
> region and do not handle linear map addresses.
> 
> If we go this route, should I extend the arm64 set_memory_r*
> implementation in the next version? The plan would be to make it check 
> for the bblm2 feature and modify the linear map PTEs accordingly.
> What do you think?

Good point! It's not really clear on which ranges set_memory*() is supposed to
work ...

arm64 only works on vmalloc/vmap, x86 and riscv on ordinary directmap ... what a
mess.

Having a new direct-map specific function with clear semantics might indeed
avoid even messing with that.

So, yeah, given that we have

	set_direct_map_invalid_noflush
	set_direct_map_default_noflush
	set_direct_map_valid_noflush

Let's add a

	set_direct_map_ro()

Or (my preference)

	change_direct_map_ro()

But given the existing naming scheme ... maybe just set_direct_map_ro() and
we'll clean this up another day.


Now, should there also be a "_noflush" in there, or who is supposed to flush the
TLB (or don't we flush at all, because it's used early during boot so far)?

In any case, for this function we should add excessive documentation and define
clear semantics.

-- 
Cheers,

David


^ permalink raw reply

* Re: [Patch v2] mm/page_vma_mapped: revalidate and do proper check before return device-private pmd
From: Lorenzo Stoakes @ 2026-06-19 11:04 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Wei Yang, akpm, riel, liam, vbabka, harry, jannh, balbirs, ziy,
	sj, linux-mm, stable
In-Reply-To: <5e7f7fe5-221a-4fca-aa76-297ae19eb80d@kernel.org>

On Fri, Jun 19, 2026 at 12:48:26PM +0200, David Hildenbrand (Arm) wrote:
> On 6/19/26 12:44, Lorenzo Stoakes wrote:
> > -cc wrong email
> >
> > On Tue, Jun 16, 2026 at 06:34:36AM +0000, Wei Yang wrote:
> >> For pmd_trans_huge() and pmd_is_migration_entry(), we does following
> >> before return the pmd entry:
> >>
> >>   * re-validate pmd entry after PTL
> >>   * check PVMW_MIGRATION
> >>   * check_pmd()
> >>   * handle on pte level if split under us
> >>
> >> But for device-private pmd, we just return after pmd_lock().
> >>
> >> This may return improper entry, e.g. if we are looking for a migration
> >> entry, device-private entry could still be returned, which leads to data
> >> corruption.
> >
> > I don't thik this is quite clear?
> >
> > How about:
> >
> > 	If a softleaf entry is present, the existing code simply acquires the
> > 	PMD lock and returns success even if PVMW_MIGRATION is set (indicating a
> > 	migration entry is sought), meaning that the caller can incorrectly
> > 	interpret the entry as something it is not, causing data corruption.
> >
> >>
> >> This patch fixes commit 65edfda6f3f2 ("mm/rmap: extend rmap and migration
> >> support device-private entries") by following the same pattern as
> >> pmd_trans_huge() and pmd_is_migration_entry() for device private entry.
> >>
> >> While at it, it cleanups the pmd entry handling in page_vma_mapped_walk().
> >>
> >>   * Instead of handling trans huge/migration entry/device private entry
> >>     in a mixed manner, we put each case into its own if condition and
> >>     handle with the same pattern.
> >>   * Also we grab PTL and make sure pmd is not changed under us after
> >>     above check instead of do the check with PTL hold.
> >>   * restart the process if pmd is changed under us
> >
> > You're doing quite a bit for a fix and you're putting it all in one place.
> >
> > How about do the fix as 1 patch, and then cleanups as other ones? It helps with
> > review too :)
> >
> > It's a general rule of thumb that if you do more than one of moving, refactoring
> > or changing code, to do them as separate patches so a reviewer/somebody
> > bisecting can clearly separate each.
> >
> > Also PLEASE do not add new functionality (this lock recheck) in a fixes
> > patch. We'll end up backporting new logic that way.
> >
> > Make the fixes bit _minimal_.
>
> To be fair, I asked for this
>
> https://lore.kernel.org/all/2d48ef0d-1110-4a9d-adcb-f701a1ce2cfa@kernel.org/
>
> But given that Wei mostly used my quick draft without properly checking the
> implications, yeah, let's fix it first separately.

Ack yeah sorry I mean I agree that it needs cleanup just has to be done in the
right way which clearly I think we agree on :)

>
> I can then follow up with a proper cleanup.

Thanks!

>
> >
> > I think in general Andrew prefers separate fixes patches so I'd just make the
> > _minimal_ change that fixes this for the backport, and the cleanup stuff as a
> > separate series.
> >
>
> The issue is that the existing handling is just crap, and to fix it, we're
> adding more crap. But yeah, let's add more crap first before we clean it up
> properly.

I couldn't agree more and to be clear - I hate how this is right now.

But I think for the fix we have to wade in the crap first then clean it up
afterwards... :)

>
>
> --
> Cheers,
>
> David

Cheers, Lorenzo


^ permalink raw reply

* Re: [PATCH v8 11/46] KVM: Consolidate private memory and guest_memfd ifdeffery in kvm_host.h
From: Fuad Tabba @ 2026-06-19 11:02 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
	yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
	liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
	Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-11-9d2959357853@google.com>

On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Move the kvm_arch_has_private_mem() stub and a few guest_memfd function
> definitions/declarations "down" in kvm_host.h to utilize existing #ifdefs,
> and so that related code is clustered together.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

SoB fix please. With that...

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad
> ---
>  include/linux/kvm_host.h | 37 ++++++++++++++++---------------------
>  1 file changed, 16 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index acb552745b428..9c1cf1a6559e3 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -722,27 +722,6 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
>  }
>  #endif
>
> -#ifndef kvm_arch_has_private_mem
> -static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
> -{
> -       return false;
> -}
> -#endif
> -
> -#ifdef CONFIG_KVM_GUEST_MEMFD
> -bool kvm_arch_supports_gmem_init_shared(struct kvm *kvm);
> -
> -static inline u64 kvm_gmem_get_supported_flags(struct kvm *kvm)
> -{
> -       u64 flags = GUEST_MEMFD_FLAG_MMAP;
> -
> -       if (!kvm || kvm_arch_supports_gmem_init_shared(kvm))
> -               flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
> -
> -       return flags;
> -}
> -#endif
> -
>  #ifndef kvm_arch_has_readonly_mem
>  static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
>  {
> @@ -2572,6 +2551,11 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>  #else
>  #define gmem_in_place_conversion false
>
> +static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
> +{
> +       return false;
> +}
> +
>  static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>  {
>         return false;
> @@ -2580,6 +2564,17 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>
>  #ifdef CONFIG_KVM_GUEST_MEMFD
>  bool kvm_gmem_is_private(struct kvm *kvm, gfn_t gfn);
> +bool kvm_arch_supports_gmem_init_shared(struct kvm *kvm);
> +
> +static inline u64 kvm_gmem_get_supported_flags(struct kvm *kvm)
> +{
> +       u64 flags = GUEST_MEMFD_FLAG_MMAP;
> +
> +       if (!kvm || kvm_arch_supports_gmem_init_shared(kvm))
> +               flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
> +
> +       return flags;
> +}
>
>  int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>                      gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>


^ permalink raw reply

* Re: [PATCH v8 22/46] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Fuad Tabba @ 2026-06-19 11:01 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
	yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
	liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
	Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-22-9d2959357853@google.com>

On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Michael Roth <michael.roth@amd.com>
>
> Make the source page for populating an SNP guest_memfd instance optional
> if in-place conversion/population is enabled.  If KVM can convert the page
> in-place, then it's possible for guest memory to be initialized directly
> from userspace by mmap()'ing the guest_memfd and writing to it while the
> corresponding GPA ranges are in a 'shared' state, before converting them
> to the 'private' state expected by KVM_SEV_SNP_LAUNCH_UPDATE.
>
> Update the handling/documentation for KVM_SEV_SNP_LAUNCH_UPDATE to allow
> for 'uaddr' to be set to NULL when in-place conversion is enabled, which
> SNP_LAUNCH_UPDATE will then use to determine when it should/shouldn't
> copy in data from a separate memory location. Continue to enforce
> non-NULL when PRIVATE is tracked per-VM, not per-guest_memfd.
>
> Signed-off-by: Michael Roth <michael.roth@amd.com>
> [Added src_page check in error handling path when the firmware command fails]
> [Dropped ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES]
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> [sean: drop explicit vm_memory_attributes references]
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  Documentation/virt/kvm/x86/amd-memory-encryption.rst | 13 +++++++++----
>  arch/x86/kvm/svm/sev.c                               | 16 +++++++++++-----
>  virt/kvm/kvm_main.c                                  |  1 +
>  3 files changed, 21 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> index bd04a908a8dbd..29409297f1ef0 100644
> --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> @@ -503,7 +503,8 @@ secrets.
>
>  It is required that the GPA ranges initialized by this command have had the
>  KVM_MEMORY_ATTRIBUTE_PRIVATE attribute set in advance. See the documentation
> -for KVM_SET_MEMORY_ATTRIBUTES for more details on this aspect.
> +for KVM_SET_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES2 for more details on
> +this aspect.
>
>  Upon success, this command is not guaranteed to have processed the entire
>  range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
> @@ -511,9 +512,13 @@ range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
>  remaining range that has yet to be processed. The caller should continue
>  calling this command until those fields indicate the entire range has been
>  processed, e.g. ``len`` is 0, ``gfn_start`` is equal to the last GFN in the
> -range plus 1, and ``uaddr`` is the last byte of the userspace-provided source
> -buffer address plus 1. In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO,
> -``uaddr`` will be ignored completely.
> +range plus 1, and ``uaddr`` (if specified) is the last byte of the
> +userspace-provided source buffer address plus 1.
> +
> +In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO, ``uaddr`` will be
> +ignored completely. For all other page types, ``uaddr`` is optional if in-place
> +conversion is enable, i.e. when the destination can also be the source, and is

Typo: "is enable" -> "is enabled".

"when the destination can also be the source" is hard to parse without
context. Maybe: "i.e. when the data has been written directly to
guest_memfd while the range was in the shared state".

Also, how does userspace discover whether in-place conversion is
enabled? A cross-reference to KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES
would help here.

Cheers,
/fuad

> +required if in-place conversion is disabled.
>
>  Parameters (in): struct  kvm_sev_snp_launch_update
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 74fb15551e83f..2b7569b6a8609 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2330,7 +2330,13 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>         int level;
>         int ret;
>
> -       if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page))
> +       /*
> +        * A source page is required if in-place conversion isn't enabled, as
> +        * the data needs to come from a separate physical page.  Zero pages
> +        * are exempt as they don't consume a source page.
> +        */
> +       if (!gmem_in_place_conversion &&
> +           sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page)
>                 return -EINVAL;
>
>         ret = snp_lookup_rmpentry((u64)pfn, &assigned, &level);
> @@ -2377,7 +2383,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>          */
>         if (ret && !snp_page_reclaim(kvm, pfn) &&
>             sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID &&
> -           sev_populate_args->fw_error == SEV_RET_INVALID_PARAM) {
> +           sev_populate_args->fw_error == SEV_RET_INVALID_PARAM && src_page) {
>                 void *src_vaddr = kmap_local_page(src_page);
>                 void *dst_vaddr = kmap_local_pfn(pfn);
>
> @@ -2410,8 +2416,8 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
>         if (copy_from_user(&params, u64_to_user_ptr(argp->data), sizeof(params)))
>                 return -EFAULT;
>
> -       pr_debug("%s: GFN start 0x%llx length 0x%llx type %d flags %d\n", __func__,
> -                params.gfn_start, params.len, params.type, params.flags);
> +       pr_debug("%s: GFN start 0x%llx length 0x%llx type %d flags %d src %llx\n", __func__,
> +                params.gfn_start, params.len, params.type, params.flags, params.uaddr);
>
>         if (!params.len || !PAGE_ALIGNED(params.len) || params.flags ||
>             (params.type != KVM_SEV_SNP_PAGE_TYPE_NORMAL &&
> @@ -2468,7 +2474,7 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
>
>         params.gfn_start += count;
>         params.len -= count * PAGE_SIZE;
> -       if (params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO)
> +       if (src && params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO)
>                 params.uaddr += count * PAGE_SIZE;
>
>         if (copy_to_user(u64_to_user_ptr(argp->data), &params, sizeof(params)))
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 044486f128c37..dd1d18a1d2f68 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -103,6 +103,7 @@ module_param(allow_unsafe_mappings, bool, 0444);
>
>  #ifdef kvm_arch_has_private_mem
>  bool __ro_after_init gmem_in_place_conversion = false;
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(gmem_in_place_conversion);
>  #endif
>
>  #define MEMORY_ATTRIBUTES_MATCH(one, two)                              \
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>


^ permalink raw reply

* Re: [RFC PATCH 4/6] arm64: mm: add helper to fill execmem with trapping instructions
From: Mike Rapoport @ 2026-06-19 10:58 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Adrian Barnaś, linux-arm-kernel, linux-mm, Catalin Marinas,
	Will Deacon, David Hildenbrand, Ard Biesheuvel, Christoph Lameter,
	Yang Shi, Brendan Jackman
In-Reply-To: <666a981f-44b6-4c19-a641-c1eff44fe54f@arm.com>

On Fri, Jun 19, 2026 at 11:54:25AM +0100, Ryan Roberts wrote:
> On 11/06/2026 14:01, Adrian Barnaś wrote:
> > Implement the architecture-specific execmem_fill_trapping_insns() helper
> > to poison executable memory regions.
> > 
> > When CONFIG_ARCH_HAS_EXECMEM_ROX is enabled, the execmem subsystem
> > requires a way to fill unused or freed executable memory with
> > architecture-specific trapping instructions. This implementation fills
> > the specified region with AARCH64_BREAK_FAULT instructions and flushes
> > the icache to ensure the traps are immediately visible to execution.
> > 
> > Signed-off-by: Adrian Barnaś <abarnas@google.com>
> > ---
> >  arch/arm64/mm/init.c | 14 ++++++++++++++
> >  1 file changed, 14 insertions(+)
> > 
> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > index c673a9a839dd..71aa745e0bef 100644
> > --- a/arch/arm64/mm/init.c
> > +++ b/arch/arm64/mm/init.c
> > @@ -408,6 +408,20 @@ void dump_mem_limit(void)
> >  }
> >  
> >  #ifdef CONFIG_EXECMEM
> > +
> > +#ifdef CONFIG_ARCH_HAS_EXECMEM_ROX
> > +void execmem_fill_trapping_insns(void *ptr, size_t size)
> > +{
> > +	int nr_inst = size / AARCH64_INSN_SIZE;
> 
> The x86 instruction is 1 byte, so it can exactly fill any provided buffer. For
> arm64, the instruction is 4 bytes so we can only exactly fill the buffer if it's
> size is 4 byte aligned.
> 
> I'm guessing that in practice, size will always be page aligned so we are good?

The size is always page aligned:

void *execmem_alloc(enum execmem_type type, size_t size)
{
	...

	size = PAGE_ALIGN(size);

-- 
Sincerely yours,
Mike.


^ permalink raw reply

* Re: [RFC PATCH 4/6] arm64: mm: add helper to fill execmem with trapping instructions
From: Ryan Roberts @ 2026-06-19 10:54 UTC (permalink / raw)
  To: Adrian Barnaś, linux-arm-kernel
  Cc: linux-mm, Catalin Marinas, Will Deacon, David Hildenbrand,
	Mike Rapoport (Microsoft), Ard Biesheuvel, Christoph Lameter,
	Yang Shi, Brendan Jackman
In-Reply-To: <20260611130144.1385343-5-abarnas@google.com>

On 11/06/2026 14:01, Adrian Barnaś wrote:
> Implement the architecture-specific execmem_fill_trapping_insns() helper
> to poison executable memory regions.
> 
> When CONFIG_ARCH_HAS_EXECMEM_ROX is enabled, the execmem subsystem
> requires a way to fill unused or freed executable memory with
> architecture-specific trapping instructions. This implementation fills
> the specified region with AARCH64_BREAK_FAULT instructions and flushes
> the icache to ensure the traps are immediately visible to execution.
> 
> Signed-off-by: Adrian Barnaś <abarnas@google.com>
> ---
>  arch/arm64/mm/init.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index c673a9a839dd..71aa745e0bef 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -408,6 +408,20 @@ void dump_mem_limit(void)
>  }
>  
>  #ifdef CONFIG_EXECMEM
> +
> +#ifdef CONFIG_ARCH_HAS_EXECMEM_ROX
> +void execmem_fill_trapping_insns(void *ptr, size_t size)
> +{
> +	int nr_inst = size / AARCH64_INSN_SIZE;

The x86 instruction is 1 byte, so it can exactly fill any provided buffer. For
arm64, the instruction is 4 bytes so we can only exactly fill the buffer if it's
size is 4 byte aligned.

I'm guessing that in practice, size will always be page aligned so we are good?
Perhaps worth a WARN_ON_ONCE() though?

Thanks,
Ryan

> +	__le32 *updptr = ptr;
> +
> +	for (int i = 0; i < nr_inst; i++)
> +		updptr[i] = cpu_to_le32(AARCH64_BREAK_FAULT);
> +
> +	flush_icache_range((unsigned long)ptr, (unsigned long)ptr + size);
> +}
> +#endif
> +
>  static u64 module_direct_base __ro_after_init = 0;
>  static u64 module_plt_base __ro_after_init = 0;
>  



^ permalink raw reply

* Re: [PATCH v8 21/46] KVM: guest_memfd: Zero page while getting pfn
From: Fuad Tabba @ 2026-06-19 10:51 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
	yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
	liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
	Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com>

On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Move the folio initialization logic from kvm_gmem_get_pfn() into
> __kvm_gmem_get_pfn() to also zero pages if the page is to be used in
> kvm_gmem_populate().
>
> With in-place conversion, the existing data in a guest_memfd page can be
> populated into guest memory through platform-specific ioctls.
>
> Without first zeroing the page obtained using __kvm_gmem_get_pfn(), it
> might contain uninitialized host memory, which would leak to the guest if
> the populate completes.
>
> guest_memfd pages are zeroed at most once in the page's entire lifetime
> with guest_memfd, and that is tracked using the uptodate flag.
>
> Zeroing the page in __kvm_gmem_get_pfn() is chosen over zeroing in
> kvm_gmem_get_folio() since other flows, such as a future write() syscall,
> can get a page, write to the page and then set page uptodate without
> zeroing.
>
> This aligns with the concept of zeroing before first use - the other place
> where zeroing happens is in kvm_gmem_fault_user_mapping().
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad
> ---
>  virt/kvm/guest_memfd.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 90bc1a26512b6..86c9f5b0863cb 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -1137,6 +1137,11 @@ static struct folio *__kvm_gmem_get_pfn(struct file *file,
>                 return ERR_PTR(-EHWPOISON);
>         }
>
> +       if (!folio_test_uptodate(folio)) {
> +               clear_highpage(folio_page(folio, 0));
> +               folio_mark_uptodate(folio);
> +       }
> +
>         *pfn = folio_file_pfn(folio, index);
>         if (max_order)
>                 *max_order = 0;
> @@ -1166,11 +1171,6 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>                 goto out;
>         }
>
> -       if (!folio_test_uptodate(folio)) {
> -               clear_highpage(folio_page(folio, 0));
> -               folio_mark_uptodate(folio);
> -       }
> -
>         if (kvm_gmem_is_private_mem(inode, index))
>                 r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>


^ permalink raw reply

* Re: [Patch v2] mm/page_vma_mapped: revalidate and do proper check before return device-private pmd
From: David Hildenbrand (Arm) @ 2026-06-19 10:48 UTC (permalink / raw)
  To: Lorenzo Stoakes, Wei Yang
  Cc: akpm, riel, liam, vbabka, harry, jannh, balbirs, ziy, sj,
	linux-mm, stable
In-Reply-To: <ajUXNjRMraKb6k2n@lucifer>

On 6/19/26 12:44, Lorenzo Stoakes wrote:
> -cc wrong email
> 
> On Tue, Jun 16, 2026 at 06:34:36AM +0000, Wei Yang wrote:
>> For pmd_trans_huge() and pmd_is_migration_entry(), we does following
>> before return the pmd entry:
>>
>>   * re-validate pmd entry after PTL
>>   * check PVMW_MIGRATION
>>   * check_pmd()
>>   * handle on pte level if split under us
>>
>> But for device-private pmd, we just return after pmd_lock().
>>
>> This may return improper entry, e.g. if we are looking for a migration
>> entry, device-private entry could still be returned, which leads to data
>> corruption.
> 
> I don't thik this is quite clear?
> 
> How about:
> 
> 	If a softleaf entry is present, the existing code simply acquires the
> 	PMD lock and returns success even if PVMW_MIGRATION is set (indicating a
> 	migration entry is sought), meaning that the caller can incorrectly
> 	interpret the entry as something it is not, causing data corruption.
> 
>>
>> This patch fixes commit 65edfda6f3f2 ("mm/rmap: extend rmap and migration
>> support device-private entries") by following the same pattern as
>> pmd_trans_huge() and pmd_is_migration_entry() for device private entry.
>>
>> While at it, it cleanups the pmd entry handling in page_vma_mapped_walk().
>>
>>   * Instead of handling trans huge/migration entry/device private entry
>>     in a mixed manner, we put each case into its own if condition and
>>     handle with the same pattern.
>>   * Also we grab PTL and make sure pmd is not changed under us after
>>     above check instead of do the check with PTL hold.
>>   * restart the process if pmd is changed under us
> 
> You're doing quite a bit for a fix and you're putting it all in one place.
> 
> How about do the fix as 1 patch, and then cleanups as other ones? It helps with
> review too :)
> 
> It's a general rule of thumb that if you do more than one of moving, refactoring
> or changing code, to do them as separate patches so a reviewer/somebody
> bisecting can clearly separate each.
> 
> Also PLEASE do not add new functionality (this lock recheck) in a fixes
> patch. We'll end up backporting new logic that way.
> 
> Make the fixes bit _minimal_.

To be fair, I asked for this

https://lore.kernel.org/all/2d48ef0d-1110-4a9d-adcb-f701a1ce2cfa@kernel.org/

But given that Wei mostly used my quick draft without properly checking the
implications, yeah, let's fix it first separately.

I can then follow up with a proper cleanup.

> 
> I think in general Andrew prefers separate fixes patches so I'd just make the
> _minimal_ change that fixes this for the backport, and the cleanup stuff as a
> separate series.
> 

The issue is that the existing handling is just crap, and to fix it, we're
adding more crap. But yeah, let's add more crap first before we clean it up
properly.


-- 
Cheers,

David


^ permalink raw reply

* Re: [PATCH v8 19/46] KVM: guest_memfd: Use actual size for invalidation in kvm_gmem_release()
From: Fuad Tabba @ 2026-06-19 10:46 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
	yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
	liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
	Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-19-9d2959357853@google.com>

On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> __kvm_gmem_invalidate_begin() and __kvm_gmem_invalidate_end() actually do
> not specially handle -1ul. -1ul is used as a huge number, which legal
> indices do not exceed, and hence the invalidation works as expected.
>
> Since a later patch is going to make use of the exact range, calculate the
> size of the guest_memfd inode and use it as the end range for invalidating
> SPTEs.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad

>  virt/kvm/guest_memfd.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index d163559da0235..d72ecbfcc3144 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -366,6 +366,7 @@ static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset,
>
>  static int kvm_gmem_release(struct inode *inode, struct file *file)
>  {
> +       pgoff_t end = i_size_read(inode) >> PAGE_SHIFT;
>         struct gmem_file *f = file->private_data;
>         struct kvm_memory_slot *slot;
>         struct kvm *kvm = f->kvm;
> @@ -396,9 +397,9 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
>          * Zap all SPTEs pointed at by this file.  Do not free the backing
>          * memory, as its lifetime is associated with the inode, not the file.
>          */
> -       __kvm_gmem_invalidate_start(f, 0, -1ul,
> +       __kvm_gmem_invalidate_start(f, 0, end,
>                                     kvm_gmem_get_invalidate_filter(inode));
> -       __kvm_gmem_invalidate_end(f, 0, -1ul);
> +       __kvm_gmem_invalidate_end(f, 0, end);
>
>         list_del(&f->entry);
>
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox