* Re: [PATCH mm-unstable v17 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions
From: David Hildenbrand (Arm) @ 2026-05-20 14:43 UTC (permalink / raw)
To: Nico Pache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, Usama Arif
In-Reply-To: <CAA1CXcCD5ooRJonAVp2LvnoCrQwcs1-NsAYomXbHTVNSe5X0cw@mail.gmail.com>
>> Calculate maximum allowed empty PTEs or PTEs mapping the shared zeropage ... ?
>>
>>> + * PTEs for the given collapse operation.
>>
>> We usually indent here (second line of subject), I think. Same applies to the
>> other doc below.
>
> Hmm tbh I couldn't find a example of what you meant here. There are
> some that put a space between the first sentence and the @ list.
Yeah, we usually try to make it fit in a single line.
But nevermind, leave it as is.
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH v6 12/43] KVM: guest_memfd: Call arch invalidate hooks on conversion
From: Fuad Tabba @ 2026-05-20 14:30 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-12-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> When memory in guest_memfd is converted from private to shared, the
> platform-specific state associated with the guest-private pages must be
> invalidated or cleaned up.
>
> Iterate over the folios in the affected range and call the
> kvm_arch_gmem_invalidate() hook for each PFN range. This allows
> architectures to perform necessary teardown, such as updating hardware
> metadata or encryption states, before the pages are transitioned to the
> shared state.
>
> Invoke this helper after indicating to KVM's mmu code that an invalidation
> is in progress to stop in-flight page faults from succeeding.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Minor nit below, but lgtm.
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> virt/kvm/guest_memfd.c | 41 +++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 41 insertions(+)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 9d82642a025e9..baf4b88dead1f 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -603,6 +603,42 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> return safe;
> }
>
> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
> +{
> + struct folio_batch fbatch;
> + pgoff_t next = start;
> + int i;
> +
> + folio_batch_init(&fbatch);
> + while (filemap_get_folios(inode->i_mapping, &next, end - 1, &fbatch)) {
> + for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> + struct folio *folio = fbatch.folios[i];
> + pgoff_t start_index, end_index;
> + kvm_pfn_t start_pfn, end_pfn;
> +
> + start_index = max(start, folio->index);
> + end_index = min(end, folio_next_index(folio));
> + /*
> + * end_index is either in folio or points to
> + * the first page of the next folio. Hence,
> + * all pages in range [start_index, end_index)
> + * are contiguous.
> + */
> + start_pfn = folio_file_pfn(folio, start_index);
> + end_pfn = start_pfn + end_index - start_index;
> +
> + kvm_arch_gmem_invalidate(start_pfn, end_pfn);
> + }
> +
> + folio_batch_release(&fbatch);
> + cond_resched();
> + }
> +}
> +#else
> +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
> +#endif
> +
> static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> size_t nr_pages, uint64_t attrs,
> pgoff_t *err_index)
> @@ -643,7 +679,12 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> */
>
> kvm_gmem_invalidate_begin(inode, start, end);
> +
> + if (!to_private)
> + kvm_gmem_invalidate(inode, start, end);
> +
> mas_store_prealloc(&mas, xa_mk_value(attrs));
> +
Why the unrelated extra space?
> kvm_gmem_invalidate_end(inode, start, end);
> out:
> filemap_invalidate_unlock(mapping);
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH v6 11/43] KVM: guest_memfd: Ensure pages are not in use before conversion
From: Fuad Tabba @ 2026-05-20 14:28 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-11-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> When converting memory to private in guest_memfd, it is necessary to ensure
> that the pages are not currently being accessed by any other part of the
> kernel or userspace to avoid any current user writing to guest private
> memory.
>
> guest_memfd checks for unexpected refcounts to determine whether a page is
> still in use. The only expected refcounts after unmapping the range
> requested for conversion are those that are held by guest_memfd itself.
>
> Update the kvm_memory_attributes2 structure to include an error_offset
> field. This allows KVM to report the exact offset where a conversion
> failed to userspace. If the safety check fails, return -EAGAIN and copy
> the error_offset back to userspace so that it can potentially retry the
> operation or handle the failure gracefully.
>
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> include/uapi/linux/kvm.h | 3 ++-
> virt/kvm/guest_memfd.c | 65 ++++++++++++++++++++++++++++++++++++++++++++----
> 2 files changed, 62 insertions(+), 6 deletions(-)
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index e6bbf68a83813..0b55258573d3d 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1658,7 +1658,8 @@ struct kvm_memory_attributes2 {
> __u64 size;
> __u64 attributes;
> __u64 flags;
> - __u64 reserved[12];
> + __u64 error_offset;
> + __u64 reserved[11];
> };
>
> #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 91e89b188f583..9d82642a025e9 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -572,9 +572,42 @@ static int kvm_gmem_mas_preallocate(struct ma_state *mas, u64 attributes,
> return mas_preallocate(mas, xa_mk_value(attributes), GFP_KERNEL);
> }
>
> +static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> + size_t nr_pages, pgoff_t *err_index)
> +{
> + struct address_space *mapping = inode->i_mapping;
> + const int filemap_get_folios_refcount = 1;
> + pgoff_t last = start + nr_pages - 1;
> + struct folio_batch fbatch;
> + bool safe = true;
> + int i;
> +
> + folio_batch_init(&fbatch);
> + while (safe && filemap_get_folios(mapping, &start, last, &fbatch)) {
> +
> + for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> + struct folio *folio = fbatch.folios[i];
> +
> + if (folio_ref_count(folio) !=
> + folio_nr_pages(folio) + filemap_get_folios_refcount) {
> + safe = false;
> + *err_index = folio->index;
> + break;
> + }
> + }
> +
> + folio_batch_release(&fbatch);
> + cond_resched();
> + }
> +
> + return safe;
> +}
> +
> static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> - size_t nr_pages, uint64_t attrs)
> + size_t nr_pages, uint64_t attrs,
> + pgoff_t *err_index)
> {
> + bool to_private = attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> struct address_space *mapping = inode->i_mapping;
> struct gmem_inode *gi = GMEM_I(inode);
> pgoff_t end = start + nr_pages;
> @@ -588,8 +621,21 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>
> mas_init(&mas, mt, start);
> r = kvm_gmem_mas_preallocate(&mas, attrs, start, nr_pages);
> - if (r)
> + if (r) {
> + *err_index = start;
> goto out;
> + }
> +
> + if (to_private) {
> + unmap_mapping_pages(mapping, start, nr_pages, false);
> +
> + if (!kvm_gmem_is_safe_for_conversion(inode, start, nr_pages,
> + err_index)) {
> + mas_destroy(&mas);
> + r = -EAGAIN;
> + goto out;
> + }
> + }
>
> /*
> * From this point on guest_memfd has performed necessary
> @@ -609,9 +655,10 @@ static long kvm_gmem_set_attributes(struct file *file, void __user *argp)
> struct gmem_file *f = file->private_data;
> struct inode *inode = file_inode(file);
> struct kvm_memory_attributes2 attrs;
> + pgoff_t err_index;
> size_t nr_pages;
> pgoff_t index;
> - int i;
> + int i, r;
>
> if (copy_from_user(&attrs, argp, sizeof(attrs)))
> return -EFAULT;
> @@ -635,8 +682,16 @@ static long kvm_gmem_set_attributes(struct file *file, void __user *argp)
>
> nr_pages = attrs.size >> PAGE_SHIFT;
> index = attrs.offset >> PAGE_SHIFT;
> - return __kvm_gmem_set_attributes(inode, index, nr_pages,
> - attrs.attributes);
> + r = __kvm_gmem_set_attributes(inode, index, nr_pages, attrs.attributes,
> + &err_index);
> + if (r) {
> + attrs.error_offset = ((uint64_t)err_index) << PAGE_SHIFT;
> +
> + if (copy_to_user(argp, &attrs, sizeof(attrs)))
> + return -EFAULT;
> + }
> +
> + return r;
> }
>
> static long kvm_gmem_ioctl(struct file *file, unsigned int ioctl,
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH v6 06/43] KVM: x86/mmu: Bug the VM if gmem attributes are queried to determine max mapping level
From: Sean Christopherson @ 2026-05-20 14:21 UTC (permalink / raw)
To: Fuad Tabba
Cc: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CA+EHjTxvLU4XDPXDXYXXWJES1OFQgN8VTRLMgCCNMwBE6Hk8tQ@mail.gmail.com>
On Wed, May 20, 2026, Fuad Tabba wrote:
> On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
> <devnull+ackerleytng.google.com@kernel.org> wrote:
> >
> > From: Ackerley Tng <ackerleytng@google.com>
> >
> > When the maximum mapping level is queried, KVM's MMU lock is held, and
> > while the MMU lock is held, guest_memfd cannot take the
> > filemap_invalidate_lock() to look up the current shared/private state of
> > the gfn, for these reasons:
> >
> > + The MMU lock is a spinlock or rwlock and cannot be held while taking a
> > lock that can sleep.
> > + In guest_memfd's code paths (such as truncate), the
> > filemap_invalidate_lock() is held while taking the MMU lock, and taking
> > the locks in reverse order would introduce a AB-BA deadlock.
> >
> > Currently, the maximum mapping level is only queried from guest_memfd in
> > the process of recovering huge pages, if dirty logging is disabled on a
> > memslot. Dirty logging is not currently supported for guest_memfd, and
> > guest_memfd memslots also cannot be updated.
> >
> > For now, bug the VM if guest_memfd needs to be queried to determine the
> > maximum mapping level. This guard can be removed if/when support is added.
> >
> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 9 +++++++++
> > 1 file changed, 9 insertions(+)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index a80a876ab4ad6..153bcc5369985 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3357,6 +3357,15 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault,
> > max_level = fault->max_level;
> > is_private = fault->is_private;
> > } else {
> > + /*
> > + * Memory attributes cannot be obtained from guest_memfd while
> > + * the MMU lock is held.
> > + */
> > + if (KVM_BUG_ON(static_call_query(__kvm_get_memory_attributes) ==
> > + kvm_gmem_get_memory_attributes, kvm)) {
> > + return 0;
> > + }
> > +
>
> This directly takes the address of kvm_gmem_get_memory_attributes,
> which is only compiled if CONFIG_KVM_GUEST_MEMFD=y. This breaks
> ARCH=i386.
And this bleeds guest_memfd implementation details into places they don't belong.
The right way to deal with this is to use lockdep_assert_not_held() in whatever
code mustn't run with mmu_lock held. E.g.
diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
index c9f155c2dc5c..3bea9c1137ef 100644
--- virt/kvm/guest_memfd.c
+++ virt/kvm/guest_memfd.c
@@ -547,6 +547,9 @@ unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
struct inode *inode;
+ /* Comment goes here. */
+ lockdep_assert_not_held(&kvm->mmu_lock);
+
/*
* If this gfn has no associated memslot, there's no chance of the gfn
* being backed by private memory, since guest_memfd must be used for
But I'm confused, because kvm_gmem_get_memory_attributes() doesn't actually take
filemap_invalidate_lock(), so what exactly is the problem?
> > max_level = PG_LEVEL_NUM;
> > is_private = kvm_mem_is_private(kvm, gfn);
> > }
> >
> > --
> > 2.54.0.563.g4f69b47b94-goog
> >
> >
^ permalink raw reply related
* Re: [PATCH v6 10/43] KVM: guest_memfd: Add base support for KVM_SET_MEMORY_ATTRIBUTES2
From: Fuad Tabba @ 2026-05-20 14:00 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-10-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Introduce base support for KVM_SET_MEMORY_ATTRIBUTES2 in guest_memfd, which
> just updates attributes tracked by guest_memfd.
>
> Validate input fields in general. Guard usage of KVM_SET_MEMORY_ATTRIBUTES2
> by making sure requested attributes are supported for this instance of kvm.
>
> A new KVM_SET_MEMORY_ATTRIBUTES2 is defined to support writes (unlike
> KVM_SET_MEMORY_ATTRIBUTES) in addition to reads so it can provide error
> details to userspace. This will be used in a later patch.
>
> The two ioctls use their corresponding structs with no overlap, but
> backward compatibility is baked in for future support of
> KVM_SET_MEMORY_ATTRIBUTES2 and struct kvm_memory_attributes2 in the VM
> ioctl.
>
> The process of setting memory attributes is set up such that the later half
> will not fail due to allocation. Any necessary checks are performed before
> the point of no return.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> Co-developed-by: Sean Christoperson <seanjc@google.com>
> Signed-off-by: Sean Christoperson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
/fuad
> ---
> include/uapi/linux/kvm.h | 13 ++++++
> virt/kvm/Kconfig | 1 +
> virt/kvm/guest_memfd.c | 114 +++++++++++++++++++++++++++++++++++++++++++++++
> virt/kvm/kvm_main.c | 12 +++++
> 4 files changed, 140 insertions(+)
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 6c8afa2047bf3..e6bbf68a83813 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1648,6 +1648,19 @@ struct kvm_memory_attributes {
> __u64 flags;
> };
>
> +#define KVM_SET_MEMORY_ATTRIBUTES2 _IOWR(KVMIO, 0xd2, struct kvm_memory_attributes2)
> +
> +struct kvm_memory_attributes2 {
> + union {
> + __u64 address;
> + __u64 offset;
> + };
> + __u64 size;
> + __u64 attributes;
> + __u64 flags;
> + __u64 reserved[12];
> +};
> +
> #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
>
> #define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest_memfd)
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 3fea89c45cfb4..e371e079e2c50 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -109,6 +109,7 @@ config KVM_VM_MEMORY_ATTRIBUTES
>
> config KVM_GUEST_MEMFD
> select XARRAY_MULTI
> + select KVM_MEMORY_ATTRIBUTES
> bool
>
> config HAVE_KVM_ARCH_GMEM_PREPARE
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 4f7c4824c3a45..91e89b188f583 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -540,11 +540,125 @@ unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> }
> EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_memory_attributes);
>
> +/*
> + * Preallocate memory for attributes to be stored on a maple tree, pointed to
> + * by mas. Adjacent ranges with attributes identical to the new attributes
> + * will be merged. Also sets mas's bounds up for storing attributes.
> + *
> + * This maintains the invariant that ranges with the same attributes will
> + * always be merged.
> + */
> +static int kvm_gmem_mas_preallocate(struct ma_state *mas, u64 attributes,
> + pgoff_t start, size_t nr_pages)
> +{
> + pgoff_t end = start + nr_pages;
> + pgoff_t last = end - 1;
> + void *entry;
> +
> + /* Try extending range. entry is NULL on overflow/wrap-around. */
> + mas_set(mas, end);
> + entry = mas_find(mas, end);
> + if (entry && xa_to_value(entry) == attributes)
> + last = mas->last;
> +
> + if (start > 0) {
> + mas_set(mas, start - 1);
> + entry = mas_find(mas, start - 1);
> + if (entry && xa_to_value(entry) == attributes)
> + start = mas->index;
> + }
> +
> + mas_set_range(mas, start, last);
> + return mas_preallocate(mas, xa_mk_value(attributes), GFP_KERNEL);
> +}
> +
> +static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> + size_t nr_pages, uint64_t attrs)
> +{
> + struct address_space *mapping = inode->i_mapping;
> + struct gmem_inode *gi = GMEM_I(inode);
> + pgoff_t end = start + nr_pages;
> + struct maple_tree *mt;
> + struct ma_state mas;
> + int r;
> +
> + mt = &gi->attributes;
> +
> + filemap_invalidate_lock(mapping);
> +
> + mas_init(&mas, mt, start);
> + r = kvm_gmem_mas_preallocate(&mas, attrs, start, nr_pages);
> + if (r)
> + goto out;
> +
> + /*
> + * From this point on guest_memfd has performed necessary
> + * checks and can proceed to do guest-breaking changes.
> + */
> +
> + kvm_gmem_invalidate_begin(inode, start, end);
> + mas_store_prealloc(&mas, xa_mk_value(attrs));
> + kvm_gmem_invalidate_end(inode, start, end);
> +out:
> + filemap_invalidate_unlock(mapping);
> + return r;
> +}
> +
> +static long kvm_gmem_set_attributes(struct file *file, void __user *argp)
> +{
> + struct gmem_file *f = file->private_data;
> + struct inode *inode = file_inode(file);
> + struct kvm_memory_attributes2 attrs;
> + size_t nr_pages;
> + pgoff_t index;
> + int i;
> +
> + if (copy_from_user(&attrs, argp, sizeof(attrs)))
> + return -EFAULT;
> +
> + if (attrs.flags)
> + return -EINVAL;
> + for (i = 0; i < ARRAY_SIZE(attrs.reserved); i++) {
> + if (attrs.reserved[i])
> + return -EINVAL;
> + }
> + if (attrs.attributes & ~kvm_supported_mem_attributes(f->kvm))
> + return -EINVAL;
> + if (attrs.size == 0 || attrs.offset + attrs.size < attrs.offset)
> + return -EINVAL;
> + if (!PAGE_ALIGNED(attrs.offset) || !PAGE_ALIGNED(attrs.size))
> + return -EINVAL;
> +
> + if (attrs.offset >= i_size_read(inode) ||
> + attrs.offset + attrs.size > i_size_read(inode))
> + return -EINVAL;
> +
> + nr_pages = attrs.size >> PAGE_SHIFT;
> + index = attrs.offset >> PAGE_SHIFT;
> + return __kvm_gmem_set_attributes(inode, index, nr_pages,
> + attrs.attributes);
> +}
> +
> +static long kvm_gmem_ioctl(struct file *file, unsigned int ioctl,
> + unsigned long arg)
> +{
> + switch (ioctl) {
> + case KVM_SET_MEMORY_ATTRIBUTES2:
> + if (vm_memory_attributes)
> + return -ENOTTY;
> +
> + return kvm_gmem_set_attributes(file, (void __user *)arg);
> + default:
> + return -ENOTTY;
> + }
> +}
> +
> static struct file_operations kvm_gmem_fops = {
> .mmap = kvm_gmem_mmap,
> .open = generic_file_open,
> .release = kvm_gmem_release,
> .fallocate = kvm_gmem_fallocate,
> + .unlocked_ioctl = kvm_gmem_ioctl,
> };
>
> static int kvm_gmem_migrate_folio(struct address_space *mapping,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ff20e63143642..4d7bf52b7b717 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -110,6 +110,18 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
> EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_TRAMP(__kvm_get_memory_attributes));
> #endif
>
> +#define MEMORY_ATTRIBUTES_MATCH(one, two) \
> + static_assert(offsetof(struct kvm_memory_attributes, one) == \
> + offsetof(struct kvm_memory_attributes2, two)); \
> + static_assert(sizeof_field(struct kvm_memory_attributes, one) ==\
> + sizeof_field(struct kvm_memory_attributes2, two))
> +
> +/* Ensure the common parts of the two structs are identical. */
> +MEMORY_ATTRIBUTES_MATCH(address, address);
> +MEMORY_ATTRIBUTES_MATCH(size, size);
> +MEMORY_ATTRIBUTES_MATCH(attributes, attributes);
> +MEMORY_ATTRIBUTES_MATCH(flags, flags);
> +
> /*
> * Ordering of locks:
> *
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH v6 09/43] KVM: Move kvm_supported_mem_attributes() to kvm_host.h
From: Fuad Tabba @ 2026-05-20 13:53 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-9-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Move kvm_supported_mem_attributes() from kvm_main.c to kvm_host.h and
> make it a static inline function. This allows the helper to be used in
> other parts of the KVM subsystem outside of kvm_main.c. This helper will be
> used later by guest_memfd.
>
> No functional change intended.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> include/linux/kvm_host.h | 10 ++++++++++
> virt/kvm/kvm_main.c | 10 ----------
> 2 files changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 1deab76dc0a2c..f9ea95e33d050 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2529,6 +2529,16 @@ static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
> }
>
> #ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> +static inline u64 kvm_supported_mem_attributes(struct kvm *kvm)
> +{
> +#ifdef kvm_arch_has_private_mem
> + if (!kvm || kvm_arch_has_private_mem(kvm))
> + return KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +#endif
> +
> + return 0;
> +}
> +
> typedef unsigned long (kvm_get_memory_attributes_t)(struct kvm *kvm, gfn_t gfn);
> DECLARE_STATIC_CALL(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0a4024948711a..ff20e63143642 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2428,16 +2428,6 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>
> #ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> -static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> -{
> -#ifdef kvm_arch_has_private_mem
> - if (!kvm || kvm_arch_has_private_mem(kvm))
> - return KVM_MEMORY_ATTRIBUTE_PRIVATE;
> -#endif
> -
> - return 0;
> -}
> -
> #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
> {
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH v6 08/43] KVM: guest_memfd: Only prepare folios for private pages
From: Fuad Tabba @ 2026-05-20 13:51 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-8-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> All-shared guest_memfd used to be only supported for non-CoCo VMs where
> preparation doesn't apply. INIT_SHARED is about to be supported for
> non-CoCo VMs in a later patch in this series.
>
> In addition, KVM_SET_MEMORY_ATTRIBUTES2 is about to be supported in
> guest_memfd in a later patch in this series.
>
> This means that the kvm fault handler may now call kvm_gmem_get_pfn() on a
> shared folio for a CoCo VM where preparation applies.
>
> Add a check to make sure that preparation is only performed for private
> folios.
>
> Preparation will be undone on freeing (see kvm_gmem_free_folio()) and on
> conversion to shared.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> virt/kvm/guest_memfd.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 9d025f518c025..4f7c4824c3a45 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -888,6 +888,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> int *max_order)
> {
> pgoff_t index = kvm_gmem_get_index(slot, gfn);
> + struct inode *inode;
> struct folio *folio;
> int r = 0;
>
> @@ -895,7 +896,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> if (!file)
> return -EFAULT;
>
> - filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> + inode = file_inode(file);
> + filemap_invalidate_lock_shared(inode->i_mapping);
>
> folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
> if (IS_ERR(folio)) {
> @@ -908,7 +910,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> folio_mark_uptodate(folio);
> }
>
> - r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> + if (kvm_gmem_is_private_mem(inode, index))
> + r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>
> folio_unlock(folio);
>
> @@ -918,7 +921,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> folio_put(folio);
>
> out:
> - filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
> + filemap_invalidate_unlock_shared(inode->i_mapping);
> return r;
> }
> EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH v6 07/43] KVM: guest_memfd: Update kvm_gmem_populate() to use gmem attributes
From: Fuad Tabba @ 2026-05-20 13:47 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-7-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Update the guest_memfd populate() flow to pull memory attributes from the
> gmem instance instead of the VM when KVM is not configured to track
> shared/private status in the VM.
>
> Rename the per-VM API to make it clear that it retrieves per-VM
> attributes, i.e. is not suitable for use outside of flows that are
> specific to generic per-VM attributes.
>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
/fuad
> ---
> arch/x86/kvm/mmu/mmu.c | 2 +-
> include/linux/kvm_host.h | 14 +++++++++++++-
> virt/kvm/guest_memfd.c | 24 +++++++++++++++++++++---
> virt/kvm/kvm_main.c | 8 +++-----
> 4 files changed, 38 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 153bcc5369985..bfcf9be25598e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7997,7 +7997,7 @@ static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
> const unsigned long end = start + KVM_PAGES_PER_HPAGE(level);
>
> if (level == PG_LEVEL_2M)
> - return kvm_range_has_memory_attributes(kvm, start, end, ~0, attrs);
> + return kvm_range_has_vm_memory_attributes(kvm, start, end, ~0, attrs);
>
> for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
> if (hugepage_test_mixed(slot, gfn, level - 1) ||
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 28a54298d27db..1deab76dc0a2c 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2549,12 +2549,24 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> #endif
>
> #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> -bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> +extern bool vm_memory_attributes;
> +bool kvm_range_has_vm_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> unsigned long mask, unsigned long attrs);
> bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
> struct kvm_gfn_range *range);
> bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> struct kvm_gfn_range *range);
> +#else
> +#define vm_memory_attributes false
> +static inline bool kvm_range_has_vm_memory_attributes(struct kvm *kvm,
> + gfn_t start, gfn_t end,
> + unsigned long mask,
> + unsigned long attrs)
> +{
> + WARN_ONCE(1, "Unexpected call to kvm_range_has_vm_memory_attributes()");
> +
> + return false;
> +}
> #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>
> unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn);
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index f055e058a3f28..9d025f518c025 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -924,12 +924,31 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
>
> #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_POPULATE
> +static bool kvm_gmem_range_is_private(struct gmem_inode *gi, pgoff_t index,
> + size_t nr_pages, struct kvm *kvm, gfn_t gfn)
> +{
> + pgoff_t end = index + nr_pages - 1;
> + void *entry;
> +
> + if (vm_memory_attributes)
> + return kvm_range_has_vm_memory_attributes(kvm, gfn, gfn + nr_pages,
> + KVM_MEMORY_ATTRIBUTE_PRIVATE,
> + KVM_MEMORY_ATTRIBUTE_PRIVATE);
> +
> + mt_for_each(&gi->attributes, entry, index, end) {
> + if (xa_to_value(entry) != KVM_MEMORY_ATTRIBUTE_PRIVATE)
> + return false;
> + }
> +
> + return true;
> +}
>
> static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
> struct file *file, gfn_t gfn, struct page *src_page,
> kvm_gmem_populate_cb post_populate, void *opaque)
> {
> pgoff_t index = kvm_gmem_get_index(slot, gfn);
> + struct gmem_inode *gi;
> struct folio *folio;
> kvm_pfn_t pfn;
> int ret;
> @@ -944,9 +963,8 @@ static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
>
> folio_unlock(folio);
>
> - if (!kvm_range_has_memory_attributes(kvm, gfn, gfn + 1,
> - KVM_MEMORY_ATTRIBUTE_PRIVATE,
> - KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> + gi = GMEM_I(file_inode(file));
> + if (!kvm_gmem_range_is_private(gi, index, 1, kvm, gfn)) {
> ret = -EINVAL;
> goto out_put_folio;
> }
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4139e903f756a..0a4024948711a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -103,9 +103,7 @@ module_param(allow_unsafe_mappings, bool, 0444);
>
> #ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> -static bool vm_memory_attributes = true;
> -#else
> -#define vm_memory_attributes false
> +bool vm_memory_attributes = true;
> #endif
> DEFINE_STATIC_CALL_RET0(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
> EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
> @@ -2450,7 +2448,7 @@ static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
> * Returns true if _all_ gfns in the range [@start, @end) have attributes
> * such that the bits in @mask match @attrs.
> */
> -bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> +bool kvm_range_has_vm_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> unsigned long mask, unsigned long attrs)
> {
> XA_STATE(xas, &kvm->mem_attr_array, start);
> @@ -2584,7 +2582,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> mutex_lock(&kvm->slots_lock);
>
> /* Nothing to do if the entire range has the desired attributes. */
> - if (kvm_range_has_memory_attributes(kvm, start, end, ~0, attributes))
> + if (kvm_range_has_vm_memory_attributes(kvm, start, end, ~0, attributes))
> goto out_unlock;
>
> /*
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH v6 06/43] KVM: x86/mmu: Bug the VM if gmem attributes are queried to determine max mapping level
From: Fuad Tabba @ 2026-05-20 13:33 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-6-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> When the maximum mapping level is queried, KVM's MMU lock is held, and
> while the MMU lock is held, guest_memfd cannot take the
> filemap_invalidate_lock() to look up the current shared/private state of
> the gfn, for these reasons:
>
> + The MMU lock is a spinlock or rwlock and cannot be held while taking a
> lock that can sleep.
> + In guest_memfd's code paths (such as truncate), the
> filemap_invalidate_lock() is held while taking the MMU lock, and taking
> the locks in reverse order would introduce a AB-BA deadlock.
>
> Currently, the maximum mapping level is only queried from guest_memfd in
> the process of recovering huge pages, if dirty logging is disabled on a
> memslot. Dirty logging is not currently supported for guest_memfd, and
> guest_memfd memslots also cannot be updated.
>
> For now, bug the VM if guest_memfd needs to be queried to determine the
> maximum mapping level. This guard can be removed if/when support is added.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
> arch/x86/kvm/mmu/mmu.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a80a876ab4ad6..153bcc5369985 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3357,6 +3357,15 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault,
> max_level = fault->max_level;
> is_private = fault->is_private;
> } else {
> + /*
> + * Memory attributes cannot be obtained from guest_memfd while
> + * the MMU lock is held.
> + */
> + if (KVM_BUG_ON(static_call_query(__kvm_get_memory_attributes) ==
> + kvm_gmem_get_memory_attributes, kvm)) {
> + return 0;
> + }
> +
This directly takes the address of kvm_gmem_get_memory_attributes,
which is only compiled if CONFIG_KVM_GUEST_MEMFD=y. This breaks
ARCH=i386.
Cheers,
/fuad
> max_level = PG_LEVEL_NUM;
> is_private = kvm_mem_is_private(kvm, gfn);
> }
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH v6 05/43] KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes
From: Fuad Tabba @ 2026-05-20 12:08 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-5-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Implement kvm_gmem_get_memory_attributes() for guest_memfd to allow the KVM
> core and architecture code to query per-GFN memory attributes.
>
> kvm_gmem_get_memory_attributes() finds the memory slot for a given GFN and
> queries the guest_memfd file's to determine if the page is marked as
> private.
>
> If vm_memory_attributes is not enabled, there is no shared/private tracking
> at the VM level. Install the guest_memfd implementation as long as
> guest_memfd is enabled to give guest_memfd a chance to respond on
> attributes.
>
> guest_memfd should look up attributes regardless of whether this memslot is
> gmem-only since attributes are now tracked by gmem regardless of whether
> mmap() is enabled.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
> include/linux/kvm_host.h | 2 ++
> virt/kvm/guest_memfd.c | 31 +++++++++++++++++++++++++++++++
> virt/kvm/kvm_main.c | 3 +++
> 3 files changed, 36 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index c5ba2cb34e45c..28a54298d27db 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2557,6 +2557,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> struct kvm_gfn_range *range);
> #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>
> +unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn);
> +
> #ifdef CONFIG_KVM_GUEST_MEMFD
> int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 5011d38820d0d..f055e058a3f28 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -509,6 +509,37 @@ static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
> return 0;
> }
>
> +unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> +{
> + struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
> + struct inode *inode;
> +
> + /*
> + * If this gfn has no associated memslot, there's no chance of the gfn
> + * being backed by private memory, since guest_memfd must be used for
> + * private memory, and guest_memfd must be associated with some memslot.
> + */
> + if (!slot)
> + return 0;
> +
> + CLASS(gmem_get_file, file)(slot);
> + if (!file)
> + return 0;
> +
> + inode = file_inode(file);
> +
> + /*
> + * Rely on the maple tree's internal RCU lock to ensure a
> + * stable result. This result can become stale as soon as the
> + * lock is dropped, so the caller _must_ still protect
> + * consumption of private vs. shared by checking
> + * mmu_invalidate_retry_gfn() under mmu_lock to serialize
> + * against ongoing attribute updates.
> + */
> + return kvm_gmem_get_attributes(inode, kvm_gmem_get_index(slot, gfn));
> +}
Doesn't this imply that all consumers of kvm_mem_is_private() should
validate the result using mmu_lock and the invalidation sequence?
sev_handle_rmp_fault() calls kvm_mem_is_private() without holding
mmu_lock and without any retry mechanism. Is that a problem?
Cheers,
/fuad
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_memory_attributes);
> +
> static struct file_operations kvm_gmem_fops = {
> .mmap = kvm_gmem_mmap,
> .open = generic_file_open,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ee26f1d9b5fda..4139e903f756a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2653,6 +2653,9 @@ static void kvm_init_memory_attributes(void)
> if (vm_memory_attributes)
> static_call_update(__kvm_get_memory_attributes,
> kvm_get_vm_memory_attributes);
> + else if (IS_ENABLED(CONFIG_KVM_GUEST_MEMFD))
> + static_call_update(__kvm_get_memory_attributes,
> + kvm_gmem_get_memory_attributes);
> else
> static_call_update(__kvm_get_memory_attributes,
> (void *)__static_call_return0);
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH v6 04/43] KVM: Stub in ability to disable per-VM memory attribute tracking
From: Fuad Tabba @ 2026-05-20 12:08 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-4-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Introduce the basic infrastructure to allow per-VM memory attribute
> tracking to be disabled. This will be built-upon in a later patch, where a
> module param can disable per-VM memory attribute tracking.
>
> Split the Kconfig option into a base KVM_MEMORY_ATTRIBUTES and the
> existing KVM_VM_MEMORY_ATTRIBUTES. The base option provides the core
> plumbing, while the latter enables the full per-VM tracking via an xarray
> and the associated ioctls.
>
> kvm_get_memory_attributes() now performs a static call that either looks up
> kvm->mem_attr_array with CONFIG_KVM_VM_MEMORY_ATTRIBUTES is enabled, or
> just returns 0 otherwise. The static call can be patched depending on
> whether per-VM tracking is enabled by the CONFIG.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> arch/x86/include/asm/kvm_host.h | 2 +-
> include/linux/kvm_host.h | 23 ++++++++++++---------
> virt/kvm/Kconfig | 4 ++++
> virt/kvm/kvm_main.c | 44 ++++++++++++++++++++++++++++++++++++++++-
> 4 files changed, 62 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 60b997764beef..c9aa50bcdac2d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2369,7 +2369,7 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
> int tdp_max_root_level, int tdp_huge_page_level);
>
>
> -#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
> #endif
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7d079f9701346..c5ba2cb34e45c 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2528,19 +2528,15 @@ static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
> return slot->flags & KVM_MEMSLOT_GMEM_ONLY;
> }
>
> -#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> +typedef unsigned long (kvm_get_memory_attributes_t)(struct kvm *kvm, gfn_t gfn);
> +DECLARE_STATIC_CALL(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
> +
> static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> {
> - return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
> + return static_call(__kvm_get_memory_attributes)(kvm, gfn);
> }
>
> -bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> - unsigned long mask, unsigned long attrs);
> -bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
> - struct kvm_gfn_range *range);
> -bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> - struct kvm_gfn_range *range);
> -
> static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> {
> return kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> @@ -2550,6 +2546,15 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> {
> return false;
> }
> +#endif
> +
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> + unsigned long mask, unsigned long attrs);
> +bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
> + struct kvm_gfn_range *range);
> +bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> + struct kvm_gfn_range *range);
> #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>
> #ifdef CONFIG_KVM_GUEST_MEMFD
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 5119cb37145fc..3fea89c45cfb4 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -100,7 +100,11 @@ config KVM_ELIDE_TLB_FLUSH_IF_YOUNG
> config KVM_MMU_LOCKLESS_AGING
> bool
>
> +config KVM_MEMORY_ATTRIBUTES
> + bool
> +
> config KVM_VM_MEMORY_ATTRIBUTES
> + select KVM_MEMORY_ATTRIBUTES
> bool
>
> config KVM_GUEST_MEMFD
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index abb9cfa3eb04d..ee26f1d9b5fda 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -101,6 +101,17 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(halt_poll_ns_shrink);
> static bool __ro_after_init allow_unsafe_mappings;
> module_param(allow_unsafe_mappings, bool, 0444);
>
> +#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +static bool vm_memory_attributes = true;
> +#else
> +#define vm_memory_attributes false
> +#endif
> +DEFINE_STATIC_CALL_RET0(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_TRAMP(__kvm_get_memory_attributes));
> +#endif
> +
> /*
> * Ordering of locks:
> *
> @@ -2418,7 +2429,7 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> }
> #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>
> -#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> {
> #ifdef kvm_arch_has_private_mem
> @@ -2429,6 +2440,12 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> return 0;
> }
>
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
> +{
> + return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
> +}
> +
> /*
> * Returns true if _all_ gfns in the range [@start, @end) have attributes
> * such that the bits in @mask match @attrs.
> @@ -2625,7 +2642,24 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>
> return kvm_vm_set_mem_attributes(kvm, start, end, attrs->attributes);
> }
> +#else /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
> +static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
> +{
> + BUILD_BUG_ON(1);
> +}
> #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
> +static void kvm_init_memory_attributes(void)
> +{
> + if (vm_memory_attributes)
> + static_call_update(__kvm_get_memory_attributes,
> + kvm_get_vm_memory_attributes);
> + else
> + static_call_update(__kvm_get_memory_attributes,
> + (void *)__static_call_return0);
> +}
> +#else /* CONFIG_KVM_MEMORY_ATTRIBUTES */
> +static void kvm_init_memory_attributes(void) { }
> +#endif /* CONFIG_KVM_MEMORY_ATTRIBUTES */
>
> struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
> {
> @@ -4925,6 +4959,9 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> return 1;
> #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> case KVM_CAP_MEMORY_ATTRIBUTES:
> + if (!vm_memory_attributes)
> + return 0;
> +
> return kvm_supported_mem_attributes(kvm);
> #endif
> #ifdef CONFIG_KVM_GUEST_MEMFD
> @@ -5331,6 +5368,10 @@ static long kvm_vm_ioctl(struct file *filp,
> case KVM_SET_MEMORY_ATTRIBUTES: {
> struct kvm_memory_attributes attrs;
>
> + r = -ENOTTY;
> + if (!vm_memory_attributes)
> + goto out;
> +
> r = -EFAULT;
> if (copy_from_user(&attrs, argp, sizeof(attrs)))
> goto out;
> @@ -6527,6 +6568,7 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
> kvm_preempt_ops.sched_in = kvm_sched_in;
> kvm_preempt_ops.sched_out = kvm_sched_out;
>
> + kvm_init_memory_attributes();
> kvm_init_debug();
>
> r = kvm_vfio_ops_init();
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH v6 03/43] KVM: Enumerate support for PRIVATE memory iff kvm_arch_has_private_mem is defined
From: Fuad Tabba @ 2026-05-20 12:08 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-3-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Explicitly guard reporting support for KVM_MEMORY_ATTRIBUTE_PRIVATE based
> on kvm_arch_has_private_mem being #defined in anticipation of decoupling
> kvm_supported_mem_attributes() from CONFIG_KVM_VM_MEMORY_ATTRIBUTES.
> guest_memfd support for memory attributes will be unconditional to avoid
> yet more macros (all architectures that support guest_memfd are expected to
> use per-gmem attributes at some point), at which point enumerating support
> KVM_MEMORY_ATTRIBUTE_PRIVATE based solely on memory attributes being
> supported _somewhere_ would result in KVM over-reporting support on arm64.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> include/linux/kvm_host.h | 2 +-
> virt/kvm/kvm_main.c | 2 ++
> 2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7b9faa3545300..7d079f9701346 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -722,7 +722,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
> }
> #endif
>
> -#ifndef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +#ifndef kvm_arch_has_private_mem
> static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
> {
> return false;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 306153abbafa5..abb9cfa3eb04d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2421,8 +2421,10 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> {
> +#ifdef kvm_arch_has_private_mem
> if (!kvm || kvm_arch_has_private_mem(kvm))
> return KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +#endif
>
> return 0;
> }
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH v6 02/43] KVM: Rename KVM_GENERIC_MEMORY_ATTRIBUTES to KVM_VM_MEMORY_ATTRIBUTES
From: Fuad Tabba @ 2026-05-20 12:08 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-2-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Rename the per-VM memory attributes Kconfig to make it explicitly about
> per-VM attributes in anticipation of adding memory attributes support to
> guest_memfd, at which point it will be possible (and desirable) to have
> memory attributes without the per-VM support, even in x86.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> arch/x86/include/asm/kvm_host.h | 2 +-
> arch/x86/kvm/Kconfig | 6 +++---
> arch/x86/kvm/mmu/mmu.c | 2 +-
> arch/x86/kvm/x86.c | 2 +-
> include/linux/kvm_host.h | 8 ++++----
> include/trace/events/kvm.h | 4 ++--
> virt/kvm/Kconfig | 2 +-
> virt/kvm/kvm_main.c | 14 +++++++-------
> 8 files changed, 20 insertions(+), 20 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index c470e40a00aa4..60b997764beef 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2369,7 +2369,7 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
> int tdp_max_root_level, int tdp_huge_page_level);
>
>
> -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
> #endif
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 801bf9e520db3..26f6afd51bbdc 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -84,7 +84,7 @@ config KVM_SW_PROTECTED_VM
> bool "Enable support for KVM software-protected VMs"
> depends on EXPERT
> depends on KVM_X86 && X86_64
> - select KVM_GENERIC_MEMORY_ATTRIBUTES
> + select KVM_VM_MEMORY_ATTRIBUTES
> help
> Enable support for KVM software-protected VMs. Currently, software-
> protected VMs are purely a development and testing vehicle for
> @@ -135,7 +135,7 @@ config KVM_INTEL_TDX
> bool "Intel Trust Domain Extensions (TDX) support"
> default y
> depends on INTEL_TDX_HOST
> - select KVM_GENERIC_MEMORY_ATTRIBUTES
> + select KVM_VM_MEMORY_ATTRIBUTES
> select HAVE_KVM_ARCH_GMEM_POPULATE
> help
> Provides support for launching Intel Trust Domain Extensions (TDX)
> @@ -159,7 +159,7 @@ config KVM_AMD_SEV
> depends on KVM_AMD && X86_64
> depends on CRYPTO_DEV_SP_PSP && !(KVM_AMD=y && CRYPTO_DEV_CCP_DD=m)
> select ARCH_HAS_CC_PLATFORM
> - select KVM_GENERIC_MEMORY_ATTRIBUTES
> + select KVM_VM_MEMORY_ATTRIBUTES
> select HAVE_KVM_ARCH_GMEM_PREPARE
> select HAVE_KVM_ARCH_GMEM_INVALIDATE
> select HAVE_KVM_ARCH_GMEM_POPULATE
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 892246204435c..a80a876ab4ad6 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7899,7 +7899,7 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> vhost_task_stop(kvm->arch.nx_huge_page_recovery_thread);
> }
>
> -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
> int level)
> {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0a1b63c63d1a9..1560de1e95be0 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13625,7 +13625,7 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
> }
> }
>
> -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> kvm_mmu_init_memslot_memory_attributes(kvm, slot);
> #endif
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4c14aee1fb063..7b9faa3545300 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -722,7 +722,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
> }
> #endif
>
> -#ifndef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifndef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
> {
> return false;
> @@ -871,7 +871,7 @@ struct kvm {
> #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> struct notifier_block pm_notifier;
> #endif
> -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> /* Protected by slots_lock (for writes) and RCU (for reads) */
> struct xarray mem_attr_array;
> #endif
> @@ -2528,7 +2528,7 @@ static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
> return slot->flags & KVM_MEMSLOT_GMEM_ONLY;
> }
>
> -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> {
> return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
> @@ -2550,7 +2550,7 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> {
> return false;
> }
> -#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
> +#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>
> #ifdef CONFIG_KVM_GUEST_MEMFD
> int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
> index b282e3a867696..1ba72bd73ea2f 100644
> --- a/include/trace/events/kvm.h
> +++ b/include/trace/events/kvm.h
> @@ -358,7 +358,7 @@ TRACE_EVENT(kvm_dirty_ring_exit,
> TP_printk("vcpu %d", __entry->vcpu_id)
> );
>
> -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> /*
> * @start: Starting address of guest memory range
> * @end: End address of guest memory range
> @@ -383,7 +383,7 @@ TRACE_EVENT(kvm_vm_set_mem_attributes,
> TP_printk("%#016llx -- %#016llx [0x%lx]",
> __entry->start, __entry->end, __entry->attr)
> );
> -#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
> +#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>
> TRACE_EVENT(kvm_unmap_hva_range,
> TP_PROTO(unsigned long start, unsigned long end),
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 794976b88c6f9..5119cb37145fc 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -100,7 +100,7 @@ config KVM_ELIDE_TLB_FLUSH_IF_YOUNG
> config KVM_MMU_LOCKLESS_AGING
> bool
>
> -config KVM_GENERIC_MEMORY_ATTRIBUTES
> +config KVM_VM_MEMORY_ATTRIBUTES
> bool
>
> config KVM_GUEST_MEMFD
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 89489996fbc1e..306153abbafa5 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1115,7 +1115,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> spin_lock_init(&kvm->mn_invalidate_lock);
> rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> xa_init(&kvm->vcpu_array);
> -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> xa_init(&kvm->mem_attr_array);
> #endif
>
> @@ -1300,7 +1300,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
> cleanup_srcu_struct(&kvm->irq_srcu);
> srcu_barrier(&kvm->srcu);
> cleanup_srcu_struct(&kvm->srcu);
> -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> xa_destroy(&kvm->mem_attr_array);
> #endif
> kvm_arch_free_vm(kvm);
> @@ -2418,7 +2418,7 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> }
> #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>
> -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> {
> if (!kvm || kvm_arch_has_private_mem(kvm))
> @@ -2623,7 +2623,7 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>
> return kvm_vm_set_mem_attributes(kvm, start, end, attrs->attributes);
> }
> -#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
> +#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>
> struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
> {
> @@ -4921,7 +4921,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> case KVM_CAP_SYSTEM_EVENT_DATA:
> case KVM_CAP_DEVICE_CTRL:
> return 1;
> -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> case KVM_CAP_MEMORY_ATTRIBUTES:
> return kvm_supported_mem_attributes(kvm);
> #endif
> @@ -5325,7 +5325,7 @@ static long kvm_vm_ioctl(struct file *filp,
> break;
> }
> #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
> -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> case KVM_SET_MEMORY_ATTRIBUTES: {
> struct kvm_memory_attributes attrs;
>
> @@ -5336,7 +5336,7 @@ static long kvm_vm_ioctl(struct file *filp,
> r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
> break;
> }
> -#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
> +#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
> case KVM_CREATE_DEVICE: {
> struct kvm_create_device cd;
>
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH mm-unstable v17 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Nico Pache @ 2026-05-20 12:05 UTC (permalink / raw)
To: Wei Yang
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap, rientjes,
rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260512154431.jxcs632mqqatqtsw@master>
On Tue, May 12, 2026 at 9:44 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Mon, May 11, 2026 at 12:58:11PM -0600, Nico Pache wrote:
> >Enable khugepaged to collapse to mTHP orders. This patch implements the
> >main scanning logic using a bitmap to track occupied pages and a stack
> >structure that allows us to find optimal collapse sizes.
> >
> >Previous to this patch, PMD collapse had 3 main phases, a light weight
> >scanning phase (mmap_read_lock) that determines a potential PMD
> >collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> >phase (mmap_write_lock).
> >
> >To enabled mTHP collapse we make the following changes:
> >
> >During PMD scan phase, track occupied pages in a bitmap. When mTHP
> >orders are enabled, we remove the restriction of max_ptes_none during the
> >scan phase to avoid missing potential mTHP collapse candidates. Once we
> >have scanned the full PMD range and updated the bitmap to track occupied
> >pages, we use the bitmap to find the optimal mTHP size.
> >
> >Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> >and determine the best eligible order for the collapse. A stack structure
> >is used instead of traditional recursion to manage the search. This also
> >prevents a traditional recursive approach when the kernel stack struct is
> >limited. The algorithm recursively splits the bitmap into smaller chunks to
> >find the highest order mTHPs that satisfy the collapse criteria. We start
> >by attempting the PMD order, then moved on the consecutively lower orders
> >(mTHP collapse). The stack maintains a pair of variables (offset, order),
> >indicating the number of PTEs from the start of the PMD, and the order of
> >the potential collapse candidate.
> >
> >The algorithm for consuming the bitmap works as such:
> > 1) push (0, HPAGE_PMD_ORDER) onto the stack
> > 2) pop the stack
> > 3) check if the number of set bits in that (offset,order) pair
> > statisfy the max_ptes_none threshold for that order
> > 4) if yes, attempt collapse
> > 5) if no (or collapse fails), push two new stack items representing
> > the left and right halves of the current bitmap range, at the
> > next lower order
> > 6) repeat at step (2) until stack is empty.
> >
> >Below is a diagram representing the algorithm and stack items:
> >
> > offset mid_offset
> > | |
> > | |
> > v v
> > ____________________________________
> > | PTE Page Table |
> > --------------------------------------
> > <-------><------->
> > order-1 order-1
> >
> >mTHP collapses reject regions containing swapped out or shared pages.
> >This is because adding new entries can lead to new none pages, and these
> >may lead to constant promotion into a higher order mTHP. A similar
> >issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> >introducing at least 2x the number of pages, and on a future scan will
> >satisfy the promotion condition once again. This issue is prevented via
> >the collapse_max_ptes_none() function which imposes the max_ptes_none
> >restrictions above.
> >
> >We currently only support mTHP collapse for max_ptes_none values of 0
> >and HPAGE_PMD_NR - 1. resulting in the following behavior:
> >
> > - max_ptes_none=0: Never introduce new empty pages during collapse
> > - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
> > available mTHP order
> >
> >Any other max_ptes_none value will emit a warning and skip mTHP collapse
> >attempts. There should be no behavior change for PMD collapse.
> >
> >Once we determine what mTHP sizes fits best in that PMD range a collapse
> >is attempted. A minimum collapse order of 2 is used as this is the lowest
> >order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
> >
> >Currently madv_collapse is not supported and will only attempt PMD
> >collapse.
> >
> >We can also remove the check for is_khugepaged inside the PMD scan as
> >the collapse_max_ptes_none() function handles this logic now.
> >
> >Signed-off-by: Nico Pache <npache@redhat.com>
>
> [...]
>
> >+static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> >+ int referenced, int unmapped, struct collapse_control *cc,
> >+ unsigned long enabled_orders)
> >+{
> >+ unsigned int nr_occupied_ptes, nr_ptes;
> >+ int max_ptes_none, collapsed = 0, stack_size = 0;
> >+ unsigned long collapse_address;
> >+ struct mthp_range range;
> >+ u16 offset;
> >+ u8 order;
> >+
> >+ collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> >+
> >+ while (stack_size) {
> >+ range = collapse_mthp_stack_pop(cc, &stack_size);
> >+ order = range.order;
> >+ offset = range.offset;
> >+ nr_ptes = 1UL << order;
> >+
> >+ if (!test_bit(order, &enabled_orders))
> >+ goto next_order;
> >+
> >+ max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
>
> I am thinking whether there is a behavioral change for userfaultfd_armed(vma).
>
> collapse_single_pmd()
> collapse_scan_pmd
> max_ptes_none = collapse_max_ptes_none(cc, vma)
> max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT --- (1)
> mthp_collapse
> max_ptes_none = collapse_max_ptes_none(cc, NULL) --- (2)
> collapse_huge_page(mm)
> hugepage_vma_revalidate(&vma)
> __collapse_huge_page_isolate(vma)
> max_ptes_none = collapse_max_ptes_none(cc, vma)
>
> Before mthp_collapse() introduced, userfaultfd_armed(vma) is skipped if there
> is any pte_none_or_zero() in collapse_scan_pmd().
>
> But now, max_ptes_none could be set to KHUGEPAGED_MAX_PTES_LIMIT at (1), so
> that we can scan all the pte to get the bitmap. This means
> userfaultfd_armed(vma) could continue even with pte_none_or_zero().
>
> Then in mthp_collapse(), collapse_max_ptes_none() at (2) ignores
> userfaultfd_armed(vma), which means it will continue to collapse a
> userfaultfd_armed(vma) when there is pte_none_or_zero().
>
> The good news is we will stop at __collapse_huge_page_isolate(), where we
> get collapse_max_ptes_none() with vma. But we already did a lot of work.
Good catch!
As you stated we eventually ensure we respect the uffd checks. So
there are no correctness issues, just the potential for wasted cycles.
At (1) we only do this if mTHPs are enabled. If that is the case, the
only waste that can arise is at the PMD order, as that order respects
the max_ptes_none value.
I think one approach is to gate (1) with the uffd check as well. That
way, if mTHPs are enabled and its uffd-armed, max_ptes_none will stay
at 0, and we bail early on the scan early if any none_ptes are hit.
But then we lose the ability to collapse to mTHPs that are uffd-armed,
where the PMD has none/zero-ptes and the mTHP fully has 0
non-none/zero-ptes.
ie) assume a PMD is 16 x's [xxxxxxxx00000000]
where x is a populated pte and 0 is not
If we guard this scan (1), then we will never check if its possible to
collapse to the smaller orders.
Let me know if you see a flaw in my logic, I think it's best to keep it as is?
>
> Not sure if I missed something.
>
> >+
> >+ if (max_ptes_none < 0)
> >+ return collapsed;
> >+
> >+ nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> >+ nr_ptes);
> >+
> >+ if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> >+ int ret;
> >+
> >+ collapse_address = address + offset * PAGE_SIZE;
> >+ ret = collapse_huge_page(mm, collapse_address, referenced,
> >+ unmapped, cc, order);
> >+ if (ret == SCAN_SUCCEED) {
> >+ collapsed += nr_ptes;
> >+ continue;
> >+ }
> >+ }
> >+
> >+next_order:
> >+ if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
> >+ const u8 next_order = order - 1;
> >+ const u16 mid_offset = offset + (nr_ptes / 2);
> >+
> >+ collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> >+ next_order);
> >+ collapse_mthp_stack_push(cc, &stack_size, offset,
> >+ next_order);
> >+ }
> >+ }
> >+ return collapsed;
> >+}
> >+
> > static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > struct vm_area_struct *vma, unsigned long start_addr,
> > bool *lock_dropped, struct collapse_control *cc)
> > {
> >- const int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> >+ int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> > const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
> > const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> >+ enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
> > pmd_t *pmd;
> >- pte_t *pte, *_pte;
> >- int none_or_zero = 0, shared = 0, referenced = 0;
> >+ pte_t *pte, *_pte, pteval;
> >+ int i;
> >+ int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
> > enum scan_result result = SCAN_FAIL;
> > struct page *page = NULL;
> > struct folio *folio = NULL;
> > unsigned long addr;
> >+ unsigned long enabled_orders;
> > spinlock_t *ptl;
> > int node = NUMA_NO_NODE, unmapped = 0;
> >
> >@@ -1429,8 +1579,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > goto out;
> > }
> >
> >+ bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> > memset(cc->node_load, 0, sizeof(cc->node_load));
> > nodes_clear(cc->alloc_nmask);
> >+
> >+ enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
>
> Would it be 0 at this point?
If your question relates to the issue you brought up above, then yes,
max_ptes_none would be 0 if it's uffd-armed. We must recheck the
uffd-armed status before modifying it to 511.
>
> >+
> >+ /*
> >+ * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> >+ * scan all pages to populate the bitmap for mTHP collapse.
> >+ */
> >+ if (enabled_orders != BIT(HPAGE_PMD_ORDER))
> >+ max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
> >+
> > pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> > if (!pte) {
> > cc->progress++;
> >@@ -1438,11 +1599,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > goto out;
> > }
> >
> >- for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> >- _pte++, addr += PAGE_SIZE) {
> >+ for (i = 0; i < HPAGE_PMD_NR; i++) {
> >+ _pte = pte + i;
> >+ addr = start_addr + i * PAGE_SIZE;
> >+ pteval = ptep_get(_pte);
> >+
> > cc->progress++;
> >
> >- pte_t pteval = ptep_get(_pte);
> > if (pte_none_or_zero(pteval)) {
> > if (++none_or_zero > max_ptes_none) {
> > result = SCAN_EXCEED_NONE_PTE;
> >@@ -1522,6 +1685,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > }
> > }
> >
> >+ /* Set bit for occupied pages */
> >+ __set_bit(i, cc->mthp_bitmap);
> > /*
> > * Record which node the original page is from and save this
> > * information to cc->node_load[].
> >@@ -1580,10 +1745,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > if (result == SCAN_SUCCEED) {
> > /* collapse_huge_page expects the lock to be dropped before calling */
> > mmap_read_unlock(mm);
> >- result = collapse_huge_page(mm, start_addr, referenced,
> >- unmapped, cc, HPAGE_PMD_ORDER);
> >+ nr_collapsed = mthp_collapse(mm, start_addr, referenced, unmapped,
> >+ cc, enabled_orders);
> > /* collapse_huge_page will return with the mmap_lock released */
>
> collapse_huge_page will return with mmap_lock released, but mthp_collapse()
> may not?
We are now releasing the lock before calling mthp_collapse, which
subsequently calls collapse_huge_page. Even if `collapse_huge_page` is
never called-- say, because enabled_orders is 0 (which should not
happen) and all collapse orders are skipped (never calling
collapse_huge_page)-- we still return here with the lock dropped.
I think this is sound. Let me know if you think differently.
Cheers :)
-- Nico
>
> > *lock_dropped = true;
> >+ result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
> > }
> > out:
> > trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> >--
> >2.54.0
>
> --
> Wei Yang
> Help you, Help me
>
^ permalink raw reply
* Re: [PATCH] gpu: host1x: trace: fix string fields in host1x traces
From: Thierry Reding @ 2026-05-20 12:03 UTC (permalink / raw)
To: Steven Rostedt
Cc: Artur Kowalski, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, linux-tegra, Thierry Reding, Mikko Perttunen,
David Airlie, Simona Vetter
In-Reply-To: <20260519141059.77435501@fedora>
[-- Attachment #1: Type: text/plain, Size: 845 bytes --]
On Tue, May 19, 2026 at 02:10:59PM -0400, Steven Rostedt wrote:
> On Tue, 19 May 2026 12:16:43 +0200
> Artur Kowalski <arturkow2000@gmail.com> wrote:
>
> > Use __assign_str and __get_str as required by tracing subsystem. Fixes
> > string fields being rejected by the verifier and unreadable from
> > userspace.
>
> Does anyone use these tracepoints? The fact that they have been broken
> for 5 years and nobody noticed makes me think they are useless.
>
> I rather remove them than fix them, but if someone thinks that these
> are still useful then by all means apply this patch.
>
> Acked-by: Steven Rostedt <rostedt@goodmis.org>
I know that Mikko used them a lot early on, but this driver is pretty
mature now, so we rarely need this low level of tracing. I'll defer to
Mikko on whether we still need these.
Thierry
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH 6/9] rv: Ensure synchronous cleanup for HA monitors
From: Gabriele Monaco @ 2026-05-20 11:22 UTC (permalink / raw)
To: Wen Yang; +Cc: linux-kernel, Steven Rostedt, Nam Cao, linux-trace-kernel
In-Reply-To: <5f8ceba5-ec25-4c19-a74b-ed98715e89b6@linux.dev>
On Wed, 2026-05-20 at 00:48 +0800, Wen Yang wrote:
> The goal is right. One thing worth double-checking is the load order
> in the callback against the "SMP BARRIER PAIRING" section of
> Documentation/memory-barriers.txt, which states:
>
Yeah I realised that after I sent my answer.. You might have noticed but I
proposed a version using acquire/release semantics in [1].
I'm waiting to send it all in a V2 for the fixes series.
Are you going to send your patch with tracepoint_synchronize_unregister() in the
per-task destruction (can be a patch alone)?
If not I'll do it myself and append that too, I prefer to have everything
together to avoid conflict resolution issues.
Thanks,
Gabriele
[1] -
https://lore.kernel.org/lkml/02c522f2a09183c9e1a6ff5b0110d0d5cc5e35bd.camel@redhat.com/
> [!] Note that the stores before the write barrier would normally be
> expected to match the loads after the read barrier or the
> address-dependency barrier, and vice versa ...
>
> So, we should to swap the read order in the callback so that it matches
> the standard pattern:
>
> void __ha_monitor_timer_callback() {
> guard(rcu)();
> curr_state = READ_ONCE(ha_mon->da_mon.curr_state); /* B:
> before rmb */
> smp_rmb();
> if (unlikely(!da_monitoring(&ha_mon->da_mon))) /* A:
> after rmb */
> return;
> /*
> * Reached here: monitoring = 1 (old_A).
> * Standard wmb/rmb guarantee: curr_state (read before rmb) is also
> * old, i.e. not initial_state.
> */
> ha_react(curr_state, EVENT_NONE, env_string.buffer);
> ...
> }
>
> void da_monitor_reset() {
> da_monitor_reset_hook(da_mon);
> WRITE_ONCE(da_mon->monitoring, 0); /* A: before wmb */
> smp_wmb();
> WRITE_ONCE(da_mon->curr_state, model_get_initial_state()); /*
> B: after wmb */
> }
>
>
>
> --
> Best wishes,
> Wen
>
> >
> > [1] -
> > https://lore.kernel.org/lkml/8af5ba4bd93d2acb8a546e8e47ced974a87c1eb8.1778522945.git.wen.yang@linux.dev
> >
> > >
> > >
> > > --
> > > Best wishes,
> > > Wen
> > >
> > >
> > > On 5/12/26 22:02, Gabriele Monaco wrote:
> > > > HA monitors may start timers, all cleanup functions currently stop the
> > > > timers asynchronously to avoid sleeping in the wrong context.
> > > > Nothing makes sure running callbacks terminate on cleanup.
> > > >
> > > > Run the entire HA timer callback in an RCU read-side critical section,
> > > > this way we can simply synchronize_rcu() with any pending timer and are
> > > > sure any cleanup using kfree_rcu() runs after callbacks terminated.
> > > > Additionally make sure any unlikely callback running late won't run any
> > > > code if the monitor is marked as disabled.
> > > >
> > > > Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
> > > > Fixes: 4a24127bd6cb ("rv: Add support for per-object monitors in DA/HA")
> > > > Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> > > > ---
> > > > include/rv/da_monitor.h | 23 +++++++++++++++++++----
> > > > include/rv/ha_monitor.h | 18 ++++++++++++++++--
> > > > 2 files changed, 35 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> > > > index a4a13b62d1a4..402d3b935c08 100644
> > > > --- a/include/rv/da_monitor.h
> > > > +++ b/include/rv/da_monitor.h
> > > > @@ -57,6 +57,15 @@ static struct rv_monitor rv_this;
> > > > #define da_monitor_reset_hook(da_mon)
> > > > #endif
> > > >
> > > > +/*
> > > > + * Hook to allow the implementation of hybrid automata: define it with
> > > > a
> > > > + * function that waits for the termination of all monitors background
> > > > + * activities (e.g. all timers). This hook can sleep.
> > > > + */
> > > > +#ifndef da_monitor_sync_hook
> > > > +#define da_monitor_sync_hook()
> > > > +#endif
> > > > +
> > > > /*
> > > > * Type for the target id, default to int but can be overridden.
> > > > * A long type can work as hash table key (PER_OBJ) but will be
> > > > downgraded
> > > > to
> > > > @@ -179,6 +188,7 @@ static inline int da_monitor_init(void)
> > > > static inline void da_monitor_destroy(void)
> > > > {
> > > > da_monitor_reset_all();
> > > > + da_monitor_sync_hook();
> > > > }
> > > >
> > > > #ifndef da_implicit_guard
> > > > @@ -232,6 +242,7 @@ static inline int da_monitor_init(void)
> > > > static inline void da_monitor_destroy(void)
> > > > {
> > > > da_monitor_reset_all();
> > > > + da_monitor_sync_hook();
> > > > }
> > > >
> > > > #ifndef da_implicit_guard
> > > > @@ -319,6 +330,7 @@ static inline void da_monitor_destroy(void)
> > > > }
> > > >
> > > > da_monitor_reset_all();
> > > > + da_monitor_sync_hook();
> > > >
> > > > rv_put_task_monitor_slot(task_mon_slot);
> > > > task_mon_slot = RV_PER_TASK_MONITOR_INIT;
> > > > @@ -497,10 +509,9 @@ static void da_monitor_reset_all(void)
> > > > struct da_monitor_storage *mon_storage;
> > > > int bkt;
> > > >
> > > > - rcu_read_lock();
> > > > + guard(rcu)();
> > > > hash_for_each_rcu(da_monitor_ht, bkt, mon_storage, node)
> > > > da_monitor_reset(&mon_storage->rv.da_mon);
> > > > - rcu_read_unlock();
> > > > }
> > > >
> > > > static inline int da_monitor_init(void)
> > > > @@ -516,13 +527,17 @@ static inline void da_monitor_destroy(void)
> > > > int bkt;
> > > >
> > > > tracepoint_synchronize_unregister();
> > > > + scoped_guard(rcu) {
> > > > + hash_for_each_rcu(da_monitor_ht, bkt, mon_storage,
> > > > node) {
> > > > + da_monitor_reset_hook(&mon_storage->rv.da_mon);
> > > > + }
> > > > + }
> > > > + da_monitor_sync_hook();
> > > > /*
> > > > * This function is called after all probes are disabled and no
> > > > longer
> > > > * pending, we can safely assume no concurrent user.
> > > > */
> > > > - synchronize_rcu();
> > > > hash_for_each_safe(da_monitor_ht, bkt, tmp, mon_storage, node)
> > > > {
> > > > - da_monitor_reset_hook(&mon_storage->rv.da_mon);
> > > > hash_del_rcu(&mon_storage->node);
> > > > kfree(mon_storage);
> > > > }
> > > > diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
> > > > index d59507e8cb30..47ff1a41febe 100644
> > > > --- a/include/rv/ha_monitor.h
> > > > +++ b/include/rv/ha_monitor.h
> > > > @@ -36,6 +36,7 @@ static bool ha_monitor_handle_constraint(struct
> > > > da_monitor
> > > > *da_mon,
> > > > #define da_monitor_event_hook ha_monitor_handle_constraint
> > > > #define da_monitor_init_hook ha_monitor_init_env
> > > > #define da_monitor_reset_hook ha_monitor_reset_env
> > > > +#define da_monitor_sync_hook() synchronize_rcu()
> > > >
> > > > #include <rv/da_monitor.h>
> > > > #include <linux/seq_buf.h>
> > > > @@ -237,12 +238,25 @@ static bool ha_monitor_handle_constraint(struct
> > > > da_monitor *da_mon,
> > > > return false;
> > > > }
> > > >
> > > > +/*
> > > > + * __ha_monitor_timer_callback - generic callback representation
> > > > + *
> > > > + * This callback runs in an RCU read-side critical section to allow the
> > > > + * destruction sequence to easily synchronize_rcu() with all pending
> > > > timer
> > > > + * after asynchronously disabling them.
> > > > + */
> > > > static inline void __ha_monitor_timer_callback(struct ha_monitor
> > > > *ha_mon)
> > > > {
> > > > - enum states curr_state = READ_ONCE(ha_mon->da_mon.curr_state);
> > > > DECLARE_SEQ_BUF(env_string, ENV_BUFFER_SIZE);
> > > > - u64 time_ns = ha_get_ns();
> > > > + enum states curr_state;
> > > > + u64 time_ns;
> > > > +
> > > > + if (unlikely(!da_monitor_handling_event(&ha_mon->da_mon)))
> > > > + return;
> > > >
> > > > + guard(rcu)();
> > > > + curr_state = READ_ONCE(ha_mon->da_mon.curr_state);
> > > > + time_ns = ha_get_ns();
> > > > ha_get_env_string(&env_string, ha_mon, time_ns);
> > > > ha_react(curr_state, EVENT_NONE, env_string.buffer);
> > > > ha_trace_error_env(ha_mon, model_get_state_name(curr_state),
> >
^ permalink raw reply
* Re: [PATCH v5] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Masami Hiramatsu @ 2026-05-20 6:20 UTC (permalink / raw)
To: Steven Rostedt
Cc: sashiko-bot, sashiko-reviews, bpf, LKML, Linux trace kernel,
Masami Hiramatsu
In-Reply-To: <20260519141726.613e2e54@fedora>
On Tue, 19 May 2026 14:17:26 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Tue, 19 May 2026 17:48:47 +0000
> sashiko-bot@kernel.org wrote:
>
> > Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
> > - [High] Using `(STRUCT)` typecasts for non-eprobes leaves the base fetch instruction uninitialized (`FETCH_OP_NOP`), silently breaking the probe.
> > - [High] Prematurely clearing `TPARG_FL_TYPECAST` causes cross-BTF type confusion and breaks string fetches.
> > - [Medium] Erroneous caching of `struct_btf` restricts a single trace probe to using typecasts from only one BTF object, preventing cross-module lookups.
> > - [Medium] Cross-BTF string table access in `parse_btf_arg()` causes invalid string comparisons.
> > - [Low] `clear_btf_context()` fails to clear `ctx->struct_btf` after dropping its reference.
> > --
> >
> > commit de4b6287a7069fde24b0e0f679a545a06fb9ae6c
> > Author: Steven Rostedt <rostedt@goodmis.org>
> >
> > tracing/eprobes: Allow use of BTF names to dereference pointers
> >
> > Add syntax to the parsing of eprobes to be able to typecast a trace event
> > field that is a pointer to a structure.
> >
> > > diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> > > index e1c73065dae51..2bc9f4ca8aa7e 100644
> > > --- a/kernel/trace/trace_probe.c
> > > +++ b/kernel/trace/trace_probe.c
> > > @@ -464,6 +470,27 @@ static const char *fetch_type_from_btf_type(struct btf *btf,
> > > return NULL;
> > > }
> > >
> > > +static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
> > > +{
> > > + int id;
> > > +
> > > + if (!ctx->struct_btf) {
> > > + struct btf *btf;
> > > +
> > > + id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
> > > + if (id < 0)
> > > + return id;
> > > + ctx->struct_btf = btf;
> > > + } else {
> > > + id = btf_find_by_name_kind(ctx->struct_btf, sname, BTF_KIND_STRUCT);
> >
> > [Severity: Medium]
> > Does caching struct_btf here restrict a single trace probe event to using
> > typecasts from only one BTF object?
> >
> > If a probe attempts to typecast structures from two different kernel
> > modules, wouldn't the lookup for the second module's struct fail because
> > it only searches the cached BTF object of the first module (or vmlinux)?
>
> Hmm, good point. I guess I should clear the struct_btf at the end of
> the '(' switch case.
Yes, it should be reset before parsing next argument.
>
> >
> > > + if (id < 0)
> > > + return id;
> > > + }
> > > +
> > > + ctx->last_struct = btf_type_by_id(ctx->struct_btf, id);
> > > + return 0;
> > > +}
> >
> > [ ... ]
> >
> > > @@ -515,6 +542,10 @@ static void clear_btf_context(struct traceprobe_parse_context *ctx)
> > > ctx->params = NULL;
> > > ctx->nr_params = 0;
> > > }
> > > + if (ctx->struct_btf) {
> > > + btf_put(ctx->struct_btf);
> > > + ctx->last_struct = NULL;
> >
> > [Severity: Low]
> > Should ctx->struct_btf be explicitly set to NULL after btf_put() drops
> > the reference?
>
> I'm thinking of dropping it in the '(' switch case.
Can you consider making the '(' switch case part as a helper
function because it depends on CONFIG_DEBUG_INFO_BTF?
Thanks,
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [PATCH v2] tools/bootconfig: Fix buf leaks in apply_xbc
From: lihongtao @ 2026-05-20 3:01 UTC (permalink / raw)
To: Masami Hiramatsu; +Cc: linux-kernel, linux-trace-kernel, Hongtao Lee
From: Hongtao Lee <lihongtao@kylinos.cn>
If data calloc failed, free the buf before return.
Fixes: 950313ebf79c ("tools: bootconfig: Add bootconfig command")
Signed-off-by: Hongtao Lee <lihongtao@kylinos.cn>
---
V1 -> V2: Change Email Signed name
tools/bootconfig/main.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/tools/bootconfig/main.c b/tools/bootconfig/main.c
index 643f707b8f1d..ddabde20585f 100644
--- a/tools/bootconfig/main.c
+++ b/tools/bootconfig/main.c
@@ -390,8 +390,10 @@ static int apply_xbc(const char *path, const char *xbc_path)
/* Backup the bootconfig data */
data = calloc(size + BOOTCONFIG_ALIGN + BOOTCONFIG_FOOTER_SIZE, 1);
- if (!data)
+ if (!data) {
+ free(buf);
return -ENOMEM;
+ }
memcpy(data, buf, size);
/* Check the data format */
--
2.25.1
^ permalink raw reply related
* Re: [PATCH v5] tracing/eprobes: Allow use of BTF names to dereference pointers
From: kernel test robot @ 2026-05-19 22:03 UTC (permalink / raw)
To: Steven Rostedt, LKML, Linux trace kernel, bpf
Cc: oe-kbuild-all, Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Douglas Raillard,
Tom Zanussi, Andrew Morton, Linux Memory Management List,
Thomas Gleixner, Ian Rogers, Jiri Olsa
In-Reply-To: <20260519130144.40e71a00@fedora>
Hi Steven,
kernel test robot noticed the following build errors:
[auto build test ERROR on trace/for-next]
[also build test ERROR on linus/master v7.1-rc4 next-20260519]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Steven-Rostedt/tracing-eprobes-Allow-use-of-BTF-names-to-dereference-pointers/20260520-011353
base: https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace for-next
patch link: https://lore.kernel.org/r/20260519130144.40e71a00%40fedora
patch subject: [PATCH v5] tracing/eprobes: Allow use of BTF names to dereference pointers
config: arc-defconfig (https://download.01.org/0day-ci/archive/20260520/202605200549.zZkT9xWc-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260520/202605200549.zZkT9xWc-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605200549.zZkT9xWc-lkp@intel.com/
All errors (new ones prefixed by >>):
kernel/trace/trace_probe.c: In function 'parse_probe_vars':
kernel/trace/trace_probe.c:1035:21: error: implicit declaration of function 'parse_trace_event'; did you mean 'parse_trace_event_arg'? [-Wimplicit-function-declaration]
1035 | if (parse_trace_event(arg, code, ctx) < 0)
| ^~~~~~~~~~~~~~~~~
| parse_trace_event_arg
kernel/trace/trace_probe.c: In function 'sprint_nth_btf_arg':
kernel/trace/trace_probe.c:1803:27: error: implicit declaration of function 'ctx_btf' [-Wimplicit-function-declaration]
1803 | struct btf *btf = ctx_btf(ctx);
| ^~~~~~~
>> kernel/trace/trace_probe.c:1803:27: error: initialization of 'struct btf *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
kernel/trace/trace_probe.c: At top level:
kernel/trace/trace_probe.c:318:12: warning: 'parse_trace_event_arg' defined but not used [-Wunused-function]
318 | static int parse_trace_event_arg(char *arg, struct fetch_insn *code,
| ^~~~~~~~~~~~~~~~~~~~~
vim +1803 kernel/trace/trace_probe.c
1797
1798 static int sprint_nth_btf_arg(int idx, const char *type,
1799 char *buf, int bufsize,
1800 struct traceprobe_parse_context *ctx)
1801 {
1802 const char *name;
> 1803 struct btf *btf = ctx_btf(ctx);
1804 int ret;
1805
1806 if (idx >= ctx->nr_params) {
1807 trace_probe_log_err(0, NO_BTFARG);
1808 return -ENOENT;
1809 }
1810 name = btf_name_by_offset(btf, ctx->params[idx].name_off);
1811 if (!name) {
1812 trace_probe_log_err(0, NO_BTF_ENTRY);
1813 return -ENOENT;
1814 }
1815 ret = snprintf(buf, bufsize, "%s%s", name, type);
1816 if (ret >= bufsize) {
1817 trace_probe_log_err(0, ARGS_2LONG);
1818 return -E2BIG;
1819 }
1820 return ret;
1821 }
1822
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* Re: [PATCH v5] tracing/eprobes: Allow use of BTF names to dereference pointers
From: kernel test robot @ 2026-05-19 21:09 UTC (permalink / raw)
To: Steven Rostedt, LKML, Linux trace kernel, bpf
Cc: oe-kbuild-all, Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Douglas Raillard,
Tom Zanussi, Andrew Morton, Linux Memory Management List,
Thomas Gleixner, Ian Rogers, Jiri Olsa
In-Reply-To: <20260519130144.40e71a00@fedora>
Hi Steven,
kernel test robot noticed the following build errors:
[auto build test ERROR on trace/for-next]
[also build test ERROR on linus/master v7.1-rc4 next-20260519]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Steven-Rostedt/tracing-eprobes-Allow-use-of-BTF-names-to-dereference-pointers/20260520-011353
base: https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace for-next
patch link: https://lore.kernel.org/r/20260519130144.40e71a00%40fedora
patch subject: [PATCH v5] tracing/eprobes: Allow use of BTF names to dereference pointers
config: s390-randconfig-r071-20260520 (https://download.01.org/0day-ci/archive/20260520/202605200427.0xXjKghz-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 8.5.0
smatch: v0.5.0-9185-gbcc58b9c
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260520/202605200427.0xXjKghz-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605200427.0xXjKghz-lkp@intel.com/
All error/warnings (new ones prefixed by >>):
kernel/trace/trace_probe.c: In function 'parse_probe_vars':
>> kernel/trace/trace_probe.c:1035:7: error: implicit declaration of function 'parse_trace_event'; did you mean 'parse_trace_event_arg'? [-Werror=implicit-function-declaration]
if (parse_trace_event(arg, code, ctx) < 0)
^~~~~~~~~~~~~~~~~
parse_trace_event_arg
kernel/trace/trace_probe.c: In function 'sprint_nth_btf_arg':
>> kernel/trace/trace_probe.c:1803:20: error: implicit declaration of function 'ctx_btf' [-Werror=implicit-function-declaration]
struct btf *btf = ctx_btf(ctx);
^~~~~~~
>> kernel/trace/trace_probe.c:1803:20: warning: initialization of 'struct btf *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
At top level:
>> kernel/trace/trace_probe.c:318:12: warning: 'parse_trace_event_arg' defined but not used [-Wunused-function]
static int parse_trace_event_arg(char *arg, struct fetch_insn *code,
^~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
vim +1035 kernel/trace/trace_probe.c
1020
1021 /* Parse $vars. @orig_arg points '$', which syncs to @ctx->offset */
1022 static int parse_probe_vars(char *orig_arg, const struct fetch_type *t,
1023 struct fetch_insn **pcode,
1024 struct fetch_insn *end,
1025 struct traceprobe_parse_context *ctx)
1026 {
1027 struct fetch_insn *code = *pcode;
1028 int err = TP_ERR_BAD_VAR;
1029 char *arg = orig_arg + 1;
1030 unsigned long param;
1031 int ret = 0;
1032 int len;
1033
1034 if (ctx->flags & TPARG_FL_TEVENT) {
> 1035 if (parse_trace_event(arg, code, ctx) < 0)
1036 goto inval;
1037 return 0;
1038 }
1039
1040 if (str_has_prefix(arg, "retval")) {
1041 if (!(ctx->flags & TPARG_FL_RETURN)) {
1042 err = TP_ERR_RETVAL_ON_PROBE;
1043 goto inval;
1044 }
1045 if (!(ctx->flags & TPARG_FL_KERNEL) ||
1046 !IS_ENABLED(CONFIG_PROBE_EVENTS_BTF_ARGS)) {
1047 code->op = FETCH_OP_RETVAL;
1048 return 0;
1049 }
1050 return parse_btf_arg(orig_arg, pcode, end, ctx);
1051 }
1052
1053 len = str_has_prefix(arg, "stack");
1054 if (len) {
1055
1056 if (arg[len] == '\0') {
1057 code->op = FETCH_OP_STACKP;
1058 return 0;
1059 }
1060
1061 if (isdigit(arg[len])) {
1062 ret = kstrtoul(arg + len, 10, ¶m);
1063 if (ret)
1064 goto inval;
1065
1066 if ((ctx->flags & TPARG_FL_KERNEL) &&
1067 param > PARAM_MAX_STACK) {
1068 err = TP_ERR_BAD_STACK_NUM;
1069 goto inval;
1070 }
1071 code->op = FETCH_OP_STACK;
1072 code->param = (unsigned int)param;
1073 return 0;
1074 }
1075 goto inval;
1076 }
1077
1078 if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
1079 code->op = FETCH_OP_COMM;
1080 return 0;
1081 }
1082
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* Re: [PATCHv2 04/11] libbpf: Change has_nop_combo to work on top of nop10
From: Jiri Olsa @ 2026-05-19 20:36 UTC (permalink / raw)
To: bot+bpf-ci
Cc: oleg, peterz, mingo, mhiramat, andrii, jakub, bpf,
linux-trace-kernel, ast, daniel, martin.lau, eddyz87,
yonghong.song, clm, ihor.solodrai
In-Reply-To: <a5a57eb9690a537c2d8d7ec619fb1d200c9fc804436ae085b9344b691c2470df@mail.kernel.org>
On Mon, May 18, 2026 at 11:37:58AM +0000, bot+bpf-ci@kernel.org wrote:
> > libbpf: Change has_nop_combo to work on top of nop10
> >
> > We now expect nop combo with 10 bytes nop instead of 5 bytes nop,
> > fixing has_nop_combo to reflect that.
> >
> > Fixes: 41a5c7df4466 ("libbpf: Add support to detect nop,nop5 instructions combo for usdt probe")
>
> Should the Fixes tag reference commit 5c3c59c7952a ("uprobes/x86: Move
> optimized uprobe from nop5 to nop10") instead? Commit 41a5c7df4466 originally
> introduced has_nop_combo() with the correct nop5 check. Commit 5c3c59c7952a
> later changed the kernel to use nop10 instead of nop5 for uprobe optimization
> but failed to update the libbpf has_nop_combo() function accordingly, creating
> a mismatch between kernel and userspace.
keeping things separated now for easier review
>
> > Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> >
> > diff --git a/tools/lib/bpf/usdt.c b/tools/lib/bpf/usdt.c
> > index e3710933fd52..7e62e4d5bedd 100644
> > --- a/tools/lib/bpf/usdt.c
> > +++ b/tools/lib/bpf/usdt.c
>
> [ ... ]
>
> > @@ -596,14 +596,14 @@ static int parse_usdt_spec(struct usdt_spec *spec, const struct usdt_note *note,
> > #if defined(__x86_64__)
> > static bool has_nop_combo(int fd, long off)
> > {
> > - unsigned char nop_combo[6] = {
> > - 0x90, 0x0f, 0x1f, 0x44, 0x00, 0x00 /* nop,nop5 */
> > + unsigned char nop_combo[11] = {
> > + 0x90, 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00,
> > };
> > - unsigned char buf[6];
> > + unsigned char buf[11];
> >
> > - if (pread(fd, buf, 6, off) != 6)
> > + if (pread(fd, buf, 11, off) != 11)
> > return false;
> > - return memcmp(buf, nop_combo, 6) == 0;
> > + return memcmp(buf, nop_combo, 11) == 0;
> > }
>
> This isn't a bug, but Jakub Sitnicki suggested in v1 review to use
> ARRAY_SIZE(buf) instead of repeating the scalar value 11 in multiple places.
> The suggestion was included alongside the Reviewed-by tag but was not
> addressed.
I think it's easy enough to keep the number
jirka
^ permalink raw reply
* Re: [PATCH mm-unstable v17 00/14] khugepaged: mTHP support
From: Nico Pache @ 2026-05-19 19:20 UTC (permalink / raw)
To: Wei Yang
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap, rientjes,
rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260518125007.a4z3pw4r73uuwja4@master>
On Mon, May 18, 2026 at 6:50 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Mon, May 11, 2026 at 12:58:00PM -0600, Nico Pache wrote:
> >The following series provides khugepaged with the capability to collapse
> >anonymous memory regions to mTHPs.
> >
> >To achieve this we generalize the khugepaged functions to no longer depend
> >on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
> >pages that are occupied (!none/zero). After the PMD scan is done, we use
> >the bitmap to find the optimal mTHP sizes for the PMD range. The
> >restriction on max_ptes_none is removed during the scan, to make sure we
> >account for the whole PMD range in the bitmap. When no mTHP size is
> >enabled, the legacy behavior of khugepaged is maintained.
> >
> >We currently only support max_ptes_none values of 0 or HPAGE_PMD_NR - 1
> >(ie 511). If any other value is specified, the kernel will emit a warning
> >and no mTHP collapse will be attempted. If a mTHP collapse is attempted,
> >but contains swapped out, or shared pages, we don't perform the collapse.
> >It is now also possible to collapse to mTHPs without requiring the PMD THP
> >size to be enabled. These limitations are to prevent collapse "creep"
> >behavior. This prevents constantly promoting mTHPs to the next available
> >size, which would occur because a collapse introduces more non-zero pages
> >that would satisfy the promotion condition on subsequent scans.
> >
> >Patch 1-2: Generalize hugepage_vma_revalidate and alloc_charge_folio
> > for arbitrary orders.
> >Patch 3: Rework max_ptes_* handling into helper functions
> >Patch 4: Generalize __collapse_huge_page_* for mTHP support
> >Patch 5: Require collapse_huge_page to enter/exit with the lock dropped
> >Patch 6: Generalize collapse_huge_page for mTHP collapse
> >Patch 7: Skip collapsing mTHP to smaller orders
> >Patch 8-9: Add per-order mTHP statistics and tracepoints
> >Patch 10: Introduce collapse_allowable_orders helper function
> >Patch 11-13: Introduce bitmap and mTHP collapse support, fully enabled
> >Patch 14: Documentation
> >
> >Testing:
> >- Built for x86_64, aarch64, ppc64le, and s390x
> >- ran all arches on test suites provided by the kernel-tests project
> >- internal testing suites: functional testing and performance testing
> >- selftests mm
> >- I created a test script that I used to push khugepaged to its limits
> > while monitoring a number of stats and tracepoints. The code is
> > available here[1] (Run in legacy mode for these changes and set mthp
> > sizes to inherit)
> > The summary from my testings was that there was no significant
> > regression noticed through this test. In some cases my changes had
> > better collapse latencies, and was able to scan more pages in the same
> > amount of time/work, but for the most part the results were consistent.
> >- redis testing. I did some testing with these changes along with my defer
> > changes (see followup [2] post for more details). We've decided to get
> > the mTHP changes merged first before attempting the defer series.
> >- some basic testing on 64k page size.
> >- lots of general use.
> >
>
> Two links are missing. I got them from previous version.
>
> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> [2] - https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/
Oh whoops, ill make sure they are there in the followup
>
> And the test in [1] is a performance test. I am thinking whether we want a
> functional test in selftests.
It also works as a functional test in some regards. The reason i never
pursued self-tests is that I naively thought this was getting merged
6(?) months ago and at the time the selftests infrastructure didn't
support it well. Baolin included patches to clean that up in his shmem
mTHP support patches and added tests for both features. Let's repost
and re-merge this first; then, I will follow up in one or two weeks
regarding self-tests. I'm currently on PTO and only have time to
complete, test, and return the v18 changes to Andrew before they
create a huge merge headache and we miss yet another window.
>
> I did a quick try with following change and some hack.
Thanks Ill use that as a base!
>
> @@ -744,6 +765,51 @@ static void collapse_max_ptes_none(struct collapse_context *c, struct mem_ops *o
> ksft_test_result_report(exit_status, "%s\n", __func__);
> }
>
> +static void collapse_mth_ptes(struct collapse_context *c, struct mem_ops *ops)
> +{
> + struct thp_settings settings = *thp_current_settings();
> + void *p;
> + int i;
> +
> + /* Disable mthp on fault */
> + for (i = 0; i < NR_ORDERS; i++) {
> + settings.hugepages[i].enabled = THP_NEVER;
> + }
> + thp_push_settings(&settings);
> +
> + p = ops->setup_area(1);
> +
> + ops->fault(p, 0, hpage_pmd_size);
> +
> + /* Expect all order-0 folio after fault */
> + memset(expected_orders, 0, sizeof(int) * (pmd_order + 1));
> + expected_orders[0] = hpage_pmd_nr;
> + if (check_folio_orders(p, hpage_pmd_size, pagemap_fd,
> + kpageflags_fd, expected_orders,
> + (pmd_order + 1)))
> + ksft_exit_fail_msg("Unexpected huge page at fault\n");
> +
> + /* Enable mthp before collapse */
> + thp_pop_settings();
> + settings.hugepages[2].enabled = THP_ALWAYS;
> + thp_push_settings(&settings);
> +
> + c->collapse("Collapse fully populated PTE table with order 2", p, 1,
> + ops, true);
> +
> + /* Expect all order-2 folio after collapse */
> + memset(expected_orders, 0, sizeof(int) * (pmd_order + 1));
> + expected_orders[2] = 1 << (pmd_order - 2);
> + if (check_folio_orders(p, hpage_pmd_size, pagemap_fd,
> + kpageflags_fd, expected_orders,
> + (pmd_order + 1)))
> + ksft_exit_fail_msg("Unexpected page order\n");
> +
> + ops->cleanup_area(p, hpage_pmd_size);
> + thp_pop_settings();
> + ksft_test_result_report(exit_status, "%s\n", __func__);
> +}
> +
> static void collapse_swapin_single_pte(struct collapse_context *c, struct mem_ops *ops)
> {
> void *p;
>
> This leverage check_after_split_folio_orders() in split_huge_page_test.c to
> check folio order in PMD range.
>
> --
> Wei Yang
> Help you, Help me
>
^ permalink raw reply
* Re: [PATCH mm-unstable v17 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: Nico Pache @ 2026-05-19 19:05 UTC (permalink / raw)
To: Lorenzo Stoakes, David Hildenbrand (Arm), Wei Yang, Lance Yang
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
liam, mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
pfalcato, rakie.kim, raquini, rdunlap, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <agtpK1x27B-E7mMo@lucifer>
On Mon, May 18, 2026 at 1:33 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Mon, May 18, 2026 at 03:16:11PM +0200, David Hildenbrand (Arm) wrote:
> > On 5/14/26 05:10, Wei Yang wrote:
> > > On Tue, May 12, 2026 at 03:42:02PM +0800, Lance Yang wrote:
> > >>
> > >> On Mon, May 11, 2026 at 12:58:04PM -0600, Nico Pache wrote:
> > >>> generalize the order of the __collapse_huge_page_* and collapse_max_*
> > >>> functions to support future mTHP collapse.
> > >>>
> > >>> The current mechanism for determining collapse with the
> > >>> khugepaged_max_ptes_none value is not designed with mTHP in mind. This
> > >>> raises a key design issue: if we support user defined max_pte_none values
> > >>> (even those scaled by order), a collapse of a lower order can introduces
> > >>> an feedback loop, or "creep", when max_ptes_none is set to a value greater
> > >>> than HPAGE_PMD_NR / 2. [1]
> > >>>
> > >>> With this configuration, a successful collapse to order N will populate
> > >>> enough pages to satisfy the collapse condition on order N+1 on the next
> > >>> scan. This leads to unnecessary work and memory churn.
> > >>>
> > >>> To fix this issue introduce a helper function that will limit mTHP
> > >>> collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
> > >>> This effectively supports two modes: [2]
> > >>>
> > >>> - max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE
> > >>> that maps the shared zeropage. Consequently, no memory bloat.
> > >>> - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
> > >>> available mTHP order.
> > >>>
> > >>> This removes the possiblilty of "creep", while not modifying any uAPI
> > >>> expectations. A warning will be emitted if any non-supported
> > >>> max_ptes_none value is configured with mTHP enabled.
> > >>>
> > >>> mTHP collapse will not honor the khugepaged_max_ptes_shared or
> > >>> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
> > >>> shared or swapped entry.
> > >>>
> > >>> No functional changes in this patch; however it defines future behavior
> > >>> for mTHP collapse.
> > >>>
> > >>> [1] - https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.com
> > >>> [2] - https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com
> > >>>
> > >>> Co-developed-by: Dev Jain <dev.jain@arm.com>
> > >>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> > >>> Signed-off-by: Nico Pache <npache@redhat.com>
> > >>> ---
> > >>> include/trace/events/huge_memory.h | 3 +-
> > >>> mm/khugepaged.c | 117 ++++++++++++++++++++---------
> > >>> 2 files changed, 85 insertions(+), 35 deletions(-)
> > >>>
> > >>> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > >>> index bcdc57eea270..443e0bd13fdb 100644
> > >>> --- a/include/trace/events/huge_memory.h
> > >>> +++ b/include/trace/events/huge_memory.h
> > >>> @@ -39,7 +39,8 @@
> > >>> EM( SCAN_STORE_FAILED, "store_failed") \
> > >>> EM( SCAN_COPY_MC, "copy_poisoned_page") \
> > >>> EM( SCAN_PAGE_FILLED, "page_filled") \
> > >>> - EMe(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")
> > >>> + EM(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback") \
> > >>> + EMe(SCAN_INVALID_PTES_NONE, "invalid_ptes_none")
> > >>>
> > >>> #undef EM
> > >>> #undef EMe
> > >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > >>> index f68853b3caa7..27465161fa6d 100644
> > >>> --- a/mm/khugepaged.c
> > >>> +++ b/mm/khugepaged.c
> > >>> @@ -61,6 +61,7 @@ enum scan_result {
> > >>> SCAN_COPY_MC,
> > >>> SCAN_PAGE_FILLED,
> > >>> SCAN_PAGE_DIRTY_OR_WRITEBACK,
> > >>> + SCAN_INVALID_PTES_NONE,
> > >>> };
> > >>>
> > >>> #define CREATE_TRACE_POINTS
> > >>> @@ -353,37 +354,60 @@ static bool pte_none_or_zero(pte_t pte)
> > >>> * PTEs for the given collapse operation.
> > >>> * @cc: The collapse control struct
> > >>> * @vma: The vma to check for userfaultfd
> > >>> + * @order: The folio order being collapsed to
> > >>> *
> > >>> * Return: Maximum number of none-page or zero-page PTEs allowed for the
> > >>> * collapse operation.
> > >>> */
> > >>> -static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> > >>> - struct vm_area_struct *vma)
> > >>> +static int collapse_max_ptes_none(struct collapse_control *cc,
> > >>> + struct vm_area_struct *vma, unsigned int order)
> > >>> {
> > >>> + unsigned int max_ptes_none = khugepaged_max_ptes_none;
> > >>> // If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
> > >>
> > >> One thing I still want to call out: kernel code usually uses C-style
> > >> comments :)
> > >>
> > >>> if (vma && userfaultfd_armed(vma))
> > >>> return 0;
> > >>> // for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
> > >>> if (!cc->is_khugepaged)
> > >>> return HPAGE_PMD_NR;
> > >>> - // For all other cases repect the user defined maximum.
> > >>> - return khugepaged_max_ptes_none;
> > >>> + // for PMD collapse, respect the user defined maximum.
> > >>> + if (is_pmd_order(order))
> > >>> + return max_ptes_none;
> > >>> + /* Zero/non-present collapse disabled. */
> > >>> + if (!max_ptes_none)
> > >>> + return 0;
> > >>> + // for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
> > >>> + // scale the maximum number of PTEs to the order of the collapse.
> > >>> + if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
> > >>> + return (1 << order) - 1;
> > >>> +
> > >>> + // We currently only support max_ptes_none values of 0 or KHUGEPAGED_MAX_PTES_LIMIT.
> > >>> + // Emit a warning and return -EINVAL.
> > >>> + pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u\n",
> > >>> + KHUGEPAGED_MAX_PTES_LIMIT);
> > >>
> > >> Maybe fallback to 0 instead, as David suggested earlier?
> > >>
> > >
> > > It looks reasonable to fallback to 0.
> > >
> > > But as the updated Document says in patch 14:
> > >
> > > For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1) are supported. Any other
> > > value will emit a warning and no mTHP collapse will be attempted.
> > >
> > > This is why it does like this now.
> > >
> > > mthp_collapse()
> > > max_ptes_none = collapse_max_ptes_none();
> > > if (max_ptes_none < 0)
> > > return collapsed;
> > >
> > >> max_ptes_none is mostly legacy PMD THP behavior. mTHP is new, and any
> > >> intermediate value in (0, KHUGEPAGED_MAX_PTES_LIMIT) would implicitly
> > >> disable it :(
> > >>
> > >
> > > So it depends on what we want to do here :-)
> > >
> > > For me, I would vote for fallback to 0.
> >
> > At this point I'll prefer to not return errors from collapse_max_ptes_none().
> > It's just rather awkward to return an error deep down in collapse code for a
> > configuration problem.
> >
> > For mthp collapse, we only support max_ptes_none==0 and
> > max_ptes_none=="HPAGE_PMD_NR - 1" (default).
> >
> > If another value is specified while collapsing mTHP, print a warning and treat
> > it as 0 (save value, no creep, no memory waste).
> >
> > In a sense, this is similar to how we handle max_ptes_shared + max_ptes_swap:
> > for mTHP: we always treat them as being 0 for mTHP collapse (and don't issue a
> > warning, because we would issue a warning with the default settings).
> >
> > @Lorenzo, fine with you?
>
> Yes 100%, this sounds sensible both in terms of the error and the default. Let's
> keep our lives simple(-ish) please :)
Ok thank you im glad we finally came to consensus on this! phew!
>
> >
> > --
> > Cheers,
> >
> > David
>
> Cheers, Lorenzo
>
^ permalink raw reply
* Re: [PATCH mm-unstable v17 02/14] mm/khugepaged: generalize alloc_charge_folio()
From: Nico Pache @ 2026-05-19 19:03 UTC (permalink / raw)
To: Lance Yang, Usama Arif
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, akpm,
anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <51db205d-77cf-416f-bfe5-fd9d0b12c433@linux.dev>
On Mon, May 18, 2026 at 8:50 AM Lance Yang <lance.yang@linux.dev> wrote:
>
>
>
> On 2026/5/18 19:55, Usama Arif wrote:
> [...]
> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >> index 979885694351..f0e29d5c7b1f 100644
> >> --- a/mm/khugepaged.c
> >> +++ b/mm/khugepaged.c
> >> @@ -1068,21 +1068,26 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> >> }
> >>
> >> static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> >> - struct collapse_control *cc)
> >> + struct collapse_control *cc, unsigned int order)
> >> {
> >> gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
> >> GFP_TRANSHUGE);
> >> int node = collapse_find_target_node(cc);
> >> struct folio *folio;
> >>
> >> - folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
> >> + folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
> >> if (!folio) {
> >> *foliop = NULL;
> >> - count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> >> + if (is_pmd_order(order))
> >> + count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> >> + count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
> >> return SCAN_ALLOC_HUGE_PAGE_FAIL;
> >> }
> >>
> >> - count_vm_event(THP_COLLAPSE_ALLOC);
> >> + if (is_pmd_order(order))
> >> + count_vm_event(THP_COLLAPSE_ALLOC);
> >> + count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC);
> >> +
> >
> > The vmstat THP_COLLAPSE_ALLOC counter is pmd order only.
> > But after this we have
> >
> > count_memcg_folio_events(folio, THP_COLLAPSE_ALLOC, 1);
> >
> > which is not being guarded with is_pmd_order().
>
> Good catch!
>
> >
> > I think we want this to be pmd order only as well so that
> > the meaning of the vmstat and cgroup counter remains the same?
>
> Agreed. THP_COLLAPSE_ALLOC should remain PMD order only for
> vmstat and memcg events.
>
> So this should be guarded with is_pmd_order() as well :)
Thanks Usama, I added that.
>
> Cheers, Lance
>
^ permalink raw reply
* Re: [PATCH mm-unstable v17 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions
From: Nico Pache @ 2026-05-19 18:21 UTC (permalink / raw)
To: Lance Yang
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, usama.arif
In-Reply-To: <20260512044444.71798-1-lance.yang@linux.dev>
On Mon, May 11, 2026 at 10:45 PM Lance Yang <lance.yang@linux.dev> wrote:
>
>
> On Mon, May 11, 2026 at 12:58:03PM -0600, Nico Pache wrote:
> >The following cleanup reworks all the max_ptes_* handling into helper
> >functions. This increases the code readability and will later be used to
> >implement the mTHP handling of these variables.
> >
> >With these changes we abstract all the madvise_collapse() special casing
> >(dont respect the sysctls) away from the functions that utilize them. And
>
> Nit: s/dont/do not/
>
> >will be used later in this series to cleanly restrict the mTHP collapse
> >behavior.
> >
> >No functional change is intended; however, we are now only reading the
> >sysfs variables once per scan, whereas before these variables were being
> >read on each loop iteration.
> >
> >Suggested-by: David Hildenbrand <david@kernel.org>
> >Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> >Acked-by: Usama Arif <usama.arif@linux.dev>
> >Signed-off-by: Nico Pache <npache@redhat.com>
> >---
> > mm/khugepaged.c | 118 +++++++++++++++++++++++++++++++++---------------
> > 1 file changed, 82 insertions(+), 36 deletions(-)
> >
> >diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >index f0e29d5c7b1f..f68853b3caa7 100644
> >--- a/mm/khugepaged.c
> >+++ b/mm/khugepaged.c
> >@@ -348,6 +348,62 @@ static bool pte_none_or_zero(pte_t pte)
> > return pte_present(pte) && is_zero_pfn(pte_pfn(pte));
> > }
> >
> >+/**
> >+ * collapse_max_ptes_none - Calculate maximum allowed none-page or zero-page
> >+ * PTEs for the given collapse operation.
> >+ * @cc: The collapse control struct
> >+ * @vma: The vma to check for userfaultfd
> >+ *
> >+ * Return: Maximum number of none-page or zero-page PTEs allowed for the
> >+ * collapse operation.
> >+ */
> >+static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> >+ struct vm_area_struct *vma)
> >+{
> >+ // If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
> >+ if (vma && userfaultfd_armed(vma))
> >+ return 0;
> >+ // for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
> >+ if (!cc->is_khugepaged)
> >+ return HPAGE_PMD_NR;
> >+ // For all other cases repect the user defined maximum.
> >+ return khugepaged_max_ptes_none;
>
> Nit: kernel code usually uses C-style comments. This could be:
>
> /* For all other cases, respect the user-defined maximum. */
>
> Also, s/repect/respect/.
>
> >+}
> >+
> >+/**
> >+ * collapse_max_ptes_shared - Calculate maximum allowed PTEs that map shared
> >+ * anonymous pages for the given collapse operation.
> >+ * @cc: The collapse control struct
> >+ *
> >+ * Return: Maximum number of PTEs that map shared anonymous pages for the
> >+ * collapse operation
> >+ */
> >+static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
> >+{
> >+ // for MADV_COLLAPSE, do not restrict the number of PTEs that map shared
> >+ // anonymous pages.
>
> Ditto.
>
> >+ if (!cc->is_khugepaged)
> >+ return HPAGE_PMD_NR;
> >+ return khugepaged_max_ptes_shared;
> >+}
> >+
> >+/**
> >+ * collapse_max_ptes_swap - Calculate the maximum allowed non-present PTEs or the
> >+ * maximum allowed non-present pagecache entries for the given collapse operation.
> >+ * @cc: The collapse control struct
> >+ *
> >+ * Return: Maximum number of non-present PTEs or the maximum allowed non-present
> >+ * pagecache entries for the collapse operation.
> >+ */
> >+static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
> >+{
> >+ // for MADV_COLLAPSE, do not restrict the number PTEs entries or
> >+ // pagecache entries that are non-present.
>
> Same here.
>
> >+ if (!cc->is_khugepaged)
> >+ return HPAGE_PMD_NR;
> >+ return khugepaged_max_ptes_swap;
> >+}
> >+
> > int hugepage_madvise(struct vm_area_struct *vma,
> > vm_flags_t *vm_flags, int advice)
> > {
> >@@ -546,21 +602,19 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > pte_t *_pte;
> > int none_or_zero = 0, shared = 0, referenced = 0;
> > enum scan_result result = SCAN_FAIL;
> >+ unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
> >+ unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
>
> Nit: could these be const, as David suggested earlier?
>
> Nothing else jumped out at me. LGTM!
>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
Ack on all the above thank you !
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox