Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH v6 20/43] KVM: guest_memfd: Enable INIT_SHARED on guest_memfd for x86 Coco VMs
From: Fuad Tabba @ 2026-05-21  8:54 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-20-91ab5a8b19a4@google.com>

On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Now that guest_memfd supports tracking private vs. shared within gmem
> itself, allow userspace to specify INIT_SHARED on a guest_memfd instance
> for x86 Confidential Computing (CoCo) VMs, so long as per-VM attributes
> are disabled, i.e. when it's actually possible for a guest_memfd instance
> to contain shared memory.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad


> ---
>  arch/x86/kvm/x86.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1560de1e95be0..6609957ecfea3 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -14172,14 +14172,13 @@ bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
>  }
>
>  #ifdef CONFIG_KVM_GUEST_MEMFD
> -/*
> - * KVM doesn't yet support initializing guest_memfd memory as shared for VMs
> - * with private memory (the private vs. shared tracking needs to be moved into
> - * guest_memfd).
> - */
>  bool kvm_arch_supports_gmem_init_shared(struct kvm *kvm)
>  {
> -       return !kvm_arch_has_private_mem(kvm);
> +       /*
> +        * INIT_SHARED isn't supported if the memory attributes are per-VM,
> +        * in which case guest_memfd can _only_ be used for private memory.
> +        */
> +       return !vm_memory_attributes || !kvm_arch_has_private_mem(kvm);
>  }
>
>  #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_PREPARE
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>

^ permalink raw reply

* Re: [PATCH v6 19/43] KVM: Let userspace disable per-VM mem attributes, enable per-gmem attributes
From: Fuad Tabba @ 2026-05-21  8:44 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-19-91ab5a8b19a4@google.com>

Hi Ackerley,

On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Make vm_memory_attributes a module parameter so that userspace can disable
> the use of memory attributes on the VM level.
>
> To avoid inconsistencies in the way memory attributes are tracked in KVM
> and guest_memfd, the vm_memory_attributes module_param is made
> read-only (0444).
>
> Make CONFIG_KVM_VM_MEMORY_ATTRIBUTES selectable, only for (CoCo) VM types
> that might use vm_memory_attributes.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Config files always confuse me, but Sashiko might be onto something:

https://sashiko.dev/#/patchset/20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4%40google.com?part=19

I think this partially goes back to commit 6, the one I flagged
yesterday. But also adding "default y" to KVM_VM_MEMORY_ATTRIBUTES?
The default value should at least fix this issue, but I'm not sure if
it would cause other problems...

Cheers,
/fuad


> ---
>  arch/x86/kvm/Kconfig | 13 +++++++++----
>  virt/kvm/kvm_main.c  |  1 +
>  2 files changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index b6d65ee664d0f..8b97d341bd33f 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -82,13 +82,20 @@ config KVM_WERROR
>
>  config KVM_VM_MEMORY_ATTRIBUTES
>         select KVM_MEMORY_ATTRIBUTES
> -       bool
> +       depends on KVM_SW_PROTECTED_VM || KVM_INTEL_TDX || KVM_AMD_SEV
> +       bool "Enable per-VM memory attributes (for CoCo VMs)"
> +       help
> +         Enable support for per-VM memory attributes, which are deprecated in
> +         favor of tracking memory attributes in guest_memfd.  Select this if
> +         you need to run CoCo VMs using a VMM that doesn't support guest_memfd
> +         memory attributes.
> +
> +         If unsure, say N.
>
>  config KVM_SW_PROTECTED_VM
>         bool "Enable support for KVM software-protected VMs"
>         depends on EXPERT
>         depends on KVM_X86 && X86_64
> -       select KVM_VM_MEMORY_ATTRIBUTES
>         help
>           Enable support for KVM software-protected VMs.  Currently, software-
>           protected VMs are purely a development and testing vehicle for
> @@ -139,7 +146,6 @@ config KVM_INTEL_TDX
>         bool "Intel Trust Domain Extensions (TDX) support"
>         default y
>         depends on INTEL_TDX_HOST
> -       select KVM_VM_MEMORY_ATTRIBUTES
>         select HAVE_KVM_ARCH_GMEM_POPULATE
>         help
>           Provides support for launching Intel Trust Domain Extensions (TDX)
> @@ -163,7 +169,6 @@ config KVM_AMD_SEV
>         depends on KVM_AMD && X86_64
>         depends on CRYPTO_DEV_SP_PSP && !(KVM_AMD=y && CRYPTO_DEV_CCP_DD=m)
>         select ARCH_HAS_CC_PLATFORM
> -       select KVM_VM_MEMORY_ATTRIBUTES
>         select HAVE_KVM_ARCH_GMEM_PREPARE
>         select HAVE_KVM_ARCH_GMEM_INVALIDATE
>         select HAVE_KVM_ARCH_GMEM_POPULATE
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index cec02d68d7039..ba195bb239aaa 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -104,6 +104,7 @@ module_param(allow_unsafe_mappings, bool, 0444);
>  #ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
>  #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>  bool vm_memory_attributes = true;
> +module_param(vm_memory_attributes, bool, 0444);
>  #endif
>  DEFINE_STATIC_CALL_RET0(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
>  EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>

^ permalink raw reply

* Re: [PATCH v20 10/10] ring-buffer: Show persistent buffer dropped events in trace_pipe file
From: Masami Hiramatsu @ 2026-05-21  8:18 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu,
	Mathieu Desnoyers, Catalin Marinas, Will Deacon, Ian Rogers
In-Reply-To: <20260520185018.470465795@kernel.org>

On Wed, 20 May 2026 14:49:48 -0400
Steven Rostedt <rostedt@kernel.org> wrote:

> From: Steven Rostedt <rostedt@goodmis.org>
> 
> When the persistent ring buffer is validated on boot up, if a subbuffer is
> deemed invalid, it resets the buffer and continues. Have the code preserve
> the RB_MISSED_EVENTS flag in the commit portion of the subbuffer header
> and pass that back so that the trace_pipe file can show the missed events
> like the trace file does.
> 
> For example:
> 
>    <...>-1242    [005] d....  4429.120116: page_fault_user: address=0x7ffaebb6e728 ip=0x7ffaeb9d4960 error_code=0x7
>    <...>-1242    [005] .....  4429.120124: mm_page_alloc: page=00000000055254f3 pfn=0x1373bd order=0 migratetype=1 gfp_flags=GFP_HIGHUSER_MOVABLE|__GFP_COMP
>    <...>-1242    [005] d..2.  4429.120132: tlb_flush: pages:1 reason:local MM shootdown (3)
> CPU:5 [LOST EVENTS]
>    <...>-1242    [005] d....  4429.120661: page_fault_user: address=0x55ba7c2d0944 ip=0x55ba7c20cd02 error_code=0x7
>    <...>-1242    [005] .....  4429.120669: mm_page_alloc: page=0000000005a02500 pfn=0x12b6e4 order=0 migratetype=1 gfp_flags=GFP_HIGHUSER_MOVABLE|__GFP_COMP
>    <...>-1242    [005] d..2.  4429.120680: tlb_flush: pages:1 reason:local MM shootdown (3)

OK, this looks good, but I have a comment below.

[...]
> @@ -7123,19 +7132,23 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
>  		 * the reader page.
>  		 */
>  		if (full &&
> -		    (!read || (len < (commit - read)) ||
> +		    (!read || (len < (size - read)) ||
>  		     cpu_buffer->reader_page == cpu_buffer->commit_page))
>  			return -1;
>  
> -		if (len > (commit - read))
> -			len = (commit - read);
> +		if (len > (size - read))
> +			len = (size - read);
>  
>  		/* Always keep the time extend and data together */
> -		size = rb_event_ts_length(event);
> +		event_size = rb_event_ts_length(event);
>  
> -		if (len < size)
> +		if (len < event_size)
>  			return -1;
>  
> +		if (commit & RB_MISSED_EVENTS) {
> +			printk("MISSED\n");

Is it for debug?

> +			flags = RB_MISSED_EVENTS; }

nit: block closing brace is in the previous line. Maybe typo?

Thanks,

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v20 00/10] ring-buffer: Making persistent ring buffers robust
From: Masami Hiramatsu @ 2026-05-21  8:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu,
	Mathieu Desnoyers, Catalin Marinas, Will Deacon, Ian Rogers
In-Reply-To: <20260520184938.749337513@kernel.org>

On Wed, 20 May 2026 14:49:38 -0400
Steven Rostedt <rostedt@kernel.org> wrote:

> 
> Here is the 20th version of improvement patches for making persistent
> ring buffers robust to failures.
> The previous version is here:
> 
>  https://lore.kernel.org/all/177751968499.2136606.17388366710182662849.stgit@mhiramat.tok.corp.google.com/
> 
> None of the patches from the 19th version was changed. Only patches were
> added to it in this version. All of Masami's patches were in version 19,
> and all my patches are new to version 20. The reason I'm including
> Masami's patches with mine is so that Sashiko can handle all of them
> in one go.

Thanks for updating.

BTW, it seems Sashiko is stopping review this series. 
https://sashiko.dev/#/patchset/20260520184938.749337513%40kernel.org

Not sure why.

> 
> I moved patch 1 from v19 to my urgent branch as it was marked as
> fix for stable.
> 
> The patches I added:
> 
> - Fix an invalid sub-buffer in the iterator (added TBD fixes tag)
>   I didn't want to fold the patch into the patch that was fixed
>   as it was written by Masami.
> 
> - Have the dropped buffers be persistent across boots. Masami's patches
>   cleared the detection of lost subbuffers on boot up and the next
>   boot would not show them. If the persistent ring buffer isn't cleared
>   it should report the same info on subsequent boots.
> 
> - Show [LOST EVENTS] in persistent trace file where a subbuffer was
>   reset due to being invalid.
> 
> - Show [LOST EVENTS] in the persistent trace_pipe file as well.
> 
> Masami Hiramatsu (Google) (6):
>       ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
>       ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
>       ring-buffer: Add persistent ring buffer invalid-page inject test
>       ring-buffer: Show commit numbers in buffer_meta file
>       ring-buffer: Cleanup persistent ring buffer validation
>       ring-buffer: Cleanup buffer_data_page related code
> 
> Steven Rostedt (4):
>       ring-buffer: Skip invalid sub-buffers for iterator
>       ring-buffer: Have dropped subbuffers be persistent across reboots
>       ring-buffer: Show persistent buffer dropped events in trace file
>       ring-buffer: Show persistent buffer dropped events in trace_pipe file
> 
> ----
>  include/linux/ring_buffer.h |   1 +
>  kernel/trace/Kconfig        |  34 +++
>  kernel/trace/ring_buffer.c  | 538 +++++++++++++++++++++++++++++---------------
>  kernel/trace/trace.c        |   4 +
>  4 files changed, 397 insertions(+), 180 deletions(-)


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v6 18/43] KVM: Move KVM_VM_MEMORY_ATTRIBUTES config definition to x86
From: Fuad Tabba @ 2026-05-21  8:07 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-18-91ab5a8b19a4@google.com>

On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Bury KVM_VM_MEMORY_ATTRIBUTES in x86 to discourage other architectures
> from adding support for per-VM memory attributes, because tracking private
> vs. shared memory on a per-VM basis is now deprecated in favor of tracking
> on a per-guest_memfd basis, and no other memory attributes are on the
> horizon.
>
> This will also allow modifying KVM_VM_MEMORY_ATTRIBUTES to be
> user-selectable (in x86) without creating weirdness in KVM's Kconfigs.
> Now that guest_memfd support memory attributes, it's entirely possible to
> run x86 CoCo VMs without support for KVM_VM_MEMORY_ATTRIBUTES.
>
> Leave the code itself in common KVM so that it's trivial to undo this
> change if new per-VM attributes do come along.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad
> ---
>  arch/x86/kvm/Kconfig | 4 ++++
>  virt/kvm/Kconfig     | 4 ----
>  2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 26f6afd51bbdc..b6d65ee664d0f 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -80,6 +80,10 @@ config KVM_WERROR
>
>           If in doubt, say "N".
>
> +config KVM_VM_MEMORY_ATTRIBUTES
> +       select KVM_MEMORY_ATTRIBUTES
> +       bool
> +
>  config KVM_SW_PROTECTED_VM
>         bool "Enable support for KVM software-protected VMs"
>         depends on EXPERT
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index e371e079e2c50..663de6421eda2 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -103,10 +103,6 @@ config KVM_MMU_LOCKLESS_AGING
>  config KVM_MEMORY_ATTRIBUTES
>         bool
>
> -config KVM_VM_MEMORY_ATTRIBUTES
> -       select KVM_MEMORY_ATTRIBUTES
> -       bool
> -
>  config KVM_GUEST_MEMFD
>         select XARRAY_MULTI
>         select KVM_MEMORY_ATTRIBUTES
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>

^ permalink raw reply

* Re: [PATCH] gpu: host1x: trace: fix string fields in host1x traces
From: Thierry Reding @ 2026-05-21  8:02 UTC (permalink / raw)
  To: Mikko Perttunen
  Cc: Steven Rostedt, Artur Kowalski, Masami Hiramatsu,
	Mathieu Desnoyers, linux-kernel, linux-trace-kernel, linux-tegra,
	Thierry Reding, David Airlie, Simona Vetter
In-Reply-To: <pQKtL-t0RKO7KsvRTAIQ_w@nvidia.com>

[-- Attachment #1: Type: text/plain, Size: 1365 bytes --]

On Thu, May 21, 2026 at 01:33:24PM +0900, Mikko Perttunen wrote:
> On Wednesday, May 20, 2026 9:03 PM Thierry Reding wrote:
> > On Tue, May 19, 2026 at 02:10:59PM -0400, Steven Rostedt wrote:
> > > On Tue, 19 May 2026 12:16:43 +0200
> > > Artur Kowalski <arturkow2000@gmail.com> wrote:
> > > 
> > > > Use __assign_str and __get_str as required by tracing subsystem. Fixes
> > > > string fields being rejected by the verifier and unreadable from
> > > > userspace.
> > > 
> > > Does anyone use these tracepoints? The fact that they have been broken
> > > for 5 years and nobody noticed makes me think they are useless.
> > > 
> > > I rather remove them than fix them, but if someone thinks that these
> > > are still useful then by all means apply this patch.
> > > 
> > > Acked-by: Steven Rostedt <rostedt@goodmis.org>
> > 
> > I know that Mikko used them a lot early on, but this driver is pretty
> > mature now, so we rarely need this low level of tracing. I'll defer to
> > Mikko on whether we still need these.
> > 
> > Thierry
> 
> Yeah, these have been quite useful in the past when debugging why a job
> is failing. Without the cdma traces it can be cumbersome to find out
> exactly what is being sent to the hardware in some cases.
> 
> My preference is for keeping them for now.

Thanks, I'll pick this up then.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v6 17/43] KVM: guest_memfd: Determine invalidation filter from memory attributes
From: Fuad Tabba @ 2026-05-21  7:56 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-17-91ab5a8b19a4@google.com>

On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Before conversion, the range filter doesn't really matter:
>
> + For non-CoCo VMs that use guest_memfd, they have no mirrored tdp, so
>   KVM_DIRECT_ROOTS would have been invalidated anyway.
> + CoCo VMs could not use INIT_SHARED, and there's no conversion support, so
>   always using KVM_FILTER_PRIVATE would have worked.
>
> Now with conversion support, update kvm_gmem_get_invalidate_filter to
> inspect the memory attributes maple tree for a given range.
>
> Instead of determining the invalidation filter based on static inode
> flags, iterate through the attributes maple tree for the specific range
> being invalidated. This allows KVM to identify if the range contains
> private pages, shared pages, or both, and set the filter bits
> accordingly.
>
> Update kvm_gmem_invalidate_begin and kvm_gmem_release to pass the range
> parameters to the filter helper to ensure invalidation accurately
> targets the memory types present in the affected range.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad
> ---
>  virt/kvm/guest_memfd.c | 27 ++++++++++++++++++++-------
>  1 file changed, 20 insertions(+), 7 deletions(-)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 9f6eebfb68f6b..c9f155c2dc5c5 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -193,12 +193,24 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>         return folio;
>  }
>
> -static enum kvm_gfn_range_filter kvm_gmem_get_invalidate_filter(struct inode *inode)
> +static enum kvm_gfn_range_filter kvm_gmem_get_invalidate_filter(
> +               struct inode *inode, pgoff_t start, pgoff_t end)
>  {
> -       if (GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED)
> -               return KVM_FILTER_SHARED;
> +       struct gmem_inode *gi = GMEM_I(inode);
> +       enum kvm_gfn_range_filter filter = 0;
> +       void *entry;
> +
> +       lockdep_assert(mt_lock_is_held(&gi->attributes));
> +
> +       mt_for_each(&gi->attributes, entry, start, end - 1) {
> +               filter |= (xa_to_value(entry) & KVM_MEMORY_ATTRIBUTE_PRIVATE) ?
> +                         KVM_FILTER_PRIVATE : KVM_FILTER_SHARED;
> +
> +               if (filter == (KVM_FILTER_PRIVATE | KVM_FILTER_SHARED))
> +                       break;
> +       }
>
> -       return KVM_FILTER_PRIVATE;
> +       return filter;
>  }
>
>  static void __kvm_gmem_invalidate_begin(struct gmem_file *f, pgoff_t start,
> @@ -244,7 +256,7 @@ static void kvm_gmem_invalidate_begin(struct inode *inode, pgoff_t start,
>         enum kvm_gfn_range_filter attr_filter;
>         struct gmem_file *f;
>
> -       attr_filter = kvm_gmem_get_invalidate_filter(inode);
> +       attr_filter = kvm_gmem_get_invalidate_filter(inode, start, end);
>
>         kvm_gmem_for_each_file(f, inode)
>                 __kvm_gmem_invalidate_begin(f, start, end, attr_filter);
> @@ -367,6 +379,7 @@ static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset,
>  static int kvm_gmem_release(struct inode *inode, struct file *file)
>  {
>         struct gmem_file *f = file->private_data;
> +       enum kvm_gfn_range_filter filter;
>         struct kvm_memory_slot *slot;
>         struct kvm *kvm = f->kvm;
>         unsigned long index;
> @@ -398,8 +411,8 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
>          * memory, as its lifetime is associated with the inode, not the file.
>          */
>         end = i_size_read(inode) >> PAGE_SHIFT;
> -       __kvm_gmem_invalidate_begin(f, 0, end,
> -                                   kvm_gmem_get_invalidate_filter(inode));
> +       filter = kvm_gmem_get_invalidate_filter(inode, 0, end);
> +       __kvm_gmem_invalidate_begin(f, 0, end, filter);
>         __kvm_gmem_invalidate_end(f, 0, end);
>
>         list_del(&f->entry);
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>

^ permalink raw reply

* Re: [PATCH v6 16/43] KVM: guest_memfd: Use actual size for invalidation in kvm_gmem_release()
From: Fuad Tabba @ 2026-05-21  7:30 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-16-91ab5a8b19a4@google.com>

Hi Ackerley,

On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> __kvm_gmem_invalidate_begin() and __kvm_gmem_invalidate_end() actually do
> not specially handle -1ul. -1ul is used as a huge number, which legal
> indices do not exceed, and hence the invalidation works as expected.
>
> Since a later patch is going to make use of the exact range, calculate the
> size of the guest_memfd inode and use it as the end range for invalidating
> SPTEs.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Want to look at what Sashiko has to say? Seems to be a real issue:

https://sashiko.dev/#/patchset/20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4%40google.com?part=16

If I understand correctly, the fix should simple: use
check_add_overflow() to validate the offset and size parameters in
kvm_gmem_bind()

   int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
             unsigned int fd, loff_t offset)
   {
       loff_t size = slot->npages << PAGE_SHIFT;
   +    loff_t end;
       unsigned long start, end_index;
       struct gmem_file *f;
...
   -    if (offset < 0 || !PAGE_ALIGNED(offset) ||
   -        offset + size > i_size_read(inode))
   +    if (offset < 0 || !PAGE_ALIGNED(offset) ||
   +        check_add_overflow(offset, size, &end) ||
   +        end > i_size_read(inode))
           goto err;

What do you think?

/fuad

> ---
>  virt/kvm/guest_memfd.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 050a8c092b1a3..9f6eebfb68f6b 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -370,6 +370,7 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
>         struct kvm_memory_slot *slot;
>         struct kvm *kvm = f->kvm;
>         unsigned long index;
> +       pgoff_t end;
>
>         /*
>          * Prevent concurrent attempts to *unbind* a memslot.  This is the last
> @@ -396,9 +397,10 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
>          * Zap all SPTEs pointed at by this file.  Do not free the backing
>          * memory, as its lifetime is associated with the inode, not the file.
>          */
> -       __kvm_gmem_invalidate_begin(f, 0, -1ul,
> +       end = i_size_read(inode) >> PAGE_SHIFT;
> +       __kvm_gmem_invalidate_begin(f, 0, end,
>                                     kvm_gmem_get_invalidate_filter(inode));
> -       __kvm_gmem_invalidate_end(f, 0, -1ul);
> +       __kvm_gmem_invalidate_end(f, 0, end);
>
>         list_del(&f->entry);
>
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>

^ permalink raw reply

* Re: [PATCH v6 05/43] KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes
From: Fuad Tabba @ 2026-05-21  7:19 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <CAEvNRgGQvMdDmVfbk42EY_PGN0ybTp-x21Zj+pg_X1mk9iCRtA@mail.gmail.com>

On Wed, 20 May 2026 at 22:44, Ackerley Tng <ackerleytng@google.com> wrote:
>
> Fuad Tabba <tabba@google.com> writes:
>
> >
> > [...snip...]
> >
> >> +unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> >> +{
> >> +       struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
> >> +       struct inode *inode;
> >> +
> >> +       /*
> >> +        * If this gfn has no associated memslot, there's no chance of the gfn
> >> +        * being backed by private memory, since guest_memfd must be used for
> >> +        * private memory, and guest_memfd must be associated with some memslot.
> >> +        */
> >> +       if (!slot)
> >> +               return 0;
> >> +
> >> +       CLASS(gmem_get_file, file)(slot);
> >> +       if (!file)
> >> +               return 0;
> >> +
> >> +       inode = file_inode(file);
> >> +
> >> +       /*
> >> +        * Rely on the maple tree's internal RCU lock to ensure a
> >> +        * stable result. This result can become stale as soon as the
> >> +        * lock is dropped, so the caller _must_ still protect
> >> +        * consumption of private vs. shared by checking
> >> +        * mmu_invalidate_retry_gfn() under mmu_lock to serialize
> >> +        * against ongoing attribute updates.
> >> +        */
> >> +       return kvm_gmem_get_attributes(inode, kvm_gmem_get_index(slot, gfn));
> >> +}
> >
> > Doesn't this imply that all consumers of kvm_mem_is_private() should
> > validate the result using mmu_lock and the invalidation sequence?
>
> Let me know how I can improve the comment.

Given Sean's context, the comment is good I think. I would quibble
with the the "_must_ still protect" phrasing being a bit too strict.

Maybe just soften it slightly to acknowledge the exception? Something like:

  * lock is dropped, so callers that require a strict result _must_ protect
  * consumption of private vs. shared by checking mmu_invalidate_retry_gfn()
  * under mmu_lock to serialize against ongoing attribute updates. Callers
  * doing lockless reads must be able to tolerate a stale result.

That aligns the comment with how KVM is actually using it today. That
said, this is nitpicking. Feel free to use or ignore.

>
> I think the "consumption" of private vs shared here actually means
> something like "don't commit a page being faulted into page tables based
> on the result of kvm_gmem_get_memory_attributes() without checking
> kvm->mmu_invalidate_in_progress.", since a racing conversion may
> complete before you commit.
>
> kvm_mem_is_private() is used from these places:
>
> 1. Fault handling in KVM, like page_fault_can_be_fast(),
>    kvm_mmu_faultin_pfn(), kvm_mmu_page_fault(): this already handles the
>    entire mmu_lock and invalidation dance. No fault will be committed if
>    a racing conversion happened after kvm_mem_is_private() but before
>    the commit.
>
> 2. kvm_mmu_max_mapping_level() from recovering huge pages after
>    disabling dirty logging: Other than that it can't be used with
>    guest_memfd now since dirty logging can't be used with guest_memfd
>    and guest_memfd memslots are not updatable, this holds mmu_lock
>    throughout until the huge page recovery is done. invalidate_begin
>    also involves zapping the pages in the range, so if the order of
>    events is
>
>    | Thread A                     | Thread B          |
>    |------------------------------|-------------------|
>    | invalidate_begin + zap       |                   |
>    | update attributes maple_tree | recover huge page |
>    | invalidate_end               |                   |
>
>    Then recovering will never see the zapped pages, nothing to
>    recover, no kvm_mem_is_private() lookup.
>
> 3. kvm_arch_vcpu_pre_fault_memory()
>
>    This eventually calls kvm_tdp_mmu_page_fault(), which checks
>    is_page_fault_stale(), so it does check before committing.
>
> Were there any other calls I missed?

The one I was looking at was `sev_handle_rmp_fault()`, which does a lockless
read without the retry loop. But as Sean just pointed out, that path can
tolerate false positives/negatives and relies on the guest faulting again,
so the lack of synchronization there is existing behavior and considered "fine".

>
> > sev_handle_rmp_fault() calls kvm_mem_is_private() without holding
> > mmu_lock and without any retry mechanism. Is that a problem?
> >
>
> Sean already replied on your actual question separately :)
>
> > Cheers,
> > /fuad
> >
> >
> >>
> >> [...snip...]
> >>

^ permalink raw reply

* Re: [PATCH v6 15/43] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: Fuad Tabba @ 2026-05-21  7:13 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-15-91ab5a8b19a4@google.com>

On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> When checking if a guest_memfd folio is safe for conversion, its refcount
> is examined. A folio may be present in a per-CPU lru_add fbatch, which
> temporarily increases its refcount. This can lead to a false positive,
> incorrectly indicating that the folio is in use and preventing the
> conversion, even if it is otherwise safe. The conversion process might not
> be on the same CPU that holds the folio in its fbatch, making a simple
> per-CPU check insufficient.
>
> To address this, drain all CPUs' lru_add fbatches if an unexpectedly high
> refcount is encountered during the safety check. This is performed at most
> once per conversion request. Draining only if the folio in question may be
> lru cached.
>
> guest_memfd folios are unevictable, so they can only reside in the lru_add
> fbatch. If the folio's refcount is still unsafe after draining, then the
> conversion is truly deemed unsafe.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Not an area I've worked with that much, but it seems right to me:

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad


> ---
>  mm/swap.c              |  2 ++
>  virt/kvm/guest_memfd.c | 18 ++++++++++++++----
>  2 files changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index 5cc44f0de9877..3134d9d3d7c30 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -37,6 +37,7 @@
>  #include <linux/page_idle.h>
>  #include <linux/local_lock.h>
>  #include <linux/buffer_head.h>
> +#include <linux/kvm_types.h>
>
>  #include "internal.h"
>
> @@ -904,6 +905,7 @@ void lru_add_drain_all(void)
>         lru_add_drain();
>  }
>  #endif /* CONFIG_SMP */
> +EXPORT_SYMBOL_FOR_KVM(lru_add_drain_all);
>
>  atomic_t lru_disable_count = ATOMIC_INIT(0);
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 034b72b4947fb..050a8c092b1a3 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -8,6 +8,7 @@
>  #include <linux/mempolicy.h>
>  #include <linux/pseudo_fs.h>
>  #include <linux/pagemap.h>
> +#include <linux/swap.h>
>
>  #include "kvm_mm.h"
>
> @@ -596,18 +597,27 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
>         const int filemap_get_folios_refcount = 1;
>         pgoff_t last = start + nr_pages - 1;
>         struct folio_batch fbatch;
> +       bool lru_drained = false;
>         bool safe = true;
>         int i;
>
>         folio_batch_init(&fbatch);
>         while (safe && filemap_get_folios(mapping, &start, last, &fbatch)) {
>
> -               for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> +               for (i = 0; i < folio_batch_count(&fbatch);) {
>                         struct folio *folio = fbatch.folios[i];
>
> -                       if (folio_ref_count(folio) !=
> -                           folio_nr_pages(folio) + filemap_get_folios_refcount) {
> -                               safe = false;
> +                       safe = (folio_ref_count(folio) ==
> +                               folio_nr_pages(folio) +
> +                               filemap_get_folios_refcount);
> +
> +                       if (safe) {
> +                               ++i;
> +                       } else if (folio_may_be_lru_cached(folio) &&
> +                                  !lru_drained) {
> +                               lru_add_drain_all();
> +                               lru_drained = true;
> +                       } else {
>                                 *err_index = folio->index;
>                                 break;
>                         }
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>

^ permalink raw reply

* Re: [PATCH v6 11/43] KVM: guest_memfd: Ensure pages are not in use before conversion
From: Fuad Tabba @ 2026-05-21  7:09 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-11-91ab5a8b19a4@google.com>

Hi Ackerley,

On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> When converting memory to private in guest_memfd, it is necessary to ensure
> that the pages are not currently being accessed by any other part of the
> kernel or userspace to avoid any current user writing to guest private
> memory.
>
> guest_memfd checks for unexpected refcounts to determine whether a page is
> still in use. The only expected refcounts after unmapping the range
> requested for conversion are those that are held by guest_memfd itself.
>
> Update the kvm_memory_attributes2 structure to include an error_offset
> field. This allows KVM to report the exact offset where a conversion
> failed to userspace. If the safety check fails, return -EAGAIN and copy
> the error_offset back to userspace so that it can potentially retry the
> operation or handle the failure gracefully.
>
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> ---
>  include/uapi/linux/kvm.h |  3 ++-
>  virt/kvm/guest_memfd.c   | 65 ++++++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 62 insertions(+), 6 deletions(-)
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index e6bbf68a83813..0b55258573d3d 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1658,7 +1658,8 @@ struct kvm_memory_attributes2 {
>         __u64 size;
>         __u64 attributes;
>         __u64 flags;
> -       __u64 reserved[12];
> +       __u64 error_offset;
> +       __u64 reserved[11];
>  };
>
>  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 91e89b188f583..9d82642a025e9 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -572,9 +572,42 @@ static int kvm_gmem_mas_preallocate(struct ma_state *mas, u64 attributes,
>         return mas_preallocate(mas, xa_mk_value(attributes), GFP_KERNEL);
>  }
>
> +static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> +                                           size_t nr_pages, pgoff_t *err_index)
> +{
> +       struct address_space *mapping = inode->i_mapping;
> +       const int filemap_get_folios_refcount = 1;
> +       pgoff_t last = start + nr_pages - 1;
> +       struct folio_batch fbatch;
> +       bool safe = true;
> +       int i;
> +
> +       folio_batch_init(&fbatch);
> +       while (safe && filemap_get_folios(mapping, &start, last, &fbatch)) {
> +
> +               for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> +                       struct folio *folio = fbatch.folios[i];
> +
> +                       if (folio_ref_count(folio) !=
> +                           folio_nr_pages(folio) + filemap_get_folios_refcount) {
> +                               safe = false;
> +                               *err_index = folio->index;
> +                               break;

https://sashiko.dev/#/patchset/20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4%40google.com?part=11

Sashiko raised a few issues here, but I think this one might be
genuine. Can you look into it please?

If that's right, when huge page support lands, if start falls in the
middle of a large folio, returning folio->index as the err_index will
return an offset strictly less than the requested start. A naive
userspace retry loop resuming from error_offset would step backwards
and corrupt attributes on memory it didn't intend to convert.
err_index should be clamped to max(start, folio->index).

Cheers,
/fuad

> +                       }
> +               }
> +
> +               folio_batch_release(&fbatch);
> +               cond_resched();
> +       }
> +
> +       return safe;
> +}
> +
>  static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> -                                    size_t nr_pages, uint64_t attrs)
> +                                    size_t nr_pages, uint64_t attrs,
> +                                    pgoff_t *err_index)
>  {
> +       bool to_private = attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
>         struct address_space *mapping = inode->i_mapping;
>         struct gmem_inode *gi = GMEM_I(inode);
>         pgoff_t end = start + nr_pages;
> @@ -588,8 +621,21 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>
>         mas_init(&mas, mt, start);
>         r = kvm_gmem_mas_preallocate(&mas, attrs, start, nr_pages);
> -       if (r)
> +       if (r) {
> +               *err_index = start;
>                 goto out;
> +       }
> +
> +       if (to_private) {
> +               unmap_mapping_pages(mapping, start, nr_pages, false);
> +
> +               if (!kvm_gmem_is_safe_for_conversion(inode, start, nr_pages,
> +                                                    err_index)) {
> +                       mas_destroy(&mas);
> +                       r = -EAGAIN;
> +                       goto out;
> +               }
> +       }
>
>         /*
>          * From this point on guest_memfd has performed necessary
> @@ -609,9 +655,10 @@ static long kvm_gmem_set_attributes(struct file *file, void __user *argp)
>         struct gmem_file *f = file->private_data;
>         struct inode *inode = file_inode(file);
>         struct kvm_memory_attributes2 attrs;
> +       pgoff_t err_index;
>         size_t nr_pages;
>         pgoff_t index;
> -       int i;
> +       int i, r;
>
>         if (copy_from_user(&attrs, argp, sizeof(attrs)))
>                 return -EFAULT;
> @@ -635,8 +682,16 @@ static long kvm_gmem_set_attributes(struct file *file, void __user *argp)
>
>         nr_pages = attrs.size >> PAGE_SHIFT;
>         index = attrs.offset >> PAGE_SHIFT;
> -       return __kvm_gmem_set_attributes(inode, index, nr_pages,
> -                                        attrs.attributes);
> +       r = __kvm_gmem_set_attributes(inode, index, nr_pages, attrs.attributes,
> +                                     &err_index);
> +       if (r) {
> +               attrs.error_offset = ((uint64_t)err_index) << PAGE_SHIFT;
> +
> +               if (copy_to_user(argp, &attrs, sizeof(attrs)))
> +                       return -EFAULT;
> +       }
> +
> +       return r;
>  }
>
>  static long kvm_gmem_ioctl(struct file *file, unsigned int ioctl,
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>

^ permalink raw reply

* Re: [PATCH 2/3] rv/rtapp/sleep: Update nanosleep rule
From: Gabriele Monaco @ 2026-05-21  7:04 UTC (permalink / raw)
  To: Nam Cao; +Cc: Steven Rostedt, linux-kernel, linux-trace-kernel
In-Reply-To: <a2175a6647bb88797ad8275eddb4b20b749474ec.1779176466.git.namcao@linutronix.de>

On Tue, 2026-05-19 at 09:49 +0200, Nam Cao wrote:
> diff --git a/kernel/trace/rv/monitors/sleep/sleep.c
> b/kernel/trace/rv/monitors/sleep/sleep.c
> index 0a36f5519e6b..e01ac56b3f4a 100644
> --- a/kernel/trace/rv/monitors/sleep/sleep.c
> +++ b/kernel/trace/rv/monitors/sleep/sleep.c
> @@ -43,9 +43,7 @@ static void ltl_atoms_init(struct task_struct *task, struct
> ltl_monitor *mon, bo
>  	ltl_atom_set(mon, LTL_WOKEN_BY_EQUAL_OR_HIGHER_PRIO, false);
>  
>  	if (task_creation) {
> -		ltl_atom_set(mon, LTL_KTHREAD_SHOULD_STOP, false);

Was it intentional to remove this too? It seems to me now it's cleared
only for existing user threads.

Aren't we sure new kthreads aren't stopping?

> -		ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_MONOTONIC, false);
> -		ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_TAI, false);
> +		ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_REALTIME, false);
>  		ltl_atom_set(mon, LTL_NANOSLEEP_TIMER_ABSTIME, false);
>  		ltl_atom_set(mon, LTL_CLOCK_NANOSLEEP, false);
>  		ltl_atom_set(mon, LTL_FUTEX_WAIT, false);

You missed removing a couple of these in the next branch, the monitor
doesn't build.

Thanks,
Gabriele


^ permalink raw reply

* Re: [PATCH 3/3] rv/rtapp: Add wakeup monitor
From: Gabriele Monaco @ 2026-05-21  6:40 UTC (permalink / raw)
  To: Nam Cao; +Cc: Steven Rostedt, linux-kernel, linux-trace-kernel
In-Reply-To: <bf775af281cb1540d51ac9a8531b45e723d73226.1779176466.git.namcao@linutronix.de>

On Tue, 2026-05-19 at 09:49 +0200, Nam Cao wrote:
> Add a wakeup monitor to detect a lower-priority task waking up a
> higher-priority task.
> 
> The rtapp/sleep monitor already detects this. However, that monitor
> triggers an error in the context of the woken task and user only gets the
> stacktrace of that task. It is also extremely useful to get the stacktrace
> of the waking task, which this monitor offers. In other words, this monitor
> complements the rtapp/sleep monitor.
> 
> Signed-off-by: Nam Cao <namcao@linutronix.de>

Looks neat, so the idea here is that we are looking at the same events
sleep would react for, just from a different perspective, so it does
make sense to run them both together.

You may want to set

  depends on RV_PER_TASK_MONITORS >= 3

in rtapp/Kconfig. Though if we plan to add more per-task monitors we may
need to find a better solution.

> diff --git a/tools/verification/models/rtapp/wakeup.ltl
> b/tools/verification/models/rtapp/wakeup.ltl
> new file mode 100644
> index 000000000000..a5d63ca0811a
> --- /dev/null
> +++ b/tools/verification/models/rtapp/wakeup.ltl
> @@ -0,0 +1,5 @@
> +RULE = always (((RT and USER_THREAD) imply
> +		(not (WOKEN_BY_LOWER_PRIO or WOKEN_BY_SOFTIRQ)) or
> ALLOWLIST))
> +
> +ALLOWLIST = BLOCK_ON_RT_MUTEX
> +         or FUTEX_LOCK_PI

So here the events and atoms are similar to the ones in sleep, but since
we fail on the waking event, we are going to see it from the perspective
of the waker task, right?

But are those really equivalent? Why do we do RT and USER_THREAD here
while there is a much more nuanced set of conditions in sleep?

If I understand it correctly, sleep can monitor some kernel threads but
this monitor does not, is there a reason for that? Are we just not
interested in the waker for kernel threads?

> diff --git a/kernel/trace/rv/monitors/wakeup/Kconfig
> b/kernel/trace/rv/monitors/wakeup/Kconfig
> new file mode 100644
> index 000000000000..3cf11c5cd5f7
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/wakeup/Kconfig
> @@ -0,0 +1,17 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +#
> +config RV_MON_WAKEUP
> +	depends on RV
> +	depends on RV_MON_RTAPP
> +	depends on HAVE_SYSCALL_TRACEPOINTS
> +	select TRACE_IRQFLAGS
> +	default y
> +	select LTL_MON_EVENTS_ID
> +	bool "wakeup monitor"
> +	help
> +	  This monitor detects a lower-priority task waking up a
> +	  higher-priority task. The RV_MON_SLEEP monitor already
> +	  detects this case, but this monitor detects in the context
> +	  of the waking task instead. This and RV_MON_SLEEP can be
> +	  enabled together to get the stacktrace of both the waking
> +	  task and the woken task.

I'm not sure if there is any better terminology, but "waking" task makes
me think of the task that is about to be woken, though it can mean also
that task that is waking another (what you probably mean here).

What about using the waker/wakee terminology?

I see the kernel (events/sched.h) uses waking as well, but it says
waking context (which a bit clearer to me than waking task).
May be worth running it through an LLM which can produce more
English-native unambiguous wording, or maybe I'm just flipping..

Also please document it in Documentation/trace/rv/monitor_rtapp.rst

Thanks,
Gabriele

^ permalink raw reply

* Re: [PATCH v20 09/10] ring-buffer: Show persistent buffer dropped events in trace file
From: Masami Hiramatsu @ 2026-05-21  6:34 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu,
	Mathieu Desnoyers, Catalin Marinas, Will Deacon, Ian Rogers
In-Reply-To: <20260520185018.332045380@kernel.org>

On Wed, 20 May 2026 14:49:47 -0400
Steven Rostedt <rostedt@kernel.org> wrote:

> From: Steven Rostedt <rostedt@goodmis.org>
> 
> When the persistent ring buffer is validated on boot up, if a subbuffer is
> deemed invalid, it resets the buffer and continues. Currently, these lost
> events are not shown in the trace file output.
> 
> Have the trace iterator look for subbuffers that have the RB_MISSED_EVENTS
> set and set the iter->missed_events flag when it is detected. This will
> then have the trace file shows "LOST EVENTS" when it reads across a
> subbuffer that was corrupted and invalidated.

I think it is good to show the dropped events. BTW, is it better to
comment out the line, just for parser?
For example, add a '#' at the like.

# CPU:5 [LOST EVENTS]

Ah, but it is already done...

Thanks,

> 
> For example:
> 
>  <...>-1016    [005] ...1.  6230.660403: preempt_disable: caller=__mod_memcg_state+0x1c8/0x200 parent=__mod_memcg_state+0x1c8/0x200
> CPU:5 [LOST EVENTS]
>  <...>-1016    [005] .....  6230.660673: kmem_cache_alloc: call_site=__anon_vma_prepare+0x1ad/0x1e0 ptr=000000006e40294c name=anon_vma bytes_req=200 bytes_alloc=208 gfp_flags=GFP_KERNEL node=-1 accounted=true
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
>  kernel/trace/ring_buffer.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index ae5c645b59c9..9cdbee171cdc 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -3518,6 +3518,9 @@ static void rb_inc_iter(struct ring_buffer_iter *iter)
>  	else
>  		rb_inc_page(&iter->head_page);
>  
> +	if (rb_page_commit(iter->head_page) & RB_MISSED_EVENTS)
> +		iter->missed_events = -1;
> +
>  	iter->page_stamp = iter->read_stamp = iter->head_page->page->time_stamp;
>  	iter->head = 0;
>  	iter->next_event = 0;
> @@ -5579,6 +5582,7 @@ static void rb_iter_reset(struct ring_buffer_iter *iter)
>  	iter->head_page = cpu_buffer->reader_page;
>  	iter->head = cpu_buffer->reader_page->read;
>  	iter->next_event = iter->head;
> +	iter->missed_events = 0;
>  
>  	iter->cache_reader_page = iter->head_page;
>  	iter->cache_read = cpu_buffer->read;
> @@ -7053,7 +7057,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
>  	struct ring_buffer_event *event;
>  	struct buffer_data_page *dpage;
>  	struct buffer_page *reader;
> -	unsigned long missed_events;
> +	long missed_events;
>  	unsigned int commit;
>  	unsigned int read;
>  	u64 save_timestamp;
> @@ -7179,6 +7183,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
>  		local_set(&reader->entries, 0);
>  		reader->read = 0;
>  		data_page->data = dpage;
> +		if (!missed_events && rb_data_page_commit(dpage) & RB_MISSED_EVENTS)
> +			missed_events = -1;
>  
>  		/*
>  		 * Use the real_end for the data size,
> @@ -7196,10 +7202,12 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
>  	 * Set a flag in the commit field if we lost events
>  	 */
>  	if (missed_events) {
> -		/* If there is room at the end of the page to save the
> +		/*
> +		 * If there is room at the end of the page to save the
>  		 * missed events, then record it there.
>  		 */
> -		if (buffer->subbuf_size - commit >= sizeof(missed_events)) {
> +		if (missed_events > 0 &&
> +		    buffer->subbuf_size - commit >= sizeof(missed_events)) {
>  			memcpy(&dpage->data[commit], &missed_events,
>  			       sizeof(missed_events));
>  			local_add(RB_MISSED_STORED, &dpage->commit);
> -- 
> 2.53.0
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v20 08/10] ring-buffer: Have dropped subbuffers be persistent across reboots
From: Masami Hiramatsu @ 2026-05-21  6:29 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu,
	Mathieu Desnoyers, Catalin Marinas, Will Deacon, Ian Rogers
In-Reply-To: <20260520185018.193087991@kernel.org>

On Wed, 20 May 2026 14:49:46 -0400
Steven Rostedt <rostedt@kernel.org> wrote:

> From: Steven Rostedt <rostedt@goodmis.org>
> 
> When the persistent ring buffer detects a corrupted subbuffer, it will
> zero its size and report dropped pages in the dmesg, then it continues
> normally.
> 
> But if a reboot happens without clearing or restarting tracing on the
> persistent ring buffer, the next boot will show no pages are dropped.
> 
> If the persistent ring buffer is still the same, then it should still
> report dropped pages so the user knows that the buffer has missing events.
> 
> Add the RB_MISSED_EVENTS flag to the commit value of the subbuffer so that
> the next boot will still show that pages were dropped.
> 

Looks good to me.

Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thanks!

> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
>  kernel/trace/ring_buffer.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index bda53a2d2159..ae5c645b59c9 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -1915,7 +1915,7 @@ static int __rb_validate_buffer(struct buffer_page *bpage, int cpu,
>  	 * Even after clearing these bits, a commit value greater than the
>  	 * subbuf_size is considered invalid.
>  	 */
> -	tail = rb_data_page_size(dpage);
> +	tail = rb_data_page_commit(dpage);
>  	if (tail <= meta->subbuf_size - BUF_PAGE_HDR_SIZE)
>  		ret = rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
>  	else
> @@ -1929,7 +1929,7 @@ static int __rb_validate_buffer(struct buffer_page *bpage, int cpu,
>  	 */
>  	if (ret < 0 || (prev_ts && prev_ts > ts) || (next_ts && ts > next_ts)) {
>  		local_set(&bpage->entries, 0);
> -		local_set(&dpage->commit, 0);
> +		local_set(&dpage->commit, RB_MISSED_EVENTS);
>  		dpage->time_stamp = prev_ts ? prev_ts : next_ts;
>  		ret = -1;
>  	} else {
> @@ -3444,7 +3444,7 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
>  	 * is a mb(), which will synchronize with the rmb here.
>  	 * (see rb_tail_page_update() and __rb_reserve_next())
>  	 */
> -	commit = rb_page_commit(iter_head_page);
> +	commit = rb_page_size(iter_head_page);
>  	smp_rmb();
>  
>  	/* An event needs to be at least 8 bytes in size */
> @@ -3473,7 +3473,7 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
>  
>  	/* Make sure the page didn't change since we read this */
>  	if (iter->page_stamp != iter_head_page->page->time_stamp ||
> -	    commit > rb_page_commit(iter_head_page))
> +	    commit > rb_page_size(iter_head_page))
>  		goto reset;
>  
>  	iter->next_event = iter->head + length;
> @@ -5643,7 +5643,7 @@ int ring_buffer_iter_empty(struct ring_buffer_iter *iter)
>  	 * (see rb_tail_page_update())
>  	 */
>  	smp_rmb();
> -	commit = rb_page_commit(commit_page);
> +	commit = rb_page_size(commit_page);
>  	/* We want to make sure that the commit page doesn't change */
>  	smp_rmb();
>  
> @@ -5836,6 +5836,7 @@ __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
>  	 */
>  	local_set(&cpu_buffer->reader_page->write, 0);
>  	local_set(&cpu_buffer->reader_page->entries, 0);
> +	rb_init_data_page(cpu_buffer->reader_page->page);
>  	cpu_buffer->reader_page->real_end = 0;
>  
>   spin:
> -- 
> 2.53.0
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Balbir Singh @ 2026-05-21  6:23 UTC (permalink / raw)
  To: Gregory Price
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <20260222084842.1824063-1-gourry@gourry.net>

On Sun, Feb 22, 2026 at 03:48:15AM -0500, Gregory Price wrote:
> Topic type: MM
> 
> Presenter: Gregory Price <gourry@gourry.net>
> 
> This series introduces N_MEMORY_PRIVATE, a NUMA node state for memory
> managed by the buddy allocator but excluded from normal allocations.
> 
> I present it with an end-to-end Compressed RAM service (mm/cram.c)
> that would otherwise not be possible (or would be considerably more
> difficult, be device-specific, and add to the ZONE_DEVICE boondoggle).
> 

Do we have updates/notes from the meeting?

> 
> TL;DR
> ===
> 
> N_MEMORY_PRIVATE is all about isolating NUMA nodes and then punching
> explicit holes in that isolation to do useful things we couldn't do
> before without re-implementing entire portions of mm/ in a driver.
> 
> 
> /* This is my memory. There are many like it, but this one is mine. */
> rc = add_private_memory_driver_managed(nid, start, size, name, flags,
>                                        online_type, private_context);
> 
> page = alloc_pages_node(nid, __GFP_PRIVATE, 0);

Do we want to provide kernel level control over allocation of private
pages, I assumed that only user space applications? I would assume
node affinity would be the way to do so, unless we have multiple

> 
> /* Ok but I want to do something useful with it */
> static const struct node_private_ops ops = {
>         .migrate_to     = my_migrate_to,
>         .folio_migrate  = my_folio_migrate,
>         .flags = NP_OPS_MIGRATION | NP_OPS_MEMPOLICY,
> };
> node_private_set_ops(nid, &ops);
>

Could you explain this further? Why does OPS_MIGRATION
and OPS_MEMPOLICY needs to be set explictly?

> /* And now I can use mempolicy with my memory */
> buf = mmap(...);
> mbind(buf, len, mode, private_node, ...);
> buf[0] = 0xdeadbeef;  /* Faults onto private node */
> 
> /* And to be clear, no one else gets my memory */
> buf2 = malloc(4096);  /* Standard allocation */
> buf2[0] = 0xdeadbeef; /* Can never land on private node */
> 
> /* But i can choose to migrate it to the private node */
> move_pages(0, 1, &buf, &private_node, NULL, ...);
> 
> /* And more fun things like this */
> 
> 
> Patchwork
> ===
> A fully working branch based on cxl/next can be found here:
> https://github.com/gourryinverse/linux/tree/private_compression
> 
> A QEMU device which can inject high/low interrupts can be found here:
> https://github.com/gourryinverse/qemu/tree/compressed_cxl_clean
> 
> The additional patches on these branches are CXL and DAX driver
> housecleaning only tangentially relevant to this RFC, so i've
> omitted them for the sake of trying to keep it somewhat clean
> here.  Those patches should (hopefully) be going upstream anyway.
> 
> Patches 1-22: Core Private Node Infrastructure
> 
>   Patch  1:      Introduce N_MEMORY_PRIVATE scaffolding
>   Patch  2:      Introduce __GFP_PRIVATE
>   Patch  3:      Apply allocation isolation mechanisms
>   Patch  4:      Add N_MEMORY nodes to private fallback lists
>   Patches 5-9:   Filter operations not yet supported
>   Patch 10:      free_folio callback
>   Patch 11:      split_folio callback
>   Patches 12-20: mm/ service opt-ins:
>                    Migration, Mempolicy, Demotion, Write Protect,
>                    Reclaim, OOM, NUMA Balancing, Compaction,
>                    LongTerm Pinning
>   Patch 21:      memory_failure callback
>   Patch 22:      Memory hotplug plumbing for private nodes
> 
> Patch 23: mm/cram -- Compressed RAM Management
> 
> Patches 24-27: CXL Driver examples
>   Sysram Regions with Private node support
>   Basic Driver Example: (MIGRATION | MEMPOLICY)
>   Compression Driver Example (Generic)
> 
> 
> Background
> ===
> 
> Today, drivers that want mm-like services on non-general-purpose
> memory either use ZONE_DEVICE (self-managed memory) or hotplug into
> N_MEMORY and accept the risk of uncontrolled allocation.
> 
> Neither option provides what we really want - the ability to:
> 	1) selectively participate in mm/ subsystems, while
> 	2) isolating that memory from general purpose use.
> 
> Some device-attached memory cannot be managed as fully general-purpose
> system RAM.  CXL devices with inline compression, for example, may
> corrupt data or crash the machine if the compression ratio drops
> below a threshold -- we simply run out of physical memory.
> 
> This is a hard problem to solve: how does an operating system deal
> with a device that basically lies about how much capacity it has?
> 
> (We'll discuss that in the CRAM section)
> 
> 
> Core Proposal: N_MEMORY_PRIVATE
> ===
> 
> Introduce N_MEMORY_PRIVATE, a NUMA node state for memory managed by
> the buddy allocator, but excluded from normal allocation paths.
> 
> Private nodes:
> 
>   - Are filtered from zonelist fallback: all existing callers to
>     get_page_from_freelist cannot reach these nodes through any
>     normal fallback mechanism.
> 
>   - Filter allocation requests on __GFP_PRIVATE
>     	numa_zone_allowed() excludes them otherwise. 
> 
>     Applies to systems with and without cpusets.
> 
>     GFP_PRIVATE is (__GFP_PRIVATE | __GFP_THISNODE).
> 
>     Services use it when they need to allocate specifically from
>     a private node (e.g., CRAM allocating a destination folio).
> 
>     No existing allocator path sets __GFP_PRIVATE, so private nodes
>     are unreachable by default.
> 
>   - Use standard struct page / folio.  No ZONE_DEVICE, no pgmap,
>     no struct page metadata limitations.
> 
>   - Use a node-scoped metadata structure to accomplish filtering
>     and callback support.
> 
>   - May participate in the buddy allocator, reclaim, compaction,
>     and LRU like normal memory, gated by an opt-in set of flags.
> 
> The key abstraction is node_private_ops: a per-node callback table
> registered by a driver or service.  
> 
> Each callback is individually gated by an NP_OPS_* capability flag.
> 
> A driver opts in only to the mm/ operations it needs.
> 
> It is similar to ZONE_DEVICE's pgmap at a node granularity.
> 
> In fact...
> 
> 
> Re-use of ZONE_DEVICE Hooks
> ===
> 
> The callback insertion points deliberately mirror existing ZONE_DEVICE
> hooks to minimize the surface area of the mechanism.
> 
> I believe this could subsume most DEVICE_COHERENT users, and greatly
> simplify the device-managed memory development process (no more
> per-driver allocator and migration code).
> 
> (Also it's just "So Fresh, So Clean").
> 
> The base set of callbacks introduced include:
> 
>   free_folio           - mirrors ZONE_DEVICE's
>                          free_zone_device_page() hook in
>                          __folio_put() / folios_put_refs()
> 
>   folio_split          - mirrors ZONE_DEVICE's
>   			 called when a huge page is split up
> 
>   migrate_to           - demote_folio_list() custom demotion (same
>                          site as ZONE_DEVICE demotion rejection)
> 
>   folio_migrate        - called when private node folio is moved to
>                          another location (e.g. compaction)
> 
>   handle_fault         - mirrors the ZONE_DEVICE fault dispatch in
>                          handle_pte_fault() (do_wp_page path)
> 
>   reclaim_policy       - called by reclaim to let a driver own the
>                          boost lifecycle (driver can driver node reclaim)
> 
>   memory_failure       - parallels memory_failure_dev_pagemap(),
>                          but for online pages that enter the normal
>                          hwpoison path
> 
> At skip sites (mlock, madvise, KSM, user migration), a unified
> folio_is_private_managed() predicate covers both ZONE_DEVICE and
> N_MEMORY_PRIVATE folios, consolidating existing zone_device checks
> with private node checks rather than adding new ones.
> 
>   static inline bool folio_is_private_managed(struct folio *folio)
>   {
>           return folio_is_zone_device(folio) ||
>                  folio_is_private_node(folio);
>   }
> 
> Most integration points become a one-line swap:
> 
>   -     if (folio_is_zone_device(folio))
>   +     if (unlikely(folio_is_private_managed(folio)))
> 
> 
> Where a one-line integration is insufficient, the integration is
> kept as clean as possible with zone_device, rather than simply
> adding more call-sites on top of it:
> 
> static inline bool folio_managed_handle_fault(struct folio *folio,
>   struct vm_fault *vmf, vm_fault_t *ret)
> {
>   /* Zone device pages use swap entries; handled in do_swap_page */
>   if (folio_is_zone_device(folio))
>     return false;
> 
>   if (folio_is_private_node(folio)) {
>     const struct node_private_ops *ops = folio_node_private_ops(folio);
> 
>     if (ops && ops->handle_fault) {
>       *ret = ops->handle_fault(vmf);
>       return true;
>     }
>   }
>   return false;
> }
> 
> 
> 
> Flag-gated behavior (NP_OPS_*) controls:
> ===
> 
> We use OPS flags to denote what mm/ services we want to allow on our
> private node.   I've plumbed these through so far:
> 
>   NP_OPS_MIGRATION       - Node supports migration
>   NP_OPS_MEMPOLICY       - Node supports mempolicy actions
>   NP_OPS_DEMOTION        - Node appears in demotion target lists
>   NP_OPS_PROTECT_WRITE   - Node memory is read-only (wrprotect)
>   NP_OPS_RECLAIM         - Node supports reclaim
>   NP_OPS_NUMA_BALANCING  - Node supports numa balancing
>   NP_OPS_COMPACTION      - Node supports compaction
>   NP_OPS_LONGTERM_PIN    - Node supports longterm pinning
>   NP_OPS_OOM_ELIGIBLE	 - (MIGRATION | DEMOTION), node is reachable
>                            as normal system ram storage, so it should
> 			   be considered in OOM pressure calculations.
> 
> I wasn't quite sure how to classify ksm, khugepaged, madvise, and
> mlock - so i have omitted those for now.
> 
> Most hooks are straightforward.
> 
> Including a node as a demotion-eligible target was as simple as:
> 
> static void establish_demotion_targets(void)
> {
>   ..... snip .....
>   /*
>    * Include private nodes that have opted in to demotion
>    * via NP_OPS_DEMOTION.  A node might have custom migrate
>    */
>   all_memory = node_states[N_MEMORY];
>   for_each_node_state(node, N_MEMORY_PRIVATE) {
>       if (node_private_has_flag(node, NP_OPS_DEMOTION))
>       node_set(node, all_memory);
>   }
>   ..... snip .....
> }
> 
> The Migration and Mempolicy support are the two most complex pieces,
> and most useful things are built on top of Migration (meaning the
> remaining implementations are usually simple).
> 
> 
> Private Node Hotplug Lifecycle
> ===
> 
> Registration follows a strict order enforced by
> add_private_memory_driver_managed():
> 
>   1. Driver calls add_private_memory_driver_managed(nid, start,
>      size, resource_name, mhp_flags, online_type, &np).
> 
>   2. node_private_register(nid, &np) stores the driver's
>      node_private in pgdat and sets pgdat->private.  N_MEMORY and
>      N_MEMORY_PRIVATE are mutually exclusive -- registration fails
>      with -EBUSY if the node already has N_MEMORY set.
> 
>      Only one driver may register per private node.
> 
>   3. Memory is hotplugged via __add_memory_driver_managed().
> 
>      When online_pages() runs, it checks pgdat->private and sets
>      N_MEMORY_PRIVATE instead of N_MEMORY.  
> 
>      Zonelist construction gives private nodes a self-only NOFALLBACK
>      list and an N_MEMORY fallback list (so kernel/slab allocations on
>      behalf of private node work can fall back to DRAM).
> 
>   4. kswapd and kcompactd are NOT started for private nodes.  The
>      owning service is responsible for driving reclaim if needed
>      (e.g., CRAM uses watermark_boost to wake kswapd on demand).
> 
> Teardown is the reverse:
> 
>   1. Driver calls offline_and_remove_private_memory(nid, start,
>      size).
> 
>   2. offline_pages() offlines the memory.  When the last block is
>      offlined, N_MEMORY_PRIVATE is cleared automatically.
> 
>   3. node_private_unregister() clears pgdat->node_private and
>      drops the refcount.  It refuses to unregister (-EBUSY) if
>      N_MEMORY_PRIVATE is still set (other memory ranges remain).
> 
> The driver is responsible for ensuring memory is hot-unpluggable
> before teardown.  The service must ensure all memory is cleaned
> up before hot-unplug - or the service must support migration (so
> memory_hotplug.c can evacuate the memory itself).
> 
> In the CRAM example, the service supports migration, so memory
> hot-unplug can remove memory without any special infrastructure.
> 
> 
> Application: Compressed RAM (mm/cram)
> ===
> 
> Compressed RAM has a serious design issue:  Its capacity a lie.
> 
> A compression device reports more capacity than it physically has.
> If workloads write faster than the OS can reclaim from the device,
> we run out of real backing store and corrupt data or crash.
> 
> I call this problem: "Trying to Out Run A Bear"
> 
> I.e. This is only stable as long as we stay ahead of the pressure.
> 
> We don't want to design a system where stability depends on outrunning
> a bear - I am slow and do not know where to acquire bear spray.
> 
>   Fun fact:   Grizzly bears have a top-speed of 56-64 km/h.
>   Unfun Fact: Humans typically top out at ~24 km/h.
> 
> This MVP takes a conservative position:
> 
>    all compressed memory is mapped read-only.
> 
>   - Folios reach the private node only via reclaim (demotion)
>   - migrate_to implements custom demotion with backpressure.
>   - fixup_migration_pte write-protects PTEs on arrival.
>   - wrprotect hooks prevent silent upgrades
>   - handle_fault promotes folios back to DRAM on write.
>   - free_folio scrubs stale data before buddy free.
> 
> Because pages are read-only, writes can never cause runaway
> compression ratio loss behind the allocator's back.  Every write
> goes through handle_fault, which promotes the folio to DRAM first.
> 
> The device only ever sees net compression (demotion in) and explicit
> decompression (promotion out via fault or reclaim), and has a much
> wider timeframe to respond to poor compression scenarios.
> 
> That means there's no bear to out run. The bears are safely asleep in
> their bear den, and even if they show up we have a bear-proof cage.
> 
> The backpressure system is our bear-proof cage: the driver reports real
> device utilization (generalized via watermark_boost on the private
> node's zone), and CRAM throttles demotion when capacity is tight.
> 
> If compression ratios are bad, we stop demoting pages and start
> evicting pages aggressively.
> 
> The service as designed is ~350 functional lines of code because it
> re-uses mm/ services:
> 
>   - Existing reclaim/vmscan code handles demotion.
>   - Existing migration code handles migration to/from.
>   - Existing page fault handling dispatches faults.
> 
> The driver contains all the CXL nastiness core developers don't want
> anything to do with - No vendor logic touches mm/ internals.
> 
> 
> 
> Future CRAM : Loosening the read-only constraint
> ===
> 
> The read-only model is safe but conservative.  For workloads where
> compressed pages are occasionally written, the promotion fault adds
> latency.  A future optimization could allow a tunable fraction of
> compressed pages to be mapped writable, accepting some risk of
> write-driven decompression in exchange for lower overhead.
> 
> The private node ops make this straightforward:
> 
>   - Adjust fixup_migration_pte to selectively skip
>     write-protection.
>   - Use the backpressure system to either revoke writable mappings,
>     deny additional demotions, or evict when device pressure rises.
> 
> This comes at a mild memory overhead: 32MB of DRAM per 1TB of CRAM.
> (1 bit per 4KB page).
> 
> This is not proposed here, but it should be somewhat trivial.
> 
> 
> Discussion Topics
> ===
> 0. Obviously I've included the set as an RFC, please rip it apart.
> 
> 1. Is N_MEMORY_PRIVATE the right isolation abstraction, or should
>    this extend ZONE_DEVICE?  Prior feedback pushed away from new
>    ZONE logic, but this will likely be debated further.
> 
>    My comments on this:
> 
>    ZONE_DEVICE requires re-implementing every service you want to
>    provide to your device memory, including basic allocation.
> 
>    Private nodes use real struct pages with no metadata
>    limitations, participate in the buddy allocator, and get NUMA
>    topology for free.
> 
> 2. Can this subsume ZONE_DEVICE COHERENT users?  The architecture
>    was designed with this in mind, but it is only a thought experiment.
> 
> 3. Is a dedicated mm/ service (cram) the right place for compressed
>    memory management, or should this be purely driver-side until
>    more devices exist?
> 
>    I wrote it this way because I forsee more "innovation" in the
>    compressed RAM space given current... uh... "Market Conditions".
> 
>    I don't see CRAM being CXL-specific, though the only solutions I've
>    seen have been CXL.  Nothing is stopping someone from soldering such
>    memory directly to a PCB.
> 
> 5. Where is your hardware-backed data that shows this works?
> 
>    I should have some by conference time.
> 
> Thanks for reading
> Gregory (Gourry)
> 
> 
> Gregory Price (27):
>   numa: introduce N_MEMORY_PRIVATE node state
>   mm,cpuset: gate allocations from N_MEMORY_PRIVATE behind __GFP_PRIVATE
>   mm/page_alloc: add numa_zone_allowed() and wire it up
>   mm/page_alloc: Add private node handling to build_zonelists
>   mm: introduce folio_is_private_managed() unified predicate
>   mm/mlock: skip mlock for managed-memory folios
>   mm/madvise: skip madvise for managed-memory folios
>   mm/ksm: skip KSM for managed-memory folios
>   mm/khugepaged: skip private node folios when trying to collapse.
>   mm/swap: add free_folio callback for folio release cleanup
>   mm/huge_memory.c: add private node folio split notification callback
>   mm/migrate: NP_OPS_MIGRATION - support private node user migration
>   mm/mempolicy: NP_OPS_MEMPOLICY - support private node mempolicy
>   mm/memory-tiers: NP_OPS_DEMOTION - support private node demotion
>   mm/mprotect: NP_OPS_PROTECT_WRITE - gate PTE/PMD write-upgrades
>   mm: NP_OPS_RECLAIM - private node reclaim participation
>   mm/oom: NP_OPS_OOM_ELIGIBLE - private node OOM participation
>   mm/memory: NP_OPS_NUMA_BALANCING - private node NUMA balancing
>   mm/compaction: NP_OPS_COMPACTION - private node compaction support
>   mm/gup: NP_OPS_LONGTERM_PIN - private node longterm pin support
>   mm/memory-failure: add memory_failure callback to node_private_ops
>   mm/memory_hotplug: add add_private_memory_driver_managed()
>   mm/cram: add compressed ram memory management subsystem
>   cxl/core: Add cxl_sysram region type
>   cxl/core: Add private node support to cxl_sysram
>   cxl: add cxl_mempolicy sample PCI driver
>   cxl: add cxl_compression PCI driver
> 
>  drivers/base/node.c                           |  250 +++-
>  drivers/cxl/Kconfig                           |    2 +
>  drivers/cxl/Makefile                          |    2 +
>  drivers/cxl/core/Makefile                     |    1 +
>  drivers/cxl/core/core.h                       |    4 +
>  drivers/cxl/core/port.c                       |    2 +
>  drivers/cxl/core/region_sysram.c              |  381 ++++++
>  drivers/cxl/cxl.h                             |   53 +
>  drivers/cxl/type3_drivers/Kconfig             |    3 +
>  drivers/cxl/type3_drivers/Makefile            |    3 +
>  .../cxl/type3_drivers/cxl_compression/Kconfig |   20 +
>  .../type3_drivers/cxl_compression/Makefile    |    4 +
>  .../cxl_compression/compression.c             | 1025 +++++++++++++++++
>  .../cxl/type3_drivers/cxl_mempolicy/Kconfig   |   16 +
>  .../cxl/type3_drivers/cxl_mempolicy/Makefile  |    4 +
>  .../type3_drivers/cxl_mempolicy/mempolicy.c   |  297 +++++
>  include/linux/cpuset.h                        |    9 -
>  include/linux/cram.h                          |   66 ++
>  include/linux/gfp_types.h                     |   15 +-
>  include/linux/memory-tiers.h                  |    9 +
>  include/linux/memory_hotplug.h                |   11 +
>  include/linux/migrate.h                       |   17 +-
>  include/linux/mm.h                            |   22 +
>  include/linux/mmzone.h                        |   16 +
>  include/linux/node_private.h                  |  532 +++++++++
>  include/linux/nodemask.h                      |    1 +
>  include/trace/events/mmflags.h                |    4 +-
>  include/uapi/linux/mempolicy.h                |    1 +
>  kernel/cgroup/cpuset.c                        |   49 +-
>  mm/Kconfig                                    |   10 +
>  mm/Makefile                                   |    1 +
>  mm/compaction.c                               |   32 +-
>  mm/cram.c                                     |  508 ++++++++
>  mm/damon/paddr.c                              |    3 +
>  mm/huge_memory.c                              |   23 +-
>  mm/hugetlb.c                                  |    2 +-
>  mm/internal.h                                 |  226 +++-
>  mm/khugepaged.c                               |    7 +-
>  mm/ksm.c                                      |    9 +-
>  mm/madvise.c                                  |    5 +-
>  mm/memory-failure.c                           |   15 +
>  mm/memory-tiers.c                             |   46 +-
>  mm/memory.c                                   |   26 +
>  mm/memory_hotplug.c                           |  122 +-
>  mm/mempolicy.c                                |   69 +-
>  mm/migrate.c                                  |   63 +-
>  mm/mlock.c                                    |    5 +-
>  mm/mprotect.c                                 |    4 +-
>  mm/oom_kill.c                                 |   52 +-
>  mm/page_alloc.c                               |   79 +-
>  mm/rmap.c                                     |    4 +-
>  mm/slub.c                                     |    3 +-
>  mm/swap.c                                     |   21 +-
>  mm/vmscan.c                                   |   55 +-
>  54 files changed, 4057 insertions(+), 152 deletions(-)
>  create mode 100644 drivers/cxl/core/region_sysram.c
>  create mode 100644 drivers/cxl/type3_drivers/Kconfig
>  create mode 100644 drivers/cxl/type3_drivers/Makefile
>  create mode 100644 drivers/cxl/type3_drivers/cxl_compression/Kconfig
>  create mode 100644 drivers/cxl/type3_drivers/cxl_compression/Makefile
>  create mode 100644 drivers/cxl/type3_drivers/cxl_compression/compression.c
>  create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/Kconfig
>  create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/Makefile
>  create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/mempolicy.c
>  create mode 100644 include/linux/cram.h
>  create mode 100644 include/linux/node_private.h
>  create mode 100644 mm/cram.c
> 
> -- 
> 2.53.0
> 
> 

Balbir

^ permalink raw reply

* Re: [PATCH mm-unstable v17 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Vernon Yang @ 2026-05-21  5:11 UTC (permalink / raw)
  To: Wei Yang
  Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260521024654.2a7teoe665porz76@master>

On Thu, May 21, 2026 at 02:46:54AM +0000, Wei Yang wrote:
> On Thu, May 21, 2026 at 10:36:15AM +0800, Vernon Yang wrote:
> >On Mon, May 11, 2026 at 12:58:11PM -0600, Nico Pache wrote:
> >> Enable khugepaged to collapse to mTHP orders. This patch implements the
> >> main scanning logic using a bitmap to track occupied pages and a stack
> >> structure that allows us to find optimal collapse sizes.
> >>
> >> Previous to this patch, PMD collapse had 3 main phases, a light weight
> >> scanning phase (mmap_read_lock) that determines a potential PMD
> >> collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> >> phase (mmap_write_lock).
> >>
> >> To enabled mTHP collapse we make the following changes:
> >>
> >> During PMD scan phase, track occupied pages in a bitmap. When mTHP
> >> orders are enabled, we remove the restriction of max_ptes_none during the
> >> scan phase to avoid missing potential mTHP collapse candidates. Once we
> >> have scanned the full PMD range and updated the bitmap to track occupied
> >> pages, we use the bitmap to find the optimal mTHP size.
> >>
> >> Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> >> and determine the best eligible order for the collapse. A stack structure
> >> is used instead of traditional recursion to manage the search. This also
> >> prevents a traditional recursive approach when the kernel stack struct is
> >> limited. The algorithm recursively splits the bitmap into smaller chunks to
> >> find the highest order mTHPs that satisfy the collapse criteria. We start
> >> by attempting the PMD order, then moved on the consecutively lower orders
> >> (mTHP collapse). The stack maintains a pair of variables (offset, order),
> >> indicating the number of PTEs from the start of the PMD, and the order of
> >> the potential collapse candidate.
> >>
> >> The algorithm for consuming the bitmap works as such:
> >>     1) push (0, HPAGE_PMD_ORDER) onto the stack
> >>     2) pop the stack
> >>     3) check if the number of set bits in that (offset,order) pair
> >>        statisfy the max_ptes_none threshold for that order
> >>     4) if yes, attempt collapse
> >>     5) if no (or collapse fails), push two new stack items representing
> >>        the left and right halves of the current bitmap range, at the
> >>        next lower order
> >>     6) repeat at step (2) until stack is empty.
> >>
> >> Below is a diagram representing the algorithm and stack items:
> >>
> >>                             offset   mid_offset
> >>                             |        |
> >>                             |        |
> >>                             v        v
> >>           ____________________________________
> >>          |          PTE Page Table            |
> >>          --------------------------------------
> >> 			    <-------><------->
> >>                              order-1  order-1
> >>
> >> mTHP collapses reject regions containing swapped out or shared pages.
> >> This is because adding new entries can lead to new none pages, and these
> >> may lead to constant promotion into a higher order mTHP. A similar
> >> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> >> introducing at least 2x the number of pages, and on a future scan will
> >> satisfy the promotion condition once again. This issue is prevented via
> >> the collapse_max_ptes_none() function which imposes the max_ptes_none
> >> restrictions above.
> >>
> >> We currently only support mTHP collapse for max_ptes_none values of 0
> >> and HPAGE_PMD_NR - 1. resulting in the following behavior:
> >>
> >>     - max_ptes_none=0: Never introduce new empty pages during collapse
> >>     - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
> >>       available mTHP order
> >>
> >> Any other max_ptes_none value will emit a warning and skip mTHP collapse
> >> attempts. There should be no behavior change for PMD collapse.
> >>
> >> Once we determine what mTHP sizes fits best in that PMD range a collapse
> >> is attempted. A minimum collapse order of 2 is used as this is the lowest
> >> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
> >>
> >> Currently madv_collapse is not supported and will only attempt PMD
> >> collapse.
> >>
> >> We can also remove the check for is_khugepaged inside the PMD scan as
> >> the collapse_max_ptes_none() function handles this logic now.
> >>
> >> Signed-off-by: Nico Pache <npache@redhat.com>
> >> ---
> >>  mm/khugepaged.c | 182 +++++++++++++++++++++++++++++++++++++++++++++---
> >>  1 file changed, 174 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >> index 3492b135d667..39bf7ea8a6e8 100644
> >> --- a/mm/khugepaged.c
> >> +++ b/mm/khugepaged.c
> >> @@ -100,6 +100,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> >>
> >>  static struct kmem_cache *mm_slot_cache __ro_after_init;
> >>
> >> +#define KHUGEPAGED_MIN_MTHP_ORDER	2
> >> +/*
> >> + * mthp_collapse() does an iterative DFS over a binary tree, from
> >> + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> >> + * size needed for a DFS on a binary tree is height + 1, where
> >> + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> >> + *
> >> + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> >> + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> >> + */
> >> +#define MTHP_STACK_SIZE	(ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> >> +
> >> +/*
> >> + * Defines a range of PTE entries in a PTE page table which are being
> >> + * considered for mTHP collapse.
> >> + *
> >> + * @offset: the offset of the first PTE entry in a PMD range.
> >> + * @order: the order of the PTE entries being considered for collapse.
> >> + */
> >> +struct mthp_range {
> >> +	u16 offset;
> >> +	u8 order;
> >> +};
> >> +
> >>  struct collapse_control {
> >>  	bool is_khugepaged;
> >>
> >> @@ -111,6 +135,12 @@ struct collapse_control {
> >>
> >>  	/* nodemask for allocation fallback */
> >>  	nodemask_t alloc_nmask;
> >> +
> >> +	/* Each bit represents a single occupied (!none/zero) page. */
> >> +	DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
> >> +	/* A mask of the current range being considered for mTHP collapse. */
> >> +	DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> >> +	struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
> >>  };
> >>
> >>  /**
> >> @@ -1404,20 +1434,140 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
> >>  	return result;
> >>  }
> >>
> >> +static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> >> +				     u16 offset, u8 order)
> >> +{
> >> +	const int size = *stack_size;
> >> +	struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> >> +
> >> +	VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> >> +	stack->order = order;
> >> +	stack->offset = offset;
> >> +	(*stack_size)++;
> >> +}
> >> +
> >> +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> >> +						 int *stack_size)
> >> +{
> >> +	const int size = *stack_size;
> >> +
> >> +	VM_WARN_ON_ONCE(size <= 0);
> >> +	(*stack_size)--;
> >> +	return cc->mthp_bitmap_stack[size - 1];
> >> +}
> >> +
> >> +static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
> >> +						u16 offset, unsigned int nr_ptes)
> >> +{
> >> +	bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> >> +	bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
> >> +	return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> >> +}
> >> +
> >> +/*
> >> + * mthp_collapse() consumes the bitmap that is generated during
> >> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> >> + *
> >> + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
> >> + * A stack structure cc->mthp_bitmap_stack is used to check different regions
> >> + * of the bitmap for collapse eligibility. The stack maintains a pair of
> >> + * variables (offset, order), indicating the number of PTEs from the start of
> >> + * the PMD, and the order of the potential collapse candidate respectively. We
> >> + * start at the PMD order and check if it is eligible for collapse; if not, we
> >> + * add two entries to the stack at a lower order to represent the left and right
> >> + * halves of the PTE page table we are examining.
> >> + *
> >> + *                         offset       mid_offset
> >> + *                         |         |
> >> + *                         |         |
> >> + *                         v         v
> >> + *      --------------------------------------
> >> + *      |          cc->mthp_bitmap            |
> >> + *      --------------------------------------
> >> + *                         <-------><------->
> >> + *                          order-1  order-1
> >> + *
> >> + * For each of these, we determine how many PTE entries are occupied in the
> >> + * range of PTE entries we propose to collapse, then we compare this to a
> >> + * threshold number of PTE entries which would need to be occupied for a
> >> + * collapse to be permitted at that order (accounting for max_ptes_none).
> >> + *
> >> + * If a collapse is permitted, we attempt to collapse the PTE range into a
> >> + * mTHP.
> >> + */
> >> +static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> >> +		int referenced, int unmapped, struct collapse_control *cc,
> >> +		unsigned long enabled_orders)
> >> +{
> >> +	unsigned int nr_occupied_ptes, nr_ptes;
> >> +	int max_ptes_none, collapsed = 0, stack_size = 0;
> >> +	unsigned long collapse_address;
> >> +	struct mthp_range range;
> >> +	u16 offset;
> >> +	u8 order;
> >> +
> >> +	collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> >> +
> >> +	while (stack_size) {
> >> +		range = collapse_mthp_stack_pop(cc, &stack_size);
> >> +		order = range.order;
> >> +		offset = range.offset;
> >> +		nr_ptes = 1UL << order;
> >> +
> >> +		if (!test_bit(order, &enabled_orders))
> >> +			goto next_order;
> >> +
> >> +		max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
> >> +
> >> +		if (max_ptes_none < 0)
> >> +			return collapsed;
> >> +
> >> +		nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> >> +							       nr_ptes);
> >> +
> >> +		if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> >> +			int ret;
> >> +
> >> +			collapse_address = address + offset * PAGE_SIZE;
> >> +			ret = collapse_huge_page(mm, collapse_address, referenced,
> >> +						 unmapped, cc, order);
> >> +			if (ret == SCAN_SUCCEED) {
> >> +				collapsed += nr_ptes;
> >> +				continue;
> >> +			}
> >> +		}
> >> +
> >> +next_order:
> >> +		if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
> >
> >Hi Nico, thank you very much for your contributions to this series.
> >
> >I found a minor issue, for MADV_COLLAPSE, if collapse_huge_page() fails
> >for some reason (e.g. allocate folio), it goes to next_order and
> >continues splitting to the next small order. However, enabled_orders
> >only supports HPAGE_PMD_ORDER, so it keeps runing the split operations
> >without any effective work until KHUGEPAGED_MIN_MTHP_ORDER is reached
> >before exiting. For khugepaged, e.g. setting only 2MB to always, also
> >same phenomenon.
>
> Yes, but it does no actual work since it is checked after pop up.
>
> >
> >This does not affect the overall functionality of mthp collapse, just
> >redundant.
> >
> >The redundant operations can be easily skipped with the following
> >modification. If I miss some thing, please let me know. Thanks!
> >
> >diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >index 1a25af3d6d0f..fa407cce525c 100644
> >--- a/mm/khugepaged.c
> >+++ b/mm/khugepaged.c
> >@@ -1574,7 +1574,7 @@ static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> > 		}
> >
> > next_order:
> >-		if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
> >+		if ((BIT(order) - 1) & enabled_orders) {
> > 			const u8 next_order = order - 1;
> > 			const u16 mid_offset = offset + (nr_ptes / 2);
> >
>
> This would stop the iteration if there are other lower enabled order, right?
             ^^^^                                  ^^^^^^^^^^^^^^^^^^^

NO :)

For more details, please refer to the following information.

|              Scenario               | Old Behavior (order > 2) | New Behavior ((BIT(order)-1) & enabled_orders) |
|-------------------------------------|--------------------------|------------------------------------------------|
| MADV_COLLAPSE                       | Splits 9,8,7,...,3       | No split                                       |
| khugepaged, only 2MB enabled        | Splits 9,8,7,...,3       | No split                                       |
| khugepaged, only 2MB + 64KB enabled | Splits 9,8,7,...,3       | Splits 9,8,7,...,5                             |
| khugepaged, only 32KB enabled       | Splits 9,8,7,...,3       | Splits 9,8,7,...,4                             |
| khugepaged, only 16KB enabled       | Splits 9,8,7,...,3       | Splits 9,8,7,...,3                             |
| khugepaged, all mTHP enabled        | Splits 9,8,7,...,3       | Splits 9,8,7,...,3                             |

--
Cheers,
Vernon

^ permalink raw reply

* [PATCH] tracing: Do not call map->ops->elt_free() if elt_alloc() is not succeeded
From: Masami Hiramatsu (Google) @ 2026-05-21  4:49 UTC (permalink / raw)
  To: Steven Rostedt, Tom Zanussi
  Cc: linux-trace-kernel, linux-kernel, Masami Hiramatsu,
	Mathieu Desnoyers, Rosen Penev
In-Reply-To: <20260520223101.34710-1-rosenp@gmail.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

In paths where tracing_map_elt_alloc() failed to allocate objects,
the map->ops->elt_alloc() call was never successful. In this case,
map->ops->elt_free() should not be called.

This bug was found by Sashiko.
Link: https://sashiko.dev/#/patchset/20260520223101.34710-1-rosenp%40gmail.com

Fixes: 2734b629525a ("tracing: Add per-element variable support to tracing_map")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
---
 kernel/trace/tracing_map.c |   17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/tracing_map.c b/kernel/trace/tracing_map.c
index bf1a507695b6..0dd7927df22a 100644
--- a/kernel/trace/tracing_map.c
+++ b/kernel/trace/tracing_map.c
@@ -386,13 +386,11 @@ static void tracing_map_elt_init_fields(struct tracing_map_elt *elt)
 	}
 }
 
-static void tracing_map_elt_free(struct tracing_map_elt *elt)
+static void __tracing_map_elt_free(struct tracing_map_elt *elt)
 {
 	if (!elt)
 		return;
 
-	if (elt->map->ops && elt->map->ops->elt_free)
-		elt->map->ops->elt_free(elt);
 	kfree(elt->fields);
 	kfree(elt->vars);
 	kfree(elt->var_set);
@@ -400,6 +398,17 @@ static void tracing_map_elt_free(struct tracing_map_elt *elt)
 	kfree(elt);
 }
 
+static void tracing_map_elt_free(struct tracing_map_elt *elt)
+{
+	if (!elt)
+		return;
+
+	/* Only objects initialized with alloc_elt() should be passed to free_elt().*/
+	if (elt->map->ops && elt->map->ops->elt_free)
+		elt->map->ops->elt_free(elt);
+	__tracing_map_elt_free(elt);
+}
+
 static struct tracing_map_elt *tracing_map_elt_alloc(struct tracing_map *map)
 {
 	struct tracing_map_elt *elt;
@@ -444,7 +453,7 @@ static struct tracing_map_elt *tracing_map_elt_alloc(struct tracing_map *map)
 	}
 	return elt;
  free:
-	tracing_map_elt_free(elt);
+	__tracing_map_elt_free(elt);
 
 	return ERR_PTR(err);
 }


^ permalink raw reply related

* Re: [PATCH] trace: allocate fields with elt struct
From: Masami Hiramatsu @ 2026-05-21  4:48 UTC (permalink / raw)
  To: Rosen Penev
  Cc: linux-trace-kernel, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, open list:TRACING
In-Reply-To: <20260520223101.34710-1-rosenp@gmail.com>

Hi Rosen,

Please check Sashiko's comment:

https://sashiko.dev/#/patchset/20260520223101.34710-1-rosenp%40gmail.com

One actual bug was found. I'll sent a fix.

Thanks,

On Wed, 20 May 2026 15:31:01 -0700
Rosen Penev <rosenp@gmail.com> wrote:

> Use a flexible array member to embed the fields array in the
> tracing_map_elt allocation, reducing the number of allocations
> per element.
> 
> Since the fields are now embedded in the struct, taking the address
> of a field through a const-qualified elt pointer yields a
> const-qualified pointer. Rather than adding casts, switch the
> comparison functions to take const void * parameters. These are
> all read-only operations.
> 
> Assisted-by: OpenCode:BigPickle
> Signed-off-by: Rosen Penev <rosenp@gmail.com>
> ---
>  kernel/trace/tracing_map.c | 41 ++++++++++++++++----------------------
>  kernel/trace/tracing_map.h |  8 ++++----
>  2 files changed, 21 insertions(+), 28 deletions(-)
> 
> diff --git a/kernel/trace/tracing_map.c b/kernel/trace/tracing_map.c
> index 1404bf752d99..97f7e3cde262 100644
> --- a/kernel/trace/tracing_map.c
> +++ b/kernel/trace/tracing_map.c
> @@ -125,32 +125,32 @@ u64 tracing_map_read_var_once(struct tracing_map_elt *elt, unsigned int i)
>  	return (u64)atomic64_read(&elt->vars[i]);
>  }
>  
> -int tracing_map_cmp_string(void *val_a, void *val_b)
> +int tracing_map_cmp_string(const void *val_a, const void *val_b)
>  {
> -	char *a = val_a;
> -	char *b = val_b;
> +	const char *a = val_a;
> +	const char *b = val_b;
>  
>  	return strcmp(a, b);
>  }
>  
> -int tracing_map_cmp_none(void *val_a, void *val_b)
> +int tracing_map_cmp_none(const void *val_a, const void *val_b)
>  {
>  	return 0;
>  }
>  
> -static int tracing_map_cmp_atomic64(void *val_a, void *val_b)
> +static int tracing_map_cmp_atomic64(const void *val_a, const void *val_b)
>  {
> -	u64 a = atomic64_read((atomic64_t *)val_a);
> -	u64 b = atomic64_read((atomic64_t *)val_b);
> +	u64 a = atomic64_read((const atomic64_t *)val_a);
> +	u64 b = atomic64_read((const atomic64_t *)val_b);
>  
>  	return (a > b) ? 1 : ((a < b) ? -1 : 0);
>  }
>  
>  #define DEFINE_TRACING_MAP_CMP_FN(type)					\
> -static int tracing_map_cmp_##type(void *val_a, void *val_b)		\
> +static int tracing_map_cmp_##type(const void *val_a, const void *val_b)	\
>  {									\
> -	type a = (type)(*(u64 *)val_a);					\
> -	type b = (type)(*(u64 *)val_b);					\
> +	type a = (type)(*(const u64 *)val_a);				\
> +	type b = (type)(*(const u64 *)val_b);				\
>  									\
>  	return (a > b) ? 1 : ((a < b) ? -1 : 0);			\
>  }
> @@ -385,7 +385,6 @@ static void tracing_map_elt_free(struct tracing_map_elt *elt)
>  
>  	if (elt->map->ops && elt->map->ops->elt_free)
>  		elt->map->ops->elt_free(elt);
> -	kfree(elt->fields);
>  	kfree(elt->vars);
>  	kfree(elt->var_set);
>  	kfree(elt->key);
> @@ -397,7 +396,7 @@ static struct tracing_map_elt *tracing_map_elt_alloc(struct tracing_map *map)
>  	struct tracing_map_elt *elt;
>  	int err = 0;
>  
> -	elt = kzalloc_obj(*elt);
> +	elt = kzalloc_flex(*elt, fields, map->n_fields);
>  	if (!elt)
>  		return ERR_PTR(-ENOMEM);
>  
> @@ -409,12 +408,6 @@ static struct tracing_map_elt *tracing_map_elt_alloc(struct tracing_map *map)
>  		goto free;
>  	}
>  
> -	elt->fields = kzalloc_objs(*elt->fields, map->n_fields);
> -	if (!elt->fields) {
> -		err = -ENOMEM;
> -		goto free;
> -	}
> -
>  	elt->vars = kzalloc_objs(*elt->vars, map->n_vars);
>  	if (!elt->vars) {
>  		err = -ENOMEM;
> @@ -848,10 +841,10 @@ static int cmp_entries_sum(const void *A, const void *B)
>  {
>  	const struct tracing_map_elt *elt_a, *elt_b;
>  	const struct tracing_map_sort_entry *a, *b;
> -	struct tracing_map_sort_key *sort_key;
> -	struct tracing_map_field *field;
> +	const struct tracing_map_sort_key *sort_key;
> +	const struct tracing_map_field *field;
>  	tracing_map_cmp_fn_t cmp_fn;
> -	void *val_a, *val_b;
> +	const void *val_a, *val_b;
>  	int ret = 0;
>  
>  	a = *(const struct tracing_map_sort_entry **)A;
> @@ -879,10 +872,10 @@ static int cmp_entries_key(const void *A, const void *B)
>  {
>  	const struct tracing_map_elt *elt_a, *elt_b;
>  	const struct tracing_map_sort_entry *a, *b;
> -	struct tracing_map_sort_key *sort_key;
> -	struct tracing_map_field *field;
> +	const struct tracing_map_sort_key *sort_key;
> +	const struct tracing_map_field *field;
>  	tracing_map_cmp_fn_t cmp_fn;
> -	void *val_a, *val_b;
> +	const void *val_a, *val_b;
>  	int ret = 0;
>  
>  	a = *(const struct tracing_map_sort_entry **)A;
> diff --git a/kernel/trace/tracing_map.h b/kernel/trace/tracing_map.h
> index 18a02959d77b..90a7fde5dd02 100644
> --- a/kernel/trace/tracing_map.h
> +++ b/kernel/trace/tracing_map.h
> @@ -13,7 +13,7 @@
>  #define TRACING_MAP_VARS_MAX		16
>  #define TRACING_MAP_SORT_KEYS_MAX	2
>  
> -typedef int (*tracing_map_cmp_fn_t) (void *val_a, void *val_b);
> +typedef int (*tracing_map_cmp_fn_t) (const void *val_a, const void *val_b);
>  
>  /*
>   * This is an overview of the tracing_map data structures and how they
> @@ -137,11 +137,11 @@ struct tracing_map_field {
>  
>  struct tracing_map_elt {
>  	struct tracing_map		*map;
> -	struct tracing_map_field	*fields;
>  	atomic64_t			*vars;
>  	bool				*var_set;
>  	void				*key;
>  	void				*private_data;
> +	struct tracing_map_field	fields[];
>  };
>  
>  struct tracing_map_entry {
> @@ -260,8 +260,8 @@ tracing_map_lookup(struct tracing_map *map, void *key);
>  
>  extern tracing_map_cmp_fn_t tracing_map_cmp_num(int field_size,
>  						int field_is_signed);
> -extern int tracing_map_cmp_string(void *val_a, void *val_b);
> -extern int tracing_map_cmp_none(void *val_a, void *val_b);
> +extern int tracing_map_cmp_string(const void *val_a, const void *val_b);
> +extern int tracing_map_cmp_none(const void *val_a, const void *val_b);
>  
>  extern void tracing_map_update_sum(struct tracing_map_elt *elt,
>  				   unsigned int i, u64 n);
> -- 
> 2.54.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH] gpu: host1x: trace: fix string fields in host1x traces
From: Mikko Perttunen @ 2026-05-21  4:33 UTC (permalink / raw)
  To: Steven Rostedt, Thierry Reding
  Cc: Artur Kowalski, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, linux-tegra, Thierry Reding, David Airlie,
	Simona Vetter
In-Reply-To: <ag2iF9bZJcBQ93lh@orome>

On Wednesday, May 20, 2026 9:03 PM Thierry Reding wrote:
> On Tue, May 19, 2026 at 02:10:59PM -0400, Steven Rostedt wrote:
> > On Tue, 19 May 2026 12:16:43 +0200
> > Artur Kowalski <arturkow2000@gmail.com> wrote:
> > 
> > > Use __assign_str and __get_str as required by tracing subsystem. Fixes
> > > string fields being rejected by the verifier and unreadable from
> > > userspace.
> > 
> > Does anyone use these tracepoints? The fact that they have been broken
> > for 5 years and nobody noticed makes me think they are useless.
> > 
> > I rather remove them than fix them, but if someone thinks that these
> > are still useful then by all means apply this patch.
> > 
> > Acked-by: Steven Rostedt <rostedt@goodmis.org>
> 
> I know that Mikko used them a lot early on, but this driver is pretty
> mature now, so we rarely need this low level of tracing. I'll defer to
> Mikko on whether we still need these.
> 
> Thierry

Yeah, these have been quite useful in the past when debugging why a job
is failing. Without the cdma traces it can be cumbersome to find out
exactly what is being sent to the hardware in some cases.

My preference is for keeping them for now.

Mikko



^ permalink raw reply

* Re: [PATCH v2] ring-buffer: Fix reporting of missed events in iterator
From: Masami Hiramatsu @ 2026-05-21  3:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux trace kernel, Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260520220801.4fd09d13@fedora>

On Wed, 20 May 2026 22:08:01 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> From c3651ad3ac95b331c7aa010d163704a3702855da Mon Sep 17 00:00:00 2001
> From: Steven Rostedt <rostedt@goodmis.org>
> Date: Wed, 20 May 2026 14:28:17 -0400
> Subject: [PATCH] ring-buffer: Fix reporting of missed events in iterator
> 
> When tracing is active while reading the trace file, if the iterator
> reading the buffer detects that the writer has passed the iterator head,
> it will reset and set a "missed events" flag. This flag is passed to the
> output processing to show the user that events were missed:
> 
>   CPU:4 [LOST EVENTS]
> 
> The problem is that the flag is reset after it is checked in
> ring_buffer_iter_dropped(). But the "trace" file iterates over all the CPU
> ring buffers and it will check if they are dropped when figuring out which
> buffer to print next. This prematurely clears the missed_events flag if
> the CPU buffer with the missed events is not the one that is printed next.
> 
> On the iteration where the CPU buffer with the missed events is printed,
> the check if it had missed events would return false and the output does
> not show that events were missed.
> 
> Do not reset the missed_events flag when checking if there were missed
> events, but instead clear it when moving the iterator head to the next
> event.
> 

Looks good to me.

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

> Cc: stable@vger.kernel.org
> Fixes: c9b7a4a72ff64 ("ring-buffer/tracing: Have iterator acknowledge dropped events")
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
> Changes since v1: https://patch.msgid.link/20260520142817.4050abab@gandalf.local.home
> 
> - Added clearing iter->missed_events in rb_iter_reset() (Sashiko)
> 
>  kernel/trace/ring_buffer.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 7288383b1f27..7b07d2004cc6 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -5429,6 +5429,7 @@ static void rb_iter_reset(struct ring_buffer_iter *iter)
>  	iter->head_page = cpu_buffer->reader_page;
>  	iter->head = cpu_buffer->reader_page->read;
>  	iter->next_event = iter->head;
> +	iter->missed_events = 0;
>  
>  	iter->cache_reader_page = iter->head_page;
>  	iter->cache_read = cpu_buffer->read;
> @@ -6108,10 +6109,7 @@ ring_buffer_peek(struct trace_buffer *buffer, int cpu, u64 *ts,
>   */
>  bool ring_buffer_iter_dropped(struct ring_buffer_iter *iter)
>  {
> -	bool ret = iter->missed_events != 0;
> -
> -	iter->missed_events = 0;
> -	return ret;
> +	return iter->missed_events != 0;
>  }
>  EXPORT_SYMBOL_GPL(ring_buffer_iter_dropped);
>  
> @@ -6273,7 +6271,7 @@ void ring_buffer_iter_advance(struct ring_buffer_iter *iter)
>  	unsigned long flags;
>  
>  	raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
> -
> +	iter->missed_events = 0;
>  	rb_advance_iter(iter);
>  
>  	raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
> -- 
> 2.53.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH] tracing: Use flexible array for entry fetch code
From: Masami Hiramatsu @ 2026-05-21  3:09 UTC (permalink / raw)
  To: Rosen Penev
  Cc: linux-trace-kernel, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Kees Cook, Gustavo A. R. Silva,
	open list:TRACING,
	open list:KERNEL HARDENING (not covered by other areas):Keyword:b__counted_by(_le|_be|_ptr)?b
In-Reply-To: <20260520215817.16560-1-rosenp@gmail.com>

On Wed, 20 May 2026 14:58:17 -0700
Rosen Penev <rosenp@gmail.com> wrote:

> Store probe entry fetch instructions in the probe_entry_arg
> allocation instead of allocating a separate instruction array.
> 
> This keeps the entry fetch code tied to the entry argument lifetime while
> leaving regular probe_arg instruction arrays separately allocated and
> freed.
> 

Thanks, this looks good to me.

> Assisted-by: Codex:GPT-5.5
> Signed-off-by: Rosen Penev <rosenp@gmail.com>
> ---
>  kernel/trace/trace_probe.c | 8 +-------
>  kernel/trace/trace_probe.h | 2 +-
>  2 files changed, 2 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index e0d3a0da26af..39f040c863e8 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -838,15 +838,10 @@ static int __store_entry_arg(struct trace_probe *tp, int argnum)
>  	int i, offset, last_offset = 0;
>  
>  	if (!earg) {
> -		earg = kzalloc_obj(*tp->entry_arg);
> +		earg = kzalloc_flex(*earg, code, 2 * tp->nr_args + 1);
>  		if (!earg)
>  			return -ENOMEM;
>  		earg->size = 2 * tp->nr_args + 1;
> -		earg->code = kzalloc_objs(struct fetch_insn, earg->size);
> -		if (!earg->code) {
> -			kfree(earg);
> -			return -ENOMEM;
> -		}
>  		/* Fill the code buffer with 'end' to simplify it */
>  		for (i = 0; i < earg->size; i++)
>  			earg->code[i].op = FETCH_OP_END;
> @@ -2051,7 +2046,6 @@ void trace_probe_cleanup(struct trace_probe *tp)
>  		traceprobe_free_probe_arg(&tp->args[i]);
>  
>  	if (tp->entry_arg) {
> -		kfree(tp->entry_arg->code);
>  		kfree(tp->entry_arg);
>  		tp->entry_arg = NULL;
>  	}
> diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
> index 262d8707a3df..1076f1df347b 100644
> --- a/kernel/trace/trace_probe.h
> +++ b/kernel/trace/trace_probe.h
> @@ -238,8 +238,8 @@ struct probe_arg {
>  };
>  
>  struct probe_entry_arg {
> -	struct fetch_insn	*code;
>  	unsigned int		size;	/* The entry data size */
> +	struct fetch_insn	code[] __counted_by(size);
>  };
>  
>  struct trace_uprobe_filter {
> -- 
> 2.54.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re:
From: Steven Rostedt @ 2026-05-21  3:00 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
In-Reply-To: <20260520184716.199133347@kernel.org>

This is what happens when I have a typo in the quilt send mail "To:" fields.

[ Yes, I still use quilt to send patch series! ]

-- Steve

^ permalink raw reply

* Re: [PATCH v20 07/10] ring-buffer: Skip invalid sub-buffers for iterator
From: Steven Rostedt @ 2026-05-21  2:58 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: linux-kernel, linux-trace-kernel, Mathieu Desnoyers,
	Catalin Marinas, Will Deacon, Ian Rogers
In-Reply-To: <20260521115125.a2a1d90c5913bd4d27cb9107@kernel.org>

On Thu, 21 May 2026 11:51:25 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> > Fixes: **TBD** ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
> > Signed-off-by: Steven Rostedt <rostedt@goodmis.org>  
> 
> Should we merge this into the original one?
> 
> Anyway, this looks good to me.

I could. I never know how to do this really. That is, I always feel
uncomfortable modifying someone else's patch. But I could add:

  [ SDR: Added skip of invalid sub-buffer for iterator case ]

Above my SOB.

-- Steve

^ permalink raw reply

* Re: [PATCH v5] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-05-21  2:55 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: sashiko-bot, sashiko-reviews, bpf, LKML, Linux trace kernel
In-Reply-To: <20260521105804.a66cf40d48f796ff66ffface@kernel.org>

On Thu, 21 May 2026 10:58:04 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> On Wed, 20 May 2026 12:48:32 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Wed, 20 May 2026 15:20:21 +0900
> > Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> >   
> > > > > > @@ -515,6 +542,10 @@ static void clear_btf_context(struct traceprobe_parse_context *ctx)
> > > > > >  		ctx->params = NULL;
> > > > > >  		ctx->nr_params = 0;
> > > > > >  	}
> > > > > > +	if (ctx->struct_btf) {
> > > > > > +		btf_put(ctx->struct_btf);
> > > > > > +		ctx->last_struct = NULL;      
> > > > > 
> > > > > [Severity: Low]
> > > > > Should ctx->struct_btf be explicitly set to NULL after btf_put() drops
> > > > > the reference?    
> > > > 
> > > > I'm thinking of dropping it in the '(' switch case.    
> > > 
> > > Can you consider making the '(' switch case part as a helper
> > > function because it depends on CONFIG_DEBUG_INFO_BTF?  
> > 
> > Should we just encapsulate that entire case statement with:
> > 
> > #ifdef CONFIG_DEBUG_INFO_BTF
> > [..]
> > #endif  
> 
> Yeah that is possible, and I rather like to make it a separate
> function for simplifying switch-case block for readability.
> 

Hmm, but as a separate function, you mean something like this?

	case '(':
		ret = handle_typecast(...);
		break;

And have;

#ifdef CONFIG_DEBUG_INFO_BTF
static int handle_typecast(...)
{
	[ do the real work here ]
}
#else
static int handle_typecast(...)
{
	trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
	return -EOPNOTSUPP;
}
#endif

  ?

-- Steve

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox