Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* Re: [PATCH v5 1/7] tracing/events: Fix to check the simple_tsk_fn creation
From: Masami Hiramatsu @ 2026-06-19  8:33 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Steven Rostedt, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178165817322.269421.3992299509400184196.stgit@devnote2>

Let me pick this fix to probes/core.

Thanks,

On Wed, 17 Jun 2026 10:02:53 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:

> From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> 
> Sashiko pointed that this sample code does not correctly handle the
> failure of thread creation because kthread_run() can return -errno.
> 
> Check the simple_tsk_fn is correctly initialized (created) or not.
> 
> Link: https://sashiko.dev/#/patchset/178092865666.163648.10457567771536160909.stgit%40devnote2
> 
> Fixes: 9cfe06f8cd5c ("tracing/events: add trace-events-sample")
> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> ---
>  Changes in v4:
>    - Fix to remove decrementing counter in error path, since foo_bar_reg() always returns 0.
>    - Add a newline to error message.
>  Changes in v3:
>    - Recover the usage counter.
> ---
>  samples/trace_events/trace-events-sample.c |    4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/samples/trace_events/trace-events-sample.c b/samples/trace_events/trace-events-sample.c
> index ecc7db237f2e..0b7a6efdb247 100644
> --- a/samples/trace_events/trace-events-sample.c
> +++ b/samples/trace_events/trace-events-sample.c
> @@ -107,6 +107,10 @@ int foo_bar_reg(void)
>  	 * for consistency sake, we still take the thread_mutex.
>  	 */
>  	simple_tsk_fn = kthread_run(simple_thread_fn, NULL, "event-sample-fn");
> +	if (IS_ERR_OR_NULL(simple_tsk_fn)) {
> +		pr_err("Failed to create simple_thread_fn\n");
> +		simple_tsk_fn = NULL;
> +	}
>   out:
>  	mutex_unlock(&thread_mutex);
>  	return 0;
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v8 09/46] KVM: guest_memfd: Introduce function to check GFN private/shared status
From: Fuad Tabba @ 2026-06-19  8:25 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
	yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
	liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
	Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-9-9d2959357853@google.com>

On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Introduce function for KVM to check the private/shared status of guest
> memory at a given GFN.
>
> This will be used in a later patch.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad
> ---
>  include/linux/kvm_host.h |  2 ++
>  virt/kvm/guest_memfd.c   | 31 +++++++++++++++++++++++++++++++
>  2 files changed, 33 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3915da2a61778..27687fb9d5201 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2575,6 +2575,8 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>  #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>
>  #ifdef CONFIG_KVM_GUEST_MEMFD
> +bool kvm_gmem_is_private(struct kvm *kvm, gfn_t gfn);
> +
>  int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>                      gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
>                      int *max_order);
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 8101f64e0366f..bca912db5be6e 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -510,6 +510,37 @@ static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
>         return 0;
>  }
>
> +bool kvm_gmem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
> +       struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
> +       struct inode *inode;
> +
> +       /*
> +        * If this gfn has no associated memslot, there's no chance of the gfn
> +        * being backed by private memory, since guest_memfd must be used for
> +        * private memory, and guest_memfd must be associated with some memslot.
> +        */
> +       if (!slot)
> +               return 0;
> +
> +       CLASS(gmem_get_file, file)(slot);
> +       if (!file)
> +               return 0;
> +
> +       inode = file_inode(file);
> +
> +       /*
> +        * Rely on the maple tree's internal RCU lock to ensure a
> +        * stable result. This result can become stale as soon as the
> +        * lock is dropped, so the caller _must_ still protect
> +        * consumption of private vs. shared by checking
> +        * mmu_invalidate_retry_gfn() under mmu_lock to serialize
> +        * against ongoing attribute updates.
> +        */
> +       return kvm_gmem_is_private_mem(inode, kvm_gmem_get_index(slot, gfn));
> +}
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_is_private);
> +
>  static struct file_operations kvm_gmem_fops = {
>         .mmap           = kvm_gmem_mmap,
>         .open           = generic_file_open,
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>

^ permalink raw reply

* Re: [PATCH v8 08/46] KVM: Provide generic interface for checking memory private/shared status
From: Fuad Tabba @ 2026-06-19  8:21 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
	yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
	liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
	Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <CA+EHjTw6x-mxDnJjnhE-6SV73tMrb0paKDTtOC2j6zJ1fXZDLA@mail.gmail.com>

On Fri, 19 Jun 2026 at 09:19, Fuad Tabba <tabba@google.com> wrote:
>
> On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
> <devnull+ackerleytng.google.com@kernel.org> wrote:
> >
> > From: Sean Christopherson <seanjc@google.com>
> >
> > Introduce a generic kvm_mem_is_private() interface using a static call to
> > determine if a GFN is private. This allows the implementation for checking
> > a GFN's private/shared status to be set at runtime.
> >
> > In preparation for choosing implementations between a guest_memfd lookup
> > and the existing VM attribute lookup, rename the existing
> > VM-attribute-based check to kvm_vm_mem_is_private to emphasize that it
> > looks up VM attributes.
> >
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
>
> (SoB fix plz)
>
> Reviewed-by: Fuad Tabba <tabba@google.com>
>
> Cheers,
> /fuad
> > ---
> >  include/linux/kvm_host.h | 12 +++++++++++-
> >  virt/kvm/kvm_main.c      | 15 +++++++++++++++
> >  2 files changed, 26 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index eb26d4ea8945a..3915da2a61778 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2546,7 +2546,7 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
> >  bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> >                                          struct kvm_gfn_range *range);
> >
> > -static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > +static inline bool kvm_vm_mem_is_private(struct kvm *kvm, gfn_t gfn)

Should have read the Sashiko review first, but where is this used?
It's not used at all in this series...

/fuad

> >  {
> >         return kvm_get_vm_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> >  }
> > @@ -2557,6 +2557,16 @@ static inline bool kvm_mem_range_is_private(struct kvm *kvm, gfn_t start,
> >                                                   KVM_MEMORY_ATTRIBUTE_PRIVATE,
> >                                                   KVM_MEMORY_ATTRIBUTE_PRIVATE);
> >  }
> > +#endif  /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
> > +
> > +#ifdef kvm_arch_has_private_mem
> > +typedef bool (kvm_mem_is_private_t)(struct kvm *kvm, gfn_t gfn);
> > +DECLARE_STATIC_CALL(__kvm_mem_is_private, kvm_mem_is_private_t);
> > +
> > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > +{
> > +       return static_call(__kvm_mem_is_private)(kvm, gfn);
> > +}
> >  #else
> >  static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> >  {
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 6669f1477013c..8b238e461b854 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2627,6 +2627,20 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> >  }
> >  #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
> >
> > +#ifdef kvm_arch_has_private_mem
> > +DEFINE_STATIC_CALL_RET0(__kvm_mem_is_private, kvm_mem_is_private_t);
> > +EXPORT_STATIC_CALL_GPL(__kvm_mem_is_private);
> > +
> > +static void kvm_init_memory_attributes(void)
> > +{
> > +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> > +       static_call_update(__kvm_mem_is_private, kvm_vm_mem_is_private);
> > +#endif
> > +}
> > +#else
> > +static void kvm_init_memory_attributes(void) { }
> > +#endif
> > +
> >  struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
> >  {
> >         return __gfn_to_memslot(kvm_memslots(kvm), gfn);
> > @@ -6528,6 +6542,7 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
> >         kvm_preempt_ops.sched_in = kvm_sched_in;
> >         kvm_preempt_ops.sched_out = kvm_sched_out;
> >
> > +       kvm_init_memory_attributes();
> >         kvm_init_debug();
> >
> >         r = kvm_vfio_ops_init();
> >
> > --
> > 2.55.0.rc0.738.g0c8ab3ebcc-goog
> >
> >

^ permalink raw reply

* Re: [PATCH v8 08/46] KVM: Provide generic interface for checking memory private/shared status
From: Fuad Tabba @ 2026-06-19  8:19 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
	yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
	liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
	Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-8-9d2959357853@google.com>

On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Introduce a generic kvm_mem_is_private() interface using a static call to
> determine if a GFN is private. This allows the implementation for checking
> a GFN's private/shared status to be set at runtime.
>
> In preparation for choosing implementations between a guest_memfd lookup
> and the existing VM attribute lookup, rename the existing
> VM-attribute-based check to kvm_vm_mem_is_private to emphasize that it
> looks up VM attributes.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

(SoB fix plz)

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad
> ---
>  include/linux/kvm_host.h | 12 +++++++++++-
>  virt/kvm/kvm_main.c      | 15 +++++++++++++++
>  2 files changed, 26 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index eb26d4ea8945a..3915da2a61778 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2546,7 +2546,7 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
>  bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
>                                          struct kvm_gfn_range *range);
>
> -static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +static inline bool kvm_vm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>  {
>         return kvm_get_vm_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
>  }
> @@ -2557,6 +2557,16 @@ static inline bool kvm_mem_range_is_private(struct kvm *kvm, gfn_t start,
>                                                   KVM_MEMORY_ATTRIBUTE_PRIVATE,
>                                                   KVM_MEMORY_ATTRIBUTE_PRIVATE);
>  }
> +#endif  /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
> +
> +#ifdef kvm_arch_has_private_mem
> +typedef bool (kvm_mem_is_private_t)(struct kvm *kvm, gfn_t gfn);
> +DECLARE_STATIC_CALL(__kvm_mem_is_private, kvm_mem_is_private_t);
> +
> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
> +       return static_call(__kvm_mem_is_private)(kvm, gfn);
> +}
>  #else
>  static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>  {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 6669f1477013c..8b238e461b854 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2627,6 +2627,20 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>  }
>  #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>
> +#ifdef kvm_arch_has_private_mem
> +DEFINE_STATIC_CALL_RET0(__kvm_mem_is_private, kvm_mem_is_private_t);
> +EXPORT_STATIC_CALL_GPL(__kvm_mem_is_private);
> +
> +static void kvm_init_memory_attributes(void)
> +{
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +       static_call_update(__kvm_mem_is_private, kvm_vm_mem_is_private);
> +#endif
> +}
> +#else
> +static void kvm_init_memory_attributes(void) { }
> +#endif
> +
>  struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
>  {
>         return __gfn_to_memslot(kvm_memslots(kvm), gfn);
> @@ -6528,6 +6542,7 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
>         kvm_preempt_ops.sched_in = kvm_sched_in;
>         kvm_preempt_ops.sched_out = kvm_sched_out;
>
> +       kvm_init_memory_attributes();
>         kvm_init_debug();
>
>         r = kvm_vfio_ops_init();
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>

^ permalink raw reply

* Re: [PATCH v8 07/46] KVM: Rename memory attribute APIs to prepare for in-place gmem conversion
From: Fuad Tabba @ 2026-06-19  8:16 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
	yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
	liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
	Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-7-9d2959357853@google.com>

On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Rename memory attribute APIs to add a "vm_" in the name in anticipation of
> moving PRIVATE tracking into guest_memfd, to allow in-place conversion
> between SHARED and PRIVATE.  At that point, there will effectively be two
> (potential) sources of memory attributes: the VM and guest_memfd.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Missing SoB (other patches as well, I won't mention it again). But for
this (and other patches I review with a missing SoB fixed):

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad
> ---
>  arch/x86/kvm/mmu/mmu.c   |  6 +++---
>  include/linux/kvm_host.h | 15 +++++++++++----
>  virt/kvm/guest_memfd.c   |  6 +++---
>  virt/kvm/kvm_main.c      | 16 ++++++++--------
>  4 files changed, 25 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e0005a21b6e22..cbc50aef801fb 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -8087,11 +8087,11 @@ static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
>         const unsigned long end = start + KVM_PAGES_PER_HPAGE(level);
>
>         if (level == PG_LEVEL_2M)
> -               return kvm_range_has_memory_attributes(kvm, start, end, ~0, attrs);
> +               return kvm_range_has_vm_memory_attributes(kvm, start, end, ~0, attrs);
>
>         for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
>                 if (hugepage_test_mixed(slot, gfn, level - 1) ||
> -                   attrs != kvm_get_memory_attributes(kvm, gfn))
> +                   attrs != kvm_get_vm_memory_attributes(kvm, gfn))
>                         return false;
>         }
>         return true;
> @@ -8191,7 +8191,7 @@ void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
>                  * be manually checked as the attributes may already be mixed.
>                  */
>                 for (gfn = start; gfn < end; gfn += nr_pages) {
> -                       unsigned long attrs = kvm_get_memory_attributes(kvm, gfn);
> +                       unsigned long attrs = kvm_get_vm_memory_attributes(kvm, gfn);
>
>                         if (hugepage_has_attrs(kvm, slot, gfn, level, attrs))
>                                 hugepage_clear_mixed(slot, gfn, level);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d370e834d619e..eb26d4ea8945a 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2534,13 +2534,13 @@ static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
>  }
>
>  #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> -static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> +static inline unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
>  {
>         return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
>  }
>
> -bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> -                                    unsigned long mask, unsigned long attrs);
> +bool kvm_range_has_vm_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> +                                       unsigned long mask, unsigned long attrs);
>  bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
>                                         struct kvm_gfn_range *range);
>  bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> @@ -2548,7 +2548,14 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
>
>  static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>  {
> -       return kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +       return kvm_get_vm_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +}
> +static inline bool kvm_mem_range_is_private(struct kvm *kvm, gfn_t start,
> +                                           gfn_t end)
> +{
> +       return kvm_range_has_vm_memory_attributes(kvm, start, end,
> +                                                 KVM_MEMORY_ATTRIBUTE_PRIVATE,
> +                                                 KVM_MEMORY_ATTRIBUTE_PRIVATE);
>  }
>  #else
>  static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index b4c24fdf159f6..8101f64e0366f 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -915,9 +915,9 @@ static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
>
>         folio_unlock(folio);
>
> -       if (!kvm_range_has_memory_attributes(kvm, gfn, gfn + 1,
> -                                            KVM_MEMORY_ATTRIBUTE_PRIVATE,
> -                                            KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> +       if (!kvm_range_has_vm_memory_attributes(kvm, gfn, gfn + 1,
> +                                               KVM_MEMORY_ATTRIBUTE_PRIVATE,
> +                                               KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
>                 ret = -EINVAL;
>                 goto out_put_folio;
>         }
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7b989b659cf82..6669f1477013c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2419,7 +2419,7 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
>  #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>
>  #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> -static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> +static u64 kvm_supported_vm_mem_attributes(struct kvm *kvm)
>  {
>  #ifdef kvm_arch_has_private_mem
>         if (!kvm || kvm_arch_has_private_mem(kvm))
> @@ -2433,19 +2433,19 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
>   * Returns true if _all_ gfns in the range [@start, @end) have attributes
>   * such that the bits in @mask match @attrs.
>   */
> -bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> -                                    unsigned long mask, unsigned long attrs)
> +bool kvm_range_has_vm_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> +                                       unsigned long mask, unsigned long attrs)
>  {
>         XA_STATE(xas, &kvm->mem_attr_array, start);
>         unsigned long index;
>         void *entry;
>
> -       mask &= kvm_supported_mem_attributes(kvm);
> +       mask &= kvm_supported_vm_mem_attributes(kvm);
>         if (attrs & ~mask)
>                 return false;
>
>         if (end == start + 1)
> -               return (kvm_get_memory_attributes(kvm, start) & mask) == attrs;
> +               return (kvm_get_vm_memory_attributes(kvm, start) & mask) == attrs;
>
>         guard(rcu)();
>         if (!attrs)
> @@ -2567,7 +2567,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
>         mutex_lock(&kvm->slots_lock);
>
>         /* Nothing to do if the entire range has the desired attributes. */
> -       if (kvm_range_has_memory_attributes(kvm, start, end, ~0, attributes))
> +       if (kvm_range_has_vm_memory_attributes(kvm, start, end, ~0, attributes))
>                 goto out_unlock;
>
>         /*
> @@ -2606,7 +2606,7 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>         /* flags is currently not used. */
>         if (attrs->flags)
>                 return -EINVAL;
> -       if (attrs->attributes & ~kvm_supported_mem_attributes(kvm))
> +       if (attrs->attributes & ~kvm_supported_vm_mem_attributes(kvm))
>                 return -EINVAL;
>         if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
>                 return -EINVAL;
> @@ -4926,7 +4926,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>                 return 1;
>  #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>         case KVM_CAP_MEMORY_ATTRIBUTES:
> -               return kvm_supported_mem_attributes(kvm);
> +               return kvm_supported_vm_mem_attributes(kvm);
>  #endif
>  #ifdef CONFIG_KVM_GUEST_MEMFD
>         case KVM_CAP_GUEST_MEMFD:
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>

^ permalink raw reply

* Re: [PATCH v8 05/46] KVM: Make CONFIG_KVM_VM_MEMORY_ATTRIBUTES selectable
From: Fuad Tabba @ 2026-06-19  8:12 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
	yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
	liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
	Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-5-9d2959357853@google.com>

On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Make CONFIG_KVM_VM_MEMORY_ATTRIBUTES selectable, only for (CoCo) VM types
> that might use vm_memory_attributes.
>
> Also document CONFIG_KVM_VM_MEMORY_ATTRIBUTES to specifically be about the
> private/shared attribute.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

You're missing a SoB, but with that fixed:

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad

> ---
>  arch/x86/kvm/Kconfig | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 24f96396cfa1c..c28393dc664eb 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -81,13 +81,16 @@ config KVM_WERROR
>           If in doubt, say "N".
>
>  config KVM_VM_MEMORY_ATTRIBUTES
> -       bool
> +       depends on KVM_SW_PROTECTED_VM || KVM_INTEL_TDX || KVM_AMD_SEV
> +       bool "Enable per-VM PRIVATE vs. SHARED attributes (for CoCo VMs)"
> +       help
> +         Enable support for tracking PRIVATE vs. SHARED memory using per-VM
> +         memory attributes.
>
>  config KVM_SW_PROTECTED_VM
>         bool "Enable support for KVM software-protected VMs"
>         depends on EXPERT
>         depends on KVM_X86 && X86_64
> -       select KVM_VM_MEMORY_ATTRIBUTES
>         help
>           Enable support for KVM software-protected VMs.  Currently, software-
>           protected VMs are purely a development and testing vehicle for
> @@ -138,7 +141,6 @@ config KVM_INTEL_TDX
>         bool "Intel Trust Domain Extensions (TDX) support"
>         default y
>         depends on INTEL_TDX_HOST
> -       select KVM_VM_MEMORY_ATTRIBUTES
>         select HAVE_KVM_ARCH_GMEM_POPULATE
>         help
>           Provides support for launching Intel Trust Domain Extensions (TDX)
> @@ -162,7 +164,6 @@ config KVM_AMD_SEV
>         depends on KVM_AMD && X86_64
>         depends on CRYPTO_DEV_SP_PSP && !(KVM_AMD=y && CRYPTO_DEV_CCP_DD=m)
>         select ARCH_HAS_CC_PLATFORM
> -       select KVM_VM_MEMORY_ATTRIBUTES
>         select HAVE_KVM_ARCH_GMEM_PREPARE
>         select HAVE_KVM_ARCH_GMEM_INVALIDATE
>         select HAVE_KVM_ARCH_GMEM_POPULATE
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>

^ permalink raw reply

* Re: [PATCH v8 04/46] KVM: Decouple kvm_has_arch_private_mem from CONFIG_KVM_VM_MEMORY_ATTRIBUTES
From: Fuad Tabba @ 2026-06-19  8:10 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
	yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
	liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
	Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-4-9d2959357853@google.com>

On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> When memory attributes become trackable in guest_memfd, the concept of
> having private memory is no longer dependent on
> CONFIG_KVM_VM_MEMORY_ATTRIBUTES.
>
> With this, on x86, kvm_arch_has_private_mem() is defined if some CoCo
> platform support (or the testing CONFIG_KVM_SW_PROTECTED_VM) is compiled
> in.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad
> ---
>  arch/x86/include/asm/kvm_host.h | 4 +++-
>  include/linux/kvm_host.h        | 2 +-
>  2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 8e8eb8a5e8a6b..1bde67cf6eb0e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2394,7 +2394,9 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>                        int tdp_max_root_level, int tdp_huge_page_level);
>
>
> -#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +#if defined(CONFIG_KVM_SW_PROTECTED_VM) ||     \
> +       defined(CONFIG_KVM_INTEL_TDX) ||        \
> +       defined(CONFIG_KVM_AMD_SEV)
>  #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
>  #endif
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 201d0f2143976..d370e834d619e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -722,7 +722,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
>  }
>  #endif
>
> -#ifndef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +#ifndef kvm_arch_has_private_mem
>  static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
>  {
>         return false;
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>

^ permalink raw reply

* Re: [PATCH v4 3/5] rpmsg: virtio_rpmsg_bus: get buffer size from config space
From: Arnaud POULIQUEN @ 2026-06-19  7:45 UTC (permalink / raw)
  To: tanmay.shah, andersson, mathieu.poirier, corbet, skhan
  Cc: linux-remoteproc, linux-doc, linux-kernel
In-Reply-To: <1251e3e1-80fe-4146-a1f8-5eb251a323be@amd.com>



On 6/18/26 18:31, Shah, Tanmay wrote:
> 
> 
> On 6/18/2026 3:32 AM, Arnaud POULIQUEN wrote:
>>
>>
>> On 6/17/26 19:41, Shah, Tanmay wrote:
>>>
>>>
>>> On 6/17/2026 4:15 AM, Arnaud POULIQUEN wrote:
>>>> Hi Tanmay,
>>>>
>>>> On 6/15/26 22:20, Tanmay Shah wrote:
>>>>> 512 bytes isn't always suitable for all case, let firmware
>>>>> maker decide the best value from resource table.
>>>>> enable by VIRTIO_RPMSG_F_BUFSZ feature bit.
>>>>>
>>>>> Signed-off-by: Tanmay Shah <tanmay.shah@amd.com>
>>>>> ---
>>>>>
>>>>> Changes in v4: squash to virtio rpmsg config patch
>>>>>      - Introduce new patch to modify rpmsg.rst documentation
>>>>>      - check version is always 1.
>>>>>      - check size field is same as size of struct virtio_rpmsg_config
>>>>>      - introduce alignment field
>>>>>      - check alignment field is power of 2
>>>>>      - check tx and rx buf size is aligned with alignment passed in the
>>>>>        structure
>>>>>
>>>>> Changes in v3:
>>>>>      - change version field from u16 to u8
>>>>>      - introduce size field in the rpmsg_virtio_config structure
>>>>>      - check version field is set to any non-zero value.
>>>>>      - check size field is not 0.
>>>>>      - Remove field for private config, as not needed for now.
>>>>>      - add documentation of rpmsg_virtio_config structure
>>>>>
>>>>>     drivers/rpmsg/virtio_rpmsg_bus.c   | 129 +++++++++++++++++++++++
>>>>> +-----
>>>>>     include/linux/rpmsg/virtio_rpmsg.h |  50 +++++++++++
>>>>>     2 files changed, 160 insertions(+), 19 deletions(-)
>>>>>     create mode 100644 include/linux/rpmsg/virtio_rpmsg.h
>>>>>
>>>>> diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c b/drivers/rpmsg/
>>>>> virtio_rpmsg_bus.c
>>>>> index 99df1ae07055..a59925f870a4 100644
>>>>> --- a/drivers/rpmsg/virtio_rpmsg_bus.c
>>>>> +++ b/drivers/rpmsg/virtio_rpmsg_bus.c
>>>>> @@ -15,11 +15,13 @@
>>>>>     #include <linux/idr.h>
>>>>>     #include <linux/jiffies.h>
>>>>>     #include <linux/kernel.h>
>>>>> +#include <linux/log2.h>
>>>>>     #include <linux/module.h>
>>>>>     #include <linux/mutex.h>
>>>>>     #include <linux/rpmsg.h>
>>>>>     #include <linux/rpmsg/byteorder.h>
>>>>>     #include <linux/rpmsg/ns.h>
>>>>> +#include <linux/rpmsg/virtio_rpmsg.h>
>>>>>     #include <linux/scatterlist.h>
>>>>>     #include <linux/slab.h>
>>>>>     #include <linux/sched.h>
>>>>> @@ -39,7 +41,8 @@
>>>>>      * @tx_bufs:    kernel address of tx buffers
>>>>>      * @num_rx_buf: total number of rx buffers
>>>>>      * @num_tx_buf: total number of tx buffers
>>>>> - * @buf_size:   size of one rx or tx buffer
>>>>> + * @rx_buf_size: size of one rx buffer
>>>>> + * @tx_buf_size: size of one tx buffer
>>>>>      * @last_tx_buf: index of last tx buffer used
>>>>>      * @bufs_dma:    dma base addr of the buffers
>>>>>      * @tx_lock:    protects svq and tx_bufs, to allow concurrent
>>>>> senders.
>>>>> @@ -59,7 +62,8 @@ struct virtproc_info {
>>>>>         void *rx_bufs, *tx_bufs;
>>>>>         unsigned int num_rx_buf;
>>>>>         unsigned int num_tx_buf;
>>>>> -    unsigned int buf_size;
>>>>> +    unsigned int rx_buf_size;
>>>>> +    unsigned int tx_buf_size;
>>>>>         int last_tx_buf;
>>>>>         dma_addr_t bufs_dma;
>>>>>         struct mutex tx_lock;
>>>>> @@ -68,9 +72,6 @@ struct virtproc_info {
>>>>>         wait_queue_head_t sendq;
>>>>>     };
>>>>>     -/* The feature bitmap for virtio rpmsg */
>>>>> -#define VIRTIO_RPMSG_F_NS    0 /* RP supports name service
>>>>> notifications */
>>>>> -
>>>>>     /**
>>>>>      * struct rpmsg_hdr - common header for all rpmsg messages
>>>>>      * @src: source address
>>>>> @@ -128,7 +129,7 @@ struct virtio_rpmsg_channel {
>>>>>      * processor.
>>>>>      */
>>>>>     #define MAX_RPMSG_NUM_BUFS    (256)
>>>>> -#define MAX_RPMSG_BUF_SIZE    (512)
>>>>> +#define DEFAULT_RPMSG_BUF_SIZE    (512)
>>>>>       /*
>>>>>      * Local addresses are dynamically allocated on-demand.
>>>>> @@ -444,7 +445,7 @@ static void *get_a_tx_buf(struct virtproc_info
>>>>> *vrp)
>>>>>           /* either pick the next unused tx buffer */
>>>>>         if (vrp->last_tx_buf < vrp->num_tx_buf)
>>>>> -        ret = vrp->tx_bufs + vrp->buf_size * vrp->last_tx_buf++;
>>>>> +        ret = vrp->tx_bufs + vrp->tx_buf_size * vrp->last_tx_buf++;
>>>>>         /* or recycle a used one */
>>>>>         else
>>>>>             ret = virtqueue_get_buf(vrp->svq, &len);
>>>>> @@ -514,7 +515,7 @@ static int rpmsg_send_offchannel_raw(struct
>>>>> rpmsg_device *rpdev,
>>>>>          * messaging), or to improve the buffer allocator, to support
>>>>>          * variable-length buffer sizes.
>>>>>          */
>>>>> -    if (len > vrp->buf_size - sizeof(struct rpmsg_hdr)) {
>>>>> +    if (len > vrp->tx_buf_size - sizeof(struct rpmsg_hdr)) {
>>>>>             dev_err(dev, "message is too big (%d)\n", len);
>>>>>             return -EMSGSIZE;
>>>>>         }
>>>>> @@ -647,7 +648,7 @@ static ssize_t virtio_rpmsg_get_mtu(struct
>>>>> rpmsg_endpoint *ept)
>>>>>         struct rpmsg_device *rpdev = ept->rpdev;
>>>>>         struct virtio_rpmsg_channel *vch =
>>>>> to_virtio_rpmsg_channel(rpdev);
>>>>>     -    return vch->vrp->buf_size - sizeof(struct rpmsg_hdr);
>>>>> +    return vch->vrp->tx_buf_size - sizeof(struct rpmsg_hdr);
>>>>>     }
>>>>>       static int rpmsg_recv_single(struct virtproc_info *vrp, struct
>>>>> device *dev,
>>>>> @@ -673,7 +674,7 @@ static int rpmsg_recv_single(struct virtproc_info
>>>>> *vrp, struct device *dev,
>>>>>          * We currently use fixed-sized buffers, so trivially sanitize
>>>>>          * the reported payload length.
>>>>>          */
>>>>> -    if (len > vrp->buf_size ||
>>>>> +    if (len > vrp->rx_buf_size ||
>>>>>             msg_len > (len - sizeof(struct rpmsg_hdr))) {
>>>>>             dev_warn(dev, "inbound msg too big: (%d, %d)\n", len,
>>>>> msg_len);
>>>>>             return -EINVAL;
>>>>> @@ -706,7 +707,7 @@ static int rpmsg_recv_single(struct virtproc_info
>>>>> *vrp, struct device *dev,
>>>>>             dev_warn_ratelimited(dev, "msg received with no
>>>>> recipient\n");
>>>>>           /* publish the real size of the buffer */
>>>>> -    rpmsg_sg_init(&sg, msg, vrp->buf_size);
>>>>> +    rpmsg_sg_init(&sg, msg, vrp->rx_buf_size);
>>>>>           /* add the buffer back to the remote processor's virtqueue */
>>>>>         err = virtqueue_add_inbuf(vrp->rvq, &sg, 1, msg, GFP_KERNEL);
>>>>> @@ -820,10 +821,13 @@ static int rpmsg_probe(struct virtio_device
>>>>> *vdev)
>>>>>         struct virtproc_info *vrp;
>>>>>         struct virtio_rpmsg_channel *vch = NULL;
>>>>>         struct rpmsg_device *rpdev_ns, *rpdev_ctrl;
>>>>> +    u16 rpmsg_buf_align = 0;
>>>>>         void *bufs_va;
>>>>>         int err = 0, i;
>>>>>         size_t total_buf_space;
>>>>>         bool notify;
>>>>> +    u8 version;
>>>>> +    u16 size;
>>>>>           vrp = kzalloc_obj(*vrp);
>>>>>         if (!vrp)
>>>>> @@ -855,9 +859,90 @@ static int rpmsg_probe(struct virtio_device *vdev)
>>>>>         else
>>>>>             vrp->num_tx_buf = MAX_RPMSG_NUM_BUFS;
>>>>>     -    vrp->buf_size = MAX_RPMSG_BUF_SIZE;
>>>>> +    /*
>>>>> +     * If VIRTIO_RPMSG_F_BUFSZ feature is supported, then configure
>>>>> buf
>>>>> +     * size from virtio device config space from the resource table.
>>>>> +     * If the feature is not supported, then assign default buf size.
>>>>> +     */
>>>>> +    if (virtio_has_feature(vdev, VIRTIO_RPMSG_F_BUFSZ)) {
>>>>> +        virtio_cread(vdev, struct virtio_rpmsg_config,
>>>>> +                 version, &version);
>>>>> +
>>>>> +        /* for now we support only v1 */
>>>>> +        if (version != RPMSG_VDEV_CONFIG_V1) {
>>>>> +            dev_err(&vdev->dev,
>>>>> +                "unsupported vdev config version %u\n", version);
>>>>> +            err = -EINVAL;
>>>>> +            goto vqs_del;
>>>>> +        }
>>>>> +
>>>>> +        /* size of the config space must match */
>>>>> +        virtio_cread(vdev, struct virtio_rpmsg_config,
>>>>> +                 size, &size);
>>>>> +        if (size != sizeof(struct virtio_rpmsg_config)) {
>>>>> +            dev_err(&vdev->dev, "invalid size of vdev config %u\n",
>>>>> +                size);
>>>>> +            err = -EINVAL;
>>>>> +            goto vqs_del;
>>>>> +        }
>>>>>     -    total_buf_space = (vrp->num_rx_buf + vrp->num_tx_buf) * vrp-
>>>>>> buf_size;
>>>>> +        /*
>>>>> +         * Optional alignment applied to each buffer size and to
>>>>> the TX
>>>>> +         * buffer base address (e.g. to align buffers on a cache
>>>>> line).
>>>>> +         * It must be a power of two; zero means no extra alignment.
>>>>> +         */
>>>>> +        virtio_cread(vdev, struct virtio_rpmsg_config,
>>>>> +                 rpmsg_buf_align, &rpmsg_buf_align);
>>>>> +        if (rpmsg_buf_align && !is_power_of_2(rpmsg_buf_align)) {
>>>>> +            dev_err(&vdev->dev,
>>>>> +                "bad vdev config: rpmsg_buf_align %u is not a power
>>>>> of two\n",
>>>>> +                rpmsg_buf_align);
>>>>> +            err = -EINVAL;
>>>>> +            goto vqs_del;
>>>>> +        }
>>>>> +
>>>>> +        /* note: tx and rx are defined from remote view */
>>>>> +        virtio_cread(vdev, struct virtio_rpmsg_config,
>>>>> +                 txbuf_size, &vrp->rx_buf_size);
>>>>> +        virtio_cread(vdev, struct virtio_rpmsg_config,
>>>>> +                 rxbuf_size, &vrp->tx_buf_size);
>>>>> +
>>>>> +        /* The buffers must hold at least the rpmsg header */
>>>>> +        if (vrp->rx_buf_size < sizeof(struct rpmsg_hdr) ||
>>>>> +            vrp->tx_buf_size < sizeof(struct rpmsg_hdr)) {
>>>>> +            dev_err(&vdev->dev,
>>>>> +                "bad vdev config: rx buf sz = %u, tx buf sz = %u\n",
>>>>> +                vrp->rx_buf_size, vrp->tx_buf_size);
>>>>> +            err = -EINVAL;
>>>>> +            goto vqs_del;
>>>>> +        }
>>>>> +
>>>>> +        /*
>>>>> +         * The buffer size must be aligned to the provided
>>>>> alignment for
>>>>> +         * so that the start address of tx bufs can be aligned.
>>>>> +         */
>>>>
>>>> 'tx' to remove as  it also concerns Rx buffers
>>>>
>>>
>>> Ack.
>>>
>>>>
>>>> What about removing this check to manage alignment during buffer
>>>> allocation?
>>>>
>>>> For example, if the alignment is on a 64-bit address and the tx_buffer
>>>> and rx_buffer sizes are 40 bytes, 48 bytes can be allocated in memory
>>>> for each buffer, and the virtio descriptor can be filled with aligned
>>>> addresses.
>>>>
>>>> In other words, the rpmsg_buf_align field contains the alignment
>>>> constraint from the remote processor. If the Linux kernel wants to
>>>> impose another alignment constraint, it must test or update
>>>> rpmsg_buf_align, but it must not impose alignment on the buffer size.
>>>>
>>>>
>>>
>>> This part I don't understand. `rpmsg_buf_align` is alignment for only
>>> single buffer size. The linux kernel is checking that single rx buf size
>>> and tx buf size is aligned with `rpmsg_buf_align` as firmware has
>>> claimed.
>>>
>>> For reference the openamp-system-reference PR:
>>> https://github.com/OpenAMP/openamp-system-reference/pull/106/changes
>>>
>>>      .vdev_config = {
>>>          .version = 1,
>>>          .reserved = 0,
>>>          .size = (uint16_t)(sizeof(struct rpmsg_virtio_config) -
>>> sizeof(bool)),
>>>          .alignment = RPMSG_BUF_ALIGN,
>>>          .reserved1 = 0,
>>>          /* Tx for host */
>>>          .h2r_buf_size = metal_align_up(4096, RPMSG_BUF_ALIGN),
>>>          /* Rx for host */
>>>          .r2h_buf_size = metal_align_up(4096, RPMSG_BUF_ALIGN),
>>>      },
>>>
>>> IIUC, The linux kernel is not really supposed to modify
>>> `rpmsg_buf_align`. It only uses it to check that firmware has assigned
>>> correct size of single rx and tx buffer.
>>>
>>>
>>> When the linux kernel uses dma_alloc_coherent() API it aligns total
>>> buffer size with page size. That is different than single tx buf size
>>> and single rx buf size. The total buf size alignment to page size is
>>> irrelevant to `rpmsg_buf_align` field.
>>>
>>> Please let me know if I am missing something or didn't understand your
>>> comment. I prefer that `rpmsg_buf_align` should be only modified by the
>>> firmware and not the linux kernel.
>>
>>
>> Sorry it was unclear, let try to reexplain my suggestion:
>>
>> Two alignment constraints can apply:
>> - The remote processor can require an alignment through
>>    vdev_config::alignment.
>> - The main processor, which runs Linux or another operating system (OS),
>>    can require a different alignment, for example, for cache alignment.
>> In current Linux implementation no constraint in Linux.
>> nevertheless  I would be in favor of taking into account such future
>> constraint without imposing constraint on the buffer sizes.
> 
> Is this ever going to be ture? Is it ever possible that Linux and remote
> has different cache alignment? IIUC, both will be using same cache and
> so same alignment will be applicable. That is why only signle alignment
> is required.

Some remote processors, for example, some Arm Cortex-M33, do not 
integrate cache. Even if cache exists, cache can be enabled on one 
processor, but not on the other.

> 
>> Based on that in short term the local 'rpmsg_buf_align' would still
>> computed
>> only from vdev_config::alignment (not update of vdev_config::alignment).
>>
>> virtio_cread(vdev, struct virtio_rpmsg_config,
>>                   rpmsg_buf_align, &rpmsg_buf_align);
>>
>> Then you could use use ALIGN() helper:
>>
>> unsigned int rx_buf_align_size = ALIGN(vrp->rx_buf_size,
>>                         rpmsg_buf_align);
>> unsigned int tx_buf_align_size = ALIGN(vrp->tx_buf_size,
>>                         rpmsg_buf_align);
>>
> 
> This is where I have different opinion. Instead of Linux using ALIGN()
> macro, can we expect that firmware must assign the aligned buffer size
> with vdev_config::rpmsg_buf_align? And so Linux will fail if the buffer
> size is not aligned already from the firmware side. That is why I had
> introduced checks instead of doing alignment by linux.
> 
>> total_buf_space = (vrp->num_rx_buf * rx_buf_align_size) +
>>            (vrp->num_tx_buf * tx_buf_align_size);
>>
>> vrp->tx_bufs = bufs_va + vrp->num_rx_buf * rx_buf_align_size;
>>
>> Apply the same rule to cpu_addr in the vring descriptor:
>>
>> void *cpu_addr = vrp->rx_bufs + i * rx_buf_align_size;
>>
>> rpmsg_sg_init(&sg, cpu_addr, vrp->rx_buf_size);
>>
>> With this approach, the buffer addresses remain aligned
>> independently of vdev_config::Rxbuf_size and vdev_config::txbuf_size.
>> Don't hesitate if it is still not clear!
> 
> How they remain aligned independent of tx/rx_buf_size? tx_bufs address
> is still calculated based on rx_buf_align_size, so its alignment still
> depends on rx_buf_align_size which is derived using
> vdev_config::rpmsg_buf_align.>
> I think we are trying to achive the same thing, but implementation is
> differnt. We just need to decide where the alignment should be done?
> 
> Either on the linux side? Or in the firmware resource table?
> 
> I prefer that the firmware should already provide aligned buffer size,
> and Linux should only check it. If alignment is not done, then simply
> fail with error. That way, firmware also knows the correct size of the
> buffer. If Linux does the alignment, then the firmware is not aware of
> the correct size that is used by the linux.
> 
> I am open to move the alignment operation to the linux side with the
> reasonable justification.

That remains a suggestion. My main concern with the implementation is
that RPMsg size should depend only on the max playlod size needed, not
also on the memory alignment.

If this constraint is kept, it must be imposed on all other non-Linux
solutions. Otherwise, the remote implementation depends on the main
processor implementation.

 From my POV, It would be preferable not to impose such constraint when
possible.

Thanks,
Arnaud

> 
> Thank You,
> Tanmay
> 
>>>
>>>
>>>>> +        if (rpmsg_buf_align &&
>>>>> +            (!IS_ALIGNED(vrp->rx_buf_size, rpmsg_buf_align) ||
>>>>> +             !IS_ALIGNED(vrp->tx_buf_size, rpmsg_buf_align))) {
>>>>> +            dev_err(&vdev->dev,
>>>>> +                "bad vdev config: buf sizes (rx %u, tx %u) not
>>>>> aligned to %u\n",
>>>>> +                vrp->rx_buf_size, vrp->tx_buf_size,
>>>>> +                rpmsg_buf_align);
>>>>> +            err = -EINVAL;
>>>>> +            goto vqs_del;
>>>>> +        }
>>>>> +
>>>>> +        dev_dbg(&vdev->dev,
>>>>> +            "vdev config: ver=%u, align=0x%x, rx sz = 0x%x, tx sz =
>>>>> 0x%x\n",
>>>>> +            version, rpmsg_buf_align, vrp->rx_buf_size,
>>>>> +            vrp->tx_buf_size);
>>>>> +    } else {
>>>>> +        vrp->rx_buf_size = DEFAULT_RPMSG_BUF_SIZE;
>>>>> +        vrp->tx_buf_size = DEFAULT_RPMSG_BUF_SIZE;
>>>>> +    }
>>>>> +
>>>>> +    total_buf_space = (vrp->num_rx_buf * vrp->rx_buf_size) +
>>>>> +              (vrp->num_tx_buf * vrp->tx_buf_size);
>>>>>           /* allocate coherent memory for the buffers */
>>>>>         bufs_va = dma_alloc_coherent(vdev->dev.parent,
>>>>> @@ -874,15 +959,20 @@ static int rpmsg_probe(struct virtio_device
>>>>> *vdev)
>>>>>         /* first part of the buffers is dedicated for RX */
>>>>>         vrp->rx_bufs = bufs_va;
>>>>>     -    /* and second part is dedicated for TX */
>>>>> -    vrp->tx_bufs = bufs_va + vrp->num_rx_buf * vrp->buf_size;
>>>>> +    /*
>>>>> +     * Here buf_va is aligned to a page. Also rx buf size is aligned
>>>>> with
>>>>> +     * cache line alignment provided by the firmware, so tx buf's
>>>>> start
>>>>> +     * address is guranteed to be aligned with the alignment
>>>>> provided by
>>>>> +     * the firmware.
>>>>> +     */
>>>>> +    vrp->tx_bufs = bufs_va + (vrp->num_rx_buf * vrp->rx_buf_size);
>>>>>           /* set up the receive buffers */
>>>>>         for (i = 0; i < vrp->num_rx_buf; i++) {
>>>>>             struct scatterlist sg;
>>>>> -        void *cpu_addr = vrp->rx_bufs + i * vrp->buf_size;
>>>>> +        void *cpu_addr = vrp->rx_bufs + i * vrp->rx_buf_size;
>>>>>     -        rpmsg_sg_init(&sg, cpu_addr, vrp->buf_size);
>>>>> +        rpmsg_sg_init(&sg, cpu_addr, vrp->rx_buf_size);
>>>>>               err = virtqueue_add_inbuf(vrp->rvq, &sg, 1, cpu_addr,
>>>>>                           GFP_KERNEL);
>>>>> @@ -965,8 +1055,8 @@ static int rpmsg_remove_device(struct device
>>>>> *dev, void *data)
>>>>>     static void rpmsg_remove(struct virtio_device *vdev)
>>>>>     {
>>>>>         struct virtproc_info *vrp = vdev->priv;
>>>>> -    unsigned int num_bufs = vrp->num_rx_buf + vrp->num_tx_buf;
>>>>> -    size_t total_buf_space = num_bufs * vrp->buf_size;
>>>>> +    size_t total_buf_space = (vrp->num_rx_buf * vrp->rx_buf_size) +
>>>>> +                 (vrp->num_tx_buf * vrp->tx_buf_size);
>>>>>         int ret;
>>>>>           virtio_reset_device(vdev);
>>>>> @@ -992,6 +1082,7 @@ static struct virtio_device_id id_table[] = {
>>>>>       static unsigned int features[] = {
>>>>>         VIRTIO_RPMSG_F_NS,
>>>>> +    VIRTIO_RPMSG_F_BUFSZ,
>>>>>     };
>>>>>       static struct virtio_driver virtio_ipc_driver = {
>>>>> diff --git a/include/linux/rpmsg/virtio_rpmsg.h b/include/linux/rpmsg/
>>>>> virtio_rpmsg.h
>>>>> new file mode 100644
>>>>> index 000000000000..7e14da68fd17
>>>>> --- /dev/null
>>>>> +++ b/include/linux/rpmsg/virtio_rpmsg.h
>>>>> @@ -0,0 +1,50 @@
>>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>>> +/*
>>>>> + * Copyright (C) Pinecone Inc. 2019
>>>>> + * Copyright (C) Xiang Xiao <xiaoxiang@pinecone.net>
>>>>> + * Copyright (C) Advanced Micro Devices, Inc. 2026
>>>>> + */
>>>>> +
>>>>> +#ifndef _LINUX_VIRTIO_RPMSG_H
>>>>> +#define _LINUX_VIRTIO_RPMSG_H
>>>>> +
>>>>> +#include <linux/types.h>
>>>>> +#include <linux/virtio_types.h>
>>>>> +
>>>>> +/* The feature bitmap for virtio rpmsg */
>>>>> +#define VIRTIO_RPMSG_F_NS    0 /* RP supports name service
>>>>> notifications */
>>>>> +#define VIRTIO_RPMSG_F_BUFSZ    1 /* RP get buffer size from config
>>>>> space */
>>>>> +
>>>>> +/* Version of struct virtio_rpmsg_config understood by this driver */
>>>>> +#define RPMSG_VDEV_CONFIG_V1    1
>>>>> +
>>>>> +/**
>>>>> + * struct virtio_rpmsg_config - config space for rpmsg virtio device
>>>>> + *
>>>>> + * @version:    version of this structure, currently
>>>>> %RPMSG_VDEV_CONFIG_V1.
>>>>> + * @reserved:    reserved for padding, must be zero.
>>>>> + * @size:    size of this structure in bytes.
>>>>> + * @rpmsg_buf_align:    required alignment in bytes for each buffer.
>>>>> Must be a
>>>>> + *        power of two so that both the buffer sizes and the TX buffer
>>>>> + *        base address can be aligned (e.g. to a cache line).
>>>>> + * @reserved1:    reserved for padding, must be zero. Keeps the
>>>>> following 32-bit
>>>>> + *        fields naturally aligned.
>>>>> + * @txbuf_size:    Tx buf size from remote's view. For Linux this is
>>>>> rx buf size.
>>>>> + * @rxbuf_size:    Rx buf size from remote's view. For Linux this is
>>>>> tx buf size.
>>>>> + *
>>>>> + * This is the configuration structure shared by the device and the
>>>>> driver,
>>>>> + * read when %VIRTIO_RPMSG_F_BUFSZ is negotiated. The fields are laid
>>>>> out so
>>>>> + * the structure is naturally 32-bit aligned.
>>>>> + */
>>>>> +struct virtio_rpmsg_config {
>>>>> +    u8 version;
>>>>> +    u8 reserved;
>>>>
>>>> Why about defining the version type to u16 to avoid the reserved field?
>>>>
>>>>> +    __virtio16 size;
>>>>> +    __virtio16 rpmsg_buf_align;
>>>>> +    __virtio16 reserved1;
>>>>
>>>> Seems useless if __packed prevents the compiler from inserting extra
>>>> padding
>>>> bytes between fields,
>>>>
>>>>> +    /* The tx/rx individual buffer size (if VIRTIO_RPMSG_F_BUFSZ) */
>>>>> +    __virtio32 txbuf_size;
>>>>> +    __virtio32 rxbuf_size;
>>>>> +} __packed;
>>>>
>>>> proposal
>>>>
>>>> +struct virtio_rpmsg_config {
>>>> +    __virtio16 version;
>>>> +    __virtio16 size;
>>>> +    /* The tx/rx individual buffer size (if VIRTIO_RPMSG_F_BUFSZ) */
>>>> +    __virtio32 txbuf_size;
>>>> +    __virtio32 rxbuf_size;
>>>> +    __virtio16 rpmsg_buf_align;
>>>> +} __packed;
>>>> +
>>>>
>>>
>>> I am okay with the above proposal with minor difference:
>>>
>>> My proposal:
>>>
>>> +struct virtio_rpmsg_config {
>>> +    u8 version;
>>> +    __virtio16 size;
>>> +    __virtio16 rpmsg_buf_align;
>>> +    /* The tx/rx individual buffer size (if VIRTIO_RPMSG_F_BUFSZ) */
>>> +    __virtio32 txbuf_size;
>>> +    __virtio32 rxbuf_size;
>>> +} __packed;
>>>
>>> I just want to keep version field 8-bit, as we will probably never use
>>> upper byte of that field if we use 16-bit. Rest is okay. If the
>>> strucutre is packed then reserved bytes are not needed.
>>>
>>> Please let me know your view.
>>
>> No strong opinion on that. In the end, this structure is read only one
>> time.
>> If it is acceptable to Mathieu, it is acceptable to me.
>>
>> Thanks,
>> Arnaud
>>
>>>
>>> Thanks,
>>> Tanmay
>>>
>>>
>>>> Regards,
>>>> Arnaud
>>>>
>>>>> +
>>>>> +#endif /* _LINUX_VIRTIO_RPMSG_H */
>>>>
>>>
>>
> 


^ permalink raw reply

* Re: [PATCH v6 06/16] iio: core: create local __iio_chan_prefix_emit() for reuse
From: Rodrigo Alencar @ 2026-06-19  7:43 UTC (permalink / raw)
  To: Andy Shevchenko, Rodrigo Alencar
  Cc: Nuno Sá, rodrigo.alencar, linux-iio, devicetree,
	linux-kernel, linux-doc, linux-hardening, Lars-Peter Clausen,
	Michael Hennerich, Jonathan Cameron, David Lechner,
	Andy Shevchenko, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Philipp Zabel, Jonathan Corbet, Shuah Khan, Kees Cook,
	Gustavo A. R. Silva
In-Reply-To: <ajQ1bZSNHQ96pyJx@ashevche-desk.local>

On 18/06/26 21:14, Andy Shevchenko wrote:
> On Thu, Jun 18, 2026 at 05:14:19PM +0100, Rodrigo Alencar wrote:
> > On 18/06/26 16:06, Nuno Sá wrote:
> > > On Thu, Jun 18, 2026 at 02:27:22PM +0100, Rodrigo Alencar via B4 Relay wrote:
> 
> ...
> 
> > > > +	dev_attr->attr.name = kasprintf(GFP_KERNEL, "%s%s", prefix, postfix);
> > > > +	if (!dev_attr->attr.name)
> > > >  		return -ENOMEM;
> > > 
> > > I don't oppose the change. Looks like a nice cleanup.
> 
> May I oppose it? I found use scnprintf() is harder to follow in comparison to
> nice kasprintf() that takes care for the dynamically allocated buffer.

In the next patch the function is reused in a sysfs attribute read handler,
a context wich would not be nice to have dynamic allocation. vscnprintf() is
the main building block of sysfs_emit() which limits the buffer length to
a page size, so I used scnprintf() trying not to deviate much from that. 

kasprintf() it is still used in the caller, where the logic was a bit confusing
as it tried to avoid multiple allocations.
 
> Also there is a chance to get a name silently cut due to insufficient space.
> Besides that this function can't be used (again due to 'c') in kasprintf()-like
> wrapper. I do not consider this as a good approach. Have you looked at seq_buf
> instead?

NAME_MAX is not the maximum length a filename can have? I suppose there should be
enough space for the channel-prefix. Indeed, seq_buf can be used and it cleans up
things a bit as it tracks the the position in the buffer.

> 
> > > But bear in mind this very sensible as any subtle mistake means ABI breakage.
> 
> Which immediately raises a question of test coverage. Do we have one? If not,
> this code must be accompanied with one.

Agreed. Will see to have tests for v7.

> > Yes! I tried to be careful... this is dangerous stuff!

-- 
Kind regards,

Rodrigo Alencar

^ permalink raw reply

* [PATCH v2 4/4] rv/rtapp: Add wakeup monitor
From: Nam Cao @ 2026-06-19  7:21 UTC (permalink / raw)
  To: Gabriele Monaco, Steven Rostedt, linux-trace-kernel, linux-doc,
	linux-kernel
  Cc: Nam Cao
In-Reply-To: <cover.1781852967.git.namcao@linutronix.de>

Add a wakeup monitor to detect a lower-priority task waking up a
higher-priority task.

The rtapp/sleep monitor already detects this. However, that monitor
triggers an error in the context of the wakee task and user only gets
the stacktrace of that task. It is also extremely useful to get the
stacktrace of the waker task, which this monitor offers. In other
words, this monitor complements the rtapp/sleep monitor.

Signed-off-by: Nam Cao <namcao@linutronix.de>
---
 Documentation/trace/rv/monitor_rtapp.rst      |  20 +++
 kernel/trace/rv/Kconfig                       |   1 +
 kernel/trace/rv/Makefile                      |   1 +
 kernel/trace/rv/monitors/rtapp/Kconfig        |   2 +-
 kernel/trace/rv/monitors/wakeup/Kconfig       |  16 ++
 kernel/trace/rv/monitors/wakeup/wakeup.c      | 153 ++++++++++++++++++
 kernel/trace/rv/monitors/wakeup/wakeup.h      |  92 +++++++++++
 .../trace/rv/monitors/wakeup/wakeup_trace.h   |  14 ++
 kernel/trace/rv/rv_trace.h                    |   1 +
 tools/verification/models/rtapp/wakeup.ltl    |   5 +
 10 files changed, 304 insertions(+), 1 deletion(-)
 create mode 100644 kernel/trace/rv/monitors/wakeup/Kconfig
 create mode 100644 kernel/trace/rv/monitors/wakeup/wakeup.c
 create mode 100644 kernel/trace/rv/monitors/wakeup/wakeup.h
 create mode 100644 kernel/trace/rv/monitors/wakeup/wakeup_trace.h
 create mode 100644 tools/verification/models/rtapp/wakeup.ltl

diff --git a/Documentation/trace/rv/monitor_rtapp.rst b/Documentation/trace/rv/monitor_rtapp.rst
index 502d3ea412eb..238b59395ff5 100644
--- a/Documentation/trace/rv/monitor_rtapp.rst
+++ b/Documentation/trace/rv/monitor_rtapp.rst
@@ -124,3 +124,23 @@ to handle some special cases:
     real-time-safe because preemption is disabled for the duration.
   - `FUTEX_LOCK_PI` is included in the allowlist for the same reason as
     `BLOCK_ON_RT_MUTEX`.
+
+Monitor wakeup
+++++++++++++++
+
+The `wakeup` monitor reports real-time threads being woken by lower-priority threads,
+which is a hint of priority inversion. Its specification is::
+
+  RULE = always (((RT and USER_THREAD) imply
+                (not (WOKEN_BY_LOWER_PRIO or WOKEN_BY_SOFTIRQ)) or ALLOWLIST))
+
+  ALLOWLIST = BLOCK_ON_RT_MUTEX
+           or FUTEX_LOCK_PI
+
+The `sleep` monitor already reports this type of problem. The difference is the
+context in which the problem is reported. While the `sleep` monitor reports the problem
+in the context of the wakee, this `wakeup` monitor reports the problem in the context of
+the waker. This monitor complement the `sleep` monitor, giving user better
+understanding of the issue. For instance, to debug a lower-priority task waking a
+higher-priority task scenario, user can enable both `wakeup` monitor and `sleep`
+monitor to get the stack traces of both tasks.
diff --git a/kernel/trace/rv/Kconfig b/kernel/trace/rv/Kconfig
index 3884b14df375..4d3a14a0bac2 100644
--- a/kernel/trace/rv/Kconfig
+++ b/kernel/trace/rv/Kconfig
@@ -76,6 +76,7 @@ source "kernel/trace/rv/monitors/opid/Kconfig"
 source "kernel/trace/rv/monitors/rtapp/Kconfig"
 source "kernel/trace/rv/monitors/pagefault/Kconfig"
 source "kernel/trace/rv/monitors/sleep/Kconfig"
+source "kernel/trace/rv/monitors/wakeup/Kconfig"
 # Add new rtapp monitors here
 
 source "kernel/trace/rv/monitors/stall/Kconfig"
diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
index 94498da35b37..c2c0e4142eb4 100644
--- a/kernel/trace/rv/Makefile
+++ b/kernel/trace/rv/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_RV_MON_OPID) += monitors/opid/opid.o
 obj-$(CONFIG_RV_MON_STALL) += monitors/stall/stall.o
 obj-$(CONFIG_RV_MON_DEADLINE) += monitors/deadline/deadline.o
 obj-$(CONFIG_RV_MON_NOMISS) += monitors/nomiss/nomiss.o
+obj-$(CONFIG_RV_MON_WAKEUP) += monitors/wakeup/wakeup.o
 # Add new monitors here
 obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
 obj-$(CONFIG_RV_REACT_PRINTK) += reactor_printk.o
diff --git a/kernel/trace/rv/monitors/rtapp/Kconfig b/kernel/trace/rv/monitors/rtapp/Kconfig
index 1ce9370a9ba8..1fcd7a400ded 100644
--- a/kernel/trace/rv/monitors/rtapp/Kconfig
+++ b/kernel/trace/rv/monitors/rtapp/Kconfig
@@ -1,6 +1,6 @@
 config RV_MON_RTAPP
 	depends on RV
-	depends on RV_PER_TASK_MONITORS >= 2
+	depends on RV_PER_TASK_MONITORS >= 3
 	bool "rtapp monitor"
 	help
 	  Collection of monitors to check for common problems with real-time
diff --git a/kernel/trace/rv/monitors/wakeup/Kconfig b/kernel/trace/rv/monitors/wakeup/Kconfig
new file mode 100644
index 000000000000..ec3a5c06a8c4
--- /dev/null
+++ b/kernel/trace/rv/monitors/wakeup/Kconfig
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+config RV_MON_WAKEUP
+	depends on RV
+	depends on RV_MON_RTAPP
+	depends on HAVE_SYSCALL_TRACEPOINTS
+	default y
+	select LTL_MON_EVENTS_ID
+	bool "wakeup monitor"
+	help
+	  This monitor detects a lower-priority task waking up a
+	  higher-priority task. The RV_MON_SLEEP monitor already
+	  detects this case, but this monitor detects in the context
+	  of the waker task instead. This and RV_MON_SLEEP can be
+	  enabled together to get the stacktrace of both the waker
+	  task and the wakee task.
diff --git a/kernel/trace/rv/monitors/wakeup/wakeup.c b/kernel/trace/rv/monitors/wakeup/wakeup.c
new file mode 100644
index 000000000000..01b47416f24e
--- /dev/null
+++ b/kernel/trace/rv/monitors/wakeup/wakeup.c
@@ -0,0 +1,153 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/ftrace.h>
+#include <linux/tracepoint.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/rv.h>
+#include <rv/instrumentation.h>
+
+#define MODULE_NAME "wakeup"
+
+#include <trace/events/syscalls.h>
+#include <trace/events/sched.h>
+#include <trace/events/lock.h>
+#include <uapi/linux/futex.h>
+
+#include <rv_trace.h>
+#include <monitors/rtapp/rtapp.h>
+
+
+#ifndef __NR_futex
+#define __NR_futex (-__COUNTER__)
+#endif
+#ifndef __NR_futex_time64
+#define __NR_futex_time64 (-__COUNTER__)
+#endif
+
+#include "wakeup.h"
+#include <rv/ltl_monitor.h>
+
+static void ltl_atoms_fetch(struct task_struct *task, struct ltl_monitor *mon)
+{
+	/*
+	 * This includes "actual" real-time tasks and also PI-boosted
+	 * tasks. A task being PI-boosted means it is blocking an "actual"
+	 * real-task, therefore it should also obey the monitor's rule,
+	 * otherwise the "actual" real-task may be delayed.
+	 */
+	ltl_atom_set(mon, LTL_RT, rt_or_dl_task(task));
+}
+
+static void ltl_atoms_init(struct task_struct *task, struct ltl_monitor *mon, bool task_creation)
+{
+	ltl_atom_set(mon, LTL_WOKEN_BY_LOWER_PRIO, false);
+	ltl_atom_set(mon, LTL_WOKEN_BY_SOFTIRQ, false);
+
+	if (task_creation) {
+		ltl_atom_set(mon, LTL_BLOCK_ON_RT_MUTEX, false);
+		ltl_atom_set(mon, LTL_FUTEX_LOCK_PI, false);
+	}
+
+	ltl_atom_set(mon, LTL_USER_THREAD, !(task->flags & PF_KTHREAD));
+}
+
+static void handle_sched_waking(void *data, struct task_struct *task)
+{
+	if (in_task()) {
+		if (current->prio > task->prio)
+			ltl_atom_pulse(task, LTL_WOKEN_BY_LOWER_PRIO, true);
+	} else if (in_serving_softirq()) {
+		ltl_atom_pulse(task, LTL_WOKEN_BY_SOFTIRQ, true);
+	}
+}
+
+static void handle_contention_begin(void *data, void *lock, unsigned int flags)
+{
+	if (flags & LCB_F_RT)
+		ltl_atom_update(current, LTL_BLOCK_ON_RT_MUTEX, true);
+}
+
+static void handle_contention_end(void *data, void *lock, int ret)
+{
+	ltl_atom_update(current, LTL_BLOCK_ON_RT_MUTEX, false);
+}
+
+static void handle_sys_enter(void *data, struct pt_regs *regs, long id)
+{
+	unsigned long args[6];
+	int op, cmd;
+
+	switch (id) {
+	case __NR_futex:
+	case __NR_futex_time64:
+		syscall_get_arguments(current, regs, args);
+		op = args[1];
+		cmd = op & FUTEX_CMD_MASK;
+
+		switch (cmd) {
+		case FUTEX_LOCK_PI:
+		case FUTEX_LOCK_PI2:
+			ltl_atom_update(current, LTL_FUTEX_LOCK_PI, true);
+			break;
+		}
+		break;
+	}
+}
+
+static void handle_sys_exit(void *data, struct pt_regs *regs, long ret)
+{
+	ltl_atom_update(current, LTL_FUTEX_LOCK_PI, false);
+}
+
+static int enable_wakeup(void)
+{
+	int retval;
+
+	retval = ltl_monitor_init();
+	if (retval)
+		return retval;
+
+	rv_attach_trace_probe("rtapp_wakeup", sched_waking, handle_sched_waking);
+	rv_attach_trace_probe("rtapp_wakeup", contention_begin, handle_contention_begin);
+	rv_attach_trace_probe("rtapp_wakeup", contention_end, handle_contention_end);
+	rv_attach_trace_probe("rtapp_wakeup", sys_enter, handle_sys_enter);
+	rv_attach_trace_probe("rtapp_wakeup", sys_exit, handle_sys_exit);
+
+	return 0;
+}
+
+static void disable_wakeup(void)
+{
+	rv_detach_trace_probe("rtapp_wakeup", sched_waking, handle_sched_waking);
+	rv_detach_trace_probe("rtapp_wakeup", contention_begin, handle_contention_begin);
+	rv_detach_trace_probe("rtapp_wakeup", contention_end, handle_contention_end);
+	rv_detach_trace_probe("rtapp_wakeup", sys_enter, handle_sys_enter);
+	rv_detach_trace_probe("rtapp_wakeup", sys_exit, handle_sys_exit);
+
+	ltl_monitor_destroy();
+}
+
+static struct rv_monitor rv_wakeup = {
+	.name = "wakeup",
+	.description = "Monitor that real-time tasks are not woken by lower-priority tasks",
+	.enable = enable_wakeup,
+	.disable = disable_wakeup,
+};
+
+static int __init register_wakeup(void)
+{
+	return rv_register_monitor(&rv_wakeup, &rv_rtapp);
+}
+
+static void __exit unregister_wakeup(void)
+{
+	rv_unregister_monitor(&rv_wakeup);
+}
+
+module_init(register_wakeup);
+module_exit(unregister_wakeup);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Nam Cao <namcao@linutronix.de>");
+MODULE_DESCRIPTION("Monitor that real-time tasks are not woken by lower-priority tasks");
diff --git a/kernel/trace/rv/monitors/wakeup/wakeup.h b/kernel/trace/rv/monitors/wakeup/wakeup.h
new file mode 100644
index 000000000000..6f80da64e0e1
--- /dev/null
+++ b/kernel/trace/rv/monitors/wakeup/wakeup.h
@@ -0,0 +1,92 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * C implementation of Buchi automaton, automatically generated by
+ * tools/verification/rvgen from the linear temporal logic specification.
+ * For further information, see kernel documentation:
+ *   Documentation/trace/rv/linear_temporal_logic.rst
+ */
+
+#include <linux/rv.h>
+
+#define MONITOR_NAME wakeup
+
+enum ltl_atom {
+	LTL_BLOCK_ON_RT_MUTEX,
+	LTL_FUTEX_LOCK_PI,
+	LTL_RT,
+	LTL_USER_THREAD,
+	LTL_WOKEN_BY_LOWER_PRIO,
+	LTL_WOKEN_BY_SOFTIRQ,
+	LTL_NUM_ATOM
+};
+static_assert(LTL_NUM_ATOM <= RV_MAX_LTL_ATOM);
+
+static const char *ltl_atom_str(enum ltl_atom atom)
+{
+	static const char *const names[] = {
+		"bl_on_rt_mu",
+		"fu_lo_pi",
+		"rt",
+		"us_th",
+		"wo_lo_pr",
+		"wo_so",
+	};
+
+	return names[atom];
+}
+
+enum ltl_buchi_state {
+	S0,
+	RV_NUM_BA_STATES
+};
+static_assert(RV_NUM_BA_STATES <= RV_MAX_BA_STATES);
+
+static void ltl_start(struct task_struct *task, struct ltl_monitor *mon)
+{
+	bool woken_by_softirq = test_bit(LTL_WOKEN_BY_SOFTIRQ, mon->atoms);
+	bool woken_by_lower_prio = test_bit(LTL_WOKEN_BY_LOWER_PRIO, mon->atoms);
+	bool user_thread = test_bit(LTL_USER_THREAD, mon->atoms);
+	bool rt = test_bit(LTL_RT, mon->atoms);
+	bool futex_lock_pi = test_bit(LTL_FUTEX_LOCK_PI, mon->atoms);
+	bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
+	bool val9 = block_on_rt_mutex || futex_lock_pi;
+	bool val6 = !woken_by_softirq;
+	bool val5 = !woken_by_lower_prio;
+	bool val8 = val5 && val6;
+	bool val10 = val8 || val9;
+	bool val3 = !user_thread;
+	bool val2 = !rt;
+	bool val4 = val2 || val3;
+	bool val11 = val4 || val10;
+
+	if (val11)
+		__set_bit(S0, mon->states);
+}
+
+static void
+ltl_possible_next_states(struct ltl_monitor *mon, unsigned int state, unsigned long *next)
+{
+	bool woken_by_softirq = test_bit(LTL_WOKEN_BY_SOFTIRQ, mon->atoms);
+	bool woken_by_lower_prio = test_bit(LTL_WOKEN_BY_LOWER_PRIO, mon->atoms);
+	bool user_thread = test_bit(LTL_USER_THREAD, mon->atoms);
+	bool rt = test_bit(LTL_RT, mon->atoms);
+	bool futex_lock_pi = test_bit(LTL_FUTEX_LOCK_PI, mon->atoms);
+	bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
+	bool val9 = block_on_rt_mutex || futex_lock_pi;
+	bool val6 = !woken_by_softirq;
+	bool val5 = !woken_by_lower_prio;
+	bool val8 = val5 && val6;
+	bool val10 = val8 || val9;
+	bool val3 = !user_thread;
+	bool val2 = !rt;
+	bool val4 = val2 || val3;
+	bool val11 = val4 || val10;
+
+	switch (state) {
+	case S0:
+		if (val11)
+			__set_bit(S0, next);
+		break;
+	}
+}
diff --git a/kernel/trace/rv/monitors/wakeup/wakeup_trace.h b/kernel/trace/rv/monitors/wakeup/wakeup_trace.h
new file mode 100644
index 000000000000..7e056183f920
--- /dev/null
+++ b/kernel/trace/rv/monitors/wakeup/wakeup_trace.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Snippet to be included in rv_trace.h
+ */
+
+#ifdef CONFIG_RV_MON_WAKEUP
+DEFINE_EVENT(event_ltl_monitor_id, event_wakeup,
+	     TP_PROTO(struct task_struct *task, char *states, char *atoms, char *next),
+	     TP_ARGS(task, states, atoms, next));
+DEFINE_EVENT(error_ltl_monitor_id, error_wakeup,
+	     TP_PROTO(struct task_struct *task),
+	     TP_ARGS(task));
+#endif /* CONFIG_RV_MON_WAKEUP */
diff --git a/kernel/trace/rv/rv_trace.h b/kernel/trace/rv/rv_trace.h
index 9622c269789c..2f8a932432c9 100644
--- a/kernel/trace/rv/rv_trace.h
+++ b/kernel/trace/rv/rv_trace.h
@@ -241,6 +241,7 @@ DECLARE_EVENT_CLASS(error_ltl_monitor_id,
 );
 #include <monitors/pagefault/pagefault_trace.h>
 #include <monitors/sleep/sleep_trace.h>
+#include <monitors/wakeup/wakeup_trace.h>
 // Add new monitors based on CONFIG_LTL_MON_EVENTS_ID here
 #endif /* CONFIG_LTL_MON_EVENTS_ID */
 
diff --git a/tools/verification/models/rtapp/wakeup.ltl b/tools/verification/models/rtapp/wakeup.ltl
new file mode 100644
index 000000000000..a5d63ca0811a
--- /dev/null
+++ b/tools/verification/models/rtapp/wakeup.ltl
@@ -0,0 +1,5 @@
+RULE = always (((RT and USER_THREAD) imply
+		(not (WOKEN_BY_LOWER_PRIO or WOKEN_BY_SOFTIRQ)) or ALLOWLIST))
+
+ALLOWLIST = BLOCK_ON_RT_MUTEX
+         or FUTEX_LOCK_PI
-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 3/4] rv/rtapp/sleep: Stop monitoring kernel threads
From: Nam Cao @ 2026-06-19  7:21 UTC (permalink / raw)
  To: Gabriele Monaco, Steven Rostedt, linux-trace-kernel, linux-doc,
	linux-kernel
  Cc: Nam Cao, Sebastian Andrzej Siewior
In-Reply-To: <cover.1781852967.git.namcao@linutronix.de>

The rtapp/sleep monitor's primary purpose is detecting common mistakes
with user-space real-time design. Monitoring real-time issues with
kernel threads is a bonus.

However, accomodating kernel threads complicates the monitor due to
the edge cases which is seen by the monitor as lower-priority task
waking higher-priority task:

  - kthread_stop() wakes up the task in order to stop it.

  - The rcu thread and migration thread can be woken by any task.

  - The ktimerd thread is woken near the end of irq_exit_rcu(), where
    the preempt counter is "broken" and falsely says this is task
    context. This requires the monitor to use the hardirq_context flag
    instead of the preempt counter.

Beside complicating the monitor, the final case also requires enabling
CONFIG_TRACE_IRQFLAGS (so that "hardirq_context" can be used). This
adds overhead to the kernel even when the monitor is not active. This
may be an obstacle to enabling this monitor in distros' kernels.

Furthermore, kernel threads usually are started before the monitor is
enabled. Consequently, the threads' states (i.o.w. the monitor's
atomic propositions for the threads) are not fully known to the
monitor. As a result, the kernel threads mostly cannot be monitored.

Overall, the downsides of accomodating kernel threads outweights the
benefits. Thus, exclude kernel threads to simplify the monitor.

Signed-off-by: Nam Cao <namcao@linutronix.de>
---
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 Documentation/trace/rv/monitor_rtapp.rst  |  22 ++---
 kernel/trace/rv/monitors/sleep/Kconfig    |   1 -
 kernel/trace/rv/monitors/sleep/sleep.c    |  39 +-------
 kernel/trace/rv/monitors/sleep/sleep.h    | 104 +++++++++-------------
 tools/verification/models/rtapp/sleep.ltl |   7 +-
 5 files changed, 54 insertions(+), 119 deletions(-)

diff --git a/Documentation/trace/rv/monitor_rtapp.rst b/Documentation/trace/rv/monitor_rtapp.rst
index 570be67a8f3b..502d3ea412eb 100644
--- a/Documentation/trace/rv/monitor_rtapp.rst
+++ b/Documentation/trace/rv/monitor_rtapp.rst
@@ -93,9 +93,9 @@ assessment.
 
 The monitor's specification is::
 
-  RULE = always ((RT and SLEEP) imply (RT_FRIENDLY_SLEEP or ALLOWLIST))
+  RULE = always ((RT and SLEEP and USER_THREAD) imply (RT_FRIENDLY_SLEEP or ALLOWLIST))
 
-  RT_FRIENDLY_SLEEP = (RT_VALID_SLEEP_REASON or KERNEL_THREAD)
+  RT_FRIENDLY_SLEEP = RT_VALID_SLEEP_REASON
                   and ((not SCHEDULE_IN) until RT_FRIENDLY_WAKE)
 
   RT_VALID_SLEEP_REASON = FUTEX_WAIT
@@ -110,23 +110,13 @@ The monitor's specification is::
                   or WOKEN_BY_HARDIRQ
                   or WOKEN_BY_NMI
                   or ABORT_SLEEP
-                  or KTHREAD_SHOULD_STOP
 
   ALLOWLIST = BLOCK_ON_RT_MUTEX
            or FUTEX_LOCK_PI
-           or TASK_IS_RCU
-           or TASK_IS_MIGRATION
-
-Beside the scenarios described above, this specification also handle some
-special cases:
-
-  - `KERNEL_THREAD`: kernel tasks do not have any pattern that can be recognized
-    as valid real-time sleeping reasons. Therefore sleeping reason is not
-    checked for kernel tasks.
-  - `KTHREAD_SHOULD_STOP`: a non-real-time thread may stop a real-time kernel
-    thread by waking it and waiting for it to exit (`kthread_stop()`). This
-    wakeup is safe for real-time.
-  - `ALLOWLIST`: to handle known false positives with the kernel.
+
+Beside the scenarios described above, this specification also defines an allow list
+to handle some special cases:
+
   - `BLOCK_ON_RT_MUTEX` is included in the allowlist due to its implementation.
     In the release path of rt_mutex, a boosted task is de-boosted before waking
     the rt_mutex's waiter. Consequently, the monitor may see a real-time-unsafe
diff --git a/kernel/trace/rv/monitors/sleep/Kconfig b/kernel/trace/rv/monitors/sleep/Kconfig
index 6b7a122e7b47..d6ec3e9a91b6 100644
--- a/kernel/trace/rv/monitors/sleep/Kconfig
+++ b/kernel/trace/rv/monitors/sleep/Kconfig
@@ -5,7 +5,6 @@ config RV_MON_SLEEP
 	select RV_LTL_MONITOR
 	depends on HAVE_SYSCALL_TRACEPOINTS
 	depends on RV_MON_RTAPP
-	select TRACE_IRQFLAGS
 	default y
 	select LTL_MON_EVENTS_ID
 	bool "sleep monitor"
diff --git a/kernel/trace/rv/monitors/sleep/sleep.c b/kernel/trace/rv/monitors/sleep/sleep.c
index 638be7d8747f..aa5a984853b5 100644
--- a/kernel/trace/rv/monitors/sleep/sleep.c
+++ b/kernel/trace/rv/monitors/sleep/sleep.c
@@ -43,7 +43,6 @@ static void ltl_atoms_init(struct task_struct *task, struct ltl_monitor *mon, bo
 	ltl_atom_set(mon, LTL_WOKEN_BY_EQUAL_OR_HIGHER_PRIO, false);
 
 	if (task_creation) {
-		ltl_atom_set(mon, LTL_KTHREAD_SHOULD_STOP, false);
 		ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_REALTIME, false);
 		ltl_atom_set(mon, LTL_NANOSLEEP_TIMER_ABSTIME, false);
 		ltl_atom_set(mon, LTL_CLOCK_NANOSLEEP, false);
@@ -53,33 +52,7 @@ static void ltl_atoms_init(struct task_struct *task, struct ltl_monitor *mon, bo
 		ltl_atom_set(mon, LTL_BLOCK_ON_RT_MUTEX, false);
 	}
 
-	if (task->flags & PF_KTHREAD) {
-		ltl_atom_set(mon, LTL_KERNEL_THREAD, true);
-
-		/* kernel tasks do not do syscall */
-		ltl_atom_set(mon, LTL_FUTEX_WAIT, false);
-		ltl_atom_set(mon, LTL_FUTEX_LOCK_PI, false);
-		ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_REALTIME, false);
-		ltl_atom_set(mon, LTL_NANOSLEEP_TIMER_ABSTIME, false);
-		ltl_atom_set(mon, LTL_CLOCK_NANOSLEEP, false);
-		ltl_atom_set(mon, LTL_EPOLL_WAIT, false);
-
-		if (strstarts(task->comm, "migration/"))
-			ltl_atom_set(mon, LTL_TASK_IS_MIGRATION, true);
-		else
-			ltl_atom_set(mon, LTL_TASK_IS_MIGRATION, false);
-
-		if (strstarts(task->comm, "rcu"))
-			ltl_atom_set(mon, LTL_TASK_IS_RCU, true);
-		else
-			ltl_atom_set(mon, LTL_TASK_IS_RCU, false);
-	} else {
-		ltl_atom_set(mon, LTL_KTHREAD_SHOULD_STOP, false);
-		ltl_atom_set(mon, LTL_KERNEL_THREAD, false);
-		ltl_atom_set(mon, LTL_TASK_IS_RCU, false);
-		ltl_atom_set(mon, LTL_TASK_IS_MIGRATION, false);
-	}
-
+	ltl_atom_set(mon, LTL_USER_THREAD, !(task->flags & PF_KTHREAD));
 }
 
 static void handle_sched_set_state(void *data, struct task_struct *task, int state)
@@ -97,7 +70,7 @@ static void handle_sched_exit(void *data, bool is_switch)
 
 static void handle_sched_waking(void *data, struct task_struct *task)
 {
-	if (this_cpu_read(hardirq_context)) {
+	if (in_hardirq()) {
 		ltl_atom_pulse(task, LTL_WOKEN_BY_HARDIRQ, true);
 	} else if (in_task()) {
 		if (current->prio <= task->prio)
@@ -181,12 +154,6 @@ static void handle_sys_exit(void *data, struct pt_regs *regs, long ret)
 	ltl_atom_update(current, LTL_CLOCK_NANOSLEEP, false);
 }
 
-static void handle_kthread_stop(void *data, struct task_struct *task)
-{
-	/* FIXME: this could race with other tracepoint handlers */
-	ltl_atom_update(task, LTL_KTHREAD_SHOULD_STOP, true);
-}
-
 static int enable_sleep(void)
 {
 	int retval;
@@ -200,7 +167,6 @@ static int enable_sleep(void)
 	rv_attach_trace_probe("rtapp_sleep", sched_set_state_tp, handle_sched_set_state);
 	rv_attach_trace_probe("rtapp_sleep", contention_begin, handle_contention_begin);
 	rv_attach_trace_probe("rtapp_sleep", contention_end, handle_contention_end);
-	rv_attach_trace_probe("rtapp_sleep", sched_kthread_stop, handle_kthread_stop);
 	rv_attach_trace_probe("rtapp_sleep", sys_enter, handle_sys_enter);
 	rv_attach_trace_probe("rtapp_sleep", sys_exit, handle_sys_exit);
 	return 0;
@@ -213,7 +179,6 @@ static void disable_sleep(void)
 	rv_detach_trace_probe("rtapp_sleep", sched_set_state_tp, handle_sched_set_state);
 	rv_detach_trace_probe("rtapp_sleep", contention_begin, handle_contention_begin);
 	rv_detach_trace_probe("rtapp_sleep", contention_end, handle_contention_end);
-	rv_detach_trace_probe("rtapp_sleep", sched_kthread_stop, handle_kthread_stop);
 	rv_detach_trace_probe("rtapp_sleep", sys_enter, handle_sys_enter);
 	rv_detach_trace_probe("rtapp_sleep", sys_exit, handle_sys_exit);
 
diff --git a/kernel/trace/rv/monitors/sleep/sleep.h b/kernel/trace/rv/monitors/sleep/sleep.h
index 2fe2ec7edae8..44e593f41e6a 100644
--- a/kernel/trace/rv/monitors/sleep/sleep.h
+++ b/kernel/trace/rv/monitors/sleep/sleep.h
@@ -18,15 +18,12 @@ enum ltl_atom {
 	LTL_EPOLL_WAIT,
 	LTL_FUTEX_LOCK_PI,
 	LTL_FUTEX_WAIT,
-	LTL_KERNEL_THREAD,
-	LTL_KTHREAD_SHOULD_STOP,
 	LTL_NANOSLEEP_CLOCK_REALTIME,
 	LTL_NANOSLEEP_TIMER_ABSTIME,
 	LTL_RT,
 	LTL_SCHEDULE_IN,
 	LTL_SLEEP,
-	LTL_TASK_IS_MIGRATION,
-	LTL_TASK_IS_RCU,
+	LTL_USER_THREAD,
 	LTL_WOKEN_BY_EQUAL_OR_HIGHER_PRIO,
 	LTL_WOKEN_BY_HARDIRQ,
 	LTL_WOKEN_BY_NMI,
@@ -43,15 +40,12 @@ static const char *ltl_atom_str(enum ltl_atom atom)
 		"ep_wa",
 		"fu_lo_pi",
 		"fu_wa",
-		"ker_th",
-		"kth_sh_st",
 		"na_cl_re",
 		"na_ti_ab",
 		"rt",
 		"sch_in",
 		"sle",
-		"ta_mi",
-		"ta_rc",
+		"us_th",
 		"wo_eq_hi_pr",
 		"wo_ha",
 		"wo_nm",
@@ -79,46 +73,41 @@ static void ltl_start(struct task_struct *task, struct ltl_monitor *mon)
 	bool woken_by_hardirq = test_bit(LTL_WOKEN_BY_HARDIRQ, mon->atoms);
 	bool woken_by_equal_or_higher_prio = test_bit(LTL_WOKEN_BY_EQUAL_OR_HIGHER_PRIO,
 	     mon->atoms);
-	bool task_is_rcu = test_bit(LTL_TASK_IS_RCU, mon->atoms);
-	bool task_is_migration = test_bit(LTL_TASK_IS_MIGRATION, mon->atoms);
+	bool user_thread = test_bit(LTL_USER_THREAD, mon->atoms);
 	bool sleep = test_bit(LTL_SLEEP, mon->atoms);
 	bool schedule_in = test_bit(LTL_SCHEDULE_IN, mon->atoms);
 	bool rt = test_bit(LTL_RT, mon->atoms);
 	bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
 	bool nanosleep_clock_realtime = test_bit(LTL_NANOSLEEP_CLOCK_REALTIME, mon->atoms);
-	bool kthread_should_stop = test_bit(LTL_KTHREAD_SHOULD_STOP, mon->atoms);
-	bool kernel_thread = test_bit(LTL_KERNEL_THREAD, mon->atoms);
 	bool futex_wait = test_bit(LTL_FUTEX_WAIT, mon->atoms);
 	bool futex_lock_pi = test_bit(LTL_FUTEX_LOCK_PI, mon->atoms);
 	bool epoll_wait = test_bit(LTL_EPOLL_WAIT, mon->atoms);
 	bool clock_nanosleep = test_bit(LTL_CLOCK_NANOSLEEP, mon->atoms);
 	bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
 	bool abort_sleep = test_bit(LTL_ABORT_SLEEP, mon->atoms);
-	bool val41 = task_is_rcu || task_is_migration;
-	bool val42 = futex_lock_pi || val41;
-	bool val5 = block_on_rt_mutex || val42;
-	bool val33 = abort_sleep || kthread_should_stop;
-	bool val34 = woken_by_nmi || val33;
-	bool val35 = woken_by_hardirq || val34;
-	bool val14 = woken_by_equal_or_higher_prio || val35;
+	bool val7 = block_on_rt_mutex || futex_lock_pi;
+	bool val32 = woken_by_nmi || abort_sleep;
+	bool val33 = woken_by_hardirq || val32;
+	bool val14 = woken_by_equal_or_higher_prio || val33;
 	bool val13 = !schedule_in;
 	bool val25 = !nanosleep_clock_realtime;
 	bool val26 = nanosleep_timer_abstime && val25;
 	bool val18 = clock_nanosleep && val26;
 	bool val20 = val18 || epoll_wait;
-	bool val9 = futex_wait || val20;
-	bool val11 = val9 || kernel_thread;
+	bool val11 = futex_wait || val20;
+	bool val3 = !user_thread;
 	bool val2 = !sleep;
+	bool val4 = val2 || val3;
 	bool val1 = !rt;
-	bool val3 = val1 || val2;
+	bool val5 = val1 || val4;
 
-	if (val3)
+	if (val5)
 		__set_bit(S0, mon->states);
 	if (val11 && val13)
 		__set_bit(S1, mon->states);
 	if (val11 && val14)
 		__set_bit(S4, mon->states);
-	if (val5)
+	if (val7)
 		__set_bit(S5, mon->states);
 }
 
@@ -129,130 +118,125 @@ ltl_possible_next_states(struct ltl_monitor *mon, unsigned int state, unsigned l
 	bool woken_by_hardirq = test_bit(LTL_WOKEN_BY_HARDIRQ, mon->atoms);
 	bool woken_by_equal_or_higher_prio = test_bit(LTL_WOKEN_BY_EQUAL_OR_HIGHER_PRIO,
 	     mon->atoms);
-	bool task_is_rcu = test_bit(LTL_TASK_IS_RCU, mon->atoms);
-	bool task_is_migration = test_bit(LTL_TASK_IS_MIGRATION, mon->atoms);
+	bool user_thread = test_bit(LTL_USER_THREAD, mon->atoms);
 	bool sleep = test_bit(LTL_SLEEP, mon->atoms);
 	bool schedule_in = test_bit(LTL_SCHEDULE_IN, mon->atoms);
 	bool rt = test_bit(LTL_RT, mon->atoms);
 	bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
 	bool nanosleep_clock_realtime = test_bit(LTL_NANOSLEEP_CLOCK_REALTIME, mon->atoms);
-	bool kthread_should_stop = test_bit(LTL_KTHREAD_SHOULD_STOP, mon->atoms);
-	bool kernel_thread = test_bit(LTL_KERNEL_THREAD, mon->atoms);
 	bool futex_wait = test_bit(LTL_FUTEX_WAIT, mon->atoms);
 	bool futex_lock_pi = test_bit(LTL_FUTEX_LOCK_PI, mon->atoms);
 	bool epoll_wait = test_bit(LTL_EPOLL_WAIT, mon->atoms);
 	bool clock_nanosleep = test_bit(LTL_CLOCK_NANOSLEEP, mon->atoms);
 	bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
 	bool abort_sleep = test_bit(LTL_ABORT_SLEEP, mon->atoms);
-	bool val41 = task_is_rcu || task_is_migration;
-	bool val42 = futex_lock_pi || val41;
-	bool val5 = block_on_rt_mutex || val42;
-	bool val33 = abort_sleep || kthread_should_stop;
-	bool val34 = woken_by_nmi || val33;
-	bool val35 = woken_by_hardirq || val34;
-	bool val14 = woken_by_equal_or_higher_prio || val35;
+	bool val7 = block_on_rt_mutex || futex_lock_pi;
+	bool val32 = woken_by_nmi || abort_sleep;
+	bool val33 = woken_by_hardirq || val32;
+	bool val14 = woken_by_equal_or_higher_prio || val33;
 	bool val13 = !schedule_in;
 	bool val25 = !nanosleep_clock_realtime;
 	bool val26 = nanosleep_timer_abstime && val25;
 	bool val18 = clock_nanosleep && val26;
 	bool val20 = val18 || epoll_wait;
-	bool val9 = futex_wait || val20;
-	bool val11 = val9 || kernel_thread;
+	bool val11 = futex_wait || val20;
+	bool val3 = !user_thread;
 	bool val2 = !sleep;
+	bool val4 = val2 || val3;
 	bool val1 = !rt;
-	bool val3 = val1 || val2;
+	bool val5 = val1 || val4;
 
 	switch (state) {
 	case S0:
-		if (val3)
+		if (val5)
 			__set_bit(S0, next);
 		if (val11 && val13)
 			__set_bit(S1, next);
 		if (val11 && val14)
 			__set_bit(S4, next);
-		if (val5)
+		if (val7)
 			__set_bit(S5, next);
 		break;
 	case S1:
 		if (val11 && val13)
 			__set_bit(S1, next);
-		if (val13 && val3)
+		if (val13 && val5)
 			__set_bit(S2, next);
-		if (val14 && val3)
+		if (val14 && val5)
 			__set_bit(S3, next);
 		if (val11 && val14)
 			__set_bit(S4, next);
-		if (val13 && val5)
+		if (val13 && val7)
 			__set_bit(S6, next);
-		if (val14 && val5)
+		if (val14 && val7)
 			__set_bit(S7, next);
 		break;
 	case S2:
 		if (val11 && val13)
 			__set_bit(S1, next);
-		if (val13 && val3)
+		if (val13 && val5)
 			__set_bit(S2, next);
-		if (val14 && val3)
+		if (val14 && val5)
 			__set_bit(S3, next);
 		if (val11 && val14)
 			__set_bit(S4, next);
-		if (val13 && val5)
+		if (val13 && val7)
 			__set_bit(S6, next);
-		if (val14 && val5)
+		if (val14 && val7)
 			__set_bit(S7, next);
 		break;
 	case S3:
-		if (val3)
+		if (val5)
 			__set_bit(S0, next);
 		if (val11 && val13)
 			__set_bit(S1, next);
 		if (val11 && val14)
 			__set_bit(S4, next);
-		if (val5)
+		if (val7)
 			__set_bit(S5, next);
 		break;
 	case S4:
-		if (val3)
+		if (val5)
 			__set_bit(S0, next);
 		if (val11 && val13)
 			__set_bit(S1, next);
 		if (val11 && val14)
 			__set_bit(S4, next);
-		if (val5)
+		if (val7)
 			__set_bit(S5, next);
 		break;
 	case S5:
-		if (val3)
+		if (val5)
 			__set_bit(S0, next);
 		if (val11 && val13)
 			__set_bit(S1, next);
 		if (val11 && val14)
 			__set_bit(S4, next);
-		if (val5)
+		if (val7)
 			__set_bit(S5, next);
 		break;
 	case S6:
 		if (val11 && val13)
 			__set_bit(S1, next);
-		if (val13 && val3)
+		if (val13 && val5)
 			__set_bit(S2, next);
-		if (val14 && val3)
+		if (val14 && val5)
 			__set_bit(S3, next);
 		if (val11 && val14)
 			__set_bit(S4, next);
-		if (val13 && val5)
+		if (val13 && val7)
 			__set_bit(S6, next);
-		if (val14 && val5)
+		if (val14 && val7)
 			__set_bit(S7, next);
 		break;
 	case S7:
-		if (val3)
+		if (val5)
 			__set_bit(S0, next);
 		if (val11 && val13)
 			__set_bit(S1, next);
 		if (val11 && val14)
 			__set_bit(S4, next);
-		if (val5)
+		if (val7)
 			__set_bit(S5, next);
 		break;
 	}
diff --git a/tools/verification/models/rtapp/sleep.ltl b/tools/verification/models/rtapp/sleep.ltl
index 5923e58d7810..4d78fdd204c0 100644
--- a/tools/verification/models/rtapp/sleep.ltl
+++ b/tools/verification/models/rtapp/sleep.ltl
@@ -1,6 +1,6 @@
-RULE = always ((RT and SLEEP) imply (RT_FRIENDLY_SLEEP or ALLOWLIST))
+RULE = always ((RT and SLEEP and USER_THREAD) imply (RT_FRIENDLY_SLEEP or ALLOWLIST))
 
-RT_FRIENDLY_SLEEP = (RT_VALID_SLEEP_REASON or KERNEL_THREAD)
+RT_FRIENDLY_SLEEP = RT_VALID_SLEEP_REASON
                 and ((not SCHEDULE_IN) until RT_FRIENDLY_WAKE)
 
 RT_VALID_SLEEP_REASON = FUTEX_WAIT
@@ -15,9 +15,6 @@ RT_FRIENDLY_WAKE = WOKEN_BY_EQUAL_OR_HIGHER_PRIO
                 or WOKEN_BY_HARDIRQ
                 or WOKEN_BY_NMI
                 or ABORT_SLEEP
-                or KTHREAD_SHOULD_STOP
 
 ALLOWLIST = BLOCK_ON_RT_MUTEX
          or FUTEX_LOCK_PI
-         or TASK_IS_RCU
-         or TASK_IS_MIGRATION
-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 2/4] rv/rtapp/sleep: Update nanosleep rule
From: Nam Cao @ 2026-06-19  7:21 UTC (permalink / raw)
  To: Gabriele Monaco, Steven Rostedt, linux-trace-kernel, linux-doc,
	linux-kernel
  Cc: Nam Cao
In-Reply-To: <cover.1781852967.git.namcao@linutronix.de>

CLOCK_REALTIME is the only clock that often is misused in real-time
applications. The other clocks either are safe for real-time uses
(CLOCK_TAI, CLOCK_MONOTONIC, CLOCK_BOOTTIME) or are unlikely to be misused
(CLOCK_AUX, CLOCK_PROCESS_CPUTIME_ID).

Update the monitor to only warn about CLOCK_REALTIME.

While at it, update the out-of-sync documentation.

Signed-off-by: Nam Cao <namcao@linutronix.de>
---
 Documentation/trace/rv/monitor_rtapp.rst  | 17 +++++---
 kernel/trace/rv/monitors/sleep/sleep.c    | 12 ++----
 kernel/trace/rv/monitors/sleep/sleep.h    | 52 +++++++++++------------
 tools/verification/models/rtapp/sleep.ltl |  2 +-
 4 files changed, 39 insertions(+), 44 deletions(-)

diff --git a/Documentation/trace/rv/monitor_rtapp.rst b/Documentation/trace/rv/monitor_rtapp.rst
index 01656bf7080a..570be67a8f3b 100644
--- a/Documentation/trace/rv/monitor_rtapp.rst
+++ b/Documentation/trace/rv/monitor_rtapp.rst
@@ -51,12 +51,13 @@ The `sleep` monitor reports real-time threads sleeping in a manner that may
 cause undesirable latency. Real-time applications should only put a real-time
 thread to sleep for one of the following reasons:
 
-  - Cyclic work: real-time thread sleeps waiting for the next cycle. For this
-    case, only the `clock_nanosleep` syscall should be used with `TIMER_ABSTIME`
-    (to avoid time drift) and `CLOCK_MONOTONIC` (to avoid the clock being
-    changed). No other method is safe for real-time. For example, threads
-    waiting for timerfd can be woken by softirq which provides no real-time
-    guarantee.
+  - Cyclic work: real-time thread sleeps waiting for the next
+    cycle. For this case, only the `clock_nanosleep` syscall should be
+    used with `TIMER_ABSTIME` (to avoid time drift). Additionally,
+    `CLOCK_REALTIME` should not be used (to avoid the clock being
+    changed). No other method is safe for real-time. For example,
+    threads waiting for timerfd can be woken by softirq which provides
+    no real-time guarantee.
   - Real-time thread waiting for something to happen (e.g. another thread
     releasing shared resources, or a completion signal from another thread). In
     this case, only futexes (FUTEX_LOCK_PI, FUTEX_LOCK_PI2 or one of
@@ -99,14 +100,16 @@ The monitor's specification is::
 
   RT_VALID_SLEEP_REASON = FUTEX_WAIT
                        or RT_FRIENDLY_NANOSLEEP
+                       or EPOLL_WAIT
 
   RT_FRIENDLY_NANOSLEEP = CLOCK_NANOSLEEP
                       and NANOSLEEP_TIMER_ABSTIME
-                      and NANOSLEEP_CLOCK_MONOTONIC
+                      and not NANOSLEEP_CLOCK_REALTIME
 
   RT_FRIENDLY_WAKE = WOKEN_BY_EQUAL_OR_HIGHER_PRIO
                   or WOKEN_BY_HARDIRQ
                   or WOKEN_BY_NMI
+                  or ABORT_SLEEP
                   or KTHREAD_SHOULD_STOP
 
   ALLOWLIST = BLOCK_ON_RT_MUTEX
diff --git a/kernel/trace/rv/monitors/sleep/sleep.c b/kernel/trace/rv/monitors/sleep/sleep.c
index d6b677fab8f8..638be7d8747f 100644
--- a/kernel/trace/rv/monitors/sleep/sleep.c
+++ b/kernel/trace/rv/monitors/sleep/sleep.c
@@ -44,8 +44,7 @@ static void ltl_atoms_init(struct task_struct *task, struct ltl_monitor *mon, bo
 
 	if (task_creation) {
 		ltl_atom_set(mon, LTL_KTHREAD_SHOULD_STOP, false);
-		ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_MONOTONIC, false);
-		ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_TAI, false);
+		ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_REALTIME, false);
 		ltl_atom_set(mon, LTL_NANOSLEEP_TIMER_ABSTIME, false);
 		ltl_atom_set(mon, LTL_CLOCK_NANOSLEEP, false);
 		ltl_atom_set(mon, LTL_FUTEX_WAIT, false);
@@ -60,8 +59,7 @@ static void ltl_atoms_init(struct task_struct *task, struct ltl_monitor *mon, bo
 		/* kernel tasks do not do syscall */
 		ltl_atom_set(mon, LTL_FUTEX_WAIT, false);
 		ltl_atom_set(mon, LTL_FUTEX_LOCK_PI, false);
-		ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_MONOTONIC, false);
-		ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_TAI, false);
+		ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_REALTIME, false);
 		ltl_atom_set(mon, LTL_NANOSLEEP_TIMER_ABSTIME, false);
 		ltl_atom_set(mon, LTL_CLOCK_NANOSLEEP, false);
 		ltl_atom_set(mon, LTL_EPOLL_WAIT, false);
@@ -136,8 +134,7 @@ static void handle_sys_enter(void *data, struct pt_regs *regs, long id)
 	case __NR_clock_nanosleep_time64:
 #endif
 		syscall_get_arguments(current, regs, args);
-		ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_MONOTONIC, args[0] == CLOCK_MONOTONIC);
-		ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_TAI, args[0] == CLOCK_TAI);
+		ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_REALTIME, args[0] == CLOCK_REALTIME);
 		ltl_atom_set(mon, LTL_NANOSLEEP_TIMER_ABSTIME, args[1] == TIMER_ABSTIME);
 		ltl_atom_update(current, LTL_CLOCK_NANOSLEEP, true);
 		break;
@@ -178,8 +175,7 @@ static void handle_sys_exit(void *data, struct pt_regs *regs, long ret)
 
 	ltl_atom_set(mon, LTL_FUTEX_LOCK_PI, false);
 	ltl_atom_set(mon, LTL_FUTEX_WAIT, false);
-	ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_MONOTONIC, false);
-	ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_TAI, false);
+	ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_REALTIME, false);
 	ltl_atom_set(mon, LTL_NANOSLEEP_TIMER_ABSTIME, false);
 	ltl_atom_set(mon, LTL_EPOLL_WAIT, false);
 	ltl_atom_update(current, LTL_CLOCK_NANOSLEEP, false);
diff --git a/kernel/trace/rv/monitors/sleep/sleep.h b/kernel/trace/rv/monitors/sleep/sleep.h
index 403dc2852c52..2fe2ec7edae8 100644
--- a/kernel/trace/rv/monitors/sleep/sleep.h
+++ b/kernel/trace/rv/monitors/sleep/sleep.h
@@ -20,8 +20,7 @@ enum ltl_atom {
 	LTL_FUTEX_WAIT,
 	LTL_KERNEL_THREAD,
 	LTL_KTHREAD_SHOULD_STOP,
-	LTL_NANOSLEEP_CLOCK_MONOTONIC,
-	LTL_NANOSLEEP_CLOCK_TAI,
+	LTL_NANOSLEEP_CLOCK_REALTIME,
 	LTL_NANOSLEEP_TIMER_ABSTIME,
 	LTL_RT,
 	LTL_SCHEDULE_IN,
@@ -46,8 +45,7 @@ static const char *ltl_atom_str(enum ltl_atom atom)
 		"fu_wa",
 		"ker_th",
 		"kth_sh_st",
-		"na_cl_mo",
-		"na_cl_ta",
+		"na_cl_re",
 		"na_ti_ab",
 		"rt",
 		"sch_in",
@@ -87,8 +85,7 @@ static void ltl_start(struct task_struct *task, struct ltl_monitor *mon)
 	bool schedule_in = test_bit(LTL_SCHEDULE_IN, mon->atoms);
 	bool rt = test_bit(LTL_RT, mon->atoms);
 	bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
-	bool nanosleep_clock_tai = test_bit(LTL_NANOSLEEP_CLOCK_TAI, mon->atoms);
-	bool nanosleep_clock_monotonic = test_bit(LTL_NANOSLEEP_CLOCK_MONOTONIC, mon->atoms);
+	bool nanosleep_clock_realtime = test_bit(LTL_NANOSLEEP_CLOCK_REALTIME, mon->atoms);
 	bool kthread_should_stop = test_bit(LTL_KTHREAD_SHOULD_STOP, mon->atoms);
 	bool kernel_thread = test_bit(LTL_KERNEL_THREAD, mon->atoms);
 	bool futex_wait = test_bit(LTL_FUTEX_WAIT, mon->atoms);
@@ -97,17 +94,17 @@ static void ltl_start(struct task_struct *task, struct ltl_monitor *mon)
 	bool clock_nanosleep = test_bit(LTL_CLOCK_NANOSLEEP, mon->atoms);
 	bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
 	bool abort_sleep = test_bit(LTL_ABORT_SLEEP, mon->atoms);
-	bool val42 = task_is_rcu || task_is_migration;
-	bool val43 = futex_lock_pi || val42;
-	bool val5 = block_on_rt_mutex || val43;
-	bool val34 = abort_sleep || kthread_should_stop;
-	bool val35 = woken_by_nmi || val34;
-	bool val36 = woken_by_hardirq || val35;
-	bool val14 = woken_by_equal_or_higher_prio || val36;
+	bool val41 = task_is_rcu || task_is_migration;
+	bool val42 = futex_lock_pi || val41;
+	bool val5 = block_on_rt_mutex || val42;
+	bool val33 = abort_sleep || kthread_should_stop;
+	bool val34 = woken_by_nmi || val33;
+	bool val35 = woken_by_hardirq || val34;
+	bool val14 = woken_by_equal_or_higher_prio || val35;
 	bool val13 = !schedule_in;
-	bool val26 = nanosleep_clock_monotonic || nanosleep_clock_tai;
-	bool val27 = nanosleep_timer_abstime && val26;
-	bool val18 = clock_nanosleep && val27;
+	bool val25 = !nanosleep_clock_realtime;
+	bool val26 = nanosleep_timer_abstime && val25;
+	bool val18 = clock_nanosleep && val26;
 	bool val20 = val18 || epoll_wait;
 	bool val9 = futex_wait || val20;
 	bool val11 = val9 || kernel_thread;
@@ -138,8 +135,7 @@ ltl_possible_next_states(struct ltl_monitor *mon, unsigned int state, unsigned l
 	bool schedule_in = test_bit(LTL_SCHEDULE_IN, mon->atoms);
 	bool rt = test_bit(LTL_RT, mon->atoms);
 	bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
-	bool nanosleep_clock_tai = test_bit(LTL_NANOSLEEP_CLOCK_TAI, mon->atoms);
-	bool nanosleep_clock_monotonic = test_bit(LTL_NANOSLEEP_CLOCK_MONOTONIC, mon->atoms);
+	bool nanosleep_clock_realtime = test_bit(LTL_NANOSLEEP_CLOCK_REALTIME, mon->atoms);
 	bool kthread_should_stop = test_bit(LTL_KTHREAD_SHOULD_STOP, mon->atoms);
 	bool kernel_thread = test_bit(LTL_KERNEL_THREAD, mon->atoms);
 	bool futex_wait = test_bit(LTL_FUTEX_WAIT, mon->atoms);
@@ -148,17 +144,17 @@ ltl_possible_next_states(struct ltl_monitor *mon, unsigned int state, unsigned l
 	bool clock_nanosleep = test_bit(LTL_CLOCK_NANOSLEEP, mon->atoms);
 	bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
 	bool abort_sleep = test_bit(LTL_ABORT_SLEEP, mon->atoms);
-	bool val42 = task_is_rcu || task_is_migration;
-	bool val43 = futex_lock_pi || val42;
-	bool val5 = block_on_rt_mutex || val43;
-	bool val34 = abort_sleep || kthread_should_stop;
-	bool val35 = woken_by_nmi || val34;
-	bool val36 = woken_by_hardirq || val35;
-	bool val14 = woken_by_equal_or_higher_prio || val36;
+	bool val41 = task_is_rcu || task_is_migration;
+	bool val42 = futex_lock_pi || val41;
+	bool val5 = block_on_rt_mutex || val42;
+	bool val33 = abort_sleep || kthread_should_stop;
+	bool val34 = woken_by_nmi || val33;
+	bool val35 = woken_by_hardirq || val34;
+	bool val14 = woken_by_equal_or_higher_prio || val35;
 	bool val13 = !schedule_in;
-	bool val26 = nanosleep_clock_monotonic || nanosleep_clock_tai;
-	bool val27 = nanosleep_timer_abstime && val26;
-	bool val18 = clock_nanosleep && val27;
+	bool val25 = !nanosleep_clock_realtime;
+	bool val26 = nanosleep_timer_abstime && val25;
+	bool val18 = clock_nanosleep && val26;
 	bool val20 = val18 || epoll_wait;
 	bool val9 = futex_wait || val20;
 	bool val11 = val9 || kernel_thread;
diff --git a/tools/verification/models/rtapp/sleep.ltl b/tools/verification/models/rtapp/sleep.ltl
index 464c84b9df87..5923e58d7810 100644
--- a/tools/verification/models/rtapp/sleep.ltl
+++ b/tools/verification/models/rtapp/sleep.ltl
@@ -9,7 +9,7 @@ RT_VALID_SLEEP_REASON = FUTEX_WAIT
 
 RT_FRIENDLY_NANOSLEEP = CLOCK_NANOSLEEP
                     and NANOSLEEP_TIMER_ABSTIME
-                    and (NANOSLEEP_CLOCK_MONOTONIC or NANOSLEEP_CLOCK_TAI)
+                    and not NANOSLEEP_CLOCK_REALTIME
 
 RT_FRIENDLY_WAKE = WOKEN_BY_EQUAL_OR_HIGHER_PRIO
                 or WOKEN_BY_HARDIRQ
-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 1/4] rv/rtapp/sleep: Make the error more informative for user
From: Nam Cao @ 2026-06-19  7:21 UTC (permalink / raw)
  To: Gabriele Monaco, Steven Rostedt, linux-trace-kernel, linux-doc,
	linux-kernel
  Cc: Nam Cao
In-Reply-To: <cover.1781852967.git.namcao@linutronix.de>

The rtapp/sleep monitor detects real-time tasks which go to sleep in an
real-time-unsafe manner. If this happen, the monitor triggers a trace event
in the sched_wakeup tracepoint's handler.

However, the invoking context of that trace event is not the most
informative, because of the stack trace of that event is the wakeup's code
path which is not very helpful:

74.669317: rv:error_sleep: condvar[254]: violation detected
    ltl_validate+0x345 ([kernel.kallsyms])
    handle_sched_wakeup+0x34 ([kernel.kallsyms])
    ttwu_do_activate+0xff ([kernel.kallsyms])
    sched_ttwu_pending+0x104 ([kernel.kallsyms])
    __flush_smp_call_function_queue+0x15b ([kernel.kallsyms])
    __sysvec_call_function_single+0x18 ([kernel.kallsyms])
    sysvec_call_function_single+0x66 ([kernel.kallsyms])
    asm_sysvec_call_function_single+0x1a ([kernel.kallsyms])
    pv_native_safe_halt+0xf ([kernel.kallsyms])
    default_idle+0x9 ([kernel.kallsyms])
    default_idle_call+0x33 ([kernel.kallsyms])
    do_idle+0x234 ([kernel.kallsyms])
    cpu_startup_entry+0x24 ([kernel.kallsyms])
    start_secondary+0xf8 ([kernel.kallsyms])
    common_startup_64+0x13e ([kernel.kallsyms])

What would be much more valuable is the stack trace of the task itself.

Instead of using the sched_wakeup tracepoint, use the sched_exit
tracepoint. This makes the event happen in the task's context, making
the stack trace far more informative for user:

rv:error_sleep: condvar[254]: violation detected
    ltl_validate+0x345 ([kernel.kallsyms])
    handle_sched_exit+0x39 ([kernel.kallsyms])
    __schedule+0x80f ([kernel.kallsyms])
    schedule+0x22 ([kernel.kallsyms])
    futex_do_wait+0x33 ([kernel.kallsyms])
    __futex_wait+0x8c ([kernel.kallsyms])
    futex_wait+0x73 ([kernel.kallsyms])
    do_futex+0xc6 ([kernel.kallsyms])
    __x64_sys_futex+0x121 ([kernel.kallsyms])
    do_syscall_64+0xf3 ([kernel.kallsyms])
    entry_SYSCALL_64_after_hwframe+0x77 ([kernel.kallsyms])
    __futex_abstimed_wait_common64+0xc6 (inlined)
    __futex_abstimed_wait_common+0xc6 (/usr/lib/x86_64-linux-gnu/libc.so.6)

Signed-off-by: Nam Cao <namcao@linutronix.de>
---
 Documentation/trace/rv/monitor_rtapp.rst  |  2 +-
 kernel/trace/rv/monitors/sleep/sleep.c    | 10 +++++-----
 kernel/trace/rv/monitors/sleep/sleep.h    | 14 +++++++-------
 tools/verification/models/rtapp/sleep.ltl |  2 +-
 4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/Documentation/trace/rv/monitor_rtapp.rst b/Documentation/trace/rv/monitor_rtapp.rst
index c8104eda924a..01656bf7080a 100644
--- a/Documentation/trace/rv/monitor_rtapp.rst
+++ b/Documentation/trace/rv/monitor_rtapp.rst
@@ -95,7 +95,7 @@ The monitor's specification is::
   RULE = always ((RT and SLEEP) imply (RT_FRIENDLY_SLEEP or ALLOWLIST))
 
   RT_FRIENDLY_SLEEP = (RT_VALID_SLEEP_REASON or KERNEL_THREAD)
-                  and ((not WAKE) until RT_FRIENDLY_WAKE)
+                  and ((not SCHEDULE_IN) until RT_FRIENDLY_WAKE)
 
   RT_VALID_SLEEP_REASON = FUTEX_WAIT
                        or RT_FRIENDLY_NANOSLEEP
diff --git a/kernel/trace/rv/monitors/sleep/sleep.c b/kernel/trace/rv/monitors/sleep/sleep.c
index 8dfe5ec13e19..d6b677fab8f8 100644
--- a/kernel/trace/rv/monitors/sleep/sleep.c
+++ b/kernel/trace/rv/monitors/sleep/sleep.c
@@ -36,7 +36,7 @@ static void ltl_atoms_fetch(struct task_struct *task, struct ltl_monitor *mon)
 static void ltl_atoms_init(struct task_struct *task, struct ltl_monitor *mon, bool task_creation)
 {
 	ltl_atom_set(mon, LTL_SLEEP, false);
-	ltl_atom_set(mon, LTL_WAKE, false);
+	ltl_atom_set(mon, LTL_SCHEDULE_IN, false);
 	ltl_atom_set(mon, LTL_ABORT_SLEEP, false);
 	ltl_atom_set(mon, LTL_WOKEN_BY_HARDIRQ, false);
 	ltl_atom_set(mon, LTL_WOKEN_BY_NMI, false);
@@ -92,9 +92,9 @@ static void handle_sched_set_state(void *data, struct task_struct *task, int sta
 		ltl_atom_pulse(task, LTL_ABORT_SLEEP, true);
 }
 
-static void handle_sched_wakeup(void *data, struct task_struct *task)
+static void handle_sched_exit(void *data, bool is_switch)
 {
-	ltl_atom_pulse(task, LTL_WAKE, true);
+	ltl_atom_pulse(current, LTL_SCHEDULE_IN, true);
 }
 
 static void handle_sched_waking(void *data, struct task_struct *task)
@@ -200,7 +200,7 @@ static int enable_sleep(void)
 		return retval;
 
 	rv_attach_trace_probe("rtapp_sleep", sched_waking, handle_sched_waking);
-	rv_attach_trace_probe("rtapp_sleep", sched_wakeup, handle_sched_wakeup);
+	rv_attach_trace_probe("rtapp_sleep", sched_exit_tp, handle_sched_exit);
 	rv_attach_trace_probe("rtapp_sleep", sched_set_state_tp, handle_sched_set_state);
 	rv_attach_trace_probe("rtapp_sleep", contention_begin, handle_contention_begin);
 	rv_attach_trace_probe("rtapp_sleep", contention_end, handle_contention_end);
@@ -213,7 +213,7 @@ static int enable_sleep(void)
 static void disable_sleep(void)
 {
 	rv_detach_trace_probe("rtapp_sleep", sched_waking, handle_sched_waking);
-	rv_detach_trace_probe("rtapp_sleep", sched_wakeup, handle_sched_wakeup);
+	rv_detach_trace_probe("rtapp_sleep", sched_exit_tp, handle_sched_exit);
 	rv_detach_trace_probe("rtapp_sleep", sched_set_state_tp, handle_sched_set_state);
 	rv_detach_trace_probe("rtapp_sleep", contention_begin, handle_contention_begin);
 	rv_detach_trace_probe("rtapp_sleep", contention_end, handle_contention_end);
diff --git a/kernel/trace/rv/monitors/sleep/sleep.h b/kernel/trace/rv/monitors/sleep/sleep.h
index 95dc2727c059..403dc2852c52 100644
--- a/kernel/trace/rv/monitors/sleep/sleep.h
+++ b/kernel/trace/rv/monitors/sleep/sleep.h
@@ -24,10 +24,10 @@ enum ltl_atom {
 	LTL_NANOSLEEP_CLOCK_TAI,
 	LTL_NANOSLEEP_TIMER_ABSTIME,
 	LTL_RT,
+	LTL_SCHEDULE_IN,
 	LTL_SLEEP,
 	LTL_TASK_IS_MIGRATION,
 	LTL_TASK_IS_RCU,
-	LTL_WAKE,
 	LTL_WOKEN_BY_EQUAL_OR_HIGHER_PRIO,
 	LTL_WOKEN_BY_HARDIRQ,
 	LTL_WOKEN_BY_NMI,
@@ -50,10 +50,10 @@ static const char *ltl_atom_str(enum ltl_atom atom)
 		"na_cl_ta",
 		"na_ti_ab",
 		"rt",
-		"sl",
+		"sch_in",
+		"sle",
 		"ta_mi",
 		"ta_rc",
-		"wak",
 		"wo_eq_hi_pr",
 		"wo_ha",
 		"wo_nm",
@@ -81,10 +81,10 @@ static void ltl_start(struct task_struct *task, struct ltl_monitor *mon)
 	bool woken_by_hardirq = test_bit(LTL_WOKEN_BY_HARDIRQ, mon->atoms);
 	bool woken_by_equal_or_higher_prio = test_bit(LTL_WOKEN_BY_EQUAL_OR_HIGHER_PRIO,
 	     mon->atoms);
-	bool wake = test_bit(LTL_WAKE, mon->atoms);
 	bool task_is_rcu = test_bit(LTL_TASK_IS_RCU, mon->atoms);
 	bool task_is_migration = test_bit(LTL_TASK_IS_MIGRATION, mon->atoms);
 	bool sleep = test_bit(LTL_SLEEP, mon->atoms);
+	bool schedule_in = test_bit(LTL_SCHEDULE_IN, mon->atoms);
 	bool rt = test_bit(LTL_RT, mon->atoms);
 	bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
 	bool nanosleep_clock_tai = test_bit(LTL_NANOSLEEP_CLOCK_TAI, mon->atoms);
@@ -104,7 +104,7 @@ static void ltl_start(struct task_struct *task, struct ltl_monitor *mon)
 	bool val35 = woken_by_nmi || val34;
 	bool val36 = woken_by_hardirq || val35;
 	bool val14 = woken_by_equal_or_higher_prio || val36;
-	bool val13 = !wake;
+	bool val13 = !schedule_in;
 	bool val26 = nanosleep_clock_monotonic || nanosleep_clock_tai;
 	bool val27 = nanosleep_timer_abstime && val26;
 	bool val18 = clock_nanosleep && val27;
@@ -132,10 +132,10 @@ ltl_possible_next_states(struct ltl_monitor *mon, unsigned int state, unsigned l
 	bool woken_by_hardirq = test_bit(LTL_WOKEN_BY_HARDIRQ, mon->atoms);
 	bool woken_by_equal_or_higher_prio = test_bit(LTL_WOKEN_BY_EQUAL_OR_HIGHER_PRIO,
 	     mon->atoms);
-	bool wake = test_bit(LTL_WAKE, mon->atoms);
 	bool task_is_rcu = test_bit(LTL_TASK_IS_RCU, mon->atoms);
 	bool task_is_migration = test_bit(LTL_TASK_IS_MIGRATION, mon->atoms);
 	bool sleep = test_bit(LTL_SLEEP, mon->atoms);
+	bool schedule_in = test_bit(LTL_SCHEDULE_IN, mon->atoms);
 	bool rt = test_bit(LTL_RT, mon->atoms);
 	bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
 	bool nanosleep_clock_tai = test_bit(LTL_NANOSLEEP_CLOCK_TAI, mon->atoms);
@@ -155,7 +155,7 @@ ltl_possible_next_states(struct ltl_monitor *mon, unsigned int state, unsigned l
 	bool val35 = woken_by_nmi || val34;
 	bool val36 = woken_by_hardirq || val35;
 	bool val14 = woken_by_equal_or_higher_prio || val36;
-	bool val13 = !wake;
+	bool val13 = !schedule_in;
 	bool val26 = nanosleep_clock_monotonic || nanosleep_clock_tai;
 	bool val27 = nanosleep_timer_abstime && val26;
 	bool val18 = clock_nanosleep && val27;
diff --git a/tools/verification/models/rtapp/sleep.ltl b/tools/verification/models/rtapp/sleep.ltl
index 6f26c4810f78..464c84b9df87 100644
--- a/tools/verification/models/rtapp/sleep.ltl
+++ b/tools/verification/models/rtapp/sleep.ltl
@@ -1,7 +1,7 @@
 RULE = always ((RT and SLEEP) imply (RT_FRIENDLY_SLEEP or ALLOWLIST))
 
 RT_FRIENDLY_SLEEP = (RT_VALID_SLEEP_REASON or KERNEL_THREAD)
-                and ((not WAKE) until RT_FRIENDLY_WAKE)
+                and ((not SCHEDULE_IN) until RT_FRIENDLY_WAKE)
 
 RT_VALID_SLEEP_REASON = FUTEX_WAIT
                      or RT_FRIENDLY_NANOSLEEP
-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 0/4] rv: rtapp monitor update
From: Nam Cao @ 2026-06-19  7:21 UTC (permalink / raw)
  To: Gabriele Monaco, Steven Rostedt, linux-trace-kernel, linux-doc,
	linux-kernel
  Cc: Nam Cao

A couple of minor improvements to the rtapp monitor:

  - Making the monitor more informative to user by changing the
    context of the tracepoint into the monitored task itself, not the
    IPI wakeup path.

  - and update the allow list regarding clock_nanosleep syscall.

  - Stop monitoring the kernel threads to simplify the monitors.

  - Add a new rtapp/wakeup monitor to give complement the rtapp/sleep
    monitor.

v2..v1 https://lore.kernel.org/lkml/cover.1779176466.git.namcao@linutronix.de/
  - Use clearer waker/wakee terminologies
  - Fix build issue
  - Add new patch "rv/rtapp/sleep: Stop monitoring kernel threads"
  - Require RV_PER_TASK_MONITORS >= 3

Nam Cao (4):
  rv/rtapp/sleep: Make the error more informative for user
  rv/rtapp/sleep: Update nanosleep rule
  rv/rtapp/sleep: Stop monitoring kernel threads
  rv/rtapp: Add wakeup monitor

 Documentation/trace/rv/monitor_rtapp.rst      |  61 ++++---
 kernel/trace/rv/Kconfig                       |   1 +
 kernel/trace/rv/Makefile                      |   1 +
 kernel/trace/rv/monitors/rtapp/Kconfig        |   2 +-
 kernel/trace/rv/monitors/sleep/Kconfig        |   1 -
 kernel/trace/rv/monitors/sleep/sleep.c        |  59 ++-----
 kernel/trace/rv/monitors/sleep/sleep.h        | 142 +++++++---------
 kernel/trace/rv/monitors/wakeup/Kconfig       |  16 ++
 kernel/trace/rv/monitors/wakeup/wakeup.c      | 153 ++++++++++++++++++
 kernel/trace/rv/monitors/wakeup/wakeup.h      |  92 +++++++++++
 .../trace/rv/monitors/wakeup/wakeup_trace.h   |  14 ++
 kernel/trace/rv/rv_trace.h                    |   1 +
 tools/verification/models/rtapp/sleep.ltl     |  11 +-
 tools/verification/models/rtapp/wakeup.ltl    |   5 +
 14 files changed, 396 insertions(+), 163 deletions(-)
 create mode 100644 kernel/trace/rv/monitors/wakeup/Kconfig
 create mode 100644 kernel/trace/rv/monitors/wakeup/wakeup.c
 create mode 100644 kernel/trace/rv/monitors/wakeup/wakeup.h
 create mode 100644 kernel/trace/rv/monitors/wakeup/wakeup_trace.h
 create mode 100644 tools/verification/models/rtapp/wakeup.ltl

-- 
2.47.3


^ permalink raw reply

* [RFC v2 PATCH] reserve_mem: add support for static memory
From: Shyam Saini @ 2026-06-19  6:23 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel
  Cc: rppt, akpm, tgopinath, bboscaccy, kees, tony.luck, gpiccoli, bp,
	rdunlap, peterz, feng.tang, dapeng1.mi, elver, enelsonmoore, kuba,
	lirongqing, ebiggers

reserve_mem relies on dynamic memory allocation, this limits the
usecase where memory is required to be preserved across the boots.
Eg: ramoops memory reservation on ACPI platforms

So add support to pass a pre-determined static address and reserve
memory at a specified location. This enables use case like ramoops
on ACPI platforms to reliably access ramoops region with previous
boot logs.

Also skip the parsing of <align> when static address is passed.

Example syntax for static address
 reserve_mem=4M@0x1E0000000:oops

Signed-off-by: Shyam Saini <shyamsaini@linux.microsoft.com>
---
v1: https://lore.kernel.org/lkml/0eaf3be2-5121-48b7-aeed-196405c0a480@infradead.org/
v2: Fix code logic and incorporate Randy's suggestion
---
 .../admin-guide/kernel-parameters.txt         | 15 ++++++
 mm/memblock.c                                 | 47 +++++++++++++------
 2 files changed, 47 insertions(+), 15 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index b5493a7f8f228..7e0baca564b97 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6563,6 +6563,21 @@ Kernel parameters
 
 			reserve_mem=12M:4096:oops ramoops.mem_name=oops
 
+	reserve_mem=	[RAM]
+			Format: nn[KMG]:<@offset>:<label>
+			Reserve physical memory at predetermined location and label it with
+			a name that other subsystems can use to access it. This is typically
+			used for systems that do not wipe the RAM, and this command
+			line will try to reserve the same physical memory on
+			soft reboots. Note, it is guaranteed to be the same
+			location unless some other early allocation, e.g.: crashkernel=256M
+                        (without static address) is reserved or overlaps this region.
+
+			The format is size:offset:label for example, to request
+			4 megabytes for ramoops at 0x1E0000000:
+
+			reserve_mem=4M@0x1E0000000:oops ramoops.mem_name=oops
+
 	reservetop=	[X86-32,EARLY]
 			Format: nn[KMG]
 			Reserves a hole at the top of the kernel virtual
diff --git a/mm/memblock.c b/mm/memblock.c
index 6349c48154f4b..c76cefa0a8a83 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2721,6 +2721,7 @@ static int __init reserve_mem(char *p)
 	char *name;
 	char *oldp;
 	int len;
+	bool addr_is_static = false;
 
 	if (!p)
 		goto err_param;
@@ -2736,19 +2737,27 @@ static int __init reserve_mem(char *p)
 	if (!size || p == oldp)
 		goto err_param;
 
-	if (*p != ':')
-		goto err_param;
+	/* parse the static memory address */
+	if (*p == '@') {
+		start = memparse(p+1, &p);
+		addr_is_static = true;
+	}
 
-	align = memparse(p+1, &p);
 	if (*p != ':')
 		goto err_param;
 
-	/*
-	 * memblock_phys_alloc() doesn't like a zero size align,
-	 * but it is OK for this command to have it.
-	 */
-	if (align < SMP_CACHE_BYTES)
-		align = SMP_CACHE_BYTES;
+	if (!addr_is_static) {
+		align = memparse(p+1, &p);
+		if (*p != ':')
+			goto err_param;
+
+		/*
+		 * memblock_phys_alloc() doesn't like a zero size align,
+		 * but it is OK for this command to have it.
+		 */
+		if (align < SMP_CACHE_BYTES)
+			align = SMP_CACHE_BYTES;
+	}
 
 	name = p + 1;
 	len = strlen(name);
@@ -2772,14 +2781,22 @@ static int __init reserve_mem(char *p)
 	}
 
 	/* Pick previous allocations up from KHO if available */
-	if (reserve_mem_kho_revive(name, size, align))
+	if (!addr_is_static && reserve_mem_kho_revive(name, size, align))
 		return 1;
 
-	/* TODO: Allocation must be outside of scratch region */
-	start = memblock_phys_alloc(size, align);
-	if (!start) {
-		pr_err("reserve_mem: memblock allocation failed\n");
-		return -ENOMEM;
+	if (addr_is_static) {
+		if (memblock_reserve(start, size)) {
+			pr_err("reserve_mem: memblock reservation failed\n");
+			return -ENOMEM;
+		}
+
+	} else {
+		/* TODO: Allocation must be outside of scratch region */
+		start = memblock_phys_alloc(size, align);
+		if (!start) {
+			pr_err("reserve_mem: memblock allocation failed\n");
+			return -ENOMEM;
+		}
 	}
 
 	reserved_mem_add(start, size, name);
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v4 1/4] KVM: PPC: Introduce KVM_CAP_PPC_COMPAT_CAPS and wire up ioctl
From: Vaibhav Jain @ 2026-06-19  6:14 UTC (permalink / raw)
  To: Amit Machhiwal, linuxppc-dev, Madhavan Srinivasan
  Cc: Amit Machhiwal, Anushree Mathur, Paolo Bonzini, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP), Jonathan Corbet,
	Shuah Khan, kvm, linux-kernel, linux-doc, lkp
In-Reply-To: <20260616123314.82721-2-amachhiw@linux.ibm.com>

Hi Amit.

Thanks for the patch and incorporating V3 review comments. Further
review comments inline below:

Amit Machhiwal <amachhiw@linux.ibm.com> writes:

> Introduce a new capability and ioctl to expose CPU compatibility modes
> supported by the host processor for nested guests.
>
> On IBM POWER systems, newer processor generations (N) can operate in
> compatibility modes corresponding to earlier generations, like (N-1) and
> (N-2). This is particularly relevant for nested virtualization, where
> nested KVM guests may need to run with a specific processor compatibility
> level.
>
> Introduce KVM_CAP_PPC_COMPAT_CAPS capability and the corresponding
> KVM_PPC_GET_COMPAT_CAPS vm ioctl. The ioctl returns a bitmap describing
> the compatibility modes supported by the host in respective bit numbers,
> allowing userspace (e.g., QEMU) to select an appropriate compatibility
> level when configuring nested KVM guests.
>
> The ioctl handling is added in kvm_arch_vm_ioctl() and retrieves host
> CPU compatibility capabilities via a PowerPC-specific backend
> implementation when available. The implementation validates the structure
> size from userspace to ensure forward compatibility and returns
> appropriate error codes (EINVAL for invalid size, EFAULT for copy
> failures, ENOTTY if backend is not implemented). The struct
> kvm_ppc_compat_caps includes a size field to support future ABI
> extensions.
>
> Suggested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
> ---
>  arch/powerpc/include/asm/kvm_ppc.h  |  1 +
>  arch/powerpc/include/uapi/asm/kvm.h |  7 ++++++
>  arch/powerpc/kvm/powerpc.c          | 35 +++++++++++++++++++++++++++++
>  include/uapi/linux/kvm.h            |  4 ++++
>  4 files changed, 47 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 0953f2daa466..169ea6a7fbad 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -319,6 +319,7 @@ struct kvmppc_ops {
>  	bool (*hash_v3_possible)(void);
>  	int (*create_vm_debugfs)(struct kvm *kvm);
>  	int (*create_vcpu_debugfs)(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry);
> +	int (*get_compat_caps)(struct kvm_ppc_compat_caps *host_caps);
>  };
>  
>  extern struct kvmppc_ops *kvmppc_hv_ops;
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> index 077c5437f521..8a38be6c3b03 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -437,6 +437,13 @@ struct kvm_ppc_cpu_char {
>  	__u64	behaviour_mask;		/* valid bits in behaviour */
>  };
>  
> +/* For KVM_PPC_GET_COMPAT_CAPS */
> +struct kvm_ppc_compat_caps {
> +	__u64	flags;			/* Reserved for future use */
> +	__u64	size;			/* Size of this structure */
Suggesting moving the 'size' as the first member of the struct. That way
copying the struct from userspace becomes bit easier.

> +	__u64	compat_capabilities;	/* Capabilities supported by the host */
> +};
> +
>  /*
>   * Values for character and character_mask.
>   * These are identical to the values used by H_GET_CPU_CHARACTERISTICS.
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 98de68379b18..9153b0034b45 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -701,6 +701,13 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  			}
>  		}
>  		break;
> +#if defined(CONFIG_KVM_BOOK3S_HV_POSSIBLE)
> +	case KVM_CAP_PPC_COMPAT_CAPS:
> +		r = 0;
> +		if (kvmhv_on_pseries())
> +			r = 1;
> +		break;
> +#endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
>  	default:
>  		r = 0;
>  		break;
> @@ -2467,6 +2474,34 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>  		r = kvm->arch.kvm_ops->svm_off(kvm);
>  		break;
>  	}
> +	case KVM_PPC_GET_COMPAT_CAPS: {
> +		struct kvm_ppc_compat_caps host_caps;
> +		u64 user_size;
> +
> +		r = -EFAULT;
> +		/* First, get the size field from userspace to validate */
> +		if (copy_from_user(&user_size, &((struct kvm_ppc_compat_caps
> +		     __user *)argp)->size, sizeof(user_size))) {
move the struct size member to the first field. That way
from_from_user() call is simplified and you wont have to do some wired
pointer arithmetic.


> +			goto out;
> +		}
> +
> +		/* Validate size - must be at least the current structure size */
> +		r = -EINVAL;
> +		if (user_size < sizeof(host_caps))
> +			goto out;
Check should be strengthed to
 if (user_size != sizeof(host_caps))
So that in case used space sends a struct larger than what kernel knows
abt it will be rejected. This will prevent surprises in future in case
VMM sends a larger struct expecting kernel to know abt it but an older
kernel only knows abt older smaller sized struct. Also look at the
review comment below.

> +
> +		r = -ENOTTY;
> +		memset(&host_caps, 0, sizeof(host_caps));
> +		if (!kvm->arch.kvm_ops->get_compat_caps)
> +			goto out;
> +
> +		r = kvm->arch.kvm_ops->get_compat_caps(&host_caps);
> +		/* Set the actual size of the structure we're returning */
> +		host_caps.size = sizeof(host_caps);
> +		if (!r && copy_to_user(argp, &host_caps, sizeof(host_caps)))
> +			r = -EFAULT;
You are allowing a future userspace VMM to potentially send a larger
'struct kvm_ppc_compat_caps' that what kernel knows about. This makes
error handling in userspace bit involved since there might be some
fields in the 'struct kvm_ppc_compat_caps' given from userspace may
remain un-initialized when userspace sees it. So please mention this
subtle behaviour should be mentioned in patch description and also
update it the doc in the later patch.

> +		break;
> +	}
>  	default: {
>  		struct kvm *kvm = filp->private_data;
>  		r = kvm->arch.kvm_ops->arch_vm_ioctl(filp, ioctl, arg);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 6c8afa2047bf..1788a0068662 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -996,6 +996,7 @@ struct kvm_enable_cap {
>  #define KVM_CAP_S390_USER_OPEREXEC 246
>  #define KVM_CAP_S390_KEYOP 247
>  #define KVM_CAP_S390_VSIE_ESAMODE 248
> +#define KVM_CAP_PPC_COMPAT_CAPS 249
>  
>  struct kvm_irq_routing_irqchip {
>  	__u32 irqchip;
> @@ -1349,6 +1350,9 @@ struct kvm_s390_keyop {
>  #define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
>  #define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
>  
> +/* Available with KVM_CAP_PPC_COMPAT_CAPS */
> +#define KVM_PPC_GET_COMPAT_CAPS	_IOR(KVMIO,  0xe4, struct kvm_ppc_compat_caps)
> +
>  /*
>   * ioctls for vcpu fds
>   */
> -- 
> 2.50.1 (Apple Git-155)
>
>

-- 
Cheers
~ Vaibhav

^ permalink raw reply

* Re: [PATCH v4 4/4] KVM: PPC: Document KVM_PPC_GET_COMPAT_CAPS ioctl
From: Vaibhav Jain @ 2026-06-19  6:14 UTC (permalink / raw)
  To: Amit Machhiwal, linuxppc-dev, Madhavan Srinivasan
  Cc: Amit Machhiwal, Anushree Mathur, Paolo Bonzini, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP), Jonathan Corbet,
	Shuah Khan, kvm, linux-kernel, linux-doc, lkp
In-Reply-To: <20260616123314.82721-5-amachhiw@linux.ibm.com>

Hi Amit,

Thanks for the patch and incorporating V3 review comments. Further
review comments inline below:

Amit Machhiwal <amachhiw@linux.ibm.com> writes:

> Add documentation for the KVM_PPC_GET_COMPAT_CAPS ioctl to the KVM API
> documentation.
>
> The ioctl exposes host processor compatibility modes supported for
> nested KVM guests on PowerPC systems. The documentation includes
> comprehensive error code descriptions, structure field definitions
> including the size field for forward compatibility, and KVM-specific
> capability bit constants.
>
> Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
> ---
>  Documentation/virt/kvm/api.rst | 47 ++++++++++++++++++++++++++++++++++
>  1 file changed, 47 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 52bbbb553ce1..ba6feba74d7d 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6553,6 +6553,53 @@ KVM_S390_KEYOP_SSKE
>    Sets the storage key for the guest address ``guest_addr`` to the key
>    specified in ``key``, returning the previous value in ``key``.
>  
> +4.145 KVM_PPC_GET_COMPAT_CAPS
> +-----------------------------
> +:Capability: KVM_CAP_PPC_COMPAT_CAPS
> +:Architectures: powerpc
> +:Type: vm ioctl
> +:Parameters: struct kvm_ppc_compat_caps (out)
> +:Returns: 0 on success, negative value on failure
> +
> +Errors include:
> +
> +  ======== ============================================================
> +  EFAULT   if ``struct kvm_ppc_compat_caps`` cannot be read from or
> +           written to userspace
> +  EINVAL   if the ``size`` field is smaller than the current structure
> +           size, or if the backend implementation fails to retrieve or
> +           map CPU compatibility capabilities
> +  ENOTTY   if the backend does not implement the ``get_compat_caps``
> +           operation (e.g., on non-pseries platforms or when the
> +           required KVM operations are not available)
> +  ======== ============================================================
> +
> +IBM POWER system server-based processors provide a compatibility mode feature
> +where an Nth generation processor can operate in modes consistent with earlier
> +generations such as (N-1) and (N-2).
> +
> +This ioctl provides userspace with information about the CPU compatibility modes
> +supported by the current host processor for booting the nested KVM guests on
> +PowerNV (KVM nested APIv1) and PowerVM (KVM nested APIv2) platforms.
> +

Please add a detail on how returned 'size' field can be less than what
the userspace has sent and how it should be handled.

> +::
> +
> +  struct kvm_ppc_compat_caps {
> +	__u64	flags;			/* Reserved for future use */
> +	__u64	size;			/* Size of this structure */
> +	__u64	compat_capabilities;	/* Capabilities supported by the host */
> +  };
> +
> +The ``compat_capabilities`` bit field describes the processor compatibility
> +modes supported by the host. For example, the following bits indicate support
> +for specific processor modes.
> +
> +::
> +
> +  KVM_PPC_COMPAT_CAP_POWER9  (bit 1): KVM guests can run in Power9 processor mode
> +  KVM_PPC_COMPAT_CAP_POWER10 (bit 2): KVM guests can run in Power10 processor mode
> +  KVM_PPC_COMPAT_CAP_POWER11 (bit 3): KVM guests can run in Power11 processor mode
> +
>  .. _kvm_run:
>  
>  5. The kvm_run structure
> -- 
> 2.50.1 (Apple Git-155)
>

-- 
Cheers
~ Vaibhav

^ permalink raw reply

* Re: [PATCH v4 3/4] KVM: PPC: Book3S HV: Add support for compat CPU capabilities for KVM on PowerNV
From: Vaibhav Jain @ 2026-06-19  6:12 UTC (permalink / raw)
  To: Amit Machhiwal, linuxppc-dev, Madhavan Srinivasan
  Cc: Amit Machhiwal, Anushree Mathur, Paolo Bonzini, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP), Jonathan Corbet,
	Shuah Khan, kvm, linux-kernel, linux-doc, lkp
In-Reply-To: <20260616123314.82721-4-amachhiw@linux.ibm.com>

Hi Amit.

Thanks for the patch and incorporating V3 review comments. Further
review comments inline below:

Amit Machhiwal <amachhiw@linux.ibm.com> writes:

> Currently, when booting a compatibility-mode KVM guest (L1) on a PowerNV
> hypervisor (L0), the guest runs with the expected processor
> compatibility level. However, when booting a nested KVM guest (L2)
> inside the L1, QEMU derives the CPU model from the raw host PVR and
> attempts to run the nested guest at that level, instead of honoring the
> compatibility mode of the L1.
>
> Extend host CPU compatibility capability reporting to support nested
> virtualization on PowerNV systems (PAPR nested API v1).
>
> For nested API v2 (PowerVM), compatibility capabilities are obtained
> from the hypervisor via the H_GUEST_GET_CAPABILITIES hcall. This
> information is not available on PowerNV systems.
>
> For nested API v1, derive the compatibility capabilities from the L1
> guest by reading the "cpu-version" property from the device tree, which
> reflects the effective (logical) processor compatibility level. Map this
> value to the corresponding compatibility capability bitmap using
> KVM-specific constants.
>
> Introduce a helper to translate CPU version values into KVM_PPC_COMPAT_CAP
> bits and integrate it into kvmppc_get_compat_caps(). The implementation
> applies masking to ensure only supported processor modes are exposed.
>
> This allows userspace to query host CPU compatibility modes on both
> PowerVM and PowerNV platforms via the KVM_PPC_GET_COMPAT_CAPS ioctl.
>
> Suggested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
> ---
>  arch/powerpc/kvm/book3s_hv.c | 37 +++++++++++++++++++++++++++++++++++-
>  1 file changed, 36 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index f674386df62c..375e7a7fa9f8 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -6523,15 +6523,50 @@ static bool kvmppc_hash_v3_possible(void)
>  	return true;
>  }
>  
> +static int kvmppc_map_compat_capabilities(const __be32 cpu_version,
> +				      unsigned long *capabilities)
> +{
> +	switch (cpu_version) {
> +	case PVR_ARCH_31_P11:
> +		*capabilities |= KVM_PPC_COMPAT_CAP_POWER11;
Do you need to do 'break' here instead of falling through. Since P11
host can support P10 and P9 compat modes

> +		break;
> +	case PVR_ARCH_31:
> +		*capabilities |= KVM_PPC_COMPAT_CAP_POWER10;
> +		break;
> +	case PVR_ARCH_300:
> +		*capabilities |= KVM_PPC_COMPAT_CAP_POWER9;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
>  
>  static int kvmppc_get_compat_caps(struct kvm_ppc_compat_caps *host_caps)
>  {
> +	struct device_node *np;
>  	unsigned long capabilities = 0;
> +	const __be32 *prop = NULL;
>  	long rc = -EINVAL;
> +	u32 cpu_version;
>  
>  	if (kvmhv_on_pseries()) {
> -		if (kvmhv_is_nestedv2())
> +		if (kvmhv_is_nestedv2()) {
>  			rc = plpar_guest_get_capabilities(0, &capabilities);
> +		} else {
> +			for_each_node_by_type(np, "cpu") {
> +				prop = of_get_property(np, "cpu-version", NULL);
> +				if (prop) {
> +					cpu_version = be32_to_cpup(prop);
> +					break;
> +				}
> +			}
> +			if (!prop)
> +				return -EINVAL;
> +			rc = kvmppc_map_compat_capabilities(cpu_version,
> +								&capabilities);
> +		}
should you check for 'rc' error here before assigning 'capabilities' to
'host_caps->compat_capabilities' . I understand it will be set to '0'
due to its initialization at the top of the function. But would be
better to make it more explicit

>  		host_caps->compat_capabilities = capabilities &
>  							KVM_PPC_COMPAT_BITMASK;
>  	}
> -- 
> 2.50.1 (Apple Git-155)
>
>

-- 
Cheers
~ Vaibhav

^ permalink raw reply

* Re: [PATCH v4 2/4] KVM: PPC: Book3S HV: Implement compat CPU capability retrieval for KVM on PowerVM
From: Vaibhav Jain @ 2026-06-19  6:04 UTC (permalink / raw)
  To: Amit Machhiwal, linuxppc-dev, Madhavan Srinivasan
  Cc: Amit Machhiwal, Anushree Mathur, Paolo Bonzini, Nicholas Piggin,
	Michael Ellerman, Christophe Leroy (CS GROUP), Jonathan Corbet,
	Shuah Khan, kvm, linux-kernel, linux-doc, lkp
In-Reply-To: <20260616123314.82721-3-amachhiw@linux.ibm.com>

Hi Amit.

Thanks for the patch and incorporating V3 review comments. Further
review comments inline below:

Amit Machhiwal <amachhiw@linux.ibm.com> writes:

> On POWER systems, the host CPU may run in a compatibility mode (e.g., a
> Power11 processor operating in Power10 compatibility mode). In such
> cases, the effective CPU level exposed to guests differs from the
> physical processor generation.
>
> When running nested KVM guests, QEMU derives the host CPU type using
> mfpvr(), which reflects the physical processor version. This can result
> in a mismatch between the CPU model selected by QEMU and the
> compatibility mode enforced by the host, leading to guest boot failures.
>
> For example, booting a nested guest on a Power11 LPAR configured in
> Power10 compatibility mode fails with:
>
>   KVM-NESTEDv2: couldn't set guest wide elements
>   [..KVM reg dump..]
>
> This occurs because QEMU selects a CPU model corresponding to the
> physical processor (via mfpvr()), while the host operates in a lower
> compatibility mode. As a result, KVM rejects the requested compatibility
> level during guest initialization.
>
> Add support for retrieving host CPU compatibility capabilities for
> nested guests on PowerVM (PAPR nested API v2). The hypervisor provides
> the effective compatibility levels via the H_GUEST_GET_CAPABILITIES
> hcall, which reflects the processor modes negotiated between the Power
> hypervisor (L0) and the host partition (L1).
>
> On pseries systems, obtain the capability bitmap using
> plpar_guest_get_capabilities() and return it via struct
> kvm_ppc_compat_caps. The implementation defines KVM-specific capability
> constants (KVM_PPC_COMPAT_CAP_POWER9/10/11) and applies masking to ensure
> only supported processor modes are exposed to userspace. This information
> is then exposed through the KVM_PPC_GET_COMPAT_CAPS ioctl.
>
> Hook the implementation into the Book3S HV kvmppc_ops so that it can be
> invoked by the generic KVM ioctl handling code.
>
> Suggested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
> ---
>  arch/powerpc/include/uapi/asm/kvm.h | 11 ++++++++++-
>  arch/powerpc/kvm/book3s_hv.c        | 17 +++++++++++++++++
>  2 files changed, 27 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> index 8a38be6c3b03..730488681443 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -443,7 +443,16 @@ struct kvm_ppc_compat_caps {
>  	__u64	size;			/* Size of this structure */
>  	__u64	compat_capabilities;	/* Capabilities supported by the host */
>  };
> -
> +/*
> + * Capability bits for compat_capabilities field in kvm_ppc_compat_caps.
> + * These bits indicate which processor compatibility modes are supported.
> + */
> +#define KVM_PPC_COMPAT_CAP_POWER9	(1ULL << 62)
> +#define KVM_PPC_COMPAT_CAP_POWER10	(1ULL << 61)
> +#define KVM_PPC_COMPAT_CAP_POWER11	(1ULL << 60)
> +#define KVM_PPC_COMPAT_BITMASK		(KVM_PPC_COMPAT_CAP_POWER9 | \
> +					 KVM_PPC_COMPAT_CAP_POWER10 | \
> +					 KVM_PPC_COMPAT_CAP_POWER11)
>  /*
>   * Values for character and character_mask.
>   * These are identical to the values used by H_GET_CPU_CHARACTERISTICS.
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index f9380ef65750..f674386df62c 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -6523,6 +6523,22 @@ static bool kvmppc_hash_v3_possible(void)
>  	return true;
>  }
>  
> +
> +static int kvmppc_get_compat_caps(struct kvm_ppc_compat_caps *host_caps)
> +{
> +	unsigned long capabilities = 0;
> +	long rc = -EINVAL;
> +
> +	if (kvmhv_on_pseries()) {
> +		if (kvmhv_is_nestedv2())
> +			rc = plpar_guest_get_capabilities(0,
> &capabilities);
I think instead of making the hcall you should use the
'nested_capabilities' extern symbol as it would already the same
value. This symbol is already accessible in 'book3s_hv.c'

> +		host_caps->compat_capabilities = capabilities &
> +							KVM_PPC_COMPAT_BITMASK;
> +	}
> +
> +	return rc;
> +}
> +
>  static struct kvmppc_ops kvm_ops_hv = {
>  	.get_sregs = kvm_arch_vcpu_ioctl_get_sregs_hv,
>  	.set_sregs = kvm_arch_vcpu_ioctl_set_sregs_hv,
> @@ -6565,6 +6581,7 @@ static struct kvmppc_ops kvm_ops_hv = {
>  	.hash_v3_possible = kvmppc_hash_v3_possible,
>  	.create_vcpu_debugfs = kvmppc_arch_create_vcpu_debugfs_hv,
>  	.create_vm_debugfs = kvmppc_arch_create_vm_debugfs_hv,
> +	.get_compat_caps = kvmppc_get_compat_caps,
>  };
>  
>  static int kvm_init_subcore_bitmap(void)
> -- 
> 2.50.1 (Apple Git-155)
>
>

-- 
Cheers
~ Vaibhav

^ permalink raw reply

* Re: [PATCH v2 5/7] seg6: add End.M.GTP6.D.Di behavior
From: Yuya Kusakabe @ 2026-06-19  5:48 UTC (permalink / raw)
  To: andrea
  Cc: Yuya Kusakabe, andrea.mayer, davem, edumazet, dsahern, kuba,
	pabeni, horms, justin.iurman, shuah, corbet, skhan, linux-kernel,
	netdev, linux-kselftest, linux-doc, stefano.salsano, ahabdels
In-Reply-To: <20260607160119.ed2022e8a358d700e1134318@common-net.org>

Hi Andrea,

Thank you for the review.

> The patch 4 review applies here, except for the parts where Section 6.4 is
> implemented instead of Section 6.3 (which is incorrectly implemented in
> patch 4).

Answered in the patch 4 reply: the next version of End.M.GTP6.D will
implement Section 6.3 (Args.Mob.Session stamped into SRH[0], no
preserved D), leaving the original-DA preservation exclusive to this
drop-in variant.

> input_action_end_m_gtp6_d_di() and its finish callback are largely
> identical to the patch 4 functions (input_action_end_m_gtp6_d() and its
> finish): the SRH check, GTP-U dispatch, outer strip, inner protocol
> detection, and NF_HOOK invocation are identical. The duplication should be
> reduced via shared helpers.

Will do. The plan is one decap helper (SRH check, GTP-U dispatch,
outer strip, inner protocol detection) shared between End.M.GTP6.D and
End.M.GTP6.D.Di, and one SRv6-push helper (including the GSO offload
setup) shared with H.M.GTP4.D as well, with the GTP-U parser common to
all of them. The D.Di handler then reduces to the prepended-slot
handling specific to the drop-in variant. (The NF_HOOK invocation goes
away in the initial series per the cover letter thread.)

> D.Di does not use teid or qfi, so these variables and the (void) casts are
> dead code and should be avoided. For example, seg6_mobile_parse_gtpu() could
> accept NULL for teid and qfi so callers that do not need them can pass NULL
> directly.

Will do exactly that: the GTP-U parser (and the decap helper above)
will accept NULL for teid/qfi, and the drop-in variant will pass NULL.

Thanks,
Yuya

^ permalink raw reply

* Re: [PATCH v2 4/7] seg6: add End.M.GTP6.D behavior
From: Yuya Kusakabe @ 2026-06-19  5:27 UTC (permalink / raw)
  To: andrea
  Cc: Yuya Kusakabe, andrea.mayer, davem, edumazet, dsahern, kuba,
	pabeni, horms, justin.iurman, shuah, corbet, skhan, linux-kernel,
	netdev, linux-kselftest, linux-doc, stefano.salsano, ahabdels
In-Reply-To: <20260607020517.0c6bbb8beba505ac9447545e@common-net.org>

Hi Andrea,

Thank you for the review. The points shared with patches 1-3 will be
addressed as described in those replies; below the
End.M.GTP6.D-specific ones.

> The "src" attribute is used verbatim here as the outer IPv6 source address,
> same as patch 3. The src dual-semantics overload flagged in the patch 3
> reply applies here too.

Covered in the patch 3 reply: with the End.M.GTP4.E template use
gone, verbatim outer IPv6 SA becomes the single meaning of the src
attribute for the IPv6-emitting behaviors.

> Thank you for the follow-up in the cover letter thread. The finish callback
> writes orig_dst into SRH[0] and Args.Mob.Session into SRH[1]. As far as I
> can see, this matches neither Section 6.3 (Args.Mob.Session in SRH[0], no
> D) nor Section 6.4 (D in SRH[0], no Args.Mob).

Confirmed, that is the bug from my May 10 note. The next version of
End.M.GTP6.D will push the configured SR Policy verbatim and stamp
Args.Mob.Session into SRH[0] (at the locator length given by the
explicit sr_prefix_len attribute) per Section 6.3 S08; preserving the
original outer DA in a prepended slot will be exclusive to
End.M.GTP6.D.Di.

> Same reverse Christmas tree as patch 2; same issue in the other functions
> introduced by this patch.
> gtp is only used as a cast intermediary. Could it be inlined?

Will fix both.

> Nit: gtphl and hdrlen are assigned before the GTP1_F_EXTHDR check. On the
> path where the E flag is not set, gtphl is unused. Moving the gtphl
> assignment after the check would make the flow clearer.

Will move the gtphl dereference after the check; the pull has to stay
before it, since the long header is also consumed for S/PN-only
flags.

> Maybe ext could be renamed to ext_hdr? It would be easier to distinguish
> from ext_units and ext_bytes.
> ext_units is only used to derive ext_bytes. A single ext_len would
> remove the intermediate variable.

Will do both.

> If the extension chain contains more than one PDU Session Container, *qfi
> is silently overwritten. Is that intentional, or should the function reject
> a duplicate?

Not intentional; will reject a duplicate PDU Session Container as
malformed, with a selftest case for it.

> ext[ext_bytes - 1] reads the Next Extension Header Type field from the last
> byte of the current extension. Would a short comment help the reader?

Will add one.

> input_action_end_m_gtp6_d() does not change skb_dst(skb) before this call,
> so dst and lwtstate are the same ones the caller already dereferenced. When
> can this NULL check trigger?

It cannot: for a route installed with LWTUNNEL_STATE_INPUT_REDIRECT,
lwtunnel_set_redirect() always populates orig_input before dst.input
is replaced. I will drop the checks and call orig_input directly.

> Same dst/lwtstate issue as patch 2. Not introduced by this patch.
> Same missing iptunnel_handle_offloads() as patch 2.

The NF_HOOK split goes away per the cover letter thread, and the SRv6
push will go through a shared helper that calls
iptunnel_handle_offloads(skb, SKB_GSO_IPXIP6) before
seg6_do_srh_encap().

> Same BAD_INNER misuse as patch 2. seg6_do_srh_encap() can also fail from
> seg6_push_hmac(), which is an HMAC error on the new SRH, not an inner-T-PDU
> problem.
[...]
> segments[0], segments[1], saddr, and daddr are written after
> seg6_do_srh_encap() already called skb_postpush_rcsum(). skb->csum can
> be stale. Same for any later change to the outer header or SRH.
>
> HMAC, if configured, is computed on non-final SRH and saddr, hence invalid.

Thanks, both of these are real issues. My plan for the next version:

- every field stamped after seg6_do_srh_encap() (Args.Mob.Session, the
  preserved DA in the drop-in variant, the outer saddr/daddr refresh,
  and the dsfield propagation in H.M.GTP4.D) will go through a small
  helper that applies the corresponding diff to skb->csum when the skb
  is CHECKSUM_COMPLETE;

- the D-side behaviors will reject an HMAC-flagged SRH template at
  configuration time: stamping the per-packet fields after
  seg6_do_srh_encap() has signed the SRH would always invalidate the
  HMAC. Inbound HMAC validation is unaffected. Would you prefer the
  stamp-before-sign ordering solved from the start instead?

> The initializer on reason is dead. Every goto drop path sets reason
> explicitly before the jump. The variable can be left uninitialized here.

This goes away with the drop-reason rework: the MUP drop reasons will
be out of the initial series per the prep series plan, so the variable
itself is removed.

> Same SRH validation concerns as patch 1. HMAC is not validated here.

The ingress will use the same three-state SRH helper as the other
behaviors, which validates the HMAC whenever an SRH is present.

> Limitation note for both input_action_end() calls above: correct per RFC
> 9433 Section 6.3 S10-S11, but the SRH is absent or SL == 0 here, so
> input_action_end() will always drop without signaling non-GTP-U traffic.
> Perhaps you meant to drop directly with BAD_GTPU?

Right, the End fallback could only ever drop here. Instead of
dropping, I plan to hand non-UDP, non-GTP-U and non-T-PDU packets to
the route's original input path (the orig_input saved by the lwtunnel
input redirect), so a downstream owner of the GTP-U control plane
still receives e.g. Echo Request; the selftests will cover that
passthrough.

> Nit: inner_first could be an inner_ver with the shift done at assignment.
> The name would say what the variable holds.
[...]
> Same repeated size-selection ternary as patch 2.

Will do both: the inner version, header length and protocol computed
once in the switch.

> The anonymous { } block scopes three variables that should be declared at
> function top. Splitting into smaller helpers would make this easier to
> follow.

Will split the dispatch and outer strip into a decap helper shared
with End.M.GTP6.D.Di, with declarations at function top.

> Same missing frag_off check as patch 2.

Will add.

> The "{,.Di}" shell brace notation is unusual. Emitting the actual
> behavior name (End.M.GTP6.D or End.M.GTP6.D.Di) would be clearer.
> Same applies wherever this notation appears in the patchset.

Will replace it with the concrete behavior name everywhere.

Thanks,
Yuya

^ permalink raw reply

* Re: [RFC PATCH 0/2] kasan: hw_tags: Add option to tag only at allocation time
From: Dev Jain @ 2026-06-19  5:17 UTC (permalink / raw)
  To: Ryan Roberts, ryabinin.a.a, akpm, corbet
  Cc: glider, andreyknvl, dvyukov, vincenzo.frascino, kasan-dev,
	linux-mm, linux-kernel, skhan, workflows, linux-doc,
	linux-arm-kernel, anshuman.khandual, kaleshsingh, 21cnbao, david,
	will, catalin.marinas
In-Reply-To: <dbc2800f-7880-486f-831c-ec9b6cedc005@arm.com>



On 18/06/26 7:18 pm, Ryan Roberts wrote:
> On 12/06/2026 05:44, Dev Jain wrote:
>> Introduce a boot option to tag only at allocation time of the objects. This
>> reduces KASAN MTE overhead, the tradeoff being reduced ability of
>> catching bugs.
>>
>> Now, when a memory object will be freed, it will retain the random tag it
>> had at allocation time. This compromises on catching UAF bugs, till the
>> time the object is not reallocated, at which point it will have a new
>> random tag.
>>
>> Hence, not catching "use-after-free-before-reallocation" and not catching
>> "double-free" will be the compromise for reduced KASAN overhead.
> 
> Does standard KASAN with HW_TAGS really detect double-free? How does it do that?
> I could imagine it testing the tags of memory being freed to see if they are set
> to the poison tag, but that would lead to false positives for the GFP_SKIP_KASAN
> case, surely?

Should have mentioned, the double-free check is only for slab objects, see
__kasan_slab_pre_free. So we won't be able to catch double-free here.

> 
> If I'm right, then the only downgrade this new mode causes is that if
> freed-but-not-yet-reallocated memory is accessed via it's dangling pointer, then
> that bad access is not detected. I think that would be benign in all the cases I
> can think of, so while it would be a problem for a debugging use case, it would
> unlikely be a problem for security enforcement?

Okay so you are saying that we won't catch the bug, but there is no security problem
because the dangling pointer is accessing memory which isn't in use by anyone else.


> 
> Thanks,
> Ryan
> 
> 
>>
>> This is an RFC because we are not clear about the performance benefit.
>>
>> Android folks, please help with testing!
>>
>> ---
>> Applies on Linus master (9716c086c8e8).
>>
>> Dev Jain (2):
>>   kasan: hw_tags: Use KASAN_PAGE_REDZONE for vmalloc redzoning
>>   kasan: hw_tags: Add boot option to elide free time poisoning
>>
>>  Documentation/dev-tools/kasan.rst |  4 +++
>>  mm/kasan/hw_tags.c                | 45 +++++++++++++++++++++++++++++--
>>  mm/kasan/kasan.h                  | 23 +++++++++++++++-
>>  3 files changed, 69 insertions(+), 3 deletions(-)
>>
> 


^ permalink raw reply

* [gourryinverse:scratch/gourry/managed_nodes/rfc5 52/55] htmldocs: Documentation/ABI/testing/sysfs-bus-dax:184: WARNING: Definition list ends without a blank line; unexpected unindent. [docutils]
From: kernel test robot @ 2026-06-19  5:16 UTC (permalink / raw)
  To: Gregory Price; +Cc: oe-kbuild-all, linux-doc

tree:   https://github.com/gourryinverse/linux scratch/gourry/managed_nodes/rfc5
head:   b20d77118eedcaee86ecfc4a7ecd4285c2762231
commit: e9dd3c30fc0be8eda22f834894fcfe47c5cab4fc [52/55] Documentation/ABI: document anondax private-node sysfs interface
compiler: clang version 22.1.8 (https://github.com/llvm/llvm-project ca7933e47d3a3451d81e72ac174dcb5aa28b59d1)
docutils: docutils (Docutils 0.21.2, Python 3.13.5, on linux)
reproduce: (https://download.01.org/0day-ci/archive/20260619/202606190744.9fwTmXa8-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202606190744.9fwTmXa8-lkp@intel.com/

All warnings (new ones prefixed by >>):

   WARNING: Documentation/ABI/testing/sysfs-class-reboot-mode-reboot_modes:36: abi_sys_class_reboot_mode_driver_reboot_modes doesn't have a description
   WARNING: /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/<hid-bus>:<vendor-id>:<product-id>.<num>/os_mode is defined 2 times: Documentation/ABI/testing/sysfs-driver-hid-lenovo-go:364; Documentation/ABI/testing/sysfs-driver-hid-lenovo-go-s:234
   WARNING: /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/<hid-bus>:<vendor-id>:<product-id>.<num>/os_mode_index is defined 2 times: Documentation/ABI/testing/sysfs-driver-hid-lenovo-go:373; Documentation/ABI/testing/sysfs-driver-hid-lenovo-go-s:243
   WARNING: /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/<hid-bus>:<vendor-id>:<product-id>.<num>/touchpad/enabled is defined 2 times: Documentation/ABI/testing/sysfs-driver-hid-lenovo-go:636; Documentation/ABI/testing/sysfs-driver-hid-lenovo-go-s:252
   WARNING: /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/<hid-bus>:<vendor-id>:<product-id>.<num>/touchpad/enabled_index is defined 2 times: Documentation/ABI/testing/sysfs-driver-hid-lenovo-go:645; Documentation/ABI/testing/sysfs-driver-hid-lenovo-go-s:261
>> Documentation/ABI/testing/sysfs-bus-dax:184: WARNING: Definition list ends without a blank line; unexpected unindent. [docutils]
   Documentation/ABI/testing/sysfs-bus-dax:184: ERROR: Unexpected indentation. [docutils]
>> Documentation/ABI/testing/sysfs-bus-dax:184: WARNING: Block quote ends without a blank line; unexpected unindent. [docutils]
   Documentation/core-api/kref:328: ./include/linux/kref.h:72: WARNING: Invalid C declaration: Expected end of definition. [error at 96]
   int kref_put_mutex (struct kref *kref, void (*release)(struct kref *kref), struct mutex *mutex) __cond_acquires(true# mutex)
   ------------------------------------------------------------------------------------------------^
   Documentation/core-api/kref:328: ./include/linux/kref.h:94: WARNING: Invalid C declaration: Expected end of definition. [error at 92]
   int kref_put_lock (struct kref *kref, void (*release)(struct kref *kref), spinlock_t *lock) __cond_acquires(true# lock)


vim +184 Documentation/ABI/testing/sysfs-bus-dax

 > 184	What:		/sys/bus/dax/devices/daxX.Y/reclaim
   185	What:		/sys/bus/dax/devices/daxX.Y/mempolicy
   186	What:		/sys/bus/dax/devices/daxX.Y/hotunplug
   187	What:		/sys/bus/dax/devices/daxX.Y/tiering
   188	What:		/sys/bus/dax/devices/daxX.Y/ltpin
   189	What:		/sys/bus/dax/devices/daxX.Y/user_migrate
   190	Date:		January, 2026
   191	KernelVersion:	v6.21
   192	Contact:	nvdimm@lists.linux.dev
   193	Description:
   194			(RW) anondax only.  Per-service opt-ins (booleans) recorded on
   195			the device and applied to its private node at hotplug, selecting
   196			which mm services may act on the node's memory.  A private node
   197			opts out of everything by default; each toggle relaxes one
   198			service:
   199	
   200			  reclaim		allow reclaim of the node's folios, by the mm
   201						and by userspace MADV_COLD/PAGEOUT/FREE
   202			  mempolicy		allow userspace placement policy
   203						(mbind()/set_mempolicy()/home_node) onto the node
   204			  hotunplug		allow hot-unplug via migration
   205			  tiering		allow kernel access-aware migration: demotion
   206						target, NUMA balancing, and DAMON migration
   207			  ltpin			allow FOLL_LONGTERM GUP pins
   208			  user_migrate		allow userspace move_pages() to/from the node
   209	
   210			Writable only while the device is "unplugged" (-EBUSY otherwise).
   211			Dependencies between opt-ins (tiering requires reclaim) are
   212			validated when the device is hotplugged, not
   213			at write time: an inconsistent combination is accepted by the
   214			write but the subsequent hotplug then fails.
   215			See Documentation/mm/numa_private_nodes.rst.
   216	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [RFC PATCH 0/2] kasan: hw_tags: Add option to tag only at allocation time
From: Dev Jain @ 2026-06-19  4:46 UTC (permalink / raw)
  To: Lance Yang
  Cc: ryabinin.a.a, akpm, corbet, glider, andreyknvl, dvyukov,
	vincenzo.frascino, kasan-dev, linux-mm, linux-kernel, skhan,
	workflows, linux-doc, linux-arm-kernel, ryan.roberts,
	anshuman.khandual, kaleshsingh, 21cnbao, david, will,
	catalin.marinas
In-Reply-To: <20260613060637.40039-1-lance.yang@linux.dev>



On 13/06/26 11:36 am, Lance Yang wrote:
> 
> On Fri, Jun 12, 2026 at 04:44:22AM +0000, Dev Jain wrote:
>> Introduce a boot option to tag only at allocation time of the objects. This
>> reduces KASAN MTE overhead, the tradeoff being reduced ability of
>> catching bugs.
>>
>> Now, when a memory object will be freed, it will retain the random tag it
>> had at allocation time. This compromises on catching UAF bugs, till the
>> time the object is not reallocated, at which point it will have a new
>> random tag.
>>
>> Hence, not catching "use-after-free-before-reallocation" and not catching
>> "double-free" will be the compromise for reduced KASAN overhead.
> 
> Hmm ... do we also need to teach the KASAN KUnit tests about this mode?
> 
> With kasan.tag_only_on_alloc=on, free-time poisoning is skipped, so
> some UAF and double-free reports are skipped on purpose, but the tests
> still expect them :)

Yeah my opinion is that we shouldn't bother - but if we go ahead with this
patch in some shape or form then I'll see how to make kasan_test work with
this.

> 
> Cheers, Lance
> 


^ permalink raw reply

* Re: [RFC PATCH 2/2] kasan: hw_tags: Add boot option to elide free time poisoning
From: Dev Jain @ 2026-06-19  4:44 UTC (permalink / raw)
  To: Isaac Manjarres
  Cc: ryabinin.a.a, akpm, corbet, glider, andreyknvl, dvyukov,
	vincenzo.frascino, kasan-dev, linux-mm, linux-kernel, skhan,
	workflows, linux-doc, linux-arm-kernel, ryan.roberts,
	anshuman.khandual, kaleshsingh, 21cnbao, david, will,
	catalin.marinas
In-Reply-To: <aiyi5flNkNNm0pSR@google.com>



On 13/06/26 5:53 am, Isaac Manjarres wrote:
> On Fri, Jun 12, 2026 at 04:44:24AM +0000, Dev Jain wrote:
>> diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h
>> index fc9169a547662..4fa8abb312faa 100644
>> --- a/mm/kasan/kasan.h
>> +++ b/mm/kasan/kasan.h
>>  #ifdef CONFIG_KASAN_GENERIC
>> @@ -478,6 +489,16 @@ static inline u8 kasan_random_tag(void) { return 0; }
>>  
>>  static inline void kasan_poison(const void *addr, size_t size, u8 value, bool init)
>>  {
>> +	if (kasan_tag_only_on_alloc_enabled()) {
>> +		if ((value != KASAN_SLAB_REDZONE) && (value != KASAN_PAGE_REDZONE)) {
>> +			if (init)
>> +				memset((void *)kasan_reset_tag(addr), 0, size);
>> +			return;
>> +		}
>> +	}
>> +
>> +	value |= 0xF0;
>> +
> 
> I wonder if it would make more sense to have this as:
> 
> if (kasan_tag_only_on_alloc_enabled() && (value == KASAN_SLAB_FREE ||
>     value == KASAN_PAGE_FREE)) {
> 	if (init)
> 		memset((void *)kasan_reset_tag(addr), 0, size);
> 	return;
> }
> 
> That seems a bit clearer to me as to what it is that you're doing, and
> also makes it so that you don't have to do any bit manipulation
> on the value when you're filling in the redzones.

Ah so you mean, we can define KASAN_SLAB_FREE and KASAN_PAGE_FREE to be
different values, leaving KASAN_SLAB_REDZONE and KASAN_PAGE_REDZONE to
be 0xFE, the poison value. Yep I'll do that.
> 
> Thanks,
> Isaac


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox