Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* [PATCH v3] mm/lruvec: trace LRU add drains and drain-all requests
From: JP Kobryn @ 2026-06-10 19:52 UTC (permalink / raw)
  To: linux-mm, willy, shakeel.butt, usama.arif, akpm, vbabka, mhocko,
	rostedt, mhiramat, mathieu.desnoyers, kasong, qi.zheng, baohua,
	axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
	baoquan.he, youngjun.park
  Cc: linux-kernel, linux-trace-kernel

LRU add batches can be drained before they reach capacity. This can be a
source of LRU lock contention, but it is not currently possible to
attribute these drains to callers with existing tracepoints.

Add mm_lru_add_drain to report the CPU and lru_add batch count when an
lru_add batch is drained. This allows tracing to distinguish full drains
from partial drains and attribute them to the calling stack.

Add mm_lru_add_drain_all to capture callers of __lru_add_drain_all and
whether they set the force flag for all CPUs. The tracepoint resembles
the signature of the enclosing function, but is needed because of
potential inlining.

Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>
---
 include/trace/events/pagemap.h | 37 ++++++++++++++++++++++++++++++++++
 mm/swap.c                      |  7 ++++++-
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
index 171524d3526d..ff3da07ccb40 100644
--- a/include/trace/events/pagemap.h
+++ b/include/trace/events/pagemap.h
@@ -77,6 +77,43 @@ TRACE_EVENT(mm_lru_activate,
 	TP_printk("folio=%p pfn=0x%lx", __entry->folio, __entry->pfn)
 );
 
+TRACE_EVENT(mm_lru_add_drain,
+
+	TP_PROTO(int cpu, unsigned int nr),
+
+	TP_ARGS(cpu, nr),
+
+	TP_STRUCT__entry(
+		__field(int,		cpu	)
+		__field(unsigned int,	nr	)
+	),
+
+	TP_fast_assign(
+		__entry->cpu	= cpu;
+		__entry->nr	= nr;
+	),
+
+	TP_printk("cpu=%d nr=%u", __entry->cpu, __entry->nr)
+);
+
+TRACE_EVENT(mm_lru_add_drain_all,
+
+	TP_PROTO(bool force_all_cpus),
+
+	TP_ARGS(force_all_cpus),
+
+	TP_STRUCT__entry(
+		__field(bool,	force_all_cpus	)
+	),
+
+	TP_fast_assign(
+		__entry->force_all_cpus	= force_all_cpus;
+	),
+
+	TP_printk("force_all_cpus=%s",
+		__entry->force_all_cpus ? "true" : "false")
+);
+
 #endif /* _TRACE_PAGEMAP_H */
 
 /* This part must be outside protection */
diff --git a/mm/swap.c b/mm/swap.c
index 588f50d8f1a8..e14b7612f896 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -694,9 +694,12 @@ void lru_add_drain_cpu(int cpu)
 {
 	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
 	struct folio_batch *fbatch = &fbatches->lru_add;
+	unsigned int nr_folios_add = folio_batch_count(fbatch);
 
-	if (folio_batch_count(fbatch))
+	if (nr_folios_add) {
 		folio_batch_move_lru(fbatch, lru_add);
+		trace_mm_lru_add_drain(cpu, nr_folios_add);
+	}
 
 	fbatch = &fbatches->lru_move_tail;
 	/* Disabling interrupts below acts as a compiler barrier. */
@@ -869,6 +872,8 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
 	if (WARN_ON(!mm_percpu_wq))
 		return;
 
+	trace_mm_lru_add_drain_all(force_all_cpus);
+
 	/*
 	 * Guarantee folio_batch counter stores visible by this CPU
 	 * are visible to other CPUs before loading the current drain
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH] tracing: fprobe: Remove __packed from generic __fprobe_header
From: Steven Rostedt @ 2026-06-10 19:51 UTC (permalink / raw)
  To: David Laight
  Cc: Masami Hiramatsu (Google),
	Markus Schneider-Pargmann (The Capable Hub), Mathieu Desnoyers,
	Heiko Carstens, linux-kernel, linux-trace-kernel
In-Reply-To: <20260610120659.7c61cfa6@pumpkin>

On Wed, 10 Jun 2026 12:06:59 +0100
David Laight <david.laight.linux@gmail.com> wrote:

> So you only want __packed on structures that might be misaligned and those
> that contain misaligned members.
> 
> If the structure is only guaranteed to be 32bit aligned then use __packed
> __aligned(4) so that two 32bit accesses get used instead of 8 8bit ones.
> 
> -- David
> 
> > 
> > Thank you,
> >   
> > > Signed-off-by: Markus Schneider-Pargmann (The Capable Hub) <msp@baylibre.com>
> > > ---
> > >  kernel/trace/fprobe.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
> > > index cc49ebd2a773..21751dcdb7b9 100644
> > > --- a/kernel/trace/fprobe.c
> > > +++ b/kernel/trace/fprobe.c
> > > @@ -181,7 +181,7 @@ static inline void read_fprobe_header(unsigned long *stack,
> > >  struct __fprobe_header {
> > >  	struct fprobe *fp;
> > >  	unsigned long size_words;
> > > -} __packed;
> > > +};
> > >  

Does "__packed" really do anything between a pointer and a long?

-- Steve

^ permalink raw reply

* Re: [RFC PATCH 1/2] tracing/osnoise: Sample IPI counts
From: Crystal Wood @ 2026-06-10 19:51 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel, linux-trace-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Tomas Glozar,
	Costa Shulyupin, Ivan Pravdin
In-Reply-To: <20260610130457.1304245-2-vschneid@redhat.com>

On Wed, 2026-06-10 at 15:04 +0200, Valentin Schneider wrote:
> Osnoise already implictly accounts IPIs via its IRQ tracking,

Does it?  It seems that IPIs bypass the kernel/irq subsystem on some
arches (including x86, but not ARM).

It would be nice to solve this properly by adding generic ipi
entry/exit tracing (similar to what ARM already has).

> however it
> can be interesting to distiguish between the two: undesired IPIs usually
> imply a software configuration issue (e.g. wrong/incomplete CPU isolation)
> whereas undesired (non-IPI) IRQs usually imply a hardware configuration
> issue.
> 
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
> Note that this is modifying the osnoise:osnoise_entry Ftrace entry; I know
> trace events are sort of supposed to be stable, but I'm not sure about
> ftrace entries.

I think old rtla will be OK with this since it looks up fields by name
rather than assuming a fixed layout.

> Alternatively I can have this be purely supported in userspace osnoise by
> hooking into the IPI events and counting IPIs separately from the osnoise
> events.

One benefit I could see of doing this in kernel osnoise would be if you
could atomically correlate the count with the particular noise
interval, but this patch doesn't do that.

> +static void ipi_emission(struct osnoise_variables *osn_var, unsigned int dst_cpu)
> +{
> +	if (!osn_var->sampling)
> +		return;
> +
> +	osn_var->ipi.count++;
> +}
> +
> +static void trace_ipi_send_cpu_callback(void *data, unsigned int cpu,
> +					unsigned long callsite, void *callback)
> +{
> +	struct osnoise_variables *osn_var;
> +
> +	osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
> +	ipi_emission(osn_var, cpu);
> +}
> +
> +static void trace_ipi_send_cpumask_callback(void *data, const struct cpumask *cpumask,
> +					    unsigned long callsite, void *callback)
> +{
> +	struct osnoise_variables *osn_var;
> +	int cpu;
> +
> +	for_each_cpu_and(cpu, cpumask, &osnoise_cpumask) {
> +		osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
> +		ipi_emission(osn_var, cpu);
> +	}
> +}

Isn't this racy to do from a different CPU?  Both in terms of the
counter, and the timing of the increment relative to when the IPI is
actually received.  Not necessarily a huge deal if you only care about
zero versus bignum, but still.  At least worth a comment, if we go with
this approach.

-Crystal


^ permalink raw reply

* Re: [PATCH] mm/lruvec: trace LRU add drains and drain-all queuing
From: Shakeel Butt @ 2026-06-10 19:38 UTC (permalink / raw)
  To: JP Kobryn
  Cc: Barry Song, linux-mm, willy, usama.arif, akpm, vbabka, mhocko,
	rostedt, mhiramat, mathieu.desnoyers, kasong, qi.zheng,
	axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
	baoquan.he, youngjun.park, linux-kernel, linux-trace-kernel
In-Reply-To: <ffdf9b07-487b-4668-a91b-27b0ab29c70d@linux.dev>

On Wed, Jun 10, 2026 at 12:20:19PM -0700, JP Kobryn wrote:
> On 6/10/26 11:54 AM, JP Kobryn wrote:
> > On 6/9/26 6:21 PM, Shakeel Butt wrote:
> >> On Tue, Jun 09, 2026 at 05:16:15PM -0700, JP Kobryn wrote:
> >>> On 6/9/26 5:07 PM, JP Kobryn wrote:
> >>>> On 6/9/26 12:44 AM, Barry Song wrote:
> >>>>> On Tue, Jun 9, 2026 at 12:12 PM JP Kobryn <jp.kobryn@linux.dev> wrote:
> >>>>>>

[...]

> >>>>> Do you need tracing on each CPU individually, or is tracing the
> >>>>> entire __lru_add_drain_all() invocation sufficient?
> >>>>
> >>>> I think the latter would be fine. The remote work will invoke the
> >>>> mm_lru_add_drain tracepoint, which will show up as kworker stacks. Since
> >>>> the event already has the CPU, we could see where queued drains actually
> >>>> ran.
> >>>
> >>> Actually if it's just a single invocation and the only event data is the
> >>> force flag, a tracepoint may not even be needed. Other probes can be
> >>> installed on function invocation and read the single argument. I can
> >>> drop this from v2 and keep the single mm_lru_add_drain tracepoint.
> >>
> >> No we do want to trace the callers requesting to drain from all the CPUs. If you
> >> trace just lru_add_drain_cpu() then you will only see that the drain is
> >> requested for a given CPU but no information on the requester. 
> >>
> >> Also as Barry said, I think single trace for whole __lru_add_drain_all() is good
> >> enough.
> > 
> > Right, but couldn't that already be done with fentry or kprobe? If we
> > only need the calling stack and the argument value of force_all_cpus I
> > don't see a strong need for a dedicated tracepoint.
> 
> Nevermind that. I see it's declared inline so I'll add a tracepoint and
> send v3.

Thanks. BTW even without inline keyword, compiler can still decide to inline
a function, so kprobe/fentry are not always reliable.
> 

^ permalink raw reply

* Re: [PATCH] mm/lruvec: trace LRU add drains and drain-all queuing
From: JP Kobryn @ 2026-06-10 19:20 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Barry Song, linux-mm, willy, usama.arif, akpm, vbabka, mhocko,
	rostedt, mhiramat, mathieu.desnoyers, kasong, qi.zheng,
	axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
	baoquan.he, youngjun.park, linux-kernel, linux-trace-kernel
In-Reply-To: <6e7739cc-327e-4ca9-86f4-17729b624632@linux.dev>

On 6/10/26 11:54 AM, JP Kobryn wrote:
> On 6/9/26 6:21 PM, Shakeel Butt wrote:
>> On Tue, Jun 09, 2026 at 05:16:15PM -0700, JP Kobryn wrote:
>>> On 6/9/26 5:07 PM, JP Kobryn wrote:
>>>> On 6/9/26 12:44 AM, Barry Song wrote:
>>>>> On Tue, Jun 9, 2026 at 12:12 PM JP Kobryn <jp.kobryn@linux.dev> wrote:
>>>>>>
>>>>>> LRU add batches can be drained before they reach capacity. This can be a
>>>>>> source of LRU lock contention, but it is not currently possible to
>>>>>> attribute these drains to callers with existing tracepoints.
>>>>>>
>>>>>> Add mm_lru_add_drain to report the CPU and lru_add batch count when an
>>>>>> lru_add batch is drained. This allows tracing to distinguish full drains
>>>>>> from partial drains and attribute them to the calling stack.
>>>>>>
>>>>>> Add mm_lru_drain_all_queue to report when lru_add_drain_all() queues
>>>>>> per-CPU drain work. This captures the requester stack and target CPU for
>>>>>> remote drain work. The event is named as a drain-all queue event because
>>>>>> the queued work can be needed for batches other than lru_add.
>>>>>>
>>>>>> Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>
>>>>>> ---
>>>>>>  include/trace/events/pagemap.h | 40 ++++++++++++++++++++++++++++++++++
>>>>>>  mm/swap.c                      |  6 ++++-
>>>>>>  2 files changed, 45 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
>>>>>> index 171524d3526d..ea8fc46bedb0 100644
>>>>>> --- a/include/trace/events/pagemap.h
>>>>>> +++ b/include/trace/events/pagemap.h
>>>>>> @@ -77,6 +77,46 @@ TRACE_EVENT(mm_lru_activate,
>>>>>>         TP_printk("folio=%p pfn=0x%lx", __entry->folio, __entry->pfn)
>>>>>>  );
>>>>>>
>>>>>> +TRACE_EVENT(mm_lru_add_drain,
>>>>>> +
>>>>>> +       TP_PROTO(int cpu, unsigned int nr),
>>>>>> +
>>>>>> +       TP_ARGS(cpu, nr),
>>>>>> +
>>>>>> +       TP_STRUCT__entry(
>>>>>> +               __field(int,            cpu     )
>>>>>> +               __field(unsigned int,   nr      )
>>>>>> +       ),
>>>>>> +
>>>>>> +       TP_fast_assign(
>>>>>> +               __entry->cpu    = cpu;
>>>>>> +               __entry->nr     = nr;
>>>>>> +       ),
>>>>>> +
>>>>>> +       TP_printk("cpu=%d nr=%u", __entry->cpu, __entry->nr)
>>>>>> +);
>>>>>> +
>>>>>> +TRACE_EVENT(mm_lru_drain_all_queue,
>>>>>> +
>>>>>> +       TP_PROTO(int target_cpu, bool force_all_cpus),
>>>>>> +
>>>>>> +       TP_ARGS(target_cpu, force_all_cpus),
>>>>>> +
>>>>>> +       TP_STRUCT__entry(
>>>>>> +               __field(int,    target_cpu      )
>>>>>> +               __field(bool,   force_all_cpus  )
>>>>>> +       ),
>>>>>> +
>>>>>> +       TP_fast_assign(
>>>>>> +               __entry->target_cpu     = target_cpu;
>>>>>> +               __entry->force_all_cpus = force_all_cpus;
>>>>>> +       ),
>>>>>> +
>>>>>> +       TP_printk("target_cpu=%d force_all_cpus=%s",
>>>>>> +               __entry->target_cpu,
>>>>>> +               __entry->force_all_cpus ? "true" : "false")
>>>>>> +);
>>>>>> +
>>>>>>  #endif /* _TRACE_PAGEMAP_H */
>>>>>>
>>>>>>  /* This part must be outside protection */
>>>>>> diff --git a/mm/swap.c b/mm/swap.c
>>>>>> index 588f50d8f1a8..c385b93582eb 100644
>>>>>> --- a/mm/swap.c
>>>>>> +++ b/mm/swap.c
>>>>>> @@ -694,9 +694,12 @@ void lru_add_drain_cpu(int cpu)
>>>>>>  {
>>>>>>         struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
>>>>>>         struct folio_batch *fbatch = &fbatches->lru_add;
>>>>>> +       unsigned int nr_folios_add = folio_batch_count(fbatch);
>>>>>>
>>>>>> -       if (folio_batch_count(fbatch))
>>>>>> +       if (nr_folios_add) {
>>>>>>                 folio_batch_move_lru(fbatch, lru_add);
>>>>>> +               trace_mm_lru_add_drain(cpu, nr_folios_add);
>>>>>> +       }
>>>>>>
>>>>>>         fbatch = &fbatches->lru_move_tail;
>>>>>>         /* Disabling interrupts below acts as a compiler barrier. */
>>>>>> @@ -928,6 +931,7 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
>>>>>>                 if (cpu_needs_drain(cpu)) {
>>>>>>                         INIT_WORK(work, lru_add_drain_per_cpu);
>>>>>>                         queue_work_on(cpu, mm_percpu_wq, work);
>>>>>> +                       trace_mm_lru_drain_all_queue(cpu, force_all_cpus);
>>>>>
>>>>> Do you need tracing on each CPU individually, or is tracing the
>>>>> entire __lru_add_drain_all() invocation sufficient?
>>>>
>>>> I think the latter would be fine. The remote work will invoke the
>>>> mm_lru_add_drain tracepoint, which will show up as kworker stacks. Since
>>>> the event already has the CPU, we could see where queued drains actually
>>>> ran.
>>>
>>> Actually if it's just a single invocation and the only event data is the
>>> force flag, a tracepoint may not even be needed. Other probes can be
>>> installed on function invocation and read the single argument. I can
>>> drop this from v2 and keep the single mm_lru_add_drain tracepoint.
>>
>> No we do want to trace the callers requesting to drain from all the CPUs. If you
>> trace just lru_add_drain_cpu() then you will only see that the drain is
>> requested for a given CPU but no information on the requester. 
>>
>> Also as Barry said, I think single trace for whole __lru_add_drain_all() is good
>> enough.
> 
> Right, but couldn't that already be done with fentry or kprobe? If we
> only need the calling stack and the argument value of force_all_cpus I
> don't see a strong need for a dedicated tracepoint.

Nevermind that. I see it's declared inline so I'll add a tracepoint and
send v3.


^ permalink raw reply

* Re: [PATCH] tracing: reject invalid preemptirq_delay_test CPU affinity
From: Samuel Moelius @ 2026-06-10 19:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, open list:TRACING,
	open list:TRACING
In-Reply-To: <20260609181617.185f1e02@fedora>

On Tue, Jun 9, 2026 at 6:16 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Fri,  5 Jun 2026 00:40:06 +0000
> Samuel Moelius <sam.moelius@trailofbits.com> wrote:
>
> > preemptirq_delay_test accepts cpu_affinity as a module parameter and,
> > when it is non-negative, writes that CPU directly into a temporary
> > cpumask from the worker thread.  Values outside nr_cpu_ids can set a
> > bit outside the allocated cpumask before the test reports a normal
> > affinity error.
> >
> > Validate the requested CPU before starting the worker thread, and
> > return -EINVAL for invalid affinity requests.
> >
> > Assisted-by: Codex:gpt-5.5-cyber-preview
> > Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com>
> > ---
> >  kernel/trace/preemptirq_delay_test.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/kernel/trace/preemptirq_delay_test.c b/kernel/trace/preemptirq_delay_test.c
> > index acb0c971a408..0f017799754a 100644
> > --- a/kernel/trace/preemptirq_delay_test.c
> > +++ b/kernel/trace/preemptirq_delay_test.c
> > @@ -14,6 +14,7 @@
> >  #include <linux/kthread.h>
> >  #include <linux/module.h>
> >  #include <linux/printk.h>
> > +#include <linux/cpumask.h>
> >  #include <linux/string.h>
> >  #include <linux/sysfs.h>
> >  #include <linux/completion.h>
> > @@ -152,6 +153,15 @@ static int preemptirq_run_test(void)
> >       struct task_struct *task;
> >       char task_name[50];
> >
> > +     if (cpu_affinity > -1) {
> > +             unsigned int cpu = cpu_affinity;
> > +
> > +             if (cpu >= nr_cpu_ids || !cpu_possible(cpu)) {
> > +                     pr_err("cpu_affinity:%d, invalid CPU\n", cpu_affinity);
> > +                     return -EINVAL;
>
> Just add the check to the preemptirq_delay_run() function where it
> tests affinity. Who cares if it created the thread or not. It's just a
> test.

I am getting ready to travel and I will address this when I return in
about two weeks. Thank you for understanding.

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: David Hildenbrand (Arm) @ 2026-06-10 18:59 UTC (permalink / raw)
  To: Gregory Price
  Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <aimSzvoJDrpeQsmM@gourry-fedora-PF4VCD3F>

On 6/10/26 18:37, Gregory Price wrote:
> On Wed, Jun 10, 2026 at 05:00:33PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/10/26 12:41, Gregory Price wrote:
>>>
>>> Notably: slub.c injects __GFP_THISNODE internally on behalf of kmalloc,
>>> which causes spillage into private nodes because slub allows private
>>> nodes in its mask.  I think this is fixable.
>>>
>>> I have to inspect some other __GFP_THISNODE users (hugetlb, some arch
>>> code, etc), but it seems like fully dropping the FALLBACK entries and
>>> requiring __GFP_THISNODE might be sufficient.
>>
>> Sorry, I haven't been able to follow up so far, and not sure if that's what you
>> are discussing here ...
>>
>> After the LSF/MM session, I was wondering, whether if we focus on allowing only
>> folios allocations to end up on private memory nodes for now: could the
>> __GFP_THISNODE approach work there?
>>
>> Essentially, disallow any allocations on non-folio paths, and allow folio
>> allocation only with __GFP_THISNODE set.
>>
>> I have to find time to read the other mails in this thread, on my todo list.
>>
>> So sorry if that is precisely what is being discussed here.
>>
> 
> So, I remember this being asked, and I didn't fully grok the request.
> 
> I'm still not sure I fully understand the question, so apologies if I'm
> answer the wrong things here.
> 
> I understand this question in two ways:
> 
>   1) Can we disallow PAGE allocation and limit this to FOLIO allocation

Yes. Can we only allow folios to be allocated from private memory nodes. So let
me reply to that one below.

>   2) Can we disallow [Feature] (i.e. slab) allocation targeting the node.
> 
> 
> 1) Can we disallow page allocation and limit this to folios?
> 
> No, I don't think so.
> 
> Folio allocations are written in terms of page allocations, we would
> have to rewrite folio allocation interfaces and introduce a bunch of
> boilerplate for the sake of this.
> 
> struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
>                 int preferred_nid, nodemask_t *nodemask)
> {
>         struct page *page;
> 
>         page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid, nodemask);
>         if (page)
>                 set_page_refcounted(page);
>         return page;
> }
> 
> struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
>                 nodemask_t *nodemask)
> {
>         struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
>                                         preferred_nid, nodemask);
> 	return page_rmappable_folio(page);
> }

At LSF/MM we talked about how GFP flags are bad and how deriving stuff from the
context might be better. I think there was also talk about how the memalloc_*
interface might be a better way forward. Maybe we would start giving the
allocator more context ("we are allocating a folio").

The following is incomplete (esp. hugetlb stuff I assume), just as some idea:

From 64aaff5f40497201ecc089c3339df6576184c433 Mon Sep 17 00:00:00 2001
From: "David Hildenbrand (Arm)" <david@kernel.org>
Date: Wed, 10 Jun 2026 20:55:49 +0200
Subject: [PATCH] tmp

Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
---
 include/linux/sched.h    |  2 +-
 include/linux/sched/mm.h | 11 +++++++++++
 mm/mempolicy.c           | 14 ++++++++++++--
 mm/page_alloc.c          |  7 ++++++-
 4 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ee06cba5c6f5..9c850b7be6bf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1778,7 +1778,7 @@ extern struct pid *cad_pid;
 						 * I am cleaning dirty pages from some other bdi. */
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
-#define PF__HOLE__00800000	0x00800000
+#define PF__MEMALLOC_FOLIO	0x00800000	/* Allocating a folio that can end up on
private memory nodes */
 #define PF__HOLE__01000000	0x01000000
 #define PF__HOLE__02000000	0x02000000
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with
cpus_mask */
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 95d0040df584..2101a447c084 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -471,6 +471,17 @@ static inline void memalloc_pin_restore(unsigned int flags)
 	memalloc_flags_restore(flags);
 }

+static inline unsigned int memalloc_folio_save(void)
+{
+	return memalloc_flags_save(PF_MEMALLOC_FOLIO);
+}
+
+static inline void memalloc_folio_restore(unsigned int flags)
+{
+	memalloc_flags_restore(flags);
+}
+
+
 #ifdef CONFIG_MEMCG
 DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
 /**
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 36699fabd3c2..a78b0e5a1fce 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2506,8 +2506,13 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned
int order,
 struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
 		struct mempolicy *pol, pgoff_t ilx, int nid)
 {
-	struct page *page = alloc_pages_mpol(gfp | __GFP_COMP, order, pol,
+	struct page *page;
+	int flags;
+
+	flags = memalloc_folio_save();
+	page = alloc_pages_mpol(gfp | __GFP_COMP, order, pol,
 			ilx, nid);
+	memalloc_folio_restore(flags);
 	if (!page)
 		return NULL;

@@ -2588,7 +2593,12 @@ EXPORT_SYMBOL(alloc_pages_noprof);

 struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order)
 {
-	return page_rmappable_folio(alloc_pages_noprof(gfp | __GFP_COMP, order));
+	struct folio *folio;
+	int flags;
+
+	flags = memalloc_folio_save();
+	folio = page_rmappable_folio(alloc_pages_noprof(gfp | __GFP_COMP, order));
+	memalloc_folio_restore(flags);
+	return folio;
 }
 EXPORT_SYMBOL(folio_alloc_noprof);

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ee902a468c2f..37434b37f7af 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5345,8 +5345,13 @@ EXPORT_SYMBOL(__alloc_pages_noprof);
 struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int
preferred_nid,
 		nodemask_t *nodemask)
 {
-	struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
+	struct page *page;
+	int flags;
+
+	flags = memalloc_folio_save();
+	page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
 					preferred_nid, nodemask);
+	memalloc_folio_restore(flags);
 	return page_rmappable_folio(page);
 }
 EXPORT_SYMBOL(__folio_alloc_noprof);
-- 
2.43.0


-- 
Cheers,

David

^ permalink raw reply related

* Re: [PATCH] mm/lruvec: trace LRU add drains and drain-all queuing
From: JP Kobryn @ 2026-06-10 18:54 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Barry Song, linux-mm, willy, usama.arif, akpm, vbabka, mhocko,
	rostedt, mhiramat, mathieu.desnoyers, kasong, qi.zheng,
	axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
	baoquan.he, youngjun.park, linux-kernel, linux-trace-kernel
In-Reply-To: <aii7WWFmAyoXn9rk@linux.dev>

On 6/9/26 6:21 PM, Shakeel Butt wrote:
> On Tue, Jun 09, 2026 at 05:16:15PM -0700, JP Kobryn wrote:
>> On 6/9/26 5:07 PM, JP Kobryn wrote:
>>> On 6/9/26 12:44 AM, Barry Song wrote:
>>>> On Tue, Jun 9, 2026 at 12:12 PM JP Kobryn <jp.kobryn@linux.dev> wrote:
>>>>>
>>>>> LRU add batches can be drained before they reach capacity. This can be a
>>>>> source of LRU lock contention, but it is not currently possible to
>>>>> attribute these drains to callers with existing tracepoints.
>>>>>
>>>>> Add mm_lru_add_drain to report the CPU and lru_add batch count when an
>>>>> lru_add batch is drained. This allows tracing to distinguish full drains
>>>>> from partial drains and attribute them to the calling stack.
>>>>>
>>>>> Add mm_lru_drain_all_queue to report when lru_add_drain_all() queues
>>>>> per-CPU drain work. This captures the requester stack and target CPU for
>>>>> remote drain work. The event is named as a drain-all queue event because
>>>>> the queued work can be needed for batches other than lru_add.
>>>>>
>>>>> Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>
>>>>> ---
>>>>>  include/trace/events/pagemap.h | 40 ++++++++++++++++++++++++++++++++++
>>>>>  mm/swap.c                      |  6 ++++-
>>>>>  2 files changed, 45 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
>>>>> index 171524d3526d..ea8fc46bedb0 100644
>>>>> --- a/include/trace/events/pagemap.h
>>>>> +++ b/include/trace/events/pagemap.h
>>>>> @@ -77,6 +77,46 @@ TRACE_EVENT(mm_lru_activate,
>>>>>         TP_printk("folio=%p pfn=0x%lx", __entry->folio, __entry->pfn)
>>>>>  );
>>>>>
>>>>> +TRACE_EVENT(mm_lru_add_drain,
>>>>> +
>>>>> +       TP_PROTO(int cpu, unsigned int nr),
>>>>> +
>>>>> +       TP_ARGS(cpu, nr),
>>>>> +
>>>>> +       TP_STRUCT__entry(
>>>>> +               __field(int,            cpu     )
>>>>> +               __field(unsigned int,   nr      )
>>>>> +       ),
>>>>> +
>>>>> +       TP_fast_assign(
>>>>> +               __entry->cpu    = cpu;
>>>>> +               __entry->nr     = nr;
>>>>> +       ),
>>>>> +
>>>>> +       TP_printk("cpu=%d nr=%u", __entry->cpu, __entry->nr)
>>>>> +);
>>>>> +
>>>>> +TRACE_EVENT(mm_lru_drain_all_queue,
>>>>> +
>>>>> +       TP_PROTO(int target_cpu, bool force_all_cpus),
>>>>> +
>>>>> +       TP_ARGS(target_cpu, force_all_cpus),
>>>>> +
>>>>> +       TP_STRUCT__entry(
>>>>> +               __field(int,    target_cpu      )
>>>>> +               __field(bool,   force_all_cpus  )
>>>>> +       ),
>>>>> +
>>>>> +       TP_fast_assign(
>>>>> +               __entry->target_cpu     = target_cpu;
>>>>> +               __entry->force_all_cpus = force_all_cpus;
>>>>> +       ),
>>>>> +
>>>>> +       TP_printk("target_cpu=%d force_all_cpus=%s",
>>>>> +               __entry->target_cpu,
>>>>> +               __entry->force_all_cpus ? "true" : "false")
>>>>> +);
>>>>> +
>>>>>  #endif /* _TRACE_PAGEMAP_H */
>>>>>
>>>>>  /* This part must be outside protection */
>>>>> diff --git a/mm/swap.c b/mm/swap.c
>>>>> index 588f50d8f1a8..c385b93582eb 100644
>>>>> --- a/mm/swap.c
>>>>> +++ b/mm/swap.c
>>>>> @@ -694,9 +694,12 @@ void lru_add_drain_cpu(int cpu)
>>>>>  {
>>>>>         struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
>>>>>         struct folio_batch *fbatch = &fbatches->lru_add;
>>>>> +       unsigned int nr_folios_add = folio_batch_count(fbatch);
>>>>>
>>>>> -       if (folio_batch_count(fbatch))
>>>>> +       if (nr_folios_add) {
>>>>>                 folio_batch_move_lru(fbatch, lru_add);
>>>>> +               trace_mm_lru_add_drain(cpu, nr_folios_add);
>>>>> +       }
>>>>>
>>>>>         fbatch = &fbatches->lru_move_tail;
>>>>>         /* Disabling interrupts below acts as a compiler barrier. */
>>>>> @@ -928,6 +931,7 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
>>>>>                 if (cpu_needs_drain(cpu)) {
>>>>>                         INIT_WORK(work, lru_add_drain_per_cpu);
>>>>>                         queue_work_on(cpu, mm_percpu_wq, work);
>>>>> +                       trace_mm_lru_drain_all_queue(cpu, force_all_cpus);
>>>>
>>>> Do you need tracing on each CPU individually, or is tracing the
>>>> entire __lru_add_drain_all() invocation sufficient?
>>>
>>> I think the latter would be fine. The remote work will invoke the
>>> mm_lru_add_drain tracepoint, which will show up as kworker stacks. Since
>>> the event already has the CPU, we could see where queued drains actually
>>> ran.
>>
>> Actually if it's just a single invocation and the only event data is the
>> force flag, a tracepoint may not even be needed. Other probes can be
>> installed on function invocation and read the single argument. I can
>> drop this from v2 and keep the single mm_lru_add_drain tracepoint.
> 
> No we do want to trace the callers requesting to drain from all the CPUs. If you
> trace just lru_add_drain_cpu() then you will only see that the drain is
> requested for a given CPU but no information on the requester. 
> 
> Also as Barry said, I think single trace for whole __lru_add_drain_all() is good
> enough.

Right, but couldn't that already be done with fentry or kprobe? If we
only need the calling stack and the argument value of force_all_cpus I
don't see a strong need for a dedicated tracepoint.

^ permalink raw reply

* Re: [PATCHv4 05/13] uprobes/x86: Move optimized uprobe from nop5 to nop10
From: Andrii Nakryiko @ 2026-06-10 18:02 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <aikd6-5HYhCnc0Ze@krava>

On Wed, Jun 10, 2026 at 1:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Tue, Jun 09, 2026 at 09:43:15AM -0700, Andrii Nakryiko wrote:
> > On Tue, Jun 9, 2026 at 4:44 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Mon, Jun 08, 2026 at 01:46:39PM -0700, Andrii Nakryiko wrote:
> > > > On Tue, May 26, 2026 at 1:59 PM Jiri Olsa <jolsa@kernel.org> wrote:
> > > > >
> > > > > Andrii reported an issue with optimized uprobes [1] that can clobber
> > > > > redzone area with call instruction storing return address on stack
> > > > > where user code may keep temporary data without adjusting rsp.
> > > > >
> > > > > Fixing this by moving the optimized uprobes on top of 10-bytes nop
> > > > > instruction, so we can squeeze another instruction to escape the
> > > > > redzone area before doing the call, like:
> > > > >
> > > > >   lea -0x80(%rsp), %rsp
> > > > >   call tramp
> > > > >
> > > > > Note the lea instruction is used to adjust the rsp register without
> > > > > changing the flags.
> > > > >
> > > > > We use nop10 and following transformation to optimized instructions
> > > > > above and back as suggested by Peterz [2].
> > > > >
> > > > > Optimize path (int3_update_optimize):
> > > > >
> > > > >   1) Initial state after set_swbp() installed the uprobe:
> > > > >       cc 2e 0f 1f 84 00 00 00 00 00
> > > > >
> > > > >      From offset 0 this is INT3 followed by the tail of the original
> > > > >      10-byte NOP.
> > > > >
> > > > >      After a previous unoptimization bytes 5..9 may still contain the
> > > > >      old call instruction, which remains valid for threads already there.
> > > > >
> > > > >   2) Rewrite the LEA tail and call displacement:
> > > > >       cc [8d 64 24 80 e8 d0 d1 d2 d3]
> > > > >
> > > > >      From offset 0 this traps on the uprobe INT3.  Bytes 1..9 are not
> > > > >      executable entry points while byte 0 is trapped.
> > > > >
> > > > >   3) Publish the first LEA byte:
> > > > >       [48] 8d 64 24 80 e8 d0 d1 d2 d3
> > > > >
> > > > >      From offset 0 this is:
> > > > >         lea -0x80(%rsp), %rsp
> > > > >         call <uprobe-trampoline>
> > > > >
> > > > > Unoptimize path (int3_update_unoptimize):
> > > > >
> > > > >   1) Initial optimized state:
> > > > >       48 8d 64 24 80 e8 d0 d1 d2 d3
> > > > >      Same as 3) above.
> > > > >
> > > > >   2) Trap new entries before restoring the NOP bytes:
> > > > >       [cc] 8d 64 24 80 e8 d0 d1 d2 d3
> > > > >
> > > > >      From offset 0 this traps. A thread that had already executed the
> > > > >      LEA can still reach the intact CALL at offset 5.
> > > > >
> > > > >   3) Restore bytes 1..4 of the original NOP while keeping byte 0 trapped
> > > > >      and byte 5 as CALL.
> > > > >       cc [2e 0f 1f 84] e8 d0 d1 d2 d3
> > > > >
> > > > >      From offset 0 this still traps. Offset 5 is still the CALL for any
> > > > >      thread that was already past the first LEA byte.
> > > > >
> > > > >   4) Publish the first byte of the original NOP:
> > > > >       [66] 2e 0f 1f 84 e8 d0 d1 d2 d3
> > > > >
> > > > >      From offset 0 this is the restored 10-byte NOP; the CALL opcode and
> > > > >      displacement are now only NOP operands.  Offset 5 still decodes as
> > > > >      CALL for a thread that was already there.
> > > > >
> > > > >      Tthere is only a single target uprobe-trampoline for the given nop10
> > > > >      instruction address, so the CALL instruction will not be changed across
> > > > >      unoptimization/optimization cycles.
> > > > >      Therefore, any task that is preempted at the CALL instruction is guaranteed
> > > > >      to observe that CALL and not anything else.
> > > > >
> > > > > Note as explained in [2] we need to use following nop10:
> > > > >        PF1   PF2   ESC   NOPL  MOD   SIB   DISP32
> > > > > NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw 0x00000000(%rax,%rax,1)
> > > > >
> > > > > which means we need to allow 0x2e prefix which maps to INAT_PFX_CS
> > > > > attribute in is_prefix_bad function.
> > > > >
> > > > > Also changing the uprobe syscall error when called out of uprobe
> > > > > trampoline to -EPROTO, so we are able to detect the fixed kernel.
> > > > >
> > > > > The optimized uprobe performance stays the same:
> > > > >
> > > > >         uprobe-nop     :    3.129 ± 0.013M/s
> > > > >         uprobe-push    :    3.045 ± 0.006M/s
> > > > >         uprobe-ret     :    1.095 ± 0.004M/s
> > > > >   -->   uprobe-nop10   :    7.170 ± 0.020M/s
> > > > >         uretprobe-nop  :    2.143 ± 0.021M/s
> > > > >         uretprobe-push :    2.090 ± 0.000M/s
> > > > >         uretprobe-ret  :    0.942 ± 0.000M/s
> > > > >   -->   uretprobe-nop10:    3.381 ± 0.003M/s
> > > > >         usdt-nop       :    3.245 ± 0.004M/s
> > > > >   -->   usdt-nop10     :    7.256 ± 0.023M/s
> > > > >
> > > > > [1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> > > > > [2] https://lore.kernel.org/bpf/20260518104306.GU3102624@noisy.programming.kicks-ass.net/#t
> > > > > Reported-by: Andrii Nakryiko <andrii@kernel.org>
> > > > > Closes: https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> > > > > Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> > > > > Assisted-by: Codex:GPT-5.5
> > > > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > > > ---
> > > > >  arch/x86/kernel/uprobes.c | 255 ++++++++++++++++++++++++++++----------
> > > > >  1 file changed, 190 insertions(+), 65 deletions(-)
> > > > >
> > > >
> > > > [...]
> > > >
> > > > > @@ -943,13 +1026,31 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> > > > >         smp_text_poke_sync_each_cpu();
> > > > >
> > > > >         /*
> > > > > -        * Write first byte.
> > > > > +        * 3) Restore bytes 1..4 of the original NOP while keeping byte 0 trapped
> > > > > +        *    and byte 5 as CALL:
> > > > > +        *    cc [2e 0f 1f 84] e8 d0 d1 d2 d3
> > > > > +        */
> > > > > +       ctx.expect = EXPECT_SWBP_OPTIMIZED;
> > > > > +       err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1,
> > > > > +                          LEA_INSN_SIZE - 1, verify_insn,
> > > > > +                          true /* is_register */, false /* do_update_ref_ctr */,
> > > >
> > > > tbh, it's quite subtle and non-obvious why is_register should be set
> > > > to true first two times (and especially that is_register and
> > > > do_update_ref_ctr are implicitly connected), not sure how to make it
> > > > cleaner, but maybe leave a short comment explaining this twice
> > > > register, once unregister sequence?
> > >
> > > ok, I came up with comment below
> > >
> > > thanks,
> > > jirka
> > >
> > >
> > > ---
> > > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > > index de544516ea70..92449f34c005 100644
> > > --- a/arch/x86/kernel/uprobes.c
> > > +++ b/arch/x86/kernel/uprobes.c
> > > @@ -1011,6 +1011,12 @@ static int int3_update_unoptimize(struct arch_uprobe *auprobe, struct vm_area_st
> > >         int err;
> > >
> > >         /*
> > > +        * Note the first two uprobe_write calls use is_register=true, because they
> > > +        * are intermediate patching states while the probe is still active.
> >
> > this doesn't really explain why is_register=true is the right one. It
> > actually doesn't matter as long as do_update_ref_ctr=true, isn't that
> > right? So maybe just to avoid a bit of confusion let's pass
> > is_register=false and do_update_ref_ctr=false, and in the comment
> > explain as you said that it's intermediate update and we don't want to
> > update refctr just yet until the very last step?
>
> apart from refctr update there's also different way the concerned
> page is managed, IIUC:
>
> with is_register=true we force to get exclusive anonymous page for
> the update (or pin the existing one)
>
> with is_register=false we try to zap the private anonymous page and
> return the mapping to the original page
>
> there are several comments on this in uprobe_write/__uprobe_write
>
> how about the update below
>
> jirka
>
>
> ---
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index de544516ea70..09f5ff71227c 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -1011,6 +1011,16 @@ static int int3_update_unoptimize(struct arch_uprobe *auprobe, struct vm_area_st
>         int err;
>
>         /*
> +        * Note the first two uprobe_write calls use is_register=true, because they
> +        * are intermediate patching states while the probe is still active, so
> +        * we force the exclusive anonymous page for the update.
> +        * Also we use do_update_ref_ctr=false because refctr was already updated by
> +        * the initial int3 install.
> +        *
> +        * The last uprobe_write to nop10 instruction is called with is_register=false
> +        * and do_update_ref_ctr=true to trigger the refctr update and to instruct
> +        * uprobe_write to zap the anonymous page if it now matches the file page.
> +        *

lgtm!

>          * 1) Initial optimized state:
>          *    48 8d 64 24 80 e8 d0 d1 d2 d3
>          *

^ permalink raw reply

* Re: [PATCH v7 00/42] guest_memfd: In-place conversion support
From: Ackerley Tng @ 2026-06-10 17:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
	forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <aiMVLtblIKu1DQWJ@google.com>

Sean Christopherson <seanjc@google.com> writes:

> On Thu, Jun 04, 2026, Ackerley Tng wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>> >> + KVM: selftests: Test conversion with elevated page refcount
>> >>     + Askar pointed out that soon vmsplice may not pin pages. Should I
>> >>       pin pages through CONFIG_GUP_TEST like in [2]? I prefer not to
>> >>       take a dependency on CONFIG_GUP_TEST.
>> >
>> > I'm not exactly excited about taking a dependency on CONFIG_GUP_TEST either, but
>> > it probably is the least awful choice.  E.g. KVM also pins pages is certain flows,
>> > but we're _also_ actively working to remove the need to pin.
>> >
>> > Hmm, maybe IORING_REGISTER_PBUF_RING?  AFAICT, it's almost literally a "pin user
>> > memory" syscall.
>> >
>>
>> Hmm that takes a dependency on io_uring, which isn't always compiled
>> in. Between CONFIG_IO_URING and CONFIG_GUP_TEST, I'd rather
>> CONFIG_GUP_TEST.
>
> Or try both?  If it's not a ridiculous amount of work.

CONFIG_GUP_TEST was tried in [1]

[1] https://lore.kernel.org/all/baa8838f623102931e755cf34c86314b305af49c.1747264138.git.ackerleytng@google.com/

It looks like this

  static void pin_pages(void *vaddr, uint64_t size)
  {
  	const struct pin_longterm_test args = {
  		.addr = (uint64_t)vaddr,
  		.size = size,
  		.flags = PIN_LONGTERM_TEST_FLAG_USE_WRITE,
  	};

  	gup_test_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
  	TEST_REQUIRE(gup_test_fd > 0);

  	TEST_ASSERT_EQ(ioctl(gup_test_fd, PIN_LONGTERM_TEST_START, &args), 0);
  }

  static void unpin_pages(void)
  {
  	TEST_ASSERT_EQ(ioctl(gup_test_fd, PIN_LONGTERM_TEST_STOP), 0);
  }

So in the test I'll call pin_pages(), then try to convert, see that it
fails with EAGAIN and reports the expected error_offset, then I call
unpin_pages(), then I convert again and expect success.

Are you uncomfortable with the CONFIG_GUP_TEST interface? What would you
like me to try with CONFIG_IO_URING? I'm thinking that the main
difference between the two is just down to which non-default CONFIG
option we want to take for guest_memfd tests.

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-10 16:37 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <c1b66e7a-bb95-4295-8193-55ceadaaa578@kernel.org>

On Wed, Jun 10, 2026 at 05:00:33PM +0200, David Hildenbrand (Arm) wrote:
> On 6/10/26 12:41, Gregory Price wrote:
> > On Wed, Jun 03, 2026 at 03:00:01PM +1000, Balbir Singh wrote:
> > 
> > Notably: slub.c injects __GFP_THISNODE internally on behalf of kmalloc,
> > which causes spillage into private nodes because slub allows private
> > nodes in its mask.  I think this is fixable.
> > 
> > I have to inspect some other __GFP_THISNODE users (hugetlb, some arch
> > code, etc), but it seems like fully dropping the FALLBACK entries and
> > requiring __GFP_THISNODE might be sufficient.
> 
> Sorry, I haven't been able to follow up so far, and not sure if that's what you
> are discussing here ...
> 
> After the LSF/MM session, I was wondering, whether if we focus on allowing only
> folios allocations to end up on private memory nodes for now: could the
> __GFP_THISNODE approach work there?
> 
> Essentially, disallow any allocations on non-folio paths, and allow folio
> allocation only with __GFP_THISNODE set.
> 
> I have to find time to read the other mails in this thread, on my todo list.
> 
> So sorry if that is precisely what is being discussed here.
> 

So, I remember this being asked, and I didn't fully grok the request.

I'm still not sure I fully understand the question, so apologies if I'm
answer the wrong things here.

I understand this question in two ways:

  1) Can we disallow PAGE allocation and limit this to FOLIO allocation
  2) Can we disallow [Feature] (i.e. slab) allocation targeting the node.

1) Can we disallow page allocation and limit this to folios?

No, I don't think so.

Folio allocations are written in terms of page allocations, we would
have to rewrite folio allocation interfaces and introduce a bunch of
boilerplate for the sake of this.

struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
                int preferred_nid, nodemask_t *nodemask)
{
        struct page *page;

        page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid, nodemask);
        if (page)
                set_page_refcounted(page);
        return page;
}

struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
                nodemask_t *nodemask)
{
        struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
                                        preferred_nid, nodemask);
	return page_rmappable_folio(page);
}

At the end of the day, this all reduces to `get_pages_from_freelist`,
and at that level we don't really care about folio vs page.

__GFP_COMP is insufficient to differentiate between a non-folio compound
page and a folio, and __GFP_COMP is passed into __alloc_pages_*
interfaces all over the kernel.

Trying to detach these paths things seems like a horrible rats nest /
not feasible / will create a lot of boilerplate for little value.

(I did not fully understand this request when it was asked, I do
 not fully understand this request not, please let me know if I
 have misunderstood what you were asking).

2) Can we disallow SLAB allocation.

Yeah, but I think a better question is whether there's a difference
between alloc_pages_node() and kmalloc_node() when it all just sinks
to the same fundamental code in mm/page_alloc.c

Maybe there's an argument for something like NP_OPT_KMALLOC (allow slab
allocations on the private node w/ __GFP_THISNODE)

On my current set, I don't implement any explicit filtering at all in
mm/page_alloc.c - the filtering is a function of the nodes not being
present in the FALLBACK list and only having a NOFALLBACK list.

What __GFP_THISNODE actually does under the hood is just switch
which zone list (FALLBACK vs NOFALLBACK) is used for the target node.

For isolation w/o __GFP_PRIVATE, we're removing N_MEMORY_PRIVATE nodes
from *their own FALLBACK* list and only adding them to their NOFALLBACK
list.  That means to reach a private node you MUST use __GFP_THISNODE.

I realize this is confusing, but essentially we don't have to modify
mm/page_alloc.c to get the __GFP_THISNODE filtering, we get this from
the fallback/nofallback list construction.

Ok, so how does this flush out in practice - and why do I call this
filtering mechanism fragile?

consider kmalloc_node() and __slab_alloc():

kmalloc_node(...)
  └─ ___slab_alloc()     mm/slub.c:4406   pc.flags |= __GFP_THISNODE
      └─ new_slab(s, pc.flags, node)
          └─ allocate_slab(s, flags, node)
              └─ alloc_slab_page(flags, node, oo, …)
                  └─ __alloc_frozen_pages(flags, order, node, NULL);

Slab silently upgrades the page allocator flags here to include
__GFP_THISNODE - even if the user didn't request that behavior.

This is exactly the kind of "spillage" I said was hard to police at LSF.

Without __GFP_PRIVATE, we have to keep an eye on what around the kernel
is using __GFP_THISNODE and how.

For mm/slub.c we can choose to do one of thwo things

  1) 100% refuse slab allocations on private nodes, i.e.:

     kmalloc_node(..., private_nid, __GFP_THISNODE)

     And will fail (return NULL).

  or

  2) Do not upgrade private-node slab requests w/ __GFP_THISNODE

     This allows kmalloc_node() to work the same as folio_alloc()
     or alloc_pages() interfaces (__GFP_THISNODE is the key), with
     the understanding that any __GFP_THISNODE user

We can opt these nodes into slab/kmalloc with a NP_OPT_SLAB
if the owner wants kmalloc_node(), with the understanding that any
caller using __GFP_THISNODE may get access.

That's the kind of fragility I was trying to avoid.

That said, in practice, I have found that basic kernel operations don't
generally target use kmalloc_node() w/ __GFP_THISNODE - there's just
nothing to prevent anyone from doing so.

So this seems promising...
And then theres arch/powerpc/platforms/powernv/memtrace.c

static u64 memtrace_alloc_node(u32 nid, u64 size)
{
	... snip ...
        page = alloc_contig_pages(nr_pages, GFP_KERNEL | __GFP_THISNODE |
                                  __GFP_NOWARN | __GFP_ZERO, nid, NULL);
	... snip ...
}

static int memtrace_init_regions_runtime(u64 size)
{
	... snip ...
        for_each_online_node(nid) {
                m = memtrace_alloc_node(nid, size);
	... snip ...
}

static int memtrace_enable_set(void *data, u64 val)
{
	... snip ...
        if (memtrace_init_regions_runtime(val))
                goto out_unlock;
	... snip ...
}

This is the *exact* pattern I said would be hard to police - and it
doesn't look like a bug, just not informed that private nodes exist.

This is why I'm concerned with trying to depend on __GFP_THISNODE as the
filtering function.

That said, the number of __GFP_THISNODE users is very limited
kernel-wide, so maybe that's an acceptable maintenance burden?

~Gregory

^ permalink raw reply

* Re: [PATCH 1/2] ring-buffer: Fix event length with forced 8-byte alignment
From: Steven Rostedt @ 2026-06-10 16:17 UTC (permalink / raw)
  To: Hui Wang
  Cc: Masami Hiramatsu (Google), mathieu.desnoyers, pjw,
	linux-trace-kernel, shuah, wangfushuai, linux-kselftest
In-Reply-To: <ea9d00cb-54c6-4635-aa13-e5a688375132@canonical.com>

On Tue, 9 Jun 2026 12:22:47 +0800
Hui Wang <hui.wang@canonical.com> wrote:

> Thanks for the pointer. I reverted my two patches and applied the patch 
> you referenced, but unfortunately it doesn't resolve the problem — the 
> testcase still fails in my environment (riscv64 kernel with 
> CONFIG_HAVE_64BIT_ALIGNED_ACCESS enabled).
> 
>  From what I can tell, that fix addresses a different problem than the 
> one I'm hitting: it targets a 64K page-size issue, whereas my failure is 
> caused by the 64-bit alignment requirement 
> (CONFIG_HAVE_64BIT_ALIGNED_ACCESS). So I don't think they're the same 
> root cause.
> 
> So can you please take a look at them again.

OK, taking a deeper look at it, and yes, your are correct. Sorry for
jumping to the conclusion with thinking this was the same issue as what
was brought up before.

I'll take these.

Thanks,

-- Steve

^ permalink raw reply

* Re: [PATCHv7 bpf-next 03/29] ftrace: Add add_ftrace_hash_entry function
From: Alexei Starovoitov @ 2026-06-10 15:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Kumar Kartikeya Dwivedi, Jiri Olsa, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, bpf, linux-trace-kernel,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	Menglong Dong
In-Reply-To: <20260610113536.77172ad1@robin>

On Wed, Jun 10, 2026 at 8:35 AM Steven Rostedt <rostedt@kernel.org> wrote:
>
> On Tue, 09 Jun 2026 16:43:19 +0200
> "Kumar Kartikeya Dwivedi" <memxor@gmail.com> wrote:
>
> > Hi Steven,
> > Version 8 of this set was already applied to bpf-next.
> >
> > https://lore.kernel.org/bpf/178085644764.273544.8250000589480262551.git-patchwork-notify@kernel.org
>
> It should have waited for my review of the first three patches though.
> I like to run them through my tests before giving the OK. As they are
> generic changes to my code.
>
> They are trivial changes, but regardless, someone should have asked.

If my memory doesn't fail me you said it's fine during v1,v2 iterations.
The last v3 - v8 you were silent, so we assumed you're still fine.

While at it, please review Mykyta's set:
https://patchwork.kernel.org/user/todo/netdevbpf/?series=1096695

It's also been pending for almost a month now.

^ permalink raw reply

* Re: [PATCHv7 bpf-next 03/29] ftrace: Add add_ftrace_hash_entry function
From: Steven Rostedt @ 2026-06-10 15:35 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	bpf, linux-trace-kernel, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, Menglong Dong
In-Reply-To: <DJ4LJUWPD2BF.1TY4Z3WY2V0H5@gmail.com>

On Tue, 09 Jun 2026 16:43:19 +0200
"Kumar Kartikeya Dwivedi" <memxor@gmail.com> wrote:

> Hi Steven,
> Version 8 of this set was already applied to bpf-next.
> 
> https://lore.kernel.org/bpf/178085644764.273544.8250000589480262551.git-patchwork-notify@kernel.org

It should have waited for my review of the first three patches though.
I like to run them through my tests before giving the OK. As they are
generic changes to my code.

They are trivial changes, but regardless, someone should have asked.

-- Steve

^ permalink raw reply

* Re: [PATCH v3 12/13] verification/rvgen: Remove the old state variables
From: Gabriele Monaco @ 2026-06-10 15:06 UTC (permalink / raw)
  To: Nam Cao
  Cc: Steven Rostedt, Wander Lairson Costa, linux-trace-kernel,
	linux-kernel
In-Reply-To: <b16f71f3834df7c18ae915071be3709ee8513443.1780908661.git.namcao@linutronix.de>

On Mon, 2026-06-08 at 10:57 +0200, Nam Cao wrote:
> The state variables (states, initial_state, final_states) only
> capture the
> states' names and have less information than their Lark-based
> counterparts.
> 
> Switch to use the new state variables and delete these old ones.
> 
> Signed-off-by: Nam Cao <namcao@linutronix.de>

Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>

> ---
>  tools/verification/rvgen/rvgen/automata.py |  9 ++++-----
>  tools/verification/rvgen/rvgen/dot2c.py    | 10 +++++-----
>  tools/verification/rvgen/rvgen/dot2k.py    |  8 ++++----
>  3 files changed, 13 insertions(+), 14 deletions(-)
> 
> diff --git a/tools/verification/rvgen/rvgen/automata.py
> b/tools/verification/rvgen/rvgen/automata.py
> index 4c302f5cba68..a3be327c2a73 100644
> --- a/tools/verification/rvgen/rvgen/automata.py
> +++ b/tools/verification/rvgen/rvgen/automata.py
> @@ -411,8 +411,7 @@ class Automata:
>          self.__dot_lines = self.__open_dot()
>          self.__parse_tree = ParseTree(file_path)
>          self.transitions = self.__parse_transitions()
> -        self._states, self._initial_state, self._final_states =
> self.__parse_states()
> -        self.states, self.initial_state, self.final_states =
> self.__get_state_variables()
> +        self.states, self.initial_state, self.final_states =
> self.__parse_states()
>          self.env_types = {}
>          self.env_stored = set()
>          self.constraint_vars = set()
> @@ -603,7 +602,7 @@ class Automata:
>                      envs.append(c.env)
>                      self.__extract_env_var(c)
>  
> -        for state in self._states:
> +        for state in self.states:
>              if state.inv:
>                  envs.append(state.inv.env)
>                  self.__extract_env_var(state.inv)
> @@ -639,7 +638,7 @@ class Automata:
>      def __create_matrix(self) -> list[list[str]]:
>          # transform the array into a dictionary
>          events = self.events
> -        states = [s.name for s in self._states]
> +        states = [s.name for s in self.states]
>          events_dict = {}
>          states_dict = {}
>          nr_event = 0
> @@ -675,7 +674,7 @@ class Automata:
>              for j in range(len(self.states)):
>                  if self.function[j][i] != self.invalid_state_str:
>                      curr_event_used += 1
> -                if self.function[j][i] == self.initial_state:
> +                if self.function[j][i] == self.initial_state.name:
>                      curr_event_will_init += 1
>              if self.function[0][i] != self.invalid_state_str:
>                  curr_event_from_init = True
> diff --git a/tools/verification/rvgen/rvgen/dot2c.py
> b/tools/verification/rvgen/rvgen/dot2c.py
> index fc85ba1f649e..22938ce1bf6c 100644
> --- a/tools/verification/rvgen/rvgen/dot2c.py
> +++ b/tools/verification/rvgen/rvgen/dot2c.py
> @@ -29,10 +29,10 @@ class Dot2c(Automata):
>  
>      def __get_enum_states_content(self) -> list[str]:
>          buff = []
> -        buff.append(f"\t{self.initial_state}{self.enum_suffix},")
> +       
> buff.append(f"\t{self.initial_state.name}{self.enum_suffix},")
>          for state in self.states:
>              if state != self.initial_state:
> -                buff.append(f"\t{state}{self.enum_suffix},")
> +                buff.append(f"\t{state.name}{self.enum_suffix},")
>          buff.append(f"\tstate_max{self.enum_suffix},")
>  
>          return buff
> @@ -142,7 +142,7 @@ class Dot2c(Automata):
>      def format_aut_init_states_string(self) -> list[str]:
>          buff = []
>          buff.append("\t.state_names = {")
> -       
> buff.append(self.__get_string_vector_per_line_content(self.states))
> +       
> buff.append(self.__get_string_vector_per_line_content([s.name for s
> in self.states]))
>          buff.append("\t},")
>  
>          return buff
> @@ -159,7 +159,7 @@ class Dot2c(Automata):
>          return buff
>  
>      def __get_max_strlen_of_states(self) -> int:
> -        max_state_name = len(max(self.states, key=len))
> +        max_state_name = max((len(s.name) for s in self.states))
>          return max(max_state_name, len(self.invalid_state_str))
>  
>      def get_aut_init_function(self) -> str:
> @@ -199,7 +199,7 @@ class Dot2c(Automata):
>          return buff
>  
>      def get_aut_init_initial_state(self) -> str:
> -        return self.initial_state
> +        return self.initial_state.name
>  
>      def format_aut_init_initial_state(self) -> list[str]:
>          buff = []
> diff --git a/tools/verification/rvgen/rvgen/dot2k.py
> b/tools/verification/rvgen/rvgen/dot2k.py
> index dc6d6f33729b..e4b6c7c09170 100644
> --- a/tools/verification/rvgen/rvgen/dot2k.py
> +++ b/tools/verification/rvgen/rvgen/dot2k.py
> @@ -179,7 +179,7 @@ class ha2k(dot2k):
>          self.trace_h = self._read_template_file("trace_hybrid.h")
>          self.has_invariant = False
>          self.has_guard = False
> -        for state in self._states:
> +        for state in self.states:
>              if state.inv:
>                  self.has_invariant = True
>          for transition in self.transitions:
> @@ -314,7 +314,7 @@ f"""static inline bool
> ha_verify_invariants(struct ha_monitor *ha_mon,
>  {{"""]
>  
>          _else = ""
> -        for state in self._states:
> +        for state in self.states:
>              if not state.inv:
>                  continue
>  
> @@ -382,7 +382,7 @@ f"""static inline void ha_setup_invariants(struct
> ha_monitor *ha_mon,
>          buff.append(f"\tif ({condition_str})\n\t\treturn;")
>  
>          _else = ""
> -        for state in self._states:
> +        for state in self.states:
>              inv = state.inv
>              if not inv:
>                  continue
> @@ -391,7 +391,7 @@ f"""static inline void ha_setup_invariants(struct
> ha_monitor *ha_mon,
>              buff.append(f"\t\t{inv};")
>              _else = "else "
>  
> -        for state in self._states:
> +        for state in self.states:
>              inv = state.inv
>              if not inv:
>                  continue


^ permalink raw reply

* Re: [PATCH v3 11/13] verification/rvgen: Switch __create_matrix() to Lark
From: Gabriele Monaco @ 2026-06-10 15:05 UTC (permalink / raw)
  To: Nam Cao
  Cc: Steven Rostedt, Wander Lairson Costa, linux-trace-kernel,
	linux-kernel
In-Reply-To: <4e501de8e4e84a3de3370d82bf346f916aa97706.1780908661.git.namcao@linutronix.de>

On Mon, 2026-06-08 at 10:57 +0200, Nam Cao wrote:
> Switch __create_matrix() to use the transitions parsed by Lark to
> avoid all
> the raw text parsing.
> 
> Also stop parsing constraints in __create_matrix(), that is not used
> anymore.
> 
> Signed-off-by: Nam Cao <namcao@linutronix.de>

Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>

> ---
>  tools/verification/rvgen/rvgen/automata.py | 47 ++++++--------------
> --
>  tools/verification/rvgen/rvgen/dot2k.py    |  2 +-
>  2 files changed, 13 insertions(+), 36 deletions(-)
> 
> diff --git a/tools/verification/rvgen/rvgen/automata.py
> b/tools/verification/rvgen/rvgen/automata.py
> index 2e26bb863245..4c302f5cba68 100644
> --- a/tools/verification/rvgen/rvgen/automata.py
> +++ b/tools/verification/rvgen/rvgen/automata.py
> @@ -418,7 +418,7 @@ class Automata:
>          self.constraint_vars = set()
>          self.self_loop_reset_events = set()
>          self.events, self.envs = self.__get_event_variables()
> -        self.function, self.constraints = self.__create_matrix()
> +        self.function = self.__create_matrix()
>          self.events_start, self.events_start_run =
> self.__store_init_events()
>          self.env_stored = sorted(self.env_stored)
>          self.constraint_vars = sorted(self.constraint_vars)
> @@ -636,10 +636,10 @@ class Automata:
>          if constraint.val[0].isalpha():
>              self.constraint_vars.add(constraint.val)
>  
> -    def __create_matrix(self) -> tuple[list[list[str]],
> dict[_ConstraintKey, list[str]]]:
> +    def __create_matrix(self) -> list[list[str]]:
>          # transform the array into a dictionary
>          events = self.events
> -        states = self.states
> +        states = [s.name for s in self._states]
>          events_dict = {}
>          states_dict = {}
>          nr_event = 0
> @@ -654,39 +654,16 @@ class Automata:
>  
>          # declare the matrix....
>          matrix = [[self.invalid_state_str for _ in range(nr_event)]
> for _ in range(nr_state)]
> -        constraints: dict[_ConstraintKey, list[str]] = {}
>  
> -        # and we are back! Let's fill the matrix
> -        cursor = self.__get_cursor_begin_events()
> -
> -        for line in map(str.lstrip,
> -                        islice(self.__dot_lines, cursor, None)):
> -
> -            if not line or line[0] != '"':
> -                break
> -
> -            split_line = line.split()
> -
> -            if len(split_line) > 2 and split_line[1] == "->":
> -                origin_state = split_line[0].replace('"',
> '').replace(',', '_')
> -                dest_state = split_line[2].replace('"',
> '').replace(',', '_')
> -                possible_events =
> "".join(split_line[split_line.index("label") + 2:-1]).replace('"',
> '')
> -                for event in possible_events.split("\\n"):
> -                    event, *constr = event.split(";")
> -                    if constr:
> -                        key =
> _EventConstraintKey(states_dict[origin_state], events_dict[event])
> -                        constraints[key] = constr
> -                        # those events reset also on self loops
> -                        if origin_state == dest_state and "reset" in
> "".join(constr):
> -                            self.self_loop_reset_events.add(event)
> -                   
> matrix[states_dict[origin_state]][events_dict[event]] = dest_state
> -            else:
> -                state = line.split("label")[1].split('"')[1]
> -                state, *constr = state.replace(" ", "").split("\\n")
> -                if constr:
> -                   
> constraints[_StateConstraintKey(states_dict[state])] = constr
> -
> -        return matrix, constraints
> +        for transition in self.transitions:
> +            src, dst = transition.src, transition.dst
> +            event = transition.event
> +            if src == dst and transition.reset:
> +                # those events reset also on self loops
> +                self.self_loop_reset_events.add(event)
> +            matrix[states_dict[src]][events_dict[event]] = dst
> +
> +        return matrix
>  
>      def __store_init_events(self) -> tuple[list[bool], list[bool]]:
>          events_start = [False] * len(self.events)
> diff --git a/tools/verification/rvgen/rvgen/dot2k.py
> b/tools/verification/rvgen/rvgen/dot2k.py
> index a38ef735a861..dc6d6f33729b 100644
> --- a/tools/verification/rvgen/rvgen/dot2k.py
> +++ b/tools/verification/rvgen/rvgen/dot2k.py
> @@ -403,7 +403,7 @@ f"""static inline void ha_setup_invariants(struct
> ha_monitor *ha_mon,
>  
>      def __fill_constr_func(self) -> list[str]:
>          buff = []
> -        if not self.constraints:
> +        if not self.has_invariant and not self.has_guard:
>              return []
>  
>          buff.append(


^ permalink raw reply

* Re: [PATCH v3 10/13] verification/rvgen: Switch __get_event_variables() to Lark
From: Gabriele Monaco @ 2026-06-10 15:04 UTC (permalink / raw)
  To: Nam Cao
  Cc: Steven Rostedt, Wander Lairson Costa, linux-trace-kernel,
	linux-kernel
In-Reply-To: <50df4dd278779328cf2aa283c8353bac2281a337.1780908661.git.namcao@linutronix.de>

On Mon, 2026-06-08 at 10:57 +0200, Nam Cao wrote:
> Switch __get_event_variables() to use the parsed results from Lark,
> instead
> of raw text processing.
> 
> Signed-off-by: Nam Cao <namcao@linutronix.de>

Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>

> ---
>  tools/verification/rvgen/rvgen/automata.py | 78 ++++++--------------
> --
>  1 file changed, 19 insertions(+), 59 deletions(-)
> 
> diff --git a/tools/verification/rvgen/rvgen/automata.py
> b/tools/verification/rvgen/rvgen/automata.py
> index b86275e7bf28..2e26bb863245 100644
> --- a/tools/verification/rvgen/rvgen/automata.py
> +++ b/tools/verification/rvgen/rvgen/automata.py
> @@ -591,45 +591,22 @@ class Automata:
>      def __get_event_variables(self) -> tuple[list[str], list[str]]:
>          events: list[str] = []
>          envs: list[str] = []
> -        # here we are at the begin of transitions, take a note, we
> will return later.
> -        cursor = self.__get_cursor_begin_events()
>  
> -        for line in map(str.lstrip, islice(self.__dot_lines, cursor,
> None)):
> -            if not line.startswith('"'):
> -                break
> +        for transition in self.transitions:
> +            events.append(transition.event)
>  
> -            # transitions have the format:
> -            # "all_fired" -> "both_fired" [ label = "disable_irq" ];
> -            #  ------------ event is here ------------^^^^^
> -            split_line = line.split()
> -            if len(split_line) > 1 and split_line[1] == "->":
> -                event = "".join(split_line[split_line.index("label")
> + 2:-1]).replace('"', '')
> -
> -                # when a transition has more than one label, they
> are like this
> -                # "local_irq_enable\nhw_local_irq_enable_n"
> -                # so split them.
> -
> -                for i in event.split("\\n"):
> -                    # if the event contains a constraint (hybrid
> automata),
> -                    # it will be separated by a ";":
> -                    # "sched_switch;x<1000;reset(x)"
> -                    ev, *constr = i.split(";")
> -                    if constr:
> -                        if len(constr) > 2:
> -                            raise AutomataError("Only 1 constraint
> and 1 reset are supported")
> -                        envs += self.__extract_env_var(constr)
> -                    events.append(ev)
> -            else:
> -                # state labels have the format:
> -                # "enable_fired" [label =
> "enable_fired\ncondition"];
> -                #  ----- label is here -----^^^^^
> -                # label and node name must be the same, condition is
> optional
> -                state = line.split("label")[1].split('"')[1]
> -                _, *constr = state.split("\\n")
> -                if constr:
> -                    if len(constr) > 1:
> -                        raise AutomataError("Only 1 constraint is
> supported in the state")
> -                    envs +=
> self.__extract_env_var([constr[0].replace(" ", "")])
> +            if transition.reset:
> +                envs.append(transition.reset.env)
> +                self.env_stored.add(transition.reset.env)
> +            if transition.rule:
> +                for c, _ in transition.rule.rules:
> +                    envs.append(c.env)
> +                    self.__extract_env_var(c)
> +
> +        for state in self._states:
> +            if state.inv:
> +                envs.append(state.inv.env)
> +                self.__extract_env_var(state.inv)
>  
>          return sorted(set(events)), sorted(set(envs))
>  
> @@ -653,28 +630,11 @@ class Automata:
>              seps.append(None)
>          return zip(exprs, seps)
>  
> -    def __extract_env_var(self, constraint: list[str]) -> list[str]:
> -        env = []
> -        for c, _ in self._split_constraint_expr(constraint):
> -            rule = self.constraint_rule.search(c)
> -            reset = self.constraint_reset.search(c)
> -            if rule:
> -                env.append(rule["env"])
> -                if rule.groupdict().get("unit"):
> -                    self.env_types[rule["env"]] = rule["unit"]
> -                if rule["val"][0].isalpha():
> -                    self.constraint_vars.add(rule["val"])
> -                # try to infer unit from constants or parameters
> -                val_for_unit = rule["val"].lower().replace("()", "")
> -                if val_for_unit.endswith("_ns"):
> -                    self.env_types[rule["env"]] = "ns"
> -                if val_for_unit.endswith("_jiffies"):
> -                    self.env_types[rule["env"]] = "j"
> -            if reset:
> -                env.append(reset["env"])
> -                # environment variables that are reset need a
> storage
> -                self.env_stored.add(reset["env"])
> -        return env
> +    def __extract_env_var(self, constraint: ConstraintCondition):
> +        if constraint.unit:
> +            self.env_types[constraint.env] = constraint.unit
> +        if constraint.val[0].isalpha():
> +            self.constraint_vars.add(constraint.val)
>  
>      def __create_matrix(self) -> tuple[list[list[str]],
> dict[_ConstraintKey, list[str]]]:
>          # transform the array into a dictionary


^ permalink raw reply

* Re: [PATCH v3 09/13] verification/rvgen: Delete __parse_constraint()
From: Gabriele Monaco @ 2026-06-10 15:04 UTC (permalink / raw)
  To: Nam Cao
  Cc: Steven Rostedt, Wander Lairson Costa, linux-trace-kernel,
	linux-kernel
In-Reply-To: <8d9b9068a5dde8256edd7debe7aab33e15a7fc51.1780908661.git.namcao@linutronix.de>

On Mon, 2026-06-08 at 10:57 +0200, Nam Cao wrote:
> -    def __validate_constraint(self, key: tuple[int, int] | int,
> constr: str,
> -                              rule, reset) -> None:
> -        # event constrains are tuples and allow both rules and reset
> -        # state constraints are only used for expirations (e.g.
> clk<N)
> -        if self.is_event_constraint(key):
> -            if not rule and not reset:
> -                raise AutomataError("Unrecognised event constraint "
> -                                   
> f"({self.states[key[0]]}/{self.events[key[1]]}: {constr})")
> -            if rule and (rule["env"] in self.env_types and
> -                         rule["env"] not in self.env_stored):
> -                raise AutomataError("Clocks in hybrid automata
> always require a storage"
> -                                    f" ({rule["env"]})")
> -        else:
> -            if not rule:
> -                raise AutomataError("Unrecognised state constraint "
> -                                    f"({self.states[key]}:
> {constr})")
> -            if rule["env"] not in self.env_stored:
> -                raise AutomataError("State constraints always
> require a storage "
> -                                    f"({rule["env"]})")

This function used to validate things we are no longer validating, now it's
alright to create a model where a clock is never reset, which doesn't fully
make sense. Should we add that check somewhere else?

Thanks,
Gabriele

> -            if rule["op"] not in ["<", "<="]:
> -                raise AutomataError("State constraints must be clock
> expirations like"
> -                                    f" clk<N ({rule.string})")
> -


^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: David Hildenbrand (Arm) @ 2026-06-10 15:00 UTC (permalink / raw)
  To: Gregory Price, Balbir Singh
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <aik_ddHymus2DJ6D@gourry-fedora-PF4VCD3F>

On 6/10/26 12:41, Gregory Price wrote:
> On Wed, Jun 03, 2026 at 03:00:01PM +1000, Balbir Singh wrote:
>>>
>>>    __GFP_THISNODE cannot be overloaded to do anything useful here.
>>
>> Let me clarify, I meant to say, let's use a nodemask for allocation
>> and __GFP_THISNODE gets us to the node we desire, if that is the only
>> node. My earlier comment might not have been clear.
>>
> 
> I've been tested an stripped back patch set where I drop all FALLBACK
> entries for private nodes (including for itself) and only keep the
> NOFALLBACK entry for private nodes.
> 
> This effectively isolates the nodes for any allocation without
> __GFP_THISNODE.
> 
> This also precludes these nodes from ever using non-mbind mempolicies,
> which I think is a completely reasonable compromise and something I was
> already expecting we would do.
> 
> Notably: slub.c injects __GFP_THISNODE internally on behalf of kmalloc,
> which causes spillage into private nodes because slub allows private
> nodes in its mask.  I think this is fixable.
> 
> I have to inspect some other __GFP_THISNODE users (hugetlb, some arch
> code, etc), but it seems like fully dropping the FALLBACK entries and
> requiring __GFP_THISNODE might be sufficient.

Sorry, I haven't been able to follow up so far, and not sure if that's what you
are discussing here ...

After the LSF/MM session, I was wondering, whether if we focus on allowing only
folios allocations to end up on private memory nodes for now: could the
__GFP_THISNODE approach work there?

Essentially, disallow any allocations on non-folio paths, and allow folio
allocation only with __GFP_THISNODE set.

I have to find time to read the other mails in this thread, on my todo list.

So sorry if that is precisely what is being discussed here.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v4 6/7] Documentation: bootconfig: document build-time cmdline rendering
From: Breno Leitao @ 2026-06-10 14:58 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Andrew Morton, Nathan Chancellor, paulmck, Nicolas Schier,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, kernel-team
In-Reply-To: <20260610233720.82fe59cf42aa57659c2e5697@kernel.org>

On Wed, Jun 10, 2026 at 11:37:20PM +0900, Masami Hiramatsu wrote:
> On Tue, 09 Jun 2026 03:28:33 -0700
> Breno Leitao <leitao@debian.org> wrote:
> 
> > Add a section describing CONFIG_BOOT_CONFIG_EMBED_CMDLINE: what it
> > does (renders the embedded "kernel" subtree to a flat cmdline at
> > build time so early_param() handlers see the values), what it
> > requires (BOOT_CONFIG_EMBED, a non-empty BOOT_CONFIG_EMBED_FILE,
> > and ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG -- currently x86 only),
> > the bootconfig opt-in semantics, the initrd-vs-embedded precedence,
> > and the soft-error overflow behavior.
> 
> Hi Breno,
> 
> Thanks for adding the document. But related to the Sashiko's comment,
> I believe it's necessary to pre-describe in this document how the
> kernel behaves with various combinations of cmdline and bootconfig,
> both embbedded and initrd/bootloader.
> 
> We can have these ways to pass the kernel options.
> 
> - bootloader cmdline
> - embedded cmdline
> - initrd bootconfig
> - embedded bootconfig (standard/cmdline)

Will do.  For v5 I'll extend bootconfig.rst with a section that walks through
each combination -- which source wins, when parse_early_param() sees it, and
how it shows up in /proc/cmdline and /proc/bootconfig.

> Clearly, we will have the option to choose between a standard embedded
> boot configuration or a command-line one, not either, but the behavior
> is different. I confirmed that is covered.
> 
> Embedded bootconfig is a kind of default bootconfig, which is NOT used
> when initrd has another bootconfig. I made this design because of
> /proc/bootconfig, which is not merged with embedded one. However, 
> CONFIG_BOOT_CONFIG_EMBED_CMDLINE will be a bit different, if it is
> embedded and "bootconfig" feature is enabled, the embbedded one
> has been used already. 
> 
> To avoid confusion, when this option is used, shouldn't we treat it
> the same way as if embedded command lines were enabled, and either
> not display it in /proc/bootconfig (or always display it, by merging
> the rendered string)?

You're right that EMBED_CMDLINE breaks it: the embedded kernel.* keys
are already in boot_command_line before setup_boot_config() ever sees
the initrd bconf, so a user reading /proc/bootconfig would see only
the initrd keys while parse_early_param() acted on the embedded ones.
That's exactly the split-state Sashiko was circling around.

Both options you suggest work for me, but they pull in opposite
directions and I'd rather not guess wrong on the user-facing
contract.  Which do you prefer for v5?

  (a) Don't display embedded in /proc/bootconfig -- keep the current
      "file shows the active bootconfig source" behavior and document
      that with EMBED_CMDLINE=y, the kernel.* subtree may have been
      applied separately via the cmdline.

  (b) Always display embedded by merging the rendered string into
      /proc/bootconfig when EMBED_CMDLINE=y, so the file reflects
      what was actually applied.

Happy to go either way

Thanks for the direction,
--breno

^ permalink raw reply

* Re: [PATCH v4 3/7] bootconfig: render embedded bootconfig as a kernel cmdline at build time
From: Breno Leitao @ 2026-06-10 14:50 UTC (permalink / raw)
  To: Julian Braha
  Cc: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, linux-kernel,
	linux-trace-kernel, linux-kbuild, bpf, kernel-team
In-Reply-To: <ebb4fb1d-effe-44c6-82dd-8223b36419a1@gmail.com>

Hello Julian,

On Wed, Jun 10, 2026 at 02:44:52PM +0100, Julian Braha wrote:
> On 6/9/26 11:28, Breno Leitao wrote:
> > +	depends on BOOT_CONFIG_EMBED
> > +	depends on BOOT_CONFIG_EMBED_FILE != ""
> 
> Hi Breno,
> 
> Just an FYI, this dependency on BOOT_CONFIG_EMBED is redundant because
> the:
> BOOT_CONFIG_EMBED_FILE != ""
> is only possible when BOOT_CONFIG_EMBED is enabled.

Good catch, thanks. BOOT_CONFIG_EMBED_FILE itself already depends on
BOOT_CONFIG_EMBED, so the explicit line is redundant. I'll drop it!

Thanks for the review,
--breno

^ permalink raw reply

* Re: [PATCH v4 6/7] Documentation: bootconfig: document build-time cmdline rendering
From: Masami Hiramatsu @ 2026-06-10 14:37 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Andrew Morton, Nathan Chancellor, paulmck, Nicolas Schier,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, kernel-team
In-Reply-To: <20260609-bootconfig_using_tools-v4-6-73c463f03a97@debian.org>

On Tue, 09 Jun 2026 03:28:33 -0700
Breno Leitao <leitao@debian.org> wrote:

> Add a section describing CONFIG_BOOT_CONFIG_EMBED_CMDLINE: what it
> does (renders the embedded "kernel" subtree to a flat cmdline at
> build time so early_param() handlers see the values), what it
> requires (BOOT_CONFIG_EMBED, a non-empty BOOT_CONFIG_EMBED_FILE,
> and ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG -- currently x86 only),
> the bootconfig opt-in semantics, the initrd-vs-embedded precedence,
> and the soft-error overflow behavior.

Hi Breno,

Thanks for adding the document. But related to the Sashiko's comment,
I believe it's necessary to pre-describe in this document how the
kernel behaves with various combinations of cmdline and bootconfig,
both embbedded and initrd/bootloader.

We can have these ways to pass the kernel options.

- bootloader cmdline
- embedded cmdline
- initrd bootconfig
- embedded bootconfig (standard/cmdline)

Clearly, we will have the option to choose between a standard embedded
boot configuration or a command-line one, not either, but the behavior
is different. I confirmed that is covered.

Embedded bootconfig is a kind of default bootconfig, which is NOT used
when initrd has another bootconfig. I made this design because of
/proc/bootconfig, which is not merged with embedded one. However, 
CONFIG_BOOT_CONFIG_EMBED_CMDLINE will be a bit different, if it is
embedded and "bootconfig" feature is enabled, the embbedded one
has been used already. 

To avoid confusion, when this option is used, shouldn't we treat it
the same way as if embedded command lines were enabled, and either
not display it in /proc/bootconfig (or always display it, by merging
the rendered string)?

Thank you,

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* [PATCH v2 9/9] media: hantro: Add v4l2_hw run/done traces
From: Detlev Casanova @ 2026-06-10 14:33 UTC (permalink / raw)
  To: Daniel Almeida, Mauro Carvalho Chehab, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Nicolas Dufresne,
	Benjamin Gaignard, Philipp Zabel, Heiko Stuebner
  Cc: linux-kernel, linux-media, linux-trace-kernel, linux-rockchip,
	linux-arm-kernel, kernel, Detlev Casanova
In-Reply-To: <20260610-v4l2-add-ftrace-v2-0-9756edf72ac1@collabora.com>

Add the trace calls as well as retrieving the number of clock cycles for
the rockchip_vpu core.

Signed-off-by: Detlev Casanova <detlev.casanova@collabora.com>
---
 drivers/media/platform/verisilicon/hantro.h               |  1 +
 drivers/media/platform/verisilicon/hantro_drv.c           | 10 ++++++++++
 drivers/media/platform/verisilicon/rockchip_vpu981_regs.h |  1 +
 drivers/media/platform/verisilicon/rockchip_vpu_hw.c      |  4 ++++
 4 files changed, 16 insertions(+)

diff --git a/drivers/media/platform/verisilicon/hantro.h b/drivers/media/platform/verisilicon/hantro.h
index 0353de154a1e..d5cddc783688 100644
--- a/drivers/media/platform/verisilicon/hantro.h
+++ b/drivers/media/platform/verisilicon/hantro.h
@@ -253,6 +253,7 @@ struct hantro_ctx {
 
 	u32 sequence_cap;
 	u32 sequence_out;
+	u32 hw_cycles;
 
 	const struct hantro_fmt *vpu_src_fmt;
 	struct v4l2_pix_format_mplane src_fmt;
diff --git a/drivers/media/platform/verisilicon/hantro_drv.c b/drivers/media/platform/verisilicon/hantro_drv.c
index 2e81877f640f..32855b14e0f1 100644
--- a/drivers/media/platform/verisilicon/hantro_drv.c
+++ b/drivers/media/platform/verisilicon/hantro_drv.c
@@ -25,6 +25,8 @@
 #include <media/videobuf2-core.h>
 #include <media/videobuf2-vmalloc.h>
 
+#include <trace/events/v4l2.h>
+
 #include "hantro_v4l2.h"
 #include "hantro.h"
 #include "hantro_hw.h"
@@ -103,6 +105,9 @@ void hantro_irq_done(struct hantro_dev *vpu,
 	struct hantro_ctx *ctx =
 		v4l2_m2m_get_curr_priv(vpu->m2m_dev);
 
+	if (ctx)
+		trace_v4l2_hw_done(ctx->fh.tgid, ctx->fh.fd, ctx->hw_cycles);
+
 	/*
 	 * If cancel_delayed_work returns false
 	 * the timeout expired. The watchdog is running,
@@ -125,6 +130,9 @@ void hantro_watchdog(struct work_struct *work)
 	ctx = v4l2_m2m_get_curr_priv(vpu->m2m_dev);
 	if (ctx) {
 		vpu_err("frame processing timed out!\n");
+
+		trace_v4l2_hw_done(ctx->fh.tgid, ctx->fh.fd, ctx->hw_cycles);
+
 		if (ctx->codec_ops->reset)
 			ctx->codec_ops->reset(ctx);
 		hantro_job_finish(vpu, ctx, VB2_BUF_STATE_ERROR);
@@ -189,6 +197,8 @@ static void device_run(void *priv)
 	if (ctx->codec_ops->run(ctx))
 		goto err_cancel_job;
 
+	trace_v4l2_hw_run(ctx->fh.tgid, ctx->fh.fd);
+
 	return;
 
 err_cancel_job:
diff --git a/drivers/media/platform/verisilicon/rockchip_vpu981_regs.h b/drivers/media/platform/verisilicon/rockchip_vpu981_regs.h
index e4008da64f19..96b85470208b 100644
--- a/drivers/media/platform/verisilicon/rockchip_vpu981_regs.h
+++ b/drivers/media/platform/verisilicon/rockchip_vpu981_regs.h
@@ -451,6 +451,7 @@
 #define av1_pp0_dup_ver			AV1_DEC_REG(394, 16, 0xff)
 #define av1_pp0_dup_hor			AV1_DEC_REG(394, 24, 0xff)
 
+#define AV1_CYCLE_COUNT			(AV1_SWREG(63))
 #define AV1_TILE_OUT_LU			(AV1_SWREG(65))
 #define AV1_REFERENCE_Y(i)		(AV1_SWREG(67) + ((i) * 0x8))
 #define AV1_SEGMENTATION		(AV1_SWREG(81))
diff --git a/drivers/media/platform/verisilicon/rockchip_vpu_hw.c b/drivers/media/platform/verisilicon/rockchip_vpu_hw.c
index 02673be9878e..f959151b6645 100644
--- a/drivers/media/platform/verisilicon/rockchip_vpu_hw.c
+++ b/drivers/media/platform/verisilicon/rockchip_vpu_hw.c
@@ -424,6 +424,8 @@ static irqreturn_t rk3588_vpu981_irq(int irq, void *dev_id)
 {
 	struct hantro_dev *vpu = dev_id;
 	enum vb2_buffer_state state;
+	struct hantro_ctx *ctx =
+		v4l2_m2m_get_curr_priv(vpu->m2m_dev);
 	u32 status;
 
 	status = vdpu_read(vpu, AV1_REG_INTERRUPT);
@@ -433,6 +435,8 @@ static irqreturn_t rk3588_vpu981_irq(int irq, void *dev_id)
 	vdpu_write(vpu, 0, AV1_REG_INTERRUPT);
 	vdpu_write(vpu, AV1_REG_CONFIG_DEC_CLK_GATE_E, AV1_REG_CONFIG);
 
+	ctx->hw_cycles = vdpu_read(vpu, AV1_CYCLE_COUNT);
+
 	hantro_irq_done(vpu, state);
 
 	return IRQ_HANDLED;

-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 8/9] media: Add HW run/done trace events
From: Detlev Casanova @ 2026-06-10 14:33 UTC (permalink / raw)
  To: Daniel Almeida, Mauro Carvalho Chehab, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Nicolas Dufresne,
	Benjamin Gaignard, Philipp Zabel, Heiko Stuebner
  Cc: linux-kernel, linux-media, linux-trace-kernel, linux-rockchip,
	linux-arm-kernel, kernel, Detlev Casanova
In-Reply-To: <20260610-v4l2-add-ftrace-v2-0-9756edf72ac1@collabora.com>

The events can be fired by drivers when the hardware is run and when it
is done.
That can be used by userspace tracers to see HW performance and usage.

The hw_done event allows setting the number of clock cycles the HW needed
to do the work, to help tools evaluate performances.

Signed-off-by: Detlev Casanova <detlev.casanova@collabora.com>
---
 drivers/media/v4l2-core/v4l2-trace.c |  3 +++
 include/trace/events/v4l2.h          | 40 ++++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/drivers/media/v4l2-core/v4l2-trace.c b/drivers/media/v4l2-core/v4l2-trace.c
index 183d5ecb49c5..59cf6f8807ac 100644
--- a/drivers/media/v4l2-core/v4l2-trace.c
+++ b/drivers/media/v4l2-core/v4l2-trace.c
@@ -12,6 +12,9 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(vb2_v4l2_buf_queue);
 EXPORT_TRACEPOINT_SYMBOL_GPL(vb2_v4l2_dqbuf);
 EXPORT_TRACEPOINT_SYMBOL_GPL(vb2_v4l2_qbuf);
 
+EXPORT_TRACEPOINT_SYMBOL_GPL(v4l2_hw_run);
+EXPORT_TRACEPOINT_SYMBOL_GPL(v4l2_hw_done);
+
 /* Export AV1 controls */
 EXPORT_TRACEPOINT_SYMBOL_GPL(v4l2_ctrl_av1_sequence);
 EXPORT_TRACEPOINT_SYMBOL_GPL(v4l2_ctrl_av1_frame);
diff --git a/include/trace/events/v4l2.h b/include/trace/events/v4l2.h
index e5b80aeecc30..6f1bbb085cb0 100644
--- a/include/trace/events/v4l2.h
+++ b/include/trace/events/v4l2.h
@@ -299,6 +299,46 @@ DEFINE_EVENT(v4l2_stream_class, v4l2_streamoff,
 	TP_ARGS(tgid, fd)
 );
 
+
+/* Events for hardware run/done.
+ *
+ * These events will be fired respectively when the hardware is run (v4l2_hw_run) and done
+ * (v4l2_hw_done).
+ * As for other events, tgid and fd are used to identify the process that opened the video device.
+ *
+ * The v4l2_hw_done event also includes the number of hardware cycles taken by the hardware to
+ * process the command.
+ */
+DEFINE_EVENT(v4l2_stream_class, v4l2_hw_run,
+	TP_PROTO(u32 tgid, u32 fd),
+	TP_ARGS(tgid, fd)
+);
+
+DECLARE_EVENT_CLASS(v4l2_hw_done_class,
+	TP_PROTO(u32 tgid, u32 fd, u32 hw_cycles),
+	TP_ARGS(tgid, fd, hw_cycles),
+
+	TP_STRUCT__entry(
+		__field(u32, tgid)
+		__field(u32, fd)
+		__field(u32, hw_cycles)
+	),
+
+	TP_fast_assign(
+		__entry->tgid = tgid;
+		__entry->fd = fd;
+		__entry->hw_cycles = hw_cycles;
+	),
+
+	TP_printk("tgid = %u, fd = %u, hw_cycles = %u",
+		  __entry->tgid, __entry->fd, __entry->hw_cycles)
+);
+
+DEFINE_EVENT(v4l2_hw_done_class, v4l2_hw_done,
+	TP_PROTO(u32 tgid, u32 fd, u32 hw_cycles),
+	TP_ARGS(tgid, fd, hw_cycles)
+);
+
 #endif /* if !defined(_TRACE_V4L2_H) || defined(TRACE_HEADER_MULTI_READ) */
 
 /* This part must be outside protection */

-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 7/9] media: Add stream on/off traces and run them in the ioctl
From: Detlev Casanova @ 2026-06-10 14:33 UTC (permalink / raw)
  To: Daniel Almeida, Mauro Carvalho Chehab, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Nicolas Dufresne,
	Benjamin Gaignard, Philipp Zabel, Heiko Stuebner
  Cc: linux-kernel, linux-media, linux-trace-kernel, linux-rockchip,
	linux-arm-kernel, kernel, Detlev Casanova
In-Reply-To: <20260610-v4l2-add-ftrace-v2-0-9756edf72ac1@collabora.com>

This will automatically add stream on/off tracing for all v4l2 drivers.

Signed-off-by: Detlev Casanova <detlev.casanova@collabora.com>
---
 drivers/media/v4l2-core/v4l2-ioctl.c | 20 +++++++++++++++++--
 include/trace/events/v4l2.h          | 37 ++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/drivers/media/v4l2-core/v4l2-ioctl.c b/drivers/media/v4l2-core/v4l2-ioctl.c
index c8746a1637f5..b09489baff3e 100644
--- a/drivers/media/v4l2-core/v4l2-ioctl.c
+++ b/drivers/media/v4l2-core/v4l2-ioctl.c
@@ -1963,13 +1963,29 @@ static int v4l_try_fmt(const struct v4l2_ioctl_ops *ops, struct file *file,
 static int v4l_streamon(const struct v4l2_ioctl_ops *ops, struct file *file,
 			void *arg)
 {
-	return ops->vidioc_streamon(file, NULL, *(unsigned int *)arg);
+	struct v4l2_fh *fh = file_to_v4l2_fh(file);
+	int err;
+
+	err = ops->vidioc_streamon(file, NULL, *(unsigned int *)arg);
+
+	if (!err)
+		trace_v4l2_streamon(fh->tgid, fh->fd);
+
+	return err;
 }
 
 static int v4l_streamoff(const struct v4l2_ioctl_ops *ops, struct file *file,
 			 void *arg)
 {
-	return ops->vidioc_streamoff(file, NULL, *(unsigned int *)arg);
+	struct v4l2_fh *fh = file_to_v4l2_fh(file);
+	int err;
+
+	err = ops->vidioc_streamoff(file, NULL, *(unsigned int *)arg);
+
+	if (!err)
+		trace_v4l2_streamoff(fh->tgid, fh->fd);
+
+	return err;
 }
 
 static int v4l_g_tuner(const struct v4l2_ioctl_ops *ops, struct file *file,
diff --git a/include/trace/events/v4l2.h b/include/trace/events/v4l2.h
index 248bc09bfc99..e5b80aeecc30 100644
--- a/include/trace/events/v4l2.h
+++ b/include/trace/events/v4l2.h
@@ -262,6 +262,43 @@ DEFINE_EVENT(vb2_v4l2_event_class, vb2_v4l2_qbuf,
 	TP_ARGS(q, vb)
 );
 
+
+/* Events for stream on/off.
+ *
+ * These events will be fired every time userspace starts or stops a stream.
+ * tgid and fd are used to identify the process that opened the video device.
+ *
+ * Note that this even can be fired multiple times for a given tgid/fd pair.
+ * E.g.: mem2mem drivers expect stream on/off on both output and capture queues.
+ */
+DECLARE_EVENT_CLASS(v4l2_stream_class,
+	TP_PROTO(u32 tgid, u32 fd),
+	TP_ARGS(tgid, fd),
+
+	TP_STRUCT__entry(
+		__field(u32, tgid)
+		__field(u32, fd)
+	),
+
+	TP_fast_assign(
+		__entry->tgid = tgid;
+		__entry->fd = fd;
+	),
+
+	TP_printk("tgid = %u, fd = %u",
+		  __entry->tgid, __entry->fd)
+);
+
+DEFINE_EVENT(v4l2_stream_class, v4l2_streamon,
+	TP_PROTO(u32 tgid, u32 fd),
+	TP_ARGS(tgid, fd)
+);
+
+DEFINE_EVENT(v4l2_stream_class, v4l2_streamoff,
+	TP_PROTO(u32 tgid, u32 fd),
+	TP_ARGS(tgid, fd)
+);
+
 #endif /* if !defined(_TRACE_V4L2_H) || defined(TRACE_HEADER_MULTI_READ) */
 
 /* This part must be outside protection */

-- 
2.54.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox