Linux Trace Kernel
 help / color / mirror / Atom feed
* Re: [PATCH v4] tracing/probes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-05-19 16:28 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: LKML, Linux Trace Kernel, bpf, Mathieu Desnoyers, Mark Rutland,
	Peter Zijlstra, Namhyung Kim, Takaya Saeki, Douglas Raillard,
	Tom Zanussi, Andrew Morton, Thomas Gleixner, Ian Rogers,
	Jiri Olsa
In-Reply-To: <20260520002640.d372bd044ceba9364b1f168f@kernel.org>

On Wed, 20 May 2026 00:26:40 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > Hmm, it may be better to make it one parenthesis?
> > 
> >        (STRUCT,PTR,ASSIGN)->FIELD
> > 
> >        data=(foo,foo_list,list)->data  
> 
> OK, but I don't like this 3 parameters conversion. I want to
> make it a kind of type casting with an option.
> 
> 	(STRUCT,ASSIGN)PTR->FIELD
> 
> 	data=(foo,list)foo_list->data
> 
> The second parenthesis will be eventually needed for nested casting,
> for example, in above case, if the data is a pointer to another data
> structure:
> 
> struct bar {
> 	int	value;
> 	...
> };
> 
> 	value=(bar)((foo,list)foo_list->data)->value

Have fun with the parenthesis parsing ;-)

> 
> 
> > 
> > That would make it easier to differentiate between a simple "typecast" and
> > a container_of() by checking if the content between the parenthesis has a
> > comma.
> > 
> > Maybe even reorder it to:
> > 
> >        (PTR,STRUCT,ASSIGN)->FIELD
> > 
> >        data=(foo_list,foo,list)->data
> > 
> > to match the order of container_of():
> > 
> >       data = container_of(foo_list, struct foo, list)->data;
> > 
> > ?  
> 
> This doesn't seem to conform to the rule here of using parentheses for
> type casting, so I personally don't like it.

OK, as long as it's intuitive and is easy to remember. I hate having to
look up the documents every time I have to do some probe argument
parsing :-(

> 
> 
> > > > diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> > > > index e0d3a0da26af..b0829eb1cb52 100644
> > > > --- a/kernel/trace/trace_probe.c
> > > > +++ b/kernel/trace/trace_probe.c
> > > > @@ -464,6 +464,26 @@ static const char *fetch_type_from_btf_type(struct btf *btf,
> > > >  	return NULL;
> > > >  }
> > > >  
> > > > +static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
> > > > +{
> > > > +	int id;
> > > > +
> > > > +	if (!ctx->btf) {
> > > > +		struct btf *btf;    
> > > 
> > > This needs an empty line here.  
> > 
> > Sure.
> > 
> > For conditional blocks, I don't always add a newline, but this is your code
> > and I'll follow your suggestions.  
> 
> Ah, this is just for fixing checkpatch.pl warning :-)

I added it to keep your checkpatch happy.


> > > > @@ -471,12 +491,12 @@ static int query_btf_context(struct traceprobe_parse_context *ctx)
> > > >  	struct btf *btf;
> > > >  	s32 nr;
> > > >  
> > > > -	if (ctx->btf)
> > > > -		return 0;
> > > > -
> > > >  	if (!ctx->funcname)
> > > >  		return -EINVAL;
> > > >  
> > > > +	if (ctx->btf)
> > > > +		return 0;
> > > > +    
> > > 
> > > Could you tell me why this order is changed?
> > > I think this type casting will allow us to skip checking funcname
> > > because btf context is already specified.  
> > 
> > I wanted this to fail if btf was already set but funcname wasn't, because
> > this should only be called for functions.  
> 
> Hmm, OK. Then, can you make a separate patch for this?

I added a struct_btf (I can change it to typecast_btf) and have a
helper function that is:

static struct btf *ctx_btf(struct traceprobe_parse_context *ctx)
{
	return ctx->flags & TPARG_FL_STRUCT ?
		ctx->struct_btf : ctx->btf;
}

And have all functions get their btf from that. This way we can keep
the two allocations separate.

> 
> >   
> > > 
> > > Ah, BTW, we may need to use a special struct btf* for type
> > > casting. If the target function is in a module and the
> > > casting type is defined in vmlinux, those are stored in
> > > the different places...  
> > 
> > OK, I'll make a separate btf for it then. I'll have to make sure the btf
> > used for parsing knows which one to use. Shouldn't be too hard if we check
> > for the STRUCT flag in the ctx->flags.  
> 
> Yeah, and personally, I think that flag should be the TYPECAST flag.

I'll update it.

Thanks,

-- Steve

^ permalink raw reply

* Re: [PATCH v4] tracing/probes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-05-19 16:38 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: LKML, Linux Trace Kernel, bpf, Mathieu Desnoyers, Mark Rutland,
	Peter Zijlstra, Namhyung Kim, Takaya Saeki, Douglas Raillard,
	Tom Zanussi, Andrew Morton, Thomas Gleixner, Ian Rogers,
	Jiri Olsa
In-Reply-To: <20260519122836.0bfded70@fedora>

On Tue, 19 May 2026 12:28:36 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> I added a struct_btf (I can change it to typecast_btf) and have a
> helper function that is:

Actually, I'm going to keep it as struct_btf as it is a btf for a
structure. I did change the flag to be TYPECAST though.

-- Steve


> 
> static struct btf *ctx_btf(struct traceprobe_parse_context *ctx)
> {
> 	return ctx->flags & TPARG_FL_STRUCT ?
> 		ctx->struct_btf : ctx->btf;
> }


^ permalink raw reply

* Re: [PATCH 6/9] rv: Ensure synchronous cleanup for HA monitors
From: Wen Yang @ 2026-05-19 16:48 UTC (permalink / raw)
  To: Gabriele Monaco; +Cc: linux-kernel, Steven Rostedt, Nam Cao, linux-trace-kernel
In-Reply-To: <88a6fc5c08d18e3c1f6d29dc106db80fa688bf87.camel@redhat.com>



On 5/18/26 19:54, Gabriele Monaco wrote:
> On Sun, 2026-05-17 at 17:12 +0800, Wen Yang wrote:
>> The guard(rcu)() + synchronize_rcu() mechanism for HA timer callbacks
>> is correct.
>>
>> One concern: TOCTOU between the pre-check and guard(rcu)().
> 
> Yes, this could happen, but I'm not sure it's really a big issue:
> 
>>
>> da_monitor_reset() calls reset_hook BEFORE clearing monitoring:
>>
>>     da_monitor_reset_hook(da_mon);        /* ha_cancel_timer [async]   */
>>     WRITE_ONCE(da_mon->monitoring, 0);    /* cleared AFTER reset_hook  */
>>     da_mon->curr_state = model_get_initial_state();
>>
>> This may creates a window where the callback pre-check passes but the
>> monitor is reset before guard(rcu)() is acquired:
> 
> If a callback is running, there was a violation because the timer expired, so it
> isn't wrong to report, although we are unloading the monitor.
> 
>>
>>     /* __ha_monitor_timer_callback() */
>>     if (unlikely(!da_monitor_handling_event(&ha_mon->da_mon)))
>>         return;
>>
>>     /* passes: monitoring=1
>>      *
>>      * WINDOW ─ CPU A runs da_monitor_reset_all() here:
>>      *   ha_cancel_timer()  [returns: callback is running, cannot cancel]
>>      *   WRITE_ONCE(monitoring, 0)
>>      *   curr_state = model_get_initial_state()
>>      */
>>     guard(rcu)();
>>     curr_state = READ_ONCE(ha_mon->da_mon.curr_state);  /* initial_state */
>>     /* no second da_monitoring() check */
>>     ha_react(curr_state, EVENT_NONE, env_string.buffer); /* spurious call */
>>     ha_trace_error_env(ha_mon, ...);                     /* fires
>> unconditionally */
>>
>> Result: spurious ha_trace_error_env() for initial_state.  For existing
>> monitors (stall/nomiss/opid), model_should_send_event_env(initial, NONE)
>> returns false, so no false-positive reaction, but the trace event fires.
>> Monitors where initial_state carries a constraint would produce a false
>> positive.
> 
> I'm not sure what you mean here, if I understand the situation correctly: the
> callback is running (so we should react), da_monitor_reset() is too late to stop
> it but somehow manages to reset curr_state on time for the callback to see it
> change: react reports the wrong state in an otherwise valid reaction.
> 
>>
>> Proposed fix : re-check inside the RCU critical section:
>>
>>     guard(rcu)();
>>     if (unlikely(!da_monitoring(&ha_mon->da_mon)))  /* re-check here */
>>         return;
>>     curr_state = READ_ONCE(ha_mon->da_mon.curr_state);
> 
> I'm not sure that's going to fix it anyway, RCU cannot synchronise readers,
> checking again would at most (mildly) reduce the race window, not remove it.
> 
> What we could do is to play with barriersin for the callback to either:
> * see monitoring = 1 AND the old curr_state
> * see monitoring = 0 AND the new curr_state
> 
> Something like:
> 
> void __ha_monitor_timer_callback() {
> 	guard(rcu)(); //this is only for waiters, let them wait more
> 
> 	if (unlikely(!da_monitor_handling_event(&ha_mon->da_mon)))
> 		return;
> 	smp_rmb();
> 	curr_state = READ_ONCE(ha_mon->da_mon.curr_state);
> 	...
> }
> 
> void da_monitor_reset() {
> 	da_monitor_reset_hook(da_mon);
> 	WRITE_ONCE(da_mon->monitoring, 0);
> 	smp_wmb();
> 	WRITE_ONCE(da_mon->curr_state, model_get_initial_state());
> }
> 
> Coupled with your patch [1] adding more atomic accesses to da_mon->monitoring
> should probably do the trick.
> 
> Am I missing anything?
> 

The goal is right.  One thing worth double-checking is the load order
in the callback against the "SMP BARRIER PAIRING" section of
Documentation/memory-barriers.txt, which states:

   [!] Note that the stores before the write barrier would normally be
   expected to match the loads after the read barrier or the
   address-dependency barrier, and vice versa ...

So, we should to swap the read order in the callback so that it matches
the standard pattern:

   void __ha_monitor_timer_callback() {
         guard(rcu)();
         curr_state = READ_ONCE(ha_mon->da_mon.curr_state);  /* B: 
before rmb */
         smp_rmb();
         if (unlikely(!da_monitoring(&ha_mon->da_mon)))       /* A: 
after rmb */
                 return;
         /*
          * Reached here: monitoring = 1 (old_A).
          * Standard wmb/rmb guarantee: curr_state (read before rmb) is also
          * old, i.e. not initial_state.
          */
         ha_react(curr_state, EVENT_NONE, env_string.buffer);
         ...
   }

   void da_monitor_reset() {
         da_monitor_reset_hook(da_mon);
         WRITE_ONCE(da_mon->monitoring, 0);   /* A: before wmb */
         smp_wmb();
         WRITE_ONCE(da_mon->curr_state, model_get_initial_state());  /* 
B: after wmb */
   }



--
Best wishes,
Wen

> 
> [1] -
> https://lore.kernel.org/lkml/8af5ba4bd93d2acb8a546e8e47ced974a87c1eb8.1778522945.git.wen.yang@linux.dev
> 
>>
>>
>> --
>> Best wishes,
>> Wen
>>
>>
>> On 5/12/26 22:02, Gabriele Monaco wrote:
>>> HA monitors may start timers, all cleanup functions currently stop the
>>> timers asynchronously to avoid sleeping in the wrong context.
>>> Nothing makes sure running callbacks terminate on cleanup.
>>>
>>> Run the entire HA timer callback in an RCU read-side critical section,
>>> this way we can simply synchronize_rcu() with any pending timer and are
>>> sure any cleanup using kfree_rcu() runs after callbacks terminated.
>>> Additionally make sure any unlikely callback running late won't run any
>>> code if the monitor is marked as disabled.
>>>
>>> Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
>>> Fixes: 4a24127bd6cb ("rv: Add support for per-object monitors in DA/HA")
>>> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
>>> ---
>>>    include/rv/da_monitor.h | 23 +++++++++++++++++++----
>>>    include/rv/ha_monitor.h | 18 ++++++++++++++++--
>>>    2 files changed, 35 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
>>> index a4a13b62d1a4..402d3b935c08 100644
>>> --- a/include/rv/da_monitor.h
>>> +++ b/include/rv/da_monitor.h
>>> @@ -57,6 +57,15 @@ static struct rv_monitor rv_this;
>>>    #define da_monitor_reset_hook(da_mon)
>>>    #endif
>>>    
>>> +/*
>>> + * Hook to allow the implementation of hybrid automata: define it with a
>>> + * function that waits for the termination of all monitors background
>>> + * activities (e.g. all timers). This hook can sleep.
>>> + */
>>> +#ifndef da_monitor_sync_hook
>>> +#define da_monitor_sync_hook()
>>> +#endif
>>> +
>>>    /*
>>>     * Type for the target id, default to int but can be overridden.
>>>     * A long type can work as hash table key (PER_OBJ) but will be downgraded
>>> to
>>> @@ -179,6 +188,7 @@ static inline int da_monitor_init(void)
>>>    static inline void da_monitor_destroy(void)
>>>    {
>>>    	da_monitor_reset_all();
>>> +	da_monitor_sync_hook();
>>>    }
>>>    
>>>    #ifndef da_implicit_guard
>>> @@ -232,6 +242,7 @@ static inline int da_monitor_init(void)
>>>    static inline void da_monitor_destroy(void)
>>>    {
>>>    	da_monitor_reset_all();
>>> +	da_monitor_sync_hook();
>>>    }
>>>    
>>>    #ifndef da_implicit_guard
>>> @@ -319,6 +330,7 @@ static inline void da_monitor_destroy(void)
>>>    	}
>>>    
>>>    	da_monitor_reset_all();
>>> +	da_monitor_sync_hook();
>>>    
>>>    	rv_put_task_monitor_slot(task_mon_slot);
>>>    	task_mon_slot = RV_PER_TASK_MONITOR_INIT;
>>> @@ -497,10 +509,9 @@ static void da_monitor_reset_all(void)
>>>    	struct da_monitor_storage *mon_storage;
>>>    	int bkt;
>>>    
>>> -	rcu_read_lock();
>>> +	guard(rcu)();
>>>    	hash_for_each_rcu(da_monitor_ht, bkt, mon_storage, node)
>>>    		da_monitor_reset(&mon_storage->rv.da_mon);
>>> -	rcu_read_unlock();
>>>    }
>>>    
>>>    static inline int da_monitor_init(void)
>>> @@ -516,13 +527,17 @@ static inline void da_monitor_destroy(void)
>>>    	int bkt;
>>>    
>>>    	tracepoint_synchronize_unregister();
>>> +	scoped_guard(rcu) {
>>> +		hash_for_each_rcu(da_monitor_ht, bkt, mon_storage, node) {
>>> +			da_monitor_reset_hook(&mon_storage->rv.da_mon);
>>> +		}
>>> +	}
>>> +	da_monitor_sync_hook();
>>>    	/*
>>>    	 * This function is called after all probes are disabled and no
>>> longer
>>>    	 * pending, we can safely assume no concurrent user.
>>>    	 */
>>> -	synchronize_rcu();
>>>    	hash_for_each_safe(da_monitor_ht, bkt, tmp, mon_storage, node) {
>>> -		da_monitor_reset_hook(&mon_storage->rv.da_mon);
>>>    		hash_del_rcu(&mon_storage->node);
>>>    		kfree(mon_storage);
>>>    	}
>>> diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
>>> index d59507e8cb30..47ff1a41febe 100644
>>> --- a/include/rv/ha_monitor.h
>>> +++ b/include/rv/ha_monitor.h
>>> @@ -36,6 +36,7 @@ static bool ha_monitor_handle_constraint(struct da_monitor
>>> *da_mon,
>>>    #define da_monitor_event_hook ha_monitor_handle_constraint
>>>    #define da_monitor_init_hook ha_monitor_init_env
>>>    #define da_monitor_reset_hook ha_monitor_reset_env
>>> +#define da_monitor_sync_hook() synchronize_rcu()
>>>    
>>>    #include <rv/da_monitor.h>
>>>    #include <linux/seq_buf.h>
>>> @@ -237,12 +238,25 @@ static bool ha_monitor_handle_constraint(struct
>>> da_monitor *da_mon,
>>>    	return false;
>>>    }
>>>    
>>> +/*
>>> + * __ha_monitor_timer_callback - generic callback representation
>>> + *
>>> + * This callback runs in an RCU read-side critical section to allow the
>>> + * destruction sequence to easily synchronize_rcu() with all pending timer
>>> + * after asynchronously disabling them.
>>> + */
>>>    static inline void __ha_monitor_timer_callback(struct ha_monitor *ha_mon)
>>>    {
>>> -	enum states curr_state = READ_ONCE(ha_mon->da_mon.curr_state);
>>>    	DECLARE_SEQ_BUF(env_string, ENV_BUFFER_SIZE);
>>> -	u64 time_ns = ha_get_ns();
>>> +	enum states curr_state;
>>> +	u64 time_ns;
>>> +
>>> +	if (unlikely(!da_monitor_handling_event(&ha_mon->da_mon)))
>>> +		return;
>>>    
>>> +	guard(rcu)();
>>> +	curr_state = READ_ONCE(ha_mon->da_mon.curr_state);
>>> +	time_ns = ha_get_ns();
>>>    	ha_get_env_string(&env_string, ha_mon, time_ns);
>>>    	ha_react(curr_state, EVENT_NONE, env_string.buffer);
>>>    	ha_trace_error_env(ha_mon, model_get_state_name(curr_state),
> 

^ permalink raw reply

* [PATCH v5] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-05-19 17:01 UTC (permalink / raw)
  To: LKML, Linux trace kernel, bpf
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland, Peter Zijlstra,
	Namhyung Kim, Takaya Saeki, Douglas Raillard, Tom Zanussi,
	Andrew Morton, Thomas Gleixner, Ian Rogers, Jiri Olsa

From: Steven Rostedt <rostedt@goodmis.org>

Add syntax to the parsing of eprobes to be able to typecast a trace event
field that is a pointer to a structure.

Currently, a dereference must be a number, where the user has to figure
out manually the offset of a member of a structure that they want to
dereference.

But for event probes that records a field that happens to be a pointer to
a structure, it cannot dereference these values with BTF naming, but
must use numerical offsets.

For example, to find out what device a sk_buff is pointing to in the
net_dev_xmit trace event, one must first use gdb to find the offsets of the
members of the structures:

 (gdb) p &((struct sk_buff *)0)->dev
 $1 = (struct net_device **) 0x10
 (gdb) p &((struct net_device *)0)->name
 $2 = (char (*)[16]) 0x118

And then use the raw numbers to dereference:

  # echo 'e:xmit net.net_dev_xmit +0x118(+0x10($skbaddr)):string' >> dynamic_events

If BTF is in the kernel, then instead, the skbaddr can be typecast to
sk_buff and use the normal dereference logic.

  # echo 'e:xmit net.net_dev_xmit (sk_buff)skbaddr->dev->name:string' >> dynamic_events
  # echo 1 > events/eprobes/xmit/enable
  # cat trace
[..]
    sshd-session-1022    [000] b..2.   860.249343: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.250061: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.250142: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.263553: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.283820: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.302716: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.322905: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.342828: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.362268: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.382335: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.400856: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.419893: xmit: (net.net_dev_xmit) arg1="enp7s0"

The syntax is simply: (STRUCT)(FIELD)->MEMBER[->MEMBER..]

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes from v4: https://patch.msgid.link/20260518232312.0c78f055@gandalf.local.home

- Only describe this for eprobes, as that is currently the only user of
  this. Kprobes registers can come later. (Sashiko)

- Update the documentation to only be in the eprobetrace.rst.

- Make the btf value separate for structures and functions. As a function
  may use a BTF from a module, it can point to a structure that is in
  vmlinux. Make them separate and add a ctx_btf() helper function to
  figure out which one to use. (Masami Hiramatsu)

- Remove updating query_btf_context() by having it check funcname first.
  The above ctx_btf() helper fixes this.

- Rename TPARG_FL_STRUCT to TPARG_FL_TYPECAST (Masami Hiramatsu)

- Have the typecast code call parse_btf_arg() directly, instead of going
  through parse_probe_vars(). This fixes issues with updating the pcode
  value, and simplifies the formatting and code (Sashiko).

  Added a parse_trace_event() helper to duplicate the handling of
  TPARG_FL_TEVENT.

- Use "goto found_type;" instead of relying on type being NULL in
  parse_btf_arg(). (Masami Hiramatsu)

- Add query_btf_struct() stub function when CONFIG_PROBE_EVENTS_BTF_ARGS
  is not defined. (kernel test robot)

- Remove need for '*' in typecast (Masami Hiramatsu)

- Have the event fields not use '$' when the typecast is used.
  (Masami Hiramatsu)

 Documentation/trace/eprobetrace.rst |   4 +
 kernel/trace/trace_probe.c          | 158 ++++++++++++++++++++++------
 kernel/trace/trace_probe.h          |   4 +
 3 files changed, 135 insertions(+), 31 deletions(-)

diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index 89b5157cfab8..fe3602540569 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -46,6 +46,10 @@ Synopsis of eprobe_events
 		  (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
                   "string", "ustring", "symbol", "symstr" and "bitfield" are
                   supported.
+  (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+                  a pointer to STRUCT and then derference the pointer defined by
+                  ->MEMBER. Note that when this is used, the FIELD name does not
+                  need to be prefixed with a '$'.
 
 Types
 -----
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index e0d3a0da26af..cd54deb985f4 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -376,11 +376,17 @@ static bool btf_type_is_char_array(struct btf *btf, const struct btf_type *type)
 		&& BTF_INT_BITS(intdata) == 8;
 }
 
+static struct btf *ctx_btf(struct traceprobe_parse_context *ctx)
+{
+	return ctx->flags & TPARG_FL_TYPECAST ?
+		ctx->struct_btf : ctx->btf;
+}
+
 static int check_prepare_btf_string_fetch(char *typename,
 				struct fetch_insn **pcode,
 				struct traceprobe_parse_context *ctx)
 {
-	struct btf *btf = ctx->btf;
+	struct btf *btf = ctx_btf(ctx);
 
 	if (!btf || !ctx->last_type)
 		return 0;
@@ -464,6 +470,27 @@ static const char *fetch_type_from_btf_type(struct btf *btf,
 	return NULL;
 }
 
+static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
+{
+	int id;
+
+	if (!ctx->struct_btf) {
+		struct btf *btf;
+
+		id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
+		if (id < 0)
+			return id;
+		ctx->struct_btf = btf;
+	} else {
+		id = btf_find_by_name_kind(ctx->struct_btf, sname, BTF_KIND_STRUCT);
+		if (id < 0)
+			return id;
+	}
+
+	ctx->last_struct = btf_type_by_id(ctx->struct_btf, id);
+	return 0;
+}
+
 static int query_btf_context(struct traceprobe_parse_context *ctx)
 {
 	const struct btf_param *param;
@@ -515,6 +542,10 @@ static void clear_btf_context(struct traceprobe_parse_context *ctx)
 		ctx->params = NULL;
 		ctx->nr_params = 0;
 	}
+	if (ctx->struct_btf) {
+		btf_put(ctx->struct_btf);
+		ctx->last_struct = NULL;
+	}
 }
 
 /* Return 1 if the field separator is arrow operator ('->') */
@@ -554,22 +585,29 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 	struct fetch_insn *code = *pcode;
 	const struct btf_member *field;
 	u32 bitoffs, anon_offs;
+	bool is_struct = ctx->flags & TPARG_FL_TYPECAST;
+	struct btf *btf = ctx_btf(ctx);
 	char *next;
 	int is_ptr;
 	s32 tid;
 
 	do {
-		/* Outer loop for solving arrow operator ('->') */
-		if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
-			trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
-			return -EINVAL;
-		}
-		/* Convert a struct pointer type to a struct type */
-		type = btf_type_skip_modifiers(ctx->btf, type->type, &tid);
-		if (!type) {
-			trace_probe_log_err(ctx->offset, BAD_BTF_TID);
-			return -EINVAL;
+		if (!is_struct) {
+			/* Outer loop for solving arrow operator ('->') */
+			if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
+				trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
+				return -EINVAL;
+			}
+
+			/* Convert a struct pointer type to a struct type */
+			type = btf_type_skip_modifiers(btf, type->type, &tid);
+			if (!type) {
+				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+				return -EINVAL;
+			}
 		}
+		/* Only the first type can skip being a pointer */
+		is_struct = false;
 
 		bitoffs = 0;
 		do {
@@ -580,7 +618,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 				return is_ptr;
 
 			anon_offs = 0;
-			field = btf_find_struct_member(ctx->btf, type, fieldname,
+			field = btf_find_struct_member(btf, type, fieldname,
 						       &anon_offs);
 			if (IS_ERR(field)) {
 				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
@@ -602,7 +640,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 				ctx->last_bitsize = 0;
 			}
 
-			type = btf_type_skip_modifiers(ctx->btf, field->type, &tid);
+			type = btf_type_skip_modifiers(btf, field->type, &tid);
 			if (!type) {
 				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
 				return -EINVAL;
@@ -627,6 +665,26 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 	return 0;
 }
 
+
+static int parse_trace_event(char *arg, struct fetch_insn *code,
+			     struct traceprobe_parse_context *ctx)
+{
+	int ret;
+
+	if (code->data)
+		return -EFAULT;
+	ret = parse_trace_event_arg(arg, code, ctx);
+	if (!ret)
+		return 0;
+	if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
+		code->op = FETCH_OP_COMM;
+		return 0;
+	}
+	/* backward compatibility */
+	ctx->offset = 0;
+	return -EINVAL;
+}
+
 static int __store_entry_arg(struct trace_probe *tp, int argnum);
 
 static int parse_btf_arg(char *varname,
@@ -636,11 +694,12 @@ static int parse_btf_arg(char *varname,
 	struct fetch_insn *code = *pcode;
 	const struct btf_param *params;
 	const struct btf_type *type;
+	struct btf *btf = ctx_btf(ctx);
 	char *field = NULL;
 	int i, is_ptr, ret;
 	u32 tid;
 
-	if (WARN_ON_ONCE(!ctx->funcname))
+	if (WARN_ON_ONCE(!ctx->funcname && !(ctx->flags & TPARG_FL_TYPECAST)))
 		return -EINVAL;
 
 	is_ptr = split_next_field(varname, &field, ctx);
@@ -653,6 +712,14 @@ static int parse_btf_arg(char *varname,
 		return -EOPNOTSUPP;
 	}
 
+	if (ctx->flags & TPARG_FL_TEVENT) {
+		int ret;
+
+		ret = parse_trace_event(varname, code, ctx);
+		if (ret < 0)
+			return ret;
+	}
+
 	if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
 		code->op = FETCH_OP_RETVAL;
 		/* Check whether the function return type is not void */
@@ -672,7 +739,7 @@ static int parse_btf_arg(char *varname,
 		return 0;
 	}
 
-	if (!ctx->btf) {
+	if (!btf) {
 		ret = query_btf_context(ctx);
 		if (ret < 0 || ctx->nr_params == 0) {
 			trace_probe_log_err(ctx->offset, NO_BTF_ENTRY);
@@ -682,7 +749,7 @@ static int parse_btf_arg(char *varname,
 	params = ctx->params;
 
 	for (i = 0; i < ctx->nr_params; i++) {
-		const char *name = btf_name_by_offset(ctx->btf, params[i].name_off);
+		const char *name = btf_name_by_offset(btf, params[i].name_off);
 
 		if (name && !strcmp(name, varname)) {
 			if (tparg_is_function_entry(ctx->flags)) {
@@ -704,15 +771,22 @@ static int parse_btf_arg(char *varname,
 			goto found;
 		}
 	}
+
+	if (ctx->flags & TPARG_FL_TYPECAST) {
+		type = ctx->last_struct;
+		goto found_type;
+	}
+
 	trace_probe_log_err(ctx->offset, NO_BTFARG);
 	return -ENOENT;
 
 found:
-	type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
+	type = btf_type_skip_modifiers(btf, tid, &tid);
 	if (!type) {
 		trace_probe_log_err(ctx->offset, BAD_BTF_TID);
 		return -EINVAL;
 	}
+found_type:
 	/* Initialize the last type information */
 	ctx->last_type = type;
 	ctx->last_bitoffs = 0;
@@ -727,7 +801,7 @@ static int parse_btf_arg(char *varname,
 static const struct fetch_type *find_fetch_type_from_btf_type(
 					struct traceprobe_parse_context *ctx)
 {
-	struct btf *btf = ctx->btf;
+	struct btf *btf = ctx_btf(ctx);
 	const char *typestr = NULL;
 
 	if (btf && ctx->last_type)
@@ -764,6 +838,11 @@ static void clear_btf_context(struct traceprobe_parse_context *ctx)
 	ctx->btf = NULL;
 }
 
+static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
+{
+	return -EOPNOTSUPP;
+}
+
 static int query_btf_context(struct traceprobe_parse_context *ctx)
 {
 	return -EOPNOTSUPP;
@@ -953,18 +1032,9 @@ static int parse_probe_vars(char *orig_arg, const struct fetch_type *t,
 	int len;
 
 	if (ctx->flags & TPARG_FL_TEVENT) {
-		if (code->data)
-			return -EFAULT;
-		ret = parse_trace_event_arg(arg, code, ctx);
-		if (!ret)
-			return 0;
-		if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
-			code->op = FETCH_OP_COMM;
-			return 0;
-		}
-		/* backward compatibility */
-		ctx->offset = 0;
-		goto inval;
+		if (parse_trace_event(arg, code, ctx) < 0)
+			goto inval;
+		return 0;
 	}
 
 	if (str_has_prefix(arg, "retval")) {
@@ -1231,6 +1301,30 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
 				code->op = FETCH_OP_IMM;
 		}
 		break;
+	case '(':
+		tmp = strchr(arg, ')');
+		if (!tmp) {
+			trace_probe_log_err(ctx->offset + strlen(arg),
+					    DEREF_OPEN_BRACE);
+			return -EINVAL;
+		}
+		*tmp = '\0';
+		ret = query_btf_struct(arg + 1, ctx);
+		*tmp = ')';
+
+		if (ret < 0) {
+			trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
+			return -EINVAL;
+		}
+
+		ctx->flags |= TPARG_FL_TYPECAST;
+		tmp++;
+
+		ctx->offset += tmp - arg;
+		ret = parse_btf_arg(tmp, pcode, end, ctx);
+		ctx->flags &= ~TPARG_FL_TYPECAST;
+		ctx->last_struct = NULL;
+		break;
 	default:
 		if (isalpha(arg[0]) || arg[0] == '_') {	/* BTF variable */
 			if (!tparg_is_function_entry(ctx->flags) &&
@@ -1504,6 +1598,7 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
 	code[FETCH_INSN_MAX - 1].op = FETCH_OP_END;
 
 	ctx->last_type = NULL;
+	ctx->last_struct = NULL;
 	ret = parse_probe_arg(arg, parg->type, &code, &code[FETCH_INSN_MAX - 1],
 			      ctx);
 	if (ret < 0)
@@ -1705,13 +1800,14 @@ static int sprint_nth_btf_arg(int idx, const char *type,
 			      struct traceprobe_parse_context *ctx)
 {
 	const char *name;
+	struct btf *btf = ctx_btf(ctx);
 	int ret;
 
 	if (idx >= ctx->nr_params) {
 		trace_probe_log_err(0, NO_BTFARG);
 		return -ENOENT;
 	}
-	name = btf_name_by_offset(ctx->btf, ctx->params[idx].name_off);
+	name = btf_name_by_offset(btf, ctx->params[idx].name_off);
 	if (!name) {
 		trace_probe_log_err(0, NO_BTF_ENTRY);
 		return -ENOENT;
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 262d8707a3df..6735b8b2432b 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -394,6 +394,7 @@ static inline int traceprobe_get_entry_data_size(struct trace_probe *tp)
  * TPARG_FL_KERNEL and TPARG_FL_USER are also mutually exclusive.
  * TPARG_FL_FPROBE and TPARG_FL_TPOINT are optional but it should be with
  * TPARG_FL_KERNEL.
+ * TPARG_FL_TYPECAST is set if an argument was typecast to a structure.
  */
 #define TPARG_FL_RETURN BIT(0)
 #define TPARG_FL_KERNEL BIT(1)
@@ -402,6 +403,7 @@ static inline int traceprobe_get_entry_data_size(struct trace_probe *tp)
 #define TPARG_FL_USER   BIT(4)
 #define TPARG_FL_FPROBE BIT(5)
 #define TPARG_FL_TPOINT BIT(6)
+#define TPARG_FL_TYPECAST BIT(7)
 #define TPARG_FL_LOC_MASK	GENMASK(4, 0)
 
 static inline bool tparg_is_function_entry(unsigned int flags)
@@ -422,7 +424,9 @@ struct traceprobe_parse_context {
 	const struct btf_param *params;	/* Parameter of the function */
 	s32 nr_params;			/* The number of the parameters */
 	struct btf *btf;		/* The BTF to be used */
+	struct btf *struct_btf;		/* The BTF to be used for structs */
 	const struct btf_type *last_type;	/* Saved type */
+	const struct btf_type *last_struct;	/* Saved structure */
 	u32 last_bitoffs;		/* Saved bitoffs */
 	u32 last_bitsize;		/* Saved bitsize */
 	struct trace_probe *tp;
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH v5] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-05-19 17:28 UTC (permalink / raw)
  To: LKML, Linux trace kernel, bpf
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland, Peter Zijlstra,
	Namhyung Kim, Takaya Saeki, Douglas Raillard, Tom Zanussi,
	Andrew Morton, Thomas Gleixner, Ian Rogers, Jiri Olsa
In-Reply-To: <20260519130144.40e71a00@fedora>

On Tue, 19 May 2026 13:01:44 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> @@ -636,11 +694,12 @@ static int parse_btf_arg(char *varname,
>  	struct fetch_insn *code = *pcode;
>  	const struct btf_param *params;
>  	const struct btf_type *type;
> +	struct btf *btf = ctx_btf(ctx);
>  	char *field = NULL;
>  	int i, is_ptr, ret;
>  	u32 tid;
>  
> -	if (WARN_ON_ONCE(!ctx->funcname))
> +	if (WARN_ON_ONCE(!ctx->funcname && !(ctx->flags & TPARG_FL_TYPECAST)))
>  		return -EINVAL;
>  
>  	is_ptr = split_next_field(varname, &field, ctx);
> @@ -653,6 +712,14 @@ static int parse_btf_arg(char *varname,
>  		return -EOPNOTSUPP;
>  	}
>  
> +	if (ctx->flags & TPARG_FL_TEVENT) {
> +		int ret;
> +
> +		ret = parse_trace_event(varname, code, ctx);
> +		if (ret < 0)
> +			return ret;
> +	}
> +
>  	if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
>  		code->op = FETCH_OP_RETVAL;
>  		/* Check whether the function return type is not void */
> @@ -672,7 +739,7 @@ static int parse_btf_arg(char *varname,
>  		return 0;
>  	}
>  
> -	if (!ctx->btf) {
> +	if (!btf) {
>  		ret = query_btf_context(ctx);

Oops, need a:

		btf = ctx->btf;

here!

-- Steve

>  		if (ret < 0 || ctx->nr_params == 0) {
>  			trace_probe_log_err(ctx->offset, NO_BTF_ENTRY);

^ permalink raw reply

* Re: [PATCH v5] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-05-19 17:37 UTC (permalink / raw)
  To: LKML, Linux trace kernel, bpf
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland, Peter Zijlstra,
	Namhyung Kim, Takaya Saeki, Douglas Raillard, Tom Zanussi,
	Andrew Morton, Thomas Gleixner, Ian Rogers, Jiri Olsa
In-Reply-To: <20260519132830.3011a194@fedora>

On Tue, 19 May 2026 13:28:30 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 19 May 2026 13:01:44 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > @@ -636,11 +694,12 @@ static int parse_btf_arg(char *varname,
> >  	struct fetch_insn *code = *pcode;
> >  	const struct btf_param *params;
> >  	const struct btf_type *type;
> > +	struct btf *btf = ctx_btf(ctx);
> >  	char *field = NULL;
> >  	int i, is_ptr, ret;
> >  	u32 tid;
> >  
> > -	if (WARN_ON_ONCE(!ctx->funcname))
> > +	if (WARN_ON_ONCE(!ctx->funcname && !(ctx->flags & TPARG_FL_TYPECAST)))
> >  		return -EINVAL;
> >  
> >  	is_ptr = split_next_field(varname, &field, ctx);
> > @@ -653,6 +712,14 @@ static int parse_btf_arg(char *varname,
> >  		return -EOPNOTSUPP;
> >  	}
> >  
> > +	if (ctx->flags & TPARG_FL_TEVENT) {
> > +		int ret;
> > +
> > +		ret = parse_trace_event(varname, code, ctx);
> > +		if (ret < 0)
> > +			return ret;

Actually, since typecasting is currently only for eprobes, I'm going to
shuffle the code around a bit to only have the TYPECAST affect TEVENT,
and do the jump to found_type from here.

-- Steve


> > +	}
> > +
> >  	if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
> >  		code->op = FETCH_OP_RETVAL;
> >  		/* Check whether the function return type is not void */
> > @@ -672,7 +739,7 @@ static int parse_btf_arg(char *varname,
> >  		return 0;
> >  	}
> >  
> > -	if (!ctx->btf) {
> > +	if (!btf) {
> >  		ret = query_btf_context(ctx);  
> 
> Oops, need a:
> 
> 		btf = ctx->btf;
> 
> here!
> 
> -- Steve
> 
> >  		if (ret < 0 || ctx->nr_params == 0) {
> >  			trace_probe_log_err(ctx->offset, NO_BTF_ENTRY);  


^ permalink raw reply

* Re: [PATCH v3 1/2] spi: qcom-geni: trace: Add trace events for Qualcomm GENI SPI
From: Steven Rostedt @ 2026-05-19 17:54 UTC (permalink / raw)
  To: Praveen Talari
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Brown, linux-kernel,
	linux-trace-kernel, linux-arm-msm, linux-spi, mukesh.savaliya,
	aniket.randive, chandana.chiluveru, jyothi.seerapu, Konrad Dybcio
In-Reply-To: <20260518-add-tracepoints-for-qcom-geni-spi-v3-1-7928f6810a79@oss.qualcomm.com>

On Mon, 18 May 2026 22:30:51 +0530
Praveen Talari <praveen.talari@oss.qualcomm.com> wrote:

> @@ -0,0 +1,103 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM qcom_geni_spi
> +
> +#if !defined(_TRACE_QCOM_GENI_SPI_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_QCOM_GENI_SPI_H
> +
> +#include <linux/tracepoint.h>
> +
> +TRACE_EVENT(geni_spi_setup_params,
> +	    TP_PROTO(struct device *dev, u8 cs, u32 mode,
> +		     u32 mode_changed, bool cs_changed),
> +	    TP_ARGS(dev, cs, mode, mode_changed, cs_changed),
> +
> +	    TP_STRUCT__entry(__string(name, dev_name(dev))
> +			     __field(u8, cs)

A u8 followed by a u32 will create a 3 byte hole in the structure
layout that gets recorded onto the ring buffer. Best to move that field
to after the bool cs_changed, for better compaction.

> +			     __field(u32, mode)
> +			     __field(u32, mode_changed)
> +			     __field(bool, cs_changed)
> +	    ),
> +
> +	    TP_fast_assign(__assign_str(name);
> +			   __entry->cs = cs;
> +			   __entry->mode = mode;
> +			   __entry->mode_changed = mode_changed;
> +			   __entry->cs_changed = cs_changed;
> +	    ),
> +
> +	    TP_printk("%s: cs=%u mode=0x%08x mode_changed=0x%08x cs_changed=%d",
> +		      __get_str(name), __entry->cs, __entry->mode,
> +		      __entry->mode_changed, __entry->cs_changed)
> +);
> +
> +TRACE_EVENT(geni_spi_clk_cfg,
> +	    TP_PROTO(struct device *dev, unsigned long req_hz,
> +		     unsigned long sclk_hz, unsigned int clk_idx,
> +		     unsigned int clk_div, unsigned int bpw),
> +	    TP_ARGS(dev, req_hz, sclk_hz, clk_idx, clk_div, bpw),
> +
> +	    TP_STRUCT__entry(__string(name, dev_name(dev))

__string items inject a 4 byte meta data so they are basically the same
as a u32 item on the structure. Move this to the end or after the long
words so that on 64 bit architectures you don't create a 4 byte hole
here.

-- Steve

> +			     __field(unsigned long, req_hz)
> +			     __field(unsigned long, sclk_hz)
> +			     __field(unsigned int, clk_idx)
> +			     __field(unsigned int, clk_div)
> +			     __field(unsigned int, bpw)
> +	    ),
> +
> +	    TP_fast_assign(__assign_str(name);
> +			   __entry->req_hz = req_hz;
> +			   __entry->sclk_hz = sclk_hz;
> +			   __entry->clk_idx = clk_idx;
> +			   __entry->clk_div = clk_div;
> +			   __entry->bpw = bpw;
> +	    ),
> +
> +	    TP_printk("%s: req_hz=%lu sclk_hz=%lu clk_idx=%u clk_div=%u bpw=%u",
> +		      __get_str(name), __entry->req_hz, __entry->sclk_hz,
> +		      __entry->clk_idx, __entry->clk_div, __entry->bpw)
> +);

^ permalink raw reply

* Re: [PATCH] gpu: host1x: trace: fix string fields in host1x traces
From: Steven Rostedt @ 2026-05-19 18:10 UTC (permalink / raw)
  To: Artur Kowalski
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, linux-tegra, Thierry Reding, Mikko Perttunen,
	David Airlie, Simona Vetter
In-Reply-To: <20260519-host1x-tracing-v1-1-55afb8cbd186@gmail.com>

On Tue, 19 May 2026 12:16:43 +0200
Artur Kowalski <arturkow2000@gmail.com> wrote:

> Use __assign_str and __get_str as required by tracing subsystem. Fixes
> string fields being rejected by the verifier and unreadable from
> userspace.

Does anyone use these tracepoints? The fact that they have been broken
for 5 years and nobody noticed makes me think they are useless.

I rather remove them than fix them, but if someone thinks that these
are still useful then by all means apply this patch.

Acked-by: Steven Rostedt <rostedt@goodmis.org>

-- Steve


> 
> Tested on v6.18.21.
> 
> Signed-off-by: Artur Kowalski <arturkow2000@gmail.com>
> ---
>  include/trace/events/host1x.h | 50 ++++++++++++++++++++++---------------------
>  1 file changed, 26 insertions(+), 24 deletions(-)
> 
> diff --git a/include/trace/events/host1x.h b/include/trace/events/host1x.h
> index 1ba84b738e46..1b6aeb7b177b 100644
> --- a/include/trace/events/host1x.h
> +++ b/include/trace/events/host1x.h
> @@ -21,9 +21,11 @@ struct host1x_bo;
>  DECLARE_EVENT_CLASS(host1x,
>  	TP_PROTO(const char *name),
>  	TP_ARGS(name),
> -	TP_STRUCT__entry(__field(const char *, name)),
> -	TP_fast_assign(__entry->name = name;),
> -	TP_printk("name=%s", __entry->name)
> +	TP_STRUCT__entry(__string(name, name)),
> +	TP_fast_assign(
> +		__assign_str(name);
> +	),
> +	TP_printk("name=%s", __get_str(name))
>  );
>  
>  DEFINE_EVENT(host1x, host1x_channel_open,
> @@ -52,19 +54,19 @@ TRACE_EVENT(host1x_cdma_push,
>  	TP_ARGS(name, op1, op2),
>  
>  	TP_STRUCT__entry(
> -		__field(const char *, name)
> +		__string(name, name)
>  		__field(u32, op1)
>  		__field(u32, op2)
>  	),
>  
>  	TP_fast_assign(
> -		__entry->name = name;
> +		__assign_str(name);
>  		__entry->op1 = op1;
>  		__entry->op2 = op2;
>  	),
>  
>  	TP_printk("name=%s, op1=%08x, op2=%08x",
> -		__entry->name, __entry->op1, __entry->op2)
> +		__get_str(name), __entry->op1, __entry->op2)
>  );
>  
>  TRACE_EVENT(host1x_cdma_push_wide,
> @@ -73,7 +75,7 @@ TRACE_EVENT(host1x_cdma_push_wide,
>  	TP_ARGS(name, op1, op2, op3, op4),
>  
>  	TP_STRUCT__entry(
> -		__field(const char *, name)
> +		__string(name, name)
>  		__field(u32, op1)
>  		__field(u32, op2)
>  		__field(u32, op3)
> @@ -81,7 +83,7 @@ TRACE_EVENT(host1x_cdma_push_wide,
>  	),
>  
>  	TP_fast_assign(
> -		__entry->name = name;
> +		__assign_str(name);
>  		__entry->op1 = op1;
>  		__entry->op2 = op2;
>  		__entry->op3 = op3;
> @@ -89,7 +91,7 @@ TRACE_EVENT(host1x_cdma_push_wide,
>  	),
>  
>  	TP_printk("name=%s, op1=%08x, op2=%08x, op3=%08x op4=%08x",
> -		__entry->name, __entry->op1, __entry->op2, __entry->op3,
> +		__get_str(name), __entry->op1, __entry->op2, __entry->op3,
>  		__entry->op4)
>  );
>  
> @@ -100,7 +102,7 @@ TRACE_EVENT(host1x_cdma_push_gather,
>  	TP_ARGS(name, bo, words, offset, cmdbuf),
>  
>  	TP_STRUCT__entry(
> -		__field(const char *, name)
> +		__string(name, name)
>  		__field(struct host1x_bo *, bo)
>  		__field(u32, words)
>  		__field(u32, offset)
> @@ -114,14 +116,14 @@ TRACE_EVENT(host1x_cdma_push_gather,
>  					words * sizeof(u32));
>  		}
>  		__entry->cmdbuf = cmdbuf;
> -		__entry->name = name;
> +		__assign_str(name);
>  		__entry->bo = bo;
>  		__entry->words = words;
>  		__entry->offset = offset;
>  	),
>  
>  	TP_printk("name=%s, bo=%p, words=%u, offset=%d, contents=[%s]",
> -	  __entry->name, __entry->bo,
> +	  __get_str(name), __entry->bo,
>  	  __entry->words, __entry->offset,
>  	  __print_hex(__get_dynamic_array(cmdbuf),
>  		  __entry->cmdbuf ? __entry->words * 4 : 0))
> @@ -134,7 +136,7 @@ TRACE_EVENT(host1x_channel_submit,
>  	TP_ARGS(name, cmdbufs, relocs, syncpt_id, syncpt_incrs),
>  
>  	TP_STRUCT__entry(
> -		__field(const char *, name)
> +		__string(name, name)
>  		__field(u32, cmdbufs)
>  		__field(u32, relocs)
>  		__field(u32, syncpt_id)
> @@ -142,7 +144,7 @@ TRACE_EVENT(host1x_channel_submit,
>  	),
>  
>  	TP_fast_assign(
> -		__entry->name = name;
> +		__assign_str(name);
>  		__entry->cmdbufs = cmdbufs;
>  		__entry->relocs = relocs;
>  		__entry->syncpt_id = syncpt_id;
> @@ -151,7 +153,7 @@ TRACE_EVENT(host1x_channel_submit,
>  
>  	TP_printk("name=%s, cmdbufs=%u, relocs=%u, syncpt_id=%u, "
>  		  "syncpt_incrs=%u",
> -		  __entry->name, __entry->cmdbufs, __entry->relocs,
> +		  __get_str(name), __entry->cmdbufs, __entry->relocs,
>  		  __entry->syncpt_id, __entry->syncpt_incrs)
>  );
>  
> @@ -161,19 +163,19 @@ TRACE_EVENT(host1x_channel_submitted,
>  	TP_ARGS(name, syncpt_base, syncpt_max),
>  
>  	TP_STRUCT__entry(
> -		__field(const char *, name)
> +		__string(name, name)
>  		__field(u32, syncpt_base)
>  		__field(u32, syncpt_max)
>  	),
>  
>  	TP_fast_assign(
> -		__entry->name = name;
> +		__assign_str(name);
>  		__entry->syncpt_base = syncpt_base;
>  		__entry->syncpt_max = syncpt_max;
>  	),
>  
>  	TP_printk("name=%s, syncpt_base=%d, syncpt_max=%d",
> -		__entry->name, __entry->syncpt_base, __entry->syncpt_max)
> +		__get_str(name), __entry->syncpt_base, __entry->syncpt_max)
>  );
>  
>  TRACE_EVENT(host1x_channel_submit_complete,
> @@ -182,19 +184,19 @@ TRACE_EVENT(host1x_channel_submit_complete,
>  	TP_ARGS(name, count, thresh),
>  
>  	TP_STRUCT__entry(
> -		__field(const char *, name)
> +		__string(name, name)
>  		__field(int, count)
>  		__field(u32, thresh)
>  	),
>  
>  	TP_fast_assign(
> -		__entry->name = name;
> +		__assign_str(name);
>  		__entry->count = count;
>  		__entry->thresh = thresh;
>  	),
>  
>  	TP_printk("name=%s, count=%d, thresh=%d",
> -		__entry->name, __entry->count, __entry->thresh)
> +		__get_str(name), __entry->count, __entry->thresh)
>  );
>  
>  TRACE_EVENT(host1x_wait_cdma,
> @@ -203,16 +205,16 @@ TRACE_EVENT(host1x_wait_cdma,
>  	TP_ARGS(name, eventid),
>  
>  	TP_STRUCT__entry(
> -		__field(const char *, name)
> +		__string(name, name)
>  		__field(u32, eventid)
>  	),
>  
>  	TP_fast_assign(
> -		__entry->name = name;
> +		__assign_str(name);
>  		__entry->eventid = eventid;
>  	),
>  
> -	TP_printk("name=%s, event=%d", __entry->name, __entry->eventid)
> +	TP_printk("name=%s, event=%d", __get_str(name), __entry->eventid)
>  );
>  
>  TRACE_EVENT(host1x_syncpt_load_min,
> 
> ---
> base-commit: ab5fce87a778cb780a05984a2ca448f2b41aafbf
> change-id: 20260519-host1x-tracing-e2d608ec5e37
> 
> Best regards,
> --  
> Artur Kowalski <arturkow2000@gmail.com>


^ permalink raw reply

* Re: [PATCH v5] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-05-19 18:17 UTC (permalink / raw)
  To: sashiko-bot
  Cc: sashiko-reviews, bpf, LKML, Linux trace kernel, Masami Hiramatsu
In-Reply-To: <20260519174848.176A6C2BCB3@smtp.kernel.org>

On Tue, 19 May 2026 17:48:47 +0000
sashiko-bot@kernel.org wrote:

> Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
> - [High] Using `(STRUCT)` typecasts for non-eprobes leaves the base fetch instruction uninitialized (`FETCH_OP_NOP`), silently breaking the probe.
> - [High] Prematurely clearing `TPARG_FL_TYPECAST` causes cross-BTF type confusion and breaks string fetches.
> - [Medium] Erroneous caching of `struct_btf` restricts a single trace probe to using typecasts from only one BTF object, preventing cross-module lookups.
> - [Medium] Cross-BTF string table access in `parse_btf_arg()` causes invalid string comparisons.
> - [Low] `clear_btf_context()` fails to clear `ctx->struct_btf` after dropping its reference.
> --
> 
> commit de4b6287a7069fde24b0e0f679a545a06fb9ae6c
> Author: Steven Rostedt <rostedt@goodmis.org>
> 
> tracing/eprobes: Allow use of BTF names to dereference pointers
> 
> Add syntax to the parsing of eprobes to be able to typecast a trace event
> field that is a pointer to a structure.
> 
> > diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> > index e1c73065dae51..2bc9f4ca8aa7e 100644
> > --- a/kernel/trace/trace_probe.c
> > +++ b/kernel/trace/trace_probe.c
> > @@ -464,6 +470,27 @@ static const char *fetch_type_from_btf_type(struct btf *btf,
> >  	return NULL;
> >  }
> >  
> > +static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
> > +{
> > +	int id;
> > +
> > +	if (!ctx->struct_btf) {
> > +		struct btf *btf;
> > +
> > +		id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
> > +		if (id < 0)
> > +			return id;
> > +		ctx->struct_btf = btf;
> > +	} else {
> > +		id = btf_find_by_name_kind(ctx->struct_btf, sname, BTF_KIND_STRUCT);  
> 
> [Severity: Medium]
> Does caching struct_btf here restrict a single trace probe event to using
> typecasts from only one BTF object?
> 
> If a probe attempts to typecast structures from two different kernel
> modules, wouldn't the lookup for the second module's struct fail because
> it only searches the cached BTF object of the first module (or vmlinux)?

Hmm, good point. I guess I should clear the struct_btf at the end of
the '(' switch case.

> 
> > +		if (id < 0)
> > +			return id;
> > +	}
> > +
> > +	ctx->last_struct = btf_type_by_id(ctx->struct_btf, id);
> > +	return 0;
> > +}  
> 
> [ ... ]
> 
> > @@ -515,6 +542,10 @@ static void clear_btf_context(struct traceprobe_parse_context *ctx)
> >  		ctx->params = NULL;
> >  		ctx->nr_params = 0;
> >  	}
> > +	if (ctx->struct_btf) {
> > +		btf_put(ctx->struct_btf);
> > +		ctx->last_struct = NULL;  
> 
> [Severity: Low]
> Should ctx->struct_btf be explicitly set to NULL after btf_put() drops
> the reference?

I'm thinking of dropping it in the '(' switch case.

> 
> While the traceprobe_parse_context might be destroyed shortly after
> parsing completes, leaving a dangling pointer could potentially lead to a
> use-after-free and double-free if any future refactoring reuses the
> parsing context across multiple trace events.
> 
> > +	}
> >  }  
> 
> [ ... ]
> 
> > @@ -672,7 +739,7 @@ static int parse_btf_arg(char *varname,
> >  		return 0;
> >  	}
> >  
> > -	if (!ctx->btf) {
> > +	if (!btf) {
> >  		ret = query_btf_context(ctx);
> >  		if (ret < 0 || ctx->nr_params == 0) {
> >  			trace_probe_log_err(ctx->offset, NO_BTF_ENTRY);
> > @@ -682,7 +749,7 @@ static int parse_btf_arg(char *varname,
> >  	params = ctx->params;
> >  
> >  	for (i = 0; i < ctx->nr_params; i++) {
> > -		const char *name = btf_name_by_offset(ctx->btf, params[i].name_off);
> > +		const char *name = btf_name_by_offset(btf, params[i].name_off);  
> 
> [Severity: Medium]
> When parsing a typecast argument, btf evaluates to ctx->struct_btf. The
> parameter matching loop then calls btf_name_by_offset() using that btf.
> 
> Since params[i].name_off originates from the function's BTF (ctx->btf,
> populated by previous non-typecast arguments), could this lead to reading
> from the wrong string table if ctx->struct_btf is a different kernel
> module?

The new patch will skip this for TYPECAST arguments. Well, at least for
trace events.

> 
> This might result in invalid string comparisons or returning
> (invalid-name-offset) if it reads out of bounds.
> 
> >  
> >  		if (name && !strcmp(name, varname)) {
> >  			if (tparg_is_function_entry(ctx->flags)) {  
> 
> [ ... ]
> 
> > @@ -1231,6 +1301,30 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
> >  				code->op = FETCH_OP_IMM;
> >  		}
> >  		break;
> > +	case '(':
> > +		tmp = strchr(arg, ')');
> > +		if (!tmp) {
> > +			trace_probe_log_err(ctx->offset + strlen(arg),
> > +					    DEREF_OPEN_BRACE);
> > +			return -EINVAL;
> > +		}
> > +		*tmp = '\0';
> > +		ret = query_btf_struct(arg + 1, ctx);
> > +		*tmp = ')';
> > +
> > +		if (ret < 0) {
> > +			trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
> > +			return -EINVAL;
> > +		}
> > +
> > +		ctx->flags |= TPARG_FL_TYPECAST;
> > +		tmp++;
> > +
> > +		ctx->offset += tmp - arg;
> > +		ret = parse_btf_arg(tmp, pcode, end, ctx);  
> 
> [Severity: High]
> When parse_probe_arg() evaluates the (STRUCT) syntax for kprobes or
> fprobes, it sets TPARG_FL_TYPECAST and calls parse_btf_arg(). Inside
> parse_btf_arg(), ctx_btf() evaluates to ctx->struct_btf, so it bypasses
> query_btf_context() and jumps to found_type.
> 
> However, it doesn't seem to initialize code->op with a valid base fetch
> instruction like FETCH_OP_ARG or FETCH_OP_REG.
> 
> Could this leave the base instruction as FETCH_OP_NOP (0)? If so, it
> seems the base address would be 0, causing subsequent field dereferences
> to safely fail on 0 + offset due to copy_from_kernel_nofault(). This might
> allow unsupported syntax like (sk_buff)%di->dev to parse but silently
> fetch nothing.

Yeah, I may have to force this not to work with function types.

> 
> > +		ctx->flags &= ~TPARG_FL_TYPECAST;  
> 
> [Severity: High]
> The TPARG_FL_TYPECAST flag is cleared here, but ctx->last_type still
> points to a type originating from ctx->struct_btf.
> 
> Later in traceprobe_parse_probe_arg_body(),
> check_prepare_btf_string_fetch() is called. Since the flag is cleared,
> ctx_btf() will incorrectly return ctx->btf (vmlinux for kprobes, or
> NULL for eprobes).
> 
> Does this cause check_prepare_btf_string_fetch() to return 0 for eprobes
> and skip adding the necessary dummy FETCH_OP_DEREF for string fetches?
> Without this dummy dereference, string fetches on character pointers
> in eprobes might fail by reading the raw pointer value itself as a string.
> 
> Also, for kprobes, could this cause type confusion if
> btf_type_skip_modifiers() looks up a module's type ID inside the vmlinux
> BTF?

As mentioned above, I need to put and clear the last_struct here.

Thanks,

-- Steve


> 
> > +		ctx->last_struct = NULL;
> > +		break;  
> 


^ permalink raw reply

* Re: [PATCH mm-unstable v17 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions
From: Nico Pache @ 2026-05-19 18:21 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, Usama Arif
In-Reply-To: <4566170a-7e3d-49e4-baab-ba2790c198db@kernel.org>

On Tue, May 12, 2026 at 1:30 AM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 5/11/26 20:58, Nico Pache wrote:
> > The following cleanup reworks all the max_ptes_* handling into helper
> > functions. This increases the code readability and will later be used to
> > implement the mTHP handling of these variables.
> >
> > With these changes we abstract all the madvise_collapse() special casing
> > (dont respect the sysctls) away from the functions that utilize them. And
> > will be used later in this series to cleanly restrict the mTHP collapse
> > behavior.
> >
> > No functional change is intended; however, we are now only reading the
> > sysfs variables once per scan, whereas before these variables were being
> > read on each loop iteration.
> >
> > Suggested-by: David Hildenbrand <david@kernel.org>
> > Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>
> Some nits when re-reading:
>
> > Acked-by: Usama Arif <usama.arif@linux.dev>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 118 +++++++++++++++++++++++++++++++++---------------
> >  1 file changed, 82 insertions(+), 36 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index f0e29d5c7b1f..f68853b3caa7 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -348,6 +348,62 @@ static bool pte_none_or_zero(pte_t pte)
> >       return pte_present(pte) && is_zero_pfn(pte_pfn(pte));
> >  }
> >
> > +/**
> > + * collapse_max_ptes_none - Calculate maximum allowed none-page or zero-page
>
> I know, it's painful, but ...
>
> There is no "none-page".

Yeah I think you mentioned that last review... sorry!

>
> Calculate maximum allowed empty PTEs or PTEs mapping the shared zeropage ... ?
>
> > + * PTEs for the given collapse operation.
>
> We usually indent here (second line of subject), I think. Same applies to the
> other doc below.

Hmm tbh I couldn't find a example of what you meant here. There are
some that put a space between the first sentence and the @ list.

>
> > + * @cc: The collapse control struct
> > + * @vma: The vma to check for userfaultfd
> > + *
> > + * Return: Maximum number of none-page or zero-page PTEs allowed for the
> > + * collapse operation.
>
> Same here.
>
> > + */
> > +static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> > +             struct vm_area_struct *vma)
> > +{
> > +     // If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
>
> Lance commented on the comment style.
>
> Is this comment really required? It's pretty self-documenting already.

Dropped it, thanks.

>
> > +     if (vma && userfaultfd_armed(vma))
> > +             return 0;
> > +     // for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
> > +     if (!cc->is_khugepaged)
> > +             return HPAGE_PMD_NR;
> > +     // For all other cases repect the user defined maximum.
> > +     return khugepaged_max_ptes_none;
> > +}
> > +
> --
> Cheers,
>
> David
>


^ permalink raw reply

* Re: [PATCH mm-unstable v17 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions
From: Nico Pache @ 2026-05-19 18:21 UTC (permalink / raw)
  To: Lance Yang
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif
In-Reply-To: <20260512044444.71798-1-lance.yang@linux.dev>

On Mon, May 11, 2026 at 10:45 PM Lance Yang <lance.yang@linux.dev> wrote:
>
>
> On Mon, May 11, 2026 at 12:58:03PM -0600, Nico Pache wrote:
> >The following cleanup reworks all the max_ptes_* handling into helper
> >functions. This increases the code readability and will later be used to
> >implement the mTHP handling of these variables.
> >
> >With these changes we abstract all the madvise_collapse() special casing
> >(dont respect the sysctls) away from the functions that utilize them. And
>
> Nit: s/dont/do not/
>
> >will be used later in this series to cleanly restrict the mTHP collapse
> >behavior.
> >
> >No functional change is intended; however, we are now only reading the
> >sysfs variables once per scan, whereas before these variables were being
> >read on each loop iteration.
> >
> >Suggested-by: David Hildenbrand <david@kernel.org>
> >Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> >Acked-by: Usama Arif <usama.arif@linux.dev>
> >Signed-off-by: Nico Pache <npache@redhat.com>
> >---
> > mm/khugepaged.c | 118 +++++++++++++++++++++++++++++++++---------------
> > 1 file changed, 82 insertions(+), 36 deletions(-)
> >
> >diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >index f0e29d5c7b1f..f68853b3caa7 100644
> >--- a/mm/khugepaged.c
> >+++ b/mm/khugepaged.c
> >@@ -348,6 +348,62 @@ static bool pte_none_or_zero(pte_t pte)
> >       return pte_present(pte) && is_zero_pfn(pte_pfn(pte));
> > }
> >
> >+/**
> >+ * collapse_max_ptes_none - Calculate maximum allowed none-page or zero-page
> >+ * PTEs for the given collapse operation.
> >+ * @cc: The collapse control struct
> >+ * @vma: The vma to check for userfaultfd
> >+ *
> >+ * Return: Maximum number of none-page or zero-page PTEs allowed for the
> >+ * collapse operation.
> >+ */
> >+static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> >+              struct vm_area_struct *vma)
> >+{
> >+      // If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
> >+      if (vma && userfaultfd_armed(vma))
> >+              return 0;
> >+      // for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
> >+      if (!cc->is_khugepaged)
> >+              return HPAGE_PMD_NR;
> >+      // For all other cases repect the user defined maximum.
> >+      return khugepaged_max_ptes_none;
>
> Nit: kernel code usually uses C-style comments. This could be:
>
> /* For all other cases, respect the user-defined maximum. */
>
> Also, s/repect/respect/.
>
> >+}
> >+
> >+/**
> >+ * collapse_max_ptes_shared - Calculate maximum allowed PTEs that map shared
> >+ * anonymous pages for the given collapse operation.
> >+ * @cc: The collapse control struct
> >+ *
> >+ * Return: Maximum number of PTEs that map shared anonymous pages for the
> >+ * collapse operation
> >+ */
> >+static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
> >+{
> >+      // for MADV_COLLAPSE, do not restrict the number of PTEs that map shared
> >+      // anonymous pages.
>
> Ditto.
>
> >+      if (!cc->is_khugepaged)
> >+              return HPAGE_PMD_NR;
> >+      return khugepaged_max_ptes_shared;
> >+}
> >+
> >+/**
> >+ * collapse_max_ptes_swap - Calculate the maximum allowed non-present PTEs or the
> >+ * maximum allowed non-present pagecache entries for the given collapse operation.
> >+ * @cc: The collapse control struct
> >+ *
> >+ * Return: Maximum number of non-present PTEs or the maximum allowed non-present
> >+ * pagecache entries for the collapse operation.
> >+ */
> >+static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
> >+{
> >+      // for MADV_COLLAPSE, do not restrict the number PTEs entries or
> >+      // pagecache entries that are non-present.
>
> Same here.
>
> >+      if (!cc->is_khugepaged)
> >+              return HPAGE_PMD_NR;
> >+      return khugepaged_max_ptes_swap;
> >+}
> >+
> > int hugepage_madvise(struct vm_area_struct *vma,
> >                    vm_flags_t *vm_flags, int advice)
> > {
> >@@ -546,21 +602,19 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >       pte_t *_pte;
> >       int none_or_zero = 0, shared = 0, referenced = 0;
> >       enum scan_result result = SCAN_FAIL;
> >+      unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
> >+      unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
>
> Nit: could these be const, as David suggested earlier?
>
> Nothing else jumped out at me. LGTM!
>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>

Ack on all the above thank you !

>


^ permalink raw reply

* Re: [PATCH mm-unstable v17 02/14] mm/khugepaged: generalize alloc_charge_folio()
From: Nico Pache @ 2026-05-19 19:03 UTC (permalink / raw)
  To: Lance Yang, Usama Arif
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, akpm,
	anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <51db205d-77cf-416f-bfe5-fd9d0b12c433@linux.dev>

On Mon, May 18, 2026 at 8:50 AM Lance Yang <lance.yang@linux.dev> wrote:
>
>
>
> On 2026/5/18 19:55, Usama Arif wrote:
> [...]
> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >> index 979885694351..f0e29d5c7b1f 100644
> >> --- a/mm/khugepaged.c
> >> +++ b/mm/khugepaged.c
> >> @@ -1068,21 +1068,26 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> >>   }
> >>
> >>   static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> >> -            struct collapse_control *cc)
> >> +            struct collapse_control *cc, unsigned int order)
> >>   {
> >>      gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
> >>                   GFP_TRANSHUGE);
> >>      int node = collapse_find_target_node(cc);
> >>      struct folio *folio;
> >>
> >> -    folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
> >> +    folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
> >>      if (!folio) {
> >>              *foliop = NULL;
> >> -            count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> >> +            if (is_pmd_order(order))
> >> +                    count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> >> +            count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
> >>              return SCAN_ALLOC_HUGE_PAGE_FAIL;
> >>      }
> >>
> >> -    count_vm_event(THP_COLLAPSE_ALLOC);
> >> +    if (is_pmd_order(order))
> >> +            count_vm_event(THP_COLLAPSE_ALLOC);
> >> +    count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC);
> >> +
> >
> > The vmstat THP_COLLAPSE_ALLOC counter is pmd order only.
> > But after this we have
> >
> >       count_memcg_folio_events(folio, THP_COLLAPSE_ALLOC, 1);
> >
> > which is not being guarded with is_pmd_order().
>
> Good catch!
>
> >
> > I think we want this to be pmd order only as well so that
> > the meaning of the vmstat and cgroup counter remains the same?
>
> Agreed. THP_COLLAPSE_ALLOC should remain PMD order only for
> vmstat and memcg events.
>
> So this should be guarded with is_pmd_order() as well :)

Thanks Usama, I added that.

>
> Cheers, Lance
>


^ permalink raw reply

* Re: [PATCH mm-unstable v17 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: Nico Pache @ 2026-05-19 19:05 UTC (permalink / raw)
  To: Lorenzo Stoakes, David Hildenbrand (Arm), Wei Yang, Lance Yang
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <agtpK1x27B-E7mMo@lucifer>

On Mon, May 18, 2026 at 1:33 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Mon, May 18, 2026 at 03:16:11PM +0200, David Hildenbrand (Arm) wrote:
> > On 5/14/26 05:10, Wei Yang wrote:
> > > On Tue, May 12, 2026 at 03:42:02PM +0800, Lance Yang wrote:
> > >>
> > >> On Mon, May 11, 2026 at 12:58:04PM -0600, Nico Pache wrote:
> > >>> generalize the order of the __collapse_huge_page_* and collapse_max_*
> > >>> functions to support future mTHP collapse.
> > >>>
> > >>> The current mechanism for determining collapse with the
> > >>> khugepaged_max_ptes_none value is not designed with mTHP in mind. This
> > >>> raises a key design issue: if we support user defined max_pte_none values
> > >>> (even those scaled by order), a collapse of a lower order can introduces
> > >>> an feedback loop, or "creep", when max_ptes_none is set to a value greater
> > >>> than HPAGE_PMD_NR / 2. [1]
> > >>>
> > >>> With this configuration, a successful collapse to order N will populate
> > >>> enough pages to satisfy the collapse condition on order N+1 on the next
> > >>> scan. This leads to unnecessary work and memory churn.
> > >>>
> > >>> To fix this issue introduce a helper function that will limit mTHP
> > >>> collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
> > >>> This effectively supports two modes: [2]
> > >>>
> > >>> - max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE
> > >>>  that maps the shared zeropage. Consequently, no memory bloat.
> > >>> - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
> > >>>  available mTHP order.
> > >>>
> > >>> This removes the possiblilty of "creep", while not modifying any uAPI
> > >>> expectations. A warning will be emitted if any non-supported
> > >>> max_ptes_none value is configured with mTHP enabled.
> > >>>
> > >>> mTHP collapse will not honor the khugepaged_max_ptes_shared or
> > >>> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
> > >>> shared or swapped entry.
> > >>>
> > >>> No functional changes in this patch; however it defines future behavior
> > >>> for mTHP collapse.
> > >>>
> > >>> [1] - https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.com
> > >>> [2] - https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com
> > >>>
> > >>> Co-developed-by: Dev Jain <dev.jain@arm.com>
> > >>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> > >>> Signed-off-by: Nico Pache <npache@redhat.com>
> > >>> ---
> > >>> include/trace/events/huge_memory.h |   3 +-
> > >>> mm/khugepaged.c                    | 117 ++++++++++++++++++++---------
> > >>> 2 files changed, 85 insertions(+), 35 deletions(-)
> > >>>
> > >>> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > >>> index bcdc57eea270..443e0bd13fdb 100644
> > >>> --- a/include/trace/events/huge_memory.h
> > >>> +++ b/include/trace/events/huge_memory.h
> > >>> @@ -39,7 +39,8 @@
> > >>>   EM( SCAN_STORE_FAILED,          "store_failed")                 \
> > >>>   EM( SCAN_COPY_MC,               "copy_poisoned_page")           \
> > >>>   EM( SCAN_PAGE_FILLED,           "page_filled")                  \
> > >>> - EMe(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")
> > >>> + EM(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")     \
> > >>> + EMe(SCAN_INVALID_PTES_NONE,     "invalid_ptes_none")
> > >>>
> > >>> #undef EM
> > >>> #undef EMe
> > >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > >>> index f68853b3caa7..27465161fa6d 100644
> > >>> --- a/mm/khugepaged.c
> > >>> +++ b/mm/khugepaged.c
> > >>> @@ -61,6 +61,7 @@ enum scan_result {
> > >>>   SCAN_COPY_MC,
> > >>>   SCAN_PAGE_FILLED,
> > >>>   SCAN_PAGE_DIRTY_OR_WRITEBACK,
> > >>> + SCAN_INVALID_PTES_NONE,
> > >>> };
> > >>>
> > >>> #define CREATE_TRACE_POINTS
> > >>> @@ -353,37 +354,60 @@ static bool pte_none_or_zero(pte_t pte)
> > >>>  * PTEs for the given collapse operation.
> > >>>  * @cc: The collapse control struct
> > >>>  * @vma: The vma to check for userfaultfd
> > >>> + * @order: The folio order being collapsed to
> > >>>  *
> > >>>  * Return: Maximum number of none-page or zero-page PTEs allowed for the
> > >>>  * collapse operation.
> > >>>  */
> > >>> -static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> > >>> -         struct vm_area_struct *vma)
> > >>> +static int collapse_max_ptes_none(struct collapse_control *cc,
> > >>> +         struct vm_area_struct *vma, unsigned int order)
> > >>> {
> > >>> + unsigned int max_ptes_none = khugepaged_max_ptes_none;
> > >>>   // If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
> > >>
> > >> One thing I still want to call out: kernel code usually uses C-style
> > >> comments :)
> > >>
> > >>>   if (vma && userfaultfd_armed(vma))
> > >>>           return 0;
> > >>>   // for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
> > >>>   if (!cc->is_khugepaged)
> > >>>           return HPAGE_PMD_NR;
> > >>> - // For all other cases repect the user defined maximum.
> > >>> - return khugepaged_max_ptes_none;
> > >>> + // for PMD collapse, respect the user defined maximum.
> > >>> + if (is_pmd_order(order))
> > >>> +         return max_ptes_none;
> > >>> + /* Zero/non-present collapse disabled. */
> > >>> + if (!max_ptes_none)
> > >>> +         return 0;
> > >>> + // for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
> > >>> + // scale the maximum number of PTEs to the order of the collapse.
> > >>> + if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
> > >>> +         return (1 << order) - 1;
> > >>> +
> > >>> + // We currently only support max_ptes_none values of 0 or KHUGEPAGED_MAX_PTES_LIMIT.
> > >>> + // Emit a warning and return -EINVAL.
> > >>> + pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u\n",
> > >>> +               KHUGEPAGED_MAX_PTES_LIMIT);
> > >>
> > >> Maybe fallback to 0 instead, as David suggested earlier?
> > >>
> > >
> > > It looks reasonable to fallback to 0.
> > >
> > > But as the updated Document says in patch 14:
> > >
> > >   For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1) are supported. Any other
> > >   value will emit a warning and no mTHP collapse will be attempted.
> > >
> > > This is why it does like this now.
> > >
> > >     mthp_collapse()
> > >         max_ptes_none = collapse_max_ptes_none();
> > >         if (max_ptes_none < 0)
> > >             return collapsed;
> > >
> > >> max_ptes_none is mostly legacy PMD THP behavior. mTHP is new, and any
> > >> intermediate value in (0, KHUGEPAGED_MAX_PTES_LIMIT) would implicitly
> > >> disable it :(
> > >>
> > >
> > > So it depends on what we want to do here :-)
> > >
> > > For me, I would vote for fallback to 0.
> >
> > At this point I'll prefer to not return errors from collapse_max_ptes_none().
> > It's just rather awkward to return an error deep down in collapse code for a
> > configuration problem.
> >
> > For mthp collapse, we only support max_ptes_none==0 and
> > max_ptes_none=="HPAGE_PMD_NR - 1" (default).
> >
> > If another value is specified while collapsing mTHP, print a warning and treat
> > it as 0 (save value, no creep, no memory waste).
> >
> > In a sense, this is similar to how we handle max_ptes_shared + max_ptes_swap:
> > for mTHP: we always treat them as being 0 for mTHP collapse (and don't issue a
> > warning, because we would issue a warning with the default settings).
> >
> > @Lorenzo, fine with you?
>
> Yes 100%, this sounds sensible both in terms of the error and the default. Let's
> keep our lives simple(-ish) please :)

Ok thank you im glad we finally came to consensus on this! phew!

>
> >
> > --
> > Cheers,
> >
> > David
>
> Cheers, Lorenzo
>


^ permalink raw reply

* Re: [PATCH mm-unstable v17 00/14] khugepaged: mTHP support
From: Nico Pache @ 2026-05-19 19:20 UTC (permalink / raw)
  To: Wei Yang
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260518125007.a4z3pw4r73uuwja4@master>

On Mon, May 18, 2026 at 6:50 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Mon, May 11, 2026 at 12:58:00PM -0600, Nico Pache wrote:
> >The following series provides khugepaged with the capability to collapse
> >anonymous memory regions to mTHPs.
> >
> >To achieve this we generalize the khugepaged functions to no longer depend
> >on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
> >pages that are occupied (!none/zero). After the PMD scan is done, we use
> >the bitmap to find the optimal mTHP sizes for the PMD range. The
> >restriction on max_ptes_none is removed during the scan, to make sure we
> >account for the whole PMD range in the bitmap. When no mTHP size is
> >enabled, the legacy behavior of khugepaged is maintained.
> >
> >We currently only support max_ptes_none values of 0 or HPAGE_PMD_NR - 1
> >(ie 511). If any other value is specified, the kernel will emit a warning
> >and no mTHP collapse will be attempted. If a mTHP collapse is attempted,
> >but contains swapped out, or shared pages, we don't perform the collapse.
> >It is now also possible to collapse to mTHPs without requiring the PMD THP
> >size to be enabled. These limitations are to prevent collapse "creep"
> >behavior. This prevents constantly promoting mTHPs to the next available
> >size, which would occur because a collapse introduces more non-zero pages
> >that would satisfy the promotion condition on subsequent scans.
> >
> >Patch 1-2:   Generalize hugepage_vma_revalidate and alloc_charge_folio
> >            for arbitrary orders.
> >Patch 3:     Rework max_ptes_* handling into helper functions
> >Patch 4:     Generalize __collapse_huge_page_* for mTHP support
> >Patch 5:     Require collapse_huge_page to enter/exit with the lock dropped
> >Patch 6:     Generalize collapse_huge_page for mTHP collapse
> >Patch 7:     Skip collapsing mTHP to smaller orders
> >Patch 8-9:   Add per-order mTHP statistics and tracepoints
> >Patch 10:    Introduce collapse_allowable_orders helper function
> >Patch 11-13: Introduce bitmap and mTHP collapse support, fully enabled
> >Patch 14:    Documentation
> >
> >Testing:
> >- Built for x86_64, aarch64, ppc64le, and s390x
> >- ran all arches on test suites provided by the kernel-tests project
> >- internal testing suites: functional testing and performance testing
> >- selftests mm
> >- I created a test script that I used to push khugepaged to its limits
> >   while monitoring a number of stats and tracepoints. The code is
> >   available here[1] (Run in legacy mode for these changes and set mthp
> >   sizes to inherit)
> >   The summary from my testings was that there was no significant
> >   regression noticed through this test. In some cases my changes had
> >   better collapse latencies, and was able to scan more pages in the same
> >   amount of time/work, but for the most part the results were consistent.
> >- redis testing. I did some testing with these changes along with my defer
> >  changes (see followup [2] post for more details). We've decided to get
> >  the mTHP changes merged first before attempting the defer series.
> >- some basic testing on 64k page size.
> >- lots of general use.
> >
>
> Two links are missing. I got them from previous version.
>
> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> [2] - https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/

Oh whoops, ill make sure they are there in the followup

>
> And the test in [1] is a performance test. I am thinking whether we want a
> functional test in selftests.

It also works as a functional test in some regards. The reason i never
pursued self-tests is that I naively thought this was getting merged
6(?) months ago and at the time the selftests infrastructure didn't
support it well. Baolin included patches to clean that up in his shmem
mTHP support patches and added tests for both features. Let's repost
and re-merge this first; then, I will follow up in one or two weeks
regarding self-tests. I'm currently on PTO and only have time to
complete, test, and return the v18 changes to Andrew before they
create a huge merge headache and we miss yet another window.

>
> I did a quick try with following change and some hack.

Thanks Ill use that as a base!

>
> @@ -744,6 +765,51 @@ static void collapse_max_ptes_none(struct collapse_context *c, struct mem_ops *o
>         ksft_test_result_report(exit_status, "%s\n", __func__);
>  }
>
> +static void collapse_mth_ptes(struct collapse_context *c, struct mem_ops *ops)
> +{
> +       struct thp_settings settings = *thp_current_settings();
> +       void *p;
> +       int i;
> +
> +       /* Disable mthp on fault */
> +       for (i = 0; i < NR_ORDERS; i++) {
> +               settings.hugepages[i].enabled = THP_NEVER;
> +       }
> +       thp_push_settings(&settings);
> +
> +       p = ops->setup_area(1);
> +
> +       ops->fault(p, 0, hpage_pmd_size);
> +
> +       /* Expect all order-0 folio after fault */
> +       memset(expected_orders, 0, sizeof(int) * (pmd_order + 1));
> +       expected_orders[0] = hpage_pmd_nr;
> +       if (check_folio_orders(p, hpage_pmd_size, pagemap_fd,
> +                                          kpageflags_fd, expected_orders,
> +                                          (pmd_order + 1)))
> +               ksft_exit_fail_msg("Unexpected huge page at fault\n");
> +
> +       /* Enable mthp before collapse */
> +       thp_pop_settings();
> +       settings.hugepages[2].enabled = THP_ALWAYS;
> +       thp_push_settings(&settings);
> +
> +       c->collapse("Collapse fully populated PTE table with order 2", p, 1,
> +                   ops, true);
> +
> +       /* Expect all order-2 folio after collapse */
> +       memset(expected_orders, 0, sizeof(int) * (pmd_order + 1));
> +       expected_orders[2] = 1 << (pmd_order - 2);
> +       if (check_folio_orders(p, hpage_pmd_size, pagemap_fd,
> +                                          kpageflags_fd, expected_orders,
> +                                          (pmd_order + 1)))
> +               ksft_exit_fail_msg("Unexpected page order\n");
> +
> +       ops->cleanup_area(p, hpage_pmd_size);
> +       thp_pop_settings();
> +       ksft_test_result_report(exit_status, "%s\n", __func__);
> +}
> +
>  static void collapse_swapin_single_pte(struct collapse_context *c, struct mem_ops *ops)
>  {
>         void *p;
>
> This leverage check_after_split_folio_orders() in split_huge_page_test.c to
> check folio order in PMD range.
>
> --
> Wei Yang
> Help you, Help me
>


^ permalink raw reply

* Re: [PATCHv2 04/11] libbpf: Change has_nop_combo to work on top of nop10
From: Jiri Olsa @ 2026-05-19 20:36 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: oleg, peterz, mingo, mhiramat, andrii, jakub, bpf,
	linux-trace-kernel, ast, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <a5a57eb9690a537c2d8d7ec619fb1d200c9fc804436ae085b9344b691c2470df@mail.kernel.org>

On Mon, May 18, 2026 at 11:37:58AM +0000, bot+bpf-ci@kernel.org wrote:
> > libbpf: Change has_nop_combo to work on top of nop10
> >
> > We now expect nop combo with 10 bytes nop instead of 5 bytes nop,
> > fixing has_nop_combo to reflect that.
> >
> > Fixes: 41a5c7df4466 ("libbpf: Add support to detect nop,nop5 instructions combo for usdt probe")
> 
> Should the Fixes tag reference commit 5c3c59c7952a ("uprobes/x86: Move
> optimized uprobe from nop5 to nop10") instead? Commit 41a5c7df4466 originally
> introduced has_nop_combo() with the correct nop5 check. Commit 5c3c59c7952a
> later changed the kernel to use nop10 instead of nop5 for uprobe optimization
> but failed to update the libbpf has_nop_combo() function accordingly, creating
> a mismatch between kernel and userspace.

keeping things separated now for easier review

> 
> > Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> >
> > diff --git a/tools/lib/bpf/usdt.c b/tools/lib/bpf/usdt.c
> > index e3710933fd52..7e62e4d5bedd 100644
> > --- a/tools/lib/bpf/usdt.c
> > +++ b/tools/lib/bpf/usdt.c
> 
> [ ... ]
> 
> > @@ -596,14 +596,14 @@ static int parse_usdt_spec(struct usdt_spec *spec, const struct usdt_note *note,
> >  #if defined(__x86_64__)
> >  static bool has_nop_combo(int fd, long off)
> >  {
> > -	unsigned char nop_combo[6] = {
> > -		0x90, 0x0f, 0x1f, 0x44, 0x00, 0x00 /* nop,nop5 */
> > +	unsigned char nop_combo[11] = {
> > +		0x90, 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00,
> >  	};
> > -	unsigned char buf[6];
> > +	unsigned char buf[11];
> >
> > -	if (pread(fd, buf, 6, off) != 6)
> > +	if (pread(fd, buf, 11, off) != 11)
> >  		return false;
> > -	return memcmp(buf, nop_combo, 6) == 0;
> > +	return memcmp(buf, nop_combo, 11) == 0;
> >  }
> 
> This isn't a bug, but Jakub Sitnicki suggested in v1 review to use
> ARRAY_SIZE(buf) instead of repeating the scalar value 11 in multiple places.
> The suggestion was included alongside the Reviewed-by tag but was not
> addressed.

I think it's easy enough to keep the number

jirka

^ permalink raw reply

* Re: [PATCH v5] tracing/eprobes: Allow use of BTF names to dereference pointers
From: kernel test robot @ 2026-05-19 21:09 UTC (permalink / raw)
  To: Steven Rostedt, LKML, Linux trace kernel, bpf
  Cc: oe-kbuild-all, Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland,
	Peter Zijlstra, Namhyung Kim, Takaya Saeki, Douglas Raillard,
	Tom Zanussi, Andrew Morton, Linux Memory Management List,
	Thomas Gleixner, Ian Rogers, Jiri Olsa
In-Reply-To: <20260519130144.40e71a00@fedora>

Hi Steven,

kernel test robot noticed the following build errors:

[auto build test ERROR on trace/for-next]
[also build test ERROR on linus/master v7.1-rc4 next-20260519]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Steven-Rostedt/tracing-eprobes-Allow-use-of-BTF-names-to-dereference-pointers/20260520-011353
base:   https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace for-next
patch link:    https://lore.kernel.org/r/20260519130144.40e71a00%40fedora
patch subject: [PATCH v5] tracing/eprobes: Allow use of BTF names to dereference pointers
config: s390-randconfig-r071-20260520 (https://download.01.org/0day-ci/archive/20260520/202605200427.0xXjKghz-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 8.5.0
smatch: v0.5.0-9185-gbcc58b9c
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260520/202605200427.0xXjKghz-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605200427.0xXjKghz-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   kernel/trace/trace_probe.c: In function 'parse_probe_vars':
>> kernel/trace/trace_probe.c:1035:7: error: implicit declaration of function 'parse_trace_event'; did you mean 'parse_trace_event_arg'? [-Werror=implicit-function-declaration]
      if (parse_trace_event(arg, code, ctx) < 0)
          ^~~~~~~~~~~~~~~~~
          parse_trace_event_arg
   kernel/trace/trace_probe.c: In function 'sprint_nth_btf_arg':
>> kernel/trace/trace_probe.c:1803:20: error: implicit declaration of function 'ctx_btf' [-Werror=implicit-function-declaration]
     struct btf *btf = ctx_btf(ctx);
                       ^~~~~~~
>> kernel/trace/trace_probe.c:1803:20: warning: initialization of 'struct btf *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
   At top level:
>> kernel/trace/trace_probe.c:318:12: warning: 'parse_trace_event_arg' defined but not used [-Wunused-function]
    static int parse_trace_event_arg(char *arg, struct fetch_insn *code,
               ^~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +1035 kernel/trace/trace_probe.c

  1020	
  1021	/* Parse $vars. @orig_arg points '$', which syncs to @ctx->offset */
  1022	static int parse_probe_vars(char *orig_arg, const struct fetch_type *t,
  1023				    struct fetch_insn **pcode,
  1024				    struct fetch_insn *end,
  1025				    struct traceprobe_parse_context *ctx)
  1026	{
  1027		struct fetch_insn *code = *pcode;
  1028		int err = TP_ERR_BAD_VAR;
  1029		char *arg = orig_arg + 1;
  1030		unsigned long param;
  1031		int ret = 0;
  1032		int len;
  1033	
  1034		if (ctx->flags & TPARG_FL_TEVENT) {
> 1035			if (parse_trace_event(arg, code, ctx) < 0)
  1036				goto inval;
  1037			return 0;
  1038		}
  1039	
  1040		if (str_has_prefix(arg, "retval")) {
  1041			if (!(ctx->flags & TPARG_FL_RETURN)) {
  1042				err = TP_ERR_RETVAL_ON_PROBE;
  1043				goto inval;
  1044			}
  1045			if (!(ctx->flags & TPARG_FL_KERNEL) ||
  1046			    !IS_ENABLED(CONFIG_PROBE_EVENTS_BTF_ARGS)) {
  1047				code->op = FETCH_OP_RETVAL;
  1048				return 0;
  1049			}
  1050			return parse_btf_arg(orig_arg, pcode, end, ctx);
  1051		}
  1052	
  1053		len = str_has_prefix(arg, "stack");
  1054		if (len) {
  1055	
  1056			if (arg[len] == '\0') {
  1057				code->op = FETCH_OP_STACKP;
  1058				return 0;
  1059			}
  1060	
  1061			if (isdigit(arg[len])) {
  1062				ret = kstrtoul(arg + len, 10, &param);
  1063				if (ret)
  1064					goto inval;
  1065	
  1066				if ((ctx->flags & TPARG_FL_KERNEL) &&
  1067				    param > PARAM_MAX_STACK) {
  1068					err = TP_ERR_BAD_STACK_NUM;
  1069					goto inval;
  1070				}
  1071				code->op = FETCH_OP_STACK;
  1072				code->param = (unsigned int)param;
  1073				return 0;
  1074			}
  1075			goto inval;
  1076		}
  1077	
  1078		if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
  1079			code->op = FETCH_OP_COMM;
  1080			return 0;
  1081		}
  1082	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH v5] tracing/eprobes: Allow use of BTF names to dereference pointers
From: kernel test robot @ 2026-05-19 22:03 UTC (permalink / raw)
  To: Steven Rostedt, LKML, Linux trace kernel, bpf
  Cc: oe-kbuild-all, Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland,
	Peter Zijlstra, Namhyung Kim, Takaya Saeki, Douglas Raillard,
	Tom Zanussi, Andrew Morton, Linux Memory Management List,
	Thomas Gleixner, Ian Rogers, Jiri Olsa
In-Reply-To: <20260519130144.40e71a00@fedora>

Hi Steven,

kernel test robot noticed the following build errors:

[auto build test ERROR on trace/for-next]
[also build test ERROR on linus/master v7.1-rc4 next-20260519]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Steven-Rostedt/tracing-eprobes-Allow-use-of-BTF-names-to-dereference-pointers/20260520-011353
base:   https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace for-next
patch link:    https://lore.kernel.org/r/20260519130144.40e71a00%40fedora
patch subject: [PATCH v5] tracing/eprobes: Allow use of BTF names to dereference pointers
config: arc-defconfig (https://download.01.org/0day-ci/archive/20260520/202605200549.zZkT9xWc-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260520/202605200549.zZkT9xWc-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605200549.zZkT9xWc-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/trace/trace_probe.c: In function 'parse_probe_vars':
   kernel/trace/trace_probe.c:1035:21: error: implicit declaration of function 'parse_trace_event'; did you mean 'parse_trace_event_arg'? [-Wimplicit-function-declaration]
    1035 |                 if (parse_trace_event(arg, code, ctx) < 0)
         |                     ^~~~~~~~~~~~~~~~~
         |                     parse_trace_event_arg
   kernel/trace/trace_probe.c: In function 'sprint_nth_btf_arg':
   kernel/trace/trace_probe.c:1803:27: error: implicit declaration of function 'ctx_btf' [-Wimplicit-function-declaration]
    1803 |         struct btf *btf = ctx_btf(ctx);
         |                           ^~~~~~~
>> kernel/trace/trace_probe.c:1803:27: error: initialization of 'struct btf *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
   kernel/trace/trace_probe.c: At top level:
   kernel/trace/trace_probe.c:318:12: warning: 'parse_trace_event_arg' defined but not used [-Wunused-function]
     318 | static int parse_trace_event_arg(char *arg, struct fetch_insn *code,
         |            ^~~~~~~~~~~~~~~~~~~~~


vim +1803 kernel/trace/trace_probe.c

  1797	
  1798	static int sprint_nth_btf_arg(int idx, const char *type,
  1799				      char *buf, int bufsize,
  1800				      struct traceprobe_parse_context *ctx)
  1801	{
  1802		const char *name;
> 1803		struct btf *btf = ctx_btf(ctx);
  1804		int ret;
  1805	
  1806		if (idx >= ctx->nr_params) {
  1807			trace_probe_log_err(0, NO_BTFARG);
  1808			return -ENOENT;
  1809		}
  1810		name = btf_name_by_offset(btf, ctx->params[idx].name_off);
  1811		if (!name) {
  1812			trace_probe_log_err(0, NO_BTF_ENTRY);
  1813			return -ENOENT;
  1814		}
  1815		ret = snprintf(buf, bufsize, "%s%s", name, type);
  1816		if (ret >= bufsize) {
  1817			trace_probe_log_err(0, ARGS_2LONG);
  1818			return -E2BIG;
  1819		}
  1820		return ret;
  1821	}
  1822	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* [PATCH v2] tools/bootconfig: Fix buf leaks in apply_xbc
From: lihongtao @ 2026-05-20  3:01 UTC (permalink / raw)
  To: Masami Hiramatsu; +Cc: linux-kernel, linux-trace-kernel, Hongtao Lee

From: Hongtao Lee <lihongtao@kylinos.cn>

If data calloc failed, free the buf before return.

Fixes: 950313ebf79c ("tools: bootconfig: Add bootconfig command")
Signed-off-by: Hongtao Lee <lihongtao@kylinos.cn>
---
V1 -> V2: Change Email Signed name
 tools/bootconfig/main.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tools/bootconfig/main.c b/tools/bootconfig/main.c
index 643f707b8f1d..ddabde20585f 100644
--- a/tools/bootconfig/main.c
+++ b/tools/bootconfig/main.c
@@ -390,8 +390,10 @@ static int apply_xbc(const char *path, const char *xbc_path)
 
 	/* Backup the bootconfig data */
 	data = calloc(size + BOOTCONFIG_ALIGN + BOOTCONFIG_FOOTER_SIZE, 1);
-	if (!data)
+	if (!data) {
+		free(buf);
 		return -ENOMEM;
+	}
 	memcpy(data, buf, size);
 
 	/* Check the data format */
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH v5] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Masami Hiramatsu @ 2026-05-20  6:20 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: sashiko-bot, sashiko-reviews, bpf, LKML, Linux trace kernel,
	Masami Hiramatsu
In-Reply-To: <20260519141726.613e2e54@fedora>

On Tue, 19 May 2026 14:17:26 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 19 May 2026 17:48:47 +0000
> sashiko-bot@kernel.org wrote:
> 
> > Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
> > - [High] Using `(STRUCT)` typecasts for non-eprobes leaves the base fetch instruction uninitialized (`FETCH_OP_NOP`), silently breaking the probe.
> > - [High] Prematurely clearing `TPARG_FL_TYPECAST` causes cross-BTF type confusion and breaks string fetches.
> > - [Medium] Erroneous caching of `struct_btf` restricts a single trace probe to using typecasts from only one BTF object, preventing cross-module lookups.
> > - [Medium] Cross-BTF string table access in `parse_btf_arg()` causes invalid string comparisons.
> > - [Low] `clear_btf_context()` fails to clear `ctx->struct_btf` after dropping its reference.
> > --
> > 
> > commit de4b6287a7069fde24b0e0f679a545a06fb9ae6c
> > Author: Steven Rostedt <rostedt@goodmis.org>
> > 
> > tracing/eprobes: Allow use of BTF names to dereference pointers
> > 
> > Add syntax to the parsing of eprobes to be able to typecast a trace event
> > field that is a pointer to a structure.
> > 
> > > diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> > > index e1c73065dae51..2bc9f4ca8aa7e 100644
> > > --- a/kernel/trace/trace_probe.c
> > > +++ b/kernel/trace/trace_probe.c
> > > @@ -464,6 +470,27 @@ static const char *fetch_type_from_btf_type(struct btf *btf,
> > >  	return NULL;
> > >  }
> > >  
> > > +static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
> > > +{
> > > +	int id;
> > > +
> > > +	if (!ctx->struct_btf) {
> > > +		struct btf *btf;
> > > +
> > > +		id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
> > > +		if (id < 0)
> > > +			return id;
> > > +		ctx->struct_btf = btf;
> > > +	} else {
> > > +		id = btf_find_by_name_kind(ctx->struct_btf, sname, BTF_KIND_STRUCT);  
> > 
> > [Severity: Medium]
> > Does caching struct_btf here restrict a single trace probe event to using
> > typecasts from only one BTF object?
> > 
> > If a probe attempts to typecast structures from two different kernel
> > modules, wouldn't the lookup for the second module's struct fail because
> > it only searches the cached BTF object of the first module (or vmlinux)?
> 
> Hmm, good point. I guess I should clear the struct_btf at the end of
> the '(' switch case.

Yes, it should be reset before parsing next argument.

> 
> > 
> > > +		if (id < 0)
> > > +			return id;
> > > +	}
> > > +
> > > +	ctx->last_struct = btf_type_by_id(ctx->struct_btf, id);
> > > +	return 0;
> > > +}  
> > 
> > [ ... ]
> > 
> > > @@ -515,6 +542,10 @@ static void clear_btf_context(struct traceprobe_parse_context *ctx)
> > >  		ctx->params = NULL;
> > >  		ctx->nr_params = 0;
> > >  	}
> > > +	if (ctx->struct_btf) {
> > > +		btf_put(ctx->struct_btf);
> > > +		ctx->last_struct = NULL;  
> > 
> > [Severity: Low]
> > Should ctx->struct_btf be explicitly set to NULL after btf_put() drops
> > the reference?
> 
> I'm thinking of dropping it in the '(' switch case.

Can you consider making the '(' switch case part as a helper
function because it depends on CONFIG_DEBUG_INFO_BTF?
 
Thanks,


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH 6/9] rv: Ensure synchronous cleanup for HA monitors
From: Gabriele Monaco @ 2026-05-20 11:22 UTC (permalink / raw)
  To: Wen Yang; +Cc: linux-kernel, Steven Rostedt, Nam Cao, linux-trace-kernel
In-Reply-To: <5f8ceba5-ec25-4c19-a74b-ed98715e89b6@linux.dev>

On Wed, 2026-05-20 at 00:48 +0800, Wen Yang wrote:
> The goal is right.  One thing worth double-checking is the load order
> in the callback against the "SMP BARRIER PAIRING" section of
> Documentation/memory-barriers.txt, which states:
> 

Yeah I realised that after I sent my answer.. You might have noticed but I
proposed a version using acquire/release semantics in [1].

I'm waiting to send it all in a V2 for the fixes series.

Are you going to send your patch with tracepoint_synchronize_unregister() in the
per-task destruction (can be a patch alone)?

If not I'll do it myself and append that too, I prefer to have everything
together to avoid conflict resolution issues.

Thanks,
Gabriele

[1] -
https://lore.kernel.org/lkml/02c522f2a09183c9e1a6ff5b0110d0d5cc5e35bd.camel@redhat.com/

>    [!] Note that the stores before the write barrier would normally be
>    expected to match the loads after the read barrier or the
>    address-dependency barrier, and vice versa ...
> 
> So, we should to swap the read order in the callback so that it matches
> the standard pattern:
> 
>    void __ha_monitor_timer_callback() {
>          guard(rcu)();
>          curr_state = READ_ONCE(ha_mon->da_mon.curr_state);  /* B: 
> before rmb */
>          smp_rmb();
>          if (unlikely(!da_monitoring(&ha_mon->da_mon)))       /* A: 
> after rmb */
>                  return;
>          /*
>           * Reached here: monitoring = 1 (old_A).
>           * Standard wmb/rmb guarantee: curr_state (read before rmb) is also
>           * old, i.e. not initial_state.
>           */
>          ha_react(curr_state, EVENT_NONE, env_string.buffer);
>          ...
>    }
> 
>    void da_monitor_reset() {
>          da_monitor_reset_hook(da_mon);
>          WRITE_ONCE(da_mon->monitoring, 0);   /* A: before wmb */
>          smp_wmb();
>          WRITE_ONCE(da_mon->curr_state, model_get_initial_state());  /* 
> B: after wmb */
>    }
> 
> 
> 
> --
> Best wishes,
> Wen
> 
> > 
> > [1] -
> > https://lore.kernel.org/lkml/8af5ba4bd93d2acb8a546e8e47ced974a87c1eb8.1778522945.git.wen.yang@linux.dev
> > 
> > > 
> > > 
> > > --
> > > Best wishes,
> > > Wen
> > > 
> > > 
> > > On 5/12/26 22:02, Gabriele Monaco wrote:
> > > > HA monitors may start timers, all cleanup functions currently stop the
> > > > timers asynchronously to avoid sleeping in the wrong context.
> > > > Nothing makes sure running callbacks terminate on cleanup.
> > > > 
> > > > Run the entire HA timer callback in an RCU read-side critical section,
> > > > this way we can simply synchronize_rcu() with any pending timer and are
> > > > sure any cleanup using kfree_rcu() runs after callbacks terminated.
> > > > Additionally make sure any unlikely callback running late won't run any
> > > > code if the monitor is marked as disabled.
> > > > 
> > > > Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
> > > > Fixes: 4a24127bd6cb ("rv: Add support for per-object monitors in DA/HA")
> > > > Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> > > > ---
> > > >    include/rv/da_monitor.h | 23 +++++++++++++++++++----
> > > >    include/rv/ha_monitor.h | 18 ++++++++++++++++--
> > > >    2 files changed, 35 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> > > > index a4a13b62d1a4..402d3b935c08 100644
> > > > --- a/include/rv/da_monitor.h
> > > > +++ b/include/rv/da_monitor.h
> > > > @@ -57,6 +57,15 @@ static struct rv_monitor rv_this;
> > > >    #define da_monitor_reset_hook(da_mon)
> > > >    #endif
> > > >    
> > > > +/*
> > > > + * Hook to allow the implementation of hybrid automata: define it with
> > > > a
> > > > + * function that waits for the termination of all monitors background
> > > > + * activities (e.g. all timers). This hook can sleep.
> > > > + */
> > > > +#ifndef da_monitor_sync_hook
> > > > +#define da_monitor_sync_hook()
> > > > +#endif
> > > > +
> > > >    /*
> > > >     * Type for the target id, default to int but can be overridden.
> > > >     * A long type can work as hash table key (PER_OBJ) but will be
> > > > downgraded
> > > > to
> > > > @@ -179,6 +188,7 @@ static inline int da_monitor_init(void)
> > > >    static inline void da_monitor_destroy(void)
> > > >    {
> > > >    	da_monitor_reset_all();
> > > > +	da_monitor_sync_hook();
> > > >    }
> > > >    
> > > >    #ifndef da_implicit_guard
> > > > @@ -232,6 +242,7 @@ static inline int da_monitor_init(void)
> > > >    static inline void da_monitor_destroy(void)
> > > >    {
> > > >    	da_monitor_reset_all();
> > > > +	da_monitor_sync_hook();
> > > >    }
> > > >    
> > > >    #ifndef da_implicit_guard
> > > > @@ -319,6 +330,7 @@ static inline void da_monitor_destroy(void)
> > > >    	}
> > > >    
> > > >    	da_monitor_reset_all();
> > > > +	da_monitor_sync_hook();
> > > >    
> > > >    	rv_put_task_monitor_slot(task_mon_slot);
> > > >    	task_mon_slot = RV_PER_TASK_MONITOR_INIT;
> > > > @@ -497,10 +509,9 @@ static void da_monitor_reset_all(void)
> > > >    	struct da_monitor_storage *mon_storage;
> > > >    	int bkt;
> > > >    
> > > > -	rcu_read_lock();
> > > > +	guard(rcu)();
> > > >    	hash_for_each_rcu(da_monitor_ht, bkt, mon_storage, node)
> > > >    		da_monitor_reset(&mon_storage->rv.da_mon);
> > > > -	rcu_read_unlock();
> > > >    }
> > > >    
> > > >    static inline int da_monitor_init(void)
> > > > @@ -516,13 +527,17 @@ static inline void da_monitor_destroy(void)
> > > >    	int bkt;
> > > >    
> > > >    	tracepoint_synchronize_unregister();
> > > > +	scoped_guard(rcu) {
> > > > +		hash_for_each_rcu(da_monitor_ht, bkt, mon_storage,
> > > > node) {
> > > > +			da_monitor_reset_hook(&mon_storage->rv.da_mon);
> > > > +		}
> > > > +	}
> > > > +	da_monitor_sync_hook();
> > > >    	/*
> > > >    	 * This function is called after all probes are disabled and no
> > > > longer
> > > >    	 * pending, we can safely assume no concurrent user.
> > > >    	 */
> > > > -	synchronize_rcu();
> > > >    	hash_for_each_safe(da_monitor_ht, bkt, tmp, mon_storage, node)
> > > > {
> > > > -		da_monitor_reset_hook(&mon_storage->rv.da_mon);
> > > >    		hash_del_rcu(&mon_storage->node);
> > > >    		kfree(mon_storage);
> > > >    	}
> > > > diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
> > > > index d59507e8cb30..47ff1a41febe 100644
> > > > --- a/include/rv/ha_monitor.h
> > > > +++ b/include/rv/ha_monitor.h
> > > > @@ -36,6 +36,7 @@ static bool ha_monitor_handle_constraint(struct
> > > > da_monitor
> > > > *da_mon,
> > > >    #define da_monitor_event_hook ha_monitor_handle_constraint
> > > >    #define da_monitor_init_hook ha_monitor_init_env
> > > >    #define da_monitor_reset_hook ha_monitor_reset_env
> > > > +#define da_monitor_sync_hook() synchronize_rcu()
> > > >    
> > > >    #include <rv/da_monitor.h>
> > > >    #include <linux/seq_buf.h>
> > > > @@ -237,12 +238,25 @@ static bool ha_monitor_handle_constraint(struct
> > > > da_monitor *da_mon,
> > > >    	return false;
> > > >    }
> > > >    
> > > > +/*
> > > > + * __ha_monitor_timer_callback - generic callback representation
> > > > + *
> > > > + * This callback runs in an RCU read-side critical section to allow the
> > > > + * destruction sequence to easily synchronize_rcu() with all pending
> > > > timer
> > > > + * after asynchronously disabling them.
> > > > + */
> > > >    static inline void __ha_monitor_timer_callback(struct ha_monitor
> > > > *ha_mon)
> > > >    {
> > > > -	enum states curr_state = READ_ONCE(ha_mon->da_mon.curr_state);
> > > >    	DECLARE_SEQ_BUF(env_string, ENV_BUFFER_SIZE);
> > > > -	u64 time_ns = ha_get_ns();
> > > > +	enum states curr_state;
> > > > +	u64 time_ns;
> > > > +
> > > > +	if (unlikely(!da_monitor_handling_event(&ha_mon->da_mon)))
> > > > +		return;
> > > >    
> > > > +	guard(rcu)();
> > > > +	curr_state = READ_ONCE(ha_mon->da_mon.curr_state);
> > > > +	time_ns = ha_get_ns();
> > > >    	ha_get_env_string(&env_string, ha_mon, time_ns);
> > > >    	ha_react(curr_state, EVENT_NONE, env_string.buffer);
> > > >    	ha_trace_error_env(ha_mon, model_get_state_name(curr_state),
> > 


^ permalink raw reply

* Re: [PATCH] gpu: host1x: trace: fix string fields in host1x traces
From: Thierry Reding @ 2026-05-20 12:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Artur Kowalski, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, linux-tegra, Thierry Reding, Mikko Perttunen,
	David Airlie, Simona Vetter
In-Reply-To: <20260519141059.77435501@fedora>

[-- Attachment #1: Type: text/plain, Size: 845 bytes --]

On Tue, May 19, 2026 at 02:10:59PM -0400, Steven Rostedt wrote:
> On Tue, 19 May 2026 12:16:43 +0200
> Artur Kowalski <arturkow2000@gmail.com> wrote:
> 
> > Use __assign_str and __get_str as required by tracing subsystem. Fixes
> > string fields being rejected by the verifier and unreadable from
> > userspace.
> 
> Does anyone use these tracepoints? The fact that they have been broken
> for 5 years and nobody noticed makes me think they are useless.
> 
> I rather remove them than fix them, but if someone thinks that these
> are still useful then by all means apply this patch.
> 
> Acked-by: Steven Rostedt <rostedt@goodmis.org>

I know that Mikko used them a lot early on, but this driver is pretty
mature now, so we rarely need this low level of tracing. I'll defer to
Mikko on whether we still need these.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH mm-unstable v17 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Nico Pache @ 2026-05-20 12:05 UTC (permalink / raw)
  To: Wei Yang
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260512154431.jxcs632mqqatqtsw@master>

On Tue, May 12, 2026 at 9:44 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Mon, May 11, 2026 at 12:58:11PM -0600, Nico Pache wrote:
> >Enable khugepaged to collapse to mTHP orders. This patch implements the
> >main scanning logic using a bitmap to track occupied pages and a stack
> >structure that allows us to find optimal collapse sizes.
> >
> >Previous to this patch, PMD collapse had 3 main phases, a light weight
> >scanning phase (mmap_read_lock) that determines a potential PMD
> >collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> >phase (mmap_write_lock).
> >
> >To enabled mTHP collapse we make the following changes:
> >
> >During PMD scan phase, track occupied pages in a bitmap. When mTHP
> >orders are enabled, we remove the restriction of max_ptes_none during the
> >scan phase to avoid missing potential mTHP collapse candidates. Once we
> >have scanned the full PMD range and updated the bitmap to track occupied
> >pages, we use the bitmap to find the optimal mTHP size.
> >
> >Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> >and determine the best eligible order for the collapse. A stack structure
> >is used instead of traditional recursion to manage the search. This also
> >prevents a traditional recursive approach when the kernel stack struct is
> >limited. The algorithm recursively splits the bitmap into smaller chunks to
> >find the highest order mTHPs that satisfy the collapse criteria. We start
> >by attempting the PMD order, then moved on the consecutively lower orders
> >(mTHP collapse). The stack maintains a pair of variables (offset, order),
> >indicating the number of PTEs from the start of the PMD, and the order of
> >the potential collapse candidate.
> >
> >The algorithm for consuming the bitmap works as such:
> >    1) push (0, HPAGE_PMD_ORDER) onto the stack
> >    2) pop the stack
> >    3) check if the number of set bits in that (offset,order) pair
> >       statisfy the max_ptes_none threshold for that order
> >    4) if yes, attempt collapse
> >    5) if no (or collapse fails), push two new stack items representing
> >       the left and right halves of the current bitmap range, at the
> >       next lower order
> >    6) repeat at step (2) until stack is empty.
> >
> >Below is a diagram representing the algorithm and stack items:
> >
> >                            offset   mid_offset
> >                            |        |
> >                            |        |
> >                            v        v
> >          ____________________________________
> >         |          PTE Page Table            |
> >         --------------------------------------
> >                           <-------><------->
> >                             order-1  order-1
> >
> >mTHP collapses reject regions containing swapped out or shared pages.
> >This is because adding new entries can lead to new none pages, and these
> >may lead to constant promotion into a higher order mTHP. A similar
> >issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> >introducing at least 2x the number of pages, and on a future scan will
> >satisfy the promotion condition once again. This issue is prevented via
> >the collapse_max_ptes_none() function which imposes the max_ptes_none
> >restrictions above.
> >
> >We currently only support mTHP collapse for max_ptes_none values of 0
> >and HPAGE_PMD_NR - 1. resulting in the following behavior:
> >
> >    - max_ptes_none=0: Never introduce new empty pages during collapse
> >    - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
> >      available mTHP order
> >
> >Any other max_ptes_none value will emit a warning and skip mTHP collapse
> >attempts. There should be no behavior change for PMD collapse.
> >
> >Once we determine what mTHP sizes fits best in that PMD range a collapse
> >is attempted. A minimum collapse order of 2 is used as this is the lowest
> >order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
> >
> >Currently madv_collapse is not supported and will only attempt PMD
> >collapse.
> >
> >We can also remove the check for is_khugepaged inside the PMD scan as
> >the collapse_max_ptes_none() function handles this logic now.
> >
> >Signed-off-by: Nico Pache <npache@redhat.com>
>
> [...]
>
> >+static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> >+              int referenced, int unmapped, struct collapse_control *cc,
> >+              unsigned long enabled_orders)
> >+{
> >+      unsigned int nr_occupied_ptes, nr_ptes;
> >+      int max_ptes_none, collapsed = 0, stack_size = 0;
> >+      unsigned long collapse_address;
> >+      struct mthp_range range;
> >+      u16 offset;
> >+      u8 order;
> >+
> >+      collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> >+
> >+      while (stack_size) {
> >+              range = collapse_mthp_stack_pop(cc, &stack_size);
> >+              order = range.order;
> >+              offset = range.offset;
> >+              nr_ptes = 1UL << order;
> >+
> >+              if (!test_bit(order, &enabled_orders))
> >+                      goto next_order;
> >+
> >+              max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
>
> I am thinking whether there is a behavioral change for userfaultfd_armed(vma).
>
> collapse_single_pmd()
>     collapse_scan_pmd
>         max_ptes_none = collapse_max_ptes_none(cc, vma)
>         max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT                --- (1)
>         mthp_collapse
>             max_ptes_none = collapse_max_ptes_none(cc, NULL)     --- (2)
>             collapse_huge_page(mm)
>                 hugepage_vma_revalidate(&vma)
>                 __collapse_huge_page_isolate(vma)
>                     max_ptes_none = collapse_max_ptes_none(cc, vma)
>
> Before mthp_collapse() introduced, userfaultfd_armed(vma) is skipped if there
> is any pte_none_or_zero() in collapse_scan_pmd().
>
> But now, max_ptes_none could be set to KHUGEPAGED_MAX_PTES_LIMIT at (1), so
> that we can scan all the pte to get the bitmap. This means
> userfaultfd_armed(vma) could continue even with pte_none_or_zero().
>
> Then in mthp_collapse(), collapse_max_ptes_none() at (2) ignores
> userfaultfd_armed(vma), which means it will continue to collapse a
> userfaultfd_armed(vma) when there is pte_none_or_zero().
>
> The good news is we will stop at __collapse_huge_page_isolate(), where we
> get collapse_max_ptes_none() with vma. But we already did a lot of work.

Good catch!

As you stated we eventually ensure we respect the uffd checks. So
there are no correctness issues, just the potential for wasted cycles.

At (1) we only do this if mTHPs are enabled. If that is the case, the
only waste that can arise is at the PMD order, as that order respects
the max_ptes_none value.

I think one approach is to gate (1) with the uffd check as well. That
way, if mTHPs are enabled and its uffd-armed, max_ptes_none will stay
at 0, and we bail early on the scan early if any none_ptes are hit.

But then we lose the ability to collapse to mTHPs that are uffd-armed,
where the PMD has none/zero-ptes and the mTHP fully has 0
non-none/zero-ptes.

ie) assume a PMD is 16 x's [xxxxxxxx00000000]
where x is a populated pte and 0 is not
If we guard this scan (1), then we will never check if its possible to
collapse to the smaller orders.

Let me know if you see a flaw in my logic, I think it's best to keep it as is?

>
> Not sure if I missed something.
>
> >+
> >+              if (max_ptes_none < 0)
> >+                      return collapsed;
> >+
> >+              nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> >+                                                             nr_ptes);
> >+
> >+              if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> >+                      int ret;
> >+
> >+                      collapse_address = address + offset * PAGE_SIZE;
> >+                      ret = collapse_huge_page(mm, collapse_address, referenced,
> >+                                               unmapped, cc, order);
> >+                      if (ret == SCAN_SUCCEED) {
> >+                              collapsed += nr_ptes;
> >+                              continue;
> >+                      }
> >+              }
> >+
> >+next_order:
> >+              if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
> >+                      const u8 next_order = order - 1;
> >+                      const u16 mid_offset = offset + (nr_ptes / 2);
> >+
> >+                      collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> >+                                               next_order);
> >+                      collapse_mthp_stack_push(cc, &stack_size, offset,
> >+                                               next_order);
> >+              }
> >+      }
> >+      return collapsed;
> >+}
> >+
> > static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >               struct vm_area_struct *vma, unsigned long start_addr,
> >               bool *lock_dropped, struct collapse_control *cc)
> > {
> >-      const int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> >+      int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> >       const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
> >       const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> >+      enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
> >       pmd_t *pmd;
> >-      pte_t *pte, *_pte;
> >-      int none_or_zero = 0, shared = 0, referenced = 0;
> >+      pte_t *pte, *_pte, pteval;
> >+      int i;
> >+      int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
> >       enum scan_result result = SCAN_FAIL;
> >       struct page *page = NULL;
> >       struct folio *folio = NULL;
> >       unsigned long addr;
> >+      unsigned long enabled_orders;
> >       spinlock_t *ptl;
> >       int node = NUMA_NO_NODE, unmapped = 0;
> >
> >@@ -1429,8 +1579,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >               goto out;
> >       }
> >
> >+      bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> >       memset(cc->node_load, 0, sizeof(cc->node_load));
> >       nodes_clear(cc->alloc_nmask);
> >+
> >+      enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
>
> Would it be 0 at this point?

If your question relates to the issue you brought up above, then yes,
max_ptes_none would be 0 if it's uffd-armed. We must recheck the
uffd-armed status before modifying it to 511.

>
> >+
> >+      /*
> >+       * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> >+       * scan all pages to populate the bitmap for mTHP collapse.
> >+       */
> >+      if (enabled_orders != BIT(HPAGE_PMD_ORDER))
> >+              max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
> >+
> >       pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> >       if (!pte) {
> >               cc->progress++;
> >@@ -1438,11 +1599,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >               goto out;
> >       }
> >
> >-      for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> >-           _pte++, addr += PAGE_SIZE) {
> >+      for (i = 0; i < HPAGE_PMD_NR; i++) {
> >+              _pte = pte + i;
> >+              addr = start_addr + i * PAGE_SIZE;
> >+              pteval = ptep_get(_pte);
> >+
> >               cc->progress++;
> >
> >-              pte_t pteval = ptep_get(_pte);
> >               if (pte_none_or_zero(pteval)) {
> >                       if (++none_or_zero > max_ptes_none) {
> >                               result = SCAN_EXCEED_NONE_PTE;
> >@@ -1522,6 +1685,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >                       }
> >               }
> >
> >+              /* Set bit for occupied pages */
> >+              __set_bit(i, cc->mthp_bitmap);
> >               /*
> >                * Record which node the original page is from and save this
> >                * information to cc->node_load[].
> >@@ -1580,10 +1745,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >       if (result == SCAN_SUCCEED) {
> >               /* collapse_huge_page expects the lock to be dropped before calling */
> >               mmap_read_unlock(mm);
> >-              result = collapse_huge_page(mm, start_addr, referenced,
> >-                                          unmapped, cc, HPAGE_PMD_ORDER);
> >+              nr_collapsed = mthp_collapse(mm, start_addr, referenced, unmapped,
> >+                                            cc, enabled_orders);
> >               /* collapse_huge_page will return with the mmap_lock released */
>
> collapse_huge_page will return with mmap_lock released, but mthp_collapse()
> may not?

We are now releasing the lock before calling mthp_collapse, which
subsequently calls collapse_huge_page. Even if `collapse_huge_page` is
never called-- say, because enabled_orders is 0 (which should not
happen) and all collapse orders are skipped (never calling
collapse_huge_page)-- we still return here with the lock dropped.

I think this is sound. Let me know if you think differently.

Cheers :)
-- Nico

>
> >               *lock_dropped = true;
> >+              result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
> >       }
> > out:
> >       trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> >--
> >2.54.0
>
> --
> Wei Yang
> Help you, Help me
>


^ permalink raw reply

* Re: [PATCH v6 02/43] KVM: Rename KVM_GENERIC_MEMORY_ATTRIBUTES to KVM_VM_MEMORY_ATTRIBUTES
From: Fuad Tabba @ 2026-05-20 12:08 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-2-91ab5a8b19a4@google.com>

On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Rename the per-VM memory attributes Kconfig to make it explicitly about
> per-VM attributes in anticipation of adding memory attributes support to
> guest_memfd, at which point it will be possible (and desirable) to have
> memory attributes without the per-VM support, even in x86.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad


> ---
>  arch/x86/include/asm/kvm_host.h |  2 +-
>  arch/x86/kvm/Kconfig            |  6 +++---
>  arch/x86/kvm/mmu/mmu.c          |  2 +-
>  arch/x86/kvm/x86.c              |  2 +-
>  include/linux/kvm_host.h        |  8 ++++----
>  include/trace/events/kvm.h      |  4 ++--
>  virt/kvm/Kconfig                |  2 +-
>  virt/kvm/kvm_main.c             | 14 +++++++-------
>  8 files changed, 20 insertions(+), 20 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index c470e40a00aa4..60b997764beef 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2369,7 +2369,7 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>                        int tdp_max_root_level, int tdp_huge_page_level);
>
>
> -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>  #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
>  #endif
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 801bf9e520db3..26f6afd51bbdc 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -84,7 +84,7 @@ config KVM_SW_PROTECTED_VM
>         bool "Enable support for KVM software-protected VMs"
>         depends on EXPERT
>         depends on KVM_X86 && X86_64
> -       select KVM_GENERIC_MEMORY_ATTRIBUTES
> +       select KVM_VM_MEMORY_ATTRIBUTES
>         help
>           Enable support for KVM software-protected VMs.  Currently, software-
>           protected VMs are purely a development and testing vehicle for
> @@ -135,7 +135,7 @@ config KVM_INTEL_TDX
>         bool "Intel Trust Domain Extensions (TDX) support"
>         default y
>         depends on INTEL_TDX_HOST
> -       select KVM_GENERIC_MEMORY_ATTRIBUTES
> +       select KVM_VM_MEMORY_ATTRIBUTES
>         select HAVE_KVM_ARCH_GMEM_POPULATE
>         help
>           Provides support for launching Intel Trust Domain Extensions (TDX)
> @@ -159,7 +159,7 @@ config KVM_AMD_SEV
>         depends on KVM_AMD && X86_64
>         depends on CRYPTO_DEV_SP_PSP && !(KVM_AMD=y && CRYPTO_DEV_CCP_DD=m)
>         select ARCH_HAS_CC_PLATFORM
> -       select KVM_GENERIC_MEMORY_ATTRIBUTES
> +       select KVM_VM_MEMORY_ATTRIBUTES
>         select HAVE_KVM_ARCH_GMEM_PREPARE
>         select HAVE_KVM_ARCH_GMEM_INVALIDATE
>         select HAVE_KVM_ARCH_GMEM_POPULATE
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 892246204435c..a80a876ab4ad6 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7899,7 +7899,7 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
>                 vhost_task_stop(kvm->arch.nx_huge_page_recovery_thread);
>  }
>
> -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>  static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
>                                 int level)
>  {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0a1b63c63d1a9..1560de1e95be0 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13625,7 +13625,7 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
>                 }
>         }
>
> -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>         kvm_mmu_init_memslot_memory_attributes(kvm, slot);
>  #endif
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4c14aee1fb063..7b9faa3545300 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -722,7 +722,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
>  }
>  #endif
>
> -#ifndef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifndef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>  static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
>  {
>         return false;
> @@ -871,7 +871,7 @@ struct kvm {
>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>         struct notifier_block pm_notifier;
>  #endif
> -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>         /* Protected by slots_lock (for writes) and RCU (for reads) */
>         struct xarray mem_attr_array;
>  #endif
> @@ -2528,7 +2528,7 @@ static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
>         return slot->flags & KVM_MEMSLOT_GMEM_ONLY;
>  }
>
> -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>  static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
>  {
>         return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
> @@ -2550,7 +2550,7 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>  {
>         return false;
>  }
> -#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
> +#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>
>  #ifdef CONFIG_KVM_GUEST_MEMFD
>  int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
> index b282e3a867696..1ba72bd73ea2f 100644
> --- a/include/trace/events/kvm.h
> +++ b/include/trace/events/kvm.h
> @@ -358,7 +358,7 @@ TRACE_EVENT(kvm_dirty_ring_exit,
>         TP_printk("vcpu %d", __entry->vcpu_id)
>  );
>
> -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>  /*
>   * @start:     Starting address of guest memory range
>   * @end:       End address of guest memory range
> @@ -383,7 +383,7 @@ TRACE_EVENT(kvm_vm_set_mem_attributes,
>         TP_printk("%#016llx -- %#016llx [0x%lx]",
>                   __entry->start, __entry->end, __entry->attr)
>  );
> -#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
> +#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>
>  TRACE_EVENT(kvm_unmap_hva_range,
>         TP_PROTO(unsigned long start, unsigned long end),
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 794976b88c6f9..5119cb37145fc 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -100,7 +100,7 @@ config KVM_ELIDE_TLB_FLUSH_IF_YOUNG
>  config KVM_MMU_LOCKLESS_AGING
>         bool
>
> -config KVM_GENERIC_MEMORY_ATTRIBUTES
> +config KVM_VM_MEMORY_ATTRIBUTES
>         bool
>
>  config KVM_GUEST_MEMFD
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 89489996fbc1e..306153abbafa5 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1115,7 +1115,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>         spin_lock_init(&kvm->mn_invalidate_lock);
>         rcuwait_init(&kvm->mn_memslots_update_rcuwait);
>         xa_init(&kvm->vcpu_array);
> -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>         xa_init(&kvm->mem_attr_array);
>  #endif
>
> @@ -1300,7 +1300,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>         cleanup_srcu_struct(&kvm->irq_srcu);
>         srcu_barrier(&kvm->srcu);
>         cleanup_srcu_struct(&kvm->srcu);
> -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>         xa_destroy(&kvm->mem_attr_array);
>  #endif
>         kvm_arch_free_vm(kvm);
> @@ -2418,7 +2418,7 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
>  }
>  #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>
> -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>  static u64 kvm_supported_mem_attributes(struct kvm *kvm)
>  {
>         if (!kvm || kvm_arch_has_private_mem(kvm))
> @@ -2623,7 +2623,7 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>
>         return kvm_vm_set_mem_attributes(kvm, start, end, attrs->attributes);
>  }
> -#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
> +#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>
>  struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
>  {
> @@ -4921,7 +4921,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>         case KVM_CAP_SYSTEM_EVENT_DATA:
>         case KVM_CAP_DEVICE_CTRL:
>                 return 1;
> -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>         case KVM_CAP_MEMORY_ATTRIBUTES:
>                 return kvm_supported_mem_attributes(kvm);
>  #endif
> @@ -5325,7 +5325,7 @@ static long kvm_vm_ioctl(struct file *filp,
>                 break;
>         }
>  #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
> -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>         case KVM_SET_MEMORY_ATTRIBUTES: {
>                 struct kvm_memory_attributes attrs;
>
> @@ -5336,7 +5336,7 @@ static long kvm_vm_ioctl(struct file *filp,
>                 r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
>                 break;
>         }
> -#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
> +#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>         case KVM_CREATE_DEVICE: {
>                 struct kvm_create_device cd;
>
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>

^ permalink raw reply

* Re: [PATCH v6 03/43] KVM: Enumerate support for PRIVATE memory iff kvm_arch_has_private_mem is defined
From: Fuad Tabba @ 2026-05-20 12:08 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-3-91ab5a8b19a4@google.com>

On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Explicitly guard reporting support for KVM_MEMORY_ATTRIBUTE_PRIVATE based
> on kvm_arch_has_private_mem being #defined in anticipation of decoupling
> kvm_supported_mem_attributes() from CONFIG_KVM_VM_MEMORY_ATTRIBUTES.
> guest_memfd support for memory attributes will be unconditional to avoid
> yet more macros (all architectures that support guest_memfd are expected to
> use per-gmem attributes at some point), at which point enumerating support
> KVM_MEMORY_ATTRIBUTE_PRIVATE based solely on memory attributes being
> supported _somewhere_ would result in KVM over-reporting support on arm64.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad

> ---
>  include/linux/kvm_host.h | 2 +-
>  virt/kvm/kvm_main.c      | 2 ++
>  2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7b9faa3545300..7d079f9701346 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -722,7 +722,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
>  }
>  #endif
>
> -#ifndef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +#ifndef kvm_arch_has_private_mem
>  static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
>  {
>         return false;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 306153abbafa5..abb9cfa3eb04d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2421,8 +2421,10 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
>  #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>  static u64 kvm_supported_mem_attributes(struct kvm *kvm)
>  {
> +#ifdef kvm_arch_has_private_mem
>         if (!kvm || kvm_arch_has_private_mem(kvm))
>                 return KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +#endif
>
>         return 0;
>  }
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>

^ permalink raw reply

* Re: [PATCH v6 04/43] KVM: Stub in ability to disable per-VM memory attribute tracking
From: Fuad Tabba @ 2026-05-20 12:08 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-4-91ab5a8b19a4@google.com>

On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Introduce the basic infrastructure to allow per-VM memory attribute
> tracking to be disabled. This will be built-upon in a later patch, where a
> module param can disable per-VM memory attribute tracking.
>
> Split the Kconfig option into a base KVM_MEMORY_ATTRIBUTES and the
> existing KVM_VM_MEMORY_ATTRIBUTES. The base option provides the core
> plumbing, while the latter enables the full per-VM tracking via an xarray
> and the associated ioctls.
>
> kvm_get_memory_attributes() now performs a static call that either looks up
> kvm->mem_attr_array with CONFIG_KVM_VM_MEMORY_ATTRIBUTES is enabled, or
> just returns 0 otherwise. The static call can be patched depending on
> whether per-VM tracking is enabled by the CONFIG.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad

> ---
>  arch/x86/include/asm/kvm_host.h |  2 +-
>  include/linux/kvm_host.h        | 23 ++++++++++++---------
>  virt/kvm/Kconfig                |  4 ++++
>  virt/kvm/kvm_main.c             | 44 ++++++++++++++++++++++++++++++++++++++++-
>  4 files changed, 62 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 60b997764beef..c9aa50bcdac2d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2369,7 +2369,7 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>                        int tdp_max_root_level, int tdp_huge_page_level);
>
>
> -#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
>  #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
>  #endif
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7d079f9701346..c5ba2cb34e45c 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2528,19 +2528,15 @@ static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
>         return slot->flags & KVM_MEMSLOT_GMEM_ONLY;
>  }
>
> -#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> +typedef unsigned long (kvm_get_memory_attributes_t)(struct kvm *kvm, gfn_t gfn);
> +DECLARE_STATIC_CALL(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
> +
>  static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
>  {
> -       return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
> +       return static_call(__kvm_get_memory_attributes)(kvm, gfn);
>  }
>
> -bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> -                                    unsigned long mask, unsigned long attrs);
> -bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
> -                                       struct kvm_gfn_range *range);
> -bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> -                                        struct kvm_gfn_range *range);
> -
>  static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>  {
>         return kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> @@ -2550,6 +2546,15 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>  {
>         return false;
>  }
> +#endif
> +
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> +                                    unsigned long mask, unsigned long attrs);
> +bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
> +                                       struct kvm_gfn_range *range);
> +bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> +                                        struct kvm_gfn_range *range);
>  #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>
>  #ifdef CONFIG_KVM_GUEST_MEMFD
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 5119cb37145fc..3fea89c45cfb4 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -100,7 +100,11 @@ config KVM_ELIDE_TLB_FLUSH_IF_YOUNG
>  config KVM_MMU_LOCKLESS_AGING
>         bool
>
> +config KVM_MEMORY_ATTRIBUTES
> +       bool
> +
>  config KVM_VM_MEMORY_ATTRIBUTES
> +       select KVM_MEMORY_ATTRIBUTES
>         bool
>
>  config KVM_GUEST_MEMFD
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index abb9cfa3eb04d..ee26f1d9b5fda 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -101,6 +101,17 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(halt_poll_ns_shrink);
>  static bool __ro_after_init allow_unsafe_mappings;
>  module_param(allow_unsafe_mappings, bool, 0444);
>
> +#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +static bool vm_memory_attributes = true;
> +#else
> +#define vm_memory_attributes false
> +#endif
> +DEFINE_STATIC_CALL_RET0(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_TRAMP(__kvm_get_memory_attributes));
> +#endif
> +
>  /*
>   * Ordering of locks:
>   *
> @@ -2418,7 +2429,7 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
>  }
>  #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>
> -#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
>  static u64 kvm_supported_mem_attributes(struct kvm *kvm)
>  {
>  #ifdef kvm_arch_has_private_mem
> @@ -2429,6 +2440,12 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
>         return 0;
>  }
>
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
> +{
> +       return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
> +}
> +
>  /*
>   * Returns true if _all_ gfns in the range [@start, @end) have attributes
>   * such that the bits in @mask match @attrs.
> @@ -2625,7 +2642,24 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>
>         return kvm_vm_set_mem_attributes(kvm, start, end, attrs->attributes);
>  }
> +#else  /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
> +static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
> +{
> +       BUILD_BUG_ON(1);
> +}
>  #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
> +static void kvm_init_memory_attributes(void)
> +{
> +       if (vm_memory_attributes)
> +               static_call_update(__kvm_get_memory_attributes,
> +                                  kvm_get_vm_memory_attributes);
> +       else
> +               static_call_update(__kvm_get_memory_attributes,
> +                                  (void *)__static_call_return0);
> +}
> +#else  /* CONFIG_KVM_MEMORY_ATTRIBUTES */
> +static void kvm_init_memory_attributes(void) { }
> +#endif /* CONFIG_KVM_MEMORY_ATTRIBUTES */
>
>  struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
>  {
> @@ -4925,6 +4959,9 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>                 return 1;
>  #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>         case KVM_CAP_MEMORY_ATTRIBUTES:
> +               if (!vm_memory_attributes)
> +                       return 0;
> +
>                 return kvm_supported_mem_attributes(kvm);
>  #endif
>  #ifdef CONFIG_KVM_GUEST_MEMFD
> @@ -5331,6 +5368,10 @@ static long kvm_vm_ioctl(struct file *filp,
>         case KVM_SET_MEMORY_ATTRIBUTES: {
>                 struct kvm_memory_attributes attrs;
>
> +               r = -ENOTTY;
> +               if (!vm_memory_attributes)
> +                       goto out;
> +
>                 r = -EFAULT;
>                 if (copy_from_user(&attrs, argp, sizeof(attrs)))
>                         goto out;
> @@ -6527,6 +6568,7 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
>         kvm_preempt_ops.sched_in = kvm_sched_in;
>         kvm_preempt_ops.sched_out = kvm_sched_out;
>
> +       kvm_init_memory_attributes();
>         kvm_init_debug();
>
>         r = kvm_vfio_ops_init();
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox