Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH v3 07/11] HID: Use trace_call__##name() at guarded tracepoint call sites
From: Steven Rostedt @ 2026-05-15 18:43 UTC (permalink / raw)
  To: srinivas pandruvada
  Cc: Vineeth Pillai (Google), Jiri Kosina, Benjamin Tissoires,
	linux-input, linux-trace-kernel, Peter Zijlstra
In-Reply-To: <fbc8c9659f707f46b5d8a6479fc42d5bb1d0efcd.camel@linux.intel.com>

On Fri, 15 May 2026 08:09:25 -0700
srinivas pandruvada <srinivas.pandruvada@linux.intel.com> wrote:

> On Fri, 2026-05-15 at 09:59 -0400, Vineeth Pillai (Google) wrote:
> > From: Vineeth Pillai <vineeth@bitbyteword.org>
> > 
> > Replace trace_foo() with the new trace_call__foo() at sites already
> > guarded by trace_foo_enabled(), avoiding a redundant
> > static_branch_unlikely() re-evaluation inside the tracepoint.
> > trace_call__foo() calls the tracepoint callbacks directly without
> > utilizing the static branch again.
> > 
> > Original v2 series:
> > https://lore.kernel.org/linux-trace-kernel/20260323160052.17528-1-vineeth@bitbyteword.org/
> > 
> > Parts of the original v2 series have already been merged in mainline.
> > This patch is being reposted as a follow-up cleanup for the remaining
> > unmerged pieces.
> > 
> > Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
> > Assisted-by: Claude:claude-sonnet-4-6  
> 
>     Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> 

Thanks, I'll take this through my tree.

-- Steve

^ permalink raw reply

* Re: [PATCH v3 08/11] scsi: ufs: Use trace_call__##name() at guarded tracepoint call sites
From: Steven Rostedt @ 2026-05-15 18:50 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Vineeth Pillai (Google), James E.J. Bottomley, Martin K. Petersen,
	linux-scsi, linux-trace-kernel, Peter Zijlstra
In-Reply-To: <9fde73e7-0108-48d7-a1a0-ccc9776beb5c@acm.org>

On Fri, 15 May 2026 08:27:27 -0700
Bart Van Assche <bvanassche@acm.org> wrote:

> On 5/15/26 6:59 AM, Vineeth Pillai (Google) wrote:
> >   static void ufshcd_add_query_upiu_trace(struct ufs_hba *hba,
> > @@ -432,8 +432,8 @@ static void ufshcd_add_query_upiu_trace(struct ufs_hba *hba,
> >   	if (!trace_ufshcd_upiu_enabled())
> >   		return;
> >   
> > -	trace_ufshcd_upiu(hba, str_t, &rq_rsp->header,
> > -			  &rq_rsp->qr, UFS_TSF_OSF);
> > +	trace_call__ufshcd_upiu(hba, str_t, &rq_rsp->header,
> > +			       &rq_rsp->qr, UFS_TSF_OSF);
> >   }  
> 
> Instead of making this change, please remove the 
> trace_ufshcd_upiu_enabled() call because it is redundant.

You mean to remove the ufshcd_add_query_upiu_trace() function and just use
a tracepoint where it is called?

Makes sense.

> 
> >   static void ufshcd_add_tm_upiu_trace(struct ufs_hba *hba, unsigned int tag,
> > @@ -445,15 +445,15 @@ static void ufshcd_add_tm_upiu_trace(struct ufs_hba *hba, unsigned int tag,
> >   		return;
> >   
> >   	if (str_t == UFS_TM_SEND)
> > -		trace_ufshcd_upiu(hba, str_t,
> > -				  &descp->upiu_req.req_header,
> > -				  &descp->upiu_req.input_param1,
> > -				  UFS_TSF_TM_INPUT);
> > +		trace_call__ufshcd_upiu(hba, str_t,
> > +					&descp->upiu_req.req_header,
> > +					&descp->upiu_req.input_param1,
> > +					UFS_TSF_TM_INPUT);
> >   	else
> > -		trace_ufshcd_upiu(hba, str_t,
> > -				  &descp->upiu_rsp.rsp_header,
> > -				  &descp->upiu_rsp.output_param1,
> > -				  UFS_TSF_TM_OUTPUT);
> > +		trace_call__ufshcd_upiu(hba, str_t,
> > +					&descp->upiu_rsp.rsp_header,
> > +					&descp->upiu_rsp.output_param1,
> > +					UFS_TSF_TM_OUTPUT);
> >   }  
> 
> Same comment here: I think it would be better to remove the 
> trace_ufshcd_upiu_enabled() call rather than
> changing trace_ufshcd_upiu() into trace_call__ufshcd_upiu().

Well, removing it here would mean placing the if (str == UFS_TM_SEND) into
the code and processing it even when tracing is disabled. With the
trace_*_enabled() helper, it's all a nop.

-- Steve



^ permalink raw reply

* Re: [PATCH 03/13] verification/rvgen: Implement state and transition parser based on Lark
From: Wander Lairson Costa @ 2026-05-15 19:07 UTC (permalink / raw)
  To: Nam Cao; +Cc: Gabriele Monaco, Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <361efb610ba7c06b3668a953a6847ea80453c2e3.1777962130.git.namcao@linutronix.de>

On Tue, May 05, 2026 at 08:59:24AM +0200, Nam Cao wrote:
> The DOT parsing scripts directly parse the raw text and they are quite
> fragile. If the input dot files' formats are slightly changed (for
> instance, by breaking long some lines which is allowed by the DOT
> language), the scripts would fail.
> 
> Prepare to move away from the raw text processing, implement parsers based
> on Lark which parse states, transitions and constraints.
> 
> The parse results are not used yet. The existing scripts will be converted
> one by one to them, and the raw text processing will eventually be removed.
> 
> Signed-off-by: Nam Cao <namcao@linutronix.de>
> ---
>  tools/verification/rvgen/rvgen/automata.py | 207 +++++++++++++++++++++
>  1 file changed, 207 insertions(+)
> 
> diff --git a/tools/verification/rvgen/rvgen/automata.py b/tools/verification/rvgen/rvgen/automata.py
> index 4e3d719a0952..32c16736a41b 100644
> --- a/tools/verification/rvgen/rvgen/automata.py
> +++ b/tools/verification/rvgen/rvgen/automata.py
> @@ -194,6 +194,155 @@ class ParseTree:
>          self.node_attrs = attributes_parser.node_attrs
>          self.edge_attrs = attributes_parser.edge_attrs
>  
> +class ConstraintCondition:
> +    def __init__(self, env: str, op: str, val: str, unit=None):
> +        self.env = env
> +        self.op = op
> +        self.val = val
> +        self.unit = unit
> +        if unit is None:
> +            # try to infer unit from constants or parameters
> +            val_for_unit = val.lower().replace("()", "")
> +            if val_for_unit.endswith("_ns"):
> +                self.unit = "ns"
> +            if val_for_unit.endswith("_jiffies"):
> +                self.unit = "j"
> +
> +class ConstraintRule:
> +    grammar = r'''
> +        rule: condition (OP condition)*
> +
> +        OP: "&&" | "||"
> +
> +        condition: ENV CMP_OP VAL UNIT?
> +
> +        ENV: CNAME
> +
> +        CMP_OP: "==" | "<=" | "<" | ">=" | ">"
> +
> +        VAL: /[0-9]+/
> +           | /[A-Z_]+\(\)/
> +           | /[A-Z_]+/
> +           | /[a-z_]+\(\)/
> +           | /[a-z_]+/
> +
> +        UNIT: "ns" | "us" | "ms" | "s"
> +    '''
> +
> +    def __init__(self, c: ConstraintCondition):
> +        '''
> +        A list of pairs of
> +          - the condition (e.g. is_constr_dl == 1)
> +          - the logical operator ("||" or "&&") combining this
> +            condition with the next one if it exists, otherwise None
> +
> +        TODO: Perhaps use an abstract syntax tree instead, because
> +              this representation cannot capture precedence
> +        '''
> +        self.rules = [[c, None]]

Here self.rules is a list of lists...

> +
> +    def chain(self, op: str, c: ConstraintCondition):
> +        self.rules[-1][1] = op
> +        self.rules.append((c, None))

... but here it is a list of tuples.

> +
> +class ConstraintReset:
> +    def __init__(self, env):
> +        self.env = env
> +
> +class StateLabelParser:
> +    grammar = r'''
> +    label: CNAME ("\\n" condition)?
> +
> +    %import common.CNAME
> +    %import common.WS
> +    %ignore WS
> +    ''' + ConstraintRule.grammar
> +
> +    def __init__(self, label: str):
> +        parser = lark.Lark(self.grammar, parser='lalr', start="label")
> +        tree = parser.parse(label)
> +
> +        self.state = tree.children[0]
> +        self.constraint = None
> +
> +        if len(tree.children) == 2:
> +            self.constraint = ConstraintCondition(*tree.children[1].children)
> +            if self.constraint.op not in ("<", "<="):
> +                raise AutomataError("State constraints must be clock expirations like"
> +                                    f" clk<N ({label})")
> +
> +class EventLabelParser:
> +    grammar = r'''
> +    events: event ("\\n" event)*
> +
> +    event: name (";" guard)*
> +
> +    guard: reset
> +         | rule
> +         | rule reset
> +         | reset rule
> +
> +    name: CNAME
> +
> +    reset: "reset" "(" ENV ")"
> +
> +    %import common.CNAME
> +    %import common.WS
> +    %ignore WS
> +    ''' + ConstraintRule.grammar
> +
> +    class GetEvents(lark.visitors.Transformer):
> +        def guard(self, args):
> +            reset = None
> +            rule = None
> +            for arg in args:
> +                if arg.data == "reset":
> +                    reset = ConstraintReset(arg.children[0])
> +                elif arg.data == "rule":
> +                    conditions = arg.children
> +                    rule = ConstraintRule(conditions[0])
> +                    for i in range(1, len(conditions), 2):
> +                        rule.chain(conditions[i], conditions[i + 1])
> +            return reset, rule
> +
> +        def OP(self, args):
> +            return args
> +
> +        def condition(self, args):
> +            return ConstraintCondition(*args)
> +
> +        def event(self, args):
> +            name = args[0]
> +            rule, reset = None, None
> +            if len(args) == 2:
> +                reset, rule = args[1]
> +            return name, reset, rule
> +
> +        def events(self, args):
> +            return args
> +
> +        def name(self, args):
> +            return args[0]
> +
> +    def __init__(self, label: str):
> +        parser = lark.Lark(self.grammar, parser='lalr', start="events")
> +        tree = parser.parse(label)
> +        self.events = self.GetEvents().transform(tree)
> +
> +class Transition:
> +    def __init__(self, src: str, dst: str, event: str,
> +                 reset: ConstraintReset, rule: ConstraintRule):
> +        self.src = src
> +        self.dst = dst
> +        self.event = event
> +        self.rule = rule
> +        self.reset = reset
> +
> +class State:
> +    def __init__(self, name: str, inv: ConstraintRule):
> +        self.name = name
> +        self.inv = inv
> +
>  class _ConstraintKey:
>      """Base class for constraint keys."""
>  
> @@ -248,6 +397,8 @@ class Automata:
>          self.name = model_name or self.__get_model_name()
>          self.__dot_lines = self.__open_dot()
>          self.__parse_tree = ParseTree(file_path)
> +        self.transitions = self.__parse_transitions()
> +        self._states, self._initial_state, self._final_states = self.__parse_states()
>          self.states, self.initial_state, self.final_states = self.__get_state_variables()
>          self.env_types = {}
>          self.env_stored = set()
> @@ -323,6 +474,62 @@ class Automata:
>  
>          return cursor
>  
> +    def __parse_transitions(self):
> +        transitions = []
> +
> +        for edge in self.__parse_tree.edges:
> +            attr = self.__parse_tree.edge_attrs.get(edge)
> +            if not attr:
> +                continue
> +
> +            label = attr.get("label")
> +
> +            src, dst = edge
> +
> +            parser = EventLabelParser(label)
> +            for event, reset, rule in parser.events:
> +                transitions.append(Transition(src, dst, event, reset, rule))
> +
> +        transitions.sort(key=lambda t : (t.src, t.event))
> +        return transitions
> +
> +    def __parse_states(self):
> +        initial_state = ""
> +        states = []
> +        final_states = []
> +
> +        for node in self.__parse_tree.nodes:
> +            attr = self.__parse_tree.node_attrs[node]
> +            label = attr["label"]
> +
> +            if node.startswith(Automata.init_marker):
> +                initial_state = node[len(Automata.init_marker):]
> +
> +            if not label:
> +                continue
> +
> +            parser = StateLabelParser(attr["label"])
> +            state = State(parser.state, parser.constraint)
> +
> +            states.append(state)
> +
> +            shape = attr.get("shape")
> +            if shape in ("doublecircle", "ellipse"):
> +                final_states.append(state)
> +
> +
> +        initial_state = next((s for s in states if s.name == initial_state), None)
> +        if not initial_state:
> +            raise AutomataError("The automaton doesn't have an initial state")
> +
> +        if not final_states:
> +            final_states.append(initial_state)
> +
> +        states.remove(initial_state)
> +        states.sort(key=lambda s : s.name)
> +        states.insert(0, initial_state)
> +        return states, initial_state, final_states
> +
>      def __get_state_variables(self) -> tuple[list[str], str, list[str]]:
>          # wait for node declaration
>          states = []
> -- 
> 2.47.3
> 


^ permalink raw reply

* Re: [PATCH v3 08/11] scsi: ufs: Use trace_call__##name() at guarded tracepoint call sites
From: Bart Van Assche @ 2026-05-15 19:21 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Vineeth Pillai (Google), James E.J. Bottomley, Martin K. Petersen,
	linux-scsi, linux-trace-kernel, Peter Zijlstra
In-Reply-To: <20260515145048.1c021bc9@gandalf.local.home>

On 5/15/26 11:50 AM, Steven Rostedt wrote:
> On Fri, 15 May 2026 08:27:27 -0700
> Bart Van Assche <bvanassche@acm.org> wrote:
> 
>> On 5/15/26 6:59 AM, Vineeth Pillai (Google) wrote:
>>>    static void ufshcd_add_query_upiu_trace(struct ufs_hba *hba,
>>> @@ -432,8 +432,8 @@ static void ufshcd_add_query_upiu_trace(struct ufs_hba *hba,
>>>    	if (!trace_ufshcd_upiu_enabled())
>>>    		return;
>>>    
>>> -	trace_ufshcd_upiu(hba, str_t, &rq_rsp->header,
>>> -			  &rq_rsp->qr, UFS_TSF_OSF);
>>> +	trace_call__ufshcd_upiu(hba, str_t, &rq_rsp->header,
>>> +			       &rq_rsp->qr, UFS_TSF_OSF);
>>>    }
>>
>> Instead of making this change, please remove the
>> trace_ufshcd_upiu_enabled() call because it is redundant.
> 
> You mean to remove the ufshcd_add_query_upiu_trace() function and just use
> a tracepoint where it is called?

That would be even better.

>>>    static void ufshcd_add_tm_upiu_trace(struct ufs_hba *hba, unsigned int tag,
>>> @@ -445,15 +445,15 @@ static void ufshcd_add_tm_upiu_trace(struct ufs_hba *hba, unsigned int tag,
>>>    		return;
>>>    
>>>    	if (str_t == UFS_TM_SEND)
>>> -		trace_ufshcd_upiu(hba, str_t,
>>> -				  &descp->upiu_req.req_header,
>>> -				  &descp->upiu_req.input_param1,
>>> -				  UFS_TSF_TM_INPUT);
>>> +		trace_call__ufshcd_upiu(hba, str_t,
>>> +					&descp->upiu_req.req_header,
>>> +					&descp->upiu_req.input_param1,
>>> +					UFS_TSF_TM_INPUT);
>>>    	else
>>> -		trace_ufshcd_upiu(hba, str_t,
>>> -				  &descp->upiu_rsp.rsp_header,
>>> -				  &descp->upiu_rsp.output_param1,
>>> -				  UFS_TSF_TM_OUTPUT);
>>> +		trace_call__ufshcd_upiu(hba, str_t,
>>> +					&descp->upiu_rsp.rsp_header,
>>> +					&descp->upiu_rsp.output_param1,
>>> +					UFS_TSF_TM_OUTPUT);
>>>    }
>>
>> Same comment here: I think it would be better to remove the
>> trace_ufshcd_upiu_enabled() call rather than
>> changing trace_ufshcd_upiu() into trace_call__ufshcd_upiu().
> 
> Well, removing it here would mean placing the if (str == UFS_TM_SEND) into
> the code and processing it even when tracing is disabled. With the
> trace_*_enabled() helper, it's all a nop.

The ufshcd_add_tm_upiu_trace() function is only called from the UFS
error handler and hence is not performance sensitive. The execution of
an additional if-test in this function is not a concern at all.

Thanks,

Bart.

^ permalink raw reply

* Re: [PATCH 06/13] verification/rvgen: Convert __fill_verify_guards_func() to Lark
From: Wander Lairson Costa @ 2026-05-15 19:35 UTC (permalink / raw)
  To: Nam Cao; +Cc: Gabriele Monaco, Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <e8a636c8ea6da554fd51b1241b9181f65af420c8.1777962130.git.namcao@linutronix.de>

On Tue, May 05, 2026 at 08:59:27AM +0200, Nam Cao wrote:
> Prepare to remove self.guards and self.__parse_constraints(), convert
> __fill_verify_guards_func() to use the parsed transitions from Lark.
> 
> Signed-off-by: Nam Cao <namcao@linutronix.de>
> ---
>  tools/verification/rvgen/rvgen/dot2k.py | 39 ++++++++++++++++++++-----
>  1 file changed, 31 insertions(+), 8 deletions(-)
> 
> diff --git a/tools/verification/rvgen/rvgen/dot2k.py b/tools/verification/rvgen/rvgen/dot2k.py
> index 3a39ae29e41e..cf7e5ddc649c 100644
> --- a/tools/verification/rvgen/rvgen/dot2k.py
> +++ b/tools/verification/rvgen/rvgen/dot2k.py
> @@ -221,6 +221,20 @@ class ha2k(dot2k):
>      def __parse_single_constraint(self, rule: dict, value: str) -> str:
>          return f"ha_get_env(ha_mon, {rule["env"]}{self.enum_suffix}, time_ns) {rule["op"]} {value}"
>  
> +    def __parse_guard_rule(self, rule) -> str:
> +        buff = []
> +        for c, sep in rule.rules:
> +            env = c.env + self.enum_suffix
> +            op = c.op
> +            val = self.__adjust_value(c.val, c.unit)
> +
> +            cond = f"ha_get_env(ha_mon, {env}, time_ns) {op} {val}"
> +            if sep:
> +                cond += f" {sep}"
> +            buff.append(cond)
> +        buff[-1] += ';'
> +        return buff
> +
>      def __get_constraint_env(self, constr: str) -> str:
>          """Extract the second argument from an ha_ function"""
>          env = constr.split("(")[1].split()[1].rstrip(")").rstrip(",")
> @@ -398,8 +412,9 @@ f"""static inline void ha_convert_inv_guard(struct ha_monitor *ha_mon,
>  
>      def __fill_verify_guards_func(self) -> list[str]:
>          buff = []
> -        if not self.guards:
> -            return []
> +
> +        if not self.has_guard:
> +            return

The signature of function says this function return a list, instead of
None.

>  
>          buff.append(
>  f"""static inline bool ha_verify_guards(struct ha_monitor *ha_mon,
> @@ -410,14 +425,22 @@ f"""static inline bool ha_verify_guards(struct ha_monitor *ha_mon,
>  """)
>  
>          _else = ""
> -        for edge, constr in sorted(self.guards.items()):
> +        for transition in self.transitions:
> +            if not transition.rule and not transition.reset:
> +                continue
> +
>              buff.append(f"\t{_else}if (curr_state == "
> -                        f"{self.states[edge[0]]}{self.enum_suffix} && "
> -                        f"event == {self.events[edge[1]]}{self.enum_suffix})")
> -            if constr.count(";") > 0:
> +                        f"{transition.src}{self.enum_suffix} && "
> +                        f"event == {transition.event}{self.enum_suffix})")
> +            rule = transition.rule
> +            reset = transition.reset
> +            if rule and reset:
>                  buff[-1] += " {"
> -            buff += [f"\t\t{c};" for c in constr.split(";")]
> -            if constr.count(";") > 0:
> +            if rule:
> +                buff.append("\t\t" + self.__format_guard_rules(self.__parse_guard_rule(rule))[0])
> +            if reset:
> +                buff.append(f"\t\tha_reset_env(ha_mon, {reset.env}{self.enum_suffix}, time_ns);")
> +            if rule and reset:
>                  _else = "} else "
>              else:
>                  _else = "else "
> -- 
> 2.47.3
> 


^ permalink raw reply

* [PATCH v4 0/3] Enable perf tracing for unprivileged users
From: Anubhav Shelat @ 2026-05-15 19:40 UTC (permalink / raw)
  To: mpetlan, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, James Clark, Thomas Falcon,
	linux-kernel, linux-trace-kernel, linux-perf-users
  Cc: Anubhav Shelat

Enable users to use perf-trace to trace their own processes, like strace
but without the overhead of ptrace(). Ensure that users cannot access
other users' or systemwide tracing data.

Changes in v4:
- Preserve security_perf_event_open(PERF_SECURITY_KERNEL) LSM hook in
  the tp_bypass path.
- Lift the PERF_SAMPLE_IP check out of the tp_bypass path above the
  PERF_SAMPLE_RAW branch so it applies to counting and sampling. This
  also allows us to ensure PERF_SAMPLE_IP is set for uprobes.
- Block counting path for TRACE_EVENT_FL_CAP_ANY for unprivileged users
  with sysctl_perf_event_paranoid > 1.

Changes in v3:
- Don't set PERF_SAMPLE_IP for unprivileged tracepoints. This allows us
  to exclude PERF_SAMPLE_IP from kaddr_leak without weakening KASLR.
- Mount tracefs as world-traversable so users can access eventfs
  directories.

Anubhav Shelat (3):
  perf evsel: don't set PERF_SAMPLE_IP for unprivileged tracepoints
  perf: enable unprivileged syscall tracing with perf trace
  tracefs: make root directory world-traversable

 fs/tracefs/inode.c              |  2 +-
 kernel/events/core.c            | 28 +++++++++++++++++++++++++---
 kernel/trace/trace_event_perf.c | 21 ++++++++++++++++++++-
 kernel/trace/trace_events.c     | 16 ++++++++++++++--
 tools/perf/util/evsel.c         | 14 +++++++++++++-
 5 files changed, 73 insertions(+), 8 deletions(-)

-- 
2.54.0


^ permalink raw reply

* [PATCH v4 1/3] perf evsel: don't set PERF_SAMPLE_IP for unprivileged tracepoints
From: Anubhav Shelat @ 2026-05-15 19:40 UTC (permalink / raw)
  To: mpetlan, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, James Clark, Thomas Falcon,
	linux-kernel, linux-trace-kernel, linux-perf-users
  Cc: Anubhav Shelat
In-Reply-To: <20260515194010.93725-2-ashelat@redhat.com>

For tracepoint events the IP is a static kernel address.
It doesn't vary by sample and provides no useful information for
unprivileged users. Skipping setting PERF_SAMPLE_IP for unprivileged
tracepoints avoids exposing a kernel address that reveals the KASLR base
offset.

Make an exception for uprobes, which are registered as
PERF_TYPE_TRACEPOINT, because the IP is important for their
functionality and is a safe userspace address. Detect them with
__probe_ip (entry) and __probe_ret_ip (return) using evsel__field().

Assisted-by: Claude:claude-sonnet-4.5
Signed-off-by: Anubhav Shelat <ashelat@redhat.com>
---
 tools/perf/util/evsel.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 2ee87fd84d3e..bf66e0c78451 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1509,7 +1509,19 @@ void evsel__config(struct evsel *evsel, const struct record_opts *opts,
 	attr->write_backward = opts->overwrite ? 1 : 0;
 	attr->read_format   = PERF_FORMAT_LOST;
 
-	evsel__set_sample_bit(evsel, IP);
+	/*
+	 * Don't set PERF_SAMPLE_IP for unprivileged kernel tracepoints to
+	 * avoid exposing kernel addresses. Uprobes expose only userspace
+	 * addresses so they're safe. Detect entry and return uprobes.
+	 */
+	if (attr->type != PERF_TYPE_TRACEPOINT || perf_event_paranoid_check(1)
+#ifdef HAVE_LIBTRACEEVENT
+	    || evsel__field(evsel, "__probe_ip")
+	    || evsel__field(evsel, "__probe_ret_ip")
+#endif
+	    )
+		evsel__set_sample_bit(evsel, IP);
+
 	evsel__set_sample_bit(evsel, TID);
 
 	if (evsel->sample_read) {
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 3/3] tracefs: make root directory world-traversable
From: Anubhav Shelat @ 2026-05-15 19:40 UTC (permalink / raw)
  To: mpetlan, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, James Clark, Thomas Falcon,
	linux-kernel, linux-trace-kernel, linux-perf-users
  Cc: Anubhav Shelat
In-Reply-To: <20260515194010.93725-2-ashelat@redhat.com>

Change the default tracefs mount mode from 0700 to 0755. This allows
unprivileged users to access the eventfs directories underneath which
already use 0755.

Tracing data files use mode 0440 and 0640 so they are not exposed by
this change. Only the format and id files, which have been marked as
work-readable, become accessible.

Directory listings of kprobes and uprobes, which contain functions or
binaries, become visible to unprivileged users but do not contain kernel
addresses. Admins using probes can restore the previous behavior with
chmod or mount -o mode=700.

Signed-off-by: Anubhav Shelat <ashelat@redhat.com>
---
 fs/tracefs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index f3d6188a3b7b..3a6a0c800a8b 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -23,7 +23,7 @@
 #include <linux/slab.h>
 #include "internal.h"

-#define TRACEFS_DEFAULT_MODE	0700
+#define TRACEFS_DEFAULT_MODE	0755
 static struct kmem_cache *tracefs_inode_cachep __ro_after_init;

 static struct vfsmount *tracefs_mount;
-- 
2.54.0

^ permalink raw reply related

* [PATCH v4 2/3] perf: enable unprivileged syscall tracing with perf trace
From: Anubhav Shelat @ 2026-05-15 19:40 UTC (permalink / raw)
  To: mpetlan, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, James Clark, Thomas Falcon,
	linux-kernel, linux-trace-kernel, linux-perf-users
  Cc: Anubhav Shelat
In-Reply-To: <20260515194010.93725-2-ashelat@redhat.com>

Allow unprivileged users to trace their own processes' syscalls using
perf trace, similar to strace without the intrusive overhead of ptrace().

Currently, perf trace requires CAP_PERFMON or paranoid level ≤ 1 even
though the kernel has existing infrastructure (TRACE_EVENT_FL_CAP_ANY)
specifically designed to mark syscall tracepoints as safe for
unprivileged access. To fix this:

1. Loosen the condition in perf_event_open() which requires privileges
   for all events with exclude_kernel=0. This allows perf_event_open() to
   bypass the paranoid check for task-attached tracepoint events. Ensure
   that sample types which can expose kernel addresses to unprivileged
   users are blocked. Ensure the PERF_SECURITY_KERNEL LSM hook is
   preserved.

2. Make the format and id tracefs files world-readable only for tracepoints
   with TRACE_EVENT_FL_CAP_ANY, allowing unprivileged users to see syscall
   tracepoint ids without exposing sensitive information.

3. Add a check to perf_trace_event_perm() to block PERF_SAMPLE_IP on
   kernel tracepoints for unprivileged users to prevent KASLR bypass. We do
   this here rather than in kaddr_leak because perf_trace_event_perm() can
   distinguish between kernel tracepoints and uprobe tracepoints, where the
   IP is a safe user space address and is necessary for uprobe
   functionality.

4. Restrict pure counting events (no PERF_SAMPLE_RAW) to
   TRACE_EVENT_FL_CAP_ANY tracepoints preventing unprivileged users from
   counting internal kernel tracepoints while preserving current
   behavior for exclude_kernel=1 events.

Example usage after this change:
  $ perf trace ls          # works as unprivileged user
  $ perf trace             # system-wide, still requires privileges
  $ perf trace -p 1234     # requires ptrace permission on pid 1234

Assisted-by: Claude:claude-sonnet-4.5
Signed-off-by: Anubhav Shelat <ashelat@redhat.com>
---
 kernel/events/core.c            | 28 +++++++++++++++++++++++++---
 kernel/trace/trace_event_perf.c | 21 ++++++++++++++++++++-
 kernel/trace/trace_events.c     | 16 ++++++++++++++--
 3 files changed, 59 insertions(+), 6 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7935d5663944..ff2d1e9a0b79 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -13873,9 +13873,31 @@ SYSCALL_DEFINE5(perf_event_open,
 		return err;
 
 	if (!attr.exclude_kernel) {
-		err = perf_allow_kernel();
-		if (err)
-			return err;
+		bool tp_bypass = false;
+
+		/* Check unprivileged tracepoints */
+		if (attr.type == PERF_TYPE_TRACEPOINT && pid != -1) {
+			/*
+			 * Block sample types that expose kernel addresses to
+			 * prevent KASLR bypass
+			 */
+			u64 kaddr_leak = PERF_SAMPLE_CALLCHAIN |
+					 PERF_SAMPLE_BRANCH_STACK |
+					 PERF_SAMPLE_ADDR |
+					 PERF_SAMPLE_REGS_INTR;
+
+			tp_bypass = !(attr.sample_type & kaddr_leak);
+		}
+
+		if (!tp_bypass) {
+			err = perf_allow_kernel();
+			if (err)
+				return err;
+		} else {
+			err = security_perf_event_open(PERF_SECURITY_KERNEL);
+			if (err)
+				return err;
+		}
 	}
 
 	if (attr.namespaces) {
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index a6bb7577e8c5..466007ed2869 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -72,9 +72,28 @@ static int perf_trace_event_perm(struct trace_event_call *tp_event,
 			return -EINVAL;
 	}
 
+	/*
+	 * PERF_SAMPLE_IP on kernel tracepoints exposes a kernel text
+	 * address, weakening KASLR. Block for unprivileged users unless
+	 * the tracepoint is a uprobe (userspace IP, safe to expose).
+	 */
+	if ((p_event->attr.sample_type & PERF_SAMPLE_IP) &&
+	    !p_event->attr.exclude_kernel &&
+	    !(tp_event->flags & TRACE_EVENT_FL_UPROBE) &&
+	    sysctl_perf_event_paranoid > 1 && !perfmon_capable())
+		return -EACCES;
+
 	/* No tracing, just counting, so no obvious leak */
-	if (!(p_event->attr.sample_type & PERF_SAMPLE_RAW))
+	if (!(p_event->attr.sample_type & PERF_SAMPLE_RAW)) {
+		/* Prevent unprivileged users from counting kernel tracepoints */
+		if (!p_event->attr.exclude_kernel &&
+		    sysctl_perf_event_paranoid > 1 && !perfmon_capable()) {
+			if (!(p_event->attach_state == PERF_ATTACH_TASK &&
+			      (tp_event->flags & TRACE_EVENT_FL_CAP_ANY)))
+				return -EACCES;
+		}
 		return 0;
+	}
 
 	/* Some events are ok to be traced by non-root users... */
 	if (p_event->attach_state == PERF_ATTACH_TASK) {
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index c46e623e7e0d..cbd07e2ec528 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -3050,7 +3050,13 @@ static int event_callback(const char *name, umode_t *mode, void **data,
 	struct trace_event_call *call = file->event_call;
 
 	if (strcmp(name, "format") == 0) {
-		*mode = TRACE_MODE_READ;
+		/*
+		 * Make format tracefs file world readable for tracepoints with
+		 * TRACE_EVENT_FL_CAP_ANY
+		 */
+		*mode = (call->flags & TRACE_EVENT_FL_CAP_ANY) ?
+			(TRACE_MODE_READ | 0004) :
+			TRACE_MODE_READ;
 		*fops = &ftrace_event_format_fops;
 		return 1;
 	}
@@ -3086,7 +3092,13 @@ static int event_callback(const char *name, umode_t *mode, void **data,
 #ifdef CONFIG_PERF_EVENTS
 	if (call->event.type && call->class->reg &&
 	    strcmp(name, "id") == 0) {
-		*mode = TRACE_MODE_READ;
+		/*
+		 * Make id tracefs file world readable for tracepoints with
+		 * TRACE_EVENT_FL_CAP_ANY
+		 */
+		*mode = (call->flags & TRACE_EVENT_FL_CAP_ANY) ?
+			(TRACE_MODE_READ | 0004) :
+			TRACE_MODE_READ;
 		*data = (void *)(long)call->event.type;
 		*fops = &ftrace_event_id_fops;
 		return 1;
-- 
2.54.0


^ permalink raw reply related

* [PATCH] tracing: Fix desc in error path for the trace remote test module
From: Vincent Donnefort @ 2026-05-15 20:16 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel
  Cc: kernel-team, linux-kernel, Vincent Donnefort, Sashiko

During initialisation in remote_test_load(), if one of the
simple_ring_buffer fails to initialise, the error path attempts to
rollback initialised buffers. However, the rollback incorrectly uses the
global pointer to the trace descriptor, which is only set upon
successful load completion. Fix the error path by using the local
pointer to the descriptor.

Fixes: ea908a2b79c8 ("tracing: Add a trace remote module for testing")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

diff --git a/kernel/trace/remote_test.c b/kernel/trace/remote_test.c
index 6c1b7701ddae..a3e2c9b606eb 100644
--- a/kernel/trace/remote_test.c
+++ b/kernel/trace/remote_test.c
@@ -110,9 +110,9 @@ static struct trace_buffer_desc *remote_test_load(unsigned long size, void *unus
 	return remote_test_buffer_desc;
 
 err_unload:
-	for_each_ring_buffer_desc(rb_desc, cpu, remote_test_buffer_desc)
+	for_each_ring_buffer_desc(rb_desc, cpu, desc)
 		remote_test_unload_simple_rb(rb_desc->cpu);
-	trace_remote_free_buffer(remote_test_buffer_desc);
+	trace_remote_free_buffer(desc);
 
 err_free_desc:
 	kfree(desc);

base-commit: 5d6919055dec134de3c40167a490f33c74c12581
-- 
2.54.0.563.g4f69b47b94-goog


^ permalink raw reply related

* Re: [PATCH 1/7] uprobes/x86: Move optimized uprobe from nop5 to nop10
From: Andrii Nakryiko @ 2026-05-15 20:31 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <20260514135342.22130-2-jolsa@kernel.org>

On Thu, May 14, 2026 at 6:53 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Andrii reported an issue with optimized uprobes [1] that can clobber
> redzone area with call instruction storing return address on stack
> where user code may keep temporary data without adjusting rsp.
>
> Fixing this by moving the optimized uprobes on top of 10-bytes nop
> instruction, so we can squeeze another instruction to escape the
> redzone area before doing the call, like:
>
>   lea -0x80(%rsp), %rsp
>   call tramp
>
> Note the lea instruction is used to adjust the rsp register without
> changing the flags.

I think it should be very loudly explained that we can't go back to
nop10 and have to do short jump over patched sequence (and why).

>
> The optimized uprobe performance stays the same:
>
>         uprobe-nop     :    3.129 ± 0.013M/s
>         uprobe-push    :    3.045 ± 0.006M/s
>         uprobe-ret     :    1.095 ± 0.004M/s
>   -->   uprobe-nop10   :    7.170 ± 0.020M/s
>         uretprobe-nop  :    2.143 ± 0.021M/s
>         uretprobe-push :    2.090 ± 0.000M/s
>         uretprobe-ret  :    0.942 ± 0.000M/s
>   -->   uretprobe-nop10:    3.381 ± 0.003M/s
>         usdt-nop       :    3.245 ± 0.004M/s
>   -->   usdt-nop10     :    7.256 ± 0.023M/s
>
> [1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> Reported-by: Andrii Nakryiko <andrii@kernel.org>
> Closes: https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  arch/x86/kernel/uprobes.c | 121 +++++++++++++++++++++++++++-----------
>  1 file changed, 86 insertions(+), 35 deletions(-)
>
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index ebb1baf1eb1d..f7c4101a4039 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -636,9 +636,21 @@ struct uprobe_trampoline {
>         unsigned long           vaddr;
>  };
>
> +#define LEA_INSN_SIZE          5
> +#define OPT_INSN_SIZE          (LEA_INSN_SIZE + CALL_INSN_SIZE)
> +#define OPT_JMP8_OFFSET                (OPT_INSN_SIZE - JMP8_INSN_SIZE)
> +#define REDZONE_SIZE           0x80
> +
> +static const u8 lea_rsp[] = { 0x48, 0x8d, 0x64, 0x24, 0x80 };
> +
> +static bool is_lea_insn(const uprobe_opcode_t *insn)
> +{
> +       return !memcmp(insn, lea_rsp, LEA_INSN_SIZE);
> +}
> +
>  static bool is_reachable_by_call(unsigned long vtramp, unsigned long vaddr)
>  {
> -       long delta = (long)(vaddr + 5 - vtramp);
> +       long delta = (long)(vaddr + OPT_INSN_SIZE - vtramp);
>
>         return delta >= INT_MIN && delta <= INT_MAX;
>  }
> @@ -651,7 +663,7 @@ static unsigned long find_nearest_trampoline(unsigned long vaddr)
>         };
>         unsigned long low_limit, high_limit;
>         unsigned long low_tramp, high_tramp;
> -       unsigned long call_end = vaddr + 5;
> +       unsigned long call_end = vaddr + OPT_INSN_SIZE;
>
>         if (check_add_overflow(call_end, INT_MIN, &low_limit))
>                 low_limit = PAGE_SIZE;
> @@ -826,8 +838,8 @@ SYSCALL_DEFINE0(uprobe)

should we change -ENXIO to -EPROTO or some other distinct error code,
so libbpf can avoid using nop5 attachment on kernels new enough to
support nop5 optimization, but old enough to not do this properly with
nop10?

>         regs->ax  = args.ax;
>         regs->r11 = args.r11;
>         regs->cx  = args.cx;
> -       regs->ip  = args.retaddr - 5;
> -       regs->sp += sizeof(args);
> +       regs->ip  = args.retaddr - OPT_INSN_SIZE;
> +       regs->sp += sizeof(args) + REDZONE_SIZE;
>         regs->orig_ax = -1;
>
>         sp = regs->sp;

[...]

^ permalink raw reply

* Re: [PATCH v4 3/3] tracefs: make root directory world-traversable
From: Steven Rostedt @ 2026-05-15 23:16 UTC (permalink / raw)
  To: Anubhav Shelat
  Cc: mpetlan, Masami Hiramatsu, Mathieu Desnoyers, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Thomas Falcon, linux-kernel, linux-trace-kernel,
	linux-perf-users
In-Reply-To: <20260515194010.93725-5-ashelat@redhat.com>

On Fri, 15 May 2026 15:40:07 -0400
Anubhav Shelat <ashelat@redhat.com> wrote:

> Change the default tracefs mount mode from 0700 to 0755. This allows
> unprivileged users to access the eventfs directories underneath which
> already use 0755.
> 
> Tracing data files use mode 0440 and 0640 so they are not exposed by
> this change. Only the format and id files, which have been marked as
> work-readable, become accessible.
> 
> Directory listings of kprobes and uprobes, which contain functions or
> binaries, become visible to unprivileged users but do not contain kernel
> addresses. Admins using probes can restore the previous behavior with
> chmod or mount -o mode=700.
> 

I've been thinking about this and I believe a better approach is to
make a eventfs that is mounted at:

 /sys/kernel/events

and be the same directory structure as /sys/kernel/tracing/events but
only contain read only files like "id" and "format". This directory
would be mounted as 555 and readable by all.

-- Steve

^ permalink raw reply

* Re: [PATCH v7 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Lance Yang @ 2026-05-16  4:06 UTC (permalink / raw)
  To: Breno Leitao
  Cc: linmiaohe, akpm, david, ljs, vbabka, rppt, surenb, mhocko, shuah,
	nao.horiguchi, rostedt, mhiramat, mathieu.desnoyers, corbet,
	skhan, liam, linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team
In-Reply-To: <agcbfLHT5ZWnNeN0@gmail.com>



On 2026/5/15 21:13, Breno Leitao wrote:
[...]
>>
>> Wonder if it would be simpler to just do a positive check near the top
>> of get_any_page() instead. Something like:
>>
>> static bool hwpoison_unrecoverable_kernel_page(struct page *page,
>> 						unsigned long flags)
> 
> Ack. We probably want to call it something like HWPoisonKernelOwned() to
> follow the same naming sematics of these helpers, such as HWPoisonHandlable()
> 
> By the way, I will re-include the self test back to this patch series,
> In case they are not useful, we do not merge it.
> 

Sounds good :)

Can you also test the relevant page types if possible, especially
the ones the new helper is supposed to classify?

Cheers, Lance

^ permalink raw reply

* Re: [RFC PATCH v3] bpf: introduce TAINT_UNSAFE_BPF for mutating helpers
From: Aaron Tomlin @ 2026-05-16 17:01 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, Jonathan Corbet, Song Liu, KP Singh,
	Matt Bobrowski, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard, Kumar Kartikeya Dwivedi,
	Masami Hiramatsu, Shuah Khan, Jiri Olsa, Martin KaFai Lau,
	Yonghong Song, Mathieu Desnoyers, Randy Dunlap, neelx, sean,
	chjohnst, steve, mproche, nick.lange, open list:DOCUMENTATION,
	LKML, bpf, linux-trace-kernel
In-Reply-To: <CAADnVQLw+_NaOVeaKabuf085wNo_-6MAv8w0EDO3fBz3KCQT5g@mail.gmail.com>

On Wed, May 13, 2026 at 09:35:29AM -0700, Alexei Starovoitov wrote:
> On Wed, May 13, 2026 at 8:23 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Wed, 13 May 2026 08:16:07 -0700
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >
> > > It's impossible to track all modifications.
> > > See what sched-ext is doing.
> > > What does it modify? Everything.
> >
> > What about just having a list of what BPF programs are loaded, what they
> > may be attached to, and what kfuncs they are calling?
> 
> Ohh. These have been available forever.
> Just bpftool prog, bpftool link, bpftool prog dump xlated

Hi Alexei,

Thank you for sharing.

Kind regards,
-- 
Aaron Tomlin

^ permalink raw reply

* Re: [RFC PATCH v2.2 18/28] mm/damon: trace probe_hits
From: SeongJae Park @ 2026-05-16 17:31 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Masami Hiramatsu, Mathieu Desnoyers,
	Steven Rostedt, damon, linux-kernel, linux-mm, linux-trace-kernel
In-Reply-To: <20260515004433.128933-19-sj@kernel.org>

On Thu, 14 May 2026 17:44:19 -0700 SeongJae Park <sj@kernel.org> wrote:

> Introduce a new tracepoint for exposing the per-region per-probe
> positive sample count via tracefs.
> 
> Signed-off-by: SeongJae Park <sj@kernel.org>
> ---
>  include/trace/events/damon.h | 38 ++++++++++++++++++++++++++++++++++++
>  mm/damon/core.c              |  9 +++++++++
>  2 files changed, 47 insertions(+)
> 
> diff --git a/include/trace/events/damon.h b/include/trace/events/damon.h
> index 24fc402ab3c85..ec1e317923fd3 100644
> --- a/include/trace/events/damon.h
> +++ b/include/trace/events/damon.h
> @@ -130,6 +130,44 @@ TRACE_EVENT(damon_monitor_intervals_tune,
>  	TP_printk("sample_us=%lu", __entry->sample_us)
>  );
>  
> +TRACE_EVENT_CONDITION(damon_aggregated_v2,

I was thinking [1] about a better name of this tracepoint.  I will rename this
to 'damon_region_aggregated'.  And I will deprecate damon_aggregated, with
multi phase, like we did for DAMON debugfs interface.  The idea off the top of
my head at the moment is,

1. announce it as deprecated on the document, by end of 2026
2. rename it (e.g., damon_aggregated_deprecated) by end of 2027
3. removing the code by end of 2028

The deprecation might be done faster than the current idea.

As Steven commented [2], it should be ok to immediately removing it or
extending it to have probe_hits.  But I realize I'm quite lazy at DAMON
user-space tool development, and feel more comfortable on this approach for
now.  Please let me know if anyone has a different opinion.

[1] https://lore.kernel.org/20260514000611.147809-1-sj@kernel.org
[2] https://lore.kernel.org/20260513203237.3b1b3286@gandalf.local.home

Thanks,
SJ

[...]

^ permalink raw reply

* [RFC PATCH v3 00/28] mm/damon: introduce data attributes monitoring
From: SeongJae Park @ 2026-05-16 18:36 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Masami Hiramatsu,
	Mathieu Desnoyers, Michal Hocko, Mike Rapoport, Shuah Khan,
	Shuah Khan, Steven Rostedt, Suren Baghdasaryan, Vlastimil Babka,
	damon, linux-doc, linux-kernel, linux-kselftest, linux-mm,
	linux-trace-kernel

TL; DR
======

Extend DAMON for monitoring general data attributes other than accesses.
The short term motivation is lightweight page type (e.g., belonging
cgroup) aware monitoring.  In long term, this will help extending DAMON
for multiple access events capture primitives (e.g., page faults and
PMU) and eventually pivotting DAMON to a "Data Attributes Monitoring and
Operations eNgine" in long term.

Background: High Cost of Page Level Properties Monitoring
=========================================================

DAMON is initially introduced as a Data Access MONitor.  It has been
extended for not only access monitoring but also data access-aware
system operations (DAMOS).  But still the monitoring part is only for
data accesses.

Data access patterns is good information, but some users need more
holistic views.  Particularly, users want to show the access pattern
information together with the types of the memory.  For example, users
who work for making huge pages efficiently want to know how much of
DAMON-found hot/cold regions are backed by huge pages.  Users who run
multiple workloads with different cgroups want to know how much of
DAMON-found hot/cold regions belong to specific cgroups.

For the user demand, we developed a DAMOS extension for page level
properties based monitoring [1], which has landed on 6.14.  Using the
feature, users can inform the page level data properties that they are
interested in, in a flexible format that uses DAMOS filters.  Then,
DAMON applies the filters to each folio of the entire DAMON region and
lets users know how many bytes of memory in each DAMON region passed the
given filters.

This gives page level detailed and deterministic information to users.
But, because the operation is done at page level, the overhead is
proportional to the memory size.  It was useful for test or debugging
purposes on a small number of machines.  But it was obviously too heavy
to be enabled always on all machines running the real user workloads.
For real world workloads, it was recommended to use the feature with
user-space controlled sampling approaches.  For example, users could do
the page level monitoring only once per hour, on randomly selected one
percent of machines of their fleet.  If the runtime and the  size of the
fleet is long and big enough, it should provide statistically meaningful
data.

But users are too busy to implement such controls on their own.

Data Attributes Monitoring
==========================

Extend DAMON to monitor not only data accesses, but also general data
attributes.  Do the extension while keeping the main promise of DAMON,
the bounded and best-effort minimum overhead.

Allow users to specify what data attributes in addition to the data
access they want to monitor.  Users can install one 'data probe' per
data attribute of their interest for this purpose.  The 'data probe'
should be able to be applied to any memory, and determine if the given
memory has the appropriate data attribute.  E.g., if memory of physical
address 42 belongs to cgroup A.  Each 'data probe' is configured with
filters that are very similar to the DAMOS filters.

When DAMON checks if each sampling address memory of each region is
accessed since the last check, it applies data probes if registered.
Same to the number of access check-positive samples accounting
(nr_accesses), it accounts the number of each data probe-positive
samples in another per-region counters array, namely 'probe_hits'. When
DAMON resets nr_accesses every aggregation interval, it resets
'probe_hits' together.

Users can read 'probe_hits' just before the values are reset.  In this
way, users can know how many hot/cold memory regions have data
attributes of their interest.  E.g., 30 percent of this system's hot
memory is belonging to cgroup A, and 80 percent of the cgroup
A-belonging hot memory is backed by huge pages.

Patches Sequence
================

First eight patches implement the core feature, interface and the
working support.  Patch 1 introduces data probe data structure, namely
damon_probe.  Patch 2 extends damon_ctx for installing data probes.
Patch 3 introduces another data structure for filters of each data
probe, namely damon_filter.  Patch 4 updates damon_ctx commit function
to handle the probes.  Patch 5 extends damon_region for the per-region
per-probe positive samples counter, namely probe_hits.  Patch 6 extends
damon_operations for applying probes on the underlying DAMON operations
implementation.  Patch 7 updates kdamond_fn() to invoke the probes
applying callback.  Patch 8 finally implements the probes support on
paddr ops.

Ten changes for user interface (patches 9-18) come next.  Patches 9-13
implements sysfs directories and files for setting data probes, namely
probes directory, probe directory, filters directory, filter directory
and filter directory internal files, respectively.  Patch 14 connects
the user inputs that are made via the sysfs files to DAMON core.
Following three patches (patches 15-17) implement sysfs directories and
files for showing the probe_hits to users, namely probes directory,
probe directory and hits files, respectively.  Patch 18 introduces a new
tracepoint for showing the probe_hits via tracefs.

Patch 19 adds a selftest for the sysfs files.

Patches 20 and 21 documents the design and usage of the new feature,
respectively.

Seven additional patches (patches 22-28) for monitoring belonging memory
cgroup follow.  Depending on the feedback, this part might be separated
to another series in future.  Patch 22 defines the DAMON filter type for
the new attribute, namely DAMON_FILTER_TYPE_MEMCG.  Patch 23 add the
support on paddr ops.  Patch 24 updates the sysfs interface for setup of
the target memcg.  Patch 25 move code for easy reuse of the filter
target memcg setup.  Patch 26 connects the user input to the core layer.
Finally, patches 27 and 28 update the design and usage documents for the
memcg attribute monitoring support.

Discussions
===========

This allows the page properties monitoring with overhead that is low
enough to be enabled always on real world workloads.  Because the
sampling time for access check is reused for data attributes check,  the
upper-bounded and best-effort minimum overhead of DAMON is kept.
Because the sampling memory for access check is reused for data
attributes check, additional overhead is minimum.

Still DAMOS-based page level properties monitoring should be useful,
because it provides a deterministic page level information.  When in
doubt of the sampling based information, running DAMOS-based one
together and comparing the results would be useful, for debugging and
tuning.

Plan for Dropping RFC tag
=========================

Making changes for feedback from myself, humans and Sashiko should be
the major remaining work.

I'm currently hoping to drop the RFC tag by 7.2-rc1.

Future Works: Mid Term
========================

This version of implementation is limiting the maximum number of data
probes to four.  I will try to find a way to remove the limit in future.
I personally think it should be enough for common use cases, though, and
therefore not giving high priority at the moment.

Future Works: Long Term
=======================

There are user requests for extending DAMON with detailed access
information, for example, per-CPUs/threads/read/writes monitoring.  For
that, I was working [2] on extending DAMON to use page fault events as
another access check primitives, and making the infrastructure flexible
for future use of yet another access check primitive.  Actually there is
another ongoing work [3] for extending DAMON with PMU events.  The
motivation of the work is reducing the overhead, though.

In my work [2], I was introducing a new interface for access sampling
primitives control.  Now I think this data probe interface can be used
for that, too.  That is, data access becomes just one type of data
attribute.  Also, pg_idle-confirmed access, page fault-confirmed access,
and PMU event-confirmed access will be different types of data
attributes.

The regions adjustment mechanism is currently working based on the
access information.  That's because DAMON is designed for data access
monitoring.  That is, data access information is the primary interest,
and therefore DAMON adjusts regions in a way that can best-present the
information.

Once data access becomes just one of data attributes, there is no reason
to think data access that special.  There might be some users not
interested in access at all but want to know the location of memory of
specific type.  Data probes interface will allow doing that.  Further,
we could extend the interface to let users set any data attribute as the
'primary' attribute.  Then, DAMON will split and merge regions in a way
that can best-present the 'primary' attributes.

DAMOS will also be extended, to specify targets based on not only the
data access pattern, but all user-registered data attributes.  From this
stage, we may be able to call DAMON as a "Data Attributes Monitoring and
Operations eNgine".

[1] https://lore.kernel.org/20250106193401.109161-1-sj@kernel.org
[2] https://lore.kernel.org/20251208062943.68824-1-sj@kernel.org/
[3] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@gmail.com

Changes from RFC v2.2
- rfc v2.2: https://lore.kernel.org/20260515004433.128933-1-sj@kernel.org
- Rename damon_aggregated_v2 trace event to damon_region_aggregated.
- Address Sashiko issues.
  - Enclose arguments on damon_for_each_{probe,filter}[_safe]() macros.
  - Fix typos in comments and documents.
  - Update probe_hits for region split and merge.
  - Add more documentation for damon_operation->apply_probes() callback.
  - Reduce unnecessary folio_{get,put}() in damon_pa_apply_probes().
  - Define damon_sysfs_probe_attrs as static.
  - Link scheme tried region sysfs dir and increase the count only after
    all internal dir population success.
  - Commit damon_filter->memcg_id for newly added filters.
Changes from RFC v2.1
- rfc v2.1: https://lore.kernel.org/20260514140904.119781-1-sj@kernel.org
- Rebase to mm-stable (7.1-rc3) to avoid Sashiko patch apply failure.
Changes from RFC v2
- rfc v2: https://lore.kernel.org/20260512143645.113201-1-sj@kernel.org
- Optimize nr_probes calculation for probe_hits tracepoint.
- Use TRACE_EVENT_CONDITION() for probe_hits tracepoint.
- Rebase to latest mm-new.
Changes from RFC
- rfc: https://lore.kernel.org/all/20260426205222.93895-1-sj@kernel.org/
- Support memcg DAMON filter.
- Use per-probe probe_hits sysfs file.
- Use dynamic_array for probe_hits tracing.
- Fix filter matching field.
- Fix folio leaking in damon_pa_filter_pass().
- Move nr_regions of damon_aggregated_v2 tracepoint after end.
- Rename DAMON_TEST_TYPE_ANON to DAMON_FILTER_TYPE_ANON.

SeongJae Park (28):
  mm/damon/core: introduce struct damon_probe
  mm/damon/core: embed damon_probe objects in damon_ctx
  mm/damon/core: introduce damon_filter
  mm/damon/core: commit probes
  mm/damon/core: introduce damon_region->probe_hits
  mm/damon/core: introduce damon_ops->apply_probes
  mm/damon/core: do data attributes monitoring
  mm/damon/paddr: support data attributes monitoring
  mm/damon/sysfs: implement probes dir
  mm/damon/sysfs: implement probe dir
  mm/damon/sysfs: implement filters directory
  mm/damon/sysfs: implement filter dir
  mm/damon/sysfs: implement filter dir files
  mm/damon/sysfs: setup probes on DAMON core API parameters
  mm/damon/sysfs-schemes: implement tried_regions/<r>/probes/
  mm/damon/sysfs-schemes: implement probe dir
  mm/damon/sysfs-schemes: implement probe/hits file
  mm/damon: trace probe_hits
  selftests/damon/sysfs.sh: test probes dir
  Docs/mm/damon/design: document data attributes monitoring
  Docs/admin-guide/mm/damon/usage: document data attributes monitoring
  mm/damon/core: introduce DAMON_FILTER_TYPE_MEMCG
  mm/damon/paddr: support DAMON_FILTER_TYPE_MEMCG
  mm/damon/sysfs: add filters/<F>/path file
  mm/damon/sysfs-schemes: move memcg_path_to_id() to sysfs-common
  mm/damon/sysfs: setup damon_filter->memcg_id from path
  Docs/mm/damon/design: update for memcg damon filter
  Docs/admin-guide/mm/damon/usage: update for memcg damon filter

 Documentation/admin-guide/mm/damon/usage.rst |  46 +-
 Documentation/mm/damon/design.rst            |  39 ++
 include/linux/damon.h                        |  69 +++
 include/trace/events/damon.h                 |  38 ++
 mm/damon/core.c                              | 211 +++++++
 mm/damon/paddr.c                             |  76 +++
 mm/damon/sysfs-common.c                      |  41 ++
 mm/damon/sysfs-common.h                      |   2 +
 mm/damon/sysfs-schemes.c                     | 226 ++++++--
 mm/damon/sysfs.c                             | 557 +++++++++++++++++++
 tools/testing/selftests/damon/sysfs.sh       |  48 ++
 11 files changed, 1305 insertions(+), 48 deletions(-)

base-commit: 5d6919055dec134de3c40167a490f33c74c12581
-- 
2.47.3

^ permalink raw reply

* [RFC PATCH v3 18/28] mm/damon: trace probe_hits
From: SeongJae Park @ 2026-05-16 18:36 UTC (permalink / raw)
  Cc: SeongJae Park, Andrew Morton, Masami Hiramatsu, Mathieu Desnoyers,
	Steven Rostedt, damon, linux-kernel, linux-mm, linux-trace-kernel
In-Reply-To: <20260516183712.81393-1-sj@kernel.org>

Introduce a new tracepoint for exposing the per-region per-probe
positive sample count via tracefs.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 include/trace/events/damon.h | 38 ++++++++++++++++++++++++++++++++++++
 mm/damon/core.c              |  9 +++++++++
 2 files changed, 47 insertions(+)

diff --git a/include/trace/events/damon.h b/include/trace/events/damon.h
index 24fc402ab3c85..2fd914895c405 100644
--- a/include/trace/events/damon.h
+++ b/include/trace/events/damon.h
@@ -130,6 +130,44 @@ TRACE_EVENT(damon_monitor_intervals_tune,
 	TP_printk("sample_us=%lu", __entry->sample_us)
 );
 
+TRACE_EVENT_CONDITION(damon_region_aggregated,
+
+	TP_PROTO(unsigned int target_id, struct damon_region *r,
+		unsigned int nr_regions, unsigned int nr_probes),
+
+	TP_ARGS(target_id, r, nr_regions, nr_probes),
+
+	TP_CONDITION(nr_probes > 0),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, target_id)
+		__field(unsigned long, start)
+		__field(unsigned long, end)
+		__field(unsigned int, nr_regions)
+		__field(unsigned int, nr_accesses)
+		__field(unsigned int, age)
+		__dynamic_array(unsigned char, probe_hits, nr_probes)
+	),
+
+	TP_fast_assign(
+		__entry->target_id = target_id;
+		__entry->start = r->ar.start;
+		__entry->end = r->ar.end;
+		__entry->nr_regions = nr_regions;
+		__entry->nr_accesses = r->nr_accesses;
+		__entry->age = r->age;
+		memcpy(__get_dynamic_array(probe_hits), r->probe_hits,
+			sizeof(*r->probe_hits) * nr_probes);
+	),
+
+	TP_printk("target_id=%lu nr_regions=%u %lu-%lu: %u %u probe_hits=%s",
+			__entry->target_id, __entry->nr_regions,
+			__entry->start, __entry->end,
+			__entry->nr_accesses, __entry->age,
+			__print_hex(__get_dynamic_array(probe_hits),
+				__get_dynamic_array_len(probe_hits)))
+);
+
 TRACE_EVENT(damon_aggregated,
 
 	TP_PROTO(unsigned int target_id, struct damon_region *r,
diff --git a/mm/damon/core.c b/mm/damon/core.c
index dde3c8d8fef89..11b513eb077fe 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -1881,6 +1881,13 @@ static void kdamond_reset_aggregated(struct damon_ctx *c)
 {
 	struct damon_target *t;
 	unsigned int ti = 0;	/* target's index */
+	unsigned int nr_probes = 0;
+	struct damon_probe *probe;
+
+	if (trace_damon_region_aggregated_enabled()) {
+		damon_for_each_probe(probe, c)
+			nr_probes++;
+	}
 
 	damon_for_each_target(t, c) {
 		struct damon_region *r;
@@ -1889,6 +1896,8 @@ static void kdamond_reset_aggregated(struct damon_ctx *c)
 			int i;
 
 			trace_damon_aggregated(ti, r, damon_nr_regions(t));
+			trace_damon_region_aggregated(ti, r,
+					damon_nr_regions(t), nr_probes);
 			damon_warn_fix_nr_accesses_corruption(r);
 			r->last_nr_accesses = r->nr_accesses;
 			r->nr_accesses = 0;
-- 
2.47.3

^ permalink raw reply related

* Re: [RFC PATCH v3 00/28] mm/damon: introduce data attributes monitoring
From: SeongJae Park @ 2026-05-16 18:50 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Masami Hiramatsu,
	Mathieu Desnoyers, Michal Hocko, Mike Rapoport, Shuah Khan,
	Shuah Khan, Steven Rostedt, Suren Baghdasaryan, Vlastimil Babka,
	damon, linux-doc, linux-kernel, linux-kselftest, linux-mm,
	linux-trace-kernel
In-Reply-To: <20260516183712.81393-1-sj@kernel.org>

On Sat, 16 May 2026 11:36:41 -0700 SeongJae Park <sj@kernel.org> wrote:

> TL; DR
> ======
> 
> Extend DAMON for monitoring general data attributes other than accesses.
> The short term motivation is lightweight page type (e.g., belonging
> cgroup) aware monitoring.  In long term, this will help extending DAMON
> for multiple access events capture primitives (e.g., page faults and
> PMU) and eventually pivotting DAMON to a "Data Attributes Monitoring and
> Operations eNgine" in long term.
[...]
> Changes from RFC v2.1
> - rfc v2.1: https://lore.kernel.org/20260514140904.119781-1-sj@kernel.org
> - Rebase to mm-stable (7.1-rc3) to avoid Sashiko patch apply failure.

Still this seires is based on mm-stable (7.1-rc3) for the same reason.  The
patches that based on mm-new is available at damon/next tree [1].

[1] https://origin.kernel.org/doc/html/latest/mm/damon/maintainer-profile.html#scm-trees


Thanks,
SJ

[...]

^ permalink raw reply

* [PATCH v2] tracing/probes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-05-16 21:33 UTC (permalink / raw)
  To: LKML, Linux trace kernel, bpf
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland, Peter Zijlstra,
	Namhyung Kim, Takaya Saeki, Douglas Raillard, Tom Zanussi,
	Andrew Morton, Thomas Gleixner, Ian Rogers, Jiri Olsa

From: Steven Rostedt <rostedt@goodmis.org>

Add syntax to the FETCHARGS parsing of probes to allow the use of
structure and member names to get the offsets to dereference pointers.

Currently, a dereference must be a number, where the user has to figure
out manually the offset of a member of a structure that they want to
reference. For example, to get the size of a kmem_cache that was passed to
the function kmem_cache_alloc_noprof, one would need to do:

 # cd /sys/kernel/tracing
 # echo 'f:cache kmem_cache_alloc_noprof size=+0x18($arg1):u32' >> dynamic_events

This requires knowing that the offset of size is 0x18, which can be found
with gdb:

  (gdb) p &((struct kmem_cache *)0)->size
  $1 = (unsigned int *) 0x18

If BTF is in the kernel, it can be used to find this with names, where the
user doesn't need to find the actual offset:

 # echo 'f:cache kmem_cache_alloc_noprof size=+kmem_cache.size($arg1):u32' >> dynamic_events

Instead of the "+0x18", it would have "+kmem_cache.size" where the format is:

  +STRUCT.MEMBER[.MEMBER[..]]

The delimiter is '.' and the first item is the structure name. Then the
member of the structure to get the offset of. If that member is an
embedded structure, another '.MEMBER' may be added to get the offset of
its members with respect to the original value.

  "+kmem_cache.size($arg1)" is equivalent to:

  (*(struct kmem_cache *)$arg1).size

Anonymous structures are also handled:

  # echo 'e:xmit net.net_dev_xmit +net_device.name(+sk_buff.dev($skbaddr)):string' >> dynamic_events

Where "+net_device.name(+sk_buff.dev($skbaddr))" is equivalent to:

  (*(struct net_device *)((*(struct sk_buff *)($skbaddr)).dev)->name)

Note that "dev" of struct sk_buff is inside an anonymous structure:

struct sk_buff {
	union {
		struct {
			/* These two members must be first to match sk_buff_head. */
			struct sk_buff		*next;
			struct sk_buff		*prev;

			union {
				struct net_device	*dev;
				[..]
			};
		};
		[..]
	};

This will allow up to three deep of anonymous structures before it will
fail to find a member.

The above produces:

    sshd-session-1080    [000] b..5.  1526.337161: xmit: (net.net_dev_xmit) arg1="enp7s0"

And nested structures can be found by adding more members to the arg:

  # echo 'f:read filemap_readahead.isra.0 file=+0(+dentry.d_name.name(+file.f_path.dentry($arg2))):string' >> dynamic_events

The above is equivalent to:

  *((*(struct dentry *)(*(struct file *)$arg2).f_path.dentry)->d_name.name)

And produces:

       trace-cmd-1381    [002] ...1.  2082.676268: read: (filemap_readahead.isra.0+0x0/0x150) file="trace.dat"

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v1: https://patch.msgid.link/20250729113335.2e4f087d@batman.local.home

- Pass error info back so that error_log can display what went wrong
  (Masami Hiramatsu)

- Use __btf_member_bit_offset() to get nested struct member offsets
  (Douglas Raillard)

- Add btf_put(btf) (Jiri Olsa)

 Documentation/trace/kprobetrace.rst |   3 +
 kernel/trace/trace_btf.c            | 111 ++++++++++++++++++++++++++++
 kernel/trace/trace_btf.h            |  10 +++
 kernel/trace/trace_probe.c          |  19 ++++-
 kernel/trace/trace_probe.h          |   4 +-
 5 files changed, 144 insertions(+), 3 deletions(-)

diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index 3b6791c17e9b..00273157100c 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -54,6 +54,8 @@ Synopsis of kprobe_events
   $retval	: Fetch return value.(\*2)
   $comm		: Fetch current task comm.
   +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
+  +STRUCT.MEMBER[.MEMBER[..]](FETCHARG) : If BTF is supported, Fetch memory
+		  at FETCHARG + the offset of MEMBER inside of STRUCT.(\*5)
   \IMM		: Store an immediate value to the argument.
   NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
   FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
@@ -70,6 +72,7 @@ Synopsis of kprobe_events
         accesses one register.
   (\*3) this is useful for fetching a field of data structures.
   (\*4) "u" means user-space dereference. See :ref:`user_mem_access`.
+  (\*5) +STRUCT.MEMBER(FETCHARG) is equivalent to (*(struct STRUCT *)(FETCHARG)).MEMBER
 
 Function arguments at kretprobe
 -------------------------------
diff --git a/kernel/trace/trace_btf.c b/kernel/trace/trace_btf.c
index 00172f301f25..be88cc4d97dd 100644
--- a/kernel/trace/trace_btf.c
+++ b/kernel/trace/trace_btf.c
@@ -120,3 +120,114 @@ const struct btf_member *btf_find_struct_member(struct btf *btf,
 	return member;
 }
 
+#define BITS_ROUNDDOWN_BYTES(bits) ((bits) >> 3)
+
+static int find_member(const char *ptr, struct btf *btf,
+		       const struct btf_type **type, int level)
+{
+	const struct btf_member *member;
+	const struct btf_type *t = *type;
+	int i;
+
+	/* Max of 3 depth of anonymous structures */
+	if (level > 3)
+		return -E2BIG;
+
+	for_each_member(i, t, member) {
+		const char *tname = btf_name_by_offset(btf, member->name_off);
+
+		if (strcmp(ptr, tname) == 0) {
+			int offset = __btf_member_bit_offset(t, member);
+			*type = btf_type_by_id(btf, member->type);
+			return BITS_ROUNDDOWN_BYTES(offset);
+		}
+
+		/* Handle anonymous structures */
+		if (strlen(tname))
+			continue;
+
+		*type = btf_type_by_id(btf, member->type);
+		if (btf_type_is_struct(*type)) {
+			int offset = find_member(ptr, btf, type, level + 1);
+
+			if (offset < 0)
+				continue;
+
+			return offset + BITS_ROUNDDOWN_BYTES(member->offset);
+		}
+	}
+
+	return -ENOENT;
+}
+
+/**
+ * btf_find_offset - Find an offset of a member for a structure
+ * @arg: A structure name followed by one or more members
+ * @offset_p: A pointer to where to store the offset
+ *
+ * Will parse @arg with the expected format of: struct.member[[.member]..]
+ * It is delimited by '.'. The first item must be a structure type.
+ * The next are its members. If the member is also of a structure type it
+ * another member may follow ".member".
+ *
+ * Note, @arg is modified but will be put back to what it was on return.
+ *
+ * Returns: 0 on success and -EINVAL if no '.' is present
+ *    or -ENXIO if the structure or member is not found.
+ *    Returns -EINVAL if BTF is not defined.
+ *  On success, @offset_p will contain the offset of the member specified
+ *    by @arg.
+ */
+int btf_find_offset(char *arg, long *offset_p)
+{
+	const struct btf_type *t;
+	struct btf *btf;
+	long offset = 0;
+	char *ptr;
+	int ret;
+	s32 id;
+
+	ptr = strchr(arg, '.');
+	if (!ptr)
+		return -EINVAL;
+
+	*ptr = '\0';
+
+	ret = -ENXIO;
+	id = bpf_find_btf_id(arg, BTF_KIND_STRUCT, &btf);
+	if (id < 0)
+		goto error;
+
+	/* Get BTF_KIND_FUNC type */
+	t = btf_type_by_id(btf, id);
+
+	/* May allow more than one member, as long as they are structures */
+	do {
+		ret = -ENXIO;
+		if (!t || !btf_type_is_struct(t))
+			goto error;
+
+		*ptr++ = '.';
+		arg = ptr;
+		ptr = strchr(ptr, '.');
+		if (ptr)
+			*ptr = '\0';
+
+		ret = find_member(arg, btf, &t, 0);
+		if (ret < 0)
+			goto error;
+
+		offset += ret;
+
+	} while (ptr);
+
+	btf_put(btf);
+	*offset_p = offset;
+	return 0;
+
+error:
+	btf_put(btf);
+	if (ptr)
+		*ptr = '.';
+	return ret;
+}
diff --git a/kernel/trace/trace_btf.h b/kernel/trace/trace_btf.h
index 4bc44bc261e6..7b0797a6050b 100644
--- a/kernel/trace/trace_btf.h
+++ b/kernel/trace/trace_btf.h
@@ -9,3 +9,13 @@ const struct btf_member *btf_find_struct_member(struct btf *btf,
 						const struct btf_type *type,
 						const char *member_name,
 						u32 *anon_offset);
+
+#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
+/* Will modify arg, but will put it back before returning. */
+int btf_find_offset(char *arg, long *offset);
+#else
+static inline int btf_find_offset(char *arg, long *offset)
+{
+	return -EINVAL;
+}
+#endif
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index e0d3a0da26af..6fcede2de1a5 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -1165,7 +1165,7 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
 
 	case '+':	/* deref memory */
 	case '-':
-		if (arg[1] == 'u') {
+		if (arg[1] == 'u' && isdigit(arg[2])) {
 			deref = FETCH_OP_UDEREF;
 			arg[1] = arg[0];
 			arg++;
@@ -1178,7 +1178,22 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
 			return -EINVAL;
 		}
 		*tmp = '\0';
-		ret = kstrtol(arg, 0, &offset);
+		if (arg[0] != '-' && !isdigit(*arg)) {
+			int err = 0;
+			ret = btf_find_offset(arg, &offset);
+			switch (ret) {
+			case -ENODEV: err = TP_ERR_NOSUP_BTFARG; break;
+			case -E2BIG: err = TP_ERR_MEMBER_TOO_DEEP; break;
+			case -EINVAL: err = TP_ERR_BAD_STRUCT_FMT; break;
+			case -ENXIO: err = TP_ERR_BAD_BTF_TID; break;
+			}
+			if (err)
+				__trace_probe_log_err(ctx->offset, err);
+			if (ret < 0)
+				return ret;
+		} else {
+			ret = kstrtol(arg, 0, &offset);
+		}
 		if (ret) {
 			trace_probe_log_err(ctx->offset, BAD_DEREF_OFFS);
 			break;
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 262d8707a3df..d649bb9f5b7c 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -563,7 +563,9 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
 	C(NEED_STRING_TYPE,	"$comm and immediate-string only accepts string type"),\
 	C(TOO_MANY_ARGS,	"Too many arguments are specified"),	\
 	C(TOO_MANY_EARGS,	"Too many entry arguments specified"),	\
-	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"),
+	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"), \
+	C(MEMBER_TOO_DEEP,	"Too many indirections of anonymous structure"), \
+	C(BAD_STRUCT_FMT,	"Unknown BTF structure"),
 
 #undef C
 #define C(a, b)		TP_ERR_##a
-- 
2.53.0


^ permalink raw reply related

* Re: [RFC PATCH v3 00/28] mm/damon: introduce data attributes monitoring
From: SeongJae Park @ 2026-05-16 22:03 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Masami Hiramatsu,
	Mathieu Desnoyers, Michal Hocko, Mike Rapoport, Shuah Khan,
	Shuah Khan, Steven Rostedt, Suren Baghdasaryan, Vlastimil Babka,
	damon, linux-doc, linux-kernel, linux-kselftest, linux-mm,
	linux-trace-kernel
In-Reply-To: <20260516183712.81393-1-sj@kernel.org>

On Sat, 16 May 2026 11:36:41 -0700 SeongJae Park <sj@kernel.org> wrote:

> TL; DR
> ======
> 
> Extend DAMON for monitoring general data attributes other than accesses.
> The short term motivation is lightweight page type (e.g., belonging
> cgroup) aware monitoring.  In long term, this will help extending DAMON
> for multiple access events capture primitives (e.g., page faults and
> PMU) and eventually pivotting DAMON to a "Data Attributes Monitoring and
> Operations eNgine" in long term.

Sashiko found [1] no blocker for this version but a nice document wordsmithing
idea.  Unless I get other opinions, I will drop RFC tag from the next version
of this series.

[1] https://lore.kernel.org/damon/20260516185032.82261-1-sj@kernel.org/


Thanks,
SJ

[...]

^ permalink raw reply

* Re: [PATCH v2] tracing/probes: Allow use of BTF names to dereference pointers
From: Masami Hiramatsu @ 2026-05-17  2:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux trace kernel, bpf, Masami Hiramatsu,
	Mathieu Desnoyers, Mark Rutland, Peter Zijlstra, Namhyung Kim,
	Takaya Saeki, Douglas Raillard, Tom Zanussi, Andrew Morton,
	Thomas Gleixner, Ian Rogers, Jiri Olsa
In-Reply-To: <20260516173310.1dbad146@fedora>

On Sat, 16 May 2026 17:33:09 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> From: Steven Rostedt <rostedt@goodmis.org>
> 
> Add syntax to the FETCHARGS parsing of probes to allow the use of
> structure and member names to get the offsets to dereference pointers.
> 
> Currently, a dereference must be a number, where the user has to figure
> out manually the offset of a member of a structure that they want to
> reference. For example, to get the size of a kmem_cache that was passed to
> the function kmem_cache_alloc_noprof, one would need to do:
> 
>  # cd /sys/kernel/tracing
>  # echo 'f:cache kmem_cache_alloc_noprof size=+0x18($arg1):u32' >> dynamic_events
> 
> This requires knowing that the offset of size is 0x18, which can be found
> with gdb:
> 
>   (gdb) p &((struct kmem_cache *)0)->size
>   $1 = (unsigned int *) 0x18
> 
> If BTF is in the kernel, it can be used to find this with names, where the
> user doesn't need to find the actual offset:
> 
>  # echo 'f:cache kmem_cache_alloc_noprof size=+kmem_cache.size($arg1):u32' >> dynamic_events
> 
> Instead of the "+0x18", it would have "+kmem_cache.size" where the format is:
> 
>   +STRUCT.MEMBER[.MEMBER[..]]
> 
> The delimiter is '.' and the first item is the structure name. Then the
> member of the structure to get the offset of. If that member is an
> embedded structure, another '.MEMBER' may be added to get the offset of
> its members with respect to the original value.
> 
>   "+kmem_cache.size($arg1)" is equivalent to:
> 
>   (*(struct kmem_cache *)$arg1).size
> 
> Anonymous structures are also handled:
> 
>   # echo 'e:xmit net.net_dev_xmit +net_device.name(+sk_buff.dev($skbaddr)):string' >> dynamic_events
> 
> Where "+net_device.name(+sk_buff.dev($skbaddr))" is equivalent to:
> 
>   (*(struct net_device *)((*(struct sk_buff *)($skbaddr)).dev)->name)
> 
> Note that "dev" of struct sk_buff is inside an anonymous structure:
> 
> struct sk_buff {
> 	union {
> 		struct {
> 			/* These two members must be first to match sk_buff_head. */
> 			struct sk_buff		*next;
> 			struct sk_buff		*prev;
> 
> 			union {
> 				struct net_device	*dev;
> 				[..]
> 			};
> 		};
> 		[..]
> 	};
> 
> This will allow up to three deep of anonymous structures before it will
> fail to find a member.
> 
> The above produces:
> 
>     sshd-session-1080    [000] b..5.  1526.337161: xmit: (net.net_dev_xmit) arg1="enp7s0"
> 
> And nested structures can be found by adding more members to the arg:
> 
>   # echo 'f:read filemap_readahead.isra.0 file=+0(+dentry.d_name.name(+file.f_path.dentry($arg2))):string' >> dynamic_events
> 
> The above is equivalent to:
> 
>   *((*(struct dentry *)(*(struct file *)$arg2).f_path.dentry)->d_name.name)
> 
> And produces:
> 
>        trace-cmd-1381    [002] ...1.  2082.676268: read: (filemap_readahead.isra.0+0x0/0x150) file="trace.dat"
> 
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

Thanks for updating it! I think this looks good to me. 
Let me test it. I think we also need updating selftests.

Thank you,

> ---
> Changes since v1: https://patch.msgid.link/20250729113335.2e4f087d@batman.local.home
> 
> - Pass error info back so that error_log can display what went wrong
>   (Masami Hiramatsu)
> 
> - Use __btf_member_bit_offset() to get nested struct member offsets
>   (Douglas Raillard)
> 
> - Add btf_put(btf) (Jiri Olsa)
> 
>  Documentation/trace/kprobetrace.rst |   3 +
>  kernel/trace/trace_btf.c            | 111 ++++++++++++++++++++++++++++
>  kernel/trace/trace_btf.h            |  10 +++
>  kernel/trace/trace_probe.c          |  19 ++++-
>  kernel/trace/trace_probe.h          |   4 +-
>  5 files changed, 144 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
> index 3b6791c17e9b..00273157100c 100644
> --- a/Documentation/trace/kprobetrace.rst
> +++ b/Documentation/trace/kprobetrace.rst
> @@ -54,6 +54,8 @@ Synopsis of kprobe_events
>    $retval	: Fetch return value.(\*2)
>    $comm		: Fetch current task comm.
>    +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
> +  +STRUCT.MEMBER[.MEMBER[..]](FETCHARG) : If BTF is supported, Fetch memory
> +		  at FETCHARG + the offset of MEMBER inside of STRUCT.(\*5)
>    \IMM		: Store an immediate value to the argument.
>    NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
>    FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
> @@ -70,6 +72,7 @@ Synopsis of kprobe_events
>          accesses one register.
>    (\*3) this is useful for fetching a field of data structures.
>    (\*4) "u" means user-space dereference. See :ref:`user_mem_access`.
> +  (\*5) +STRUCT.MEMBER(FETCHARG) is equivalent to (*(struct STRUCT *)(FETCHARG)).MEMBER
>  
>  Function arguments at kretprobe
>  -------------------------------
> diff --git a/kernel/trace/trace_btf.c b/kernel/trace/trace_btf.c
> index 00172f301f25..be88cc4d97dd 100644
> --- a/kernel/trace/trace_btf.c
> +++ b/kernel/trace/trace_btf.c
> @@ -120,3 +120,114 @@ const struct btf_member *btf_find_struct_member(struct btf *btf,
>  	return member;
>  }
>  
> +#define BITS_ROUNDDOWN_BYTES(bits) ((bits) >> 3)
> +
> +static int find_member(const char *ptr, struct btf *btf,
> +		       const struct btf_type **type, int level)
> +{
> +	const struct btf_member *member;
> +	const struct btf_type *t = *type;
> +	int i;
> +
> +	/* Max of 3 depth of anonymous structures */
> +	if (level > 3)
> +		return -E2BIG;
> +
> +	for_each_member(i, t, member) {
> +		const char *tname = btf_name_by_offset(btf, member->name_off);
> +
> +		if (strcmp(ptr, tname) == 0) {
> +			int offset = __btf_member_bit_offset(t, member);
> +			*type = btf_type_by_id(btf, member->type);
> +			return BITS_ROUNDDOWN_BYTES(offset);
> +		}
> +
> +		/* Handle anonymous structures */
> +		if (strlen(tname))
> +			continue;
> +
> +		*type = btf_type_by_id(btf, member->type);
> +		if (btf_type_is_struct(*type)) {
> +			int offset = find_member(ptr, btf, type, level + 1);
> +
> +			if (offset < 0)
> +				continue;
> +
> +			return offset + BITS_ROUNDDOWN_BYTES(member->offset);
> +		}
> +	}
> +
> +	return -ENOENT;
> +}
> +
> +/**
> + * btf_find_offset - Find an offset of a member for a structure
> + * @arg: A structure name followed by one or more members
> + * @offset_p: A pointer to where to store the offset
> + *
> + * Will parse @arg with the expected format of: struct.member[[.member]..]
> + * It is delimited by '.'. The first item must be a structure type.
> + * The next are its members. If the member is also of a structure type it
> + * another member may follow ".member".
> + *
> + * Note, @arg is modified but will be put back to what it was on return.
> + *
> + * Returns: 0 on success and -EINVAL if no '.' is present
> + *    or -ENXIO if the structure or member is not found.
> + *    Returns -EINVAL if BTF is not defined.
> + *  On success, @offset_p will contain the offset of the member specified
> + *    by @arg.
> + */
> +int btf_find_offset(char *arg, long *offset_p)
> +{
> +	const struct btf_type *t;
> +	struct btf *btf;
> +	long offset = 0;
> +	char *ptr;
> +	int ret;
> +	s32 id;
> +
> +	ptr = strchr(arg, '.');
> +	if (!ptr)
> +		return -EINVAL;
> +
> +	*ptr = '\0';
> +
> +	ret = -ENXIO;
> +	id = bpf_find_btf_id(arg, BTF_KIND_STRUCT, &btf);
> +	if (id < 0)
> +		goto error;
> +
> +	/* Get BTF_KIND_FUNC type */
> +	t = btf_type_by_id(btf, id);
> +
> +	/* May allow more than one member, as long as they are structures */
> +	do {
> +		ret = -ENXIO;
> +		if (!t || !btf_type_is_struct(t))
> +			goto error;
> +
> +		*ptr++ = '.';
> +		arg = ptr;
> +		ptr = strchr(ptr, '.');
> +		if (ptr)
> +			*ptr = '\0';
> +
> +		ret = find_member(arg, btf, &t, 0);
> +		if (ret < 0)
> +			goto error;
> +
> +		offset += ret;
> +
> +	} while (ptr);
> +
> +	btf_put(btf);
> +	*offset_p = offset;
> +	return 0;
> +
> +error:
> +	btf_put(btf);
> +	if (ptr)
> +		*ptr = '.';
> +	return ret;
> +}
> diff --git a/kernel/trace/trace_btf.h b/kernel/trace/trace_btf.h
> index 4bc44bc261e6..7b0797a6050b 100644
> --- a/kernel/trace/trace_btf.h
> +++ b/kernel/trace/trace_btf.h
> @@ -9,3 +9,13 @@ const struct btf_member *btf_find_struct_member(struct btf *btf,
>  						const struct btf_type *type,
>  						const char *member_name,
>  						u32 *anon_offset);
> +
> +#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
> +/* Will modify arg, but will put it back before returning. */
> +int btf_find_offset(char *arg, long *offset);
> +#else
> +static inline int btf_find_offset(char *arg, long *offset)
> +{
> +	return -EINVAL;
> +}
> +#endif
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index e0d3a0da26af..6fcede2de1a5 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -1165,7 +1165,7 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
>  
>  	case '+':	/* deref memory */
>  	case '-':
> -		if (arg[1] == 'u') {
> +		if (arg[1] == 'u' && isdigit(arg[2])) {
>  			deref = FETCH_OP_UDEREF;
>  			arg[1] = arg[0];
>  			arg++;
> @@ -1178,7 +1178,22 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
>  			return -EINVAL;
>  		}
>  		*tmp = '\0';
> -		ret = kstrtol(arg, 0, &offset);
> +		if (arg[0] != '-' && !isdigit(*arg)) {
> +			int err = 0;
> +			ret = btf_find_offset(arg, &offset);
> +			switch (ret) {
> +			case -ENODEV: err = TP_ERR_NOSUP_BTFARG; break;
> +			case -E2BIG: err = TP_ERR_MEMBER_TOO_DEEP; break;
> +			case -EINVAL: err = TP_ERR_BAD_STRUCT_FMT; break;
> +			case -ENXIO: err = TP_ERR_BAD_BTF_TID; break;
> +			}
> +			if (err)
> +				__trace_probe_log_err(ctx->offset, err);
> +			if (ret < 0)
> +				return ret;
> +		} else {
> +			ret = kstrtol(arg, 0, &offset);
> +		}
>  		if (ret) {
>  			trace_probe_log_err(ctx->offset, BAD_DEREF_OFFS);
>  			break;
> diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
> index 262d8707a3df..d649bb9f5b7c 100644
> --- a/kernel/trace/trace_probe.h
> +++ b/kernel/trace/trace_probe.h
> @@ -563,7 +563,9 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
>  	C(NEED_STRING_TYPE,	"$comm and immediate-string only accepts string type"),\
>  	C(TOO_MANY_ARGS,	"Too many arguments are specified"),	\
>  	C(TOO_MANY_EARGS,	"Too many entry arguments specified"),	\
> -	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"),
> +	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"), \
> +	C(MEMBER_TOO_DEEP,	"Too many indirections of anonymous structure"), \
> +	C(BAD_STRUCT_FMT,	"Unknown BTF structure"),
>  
>  #undef C
>  #define C(a, b)		TP_ERR_##a
> -- 
> 2.53.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Race condition in __modify_ftrace_direct() between tmp_ops registration and direct_functions hash update
From: Afi0 @ 2026-05-17  6:24 UTC (permalink / raw)
  To: security; +Cc: linux-kernel, linux-trace-kernel, rostedt, mhiramat, Greg KH


[-- Attachment #1.1: Type: text/plain, Size: 1881 bytes --]

Hi list,

Apologies for initially sending only to Greg. Resending to the full list as
requested.
------------------------------

Component: kernel/trace/ftrace.c Function: __modify_ftrace_direct()
Affected versions: Linux kernel 5.15+ Type: TOCTOU / Race condition CVSS
3.1: AV:L/AC:H/PR:L/UI:N/S:C/C:H/I:H/A:H - 7.8 (High)

SUMMARY

A race condition exists in __modify_ftrace_direct() between the
registration of tmp_ops into ftrace_ops_list and the subsequent update of
direct_functions hash entries. During this window, concurrent CPUs
executing traced functions will read the stale direct call address via
ftrace_find_rec_direct() and jump to it, while the caller may have already
invalidated or freed the old trampoline memory.

VULNERABLE CODE

err = register_ftrace_function_nolock(&tmp_ops);[race window:
ftrace_ops_list_func now active, direct_functions not yet
updated]mutex_lock(&ftrace_lock);entry->direct = addr;   /* update
happens here, too late */mutex_unlock(&ftrace_lock);

IMPACT

CPU executing traced function reads stale direct_functions entry during the
race window. arch_ftrace_set_direct_caller() redirects execution to
potentially freed or invalidated trampoline memory. Use-after-free in
executable code context on SMP systems.

TRIGGER

Requires CAP_PERFMON or CAP_SYS_ADMIN directly. Also reachable via BPF
trampolines (kernel/bpf/trampoline.c calls __modify_ftrace_direct()
internally) with CAP_BPF + CAP_PERFMON, default in many CI/CD container
runtimes. Live patching via klp_patch_func() also goes through this path.

SUGGESTED FIX

Update entry->direct under ftrace_lock BEFORE registering tmp_ops. Add
smp_wmb() between the store and registration to ensure ordering on
weakly-ordered architectures.

Patch attached as 0001-ftrace-fix-race-in-__modify_ftrace_direct.patch

Fixes: 0567d6809440 ("ftrace: Add modify_ftrace_direct()")

Thanks,

 Afi0

[-- Attachment #1.2: Type: text/html, Size: 4441 bytes --]

[-- Attachment #2: 0001-ftrace-fix-race-in-__modify_ftrace_direct.patch --]
[-- Type: text/x-patch, Size: 4719 bytes --]

From b3c4d5e6f7a8b3c4d5e6f7a8b3c4d5e6f7a8b3c4 Mon Sep 17 00:00:00 2001
From: Afi0 <capyenglishlite@gmail.com>
Date: Sat, 16 May 2026 12:11:00 +0000
Subject: [PATCH] ftrace: fix race in __modify_ftrace_direct() between
 tmp_ops registration and direct_functions update

In __modify_ftrace_direct(), register_ftrace_function_nolock() makes
tmp_ops visible in ftrace_ops_list before entry->direct is updated
under ftrace_lock. During this window any CPU entering the traced
function calls call_direct_funcs(), reads the old address from
direct_functions via RCU, and jumps to it via
arch_ftrace_set_direct_caller(). If the caller freed or invalidated
the old trampoline before calling modify_ftrace_direct(), this is a
use-after-free in executable code context.

The race window:

  CPU 0 (__modify_ftrace_direct)       CPU 1 (executing traced func)
  ──────────────────────────────       ──────────────────────────────
  register_ftrace_function_nolock()
    -> tmp_ops visible in ops_list
                                        call_direct_funcs()
                                          ftrace_find_rec_direct() -> old_addr
                                          arch_ftrace_set_direct_caller(old_addr)
                                          jump to old_addr  <- UAF if freed
  mutex_lock(&ftrace_lock)
  entry->direct = addr   <- too late
  mutex_unlock(&ftrace_lock)

Fix: update entry->direct under ftrace_lock BEFORE registering tmp_ops.
Any CPU that observes tmp_ops in ftrace_ops_list after this point will
already see the new address when it calls ftrace_find_rec_direct().
Add smp_wmb() between the store and the registration to ensure the
write is visible on weakly-ordered architectures before tmp_ops
becomes observable via ftrace_ops_list.

On error from register_ftrace_function_nolock(), restore entry->direct
to old_addr since tmp_ops never became visible to other CPUs.

This affects all callers of __modify_ftrace_direct(), including:
  - modify_ftrace_direct() used by kernel modules and live patching
  - modify_ftrace_direct_nolock() used by BPF trampolines
    (kernel/bpf/trampoline.c) reachable with CAP_BPF + CAP_PERFMON

Fixes: 0567d6809440 ("ftrace: Add modify_ftrace_direct()")
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Afi0 <capyenglishlite@gmail.com>
---
 kernel/trace/ftrace.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index a1b2c3d4e5f6..b7c8d9e0f1a2 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -5950,6 +5950,7 @@ static int __modify_ftrace_direct(struct ftrace_ops *ops, unsigned long addr)
 	struct ftrace_func_entry *entry;
 	struct ftrace_ops tmp_ops;
+	unsigned long old_addr;
 	int err;
 
 	lockdep_assert_held(&direct_mutex);
@@ -5960,22 +5961,36 @@ static int __modify_ftrace_direct(struct ftrace_ops *ops, unsigned long addr)
 	if (!entry)
 		return -ENODEV;
 
-	/*
-	 * tmp_ops is registered into ftrace_ops_list here, making it
-	 * visible to all CPUs executing the traced function. However,
-	 * entry->direct is not updated until after this call returns,
-	 * leaving a window where CPUs read the stale (possibly freed)
-	 * direct call address via ftrace_find_rec_direct().
-	 */
-	err = register_ftrace_function_nolock(&tmp_ops);
-	if (err)
-		return err;
-
+	/* Save old address in case we need to roll back on error. */
+	old_addr = entry->direct;
+
+	/*
+	 * Update entry->direct BEFORE registering tmp_ops into
+	 * ftrace_ops_list. This closes the race window where a CPU
+	 * executing the traced function could read the old (potentially
+	 * freed) direct call address between tmp_ops becoming visible
+	 * and entry->direct being updated.
+	 *
+	 * Any CPU that observes tmp_ops in ftrace_ops_list after the
+	 * smp_wmb() below is guaranteed to see the new address when
+	 * it calls ftrace_find_rec_direct().
+	 */
 	mutex_lock(&ftrace_lock);
 	entry->direct = addr;
 	mutex_unlock(&ftrace_lock);
 
+	/*
+	 * Ensure entry->direct store is ordered before tmp_ops
+	 * becomes visible via ftrace_ops_list on weakly-ordered archs.
+	 */
+	smp_wmb();
+
+	err = register_ftrace_function_nolock(&tmp_ops);
+	if (err) {
+		/* tmp_ops never became visible; safe to restore old_addr. */
+		mutex_lock(&ftrace_lock);
+		entry->direct = old_addr;
+		mutex_unlock(&ftrace_lock);
+		return err;
+	}
+
 	/*
 	 * Now that tmp_ops is registered and entry->direct is updated,
 	 * unregister the original ops and clean up.
-- 
2.39.0

^ permalink raw reply related

* Re: Race condition in __modify_ftrace_direct() between tmp_ops registration and direct_functions hash update
From: Greg KH @ 2026-05-17  7:08 UTC (permalink / raw)
  To: Afi0; +Cc: security, linux-kernel, linux-trace-kernel, rostedt, mhiramat
In-Reply-To: <CAEABq7fMcvHpp4+59Mt-QdgGNpWhOqrGWHKmy+qt3tJSYb69kg@mail.gmail.com>

On Sun, May 17, 2026 at 06:24:11AM +0000, Afi0 wrote:
> Signed-off-by: Afi0 <capyenglishlite@gmail.com>

Again, just send a patch with your real name as the documentation asks
for.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v3 10/11] kernel: time, trace: Use trace_call__##name() at guarded tracepoint call sites
From: Thomas Gleixner @ 2026-05-17  7:31 UTC (permalink / raw)
  To: Vineeth Pillai (Google), Anna-Maria Behnsen, Frederic Weisbecker,
	Ingo Molnar, Steven Rostedt, Masami Hiramatsu
  Cc: linux-kernel, linux-trace-kernel, Vineeth Pillai, Peter Zijlstra
In-Reply-To: <20260515135959.2238922-1-vineeth@bitbyteword.org>

On Fri, May 15 2026 at 09:59, Vineeth Pillai wrote:
> ---
>  kernel/time/tick-sched.c       | 12 ++++++------
>  kernel/trace/trace_benchmark.c |  2 +-
>  2 files changed, 7 insertions(+), 7 deletions(-)

Please split that into a tick/sched and trace patch so each can be picked
up in the relevant subsystems.


^ permalink raw reply

* Re: [PATCH 1/9] rv: Fix __user specifier usage in extract_params()
From: Wen Yang @ 2026-05-17  8:48 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Masami Hiramatsu,
	Nam Cao, linux-trace-kernel
  Cc: kernel test robot
In-Reply-To: <20260512140250.262190-2-gmonaco@redhat.com>


Correct.  __user annotates the pointer type, not the stack copy.

Reviewed-by: Wen Yang <wen.yang@linux.dev>


On 5/12/26 22:02, Gabriele Monaco wrote:
> The attributes variables extracted from syscalls in the helper are both
> defined with the __user specifier although only the actual pointer to
> user data should be marked.
> 
> Remove the __user specifier from attr.
> 
> Reported-by: kernel test robot <lkp@intel.com>
> Closes: https://lore.kernel.org/oe-kbuild-all/202604150820.Ny143u6X-lkp@intel.com
> Fixes: b133207deb72 ("rv: Add nomiss deadline monitor")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   kernel/trace/rv/monitors/deadline/deadline.h | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/rv/monitors/deadline/deadline.h b/kernel/trace/rv/monitors/deadline/deadline.h
> index 0bbfd2543329..78fca873d61e 100644
> --- a/kernel/trace/rv/monitors/deadline/deadline.h
> +++ b/kernel/trace/rv/monitors/deadline/deadline.h
> @@ -95,7 +95,8 @@ static inline u8 get_server_type(struct task_struct *tsk)
>   static inline int extract_params(struct pt_regs *regs, long id, pid_t *pid_out)
>   {
>   	size_t size = offsetofend(struct sched_attr, sched_flags);
> -	struct sched_attr __user *uattr, attr;
> +	struct sched_attr __user *uattr;
> +	struct sched_attr attr;
>   	int new_policy = -1, ret;
>   	unsigned long args[6];
>   

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox