Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH 1/7] uprobes/x86: Move optimized uprobe from nop5 to nop10
From: Andrii Nakryiko @ 2026-05-15 20:31 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <20260514135342.22130-2-jolsa@kernel.org>

On Thu, May 14, 2026 at 6:53 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Andrii reported an issue with optimized uprobes [1] that can clobber
> redzone area with call instruction storing return address on stack
> where user code may keep temporary data without adjusting rsp.
>
> Fixing this by moving the optimized uprobes on top of 10-bytes nop
> instruction, so we can squeeze another instruction to escape the
> redzone area before doing the call, like:
>
>   lea -0x80(%rsp), %rsp
>   call tramp
>
> Note the lea instruction is used to adjust the rsp register without
> changing the flags.

I think it should be very loudly explained that we can't go back to
nop10 and have to do short jump over patched sequence (and why).

>
> The optimized uprobe performance stays the same:
>
>         uprobe-nop     :    3.129 ± 0.013M/s
>         uprobe-push    :    3.045 ± 0.006M/s
>         uprobe-ret     :    1.095 ± 0.004M/s
>   -->   uprobe-nop10   :    7.170 ± 0.020M/s
>         uretprobe-nop  :    2.143 ± 0.021M/s
>         uretprobe-push :    2.090 ± 0.000M/s
>         uretprobe-ret  :    0.942 ± 0.000M/s
>   -->   uretprobe-nop10:    3.381 ± 0.003M/s
>         usdt-nop       :    3.245 ± 0.004M/s
>   -->   usdt-nop10     :    7.256 ± 0.023M/s
>
> [1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> Reported-by: Andrii Nakryiko <andrii@kernel.org>
> Closes: https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  arch/x86/kernel/uprobes.c | 121 +++++++++++++++++++++++++++-----------
>  1 file changed, 86 insertions(+), 35 deletions(-)
>
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index ebb1baf1eb1d..f7c4101a4039 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -636,9 +636,21 @@ struct uprobe_trampoline {
>         unsigned long           vaddr;
>  };
>
> +#define LEA_INSN_SIZE          5
> +#define OPT_INSN_SIZE          (LEA_INSN_SIZE + CALL_INSN_SIZE)
> +#define OPT_JMP8_OFFSET                (OPT_INSN_SIZE - JMP8_INSN_SIZE)
> +#define REDZONE_SIZE           0x80
> +
> +static const u8 lea_rsp[] = { 0x48, 0x8d, 0x64, 0x24, 0x80 };
> +
> +static bool is_lea_insn(const uprobe_opcode_t *insn)
> +{
> +       return !memcmp(insn, lea_rsp, LEA_INSN_SIZE);
> +}
> +
>  static bool is_reachable_by_call(unsigned long vtramp, unsigned long vaddr)
>  {
> -       long delta = (long)(vaddr + 5 - vtramp);
> +       long delta = (long)(vaddr + OPT_INSN_SIZE - vtramp);
>
>         return delta >= INT_MIN && delta <= INT_MAX;
>  }
> @@ -651,7 +663,7 @@ static unsigned long find_nearest_trampoline(unsigned long vaddr)
>         };
>         unsigned long low_limit, high_limit;
>         unsigned long low_tramp, high_tramp;
> -       unsigned long call_end = vaddr + 5;
> +       unsigned long call_end = vaddr + OPT_INSN_SIZE;
>
>         if (check_add_overflow(call_end, INT_MIN, &low_limit))
>                 low_limit = PAGE_SIZE;
> @@ -826,8 +838,8 @@ SYSCALL_DEFINE0(uprobe)

should we change -ENXIO to -EPROTO or some other distinct error code,
so libbpf can avoid using nop5 attachment on kernels new enough to
support nop5 optimization, but old enough to not do this properly with
nop10?

>         regs->ax  = args.ax;
>         regs->r11 = args.r11;
>         regs->cx  = args.cx;
> -       regs->ip  = args.retaddr - 5;
> -       regs->sp += sizeof(args);
> +       regs->ip  = args.retaddr - OPT_INSN_SIZE;
> +       regs->sp += sizeof(args) + REDZONE_SIZE;
>         regs->orig_ax = -1;
>
>         sp = regs->sp;

[...]

^ permalink raw reply

* Re: [PATCH v4 3/3] tracefs: make root directory world-traversable
From: Steven Rostedt @ 2026-05-15 23:16 UTC (permalink / raw)
  To: Anubhav Shelat
  Cc: mpetlan, Masami Hiramatsu, Mathieu Desnoyers, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Thomas Falcon, linux-kernel, linux-trace-kernel,
	linux-perf-users
In-Reply-To: <20260515194010.93725-5-ashelat@redhat.com>

On Fri, 15 May 2026 15:40:07 -0400
Anubhav Shelat <ashelat@redhat.com> wrote:

> Change the default tracefs mount mode from 0700 to 0755. This allows
> unprivileged users to access the eventfs directories underneath which
> already use 0755.
> 
> Tracing data files use mode 0440 and 0640 so they are not exposed by
> this change. Only the format and id files, which have been marked as
> work-readable, become accessible.
> 
> Directory listings of kprobes and uprobes, which contain functions or
> binaries, become visible to unprivileged users but do not contain kernel
> addresses. Admins using probes can restore the previous behavior with
> chmod or mount -o mode=700.
> 

I've been thinking about this and I believe a better approach is to
make a eventfs that is mounted at:

 /sys/kernel/events

and be the same directory structure as /sys/kernel/tracing/events but
only contain read only files like "id" and "format". This directory
would be mounted as 555 and readable by all.

-- Steve

^ permalink raw reply

* Re: [PATCH v7 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Lance Yang @ 2026-05-16  4:06 UTC (permalink / raw)
  To: Breno Leitao
  Cc: linmiaohe, akpm, david, ljs, vbabka, rppt, surenb, mhocko, shuah,
	nao.horiguchi, rostedt, mhiramat, mathieu.desnoyers, corbet,
	skhan, liam, linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team
In-Reply-To: <agcbfLHT5ZWnNeN0@gmail.com>



On 2026/5/15 21:13, Breno Leitao wrote:
[...]
>>
>> Wonder if it would be simpler to just do a positive check near the top
>> of get_any_page() instead. Something like:
>>
>> static bool hwpoison_unrecoverable_kernel_page(struct page *page,
>> 						unsigned long flags)
> 
> Ack. We probably want to call it something like HWPoisonKernelOwned() to
> follow the same naming sematics of these helpers, such as HWPoisonHandlable()
> 
> By the way, I will re-include the self test back to this patch series,
> In case they are not useful, we do not merge it.
> 

Sounds good :)

Can you also test the relevant page types if possible, especially
the ones the new helper is supposed to classify?

Cheers, Lance

^ permalink raw reply

* Re: [RFC PATCH v3] bpf: introduce TAINT_UNSAFE_BPF for mutating helpers
From: Aaron Tomlin @ 2026-05-16 17:01 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, Jonathan Corbet, Song Liu, KP Singh,
	Matt Bobrowski, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard, Kumar Kartikeya Dwivedi,
	Masami Hiramatsu, Shuah Khan, Jiri Olsa, Martin KaFai Lau,
	Yonghong Song, Mathieu Desnoyers, Randy Dunlap, neelx, sean,
	chjohnst, steve, mproche, nick.lange, open list:DOCUMENTATION,
	LKML, bpf, linux-trace-kernel
In-Reply-To: <CAADnVQLw+_NaOVeaKabuf085wNo_-6MAv8w0EDO3fBz3KCQT5g@mail.gmail.com>

On Wed, May 13, 2026 at 09:35:29AM -0700, Alexei Starovoitov wrote:
> On Wed, May 13, 2026 at 8:23 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Wed, 13 May 2026 08:16:07 -0700
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >
> > > It's impossible to track all modifications.
> > > See what sched-ext is doing.
> > > What does it modify? Everything.
> >
> > What about just having a list of what BPF programs are loaded, what they
> > may be attached to, and what kfuncs they are calling?
> 
> Ohh. These have been available forever.
> Just bpftool prog, bpftool link, bpftool prog dump xlated

Hi Alexei,

Thank you for sharing.

Kind regards,
-- 
Aaron Tomlin

^ permalink raw reply

* Re: [RFC PATCH v2.2 18/28] mm/damon: trace probe_hits
From: SeongJae Park @ 2026-05-16 17:31 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Masami Hiramatsu, Mathieu Desnoyers,
	Steven Rostedt, damon, linux-kernel, linux-mm, linux-trace-kernel
In-Reply-To: <20260515004433.128933-19-sj@kernel.org>

On Thu, 14 May 2026 17:44:19 -0700 SeongJae Park <sj@kernel.org> wrote:

> Introduce a new tracepoint for exposing the per-region per-probe
> positive sample count via tracefs.
> 
> Signed-off-by: SeongJae Park <sj@kernel.org>
> ---
>  include/trace/events/damon.h | 38 ++++++++++++++++++++++++++++++++++++
>  mm/damon/core.c              |  9 +++++++++
>  2 files changed, 47 insertions(+)
> 
> diff --git a/include/trace/events/damon.h b/include/trace/events/damon.h
> index 24fc402ab3c85..ec1e317923fd3 100644
> --- a/include/trace/events/damon.h
> +++ b/include/trace/events/damon.h
> @@ -130,6 +130,44 @@ TRACE_EVENT(damon_monitor_intervals_tune,
>  	TP_printk("sample_us=%lu", __entry->sample_us)
>  );
>  
> +TRACE_EVENT_CONDITION(damon_aggregated_v2,

I was thinking [1] about a better name of this tracepoint.  I will rename this
to 'damon_region_aggregated'.  And I will deprecate damon_aggregated, with
multi phase, like we did for DAMON debugfs interface.  The idea off the top of
my head at the moment is,

1. announce it as deprecated on the document, by end of 2026
2. rename it (e.g., damon_aggregated_deprecated) by end of 2027
3. removing the code by end of 2028

The deprecation might be done faster than the current idea.

As Steven commented [2], it should be ok to immediately removing it or
extending it to have probe_hits.  But I realize I'm quite lazy at DAMON
user-space tool development, and feel more comfortable on this approach for
now.  Please let me know if anyone has a different opinion.

[1] https://lore.kernel.org/20260514000611.147809-1-sj@kernel.org
[2] https://lore.kernel.org/20260513203237.3b1b3286@gandalf.local.home

Thanks,
SJ

[...]

^ permalink raw reply

* [RFC PATCH v3 00/28] mm/damon: introduce data attributes monitoring
From: SeongJae Park @ 2026-05-16 18:36 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Masami Hiramatsu,
	Mathieu Desnoyers, Michal Hocko, Mike Rapoport, Shuah Khan,
	Shuah Khan, Steven Rostedt, Suren Baghdasaryan, Vlastimil Babka,
	damon, linux-doc, linux-kernel, linux-kselftest, linux-mm,
	linux-trace-kernel

TL; DR
======

Extend DAMON for monitoring general data attributes other than accesses.
The short term motivation is lightweight page type (e.g., belonging
cgroup) aware monitoring.  In long term, this will help extending DAMON
for multiple access events capture primitives (e.g., page faults and
PMU) and eventually pivotting DAMON to a "Data Attributes Monitoring and
Operations eNgine" in long term.

Background: High Cost of Page Level Properties Monitoring
=========================================================

DAMON is initially introduced as a Data Access MONitor.  It has been
extended for not only access monitoring but also data access-aware
system operations (DAMOS).  But still the monitoring part is only for
data accesses.

Data access patterns is good information, but some users need more
holistic views.  Particularly, users want to show the access pattern
information together with the types of the memory.  For example, users
who work for making huge pages efficiently want to know how much of
DAMON-found hot/cold regions are backed by huge pages.  Users who run
multiple workloads with different cgroups want to know how much of
DAMON-found hot/cold regions belong to specific cgroups.

For the user demand, we developed a DAMOS extension for page level
properties based monitoring [1], which has landed on 6.14.  Using the
feature, users can inform the page level data properties that they are
interested in, in a flexible format that uses DAMOS filters.  Then,
DAMON applies the filters to each folio of the entire DAMON region and
lets users know how many bytes of memory in each DAMON region passed the
given filters.

This gives page level detailed and deterministic information to users.
But, because the operation is done at page level, the overhead is
proportional to the memory size.  It was useful for test or debugging
purposes on a small number of machines.  But it was obviously too heavy
to be enabled always on all machines running the real user workloads.
For real world workloads, it was recommended to use the feature with
user-space controlled sampling approaches.  For example, users could do
the page level monitoring only once per hour, on randomly selected one
percent of machines of their fleet.  If the runtime and the  size of the
fleet is long and big enough, it should provide statistically meaningful
data.

But users are too busy to implement such controls on their own.

Data Attributes Monitoring
==========================

Extend DAMON to monitor not only data accesses, but also general data
attributes.  Do the extension while keeping the main promise of DAMON,
the bounded and best-effort minimum overhead.

Allow users to specify what data attributes in addition to the data
access they want to monitor.  Users can install one 'data probe' per
data attribute of their interest for this purpose.  The 'data probe'
should be able to be applied to any memory, and determine if the given
memory has the appropriate data attribute.  E.g., if memory of physical
address 42 belongs to cgroup A.  Each 'data probe' is configured with
filters that are very similar to the DAMOS filters.

When DAMON checks if each sampling address memory of each region is
accessed since the last check, it applies data probes if registered.
Same to the number of access check-positive samples accounting
(nr_accesses), it accounts the number of each data probe-positive
samples in another per-region counters array, namely 'probe_hits'. When
DAMON resets nr_accesses every aggregation interval, it resets
'probe_hits' together.

Users can read 'probe_hits' just before the values are reset.  In this
way, users can know how many hot/cold memory regions have data
attributes of their interest.  E.g., 30 percent of this system's hot
memory is belonging to cgroup A, and 80 percent of the cgroup
A-belonging hot memory is backed by huge pages.

Patches Sequence
================

First eight patches implement the core feature, interface and the
working support.  Patch 1 introduces data probe data structure, namely
damon_probe.  Patch 2 extends damon_ctx for installing data probes.
Patch 3 introduces another data structure for filters of each data
probe, namely damon_filter.  Patch 4 updates damon_ctx commit function
to handle the probes.  Patch 5 extends damon_region for the per-region
per-probe positive samples counter, namely probe_hits.  Patch 6 extends
damon_operations for applying probes on the underlying DAMON operations
implementation.  Patch 7 updates kdamond_fn() to invoke the probes
applying callback.  Patch 8 finally implements the probes support on
paddr ops.

Ten changes for user interface (patches 9-18) come next.  Patches 9-13
implements sysfs directories and files for setting data probes, namely
probes directory, probe directory, filters directory, filter directory
and filter directory internal files, respectively.  Patch 14 connects
the user inputs that are made via the sysfs files to DAMON core.
Following three patches (patches 15-17) implement sysfs directories and
files for showing the probe_hits to users, namely probes directory,
probe directory and hits files, respectively.  Patch 18 introduces a new
tracepoint for showing the probe_hits via tracefs.

Patch 19 adds a selftest for the sysfs files.

Patches 20 and 21 documents the design and usage of the new feature,
respectively.

Seven additional patches (patches 22-28) for monitoring belonging memory
cgroup follow.  Depending on the feedback, this part might be separated
to another series in future.  Patch 22 defines the DAMON filter type for
the new attribute, namely DAMON_FILTER_TYPE_MEMCG.  Patch 23 add the
support on paddr ops.  Patch 24 updates the sysfs interface for setup of
the target memcg.  Patch 25 move code for easy reuse of the filter
target memcg setup.  Patch 26 connects the user input to the core layer.
Finally, patches 27 and 28 update the design and usage documents for the
memcg attribute monitoring support.

Discussions
===========

This allows the page properties monitoring with overhead that is low
enough to be enabled always on real world workloads.  Because the
sampling time for access check is reused for data attributes check,  the
upper-bounded and best-effort minimum overhead of DAMON is kept.
Because the sampling memory for access check is reused for data
attributes check, additional overhead is minimum.

Still DAMOS-based page level properties monitoring should be useful,
because it provides a deterministic page level information.  When in
doubt of the sampling based information, running DAMOS-based one
together and comparing the results would be useful, for debugging and
tuning.

Plan for Dropping RFC tag
=========================

Making changes for feedback from myself, humans and Sashiko should be
the major remaining work.

I'm currently hoping to drop the RFC tag by 7.2-rc1.

Future Works: Mid Term
========================

This version of implementation is limiting the maximum number of data
probes to four.  I will try to find a way to remove the limit in future.
I personally think it should be enough for common use cases, though, and
therefore not giving high priority at the moment.

Future Works: Long Term
=======================

There are user requests for extending DAMON with detailed access
information, for example, per-CPUs/threads/read/writes monitoring.  For
that, I was working [2] on extending DAMON to use page fault events as
another access check primitives, and making the infrastructure flexible
for future use of yet another access check primitive.  Actually there is
another ongoing work [3] for extending DAMON with PMU events.  The
motivation of the work is reducing the overhead, though.

In my work [2], I was introducing a new interface for access sampling
primitives control.  Now I think this data probe interface can be used
for that, too.  That is, data access becomes just one type of data
attribute.  Also, pg_idle-confirmed access, page fault-confirmed access,
and PMU event-confirmed access will be different types of data
attributes.

The regions adjustment mechanism is currently working based on the
access information.  That's because DAMON is designed for data access
monitoring.  That is, data access information is the primary interest,
and therefore DAMON adjusts regions in a way that can best-present the
information.

Once data access becomes just one of data attributes, there is no reason
to think data access that special.  There might be some users not
interested in access at all but want to know the location of memory of
specific type.  Data probes interface will allow doing that.  Further,
we could extend the interface to let users set any data attribute as the
'primary' attribute.  Then, DAMON will split and merge regions in a way
that can best-present the 'primary' attributes.

DAMOS will also be extended, to specify targets based on not only the
data access pattern, but all user-registered data attributes.  From this
stage, we may be able to call DAMON as a "Data Attributes Monitoring and
Operations eNgine".

[1] https://lore.kernel.org/20250106193401.109161-1-sj@kernel.org
[2] https://lore.kernel.org/20251208062943.68824-1-sj@kernel.org/
[3] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@gmail.com

Changes from RFC v2.2
- rfc v2.2: https://lore.kernel.org/20260515004433.128933-1-sj@kernel.org
- Rename damon_aggregated_v2 trace event to damon_region_aggregated.
- Address Sashiko issues.
  - Enclose arguments on damon_for_each_{probe,filter}[_safe]() macros.
  - Fix typos in comments and documents.
  - Update probe_hits for region split and merge.
  - Add more documentation for damon_operation->apply_probes() callback.
  - Reduce unnecessary folio_{get,put}() in damon_pa_apply_probes().
  - Define damon_sysfs_probe_attrs as static.
  - Link scheme tried region sysfs dir and increase the count only after
    all internal dir population success.
  - Commit damon_filter->memcg_id for newly added filters.
Changes from RFC v2.1
- rfc v2.1: https://lore.kernel.org/20260514140904.119781-1-sj@kernel.org
- Rebase to mm-stable (7.1-rc3) to avoid Sashiko patch apply failure.
Changes from RFC v2
- rfc v2: https://lore.kernel.org/20260512143645.113201-1-sj@kernel.org
- Optimize nr_probes calculation for probe_hits tracepoint.
- Use TRACE_EVENT_CONDITION() for probe_hits tracepoint.
- Rebase to latest mm-new.
Changes from RFC
- rfc: https://lore.kernel.org/all/20260426205222.93895-1-sj@kernel.org/
- Support memcg DAMON filter.
- Use per-probe probe_hits sysfs file.
- Use dynamic_array for probe_hits tracing.
- Fix filter matching field.
- Fix folio leaking in damon_pa_filter_pass().
- Move nr_regions of damon_aggregated_v2 tracepoint after end.
- Rename DAMON_TEST_TYPE_ANON to DAMON_FILTER_TYPE_ANON.

SeongJae Park (28):
  mm/damon/core: introduce struct damon_probe
  mm/damon/core: embed damon_probe objects in damon_ctx
  mm/damon/core: introduce damon_filter
  mm/damon/core: commit probes
  mm/damon/core: introduce damon_region->probe_hits
  mm/damon/core: introduce damon_ops->apply_probes
  mm/damon/core: do data attributes monitoring
  mm/damon/paddr: support data attributes monitoring
  mm/damon/sysfs: implement probes dir
  mm/damon/sysfs: implement probe dir
  mm/damon/sysfs: implement filters directory
  mm/damon/sysfs: implement filter dir
  mm/damon/sysfs: implement filter dir files
  mm/damon/sysfs: setup probes on DAMON core API parameters
  mm/damon/sysfs-schemes: implement tried_regions/<r>/probes/
  mm/damon/sysfs-schemes: implement probe dir
  mm/damon/sysfs-schemes: implement probe/hits file
  mm/damon: trace probe_hits
  selftests/damon/sysfs.sh: test probes dir
  Docs/mm/damon/design: document data attributes monitoring
  Docs/admin-guide/mm/damon/usage: document data attributes monitoring
  mm/damon/core: introduce DAMON_FILTER_TYPE_MEMCG
  mm/damon/paddr: support DAMON_FILTER_TYPE_MEMCG
  mm/damon/sysfs: add filters/<F>/path file
  mm/damon/sysfs-schemes: move memcg_path_to_id() to sysfs-common
  mm/damon/sysfs: setup damon_filter->memcg_id from path
  Docs/mm/damon/design: update for memcg damon filter
  Docs/admin-guide/mm/damon/usage: update for memcg damon filter

 Documentation/admin-guide/mm/damon/usage.rst |  46 +-
 Documentation/mm/damon/design.rst            |  39 ++
 include/linux/damon.h                        |  69 +++
 include/trace/events/damon.h                 |  38 ++
 mm/damon/core.c                              | 211 +++++++
 mm/damon/paddr.c                             |  76 +++
 mm/damon/sysfs-common.c                      |  41 ++
 mm/damon/sysfs-common.h                      |   2 +
 mm/damon/sysfs-schemes.c                     | 226 ++++++--
 mm/damon/sysfs.c                             | 557 +++++++++++++++++++
 tools/testing/selftests/damon/sysfs.sh       |  48 ++
 11 files changed, 1305 insertions(+), 48 deletions(-)

base-commit: 5d6919055dec134de3c40167a490f33c74c12581
-- 
2.47.3

^ permalink raw reply

* [RFC PATCH v3 18/28] mm/damon: trace probe_hits
From: SeongJae Park @ 2026-05-16 18:36 UTC (permalink / raw)
  Cc: SeongJae Park, Andrew Morton, Masami Hiramatsu, Mathieu Desnoyers,
	Steven Rostedt, damon, linux-kernel, linux-mm, linux-trace-kernel
In-Reply-To: <20260516183712.81393-1-sj@kernel.org>

Introduce a new tracepoint for exposing the per-region per-probe
positive sample count via tracefs.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 include/trace/events/damon.h | 38 ++++++++++++++++++++++++++++++++++++
 mm/damon/core.c              |  9 +++++++++
 2 files changed, 47 insertions(+)

diff --git a/include/trace/events/damon.h b/include/trace/events/damon.h
index 24fc402ab3c85..2fd914895c405 100644
--- a/include/trace/events/damon.h
+++ b/include/trace/events/damon.h
@@ -130,6 +130,44 @@ TRACE_EVENT(damon_monitor_intervals_tune,
 	TP_printk("sample_us=%lu", __entry->sample_us)
 );
 
+TRACE_EVENT_CONDITION(damon_region_aggregated,
+
+	TP_PROTO(unsigned int target_id, struct damon_region *r,
+		unsigned int nr_regions, unsigned int nr_probes),
+
+	TP_ARGS(target_id, r, nr_regions, nr_probes),
+
+	TP_CONDITION(nr_probes > 0),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, target_id)
+		__field(unsigned long, start)
+		__field(unsigned long, end)
+		__field(unsigned int, nr_regions)
+		__field(unsigned int, nr_accesses)
+		__field(unsigned int, age)
+		__dynamic_array(unsigned char, probe_hits, nr_probes)
+	),
+
+	TP_fast_assign(
+		__entry->target_id = target_id;
+		__entry->start = r->ar.start;
+		__entry->end = r->ar.end;
+		__entry->nr_regions = nr_regions;
+		__entry->nr_accesses = r->nr_accesses;
+		__entry->age = r->age;
+		memcpy(__get_dynamic_array(probe_hits), r->probe_hits,
+			sizeof(*r->probe_hits) * nr_probes);
+	),
+
+	TP_printk("target_id=%lu nr_regions=%u %lu-%lu: %u %u probe_hits=%s",
+			__entry->target_id, __entry->nr_regions,
+			__entry->start, __entry->end,
+			__entry->nr_accesses, __entry->age,
+			__print_hex(__get_dynamic_array(probe_hits),
+				__get_dynamic_array_len(probe_hits)))
+);
+
 TRACE_EVENT(damon_aggregated,
 
 	TP_PROTO(unsigned int target_id, struct damon_region *r,
diff --git a/mm/damon/core.c b/mm/damon/core.c
index dde3c8d8fef89..11b513eb077fe 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -1881,6 +1881,13 @@ static void kdamond_reset_aggregated(struct damon_ctx *c)
 {
 	struct damon_target *t;
 	unsigned int ti = 0;	/* target's index */
+	unsigned int nr_probes = 0;
+	struct damon_probe *probe;
+
+	if (trace_damon_region_aggregated_enabled()) {
+		damon_for_each_probe(probe, c)
+			nr_probes++;
+	}
 
 	damon_for_each_target(t, c) {
 		struct damon_region *r;
@@ -1889,6 +1896,8 @@ static void kdamond_reset_aggregated(struct damon_ctx *c)
 			int i;
 
 			trace_damon_aggregated(ti, r, damon_nr_regions(t));
+			trace_damon_region_aggregated(ti, r,
+					damon_nr_regions(t), nr_probes);
 			damon_warn_fix_nr_accesses_corruption(r);
 			r->last_nr_accesses = r->nr_accesses;
 			r->nr_accesses = 0;
-- 
2.47.3

^ permalink raw reply related

* Re: [RFC PATCH v3 00/28] mm/damon: introduce data attributes monitoring
From: SeongJae Park @ 2026-05-16 18:50 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Masami Hiramatsu,
	Mathieu Desnoyers, Michal Hocko, Mike Rapoport, Shuah Khan,
	Shuah Khan, Steven Rostedt, Suren Baghdasaryan, Vlastimil Babka,
	damon, linux-doc, linux-kernel, linux-kselftest, linux-mm,
	linux-trace-kernel
In-Reply-To: <20260516183712.81393-1-sj@kernel.org>

On Sat, 16 May 2026 11:36:41 -0700 SeongJae Park <sj@kernel.org> wrote:

> TL; DR
> ======
> 
> Extend DAMON for monitoring general data attributes other than accesses.
> The short term motivation is lightweight page type (e.g., belonging
> cgroup) aware monitoring.  In long term, this will help extending DAMON
> for multiple access events capture primitives (e.g., page faults and
> PMU) and eventually pivotting DAMON to a "Data Attributes Monitoring and
> Operations eNgine" in long term.
[...]
> Changes from RFC v2.1
> - rfc v2.1: https://lore.kernel.org/20260514140904.119781-1-sj@kernel.org
> - Rebase to mm-stable (7.1-rc3) to avoid Sashiko patch apply failure.

Still this seires is based on mm-stable (7.1-rc3) for the same reason.  The
patches that based on mm-new is available at damon/next tree [1].

[1] https://origin.kernel.org/doc/html/latest/mm/damon/maintainer-profile.html#scm-trees


Thanks,
SJ

[...]

^ permalink raw reply

* [PATCH v2] tracing/probes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-05-16 21:33 UTC (permalink / raw)
  To: LKML, Linux trace kernel, bpf
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland, Peter Zijlstra,
	Namhyung Kim, Takaya Saeki, Douglas Raillard, Tom Zanussi,
	Andrew Morton, Thomas Gleixner, Ian Rogers, Jiri Olsa

From: Steven Rostedt <rostedt@goodmis.org>

Add syntax to the FETCHARGS parsing of probes to allow the use of
structure and member names to get the offsets to dereference pointers.

Currently, a dereference must be a number, where the user has to figure
out manually the offset of a member of a structure that they want to
reference. For example, to get the size of a kmem_cache that was passed to
the function kmem_cache_alloc_noprof, one would need to do:

 # cd /sys/kernel/tracing
 # echo 'f:cache kmem_cache_alloc_noprof size=+0x18($arg1):u32' >> dynamic_events

This requires knowing that the offset of size is 0x18, which can be found
with gdb:

  (gdb) p &((struct kmem_cache *)0)->size
  $1 = (unsigned int *) 0x18

If BTF is in the kernel, it can be used to find this with names, where the
user doesn't need to find the actual offset:

 # echo 'f:cache kmem_cache_alloc_noprof size=+kmem_cache.size($arg1):u32' >> dynamic_events

Instead of the "+0x18", it would have "+kmem_cache.size" where the format is:

  +STRUCT.MEMBER[.MEMBER[..]]

The delimiter is '.' and the first item is the structure name. Then the
member of the structure to get the offset of. If that member is an
embedded structure, another '.MEMBER' may be added to get the offset of
its members with respect to the original value.

  "+kmem_cache.size($arg1)" is equivalent to:

  (*(struct kmem_cache *)$arg1).size

Anonymous structures are also handled:

  # echo 'e:xmit net.net_dev_xmit +net_device.name(+sk_buff.dev($skbaddr)):string' >> dynamic_events

Where "+net_device.name(+sk_buff.dev($skbaddr))" is equivalent to:

  (*(struct net_device *)((*(struct sk_buff *)($skbaddr)).dev)->name)

Note that "dev" of struct sk_buff is inside an anonymous structure:

struct sk_buff {
	union {
		struct {
			/* These two members must be first to match sk_buff_head. */
			struct sk_buff		*next;
			struct sk_buff		*prev;

			union {
				struct net_device	*dev;
				[..]
			};
		};
		[..]
	};

This will allow up to three deep of anonymous structures before it will
fail to find a member.

The above produces:

    sshd-session-1080    [000] b..5.  1526.337161: xmit: (net.net_dev_xmit) arg1="enp7s0"

And nested structures can be found by adding more members to the arg:

  # echo 'f:read filemap_readahead.isra.0 file=+0(+dentry.d_name.name(+file.f_path.dentry($arg2))):string' >> dynamic_events

The above is equivalent to:

  *((*(struct dentry *)(*(struct file *)$arg2).f_path.dentry)->d_name.name)

And produces:

       trace-cmd-1381    [002] ...1.  2082.676268: read: (filemap_readahead.isra.0+0x0/0x150) file="trace.dat"

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v1: https://patch.msgid.link/20250729113335.2e4f087d@batman.local.home

- Pass error info back so that error_log can display what went wrong
  (Masami Hiramatsu)

- Use __btf_member_bit_offset() to get nested struct member offsets
  (Douglas Raillard)

- Add btf_put(btf) (Jiri Olsa)

 Documentation/trace/kprobetrace.rst |   3 +
 kernel/trace/trace_btf.c            | 111 ++++++++++++++++++++++++++++
 kernel/trace/trace_btf.h            |  10 +++
 kernel/trace/trace_probe.c          |  19 ++++-
 kernel/trace/trace_probe.h          |   4 +-
 5 files changed, 144 insertions(+), 3 deletions(-)

diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index 3b6791c17e9b..00273157100c 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -54,6 +54,8 @@ Synopsis of kprobe_events
   $retval	: Fetch return value.(\*2)
   $comm		: Fetch current task comm.
   +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
+  +STRUCT.MEMBER[.MEMBER[..]](FETCHARG) : If BTF is supported, Fetch memory
+		  at FETCHARG + the offset of MEMBER inside of STRUCT.(\*5)
   \IMM		: Store an immediate value to the argument.
   NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
   FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
@@ -70,6 +72,7 @@ Synopsis of kprobe_events
         accesses one register.
   (\*3) this is useful for fetching a field of data structures.
   (\*4) "u" means user-space dereference. See :ref:`user_mem_access`.
+  (\*5) +STRUCT.MEMBER(FETCHARG) is equivalent to (*(struct STRUCT *)(FETCHARG)).MEMBER
 
 Function arguments at kretprobe
 -------------------------------
diff --git a/kernel/trace/trace_btf.c b/kernel/trace/trace_btf.c
index 00172f301f25..be88cc4d97dd 100644
--- a/kernel/trace/trace_btf.c
+++ b/kernel/trace/trace_btf.c
@@ -120,3 +120,114 @@ const struct btf_member *btf_find_struct_member(struct btf *btf,
 	return member;
 }
 
+#define BITS_ROUNDDOWN_BYTES(bits) ((bits) >> 3)
+
+static int find_member(const char *ptr, struct btf *btf,
+		       const struct btf_type **type, int level)
+{
+	const struct btf_member *member;
+	const struct btf_type *t = *type;
+	int i;
+
+	/* Max of 3 depth of anonymous structures */
+	if (level > 3)
+		return -E2BIG;
+
+	for_each_member(i, t, member) {
+		const char *tname = btf_name_by_offset(btf, member->name_off);
+
+		if (strcmp(ptr, tname) == 0) {
+			int offset = __btf_member_bit_offset(t, member);
+			*type = btf_type_by_id(btf, member->type);
+			return BITS_ROUNDDOWN_BYTES(offset);
+		}
+
+		/* Handle anonymous structures */
+		if (strlen(tname))
+			continue;
+
+		*type = btf_type_by_id(btf, member->type);
+		if (btf_type_is_struct(*type)) {
+			int offset = find_member(ptr, btf, type, level + 1);
+
+			if (offset < 0)
+				continue;
+
+			return offset + BITS_ROUNDDOWN_BYTES(member->offset);
+		}
+	}
+
+	return -ENOENT;
+}
+
+/**
+ * btf_find_offset - Find an offset of a member for a structure
+ * @arg: A structure name followed by one or more members
+ * @offset_p: A pointer to where to store the offset
+ *
+ * Will parse @arg with the expected format of: struct.member[[.member]..]
+ * It is delimited by '.'. The first item must be a structure type.
+ * The next are its members. If the member is also of a structure type it
+ * another member may follow ".member".
+ *
+ * Note, @arg is modified but will be put back to what it was on return.
+ *
+ * Returns: 0 on success and -EINVAL if no '.' is present
+ *    or -ENXIO if the structure or member is not found.
+ *    Returns -EINVAL if BTF is not defined.
+ *  On success, @offset_p will contain the offset of the member specified
+ *    by @arg.
+ */
+int btf_find_offset(char *arg, long *offset_p)
+{
+	const struct btf_type *t;
+	struct btf *btf;
+	long offset = 0;
+	char *ptr;
+	int ret;
+	s32 id;
+
+	ptr = strchr(arg, '.');
+	if (!ptr)
+		return -EINVAL;
+
+	*ptr = '\0';
+
+	ret = -ENXIO;
+	id = bpf_find_btf_id(arg, BTF_KIND_STRUCT, &btf);
+	if (id < 0)
+		goto error;
+
+	/* Get BTF_KIND_FUNC type */
+	t = btf_type_by_id(btf, id);
+
+	/* May allow more than one member, as long as they are structures */
+	do {
+		ret = -ENXIO;
+		if (!t || !btf_type_is_struct(t))
+			goto error;
+
+		*ptr++ = '.';
+		arg = ptr;
+		ptr = strchr(ptr, '.');
+		if (ptr)
+			*ptr = '\0';
+
+		ret = find_member(arg, btf, &t, 0);
+		if (ret < 0)
+			goto error;
+
+		offset += ret;
+
+	} while (ptr);
+
+	btf_put(btf);
+	*offset_p = offset;
+	return 0;
+
+error:
+	btf_put(btf);
+	if (ptr)
+		*ptr = '.';
+	return ret;
+}
diff --git a/kernel/trace/trace_btf.h b/kernel/trace/trace_btf.h
index 4bc44bc261e6..7b0797a6050b 100644
--- a/kernel/trace/trace_btf.h
+++ b/kernel/trace/trace_btf.h
@@ -9,3 +9,13 @@ const struct btf_member *btf_find_struct_member(struct btf *btf,
 						const struct btf_type *type,
 						const char *member_name,
 						u32 *anon_offset);
+
+#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
+/* Will modify arg, but will put it back before returning. */
+int btf_find_offset(char *arg, long *offset);
+#else
+static inline int btf_find_offset(char *arg, long *offset)
+{
+	return -EINVAL;
+}
+#endif
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index e0d3a0da26af..6fcede2de1a5 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -1165,7 +1165,7 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
 
 	case '+':	/* deref memory */
 	case '-':
-		if (arg[1] == 'u') {
+		if (arg[1] == 'u' && isdigit(arg[2])) {
 			deref = FETCH_OP_UDEREF;
 			arg[1] = arg[0];
 			arg++;
@@ -1178,7 +1178,22 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
 			return -EINVAL;
 		}
 		*tmp = '\0';
-		ret = kstrtol(arg, 0, &offset);
+		if (arg[0] != '-' && !isdigit(*arg)) {
+			int err = 0;
+			ret = btf_find_offset(arg, &offset);
+			switch (ret) {
+			case -ENODEV: err = TP_ERR_NOSUP_BTFARG; break;
+			case -E2BIG: err = TP_ERR_MEMBER_TOO_DEEP; break;
+			case -EINVAL: err = TP_ERR_BAD_STRUCT_FMT; break;
+			case -ENXIO: err = TP_ERR_BAD_BTF_TID; break;
+			}
+			if (err)
+				__trace_probe_log_err(ctx->offset, err);
+			if (ret < 0)
+				return ret;
+		} else {
+			ret = kstrtol(arg, 0, &offset);
+		}
 		if (ret) {
 			trace_probe_log_err(ctx->offset, BAD_DEREF_OFFS);
 			break;
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 262d8707a3df..d649bb9f5b7c 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -563,7 +563,9 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
 	C(NEED_STRING_TYPE,	"$comm and immediate-string only accepts string type"),\
 	C(TOO_MANY_ARGS,	"Too many arguments are specified"),	\
 	C(TOO_MANY_EARGS,	"Too many entry arguments specified"),	\
-	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"),
+	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"), \
+	C(MEMBER_TOO_DEEP,	"Too many indirections of anonymous structure"), \
+	C(BAD_STRUCT_FMT,	"Unknown BTF structure"),
 
 #undef C
 #define C(a, b)		TP_ERR_##a
-- 
2.53.0


^ permalink raw reply related

* Re: [RFC PATCH v3 00/28] mm/damon: introduce data attributes monitoring
From: SeongJae Park @ 2026-05-16 22:03 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Masami Hiramatsu,
	Mathieu Desnoyers, Michal Hocko, Mike Rapoport, Shuah Khan,
	Shuah Khan, Steven Rostedt, Suren Baghdasaryan, Vlastimil Babka,
	damon, linux-doc, linux-kernel, linux-kselftest, linux-mm,
	linux-trace-kernel
In-Reply-To: <20260516183712.81393-1-sj@kernel.org>

On Sat, 16 May 2026 11:36:41 -0700 SeongJae Park <sj@kernel.org> wrote:

> TL; DR
> ======
> 
> Extend DAMON for monitoring general data attributes other than accesses.
> The short term motivation is lightweight page type (e.g., belonging
> cgroup) aware monitoring.  In long term, this will help extending DAMON
> for multiple access events capture primitives (e.g., page faults and
> PMU) and eventually pivotting DAMON to a "Data Attributes Monitoring and
> Operations eNgine" in long term.

Sashiko found [1] no blocker for this version but a nice document wordsmithing
idea.  Unless I get other opinions, I will drop RFC tag from the next version
of this series.

[1] https://lore.kernel.org/damon/20260516185032.82261-1-sj@kernel.org/


Thanks,
SJ

[...]

^ permalink raw reply

* Re: [PATCH v2] tracing/probes: Allow use of BTF names to dereference pointers
From: Masami Hiramatsu @ 2026-05-17  2:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux trace kernel, bpf, Masami Hiramatsu,
	Mathieu Desnoyers, Mark Rutland, Peter Zijlstra, Namhyung Kim,
	Takaya Saeki, Douglas Raillard, Tom Zanussi, Andrew Morton,
	Thomas Gleixner, Ian Rogers, Jiri Olsa
In-Reply-To: <20260516173310.1dbad146@fedora>

On Sat, 16 May 2026 17:33:09 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> From: Steven Rostedt <rostedt@goodmis.org>
> 
> Add syntax to the FETCHARGS parsing of probes to allow the use of
> structure and member names to get the offsets to dereference pointers.
> 
> Currently, a dereference must be a number, where the user has to figure
> out manually the offset of a member of a structure that they want to
> reference. For example, to get the size of a kmem_cache that was passed to
> the function kmem_cache_alloc_noprof, one would need to do:
> 
>  # cd /sys/kernel/tracing
>  # echo 'f:cache kmem_cache_alloc_noprof size=+0x18($arg1):u32' >> dynamic_events
> 
> This requires knowing that the offset of size is 0x18, which can be found
> with gdb:
> 
>   (gdb) p &((struct kmem_cache *)0)->size
>   $1 = (unsigned int *) 0x18
> 
> If BTF is in the kernel, it can be used to find this with names, where the
> user doesn't need to find the actual offset:
> 
>  # echo 'f:cache kmem_cache_alloc_noprof size=+kmem_cache.size($arg1):u32' >> dynamic_events
> 
> Instead of the "+0x18", it would have "+kmem_cache.size" where the format is:
> 
>   +STRUCT.MEMBER[.MEMBER[..]]
> 
> The delimiter is '.' and the first item is the structure name. Then the
> member of the structure to get the offset of. If that member is an
> embedded structure, another '.MEMBER' may be added to get the offset of
> its members with respect to the original value.
> 
>   "+kmem_cache.size($arg1)" is equivalent to:
> 
>   (*(struct kmem_cache *)$arg1).size
> 
> Anonymous structures are also handled:
> 
>   # echo 'e:xmit net.net_dev_xmit +net_device.name(+sk_buff.dev($skbaddr)):string' >> dynamic_events
> 
> Where "+net_device.name(+sk_buff.dev($skbaddr))" is equivalent to:
> 
>   (*(struct net_device *)((*(struct sk_buff *)($skbaddr)).dev)->name)
> 
> Note that "dev" of struct sk_buff is inside an anonymous structure:
> 
> struct sk_buff {
> 	union {
> 		struct {
> 			/* These two members must be first to match sk_buff_head. */
> 			struct sk_buff		*next;
> 			struct sk_buff		*prev;
> 
> 			union {
> 				struct net_device	*dev;
> 				[..]
> 			};
> 		};
> 		[..]
> 	};
> 
> This will allow up to three deep of anonymous structures before it will
> fail to find a member.
> 
> The above produces:
> 
>     sshd-session-1080    [000] b..5.  1526.337161: xmit: (net.net_dev_xmit) arg1="enp7s0"
> 
> And nested structures can be found by adding more members to the arg:
> 
>   # echo 'f:read filemap_readahead.isra.0 file=+0(+dentry.d_name.name(+file.f_path.dentry($arg2))):string' >> dynamic_events
> 
> The above is equivalent to:
> 
>   *((*(struct dentry *)(*(struct file *)$arg2).f_path.dentry)->d_name.name)
> 
> And produces:
> 
>        trace-cmd-1381    [002] ...1.  2082.676268: read: (filemap_readahead.isra.0+0x0/0x150) file="trace.dat"
> 
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

Thanks for updating it! I think this looks good to me. 
Let me test it. I think we also need updating selftests.

Thank you,

> ---
> Changes since v1: https://patch.msgid.link/20250729113335.2e4f087d@batman.local.home
> 
> - Pass error info back so that error_log can display what went wrong
>   (Masami Hiramatsu)
> 
> - Use __btf_member_bit_offset() to get nested struct member offsets
>   (Douglas Raillard)
> 
> - Add btf_put(btf) (Jiri Olsa)
> 
>  Documentation/trace/kprobetrace.rst |   3 +
>  kernel/trace/trace_btf.c            | 111 ++++++++++++++++++++++++++++
>  kernel/trace/trace_btf.h            |  10 +++
>  kernel/trace/trace_probe.c          |  19 ++++-
>  kernel/trace/trace_probe.h          |   4 +-
>  5 files changed, 144 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
> index 3b6791c17e9b..00273157100c 100644
> --- a/Documentation/trace/kprobetrace.rst
> +++ b/Documentation/trace/kprobetrace.rst
> @@ -54,6 +54,8 @@ Synopsis of kprobe_events
>    $retval	: Fetch return value.(\*2)
>    $comm		: Fetch current task comm.
>    +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
> +  +STRUCT.MEMBER[.MEMBER[..]](FETCHARG) : If BTF is supported, Fetch memory
> +		  at FETCHARG + the offset of MEMBER inside of STRUCT.(\*5)
>    \IMM		: Store an immediate value to the argument.
>    NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
>    FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
> @@ -70,6 +72,7 @@ Synopsis of kprobe_events
>          accesses one register.
>    (\*3) this is useful for fetching a field of data structures.
>    (\*4) "u" means user-space dereference. See :ref:`user_mem_access`.
> +  (\*5) +STRUCT.MEMBER(FETCHARG) is equivalent to (*(struct STRUCT *)(FETCHARG)).MEMBER
>  
>  Function arguments at kretprobe
>  -------------------------------
> diff --git a/kernel/trace/trace_btf.c b/kernel/trace/trace_btf.c
> index 00172f301f25..be88cc4d97dd 100644
> --- a/kernel/trace/trace_btf.c
> +++ b/kernel/trace/trace_btf.c
> @@ -120,3 +120,114 @@ const struct btf_member *btf_find_struct_member(struct btf *btf,
>  	return member;
>  }
>  
> +#define BITS_ROUNDDOWN_BYTES(bits) ((bits) >> 3)
> +
> +static int find_member(const char *ptr, struct btf *btf,
> +		       const struct btf_type **type, int level)
> +{
> +	const struct btf_member *member;
> +	const struct btf_type *t = *type;
> +	int i;
> +
> +	/* Max of 3 depth of anonymous structures */
> +	if (level > 3)
> +		return -E2BIG;
> +
> +	for_each_member(i, t, member) {
> +		const char *tname = btf_name_by_offset(btf, member->name_off);
> +
> +		if (strcmp(ptr, tname) == 0) {
> +			int offset = __btf_member_bit_offset(t, member);
> +			*type = btf_type_by_id(btf, member->type);
> +			return BITS_ROUNDDOWN_BYTES(offset);
> +		}
> +
> +		/* Handle anonymous structures */
> +		if (strlen(tname))
> +			continue;
> +
> +		*type = btf_type_by_id(btf, member->type);
> +		if (btf_type_is_struct(*type)) {
> +			int offset = find_member(ptr, btf, type, level + 1);
> +
> +			if (offset < 0)
> +				continue;
> +
> +			return offset + BITS_ROUNDDOWN_BYTES(member->offset);
> +		}
> +	}
> +
> +	return -ENOENT;
> +}
> +
> +/**
> + * btf_find_offset - Find an offset of a member for a structure
> + * @arg: A structure name followed by one or more members
> + * @offset_p: A pointer to where to store the offset
> + *
> + * Will parse @arg with the expected format of: struct.member[[.member]..]
> + * It is delimited by '.'. The first item must be a structure type.
> + * The next are its members. If the member is also of a structure type it
> + * another member may follow ".member".
> + *
> + * Note, @arg is modified but will be put back to what it was on return.
> + *
> + * Returns: 0 on success and -EINVAL if no '.' is present
> + *    or -ENXIO if the structure or member is not found.
> + *    Returns -EINVAL if BTF is not defined.
> + *  On success, @offset_p will contain the offset of the member specified
> + *    by @arg.
> + */
> +int btf_find_offset(char *arg, long *offset_p)
> +{
> +	const struct btf_type *t;
> +	struct btf *btf;
> +	long offset = 0;
> +	char *ptr;
> +	int ret;
> +	s32 id;
> +
> +	ptr = strchr(arg, '.');
> +	if (!ptr)
> +		return -EINVAL;
> +
> +	*ptr = '\0';
> +
> +	ret = -ENXIO;
> +	id = bpf_find_btf_id(arg, BTF_KIND_STRUCT, &btf);
> +	if (id < 0)
> +		goto error;
> +
> +	/* Get BTF_KIND_FUNC type */
> +	t = btf_type_by_id(btf, id);
> +
> +	/* May allow more than one member, as long as they are structures */
> +	do {
> +		ret = -ENXIO;
> +		if (!t || !btf_type_is_struct(t))
> +			goto error;
> +
> +		*ptr++ = '.';
> +		arg = ptr;
> +		ptr = strchr(ptr, '.');
> +		if (ptr)
> +			*ptr = '\0';
> +
> +		ret = find_member(arg, btf, &t, 0);
> +		if (ret < 0)
> +			goto error;
> +
> +		offset += ret;
> +
> +	} while (ptr);
> +
> +	btf_put(btf);
> +	*offset_p = offset;
> +	return 0;
> +
> +error:
> +	btf_put(btf);
> +	if (ptr)
> +		*ptr = '.';
> +	return ret;
> +}
> diff --git a/kernel/trace/trace_btf.h b/kernel/trace/trace_btf.h
> index 4bc44bc261e6..7b0797a6050b 100644
> --- a/kernel/trace/trace_btf.h
> +++ b/kernel/trace/trace_btf.h
> @@ -9,3 +9,13 @@ const struct btf_member *btf_find_struct_member(struct btf *btf,
>  						const struct btf_type *type,
>  						const char *member_name,
>  						u32 *anon_offset);
> +
> +#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
> +/* Will modify arg, but will put it back before returning. */
> +int btf_find_offset(char *arg, long *offset);
> +#else
> +static inline int btf_find_offset(char *arg, long *offset)
> +{
> +	return -EINVAL;
> +}
> +#endif
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index e0d3a0da26af..6fcede2de1a5 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -1165,7 +1165,7 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
>  
>  	case '+':	/* deref memory */
>  	case '-':
> -		if (arg[1] == 'u') {
> +		if (arg[1] == 'u' && isdigit(arg[2])) {
>  			deref = FETCH_OP_UDEREF;
>  			arg[1] = arg[0];
>  			arg++;
> @@ -1178,7 +1178,22 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
>  			return -EINVAL;
>  		}
>  		*tmp = '\0';
> -		ret = kstrtol(arg, 0, &offset);
> +		if (arg[0] != '-' && !isdigit(*arg)) {
> +			int err = 0;
> +			ret = btf_find_offset(arg, &offset);
> +			switch (ret) {
> +			case -ENODEV: err = TP_ERR_NOSUP_BTFARG; break;
> +			case -E2BIG: err = TP_ERR_MEMBER_TOO_DEEP; break;
> +			case -EINVAL: err = TP_ERR_BAD_STRUCT_FMT; break;
> +			case -ENXIO: err = TP_ERR_BAD_BTF_TID; break;
> +			}
> +			if (err)
> +				__trace_probe_log_err(ctx->offset, err);
> +			if (ret < 0)
> +				return ret;
> +		} else {
> +			ret = kstrtol(arg, 0, &offset);
> +		}
>  		if (ret) {
>  			trace_probe_log_err(ctx->offset, BAD_DEREF_OFFS);
>  			break;
> diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
> index 262d8707a3df..d649bb9f5b7c 100644
> --- a/kernel/trace/trace_probe.h
> +++ b/kernel/trace/trace_probe.h
> @@ -563,7 +563,9 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
>  	C(NEED_STRING_TYPE,	"$comm and immediate-string only accepts string type"),\
>  	C(TOO_MANY_ARGS,	"Too many arguments are specified"),	\
>  	C(TOO_MANY_EARGS,	"Too many entry arguments specified"),	\
> -	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"),
> +	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"), \
> +	C(MEMBER_TOO_DEEP,	"Too many indirections of anonymous structure"), \
> +	C(BAD_STRUCT_FMT,	"Unknown BTF structure"),
>  
>  #undef C
>  #define C(a, b)		TP_ERR_##a
> -- 
> 2.53.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Race condition in __modify_ftrace_direct() between tmp_ops registration and direct_functions hash update
From: Afi0 @ 2026-05-17  6:24 UTC (permalink / raw)
  To: security; +Cc: linux-kernel, linux-trace-kernel, rostedt, mhiramat, Greg KH


[-- Attachment #1.1: Type: text/plain, Size: 1881 bytes --]

Hi list,

Apologies for initially sending only to Greg. Resending to the full list as
requested.
------------------------------

Component: kernel/trace/ftrace.c Function: __modify_ftrace_direct()
Affected versions: Linux kernel 5.15+ Type: TOCTOU / Race condition CVSS
3.1: AV:L/AC:H/PR:L/UI:N/S:C/C:H/I:H/A:H - 7.8 (High)

SUMMARY

A race condition exists in __modify_ftrace_direct() between the
registration of tmp_ops into ftrace_ops_list and the subsequent update of
direct_functions hash entries. During this window, concurrent CPUs
executing traced functions will read the stale direct call address via
ftrace_find_rec_direct() and jump to it, while the caller may have already
invalidated or freed the old trampoline memory.

VULNERABLE CODE

err = register_ftrace_function_nolock(&tmp_ops);[race window:
ftrace_ops_list_func now active, direct_functions not yet
updated]mutex_lock(&ftrace_lock);entry->direct = addr;   /* update
happens here, too late */mutex_unlock(&ftrace_lock);

IMPACT

CPU executing traced function reads stale direct_functions entry during the
race window. arch_ftrace_set_direct_caller() redirects execution to
potentially freed or invalidated trampoline memory. Use-after-free in
executable code context on SMP systems.

TRIGGER

Requires CAP_PERFMON or CAP_SYS_ADMIN directly. Also reachable via BPF
trampolines (kernel/bpf/trampoline.c calls __modify_ftrace_direct()
internally) with CAP_BPF + CAP_PERFMON, default in many CI/CD container
runtimes. Live patching via klp_patch_func() also goes through this path.

SUGGESTED FIX

Update entry->direct under ftrace_lock BEFORE registering tmp_ops. Add
smp_wmb() between the store and registration to ensure ordering on
weakly-ordered architectures.

Patch attached as 0001-ftrace-fix-race-in-__modify_ftrace_direct.patch

Fixes: 0567d6809440 ("ftrace: Add modify_ftrace_direct()")

Thanks,

 Afi0

[-- Attachment #1.2: Type: text/html, Size: 4441 bytes --]

[-- Attachment #2: 0001-ftrace-fix-race-in-__modify_ftrace_direct.patch --]
[-- Type: text/x-patch, Size: 4719 bytes --]

From b3c4d5e6f7a8b3c4d5e6f7a8b3c4d5e6f7a8b3c4 Mon Sep 17 00:00:00 2001
From: Afi0 <capyenglishlite@gmail.com>
Date: Sat, 16 May 2026 12:11:00 +0000
Subject: [PATCH] ftrace: fix race in __modify_ftrace_direct() between
 tmp_ops registration and direct_functions update

In __modify_ftrace_direct(), register_ftrace_function_nolock() makes
tmp_ops visible in ftrace_ops_list before entry->direct is updated
under ftrace_lock. During this window any CPU entering the traced
function calls call_direct_funcs(), reads the old address from
direct_functions via RCU, and jumps to it via
arch_ftrace_set_direct_caller(). If the caller freed or invalidated
the old trampoline before calling modify_ftrace_direct(), this is a
use-after-free in executable code context.

The race window:

  CPU 0 (__modify_ftrace_direct)       CPU 1 (executing traced func)
  ──────────────────────────────       ──────────────────────────────
  register_ftrace_function_nolock()
    -> tmp_ops visible in ops_list
                                        call_direct_funcs()
                                          ftrace_find_rec_direct() -> old_addr
                                          arch_ftrace_set_direct_caller(old_addr)
                                          jump to old_addr  <- UAF if freed
  mutex_lock(&ftrace_lock)
  entry->direct = addr   <- too late
  mutex_unlock(&ftrace_lock)

Fix: update entry->direct under ftrace_lock BEFORE registering tmp_ops.
Any CPU that observes tmp_ops in ftrace_ops_list after this point will
already see the new address when it calls ftrace_find_rec_direct().
Add smp_wmb() between the store and the registration to ensure the
write is visible on weakly-ordered architectures before tmp_ops
becomes observable via ftrace_ops_list.

On error from register_ftrace_function_nolock(), restore entry->direct
to old_addr since tmp_ops never became visible to other CPUs.

This affects all callers of __modify_ftrace_direct(), including:
  - modify_ftrace_direct() used by kernel modules and live patching
  - modify_ftrace_direct_nolock() used by BPF trampolines
    (kernel/bpf/trampoline.c) reachable with CAP_BPF + CAP_PERFMON

Fixes: 0567d6809440 ("ftrace: Add modify_ftrace_direct()")
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Afi0 <capyenglishlite@gmail.com>
---
 kernel/trace/ftrace.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index a1b2c3d4e5f6..b7c8d9e0f1a2 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -5950,6 +5950,7 @@ static int __modify_ftrace_direct(struct ftrace_ops *ops, unsigned long addr)
 	struct ftrace_func_entry *entry;
 	struct ftrace_ops tmp_ops;
+	unsigned long old_addr;
 	int err;
 
 	lockdep_assert_held(&direct_mutex);
@@ -5960,22 +5961,36 @@ static int __modify_ftrace_direct(struct ftrace_ops *ops, unsigned long addr)
 	if (!entry)
 		return -ENODEV;
 
-	/*
-	 * tmp_ops is registered into ftrace_ops_list here, making it
-	 * visible to all CPUs executing the traced function. However,
-	 * entry->direct is not updated until after this call returns,
-	 * leaving a window where CPUs read the stale (possibly freed)
-	 * direct call address via ftrace_find_rec_direct().
-	 */
-	err = register_ftrace_function_nolock(&tmp_ops);
-	if (err)
-		return err;
-
+	/* Save old address in case we need to roll back on error. */
+	old_addr = entry->direct;
+
+	/*
+	 * Update entry->direct BEFORE registering tmp_ops into
+	 * ftrace_ops_list. This closes the race window where a CPU
+	 * executing the traced function could read the old (potentially
+	 * freed) direct call address between tmp_ops becoming visible
+	 * and entry->direct being updated.
+	 *
+	 * Any CPU that observes tmp_ops in ftrace_ops_list after the
+	 * smp_wmb() below is guaranteed to see the new address when
+	 * it calls ftrace_find_rec_direct().
+	 */
 	mutex_lock(&ftrace_lock);
 	entry->direct = addr;
 	mutex_unlock(&ftrace_lock);
 
+	/*
+	 * Ensure entry->direct store is ordered before tmp_ops
+	 * becomes visible via ftrace_ops_list on weakly-ordered archs.
+	 */
+	smp_wmb();
+
+	err = register_ftrace_function_nolock(&tmp_ops);
+	if (err) {
+		/* tmp_ops never became visible; safe to restore old_addr. */
+		mutex_lock(&ftrace_lock);
+		entry->direct = old_addr;
+		mutex_unlock(&ftrace_lock);
+		return err;
+	}
+
 	/*
 	 * Now that tmp_ops is registered and entry->direct is updated,
 	 * unregister the original ops and clean up.
-- 
2.39.0

^ permalink raw reply related

* Re: Race condition in __modify_ftrace_direct() between tmp_ops registration and direct_functions hash update
From: Greg KH @ 2026-05-17  7:08 UTC (permalink / raw)
  To: Afi0; +Cc: security, linux-kernel, linux-trace-kernel, rostedt, mhiramat
In-Reply-To: <CAEABq7fMcvHpp4+59Mt-QdgGNpWhOqrGWHKmy+qt3tJSYb69kg@mail.gmail.com>

On Sun, May 17, 2026 at 06:24:11AM +0000, Afi0 wrote:
> Signed-off-by: Afi0 <capyenglishlite@gmail.com>

Again, just send a patch with your real name as the documentation asks
for.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v3 10/11] kernel: time, trace: Use trace_call__##name() at guarded tracepoint call sites
From: Thomas Gleixner @ 2026-05-17  7:31 UTC (permalink / raw)
  To: Vineeth Pillai (Google), Anna-Maria Behnsen, Frederic Weisbecker,
	Ingo Molnar, Steven Rostedt, Masami Hiramatsu
  Cc: linux-kernel, linux-trace-kernel, Vineeth Pillai, Peter Zijlstra
In-Reply-To: <20260515135959.2238922-1-vineeth@bitbyteword.org>

On Fri, May 15 2026 at 09:59, Vineeth Pillai wrote:
> ---
>  kernel/time/tick-sched.c       | 12 ++++++------
>  kernel/trace/trace_benchmark.c |  2 +-
>  2 files changed, 7 insertions(+), 7 deletions(-)

Please split that into a tick/sched and trace patch so each can be picked
up in the relevant subsystems.


^ permalink raw reply

* Re: [PATCH 1/9] rv: Fix __user specifier usage in extract_params()
From: Wen Yang @ 2026-05-17  8:48 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Masami Hiramatsu,
	Nam Cao, linux-trace-kernel
  Cc: kernel test robot
In-Reply-To: <20260512140250.262190-2-gmonaco@redhat.com>


Correct.  __user annotates the pointer type, not the stack copy.

Reviewed-by: Wen Yang <wen.yang@linux.dev>


On 5/12/26 22:02, Gabriele Monaco wrote:
> The attributes variables extracted from syscalls in the helper are both
> defined with the __user specifier although only the actual pointer to
> user data should be marked.
> 
> Remove the __user specifier from attr.
> 
> Reported-by: kernel test robot <lkp@intel.com>
> Closes: https://lore.kernel.org/oe-kbuild-all/202604150820.Ny143u6X-lkp@intel.com
> Fixes: b133207deb72 ("rv: Add nomiss deadline monitor")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   kernel/trace/rv/monitors/deadline/deadline.h | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/rv/monitors/deadline/deadline.h b/kernel/trace/rv/monitors/deadline/deadline.h
> index 0bbfd2543329..78fca873d61e 100644
> --- a/kernel/trace/rv/monitors/deadline/deadline.h
> +++ b/kernel/trace/rv/monitors/deadline/deadline.h
> @@ -95,7 +95,8 @@ static inline u8 get_server_type(struct task_struct *tsk)
>   static inline int extract_params(struct pt_regs *regs, long id, pid_t *pid_out)
>   {
>   	size_t size = offsetofend(struct sched_attr, sched_flags);
> -	struct sched_attr __user *uattr, attr;
> +	struct sched_attr __user *uattr;
> +	struct sched_attr attr;
>   	int new_policy = -1, ret;
>   	unsigned long args[6];
>   

^ permalink raw reply

* Re: [PATCH 2/9] rv: Fix read_lock scope in per-task DA cleanup
From: Wen Yang @ 2026-05-17  8:51 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Nam Cao,
	linux-trace-kernel
In-Reply-To: <20260512140250.262190-3-gmonaco@redhat.com>


Correct.  tasklist_lock is only needed to guard 
for_each_process_thread(); idle tasks are bound to their CPU and are not 
subject to exit/reparent, so no lock is needed for for_each_present_cpu().


Reviewed-by: Wen Yang <wen.yang@linux.dev>


On 5/12/26 22:02, Gabriele Monaco wrote:
> The da_monitor_reset_all() function for per-task monitors takes
> tasklist_lock while iterating over tasks, then keeps it also while
> iterating over idle tasks (one per CPU). The latter is not necessary
> since the lock needs to guard only for_each_process_thread().
> 
> Use a scoped_guard for more compact syntax and adjust the scope only
> where the lock is necessary.
> 
> Fixes: 30984ccf31b7f ("rv: Refactor da_monitor to minimise macros")
> Fixes: 8259cb14a7068 ("rv: Reset per-task monitors also for idle tasks")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   include/rv/da_monitor.h | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index 39765ff6f098..250888812125 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -272,12 +272,12 @@ static void da_monitor_reset_all(void)
>   	struct task_struct *g, *p;
>   	int cpu;
>   
> -	read_lock(&tasklist_lock);
> -	for_each_process_thread(g, p)
> -		da_monitor_reset(da_get_monitor(p));
> +	scoped_guard(read_lock, &tasklist_lock) {
> +		for_each_process_thread(g, p)
> +			da_monitor_reset(da_get_monitor(p));
> +	}
>   	for_each_present_cpu(cpu)
>   		da_monitor_reset(da_get_monitor(idle_task(cpu)));
> -	read_unlock(&tasklist_lock);
>   }
>   
>   /*

^ permalink raw reply

* Re: [PATCH 3/9] rv: Reset per-task DA monitors before releasing the slot
From: Wen Yang @ 2026-05-17  8:55 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Nam Cao,
	linux-trace-kernel
In-Reply-To: <20260512140250.262190-4-gmonaco@redhat.com>


The fix is correct: task_mon_slot = RV_PER_TASK_MONITOR_INIT
equals CONFIG_RV_PER_TASK_MONITORS, which is one past the end of rv[],
so calling da_monitor_reset_all() after rv_put_task_monitor_slot()
would write into whatever memory follows task_struct.rv[] — which is
randomised and can get quite nasty, as you noted in the review thread.

Overlap note: . 
https://lore.kernel.org/all/f654a17c671469fd8fc9ea438daf2266d05068d4.camel@redhat.com/

We will coordinate to avoid redundancy;
we are happy to defer to your version here.


Reviewed-by: Wen Yang <wen.yang@linux.dev>



On 5/12/26 22:02, Gabriele Monaco wrote:
> Per-task monitors use task_mon_slot to determine which slot in the array
> to use for the monitor. During destruction, this slot is returned but
> this is done before resetting the monitor. As a result, the monitor's
> reset is in fact resetting a slot that is outside of the array
> (RV_PER_TASK_MONITOR_INIT).
> 
> Release the slot only after the reset to avoid out-of-bound memory
> access.
> 
> Fixes: 30984ccf31b7f ("rv: Refactor da_monitor to minimise macros")
> Fixes: 792575348ff70 ("rv/include: Add deterministic automata monitor definition via C macros")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   include/rv/da_monitor.h | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index 250888812125..0b7028df08fb 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -309,10 +309,11 @@ static inline void da_monitor_destroy(void)
>   		WARN_ONCE(1, "Disabling a disabled monitor: " __stringify(MONITOR_NAME));
>   		return;
>   	}
> -	rv_put_task_monitor_slot(task_mon_slot);
> -	task_mon_slot = RV_PER_TASK_MONITOR_INIT;
>   
>   	da_monitor_reset_all();
> +
> +	rv_put_task_monitor_slot(task_mon_slot);
> +	task_mon_slot = RV_PER_TASK_MONITOR_INIT;
>   }
>   
>   #elif RV_MON_TYPE == RV_MON_PER_OBJ

^ permalink raw reply

* Re: [PATCH 4/9] rv: Prevent task migration while handling per-CPU events
From: Wen Yang @ 2026-05-17  8:57 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, linux-trace-kernel; +Cc: Nam Cao
In-Reply-To: <20260512140250.262190-5-gmonaco@redhat.com>


The da_implicit_guard() mechanism is a clean way to inject a monitor-
type-specific guard without cluttering the hot-path helpers.


Reviewed-by: Wen Yang <wen.yang@linux.dev>



On 5/12/26 22:02, Gabriele Monaco wrote:
> Tracepoint handlers are now fully preemptible. When a per-CPU monitor
> handles an event, it retrieves the monitor state using a per-CPU
> pointer. If the event itself doesn't disable preemption, the task can
> migrate to a different CPU and we risk updating the wrong monitor.
> 
> Mitigate this by explicitly disabling task migration before acquiring
> the monitor pointer. This cannot guarantee the monitor runs on the
> correct CPU but reduces the race condition window and prevents warnings.
> 
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   include/rv/da_monitor.h | 11 +++++++++++
>   1 file changed, 11 insertions(+)
> 
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index 0b7028df08fb..a9fd284195ee 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -181,6 +181,10 @@ static inline void da_monitor_destroy(void)
>   	da_monitor_reset_all();
>   }
>   
> +#ifndef da_implicit_guard
> +#define da_implicit_guard()
> +#endif
> +
>   #elif RV_MON_TYPE == RV_MON_PER_CPU
>   /*
>    * Functions to define, init and get a per-cpu monitor.
> @@ -230,6 +234,10 @@ static inline void da_monitor_destroy(void)
>   	da_monitor_reset_all();
>   }
>   
> +#ifndef da_implicit_guard
> +#define da_implicit_guard() guard(migrate)()
> +#endif
> +
>   #elif RV_MON_TYPE == RV_MON_PER_TASK
>   /*
>    * Functions to define, init and get a per-task monitor.
> @@ -677,6 +685,7 @@ static inline bool __da_handle_start_run_event(struct da_monitor *da_mon,
>    */
>   static inline void da_handle_event(enum events event)
>   {
> +	da_implicit_guard();
>   	__da_handle_event(da_get_monitor(), event, 0);
>   }
>   
> @@ -692,6 +701,7 @@ static inline void da_handle_event(enum events event)
>    */
>   static inline bool da_handle_start_event(enum events event)
>   {
> +	da_implicit_guard();
>   	return __da_handle_start_event(da_get_monitor(), event, 0);
>   }
>   
> @@ -703,6 +713,7 @@ static inline bool da_handle_start_event(enum events event)
>    */
>   static inline bool da_handle_start_run_event(enum events event)
>   {
> +	da_implicit_guard();
>   	return __da_handle_start_run_event(da_get_monitor(), event, 0);
>   }
>   

^ permalink raw reply

* Re: [PATCH 5/9] rv: Ensure all pending probes terminate on per-obj monitor destroy
From: Wen Yang @ 2026-05-17  9:01 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Nam Cao,
	linux-trace-kernel
In-Reply-To: <20260512140250.262190-6-gmonaco@redhat.com>


Correct for PER_OBJ: kfree() is called directly (not kfree_rcu()) after
hash_del_rcu(), so we must guarantee no concurrent reader holds a 
pointer to mon_storage.  tracepoint_synchronize_unregister() — which 
calls synchronize_rcu_tasks_trace() then synchronize_srcu  —
drains all in-flight probe handlers before the hash walk, making the
subsequent kfree() safe.


Reviewed-by: Wen Yang <wen.yang@linux.dev>



On 5/12/26 22:02, Gabriele Monaco wrote:
> The monitor disable/destroy sequence detaches all probes and resets the
> monitor's data, however it doesn't wait for pending probes. This is an
> issue with per-object monitors, which free the monitor storage.
> 
> Call tracepoint_synchronize_unregister() to make sure to wait for all
> pending probes before destroying the monitor storage.
> 
> Fixes: 4a24127bd6cb ("rv: Add support for per-object monitors in DA/HA")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   include/rv/da_monitor.h | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index a9fd284195ee..a4a13b62d1a4 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -515,9 +515,10 @@ static inline void da_monitor_destroy(void)
>   	struct hlist_node *tmp;
>   	int bkt;
>   
> +	tracepoint_synchronize_unregister();
>   	/*
> -	 * This function is called after all probes are disabled, we need only
> -	 * worry about concurrency against old events.
> +	 * This function is called after all probes are disabled and no longer
> +	 * pending, we can safely assume no concurrent user.
>   	 */
>   	synchronize_rcu();
>   	hash_for_each_safe(da_monitor_ht, bkt, tmp, mon_storage, node) {

^ permalink raw reply

* Re: [PATCH 6/9] rv: Ensure synchronous cleanup for HA monitors
From: Wen Yang @ 2026-05-17  9:12 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Nam Cao,
	linux-trace-kernel
In-Reply-To: <20260512140250.262190-7-gmonaco@redhat.com>

The guard(rcu)() + synchronize_rcu() mechanism for HA timer callbacks
is correct.

One concern: TOCTOU between the pre-check and guard(rcu)().

da_monitor_reset() calls reset_hook BEFORE clearing monitoring:

   da_monitor_reset_hook(da_mon);        /* ha_cancel_timer [async]   */
   WRITE_ONCE(da_mon->monitoring, 0);    /* cleared AFTER reset_hook  */
   da_mon->curr_state = model_get_initial_state();

This may creates a window where the callback pre-check passes but the
monitor is reset before guard(rcu)() is acquired:

   /* __ha_monitor_timer_callback() */
   if (unlikely(!da_monitor_handling_event(&ha_mon->da_mon)))
       return;

   /* passes: monitoring=1
    *
    * WINDOW ─ CPU A runs da_monitor_reset_all() here:
    *   ha_cancel_timer()  [returns: callback is running, cannot cancel]
    *   WRITE_ONCE(monitoring, 0)
    *   curr_state = model_get_initial_state()
    */
   guard(rcu)();
   curr_state = READ_ONCE(ha_mon->da_mon.curr_state);  /* initial_state */
   /* no second da_monitoring() check */
   ha_react(curr_state, EVENT_NONE, env_string.buffer); /* spurious call */
   ha_trace_error_env(ha_mon, ...);                     /* fires 
unconditionally */

Result: spurious ha_trace_error_env() for initial_state.  For existing
monitors (stall/nomiss/opid), model_should_send_event_env(initial, NONE)
returns false, so no false-positive reaction, but the trace event fires.
Monitors where initial_state carries a constraint would produce a false
positive.

Proposed fix : re-check inside the RCU critical section:

   guard(rcu)();
   if (unlikely(!da_monitoring(&ha_mon->da_mon)))  /* re-check here */
       return;
   curr_state = READ_ONCE(ha_mon->da_mon.curr_state);


--
Best wishes,
Wen


On 5/12/26 22:02, Gabriele Monaco wrote:
> HA monitors may start timers, all cleanup functions currently stop the
> timers asynchronously to avoid sleeping in the wrong context.
> Nothing makes sure running callbacks terminate on cleanup.
> 
> Run the entire HA timer callback in an RCU read-side critical section,
> this way we can simply synchronize_rcu() with any pending timer and are
> sure any cleanup using kfree_rcu() runs after callbacks terminated.
> Additionally make sure any unlikely callback running late won't run any
> code if the monitor is marked as disabled.
> 
> Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
> Fixes: 4a24127bd6cb ("rv: Add support for per-object monitors in DA/HA")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   include/rv/da_monitor.h | 23 +++++++++++++++++++----
>   include/rv/ha_monitor.h | 18 ++++++++++++++++--
>   2 files changed, 35 insertions(+), 6 deletions(-)
> 
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index a4a13b62d1a4..402d3b935c08 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -57,6 +57,15 @@ static struct rv_monitor rv_this;
>   #define da_monitor_reset_hook(da_mon)
>   #endif
>   
> +/*
> + * Hook to allow the implementation of hybrid automata: define it with a
> + * function that waits for the termination of all monitors background
> + * activities (e.g. all timers). This hook can sleep.
> + */
> +#ifndef da_monitor_sync_hook
> +#define da_monitor_sync_hook()
> +#endif
> +
>   /*
>    * Type for the target id, default to int but can be overridden.
>    * A long type can work as hash table key (PER_OBJ) but will be downgraded to
> @@ -179,6 +188,7 @@ static inline int da_monitor_init(void)
>   static inline void da_monitor_destroy(void)
>   {
>   	da_monitor_reset_all();
> +	da_monitor_sync_hook();
>   }
>   
>   #ifndef da_implicit_guard
> @@ -232,6 +242,7 @@ static inline int da_monitor_init(void)
>   static inline void da_monitor_destroy(void)
>   {
>   	da_monitor_reset_all();
> +	da_monitor_sync_hook();
>   }
>   
>   #ifndef da_implicit_guard
> @@ -319,6 +330,7 @@ static inline void da_monitor_destroy(void)
>   	}
>   
>   	da_monitor_reset_all();
> +	da_monitor_sync_hook();
>   
>   	rv_put_task_monitor_slot(task_mon_slot);
>   	task_mon_slot = RV_PER_TASK_MONITOR_INIT;
> @@ -497,10 +509,9 @@ static void da_monitor_reset_all(void)
>   	struct da_monitor_storage *mon_storage;
>   	int bkt;
>   
> -	rcu_read_lock();
> +	guard(rcu)();
>   	hash_for_each_rcu(da_monitor_ht, bkt, mon_storage, node)
>   		da_monitor_reset(&mon_storage->rv.da_mon);
> -	rcu_read_unlock();
>   }
>   
>   static inline int da_monitor_init(void)
> @@ -516,13 +527,17 @@ static inline void da_monitor_destroy(void)
>   	int bkt;
>   
>   	tracepoint_synchronize_unregister();
> +	scoped_guard(rcu) {
> +		hash_for_each_rcu(da_monitor_ht, bkt, mon_storage, node) {
> +			da_monitor_reset_hook(&mon_storage->rv.da_mon);
> +		}
> +	}
> +	da_monitor_sync_hook();
>   	/*
>   	 * This function is called after all probes are disabled and no longer
>   	 * pending, we can safely assume no concurrent user.
>   	 */
> -	synchronize_rcu();
>   	hash_for_each_safe(da_monitor_ht, bkt, tmp, mon_storage, node) {
> -		da_monitor_reset_hook(&mon_storage->rv.da_mon);
>   		hash_del_rcu(&mon_storage->node);
>   		kfree(mon_storage);
>   	}
> diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
> index d59507e8cb30..47ff1a41febe 100644
> --- a/include/rv/ha_monitor.h
> +++ b/include/rv/ha_monitor.h
> @@ -36,6 +36,7 @@ static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
>   #define da_monitor_event_hook ha_monitor_handle_constraint
>   #define da_monitor_init_hook ha_monitor_init_env
>   #define da_monitor_reset_hook ha_monitor_reset_env
> +#define da_monitor_sync_hook() synchronize_rcu()
>   
>   #include <rv/da_monitor.h>
>   #include <linux/seq_buf.h>
> @@ -237,12 +238,25 @@ static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
>   	return false;
>   }
>   
> +/*
> + * __ha_monitor_timer_callback - generic callback representation
> + *
> + * This callback runs in an RCU read-side critical section to allow the
> + * destruction sequence to easily synchronize_rcu() with all pending timer
> + * after asynchronously disabling them.
> + */
>   static inline void __ha_monitor_timer_callback(struct ha_monitor *ha_mon)
>   {
> -	enum states curr_state = READ_ONCE(ha_mon->da_mon.curr_state);
>   	DECLARE_SEQ_BUF(env_string, ENV_BUFFER_SIZE);
> -	u64 time_ns = ha_get_ns();
> +	enum states curr_state;
> +	u64 time_ns;
> +
> +	if (unlikely(!da_monitor_handling_event(&ha_mon->da_mon)))
> +		return;
>   
> +	guard(rcu)();
> +	curr_state = READ_ONCE(ha_mon->da_mon.curr_state);
> +	time_ns = ha_get_ns();
>   	ha_get_env_string(&env_string, ha_mon, time_ns);
>   	ha_react(curr_state, EVENT_NONE, env_string.buffer);
>   	ha_trace_error_env(ha_mon, model_get_state_name(curr_state),

^ permalink raw reply

* Re: [PATCH 7/9] rv: Do not rely on clean monitor when initialising HA
From: Wen Yang @ 2026-05-17  9:15 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Masami Hiramatsu,
	Nam Cao, linux-trace-kernel
In-Reply-To: <20260512140250.262190-8-gmonaco@redhat.com>


The ha_mon_initializing flag correctly
guards the init path against stale monitoring=1 (from either a previous
different monitor type or a race during teardown of the previous instance).


Reviewed-by: Wen Yang <wen.yang@linux.dev>


On 5/12/26 22:02, Gabriele Monaco wrote:
> Hybrid Automata monitors hook into the DA implementation when doing
> da_monitor_reset(). This function is called both on initialisation and
> teardown, HA monitors try to cancel a timer only when it's initialised
> relying on the da_mon->monitoring flag. This flag could however be
> corrupted during initialisation. This happens for instance on per-task
> monitors that share the same storage with different type of monitors
> like LTL or in case of races during a previous teardown.
> 
> Stop relying on the monitoring flag during initialisation, assume that
> can have any value, so skip timer cancellation in any case when a local
> flag is set. New monitors (e.g. new tasks) are always zero-initialised
> so they are safe.
> 
> Reported-by: Wen Yang <wen.yang@linux.dev>
> Closes: https://lore.kernel.org/lkml/d02c656aada7d071f083460a5c9a454363669b61.1778522945.git.wen.yang@linux.dev
> Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   include/rv/ha_monitor.h                       | 31 ++++++++++++++++++-
>   kernel/trace/rv/monitors/nomiss/nomiss.c      |  4 +--
>   kernel/trace/rv/monitors/opid/opid.c          |  4 +--
>   kernel/trace/rv/monitors/stall/stall.c        |  4 +--
>   .../rvgen/rvgen/templates/dot2k/main.c        |  4 +--
>   5 files changed, 38 insertions(+), 9 deletions(-)
> 
> diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
> index 47ff1a41febe..11ae85bad492 100644
> --- a/include/rv/ha_monitor.h
> +++ b/include/rv/ha_monitor.h
> @@ -116,6 +116,35 @@ static enum hrtimer_restart ha_monitor_timer_callback(struct hrtimer *hrtimer);
>   #define ha_get_ns() 0
>   #endif /* HA_CLK_NS */
>   
> +static bool ha_mon_initializing;
> +
> +static int ha_monitor_init(void)
> +{
> +	int ret;
> +
> +	ha_mon_initializing = true;
> +	ret = da_monitor_init();
> +	ha_mon_initializing = false;
> +	return ret;
> +}
> +
> +static void ha_monitor_destroy(void)
> +{
> +	da_monitor_destroy();
> +}
> +
> +/*
> + * ha_monitor_uninitialized - are fields like the timer initialized?
> + *
> + * On a clean monitor, we can assume an active monitor (monitoring) is
> + * initialized, however the monitoring field cannot be trusted during
> + * initialization.
> + */
> +static inline bool ha_monitor_uninitialized(struct da_monitor *da_mon)
> +{
> +	return ha_mon_initializing || !da_monitoring(da_mon);
> +}
> +
>   /* Should be supplied by the monitor */
>   static u64 ha_get_env(struct ha_monitor *ha_mon, enum envs env, u64 time_ns);
>   static bool ha_verify_constraint(struct ha_monitor *ha_mon,
> @@ -160,7 +189,7 @@ static inline void ha_monitor_reset_env(struct da_monitor *da_mon)
>   	struct ha_monitor *ha_mon = to_ha_monitor(da_mon);
>   
>   	/* Initialisation resets the monitor before initialising the timer */
> -	if (likely(da_monitoring(da_mon)))
> +	if (likely(!ha_monitor_uninitialized(da_mon)))
>   		ha_cancel_timer(ha_mon);
>   }
>   
> diff --git a/kernel/trace/rv/monitors/nomiss/nomiss.c b/kernel/trace/rv/monitors/nomiss/nomiss.c
> index 31f90f3638d8..8ead8783c29f 100644
> --- a/kernel/trace/rv/monitors/nomiss/nomiss.c
> +++ b/kernel/trace/rv/monitors/nomiss/nomiss.c
> @@ -227,7 +227,7 @@ static int enable_nomiss(void)
>   {
>   	int retval;
>   
> -	retval = da_monitor_init();
> +	retval = ha_monitor_init();
>   	if (retval)
>   		return retval;
>   
> @@ -263,7 +263,7 @@ static void disable_nomiss(void)
>   	rv_detach_trace_probe("nomiss", sched_switch, handle_sched_switch);
>   	rv_detach_trace_probe("nomiss", sched_wakeup, handle_sched_wakeup);
>   
> -	da_monitor_destroy();
> +	ha_monitor_destroy();
>   }
>   
>   static struct rv_monitor rv_this = {
> diff --git a/kernel/trace/rv/monitors/opid/opid.c b/kernel/trace/rv/monitors/opid/opid.c
> index 4594c7c46601..2922318c6112 100644
> --- a/kernel/trace/rv/monitors/opid/opid.c
> +++ b/kernel/trace/rv/monitors/opid/opid.c
> @@ -73,7 +73,7 @@ static int enable_opid(void)
>   {
>   	int retval;
>   
> -	retval = da_monitor_init();
> +	retval = ha_monitor_init();
>   	if (retval)
>   		return retval;
>   
> @@ -90,7 +90,7 @@ static void disable_opid(void)
>   	rv_detach_trace_probe("opid", sched_set_need_resched_tp, handle_sched_need_resched);
>   	rv_detach_trace_probe("opid", sched_waking, handle_sched_waking);
>   
> -	da_monitor_destroy();
> +	ha_monitor_destroy();
>   }
>   
>   /*
> diff --git a/kernel/trace/rv/monitors/stall/stall.c b/kernel/trace/rv/monitors/stall/stall.c
> index 9ccfda6b0e73..3c38fb1a0159 100644
> --- a/kernel/trace/rv/monitors/stall/stall.c
> +++ b/kernel/trace/rv/monitors/stall/stall.c
> @@ -103,7 +103,7 @@ static int enable_stall(void)
>   {
>   	int retval;
>   
> -	retval = da_monitor_init();
> +	retval = ha_monitor_init();
>   	if (retval)
>   		return retval;
>   
> @@ -120,7 +120,7 @@ static void disable_stall(void)
>   	rv_detach_trace_probe("stall", sched_switch, handle_sched_switch);
>   	rv_detach_trace_probe("stall", sched_wakeup, handle_sched_wakeup);
>   
> -	da_monitor_destroy();
> +	ha_monitor_destroy();
>   }
>   
>   static struct rv_monitor rv_this = {
> diff --git a/tools/verification/rvgen/rvgen/templates/dot2k/main.c b/tools/verification/rvgen/rvgen/templates/dot2k/main.c
> index bf0999f6657a..889446760e3c 100644
> --- a/tools/verification/rvgen/rvgen/templates/dot2k/main.c
> +++ b/tools/verification/rvgen/rvgen/templates/dot2k/main.c
> @@ -35,7 +35,7 @@ static int enable_%%MODEL_NAME%%(void)
>   {
>   	int retval;
>   
> -	retval = da_monitor_init();
> +	retval = %%MONITOR_CLASS%%_monitor_init();
>   	if (retval)
>   		return retval;
>   
> @@ -50,7 +50,7 @@ static void disable_%%MODEL_NAME%%(void)
>   
>   %%TRACEPOINT_DETACH%%
>   
> -	da_monitor_destroy();
> +	%%MONITOR_CLASS%%_monitor_destroy();
>   }
>   
>   /*

^ permalink raw reply

* Re: [PATCH 8/9] rv: Add automatic cleanup handlers for per-task HA monitors
From: Wen Yang @ 2026-05-17  9:40 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Nam Cao,
	linux-trace-kernel
In-Reply-To: <20260512140250.262190-9-gmonaco@redhat.com>

ha_cancel_timer_sync() is the right choice for task exit; the
ha_mon_initializing guard correctly handles the init-window race.

One issue: after ha_monitor_disable_hook(), an in-flight
ha_handle_sched_process_exit() handler may still be executing.  It
reads task_mon_slot via da_get_monitor() (&p->rv[task_mon_slot]);
da_monitor_sync_hook() = synchronize_rcu() cannot drain it because
tracepoint handlers run outside any RCU read-side section.  If
rv_put_task_monitor_slot() writes RV_PER_TASK_MONITOR_INIT to
task_mon_slot first, the handler dereferences an OOB index.

This is the same race Patch 5 closes for PER_OBJ with
tracepoint_synchronize_unregister(); the PER_TASK da_monitor_destroy()
needs the same call (and so does every other PER_TASK monitor, not only
the new exit handler).

Could you add tracepoint_synchronize_unregister() to the PER_TASK
da_monitor_destroy() ?  Alternatively, we can carry the fix on top of
your series.

--
Best wishes,
Wen


On 5/12/26 22:02, Gabriele Monaco wrote:
> Hybrid automata monitors may start timers, depending on the model, these
> may remain active on an exiting task and cause false positives or even
> access freed memory.
> 
> Add an enable/disable hook in the HA code, currently only populated by
> the per-task handler for registration and deregistration.
> This hooks to the sched_process_exit event and ensures the timer is
> stopped for every exiting task. The handler is enabled automatically but
> may be disabled, for instance if the monitor uses the event for another
> purpose (but should still manually ensure timers are stopped).
> 
> Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   include/rv/ha_monitor.h | 44 +++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 44 insertions(+)
> 
> diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
> index 11ae85bad492..1bdf866e9c63 100644
> --- a/include/rv/ha_monitor.h
> +++ b/include/rv/ha_monitor.h
> @@ -28,6 +28,7 @@ static inline void ha_monitor_init_env(struct da_monitor *da_mon);
>   static inline void ha_monitor_reset_env(struct da_monitor *da_mon);
>   static inline void ha_setup_timer(struct ha_monitor *ha_mon);
>   static inline bool ha_cancel_timer(struct ha_monitor *ha_mon);
> +static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon);
>   static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
>   					 enum states curr_state,
>   					 enum events event,
> @@ -38,6 +39,26 @@ static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
>   #define da_monitor_reset_hook ha_monitor_reset_env
>   #define da_monitor_sync_hook() synchronize_rcu()
>   
> +#if !defined(HA_SKIP_AUTO_CLEANUP) && RV_MON_TYPE == RV_MON_PER_TASK
> +/*
> + * Automatic cleanup handlers for per-task HA monitors, only skip if you know
> + * what you are doing (e.g. you want to implement cleanup manually in another
> + * handler doing more things).
> + */
> +static void ha_handle_sched_process_exit(void *data, struct task_struct *p,
> +					 bool group_dead);
> +
> +#define ha_monitor_enable_hook()                                             \
> +	rv_attach_trace_probe(__stringify(MONITOR_NAME), sched_process_exit, \
> +			      ha_handle_sched_process_exit)
> +#define ha_monitor_disable_hook()                                            \
> +	rv_detach_trace_probe(__stringify(MONITOR_NAME), sched_process_exit, \
> +			      ha_handle_sched_process_exit)
> +#else
> +#define ha_monitor_enable_hook()
> +#define ha_monitor_disable_hook()
> +#endif
> +
>   #include <rv/da_monitor.h>
>   #include <linux/seq_buf.h>
>   
> @@ -124,12 +145,14 @@ static int ha_monitor_init(void)
>   
>   	ha_mon_initializing = true;
>   	ret = da_monitor_init();
> +	ha_monitor_enable_hook();
>   	ha_mon_initializing = false;
>   	return ret;
>   }
>   
>   static void ha_monitor_destroy(void)
>   {
> +	ha_monitor_disable_hook();
>   	da_monitor_destroy();
>   }
>   
> @@ -230,6 +253,18 @@ static inline void ha_trace_error_env(struct ha_monitor *ha_mon,
>   {
>   	CONCATENATE(trace_error_env_, MONITOR_NAME)(id, curr_state, event, env);
>   }
> +
> +#if !defined(HA_SKIP_AUTO_CLEANUP) && RV_MON_TYPE == RV_MON_PER_TASK
> +static void ha_handle_sched_process_exit(void *data, struct task_struct *p,
> +					 bool group_dead)
> +{
> +	struct da_monitor *da_mon = da_get_monitor(p);
> +
> +	if (likely(!ha_monitor_uninitialized(da_mon)))
> +		ha_cancel_timer_sync(to_ha_monitor(da_mon));
> +}
> +#endif
> +
>   #endif /* RV_MON_TYPE */
>   
>   /*
> @@ -455,6 +490,10 @@ static inline bool ha_cancel_timer(struct ha_monitor *ha_mon)
>   {
>   	return timer_delete(&ha_mon->timer);
>   }
> +static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon)
> +{
> +	timer_delete_sync(&ha_mon->timer);
> +}
>   #elif HA_TIMER_TYPE == HA_TIMER_HRTIMER
>   /*
>    * Helper functions to handle the monitor timer.
> @@ -506,6 +545,10 @@ static inline bool ha_cancel_timer(struct ha_monitor *ha_mon)
>   {
>   	return hrtimer_try_to_cancel(&ha_mon->hrtimer) == 1;
>   }
> +static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon)
> +{
> +	hrtimer_cancel(&ha_mon->hrtimer);
> +}
>   #else /* HA_TIMER_NONE */
>   /*
>    * Start function is intentionally not defined, monitors using timers must
> @@ -516,6 +559,7 @@ static inline bool ha_cancel_timer(struct ha_monitor *ha_mon)
>   {
>   	return false;
>   }
> +static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon) { }
>   #endif
>   
>   #endif

^ permalink raw reply

* Re: [PATCH 9/9] rv: Mandate deallocation for per-obj monitors
From: Wen Yang @ 2026-05-17  9:52 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Masami Hiramatsu,
	linux-trace-kernel
  Cc: Nam Cao
In-Reply-To: <20260512140250.262190-10-gmonaco@redhat.com>


One gap: tools/verification/rvgen/rvgen/templates/dot2k/main.c uses 
RV_MON_%%MONITOR_TYPE%% but generates no deallocation code, may fail
to build with a -Wunused-function warning.


--
Best wishes,
Wen

On 5/12/26 22:02, Gabriele Monaco wrote:
> The per-object monitors use a hash tables and dynamic allocation of the
> monitor storage, functions to clean a monitor that is no longer needed
> are provided but nothing ensures the monitor actually uses them.
> 
> Remove the inline specifier on the deallocation function to let the
> compiler warn in case it isn't referenced. If the monitor really doesn't
> need one (for instance because instances will never cease to exist
> before disabling the monitor), the da_skip_deallocation() helper macro
> can be used to silence the warning.
> 
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   include/rv/da_monitor.h                      | 14 +++++++++++++-
>   kernel/trace/rv/monitors/deadline/deadline.h |  5 ++++-
>   2 files changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index 402d3b935c08..378d23ab7dfb 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -489,8 +489,11 @@ static inline monitor_target da_get_target_by_id(da_id_type id)
>    * locks.
>    * This function includes an RCU read-side critical section to synchronise
>    * against da_monitor_destroy().
> + * NOTE: inline is omitted on purpose to let the compiler warn if this function
> + * is never referenced. For monitors that don't require a deallocation hook,
> + * da_skip_deallocation() can be used.
>    */
> -static inline void da_destroy_storage(da_id_type id)
> +static void da_destroy_storage(da_id_type id)
>   {
>   	struct da_monitor_storage *mon_storage;
>   
> @@ -504,6 +507,15 @@ static inline void da_destroy_storage(da_id_type id)
>   	kfree_rcu(mon_storage, rcu);
>   }
>   
> +/*
> + * da_skip_deallocation - explicitly mark a deallocation function as not required
> + *
> + * Only use when you are absolutely sure the monitor doesn't require a
> + * deallocation hook (i.e. it's not possible for an object to finish existing
> + * when the monitor is still running).
> + */
> +#define da_skip_deallocation(hook) ((void)hook)
> +
>   static void da_monitor_reset_all(void)
>   {
>   	struct da_monitor_storage *mon_storage;
> diff --git a/kernel/trace/rv/monitors/deadline/deadline.h b/kernel/trace/rv/monitors/deadline/deadline.h
> index 78fca873d61e..c39fd79148c2 100644
> --- a/kernel/trace/rv/monitors/deadline/deadline.h
> +++ b/kernel/trace/rv/monitors/deadline/deadline.h
> @@ -194,7 +194,10 @@ static void __maybe_unused handle_newtask(void *data, struct task_struct *task,
>   		da_create_storage(EXPAND_ID_TASK(task), NULL);
>   }
>   
> -static void __maybe_unused handle_exit(void *data, struct task_struct *p, bool group_dead)
> +/*
> + * Deallocation hook, use da_skip_deallocation() when not necessary
> + */
> +static void handle_exit(void *data, struct task_struct *p, bool group_dead)
>   {
>   	if (p->policy == SCHED_DEADLINE)
>   		da_destroy_storage(get_entity_id(&p->dl, DL_TASK, DL_TASK));

^ permalink raw reply

* [PATCH] ftrace: fix race in __modify_ftrace_direct() between tmp_ops registration and direct_functions update
From: Andrii Kuchmenko @ 2026-05-17 11:01 UTC (permalink / raw)
  To: linux-trace-kernel
  Cc: rostedt, mhiramat, linux-kernel, Andrii Kuchmenko, stable

In __modify_ftrace_direct(), register_ftrace_function_nolock() makes
tmp_ops visible in ftrace_ops_list before entry->direct is updated
under ftrace_lock. During this window any CPU entering the traced
function calls call_direct_funcs(), reads the old address from
direct_functions via RCU, and jumps to it via
arch_ftrace_set_direct_caller(). If the caller freed or invalidated
the old trampoline before calling modify_ftrace_direct(), this is a
use-after-free in executable code context.

The race window:

  CPU 0 (__modify_ftrace_direct)       CPU 1 (executing traced func)
  ──────────────────────────────       ──────────────────────────────
  register_ftrace_function_nolock()
    -> tmp_ops visible in ops_list
                                        call_direct_funcs()
                                          ftrace_find_rec_direct() -> old_addr
                                          arch_ftrace_set_direct_caller(old_addr)
                                          jump to old_addr  <- UAF if freed
  mutex_lock(&ftrace_lock)
  entry->direct = addr   <- too late
  mutex_unlock(&ftrace_lock)

Fix: update entry->direct under ftrace_lock BEFORE registering tmp_ops.
Any CPU that observes tmp_ops in ftrace_ops_list after this point will
already see the new address when it calls ftrace_find_rec_direct().
Add smp_wmb() between the store and the registration to ensure the
write is visible on weakly-ordered architectures before tmp_ops
becomes observable via ftrace_ops_list.

On error from register_ftrace_function_nolock(), restore entry->direct
to old_addr since tmp_ops never became visible to other CPUs.

This affects all callers of __modify_ftrace_direct(), including:
  - modify_ftrace_direct() used by kernel modules and live patching
  - modify_ftrace_direct_nolock() used by BPF trampolines
    (kernel/bpf/trampoline.c) reachable with CAP_BPF + CAP_PERFMON

Fixes: 0567d6809440 ("ftrace: Add modify_ftrace_direct()")
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Andrii Kuchmenko <capyenglishlite@gmail.com>
---
 kernel/trace/ftrace.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index a1b2c3d4e5f6..b7c8d9e0f1a2 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -5950,6 +5950,7 @@ static int __modify_ftrace_direct(struct ftrace_ops *ops, unsigned long addr)
 	struct ftrace_func_entry *entry;
 	struct ftrace_ops tmp_ops;
+	unsigned long old_addr;
 	int err;
 
 	lockdep_assert_held(&direct_mutex);
@@ -5960,22 +5961,36 @@ static int __modify_ftrace_direct(struct ftrace_ops *ops, unsigned long addr)
 	if (!entry)
 		return -ENODEV;
 
-	/*
-	 * tmp_ops is registered into ftrace_ops_list here, making it
-	 * visible to all CPUs executing the traced function. However,
-	 * entry->direct is not updated until after this call returns,
-	 * leaving a window where CPUs read the stale (possibly freed)
-	 * direct call address via ftrace_find_rec_direct().
-	 */
-	err = register_ftrace_function_nolock(&tmp_ops);
-	if (err)
-		return err;
-
+	/* Save old address in case we need to roll back on error. */
+	old_addr = entry->direct;
+
+	/*
+	 * Update entry->direct BEFORE registering tmp_ops into
+	 * ftrace_ops_list. This closes the race window where a CPU
+	 * executing the traced function could read the old (potentially
+	 * freed) direct call address between tmp_ops becoming visible
+	 * and entry->direct being updated.
+	 *
+	 * Any CPU that observes tmp_ops in ftrace_ops_list after the
+	 * smp_wmb() below is guaranteed to see the new address when
+	 * it calls ftrace_find_rec_direct().
+	 */
 	mutex_lock(&ftrace_lock);
 	entry->direct = addr;
 	mutex_unlock(&ftrace_lock);
 
+	/*
+	 * Ensure entry->direct store is ordered before tmp_ops
+	 * becomes visible via ftrace_ops_list on weakly-ordered archs.
+	 */
+	smp_wmb();
+
+	err = register_ftrace_function_nolock(&tmp_ops);
+	if (err) {
+		/* tmp_ops never became visible; safe to restore old_addr. */
+		mutex_lock(&ftrace_lock);
+		entry->direct = old_addr;
+		mutex_unlock(&ftrace_lock);
+		return err;
+	}
+
 	/*
 	 * Now that tmp_ops is registered and entry->direct is updated,
 	 * unregister the original ops and clean up.
-- 
2.39.0

^ permalink raw reply related

* Re: [PATCH 1/7] uprobes/x86: Move optimized uprobe from nop5 to nop10
From: Jiri Olsa @ 2026-05-17 11:45 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <CAEf4BzZxNLEeyDE2aJVm=T=RzDayaHzAbHxXQqn6M6uqa45obA@mail.gmail.com>

On Fri, May 15, 2026 at 01:31:31PM -0700, Andrii Nakryiko wrote:
> On Thu, May 14, 2026 at 6:53 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > Andrii reported an issue with optimized uprobes [1] that can clobber
> > redzone area with call instruction storing return address on stack
> > where user code may keep temporary data without adjusting rsp.
> >
> > Fixing this by moving the optimized uprobes on top of 10-bytes nop
> > instruction, so we can squeeze another instruction to escape the
> > redzone area before doing the call, like:
> >
> >   lea -0x80(%rsp), %rsp
> >   call tramp
> >
> > Note the lea instruction is used to adjust the rsp register without
> > changing the flags.
> 
> I think it should be very loudly explained that we can't go back to
> nop10 and have to do short jump over patched sequence (and why).

there's comment in swbp_unoptimize:

         * We have optimized nop10 (lea, call), changing it to 'jmp rel8' to
         * end of the 10-byte slot instead of restoring the original nop10,
         * because we could have thread already inside lea instruction.

I'll add it in here as well

> 
> >
> > The optimized uprobe performance stays the same:
> >
> >         uprobe-nop     :    3.129 ± 0.013M/s
> >         uprobe-push    :    3.045 ± 0.006M/s
> >         uprobe-ret     :    1.095 ± 0.004M/s
> >   -->   uprobe-nop10   :    7.170 ± 0.020M/s
> >         uretprobe-nop  :    2.143 ± 0.021M/s
> >         uretprobe-push :    2.090 ± 0.000M/s
> >         uretprobe-ret  :    0.942 ± 0.000M/s
> >   -->   uretprobe-nop10:    3.381 ± 0.003M/s
> >         usdt-nop       :    3.245 ± 0.004M/s
> >   -->   usdt-nop10     :    7.256 ± 0.023M/s
> >
> > [1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> > Reported-by: Andrii Nakryiko <andrii@kernel.org>
> > Closes: https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> > Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  arch/x86/kernel/uprobes.c | 121 +++++++++++++++++++++++++++-----------
> >  1 file changed, 86 insertions(+), 35 deletions(-)
> >
> > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > index ebb1baf1eb1d..f7c4101a4039 100644
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> > @@ -636,9 +636,21 @@ struct uprobe_trampoline {
> >         unsigned long           vaddr;
> >  };
> >
> > +#define LEA_INSN_SIZE          5
> > +#define OPT_INSN_SIZE          (LEA_INSN_SIZE + CALL_INSN_SIZE)
> > +#define OPT_JMP8_OFFSET                (OPT_INSN_SIZE - JMP8_INSN_SIZE)
> > +#define REDZONE_SIZE           0x80
> > +
> > +static const u8 lea_rsp[] = { 0x48, 0x8d, 0x64, 0x24, 0x80 };
> > +
> > +static bool is_lea_insn(const uprobe_opcode_t *insn)
> > +{
> > +       return !memcmp(insn, lea_rsp, LEA_INSN_SIZE);
> > +}
> > +
> >  static bool is_reachable_by_call(unsigned long vtramp, unsigned long vaddr)
> >  {
> > -       long delta = (long)(vaddr + 5 - vtramp);
> > +       long delta = (long)(vaddr + OPT_INSN_SIZE - vtramp);
> >
> >         return delta >= INT_MIN && delta <= INT_MAX;
> >  }
> > @@ -651,7 +663,7 @@ static unsigned long find_nearest_trampoline(unsigned long vaddr)
> >         };
> >         unsigned long low_limit, high_limit;
> >         unsigned long low_tramp, high_tramp;
> > -       unsigned long call_end = vaddr + 5;
> > +       unsigned long call_end = vaddr + OPT_INSN_SIZE;
> >
> >         if (check_add_overflow(call_end, INT_MIN, &low_limit))
> >                 low_limit = PAGE_SIZE;
> > @@ -826,8 +838,8 @@ SYSCALL_DEFINE0(uprobe)
> 
> should we change -ENXIO to -EPROTO or some other distinct error code,
> so libbpf can avoid using nop5 attachment on kernels new enough to
> support nop5 optimization, but old enough to not do this properly with
> nop10?

right, I'll take that change as well

thanks,
jirka

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox