* Re: [PATCH v3] tracing/hist: bound synthetic-field strings with seq_buf
From: Steven Rostedt @ 2026-04-14 8:58 UTC (permalink / raw)
To: Pengpeng Hou
Cc: Masami Hiramatsu, Tom Zanussi, Mathieu Desnoyers,
linux-trace-kernel, linux-kernel
In-Reply-To: <20260409103001.1-tracing-hist-synth-v3-pengpeng@iscas.ac.cn>
On Thu, 9 Apr 2026 10:19:43 +0800
Pengpeng Hou <pengpeng@iscas.ac.cn> wrote:
Hi Pengpeng,
Note, the tracing subsystem uses capital letters in the subject:
Subject: tracing: Bound synthetic-field strings with seq_buf
> The synthetic field helpers build a prefixed synthetic variable name and
> a generated hist command in fixed MAX_FILTER_STR_VAL buffers. The
> current code appends those strings with raw strcat(), so long key lists,
> field names, or saved filters can run past the end of the staging
> buffers.
>
> Build both strings with seq_buf and propagate -E2BIG if either the
> synthetic variable name or the generated command exceeds
> MAX_FILTER_STR_VAL. This keeps the existing tracing-side limit while
> using the helper intended for bounded command construction.
>
> Fixes: 02205a6752f2 ("tracing: Add support for 'field variables'")
> Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
> ---
> Changes since v2: https://lore.kernel.org/all/20260401112224.85582-2-pengpeng@iscas.ac.cn/
>
> - switch the synthetic name and generated command construction to seq_buf
> as suggested by Steven Rostedt
> - keep MAX_FILTER_STR_VAL as the tracing-side limit and return -E2BIG on
> overflow
>
> kernel/trace/trace_events_hist.c | 44 ++++++++++++++++++++++----------
> 1 file changed, 30 insertions(+), 14 deletions(-)
>
> diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
> index 73ea180cad55..7c3873719beb 100644
> --- a/kernel/trace/trace_events_hist.c
> +++ b/kernel/trace/trace_events_hist.c
> @@ -8,6 +8,7 @@
> #include <linux/module.h>
> #include <linux/kallsyms.h>
> #include <linux/security.h>
> +#include <linux/seq_buf.h>
> #include <linux/mutex.h>
> #include <linux/slab.h>
> #include <linux/stacktrace.h>
> @@ -2962,14 +2963,21 @@ find_synthetic_field_var(struct hist_trigger_data *target_hist_data,
> char *system, char *event_name, char *field_name)
> {
> struct hist_field *event_var;
> + struct seq_buf s;
> char *synthetic_name;
>
> synthetic_name = kzalloc(MAX_FILTER_STR_VAL, GFP_KERNEL);
> if (!synthetic_name)
> return ERR_PTR(-ENOMEM);
>
> - strcpy(synthetic_name, "synthetic_");
> - strcat(synthetic_name, field_name);
> + seq_buf_init(&s, synthetic_name, MAX_FILTER_STR_VAL);
> + seq_buf_puts(&s, "synthetic_");
> + seq_buf_puts(&s, field_name);
Should have a comment here specifying what the seq_buf_str() is doing:
/* Terminate synthetic_name with a nul */
> + seq_buf_str(&s);
> + if (seq_buf_has_overflowed(&s)) {
> + kfree(synthetic_name);
> + return ERR_PTR(-E2BIG);
> + }
>
> event_var = find_event_var(target_hist_data, system, event_name, synthetic_name);
>
> @@ -3014,6 +3022,7 @@ create_field_var_hist(struct hist_trigger_data *target_hist_data,
> struct trace_event_file *file;
> struct hist_field *key_field;
> struct hist_field *event_var;
> + struct seq_buf s;
> char *saved_filter;
> char *cmd;
> int ret;
> @@ -3046,41 +3055,48 @@ create_field_var_hist(struct hist_trigger_data *target_hist_data,
> /* See if a synthetic field variable has already been created */
> event_var = find_synthetic_field_var(target_hist_data, subsys_name,
> event_name, field_name);
> - if (!IS_ERR_OR_NULL(event_var))
> + if (IS_ERR(event_var))
> + return event_var;
> + if (event_var)
> return event_var;
Note, the above is equivalent to:
if (event_var)
return event_var;
And since it is a separate issue than the bounding of the string, it
should be a separate patch.
>
> var_hist = kzalloc_obj(*var_hist);
> if (!var_hist)
> return ERR_PTR(-ENOMEM);
>
> + saved_filter = find_trigger_filter(hist_data, file);
Why did you move this up here?
> +
> cmd = kzalloc(MAX_FILTER_STR_VAL, GFP_KERNEL);
> if (!cmd) {
> kfree(var_hist);
> return ERR_PTR(-ENOMEM);
> }
>
> + seq_buf_init(&s, cmd, MAX_FILTER_STR_VAL);
> +
> /* Use the same keys as the compatible histogram */
> - strcat(cmd, "keys=");
> + seq_buf_puts(&s, "keys=");
>
> for_each_hist_key_field(i, hist_data) {
> key_field = hist_data->fields[i];
> if (!first)
> - strcat(cmd, ",");
> - strcat(cmd, key_field->field->name);
> + seq_buf_putc(&s, ',');
> + seq_buf_puts(&s, key_field->field->name);
> first = false;
> }
>
> /* Create the synthetic field variable specification */
> - strcat(cmd, ":synthetic_");
> - strcat(cmd, field_name);
> - strcat(cmd, "=");
> - strcat(cmd, field_name);
> + seq_buf_printf(&s, ":synthetic_%s=%s", field_name, field_name);
>
> /* Use the same filter as the compatible histogram */
> - saved_filter = find_trigger_filter(hist_data, file);
It makes more sense to define saved_filter next to where it is used.
> - if (saved_filter) {
> - strcat(cmd, " if ");
> - strcat(cmd, saved_filter);
> + if (saved_filter)
> + seq_buf_printf(&s, " if %s", saved_filter);
> +
> + seq_buf_str(&s);
> + if (seq_buf_has_overflowed(&s)) {
> + kfree(cmd);
> + kfree(var_hist);
> + return ERR_PTR(-E2BIG);
> }
>
> var_hist->cmd = kstrdup(cmd, GFP_KERNEL);
-- Steve
^ permalink raw reply
* Re: [PATCH v5 0/3] tracing/fprobe: Fix fprobe_ip_table related bugs
From: Masami Hiramatsu @ 2026-04-14 1:19 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Steven Rostedt, Menglong Dong, Mathieu Desnoyers, jiang.biao,
linux-kernel, linux-trace-kernel
In-Reply-To: <177606956628.929411.17392736689322577701.stgit@devnote2>
On Mon, 13 Apr 2026 17:39:26 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> Here is the 5th series of patches to fix bugs in fprobe.
> The previous version is here.
>
> https://lore.kernel.org/all/177584108931.388483.11311214679686745474.stgit@devnote2/
>
> This version fixes to remove fprobe_hash_node forcibly when fprobe
> registration failed [1/3] and skips updating ftrace_ops when fails
> to allocate memory in module unloading [2/3].
Hmm, Sashiko pointed out some issues in fprobe, which seems not introduced
this series but existing UAF cases.
https://sashiko.dev/#/patchset/177606956628.929411.17392736689322577701.stgit%40devnote2
Especially,
> In fprobe_return(), the code traverses the fprobe_table which contains
> RCU-protected struct fprobe_hlist nodes. These nodes are freed using
> kfree_rcu(hlist_array, rcu) in unregister_fprobe_nolock().
>
> To safely traverse this RCU-protected list, readers must hold the RCU read
> lock. However, fprobe_return() only calls preempt_disable_notrace(). While
> disabling preemption acts as an RCU-sched read-side critical section on
> non-RT kernels, it does not prevent regular RCU grace periods from
> completing on PREEMPT_RT. Thus, kfree_rcu() can free the hlist_array while
> fprobe_return() is actively iterating over it.
I would like to ask Steve a comment about this. Is fgraph return handler
context RCU safe?
Thanks,
>
> Thanks,
> ---
>
> Masami Hiramatsu (Google) (3):
> tracing/fprobe: Remove fprobe from hash in failure path
> tracing/fprobe: Avoid kcalloc() in rcu_read_lock section
> tracing/fprobe: Check the same type fprobe on table as the unregistered one
>
>
> kernel/trace/fprobe.c | 251 +++++++++++++++++++++++++++++--------------------
> 1 file changed, 147 insertions(+), 104 deletions(-)
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v2 1/2] tracing/hist: rebuild full_name on each hist_field_name() call
From: Tom Zanussi @ 2026-04-13 22:38 UTC (permalink / raw)
To: Pengpeng Hou, rostedt
Cc: mhiramat, mathieu.desnoyers, tom.zanussi, linux-kernel,
linux-trace-kernel
In-Reply-To: <20260401112224.85582-1-pengpeng@iscas.ac.cn>
On Wed, 2026-04-01 at 19:22 +0800, Pengpeng Hou wrote:
> hist_field_name() uses a static MAX_FILTER_STR_VAL buffer for fully
> qualified variable-reference names, but it currently appends into that
> buffer with strcat() without rebuilding it first. As a result, repeated
> calls append a new "system.event.field" name onto the previous one,
> which can eventually run past the end of full_name.
>
> Build the name with snprintf() on each call and return NULL if the fully
> qualified name does not fit in MAX_FILTER_STR_VAL.
>
> Fixes: 067fe038e70f ("tracing: Add variable reference handling to hist triggers")
> Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
Looks good to me, thanks.
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Tested-by: Tom Zanussi <zanussi@kernel.org>
> ---
> Changes since v1: https://lore.kernel.org/all/20260329030950.32503-1-pengpeng@iscas.ac.cn/
>
> - rebuild full_name on each call instead of falling back to field->name
> - return NULL on overflow as suggested
> - split out the snprintf() length check instead of using an inline if
>
> kernel/trace/trace_events_hist.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
> index 73ea180cad55..f9c8a4f078ea 100644
> --- a/kernel/trace/trace_events_hist.c
> +++ b/kernel/trace/trace_events_hist.c
> @@ -1361,12 +1361,14 @@ static const char *hist_field_name(struct hist_field *field,
> field->flags & HIST_FIELD_FL_VAR_REF) {
> if (field->system) {
> static char full_name[MAX_FILTER_STR_VAL];
> + int len;
> +
> + len = snprintf(full_name, sizeof(full_name), "%s.%s.%s",
> + field->system, field->event_name,
> + field->name);
> + if (len >= sizeof(full_name))
> + return NULL;
>
> - strcat(full_name, field->system);
> - strcat(full_name, ".");
> - strcat(full_name, field->event_name);
> - strcat(full_name, ".");
> - strcat(full_name, field->name);
> field_name = full_name;
> } else
> field_name = field->name;
^ permalink raw reply
* [PATCH 2/2] selftests/ftrace: Add test case for fully-qualified variable references
From: Tom Zanussi @ 2026-04-13 22:35 UTC (permalink / raw)
To: rostedt
Cc: pengpeng, mhiramat, mathieu.desnoyers, linux-kernel,
linux-trace-kernel
In-Reply-To: <cover.1776112478.git.zanussi@kernel.org>
This test adds a variable (ts0) to two events (sched_waking and
sched_wakeup) and uses a fully-qualified variable reference to
expicitly choose a particular one (sched_wakeup.$ts0) when calculating
the wakeup latency.
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
---
.../trigger-fully-qualified-var-ref.tc | 34 +++++++++++++++++++
1 file changed, 34 insertions(+)
create mode 100644 tools/testing/selftests/ftrace/test.d/trigger/inter-event/trigger-fully-qualified-var-ref.tc
diff --git a/tools/testing/selftests/ftrace/test.d/trigger/inter-event/trigger-fully-qualified-var-ref.tc b/tools/testing/selftests/ftrace/test.d/trigger/inter-event/trigger-fully-qualified-var-ref.tc
new file mode 100644
index 000000000000..8d12cdd06f1d
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/trigger/inter-event/trigger-fully-qualified-var-ref.tc
@@ -0,0 +1,34 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: event trigger - test fully-qualified variable reference support
+# requires: set_event synthetic_events events/sched/sched_process_fork/hist ping:program
+
+fail() { #msg
+ echo $1
+ exit_fail
+}
+
+echo "Test fully-qualified variable reference support"
+
+echo 'wakeup_latency u64 lat; pid_t pid; int prio; char comm[16]' > synthetic_events
+echo 'hist:keys=comm:ts0=common_timestamp.usecs if comm=="ping"' > events/sched/sched_waking/trigger
+echo 'hist:keys=comm:ts0=common_timestamp.usecs if comm=="ping"' > events/sched/sched_wakeup/trigger
+echo 'hist:keys=next_comm:wakeup_lat=common_timestamp.usecs-sched.sched_wakeup.$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,sched.sched_waking.prio,next_comm) if next_comm=="ping"' > events/sched/sched_switch/trigger
+echo 'hist:keys=pid,prio,comm:vals=lat:sort=pid,prio' > events/synthetic/wakeup_latency/trigger
+
+ping $LOCALHOST -c 3
+if ! grep -q "ping" events/synthetic/wakeup_latency/hist; then
+ fail "Failed to create inter-event histogram"
+fi
+
+if ! grep -q "synthetic_prio=prio" events/sched/sched_waking/hist; then
+ fail "Failed to create histogram with fully-qualified variable reference"
+fi
+
+echo '!hist:keys=next_comm:wakeup_lat=common_timestamp.usecs-sched.sched_wakeup.$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,sched.sched_waking.prio,next_comm) if next_comm=="ping"' >> events/sched/sched_switch/trigger
+
+if grep -q "synthetic_prio=prio" events/sched/sched_waking/hist; then
+ fail "Failed to remove histogram with fully-qualified variable reference"
+fi
+
+exit 0
--
2.43.0
^ permalink raw reply related
* [PATCH 1/2] tracing: Fix fully-qualified variable reference printing in histograms
From: Tom Zanussi @ 2026-04-13 22:35 UTC (permalink / raw)
To: rostedt
Cc: pengpeng, mhiramat, mathieu.desnoyers, linux-kernel,
linux-trace-kernel
In-Reply-To: <cover.1776112478.git.zanussi@kernel.org>
The syntax for fully-qualified variable references in histograms is
subsys.event.$var, which is parsed correctly, but not displayed
correctly when printing a histogram spec. The current code puts the $
reference at the beginning of the fully-qualified variable name
i.e. $subsys.event.var, which is incorrect.
Before:
trigger info: hist:keys=next_comm:vals=hitcount:wakeup_lat=common_timestamp.usecs-$sched.sched_wakeup.ts0: ...
After:
trigger info: hist:keys=next_comm:vals=hitcount:wakeup_lat=common_timestamp.usecs-sched.sched_wakeup.$ts0: ...
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
---
kernel/trace/trace_events_hist.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index b2b675c7d663..0dbbf6cca9bc 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -1361,9 +1361,12 @@ static const char *hist_field_name(struct hist_field *field,
field->flags & HIST_FIELD_FL_VAR_REF) {
if (field->system) {
static char full_name[MAX_FILTER_STR_VAL];
+ static char *fmt;
int len;
- len = snprintf(full_name, sizeof(full_name), "%s.%s.%s",
+ fmt = field->flags & HIST_FIELD_FL_VAR_REF ? "%s.%s.$%s" : "%s.%s.%s";
+
+ len = snprintf(full_name, sizeof(full_name), fmt,
field->system, field->event_name,
field->name);
if (len >= sizeof(full_name))
@@ -1742,9 +1745,10 @@ static const char *get_hist_field_flags(struct hist_field *hist_field)
static void expr_field_str(struct hist_field *field, char *expr)
{
- if (field->flags & HIST_FIELD_FL_VAR_REF)
- strcat(expr, "$");
- else if (field->flags & HIST_FIELD_FL_CONST) {
+ if (field->flags & HIST_FIELD_FL_VAR_REF) {
+ if (!field->system)
+ strcat(expr, "$");
+ } else if (field->flags & HIST_FIELD_FL_CONST) {
char str[HIST_CONST_DIGITS_MAX];
snprintf(str, HIST_CONST_DIGITS_MAX, "%llu", field->constant);
@@ -6156,7 +6160,8 @@ static void hist_field_print(struct seq_file *m, struct hist_field *hist_field)
else if (field_name) {
if (hist_field->flags & HIST_FIELD_FL_VAR_REF ||
hist_field->flags & HIST_FIELD_FL_ALIAS)
- seq_putc(m, '$');
+ if (!hist_field->system)
+ seq_putc(m, '$');
seq_printf(m, "%s", field_name);
} else if (hist_field->flags & HIST_FIELD_FL_TIMESTAMP)
seq_puts(m, "common_timestamp");
--
2.43.0
^ permalink raw reply related
* [PATCH 0/2] tracing: fully-qualified var-ref testcase
From: Tom Zanussi @ 2026-04-13 22:35 UTC (permalink / raw)
To: rostedt
Cc: pengpeng, mhiramat, mathieu.desnoyers, linux-kernel,
linux-trace-kernel
Hi Steve,
Here's the testcase for fully-qualified var references mentioned here
[1].
While working on it, I realized that the printing of the
fully-qualified references was wrong (because the testcases use that
output to remove the trigger), so added the first patch.
It depends on Pengpeng Hou's patch:
[PATCH v2 1/2] tracing/hist: rebuild full_name on each hist_field_name() call
Thanks,
Tom
[1] https://lore.kernel.org/lkml/36cf0fc5ee3a4476b0a70536d212278a9ee4d380.camel@kernel.org/
Tom Zanussi (2):
tracing: Fix fully-qualified variable reference printing in histograms
selftests/ftrace: Add test case for fully-qualified variable
references
kernel/trace/trace_events_hist.c | 15 +++++---
.../trigger-fully-qualified-var-ref.tc | 34 +++++++++++++++++++
2 files changed, 44 insertions(+), 5 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/trigger/inter-event/trigger-fully-qualified-var-ref.tc
--
2.43.0
^ permalink raw reply
* Re: [PATCH v3 09/11] dt-bindings: input: Document hid-over-spi DT schema
From: Rob Herring @ 2026-04-13 22:34 UTC (permalink / raw)
To: Conor Dooley, Dmitry Torokhov, Jingyuan Liang
Cc: Jiri Kosina, Benjamin Tissoires, Jonathan Corbet, Mark Brown,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Krzysztof Kozlowski, Conor Dooley, linux-input, linux-doc,
linux-kernel, linux-spi, linux-trace-kernel, devicetree, hbarnor,
tfiga, Dmitry Antipov, Jarrett Schultz
In-Reply-To: <20260410-sake-dollop-9f253ddb0749@spud>
On Fri, Apr 10, 2026 at 06:35:00PM +0100, Conor Dooley wrote:
> On Thu, Apr 09, 2026 at 10:16:46AM -0700, Dmitry Torokhov wrote:
> > On Thu, Apr 09, 2026 at 05:02:11PM +0100, Conor Dooley wrote:
> > > On Thu, Apr 02, 2026 at 01:59:46AM +0000, Jingyuan Liang wrote:
> > > > Documentation describes the required and optional properties for
> > > > implementing Device Tree for a Microsoft G6 Touch Digitizer that
> > > > supports HID over SPI Protocol 1.0 specification.
> > > >
> > > > The properties are common to HID over SPI.
> > > >
> > > > Signed-off-by: Dmitry Antipov <dmanti@microsoft.com>
> > > > Signed-off-by: Jarrett Schultz <jaschultz@microsoft.com>
> > > > Signed-off-by: Jingyuan Liang <jingyliang@chromium.org>
> > > > ---
> > > > .../devicetree/bindings/input/hid-over-spi.yaml | 126 +++++++++++++++++++++
> > > > 1 file changed, 126 insertions(+)
> > > >
> > > > diff --git a/Documentation/devicetree/bindings/input/hid-over-spi.yaml b/Documentation/devicetree/bindings/input/hid-over-spi.yaml
> > > > new file mode 100644
> > > > index 000000000000..d1b0a2e26c32
> > > > --- /dev/null
> > > > +++ b/Documentation/devicetree/bindings/input/hid-over-spi.yaml
> > > > @@ -0,0 +1,126 @@
> > > > +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> > > > +%YAML 1.2
> > > > +---
> > > > +$id: http://devicetree.org/schemas/input/hid-over-spi.yaml#
> > > > +$schema: http://devicetree.org/meta-schemas/core.yaml#
> > > > +
> > > > +title: HID over SPI Devices
> > > > +
> > > > +maintainers:
> > > > + - Benjamin Tissoires <benjamin.tissoires@redhat.com>
> > > > + - Jiri Kosina <jkosina@suse.cz>
> > >
> > > Why them and not you, the developers of the series?
> > >
> > > > +
> > > > +description: |+
> > > > + HID over SPI provides support for various Human Interface Devices over the
> > > > + SPI bus. These devices can be for example touchpads, keyboards, touch screens
> > > > + or sensors.
> > > > +
> > > > + The specification has been written by Microsoft and is currently available
> > > > + here: https://www.microsoft.com/en-us/download/details.aspx?id=103325
> > > > +
> > > > + If this binding is used, the kernel module spi-hid will handle the
> > > > + communication with the device and the generic hid core layer will handle the
> > > > + protocol.
> > >
> > > This is not relevant to the binding, please remove it.
> > >
> > > > +
> > > > +allOf:
> > > > + - $ref: /schemas/input/touchscreen/touchscreen.yaml#
> > > > +
> > > > +properties:
> > > > + compatible:
> > > > + oneOf:
> > > > + - items:
> > > > + - enum:
> > > > + - microsoft,g6-touch-digitizer
> > > > + - const: hid-over-spi
> > > > + - description: Just "hid-over-spi" alone is allowed, but not recommended.
> > > > + const: hid-over-spi
> > >
> > > Why is it allowed but not recommended? Seems to me like we should
> > > require device-specific compatibles.
> >
> > Why would we want to change the driver code to add a new compatible each
> > time a vendor decides to create a chip that is fully hid-spi-protocol
> > compliant? Or is the plan to still allow "hid-over-spi" fallback but
> > require device-specific compatible that will be ignored unless there is
> > device-specific quirk needed?
The plan is the latter case (the 1st entry up above). The comment is
remove the 2nd entry (with 'Just "hid-over-spi" alone is allowed, but
not recommended.').
> This has nothing to do with the driver, just the oddity of having a
> comment saying that not having a device specific compatible was
> permitted by not recommended in a binding. Requiring device-specific
> compatibles is the norm after all and a comment like this makes draws
> more attention to the fact that this is abnormal. Regardless of what the
> driver does, device-specific compatibles should be required.
>
> > > > +
> > > > + reg:
> > > > + maxItems: 1
> > > > +
> > > > + interrupts:
> > > > + maxItems: 1
> > > > +
> > > > + reset-gpios:
> > > > + maxItems: 1
> > > > + description:
> > > > + GPIO specifier for the digitizer's reset pin (active low). The line must
> > > > + be flagged with GPIO_ACTIVE_LOW.
> > > > +
> > > > + vdd-supply:
> > > > + description:
> > > > + Regulator for the VDD supply voltage.
> > > > +
> > > > + input-report-header-address:
> > > > + $ref: /schemas/types.yaml#/definitions/uint32
> > > > + minimum: 0
> > > > + maximum: 0xffffff
> > > > + description:
> > > > + A value to be included in the Read Approval packet, listing an address of
> > > > + the input report header to be put on the SPI bus. This address has 24
> > > > + bits.
> > > > +
> > > > + input-report-body-address:
> > > > + $ref: /schemas/types.yaml#/definitions/uint32
> > > > + minimum: 0
> > > > + maximum: 0xffffff
> > > > + description:
> > > > + A value to be included in the Read Approval packet, listing an address of
> > > > + the input report body to be put on the SPI bus. This address has 24 bits.
> > > > +
> > > > + output-report-address:
> > > > + $ref: /schemas/types.yaml#/definitions/uint32
> > > > + minimum: 0
> > > > + maximum: 0xffffff
> > > > + description:
> > > > + A value to be included in the Output Report sent by the host, listing an
> > > > + address where the output report on the SPI bus is to be written to. This
> > > > + address has 24 bits.
> > > > +
> > > > + read-opcode:
> > > > + $ref: /schemas/types.yaml#/definitions/uint8
> > > > + description:
> > > > + Value to be used in Read Approval packets. 1 byte.
> > > > +
> > > > + write-opcode:
> > > > + $ref: /schemas/types.yaml#/definitions/uint8
> > > > + description:
> > > > + Value to be used in Write Approval packets. 1 byte.
> > >
> > > Why can none of these things be determined from the device's compatible?
> > > On the surface, they like the kinds of things that could/should be.
> >
> > Why would we want to keep tables of these values in the kernel and again
> > have to update the driver for each new chip?
>
> That's pretty normal though innit? It's what match data does.
> If someone wants to have properties that communicate data that
> can be determined from the compatible, they need to provide
> justification why it is being done.
IIRC, it was explained in prior versions the spec itself says these
values vary by device. If we expect variation, then I think these
properties are fine. But please capture the reasoning for them in this
patch or we will just keep asking the same questions over and over.
Rob
^ permalink raw reply
* [PATCH] tracepoint: balance regfunc() on func_add() failure in tracepoint_add_func()
From: David Carlier @ 2026-04-13 19:06 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers
Cc: linux-trace-kernel, linux-kernel, David Carlier, stable
When a tracepoint goes through the 0 -> 1 transition, tracepoint_add_func()
invokes the subsystem's ext->regfunc() before attempting to install the
new probe via func_add(). If func_add() then fails (for example, when
allocate_probes() cannot allocate a new probe array under memory pressure
and returns -ENOMEM), the function returns the error without calling the
matching ext->unregfunc(), leaving the side effects of regfunc() behind
with no installed probe to justify them.
For syscall tracepoints this is particularly unpleasant: syscall_regfunc()
bumps sys_tracepoint_refcount and sets SYSCALL_TRACEPOINT on every task.
After a leaked failure, the refcount is stuck at a non-zero value with no
consumer, and every task continues paying the syscall trace entry/exit
overhead until reboot. Other subsystems providing regfunc()/unregfunc()
pairs exhibit similarly scoped persistent state.
Mirror the existing 1 -> 0 cleanup and call ext->unregfunc() in the
func_add() error path, gated on the same condition used there so the
unwind is symmetric with the registration.
Fixes: 8cf868affdc4 ("tracing: Have the reg function allow to fail")
Cc: stable@vger.kernel.org
Signed-off-by: David Carlier <devnexen@gmail.com>
---
kernel/tracepoint.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 91905aa19294..dffef52a807b 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -300,6 +300,8 @@ static int tracepoint_add_func(struct tracepoint *tp,
lockdep_is_held(&tracepoints_mutex));
old = func_add(&tp_funcs, func, prio);
if (IS_ERR(old)) {
+ if (tp->ext && tp->ext->unregfunc && !static_key_enabled(&tp->key))
+ tp->ext->unregfunc();
WARN_ON_ONCE(warn && PTR_ERR(old) != -ENOMEM);
return PTR_ERR(old);
}
--
2.53.0
^ permalink raw reply related
* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-04-13 17:05 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman
In-Reply-To: <2608a03b-72bb-4033-8e6f-a439502b5573@kernel.org>
On Mon, Apr 13, 2026 at 03:11:12PM +0200, David Hildenbrand (Arm) wrote:
> > Normally cloud-hypervisor VMs with virtio-net can't be subject to KSM
> > because the entire boot region gets marked shared.
>
> What exactly do you mean with "mark shared". Do you mean, that "shared
> memory" is used in the hypervisor for all boot memory?
>
Sorry, meant MAP_SHARED. But yes, in some setups the hypervisor simply
makes a memfd with the entire main memory region MAP_SHARED.
This is because the virtio-net device / network stack does GFP_KERNEL
allocations and then pins them on the host to allow zero-copy - so all
of ZONE_NORMAL is a valid target.
(At least that's my best understanding of the entire setup).
>
> You mean, in the VM, memory usable by virtio-net can only be consumed
> from a dedicated physical memory region, and that region would be a
> separate node?
>
Correct - it does requires teaching the network stack numa awareness.
I was surprised by how little code this required, though I can't be
100% sure of its correctness since networking isn't my normal space.
Alternatively you could imagine this as a real device bringing its own
dedicated networking memory for network buffers, and then telling the
network start "Hey, prefer this node over normal kernel allocations".
What I'd been hacking on was cobbled together with memfd + SRAT bits to
bring up a private node statically and then have the device claim it -
but this is just a proof of concept. A proper implementation would be
extending virtio-net to report a dedicated EFI_RESERVED region.
> >
> > I see you saw below that one of the extensions is removing the nodes
> > from the fallback list. That is part one, but it's insufficient to
> > prevent complete leakage (someone might iterate over the nodes-possible
> > list and try migrating memory).
>
> Which code would do that?
>
There are many callers of for_each_node() throughout the system.
but one discrete example:
int alloc_shrinker_info(struct mem_cgroup *memcg)
{
... snip ...
for_each_node(nid) {
struct shrinker_info *info = kvzalloc_node(sizeof(*info) + array_size,
GFP_KERNEL, nid);
... snip ..
}
If you disallow fallbacks in this scenario, this allocation always fails.
This partially answers your question about slub fallback allocations,
there are slab allocations like this that depend on fallbacks (more
below on this explicitly).
> > Basically the only isolation mechanism we have today is ZONE_DEVICE.
> >
> > Either via mbind and friends, or even just the driver itself managing it
> > directly via alloc_pages_node() and exposing some userland interface.
>
> Would mbind() work here? I thought mbind() would not suddenly give
> access to some ZONE_DEVICE memory.
>
Sorry these were orthogonal thoughts.
1) We don't have such a mechanism. ZONE_DEVICE's preferred mechanism is
setting up explicit migrations via migrate_device.c
2) mbind / alloc_pages_node would only work for private nodes.
Extending ZONE_DEVICE to enable mbind() would be an extreme lift,
as the kernel makes a lot of assumptions about folio->lru.
This is why i went the node route in the first place.
> >
> > in the NP_OPS_MIGRATION patch, this gets covered.
>
> Right, but I am not sure if NP_OPS_MIGRATION is really the right
> approach for that. Have to think about that.
>
So, OPS is a bit misleading, but it's the closest i came to some
existing pattern. OPS does not necessarily need to imply callbacks.
I've been trying to minimize the patch set and I'm starting to think
the MVP may actually be able to do away with the private_ops structure
for a basic migration+mempolicy example by simply teaching some services
(migrate.c, mempolicy.c) how/when to inject __GFP_PRIVATE.
the mempolicy.c patch already does this, but not migrate.c - i haven't
figured out the right pattern for that yet.
> > 1) as you note, removing it from the default bitmaps, which is actually
> > hard. You can't remove it from the possible-node bitmap, so that
> > just seemed non-tractable.
>
> What about making people use a different set of bitmaps here? Quite some
> work, but maybe that's the right direction given that we'll now treat
> some nodes differently.
>
It's an option, although it is fragile. That means having to police all
future users of possible-nodes and for_each_node and etc.
I've been err'ing on the side of "not fragile", but i'm open to rework.
> >
> > 2) __GFP_THISNODE actually means (among other things) "don't fallback".
> > And, in fact, there are some hotplug-time allocations that occur in
> > SLAB (pglist_data) that target the private node that *must* fallback
> > to successfully allocate for successful kernel operation.
>
>
> Can you point me at the code?
>
There is actually a comment in slub.c that addresses this directly:
static int slab_mem_going_online_callback(int nid)
{
... snip ...
/*
* XXX: kmem_cache_alloc_node will fallback to other nodes
* since memory is not yet available from the node that
* is brought up.
*/
n = kmem_cache_alloc(kmem_cache_node, GFP_KERNEL);
... snip ...
}
Slab basically acknowledges the behavior is required on existing nodes
and just falls back immediately for the "going online" path.
Other specific calls in the hotplug path:
mm/sparse.c: kzalloc_node(size, GFP_KERNEL, nid)
mm/sparse-vmemmap.c: alloc_pages_node(nid, GFP_KERNEL|...)
mm/slub.c: kmalloc_node(sizeof(*barn), GFP_KERNEL, nid)
There are quite a number of callers to kmem_cache_alloc_node() that
would have to be individually audited.
And some non-slab interfaces examples as well:
alloc_shrinker_info
alloc_node_nr_active
I've been looking at this for a while, but I'm starting to think trying
to touch all this surface area is simply too fragile compared to just
letting normal memory be a fallback for private nodes and adding:
__GFP_PRIVATE - unlock's private node, but allow fallback
#define GFP_PRIVATE (__GFP_PRIVATE | __GFP_THISNODE) - only this node
__GFP_PRIVATE vs GFP_PRIVATE then is just a matter of use case.
For mbind() it probably makes sense we'd use GFP_PRIVATE - either it
succeeds or it OOMs.
> > The flexibility is kind of the point :]
>
> Yeah, but it would be interesting which minimal support we would need to
> just let some special memory be managed by the kernel, allowing mbind()
> users to use it, but not have any other fallback allocations end up on it.
>
> Something very basic, on which we could build additional functionality.
>
I actually have a simplistic CXL driver that does exactly this:
https://github.com/gourryinverse/linux/blob/072ecf7cbebd9871e76c0b52fd99aa1321405a59/drivers/cxl/type3_drivers/cxl_mempolicy/mempolicy.c#L65
We have to support migration because mbind can migrate on bind if the
VMA already has memory - but all this means is the migrate interfaces
are live - not that the kernel actually uses them.
so mbind requires (OPS_MIGRATE | OPS_MEMPOLICY)
All these flags say is:
- move_pages() syscalls can accept these nodes
- migrate_pages() function calls can accept these nodes
- mempolicy.c nodemasks allow the nodes (should restrict to mbind)
- vma's with these nodes now inject __GFP_PRIVATE on fault
All other services (reclaim, compaction, khugepaged, etc) do not scan
these nodes and do not know about __GFP_PRIVATE, so they never see
private node folios and can't allocate from the node.
In this example, all migrate_to() really does is inject __GFP_THISNODE,
but I've been thinking about whether we can just do this in migrate.c
and leave implementing the .ops to a user that requires is.
But otherwise "it just works".
One note here though - OOM conditions and allocation failures are not
intuitive, especially when THP/non-order-0 allocations are involved.
But that might just mean this minimal setup should only allow order-0
allocations - which is fiiiiiiiiiiiiiine :P.
-----------------
For basic examples
I've implemented 4 examples to consider building on:
1) CXL mempolicy driver:
https://github.com/gourryinverse/linux/blob/072ecf7cbebd9871e76c0b52fd99aa1321405a59/drivers/cxl/type3_drivers/cxl_mempolicy/mempolicy.c#L65
As described above
2) Virtio-net / CXL.mem Network Card
(Not published yet)
This doesn't require any ops at all - the plumbing happens entirely
inside the kernel. I onlined the node with an SRAT hack and no ops
structure at all associated with the device (just set node affinity
to the pcie_dev and plumbed it through the network stack).
A proper implementation would have virtio-net register is own
reserved memory region and online it during probe.
3) Accelerator
(Not published yet)
I have converted an open source but out of tree GPU driver which
uses NUMA nodes to use private nodes. This required:
NP_OPS_MIGRATION
NP_OPS_MEMPOLICY
The pattern is very similar to the CXL mempolicy driver, except
that the driver had alloc_pages_node() calls that needed to have
__GFP_PRIVATE added to ensure allocations landed on the device.
4) CXL Compressed RAM driver:
https://github.com/gourryinverse/linux/blob/55c06eb6bced58132d9001e318f2958e8ac80614/mm/cram.c#L340
needs pretty much everything - it's "normal memory" with access
rules, so the driver isn't really in the management lifecycle.
In this example - the only way to allocate memory on the node is
via demotion. This allows us to close off the device to new
allocations if the hardware reports low memory but the OS percieves
the device to still have free memory.
Which is a cool example: The driver just sets up the node with
certain attributes and then lets the kernel deal with it.
I have started compacting the _OPS_* flags related to reclaim into a
single NP_OPS_RECLAIM flag while testing with this. Really i've come
around to thinking many mm/ services need to be taken as a package,
not fully piecemeal.
The tl;dr: Once you cede some control over to the kernel, you're
very close to ceding ALL control, but you still get some control
over how/when allocations on the node can be made.
It is important to note that even if we don't expose callbacks, we do
still need a modicum of node filtering in some places that still use
for_each_node() (vmscan.c, compaction.c, oom_kill.c, etc).
These are basically all the places ZONE_DEVICE *implicitly* opts itself
out of by having managed_pages=0. We have to make those situations
explicit - but that doesn't mean we need callbacks.
> >
> > I would simply state: "That depends on the memory device"
>
> Let's keep it very simple: just some memory that you mbind(), and you
> only want the mbind() user to make use of that memory.
>
> What would be the minimal set of hooks to guarantee that.
>
If you want the mbind contract to stay intact:
NP_OPS_MIGRATION (mbind can generate migrations)
NP_OPS_MEMPOLICY (this just tells mempolicy.c to allow the node)
The set of callbacks required should be exactly 0 (assuming we teach
migrate.c to inject __GFP_PRIVATE like we have mempolicy.c).
If your device requires some special notification on allocation, free
or migration to/from you need:
ops.free_folio(folio)
ops.migrate_to(folios, nid, mode, reason, nr_success)
ops.migrate_folio(src_folio, dst_folio)
The free path is the tricky one to get right. You can imagine:
buf = malloc(...);
mbind(buf, private_node);
memset(buf, 0x42, ...);
ioctl(driver, CHECK_OUT_THIS_DATA, buf);
exit(0);
The task dies and frees the pages back to the buddy - the question is
whether the 4-5 free_folio paths (put_folio, put_unref_folios, etc) can
all eat an ops.free_folio() callback to inform the driver the memory has
been freed.
In practice - this worked on my accelerator and compressed examples, but
I can't say it's 100% safe in all contexts. The free path needs more
scrutiny.
> For example, I assume compaction could just be supported for such
> memory? Similarly, longterm-pinning.
>
> For some of the other hooks it's rather unclear how they would affect
> the very simple mbind() rule. What is the effect of demotion or NUMA
> balancing?
>
> I'm afraid we're making things too complicated here or it might be the
> wrong abstraction, if i cannot even figure out how to make the simplest
> use case work.
>
> Maybe I'm wrong :)
>
Actually, quite the opposite: None of that should be engaged by
default. In our above example:
OPS_MIGRATION | OPS_MEMPOLICY
All this should say is that migration and mempolicy are supported - not
that anything in the kernel that uses migration will suddenly operate on
that memory.
So: Compaction, Longterm Pin, NUMA balancing, Demotion - etc - all of
these do not ever operate on this memory by default. Your device driver
or service would have to specifically opt-in to those services and must
be capable of dealing with the implications of that.
---
kind of neat aside:
You can hotplug private ZONE_NORMAL without NP_OPS_LONGTERMPIN and as
long as the driver/service controls the type/lifetime of allocations,
the node can remain hot-unpluggable in the future.
e.g. if the service only ever allocates movable allocations, the lack
of NP_OPS_LONGTERMPIN prevents those pages from being pinned. If you
add NP_OPS_MIGRATION - the attempt to pin will cause migration :]
~Gregory
^ permalink raw reply
* Re: [PATCH 0/3] mm: split the file's i_mmap tree for NUMA
From: Mateusz Guzik @ 2026-04-13 15:33 UTC (permalink / raw)
To: Huang Shijie
Cc: akpm, viro, brauner, linux-mm, linux-kernel, linux-arm-kernel,
linux-fsdevel, muchun.song, osalvador, linux-trace-kernel,
linux-perf-users, linux-parisc, nvdimm, zhongyuan, fangbaoshun,
yingzhiwei
In-Reply-To: <20260413062042.804-1-huangsj@hygon.cn>
On Mon, Apr 13, 2026 at 02:20:39PM +0800, Huang Shijie wrote:
> In NUMA, there are maybe many NUMA nodes and many CPUs.
> For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
> In the UnixBench tests, there is a test "execl" which tests
> the execve system call.
>
> When we test our server with "./Run -c 384 execl",
> the test result is not good enough. The i_mmap locks contended heavily on
> "libc.so" and "ld.so". For example, the i_mmap tree for "libc.so" can have
> over 6000 VMAs, all the VMAs can be in different NUMA mode.
> The insert/remove operations do not run quickly enough.
>
> patch 1 & patch 2 are try to hide the direct access of i_mmap.
> patch 3 splits the i_mmap into sibling trees, and we can get better
> performance with this patch set:
> we can get 77% performance improvement(10 times average)
>
To my reading you kept the lock as-is and only distributed the protected
state.
While I don't doubt the improvement, I'm confident should you take a
look at the profile you are going to find this still does not scale with
rwsem being one of the problems (there are other global locks, some of
which have experimental patches for).
Apart from that this does nothing to help high core systems which are
all one node, which imo puts another question mark on this specific
proposal.
Of course one may question whether a RB tree is the right choice here,
it may be the lock-protected cost can go way down with merely a better
data structure.
Regardless of that, for actual scalability, there will be no way around
decentralazing locking around this and partitioning per some core count
(not just by numa awareness).
Decentralizing locking is definitely possible, but I have not looked
into specifics of how problematic it is. Best case scenario it will
merely with separate locks. Worst case scenario something needs a fully
stabilized state for traversal, in that case another rw lock can be
slapped around this, creating locking order read lock -> per-subset
write lock -- this will suffer scalability due to the read locking, but
it will still scale drastically better as apart from that there will be
no serialization. In this setting the problematic consumer will write
lock the new thing to stabilize the state.
So my non-maintainer opinion is that the patchset is not worth it as it
fails to address anything for significantly more common and already
affected setups.
Have you looked into splitting the lock?
^ permalink raw reply
* Re: [RFC v4 0/7] ext4: fast commit: snapshot inode state for FC log
From: Theodore Tso @ 2026-04-13 13:12 UTC (permalink / raw)
To: Li Chen
Cc: Zhang Yi, Andreas Dilger, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, linux-ext4, linux-trace-kernel, linux-kernel
In-Reply-To: <19d86eec635.f7072461135455.4960134919814592320@linux.beauty>
On Mon, Apr 13, 2026 at 09:01:28PM +0800, Li Chen wrote:
> Absolutely! It's great to learn about the Sashiko development site.
> I will address the real issues in the next version.
Note that Sashiko will sometimes report a pre-existing issue as if it
were a problem with the commit. If that happens, feel free to ignore
its complaint; what I consider best practice is to either (a) fix it
in the a subsequent patch or patch series, or (b) leave a TODO in the
code.
I've asked the Sashiko folks to add way for URI's for each issue that
are identified by Sashiko, so we can put a URL in the TODO comment for
someone who wants to fix it later, and to make it easier for Sashiko
to identified pre-existing issues so it doesn't comment on the same
issue across multiple commit reviews (and perhaps save on the some LLM
token budget :-).
In the next few days, for patches sent to linux-ext4, Sashiko will
start e-mailing its reviews to the patch submitter and to me as the
maintainer. Once we can reduce the false positive rate, I'll ask that
the reviews be cc'ed to the linux-ext4 mailing list. But it seems
good enough that to send e-mails to the patch submitter and the
maintainer --- but that's a decision that each subsystem maintainer
will be making on their own.
Cheers,
- Ted
^ permalink raw reply
* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: David Hildenbrand (Arm) @ 2026-04-13 13:11 UTC (permalink / raw)
To: Gregory Price
Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman
In-Reply-To: <abwRu1FNqI3dVyqL@gourry-fedora-PF4VCD3F>
On 3/19/26 16:09, Gregory Price wrote:
> On Tue, Mar 17, 2026 at 02:25:29PM +0100, David Hildenbrand (Arm) wrote:
>> On 2/22/26 09:48, Gregory Price wrote:
>>> Topic type: MM
>>
>> Hi Gregory,
>>
>> stumbling over this again, some questions whereby I'll just ignore the
>> compressed RAM bits for now and focus on use cases where promotion etc
>> are not relevant :)
>
> A more concrete example up your alley:
>
> I've since been playing with a virtio-net private node.
>
> Normally cloud-hypervisor VMs with virtio-net can't be subject to KSM
> because the entire boot region gets marked shared.
What exactly do you mean with "mark shared". Do you mean, that "shared
memory" is used in the hypervisor for all boot memory?
> If virtio-net has
> its own private node / region separate from the boot region, the boot
> region is now free to be subject to KSM.
You mean, in the VM, memory usable by virtio-net can only be consumed
from a dedicated physical memory region, and that region would be a
separate node?
>
> I may have that up as an example sometime before LSF, but i need to
> clean up some networking stack hacks i've made to make it work.
>
>>>
>>> N_MEMORY_PRIVATE is all about isolating NUMA nodes and then punching
>>> explicit holes in that isolation to do useful things we couldn't do
>>> before without re-implementing entire portions of mm/ in a driver.
>>
>> Just to clarify: we don't currently have any mechanism to expose, say,
>> SPM/PMEM/whatsoever to the buddy allocator through the dax/kmem driver
>> and *not* have random allocations end up on it, correct?
>>
>> Assume we online the memory to ZONE_MOVABLE, still other (fallback)
>> allocations might end up on that memory.
>>
>
> Correct, when you hotplug memory into a node, it's a free for all.
> Fallbacks are going to happen.
Right, and I agree that having a mechanism to prevent that is reasonable.
>
> I see you saw below that one of the extensions is removing the nodes
> from the fallback list. That is part one, but it's insufficient to
> prevent complete leakage (someone might iterate over the nodes-possible
> list and try migrating memory).
Which code would do that?
>
>> How would we currently handle something like that? (do we have drivers
>> for that? I'd assume that drivers would only migrate some user memory to
>> ZONE_DEVICE memory.)
>>
>> Assuming we don't have such a mechanism, I assume that part of your
>> proposal would be very interesting: online the memory to a
>> "special"/"restricted" (you call it private) NUMA node, whereby all
>> memory of that NUMA node will only be consumable through
>> mbind() and friends.
>>
>
> Basically the only isolation mechanism we have today is ZONE_DEVICE.
>
> Either via mbind and friends, or even just the driver itself managing it
> directly via alloc_pages_node() and exposing some userland interface.
Would mbind() work here? I thought mbind() would not suddenly give
access to some ZONE_DEVICE memory.
>
> You can imagine a network driver providing an ioctl for a shared buffer
> or a driver exposing a mmap'able file descriptor as the trivial case.
Right.
>
>> Any other allocations (including automatic page migration etc) would not
>> end up on that memory.
>
> One of the complications of exposing this memory via mbind is that
> mempolicy.c has a lot of migration mechanics, just to name two:
>
> - migrate on mbind
> - cpuset rebinds
>
> So for a completely solution you need to support migration if you
> support mempolicy. But with the callbacks, you can control how/when
> migration occurs.
>
> tl;dr: many of mm/'s services are actually predicated on migration
> support, so you have to manage that somehow.
Agreed.
>
>>
>> Thinking of some "terribly slow" or "terribly fast" memory that we don't
>> want to involve in automatic memory tiering, being able to just let
>> selected workloads consume that memory sounds very helpful.
>>
>>
>> (wondering if there could be some way allocations might get migrated out
>> of the node, for example, during memory offlining etc, which might also
>> not be desirable)
>>
>
> in the NP_OPS_MIGRATION patch, this gets covered.
Right, but I am not sure if NP_OPS_MIGRATION is really the right
approach for that. Have to think about that.
>
> I'm not sure the NP_OPS_* pattern is what we actually want, it's just
> what i came up with to make it clear what's being enabled.
>
> Basically without NP_OPS_MIGRATION, this memory is completely
> non-migratable. The driver managing it therefore needs to control the
> lifetime, and if hotplug is requested - kill anyone using it (which by
> definition should not the kernel) and either release the pages or take
> them so they can be released while hotplug is spinning.
>
>> I am not sure if __GFP_PRIVATE etc is really required for that. But some
>> mechanism to make that work seems extremely helpful.
>>
>> Because ...
>>
>>> /* And now I can use mempolicy with my memory */
>>> buf = mmap(...);
>>> mbind(buf, len, mode, private_node, ...);
>>> buf[0] = 0xdeadbeef; /* Faults onto private node */
>>
>> ... just being able to consume that memory through mbind() and having
>> guarantees sounds extremely helpful.
>>
>
> Yes! :]
>
>>>
>>> - Filter allocation requests on __GFP_PRIVATE
>>> numa_zone_allowed() excludes them otherwise.
>>
>> I think we discussed that in the past, but why can't we find a way that
>> only people requesting __GFP_THISNODE could allocate that memory, for
>> example? I guess we'd have to remove it from all "default NUMA bitmaps"
>> somehow.
>>
>
> I experimented with this. There were two concerns:
>
> 1) as you note, removing it from the default bitmaps, which is actually
> hard. You can't remove it from the possible-node bitmap, so that
> just seemed non-tractable.
What about making people use a different set of bitmaps here? Quite some
work, but maybe that's the right direction given that we'll now treat
some nodes differently.
>
> 2) __GFP_THISNODE actually means (among other things) "don't fallback".
> And, in fact, there are some hotplug-time allocations that occur in
> SLAB (pglist_data) that target the private node that *must* fallback
> to successfully allocate for successful kernel operation.
Can you point me at the code?
>
> So separating PRIVATE from THISNODE and allowing some use of fallback
> mechanics resolves some problems here.
>
> I think #2 is a solvable problem, but #1 i don't think can be addressed.
> I need to investigate the slab interactions a little more.
I'll also have to think about this some more.
>
>>> - Use standard struct page / folio. No ZONE_DEVICE, no pgmap,
>>> no struct page metadata limitations.
>>
>> Good.
>
> Note: I've actually since explored merging this with pgmap, and
> rebranding it as node-scope pgmap.
>
> In that sense, you could think of this as NODE_DEVICE instead of
> NODE_PRIVATE - but maybe I'm inviting too much baggage :]
:)
NODE_DEVICE sounds interesting though.
>
>>>
>>> Re-use of ZONE_DEVICE Hooks
>>> ===
>>
>> I think all of that might not be required for the simplistic use case I
>> mentioned above (fast/slow memory only to be consumed by selected user
>> space that opts in through mbind() and friends).
>>
>> Or are there other use cases for these callbacks
>>
>
> Many `folio_is_zone_device()` hooks result in the operations being
> a no-op / failing. We need all those same hooks.
>
> Some hooks I added - such as migration hooks, are combined with the
> zone_device hooks via i helper to demonstrate the pattern is the same
> when the memory is opted into migration.
>
> I do not think all of these hooks are required, I would think of this
> more as an exploration of the whole space, and then we can throw what
> does not have an active use case.
>
> For the compressed ram component I've been designing, the needs are:
>
> - Migration
> - Reclaim
> - Demotion
> - Write Protect (maybe, possibly optional)
>
> But you could argue another user might want the same device to have:
> - Migration
> - Mempolicy
>
> Where they manage things from userland, rather than via reclaim.
>
> The flexibility is kind of the point :]
Yeah, but it would be interesting which minimal support we would need to
just let some special memory be managed by the kernel, allowing mbind()
users to use it, but not have any other fallback allocations end up on it.
Something very basic, on which we could build additional functionality.
>
>> [...]
>>>
>>>
>>> Flag-gated behavior (NP_OPS_*) controls:
>>> ===
>>>
>>> We use OPS flags to denote what mm/ services we want to allow on our
>>> private node. I've plumbed these through so far:
>>>
>>> NP_OPS_MIGRATION - Node supports migration
>>> NP_OPS_MEMPOLICY - Node supports mempolicy actions
>>> NP_OPS_DEMOTION - Node appears in demotion target lists
>>> NP_OPS_PROTECT_WRITE - Node memory is read-only (wrprotect)
>>> NP_OPS_RECLAIM - Node supports reclaim
>>> NP_OPS_NUMA_BALANCING - Node supports numa balancing
>>> NP_OPS_COMPACTION - Node supports compaction
>>> NP_OPS_LONGTERM_PIN - Node supports longterm pinning
>>> NP_OPS_OOM_ELIGIBLE - (MIGRATION | DEMOTION), node is reachable
>>> as normal system ram storage, so it should
>>> be considered in OOM pressure calculations.
>>
>> I have to think about all that, and whether that would be required as a
>> first step. I'd assume in a simplistic use case mentioned above we might
>> only forbid the memory to be used as a fallback for any oom etc.
>>
>> Whether reclaim (e.g., swapout) makes sense is a good question.
>>
>
> I would simply state: "That depends on the memory device"
Let's keep it very simple: just some memory that you mbind(), and you
only want the mbind() user to make use of that memory.
What would be the minimal set of hooks to guarantee that.
For example, I assume compaction could just be supported for such
memory? Similarly, longterm-pinning.
For some of the other hooks it's rather unclear how they would affect
the very simple mbind() rule. What is the effect of demotion or NUMA
balancing?
I'm afraid we're making things too complicated here or it might be the
wrong abstraction, if i cannot even figure out how to make the simplest
use case work.
Maybe I'm wrong :)
>
> Which is kind of the point. The ability to isolate and poke holes in
> that isolation explictly, while using the same mm/ code, creates a new
> design space we haven't had before.
>
> ---
>
> I think it would be fair to say all of these would not be required for
> an MVP interface, and should require a use case to merge. But the code
> is here because I wanted to explore just how far it can go.
That's absolutely fair. :)
--
Cheers,
David
^ permalink raw reply
* Re: [RFC v4 0/7] ext4: fast commit: snapshot inode state for FC log
From: Li Chen @ 2026-04-13 13:01 UTC (permalink / raw)
To: Theodore Tso
Cc: Zhang Yi, Andreas Dilger, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, linux-ext4, linux-trace-kernel, linux-kernel
In-Reply-To: <20260410011843.GD99725@macsyma-wired.lan>
Hi Ted,
---- On Fri, 10 Apr 2026 09:18:43 +0800 Theodore Tso <tytso@mit.edu> wrote ---
> On Tue, Jan 20, 2026 at 07:25:29PM +0800, Li Chen wrote:
> > Hi,
> >
> > (This RFC v4 series is based on linux-next tag next-20260106, plus the
> > prerequisite patch "ext4: fast commit: make s_fc_lock reclaim-safe" posted at:
> > https://lore.kernel.org/all/20260106120621.440126-1-me@linux.beauty/)
>
> Can you take a look at the Sashiko reviews here:
>
> https://sashiko.dev/#/patchset/20260408112020.716706-1-me%40linux.beauty
>
> There seems to be at least one legitimate concern, which is the
> potential cur_lblk overflow. There are a couple of others which I
> think is real; could you please look at their review comments?
Absolutely! It's great to learn about the Sashiko development site.
I will address the real issues in the next version.
Regards,
Li
^ permalink raw reply
* Re: [RFC v6 6/7] ext4: fast commit: add lock_updates tracepoint
From: Li Chen @ 2026-04-13 12:58 UTC (permalink / raw)
To: Steven Rostedt
Cc: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, Masami Hiramatsu,
Mathieu Desnoyers, linux-ext4, linux-kernel, linux-trace-kernel
In-Reply-To: <20260408160405.45a5ee09@gandalf.local.home>
Hi Steven,
---- On Thu, 09 Apr 2026 04:02:56 +0800 Steven Rostedt <rostedt@goodmis.org> wrote ---
> On Wed, 8 Apr 2026 19:20:17 +0800
> Li Chen <me@linux.beauty> wrote:
>
> > Commit-time fast commit snapshots run under jbd2_journal_lock_updates(),
> > so it is useful to quantify the time spent with updates locked and to
> > understand why snapshotting can fail.
> >
> > Add a new tracepoint, ext4_fc_lock_updates, reporting the time spent in
> > the updates-locked window along with the number of snapshotted inodes
> > and ranges. Record the first snapshot failure reason in a stable snap_err
> > field for tooling.
> >
>
> [..]
>
> > @@ -1338,13 +1375,13 @@ static int ext4_fc_perform_commit(journal_t *journal)
> > if (ret)
> > return ret;
> >
> > -
> > ret = ext4_fc_alloc_snapshot_inodes(sb, &inodes, &inodes_size);
> > if (ret)
> > return ret;
> >
> > /* Step 4: Mark all inodes as being committed. */
> > jbd2_journal_lock_updates(journal);
> > + lock_start = ktime_get();
>
> ktime_get() is rather quick but if you care about micro-optimizations, you
> could have:
>
> if (trace_ext4_fc_lock_updates_enabled())
> lock_start = ktime_get();
> else
> lock_start = 0;
>
> > /*
> > * The journal is now locked. No more handles can start and all the
> > * previous handles are now drained. Snapshotting happens in this
> > @@ -1358,8 +1395,15 @@ static int ext4_fc_perform_commit(journal_t *journal)
> > }
> > ext4_fc_unlock(sb, alloc_ctx);
> >
> > - ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size);
> > + ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size,
> > + &snap_inodes, &snap_ranges, &snap_err);
> > jbd2_journal_unlock_updates(journal);
> > + if (trace_ext4_fc_lock_updates_enabled()) {
>
> if (trace_ext4_fc_lock_updates_enabled() && lock_start) {
>
> But feel free to ignore this if the overhead of always calling ktime_get()
> is not an issue.
>
>
> > + locked_ns = ktime_to_ns(ktime_sub(ktime_get(), lock_start));
> > + trace_ext4_fc_lock_updates(sb, commit_tid, locked_ns,
> > + snap_inodes, snap_ranges, ret,
> > + snap_err);
> > + }
> > kvfree(inodes);
> > if (ret)
> > return ret;
> > @@ -1564,7 +1608,7 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
> > journal_ioprio = EXT4_DEF_JOURNAL_IOPRIO;
> > set_task_ioprio(current, journal_ioprio);
> > fc_bufs_before = (sbi->s_fc_bytes + bsize - 1) / bsize;
> > - ret = ext4_fc_perform_commit(journal);
> > + ret = ext4_fc_perform_commit(journal, commit_tid);
> > if (ret < 0) {
> > if (ret == -EAGAIN || ret == -E2BIG || ret == -ECANCELED)
> > status = EXT4_FC_STATUS_INELIGIBLE;
> > diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> > index f493642cf121..7028a28316fa 100644
> > --- a/include/trace/events/ext4.h
> > +++ b/include/trace/events/ext4.h
> > @@ -107,6 +107,26 @@ TRACE_DEFINE_ENUM(EXT4_FC_REASON_VERITY);
> > TRACE_DEFINE_ENUM(EXT4_FC_REASON_MOVE_EXT);
> > TRACE_DEFINE_ENUM(EXT4_FC_REASON_MAX);
> >
> > +#undef EM
> > +#undef EMe
> > +#define EM(a) TRACE_DEFINE_ENUM(EXT4_FC_SNAP_ERR_##a);
> > +#define EMe(a) TRACE_DEFINE_ENUM(EXT4_FC_SNAP_ERR_##a);
> > +
> > +#define TRACE_SNAP_ERR \
> > + EM(NONE) \
> > + EM(ES_MISS) \
> > + EM(ES_DELAYED) \
> > + EM(ES_OTHER) \
> > + EM(INODES_CAP) \
> > + EM(RANGES_CAP) \
> > + EM(NOMEM) \
> > + EMe(INODE_LOC)
> > +
> > +TRACE_SNAP_ERR
> > +
> > +#undef EM
> > +#undef EMe
> > +
> > #define show_fc_reason(reason) \
> > __print_symbolic(reason, \
> > { EXT4_FC_REASON_XATTR, "XATTR"}, \
> > @@ -2818,6 +2838,47 @@ TRACE_EVENT(ext4_fc_commit_stop,
> > __entry->num_fc_ineligible, __entry->nblks_agg, __entry->tid)
> > );
> >
> > +#define EM(a) { EXT4_FC_SNAP_ERR_##a, #a },
> > +#define EMe(a) { EXT4_FC_SNAP_ERR_##a, #a }
> > +
> > +TRACE_EVENT(ext4_fc_lock_updates,
> > + TP_PROTO(struct super_block *sb, tid_t commit_tid, u64 locked_ns,
> > + unsigned int nr_inodes, unsigned int nr_ranges, int err,
> > + int snap_err),
> > +
> > + TP_ARGS(sb, commit_tid, locked_ns, nr_inodes, nr_ranges, err, snap_err),
> > +
> > + TP_STRUCT__entry(/* entry */
> > + __field(dev_t, dev)
> > + __field(tid_t, tid)
> > + __field(u64, locked_ns)
> > + __field(unsigned int, nr_inodes)
> > + __field(unsigned int, nr_ranges)
> > + __field(int, err)
> > + __field(int, snap_err)
> > + ),
> > +
> > + TP_fast_assign(/* assign */
> > + __entry->dev = sb->s_dev;
> > + __entry->tid = commit_tid;
> > + __entry->locked_ns = locked_ns;
> > + __entry->nr_inodes = nr_inodes;
> > + __entry->nr_ranges = nr_ranges;
> > + __entry->err = err;
> > + __entry->snap_err = snap_err;
> > + ),
> > +
> > + TP_printk("dev %d,%d tid %u locked_ns %llu nr_inodes %u nr_ranges %u err %d snap_err %s",
> > + MAJOR(__entry->dev), MINOR(__entry->dev), __entry->tid,
> > + __entry->locked_ns, __entry->nr_inodes, __entry->nr_ranges,
> > + __entry->err, __print_symbolic(__entry->snap_err,
> > + TRACE_SNAP_ERR))
> > +);
> > +
> > +#undef EM
> > +#undef EMe
> > +#undef TRACE_SNAP_ERR
> > +
> > #define FC_REASON_NAME_STAT(reason) \
> > show_fc_reason(reason), \
> > __entry->fc_ineligible_rc[reason]
>
> As for the rest:
>
> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
>
> [ Please add this reviewed-by to any new versions so I remember I already
> looked at it. ]
Sure, thanks a lot for your thoughtful review!
Regards,
Li
^ permalink raw reply
* [PATCH] tracing: separate module tracepoint strings from trace_printk formats
From: Cao Ruichuang @ 2026-04-13 12:33 UTC (permalink / raw)
To: petr.pavlu; +Cc: linux-trace-kernel, linux-kernel, mhiramat, Cao Ruichuang
In-Reply-To: <41e81533-0fd6-49f5-b7c1-b4e172affd2a@suse.com>
The previous module tracepoint_string() fix took the smallest
implementation path and reused the existing module trace_printk format
storage.
That was enough to make module __tracepoint_str entries show up in
printk_formats and be accepted by trace_is_tracepoint_string(), but it
also made those copied mappings persist after module unload. That does
not match the expected module lifetime semantics.
Handle module tracepoint_string() mappings separately instead of mixing
them into the module trace_printk format list. Keep copying the strings
into tracing-managed storage while the module is loaded, but track them
on their own list and drop them again on MODULE_STATE_GOING.
Keep module trace_printk format handling unchanged.
This split is intentional: module trace_printk formats and module
tracepoint_string() mappings do not have the same lifetime requirements.
Keeping them in one shared structure would either preserve
tracepoint_string() mappings too long again, or require mixed
ownership/refcount rules in a trace_printk-oriented structure.
The separate module tracepoint_string() list intentionally keeps one
copied mapping per module entry instead of trying to share copies across
modules by string contents. printk_formats is address-based, and sharing
those copies would add another layer of shared ownership/refcounting
without changing the lifetime rule this fix is trying to restore.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217196
Signed-off-by: Cao Ruichuang <create0818@163.com>
---
include/linux/tracepoint.h | 9 +-
kernel/trace/trace_printk.c | 250 ++++++++++++++++++++++--------------
2 files changed, 157 insertions(+), 102 deletions(-)
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index f14da542402..aec598a4017 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -479,11 +479,10 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
*
* For built-in code, the tracing system uses the original string address.
* For modules, the tracing code saves tracepoint strings into
- * tracing-managed storage when the module loads, so their mappings remain
- * available through printk_formats and trace string checks even after the
- * module's own memory goes away. As long as the string does not change
- * during the life of the module, it is fine to use tracepoint_string()
- * within a module.
+ * tracing-managed storage while the module is loaded, and drops those
+ * mappings again when the module unloads. As long as the string does not
+ * change during the life of the module, it is fine to use
+ * tracepoint_string() within a module.
*/
#define tracepoint_string(str) \
({ \
diff --git a/kernel/trace/trace_printk.c b/kernel/trace/trace_printk.c
index 9f67ce42ef6..0420ffcff93 100644
--- a/kernel/trace/trace_printk.c
+++ b/kernel/trace/trace_printk.c
@@ -21,24 +21,24 @@
#ifdef CONFIG_MODULES
-/*
- * modules trace_printk() formats and tracepoint_string() strings are
- * autosaved in struct trace_bprintk_fmt, which are queued on
- * trace_bprintk_fmt_list.
- */
+/* module trace_printk() formats are autosaved on trace_bprintk_fmt_list. */
static LIST_HEAD(trace_bprintk_fmt_list);
+/* module tracepoint_string() copies live on tracepoint_string_list. */
+static LIST_HEAD(tracepoint_string_list);
-/* serialize accesses to trace_bprintk_fmt_list */
+/* serialize accesses to module format and tracepoint-string lists */
static DEFINE_MUTEX(btrace_mutex);
struct trace_bprintk_fmt {
struct list_head list;
const char *fmt;
- unsigned int type;
};
-#define TRACE_BPRINTK_TYPE BIT(0)
-#define TRACE_TRACEPOINT_TYPE BIT(1)
+struct tracepoint_string_entry {
+ struct list_head list;
+ struct module *mod;
+ const char *str;
+};
static inline struct trace_bprintk_fmt *lookup_format(const char *fmt)
{
@@ -54,24 +54,21 @@ static inline struct trace_bprintk_fmt *lookup_format(const char *fmt)
return NULL;
}
-static void hold_module_trace_format(const char **start, const char **end,
- unsigned int type)
+static void hold_module_trace_bprintk_format(const char **start, const char **end)
{
const char **iter;
char *fmt;
/* allocate the trace_printk per cpu buffers */
- if ((type & TRACE_BPRINTK_TYPE) && start != end)
+ if (start != end)
trace_printk_init_buffers();
mutex_lock(&btrace_mutex);
for (iter = start; iter < end; iter++) {
struct trace_bprintk_fmt *tb_fmt = lookup_format(*iter);
if (tb_fmt) {
- if (!IS_ERR(tb_fmt)) {
- tb_fmt->type |= type;
+ if (!IS_ERR(tb_fmt))
*iter = tb_fmt->fmt;
- }
continue;
}
@@ -83,7 +80,6 @@ static void hold_module_trace_format(const char **start, const char **end,
list_add_tail(&tb_fmt->list, &trace_bprintk_fmt_list);
strcpy(fmt, *iter);
tb_fmt->fmt = fmt;
- tb_fmt->type = type;
} else
kfree(tb_fmt);
}
@@ -93,89 +89,156 @@ static void hold_module_trace_format(const char **start, const char **end,
mutex_unlock(&btrace_mutex);
}
+static void hold_module_tracepoint_strings(struct module *mod,
+ const char **start,
+ const char **end)
+{
+ const char **iter;
+
+ mutex_lock(&btrace_mutex);
+ for (iter = start; iter < end; iter++) {
+ struct tracepoint_string_entry *tp_entry;
+ char *str;
+
+ tp_entry = kmalloc_obj(*tp_entry);
+ if (!tp_entry)
+ continue;
+
+ str = kstrdup(*iter, GFP_KERNEL);
+ if (!str) {
+ kfree(tp_entry);
+ continue;
+ }
+
+ tp_entry->mod = mod;
+ tp_entry->str = str;
+ list_add_tail(&tp_entry->list, &tracepoint_string_list);
+ *iter = tp_entry->str;
+ }
+ mutex_unlock(&btrace_mutex);
+}
+
+static void release_module_tracepoint_strings(struct module *mod)
+{
+ struct tracepoint_string_entry *tp_entry, *n;
+
+ mutex_lock(&btrace_mutex);
+ list_for_each_entry_safe(tp_entry, n, &tracepoint_string_list, list) {
+ if (tp_entry->mod != mod)
+ continue;
+ list_del(&tp_entry->list);
+ kfree(tp_entry->str);
+ kfree(tp_entry);
+ }
+ mutex_unlock(&btrace_mutex);
+}
+
static int module_trace_format_notify(struct notifier_block *self,
unsigned long val, void *data)
{
struct module *mod = data;
- if (val != MODULE_STATE_COMING)
- return NOTIFY_OK;
+ switch (val) {
+ case MODULE_STATE_COMING:
+ if (mod->num_trace_bprintk_fmt) {
+ const char **start = mod->trace_bprintk_fmt_start;
+ const char **end = start + mod->num_trace_bprintk_fmt;
- if (mod->num_trace_bprintk_fmt) {
- const char **start = mod->trace_bprintk_fmt_start;
- const char **end = start + mod->num_trace_bprintk_fmt;
+ hold_module_trace_bprintk_format(start, end);
+ }
+
+ if (mod->num_tracepoint_strings) {
+ const char **start = mod->tracepoint_strings_start;
+ const char **end = start + mod->num_tracepoint_strings;
- hold_module_trace_format(start, end, TRACE_BPRINTK_TYPE);
+ hold_module_tracepoint_strings(mod, start, end);
+ }
+ break;
+ case MODULE_STATE_GOING:
+ release_module_tracepoint_strings(mod);
+ break;
}
- if (mod->num_tracepoint_strings) {
- const char **start = mod->tracepoint_strings_start;
- const char **end = start + mod->num_tracepoint_strings;
+ return NOTIFY_OK;
+}
- hold_module_trace_format(start, end, TRACE_TRACEPOINT_TYPE);
+static const char **find_first_mod_entry(void)
+{
+ struct trace_bprintk_fmt *tb_fmt;
+ struct tracepoint_string_entry *tp_entry;
+
+ if (!list_empty(&trace_bprintk_fmt_list)) {
+ tb_fmt = list_first_entry(&trace_bprintk_fmt_list,
+ typeof(*tb_fmt), list);
+ return &tb_fmt->fmt;
}
- return NOTIFY_OK;
+ if (!list_empty(&tracepoint_string_list)) {
+ tp_entry = list_first_entry(&tracepoint_string_list,
+ typeof(*tp_entry), list);
+ return &tp_entry->str;
+ }
+
+ return NULL;
}
-/*
- * The debugfs/tracing/printk_formats file maps the addresses with
- * the ASCII formats that are used in the bprintk events in the
- * buffer. For userspace tools to be able to decode the events from
- * the buffer, they need to be able to map the address with the format.
- *
- * The addresses of the bprintk formats are in their own section
- * __trace_printk_fmt. But for modules we copy them into a link list.
- * The code to print the formats and their addresses passes around the
- * address of the fmt string. If the fmt address passed into the seq
- * functions is within the kernel core __trace_printk_fmt section, then
- * it simply uses the next pointer in the list.
- *
- * When the fmt pointer is outside the kernel core __trace_printk_fmt
- * section, then we need to read the link list pointers. The trick is
- * we pass the address of the string to the seq function just like
- * we do for the kernel core formats. To get back the structure that
- * holds the format, we simply use container_of() and then go to the
- * next format in the list.
- */
-static const char **
-find_next_mod_format(int start_index, void *v, const char **fmt, loff_t *pos)
+static struct trace_bprintk_fmt *lookup_mod_format_ptr(const char **fmt_ptr)
{
- struct trace_bprintk_fmt *mod_fmt;
+ struct trace_bprintk_fmt *tb_fmt;
- if (list_empty(&trace_bprintk_fmt_list))
- return NULL;
+ list_for_each_entry(tb_fmt, &trace_bprintk_fmt_list, list) {
+ if (fmt_ptr == &tb_fmt->fmt)
+ return tb_fmt;
+ }
- /*
- * v will point to the address of the fmt record from t_next
- * v will be NULL from t_start.
- * If this is the first pointer or called from start
- * then we need to walk the list.
- */
- if (!v || start_index == *pos) {
- struct trace_bprintk_fmt *p;
-
- /* search the module list */
- list_for_each_entry(p, &trace_bprintk_fmt_list, list) {
- if (start_index == *pos)
- return &p->fmt;
- start_index++;
+ return NULL;
+}
+
+static struct tracepoint_string_entry *lookup_mod_tracepoint_ptr(const char **str_ptr)
+{
+ struct tracepoint_string_entry *tp_entry;
+
+ list_for_each_entry(tp_entry, &tracepoint_string_list, list) {
+ if (str_ptr == &tp_entry->str)
+ return tp_entry;
+ }
+
+ return NULL;
+}
+
+static const char **find_next_mod_entry(int start_index, void *v, loff_t *pos)
+{
+ struct trace_bprintk_fmt *tb_fmt;
+ struct tracepoint_string_entry *tp_entry;
+
+ if (!v || start_index == *pos)
+ return find_first_mod_entry();
+
+ tb_fmt = lookup_mod_format_ptr(v);
+ if (tb_fmt) {
+ if (tb_fmt->list.next != &trace_bprintk_fmt_list) {
+ tb_fmt = list_next_entry(tb_fmt, list);
+ return &tb_fmt->fmt;
}
- /* pos > index */
+
+ if (!list_empty(&tracepoint_string_list)) {
+ tp_entry = list_first_entry(&tracepoint_string_list,
+ typeof(*tp_entry), list);
+ return &tp_entry->str;
+ }
+
return NULL;
}
- /*
- * v points to the address of the fmt field in the mod list
- * structure that holds the module print format.
- */
- mod_fmt = container_of(v, typeof(*mod_fmt), fmt);
- if (mod_fmt->list.next == &trace_bprintk_fmt_list)
+ tp_entry = lookup_mod_tracepoint_ptr(v);
+ if (!tp_entry)
return NULL;
- mod_fmt = container_of(mod_fmt->list.next, typeof(*mod_fmt), list);
+ if (tp_entry->list.next == &tracepoint_string_list)
+ return NULL;
- return &mod_fmt->fmt;
+ tp_entry = list_next_entry(tp_entry, list);
+ return &tp_entry->str;
}
static void format_mod_start(void)
@@ -195,8 +258,8 @@ module_trace_format_notify(struct notifier_block *self,
{
return NOTIFY_OK;
}
-static inline const char **
-find_next_mod_format(int start_index, void *v, const char **fmt, loff_t *pos)
+static inline const char **find_next_mod_entry(int start_index, void *v,
+ loff_t *pos)
{
return NULL;
}
@@ -274,7 +337,7 @@ bool trace_is_tracepoint_string(const char *str)
{
const char **ptr = __start___tracepoint_str;
#ifdef CONFIG_MODULES
- struct trace_bprintk_fmt *tb_fmt;
+ struct tracepoint_string_entry *tp_entry;
#endif
for (ptr = __start___tracepoint_str; ptr < __stop___tracepoint_str; ptr++) {
@@ -284,8 +347,8 @@ bool trace_is_tracepoint_string(const char *str)
#ifdef CONFIG_MODULES
mutex_lock(&btrace_mutex);
- list_for_each_entry(tb_fmt, &trace_bprintk_fmt_list, list) {
- if ((tb_fmt->type & TRACE_TRACEPOINT_TYPE) && str == tb_fmt->fmt) {
+ list_for_each_entry(tp_entry, &tracepoint_string_list, list) {
+ if (str == tp_entry->str) {
mutex_unlock(&btrace_mutex);
return true;
}
@@ -297,9 +360,8 @@ bool trace_is_tracepoint_string(const char *str)
static const char **find_next(void *v, loff_t *pos)
{
- const char **fmt = v;
int start_index;
- int last_index;
+ int next_index;
start_index = __stop___trace_bprintk_fmt - __start___trace_bprintk_fmt;
@@ -307,25 +369,19 @@ static const char **find_next(void *v, loff_t *pos)
return __start___trace_bprintk_fmt + *pos;
/*
- * The __tracepoint_str section is treated the same as the
- * __trace_printk_fmt section. The difference is that the
- * __trace_printk_fmt section should only be used by trace_printk()
- * in a debugging environment, as if anything exists in that section
- * the trace_prink() helper buffers are allocated, which would just
- * waste space in a production environment.
- *
- * The __tracepoint_str sections on the other hand are used by
- * tracepoints which need to map pointers to their strings to
- * the ASCII text for userspace.
+ * Built-in __tracepoint_str entries are exported directly from the
+ * core section. Module tracepoint_string() mappings are kept on a
+ * separate tracing-managed list below, because their lifetime is tied
+ * to module load/unload and differs from module trace_printk() formats.
*/
- last_index = start_index;
+ next_index = start_index;
start_index = __stop___tracepoint_str - __start___tracepoint_str;
- if (*pos < last_index + start_index)
- return __start___tracepoint_str + (*pos - last_index);
+ if (*pos < next_index + start_index)
+ return __start___tracepoint_str + (*pos - next_index);
- start_index += last_index;
- return find_next_mod_format(start_index, v, fmt, pos);
+ start_index += next_index;
+ return find_next_mod_entry(start_index, v, pos);
}
static void *
--
2.39.5 (Apple Git-154)
^ permalink raw reply related
* Re: [PATCH v2] tracing: preserve module tracepoint strings
From: Petr Pavlu @ 2026-04-13 9:40 UTC (permalink / raw)
To: Cao Ruichuang
Cc: rostedt, mhiramat, mathieu.desnoyers, linux-kernel,
linux-trace-kernel
In-Reply-To: <20260410051847.73259-1-create0818@163.com>
On 4/10/26 7:18 AM, Cao Ruichuang wrote:
> tracepoint_string() is documented as exporting constant strings
> through printk_formats, including when it is used from modules.
> That currently does not work.
>
> A small test module that calls
> tracepoint_string("tracepoint_string_test_module_string") loads
> successfully and gets a pointer back, but the string never appears
> in /sys/kernel/tracing/printk_formats. The loader only collects
> __trace_printk_fmt from modules and ignores __tracepoint_str.
>
> Collect module __tracepoint_str entries too, copy them to stable
> tracing-managed storage like module trace_printk formats, and let
> trace_is_tracepoint_string() recognize those copied strings. This
> makes module tracepoint strings visible through printk_formats and
> keeps them accepted by the trace string safety checks.
>
> Update the tracepoint_string() documentation to describe this
> module behavior explicitly, so the comment matches the preserved
> module-string mappings exported by tracing.
>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217196
> Signed-off-by: Cao Ruichuang <create0818@163.com>
> ---
> v2:
> - update tracepoint_string() documentation to describe the preserved
> module-string mapping explicitly
> - address Petr Pavlu's review about the comment not matching the
> implemented module behavior
I questioned in my previous comment whether the data associated with
tracepoint_string() could be dropped when the module that created it is
unloaded. Typically, modules should not leave any data behind when they
are removed. Note how kernel/trace/trace_events.c tracks event fields
using add_str_to_module() and releases them in
trace_module_remove_events(). In practice, I suppose this isn't a large
problem because the usage of tracepoint_string() is limited and one
won't typically load/unload different modules that use this facility.
Nonetheless, what is the reason for keeping the tracepoint_string()
data for unloaded modules?
--
Thanks,
Petr
^ permalink raw reply
* Re: [PATCH] selftests/ftrace: Account for fprobe attachment at creation
From: Masami Hiramatsu @ 2026-04-13 8:48 UTC (permalink / raw)
To: Cao Ruichuang
Cc: rostedt, mathieu.desnoyers, shuah, linux-kernel,
linux-trace-kernel, linux-kselftest
In-Reply-To: <20260410043243.65800-1-create0818@163.com>
On Fri, 10 Apr 2026 12:32:43 +0800
Cao Ruichuang <create0818@163.com> wrote:
> Hi Masami,
>
> I reran this in clean QEMU on two kernels and got different results.
>
> 1. Ubuntu distro kernel:
> Linux 6.8.0-100-generic #100-Ubuntu SMP PREEMPT_DYNAMIC
> Tue Jan 13 16:40:06 UTC 2026
>
> baseline count=2
> after_create123 count=4
> after_enable1/2/3 count=4
>
> baseline enabled_functions:
> __hid_bpf_tail_call ...
>
> after_create123 enabled_functions:
> kernel_clone (2) R ->arch_ftrace_ops_list_func+0x0/0x280
> kmem_cache_free (1) R tramp: ... ->fprobe_handler+0x0/0x40
> __hid_bpf_tail_call ...
>
> 2. Current source-tree kernel built from the clean snapshot of my patch
> branch:
> Linux 7.0.0-rc6 #2 SMP PREEMPT_DYNAMIC Fri Apr 10 12:19:39 CST 2026
>
> baseline count=0
> after_create123 count=0
> after_enable1 count=1
> after_enable2 count=1
> after_enable3 count=2
>
> after_create123 enabled_functions:
> <empty>
>
> after_enable3 enabled_functions:
> kernel_clone (2) ->arch_ftrace_ops_list_func+0x0/0x200
> kmem_cache_free (1) tramp: ... ->fprobe_ftrace_entry+0x0/0x220
>
> So the behavior I reported earlier reproduces on that Ubuntu 6.8 kernel,
> but not on the current source-tree kernel. I think my earlier conclusion
> was too broad.
Thanks for reporting the difference of behaviors.
>
> I will stop pushing this testcase change for now unless I can narrow down
> which kernel change caused the difference.
OK.
Thanks!
>
> Thanks,
> Cao Ruichuang
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [PATCH v5 3/3] tracing/fprobe: Check the same type fprobe on table as the unregistered one
From: Masami Hiramatsu (Google) @ 2026-04-13 8:39 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
linux-trace-kernel
In-Reply-To: <177606956628.929411.17392736689322577701.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Commit 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case")
introduced a different ftrace_ops for entry-only fprobes.
However, when unregistering an fprobe, the kernel only checks if another
fprobe exists at the same address, without checking which type of fprobe
it is.
If different fprobes are registered at the same address, the same address
will be registered in both fgraph_ops and ftrace_ops, but only one of
them will be deleted when unregistering. (the one removed first will not
be deleted from the ops).
This results in junk entries remaining in either fgraph_ops or ftrace_ops.
For example:
=======
cd /sys/kernel/tracing
# 'Add entry and exit events on the same place'
echo 'f:event1 vfs_read' >> dynamic_events
echo 'f:event2 vfs_read%return' >> dynamic_events
# 'Enable both of them'
echo 1 > events/fprobes/enable
cat enabled_functions
vfs_read (2) ->arch_ftrace_ops_list_func+0x0/0x210
# 'Disable and remove exit event'
echo 0 > events/fprobes/event2/enable
echo -:event2 >> dynamic_events
# 'Disable and remove all events'
echo 0 > events/fprobes/enable
echo > dynamic_events
# 'Add another event'
echo 'f:event3 vfs_open%return' > dynamic_events
cat dynamic_events
f:fprobes/event3 vfs_open%return
echo 1 > events/fprobes/enable
cat enabled_functions
vfs_open (1) tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
vfs_read (1) tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
=======
As you can see, an entry for the vfs_read remains.
To fix this issue, when unregistering, the kernel should also check if
there is the same type of fprobes still exist at the same address, and
if not, delete its entry from either fgraph_ops or ftrace_ops.
Fixes: 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
kernel/trace/fprobe.c | 85 +++++++++++++++++++++++++++++++++++++------------
1 file changed, 65 insertions(+), 20 deletions(-)
diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index b5ce98d2ea96..19c5b65ed5fb 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -92,11 +92,8 @@ static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
return ret;
}
-/* Return true if there are synonims */
-static bool delete_fprobe_node(struct fprobe_hlist_node *node)
+static void delete_fprobe_node(struct fprobe_hlist_node *node)
{
- bool ret;
-
lockdep_assert_held(&fprobe_mutex);
/* Avoid double deleting and non-inserted nodes */
@@ -105,13 +102,6 @@ static bool delete_fprobe_node(struct fprobe_hlist_node *node)
rhltable_remove(&fprobe_ip_table, &node->hlist,
fprobe_rht_params);
}
-
- rcu_read_lock();
- ret = !!rhltable_lookup(&fprobe_ip_table, &node->addr,
- fprobe_rht_params);
- rcu_read_unlock();
-
- return ret;
}
/* Check existence of the fprobe */
@@ -345,6 +335,32 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
return !fp->exit_handler;
}
+static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace)
+{
+ struct rhlist_head *head, *pos;
+ struct fprobe_hlist_node *node;
+ struct fprobe *fp;
+
+ guard(rcu)();
+ head = rhltable_lookup(&fprobe_ip_table, &ip,
+ fprobe_rht_params);
+ if (!head)
+ return false;
+ /* We have to check the same type on the list. */
+ rhl_for_each_entry_rcu(node, pos, head, hlist) {
+ if (node->addr != ip)
+ break;
+ fp = READ_ONCE(node->fp);
+ if (likely(fp)) {
+ if ((!ftrace && fp->exit_handler) ||
+ (ftrace && !fp->exit_handler))
+ return true;
+ }
+ }
+
+ return false;
+}
+
#ifdef CONFIG_MODULES
static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
{
@@ -367,6 +383,29 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
return false;
}
+static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace __maybe_unused)
+{
+ struct rhlist_head *head, *pos;
+ struct fprobe_hlist_node *node;
+ struct fprobe *fp;
+
+ guard(rcu)();
+ head = rhltable_lookup(&fprobe_ip_table, &ip,
+ fprobe_rht_params);
+ if (!head)
+ return false;
+ /* We only need to check fp is there. */
+ rhl_for_each_entry_rcu(node, pos, head, hlist) {
+ if (node->addr != ip)
+ break;
+ fp = READ_ONCE(node->fp);
+ if (likely(fp))
+ return true;
+ }
+
+ return false;
+}
+
#ifdef CONFIG_MODULES
static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
{
@@ -555,18 +594,25 @@ struct fprobe_addr_list {
static int fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
struct fprobe_addr_list *alist)
{
+ lockdep_assert_in_rcu_read_lock();
+
if (!within_module(node->addr, mod))
return 0;
- if (delete_fprobe_node(node))
- return 0;
+ delete_fprobe_node(node);
/* If no address list is available, we can't track this address. */
if (!alist->addrs)
return 0;
+ /*
+ * Don't care the type here, because all fprobes on the same
+ * address must be removed eventually.
+ */
+ if (!rhltable_lookup(&fprobe_ip_table, &node->addr, fprobe_rht_params)) {
+ alist->addrs[alist->index++] = node->addr;
+ if (alist->index == alist->size)
+ return -ENOSPC;
+ }
- alist->addrs[alist->index++] = node->addr;
- if (alist->index == alist->size)
- return -ENOSPC;
return 0;
}
@@ -924,10 +970,9 @@ static int unregister_fprobe_nolock(struct fprobe *fp, bool force)
/* Remove non-synonim ips from table and hash */
count = 0;
for (i = 0; i < hlist_array->size; i++) {
- if (delete_fprobe_node(&hlist_array->array[i]))
- continue;
-
- if (addrs)
+ delete_fprobe_node(&hlist_array->array[i]);
+ if (addrs && !fprobe_exists_on_hash(hlist_array->array[i].addr,
+ fprobe_is_ftrace(fp)))
addrs[count++] = hlist_array->array[i].addr;
}
del_fprobe_hash(fp);
^ permalink raw reply related
* [PATCH v5 2/3] tracing/fprobe: Avoid kcalloc() in rcu_read_lock section
From: Masami Hiramatsu (Google) @ 2026-04-13 8:39 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
linux-trace-kernel
In-Reply-To: <177606956628.929411.17392736689322577701.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
fprobe_remove_node_in_module() is called under RCU read locked, but
this invokes kcalloc() if there are more than 8 fprobes installed
on the module. Sashiko warns it because kcalloc() can sleep [1].
[1] https://sashiko.dev/#/patchset/177552432201.853249.5125045538812833325.stgit%40mhiramat.tok.corp.google.com
To fix this issue, expand the batch size to 128 and do not expand
the fprobe_addr_list, but just cancel walking on fprobe_ip_table,
update fgraph/ftrace_ops and retry the loop again.
Fixes: 0de4c70d04a4 ("tracing: fprobe: use rhltable for fprobe_ip_table")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v5:
- Skip updating ftrace_ops when fails to allocate memory in module
unloading.
Changes in v4:
- fix a build error typo in case of CONFIG_DYNAMIC_FTRACE=n.
Changes in v3:
- Retry inside rhltable_walk_enter/exit().
- Rename fprobe_set_ips() to fprobe_remove_ips().
- Rename 'retry' label to 'again'.
---
kernel/trace/fprobe.c | 89 +++++++++++++++++++++++--------------------------
1 file changed, 41 insertions(+), 48 deletions(-)
diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index 1d9a3d2276cd..b5ce98d2ea96 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -346,11 +346,10 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
}
#ifdef CONFIG_MODULES
-static void fprobe_set_ips(unsigned long *ips, unsigned int cnt, int remove,
- int reset)
+static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
{
- ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, remove, reset);
- ftrace_set_filter_ips(&fprobe_ftrace_ops, ips, cnt, remove, reset);
+ ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
+ ftrace_set_filter_ips(&fprobe_ftrace_ops, ips, cnt, 1, 0);
}
#endif
#else
@@ -369,10 +368,9 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
}
#ifdef CONFIG_MODULES
-static void fprobe_set_ips(unsigned long *ips, unsigned int cnt, int remove,
- int reset)
+static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
{
- ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, remove, reset);
+ ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
}
#endif
#endif /* !CONFIG_DYNAMIC_FTRACE_WITH_ARGS && !CONFIG_DYNAMIC_FTRACE_WITH_REGS */
@@ -546,7 +544,7 @@ static void fprobe_graph_remove_ips(unsigned long *addrs, int num)
#ifdef CONFIG_MODULES
-#define FPROBE_IPS_BATCH_INIT 8
+#define FPROBE_IPS_BATCH_INIT 128
/* instruction pointer address list */
struct fprobe_addr_list {
int index;
@@ -554,45 +552,24 @@ struct fprobe_addr_list {
unsigned long *addrs;
};
-static int fprobe_addr_list_add(struct fprobe_addr_list *alist, unsigned long addr)
+static int fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
+ struct fprobe_addr_list *alist)
{
- unsigned long *addrs;
-
- /* Previously we failed to expand the list. */
- if (alist->index == alist->size)
- return -ENOSPC;
-
- alist->addrs[alist->index++] = addr;
- if (alist->index < alist->size)
+ if (!within_module(node->addr, mod))
return 0;
- /* Expand the address list */
- addrs = kcalloc(alist->size * 2, sizeof(*addrs), GFP_KERNEL);
- if (!addrs)
- return -ENOMEM;
-
- memcpy(addrs, alist->addrs, alist->size * sizeof(*addrs));
- alist->size *= 2;
- kfree(alist->addrs);
- alist->addrs = addrs;
+ if (delete_fprobe_node(node))
+ return 0;
+ /* If no address list is available, we can't track this address. */
+ if (!alist->addrs)
+ return 0;
+ alist->addrs[alist->index++] = node->addr;
+ if (alist->index == alist->size)
+ return -ENOSPC;
return 0;
}
-static void fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
- struct fprobe_addr_list *alist)
-{
- if (!within_module(node->addr, mod))
- return;
- if (delete_fprobe_node(node))
- return;
- /*
- * If failed to update alist, just continue to update hlist.
- * Therefore, at list user handler will not hit anymore.
- */
- fprobe_addr_list_add(alist, node->addr);
-}
-
/* Handle module unloading to manage fprobe_ip_table. */
static int fprobe_module_callback(struct notifier_block *nb,
unsigned long val, void *data)
@@ -601,29 +578,45 @@ static int fprobe_module_callback(struct notifier_block *nb,
struct fprobe_hlist_node *node;
struct rhashtable_iter iter;
struct module *mod = data;
+ bool retry;
if (val != MODULE_STATE_GOING)
return NOTIFY_DONE;
alist.addrs = kcalloc(alist.size, sizeof(*alist.addrs), GFP_KERNEL);
- /* If failed to alloc memory, we can not remove ips from hash. */
- if (!alist.addrs)
- return NOTIFY_DONE;
+ /*
+ * If failed to alloc memory, ftrace_ops will not be able to remove ips from
+ * hash, but we can still remove nodes from fprobe_ip_table, so we can avoid
+ * the potential wrong callback. So just print a warning here and try to
+ * continue without address list.
+ */
+ WARN_ONCE(!alist.addrs,
+ "Failed to allocate memory for fprobe_addr_list, ftrace_ops will not be updated");
mutex_lock(&fprobe_mutex);
rhltable_walk_enter(&fprobe_ip_table, &iter);
+again:
+ retry = false;
+ alist.index = 0;
do {
rhashtable_walk_start(&iter);
while ((node = rhashtable_walk_next(&iter)) && !IS_ERR(node))
- fprobe_remove_node_in_module(mod, node, &alist);
+ if (fprobe_remove_node_in_module(mod, node, &alist) < 0) {
+ retry = true;
+ break;
+ }
rhashtable_walk_stop(&iter);
- } while (node == ERR_PTR(-EAGAIN));
- rhashtable_walk_exit(&iter);
+ } while (node == ERR_PTR(-EAGAIN) && !retry);
+ /* Remove any ips from hash table(s) */
+ if (alist.index > 0) {
+ fprobe_remove_ips(alist.addrs, alist.index);
+ if (retry)
+ goto again;
+ }
- if (alist.index > 0)
- fprobe_set_ips(alist.addrs, alist.index, 1, 0);
+ rhashtable_walk_exit(&iter);
mutex_unlock(&fprobe_mutex);
kfree(alist.addrs);
^ permalink raw reply related
* [PATCH v5 1/3] tracing/fprobe: Remove fprobe from hash in failure path
From: Masami Hiramatsu (Google) @ 2026-04-13 8:39 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
linux-trace-kernel
In-Reply-To: <177606956628.929411.17392736689322577701.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When register_fprobe_ips() fails, it tries to remove a list of
fprobe_hash_node from fprobe_ip_table, but it missed to remove
fprobe itself from fprobe_table. Moreover, when removing
the fprobe_hash_node which is added to rhltable once, it must
use kfree_rcu() after removing from rhltable.
To fix these issues, this reuses unregister_fprobe() internal
code to rollback the half-way registered fprobe.
Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v5:
- When rolling back an fprobe that failed to register, the
fprobe_hash_node are forcibly removed and warn if failure.
Changes in v4:
- Remove short-cut case because we always need to upadte ftrace_ops.
- Use guard(mutex) in register_fprobe_ips() to unlock it correctly.
- Remove redundant !ret check in register_fprobe_ips().
- Do not set hlist_array->size in failure case, instead,
hlist_array->array[i].fp is set only when insertion is succeeded.
Changes in v3:
- Newly added.
---
kernel/trace/fprobe.c | 101 ++++++++++++++++++++++++++-----------------------
1 file changed, 53 insertions(+), 48 deletions(-)
diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index dcadf1d23b8a..1d9a3d2276cd 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -4,6 +4,7 @@
*/
#define pr_fmt(fmt) "fprobe: " fmt
+#include <linux/cleanup.h>
#include <linux/err.h>
#include <linux/fprobe.h>
#include <linux/kallsyms.h>
@@ -78,20 +79,27 @@ static const struct rhashtable_params fprobe_rht_params = {
};
/* Node insertion and deletion requires the fprobe_mutex */
-static int insert_fprobe_node(struct fprobe_hlist_node *node)
+static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
{
+ int ret;
+
lockdep_assert_held(&fprobe_mutex);
- return rhltable_insert(&fprobe_ip_table, &node->hlist, fprobe_rht_params);
+ ret = rhltable_insert(&fprobe_ip_table, &node->hlist, fprobe_rht_params);
+ /* Set the fprobe pointer if insertion was successful. */
+ if (!ret)
+ WRITE_ONCE(node->fp, fp);
+ return ret;
}
/* Return true if there are synonims */
static bool delete_fprobe_node(struct fprobe_hlist_node *node)
{
- lockdep_assert_held(&fprobe_mutex);
bool ret;
- /* Avoid double deleting */
+ lockdep_assert_held(&fprobe_mutex);
+
+ /* Avoid double deleting and non-inserted nodes */
if (READ_ONCE(node->fp) != NULL) {
WRITE_ONCE(node->fp, NULL);
rhltable_remove(&fprobe_ip_table, &node->hlist,
@@ -759,7 +767,6 @@ static int fprobe_init(struct fprobe *fp, unsigned long *addrs, int num)
fp->hlist_array = hlist_array;
hlist_array->fp = fp;
for (i = 0; i < num; i++) {
- hlist_array->array[i].fp = fp;
addr = ftrace_location(addrs[i]);
if (!addr) {
fprobe_fail_cleanup(fp);
@@ -823,6 +830,8 @@ int register_fprobe(struct fprobe *fp, const char *filter, const char *notfilter
}
EXPORT_SYMBOL_GPL(register_fprobe);
+static int unregister_fprobe_nolock(struct fprobe *fp, bool force);
+
/**
* register_fprobe_ips() - Register fprobe to ftrace by address.
* @fp: A fprobe data structure to be registered.
@@ -845,31 +854,27 @@ int register_fprobe_ips(struct fprobe *fp, unsigned long *addrs, int num)
if (ret)
return ret;
- mutex_lock(&fprobe_mutex);
+ guard(mutex)(&fprobe_mutex);
- hlist_array = fp->hlist_array;
if (fprobe_is_ftrace(fp))
ret = fprobe_ftrace_add_ips(addrs, num);
else
ret = fprobe_graph_add_ips(addrs, num);
+ if (ret) {
+ fprobe_fail_cleanup(fp);
+ return ret;
+ }
- if (!ret) {
- add_fprobe_hash(fp);
- for (i = 0; i < hlist_array->size; i++) {
- ret = insert_fprobe_node(&hlist_array->array[i]);
- if (ret)
- break;
- }
- /* fallback on insert error */
+ hlist_array = fp->hlist_array;
+ add_fprobe_hash(fp);
+ for (i = 0; i < hlist_array->size; i++) {
+ ret = insert_fprobe_node(&hlist_array->array[i], fp);
if (ret) {
- for (i--; i >= 0; i--)
- delete_fprobe_node(&hlist_array->array[i]);
+ if (unregister_fprobe_nolock(fp, true))
+ pr_warn("Failed to cleanup fprobe after insertion failure.\n");
+ break;
}
}
- mutex_unlock(&fprobe_mutex);
-
- if (ret)
- fprobe_fail_cleanup(fp);
return ret;
}
@@ -913,37 +918,23 @@ bool fprobe_is_registered(struct fprobe *fp)
return true;
}
-/**
- * unregister_fprobe() - Unregister fprobe.
- * @fp: A fprobe data structure to be unregistered.
- *
- * Unregister fprobe (and remove ftrace hooks from the function entries).
- *
- * Return 0 if @fp is unregistered successfully, -errno if not.
- */
-int unregister_fprobe(struct fprobe *fp)
+static int unregister_fprobe_nolock(struct fprobe *fp, bool force)
{
- struct fprobe_hlist *hlist_array;
+ struct fprobe_hlist *hlist_array = fp->hlist_array;
unsigned long *addrs = NULL;
- int ret = 0, i, count;
+ int i, count;
- mutex_lock(&fprobe_mutex);
- if (!fp || !is_fprobe_still_exist(fp)) {
- ret = -EINVAL;
- goto out;
- }
-
- hlist_array = fp->hlist_array;
addrs = kcalloc(hlist_array->size, sizeof(unsigned long), GFP_KERNEL);
- if (!addrs) {
- ret = -ENOMEM; /* TODO: Fallback to one-by-one loop */
- goto out;
- }
+ if (!addrs && !force)
+ return -ENOMEM;
/* Remove non-synonim ips from table and hash */
count = 0;
for (i = 0; i < hlist_array->size; i++) {
- if (!delete_fprobe_node(&hlist_array->array[i]))
+ if (delete_fprobe_node(&hlist_array->array[i]))
+ continue;
+
+ if (addrs)
addrs[count++] = hlist_array->array[i].addr;
}
del_fprobe_hash(fp);
@@ -955,12 +946,26 @@ int unregister_fprobe(struct fprobe *fp)
kfree_rcu(hlist_array, rcu);
fp->hlist_array = NULL;
+ kfree(addrs);
-out:
- mutex_unlock(&fprobe_mutex);
+ return !addrs ? -ENOMEM : 0;
+}
- kfree(addrs);
- return ret;
+/**
+ * unregister_fprobe() - Unregister fprobe.
+ * @fp: A fprobe data structure to be unregistered.
+ *
+ * Unregister fprobe (and remove ftrace hooks from the function entries).
+ *
+ * Return 0 if @fp is unregistered successfully, -errno if not.
+ */
+int unregister_fprobe(struct fprobe *fp)
+{
+ guard(mutex)(&fprobe_mutex);
+ if (!fp || !is_fprobe_still_exist(fp))
+ return -EINVAL;
+
+ return unregister_fprobe_nolock(fp, false);
}
EXPORT_SYMBOL_GPL(unregister_fprobe);
^ permalink raw reply related
* [PATCH v5 0/3] tracing/fprobe: Fix fprobe_ip_table related bugs
From: Masami Hiramatsu (Google) @ 2026-04-13 8:39 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
linux-trace-kernel
Here is the 5th series of patches to fix bugs in fprobe.
The previous version is here.
https://lore.kernel.org/all/177584108931.388483.11311214679686745474.stgit@devnote2/
This version fixes to remove fprobe_hash_node forcibly when fprobe
registration failed [1/3] and skips updating ftrace_ops when fails
to allocate memory in module unloading [2/3].
Thanks,
---
Masami Hiramatsu (Google) (3):
tracing/fprobe: Remove fprobe from hash in failure path
tracing/fprobe: Avoid kcalloc() in rcu_read_lock section
tracing/fprobe: Check the same type fprobe on table as the unregistered one
kernel/trace/fprobe.c | 251 +++++++++++++++++++++++++++++--------------------
1 file changed, 147 insertions(+), 104 deletions(-)
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [RFC PATCH 1/4] rv/tlob: Add tlob model DOT file
From: Gabriele Monaco @ 2026-04-13 8:19 UTC (permalink / raw)
To: wen.yang, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers
Cc: linux-trace-kernel, linux-kernel
In-Reply-To: <64122474633aa17d872a7dc6233d7794e80f2784.1776020428.git.wen.yang@linux.dev>
On Mon, 2026-04-13 at 03:27 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
>
> Add the Graphviz DOT specification for the tlob (task latency over
> budget) deterministic automaton.
>
> The model has three states: unmonitored, on_cpu, and off_cpu.
> trace_start transitions from unmonitored to on_cpu; switch_out and
> switch_in cycle between on_cpu and off_cpu; trace_stop and
> budget_expired return to unmonitored from either active state.
> unmonitored is the sole accepting state.
>
> switch_in, switch_out, and sched_wakeup self-loop in unmonitored;
> sched_wakeup self-loops in on_cpu; switch_out and sched_wakeup
> self-loop in off_cpu.
>
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
Interesting monitor! Thanks.
I'm going to go through it more in details later, but let me share some initial
comments.
> MAINTAINERS | 3 +++
> tools/verification/models/tlob.dot | 25 +++++++++++++++++++++++++
> 2 files changed, 28 insertions(+)
> create mode 100644 tools/verification/models/tlob.dot
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 9fbb619c6..c2c56236c 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -23242,7 +23242,10 @@ S: Maintained
> F: Documentation/trace/rv/
> F: include/linux/rv.h
> F: include/rv/
> +F: include/uapi/linux/rv.h
> F: kernel/trace/rv/
> +F: samples/rv/
> +F: tools/testing/selftests/rv/
> F: tools/testing/selftests/verification/
> F: tools/verification/
This change doesn't belong here, the patch itself is not adding those file, you
should probably move it later.
>
> diff --git a/tools/verification/models/tlob.dot
> b/tools/verification/models/tlob.dot
> new file mode 100644
> index 000000000..df34a14b8
> --- /dev/null
> +++ b/tools/verification/models/tlob.dot
> @@ -0,0 +1,25 @@
> +digraph state_automaton {
> + center = true;
> + size = "7,11";
> + {node [shape = plaintext, style=invis, label=""]
> "__init_unmonitored"};
> + {node [shape = ellipse] "unmonitored"};
> + {node [shape = plaintext] "unmonitored"};
> + {node [shape = plaintext] "on_cpu"};
> + {node [shape = plaintext] "off_cpu"};
> + "__init_unmonitored" -> "unmonitored";
> + "unmonitored" [label = "unmonitored", color = green3];
> + "unmonitored" -> "on_cpu" [ label = "trace_start" ];
> + "unmonitored" -> "unmonitored" [ label =
> "switch_in\nswitch_out\nsched_wakeup" ];
> + "on_cpu" [label = "on_cpu"];
> + "on_cpu" -> "off_cpu" [ label = "switch_out" ];
> + "on_cpu" -> "unmonitored" [ label = "trace_stop\nbudget_expired" ];
> + "on_cpu" -> "on_cpu" [ label = "sched_wakeup" ];
> + "off_cpu" [label = "off_cpu"];
> + "off_cpu" -> "on_cpu" [ label = "switch_in" ];
> + "off_cpu" -> "unmonitored" [ label = "trace_stop\nbudget_expired" ];
> + "off_cpu" -> "off_cpu" [ label = "switch_out\nsched_wakeup" ];
> + { rank = min ;
> + "__init_unmonitored";
> + "unmonitored";
> + }
> +}
^ permalink raw reply
* Re: [RFC PATCH 2/4] rv/tlob: Add tlob deterministic automaton monitor
From: Gabriele Monaco @ 2026-04-13 8:19 UTC (permalink / raw)
To: wen.yang
Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
linux-trace-kernel, linux-kernel
In-Reply-To: <ccf53a89b3b65e5728403dc097fde94ac2591d98.1776020428.git.wen.yang@linux.dev>
On Mon, 2026-04-13 at 03:27 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
>
> Add the tlob (task latency over budget) RV monitor. tlob tracks the
> monotonic elapsed time (CLOCK_MONOTONIC) of a marked per-task code
> path, including time off-CPU, and fires a per-task hrtimer when the
> elapsed time exceeds a configurable budget.
>
> Three-state DA (unmonitored/on_cpu/off_cpu) driven by trace_start,
> switch_in/out, and budget_expired events. Per-task state lives in a
> fixed-size hash table (TLOB_MAX_MONITORED slots) with RCU-deferred
> free.
>
> Two userspace interfaces:
> - tracefs: uprobe pair registration via the monitor file using the
> format "pid:threshold_us:offset_start:offset_stop:binary_path"
> - /dev/rv ioctls (CONFIG_RV_CHARDEV): TLOB_IOCTL_TRACE_START /
> TRACE_STOP; TRACE_STOP returns -EOVERFLOW on violation
>
> Each /dev/rv fd has a per-fd mmap ring buffer (physically contiguous
> pages). A control page (struct tlob_mmap_page) at offset 0 exposes
> head/tail/dropped for lockless userspace reads; struct tlob_event
> records follow at data_offset. Drop-new policy on overflow.
>
> UAPI: include/uapi/linux/rv.h (tlob_start_args, tlob_event,
> tlob_mmap_page, ioctl numbers), monitor_tlob.rst,
> ioctl-number.rst (RV_IOC_MAGIC=0xB9).
>
I'm not fully grasping all the requirements for the monitors yet, but I see you
are reimplementing a lot of functionality in the monitor itself rather than
within RV, let's see if we can consolidate some of them:
* you're using timer expirations, can we do it with timed automata? [1]
* RV automata usually don't have an /unmonitored/ state, your trace_start event
would be the start condition (da_event_start) and the monitor will get non-
running at each violation (it calls da_monitor_reset() automatically), all
setup/cleanup logic should be handled implicitly within RV. I believe that would
also save you that ugly trace_event_tlob() redefinition.
* you're maintaining a local hash table for each task_struct, that could use
the per-object monitors [2] where your "object" is in fact your struct,
allocated when you start the monitor with all appropriate fields and indexed by
pid
* you are handling violations manually, considering timed automata trigger a
full fledged violation on timeouts, can you use the RV-way (error tracepoints or
reactors only)? Do you need the additional reporting within the
tracepoint/ioctl? Cannot the userspace consumer desume all those from other
events and let RV do just the monitoring?
* I like the uprobe thing, we could probably move all that to a common helper
once we figure out how to make it generic.
Note: [1] and [2] didn't reach upstream yet, but should reach linux-next soon.
Thanks,
Gabriele
[1] -
https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=rv/for-next&id=f5587d1b6ec938afb2f74fe399a68020d66923e4
[2] -
https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=rv/for-next&id=da282bf7fadb095ee0a40c32ff0126429c769b45
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
> Documentation/trace/rv/index.rst | 1 +
> Documentation/trace/rv/monitor_tlob.rst | 381 +++++++
> .../userspace-api/ioctl/ioctl-number.rst | 1 +
> include/uapi/linux/rv.h | 181 ++++
> kernel/trace/rv/Kconfig | 17 +
> kernel/trace/rv/Makefile | 2 +
> kernel/trace/rv/monitors/tlob/Kconfig | 51 +
> kernel/trace/rv/monitors/tlob/tlob.c | 986 ++++++++++++++++++
> kernel/trace/rv/monitors/tlob/tlob.h | 145 +++
> kernel/trace/rv/monitors/tlob/tlob_trace.h | 42 +
> kernel/trace/rv/rv.c | 4 +
> kernel/trace/rv/rv_dev.c | 602 +++++++++++
> kernel/trace/rv/rv_trace.h | 50 +
> 13 files changed, 2463 insertions(+)
> create mode 100644 Documentation/trace/rv/monitor_tlob.rst
> create mode 100644 include/uapi/linux/rv.h
> create mode 100644 kernel/trace/rv/monitors/tlob/Kconfig
> create mode 100644 kernel/trace/rv/monitors/tlob/tlob.c
> create mode 100644 kernel/trace/rv/monitors/tlob/tlob.h
> create mode 100644 kernel/trace/rv/monitors/tlob/tlob_trace.h
> create mode 100644 kernel/trace/rv/rv_dev.c
>
> diff --git a/Documentation/trace/rv/index.rst
> b/Documentation/trace/rv/index.rst
> index a2812ac5c..4f2bfaf38 100644
> --- a/Documentation/trace/rv/index.rst
> +++ b/Documentation/trace/rv/index.rst
> @@ -15,3 +15,4 @@ Runtime Verification
> monitor_wwnr.rst
> monitor_sched.rst
> monitor_rtapp.rst
> + monitor_tlob.rst
> diff --git a/Documentation/trace/rv/monitor_tlob.rst
> b/Documentation/trace/rv/monitor_tlob.rst
> new file mode 100644
> index 000000000..d498e9894
> --- /dev/null
> +++ b/Documentation/trace/rv/monitor_tlob.rst
> @@ -0,0 +1,381 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +Monitor tlob
> +============
> +
> +- Name: tlob - task latency over budget
> +- Type: per-task deterministic automaton
> +- Author: Wen Yang <wen.yang@linux.dev>
> +
> +Description
> +-----------
> +
> +The tlob monitor tracks per-task elapsed time (CLOCK_MONOTONIC, including
> +both on-CPU and off-CPU time) and reports a violation when the monitored
> +task exceeds a configurable latency budget threshold.
> +
> +The monitor implements a three-state deterministic automaton::
> +
> + |
> + | (initial)
> + v
> + +--------------+
> + +-------> | unmonitored |
> + | +--------------+
> + | |
> + | trace_start
> + | v
> + | +--------------+
> + | | on_cpu |
> + | +--------------+
> + | | |
> + | switch_out| | trace_stop / budget_expired
> + | v v
> + | +--------------+ (unmonitored)
> + | | off_cpu |
> + | +--------------+
> + | | |
> + | | switch_in| trace_stop / budget_expired
> + | v v
> + | (on_cpu) (unmonitored)
> + |
> + +-- trace_stop (from on_cpu or off_cpu)
> +
> + Key transitions:
> + unmonitored --(trace_start)--> on_cpu
> + on_cpu --(switch_out)--> off_cpu
> + off_cpu --(switch_in)--> on_cpu
> + on_cpu --(trace_stop)--> unmonitored
> + off_cpu --(trace_stop)--> unmonitored
> + on_cpu --(budget_expired)-> unmonitored [violation]
> + off_cpu --(budget_expired)-> unmonitored [violation]
> +
> + sched_wakeup self-loops in on_cpu and unmonitored; switch_out and
> + sched_wakeup self-loop in off_cpu. budget_expired is fired by the one-shot
> hrtimer; it always
> + transitions to unmonitored regardless of whether the task is on-CPU
> + or off-CPU when the timer fires.
> +
> +State Descriptions
> +------------------
> +
> +- **unmonitored**: Task is not being traced. Scheduling events
> + (``switch_in``, ``switch_out``, ``sched_wakeup``) are silently
> + ignored (self-loop). The monitor waits for a ``trace_start`` event
> + to begin a new observation window.
> +
> +- **on_cpu**: Task is running on the CPU with the deadline timer armed.
> + A one-shot hrtimer was set for ``threshold_us`` microseconds at
> + ``trace_start`` time. A ``switch_out`` event transitions to
> + ``off_cpu``; the hrtimer keeps running (off-CPU time counts toward
> + the budget). A ``trace_stop`` cancels the timer and returns to
> + ``unmonitored`` (normal completion). If the hrtimer fires
> + (``budget_expired``) the violation is recorded and the automaton
> + transitions to ``unmonitored``.
> +
> +- **off_cpu**: Task was preempted or blocked. The one-shot hrtimer
> + continues to run. A ``switch_in`` event returns to ``on_cpu``.
> + A ``trace_stop`` cancels the timer and returns to ``unmonitored``.
> + If the hrtimer fires (``budget_expired``) while the task is off-CPU,
> + the violation is recorded and the automaton transitions to
> + ``unmonitored``.
> +
> +Rationale
> +---------
> +
> +The per-task latency budget threshold allows operators to express timing
> +requirements in microseconds and receive an immediate ftrace event when a
> +task exceeds its budget. This is useful for real-time tasks
> +(``SCHED_FIFO`` / ``SCHED_DEADLINE``) where total elapsed time must
> +remain within a known bound.
> +
> +Each task has an independent threshold, so up to ``TLOB_MAX_MONITORED``
> +(64) tasks with different timing requirements can be monitored
> +simultaneously.
> +
> +On threshold violation the automaton records a ``tlob_budget_exceeded``
> +ftrace event carrying the final on-CPU / off-CPU time breakdown, but does
> +not kill or throttle the task. Monitoring can be restarted by issuing a
> +new ``trace_start`` event (or a new ``TLOB_IOCTL_TRACE_START`` ioctl).
> +
> +A per-task one-shot hrtimer is armed at ``trace_start`` for exactly
> +``threshold_us`` microseconds. It fires at most once per monitoring
> +window, performs an O(1) hash lookup, records the violation, and injects
> +the ``budget_expired`` event into the DA. When ``CONFIG_RV_MON_TLOB``
> +is not set there is zero runtime cost.
> +
> +Usage
> +-----
> +
> +tracefs interface (uprobe-based external monitoring)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The ``monitor`` tracefs file allows any privileged user to instrument an
> +unmodified binary via uprobes, without changing its source code. Write a
> +four-field record to attach two plain entry uprobes: one at
> +``offset_start`` fires ``tlob_start_task()`` and one at ``offset_stop``
> +fires ``tlob_stop_task()``, so the latency budget covers exactly the code
> +region between the two offsets::
> +
> + threshold_us:offset_start:offset_stop:binary_path
> +
> +``binary_path`` comes last so it may freely contain ``:`` (e.g. paths
> +inside a container namespace).
> +
> +The uprobes fire for every task that executes the probed instruction in
> +the binary, consistent with the native uprobe semantics. All tasks that
> +execute the code region get independent per-task monitoring slots.
> +
> +Using two plain entry uprobes (rather than a uretprobe for the stop) means
> +that a mistyped offset can never corrupt the call stack; the worst outcome
> +of a bad ``offset_stop`` is a missed stop that causes the hrtimer to fire
> +and report a budget violation.
> +
> +Example -- monitor a code region in ``/usr/bin/myapp`` with a 5 ms
> +budget, where the region starts at offset 0x12a0 and ends at 0x12f0::
> +
> + echo 1 > /sys/kernel/tracing/rv/monitors/tlob/enable
> +
> + # Bind uprobes: start probe starts the clock, stop probe stops it
> + echo "5000:0x12a0:0x12f0:/usr/bin/myapp" \
> + > /sys/kernel/tracing/rv/monitors/tlob/monitor
> +
> + # Remove the uprobe binding for this code region
> + echo "-0x12a0:/usr/bin/myapp" >
> /sys/kernel/tracing/rv/monitors/tlob/monitor
> +
> + # List registered uprobe bindings (mirrors the write format)
> + cat /sys/kernel/tracing/rv/monitors/tlob/monitor
> + # -> 5000:0x12a0:0x12f0:/usr/bin/myapp
> +
> + # Read violations from the trace buffer
> + cat /sys/kernel/tracing/trace
> +
> +Up to ``TLOB_MAX_MONITORED`` tasks may be monitored simultaneously.
> +
> +The offsets can be obtained with ``nm`` or ``readelf``::
> +
> + nm -n /usr/bin/myapp | grep my_function
> + # -> 0000000000012a0 T my_function
> +
> + readelf -s /usr/bin/myapp | grep my_function
> + # -> 42: 0000000000012a0 336 FUNC GLOBAL DEFAULT 13 my_function
> +
> + # offset_start = 0x12a0 (function entry)
> + # offset_stop = 0x12a0 + 0x50 = 0x12f0 (or any instruction before return)
> +
> +Notes:
> +
> +- The uprobes fire for every task that executes the probed instruction,
> + so concurrent calls from different threads each get independent
> + monitoring slots.
> +- ``offset_stop`` need not be a function return; it can be any instruction
> + within the region. If the stop probe is never reached (e.g. early exit
> + path bypasses it), the hrtimer fires and a budget violation is reported.
> +- Each ``(binary_path, offset_start)`` pair may only be registered once.
> + A second write with the same ``offset_start`` for the same binary is
> + rejected with ``-EEXIST``. Two entry uprobes at the same address would
> + both fire for every task, causing ``tlob_start_task()`` to be called
> + twice; the second call would silently fail with ``-EEXIST`` and the
> + second binding's threshold would never take effect. Different code
> + regions that share the same ``offset_stop`` (common exit point) are
> + explicitly allowed.
> +- The uprobe binding is removed when ``-offset_start:binary_path`` is
> + written to ``monitor``, or when the monitor is disabled.
> +- The ``tag`` field in every ``tlob_budget_exceeded`` event is
> + automatically set to ``offset_start`` for the tracefs path, so
> + violation events for different code regions are immediately
> + distinguishable even when ``threshold_us`` values are identical.
> +
> +ftrace ring buffer (budget violation events)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +When a monitored task exceeds its latency budget the hrtimer fires,
> +records the violation, and emits a single ``tlob_budget_exceeded`` event
> +into the ftrace ring buffer. **Nothing is written to the ftrace ring
> +buffer while the task is within budget.**
> +
> +The event carries the on-CPU / off-CPU time breakdown so that root-cause
> +analysis (CPU-bound vs. scheduling / I/O overrun) is immediate::
> +
> + cat /sys/kernel/tracing/trace
> +
> +Example output::
> +
> + myapp-1234 [003] .... 12345.678: tlob_budget_exceeded: \
> + myapp[1234]: budget exceeded threshold=5000 \
> + on_cpu=820 off_cpu=4500 switches=3 state=off_cpu tag=0x00000000000012a0
> +
> +Field descriptions:
> +
> +``threshold``
> + Configured latency budget in microseconds.
> +
> +``on_cpu``
> + Cumulative on-CPU time since ``trace_start``, in microseconds.
> +
> +``off_cpu``
> + Cumulative off-CPU (scheduling + I/O wait) time since ``trace_start``,
> + in microseconds.
> +
> +``switches``
> + Number of times the task was scheduled out during this window.
> +
> +``state``
> + DA state when the hrtimer fired: ``on_cpu`` means the task was executing
> + when the budget expired (CPU-bound overrun); ``off_cpu`` means the task
> + was preempted or blocked (scheduling / I/O overrun).
> +
> +``tag``
> + Opaque 64-bit cookie supplied by the caller via ``tlob_start_args.tag``
> + (ioctl path) or automatically set to ``offset_start`` (tracefs uprobe
> + path). Use it to distinguish violations from different code regions
> + monitored by the same thread. Zero when not set.
> +
> +To capture violations in a file::
> +
> + trace-cmd record -e tlob_budget_exceeded &
> + # ... run workload ...
> + trace-cmd report
> +
> +/dev/rv ioctl interface (self-instrumentation)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Tasks can self-instrument their own code paths via the ``/dev/rv`` misc
> +device (requires ``CONFIG_RV_CHARDEV``). The kernel key is
> +``task_struct``; multiple threads sharing a single fd each get their own
> +independent monitoring slot.
> +
> +**Synchronous mode** -- the calling thread checks its own result::
> +
> + int fd = open("/dev/rv", O_RDWR);
> +
> + struct tlob_start_args args = {
> + .threshold_us = 50000, /* 50 ms */
> + .tag = 0, /* optional; 0 = don't care */
> + .notify_fd = -1, /* no fd notification */
> + };
> + ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
> +
> + /* ... code path under observation ... */
> +
> + int ret = ioctl(fd, TLOB_IOCTL_TRACE_STOP, NULL);
> + /* ret == 0: within budget */
> + /* ret == -EOVERFLOW: budget exceeded */
> +
> + close(fd);
> +
> +**Asynchronous mode** -- a dedicated monitor thread receives violation
> +records via ``read()`` on a shared fd, decoupling the observation from
> +the critical path::
> +
> + /* Monitor thread: open a dedicated fd. */
> + int monitor_fd = open("/dev/rv", O_RDWR);
> +
> + /* Worker thread: set notify_fd = monitor_fd in TRACE_START args. */
> + int work_fd = open("/dev/rv", O_RDWR);
> + struct tlob_start_args args = {
> + .threshold_us = 10000, /* 10 ms */
> + .tag = REGION_A,
> + .notify_fd = monitor_fd,
> + };
> + ioctl(work_fd, TLOB_IOCTL_TRACE_START, &args);
> + /* ... critical section ... */
> + ioctl(work_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +
> + /* Monitor thread: blocking read() returns one or more tlob_event records.
> */
> + struct tlob_event ntfs[8];
> + ssize_t n = read(monitor_fd, ntfs, sizeof(ntfs));
> + for (int i = 0; i < n / sizeof(struct tlob_event); i++) {
> + struct tlob_event *ntf = &ntfs[i];
> + printf("tid=%u tag=0x%llx exceeded budget=%llu us "
> + "(on_cpu=%llu off_cpu=%llu switches=%u state=%s)\n",
> + ntf->tid, ntf->tag, ntf->threshold_us,
> + ntf->on_cpu_us, ntf->off_cpu_us, ntf->switches,
> + ntf->state ? "on_cpu" : "off_cpu");
> + }
> +
> +**mmap ring buffer** -- zero-copy consumption of violation events::
> +
> + int fd = open("/dev/rv", O_RDWR);
> + struct tlob_start_args args = {
> + .threshold_us = 1000, /* 1 ms */
> + .notify_fd = fd, /* push violations to own ring buffer */
> + };
> + ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
> +
> + /* Map the ring: one control page + capacity data records. */
> + size_t pagesize = sysconf(_SC_PAGESIZE);
> + size_t cap = 64; /* read from page->capacity after mmap */
> + size_t len = pagesize + cap * sizeof(struct tlob_event);
> + void *map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> +
> + struct tlob_mmap_page *page = map;
> + struct tlob_event *data =
> + (struct tlob_event *)((char *)map + page->data_offset);
> +
> + /* Consumer loop: poll for events, read without copying. */
> + while (1) {
> + poll(&(struct pollfd){fd, POLLIN, 0}, 1, -1);
> +
> + uint32_t head = __atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE);
> + uint32_t tail = page->data_tail;
> + while (tail != head) {
> + handle(&data[tail & (page->capacity - 1)]);
> + tail++;
> + }
> + __atomic_store_n(&page->data_tail, tail, __ATOMIC_RELEASE);
> + }
> +
> +Note: ``read()`` and ``mmap()`` share the same ring and ``data_tail``
> +cursor. Do not use both simultaneously on the same fd.
> +
> +``tlob_event`` fields:
> +
> +``tid``
> + Thread ID (``task_pid_vnr``) of the violating task.
> +
> +``threshold_us``
> + Budget that was exceeded, in microseconds.
> +
> +``on_cpu_us``
> + Cumulative on-CPU time at violation time, in microseconds.
> +
> +``off_cpu_us``
> + Cumulative off-CPU time at violation time, in microseconds.
> +
> +``switches``
> + Number of context switches since ``TRACE_START``.
> +
> +``state``
> + 1 = timer fired while task was on-CPU; 0 = timer fired while off-CPU.
> +
> +``tag``
> + Cookie from ``tlob_start_args.tag``; for the tracefs uprobe path this
> + equals ``offset_start``. Zero when not set.
> +
> +tracefs files
> +-------------
> +
> +The following files are created under
> +``/sys/kernel/tracing/rv/monitors/tlob/``:
> +
> +``enable`` (rw)
> + Write ``1`` to enable the monitor; write ``0`` to disable it and
> + stop all currently monitored tasks.
> +
> +``desc`` (ro)
> + Human-readable description of the monitor.
> +
> +``monitor`` (rw)
> + Write ``threshold_us:offset_start:offset_stop:binary_path`` to bind two
> + plain entry uprobes in *binary_path*. The uprobe at *offset_start* fires
> + ``tlob_start_task()``; the uprobe at *offset_stop* fires
> + ``tlob_stop_task()``. Returns ``-EEXIST`` if a binding with the same
> + *offset_start* already exists for *binary_path*. Write
> + ``-offset_start:binary_path`` to remove the binding. Read to list
> + registered bindings, one
> + ``threshold_us:0xoffset_start:0xoffset_stop:binary_path`` entry per line.
> +
> +Specification
> +-------------
> +
> +Graphviz DOT file in tools/verification/models/tlob.dot
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst
> b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 331223761..8d3af68db 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -385,6 +385,7 @@ Code Seq# Include
> File Comments
> 0xB8 01-02 uapi/misc/mrvl_cn10k_dpi.h
> Marvell CN10K DPI driver
> 0xB8 all uapi/linux/mshv.h
> Microsoft Hyper-V /dev/mshv driver
>
> <mailto:linux-hyperv@vger.kernel.org>
> +0xB9 00-3F linux/rv.h
> Runtime Verification (RV) monitors
> 0xBA 00-0F uapi/linux/liveupdate.h Pasha
> Tatashin
>
> <mailto:pasha.tatashin@soleen.com>
> 0xC0 00-0F linux/usb/iowarrior.h
> diff --git a/include/uapi/linux/rv.h b/include/uapi/linux/rv.h
> new file mode 100644
> index 000000000..d1b96d8cd
> --- /dev/null
> +++ b/include/uapi/linux/rv.h
> @@ -0,0 +1,181 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * UAPI definitions for Runtime Verification (RV) monitors.
> + *
> + * All RV monitors that expose an ioctl self-instrumentation interface
> + * share the magic byte RV_IOC_MAGIC (0xB9), registered in
> + * Documentation/userspace-api/ioctl/ioctl-number.rst.
> + *
> + * A single /dev/rv misc device serves as the entry point. ioctl numbers
> + * encode both the monitor identity and the operation:
> + *
> + * 0x01 - 0x1F tlob (task latency over budget)
> + * 0x20 - 0x3F reserved for future RV monitors
> + *
> + * Usage examples and design rationale are in:
> + * Documentation/trace/rv/monitor_tlob.rst
> + */
> +
> +#ifndef _UAPI_LINUX_RV_H
> +#define _UAPI_LINUX_RV_H
> +
> +#include <linux/ioctl.h>
> +#include <linux/types.h>
> +
> +/* Magic byte shared by all RV monitor ioctls. */
> +#define RV_IOC_MAGIC 0xB9
> +
> +/* -----------------------------------------------------------------------
> + * tlob: task latency over budget monitor (nr 0x01 - 0x1F)
> + * -----------------------------------------------------------------------
> + */
> +
> +/**
> + * struct tlob_start_args - arguments for TLOB_IOCTL_TRACE_START
> + * @threshold_us: Latency budget for this critical section, in microseconds.
> + * Must be greater than zero.
> + * @tag: Opaque 64-bit cookie supplied by the caller. Echoed back
> + * verbatim in the tlob_budget_exceeded ftrace event and in any
> + * tlob_event record delivered via @notify_fd. Use it to
> identify
> + * which code region triggered a violation when the same thread
> + * monitors multiple regions sequentially. Set to 0 if not
> + * needed.
> + * @notify_fd: File descriptor that will receive a tlob_event record on
> + * violation. Must refer to an open /dev/rv fd. May equal
> + * the calling fd (self-notification, useful for retrieving the
> + * on_cpu_us / off_cpu_us breakdown after TRACE_STOP returns
> + * -EOVERFLOW). Set to -1 to disable fd notification; in that
> + * case violations are only signalled via the TRACE_STOP return
> + * value and the tlob_budget_exceeded ftrace event.
> + * @flags: Must be 0. Reserved for future extensions.
> + */
> +struct tlob_start_args {
> + __u64 threshold_us;
> + __u64 tag;
> + __s32 notify_fd;
> + __u32 flags;
> +};
> +
> +/**
> + * struct tlob_event - one budget-exceeded event
> + *
> + * Consumed by read() on the notify_fd registered at TLOB_IOCTL_TRACE_START.
> + * Each record describes a single budget exceedance for one task.
> + *
> + * @tid: Thread ID (task_pid_vnr) of the violating task.
> + * @threshold_us: Budget that was exceeded, in microseconds.
> + * @on_cpu_us: Cumulative on-CPU time at violation time, in microseconds.
> + * @off_cpu_us: Cumulative off-CPU (scheduling + I/O wait) time at
> + * violation time, in microseconds.
> + * @switches: Number of context switches since TRACE_START.
> + * @state: DA state at violation: 1 = on_cpu, 0 = off_cpu.
> + * @tag: Cookie from tlob_start_args.tag; for the tracefs uprobe
> path
> + * this is the offset_start value. Zero when not set.
> + */
> +struct tlob_event {
> + __u32 tid;
> + __u32 pad;
> + __u64 threshold_us;
> + __u64 on_cpu_us;
> + __u64 off_cpu_us;
> + __u32 switches;
> + __u32 state; /* 1 = on_cpu, 0 = off_cpu */
> + __u64 tag;
> +};
> +
> +/**
> + * struct tlob_mmap_page - control page for the mmap'd violation ring buffer
> + *
> + * Mapped at offset 0 of the mmap region returned by mmap(2) on a /dev/rv fd.
> + * The data array of struct tlob_event records begins at offset @data_offset
> + * (always one page from the mmap base; use this field rather than hard-
> coding
> + * PAGE_SIZE so the code remains correct across architectures).
> + *
> + * Ring layout:
> + *
> + * mmap base + 0 : struct tlob_mmap_page (one page)
> + * mmap base + data_offset : struct tlob_event[capacity]
> + *
> + * The mmap length determines the ring capacity. Compute it as:
> + *
> + * raw = sysconf(_SC_PAGESIZE) + capacity * sizeof(struct tlob_event)
> + * length = (raw + sysconf(_SC_PAGESIZE) - 1) & ~(sysconf(_SC_PAGESIZE) -
> 1)
> + *
> + * i.e. round the raw byte count up to the next page boundary before
> + * passing it to mmap(2). The kernel requires a page-aligned length.
> + * capacity must be a power of 2. Read @capacity after a successful
> + * mmap(2) for the actual value.
> + *
> + * Producer/consumer ordering contract:
> + *
> + * Kernel (producer):
> + * data[data_head & (capacity - 1)] = event;
> + * // pairs with load-acquire in userspace:
> + * smp_store_release(&page->data_head, data_head + 1);
> + *
> + * Userspace (consumer):
> + * // pairs with store-release in kernel:
> + * head = __atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE);
> + * for (tail = page->data_tail; tail != head; tail++)
> + * handle(&data[tail & (capacity - 1)]);
> + * __atomic_store_n(&page->data_tail, tail, __ATOMIC_RELEASE);
> + *
> + * @data_head and @data_tail are monotonically increasing __u32 counters
> + * in units of records. Unsigned 32-bit wrap-around is handled correctly
> + * by modular arithmetic; the ring is full when
> + * (data_head - data_tail) == capacity.
> + *
> + * When the ring is full the kernel drops the incoming record and increments
> + * @dropped. The consumer should check @dropped periodically to detect loss.
> + *
> + * read() and mmap() share the same ring buffer. Do not use both
> + * simultaneously on the same fd.
> + *
> + * @data_head: Next write slot index. Updated by the kernel with
> + * store-release ordering. Read by userspace with load-
> acquire.
> + * @data_tail: Next read slot index. Updated by userspace. Read by the
> + * kernel to detect overflow.
> + * @capacity: Actual ring capacity in records (power of 2). Written once
> + * by the kernel at mmap time; read-only for userspace
> thereafter.
> + * @version: Ring buffer ABI version; currently 1.
> + * @data_offset: Byte offset from the mmap base to the data array.
> + * Always equal to sysconf(_SC_PAGESIZE) on the running kernel.
> + * @record_size: sizeof(struct tlob_event) as seen by the kernel. Verify
> + * this matches userspace's sizeof before indexing the array.
> + * @dropped: Number of events dropped because the ring was full.
> + * Monotonically increasing; read with __ATOMIC_RELAXED.
> + */
> +struct tlob_mmap_page {
> + __u32 data_head;
> + __u32 data_tail;
> + __u32 capacity;
> + __u32 version;
> + __u32 data_offset;
> + __u32 record_size;
> + __u64 dropped;
> +};
> +
> +/*
> + * TLOB_IOCTL_TRACE_START - begin monitoring the calling task.
> + *
> + * Arms a per-task hrtimer for threshold_us microseconds. If args.notify_fd
> + * is >= 0, a tlob_event record is pushed into that fd's ring buffer on
> + * violation in addition to the tlob_budget_exceeded ftrace event.
> + * args.notify_fd == -1 disables fd notification.
> + *
> + * Violation records are consumed by read() on the notify_fd (blocking or
> + * non-blocking depending on O_NONBLOCK). On violation,
> TLOB_IOCTL_TRACE_STOP
> + * also returns -EOVERFLOW regardless of whether notify_fd is set.
> + *
> + * args.flags must be 0.
> + */
> +#define TLOB_IOCTL_TRACE_START _IOW(RV_IOC_MAGIC, 0x01, struct
> tlob_start_args)
> +
> +/*
> + * TLOB_IOCTL_TRACE_STOP - end monitoring the calling task.
> + *
> + * Returns 0 if within budget, -EOVERFLOW if the budget was exceeded.
> + */
> +#define TLOB_IOCTL_TRACE_STOP _IO(RV_IOC_MAGIC, 0x02)
> +
> +#endif /* _UAPI_LINUX_RV_H */
> diff --git a/kernel/trace/rv/Kconfig b/kernel/trace/rv/Kconfig
> index 5b4be87ba..227573cda 100644
> --- a/kernel/trace/rv/Kconfig
> +++ b/kernel/trace/rv/Kconfig
> @@ -65,6 +65,7 @@ source "kernel/trace/rv/monitors/pagefault/Kconfig"
> source "kernel/trace/rv/monitors/sleep/Kconfig"
> # Add new rtapp monitors here
>
> +source "kernel/trace/rv/monitors/tlob/Kconfig"
> # Add new monitors here
>
> config RV_REACTORS
> @@ -93,3 +94,19 @@ config RV_REACT_PANIC
> help
> Enables the panic reactor. The panic reactor emits a printk()
> message if an exception is found and panic()s the system.
> +
> +config RV_CHARDEV
> + bool "RV ioctl interface via /dev/rv"
> + depends on RV
> + default n
> + help
> + Register a /dev/rv misc device that exposes an ioctl interface
> + for RV monitor self-instrumentation. All RV monitors share the
> + single device node; ioctl numbers encode the monitor identity.
> +
> + When enabled, user-space programs can open /dev/rv and use
> + monitor-specific ioctl commands to bracket code regions they
> + want the kernel RV subsystem to observe.
> +
> + Say Y here if you want to use the tlob self-instrumentation
> + ioctl interface; otherwise say N.
> diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
> index 750e4ad6f..cc3781a3b 100644
> --- a/kernel/trace/rv/Makefile
> +++ b/kernel/trace/rv/Makefile
> @@ -3,6 +3,7 @@
> ccflags-y += -I $(src) # needed for trace events
>
> obj-$(CONFIG_RV) += rv.o
> +obj-$(CONFIG_RV_CHARDEV) += rv_dev.o
> obj-$(CONFIG_RV_MON_WIP) += monitors/wip/wip.o
> obj-$(CONFIG_RV_MON_WWNR) += monitors/wwnr/wwnr.o
> obj-$(CONFIG_RV_MON_SCHED) += monitors/sched/sched.o
> @@ -17,6 +18,7 @@ obj-$(CONFIG_RV_MON_STS) += monitors/sts/sts.o
> obj-$(CONFIG_RV_MON_NRP) += monitors/nrp/nrp.o
> obj-$(CONFIG_RV_MON_SSSW) += monitors/sssw/sssw.o
> obj-$(CONFIG_RV_MON_OPID) += monitors/opid/opid.o
> +obj-$(CONFIG_RV_MON_TLOB) += monitors/tlob/tlob.o
> # Add new monitors here
> obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
> obj-$(CONFIG_RV_REACT_PRINTK) += reactor_printk.o
> diff --git a/kernel/trace/rv/monitors/tlob/Kconfig
> b/kernel/trace/rv/monitors/tlob/Kconfig
> new file mode 100644
> index 000000000..010237480
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/Kconfig
> @@ -0,0 +1,51 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +#
> +config RV_MON_TLOB
> + depends on RV
> + depends on UPROBES
> + select DA_MON_EVENTS_ID
> + bool "tlob monitor"
> + help
> + Enable the tlob (task latency over budget) monitor. This monitor
> + tracks the elapsed time (CLOCK_MONOTONIC) of a marked code path
> within a
> + task (including both on-CPU and off-CPU time) and reports a
> + violation when the elapsed time exceeds a configurable budget
> + threshold.
> +
> + The monitor implements a three-state deterministic automaton.
> + States: unmonitored, on_cpu, off_cpu.
> + Key transitions:
> + unmonitored --(trace_start)--> on_cpu
> + on_cpu --(switch_out)--> off_cpu
> + off_cpu --(switch_in)--> on_cpu
> + on_cpu --(trace_stop)--> unmonitored
> + off_cpu --(trace_stop)--> unmonitored
> + on_cpu --(budget_expired)--> unmonitored
> + off_cpu --(budget_expired)--> unmonitored
> +
> + External configuration is done via the tracefs "monitor" file:
> + echo pid:threshold_us:binary:offset_start:offset_stop >
> .../rv/monitors/tlob/monitor
> + echo -pid > .../rv/monitors/tlob/monitor (remove
> task)
> + cat .../rv/monitors/tlob/monitor (list
> tasks)
> +
> + The uprobe binding places two plain entry uprobes at offset_start
> and
> + offset_stop in the binary; these trigger tlob_start_task() and
> + tlob_stop_task() respectively. Using two entry uprobes (rather
> than a
> + uretprobe) means that a mistyped offset can never corrupt the call
> + stack; the worst outcome is a missed stop, which causes the hrtimer
> to
> + fire and report a budget violation.
> +
> + Violation events are delivered via a lock-free mmap ring buffer on
> + /dev/rv (enabled by CONFIG_RV_CHARDEV). The consumer mmap()s the
> + device, reads records from the data array using the head/tail
> indices
> + in the control page, and advances data_tail when done.
> +
> + For self-instrumentation, use TLOB_IOCTL_TRACE_START /
> + TLOB_IOCTL_TRACE_STOP via the /dev/rv misc device (enabled by
> + CONFIG_RV_CHARDEV).
> +
> + Up to TLOB_MAX_MONITORED tasks may be monitored simultaneously.
> +
> + For further information, see:
> + Documentation/trace/rv/monitor_tlob.rst
> +
> diff --git a/kernel/trace/rv/monitors/tlob/tlob.c
> b/kernel/trace/rv/monitors/tlob/tlob.c
> new file mode 100644
> index 000000000..a6e474025
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/tlob.c
> @@ -0,0 +1,986 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * tlob: task latency over budget monitor
> + *
> + * Track the elapsed wall-clock time of a marked code path and detect when
> + * a monitored task exceeds its per-task latency budget. CLOCK_MONOTONIC
> + * is used so both on-CPU and off-CPU time count toward the budget.
> + *
> + * Per-task state is maintained in a spinlock-protected hash table. A
> + * one-shot hrtimer fires at the deadline; if the task has not called
> + * trace_stop by then, a violation is recorded.
> + *
> + * Up to TLOB_MAX_MONITORED tasks may be tracked simultaneously.
> + *
> + * Copyright (C) 2026 Wen Yang <wen.yang@linux.dev>
> + */
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/ftrace.h>
> +#include <linux/hash.h>
> +#include <linux/hrtimer.h>
> +#include <linux/kernel.h>
> +#include <linux/ktime.h>
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/namei.h>
> +#include <linux/poll.h>
> +#include <linux/rv.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/atomic.h>
> +#include <linux/rcupdate.h>
> +#include <linux/spinlock.h>
> +#include <linux/tracefs.h>
> +#include <linux/uaccess.h>
> +#include <linux/uprobes.h>
> +#include <kunit/visibility.h>
> +#include <rv/instrumentation.h>
> +
> +/* rv_interface_lock is defined in kernel/trace/rv/rv.c */
> +extern struct mutex rv_interface_lock;
> +
> +#define MODULE_NAME "tlob"
> +
> +#include <rv_trace.h>
> +#include <trace/events/sched.h>
> +
> +#define RV_MON_TYPE RV_MON_PER_TASK
> +#include "tlob.h"
> +#include <rv/da_monitor.h>
> +
> +/* Hash table size; must be a power of two. */
> +#define TLOB_HTABLE_BITS 6
> +#define TLOB_HTABLE_SIZE (1 << TLOB_HTABLE_BITS)
> +
> +/* Maximum binary path length for uprobe binding. */
> +#define TLOB_MAX_PATH 256
> +
> +/* Per-task latency monitoring state. */
> +struct tlob_task_state {
> + struct hlist_node hlist;
> + struct task_struct *task;
> + u64 threshold_us;
> + u64 tag;
> + struct hrtimer deadline_timer;
> + int canceled; /* protected by entry_lock */
> + struct file *notify_file; /* NULL or held reference */
> +
> + /*
> + * entry_lock serialises the mutable accounting fields below.
> + * Lock order: tlob_table_lock -> entry_lock (never reverse).
> + */
> + raw_spinlock_t entry_lock;
> + u64 on_cpu_us;
> + u64 off_cpu_us;
> + ktime_t last_ts;
> + u32 switches;
> + u8 da_state;
> +
> + struct rcu_head rcu; /* for call_rcu() teardown */
> +};
> +
> +/* Per-uprobe-binding state: a start + stop probe pair for one binary region.
> */
> +struct tlob_uprobe_binding {
> + struct list_head list;
> + u64 threshold_us;
> + struct path path;
> + char binpath[TLOB_MAX_PATH]; /* canonical
> path for read/remove */
> + loff_t offset_start;
> + loff_t offset_stop;
> + struct uprobe_consumer entry_uc;
> + struct uprobe_consumer stop_uc;
> + struct uprobe *entry_uprobe;
> + struct uprobe *stop_uprobe;
> +};
> +
> +/* Object pool for tlob_task_state. */
> +static struct kmem_cache *tlob_state_cache;
> +
> +/* Hash table and lock protecting table structure (insert/delete/canceled).
> */
> +static struct hlist_head tlob_htable[TLOB_HTABLE_SIZE];
> +static DEFINE_RAW_SPINLOCK(tlob_table_lock);
> +static atomic_t tlob_num_monitored = ATOMIC_INIT(0);
> +
> +/* Uprobe binding list; protected by tlob_uprobe_mutex. */
> +static LIST_HEAD(tlob_uprobe_list);
> +static DEFINE_MUTEX(tlob_uprobe_mutex);
> +
> +/* Forward declaration */
> +static enum hrtimer_restart tlob_deadline_timer_fn(struct hrtimer *timer);
> +
> +/* Hash table helpers */
> +
> +static unsigned int tlob_hash_task(const struct task_struct *task)
> +{
> + return hash_ptr((void *)task, TLOB_HTABLE_BITS);
> +}
> +
> +/*
> + * tlob_find_rcu - look up per-task state.
> + * Must be called under rcu_read_lock() or with tlob_table_lock held.
> + */
> +static struct tlob_task_state *tlob_find_rcu(struct task_struct *task)
> +{
> + struct tlob_task_state *ws;
> + unsigned int h = tlob_hash_task(task);
> +
> + hlist_for_each_entry_rcu(ws, &tlob_htable[h], hlist,
> + lockdep_is_held(&tlob_table_lock))
> + if (ws->task == task)
> + return ws;
> + return NULL;
> +}
> +
> +/* Allocate and initialise a new per-task state entry. */
> +static struct tlob_task_state *tlob_alloc(struct task_struct *task,
> + u64 threshold_us, u64 tag)
> +{
> + struct tlob_task_state *ws;
> +
> + ws = kmem_cache_zalloc(tlob_state_cache, GFP_ATOMIC);
> + if (!ws)
> + return NULL;
> +
> + ws->task = task;
> + get_task_struct(task);
> + ws->threshold_us = threshold_us;
> + ws->tag = tag;
> + ws->last_ts = ktime_get();
> + ws->da_state = on_cpu_tlob;
> + raw_spin_lock_init(&ws->entry_lock);
> + hrtimer_setup(&ws->deadline_timer, tlob_deadline_timer_fn,
> + CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> + return ws;
> +}
> +
> +/* RCU callback: free the slab once no readers remain. */
> +static void tlob_free_rcu_slab(struct rcu_head *head)
> +{
> + struct tlob_task_state *ws =
> + container_of(head, struct tlob_task_state, rcu);
> + kmem_cache_free(tlob_state_cache, ws);
> +}
> +
> +/* Arm the one-shot deadline timer for threshold_us microseconds. */
> +static void tlob_arm_deadline(struct tlob_task_state *ws)
> +{
> + hrtimer_start(&ws->deadline_timer,
> + ns_to_ktime(ws->threshold_us * NSEC_PER_USEC),
> + HRTIMER_MODE_REL);
> +}
> +
> +/*
> + * Push a violation record into a monitor fd's ring buffer (softirq context).
> + * Drop-new policy: discard incoming record when full. smp_store_release on
> + * data_head pairs with smp_load_acquire in the consumer.
> + */
> +static void tlob_event_push(struct rv_file_priv *priv,
> + const struct tlob_event *info)
> +{
> + struct tlob_ring *ring = &priv->ring;
> + unsigned long flags;
> + u32 head, tail;
> +
> + spin_lock_irqsave(&ring->lock, flags);
> +
> + head = ring->page->data_head;
> + tail = READ_ONCE(ring->page->data_tail);
> +
> + if (head - tail > ring->mask) {
> + /* Ring full: drop incoming record. */
> + ring->page->dropped++;
> + spin_unlock_irqrestore(&ring->lock, flags);
> + return;
> + }
> +
> + ring->data[head & ring->mask] = *info;
> + /* pairs with smp_load_acquire() in the consumer */
> + smp_store_release(&ring->page->data_head, head + 1);
> +
> + spin_unlock_irqrestore(&ring->lock, flags);
> +
> + wake_up_interruptible_poll(&priv->waitq, EPOLLIN | EPOLLRDNORM);
> +}
> +
> +#if IS_ENABLED(CONFIG_KUNIT)
> +void tlob_event_push_kunit(struct rv_file_priv *priv,
> + const struct tlob_event *info)
> +{
> + tlob_event_push(priv, info);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_event_push_kunit);
> +#endif /* CONFIG_KUNIT */
> +
> +/*
> + * Budget exceeded: remove the entry, record the violation, and inject
> + * budget_expired into the DA.
> + *
> + * Lock order: tlob_table_lock -> entry_lock. tlob_stop_task() sets
> + * ws->canceled under both locks; if we see it here the stop path owns
> cleanup.
> + * fput/put_task_struct are done before call_rcu(); the RCU callback only
> + * reclaims the slab.
> + */
> +static enum hrtimer_restart tlob_deadline_timer_fn(struct hrtimer *timer)
> +{
> + struct tlob_task_state *ws =
> + container_of(timer, struct tlob_task_state, deadline_timer);
> + struct tlob_event info = {};
> + struct file *notify_file;
> + struct task_struct *task;
> + unsigned long flags;
> + /* snapshots taken under entry_lock */
> + u64 on_cpu_us, off_cpu_us, threshold_us, tag;
> + u32 switches;
> + bool on_cpu;
> + bool push_event = false;
> +
> + raw_spin_lock_irqsave(&tlob_table_lock, flags);
> + /* stop path sets canceled under both locks; if set it owns cleanup
> */
> + if (ws->canceled) {
> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> + return HRTIMER_NORESTART;
> + }
> +
> + /* Finalize accounting and snapshot all fields under entry_lock. */
> + raw_spin_lock(&ws->entry_lock);
> +
> + {
> + ktime_t now = ktime_get();
> + u64 delta_us = ktime_to_us(ktime_sub(now, ws->last_ts));
> +
> + if (ws->da_state == on_cpu_tlob)
> + ws->on_cpu_us += delta_us;
> + else
> + ws->off_cpu_us += delta_us;
> + }
> +
> + ws->canceled = 1;
> + on_cpu_us = ws->on_cpu_us;
> + off_cpu_us = ws->off_cpu_us;
> + threshold_us = ws->threshold_us;
> + tag = ws->tag;
> + switches = ws->switches;
> + on_cpu = (ws->da_state == on_cpu_tlob);
> + notify_file = ws->notify_file;
> + if (notify_file) {
> + info.tid = task_pid_vnr(ws->task);
> + info.threshold_us = threshold_us;
> + info.on_cpu_us = on_cpu_us;
> + info.off_cpu_us = off_cpu_us;
> + info.switches = switches;
> + info.state = on_cpu ? 1 : 0;
> + info.tag = tag;
> + push_event = true;
> + }
> +
> + raw_spin_unlock(&ws->entry_lock);
> +
> + hlist_del_rcu(&ws->hlist);
> + atomic_dec(&tlob_num_monitored);
> + /*
> + * Hold a reference so task remains valid across da_handle_event()
> + * after we drop tlob_table_lock.
> + */
> + task = ws->task;
> + get_task_struct(task);
> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +
> + /*
> + * Both locks are now released; ws is exclusively owned (removed from
> + * the hash table with canceled=1). Emit the tracepoint and push the
> + * violation record.
> + */
> + trace_tlob_budget_exceeded(ws->task, threshold_us, on_cpu_us,
> + off_cpu_us, switches, on_cpu, tag);
> +
> + if (push_event) {
> + struct rv_file_priv *priv = notify_file->private_data;
> +
> + if (priv)
> + tlob_event_push(priv, &info);
> + }
> +
> + da_handle_event(task, budget_expired_tlob);
> +
> + if (notify_file)
> + fput(notify_file); /* ref from fget() at
> TRACE_START */
> + put_task_struct(ws->task); /* ref from tlob_alloc() */
> + put_task_struct(task); /* extra ref from
> get_task_struct() above */
> + call_rcu(&ws->rcu, tlob_free_rcu_slab);
> + return HRTIMER_NORESTART;
> +}
> +
> +/* Tracepoint handlers */
> +
> +/*
> + * handle_sched_switch - advance the DA and accumulate on/off-CPU time.
> + *
> + * RCU read-side for lock-free lookup; entry_lock for per-task accounting.
> + * da_handle_event() is called after rcu_read_unlock() to avoid holding the
> + * read-side critical section across the RV framework.
> + */
> +static void handle_sched_switch(void *data, bool preempt,
> + struct task_struct *prev,
> + struct task_struct *next,
> + unsigned int prev_state)
> +{
> + struct tlob_task_state *ws;
> + unsigned long flags;
> + bool do_prev = false, do_next = false;
> + ktime_t now;
> +
> + rcu_read_lock();
> +
> + ws = tlob_find_rcu(prev);
> + if (ws) {
> + raw_spin_lock_irqsave(&ws->entry_lock, flags);
> + if (!ws->canceled) {
> + now = ktime_get();
> + ws->on_cpu_us += ktime_to_us(ktime_sub(now, ws-
> >last_ts));
> + ws->last_ts = now;
> + ws->switches++;
> + ws->da_state = off_cpu_tlob;
> + do_prev = true;
> + }
> + raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> + }
> +
> + ws = tlob_find_rcu(next);
> + if (ws) {
> + raw_spin_lock_irqsave(&ws->entry_lock, flags);
> + if (!ws->canceled) {
> + now = ktime_get();
> + ws->off_cpu_us += ktime_to_us(ktime_sub(now, ws-
> >last_ts));
> + ws->last_ts = now;
> + ws->da_state = on_cpu_tlob;
> + do_next = true;
> + }
> + raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> + }
> +
> + rcu_read_unlock();
> +
> + if (do_prev)
> + da_handle_event(prev, switch_out_tlob);
> + if (do_next)
> + da_handle_event(next, switch_in_tlob);
> +}
> +
> +static void handle_sched_wakeup(void *data, struct task_struct *p)
> +{
> + struct tlob_task_state *ws;
> + unsigned long flags;
> + bool found = false;
> +
> + rcu_read_lock();
> + ws = tlob_find_rcu(p);
> + if (ws) {
> + raw_spin_lock_irqsave(&ws->entry_lock, flags);
> + found = !ws->canceled;
> + raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> + }
> + rcu_read_unlock();
> +
> + if (found)
> + da_handle_event(p, sched_wakeup_tlob);
> +}
> +
> +/* -----------------------------------------------------------------------
> + * Core start/stop helpers (also called from rv_dev.c)
> + * -----------------------------------------------------------------------
> + */
> +
> +/*
> + * __tlob_insert - insert @ws into the hash table and arm its deadline timer.
> + *
> + * Re-checks for duplicates and capacity under tlob_table_lock; the caller
> + * may have done a lock-free pre-check before allocating @ws. On failure @ws
> + * is freed directly (never in table, so no call_rcu needed).
> + */
> +static int __tlob_insert(struct task_struct *task, struct tlob_task_state
> *ws)
> +{
> + unsigned int h;
> + unsigned long flags;
> +
> + raw_spin_lock_irqsave(&tlob_table_lock, flags);
> + if (tlob_find_rcu(task)) {
> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> + if (ws->notify_file)
> + fput(ws->notify_file);
> + put_task_struct(ws->task);
> + kmem_cache_free(tlob_state_cache, ws);
> + return -EEXIST;
> + }
> + if (atomic_read(&tlob_num_monitored) >= TLOB_MAX_MONITORED) {
> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> + if (ws->notify_file)
> + fput(ws->notify_file);
> + put_task_struct(ws->task);
> + kmem_cache_free(tlob_state_cache, ws);
> + return -ENOSPC;
> + }
> + h = tlob_hash_task(task);
> + hlist_add_head_rcu(&ws->hlist, &tlob_htable[h]);
> + atomic_inc(&tlob_num_monitored);
> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +
> + da_handle_start_run_event(task, trace_start_tlob);
> + tlob_arm_deadline(ws);
> + return 0;
> +}
> +
> +/**
> + * tlob_start_task - begin monitoring @task with latency budget
> @threshold_us.
> + *
> + * @notify_file: /dev/rv fd whose ring buffer receives a tlob_event on
> + * violation; caller transfers the fget() reference to tlob.c.
> + * Pass NULL for synchronous mode (violations only via
> + * TRACE_STOP return value and the tlob_budget_exceeded event).
> + *
> + * Returns 0, -ENODEV, -EEXIST, -ENOSPC, or -ENOMEM. On failure the caller
> + * retains responsibility for any @notify_file reference.
> + */
> +int tlob_start_task(struct task_struct *task, u64 threshold_us,
> + struct file *notify_file, u64 tag)
> +{
> + struct tlob_task_state *ws;
> + unsigned long flags;
> +
> + if (!tlob_state_cache)
> + return -ENODEV;
> +
> + if (threshold_us > (u64)KTIME_MAX / NSEC_PER_USEC)
> + return -ERANGE;
> +
> + /* Quick pre-check before allocation. */
> + raw_spin_lock_irqsave(&tlob_table_lock, flags);
> + if (tlob_find_rcu(task)) {
> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> + return -EEXIST;
> + }
> + if (atomic_read(&tlob_num_monitored) >= TLOB_MAX_MONITORED) {
> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> + return -ENOSPC;
> + }
> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +
> + ws = tlob_alloc(task, threshold_us, tag);
> + if (!ws)
> + return -ENOMEM;
> +
> + ws->notify_file = notify_file;
> + return __tlob_insert(task, ws);
> +}
> +EXPORT_SYMBOL_GPL(tlob_start_task);
> +
> +/**
> + * tlob_stop_task - stop monitoring @task before the deadline fires.
> + *
> + * Sets canceled under entry_lock (inside tlob_table_lock) before calling
> + * hrtimer_cancel(), racing safely with the timer callback.
> + *
> + * Returns 0 if within budget, -ESRCH if the entry is gone (deadline already
> + * fired, or TRACE_START was never called).
> + */
> +int tlob_stop_task(struct task_struct *task)
> +{
> + struct tlob_task_state *ws;
> + struct file *notify_file;
> + unsigned long flags;
> +
> + raw_spin_lock_irqsave(&tlob_table_lock, flags);
> + ws = tlob_find_rcu(task);
> + if (!ws) {
> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> + return -ESRCH;
> + }
> +
> + /* Prevent handle_sched_switch from updating accounting after
> removal. */
> + raw_spin_lock(&ws->entry_lock);
> + ws->canceled = 1;
> + raw_spin_unlock(&ws->entry_lock);
> +
> + hlist_del_rcu(&ws->hlist);
> + atomic_dec(&tlob_num_monitored);
> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +
> + hrtimer_cancel(&ws->deadline_timer);
> +
> + da_handle_event(task, trace_stop_tlob);
> +
> + notify_file = ws->notify_file;
> + if (notify_file)
> + fput(notify_file);
> + put_task_struct(ws->task);
> + call_rcu(&ws->rcu, tlob_free_rcu_slab);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(tlob_stop_task);
> +
> +/* Stop monitoring all tracked tasks; called on monitor disable. */
> +static void tlob_stop_all(void)
> +{
> + struct tlob_task_state *batch[TLOB_MAX_MONITORED];
> + struct tlob_task_state *ws;
> + struct hlist_node *tmp;
> + unsigned long flags;
> + int n = 0, i;
> +
> + raw_spin_lock_irqsave(&tlob_table_lock, flags);
> + for (i = 0; i < TLOB_HTABLE_SIZE; i++) {
> + hlist_for_each_entry_safe(ws, tmp, &tlob_htable[i], hlist) {
> + raw_spin_lock(&ws->entry_lock);
> + ws->canceled = 1;
> + raw_spin_unlock(&ws->entry_lock);
> + hlist_del_rcu(&ws->hlist);
> + atomic_dec(&tlob_num_monitored);
> + if (n < TLOB_MAX_MONITORED)
> + batch[n++] = ws;
> + }
> + }
> + raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +
> + for (i = 0; i < n; i++) {
> + ws = batch[i];
> + hrtimer_cancel(&ws->deadline_timer);
> + da_handle_event(ws->task, trace_stop_tlob);
> + if (ws->notify_file)
> + fput(ws->notify_file);
> + put_task_struct(ws->task);
> + call_rcu(&ws->rcu, tlob_free_rcu_slab);
> + }
> +}
> +
> +/* uprobe binding helpers */
> +
> +static int tlob_uprobe_entry_handler(struct uprobe_consumer *uc,
> + struct pt_regs *regs, __u64 *data)
> +{
> + struct tlob_uprobe_binding *b =
> + container_of(uc, struct tlob_uprobe_binding, entry_uc);
> +
> + tlob_start_task(current, b->threshold_us, NULL, (u64)b-
> >offset_start);
> + return 0;
> +}
> +
> +static int tlob_uprobe_stop_handler(struct uprobe_consumer *uc,
> + struct pt_regs *regs, __u64 *data)
> +{
> + tlob_stop_task(current);
> + return 0;
> +}
> +
> +/*
> + * Register start + stop entry uprobes for a binding.
> + * Both are plain entry uprobes (no uretprobe), so a wrong offset never
> + * corrupts the call stack; the worst outcome is a missed stop (hrtimer
> + * fires and reports a budget violation).
> + * Called with tlob_uprobe_mutex held.
> + */
> +static int tlob_add_uprobe(u64 threshold_us, const char *binpath,
> + loff_t offset_start, loff_t offset_stop)
> +{
> + struct tlob_uprobe_binding *b, *tmp_b;
> + char pathbuf[TLOB_MAX_PATH];
> + struct inode *inode;
> + char *canon;
> + int ret;
> +
> + b = kzalloc(sizeof(*b), GFP_KERNEL);
> + if (!b)
> + return -ENOMEM;
> +
> + if (binpath[0] != '/') {
> + kfree(b);
> + return -EINVAL;
> + }
> +
> + b->threshold_us = threshold_us;
> + b->offset_start = offset_start;
> + b->offset_stop = offset_stop;
> +
> + ret = kern_path(binpath, LOOKUP_FOLLOW, &b->path);
> + if (ret)
> + goto err_free;
> +
> + if (!d_is_reg(b->path.dentry)) {
> + ret = -EINVAL;
> + goto err_path;
> + }
> +
> + /* Reject duplicate start offset for the same binary. */
> + list_for_each_entry(tmp_b, &tlob_uprobe_list, list) {
> + if (tmp_b->offset_start == offset_start &&
> + tmp_b->path.dentry == b->path.dentry) {
> + ret = -EEXIST;
> + goto err_path;
> + }
> + }
> +
> + /* Store canonical path for read-back and removal matching. */
> + canon = d_path(&b->path, pathbuf, sizeof(pathbuf));
> + if (IS_ERR(canon)) {
> + ret = PTR_ERR(canon);
> + goto err_path;
> + }
> + strscpy(b->binpath, canon, sizeof(b->binpath));
> +
> + b->entry_uc.handler = tlob_uprobe_entry_handler;
> + b->stop_uc.handler = tlob_uprobe_stop_handler;
> +
> + inode = d_real_inode(b->path.dentry);
> +
> + b->entry_uprobe = uprobe_register(inode, offset_start, 0, &b-
> >entry_uc);
> + if (IS_ERR(b->entry_uprobe)) {
> + ret = PTR_ERR(b->entry_uprobe);
> + b->entry_uprobe = NULL;
> + goto err_path;
> + }
> +
> + b->stop_uprobe = uprobe_register(inode, offset_stop, 0, &b->stop_uc);
> + if (IS_ERR(b->stop_uprobe)) {
> + ret = PTR_ERR(b->stop_uprobe);
> + b->stop_uprobe = NULL;
> + goto err_entry;
> + }
> +
> + list_add_tail(&b->list, &tlob_uprobe_list);
> + return 0;
> +
> +err_entry:
> + uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
> + uprobe_unregister_sync();
> +err_path:
> + path_put(&b->path);
> +err_free:
> + kfree(b);
> + return ret;
> +}
> +
> +/*
> + * Remove the uprobe binding for (offset_start, binpath).
> + * binpath is resolved to a dentry for comparison so symlinks are handled
> + * correctly. Called with tlob_uprobe_mutex held.
> + */
> +static void tlob_remove_uprobe_by_key(loff_t offset_start, const char
> *binpath)
> +{
> + struct tlob_uprobe_binding *b, *tmp;
> + struct path remove_path;
> +
> + if (kern_path(binpath, LOOKUP_FOLLOW, &remove_path))
> + return;
> +
> + list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
> + if (b->offset_start != offset_start)
> + continue;
> + if (b->path.dentry != remove_path.dentry)
> + continue;
> + uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
> + uprobe_unregister_nosync(b->stop_uprobe, &b->stop_uc);
> + list_del(&b->list);
> + uprobe_unregister_sync();
> + path_put(&b->path);
> + kfree(b);
> + break;
> + }
> +
> + path_put(&remove_path);
> +}
> +
> +/* Unregister all uprobe bindings; called from disable_tlob(). */
> +static void tlob_remove_all_uprobes(void)
> +{
> + struct tlob_uprobe_binding *b, *tmp;
> +
> + mutex_lock(&tlob_uprobe_mutex);
> + list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
> + uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
> + uprobe_unregister_nosync(b->stop_uprobe, &b->stop_uc);
> + list_del(&b->list);
> + path_put(&b->path);
> + kfree(b);
> + }
> + mutex_unlock(&tlob_uprobe_mutex);
> + uprobe_unregister_sync();
> +}
> +
> +/*
> + * tracefs "monitor" file
> + *
> + * Read: one "threshold_us:0xoffset_start:0xoffset_stop:binary_path\n"
> + * line per registered uprobe binding.
> + * Write: "threshold_us:offset_start:offset_stop:binary_path" - add uprobe
> binding
> + * "-offset_start:binary_path" - remove uprobe
> binding
> + */
> +
> +static ssize_t tlob_monitor_read(struct file *file,
> + char __user *ubuf,
> + size_t count, loff_t *ppos)
> +{
> + /* pid(10) + threshold(20) + 2 offsets(2*18) + path(256) + delimiters
> */
> + const int line_sz = TLOB_MAX_PATH + 72;
> + struct tlob_uprobe_binding *b;
> + char *buf, *p;
> + int n = 0, buf_sz, pos = 0;
> + ssize_t ret;
> +
> + mutex_lock(&tlob_uprobe_mutex);
> + list_for_each_entry(b, &tlob_uprobe_list, list)
> + n++;
> + mutex_unlock(&tlob_uprobe_mutex);
> +
> + buf_sz = (n ? n : 1) * line_sz + 1;
> + buf = kmalloc(buf_sz, GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;
> +
> + mutex_lock(&tlob_uprobe_mutex);
> + list_for_each_entry(b, &tlob_uprobe_list, list) {
> + p = b->binpath;
> + pos += scnprintf(buf + pos, buf_sz - pos,
> + "%llu:0x%llx:0x%llx:%s\n",
> + b->threshold_us,
> + (unsigned long long)b->offset_start,
> + (unsigned long long)b->offset_stop,
> + p);
> + }
> + mutex_unlock(&tlob_uprobe_mutex);
> +
> + ret = simple_read_from_buffer(ubuf, count, ppos, buf, pos);
> + kfree(buf);
> + return ret;
> +}
> +
> +/*
> + * Parse "threshold_us:offset_start:offset_stop:binary_path".
> + * binary_path comes last so it may freely contain ':'.
> + * Returns 0 on success.
> + */
> +VISIBLE_IF_KUNIT int tlob_parse_uprobe_line(char *buf, u64 *thr_out,
> + char **path_out,
> + loff_t *start_out, loff_t
> *stop_out)
> +{
> + unsigned long long thr;
> + long long start, stop;
> + int n = 0;
> +
> + /*
> + * %llu : decimal-only (microseconds)
> + * %lli : auto-base, accepts 0x-prefixed hex for offsets
> + * %n : records the byte offset of the first path character
> + */
> + if (sscanf(buf, "%llu:%lli:%lli:%n", &thr, &start, &stop, &n) != 3)
> + return -EINVAL;
> + if (thr == 0 || n == 0 || buf[n] == '\0')
> + return -EINVAL;
> + if (start < 0 || stop < 0)
> + return -EINVAL;
> +
> + *thr_out = thr;
> + *start_out = start;
> + *stop_out = stop;
> + *path_out = buf + n;
> + return 0;
> +}
> +
> +static ssize_t tlob_monitor_write(struct file *file,
> + const char __user *ubuf,
> + size_t count, loff_t *ppos)
> +{
> + char buf[TLOB_MAX_PATH + 64];
> + loff_t offset_start, offset_stop;
> + u64 threshold_us;
> + char *binpath;
> + int ret;
> +
> + if (count >= sizeof(buf))
> + return -EINVAL;
> + if (copy_from_user(buf, ubuf, count))
> + return -EFAULT;
> + buf[count] = '\0';
> +
> + if (count > 0 && buf[count - 1] == '\n')
> + buf[count - 1] = '\0';
> +
> + /* Remove request: "-offset_start:binary_path" */
> + if (buf[0] == '-') {
> + long long off;
> + int n = 0;
> +
> + if (sscanf(buf + 1, "%lli:%n", &off, &n) != 1 || n == 0)
> + return -EINVAL;
> + binpath = buf + 1 + n;
> + if (binpath[0] != '/')
> + return -EINVAL;
> +
> + mutex_lock(&tlob_uprobe_mutex);
> + tlob_remove_uprobe_by_key((loff_t)off, binpath);
> + mutex_unlock(&tlob_uprobe_mutex);
> +
> + return (ssize_t)count;
> + }
> +
> + /*
> + * Uprobe binding:
> "threshold_us:offset_start:offset_stop:binary_path"
> + * binpath points into buf at the start of the path field.
> + */
> + ret = tlob_parse_uprobe_line(buf, &threshold_us,
> + &binpath, &offset_start, &offset_stop);
> + if (ret)
> + return ret;
> +
> + mutex_lock(&tlob_uprobe_mutex);
> + ret = tlob_add_uprobe(threshold_us, binpath, offset_start,
> offset_stop);
> + mutex_unlock(&tlob_uprobe_mutex);
> + return ret ? ret : (ssize_t)count;
> +}
> +
> +static const struct file_operations tlob_monitor_fops = {
> + .open = simple_open,
> + .read = tlob_monitor_read,
> + .write = tlob_monitor_write,
> + .llseek = noop_llseek,
> +};
> +
> +/*
> + * __tlob_init_monitor / __tlob_destroy_monitor - called with
> rv_interface_lock
> + * held (required by da_monitor_init/destroy via
> rv_get/put_task_monitor_slot).
> + */
> +static int __tlob_init_monitor(void)
> +{
> + int i, retval;
> +
> + tlob_state_cache = kmem_cache_create("tlob_task_state",
> + sizeof(struct tlob_task_state),
> + 0, 0, NULL);
> + if (!tlob_state_cache)
> + return -ENOMEM;
> +
> + for (i = 0; i < TLOB_HTABLE_SIZE; i++)
> + INIT_HLIST_HEAD(&tlob_htable[i]);
> + atomic_set(&tlob_num_monitored, 0);
> +
> + retval = da_monitor_init();
> + if (retval) {
> + kmem_cache_destroy(tlob_state_cache);
> + tlob_state_cache = NULL;
> + return retval;
> + }
> +
> + rv_this.enabled = 1;
> + return 0;
> +}
> +
> +static void __tlob_destroy_monitor(void)
> +{
> + rv_this.enabled = 0;
> + tlob_stop_all();
> + tlob_remove_all_uprobes();
> + /*
> + * Drain pending call_rcu() callbacks from tlob_stop_all() before
> + * destroying the kmem_cache.
> + */
> + synchronize_rcu();
> + da_monitor_destroy();
> + kmem_cache_destroy(tlob_state_cache);
> + tlob_state_cache = NULL;
> +}
> +
> +/*
> + * tlob_init_monitor / tlob_destroy_monitor - KUnit wrappers that acquire
> + * rv_interface_lock, satisfying the lockdep_assert_held() inside
> + * rv_get/put_task_monitor_slot().
> + */
> +VISIBLE_IF_KUNIT int tlob_init_monitor(void)
> +{
> + int ret;
> +
> + mutex_lock(&rv_interface_lock);
> + ret = __tlob_init_monitor();
> + mutex_unlock(&rv_interface_lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_init_monitor);
> +
> +VISIBLE_IF_KUNIT void tlob_destroy_monitor(void)
> +{
> + mutex_lock(&rv_interface_lock);
> + __tlob_destroy_monitor();
> + mutex_unlock(&rv_interface_lock);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_destroy_monitor);
> +
> +VISIBLE_IF_KUNIT int tlob_enable_hooks(void)
> +{
> + rv_attach_trace_probe("tlob", sched_switch, handle_sched_switch);
> + rv_attach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
> + return 0;
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_enable_hooks);
> +
> +VISIBLE_IF_KUNIT void tlob_disable_hooks(void)
> +{
> + rv_detach_trace_probe("tlob", sched_switch, handle_sched_switch);
> + rv_detach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_disable_hooks);
> +
> +/*
> + * enable_tlob / disable_tlob - called by rv_enable/disable_monitor() which
> + * already holds rv_interface_lock; call the __ variants directly.
> + */
> +static int enable_tlob(void)
> +{
> + int retval;
> +
> + retval = __tlob_init_monitor();
> + if (retval)
> + return retval;
> +
> + return tlob_enable_hooks();
> +}
> +
> +static void disable_tlob(void)
> +{
> + tlob_disable_hooks();
> + __tlob_destroy_monitor();
> +}
> +
> +static struct rv_monitor rv_this = {
> + .name = "tlob",
> + .description = "Per-task latency-over-budget monitor.",
> + .enable = enable_tlob,
> + .disable = disable_tlob,
> + .reset = da_monitor_reset_all,
> + .enabled = 0,
> +};
> +
> +static int __init register_tlob(void)
> +{
> + int ret;
> +
> + ret = rv_register_monitor(&rv_this, NULL);
> + if (ret)
> + return ret;
> +
> + if (rv_this.root_d) {
> + tracefs_create_file("monitor", 0644, rv_this.root_d, NULL,
> + &tlob_monitor_fops);
> + }
> +
> + return 0;
> +}
> +
> +static void __exit unregister_tlob(void)
> +{
> + rv_unregister_monitor(&rv_this);
> +}
> +
> +module_init(register_tlob);
> +module_exit(unregister_tlob);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Wen Yang <wen.yang@linux.dev>");
> +MODULE_DESCRIPTION("tlob: task latency over budget per-task monitor.");
> diff --git a/kernel/trace/rv/monitors/tlob/tlob.h
> b/kernel/trace/rv/monitors/tlob/tlob.h
> new file mode 100644
> index 000000000..3438a6175
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/tlob.h
> @@ -0,0 +1,145 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _RV_TLOB_H
> +#define _RV_TLOB_H
> +
> +/*
> + * C representation of the tlob automaton, generated from tlob.dot via rvgen
> + * and extended with tlob_start_task()/tlob_stop_task() declarations.
> + * For the format description see
> Documentation/trace/rv/deterministic_automata.rst
> + */
> +
> +#include <linux/rv.h>
> +#include <uapi/linux/rv.h>
> +
> +#define MONITOR_NAME tlob
> +
> +enum states_tlob {
> + unmonitored_tlob,
> + on_cpu_tlob,
> + off_cpu_tlob,
> + state_max_tlob,
> +};
> +
> +#define INVALID_STATE state_max_tlob
> +
> +enum events_tlob {
> + trace_start_tlob,
> + switch_in_tlob,
> + switch_out_tlob,
> + sched_wakeup_tlob,
> + trace_stop_tlob,
> + budget_expired_tlob,
> + event_max_tlob,
> +};
> +
> +struct automaton_tlob {
> + char *state_names[state_max_tlob];
> + char *event_names[event_max_tlob];
> + unsigned char function[state_max_tlob][event_max_tlob];
> + unsigned char initial_state;
> + bool final_states[state_max_tlob];
> +};
> +
> +static const struct automaton_tlob automaton_tlob = {
> + .state_names = {
> + "unmonitored",
> + "on_cpu",
> + "off_cpu",
> + },
> + .event_names = {
> + "trace_start",
> + "switch_in",
> + "switch_out",
> + "sched_wakeup",
> + "trace_stop",
> + "budget_expired",
> + },
> + .function = {
> + /* unmonitored */
> + {
> + on_cpu_tlob, /* trace_start */
> + unmonitored_tlob, /* switch_in */
> + unmonitored_tlob, /* switch_out */
> + unmonitored_tlob, /* sched_wakeup */
> + INVALID_STATE, /* trace_stop */
> + INVALID_STATE, /* budget_expired */
> + },
> + /* on_cpu */
> + {
> + INVALID_STATE, /* trace_start */
> + INVALID_STATE, /* switch_in */
> + off_cpu_tlob, /* switch_out */
> + on_cpu_tlob, /* sched_wakeup */
> + unmonitored_tlob, /* trace_stop */
> + unmonitored_tlob, /* budget_expired */
> + },
> + /* off_cpu */
> + {
> + INVALID_STATE, /* trace_start */
> + on_cpu_tlob, /* switch_in */
> + off_cpu_tlob, /* switch_out */
> + off_cpu_tlob, /* sched_wakeup */
> + unmonitored_tlob, /* trace_stop */
> + unmonitored_tlob, /* budget_expired */
> + },
> + },
> + /*
> + * final_states: unmonitored is the sole accepting state.
> + * Violations are recorded via ntf_push and tlob_budget_exceeded.
> + */
> + .initial_state = unmonitored_tlob,
> + .final_states = { 1, 0, 0 },
> +};
> +
> +/* Exported for use by the RV ioctl layer (rv_dev.c) */
> +int tlob_start_task(struct task_struct *task, u64 threshold_us,
> + struct file *notify_file, u64 tag);
> +int tlob_stop_task(struct task_struct *task);
> +
> +/* Maximum number of concurrently monitored tasks (also used by KUnit). */
> +#define TLOB_MAX_MONITORED 64U
> +
> +/*
> + * Ring buffer constants (also published in UAPI for mmap size calculation).
> + */
> +#define TLOB_RING_DEFAULT_CAP 64U /* records allocated at open() */
> +#define TLOB_RING_MIN_CAP 8U /* minimum accepted by mmap() */
> +#define TLOB_RING_MAX_CAP 4096U /* maximum accepted by mmap() */
> +
> +/**
> + * struct tlob_ring - per-fd mmap-capable violation ring buffer.
> + *
> + * Allocated as a contiguous page range at rv_open() time:
> + * page 0: struct tlob_mmap_page (shared with userspace)
> + * pages 1-N: struct tlob_event[capacity]
> + */
> +struct tlob_ring {
> + struct tlob_mmap_page *page;
> + struct tlob_event *data;
> + u32 mask;
> + spinlock_t lock;
> + unsigned long base;
> + unsigned int order;
> +};
> +
> +/**
> + * struct rv_file_priv - per-fd private data for /dev/rv.
> + */
> +struct rv_file_priv {
> + struct tlob_ring ring;
> + wait_queue_head_t waitq;
> +};
> +
> +#if IS_ENABLED(CONFIG_KUNIT)
> +int tlob_init_monitor(void);
> +void tlob_destroy_monitor(void);
> +int tlob_enable_hooks(void);
> +void tlob_disable_hooks(void);
> +void tlob_event_push_kunit(struct rv_file_priv *priv,
> + const struct tlob_event *info);
> +int tlob_parse_uprobe_line(char *buf, u64 *thr_out,
> + char **path_out,
> + loff_t *start_out, loff_t *stop_out);
> +#endif /* CONFIG_KUNIT */
> +
> +#endif /* _RV_TLOB_H */
> diff --git a/kernel/trace/rv/monitors/tlob/tlob_trace.h
> b/kernel/trace/rv/monitors/tlob/tlob_trace.h
> new file mode 100644
> index 000000000..b08d67776
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/tlob_trace.h
> @@ -0,0 +1,42 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * Snippet to be included in rv_trace.h
> + */
> +
> +#ifdef CONFIG_RV_MON_TLOB
> +/*
> + * tlob uses the generic event_da_monitor_id and error_da_monitor_id event
> + * classes so that both event classes are instantiated. This avoids a
> + * -Werror=unused-variable warning that the compiler emits when a
> + * DECLARE_EVENT_CLASS has no corresponding DEFINE_EVENT instance.
> + *
> + * The event_tlob tracepoint is defined here but the call-site in
> + * da_handle_event() is overridden with a no-op macro below so that no
> + * trace record is emitted on every scheduler context switch. Budget
> + * violations are reported via the dedicated tlob_budget_exceeded event.
> + *
> + * error_tlob IS kept active so that invalid DA transitions (programming
> + * errors) are still visible in the ftrace ring buffer for debugging.
> + */
> +DEFINE_EVENT(event_da_monitor_id, event_tlob,
> + TP_PROTO(int id, char *state, char *event, char *next_state,
> + bool final_state),
> + TP_ARGS(id, state, event, next_state, final_state));
> +
> +DEFINE_EVENT(error_da_monitor_id, error_tlob,
> + TP_PROTO(int id, char *state, char *event),
> + TP_ARGS(id, state, event));
> +
> +/*
> + * Override the trace_event_tlob() call-site with a no-op after the
> + * DEFINE_EVENT above has satisfied the event class instantiation
> + * requirement. The tracepoint symbol itself exists (and can be enabled
> + * via tracefs) but the automatic call from da_handle_event() is silenced
> + * to avoid per-context-switch ftrace noise during normal operation.
> + */
> +#undef trace_event_tlob
> +#define trace_event_tlob(id, state, event, next_state, final_state) \
> + do { (void)(id); (void)(state); (void)(event); \
> + (void)(next_state); (void)(final_state); } while (0)
> +#endif /* CONFIG_RV_MON_TLOB */
> diff --git a/kernel/trace/rv/rv.c b/kernel/trace/rv/rv.c
> index ee4e68102..e754e76d5 100644
> --- a/kernel/trace/rv/rv.c
> +++ b/kernel/trace/rv/rv.c
> @@ -148,6 +148,10 @@
> #include <rv_trace.h>
> #endif
>
> +#ifdef CONFIG_RV_MON_TLOB
> +EXPORT_TRACEPOINT_SYMBOL_GPL(tlob_budget_exceeded);
> +#endif
> +
> #include "rv.h"
>
> DEFINE_MUTEX(rv_interface_lock);
> diff --git a/kernel/trace/rv/rv_dev.c b/kernel/trace/rv/rv_dev.c
> new file mode 100644
> index 000000000..a052f3203
> --- /dev/null
> +++ b/kernel/trace/rv/rv_dev.c
> @@ -0,0 +1,602 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * rv_dev.c - /dev/rv misc device for RV monitor self-instrumentation
> + *
> + * A single misc device (MISC_DYNAMIC_MINOR) serves all RV monitors.
> + * ioctl numbers encode the monitor identity:
> + *
> + * 0x01 - 0x1F tlob (task latency over budget)
> + * 0x20 - 0x3F reserved
> + *
> + * Each monitor exports tlob_start_task() / tlob_stop_task() which are
> + * called here. The calling task is identified by current.
> + *
> + * Magic: RV_IOC_MAGIC (0xB9), defined in include/uapi/linux/rv.h
> + *
> + * Per-fd private data (rv_file_priv)
> + * ------------------------------------
> + * Every open() of /dev/rv allocates an rv_file_priv (defined in tlob.h).
> + * When TLOB_IOCTL_TRACE_START is called with args.notify_fd >= 0, violations
> + * are pushed as tlob_event records into that fd's per-fd ring buffer
> (tlob_ring)
> + * and its poll/epoll waitqueue is woken.
> + *
> + * Consumers drain records with read() on the notify_fd; read() blocks until
> + * at least one record is available (unless O_NONBLOCK is set).
> + *
> + * Per-thread "started" tracking (tlob_task_handle)
> + * -------------------------------------------------
> + * tlob_stop_task() returns -ESRCH in two distinct situations:
> + *
> + * (a) The deadline timer already fired and removed the tlob hash-table
> + * entry before TRACE_STOP arrived -> budget was exceeded -> -EOVERFLOW
> + *
> + * (b) TRACE_START was never called for this thread -> programming error
> + * -> -ESRCH
> + *
> + * To distinguish them, rv_dev.c maintains a lightweight hash table
> + * (tlob_handles) that records a tlob_task_handle for every task_struct *
> + * for which a successful TLOB_IOCTL_TRACE_START has been
> + * issued but the corresponding TLOB_IOCTL_TRACE_STOP has not yet arrived.
> + *
> + * tlob_task_handle is a thin "session ticket" -- it carries only the
> + * task pointer and the owning file descriptor. The heavy per-task state
> + * (hrtimer, DA state, threshold) lives in tlob_task_state inside tlob.c.
> + *
> + * The table is keyed on task_struct * (same key as tlob.c), protected
> + * by tlob_handles_lock (spinlock, irq-safe). No get_task_struct()
> + * refcount is needed here because tlob.c already holds a reference for
> + * each live entry.
> + *
> + * Multiple threads may share the same fd. Each thread has its own
> + * tlob_task_handle in the table, so concurrent TRACE_START / TRACE_STOP
> + * calls from different threads do not interfere.
> + *
> + * The fd release path (rv_release) calls tlob_stop_task() for every
> + * handle in tlob_handles that belongs to the closing fd, ensuring cleanup
> + * even if the user forgets to call TRACE_STOP.
> + */
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/gfp.h>
> +#include <linux/hash.h>
> +#include <linux/mm.h>
> +#include <linux/miscdevice.h>
> +#include <linux/module.h>
> +#include <linux/poll.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/uaccess.h>
> +#include <uapi/linux/rv.h>
> +
> +#ifdef CONFIG_RV_MON_TLOB
> +#include "monitors/tlob/tlob.h"
> +#endif
> +
> +/* -----------------------------------------------------------------------
> + * tlob_task_handle - per-thread session ticket for the ioctl interface
> + *
> + * One handle is allocated by TLOB_IOCTL_TRACE_START and freed by
> + * TLOB_IOCTL_TRACE_STOP (or by rv_release if the fd is closed).
> + *
> + * @hlist: Hash-table linkage in tlob_handles (keyed on task pointer).
> + * @task: The monitored thread. Plain pointer; no refcount held here
> + * because tlob.c holds one for the lifetime of the monitoring
> + * window, which encompasses the lifetime of this handle.
> + * @file: The /dev/rv file descriptor that issued TRACE_START.
> + * Used by rv_release() to sweep orphaned handles on close().
> + * -----------------------------------------------------------------------
> + */
> +#define TLOB_HANDLES_BITS 5
> +#define TLOB_HANDLES_SIZE (1 << TLOB_HANDLES_BITS)
> +
> +struct tlob_task_handle {
> + struct hlist_node hlist;
> + struct task_struct *task;
> + struct file *file;
> +};
> +
> +static struct hlist_head tlob_handles[TLOB_HANDLES_SIZE];
> +static DEFINE_SPINLOCK(tlob_handles_lock);
> +
> +static unsigned int tlob_handle_hash(const struct task_struct *task)
> +{
> + return hash_ptr((void *)task, TLOB_HANDLES_BITS);
> +}
> +
> +/* Must be called with tlob_handles_lock held. */
> +static struct tlob_task_handle *
> +tlob_handle_find_locked(struct task_struct *task)
> +{
> + struct tlob_task_handle *h;
> + unsigned int slot = tlob_handle_hash(task);
> +
> + hlist_for_each_entry(h, &tlob_handles[slot], hlist) {
> + if (h->task == task)
> + return h;
> + }
> + return NULL;
> +}
> +
> +/*
> + * tlob_handle_alloc - record that @task has an active monitoring session
> + * opened via @file.
> + *
> + * Returns 0 on success, -EEXIST if @task already has a handle (double
> + * TRACE_START without TRACE_STOP), -ENOMEM on allocation failure.
> + */
> +static int tlob_handle_alloc(struct task_struct *task, struct file *file)
> +{
> + struct tlob_task_handle *h;
> + unsigned long flags;
> + unsigned int slot;
> +
> + h = kmalloc(sizeof(*h), GFP_KERNEL);
> + if (!h)
> + return -ENOMEM;
> + h->task = task;
> + h->file = file;
> +
> + spin_lock_irqsave(&tlob_handles_lock, flags);
> + if (tlob_handle_find_locked(task)) {
> + spin_unlock_irqrestore(&tlob_handles_lock, flags);
> + kfree(h);
> + return -EEXIST;
> + }
> + slot = tlob_handle_hash(task);
> + hlist_add_head(&h->hlist, &tlob_handles[slot]);
> + spin_unlock_irqrestore(&tlob_handles_lock, flags);
> + return 0;
> +}
> +
> +/*
> + * tlob_handle_free - remove the handle for @task and free it.
> + *
> + * Returns 1 if a handle existed (TRACE_START was called), 0 if not found
> + * (TRACE_START was never called for this thread).
> + */
> +static int tlob_handle_free(struct task_struct *task)
> +{
> + struct tlob_task_handle *h;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&tlob_handles_lock, flags);
> + h = tlob_handle_find_locked(task);
> + if (h) {
> + hlist_del_init(&h->hlist);
> + spin_unlock_irqrestore(&tlob_handles_lock, flags);
> + kfree(h);
> + return 1;
> + }
> + spin_unlock_irqrestore(&tlob_handles_lock, flags);
> + return 0;
> +}
> +
> +/*
> + * tlob_handle_sweep_file - release all handles owned by @file.
> + *
> + * Called from rv_release() when the fd is closed without TRACE_STOP.
> + * Calls tlob_stop_task() for each orphaned handle to drain the tlob
> + * monitoring entries and prevent resource leaks in tlob.c.
> + *
> + * Handles are collected under the lock (short critical section), then
> + * processed outside it (tlob_stop_task() may sleep/spin internally).
> + */
> +#ifdef CONFIG_RV_MON_TLOB
> +static void tlob_handle_sweep_file(struct file *file)
> +{
> + struct tlob_task_handle *batch[TLOB_HANDLES_SIZE];
> + struct tlob_task_handle *h;
> + struct hlist_node *tmp;
> + unsigned long flags;
> + int i, n = 0;
> +
> + spin_lock_irqsave(&tlob_handles_lock, flags);
> + for (i = 0; i < TLOB_HANDLES_SIZE; i++) {
> + hlist_for_each_entry_safe(h, tmp, &tlob_handles[i], hlist) {
> + if (h->file == file) {
> + hlist_del_init(&h->hlist);
> + batch[n++] = h;
> + }
> + }
> + }
> + spin_unlock_irqrestore(&tlob_handles_lock, flags);
> +
> + for (i = 0; i < n; i++) {
> + /*
> + * Ignore -ESRCH: the deadline timer may have already fired
> + * and cleaned up the tlob entry.
> + */
> + tlob_stop_task(batch[i]->task);
> + kfree(batch[i]);
> + }
> +}
> +#else
> +static inline void tlob_handle_sweep_file(struct file *file) {}
> +#endif /* CONFIG_RV_MON_TLOB */
> +
> +/* -----------------------------------------------------------------------
> + * Ring buffer lifecycle
> + * -----------------------------------------------------------------------
> + */
> +
> +/*
> + * tlob_ring_alloc - allocate a ring of @cap records (must be a power of 2).
> + *
> + * Allocates a physically contiguous block of pages:
> + * page 0 : struct tlob_mmap_page (control page, shared with
> userspace)
> + * pages 1..N : struct tlob_event[cap] (data pages)
> + *
> + * Each page is marked reserved so it can be mapped to userspace via mmap().
> + */
> +static int tlob_ring_alloc(struct tlob_ring *ring, u32 cap)
> +{
> + unsigned int total = PAGE_SIZE + cap * sizeof(struct tlob_event);
> + unsigned int order = get_order(total);
> + unsigned long base;
> + unsigned int i;
> +
> + base = __get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
> + if (!base)
> + return -ENOMEM;
> +
> + for (i = 0; i < (1u << order); i++)
> + SetPageReserved(virt_to_page((void *)(base + i *
> PAGE_SIZE)));
> +
> + ring->base = base;
> + ring->order = order;
> + ring->page = (struct tlob_mmap_page *)base;
> + ring->data = (struct tlob_event *)(base + PAGE_SIZE);
> + ring->mask = cap - 1;
> + spin_lock_init(&ring->lock);
> +
> + ring->page->capacity = cap;
> + ring->page->version = 1;
> + ring->page->data_offset = PAGE_SIZE;
> + ring->page->record_size = sizeof(struct tlob_event);
> + return 0;
> +}
> +
> +static void tlob_ring_free(struct tlob_ring *ring)
> +{
> + unsigned int i;
> +
> + if (!ring->base)
> + return;
> +
> + for (i = 0; i < (1u << ring->order); i++)
> + ClearPageReserved(virt_to_page((void *)(ring->base + i *
> PAGE_SIZE)));
> +
> + free_pages(ring->base, ring->order);
> + ring->base = 0;
> + ring->page = NULL;
> + ring->data = NULL;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * File operations
> + * -----------------------------------------------------------------------
> + */
> +
> +static int rv_open(struct inode *inode, struct file *file)
> +{
> + struct rv_file_priv *priv;
> + int ret;
> +
> + priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> + if (!priv)
> + return -ENOMEM;
> +
> + ret = tlob_ring_alloc(&priv->ring, TLOB_RING_DEFAULT_CAP);
> + if (ret) {
> + kfree(priv);
> + return ret;
> + }
> +
> + init_waitqueue_head(&priv->waitq);
> + file->private_data = priv;
> + return 0;
> +}
> +
> +static int rv_release(struct inode *inode, struct file *file)
> +{
> + struct rv_file_priv *priv = file->private_data;
> +
> + tlob_handle_sweep_file(file);
> + tlob_ring_free(&priv->ring);
> + kfree(priv);
> + file->private_data = NULL;
> + return 0;
> +}
> +
> +static __poll_t rv_poll(struct file *file, poll_table *wait)
> +{
> + struct rv_file_priv *priv = file->private_data;
> +
> + if (!priv)
> + return EPOLLERR;
> +
> + poll_wait(file, &priv->waitq, wait);
> +
> + /*
> + * Pairs with smp_store_release(&ring->page->data_head, ...) in
> + * tlob_event_push(). No lock needed: head is written by the kernel
> + * producer and read here; tail is written by the consumer and we
> only
> + * need an approximate check for the poll fast path.
> + */
> + if (smp_load_acquire(&priv->ring.page->data_head) !=
> + READ_ONCE(priv->ring.page->data_tail))
> + return EPOLLIN | EPOLLRDNORM;
> +
> + return 0;
> +}
> +
> +/*
> + * rv_read - consume tlob_event violation records from this fd's ring buffer.
> + *
> + * Each read() returns a whole number of struct tlob_event records. @count
> must
> + * be at least sizeof(struct tlob_event); partial-record sizes are rejected
> with
> + * -EINVAL.
> + *
> + * Blocking behaviour follows O_NONBLOCK on the fd:
> + * O_NONBLOCK clear: blocks until at least one record is available.
> + * O_NONBLOCK set: returns -EAGAIN immediately if the ring is empty.
> + *
> + * Returns the number of bytes copied (always a multiple of sizeof
> tlob_event),
> + * -EAGAIN if non-blocking and empty, or a negative error code.
> + *
> + * read() and mmap() share the same ring and data_tail cursor; do not use
> + * both simultaneously on the same fd.
> + */
> +static ssize_t rv_read(struct file *file, char __user *buf, size_t count,
> + loff_t *ppos)
> +{
> + struct rv_file_priv *priv = file->private_data;
> + struct tlob_ring *ring;
> + size_t rec = sizeof(struct tlob_event);
> + unsigned long irqflags;
> + ssize_t done = 0;
> + int ret;
> +
> + if (!priv)
> + return -ENODEV;
> +
> + ring = &priv->ring;
> +
> + if (count < rec)
> + return -EINVAL;
> +
> + /* Blocking path: sleep until the producer advances data_head. */
> + if (!(file->f_flags & O_NONBLOCK)) {
> + ret = wait_event_interruptible(priv->waitq,
> + /* pairs with smp_store_release() in the producer */
> + smp_load_acquire(&ring->page->data_head) !=
> + READ_ONCE(ring->page->data_tail));
> + if (ret)
> + return ret;
> + }
> +
> + /*
> + * Drain records into the caller's buffer. ring->lock serialises
> + * concurrent read() callers and the softirq producer.
> + */
> + while (done + rec <= count) {
> + struct tlob_event record;
> + u32 head, tail;
> +
> + spin_lock_irqsave(&ring->lock, irqflags);
> + /* pairs with smp_store_release() in the producer */
> + head = smp_load_acquire(&ring->page->data_head);
> + tail = ring->page->data_tail;
> + if (head == tail) {
> + spin_unlock_irqrestore(&ring->lock, irqflags);
> + break;
> + }
> + record = ring->data[tail & ring->mask];
> + WRITE_ONCE(ring->page->data_tail, tail + 1);
> + spin_unlock_irqrestore(&ring->lock, irqflags);
> +
> + if (copy_to_user(buf + done, &record, rec))
> + return done ? done : -EFAULT;
> + done += rec;
> + }
> +
> + return done ? done : -EAGAIN;
> +}
> +
> +/*
> + * rv_mmap - map the per-fd violation ring buffer into userspace.
> + *
> + * The mmap region covers the full ring allocation:
> + *
> + * offset 0 : struct tlob_mmap_page (control page)
> + * offset PAGE_SIZE : struct tlob_event[capacity] (data pages)
> + *
> + * The caller must map exactly PAGE_SIZE + capacity * sizeof(struct
> tlob_event)
> + * bytes starting at offset 0 (vm_pgoff must be 0). The actual capacity is
> + * read from tlob_mmap_page.capacity after a successful mmap(2).
> + *
> + * Private mappings (MAP_PRIVATE) are rejected: the shared data_tail field
> + * written by userspace must be visible to the kernel producer.
> + */
> +static int rv_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + struct rv_file_priv *priv = file->private_data;
> + struct tlob_ring *ring;
> + unsigned long size = vma->vm_end - vma->vm_start;
> + unsigned long ring_size;
> +
> + if (!priv)
> + return -ENODEV;
> +
> + ring = &priv->ring;
> +
> + if (vma->vm_pgoff != 0)
> + return -EINVAL;
> +
> + ring_size = PAGE_ALIGN(PAGE_SIZE + ((unsigned long)(ring->mask + 1) *
> + sizeof(struct tlob_event)));
> + if (size != ring_size)
> + return -EINVAL;
> +
> + if (!(vma->vm_flags & VM_SHARED))
> + return -EINVAL;
> +
> + return remap_pfn_range(vma, vma->vm_start,
> + page_to_pfn(virt_to_page((void *)ring->base)),
> + ring_size, vma->vm_page_prot);
> +}
> +
> +/* -----------------------------------------------------------------------
> + * ioctl dispatcher
> + * -----------------------------------------------------------------------
> + */
> +
> +static long rv_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> + unsigned int nr = _IOC_NR(cmd);
> +
> + /*
> + * Verify the magic byte so we don't accidentally handle ioctls
> + * intended for a different device.
> + */
> + if (_IOC_TYPE(cmd) != RV_IOC_MAGIC)
> + return -ENOTTY;
> +
> +#ifdef CONFIG_RV_MON_TLOB
> + /* tlob: ioctl numbers 0x01 - 0x1F */
> + switch (cmd) {
> + case TLOB_IOCTL_TRACE_START: {
> + struct tlob_start_args args;
> + struct file *notify_file = NULL;
> + int ret, hret;
> +
> + if (copy_from_user(&args,
> + (struct tlob_start_args __user *)arg,
> + sizeof(args)))
> + return -EFAULT;
> + if (args.threshold_us == 0)
> + return -EINVAL;
> + if (args.flags != 0)
> + return -EINVAL;
> +
> + /*
> + * If notify_fd >= 0, resolve it to a file pointer.
> + * fget() bumps the reference count; tlob.c drops it
> + * via fput() when the monitoring window ends.
> + * Reject non-/dev/rv fds to prevent type confusion.
> + */
> + if (args.notify_fd >= 0) {
> + notify_file = fget(args.notify_fd);
> + if (!notify_file)
> + return -EBADF;
> + if (notify_file->f_op != file->f_op) {
> + fput(notify_file);
> + return -EINVAL;
> + }
> + }
> +
> + ret = tlob_start_task(current, args.threshold_us,
> + notify_file, args.tag);
> + if (ret != 0) {
> + /* tlob.c did not take ownership; drop ref. */
> + if (notify_file)
> + fput(notify_file);
> + return ret;
> + }
> +
> + /*
> + * Record session handle. Free any stale handle left by
> + * a previous window whose deadline timer fired (timer
> + * removes tlob_task_state but cannot touch tlob_handles).
> + */
> + tlob_handle_free(current);
> + hret = tlob_handle_alloc(current, file);
> + if (hret < 0) {
> + tlob_stop_task(current);
> + return hret;
> + }
> + return 0;
> + }
> + case TLOB_IOCTL_TRACE_STOP: {
> + int had_handle;
> + int ret;
> +
> + /*
> + * Atomically remove the session handle for current.
> + *
> + * had_handle == 0: TRACE_START was never called for
> + * this thread -> caller bug -> -ESRCH
> + *
> + * had_handle == 1: TRACE_START was called. If
> + * tlob_stop_task() now returns
> + * -ESRCH, the deadline timer already
> + * fired -> budget exceeded -> -EOVERFLOW
> + */
> + had_handle = tlob_handle_free(current);
> + if (!had_handle)
> + return -ESRCH;
> +
> + ret = tlob_stop_task(current);
> + return (ret == -ESRCH) ? -EOVERFLOW : ret;
> + }
> + default:
> + break;
> + }
> +#endif /* CONFIG_RV_MON_TLOB */
> +
> + return -ENOTTY;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * Module init / exit
> + * -----------------------------------------------------------------------
> + */
> +
> +static const struct file_operations rv_fops = {
> + .owner = THIS_MODULE,
> + .open = rv_open,
> + .release = rv_release,
> + .read = rv_read,
> + .poll = rv_poll,
> + .mmap = rv_mmap,
> + .unlocked_ioctl = rv_ioctl,
> +#ifdef CONFIG_COMPAT
> + .compat_ioctl = rv_ioctl,
> +#endif
> + .llseek = noop_llseek,
> +};
> +
> +/*
> + * 0666: /dev/rv is a self-instrumentation device. All ioctls operate
> + * exclusively on the calling task (current); no task can monitor another
> + * via this interface. Opening the device does not grant any privilege
> + * beyond observing one's own latency, so world-read/write is appropriate.
> + */
> +static struct miscdevice rv_miscdev = {
> + .minor = MISC_DYNAMIC_MINOR,
> + .name = "rv",
> + .fops = &rv_fops,
> + .mode = 0666,
> +};
> +
> +static int __init rv_ioctl_init(void)
> +{
> + int i;
> +
> + for (i = 0; i < TLOB_HANDLES_SIZE; i++)
> + INIT_HLIST_HEAD(&tlob_handles[i]);
> +
> + return misc_register(&rv_miscdev);
> +}
> +
> +static void __exit rv_ioctl_exit(void)
> +{
> + misc_deregister(&rv_miscdev);
> +}
> +
> +module_init(rv_ioctl_init);
> +module_exit(rv_ioctl_exit);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_DESCRIPTION("RV ioctl interface via /dev/rv");
> diff --git a/kernel/trace/rv/rv_trace.h b/kernel/trace/rv/rv_trace.h
> index 4a6faddac..65d6c6485 100644
> --- a/kernel/trace/rv/rv_trace.h
> +++ b/kernel/trace/rv/rv_trace.h
> @@ -126,6 +126,7 @@ DECLARE_EVENT_CLASS(error_da_monitor_id,
> #include <monitors/snroc/snroc_trace.h>
> #include <monitors/nrp/nrp_trace.h>
> #include <monitors/sssw/sssw_trace.h>
> +#include <monitors/tlob/tlob_trace.h>
> // Add new monitors based on CONFIG_DA_MON_EVENTS_ID here
>
> #endif /* CONFIG_DA_MON_EVENTS_ID */
> @@ -202,6 +203,55 @@ TRACE_EVENT(rv_retries_error,
> __get_str(event), __get_str(name))
> );
> #endif /* CONFIG_RV_MON_MAINTENANCE_EVENTS */
> +
> +#ifdef CONFIG_RV_MON_TLOB
> +/*
> + * tlob_budget_exceeded - emitted when a monitored task exceeds its latency
> + * budget. Carries the on-CPU / off-CPU time breakdown so that the cause
> + * of the overrun (CPU-bound vs. scheduling/I/O latency) is immediately
> + * visible in the ftrace ring buffer without post-processing.
> + */
> +TRACE_EVENT(tlob_budget_exceeded,
> +
> + TP_PROTO(struct task_struct *task, u64 threshold_us,
> + u64 on_cpu_us, u64 off_cpu_us, u32 switches,
> + bool state_is_on_cpu, u64 tag),
> +
> + TP_ARGS(task, threshold_us, on_cpu_us, off_cpu_us, switches,
> + state_is_on_cpu, tag),
> +
> + TP_STRUCT__entry(
> + __string(comm, task->comm)
> + __field(pid_t, pid)
> + __field(u64, threshold_us)
> + __field(u64, on_cpu_us)
> + __field(u64, off_cpu_us)
> + __field(u32, switches)
> + __field(bool, state_is_on_cpu)
> + __field(u64, tag)
> + ),
> +
> + TP_fast_assign(
> + __assign_str(comm);
> + __entry->pid = task->pid;
> + __entry->threshold_us = threshold_us;
> + __entry->on_cpu_us = on_cpu_us;
> + __entry->off_cpu_us = off_cpu_us;
> + __entry->switches = switches;
> + __entry->state_is_on_cpu = state_is_on_cpu;
> + __entry->tag = tag;
> + ),
> +
> + TP_printk("%s[%d]: budget exceeded threshold=%llu on_cpu=%llu
> off_cpu=%llu switches=%u state=%s tag=0x%016llx",
> + __get_str(comm), __entry->pid,
> + __entry->threshold_us,
> + __entry->on_cpu_us, __entry->off_cpu_us,
> + __entry->switches,
> + __entry->state_is_on_cpu ? "on_cpu" : "off_cpu",
> + __entry->tag)
> +);
> +#endif /* CONFIG_RV_MON_TLOB */
> +
> #endif /* _TRACE_RV_H */
>
> /* This part must be outside protection */
^ permalink raw reply
* [RFC PATCH 0/2] Decouple ftrace/livepatch from module loader via notifier priority and reverse traversal
From: chensong_2000 @ 2026-04-13 8:01 UTC (permalink / raw)
To: rafael, lenb, mturquette, sboyd, viresh.kumar, agk, snitzer,
mpatocka, bmarzins, song, yukuai, linan122, jason.wessel, danielt,
dianders, horms, davem, edumazet, kuba, pabeni, paulmck, frederic,
mcgrof, petr.pavlu, da.gomez, samitolvanen, atomlin, jpoimboe,
jikos, mbenes, pmladek, joe.lawrence, rostedt, mhiramat,
mark.rutland, mathieu.desnoyers
Cc: linux-modules, linux-kernel, linux-trace-kernel, linux-acpi,
linux-clk, linux-pm, live-patching, dm-devel, linux-raid,
kgdb-bugreport, netdev, Song Chen
From: Song Chen <chensong_2000@189.cn>
This patchset addresses a long-standing tight coupling between the
module loader and two of its key consumers: ftrace and livepatch.
Background:
The module loader currently hard-codes direct calls to
ftrace_module_enable(), klp_module_coming(), klp_module_going() and
ftrace_release_mod() inside prepare_coming_module() and the module
unload path. This hard-coding was necessary because the module notifier
chain could not guarantee the strict call ordering that ftrace and
livepatch require:
During MODULE_STATE_COMING, ftrace must run before livepatch, so
that per-module function records are ready before livepatch registers
its ftrace hooks.
During MODULE_STATE_GOING, livepatch must run before ftrace, so that
livepatch removes its hooks before ftrace releases those records.
This symmetric setup/teardown ordering could not be expressed through
the notifier chain because the chain only supported forward (descending
priority) traversal. Without reverse traversal, it was impossible to
guarantee that the GOING order would be the strict inverse of the
COMING order using a single priority value per notifier.
Patch 1 - notifier: replace single-linked list with double-linked list.
Patch 2 - ftrace/klp: decouple from module loader using notifier
priority.
headsup: somehow the smtp of my mailbox doesn't work very well lately,
if i receive return letter, i have to resend, sorry in advance.
Song Chen (2):
kernel/notifier: replace single-linked list with double-linked list
for reverse traversal
kernel/module: Decouple klp and ftrace from load_module
drivers/acpi/sleep.c | 1 -
drivers/clk/clk.c | 2 +-
drivers/cpufreq/cpufreq.c | 2 +-
drivers/md/dm-integrity.c | 1 -
drivers/md/md.c | 1 -
include/linux/module.h | 8 ++
include/linux/notifier.h | 26 ++---
kernel/debug/debug_core.c | 1 -
kernel/livepatch/core.c | 29 ++++-
kernel/module/main.c | 34 +++---
kernel/notifier.c | 219 ++++++++++++++++++++++++++++++++------
kernel/trace/ftrace.c | 38 +++++++
net/ipv4/nexthop.c | 2 +-
13 files changed, 290 insertions(+), 74 deletions(-)
--
2.43.0
^ permalink raw reply
* Re: [PATCH mm-unstable v15 06/13] mm/khugepaged: skip collapsing mTHP to smaller orders
From: David Hildenbrand (Arm) @ 2026-04-13 7:37 UTC (permalink / raw)
To: Nico Pache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <CAA1CXcDnz_7+16sDVbGJ2ZZPWxs7ta_Z0YU6x1dUe7yiSJ3OKg@mail.gmail.com>
On 4/13/26 03:38, Nico Pache wrote:
> On Thu, Mar 12, 2026 at 3:00 PM David Hildenbrand (Arm)
> <david@kernel.org> wrote:
>>
>> On 2/26/26 04:24, Nico Pache wrote:
>>> khugepaged may try to collapse a mTHP to a smaller mTHP, resulting in
>>> some pages being unmapped. Skip these cases until we have a way to check
>>> if its ok to collapse to a smaller mTHP size (like in the case of a
>>> partially mapped folio).
>>>
>>> This patch is inspired by Dev Jain's work on khugepaged mTHP support [1].
>>>
>>> [1] https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/
>>>
>>> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Co-developed-by: Dev Jain <dev.jain@arm.com>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>> ---
>>> mm/khugepaged.c | 8 ++++++++
>>> 1 file changed, 8 insertions(+)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index fb3ba8fe5a6c..c739f26dd61e 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -638,6 +638,14 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>> goto out;
>>> }
>>> }
>>> + /*
>>> + * TODO: In some cases of partially-mapped folios, we'd actually
>>> + * want to collapse.
>>> + */
>>> + if (!is_pmd_order(order) && folio_order(folio) >= order) {
>>> + result = SCAN_PTE_MAPPED_HUGEPAGE;
>>> + goto out;
>>> + }
>>>
>>> if (folio_test_large(folio)) {
>>> struct folio *f;
>>
>> Why aren't we doing the same in hpage_collapse_scan_pmd() ?
>
> We can't do this in the scan phase because we are not yet aware of the
> order we want to collapse to.
Yes, realized that myself later. It's confusing, try documenting that in
the patch description.
--
Cheers,
David
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox