Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* [PATCH v8 06/12] rv: Add sample hybrid monitor stall
From: Gabriele Monaco @ 2026-03-30 11:10 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Nam Cao, Juri Lelli,
	Jonathan Corbet, Gabriele Monaco, Masami Hiramatsu, linux-doc,
	linux-trace-kernel
  Cc: Tomas Glozar, Clark Williams, John Kacur
In-Reply-To: <20260330111010.153663-1-gmonaco@redhat.com>

Add a sample monitor to showcase hybrid/timed automata.
The stall monitor identifies tasks stalled for longer than a threshold
and reacts when that happens.

Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 Documentation/tools/rv/index.rst             |   1 +
 Documentation/tools/rv/rv-mon-stall.rst      |  44 ++++++
 Documentation/trace/rv/index.rst             |   1 +
 Documentation/trace/rv/monitor_stall.rst     |  43 ++++++
 kernel/trace/rv/Kconfig                      |   1 +
 kernel/trace/rv/Makefile                     |   1 +
 kernel/trace/rv/monitors/stall/Kconfig       |  13 ++
 kernel/trace/rv/monitors/stall/stall.c       | 150 +++++++++++++++++++
 kernel/trace/rv/monitors/stall/stall.h       |  81 ++++++++++
 kernel/trace/rv/monitors/stall/stall_trace.h |  19 +++
 kernel/trace/rv/rv_trace.h                   |   1 +
 tools/verification/models/stall.dot          |  22 +++
 12 files changed, 377 insertions(+)
 create mode 100644 Documentation/tools/rv/rv-mon-stall.rst
 create mode 100644 Documentation/trace/rv/monitor_stall.rst
 create mode 100644 kernel/trace/rv/monitors/stall/Kconfig
 create mode 100644 kernel/trace/rv/monitors/stall/stall.c
 create mode 100644 kernel/trace/rv/monitors/stall/stall.h
 create mode 100644 kernel/trace/rv/monitors/stall/stall_trace.h
 create mode 100644 tools/verification/models/stall.dot

diff --git a/Documentation/tools/rv/index.rst b/Documentation/tools/rv/index.rst
index fd42b0017d07..2aaa01c9fe48 100644
--- a/Documentation/tools/rv/index.rst
+++ b/Documentation/tools/rv/index.rst
@@ -16,3 +16,4 @@ Runtime verification (rv) tool
    rv-mon-wip
    rv-mon-wwnr
    rv-mon-sched
+   rv-mon-stall
diff --git a/Documentation/tools/rv/rv-mon-stall.rst b/Documentation/tools/rv/rv-mon-stall.rst
new file mode 100644
index 000000000000..c79d7c2e4dd4
--- /dev/null
+++ b/Documentation/tools/rv/rv-mon-stall.rst
@@ -0,0 +1,44 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+rv-mon-stall
+============
+--------------------
+Stalled task monitor
+--------------------
+
+:Manual section: 1
+
+SYNOPSIS
+========
+
+**rv mon stall** [*OPTIONS*]
+
+DESCRIPTION
+===========
+
+The stalled task (**stall**) monitor is a sample per-task timed monitor that
+checks if tasks are scheduled within a defined threshold after they are ready.
+
+See kernel documentation for further information about this monitor:
+<https://docs.kernel.org/trace/rv/monitor_stall.html>
+
+OPTIONS
+=======
+
+.. include:: common_ikm.rst
+
+SEE ALSO
+========
+
+**rv**\(1), **rv-mon**\(1)
+
+Linux kernel *RV* documentation:
+<https://www.kernel.org/doc/html/latest/trace/rv/index.html>
+
+AUTHOR
+======
+
+Written by Gabriele Monaco <gmonaco@redhat.com>
+
+.. include:: common_appendix.rst
diff --git a/Documentation/trace/rv/index.rst b/Documentation/trace/rv/index.rst
index ad298784bda2..bf9962f49959 100644
--- a/Documentation/trace/rv/index.rst
+++ b/Documentation/trace/rv/index.rst
@@ -16,3 +16,4 @@ Runtime Verification
    monitor_wwnr.rst
    monitor_sched.rst
    monitor_rtapp.rst
+   monitor_stall.rst
diff --git a/Documentation/trace/rv/monitor_stall.rst b/Documentation/trace/rv/monitor_stall.rst
new file mode 100644
index 000000000000..d29e820b2433
--- /dev/null
+++ b/Documentation/trace/rv/monitor_stall.rst
@@ -0,0 +1,43 @@
+Monitor stall
+=============
+
+- Name: stall - stalled task monitor
+- Type: per-task hybrid automaton
+- Author: Gabriele Monaco <gmonaco@redhat.com>
+
+Description
+-----------
+
+The stalled task (stall) monitor is a sample per-task timed monitor that checks
+if tasks are scheduled within a defined threshold after they are ready::
+
+                        |
+                        |
+                        v
+                      #==========================#
+  +-----------------> H         dequeued         H
+  |                   #==========================#
+  |                     |
+ sched_switch_wait      | sched_wakeup;reset(clk)
+  |                     v
+  |                   +--------------------------+ <+
+  |                   |         enqueued         |  | sched_wakeup
+  |                   | clk < threshold_jiffies  | -+
+  |                   +--------------------------+
+  |                     |                 ^
+  |              sched_switch_in    sched_switch_preempt;reset(clk)
+  |                     v                 |
+  |                   +--------------------------+
+  +------------------ |         running          |
+                      +--------------------------+
+                        ^ sched_switch_in      |
+                        | sched_wakeup         |
+                        +----------------------+
+
+The threshold can be configured as a parameter by either booting with the
+``stall.threshold_jiffies=<new value>`` argument or writing a new value to
+``/sys/module/stall/parameters/threshold_jiffies``.
+
+Specification
+-------------
+Graphviz Dot file in tools/verification/models/stall.dot
diff --git a/kernel/trace/rv/Kconfig b/kernel/trace/rv/Kconfig
index 4ad392dfc57f..720fbe4935f8 100644
--- a/kernel/trace/rv/Kconfig
+++ b/kernel/trace/rv/Kconfig
@@ -78,6 +78,7 @@ source "kernel/trace/rv/monitors/pagefault/Kconfig"
 source "kernel/trace/rv/monitors/sleep/Kconfig"
 # Add new rtapp monitors here
 
+source "kernel/trace/rv/monitors/stall/Kconfig"
 # Add new monitors here
 
 config RV_REACTORS
diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
index 750e4ad6fa0f..51c95e2d2da6 100644
--- a/kernel/trace/rv/Makefile
+++ b/kernel/trace/rv/Makefile
@@ -17,6 +17,7 @@ obj-$(CONFIG_RV_MON_STS) += monitors/sts/sts.o
 obj-$(CONFIG_RV_MON_NRP) += monitors/nrp/nrp.o
 obj-$(CONFIG_RV_MON_SSSW) += monitors/sssw/sssw.o
 obj-$(CONFIG_RV_MON_OPID) += monitors/opid/opid.o
+obj-$(CONFIG_RV_MON_STALL) += monitors/stall/stall.o
 # Add new monitors here
 obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
 obj-$(CONFIG_RV_REACT_PRINTK) += reactor_printk.o
diff --git a/kernel/trace/rv/monitors/stall/Kconfig b/kernel/trace/rv/monitors/stall/Kconfig
new file mode 100644
index 000000000000..6f846b642544
--- /dev/null
+++ b/kernel/trace/rv/monitors/stall/Kconfig
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+config RV_MON_STALL
+	depends on RV
+	select HA_MON_EVENTS_ID
+	bool "stall monitor"
+	help
+	  Enable the stall sample monitor that illustrates the usage of hybrid
+	  automata monitors. It can be used to identify tasks stalled for
+	  longer than a threshold.
+
+	  For further information, see:
+	    Documentation/trace/rv/monitor_stall.rst
diff --git a/kernel/trace/rv/monitors/stall/stall.c b/kernel/trace/rv/monitors/stall/stall.c
new file mode 100644
index 000000000000..9ccfda6b0e73
--- /dev/null
+++ b/kernel/trace/rv/monitors/stall/stall.c
@@ -0,0 +1,150 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/ftrace.h>
+#include <linux/tracepoint.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/rv.h>
+#include <rv/instrumentation.h>
+
+#define MODULE_NAME "stall"
+
+#include <trace/events/sched.h>
+#include <rv_trace.h>
+
+#define RV_MON_TYPE RV_MON_PER_TASK
+#define HA_TIMER_TYPE HA_TIMER_WHEEL
+#include "stall.h"
+#include <rv/ha_monitor.h>
+
+static u64 threshold_jiffies = 1000;
+module_param(threshold_jiffies, ullong, 0644);
+
+static u64 ha_get_env(struct ha_monitor *ha_mon, enum envs_stall env, u64 time_ns)
+{
+	if (env == clk_stall)
+		return ha_get_clk_jiffy(ha_mon, env);
+	return ENV_INVALID_VALUE;
+}
+
+static void ha_reset_env(struct ha_monitor *ha_mon, enum envs_stall env, u64 time_ns)
+{
+	if (env == clk_stall)
+		ha_reset_clk_jiffy(ha_mon, env);
+}
+
+static inline bool ha_verify_invariants(struct ha_monitor *ha_mon,
+					enum states curr_state, enum events event,
+					enum states next_state, u64 time_ns)
+{
+	if (curr_state == enqueued_stall)
+		return ha_check_invariant_jiffy(ha_mon, clk_stall, time_ns);
+	return true;
+}
+
+static inline bool ha_verify_guards(struct ha_monitor *ha_mon,
+				    enum states curr_state, enum events event,
+				    enum states next_state, u64 time_ns)
+{
+	bool res = true;
+
+	if (curr_state == dequeued_stall && event == sched_wakeup_stall)
+		ha_reset_env(ha_mon, clk_stall, time_ns);
+	else if (curr_state == running_stall && event == sched_switch_preempt_stall)
+		ha_reset_env(ha_mon, clk_stall, time_ns);
+	return res;
+}
+
+static inline void ha_setup_invariants(struct ha_monitor *ha_mon,
+				       enum states curr_state, enum events event,
+				       enum states next_state, u64 time_ns)
+{
+	if (next_state == curr_state)
+		return;
+	if (next_state == enqueued_stall)
+		ha_start_timer_jiffy(ha_mon, clk_stall, threshold_jiffies, time_ns);
+	else if (curr_state == enqueued_stall)
+		ha_cancel_timer(ha_mon);
+}
+
+static bool ha_verify_constraint(struct ha_monitor *ha_mon,
+				 enum states curr_state, enum events event,
+				 enum states next_state, u64 time_ns)
+{
+	if (!ha_verify_invariants(ha_mon, curr_state, event, next_state, time_ns))
+		return false;
+
+	if (!ha_verify_guards(ha_mon, curr_state, event, next_state, time_ns))
+		return false;
+
+	ha_setup_invariants(ha_mon, curr_state, event, next_state, time_ns);
+
+	return true;
+}
+
+static void handle_sched_switch(void *data, bool preempt,
+				struct task_struct *prev,
+				struct task_struct *next,
+				unsigned int prev_state)
+{
+	if (!preempt && prev_state != TASK_RUNNING)
+		da_handle_start_event(prev, sched_switch_wait_stall);
+	else
+		da_handle_event(prev, sched_switch_preempt_stall);
+	da_handle_event(next, sched_switch_in_stall);
+}
+
+static void handle_sched_wakeup(void *data, struct task_struct *p)
+{
+	da_handle_event(p, sched_wakeup_stall);
+}
+
+static int enable_stall(void)
+{
+	int retval;
+
+	retval = da_monitor_init();
+	if (retval)
+		return retval;
+
+	rv_attach_trace_probe("stall", sched_switch, handle_sched_switch);
+	rv_attach_trace_probe("stall", sched_wakeup, handle_sched_wakeup);
+
+	return 0;
+}
+
+static void disable_stall(void)
+{
+	rv_this.enabled = 0;
+
+	rv_detach_trace_probe("stall", sched_switch, handle_sched_switch);
+	rv_detach_trace_probe("stall", sched_wakeup, handle_sched_wakeup);
+
+	da_monitor_destroy();
+}
+
+static struct rv_monitor rv_this = {
+	.name = "stall",
+	.description = "identify tasks stalled for longer than a threshold.",
+	.enable = enable_stall,
+	.disable = disable_stall,
+	.reset = da_monitor_reset_all,
+	.enabled = 0,
+};
+
+static int __init register_stall(void)
+{
+	return rv_register_monitor(&rv_this, NULL);
+}
+
+static void __exit unregister_stall(void)
+{
+	rv_unregister_monitor(&rv_this);
+}
+
+module_init(register_stall);
+module_exit(unregister_stall);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Gabriele Monaco <gmonaco@redhat.com>");
+MODULE_DESCRIPTION("stall: identify tasks stalled for longer than a threshold.");
diff --git a/kernel/trace/rv/monitors/stall/stall.h b/kernel/trace/rv/monitors/stall/stall.h
new file mode 100644
index 000000000000..638520cb1082
--- /dev/null
+++ b/kernel/trace/rv/monitors/stall/stall.h
@@ -0,0 +1,81 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Automatically generated C representation of stall automaton
+ * For further information about this format, see kernel documentation:
+ *   Documentation/trace/rv/deterministic_automata.rst
+ */
+
+#define MONITOR_NAME stall
+
+enum states_stall {
+	dequeued_stall,
+	enqueued_stall,
+	running_stall,
+	state_max_stall,
+};
+
+#define INVALID_STATE state_max_stall
+
+enum events_stall {
+	sched_switch_in_stall,
+	sched_switch_preempt_stall,
+	sched_switch_wait_stall,
+	sched_wakeup_stall,
+	event_max_stall,
+};
+
+enum envs_stall {
+	clk_stall,
+	env_max_stall,
+	env_max_stored_stall = env_max_stall,
+};
+
+_Static_assert(env_max_stored_stall <= MAX_HA_ENV_LEN, "Not enough slots");
+
+struct automaton_stall {
+	char *state_names[state_max_stall];
+	char *event_names[event_max_stall];
+	char *env_names[env_max_stall];
+	unsigned char function[state_max_stall][event_max_stall];
+	unsigned char initial_state;
+	bool final_states[state_max_stall];
+};
+
+static const struct automaton_stall automaton_stall = {
+	.state_names = {
+		"dequeued",
+		"enqueued",
+		"running",
+	},
+	.event_names = {
+		"sched_switch_in",
+		"sched_switch_preempt",
+		"sched_switch_wait",
+		"sched_wakeup",
+	},
+	.env_names = {
+		"clk",
+	},
+	.function = {
+		{
+			INVALID_STATE,
+			INVALID_STATE,
+			INVALID_STATE,
+			enqueued_stall,
+		},
+		{
+			running_stall,
+			INVALID_STATE,
+			INVALID_STATE,
+			enqueued_stall,
+		},
+		{
+			running_stall,
+			enqueued_stall,
+			dequeued_stall,
+			running_stall,
+		},
+	},
+	.initial_state = dequeued_stall,
+	.final_states = { 1, 0, 0 },
+};
diff --git a/kernel/trace/rv/monitors/stall/stall_trace.h b/kernel/trace/rv/monitors/stall/stall_trace.h
new file mode 100644
index 000000000000..6a7cc1b1d040
--- /dev/null
+++ b/kernel/trace/rv/monitors/stall/stall_trace.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Snippet to be included in rv_trace.h
+ */
+
+#ifdef CONFIG_RV_MON_STALL
+DEFINE_EVENT(event_da_monitor_id, event_stall,
+	     TP_PROTO(int id, char *state, char *event, char *next_state, bool final_state),
+	     TP_ARGS(id, state, event, next_state, final_state));
+
+DEFINE_EVENT(error_da_monitor_id, error_stall,
+	     TP_PROTO(int id, char *state, char *event),
+	     TP_ARGS(id, state, event));
+
+DEFINE_EVENT(error_env_da_monitor_id, error_env_stall,
+	     TP_PROTO(int id, char *state, char *event, char *env),
+	     TP_ARGS(id, state, event, env));
+#endif /* CONFIG_RV_MON_STALL */
diff --git a/kernel/trace/rv/rv_trace.h b/kernel/trace/rv/rv_trace.h
index 7c598967bc0e..1661f8fe4a88 100644
--- a/kernel/trace/rv/rv_trace.h
+++ b/kernel/trace/rv/rv_trace.h
@@ -187,6 +187,7 @@ DECLARE_EVENT_CLASS(error_env_da_monitor_id,
 		__get_str(env))
 );
 
+#include <monitors/stall/stall_trace.h>
 // Add new monitors based on CONFIG_HA_MON_EVENTS_ID here
 
 #endif
diff --git a/tools/verification/models/stall.dot b/tools/verification/models/stall.dot
new file mode 100644
index 000000000000..50077d1dff74
--- /dev/null
+++ b/tools/verification/models/stall.dot
@@ -0,0 +1,22 @@
+digraph state_automaton {
+	center = true;
+	size = "7,11";
+	{node [shape = circle] "enqueued"};
+	{node [shape = plaintext, style=invis, label=""] "__init_dequeued"};
+	{node [shape = doublecircle] "dequeued"};
+	{node [shape = circle] "running"};
+	"__init_dequeued" -> "dequeued";
+	"enqueued" [label = "enqueued\nclk < threshold_jiffies"];
+	"running" [label = "running"];
+	"dequeued" [label = "dequeued", color = green3];
+	"running" -> "running" [ label = "sched_switch_in\nsched_wakeup" ];
+	"enqueued" -> "enqueued" [ label = "sched_wakeup" ];
+	"enqueued" -> "running" [ label = "sched_switch_in" ];
+	"running" -> "dequeued" [ label = "sched_switch_wait" ];
+	"dequeued" -> "enqueued" [ label = "sched_wakeup;reset(clk)" ];
+	"running" -> "enqueued" [ label = "sched_switch_preempt;reset(clk)" ];
+	{ rank = min ;
+		"__init_dequeued";
+		"dequeued";
+	}
+}
-- 
2.53.0


^ permalink raw reply related

* [PATCH v8 07/12] rv: Convert the opid monitor to a hybrid automaton
From: Gabriele Monaco @ 2026-03-30 11:10 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Nam Cao, Juri Lelli,
	Gabriele Monaco, Jonathan Corbet, Masami Hiramatsu,
	linux-trace-kernel, linux-doc
  Cc: Tomas Glozar, Clark Williams, John Kacur
In-Reply-To: <20260330111010.153663-1-gmonaco@redhat.com>

The opid monitor validates that wakeup and need_resched events only
occur with interrupts and preemption disabled by following the
preemptirq tracepoints.
As reported in [1], those tracepoints might be inaccurate in some
situations (e.g. NMIs).

Since the monitor doesn't validate other ordering properties, remove the
dependency on preemptirq tracepoints and convert the monitor to a hybrid
automaton to validate the constraint during event handling.
This makes the monitor more robust by also removing the workaround for
interrupts missing the preemption tracepoints, which was working on
PREEMPT_RT only and allows the monitor to be built on kernels without
the preemptirqs tracepoints.

[1] - https://lore.kernel.org/lkml/20250625120823.60600-1-gmonaco@redhat.com

Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 Documentation/trace/rv/monitor_sched.rst   |  62 +++---------
 kernel/trace/rv/monitors/opid/Kconfig      |  11 +-
 kernel/trace/rv/monitors/opid/opid.c       | 111 +++++++--------------
 kernel/trace/rv/monitors/opid/opid.h       |  86 ++++------------
 kernel/trace/rv/monitors/opid/opid_trace.h |   4 +
 kernel/trace/rv/rv_trace.h                 |   2 +-
 tools/verification/models/sched/opid.dot   |  36 ++-----
 7 files changed, 82 insertions(+), 230 deletions(-)

diff --git a/Documentation/trace/rv/monitor_sched.rst b/Documentation/trace/rv/monitor_sched.rst
index 3f8381ad9ec7..0b96d6e147c6 100644
--- a/Documentation/trace/rv/monitor_sched.rst
+++ b/Documentation/trace/rv/monitor_sched.rst
@@ -346,55 +346,21 @@ Monitor opid
 
 The operations with preemption and irq disabled (opid) monitor ensures
 operations like ``wakeup`` and ``need_resched`` occur with interrupts and
-preemption disabled or during interrupt context, in such case preemption may
-not be disabled explicitly.
+preemption disabled.
 ``need_resched`` can be set by some RCU internals functions, in which case it
-doesn't match a task wakeup and might occur with only interrupts disabled::
-
-                 |                     sched_need_resched
-                 |                     sched_waking
-                 |                     irq_entry
-                 |                   +--------------------+
-                 v                   v                    |
-               +------------------------------------------------------+
-  +----------- |                     disabled                         | <+
-  |            +------------------------------------------------------+  |
-  |              |                 ^                                     |
-  |              |          preempt_disable      sched_need_resched      |
-  |       preempt_enable           |           +--------------------+    |
-  |              v                 |           v                    |    |
-  |            +------------------------------------------------------+  |
-  |            |                   irq_disabled                       |  |
-  |            +------------------------------------------------------+  |
-  |                              |             |        ^                |
-  |     irq_entry            irq_entry         |        |                |
-  |     sched_need_resched       v             |   irq_disable           |
-  |     sched_waking +--------------+          |        |                |
-  |           +----- |              |     irq_enable    |                |
-  |           |      |    in_irq    |          |        |                |
-  |           +----> |              |          |        |                |
-  |                  +--------------+          |        |          irq_disable
-  |                     |                      |        |                |
-  | irq_enable          | irq_enable           |        |                |
-  |                     v                      v        |                |
-  |            #======================================================#  |
-  |            H                     enabled                          H  |
-  |            #======================================================#  |
-  |              |                   ^         ^ preempt_enable     |    |
-  |       preempt_disable     preempt_enable   +--------------------+    |
-  |              v                   |                                   |
-  |            +------------------+  |                                   |
-  +----------> | preempt_disabled | -+                                   |
-               +------------------+                                      |
-                 |                                                       |
-                 +-------------------------------------------------------+
-
-This monitor is designed to work on ``PREEMPT_RT`` kernels, the special case of
-events occurring in interrupt context is a shortcut to identify valid scenarios
-where the preemption tracepoints might not be visible, during interrupts
-preemption is always disabled. On non- ``PREEMPT_RT`` kernels, the interrupts
-might invoke a softirq to set ``need_resched`` and wake up a task. This is
-another special case that is currently not supported by the monitor.
+doesn't match a task wakeup and might occur with only interrupts disabled.
+The interrupt and preemption status are validated by the hybrid automaton
+constraints when processing the events::
+
+   |
+   |
+   v
+ #=========#   sched_need_resched;irq_off == 1
+ H         H   sched_waking;irq_off == 1 && preempt_off == 1
+ H   any   H ------------------------------------------------+
+ H         H                                                 |
+ H         H <-----------------------------------------------+
+ #=========#
 
 References
 ----------
diff --git a/kernel/trace/rv/monitors/opid/Kconfig b/kernel/trace/rv/monitors/opid/Kconfig
index 561d32da572b..6d02e239b684 100644
--- a/kernel/trace/rv/monitors/opid/Kconfig
+++ b/kernel/trace/rv/monitors/opid/Kconfig
@@ -2,18 +2,13 @@
 #
 config RV_MON_OPID
 	depends on RV
-	depends on TRACE_IRQFLAGS
-	depends on TRACE_PREEMPT_TOGGLE
 	depends on RV_MON_SCHED
-	default y if PREEMPT_RT
-	select DA_MON_EVENTS_IMPLICIT
+	default y
+	select HA_MON_EVENTS_IMPLICIT
 	bool "opid monitor"
 	help
 	  Monitor to ensure operations like wakeup and need resched occur with
-	  interrupts and preemption disabled or during IRQs, where preemption
-	  may not be disabled explicitly.
-
-	  This monitor is unstable on !PREEMPT_RT, say N unless you are testing it.
+	  interrupts and preemption disabled.
 
 	  For further information, see:
 	    Documentation/trace/rv/monitor_sched.rst
diff --git a/kernel/trace/rv/monitors/opid/opid.c b/kernel/trace/rv/monitors/opid/opid.c
index 25a40e90fa40..4594c7c46601 100644
--- a/kernel/trace/rv/monitors/opid/opid.c
+++ b/kernel/trace/rv/monitors/opid/opid.c
@@ -10,94 +10,63 @@
 #define MODULE_NAME "opid"
 
 #include <trace/events/sched.h>
-#include <trace/events/irq.h>
-#include <trace/events/preemptirq.h>
 #include <rv_trace.h>
 #include <monitors/sched/sched.h>
 
 #define RV_MON_TYPE RV_MON_PER_CPU
 #include "opid.h"
-#include <rv/da_monitor.h>
+#include <rv/ha_monitor.h>
 
-#ifdef CONFIG_X86_LOCAL_APIC
-#include <asm/trace/irq_vectors.h>
-
-static void handle_vector_irq_entry(void *data, int vector)
+static u64 ha_get_env(struct ha_monitor *ha_mon, enum envs_opid env, u64 time_ns)
 {
-	da_handle_event(irq_entry_opid);
-}
-
-static void attach_vector_irq(void)
-{
-	rv_attach_trace_probe("opid", local_timer_entry, handle_vector_irq_entry);
-	if (IS_ENABLED(CONFIG_IRQ_WORK))
-		rv_attach_trace_probe("opid", irq_work_entry, handle_vector_irq_entry);
-	if (IS_ENABLED(CONFIG_SMP)) {
-		rv_attach_trace_probe("opid", reschedule_entry, handle_vector_irq_entry);
-		rv_attach_trace_probe("opid", call_function_entry, handle_vector_irq_entry);
-		rv_attach_trace_probe("opid", call_function_single_entry, handle_vector_irq_entry);
+	if (env == irq_off_opid)
+		return irqs_disabled();
+	else if (env == preempt_off_opid) {
+		/*
+		 * If CONFIG_PREEMPTION is enabled, then the tracepoint itself disables
+		 * preemption (adding one to the preempt_count). Since we are
+		 * interested in the preempt_count at the time the tracepoint was
+		 * hit, we consider 1 as still enabled.
+		 */
+		if (IS_ENABLED(CONFIG_PREEMPTION))
+			return (preempt_count() & PREEMPT_MASK) > 1;
+		return true;
 	}
+	return ENV_INVALID_VALUE;
 }
 
-static void detach_vector_irq(void)
+static inline bool ha_verify_guards(struct ha_monitor *ha_mon,
+				    enum states curr_state, enum events event,
+				    enum states next_state, u64 time_ns)
 {
-	rv_detach_trace_probe("opid", local_timer_entry, handle_vector_irq_entry);
-	if (IS_ENABLED(CONFIG_IRQ_WORK))
-		rv_detach_trace_probe("opid", irq_work_entry, handle_vector_irq_entry);
-	if (IS_ENABLED(CONFIG_SMP)) {
-		rv_detach_trace_probe("opid", reschedule_entry, handle_vector_irq_entry);
-		rv_detach_trace_probe("opid", call_function_entry, handle_vector_irq_entry);
-		rv_detach_trace_probe("opid", call_function_single_entry, handle_vector_irq_entry);
-	}
+	bool res = true;
+
+	if (curr_state == any_opid && event == sched_need_resched_opid)
+		res = ha_get_env(ha_mon, irq_off_opid, time_ns) == 1ull;
+	else if (curr_state == any_opid && event == sched_waking_opid)
+		res = ha_get_env(ha_mon, irq_off_opid, time_ns) == 1ull &&
+		      ha_get_env(ha_mon, preempt_off_opid, time_ns) == 1ull;
+	return res;
 }
 
-#else
-/* We assume irq_entry tracepoints are sufficient on other architectures */
-static void attach_vector_irq(void) { }
-static void detach_vector_irq(void) { }
-#endif
-
-static void handle_irq_disable(void *data, unsigned long ip, unsigned long parent_ip)
+static bool ha_verify_constraint(struct ha_monitor *ha_mon,
+				 enum states curr_state, enum events event,
+				 enum states next_state, u64 time_ns)
 {
-	da_handle_event(irq_disable_opid);
-}
+	if (!ha_verify_guards(ha_mon, curr_state, event, next_state, time_ns))
+		return false;
 
-static void handle_irq_enable(void *data, unsigned long ip, unsigned long parent_ip)
-{
-	da_handle_event(irq_enable_opid);
-}
-
-static void handle_irq_entry(void *data, int irq, struct irqaction *action)
-{
-	da_handle_event(irq_entry_opid);
-}
-
-static void handle_preempt_disable(void *data, unsigned long ip, unsigned long parent_ip)
-{
-	da_handle_event(preempt_disable_opid);
-}
-
-static void handle_preempt_enable(void *data, unsigned long ip, unsigned long parent_ip)
-{
-	da_handle_event(preempt_enable_opid);
+	return true;
 }
 
 static void handle_sched_need_resched(void *data, struct task_struct *tsk, int cpu, int tif)
 {
-	/* The monitor's intitial state is not in_irq */
-	if (this_cpu_read(hardirq_context))
-		da_handle_event(sched_need_resched_opid);
-	else
-		da_handle_start_event(sched_need_resched_opid);
+	da_handle_start_run_event(sched_need_resched_opid);
 }
 
 static void handle_sched_waking(void *data, struct task_struct *p)
 {
-	/* The monitor's intitial state is not in_irq */
-	if (this_cpu_read(hardirq_context))
-		da_handle_event(sched_waking_opid);
-	else
-		da_handle_start_event(sched_waking_opid);
+	da_handle_start_run_event(sched_waking_opid);
 }
 
 static int enable_opid(void)
@@ -108,14 +77,8 @@ static int enable_opid(void)
 	if (retval)
 		return retval;
 
-	rv_attach_trace_probe("opid", irq_disable, handle_irq_disable);
-	rv_attach_trace_probe("opid", irq_enable, handle_irq_enable);
-	rv_attach_trace_probe("opid", irq_handler_entry, handle_irq_entry);
-	rv_attach_trace_probe("opid", preempt_disable, handle_preempt_disable);
-	rv_attach_trace_probe("opid", preempt_enable, handle_preempt_enable);
 	rv_attach_trace_probe("opid", sched_set_need_resched_tp, handle_sched_need_resched);
 	rv_attach_trace_probe("opid", sched_waking, handle_sched_waking);
-	attach_vector_irq();
 
 	return 0;
 }
@@ -124,14 +87,8 @@ static void disable_opid(void)
 {
 	rv_this.enabled = 0;
 
-	rv_detach_trace_probe("opid", irq_disable, handle_irq_disable);
-	rv_detach_trace_probe("opid", irq_enable, handle_irq_enable);
-	rv_detach_trace_probe("opid", irq_handler_entry, handle_irq_entry);
-	rv_detach_trace_probe("opid", preempt_disable, handle_preempt_disable);
-	rv_detach_trace_probe("opid", preempt_enable, handle_preempt_enable);
 	rv_detach_trace_probe("opid", sched_set_need_resched_tp, handle_sched_need_resched);
 	rv_detach_trace_probe("opid", sched_waking, handle_sched_waking);
-	detach_vector_irq();
 
 	da_monitor_destroy();
 }
diff --git a/kernel/trace/rv/monitors/opid/opid.h b/kernel/trace/rv/monitors/opid/opid.h
index 092992514970..fb0aa4c28aa6 100644
--- a/kernel/trace/rv/monitors/opid/opid.h
+++ b/kernel/trace/rv/monitors/opid/opid.h
@@ -8,30 +8,31 @@
 #define MONITOR_NAME opid
 
 enum states_opid {
-	disabled_opid,
-	enabled_opid,
-	in_irq_opid,
-	irq_disabled_opid,
-	preempt_disabled_opid,
+	any_opid,
 	state_max_opid,
 };
 
 #define INVALID_STATE state_max_opid
 
 enum events_opid {
-	irq_disable_opid,
-	irq_enable_opid,
-	irq_entry_opid,
-	preempt_disable_opid,
-	preempt_enable_opid,
 	sched_need_resched_opid,
 	sched_waking_opid,
 	event_max_opid,
 };
 
+enum envs_opid {
+	irq_off_opid,
+	preempt_off_opid,
+	env_max_opid,
+	env_max_stored_opid = irq_off_opid,
+};
+
+_Static_assert(env_max_stored_opid <= MAX_HA_ENV_LEN, "Not enough slots");
+
 struct automaton_opid {
 	char *state_names[state_max_opid];
 	char *event_names[event_max_opid];
+	char *env_names[env_max_opid];
 	unsigned char function[state_max_opid][event_max_opid];
 	unsigned char initial_state;
 	bool final_states[state_max_opid];
@@ -39,68 +40,19 @@ struct automaton_opid {
 
 static const struct automaton_opid automaton_opid = {
 	.state_names = {
-		"disabled",
-		"enabled",
-		"in_irq",
-		"irq_disabled",
-		"preempt_disabled",
+		"any",
 	},
 	.event_names = {
-		"irq_disable",
-		"irq_enable",
-		"irq_entry",
-		"preempt_disable",
-		"preempt_enable",
 		"sched_need_resched",
 		"sched_waking",
 	},
+	.env_names = {
+		"irq_off",
+		"preempt_off",
+	},
 	.function = {
-		{
-			INVALID_STATE,
-			preempt_disabled_opid,
-			disabled_opid,
-			INVALID_STATE,
-			irq_disabled_opid,
-			disabled_opid,
-			disabled_opid,
-		},
-		{
-			irq_disabled_opid,
-			INVALID_STATE,
-			INVALID_STATE,
-			preempt_disabled_opid,
-			enabled_opid,
-			INVALID_STATE,
-			INVALID_STATE,
-		},
-		{
-			INVALID_STATE,
-			enabled_opid,
-			in_irq_opid,
-			INVALID_STATE,
-			INVALID_STATE,
-			in_irq_opid,
-			in_irq_opid,
-		},
-		{
-			INVALID_STATE,
-			enabled_opid,
-			in_irq_opid,
-			disabled_opid,
-			INVALID_STATE,
-			irq_disabled_opid,
-			INVALID_STATE,
-		},
-		{
-			disabled_opid,
-			INVALID_STATE,
-			INVALID_STATE,
-			INVALID_STATE,
-			enabled_opid,
-			INVALID_STATE,
-			INVALID_STATE,
-		},
+		{           any_opid,           any_opid },
 	},
-	.initial_state = disabled_opid,
-	.final_states = { 0, 1, 0, 0, 0 },
+	.initial_state = any_opid,
+	.final_states = { 1 },
 };
diff --git a/kernel/trace/rv/monitors/opid/opid_trace.h b/kernel/trace/rv/monitors/opid/opid_trace.h
index 3df6ff955c30..b04005b64208 100644
--- a/kernel/trace/rv/monitors/opid/opid_trace.h
+++ b/kernel/trace/rv/monitors/opid/opid_trace.h
@@ -12,4 +12,8 @@ DEFINE_EVENT(event_da_monitor, event_opid,
 DEFINE_EVENT(error_da_monitor, error_opid,
 	     TP_PROTO(char *state, char *event),
 	     TP_ARGS(state, event));
+
+DEFINE_EVENT(error_env_da_monitor, error_env_opid,
+	     TP_PROTO(char *state, char *event, char *env),
+	     TP_ARGS(state, event, env));
 #endif /* CONFIG_RV_MON_OPID */
diff --git a/kernel/trace/rv/rv_trace.h b/kernel/trace/rv/rv_trace.h
index 1661f8fe4a88..9e8072d863a2 100644
--- a/kernel/trace/rv/rv_trace.h
+++ b/kernel/trace/rv/rv_trace.h
@@ -62,7 +62,6 @@ DECLARE_EVENT_CLASS(error_da_monitor,
 #include <monitors/scpd/scpd_trace.h>
 #include <monitors/snep/snep_trace.h>
 #include <monitors/sts/sts_trace.h>
-#include <monitors/opid/opid_trace.h>
 // Add new monitors based on CONFIG_DA_MON_EVENTS_IMPLICIT here
 
 #ifdef CONFIG_HA_MON_EVENTS_IMPLICIT
@@ -91,6 +90,7 @@ DECLARE_EVENT_CLASS(error_env_da_monitor,
 		__get_str(env))
 );
 
+#include <monitors/opid/opid_trace.h>
 // Add new monitors based on CONFIG_HA_MON_EVENTS_IMPLICIT here
 
 #endif
diff --git a/tools/verification/models/sched/opid.dot b/tools/verification/models/sched/opid.dot
index 840052f6952b..511051fce430 100644
--- a/tools/verification/models/sched/opid.dot
+++ b/tools/verification/models/sched/opid.dot
@@ -1,35 +1,13 @@
 digraph state_automaton {
 	center = true;
 	size = "7,11";
-	{node [shape = plaintext, style=invis, label=""] "__init_disabled"};
-	{node [shape = circle] "disabled"};
-	{node [shape = doublecircle] "enabled"};
-	{node [shape = circle] "enabled"};
-	{node [shape = circle] "in_irq"};
-	{node [shape = circle] "irq_disabled"};
-	{node [shape = circle] "preempt_disabled"};
-	"__init_disabled" -> "disabled";
-	"disabled" [label = "disabled"];
-	"disabled" -> "disabled" [ label = "sched_need_resched\nsched_waking\nirq_entry" ];
-	"disabled" -> "irq_disabled" [ label = "preempt_enable" ];
-	"disabled" -> "preempt_disabled" [ label = "irq_enable" ];
-	"enabled" [label = "enabled", color = green3];
-	"enabled" -> "enabled" [ label = "preempt_enable" ];
-	"enabled" -> "irq_disabled" [ label = "irq_disable" ];
-	"enabled" -> "preempt_disabled" [ label = "preempt_disable" ];
-	"in_irq" [label = "in_irq"];
-	"in_irq" -> "enabled" [ label = "irq_enable" ];
-	"in_irq" -> "in_irq" [ label = "sched_need_resched\nsched_waking\nirq_entry" ];
-	"irq_disabled" [label = "irq_disabled"];
-	"irq_disabled" -> "disabled" [ label = "preempt_disable" ];
-	"irq_disabled" -> "enabled" [ label = "irq_enable" ];
-	"irq_disabled" -> "in_irq" [ label = "irq_entry" ];
-	"irq_disabled" -> "irq_disabled" [ label = "sched_need_resched" ];
-	"preempt_disabled" [label = "preempt_disabled"];
-	"preempt_disabled" -> "disabled" [ label = "irq_disable" ];
-	"preempt_disabled" -> "enabled" [ label = "preempt_enable" ];
+	{node [shape = plaintext, style=invis, label=""] "__init_any"};
+	{node [shape = doublecircle] "any"};
+	"__init_any" -> "any";
+	"any" [label = "any", color = green3];
+	"any" -> "any" [ label = "sched_need_resched;irq_off == 1\nsched_waking;irq_off == 1 && preempt_off == 1" ];
 	{ rank = min ;
-		"__init_disabled";
-		"disabled";
+		"__init_any";
+		"any";
 	}
 }
-- 
2.53.0


^ permalink raw reply related

* [PATCH v8 08/12] rv: Add support for per-object monitors in DA/HA
From: Gabriele Monaco @ 2026-03-30 11:10 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Nam Cao, Juri Lelli,
	Gabriele Monaco, linux-trace-kernel
  Cc: Tomas Glozar, Clark Williams, John Kacur
In-Reply-To: <20260330111010.153663-1-gmonaco@redhat.com>

RV deterministic and hybrid automata currently only support global,
per-cpu and per-task monitors. It isn't possible to write a model that
would follow some different type of object, like a deadline entity or a
lock.

Define the generic per-object monitor implementation which shares part
of the implementation with the per-task monitors.
The user needs to provide an id for the object (e.g. pid for tasks) and
define the data type for the monitor_target (e.g. struct task_struct *
for tasks). Both are supplied to the event handlers, as the id may not
be easily available in the target.

The monitor storage (e.g. the rv monitor, pointer to the target, etc.)
is stored in a hash table indexed by id. Monitor storage objects are
automatically allocated unless specified otherwise (e.g. if the creation
context is unsafe for allocation).

Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/linux/rv.h      |   1 +
 include/rv/da_monitor.h | 300 +++++++++++++++++++++++++++++++++++++++-
 include/rv/ha_monitor.h |   5 +-
 3 files changed, 300 insertions(+), 6 deletions(-)

diff --git a/include/linux/rv.h b/include/linux/rv.h
index 0aef9e3c785c..541ba404926a 100644
--- a/include/linux/rv.h
+++ b/include/linux/rv.h
@@ -13,6 +13,7 @@
 #define RV_MON_GLOBAL   0
 #define RV_MON_PER_CPU  1
 #define RV_MON_PER_TASK 2
+#define RV_MON_PER_OBJ  3
 
 #ifdef CONFIG_RV
 #include <linux/array_size.h>
diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index ab5fe0896a46..39765ff6f098 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -19,6 +19,8 @@
 #include <linux/stringify.h>
 #include <linux/bug.h>
 #include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/hashtable.h>
 
 /*
  * Per-cpu variables require a unique name although static in some
@@ -57,6 +59,9 @@ static struct rv_monitor rv_this;
 
 /*
  * Type for the target id, default to int but can be overridden.
+ * A long type can work as hash table key (PER_OBJ) but will be downgraded to
+ * int in the event tracepoint.
+ * Unused for implicit monitors.
  */
 #ifndef da_id_type
 #define da_id_type int
@@ -245,9 +250,9 @@ static inline struct da_monitor *da_get_monitor(struct task_struct *tsk)
 }
 
 /*
- * da_get_task - return the task associated to the monitor
+ * da_get_target - return the task associated to the monitor
  */
-static inline struct task_struct *da_get_task(struct da_monitor *da_mon)
+static inline struct task_struct *da_get_target(struct da_monitor *da_mon)
 {
 	return container_of(da_mon, struct task_struct, rv[task_mon_slot].da_mon);
 }
@@ -259,7 +264,7 @@ static inline struct task_struct *da_get_task(struct da_monitor *da_mon)
  */
 static inline da_id_type da_get_id(struct da_monitor *da_mon)
 {
-	return da_get_task(da_mon)->pid;
+	return da_get_target(da_mon)->pid;
 }
 
 static void da_monitor_reset_all(void)
@@ -309,6 +314,221 @@ static inline void da_monitor_destroy(void)
 
 	da_monitor_reset_all();
 }
+
+#elif RV_MON_TYPE == RV_MON_PER_OBJ
+/*
+ * Functions to define, init and get a per-object monitor.
+ */
+
+struct da_monitor_storage {
+	da_id_type id;
+	monitor_target target;
+	union rv_task_monitor rv;
+	struct hlist_node node;
+	struct rcu_head rcu;
+};
+
+#ifndef DA_MONITOR_HT_BITS
+#define DA_MONITOR_HT_BITS 10
+#endif
+static DEFINE_HASHTABLE(da_monitor_ht, DA_MONITOR_HT_BITS);
+
+/*
+ * da_create_empty_storage - pre-allocate an empty storage
+ */
+static inline struct da_monitor_storage *da_create_empty_storage(da_id_type id)
+{
+	struct da_monitor_storage *mon_storage;
+
+	mon_storage = kmalloc_nolock(sizeof(struct da_monitor_storage),
+				     __GFP_ZERO, NUMA_NO_NODE);
+	if (!mon_storage)
+		return NULL;
+
+	hash_add_rcu(da_monitor_ht, &mon_storage->node, id);
+	mon_storage->id = id;
+	return mon_storage;
+}
+
+/*
+ * da_create_storage - create the per-object storage
+ *
+ * The caller is responsible to synchronise writers, either with locks or
+ * implicitly. For instance, if da_create_storage is only called from a single
+ * event for target (e.g. sched_switch), it's safe to call this without locks.
+ */
+static inline struct da_monitor *da_create_storage(da_id_type id,
+						   monitor_target target,
+						   struct da_monitor *da_mon)
+{
+	struct da_monitor_storage *mon_storage;
+
+	if (da_mon)
+		return da_mon;
+
+	mon_storage = da_create_empty_storage(id);
+	if (!mon_storage)
+		return NULL;
+
+	mon_storage->target = target;
+	return &mon_storage->rv.da_mon;
+}
+
+/*
+ * __da_get_mon_storage - get the monitor storage from the hash table
+ */
+static inline struct da_monitor_storage *__da_get_mon_storage(da_id_type id)
+{
+	struct da_monitor_storage *mon_storage;
+
+	lockdep_assert_in_rcu_read_lock();
+	hash_for_each_possible_rcu(da_monitor_ht, mon_storage, node, id) {
+		if (mon_storage->id == id)
+			return mon_storage;
+	}
+
+	return NULL;
+}
+
+/*
+ * da_get_monitor - return the monitor for target
+ */
+static struct da_monitor *da_get_monitor(da_id_type id, monitor_target target)
+{
+	struct da_monitor_storage *mon_storage;
+
+	mon_storage = __da_get_mon_storage(id);
+	return mon_storage ? &mon_storage->rv.da_mon : NULL;
+}
+
+/*
+ * da_get_target - return the object associated to the monitor
+ */
+static inline monitor_target da_get_target(struct da_monitor *da_mon)
+{
+	return container_of(da_mon, struct da_monitor_storage, rv.da_mon)->target;
+}
+
+/*
+ * da_get_id - return the id associated to the monitor
+ */
+static inline da_id_type da_get_id(struct da_monitor *da_mon)
+{
+	return container_of(da_mon, struct da_monitor_storage, rv.da_mon)->id;
+}
+
+/*
+ * da_create_or_get - create the per-object storage if not already there
+ *
+ * This needs a lookup so should be guarded by RCU, the condition is checked
+ * directly in da_create_storage()
+ */
+static inline void da_create_or_get(da_id_type id, monitor_target target)
+{
+	guard(rcu)();
+	da_create_storage(id, target, da_get_monitor(id, target));
+}
+
+/*
+ * da_fill_empty_storage - store the target in a pre-allocated storage
+ *
+ * Can be used as a substitute of da_create_storage when starting a monitor in
+ * an environment where allocation is unsafe.
+ */
+static inline struct da_monitor *da_fill_empty_storage(da_id_type id,
+						       monitor_target target,
+						       struct da_monitor *da_mon)
+{
+	if (unlikely(da_mon && !da_get_target(da_mon)))
+		container_of(da_mon, struct da_monitor_storage, rv.da_mon)->target = target;
+	return da_mon;
+}
+
+/*
+ * da_get_target_by_id - return the object associated to the id
+ */
+static inline monitor_target da_get_target_by_id(da_id_type id)
+{
+	struct da_monitor_storage *mon_storage;
+
+	guard(rcu)();
+	mon_storage = __da_get_mon_storage(id);
+
+	if (unlikely(!mon_storage))
+		return NULL;
+	return mon_storage->target;
+}
+
+/*
+ * da_destroy_storage - destroy the per-object storage
+ *
+ * The caller is responsible to synchronise writers, either with locks or
+ * implicitly. For instance, if da_destroy_storage is called at sched_exit and
+ * da_create_storage can never occur after that, it's safe to call this without
+ * locks.
+ * This function includes an RCU read-side critical section to synchronise
+ * against da_monitor_destroy().
+ */
+static inline void da_destroy_storage(da_id_type id)
+{
+	struct da_monitor_storage *mon_storage;
+
+	guard(rcu)();
+	mon_storage = __da_get_mon_storage(id);
+
+	if (!mon_storage)
+		return;
+	da_monitor_reset_hook(&mon_storage->rv.da_mon);
+	hash_del_rcu(&mon_storage->node);
+	kfree_rcu(mon_storage, rcu);
+}
+
+static void da_monitor_reset_all(void)
+{
+	struct da_monitor_storage *mon_storage;
+	int bkt;
+
+	rcu_read_lock();
+	hash_for_each_rcu(da_monitor_ht, bkt, mon_storage, node)
+		da_monitor_reset(&mon_storage->rv.da_mon);
+	rcu_read_unlock();
+}
+
+static inline int da_monitor_init(void)
+{
+	hash_init(da_monitor_ht);
+	return 0;
+}
+
+static inline void da_monitor_destroy(void)
+{
+	struct da_monitor_storage *mon_storage;
+	struct hlist_node *tmp;
+	int bkt;
+
+	/*
+	 * This function is called after all probes are disabled, we need only
+	 * worry about concurrency against old events.
+	 */
+	synchronize_rcu();
+	hash_for_each_safe(da_monitor_ht, bkt, tmp, mon_storage, node) {
+		da_monitor_reset_hook(&mon_storage->rv.da_mon);
+		hash_del_rcu(&mon_storage->node);
+		kfree(mon_storage);
+	}
+}
+
+/*
+ * Allow the per-object monitors to run allocation manually, necessary if the
+ * start condition is in a context problematic for allocation (e.g. scheduling).
+ * In such case, if the storage was pre-allocated without a target, set it now.
+ */
+#ifdef DA_SKIP_AUTO_ALLOC
+#define da_prepare_storage da_fill_empty_storage
+#else
+#define da_prepare_storage da_create_storage
+#endif /* DA_SKIP_AUTO_ALLOC */
+
 #endif /* RV_MON_TYPE */
 
 #if RV_MON_TYPE == RV_MON_GLOBAL || RV_MON_TYPE == RV_MON_PER_CPU
@@ -342,9 +562,9 @@ static inline da_id_type da_get_id(struct da_monitor *da_mon)
 	return 0;
 }
 
-#elif RV_MON_TYPE == RV_MON_PER_TASK
+#elif RV_MON_TYPE == RV_MON_PER_TASK || RV_MON_TYPE == RV_MON_PER_OBJ
 /*
- * Trace events for per_task monitors, report the PID of the task.
+ * Trace events for per_task/per_object monitors, report the target id.
  */
 
 static inline void da_trace_event(struct da_monitor *da_mon,
@@ -525,6 +745,76 @@ static inline bool da_handle_start_run_event(struct task_struct *tsk,
 {
 	return __da_handle_start_run_event(da_get_monitor(tsk), event, tsk->pid);
 }
+
+#elif RV_MON_TYPE == RV_MON_PER_OBJ
+/*
+ * Handle event for per object.
+ */
+
+/*
+ * da_handle_event - handle an event
+ */
+static inline void da_handle_event(da_id_type id, monitor_target target, enum events event)
+{
+	struct da_monitor *da_mon;
+
+	guard(rcu)();
+	da_mon = da_get_monitor(id, target);
+	if (likely(da_mon))
+		__da_handle_event(da_mon, event, id);
+}
+
+/*
+ * da_handle_start_event - start monitoring or handle event
+ *
+ * This function is used to notify the monitor that the system is returning
+ * to the initial state, so the monitor can start monitoring in the next event.
+ * Thus:
+ *
+ * If the monitor already started, handle the event.
+ * If the monitor did not start yet, start the monitor but skip the event.
+ */
+static inline bool da_handle_start_event(da_id_type id, monitor_target target,
+					 enum events event)
+{
+	struct da_monitor *da_mon;
+
+	guard(rcu)();
+	da_mon = da_get_monitor(id, target);
+	da_mon = da_prepare_storage(id, target, da_mon);
+	if (unlikely(!da_mon))
+		return 0;
+	return __da_handle_start_event(da_mon, event, id);
+}
+
+/*
+ * da_handle_start_run_event - start monitoring and handle event
+ *
+ * This function is used to notify the monitor that the system is in the
+ * initial state, so the monitor can start monitoring and handling event.
+ */
+static inline bool da_handle_start_run_event(da_id_type id, monitor_target target,
+					     enum events event)
+{
+	struct da_monitor *da_mon;
+
+	guard(rcu)();
+	da_mon = da_get_monitor(id, target);
+	da_mon = da_prepare_storage(id, target, da_mon);
+	if (unlikely(!da_mon))
+		return 0;
+	return __da_handle_start_run_event(da_mon, event, id);
+}
+
+static inline void da_reset(da_id_type id, monitor_target target)
+{
+	struct da_monitor *da_mon;
+
+	guard(rcu)();
+	da_mon = da_get_monitor(id, target);
+	if (likely(da_mon))
+		da_monitor_reset(da_mon);
+}
 #endif /* RV_MON_TYPE */
 
 #endif
diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
index b6cf3b2ba989..d59507e8cb30 100644
--- a/include/rv/ha_monitor.h
+++ b/include/rv/ha_monitor.h
@@ -190,7 +190,10 @@ static inline void ha_trace_error_env(struct ha_monitor *ha_mon,
 {
 	CONCATENATE(trace_error_env_, MONITOR_NAME)(curr_state, event, env);
 }
-#elif RV_MON_TYPE == RV_MON_PER_TASK
+#elif RV_MON_TYPE == RV_MON_PER_TASK || RV_MON_TYPE == RV_MON_PER_OBJ
+
+#define ha_get_target(ha_mon) da_get_target(&ha_mon->da_mon)
+
 static inline void ha_trace_error_env(struct ha_monitor *ha_mon,
 				      char *curr_state, char *event, char *env,
 				      da_id_type id)
-- 
2.53.0


^ permalink raw reply related

* [PATCH v8 09/12] verification/rvgen: Add support for per-obj monitors
From: Gabriele Monaco @ 2026-03-30 11:10 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Nam Cao, Juri Lelli,
	Gabriele Monaco, linux-trace-kernel
  Cc: Tomas Glozar, Clark Williams, John Kacur
In-Reply-To: <20260330111010.153663-1-gmonaco@redhat.com>

The special per-object monitor type was just introduced in RV, this
requires the user to define some functions and type specific to the
object.

Adapt rvgen to add stub definitions for the monitor_target type and
other modifications required to create per-object monitors.

Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---

Notes:
    V6:
    * Use f-strings in newly added code and cleanup
    V3:
    * Add _is_id_monitor() in dot2k to handle per-obj together with per-task

 tools/verification/rvgen/rvgen/dot2k.py     | 17 +++++++++++++----
 tools/verification/rvgen/rvgen/generator.py |  2 +-
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/tools/verification/rvgen/rvgen/dot2k.py b/tools/verification/rvgen/rvgen/dot2k.py
index 3cdc8cfb6be5..e7ba68a54c1f 100644
--- a/tools/verification/rvgen/rvgen/dot2k.py
+++ b/tools/verification/rvgen/rvgen/dot2k.py
@@ -27,6 +27,8 @@ class dot2k(Monitor, Dot2c):
     def fill_monitor_type(self) -> str:
         buff = [ self.monitor_type.upper() ]
         buff += self._fill_timer_type()
+        if self.monitor_type == "per_obj":
+            buff.append("typedef /* XXX: define the target type */ *monitor_target;")
         return "\n".join(buff)
 
     def fill_tracepoint_handlers_skel(self) -> str:
@@ -45,6 +47,10 @@ class dot2k(Monitor, Dot2c):
             if self.monitor_type == "per_task":
                 buff.append("\tstruct task_struct *p = /* XXX: how do I get p? */;");
                 buff.append("\tda_%s(p, %s%s);" % (handle, event, self.enum_suffix));
+            elif self.monitor_type == "per_obj":
+                buff.append("\tint id = /* XXX: how do I get the id? */;")
+                buff.append("\tmonitor_target t = /* XXX: how do I get t? */;")
+                buff.append(f"\tda_{handle}(id, t, {event}{self.enum_suffix});")
             else:
                 buff.append("\tda_%s(%s%s);" % (handle, event, self.enum_suffix));
             buff.append("}")
@@ -92,13 +98,16 @@ class dot2k(Monitor, Dot2c):
 
         return '\n'.join(buff)
 
+    def _is_id_monitor(self) -> bool:
+        return self.monitor_type in ("per_task", "per_obj")
+
     def fill_monitor_class_type(self) -> str:
-        if self.monitor_type == "per_task":
+        if self._is_id_monitor():
             return "DA_MON_EVENTS_ID"
         return "DA_MON_EVENTS_IMPLICIT"
 
     def fill_monitor_class(self) -> str:
-        if self.monitor_type == "per_task":
+        if self._is_id_monitor():
             return "da_monitor_id"
         return "da_monitor"
 
@@ -122,7 +131,7 @@ class dot2k(Monitor, Dot2c):
                 }
         tp_args_id = ("int ", "id")
         tp_args = tp_args_dict[tp_type]
-        if self.monitor_type == "per_task":
+        if self._is_id_monitor():
             tp_args.insert(0, tp_args_id)
         tp_proto_c = ", ".join([a+b for a,b in tp_args])
         tp_args_c = ", ".join([b for a,b in tp_args])
@@ -169,7 +178,7 @@ class ha2k(dot2k):
         self.__parse_constraints()
 
     def fill_monitor_class_type(self) -> str:
-        if self.monitor_type == "per_task":
+        if self._is_id_monitor():
             return "HA_MON_EVENTS_ID"
         return "HA_MON_EVENTS_IMPLICIT"
 
diff --git a/tools/verification/rvgen/rvgen/generator.py b/tools/verification/rvgen/rvgen/generator.py
index b80af3fd6701..5eac12e110dc 100644
--- a/tools/verification/rvgen/rvgen/generator.py
+++ b/tools/verification/rvgen/rvgen/generator.py
@@ -243,7 +243,7 @@ obj-$(CONFIG_RV_MON_%s) += monitors/%s/%s.o
 
 
 class Monitor(RVGenerator):
-    monitor_types = { "global" : 1, "per_cpu" : 2, "per_task" : 3 }
+    monitor_types = { "global" : 1, "per_cpu" : 2, "per_task" : 3, "per_obj" : 4 }
 
     def __init__(self, extra_params={}):
         super().__init__(extra_params)
-- 
2.53.0


^ permalink raw reply related

* [PATCH v8 10/12] sched: Add deadline tracepoints
From: Gabriele Monaco @ 2026-03-30 11:10 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Nam Cao, Juri Lelli,
	Masami Hiramatsu, Ingo Molnar, Peter Zijlstra, linux-trace-kernel
  Cc: Gabriele Monaco, Phil Auld, Juri Lelli, Tomas Glozar,
	Clark Williams, John Kacur
In-Reply-To: <20260330111010.153663-1-gmonaco@redhat.com>

Add the following tracepoints:

* sched_dl_throttle(dl_se, cpu, type):
    Called when a deadline entity is throttled
* sched_dl_replenish(dl_se, cpu, type):
    Called when a deadline entity's runtime is replenished
* sched_dl_update(dl_se, cpu, type):
    Called when a deadline entity updates without throttle or replenish
* sched_dl_server_start(dl_se, cpu, type):
    Called when a deadline server is started
* sched_dl_server_stop(dl_se, cpu, type):
    Called when a deadline server is stopped

Those tracepoints can be useful to validate the deadline scheduler with
RV and are not exported to tracefs.

Reviewed-by: Phil Auld <pauld@redhat.com>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---

Notes:
    V8:
    * Use u8 instead of uint8_t
    V7:
    * Export sched_dl_update to modules and fix style
    V6:
    * Add dl_se type to differentiate between fair and ext servers
    * Add event to track dl_update_curr not firing other events

 include/trace/events/sched.h | 26 ++++++++++++++++++++++++++
 kernel/sched/core.c          |  5 +++++
 kernel/sched/deadline.c      | 23 +++++++++++++++++++++++
 3 files changed, 54 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 7b2645b50e78..535860581f15 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -896,6 +896,32 @@ DECLARE_TRACE(sched_set_need_resched,
 	TP_PROTO(struct task_struct *tsk, int cpu, int tif),
 	TP_ARGS(tsk, cpu, tif));
 
+#define DL_OTHER 0
+#define DL_TASK 1
+#define DL_SERVER_FAIR 2
+#define DL_SERVER_EXT 3
+
+DECLARE_TRACE(sched_dl_throttle,
+	TP_PROTO(struct sched_dl_entity *dl_se, int cpu, u8 type),
+	TP_ARGS(dl_se, cpu, type));
+
+DECLARE_TRACE(sched_dl_replenish,
+	TP_PROTO(struct sched_dl_entity *dl_se, int cpu, u8 type),
+	TP_ARGS(dl_se, cpu, type));
+
+/* Call to update_curr_dl_se not involving throttle or replenish */
+DECLARE_TRACE(sched_dl_update,
+	TP_PROTO(struct sched_dl_entity *dl_se, int cpu, u8 type),
+	TP_ARGS(dl_se, cpu, type));
+
+DECLARE_TRACE(sched_dl_server_start,
+	TP_PROTO(struct sched_dl_entity *dl_se, int cpu, u8 type),
+	TP_ARGS(dl_se, cpu, type));
+
+DECLARE_TRACE(sched_dl_server_stop,
+	TP_PROTO(struct sched_dl_entity *dl_se, int cpu, u8 type),
+	TP_ARGS(dl_se, cpu, type));
+
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 496dff740dca..6a043f11b79d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -122,6 +122,11 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_entry_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_exit_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_set_need_resched_tp);
+EXPORT_TRACEPOINT_SYMBOL_GPL(sched_dl_throttle_tp);
+EXPORT_TRACEPOINT_SYMBOL_GPL(sched_dl_replenish_tp);
+EXPORT_TRACEPOINT_SYMBOL_GPL(sched_dl_update_tp);
+EXPORT_TRACEPOINT_SYMBOL_GPL(sched_dl_server_start_tp);
+EXPORT_TRACEPOINT_SYMBOL_GPL(sched_dl_server_stop_tp);
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 DEFINE_PER_CPU(struct rnd_state, sched_rnd_state);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d08b00429323..e511e36916bd 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -115,6 +115,19 @@ static inline bool is_dl_boosted(struct sched_dl_entity *dl_se)
 }
 #endif /* !CONFIG_RT_MUTEXES */
 
+static inline u8 dl_get_type(struct sched_dl_entity *dl_se, struct rq *rq)
+{
+	if (!dl_server(dl_se))
+		return DL_TASK;
+	if (dl_se == &rq->fair_server)
+		return DL_SERVER_FAIR;
+#ifdef CONFIG_SCHED_CLASS_EXT
+	if (dl_se == &rq->ext_server)
+		return DL_SERVER_EXT;
+#endif
+	return DL_OTHER;
+}
+
 static inline struct dl_bw *dl_bw_of(int i)
 {
 	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
@@ -733,6 +746,7 @@ static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se,
 		dl_se->dl_throttled = 1;
 		dl_se->dl_defer_armed = 1;
 	}
+	trace_sched_dl_replenish_tp(dl_se, cpu_of(rq), dl_get_type(dl_se, rq));
 }
 
 /*
@@ -848,6 +862,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	if (dl_se->dl_throttled)
 		dl_se->dl_throttled = 0;
 
+	trace_sched_dl_replenish_tp(dl_se, cpu_of(rq), dl_get_type(dl_se, rq));
+
 	/*
 	 * If this is the replenishment of a deferred reservation,
 	 * clear the flag and return.
@@ -1345,6 +1361,7 @@ static inline void dl_check_constrained_dl(struct sched_dl_entity *dl_se)
 	    dl_time_before(rq_clock(rq), dl_next_period(dl_se))) {
 		if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(dl_se)))
 			return;
+		trace_sched_dl_throttle_tp(dl_se, cpu_of(rq), dl_get_type(dl_se, rq));
 		dl_se->dl_throttled = 1;
 		if (dl_se->runtime > 0)
 			dl_se->runtime = 0;
@@ -1508,6 +1525,7 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 
 throttle:
 	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
+		trace_sched_dl_throttle_tp(dl_se, cpu_of(rq), dl_get_type(dl_se, rq));
 		dl_se->dl_throttled = 1;
 
 		/* If requested, inform the user about runtime overruns. */
@@ -1532,6 +1550,8 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 
 		if (!is_leftmost(dl_se, &rq->dl))
 			resched_curr(rq);
+	} else {
+		trace_sched_dl_update_tp(dl_se, cpu_of(rq), dl_get_type(dl_se, rq));
 	}
 
 	/*
@@ -1810,6 +1830,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
 	if (WARN_ON_ONCE(!cpu_online(cpu_of(rq))))
 		return;
 
+	trace_sched_dl_server_start_tp(dl_se, cpu_of(rq), dl_get_type(dl_se, rq));
 	dl_se->dl_server_active = 1;
 	enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
 	if (!dl_task(dl_se->rq->curr) || dl_entity_preempt(dl_se, &rq->curr->dl))
@@ -1821,6 +1842,8 @@ void dl_server_stop(struct sched_dl_entity *dl_se)
 	if (!dl_server(dl_se) || !dl_server_active(dl_se))
 		return;
 
+	trace_sched_dl_server_stop_tp(dl_se, cpu_of(dl_se->rq),
+				      dl_get_type(dl_se, dl_se->rq));
 	dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
 	hrtimer_try_to_cancel(&dl_se->dl_timer);
 	dl_se->dl_defer_armed = 0;
-- 
2.53.0


^ permalink raw reply related

* [PATCH v8 11/12] sched/deadline: Move some utility functions to deadline.h
From: Gabriele Monaco @ 2026-03-30 11:10 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Nam Cao, Juri Lelli, Ingo Molnar,
	Peter Zijlstra
  Cc: Gabriele Monaco, Juri Lelli, Tomas Glozar, Clark Williams,
	John Kacur, linux-trace-kernel
In-Reply-To: <20260330111010.153663-1-gmonaco@redhat.com>

Some utility functions on sched_dl_entity can be useful outside of
deadline.c , for instance for modelling, without relying on raw
structure fields.

Move functions like dl_task_of and dl_is_implicit to deadline.h to make
them available outside.

Acked-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/linux/sched/deadline.h | 27 +++++++++++++++++++++++++++
 kernel/sched/deadline.c        | 28 +---------------------------
 2 files changed, 28 insertions(+), 27 deletions(-)

diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index c40115d4e34d..1198138cb839 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -37,4 +37,31 @@ extern void dl_clear_root_domain_cpu(int cpu);
 extern u64 dl_cookie;
 extern bool dl_bw_visited(int cpu, u64 cookie);
 
+static inline bool dl_server(struct sched_dl_entity *dl_se)
+{
+	return dl_se->dl_server;
+}
+
+static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
+{
+	BUG_ON(dl_server(dl_se));
+	return container_of(dl_se, struct task_struct, dl);
+}
+
+/*
+ * Regarding the deadline, a task with implicit deadline has a relative
+ * deadline == relative period. A task with constrained deadline has a
+ * relative deadline <= relative period.
+ *
+ * We support constrained deadline tasks. However, there are some restrictions
+ * applied only for tasks which do not have an implicit deadline. See
+ * update_dl_entity() to know more about such restrictions.
+ *
+ * The dl_is_implicit() returns true if the task has an implicit deadline.
+ */
+static inline bool dl_is_implicit(struct sched_dl_entity *dl_se)
+{
+	return dl_se->dl_deadline == dl_se->dl_period;
+}
+
 #endif /* _LINUX_SCHED_DEADLINE_H */
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e511e36916bd..c10415c1aa4a 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -18,6 +18,7 @@
 
 #include <linux/cpuset.h>
 #include <linux/sched/clock.h>
+#include <linux/sched/deadline.h>
 #include <uapi/linux/sched/types.h>
 #include "sched.h"
 #include "pelt.h"
@@ -57,17 +58,6 @@ static int __init sched_dl_sysctl_init(void)
 late_initcall(sched_dl_sysctl_init);
 #endif /* CONFIG_SYSCTL */
 
-static bool dl_server(struct sched_dl_entity *dl_se)
-{
-	return dl_se->dl_server;
-}
-
-static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
-{
-	BUG_ON(dl_server(dl_se));
-	return container_of(dl_se, struct task_struct, dl);
-}
-
 static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
 {
 	return container_of(dl_rq, struct rq, dl);
@@ -990,22 +980,6 @@ update_dl_revised_wakeup(struct sched_dl_entity *dl_se, struct rq *rq)
 	dl_se->runtime = (dl_se->dl_density * laxity) >> BW_SHIFT;
 }
 
-/*
- * Regarding the deadline, a task with implicit deadline has a relative
- * deadline == relative period. A task with constrained deadline has a
- * relative deadline <= relative period.
- *
- * We support constrained deadline tasks. However, there are some restrictions
- * applied only for tasks which do not have an implicit deadline. See
- * update_dl_entity() to know more about such restrictions.
- *
- * The dl_is_implicit() returns true if the task has an implicit deadline.
- */
-static inline bool dl_is_implicit(struct sched_dl_entity *dl_se)
-{
-	return dl_se->dl_deadline == dl_se->dl_period;
-}
-
 /*
  * When a deadline entity is placed in the runqueue, its runtime and deadline
  * might need to be updated. This is done by a CBS wake up rule. There are two
-- 
2.53.0


^ permalink raw reply related

* [PATCH v8 12/12] rv: Add nomiss deadline monitor
From: Gabriele Monaco @ 2026-03-30 11:10 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Nam Cao, Juri Lelli,
	Gabriele Monaco, Jonathan Corbet, Masami Hiramatsu,
	linux-trace-kernel, linux-doc
  Cc: Tomas Glozar, Clark Williams, John Kacur
In-Reply-To: <20260330111010.153663-1-gmonaco@redhat.com>

Add the deadline monitors collection to validate the deadline scheduler,
both for deadline tasks and servers.

The currently implemented monitors are:
* nomiss:
    validate dl entities run to completion before their deadiline

Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---

Notes:
    V8:
    * Warn if kallsyms lookup fails
    * Use u8 instead of uint8_t
    * Drop throttle monitor, will submit separately

 Documentation/trace/rv/index.rst              |   1 +
 Documentation/trace/rv/monitor_deadline.rst   |  84 +++++
 kernel/trace/rv/Kconfig                       |   4 +
 kernel/trace/rv/Makefile                      |   2 +
 kernel/trace/rv/monitors/deadline/Kconfig     |  10 +
 kernel/trace/rv/monitors/deadline/deadline.c  |  44 +++
 kernel/trace/rv/monitors/deadline/deadline.h  | 202 ++++++++++++
 kernel/trace/rv/monitors/nomiss/Kconfig       |  15 +
 kernel/trace/rv/monitors/nomiss/nomiss.c      | 293 ++++++++++++++++++
 kernel/trace/rv/monitors/nomiss/nomiss.h      | 123 ++++++++
 .../trace/rv/monitors/nomiss/nomiss_trace.h   |  19 ++
 kernel/trace/rv/rv_trace.h                    |   1 +
 tools/verification/models/deadline/nomiss.dot |  41 +++
 13 files changed, 839 insertions(+)
 create mode 100644 Documentation/trace/rv/monitor_deadline.rst
 create mode 100644 kernel/trace/rv/monitors/deadline/Kconfig
 create mode 100644 kernel/trace/rv/monitors/deadline/deadline.c
 create mode 100644 kernel/trace/rv/monitors/deadline/deadline.h
 create mode 100644 kernel/trace/rv/monitors/nomiss/Kconfig
 create mode 100644 kernel/trace/rv/monitors/nomiss/nomiss.c
 create mode 100644 kernel/trace/rv/monitors/nomiss/nomiss.h
 create mode 100644 kernel/trace/rv/monitors/nomiss/nomiss_trace.h
 create mode 100644 tools/verification/models/deadline/nomiss.dot

diff --git a/Documentation/trace/rv/index.rst b/Documentation/trace/rv/index.rst
index bf9962f49959..29769f06bb0f 100644
--- a/Documentation/trace/rv/index.rst
+++ b/Documentation/trace/rv/index.rst
@@ -17,3 +17,4 @@ Runtime Verification
    monitor_sched.rst
    monitor_rtapp.rst
    monitor_stall.rst
+   monitor_deadline.rst
diff --git a/Documentation/trace/rv/monitor_deadline.rst b/Documentation/trace/rv/monitor_deadline.rst
new file mode 100644
index 000000000000..84506ed1e293
--- /dev/null
+++ b/Documentation/trace/rv/monitor_deadline.rst
@@ -0,0 +1,84 @@
+Deadline monitors
+=================
+
+- Name: deadline
+- Type: container for multiple monitors
+- Author: Gabriele Monaco <gmonaco@redhat.com>
+
+Description
+-----------
+
+The deadline monitor is a set of specifications to describe the deadline
+scheduler behaviour. It includes monitors per scheduling entity (deadline tasks
+and servers) that work independently to verify different specifications the
+deadline scheduler should follow.
+
+Specifications
+--------------
+
+Monitor nomiss
+~~~~~~~~~~~~~~
+
+The nomiss monitor ensures dl entities get to run *and* run to completion
+before their deadline, although deferrable servers may not run. An entity is
+considered done if ``throttled``, either because it yielded or used up its
+runtime, or when it voluntarily starts ``sleeping``.
+The monitor includes a user configurable deadline threshold. If the total
+utilisation of deadline tasks is larger than 1, they are only guaranteed
+bounded tardiness. See Documentation/scheduler/sched-deadline.rst for more
+details. The threshold (module parameter ``nomiss.deadline_thresh``) can be
+configured to avoid the monitor to fail based on the acceptable tardiness in
+the system. Since ``dl_throttle`` is a valid outcome for the entity to be done,
+the minimum tardiness needs be 1 tick to consider the throttle delay, unless
+the ``HRTICK_DL`` scheduler feature is active.
+
+Servers have also an intermediate ``idle`` state, occurring as soon as no
+runnable task is available from ready or running where no timing constraint
+is applied. A server goes to sleep by stopping, there is no wakeup equivalent
+as the order of a server starting and replenishing is not defined, hence a
+server can run from sleeping without being ready::
+
+                                  |
+  sched_wakeup                    v
+  dl_replenish;reset(clk) -- #=========================#
+               |             H                         H dl_replenish;reset(clk)
+               +-----------> H                         H <--------------------+
+                             H                         H                      |
+      +- dl_server_stop ---- H          ready          H                      |
+      |  +-----------------> H   clk < DEADLINE_NS()   H   dl_throttle;       |
+      |  |                   H                         H     is_defer == 1    |
+      |  | sched_switch_in - H                         H -----------------+   |
+      |  |   |               #=========================#                  |   |
+      |  |   |                       |            ^                       |   |
+      |  |   |             dl_server_idle    dl_replenish;reset(clk)      |   |
+      |  |   |                       v            |                       |   |
+      |  |   |                      +--------------+                      |   |
+      |  |   |              +------ |              |                      |   |
+      |  |   |     dl_server_idle   |              | dl_throttle          |   |
+      |  |   |              |       |     idle     | -----------------+   |   |
+      |  |   |              +-----> |              |                  |   |   |
+      |  |   |                      |              |                  |   |   |
+      |  |   |                      |              |                  |   |   |
+   +--+--+---+--- dl_server_stop -- +--------------+                  |   |   |
+   |  |  |   |                       |           ^                    |   |   |
+   |  |  |   |            sched_switch_in    dl_server_idle           |   |   |
+   |  |  |   |                       v           |                    |   |   |
+   |  |  |   |      +---------- +---------------------+               |   |   |
+   |  |  |   | sched_switch_in  |                     |               |   |   |
+   |  |  |   | sched_wakeup     |                     |               |   |   |
+   |  |  |   | dl_replenish;    |      running        | -------+      |   |   |
+   |  |  |   |      reset(clk)  | clk < DEADLINE_NS() |        |      |   |   |
+   |  |  |   |      +---------> |                     | dl_throttle   |   |   |
+   |  |  |   +----------------> |                     |        |      |   |   |
+   |  |  |                      +---------------------+        |      |   |   |
+   |  | sched_wakeup                ^   sched_switch_suspend   |      |   |   |
+   v  v dl_replenish;reset(clk)     |   dl_server_stop         |      |   |   |
+ +--------------+                   |   |                      v      v   v   |
+ |              | - sched_switch_in +   |                     +---------------+
+ |              | <---------------------+     dl_throttle +-- |               |
+ |   sleeping   |                            sched_wakeup |   |   throttled   |
+ |              | -- dl_server_stop        dl_server_idle +-> |               |
+ |              |    dl_server_idle     sched_switch_suspend  +---------------+
+ +--------------+ <---------+                                        ^
+        |                                                            |
+        +------ dl_throttle;is_constr_dl == 1 || is_defer == 1 ------+
diff --git a/kernel/trace/rv/Kconfig b/kernel/trace/rv/Kconfig
index 720fbe4935f8..3884b14df375 100644
--- a/kernel/trace/rv/Kconfig
+++ b/kernel/trace/rv/Kconfig
@@ -79,6 +79,10 @@ source "kernel/trace/rv/monitors/sleep/Kconfig"
 # Add new rtapp monitors here
 
 source "kernel/trace/rv/monitors/stall/Kconfig"
+source "kernel/trace/rv/monitors/deadline/Kconfig"
+source "kernel/trace/rv/monitors/nomiss/Kconfig"
+# Add new deadline monitors here
+
 # Add new monitors here
 
 config RV_REACTORS
diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
index 51c95e2d2da6..94498da35b37 100644
--- a/kernel/trace/rv/Makefile
+++ b/kernel/trace/rv/Makefile
@@ -18,6 +18,8 @@ obj-$(CONFIG_RV_MON_NRP) += monitors/nrp/nrp.o
 obj-$(CONFIG_RV_MON_SSSW) += monitors/sssw/sssw.o
 obj-$(CONFIG_RV_MON_OPID) += monitors/opid/opid.o
 obj-$(CONFIG_RV_MON_STALL) += monitors/stall/stall.o
+obj-$(CONFIG_RV_MON_DEADLINE) += monitors/deadline/deadline.o
+obj-$(CONFIG_RV_MON_NOMISS) += monitors/nomiss/nomiss.o
 # Add new monitors here
 obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
 obj-$(CONFIG_RV_REACT_PRINTK) += reactor_printk.o
diff --git a/kernel/trace/rv/monitors/deadline/Kconfig b/kernel/trace/rv/monitors/deadline/Kconfig
new file mode 100644
index 000000000000..38804a6ad91d
--- /dev/null
+++ b/kernel/trace/rv/monitors/deadline/Kconfig
@@ -0,0 +1,10 @@
+config RV_MON_DEADLINE
+	depends on RV
+	bool "deadline monitor"
+	help
+	  Collection of monitors to check the deadline scheduler and server
+	  behave according to specifications. Enable this to enable all
+	  scheduler specification supported by the current kernel.
+
+	  For further information, see:
+	    Documentation/trace/rv/monitor_deadline.rst
diff --git a/kernel/trace/rv/monitors/deadline/deadline.c b/kernel/trace/rv/monitors/deadline/deadline.c
new file mode 100644
index 000000000000..d566d4542ebf
--- /dev/null
+++ b/kernel/trace/rv/monitors/deadline/deadline.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/rv.h>
+#include <linux/kallsyms.h>
+
+#define MODULE_NAME "deadline"
+
+#include "deadline.h"
+
+struct rv_monitor rv_deadline = {
+	.name = "deadline",
+	.description = "container for several deadline scheduler specifications.",
+	.enable = NULL,
+	.disable = NULL,
+	.reset = NULL,
+	.enabled = 0,
+};
+
+/* Used by other monitors */
+struct sched_class *rv_ext_sched_class;
+
+static int __init register_deadline(void)
+{
+	if (IS_ENABLED(CONFIG_SCHED_CLASS_EXT)) {
+		rv_ext_sched_class = (void *)kallsyms_lookup_name("ext_sched_class");
+		if (!rv_ext_sched_class)
+			pr_warn("rv: Missing ext_sched_class, monitors may not work.\n");
+	}
+	return rv_register_monitor(&rv_deadline, NULL);
+}
+
+static void __exit unregister_deadline(void)
+{
+	rv_unregister_monitor(&rv_deadline);
+}
+
+module_init(register_deadline);
+module_exit(unregister_deadline);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Gabriele Monaco <gmonaco@redhat.com>");
+MODULE_DESCRIPTION("deadline: container for several deadline scheduler specifications.");
diff --git a/kernel/trace/rv/monitors/deadline/deadline.h b/kernel/trace/rv/monitors/deadline/deadline.h
new file mode 100644
index 000000000000..0bbfd2543329
--- /dev/null
+++ b/kernel/trace/rv/monitors/deadline/deadline.h
@@ -0,0 +1,202 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/kernel.h>
+#include <linux/uaccess.h>
+#include <linux/sched/deadline.h>
+#include <asm/syscall.h>
+#include <uapi/linux/sched/types.h>
+#include <trace/events/sched.h>
+
+/*
+ * Dummy values if not available
+ */
+#ifndef __NR_sched_setscheduler
+#define __NR_sched_setscheduler -__COUNTER__
+#endif
+#ifndef __NR_sched_setattr
+#define __NR_sched_setattr -__COUNTER__
+#endif
+
+extern struct rv_monitor rv_deadline;
+/* Initialised when registering the deadline container */
+extern struct sched_class *rv_ext_sched_class;
+
+/*
+ * If both have dummy values, the syscalls are not supported and we don't even
+ * need to register the handler.
+ */
+static inline bool should_skip_syscall_handle(void)
+{
+	return __NR_sched_setattr < 0 && __NR_sched_setscheduler < 0;
+}
+
+/*
+ * is_supported_type - return true if @type is supported by the deadline monitors
+ */
+static inline bool is_supported_type(u8 type)
+{
+	return type == DL_TASK || type == DL_SERVER_FAIR || type == DL_SERVER_EXT;
+}
+
+/*
+ * is_server_type - return true if @type is a supported server
+ */
+static inline bool is_server_type(u8 type)
+{
+	return is_supported_type(type) && type != DL_TASK;
+}
+
+/*
+ * Use negative numbers for the server.
+ * Currently only one fair server per CPU, may change in the future.
+ */
+#define fair_server_id(cpu) (-cpu)
+#define ext_server_id(cpu) (-cpu - num_possible_cpus())
+#define NO_SERVER_ID (-2 * num_possible_cpus())
+/*
+ * Get a unique id used for dl entities
+ *
+ * The cpu is not required for tasks as the pid is used there, if this function
+ * is called on a dl_se that for sure corresponds to a task, DL_TASK can be
+ * used in place of cpu.
+ * We need the cpu for servers as it is provided in the tracepoint and we
+ * cannot easily retrieve it from the dl_se (requires the struct rq definition).
+ */
+static inline int get_entity_id(struct sched_dl_entity *dl_se, int cpu, u8 type)
+{
+	if (dl_server(dl_se) && type != DL_TASK) {
+		if (type == DL_SERVER_FAIR)
+			return fair_server_id(cpu);
+		if (type == DL_SERVER_EXT)
+			return ext_server_id(cpu);
+		return NO_SERVER_ID;
+	}
+	return dl_task_of(dl_se)->pid;
+}
+
+static inline bool task_is_scx_enabled(struct task_struct *tsk)
+{
+	return IS_ENABLED(CONFIG_SCHED_CLASS_EXT) &&
+	       tsk->sched_class == rv_ext_sched_class;
+}
+
+/* Expand id and target as arguments for da functions */
+#define EXPAND_ID(dl_se, cpu, type) get_entity_id(dl_se, cpu, type), dl_se
+#define EXPAND_ID_TASK(tsk) get_entity_id(&tsk->dl, task_cpu(tsk), DL_TASK), &tsk->dl
+
+static inline u8 get_server_type(struct task_struct *tsk)
+{
+	if (tsk->policy == SCHED_NORMAL || tsk->policy == SCHED_EXT ||
+	    tsk->policy == SCHED_BATCH || tsk->policy == SCHED_IDLE)
+		return task_is_scx_enabled(tsk) ? DL_SERVER_EXT : DL_SERVER_FAIR;
+	return DL_OTHER;
+}
+
+static inline int extract_params(struct pt_regs *regs, long id, pid_t *pid_out)
+{
+	size_t size = offsetofend(struct sched_attr, sched_flags);
+	struct sched_attr __user *uattr, attr;
+	int new_policy = -1, ret;
+	unsigned long args[6];
+
+	switch (id) {
+	case __NR_sched_setscheduler:
+		syscall_get_arguments(current, regs, args);
+		*pid_out = args[0];
+		new_policy = args[1];
+		break;
+	case __NR_sched_setattr:
+		syscall_get_arguments(current, regs, args);
+		*pid_out = args[0];
+		uattr = (struct sched_attr __user *)args[1];
+		/*
+		 * Just copy up to sched_flags, we are not interested after that
+		 */
+		ret = copy_struct_from_user(&attr, size, uattr, size);
+		if (ret)
+			return ret;
+		if (attr.sched_flags & SCHED_FLAG_KEEP_POLICY)
+			return -EINVAL;
+		new_policy = attr.sched_policy;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return new_policy & ~SCHED_RESET_ON_FORK;
+}
+
+/* Helper functions requiring DA/HA utilities */
+#ifdef RV_MON_TYPE
+
+/*
+ * get_fair_server - get the fair server associated to a task
+ *
+ * If the task is a boosted task, the server is available in the task_struct,
+ * otherwise grab the dl entity saved for the CPU where the task is enqueued.
+ * This function assumes the task is enqueued somewhere.
+ */
+static inline struct sched_dl_entity *get_server(struct task_struct *tsk, u8 type)
+{
+	if (tsk->dl_server && get_server_type(tsk) == type)
+		return tsk->dl_server;
+	if (type == DL_SERVER_FAIR)
+		return da_get_target_by_id(fair_server_id(task_cpu(tsk)));
+	if (type == DL_SERVER_EXT)
+		return da_get_target_by_id(ext_server_id(task_cpu(tsk)));
+	return NULL;
+}
+
+/*
+ * Initialise monitors for all tasks and pre-allocate the storage for servers.
+ * This is necessary since we don't have access to the servers here and
+ * allocation can cause deadlocks from their tracepoints. We can only fill
+ * pre-initialised storage from there.
+ */
+static inline int init_storage(bool skip_tasks)
+{
+	struct task_struct *g, *p;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		if (!da_create_empty_storage(fair_server_id(cpu)))
+			goto fail;
+		if (IS_ENABLED(CONFIG_SCHED_CLASS_EXT) &&
+		    !da_create_empty_storage(ext_server_id(cpu)))
+			goto fail;
+	}
+
+	if (skip_tasks)
+		return 0;
+
+	read_lock(&tasklist_lock);
+	for_each_process_thread(g, p) {
+		if (p->policy == SCHED_DEADLINE) {
+			if (!da_create_storage(EXPAND_ID_TASK(p), NULL)) {
+				read_unlock(&tasklist_lock);
+				goto fail;
+			}
+		}
+	}
+	read_unlock(&tasklist_lock);
+	return 0;
+
+fail:
+	da_monitor_destroy();
+	return -ENOMEM;
+}
+
+static void __maybe_unused handle_newtask(void *data, struct task_struct *task, u64 flags)
+{
+	/* Might be superfluous as tasks are not started with this policy.. */
+	if (task->policy == SCHED_DEADLINE)
+		da_create_storage(EXPAND_ID_TASK(task), NULL);
+}
+
+static void __maybe_unused handle_exit(void *data, struct task_struct *p, bool group_dead)
+{
+	if (p->policy == SCHED_DEADLINE)
+		da_destroy_storage(get_entity_id(&p->dl, DL_TASK, DL_TASK));
+}
+
+#endif
diff --git a/kernel/trace/rv/monitors/nomiss/Kconfig b/kernel/trace/rv/monitors/nomiss/Kconfig
new file mode 100644
index 000000000000..e1886c3a0dd9
--- /dev/null
+++ b/kernel/trace/rv/monitors/nomiss/Kconfig
@@ -0,0 +1,15 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+config RV_MON_NOMISS
+	depends on RV
+	depends on HAVE_SYSCALL_TRACEPOINTS
+	depends on RV_MON_DEADLINE
+	default y
+	select HA_MON_EVENTS_ID
+	bool "nomiss monitor"
+	help
+	  Monitor to ensure dl entities run to completion before their deadiline.
+	  This monitor is part of the deadline monitors collection.
+
+	  For further information, see:
+	    Documentation/trace/rv/monitor_deadline.rst
diff --git a/kernel/trace/rv/monitors/nomiss/nomiss.c b/kernel/trace/rv/monitors/nomiss/nomiss.c
new file mode 100644
index 000000000000..31f90f3638d8
--- /dev/null
+++ b/kernel/trace/rv/monitors/nomiss/nomiss.c
@@ -0,0 +1,293 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/ftrace.h>
+#include <linux/tracepoint.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/rv.h>
+#include <rv/instrumentation.h>
+
+#define MODULE_NAME "nomiss"
+
+#include <uapi/linux/sched/types.h>
+#include <trace/events/syscalls.h>
+#include <trace/events/sched.h>
+#include <trace/events/task.h>
+#include <rv_trace.h>
+
+#define RV_MON_TYPE RV_MON_PER_OBJ
+#define HA_TIMER_TYPE HA_TIMER_WHEEL
+/* The start condition is on sched_switch, it's dangerous to allocate there */
+#define DA_SKIP_AUTO_ALLOC
+typedef struct sched_dl_entity *monitor_target;
+#include "nomiss.h"
+#include <rv/ha_monitor.h>
+#include <monitors/deadline/deadline.h>
+
+/*
+ * User configurable deadline threshold. If the total utilisation of deadline
+ * tasks is larger than 1, they are only guaranteed bounded tardiness. See
+ * Documentation/scheduler/sched-deadline.rst for more details.
+ * The minimum tardiness without sched_feat(HRTICK_DL) is 1 tick to accommodate
+ * for throttle enforced on the next tick.
+ */
+static u64 deadline_thresh = TICK_NSEC;
+module_param(deadline_thresh, ullong, 0644);
+#define DEADLINE_NS(ha_mon) (ha_get_target(ha_mon)->dl_deadline + deadline_thresh)
+
+static u64 ha_get_env(struct ha_monitor *ha_mon, enum envs_nomiss env, u64 time_ns)
+{
+	if (env == clk_nomiss)
+		return ha_get_clk_ns(ha_mon, env, time_ns);
+	else if (env == is_constr_dl_nomiss)
+		return !dl_is_implicit(ha_get_target(ha_mon));
+	else if (env == is_defer_nomiss)
+		return ha_get_target(ha_mon)->dl_defer;
+	return ENV_INVALID_VALUE;
+}
+
+static void ha_reset_env(struct ha_monitor *ha_mon, enum envs_nomiss env, u64 time_ns)
+{
+	if (env == clk_nomiss)
+		ha_reset_clk_ns(ha_mon, env, time_ns);
+}
+
+static inline bool ha_verify_invariants(struct ha_monitor *ha_mon,
+					enum states curr_state, enum events event,
+					enum states next_state, u64 time_ns)
+{
+	if (curr_state == ready_nomiss)
+		return ha_check_invariant_ns(ha_mon, clk_nomiss, time_ns);
+	else if (curr_state == running_nomiss)
+		return ha_check_invariant_ns(ha_mon, clk_nomiss, time_ns);
+	return true;
+}
+
+static inline void ha_convert_inv_guard(struct ha_monitor *ha_mon,
+					enum states curr_state, enum events event,
+					enum states next_state, u64 time_ns)
+{
+	if (curr_state == next_state)
+		return;
+	if (curr_state == ready_nomiss)
+		ha_inv_to_guard(ha_mon, clk_nomiss, DEADLINE_NS(ha_mon), time_ns);
+	else if (curr_state == running_nomiss)
+		ha_inv_to_guard(ha_mon, clk_nomiss, DEADLINE_NS(ha_mon), time_ns);
+}
+
+static inline bool ha_verify_guards(struct ha_monitor *ha_mon,
+				    enum states curr_state, enum events event,
+				    enum states next_state, u64 time_ns)
+{
+	bool res = true;
+
+	if (curr_state == ready_nomiss && event == dl_replenish_nomiss)
+		ha_reset_env(ha_mon, clk_nomiss, time_ns);
+	else if (curr_state == ready_nomiss && event == dl_throttle_nomiss)
+		res = ha_get_env(ha_mon, is_defer_nomiss, time_ns) == 1ull;
+	else if (curr_state == idle_nomiss && event == dl_replenish_nomiss)
+		ha_reset_env(ha_mon, clk_nomiss, time_ns);
+	else if (curr_state == running_nomiss && event == dl_replenish_nomiss)
+		ha_reset_env(ha_mon, clk_nomiss, time_ns);
+	else if (curr_state == sleeping_nomiss && event == dl_replenish_nomiss)
+		ha_reset_env(ha_mon, clk_nomiss, time_ns);
+	else if (curr_state == sleeping_nomiss && event == dl_throttle_nomiss)
+		res = ha_get_env(ha_mon, is_constr_dl_nomiss, time_ns) == 1ull ||
+		      ha_get_env(ha_mon, is_defer_nomiss, time_ns) == 1ull;
+	else if (curr_state == throttled_nomiss && event == dl_replenish_nomiss)
+		ha_reset_env(ha_mon, clk_nomiss, time_ns);
+	return res;
+}
+
+static inline void ha_setup_invariants(struct ha_monitor *ha_mon,
+				       enum states curr_state, enum events event,
+				       enum states next_state, u64 time_ns)
+{
+	if (next_state == curr_state && event != dl_replenish_nomiss)
+		return;
+	if (next_state == ready_nomiss)
+		ha_start_timer_ns(ha_mon, clk_nomiss, DEADLINE_NS(ha_mon), time_ns);
+	else if (next_state == running_nomiss)
+		ha_start_timer_ns(ha_mon, clk_nomiss, DEADLINE_NS(ha_mon), time_ns);
+	else if (curr_state == ready_nomiss)
+		ha_cancel_timer(ha_mon);
+	else if (curr_state == running_nomiss)
+		ha_cancel_timer(ha_mon);
+}
+
+static bool ha_verify_constraint(struct ha_monitor *ha_mon,
+				 enum states curr_state, enum events event,
+				 enum states next_state, u64 time_ns)
+{
+	if (!ha_verify_invariants(ha_mon, curr_state, event, next_state, time_ns))
+		return false;
+
+	ha_convert_inv_guard(ha_mon, curr_state, event, next_state, time_ns);
+
+	if (!ha_verify_guards(ha_mon, curr_state, event, next_state, time_ns))
+		return false;
+
+	ha_setup_invariants(ha_mon, curr_state, event, next_state, time_ns);
+
+	return true;
+}
+
+static void handle_dl_replenish(void *data, struct sched_dl_entity *dl_se,
+				int cpu, u8 type)
+{
+	if (is_supported_type(type))
+		da_handle_event(EXPAND_ID(dl_se, cpu, type), dl_replenish_nomiss);
+}
+
+static void handle_dl_throttle(void *data, struct sched_dl_entity *dl_se,
+			       int cpu, u8 type)
+{
+	if (is_supported_type(type))
+		da_handle_event(EXPAND_ID(dl_se, cpu, type), dl_throttle_nomiss);
+}
+
+static void handle_dl_server_stop(void *data, struct sched_dl_entity *dl_se,
+				  int cpu, u8 type)
+{
+	/*
+	 * This isn't the standard use of da_handle_start_run_event since this
+	 * event cannot only occur from the initial state.
+	 * It is fine to use here because it always brings to a known state and
+	 * the fact we "pretend" the transition starts from the initial state
+	 * has no side effect.
+	 */
+	if (is_supported_type(type))
+		da_handle_start_run_event(EXPAND_ID(dl_se, cpu, type), dl_server_stop_nomiss);
+}
+
+static inline void handle_server_switch(struct task_struct *next, int cpu, u8 type)
+{
+	struct sched_dl_entity *dl_se = get_server(next, type);
+
+	if (dl_se && is_idle_task(next))
+		da_handle_event(EXPAND_ID(dl_se, cpu, type), dl_server_idle_nomiss);
+}
+
+static void handle_sched_switch(void *data, bool preempt,
+				struct task_struct *prev,
+				struct task_struct *next,
+				unsigned int prev_state)
+{
+	int cpu = task_cpu(next);
+
+	if (prev_state != TASK_RUNNING && !preempt && prev->policy == SCHED_DEADLINE)
+		da_handle_event(EXPAND_ID_TASK(prev), sched_switch_suspend_nomiss);
+	if (next->policy == SCHED_DEADLINE)
+		da_handle_start_run_event(EXPAND_ID_TASK(next), sched_switch_in_nomiss);
+
+	/*
+	 * The server is available in next only if the next task is boosted,
+	 * otherwise we need to retrieve it.
+	 * Here the server continues in the state running/armed until actually
+	 * stopped, this works since we continue expecting a throttle.
+	 */
+	if (next->dl_server)
+		da_handle_start_event(EXPAND_ID(next->dl_server, cpu,
+						get_server_type(next)),
+				      sched_switch_in_nomiss);
+	else {
+		handle_server_switch(next, cpu, DL_SERVER_FAIR);
+		if (IS_ENABLED(CONFIG_SCHED_CLASS_EXT))
+			handle_server_switch(next, cpu, DL_SERVER_EXT);
+	}
+}
+
+static void handle_sys_enter(void *data, struct pt_regs *regs, long id)
+{
+	struct task_struct *p;
+	int new_policy = -1;
+	pid_t pid = 0;
+
+	new_policy = extract_params(regs, id, &pid);
+	if (new_policy < 0)
+		return;
+	guard(rcu)();
+	p = pid ? find_task_by_vpid(pid) : current;
+	if (unlikely(!p) || new_policy == p->policy)
+		return;
+
+	if (p->policy == SCHED_DEADLINE)
+		da_reset(EXPAND_ID_TASK(p));
+	else if (new_policy == SCHED_DEADLINE)
+		da_create_or_get(EXPAND_ID_TASK(p));
+}
+
+static void handle_sched_wakeup(void *data, struct task_struct *tsk)
+{
+	if (tsk->policy == SCHED_DEADLINE)
+		da_handle_event(EXPAND_ID_TASK(tsk), sched_wakeup_nomiss);
+}
+
+static int enable_nomiss(void)
+{
+	int retval;
+
+	retval = da_monitor_init();
+	if (retval)
+		return retval;
+
+	retval = init_storage(false);
+	if (retval)
+		return retval;
+	rv_attach_trace_probe("nomiss", sched_dl_replenish_tp, handle_dl_replenish);
+	rv_attach_trace_probe("nomiss", sched_dl_throttle_tp, handle_dl_throttle);
+	rv_attach_trace_probe("nomiss", sched_dl_server_stop_tp, handle_dl_server_stop);
+	rv_attach_trace_probe("nomiss", sched_switch, handle_sched_switch);
+	rv_attach_trace_probe("nomiss", sched_wakeup, handle_sched_wakeup);
+	if (!should_skip_syscall_handle())
+		rv_attach_trace_probe("nomiss", sys_enter, handle_sys_enter);
+	rv_attach_trace_probe("nomiss", task_newtask, handle_newtask);
+	rv_attach_trace_probe("nomiss", sched_process_exit, handle_exit);
+
+	return 0;
+}
+
+static void disable_nomiss(void)
+{
+	rv_this.enabled = 0;
+
+	/* Those are RCU writers, detach earlier hoping to close a bit faster */
+	rv_detach_trace_probe("nomiss", task_newtask, handle_newtask);
+	rv_detach_trace_probe("nomiss", sched_process_exit, handle_exit);
+	if (!should_skip_syscall_handle())
+		rv_detach_trace_probe("nomiss", sys_enter, handle_sys_enter);
+
+	rv_detach_trace_probe("nomiss", sched_dl_replenish_tp, handle_dl_replenish);
+	rv_detach_trace_probe("nomiss", sched_dl_throttle_tp, handle_dl_throttle);
+	rv_detach_trace_probe("nomiss", sched_dl_server_stop_tp, handle_dl_server_stop);
+	rv_detach_trace_probe("nomiss", sched_switch, handle_sched_switch);
+	rv_detach_trace_probe("nomiss", sched_wakeup, handle_sched_wakeup);
+
+	da_monitor_destroy();
+}
+
+static struct rv_monitor rv_this = {
+	.name = "nomiss",
+	.description = "dl entities run to completion before their deadline.",
+	.enable = enable_nomiss,
+	.disable = disable_nomiss,
+	.reset = da_monitor_reset_all,
+	.enabled = 0,
+};
+
+static int __init register_nomiss(void)
+{
+	return rv_register_monitor(&rv_this, &rv_deadline);
+}
+
+static void __exit unregister_nomiss(void)
+{
+	rv_unregister_monitor(&rv_this);
+}
+
+module_init(register_nomiss);
+module_exit(unregister_nomiss);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Gabriele Monaco <gmonaco@redhat.com>");
+MODULE_DESCRIPTION("nomiss: dl entities run to completion before their deadline.");
diff --git a/kernel/trace/rv/monitors/nomiss/nomiss.h b/kernel/trace/rv/monitors/nomiss/nomiss.h
new file mode 100644
index 000000000000..3d1b436194d7
--- /dev/null
+++ b/kernel/trace/rv/monitors/nomiss/nomiss.h
@@ -0,0 +1,123 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Automatically generated C representation of nomiss automaton
+ * For further information about this format, see kernel documentation:
+ *   Documentation/trace/rv/deterministic_automata.rst
+ */
+
+#define MONITOR_NAME nomiss
+
+enum states_nomiss {
+	ready_nomiss,
+	idle_nomiss,
+	running_nomiss,
+	sleeping_nomiss,
+	throttled_nomiss,
+	state_max_nomiss,
+};
+
+#define INVALID_STATE state_max_nomiss
+
+enum events_nomiss {
+	dl_replenish_nomiss,
+	dl_server_idle_nomiss,
+	dl_server_stop_nomiss,
+	dl_throttle_nomiss,
+	sched_switch_in_nomiss,
+	sched_switch_suspend_nomiss,
+	sched_wakeup_nomiss,
+	event_max_nomiss,
+};
+
+enum envs_nomiss {
+	clk_nomiss,
+	is_constr_dl_nomiss,
+	is_defer_nomiss,
+	env_max_nomiss,
+	env_max_stored_nomiss = is_constr_dl_nomiss,
+};
+
+_Static_assert(env_max_stored_nomiss <= MAX_HA_ENV_LEN, "Not enough slots");
+#define HA_CLK_NS
+
+struct automaton_nomiss {
+	char *state_names[state_max_nomiss];
+	char *event_names[event_max_nomiss];
+	char *env_names[env_max_nomiss];
+	unsigned char function[state_max_nomiss][event_max_nomiss];
+	unsigned char initial_state;
+	bool final_states[state_max_nomiss];
+};
+
+static const struct automaton_nomiss automaton_nomiss = {
+	.state_names = {
+		"ready",
+		"idle",
+		"running",
+		"sleeping",
+		"throttled",
+	},
+	.event_names = {
+		"dl_replenish",
+		"dl_server_idle",
+		"dl_server_stop",
+		"dl_throttle",
+		"sched_switch_in",
+		"sched_switch_suspend",
+		"sched_wakeup",
+	},
+	.env_names = {
+		"clk",
+		"is_constr_dl",
+		"is_defer",
+	},
+	.function = {
+		{
+			ready_nomiss,
+			idle_nomiss,
+			sleeping_nomiss,
+			throttled_nomiss,
+			running_nomiss,
+			INVALID_STATE,
+			ready_nomiss,
+		},
+		{
+			ready_nomiss,
+			idle_nomiss,
+			sleeping_nomiss,
+			throttled_nomiss,
+			running_nomiss,
+			INVALID_STATE,
+			INVALID_STATE,
+		},
+		{
+			running_nomiss,
+			idle_nomiss,
+			sleeping_nomiss,
+			throttled_nomiss,
+			running_nomiss,
+			sleeping_nomiss,
+			running_nomiss,
+		},
+		{
+			ready_nomiss,
+			sleeping_nomiss,
+			sleeping_nomiss,
+			throttled_nomiss,
+			running_nomiss,
+			INVALID_STATE,
+			ready_nomiss,
+		},
+		{
+			ready_nomiss,
+			throttled_nomiss,
+			INVALID_STATE,
+			throttled_nomiss,
+			INVALID_STATE,
+			throttled_nomiss,
+			throttled_nomiss,
+		},
+	},
+	.initial_state = ready_nomiss,
+	.final_states = { 1, 0, 0, 0, 0 },
+};
diff --git a/kernel/trace/rv/monitors/nomiss/nomiss_trace.h b/kernel/trace/rv/monitors/nomiss/nomiss_trace.h
new file mode 100644
index 000000000000..42e7efaca4e7
--- /dev/null
+++ b/kernel/trace/rv/monitors/nomiss/nomiss_trace.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Snippet to be included in rv_trace.h
+ */
+
+#ifdef CONFIG_RV_MON_NOMISS
+DEFINE_EVENT(event_da_monitor_id, event_nomiss,
+	     TP_PROTO(int id, char *state, char *event, char *next_state, bool final_state),
+	     TP_ARGS(id, state, event, next_state, final_state));
+
+DEFINE_EVENT(error_da_monitor_id, error_nomiss,
+	     TP_PROTO(int id, char *state, char *event),
+	     TP_ARGS(id, state, event));
+
+DEFINE_EVENT(error_env_da_monitor_id, error_env_nomiss,
+	     TP_PROTO(int id, char *state, char *event, char *env),
+	     TP_ARGS(id, state, event, env));
+#endif /* CONFIG_RV_MON_NOMISS */
diff --git a/kernel/trace/rv/rv_trace.h b/kernel/trace/rv/rv_trace.h
index 9e8072d863a2..9622c269789c 100644
--- a/kernel/trace/rv/rv_trace.h
+++ b/kernel/trace/rv/rv_trace.h
@@ -188,6 +188,7 @@ DECLARE_EVENT_CLASS(error_env_da_monitor_id,
 );
 
 #include <monitors/stall/stall_trace.h>
+#include <monitors/nomiss/nomiss_trace.h>
 // Add new monitors based on CONFIG_HA_MON_EVENTS_ID here
 
 #endif
diff --git a/tools/verification/models/deadline/nomiss.dot b/tools/verification/models/deadline/nomiss.dot
new file mode 100644
index 000000000000..fd1ea6bf2509
--- /dev/null
+++ b/tools/verification/models/deadline/nomiss.dot
@@ -0,0 +1,41 @@
+digraph state_automaton {
+	center = true;
+	size = "7,11";
+	{node [shape = circle] "idle"};
+	{node [shape = plaintext, style=invis, label=""] "__init_ready"};
+	{node [shape = doublecircle] "ready"};
+	{node [shape = circle] "ready"};
+	{node [shape = circle] "running"};
+	{node [shape = circle] "sleeping"};
+	{node [shape = circle] "throttled"};
+	"__init_ready" -> "ready";
+	"idle" [label = "idle"];
+	"idle" -> "idle" [ label = "dl_server_idle" ];
+	"idle" -> "ready" [ label = "dl_replenish;reset(clk)" ];
+	"idle" -> "running" [ label = "sched_switch_in" ];
+	"idle" -> "sleeping" [ label = "dl_server_stop" ];
+	"idle" -> "throttled" [ label = "dl_throttle" ];
+	"ready" [label = "ready\nclk < DEADLINE_NS()", color = green3];
+	"ready" -> "idle" [ label = "dl_server_idle" ];
+	"ready" -> "ready" [ label = "sched_wakeup\ndl_replenish;reset(clk)" ];
+	"ready" -> "running" [ label = "sched_switch_in" ];
+	"ready" -> "sleeping" [ label = "dl_server_stop" ];
+	"ready" -> "throttled" [ label = "dl_throttle;is_defer == 1" ];
+	"running" [label = "running\nclk < DEADLINE_NS()"];
+	"running" -> "idle" [ label = "dl_server_idle" ];
+	"running" -> "running" [ label = "dl_replenish;reset(clk)\nsched_switch_in\nsched_wakeup" ];
+	"running" -> "sleeping" [ label = "sched_switch_suspend\ndl_server_stop" ];
+	"running" -> "throttled" [ label = "dl_throttle" ];
+	"sleeping" [label = "sleeping"];
+	"sleeping" -> "ready" [ label = "sched_wakeup\ndl_replenish;reset(clk)" ];
+	"sleeping" -> "running" [ label = "sched_switch_in" ];
+	"sleeping" -> "sleeping" [ label = "dl_server_stop\ndl_server_idle" ];
+	"sleeping" -> "throttled" [ label = "dl_throttle;is_constr_dl == 1 || is_defer == 1" ];
+	"throttled" [label = "throttled"];
+	"throttled" -> "ready" [ label = "dl_replenish;reset(clk)" ];
+	"throttled" -> "throttled" [ label = "sched_switch_suspend\nsched_wakeup\ndl_server_idle\ndl_throttle" ];
+	{ rank = min ;
+		"__init_ready";
+		"ready";
+	}
+}
-- 
2.53.0


^ permalink raw reply related

* [PATCH AUTOSEL 6.19-5.10] btrfs: tracepoints: get correct superblock from dentry in event btrfs_sync_file()
From: Sasha Levin @ 2026-03-30 12:38 UTC (permalink / raw)
  To: patches, stable
  Cc: Goldwyn Rodrigues, Boris Burkov, Goldwyn Rodrigues, David Sterba,
	Sasha Levin, clm, rostedt, mhiramat, linux-btrfs, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260330123842.756154-1-sashal@kernel.org>

From: Goldwyn Rodrigues <rgoldwyn@suse.de>

[ Upstream commit a85b46db143fda5869e7d8df8f258ccef5fa1719 ]

If overlay is used on top of btrfs, dentry->d_sb translates to overlay's
super block and fsid assignment will lead to a crash.

Use file_inode(file)->i_sb to always get btrfs_sb.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have all the evidence. Here is the complete analysis.

---

## PHASE 1: COMMIT MESSAGE FORENSICS

**Step 1.1: Subject Line**
Record: [btrfs: tracepoints] [get correct / fix] [Fix incorrect
superblock derivation in the `btrfs_sync_file` trace event when
overlayfs is stacked on btrfs]

**Step 1.2: Tags**
- Reviewed-by: Boris Burkov <boris@bur.io> (btrfs developer/reviewer)
- Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> (author, active
  btrfs contributor)
- Signed-off-by: David Sterba <dsterba@suse.com> (btrfs co-maintainer,
  committer)
- No Fixes: tag (expected for commits under manual review)
- No Reported-by:, Tested-by:, Link:, or Cc: stable tags

Record: Reviewed by btrfs developer, committed by btrfs maintainer. No
Fixes: or Cc: stable (expected for manually reviewed candidates).

**Step 1.3: Commit Body**
Record: Bug: when overlayfs is used on top of btrfs, `dentry->d_sb` in
the tracepoint resolves to the overlay superblock, not btrfs'. The
`btrfs_sb()` inline function then treats the overlay's `s_fs_info` as
`struct btrfs_fs_info *`, and `TP_fast_assign_fsid` dereferences
`fs_info->fs_devices->fsid`—accessing completely invalid memory.
Symptom: kernel crash. Fix: use `file_inode(file)->i_sb` to always get
the btrfs superblock.

**Step 1.4: Hidden Bug Fix**
Record: Not hidden—this is an explicit crash fix. The commit message
directly states "will lead to a crash."

---

## PHASE 2: DIFF ANALYSIS

**Step 2.1: Inventory**
Record: 1 file changed: `include/trace/events/btrfs.h`. Approximately 6
lines added, 4 removed within the `TP_fast_assign` block of
`TRACE_EVENT(btrfs_sync_file)`. Single-file, surgical fix.

**Step 2.2: Code Flow Change**

Before:
```c
const struct dentry *dentry = file->f_path.dentry;
const struct inode *inode = d_inode(dentry);
TP_fast_assign_fsid(btrfs_sb(file->f_path.dentry->d_sb));
__entry->parent = btrfs_ino(BTRFS_I(d_inode(dentry->d_parent)));
```

After:
```c
struct dentry *dentry = file_dentry(file);
struct inode *inode = file_inode(file);
struct dentry *parent = dget_parent(dentry);
struct inode *parent_inode = d_inode(parent);
dput(parent);
TP_fast_assign_fsid(btrfs_sb(inode->i_sb));
__entry->parent = btrfs_ino(BTRFS_I(parent_inode));
```

Three independent improvements:
1. **Critical crash fix**: `file->f_path.dentry->d_sb` → `inode->i_sb`
   for the fsid assignment
2. **Correctness**: `file->f_path.dentry` → `file_dentry(file)` and
   `d_inode(dentry)` → `file_inode(file)` (overlay-safe helpers)
3. **Safety**: parent dentry now accessed via `dget_parent()`/`dput()`
   (proper reference counting)

Record: Single hunk, tracepoint-only path, three small correctness
improvements.

**Step 2.3: Bug Mechanism**
Verified:
- `btrfs_sb(sb)` returns `sb->s_fs_info` (`fs/btrfs/super.h` line 21–24)
- `TP_fast_assign_fsid(fs_info)` does `memcpy(__entry->fsid,
  fs_info->fs_devices->fsid, BTRFS_FSID_SIZE)` (line 163–170)
- Overlayfs stores `struct ovl_fs *` in `sb->s_fs_info`
  (`fs/overlayfs/ovl_entry.h` line 115–121)
- When overlay sb is passed to `btrfs_sb()`, the returned pointer is not
  a `btrfs_fs_info`; dereferencing `->fs_devices->fsid` accesses invalid
  memory → crash

Record: [Type confusion via wrong superblock] [overlay's `s_fs_info`
interpreted as `btrfs_fs_info *`, then invalid dereference of
`fs_devices->fsid`]

**Step 2.4: Fix Quality**
Record: Obviously correct—this is the only btrfs tracepoint using
`file->f_path.dentry->d_sb`; all others already use `inode->i_sb`. Fix
aligns this tracepoint with the established pattern. Very low regression
risk: changes only tracepoint data assignment code.

---

## PHASE 3: GIT HISTORY INVESTIGATION

**Step 3.1: Blame**
Verified via `git blame -L 771,779`:
- The buggy fsid line
  (`TP_fast_assign_fsid(btrfs_sb(file->f_path.dentry->d_sb))`)
  introduced in commit `bc074524e123de` (Jeff Mahoney, 2016-06-09,
  "btrfs: prefix fsid to all trace events")
- `git describe --contains bc074524e123de` → `v4.8-rc1~38^2~1^2~12`
- Bug has been present since **v4.8-rc1** — all currently active stable
  trees are affected

Record: [Bug introduced in bc074524e123de, first in v4.8-rc1, present
since 2016 in all active stable trees]

**Step 3.2: Fixes Tag**
Record: No Fixes: tag present. The implicit target is bc074524e123de.

**Step 3.3: File History**
Verified via `git log --oneline -20 -- include/trace/events/btrfs.h`:
- Related prior fix: `f157dd661339f` ("btrfs: fix NULL dereference on
  root when tracing inode eviction") — a different tracepoint crash fix
  in the same file
- Historical related fix: `de17e793b104d` ("btrfs: fix crash/invalid
  memory access on fsync when using overlayfs") — this fixed the **core
  `btrfs_sync_file()` function** for the same overlayfs class of bug,
  but did NOT fix the tracepoint. The current commit completes that
  work.
- The historical commit includes a full oops trace showing the exact
  crash scenario

Record: [Standalone fix. Historical `de17e793b104d` fixed the fsync
function itself but left the tracepoint buggy. This commit completes
that fix.]

**Step 3.4: Author**
Verified: Goldwyn Rodrigues has 10+ btrfs commits including folio
conversions and core btrfs work. David Sterba is a listed btrfs
maintainer.
Record: [Author is established btrfs contributor from SUSE; committed by
btrfs maintainer]

**Step 3.5: Dependencies**
Verified: `file_dentry()`, `file_inode()`, `dget_parent()`, `dput()` all
exist in v5.15, v6.1, v6.6 stable trees.
Record: [No dependencies. All required helper APIs confirmed present in
stable trees.]

---

## PHASE 4: MAILING LIST AND EXTERNAL RESEARCH

**Step 4.1: Lore Search**
Record: lore.kernel.org returned Anubis anti-bot challenge — exact patch
thread not verified.

**Step 4.2: Bug Report**
Verified: The historical commit `de17e793b104d` includes a full kernel
oops trace from `btrfs_sync_file` when using overlayfs. This establishes
that overlayfs+btrfs fsync crashes are a known, real-world class of bug.
The current tracepoint fix addresses the remaining instance of the same
pattern.
Record: [Real-world crash reports documented in historical commit
de17e793b104d with full stack trace]

**Step 4.3–4.4: Related Patches / Stable History**
Record: Could not verify lore threads. No evidence of prior stable
selection for this specific tracepoint fix.

---

## PHASE 5: CODE SEMANTIC ANALYSIS

**Step 5.1: Functions Modified**
Record: `TRACE_EVENT(btrfs_sync_file)` — specifically its
`TP_fast_assign` block

**Step 5.2: Callers**
Verified: `trace_btrfs_sync_file(file, datasync)` called from exactly
one place: `fs/btrfs/file.c:1578` inside `btrfs_sync_file()`.

**Step 5.3–5.4: Call Chain / Reachability**
Verified complete path:
- `fsync(2)` / `fdatasync(2)` → `do_fsync()` → `vfs_fsync()` →
  `vfs_fsync_range()` → `btrfs_sync_file()` → `trace_btrfs_sync_file()`
- Overlayfs path: `ovl_fsync()` (line 441 of `fs/overlayfs/file.c`) →
  `vfs_fsync_range(upperfile, ...)` → `btrfs_sync_file()` →
  `trace_btrfs_sync_file()`
- The tracepoint body executes only when the `btrfs_sync_file`
  tracepoint is enabled (static key gated)

Record: [Directly reachable from userspace fsync() syscall. Overlayfs
path confirmed via ovl_fsync(). Tracepoint gated by static key.]

**Step 5.5: Similar Patterns**
Verified: `TP_fast_assign_fsid(btrfs_sb(file->f_path.dentry->d_sb))`
appears only once in the entire file — this tracepoint. All other btrfs
tracepoints use `inode->i_sb` or receive `fs_info` directly. This is the
sole inconsistent instance.
Record: [Only tracepoint with this bug pattern; all others already
correct]

---

## PHASE 6: STABLE TREE ANALYSIS

**Step 6.1: Presence in Stable Trees**
Verified via `git cat-file -p`:
- v5.15: buggy line at line 701 ✓
- v6.1: buggy line at line 766 ✓
- v6.6: buggy line at line 795 ✓

Record: [All active stable trees (v5.15, v6.1, v6.6) contain the exact
buggy line]

**Step 6.2: Backport Complications**
Record: Clean apply expected for recent stable trees — same code
structure, same APIs available. Minor line number offsets only.

**Step 6.3: Duplicate Fixes**
Record: No alternative fix for this tracepoint found in any stable tree.
The historical `de17e793b104d` fixed only the function, not the
tracepoint.

---

## PHASE 7: SUBSYSTEM AND MAINTAINER CONTEXT

**Step 7.1: Subsystem**
Record: btrfs filesystem tracepoints — IMPORTANT subsystem. btrfs is
widely used, especially with overlayfs in container environments
(Docker, Podman).

**Step 7.2: Activity**
Record: Active — `include/trace/events/btrfs.h` has seen 20+ recent
commits including other tracepoint crash fixes.

---

## PHASE 8: IMPACT AND RISK ASSESSMENT

**Step 8.1: Affected Users**
Record: Users running overlayfs on top of btrfs with btrfs tracepoints
enabled. This includes container workloads and debugging/tracing
scenarios on production systems.

**Step 8.2: Trigger Conditions**
Record: Enable `btrfs_sync_file` tracepoint (or all btrfs events) + use
overlayfs on btrfs + any `fsync()`/`fdatasync()` call. Deterministic
when conditions met — not a race.

**Step 8.3: Failure Mode**
Record: Kernel crash / oops from invalid memory access in tracepoint
assignment. Severity: **CRITICAL** when triggered (system crash,
potential data loss from incomplete fsync).

**Step 8.4: Risk-Benefit**
- BENEFIT: HIGH — prevents a deterministic kernel crash in a real,
  userspace-triggerable path
- RISK: VERY LOW — ~10 lines changed in a single tracepoint, using
  established VFS helpers consistent with all other btrfs tracepoints
Record: [Benefit: HIGH, Risk: VERY LOW, Ratio: Excellent for
backporting]

---

## PHASE 9: FINAL SYNTHESIS

**Step 9.1: Evidence Compilation**

FOR backporting:
- Fixes a real kernel crash (type confusion → invalid memory access →
  oops)
- Small, surgical fix: ~10 lines in 1 file, 1 tracepoint
- Obviously correct: aligns with how all other btrfs tracepoints handle
  the superblock
- Bug class verified real via historical commit `de17e793b104d` with
  full crash stack trace
- Reviewed by btrfs developer, committed by btrfs maintainer
- Bug present in ALL active stable trees (v5.15, v6.1, v6.6) — confirmed
- All required helper APIs exist in stable trees — confirmed
- No dependencies on other commits
- Overlayfs call path verified to reach the buggy code

AGAINST backporting:
- Tracepoint must be enabled to trigger (narrower population than core
  path bugs)
- No Tested-by: tag

UNRESOLVED:
- Exact lore.kernel.org patch discussion thread (blocked by Anubis)
- Whether unprivileged users can enable this tracepoint
- `git apply --check` not run against stable branches (but same code
  confirmed present)

**Step 9.2: Stable Rules Checklist**
1. Obviously correct and tested? **YES** — consistent with all other
   btrfs tracepoints; maintainer-reviewed
2. Fixes a real bug? **YES** — kernel crash with overlayfs+btrfs+tracing
3. Important issue? **YES** — kernel oops (CRITICAL)
4. Small and contained? **YES** — ~10 lines, single tracepoint
5. No new features/APIs? **YES** — pure bug fix
6. Applies to stable? **YES** — buggy code and required APIs confirmed
   in v5.15/v6.1/v6.6

**Step 9.3: Exception Categories**
Record: Not applicable. This is a standard bug fix, not a device
ID/quirk/DT exception.

**Step 9.4: Decision**
The fix addresses a real, deterministic kernel crash caused by type
confusion when overlayfs is stacked on btrfs. The crash mechanism is
fully verified: `btrfs_sb()` interprets overlay's `s_fs_info` (an
`ovl_fs *`) as `btrfs_fs_info *`, then `TP_fast_assign_fsid`
dereferences `fs_info->fs_devices->fsid` — accessing garbage memory. The
fix is small, obviously correct, consistent with all other btrfs
tracepoints, and has no dependencies. The bug exists in all active
stable trees. The only limiting factor is that tracepoints must be
enabled, but stable kernels are regularly used with tracing enabled for
support and debugging, and a crash in that scenario is unacceptable.

---

## Verification

- [Phase 1] Parsed tags from commit message: Reviewed-by Boris Burkov,
  SOB from David Sterba (btrfs maintainer), SOB from Goldwyn Rodrigues
  (author). No Fixes:, Reported-by:, Link:, Cc: stable.
- [Phase 2] Read `include/trace/events/btrfs.h` lines 163–170: confirmed
  `TP_fast_assign_fsid` does `memcpy(__entry->fsid,
  fs_info->fs_devices->fsid, BTRFS_FSID_SIZE)`
- [Phase 2] Read `include/trace/events/btrfs.h` lines 771–779: confirmed
  pre-fix code uses `file->f_path.dentry->d_sb`
- [Phase 2] Grep confirmed `btrfs_sb()` returns `sb->s_fs_info` in
  `fs/btrfs/super.h` lines 21–24
- [Phase 2] Read `fs/overlayfs/ovl_entry.h` lines 115–121: confirmed
  `OVL_FS(sb)` casts `sb->s_fs_info` to `struct ovl_fs *` — type
  confusion verified
- [Phase 2] Grep confirmed the buggy
  `TP_fast_assign_fsid(btrfs_sb(file->f_path.dentry->d_sb))` is the ONLY
  such pattern in btrfs.h
- [Phase 3] `git blame -L 771,779`: buggy fsid line from
  `bc074524e123de` (Jeff Mahoney, 2016)
- [Phase 3] `git describe --contains bc074524e123de`:
  `v4.8-rc1~38^2~1^2~12` — bug present since v4.8
- [Phase 3] `git show de17e793b104d`: confirmed historical
  overlayfs+btrfs fsync crash with full kernel oops trace; that fix
  addressed only `btrfs_sync_file()`, NOT the tracepoint
- [Phase 3] `git log --oneline --author="Goldwyn Rodrigues" -10 --
  fs/btrfs`: confirmed active btrfs contributor
- [Phase 4] lore.kernel.org: blocked by Anubis — patch thread UNVERIFIED
- [Phase 5] Grep: `trace_btrfs_sync_file` called from exactly
  `fs/btrfs/file.c:1578`
- [Phase 5] Read `fs/overlayfs/file.c` lines 441–464: confirmed
  `ovl_fsync()` → `vfs_fsync_range(upperfile)` call path
- [Phase 6] `git cat-file -p v5.15:include/trace/events/btrfs.h`: buggy
  line at line 701 ✓
- [Phase 6] `git cat-file -p v6.1:include/trace/events/btrfs.h`: buggy
  line at line 766 ✓
- [Phase 6] `git cat-file -p v6.6:include/trace/events/btrfs.h`: buggy
  line at line 795 ✓
- [Phase 6] Verified `file_inode`, `file_dentry`, `dget_parent` present
  in v5.15, v6.1, v6.6 via `git cat-file` grep
- [Phase 8] Failure mode: kernel oops from invalid memory access in
  tracepoint assignment, severity CRITICAL
- UNVERIFIED: Exact lore patch discussion (Anubis blocked); privilege
  requirements for tracepoint enablement; `git apply --check` on stable
  branches

**YES**

 include/trace/events/btrfs.h | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 125bdc166bfed..0864700f76e0a 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -769,12 +769,15 @@ TRACE_EVENT(btrfs_sync_file,
 	),

 	TP_fast_assign(
-		const struct dentry *dentry = file->f_path.dentry;
-		const struct inode *inode = d_inode(dentry);
+		struct dentry *dentry = file_dentry(file);
+		struct inode *inode = file_inode(file);
+		struct dentry *parent = dget_parent(dentry);
+		struct inode *parent_inode = d_inode(parent);

-		TP_fast_assign_fsid(btrfs_sb(file->f_path.dentry->d_sb));
+		dput(parent);
+		TP_fast_assign_fsid(btrfs_sb(inode->i_sb));
 		__entry->ino		= btrfs_ino(BTRFS_I(inode));
-		__entry->parent		= btrfs_ino(BTRFS_I(d_inode(dentry->d_parent)));
+		__entry->parent		= btrfs_ino(BTRFS_I(parent_inode));
 		__entry->datasync	= datasync;
 		__entry->root_objectid	= btrfs_root_id(BTRFS_I(inode)->root);
 	),
-- 
2.53.0

^ permalink raw reply related

* [PATCH v14 0/5] ring-buffer: Making persistent ring buffers robust
From: Masami Hiramatsu (Google) @ 2026-03-30 12:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers

Hi,

Here is the 14th version of improvement patches for making persistent
ring buffers robust to failures.
The previous version is here:

https://lore.kernel.org/all/177440549083.1529621.15486836623498328967.stgit@mhiramat.tok.corp.google.com/

This version adds a patch to reset RB_MISSED_* flags when validating
persistent ring buffer [4/5] and renames selftest config to
CONFIG_RING_BUFFER_PERSISTENT_INJECT, clears meta->nr_invalid/entry_bytes
after testing and adds test commands in config comment.[5/5]

I heard the first 3 patches are already under tested, but it is
not pushed yet. So I left those in this version.

Thank you,

---

Masami Hiramatsu (Google) (5):
      ring-buffer: Flush and stop persistent ring buffer on panic
      ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
      ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
      ring-buffer: Reset RB_MISSED_* flags on persistent ring buffer
      ring-buffer: Add persistent ring buffer selftest


 arch/alpha/include/asm/Kbuild        |    1 
 arch/arc/include/asm/Kbuild          |    1 
 arch/arm/include/asm/Kbuild          |    1 
 arch/arm64/include/asm/ring_buffer.h |   10 +
 arch/csky/include/asm/Kbuild         |    1 
 arch/hexagon/include/asm/Kbuild      |    1 
 arch/loongarch/include/asm/Kbuild    |    1 
 arch/m68k/include/asm/Kbuild         |    1 
 arch/microblaze/include/asm/Kbuild   |    1 
 arch/mips/include/asm/Kbuild         |    1 
 arch/nios2/include/asm/Kbuild        |    1 
 arch/openrisc/include/asm/Kbuild     |    1 
 arch/parisc/include/asm/Kbuild       |    1 
 arch/powerpc/include/asm/Kbuild      |    1 
 arch/riscv/include/asm/Kbuild        |    1 
 arch/s390/include/asm/Kbuild         |    1 
 arch/sh/include/asm/Kbuild           |    1 
 arch/sparc/include/asm/Kbuild        |    1 
 arch/um/include/asm/Kbuild           |    1 
 arch/x86/include/asm/Kbuild          |    1 
 arch/xtensa/include/asm/Kbuild       |    1 
 include/asm-generic/ring_buffer.h    |   13 ++
 include/linux/ring_buffer.h          |    1 
 kernel/trace/Kconfig                 |   31 ++++
 kernel/trace/ring_buffer.c           |  240 ++++++++++++++++++++++++++--------
 kernel/trace/trace.c                 |    4 +
 26 files changed, 260 insertions(+), 59 deletions(-)
 create mode 100644 arch/arm64/include/asm/ring_buffer.h
 create mode 100644 include/asm-generic/ring_buffer.h

--
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* [PATCH v14 1/5] ring-buffer: Flush and stop persistent ring buffer on panic
From: Masami Hiramatsu (Google) @ 2026-03-30 12:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers
In-Reply-To: <177487498530.3463592.12715592581212799257.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

On real hardware, panic and machine reboot may not flush hardware cache
to memory. This means the persistent ring buffer, which relies on a
coherent state of memory, may not have its events written to the buffer
and they may be lost. Moreover, there may be inconsistency with the
counters which are used for validation of the integrity of the
persistent ring buffer which may cause all data to be discarded.

To avoid this issue, stop recording of the ring buffer on panic and
flush the cache of the ring buffer's memory.

Fixes: e645535a954a ("tracing: Add option to use memmapped memory for trace boot instance")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v13:
   - Fix a rebase conflict.
 Changes in v11:
   - Do nothing by default since flush_cache_vmap() does nothing on x86
     but it can cause deadlock on some architectures via on_each_cpu()
     because other CPUs will be stoppped when panic notifier is called.
 Changes in v9:
   - Fix typo of & to &&.
   - Fix typo of "Generic"
 Changes in v6:
   - Introduce asm/ring_buffer.h for arch_ring_buffer_flush_range().
   - Use flush_cache_vmap() instead of flush_cache_all().
 Changes in v5:
   - Use ring_buffer_record_off() instead of ring_buffer_record_disable().
   - Use flush_cache_all() to ensure flush all cache.
 Changes in v3:
   - update patch description.
---
 arch/alpha/include/asm/Kbuild        |    1 +
 arch/arc/include/asm/Kbuild          |    1 +
 arch/arm/include/asm/Kbuild          |    1 +
 arch/arm64/include/asm/ring_buffer.h |   10 ++++++++++
 arch/csky/include/asm/Kbuild         |    1 +
 arch/hexagon/include/asm/Kbuild      |    1 +
 arch/loongarch/include/asm/Kbuild    |    1 +
 arch/m68k/include/asm/Kbuild         |    1 +
 arch/microblaze/include/asm/Kbuild   |    1 +
 arch/mips/include/asm/Kbuild         |    1 +
 arch/nios2/include/asm/Kbuild        |    1 +
 arch/openrisc/include/asm/Kbuild     |    1 +
 arch/parisc/include/asm/Kbuild       |    1 +
 arch/powerpc/include/asm/Kbuild      |    1 +
 arch/riscv/include/asm/Kbuild        |    1 +
 arch/s390/include/asm/Kbuild         |    1 +
 arch/sh/include/asm/Kbuild           |    1 +
 arch/sparc/include/asm/Kbuild        |    1 +
 arch/um/include/asm/Kbuild           |    1 +
 arch/x86/include/asm/Kbuild          |    1 +
 arch/xtensa/include/asm/Kbuild       |    1 +
 include/asm-generic/ring_buffer.h    |   13 +++++++++++++
 kernel/trace/ring_buffer.c           |   22 ++++++++++++++++++++++
 23 files changed, 65 insertions(+)
 create mode 100644 arch/arm64/include/asm/ring_buffer.h
 create mode 100644 include/asm-generic/ring_buffer.h

diff --git a/arch/alpha/include/asm/Kbuild b/arch/alpha/include/asm/Kbuild
index 483965c5a4de..b154b4e3dfa8 100644
--- a/arch/alpha/include/asm/Kbuild
+++ b/arch/alpha/include/asm/Kbuild
@@ -5,4 +5,5 @@ generic-y += agp.h
 generic-y += asm-offsets.h
 generic-y += kvm_para.h
 generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
 generic-y += text-patching.h
diff --git a/arch/arc/include/asm/Kbuild b/arch/arc/include/asm/Kbuild
index 4c69522e0328..483caacc6988 100644
--- a/arch/arc/include/asm/Kbuild
+++ b/arch/arc/include/asm/Kbuild
@@ -5,5 +5,6 @@ generic-y += extable.h
 generic-y += kvm_para.h
 generic-y += mcs_spinlock.h
 generic-y += parport.h
+generic-y += ring_buffer.h
 generic-y += user.h
 generic-y += text-patching.h
diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
index 03657ff8fbe3..decad5f2c826 100644
--- a/arch/arm/include/asm/Kbuild
+++ b/arch/arm/include/asm/Kbuild
@@ -3,6 +3,7 @@ generic-y += early_ioremap.h
 generic-y += extable.h
 generic-y += flat.h
 generic-y += parport.h
+generic-y += ring_buffer.h
 
 generated-y += mach-types.h
 generated-y += unistd-nr.h
diff --git a/arch/arm64/include/asm/ring_buffer.h b/arch/arm64/include/asm/ring_buffer.h
new file mode 100644
index 000000000000..62316c406888
--- /dev/null
+++ b/arch/arm64/include/asm/ring_buffer.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _ASM_ARM64_RING_BUFFER_H
+#define _ASM_ARM64_RING_BUFFER_H
+
+#include <asm/cacheflush.h>
+
+/* Flush D-cache on persistent ring buffer */
+#define arch_ring_buffer_flush_range(start, end)	dcache_clean_pop(start, end)
+
+#endif /* _ASM_ARM64_RING_BUFFER_H */
diff --git a/arch/csky/include/asm/Kbuild b/arch/csky/include/asm/Kbuild
index 3a5c7f6e5aac..7dca0c6cdc84 100644
--- a/arch/csky/include/asm/Kbuild
+++ b/arch/csky/include/asm/Kbuild
@@ -9,6 +9,7 @@ generic-y += qrwlock.h
 generic-y += qrwlock_types.h
 generic-y += qspinlock.h
 generic-y += parport.h
+generic-y += ring_buffer.h
 generic-y += user.h
 generic-y += vmlinux.lds.h
 generic-y += text-patching.h
diff --git a/arch/hexagon/include/asm/Kbuild b/arch/hexagon/include/asm/Kbuild
index 1efa1e993d4b..0f887d4238ed 100644
--- a/arch/hexagon/include/asm/Kbuild
+++ b/arch/hexagon/include/asm/Kbuild
@@ -5,4 +5,5 @@ generic-y += extable.h
 generic-y += iomap.h
 generic-y += kvm_para.h
 generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
 generic-y += text-patching.h
diff --git a/arch/loongarch/include/asm/Kbuild b/arch/loongarch/include/asm/Kbuild
index 9034b583a88a..7e92957baf6a 100644
--- a/arch/loongarch/include/asm/Kbuild
+++ b/arch/loongarch/include/asm/Kbuild
@@ -10,5 +10,6 @@ generic-y += qrwlock.h
 generic-y += user.h
 generic-y += ioctl.h
 generic-y += mmzone.h
+generic-y += ring_buffer.h
 generic-y += statfs.h
 generic-y += text-patching.h
diff --git a/arch/m68k/include/asm/Kbuild b/arch/m68k/include/asm/Kbuild
index b282e0dd8dc1..62543bf305ff 100644
--- a/arch/m68k/include/asm/Kbuild
+++ b/arch/m68k/include/asm/Kbuild
@@ -3,5 +3,6 @@ generated-y += syscall_table.h
 generic-y += extable.h
 generic-y += kvm_para.h
 generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
 generic-y += spinlock.h
 generic-y += text-patching.h
diff --git a/arch/microblaze/include/asm/Kbuild b/arch/microblaze/include/asm/Kbuild
index 7178f990e8b3..0030309b47ad 100644
--- a/arch/microblaze/include/asm/Kbuild
+++ b/arch/microblaze/include/asm/Kbuild
@@ -5,6 +5,7 @@ generic-y += extable.h
 generic-y += kvm_para.h
 generic-y += mcs_spinlock.h
 generic-y += parport.h
+generic-y += ring_buffer.h
 generic-y += syscalls.h
 generic-y += tlb.h
 generic-y += user.h
diff --git a/arch/mips/include/asm/Kbuild b/arch/mips/include/asm/Kbuild
index 684569b2ecd6..9771c3d85074 100644
--- a/arch/mips/include/asm/Kbuild
+++ b/arch/mips/include/asm/Kbuild
@@ -12,5 +12,6 @@ generic-y += mcs_spinlock.h
 generic-y += parport.h
 generic-y += qrwlock.h
 generic-y += qspinlock.h
+generic-y += ring_buffer.h
 generic-y += user.h
 generic-y += text-patching.h
diff --git a/arch/nios2/include/asm/Kbuild b/arch/nios2/include/asm/Kbuild
index 28004301c236..0a2530964413 100644
--- a/arch/nios2/include/asm/Kbuild
+++ b/arch/nios2/include/asm/Kbuild
@@ -5,6 +5,7 @@ generic-y += cmpxchg.h
 generic-y += extable.h
 generic-y += kvm_para.h
 generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
 generic-y += spinlock.h
 generic-y += user.h
 generic-y += text-patching.h
diff --git a/arch/openrisc/include/asm/Kbuild b/arch/openrisc/include/asm/Kbuild
index cef49d60d74c..8aa34621702d 100644
--- a/arch/openrisc/include/asm/Kbuild
+++ b/arch/openrisc/include/asm/Kbuild
@@ -8,4 +8,5 @@ generic-y += spinlock_types.h
 generic-y += spinlock.h
 generic-y += qrwlock_types.h
 generic-y += qrwlock.h
+generic-y += ring_buffer.h
 generic-y += user.h
diff --git a/arch/parisc/include/asm/Kbuild b/arch/parisc/include/asm/Kbuild
index 4fb596d94c89..d48d158f7241 100644
--- a/arch/parisc/include/asm/Kbuild
+++ b/arch/parisc/include/asm/Kbuild
@@ -4,4 +4,5 @@ generated-y += syscall_table_64.h
 generic-y += agp.h
 generic-y += kvm_para.h
 generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
 generic-y += user.h
diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index 2e23533b67e3..805b5aeebb6f 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -5,4 +5,5 @@ generated-y += syscall_table_spu.h
 generic-y += agp.h
 generic-y += mcs_spinlock.h
 generic-y += qrwlock.h
+generic-y += ring_buffer.h
 generic-y += early_ioremap.h
diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild
index bd5fc9403295..7721b63642f4 100644
--- a/arch/riscv/include/asm/Kbuild
+++ b/arch/riscv/include/asm/Kbuild
@@ -14,5 +14,6 @@ generic-y += ticket_spinlock.h
 generic-y += qrwlock.h
 generic-y += qrwlock_types.h
 generic-y += qspinlock.h
+generic-y += ring_buffer.h
 generic-y += user.h
 generic-y += vmlinux.lds.h
diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild
index 80bad7de7a04..0c1fc47c3ba0 100644
--- a/arch/s390/include/asm/Kbuild
+++ b/arch/s390/include/asm/Kbuild
@@ -7,3 +7,4 @@ generated-y += unistd_nr.h
 generic-y += asm-offsets.h
 generic-y += mcs_spinlock.h
 generic-y += mmzone.h
+generic-y += ring_buffer.h
diff --git a/arch/sh/include/asm/Kbuild b/arch/sh/include/asm/Kbuild
index 4d3f10ed8275..f0403d3ee8ab 100644
--- a/arch/sh/include/asm/Kbuild
+++ b/arch/sh/include/asm/Kbuild
@@ -3,4 +3,5 @@ generated-y += syscall_table.h
 generic-y += kvm_para.h
 generic-y += mcs_spinlock.h
 generic-y += parport.h
+generic-y += ring_buffer.h
 generic-y += text-patching.h
diff --git a/arch/sparc/include/asm/Kbuild b/arch/sparc/include/asm/Kbuild
index 17ee8a273aa6..49c6bb326b75 100644
--- a/arch/sparc/include/asm/Kbuild
+++ b/arch/sparc/include/asm/Kbuild
@@ -4,4 +4,5 @@ generated-y += syscall_table_64.h
 generic-y += agp.h
 generic-y += kvm_para.h
 generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
 generic-y += text-patching.h
diff --git a/arch/um/include/asm/Kbuild b/arch/um/include/asm/Kbuild
index 1b9b82bbe322..2a1629ba8140 100644
--- a/arch/um/include/asm/Kbuild
+++ b/arch/um/include/asm/Kbuild
@@ -17,6 +17,7 @@ generic-y += module.lds.h
 generic-y += parport.h
 generic-y += percpu.h
 generic-y += preempt.h
+generic-y += ring_buffer.h
 generic-y += runtime-const.h
 generic-y += softirq_stack.h
 generic-y += switch_to.h
diff --git a/arch/x86/include/asm/Kbuild b/arch/x86/include/asm/Kbuild
index 4566000e15c4..078fd2c0d69d 100644
--- a/arch/x86/include/asm/Kbuild
+++ b/arch/x86/include/asm/Kbuild
@@ -14,3 +14,4 @@ generic-y += early_ioremap.h
 generic-y += fprobe.h
 generic-y += mcs_spinlock.h
 generic-y += mmzone.h
+generic-y += ring_buffer.h
diff --git a/arch/xtensa/include/asm/Kbuild b/arch/xtensa/include/asm/Kbuild
index 13fe45dea296..e57af619263a 100644
--- a/arch/xtensa/include/asm/Kbuild
+++ b/arch/xtensa/include/asm/Kbuild
@@ -6,5 +6,6 @@ generic-y += mcs_spinlock.h
 generic-y += parport.h
 generic-y += qrwlock.h
 generic-y += qspinlock.h
+generic-y += ring_buffer.h
 generic-y += user.h
 generic-y += text-patching.h
diff --git a/include/asm-generic/ring_buffer.h b/include/asm-generic/ring_buffer.h
new file mode 100644
index 000000000000..201d2aee1005
--- /dev/null
+++ b/include/asm-generic/ring_buffer.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Generic arch dependent ring_buffer macros.
+ */
+#ifndef __ASM_GENERIC_RING_BUFFER_H__
+#define __ASM_GENERIC_RING_BUFFER_H__
+
+#include <linux/cacheflush.h>
+
+/* Flush cache on ring buffer range if needed. Do nothing by default. */
+#define arch_ring_buffer_flush_range(start, end)	do { } while (0)
+
+#endif /* __ASM_GENERIC_RING_BUFFER_H__ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8b6c39bba56d..3e793bd1c134 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -7,6 +7,7 @@
 #include <linux/ring_buffer_types.h>
 #include <linux/sched/isolation.h>
 #include <linux/trace_recursion.h>
+#include <linux/panic_notifier.h>
 #include <linux/trace_events.h>
 #include <linux/ring_buffer.h>
 #include <linux/trace_clock.h>
@@ -31,6 +32,7 @@
 #include <linux/oom.h>
 #include <linux/mm.h>
 
+#include <asm/ring_buffer.h>
 #include <asm/local64.h>
 #include <asm/local.h>
 #include <asm/setup.h>
@@ -559,6 +561,7 @@ struct trace_buffer {
 
 	unsigned long			range_addr_start;
 	unsigned long			range_addr_end;
+	struct notifier_block		flush_nb;
 
 	struct ring_buffer_meta		*meta;
 
@@ -2520,6 +2523,16 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
 	kfree(cpu_buffer);
 }
 
+/* Stop recording on a persistent buffer and flush cache if needed. */
+static int rb_flush_buffer_cb(struct notifier_block *nb, unsigned long event, void *data)
+{
+	struct trace_buffer *buffer = container_of(nb, struct trace_buffer, flush_nb);
+
+	ring_buffer_record_off(buffer);
+	arch_ring_buffer_flush_range(buffer->range_addr_start, buffer->range_addr_end);
+	return NOTIFY_DONE;
+}
+
 static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
 					 int order, unsigned long start,
 					 unsigned long end,
@@ -2650,6 +2663,12 @@ static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
 
 	mutex_init(&buffer->mutex);
 
+	/* Persistent ring buffer needs to flush cache before reboot. */
+	if (start && end) {
+		buffer->flush_nb.notifier_call = rb_flush_buffer_cb;
+		atomic_notifier_chain_register(&panic_notifier_list, &buffer->flush_nb);
+	}
+
 	return_ptr(buffer);
 
  fail_free_buffers:
@@ -2748,6 +2767,9 @@ ring_buffer_free(struct trace_buffer *buffer)
 {
 	int cpu;
 
+	if (buffer->range_addr_start && buffer->range_addr_end)
+		atomic_notifier_chain_unregister(&panic_notifier_list, &buffer->flush_nb);
+
 	cpuhp_state_remove_instance(CPUHP_TRACE_RB_PREPARE, &buffer->node);
 
 	irq_work_sync(&buffer->irq_work.work);


^ permalink raw reply related

* [PATCH v14 2/5] ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
From: Masami Hiramatsu (Google) @ 2026-03-30 12:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers
In-Reply-To: <177487498530.3463592.12715592581212799257.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Skip invalid sub-buffers when validating the persistent ring buffer
instead of discarding the entire ring buffer. Only skipped buffers
are invalidated (cleared).

If the cache data in memory fails to be synchronized during a reboot,
the persistent ring buffer may become partially corrupted, but other
sub-buffers may still contain readable event data. Only discard the
subbuffers that are found to be corrupted.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
  Changes in v11:
  - Fix a typo.
  Changes in v9:
  - Add meta->subbuf_size check.
  - Fix a typo.
  - Handle invalid reader_page case.
  Changes in v8:
  - Add comment in rb_valudate_buffer()
  - Clear the RB_MISSED_* flags in rb_valudate_buffer() instead of
    skipping subbuf.
  - Remove unused subbuf local variable from rb_cpu_meta_valid().
  Changes in v7:
  - Combined with Handling RB_MISSED_* flags patch, focus on validation at boot.
  - Remove checking subbuffer data when validating metadata, because it should be done
    later.
  - Do not mark the discarded sub buffer page but just reset it.
  Changes in v6:
  - Show invalid page detection message once per CPU.
  Changes in v5:
  - Instead of showing errors for each page, just show the number
    of discarded pages at last.
  Changes in v3:
  - Record missed data event on commit.
---
 kernel/trace/ring_buffer.c |   98 ++++++++++++++++++++++++++------------------
 1 file changed, 58 insertions(+), 40 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 3e793bd1c134..31cad8edd488 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -370,6 +370,12 @@ static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
 	return local_read(&bpage->page->commit);
 }
 
+/* Size is determined by what has been committed */
+static __always_inline unsigned int rb_page_size(struct buffer_page *bpage)
+{
+	return rb_page_commit(bpage) & ~RB_MISSED_MASK;
+}
+
 static void free_buffer_page(struct buffer_page *bpage)
 {
 	/* Range pages are not to be freed */
@@ -1762,7 +1768,6 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
 			      unsigned long *subbuf_mask)
 {
 	int subbuf_size = PAGE_SIZE;
-	struct buffer_data_page *subbuf;
 	unsigned long buffers_start;
 	unsigned long buffers_end;
 	int i;
@@ -1770,6 +1775,11 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
 	if (!subbuf_mask)
 		return false;
 
+	if (meta->subbuf_size != PAGE_SIZE) {
+		pr_info("Ring buffer boot meta [%d] invalid subbuf_size\n", cpu);
+		return false;
+	}
+
 	buffers_start = meta->first_buffer;
 	buffers_end = meta->first_buffer + (subbuf_size * meta->nr_subbufs);
 
@@ -1786,11 +1796,12 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
 		return false;
 	}
 
-	subbuf = rb_subbufs_from_meta(meta);
-
 	bitmap_clear(subbuf_mask, 0, meta->nr_subbufs);
 
-	/* Is the meta buffers and the subbufs themselves have correct data? */
+	/*
+	 * Ensure the meta::buffers array has correct data. The data in each subbufs
+	 * are checked later in rb_meta_validate_events().
+	 */
 	for (i = 0; i < meta->nr_subbufs; i++) {
 		if (meta->buffers[i] < 0 ||
 		    meta->buffers[i] >= meta->nr_subbufs) {
@@ -1798,18 +1809,12 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
 			return false;
 		}
 
-		if ((unsigned)local_read(&subbuf->commit) > subbuf_size) {
-			pr_info("Ring buffer boot meta [%d] buffer invalid commit\n", cpu);
-			return false;
-		}
-
 		if (test_bit(meta->buffers[i], subbuf_mask)) {
 			pr_info("Ring buffer boot meta [%d] array has duplicates\n", cpu);
 			return false;
 		}
 
 		set_bit(meta->buffers[i], subbuf_mask);
-		subbuf = (void *)subbuf + subbuf_size;
 	}
 
 	return true;
@@ -1873,13 +1878,22 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
 	return events;
 }
 
-static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu)
+static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
+			      struct ring_buffer_cpu_meta *meta)
 {
 	unsigned long long ts;
+	unsigned long tail;
 	u64 delta;
-	int tail;
 
-	tail = local_read(&dpage->commit);
+	/*
+	 * When a sub-buffer is recovered from a read, the commit value may
+	 * have RB_MISSED_* bits set, as these bits are reset on reuse.
+	 * Even after clearing these bits, a commit value greater than the
+	 * subbuf_size is considered invalid.
+	 */
+	tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
+	if (tail > meta->subbuf_size)
+		return -1;
 	return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
 }
 
@@ -1890,6 +1904,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 	struct buffer_page *head_page, *orig_head;
 	unsigned long entry_bytes = 0;
 	unsigned long entries = 0;
+	int discarded = 0;
 	int ret;
 	u64 ts;
 	int i;
@@ -1900,14 +1915,19 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 	orig_head = head_page = cpu_buffer->head_page;
 
 	/* Do the reader page first */
-	ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu);
+	ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu, meta);
 	if (ret < 0) {
-		pr_info("Ring buffer reader page is invalid\n");
-		goto invalid;
+		pr_info("Ring buffer meta [%d] invalid reader page detected\n",
+			cpu_buffer->cpu);
+		discarded++;
+		/* Instead of discard whole ring buffer, discard only this sub-buffer. */
+		local_set(&cpu_buffer->reader_page->entries, 0);
+		local_set(&cpu_buffer->reader_page->page->commit, 0);
+	} else {
+		entries += ret;
+		entry_bytes += rb_page_size(cpu_buffer->reader_page);
+		local_set(&cpu_buffer->reader_page->entries, ret);
 	}
-	entries += ret;
-	entry_bytes += local_read(&cpu_buffer->reader_page->page->commit);
-	local_set(&cpu_buffer->reader_page->entries, ret);
 
 	ts = head_page->page->time_stamp;
 
@@ -1935,7 +1955,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 			break;
 
 		/* Stop rewind if the page is invalid. */
-		ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+		ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
 		if (ret < 0)
 			break;
 
@@ -2014,21 +2034,24 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 		if (head_page == cpu_buffer->reader_page)
 			continue;
 
-		ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+		ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
 		if (ret < 0) {
-			pr_info("Ring buffer meta [%d] invalid buffer page\n",
-				cpu_buffer->cpu);
-			goto invalid;
-		}
-
-		/* If the buffer has content, update pages_touched */
-		if (ret)
-			local_inc(&cpu_buffer->pages_touched);
-
-		entries += ret;
-		entry_bytes += local_read(&head_page->page->commit);
-		local_set(&head_page->entries, ret);
+			if (!discarded)
+				pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
+					cpu_buffer->cpu);
+			discarded++;
+			/* Instead of discard whole ring buffer, discard only this sub-buffer. */
+			local_set(&head_page->entries, 0);
+			local_set(&head_page->page->commit, 0);
+		} else {
+			/* If the buffer has content, update pages_touched */
+			if (ret)
+				local_inc(&cpu_buffer->pages_touched);
 
+			entries += ret;
+			entry_bytes += rb_page_size(head_page);
+			local_set(&head_page->entries, ret);
+		}
 		if (head_page == cpu_buffer->commit_page)
 			break;
 	}
@@ -2042,7 +2065,8 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 	local_set(&cpu_buffer->entries, entries);
 	local_set(&cpu_buffer->entries_bytes, entry_bytes);
 
-	pr_info("Ring buffer meta [%d] is from previous boot!\n", cpu_buffer->cpu);
+	pr_info("Ring buffer meta [%d] is from previous boot! (%d pages discarded)\n",
+		cpu_buffer->cpu, discarded);
 	return;
 
  invalid:
@@ -3329,12 +3353,6 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
 	return NULL;
 }
 
-/* Size is determined by what has been committed */
-static __always_inline unsigned rb_page_size(struct buffer_page *bpage)
-{
-	return rb_page_commit(bpage) & ~RB_MISSED_MASK;
-}
-
 static __always_inline unsigned
 rb_commit_index(struct ring_buffer_per_cpu *cpu_buffer)
 {


^ permalink raw reply related

* [PATCH v14 3/5] ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
From: Masami Hiramatsu (Google) @ 2026-03-30 12:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers
In-Reply-To: <177487498530.3463592.12715592581212799257.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Skip invalid sub-buffers when rewinding the persistent ring buffer
instead of stopping the rewinding the ring buffer. The skipped
buffers are cleared.

To ensure the rewinding stops at the unused page, this also clears
buffer_data_page::time_stamp when tracing resets the buffer. This
allows us to identify unused pages and empty pages.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v12:
   - Fix build error.
 Changes in v11:
   - Reset timestamp when the buffer is invalid.
   - When rewinding, skip subbuf page if timestamp is wrong and
     check timestamp after validating buffer data page.
 Changes in v10:
   - Newly added.
---
 kernel/trace/ring_buffer.c |   76 +++++++++++++++++++++++++-------------------
 1 file changed, 43 insertions(+), 33 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 31cad8edd488..e5178239f2f9 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -363,6 +363,7 @@ struct buffer_page {
 static void rb_init_page(struct buffer_data_page *bpage)
 {
 	local_set(&bpage->commit, 0);
+	bpage->time_stamp = 0;
 }
 
 static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
@@ -1878,12 +1879,14 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
 	return events;
 }
 
-static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
+static int rb_validate_buffer(struct buffer_page *bpage, int cpu,
 			      struct ring_buffer_cpu_meta *meta)
 {
+	struct buffer_data_page *dpage = bpage->page;
 	unsigned long long ts;
 	unsigned long tail;
 	u64 delta;
+	int ret = -1;
 
 	/*
 	 * When a sub-buffer is recovered from a read, the commit value may
@@ -1892,9 +1895,17 @@ static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
 	 * subbuf_size is considered invalid.
 	 */
 	tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
-	if (tail > meta->subbuf_size)
-		return -1;
-	return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
+	if (tail <= meta->subbuf_size)
+		ret = rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
+
+	if (ret < 0) {
+		local_set(&bpage->entries, 0);
+		local_set(&bpage->page->commit, 0);
+	} else {
+		local_set(&bpage->entries, ret);
+	}
+
+	return ret;
 }
 
 /* If the meta data has been validated, now validate the events */
@@ -1915,18 +1926,14 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 	orig_head = head_page = cpu_buffer->head_page;
 
 	/* Do the reader page first */
-	ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu, meta);
+	ret = rb_validate_buffer(cpu_buffer->reader_page, cpu_buffer->cpu, meta);
 	if (ret < 0) {
 		pr_info("Ring buffer meta [%d] invalid reader page detected\n",
 			cpu_buffer->cpu);
 		discarded++;
-		/* Instead of discard whole ring buffer, discard only this sub-buffer. */
-		local_set(&cpu_buffer->reader_page->entries, 0);
-		local_set(&cpu_buffer->reader_page->page->commit, 0);
 	} else {
 		entries += ret;
 		entry_bytes += rb_page_size(cpu_buffer->reader_page);
-		local_set(&cpu_buffer->reader_page->entries, ret);
 	}
 
 	ts = head_page->page->time_stamp;
@@ -1945,26 +1952,33 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 		if (head_page == cpu_buffer->tail_page)
 			break;
 
-		/* Ensure the page has older data than head. */
-		if (ts < head_page->page->time_stamp)
-			break;
-
-		ts = head_page->page->time_stamp;
-		/* Ensure the page has correct timestamp and some data. */
-		if (!ts || rb_page_commit(head_page) == 0)
-			break;
-
-		/* Stop rewind if the page is invalid. */
-		ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
-		if (ret < 0)
+		/* Rewind until unused page (no timestamp, no commit). */
+		if (!head_page->page->time_stamp && rb_page_commit(head_page) == 0)
 			break;
 
-		/* Recover the number of entries and update stats. */
-		local_set(&head_page->entries, ret);
-		if (ret)
-			local_inc(&cpu_buffer->pages_touched);
-		entries += ret;
-		entry_bytes += rb_page_commit(head_page);
+		/*
+		 * Skip if the page is invalid, or its timestamp is newer than the
+		 * previous valid page.
+		 */
+		ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta);
+		if (ret >= 0 && ts < head_page->page->time_stamp) {
+			local_set(&head_page->entries, 0);
+			local_set(&head_page->page->commit, 0);
+			head_page->page->time_stamp = ts;
+			ret = -1;
+		}
+		if (ret < 0) {
+			if (!discarded)
+				pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
+					cpu_buffer->cpu);
+			discarded++;
+		} else {
+			entries += ret;
+			entry_bytes += rb_page_size(head_page);
+			if (ret > 0)
+				local_inc(&cpu_buffer->pages_touched);
+			ts = head_page->page->time_stamp;
+		}
 	}
 	if (i)
 		pr_info("Ring buffer [%d] rewound %d pages\n", cpu_buffer->cpu, i);
@@ -2034,15 +2048,12 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 		if (head_page == cpu_buffer->reader_page)
 			continue;
 
-		ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
+		ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta);
 		if (ret < 0) {
 			if (!discarded)
 				pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
 					cpu_buffer->cpu);
 			discarded++;
-			/* Instead of discard whole ring buffer, discard only this sub-buffer. */
-			local_set(&head_page->entries, 0);
-			local_set(&head_page->page->commit, 0);
 		} else {
 			/* If the buffer has content, update pages_touched */
 			if (ret)
@@ -2050,7 +2061,6 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 
 			entries += ret;
 			entry_bytes += rb_page_size(head_page);
-			local_set(&head_page->entries, ret);
 		}
 		if (head_page == cpu_buffer->commit_page)
 			break;
@@ -2081,7 +2091,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 	/* Reset all the subbuffers */
 	for (i = 0; i < meta->nr_subbufs - 1; i++, rb_inc_page(&head_page)) {
 		local_set(&head_page->entries, 0);
-		local_set(&head_page->page->commit, 0);
+		rb_init_page(head_page->page);
 	}
 }
 


^ permalink raw reply related

* [PATCH v14 4/5] ring-buffer: Reset RB_MISSED_* flags on persistent ring buffer
From: Masami Hiramatsu (Google) @ 2026-03-30 12:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers
In-Reply-To: <177487498530.3463592.12715592581212799257.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Reset RB_MISSED_* flags when the persistent ring buffer is
validated at boot. Since these flags are used only in reading
process, such process should be stopped when reboot and never
be restarted. Thus, these flags are meaningless in the next
boot. Moreover, it can confuse the read process after reboot.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v14:
   - Newly added.
---
 kernel/trace/ring_buffer.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index e5178239f2f9..5049cf13021e 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1903,6 +1903,7 @@ static int rb_validate_buffer(struct buffer_page *bpage, int cpu,
 		local_set(&bpage->page->commit, 0);
 	} else {
 		local_set(&bpage->entries, ret);
+		local_set(&bpage->page->commit, tail);
 	}
 
 	return ret;


^ permalink raw reply related

* [PATCH v14 5/5] ring-buffer: Add persistent ring buffer selftest
From: Masami Hiramatsu (Google) @ 2026-03-30 12:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers
In-Reply-To: <177487498530.3463592.12715592581212799257.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Add a self-destractive test for the persistent ring buffer. This
will invalidate some sub-buffer pages in the persistent ring buffer
when kernel gets panic, and check whether the number of detected
invalid pages and the total entry_bytes are the same as record
after reboot.

This can ensure the kernel correctly recover partially corrupted
persistent ring buffer when boot.

The test only runs on the persistent ring buffer whose name is
"ptracingtest". And user has to fill it up with events before
kernel panics.

To run the test, enable CONFIG_RING_BUFFER_PERSISTENT_INJECT
and you have to setup the kernel cmdline;

 reserve_mem=20M:2M:trace trace_instance=ptracingtest^traceoff@trace
 panic=1

And run following commands after the 1st boot;

 cd /sys/kernel/tracing/instances/ptracingtest
 echo 1 > tracing_on
 echo 1 > events/enable
 sleep 3
 echo c > /proc/sysrq-trigger

After panic message, the kernel will reboot and run the verification
on the persistent ring buffer, e.g.

 Ring buffer meta [2] invalid buffer page detected
 Ring buffer meta [2] is from previous boot! (318 pages discarded)
 Ring buffer testing [2] invalid pages: PASSED (318/318)
 Ring buffer testing [2] entry_bytes: PASSED (1300476/1300476)

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v14:
  - Rename config to CONFIG_RING_BUFFER_PERSISTENT_INJECT.
  - Clear meta->nr_invalid/entry_bytes after testing.
  - Add test commands in config comment.
 Changes in v10:
  - Add entry_bytes test.
  - Do not compile test code if CONFIG_RING_BUFFER_PERSISTENT_SELFTEST=n.
 Changes in v9:
  - Test also reader pages.
---
 include/linux/ring_buffer.h |    1 +
 kernel/trace/Kconfig        |   31 +++++++++++++++++++
 kernel/trace/ring_buffer.c  |   71 +++++++++++++++++++++++++++++++++++++++++++
 kernel/trace/trace.c        |    4 ++
 4 files changed, 107 insertions(+)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 994f52b34344..0670742b2d60 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -238,6 +238,7 @@ int ring_buffer_subbuf_size_get(struct trace_buffer *buffer);
 
 enum ring_buffer_flags {
 	RB_FL_OVERWRITE		= 1 << 0,
+	RB_FL_TESTING		= 1 << 1,
 };
 
 #ifdef CONFIG_RING_BUFFER
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e130da35808f..07305ed6d745 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -1202,6 +1202,37 @@ config RING_BUFFER_VALIDATE_TIME_DELTAS
 	  Only say Y if you understand what this does, and you
 	  still want it enabled. Otherwise say N
 
+config RING_BUFFER_PERSISTENT_INJECT
+	bool "Enable persistent ring buffer error injection test"
+	depends on RING_BUFFER
+	help
+	  Run a selftest on the persistent ring buffer which names
+	  "ptracingtest" (and its backup) when panic_on_reboot by
+	  invalidating ring buffer pages.
+	  To use this, boot kernel with "ptracingtest" persistent
+	  ring buffer, e.g.
+
+	   reserve_mem=20M:2M:trace trace_instance=ptracingtest@trace panic=1
+
+	  And after the 1st boot, run test command, like;
+
+	   cd /sys/kernel/tracing/instances/ptracingtest
+	   echo 1 > events/enable
+	   echo 1 > tracing_on
+	   sleep 3
+	   echo c > /proc/sysrq-trigger
+
+	  After panic message, the kernel reboots and show test results
+	  on the boot log.
+
+	  Note that user has to enable events on the persistent ring
+	  buffer manually to fill up ring buffers before rebooting.
+	  Since this invalidates the data on test target ring buffer,
+	  "ptracingtest" persistent ring buffer must not be used for
+	  actual tracing, but only for testing.
+
+	  If unsure, say N
+
 config MMIOTRACE_TEST
 	tristate "Test module for mmiotrace"
 	depends on MMIOTRACE && m
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 5049cf13021e..7f8140c54fce 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -64,6 +64,10 @@ struct ring_buffer_cpu_meta {
 	unsigned long	commit_buffer;
 	__u32		subbuf_size;
 	__u32		nr_subbufs;
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+	__u32		nr_invalid;
+	__u32		entry_bytes;
+#endif
 	int		buffers[];
 };
 
@@ -2078,6 +2082,21 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 
 	pr_info("Ring buffer meta [%d] is from previous boot! (%d pages discarded)\n",
 		cpu_buffer->cpu, discarded);
+
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+	if (meta->nr_invalid)
+		pr_info("Ring buffer testing [%d] invalid pages: %s (%d/%d)\n",
+			cpu_buffer->cpu,
+			(discarded == meta->nr_invalid) ? "PASSED" : "FAILED",
+			discarded, meta->nr_invalid);
+	if (meta->entry_bytes)
+		pr_info("Ring buffer testing [%d] entry_bytes: %s (%ld/%ld)\n",
+			cpu_buffer->cpu,
+			(entry_bytes == meta->entry_bytes) ? "PASSED" : "FAILED",
+			(long)entry_bytes, (long)meta->entry_bytes);
+	meta->nr_invalid = 0;
+	meta->entry_bytes = 0;
+#endif
 	return;
 
  invalid:
@@ -2558,12 +2577,64 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
 	kfree(cpu_buffer);
 }
 
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+static void rb_test_inject_invalid_pages(struct trace_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_cpu_meta *meta;
+	struct buffer_data_page *dpage;
+	u32 entry_bytes = 0;
+	unsigned long ptr;
+	int subbuf_size;
+	int invalid = 0;
+	int cpu;
+	int i;
+
+	if (!(buffer->flags & RB_FL_TESTING))
+		return;
+
+	guard(preempt)();
+	cpu = smp_processor_id();
+
+	cpu_buffer = buffer->buffers[cpu];
+	meta = cpu_buffer->ring_meta;
+	ptr = (unsigned long)rb_subbufs_from_meta(meta);
+	subbuf_size = meta->subbuf_size;
+
+	for (i = 0; i < meta->nr_subbufs; i++) {
+		int idx = meta->buffers[i];
+
+		dpage = (void *)(ptr + idx * subbuf_size);
+		/* Skip unused pages */
+		if (!local_read(&dpage->commit))
+			continue;
+
+		/* Invalidate even pages. */
+		if (!(i & 0x1)) {
+			local_add(subbuf_size + 1, &dpage->commit);
+			invalid++;
+		} else {
+			/* Count total commit bytes. */
+			entry_bytes += local_read(&dpage->commit);
+		}
+	}
+
+	pr_info("Inject invalidated %d pages on CPU%d, total size: %ld\n",
+		invalid, cpu, (long)entry_bytes);
+	meta->nr_invalid = invalid;
+	meta->entry_bytes = entry_bytes;
+}
+#else /* !CONFIG_RING_BUFFER_PERSISTENT_INJECT */
+#define rb_test_inject_invalid_pages(buffer)	do { } while (0)
+#endif
+
 /* Stop recording on a persistent buffer and flush cache if needed. */
 static int rb_flush_buffer_cb(struct notifier_block *nb, unsigned long event, void *data)
 {
 	struct trace_buffer *buffer = container_of(nb, struct trace_buffer, flush_nb);
 
 	ring_buffer_record_off(buffer);
+	rb_test_inject_invalid_pages(buffer);
 	arch_ring_buffer_flush_range(buffer->range_addr_start, buffer->range_addr_end);
 	return NOTIFY_DONE;
 }
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 4189ec9df6a5..108b0d16badf 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9366,6 +9366,8 @@ static void setup_trace_scratch(struct trace_array *tr,
 	memset(tscratch, 0, size);
 }
 
+#define TRACE_TEST_PTRACING_NAME	"ptracingtest"
+
 static int
 allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, unsigned long size)
 {
@@ -9378,6 +9380,8 @@ allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, unsigned
 	buf->tr = tr;
 
 	if (tr->range_addr_start && tr->range_addr_size) {
+		if (!strcmp(tr->name, TRACE_TEST_PTRACING_NAME))
+			rb_flags |= RB_FL_TESTING;
 		/* Add scratch buffer to handle 128 modules */
 		buf->buffer = ring_buffer_alloc_range(size, rb_flags, 0,
 						      tr->range_addr_start,


^ permalink raw reply related

* Re: [PATCH v2] bootconfig: Apply early options from embedded config
From: Breno Leitao @ 2026-03-30 13:15 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Jonathan Corbet, Shuah Khan, linux-kernel, linux-trace-kernel,
	linux-doc, oss, paulmck, rostedt, kernel-team
In-Reply-To: <20260327223744.f246150adc1671f7605a4f0a@kernel.org>

On Fri, Mar 27, 2026 at 10:37:44PM +0900, Masami Hiramatsu wrote:
> On Fri, 27 Mar 2026 03:06:41 -0700
> Breno Leitao <leitao@debian.org> wrote:

> > > To fix this, we need to change setup_arch() for each architecture so
> > > that it calls this bootconfig_apply_early_params().
> > 
> > Could we instead integrate this into parse_early_param() itself? That
> > approach would avoid the need to modify each architecture individually.
> 
> Ah, indeed. 

I investigated integrating bootconfig into parse_early_param() and hit a
blocker: xbc_init() and xbc_make_cmdline() depend on memblock_alloc(), but on
most architectures (x86, arm64, arm, s390, riscv) parse_early_param() is called
from setup_arch() _before_ memblock is initialized.

So, bootconfig will not be available as early as parse_early_param(). 

An alternative is replace memblock allocations in lib/bootconfig.c with static
__initdata buffers, similar to Petr's approach in 2023:

	https://lore.kernel.org/all/20231121231342.193646-3-oss@malat.biz/

But, there was concerns about the allocation size:

	Petr Malat <oss@malat.biz> wrote: 
	> To allow handling of early options, it's necessary to eliminate allocations
	> from embedded bootconfig handling

	"Hm, my concern is that this can introduce some sort of overhead to parse the bootconfig."

^ permalink raw reply

* Re: [PATCH] tracing: Move snapshot code out of trace.c and into trace_snapshot.c
From: Arnd Bergmann @ 2026-03-30 14:06 UTC (permalink / raw)
  To: kernel test robot, Steven Rostedt, LKML, Linux trace kernel
  Cc: llvm, oe-kbuild-all, Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <202603070230.Zz4BBLtb-lkp@intel.com>

On Fri, Mar 6, 2026, at 20:07, kernel test robot wrote:
>>> kernel/trace/trace.c:820:5: warning: no previous prototype for function 'tracing_alloc_snapshot' [-Wmissing-prototypes]
>      820 | int tracing_alloc_snapshot(void)
>          |     ^
>    kernel/trace/trace.c:820:1: note: declare 'static' if the function 
> is not intended to be used outside of this translation unit
>      820 | int tracing_alloc_snapshot(void)
>          | ^
>          | static 
>    1 warning generated.

I saw the same thing and worked around it by removing the function.
I then noticed that a bunch of code surrounding it is also unused
and I removed that as well (see below). This version passes
my randconfig build tests, but I suspect it is still wrong,
since the code never had any callers and I don't understand
why.

       Arnd


diff --git a/include/linux/trace_printk.h b/include/linux/trace_printk.h
index 2670ec7f4262..87466d8df147 100644
--- a/include/linux/trace_printk.h
+++ b/include/linux/trace_printk.h
@@ -38,8 +38,6 @@ enum ftrace_dump_mode {
 void tracing_on(void);
 void tracing_off(void);
 int tracing_is_on(void);
-void tracing_snapshot(void);
-void tracing_snapshot_alloc(void);
 
 extern void tracing_start(void);
 extern void tracing_stop(void);
@@ -184,8 +182,6 @@ static inline void trace_dump_stack(int skip) { }
 static inline void tracing_on(void) { }
 static inline void tracing_off(void) { }
 static inline int tracing_is_on(void) { return 0; }
-static inline void tracing_snapshot(void) { }
-static inline void tracing_snapshot_alloc(void) { }
 
 static inline __printf(1, 2)
 int trace_printk(const char *fmt, ...)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ec2b926436a7..76fe2c758734 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -767,70 +767,6 @@ void tracing_on(void)
 }
 EXPORT_SYMBOL_GPL(tracing_on);
 
-#ifdef CONFIG_TRACER_SNAPSHOT
-/**
- * tracing_snapshot - take a snapshot of the current buffer.
- *
- * This causes a swap between the snapshot buffer and the current live
- * tracing buffer. You can use this to take snapshots of the live
- * trace when some condition is triggered, but continue to trace.
- *
- * Note, make sure to allocate the snapshot with either
- * a tracing_snapshot_alloc(), or by doing it manually
- * with: echo 1 > /sys/kernel/tracing/snapshot
- *
- * If the snapshot buffer is not allocated, it will stop tracing.
- * Basically making a permanent snapshot.
- */
-void tracing_snapshot(void)
-{
-	struct trace_array *tr = &global_trace;
-
-	tracing_snapshot_instance(tr);
-}
-EXPORT_SYMBOL_GPL(tracing_snapshot);
-
-/**
- * tracing_alloc_snapshot - allocate snapshot buffer.
- *
- * This only allocates the snapshot buffer if it isn't already
- * allocated - it doesn't also take a snapshot.
- *
- * This is meant to be used in cases where the snapshot buffer needs
- * to be set up for events that can't sleep but need to be able to
- * trigger a snapshot.
- */
-int tracing_alloc_snapshot(void)
-{
-	struct trace_array *tr = &global_trace;
-	int ret;
-
-	ret = tracing_alloc_snapshot_instance(tr);
-	WARN_ON(ret < 0);
-
-	return ret;
-}
-EXPORT_SYMBOL_GPL(tracing_alloc_snapshot);
-#else
-void tracing_snapshot(void)
-{
-	WARN_ONCE(1, "Snapshot feature not enabled, but internal snapshot used");
-}
-EXPORT_SYMBOL_GPL(tracing_snapshot);
-int tracing_alloc_snapshot(void)
-{
-	WARN_ONCE(1, "Snapshot feature not enabled, but snapshot allocation used");
-	return -ENODEV;
-}
-EXPORT_SYMBOL_GPL(tracing_alloc_snapshot);
-void tracing_snapshot_alloc(void)
-{
-	/* Give warning */
-	tracing_snapshot();
-}
-EXPORT_SYMBOL_GPL(tracing_snapshot_alloc);
-#endif /* CONFIG_TRACER_SNAPSHOT */
-
 void tracer_tracing_off(struct trace_array *tr)
 {
 	if (tr->array_buffer.buffer)
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index e4cf6703b301..6abd9e16ef21 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -2306,7 +2306,6 @@ static inline int register_snapshot_cmd(void) { return 0; }
 # endif
 #else /* !CONFIG_TRACER_SNAPSHOT */
 static inline int trace_allocate_snapshot(struct trace_array *tr, int size) { return 0; }
-static inline int tracing_alloc_snapshot(void) { return 0; }
 static inline void tracing_snapshot_instance(struct trace_array *tr) { }
 static inline int tracing_alloc_snapshot_instance(struct trace_array *tr)
 {
diff --git a/kernel/trace/trace_snapshot.c b/kernel/trace/trace_snapshot.c
index 8865b2ef2264..926f395e5af4 100644
--- a/kernel/trace/trace_snapshot.c
+++ b/kernel/trace/trace_snapshot.c
@@ -237,29 +237,6 @@ void tracing_disarm_snapshot(struct trace_array *tr)
 	spin_unlock(&tr->snapshot_trigger_lock);
 }
 
-/**
- * tracing_snapshot_alloc - allocate and take a snapshot of the current buffer.
- *
- * This is similar to tracing_snapshot(), but it will allocate the
- * snapshot buffer if it isn't already allocated. Use this only
- * where it is safe to sleep, as the allocation may sleep.
- *
- * This causes a swap between the snapshot buffer and the current live
- * tracing buffer. You can use this to take snapshots of the live
- * trace when some condition is triggered, but continue to trace.
- */
-void tracing_snapshot_alloc(void)
-{
-	int ret;
-
-	ret = tracing_alloc_snapshot();
-	if (ret < 0)
-		return;
-
-	tracing_snapshot();
-}
-EXPORT_SYMBOL_GPL(tracing_snapshot_alloc);
-
 /**
  * tracing_snapshot_cond_enable - enable conditional snapshot for an instance
  * @tr:		The tracing instance
@@ -391,8 +368,6 @@ void latency_fsnotify(struct trace_array *tr)
 	 */
 	irq_work_queue(&tr->fsnotify_irqwork);
 }
-#else
-static inline void latency_fsnotify(struct trace_array *tr) { }
 #endif /* LATENCY_FS_NOTIFY */
 static const struct file_operations tracing_max_lat_fops;
 

^ permalink raw reply related

* Re: [PATCH v2 1/2] tracing/hist: rebuild full_name on each hist_field_name() call
From: Steven Rostedt @ 2026-03-30 14:22 UTC (permalink / raw)
  To: Pengpeng Hou
  Cc: mhiramat, mathieu.desnoyers, tom.zanussi, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260330024619.38459-1-pengpeng@iscas.ac.cn>

On Mon, 30 Mar 2026 10:46:19 +0800
Pengpeng Hou <pengpeng@iscas.ac.cn> wrote:


Please resend both patches as a separate thread series. Do not send new
versions of the patch as a reply to the old one. That just makes it much
harder for maintainers to keep track of patches, as they are hidden within
threads.

> hist_field_name() uses a static MAX_FILTER_STR_VAL buffer for fully
> qualified variable-reference names, but it currently appends into that
> buffer with strcat() without rebuilding it first. As a result, repeated
> calls append a new "system.event.field" name onto the previous one,
> which can eventually run past the end of full_name.
> 
> Build the name with snprintf() on each call and return NULL if the fully
> qualified name does not fit in MAX_FILTER_STR_VAL.
> 
> Fixes: 067fe038e70f ("tracing: Add variable reference handling to hist triggers")
> Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
> ---
> v2:

Instead of saying "v2", use:

Changes since v1: https://lore.kernel.org/all/20260329030950.32503-1-pengpeng@iscas.ac.cn/

That keeps the history link of this patch compared to the previous version.

-- Steve


> - rebuild full_name on each call instead of falling back to field->name
> - return NULL on overflow as suggested
> - split out the snprintf() length check instead of using an inline if


^ permalink raw reply

* Re: [PATCH v6] tracing: Preserve repeated boot-time tracing parameters
From: Steven Rostedt @ 2026-03-30 14:43 UTC (permalink / raw)
  To: Wesley Atwell
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260329184254.1813273-1-atwellwea@gmail.com>

On Sun, 29 Mar 2026 12:42:54 -0600
Wesley Atwell <atwellwea@gmail.com> wrote:

BTW, please do not reply to old versions of a patch with new versions. It
makes it much more difficult for maintainers to find what is the last patch.

New versions of a patch should *always* be a start of a new thread!

> Some tracing boot parameters already accept delimited value lists, but
> their __setup() handlers keep only the last instance seen at boot.
> Make repeated instances append to the same boot-time buffer in the
> format each parser already consumes.
> 
> Use a shared trace_append_boot_param() helper for the ftrace filters,
> trace_options, and kprobe_event boot parameters. trace_trigger=
> still tokenizes a temporary parse buffer in place, but now copies each
> parsed event/trigger pair into boot-time storage so repeated instances
> do not overwrite earlier ones.
> 
> This also lets Bootconfig array values work naturally when they expand
> to repeated param=value entries.
> 
> Before this change, only the last instance from each repeated
> parameter survived boot.
> 
> Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
> ---
> Changes since v5: https://lore.kernel.org/all/20260328201842.1782806-1-atwellwea@gmail.com/

This is also why I suggested using the above link. The link shows how to
find the old version of the patch, without relying on "In-Reply-To" header.

> - add the separator accounting comment in trace_append_boot_param()
> - keep the existing trace_trigger= temporary buffer and copy each
>   parsed event/trigger pair into boot-time storage instead of tracking
>   a running offset inside that buffer
> 
>  kernel/trace/ftrace.c       | 12 ++++++++----
>  kernel/trace/trace.c        | 30 +++++++++++++++++++++++++++++-
>  kernel/trace/trace.h        |  2 ++
>  kernel/trace/trace_events.c | 24 +++++++++++++++++++++---
>  kernel/trace/trace_kprobe.c |  3 ++-
>  5 files changed, 62 insertions(+), 9 deletions(-)
> 

> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -228,6 +228,33 @@ static int boot_instance_index;
>  static char boot_snapshot_info[COMMAND_LINE_SIZE] __initdata;
>  static int boot_snapshot_index;
>  
> +/*
> + * Repeated boot parameters, including Bootconfig array expansions, need
> + * to stay in the delimiter form that the existing parser consumes.
> + */
> +void __init trace_append_boot_param(char *buf, const char *str, char sep,
> +				    int size)
> +{
> +	int len, needed, str_len;
> +
> +	if (!*str)
> +		return;
> +
> +	len = strlen(buf);
> +	str_len = strlen(str);
> +	needed = len + str_len + 1;

Nit, but it would be nice to have a blank line here.

> +	/* For continuation, account for the separator. */
> +	if (len)
> +		needed++;
> +	if (needed > size)
> +		return;
> +
> +	if (len)
> +		buf[len++] = sep;
> +
> +	strscpy(buf + len, str, size - len);
> +}
> +
>  static int __init set_cmdline_ftrace(char *str)
>  {
>  	strscpy(bootup_tracer_buf, str, MAX_TRACER_SIZE);
> @@ -329,7 +356,8 @@ static char trace_boot_options_buf[MAX_TRACER_SIZE] __initdata;
>  
>  static int __init set_trace_boot_options(char *str)
>  {
> -	strscpy(trace_boot_options_buf, str, MAX_TRACER_SIZE);
> +	trace_append_boot_param(trace_boot_options_buf, str, ',',
> +				MAX_TRACER_SIZE);
>  	return 1;
>  }
>  __setup("trace_options=", set_trace_boot_options);
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index b8f3804586a0..237a0417de1c 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -863,6 +863,8 @@ extern int DYN_FTRACE_TEST_NAME(void);
>  extern int DYN_FTRACE_TEST_NAME2(void);
>  
>  extern void trace_set_ring_buffer_expanded(struct trace_array *tr);
> +void __init trace_append_boot_param(char *buf, const char *str,
> +				    char sep, int size);
>  extern bool tracing_selftest_disabled;
>  
>  #ifdef CONFIG_FTRACE_STARTUP_TEST
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index 249d1cba72c0..1c4a4a46169e 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -17,6 +17,7 @@
>  #include <linux/kthread.h>
>  #include <linux/tracefs.h>
>  #include <linux/uaccess.h>
> +#include <linux/memblock.h>
>  #include <linux/module.h>
>  #include <linux/ctype.h>
>  #include <linux/sort.h>
> @@ -3674,7 +3675,7 @@ trace_create_new_event(struct trace_event_call *call,
>  #define MAX_BOOT_TRIGGERS 32
>  
>  static struct boot_triggers {
> -	const char		*event;
> +	char			*event;
>  	char			*trigger;
>  } bootup_triggers[MAX_BOOT_TRIGGERS];
>  
> @@ -3683,6 +3684,7 @@ static int nr_boot_triggers;
>  
>  static __init int setup_trace_triggers(char *str)
>  {
> +	char *event;
>  	char *trigger;
>  	char *buf;
>  	int i;
> @@ -3692,14 +3694,30 @@ static __init int setup_trace_triggers(char *str)
>  	disable_tracing_selftest("running event triggers");
>  
>  	buf = bootup_trigger_buf;
> -	for (i = 0; i < MAX_BOOT_TRIGGERS; i++) {
> +	for (i = nr_boot_triggers; i < MAX_BOOT_TRIGGERS; i++) {

Let's not make this so complex.

This function isn't the same as the other functions. It doesn't need to add
separators to the temp buffer. It only needs to append it.


>  		trigger = strsep(&buf, ",");
>  		if (!trigger)
>  			break;
> -		bootup_triggers[i].event = strsep(&trigger, ".");
> +		event = strsep(&trigger, ".");
>  		bootup_triggers[i].trigger = trigger;
>  		if (!bootup_triggers[i].trigger)
>  			break;
> +
> +		/*
> +		 * Keep each parsed trigger outside the temporary setup
> +		 * buffer so repeated trace_trigger= entries do not
> +		 * overwrite earlier ones.
> +		 */
> +		bootup_triggers[i].event =
> +			memblock_alloc_or_panic(strlen(event) + 1,
> +						SMP_CACHE_BYTES);
> +		strscpy(bootup_triggers[i].event, event,
> +			strlen(event) + 1);
> +		bootup_triggers[i].trigger =
> +			memblock_alloc_or_panic(strlen(trigger) + 1,
> +						SMP_CACHE_BYTES);
> +		strscpy(bootup_triggers[i].trigger, trigger,
> +			strlen(trigger) + 1);
>  	}

I believe all you need for the boot triggers is this:

  (Not even compiled tested)

diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 9928da636c9d..7754a8adb58a 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -3677,20 +3677,24 @@ static struct boot_triggers {
 } bootup_triggers[MAX_BOOT_TRIGGERS];
 
 static char bootup_trigger_buf[COMMAND_LINE_SIZE];
+static int boot_trigger_buf_len;
 static int nr_boot_triggers;
 
 static __init int setup_trace_triggers(char *str)
 {
 	char *trigger;
 	char *buf;
+	int len = boot_trigger_buf_len;
 	int i;
 
-	strscpy(bootup_trigger_buf, str, COMMAND_LINE_SIZE);
+	strscpy(bootup_trigger_buf + len , str, COMMAND_LINE_SIZE - len);
 	trace_set_ring_buffer_expanded(NULL);
 	disable_tracing_selftest("running event triggers");
 
-	buf = bootup_trigger_buf;
-	for (i = 0; i < MAX_BOOT_TRIGGERS; i++) {
+	buf = bootup_trigger_buf + len;
+	boot_trigger_buf_len += strlen(buf);
+
+	for (i = nr_boot_triggers; i < MAX_BOOT_TRIGGERS; i++) {
 		trigger = strsep(&buf, ",");
 		if (!trigger)
 			break;

^ permalink raw reply related

* Re: [PATCH v2] bootconfig: Apply early options from embedded config
From: Breno Leitao @ 2026-03-30 15:04 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Jonathan Corbet, Shuah Khan, linux-kernel, linux-trace-kernel,
	linux-doc, oss, paulmck, rostedt, kernel-team
In-Reply-To: <acpzhCBEPh-tKVqg@gmail.com>

On Mon, Mar 30, 2026 at 06:15:17AM -0700, Breno Leitao wrote:
> On Fri, Mar 27, 2026 at 10:37:44PM +0900, Masami Hiramatsu wrote:
> > On Fri, 27 Mar 2026 03:06:41 -0700
> > Breno Leitao <leitao@debian.org> wrote:
>
> > > > To fix this, we need to change setup_arch() for each architecture so
> > > > that it calls this bootconfig_apply_early_params().
> > >
> > > Could we instead integrate this into parse_early_param() itself? That
> > > approach would avoid the need to modify each architecture individually.
> >
> > Ah, indeed.
>
> I investigated integrating bootconfig into parse_early_param() and hit a
> blocker: xbc_init() and xbc_make_cmdline() depend on memblock_alloc(), but on
> most architectures (x86, arm64, arm, s390, riscv) parse_early_param() is called
> from setup_arch() _before_ memblock is initialized.

That said, I'd like to propose a simpler approach as a first step:

1) Keep calling bootconfig_apply_early_params() from setup_boot_config().
   This is the least intrusive approach and expands bootconfig support to
   additional early boot parameters.

2) Document that architecture-specific early parameters might be ignored.
   If a parameter is consumed early enough (during setup_arch()), it will
   not see the bootconfig value.

3) Ensure that early bootconfig parameters don't overwrite the boot command
   line. For example, if the boot command line has foo=bar and bootconfig
   later has foo=baz, the command line value should take precedence.
   This prevents early boot code (in setup_arch()) from seeing a parameter
   value that will be changed later.


If that is OK, that is what I have right now:

commit dd6e00e41c381e5fef9d22dda02b104aa8f83101
Author: Breno Leitao <leitao@debian.org>
Date:   Mon Mar 30 06:50:28 2026 -0700

    bootconfig: Apply early options from embedded config
    
    Bootconfig currently cannot apply early kernel parameters. For example,
    the "mitigations=" parameter must be passed through traditional boot
    methods because bootconfig parsing happens after these early parameters
    need to be processed.
    
    Add bootconfig_apply_early_params() which walks all kernel.* keys in the
    parsed XBC tree and calls do_early_param() for each one. It is called
    from setup_boot_config() immediately after a successful xbc_init() on
    the embedded data, which happens before parse_early_param() runs in
    start_kernel().
    
    This allows early options such as:
    
      kernel.mitigations = off
    
    to be placed in the embedded bootconfig and take effect, without
    requiring them on the kernel command line.
    
    If the same parameter appears on both the kernel command line and in
    the embedded bootconfig, the command-line value takes precedence:
    bootconfig_apply_early_params() checks boot_command_line and skips
    any parameter already present there.
    
    Known limitations are documented:
    - Early options in initrd bootconfig are still silently ignored, as the
      initrd is only available after the early param window has closed.
    - Arch-specific early params consumed during setup_arch() (e.g. mem=,
      earlycon, noapic) may not take effect from bootconfig.
    
    Signed-off-by: Breno Leitao <leitao@debian.org>

diff --git a/Documentation/admin-guide/bootconfig.rst b/Documentation/admin-guide/bootconfig.rst
index f712758472d5c..6ed852a0c66d8 100644
--- a/Documentation/admin-guide/bootconfig.rst
+++ b/Documentation/admin-guide/bootconfig.rst
@@ -169,6 +169,15 @@ Boot Kernel With a Boot Config
 There are two options to boot the kernel with bootconfig: attaching the
 bootconfig to the initrd image or embedding it in the kernel itself.
 
+Early options (those registered with ``early_param()``) may only be
+specified in the embedded bootconfig, because the initrd is not yet
+available when early parameters are processed.
+
+Note that embedded bootconfig is parsed after ``setup_arch()``, so
+early options that are consumed during architecture initialization
+(e.g., ``mem=``, ``memmap=``, ``earlycon``, ``noapic``, ``nolapic``,
+``acpi=``, ``numa=``, ``iommu=``) may not take effect from bootconfig.
+
 Attaching a Boot Config to Initrd
 ---------------------------------
 
diff --git a/init/Kconfig b/init/Kconfig
index 7484cd703bc1a..34adcc1feb9b6 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1525,6 +1525,16 @@ config BOOT_CONFIG_EMBED
 	  image. But if the system doesn't support initrd, this option will
 	  help you by embedding a bootconfig file while building the kernel.
 
+	  Unlike bootconfig attached to initrd, the embedded bootconfig also
+	  supports early options (those registered with early_param()). Any
+	  kernel.* key in the embedded bootconfig is applied before
+	  parse_early_param() runs.  Early options in initrd bootconfig will
+	  not be applied.  Early options consumed during setup_arch() (e.g.
+	  mem=, memmap=, earlycon, noapic, acpi=, numa=, iommu=) may not
+	  take effect.  If the same early option
+	  appears in both bootconfig and the kernel command line, the
+	  command line value takes precedence.
+
 	  If unsure, say N.
 
 config BOOT_CONFIG_EMBED_FILE
diff --git a/init/main.c b/init/main.c
index 1cb395dd94e43..487fe86ab5c09 100644
--- a/init/main.c
+++ b/init/main.c
@@ -414,10 +414,112 @@ static int __init warn_bootconfig(char *str)
 	return 0;
 }
 
+/*
+ * do_early_param() is defined later in this file but called from
+ * bootconfig_apply_early_params() below, so we need a forward declaration.
+ */
+static int __init do_early_param(char *param, char *val,
+				 const char *unused, void *arg);
+
+/*
+ * Check if a parameter name appears on the kernel command line.
+ * Returns true if the parameter was explicitly passed by the bootloader.
+ */
+static bool __init cmdline_has_param(const char *param)
+{
+	const char *p = boot_command_line;
+	int len = strlen(param);
+
+	while ((p = strstr(p, param)) != NULL) {
+		/* Check it's a whole-word match: preceded by space/start */
+		if (p != boot_command_line && *(p - 1) != ' ') {
+			p += len;
+			continue;
+		}
+		/* Followed by =, space, or end of string */
+		if (p[len] == '=' || p[len] == ' ' || p[len] == '\0')
+			return true;
+		p += len;
+	}
+	return false;
+}
+
+/*
+ * bootconfig_apply_early_params - apply kernel.* keys from the embedded
+ * bootconfig as early_param() calls.
+ *
+ * early_param() handlers run before most of the kernel initialises.
+ * A bootconfig attached to initrd arrives too late because the initrd is
+ * not mapped when early params are processed.  The embedded bootconfig
+ * lives in the kernel image itself (.init.data), so it is always
+ * reachable.
+ *
+ * Called from setup_boot_config() which runs before parse_early_param()
+ * in start_kernel(), but after setup_arch().  Arch-specific early params
+ * parsed during setup_arch() will not see bootconfig values.
+ */
+static void __init bootconfig_apply_early_params(void)
+{
+	struct xbc_node *knode, *vnode, *root;
+	const char *val;
+	char *val_copy;
+
+	root = xbc_find_node("kernel");
+	if (!root)
+		return;
+
+	xbc_node_for_each_key_value(root, knode, val) {
+		if (xbc_node_compose_key_after(root, knode,
+					       xbc_namebuf,
+					       XBC_KEYLEN_MAX) < 0)
+			continue;
+
+		/* Command-line values take precedence over bootconfig */
+		if (cmdline_has_param(xbc_namebuf)) {
+			pr_info("bootconfig: skipping '%s', already on command line\n",
+				xbc_namebuf);
+			continue;
+		}
+
+		/* Boolean key with no value — pass NULL like parse_args() */
+		if (!xbc_node_get_child(knode)) {
+			do_early_param(xbc_namebuf, NULL, NULL, NULL);
+			continue;
+		}
+
+		/*
+		 * Iterate array values: "foo = bar, buz" becomes two
+		 * calls: do_early_param("foo", "bar") and
+		 * do_early_param("foo", "buz").
+		 */
+		vnode = xbc_node_get_child(knode);
+		xbc_array_for_each_value(vnode, val) {
+			/*
+			 * Some early_param handlers save the pointer to
+			 * val, so each value needs its own persistent
+			 * copy.  memblock is available here since we run
+			 * after setup_arch().  These allocations are
+			 * intentionally never freed because the handlers
+			 * may retain references indefinitely.
+			 */
+			val_copy = memblock_alloc(strlen(val) + 1,
+						  SMP_CACHE_BYTES);
+			if (!val_copy) {
+				pr_err("Failed to allocate bootconfig value for '%s'\n",
+				       xbc_namebuf);
+				continue;
+			}
+			strcpy(val_copy, val);
+			do_early_param(xbc_namebuf, val_copy, NULL, NULL);
+		}
+	}
+}
+
 static void __init setup_boot_config(void)
 {
 	static char tmp_cmdline[COMMAND_LINE_SIZE] __initdata;
 	const char *msg, *data;
+	bool embedded = false;
 	int pos, ret;
 	size_t size;
 	char *err;
@@ -425,8 +527,11 @@ static void __init setup_boot_config(void)
 	/* Cut out the bootconfig data even if we have no bootconfig option */
 	data = get_boot_config_from_initrd(&size);
 	/* If there is no bootconfig in initrd, try embedded one. */
-	if (!data)
+	if (!data) {
 		data = xbc_get_embedded_bootconfig(&size);
+		/* tag we have embedded data */
+		embedded = !!data;
+	}
 
 	strscpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE);
 	err = parse_args("bootconfig", tmp_cmdline, NULL, 0, 0, 0, NULL,
@@ -464,6 +569,8 @@ static void __init setup_boot_config(void)
 	} else {
 		xbc_get_info(&ret, NULL);
 		pr_info("Load bootconfig: %ld bytes %d nodes\n", (long)size, ret);
+		if (embedded)
+			bootconfig_apply_early_params();
 		/* keys starting with "kernel." are passed via cmdline */
 		extra_command_line = xbc_make_cmdline("kernel");
 		/* Also, "init." keys are init arguments */

^ permalink raw reply related

* Re: [PATCH v2] tracing/osnoise: Add option to align tlat threads
From: Wander Lairson Costa @ 2026-03-30 16:00 UTC (permalink / raw)
  To: Tomas Glozar
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, John Kacur,
	Luis Goncalves, Crystal Wood, Costa Shulyupin, LKML,
	linux-trace-kernel
In-Reply-To: <20260302131316.385987-1-tglozar@redhat.com>

On Mon, Mar 02, 2026 at 02:13:16PM +0100, Tomas Glozar wrote:
> Add an option called TIMERLAT_ALIGN to osnoise/options, together with a
> corresponding setting osnoise/timerlat_align_us.
> 
> This option sets the alignment of wakeup times between different
> timerlat threads, similarly to cyclictest's -A/--aligned option. If
> TIMERLAT_ALIGN is set, the first thread that reaches the first cycle
> records its first wake-up time. Each following thread sets its first
> wake-up time to a fixed offset from the recorded time, and increments
> it by the same offset.
> 
> Example:
> 
> osnoise/timerlat_period is set to 1000, osnoise/timerlat_align_us is
> set to 20. There are four threads, on CPUs 1 to 4.
> 
> - CPU 4 enters first cycle first. The current time is 20000us, so
> the wake-up of the first cycle is set to 21000us. This time is recorded.
> - CPU 2 enter first cycle next. It reads the recorded time, increments
> it to 21020us, and uses this value as its own wake-up time for the first
> cycle.
> - CPU 3 enters first cycle next. It reads the recorded time, increments
> it to 21040 us, and uses the value as its own wake-up time.
> - CPU 1 proceeds analogically.
> 
> In each next cycle, the wake-up time (called "absolute period" in
> timerlat code) is incremented by the (relative) period of 1000us. Thus,
> the wake-ups in the following cycles (provided the times are reached and
> not in the past) will be as follows:
> 
> CPU 1		CPU 2		CPU 3	 	CPU 4
> 21080us		21020us		21040us		21000us
> 22080us		22020us		22040us		22000us
> ...		...		...		...
> 

Reviewed-by: Wander Lairson Costa <wander@redhat.com>


^ permalink raw reply

* Re: [PATCH] rtla: Fix build without libbpf header
From: Wander Lairson Costa @ 2026-03-30 16:01 UTC (permalink / raw)
  To: Tomas Glozar
  Cc: Steven Rostedt, John Kacur, Luis Goncalves, Crystal Wood,
	Costa Shulyupin, LKML, linux-trace-kernel
In-Reply-To: <20260330091207.16184-1-tglozar@redhat.com>

On Mon, Mar 30, 2026 at 11:12:07AM +0200, Tomas Glozar wrote:
> rtla supports building without libbpf. However, BPF actions
> patchset [1] adds an include of bpf/libbpf.h into timerlat_bpf.h,
> which breaks build on systems that don't have libbpf headers
> installed.
> 
> This is a leftover from a draft version of the patchset where
> timerlat_bpf_set_action() (which takes a struct bpf_program * argument)
> was defined in the header. timerlat_bpf.c already includes bpf/libbpf.h
> via timerlat.skel.h when libbpf is present.
> 
> Remove the redundant include to fix build on systems without libbpf
> headers.
> 
> [1] https://lore.kernel.org/linux-trace-kernel/20251126144205.331954-1-tglozar@redhat.com/T/
> 
> Reported-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> Closes: https://lore.kernel.org/linux-trace-kernel/20260329122202.65a8b575@robin/
> Fixes: 8cd0f08ac72e ("rtla/timerlat: Support tail call from BPF program")
> Signed-off-by: Tomas Glozar <tglozar@redhat.com>
> ---
>  tools/tracing/rtla/src/timerlat_bpf.h | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/tools/tracing/rtla/src/timerlat_bpf.h b/tools/tracing/rtla/src/timerlat_bpf.h
> index 169abeaf4363..f7c5675737fe 100644
> --- a/tools/tracing/rtla/src/timerlat_bpf.h
> +++ b/tools/tracing/rtla/src/timerlat_bpf.h
> @@ -12,7 +12,6 @@ enum summary_field {
>  };
>  
>  #ifndef __bpf__
> -#include <bpf/libbpf.h>
>  #ifdef HAVE_BPF_SKEL
>  int timerlat_bpf_init(struct timerlat_params *params);
>  int timerlat_bpf_attach(void);
> -- 
> 2.53.0
> 

Reviewed-by: Wander Lairson Costa <wander@redhat.com>


^ permalink raw reply

* Re: [PATCH] tracing: Move snapshot code out of trace.c and into trace_snapshot.c
From: Steven Rostedt @ 2026-03-30 16:05 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: kernel test robot, LKML, Linux trace kernel, llvm, oe-kbuild-all,
	Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <8580f943-4c37-4c66-937d-adee13b72201@app.fastmail.com>

On Mon, 30 Mar 2026 16:06:44 +0200
"Arnd Bergmann" <arnd@arndb.de> wrote:

> I saw the same thing and worked around it by removing the function.
> I then noticed that a bunch of code surrounding it is also unused
> and I removed that as well (see below). This version passes
> my randconfig build tests, but I suspect it is still wrong,
> since the code never had any callers and I don't understand
> why.

Note, this code is in include/linux/tracing_printk.h, and is for debugging
purposes (just like trace_printk() is). Hence, it shouldn't be removed.

The purpose is to call tracing_snapshot() when your code detects something
isn't right (but it doesn't crash), and this will take a snapshot of the
current trace that lead up to the anomaly.

If anything, I should add more to Documentation/trace/debugging.rst about it.

-- Steve

^ permalink raw reply

* Re: [PATCH v6] tracing: Preserve repeated boot-time tracing parameters
From: Steven Rostedt @ 2026-03-30 16:37 UTC (permalink / raw)
  To: Wesley Atwell
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260330104322.7403c660@gandalf.local.home>

On Mon, 30 Mar 2026 10:43:22 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index 9928da636c9d..7754a8adb58a 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -3677,20 +3677,24 @@ static struct boot_triggers {
>  } bootup_triggers[MAX_BOOT_TRIGGERS];
>  
>  static char bootup_trigger_buf[COMMAND_LINE_SIZE];
> +static int boot_trigger_buf_len;
>  static int nr_boot_triggers;
>  
>  static __init int setup_trace_triggers(char *str)
>  {
>  	char *trigger;
>  	char *buf;
> +	int len = boot_trigger_buf_len;
>  	int i;
>  
> -	strscpy(bootup_trigger_buf, str, COMMAND_LINE_SIZE);
> +	strscpy(bootup_trigger_buf + len , str, COMMAND_LINE_SIZE - len);
>  	trace_set_ring_buffer_expanded(NULL);
>  	disable_tracing_selftest("running event triggers");
>  
> -	buf = bootup_trigger_buf;
> -	for (i = 0; i < MAX_BOOT_TRIGGERS; i++) {
> +	buf = bootup_trigger_buf + len;
> +	boot_trigger_buf_len += strlen(buf);

The above needs to skip the '\0' too:

	boot_trigger_buf_len += strlen(buf) + 1;


> +
> +	for (i = nr_boot_triggers; i < MAX_BOOT_TRIGGERS; i++) {
>  		trigger = strsep(&buf, ",");
>  		if (!trigger)
>  			break;


^ permalink raw reply

* Re: [PATCH v6] tracing: Preserve repeated boot-time tracing parameters
From: Steven Rostedt @ 2026-03-30 16:42 UTC (permalink / raw)
  To: Wesley Atwell
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260330123743.5cd30e56@gandalf.local.home>

On Mon, 30 Mar 2026 12:37:43 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Mon, 30 Mar 2026 10:43:22 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> > index 9928da636c9d..7754a8adb58a 100644
> > --- a/kernel/trace/trace_events.c
> > +++ b/kernel/trace/trace_events.c
> > @@ -3677,20 +3677,24 @@ static struct boot_triggers {
> >  } bootup_triggers[MAX_BOOT_TRIGGERS];
> >  
> >  static char bootup_trigger_buf[COMMAND_LINE_SIZE];
> > +static int boot_trigger_buf_len;
> >  static int nr_boot_triggers;
> >  
> >  static __init int setup_trace_triggers(char *str)
> >  {
> >  	char *trigger;
> >  	char *buf;
> > +	int len = boot_trigger_buf_len;
> >  	int i;
> >  
> > -	strscpy(bootup_trigger_buf, str, COMMAND_LINE_SIZE);
> > +	strscpy(bootup_trigger_buf + len , str, COMMAND_LINE_SIZE - len);
> >  	trace_set_ring_buffer_expanded(NULL);
> >  	disable_tracing_selftest("running event triggers");
> >  
> > -	buf = bootup_trigger_buf;
> > -	for (i = 0; i < MAX_BOOT_TRIGGERS; i++) {
> > +	buf = bootup_trigger_buf + len;
> > +	boot_trigger_buf_len += strlen(buf);  
> 
> The above needs to skip the '\0' too:
> 
> 	boot_trigger_buf_len += strlen(buf) + 1;
> 

And since this option is different from the rest, lets make it a separate patch.

-- Steve

> 
> > +
> > +	for (i = nr_boot_triggers; i < MAX_BOOT_TRIGGERS; i++) {
> >  		trigger = strsep(&buf, ",");
> >  		if (!trigger)
> >  			break;  
> 


^ permalink raw reply

* Re: [PATCH v14 1/5] ring-buffer: Flush and stop persistent ring buffer on panic
From: Steven Rostedt @ 2026-03-30 17:54 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel, Ian Rogers
In-Reply-To: <177487499643.3463592.15413057950716995168.stgit@mhiramat.tok.corp.google.com>

On Mon, 30 Mar 2026 21:49:56 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:

> diff --git a/arch/arm64/include/asm/ring_buffer.h b/arch/arm64/include/asm/ring_buffer.h
> new file mode 100644
> index 000000000000..62316c406888
> --- /dev/null
> +++ b/arch/arm64/include/asm/ring_buffer.h
> @@ -0,0 +1,10 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef _ASM_ARM64_RING_BUFFER_H
> +#define _ASM_ARM64_RING_BUFFER_H
> +
> +#include <asm/cacheflush.h>
> +
> +/* Flush D-cache on persistent ring buffer */
> +#define arch_ring_buffer_flush_range(start, end)	dcache_clean_pop(start, end)
> +
> +#endif /* _ASM_ARM64_RING_BUFFER_H */

You probably need to get an ack from the arm64 folks.

-- Steve

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox