Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
From: Alexei Starovoitov @ 2026-05-12 19:27 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Masami Hiramatsu, Andrii Nakryiko, bpf, linux-trace-kernel,
	Oleg Nesterov, Peter Zijlstra, Ingo Molnar
In-Reply-To: <agNeEzjiThzmJHiP@krava>

On Tue, May 12, 2026 at 10:07 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> +       /*
> +        * We have nop10 (with first byte overwritten to int3),
> +        * change it to:
> +        *   lea 0x80(%rsp), %rsp
> +        *   call tramp
> +        *
> +        * The first lea instruction skips the stack redzone so the call
> +        * instruction can safely push return address on stack.
> +        */

typo: lea -128(%rsp), %rsp

you can also do:

add $-128, %rsp + call tramp = 4 + 5 = 9 bytes instead of 10.

Initially I didn't like this approach, since we just introduced
usdt nop5 and now need to recompile everything again,
but looking at the fix it's definitely simpler than alternatives
and doesn't have annoying limitations.

^ permalink raw reply

* Re: [PATCH v6 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
From: jane.chu @ 2026-05-12 17:58 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Breno Leitao, Miaohe Lin,
	Naoya Horiguchi, Andrew Morton, Jonathan Corbet, Shuah Khan,
	Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <9504c193-8c01-4d03-8f62-c50fd7fbdbc0@kernel.org>



On 5/12/2026 1:17 AM, David Hildenbrand (Arm) wrote:
> On 5/11/26 17:38, Breno Leitao wrote:
>> When get_hwpoison_page() returns a negative value, distinguish
>> reserved pages from other failure cases by reporting MF_MSG_KERNEL
>> instead of MF_MSG_GET_HWPOISON. Reserved pages belong to the kernel
>> and should be classified accordingly for proper handling.
>>
>> Sample PG_reserved before the get_hwpoison_page() call. In the
>> MF_COUNT_INCREASED path get_any_page() can drop the caller's
>> reference before returning -EIO, after which the underlying page may
>> have been freed and reallocated with page->flags reset; reading
>> PageReserved(p) at that point would observe stale or unrelated state.
>> The pre-call snapshot reflects what the page actually was at the
>> time of the failure event.
>>
>> Acked-by: Miaohe Lin <linmiaohe@huawei.com>
>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>> Signed-off-by: Breno Leitao <leitao@debian.org>
>> ---
>>   mm/memory-failure.c | 19 ++++++++++++++++++-
>>   1 file changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 866c4428ac7ef..f112fb27a8ff6 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -2348,6 +2348,7 @@ int memory_failure(unsigned long pfn, int flags)
>>   	unsigned long page_flags;
>>   	bool retry = true;
>>   	int hugetlb = 0;
>> +	bool is_reserved;
>>   
>>   	if (!sysctl_memory_failure_recovery)
>>   		panic("Memory failure on page %lx", pfn);
>> @@ -2411,6 +2412,18 @@ int memory_failure(unsigned long pfn, int flags)
>>   	 * In fact it's dangerous to directly bump up page count from 0,
>>   	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
>>   	 */
>> +	/*
>> +	 * Pages with PG_reserved set are not currently managed by the
>> +	 * page allocator (memblock-reserved memory, driver reservations,
>> +	 * etc.), so classify them as kernel-owned for reporting.
>> +	 *
>> +	 * Sample the flag before get_hwpoison_page(): in the
>> +	 * MF_COUNT_INCREASED path, get_any_page() can drop the caller's
>> +	 * reference before returning -EIO, after which page->flags may
>> +	 * have been reset by the allocator.
>> +	 */
>> +	is_reserved = PageReserved(p);
>> +
>>   	res = get_hwpoison_page(p, flags);
>>   	if (!res) {
>>   		if (is_free_buddy_page(p)) {
>> @@ -2432,7 +2445,11 @@ int memory_failure(unsigned long pfn, int flags)
>>   		}
>>   		goto unlock_mutex;
>>   	} else if (res < 0) {
>> -		res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
>> +		if (is_reserved)
>> +			res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
>> +		else
>> +			res = action_result(pfn, MF_MSG_GET_HWPOISON,
>> +					    MF_IGNORED);
>>   		goto unlock_mutex;
>>   	}
>>   
>>
> 
> It's a bit odd that we need this handling when we already have handling for
> reserved pages in error_states[].
> 
> HWPoisonHandlable() would always essentially reject PG_reserved pages. So
> __get_hwpoison_page() ... would always fail? Making
> get_hwpoison_page()->get_any_page() always fail?
> 
> But then, we never call identify_page_state()? And never call me_kernel()?
> 
> This all looks very odd.
> 
> Why would you even want to call get_hwpoison_page() in the first place if you
> find PageReserved?
> 

Ah, good point!
It seems to me that all unhandable pages should head out to 
identify_page_state:

--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2411,6 +2411,10 @@ int memory_failure(unsigned long pfn, int flags)
          * In fact it's dangerous to directly bump up page count from 0,
          * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
          */
+
+       if (!HWPoisonHandlable(page, flags)
+               goto identify_page_state;
+
         res = get_hwpoison_page(p, flags);
         if (!res) {
                 if (is_free_buddy_page(p)) {

thanks,
-jane





^ permalink raw reply

* [PATCH v2] rtla: Stop the record trace on interrupt
From: Crystal Wood @ 2026-05-12 17:37 UTC (permalink / raw)
  To: Tomas Glozar
  Cc: Steven Rostedt, linux-trace-kernel, John Kacur, Costa Shulyupin,
	Wander Lairson Costa, Crystal Wood

Before, when rtla got a signal, it stopped the main trace but not the
record trace.  With "--on-end trace", this can lead to
save_trace_to_file() failing to keep up, especially on a debug kernel. 
Plus, it adds post-stoppage noise to the trace file.

Signed-off-by: Crystal Wood <crwood@redhat.com>
---
v2: clarify that this matters for --on-end trace

 tools/tracing/rtla/src/common.c   | 19 +++++++++++--------
 tools/tracing/rtla/src/common.h   |  1 -
 tools/tracing/rtla/src/timerlat.c |  2 +-
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/tools/tracing/rtla/src/common.c b/tools/tracing/rtla/src/common.c
index 35e3d3aa922e..effad523e8cf 100644
--- a/tools/tracing/rtla/src/common.c
+++ b/tools/tracing/rtla/src/common.c
@@ -10,7 +10,7 @@
 
 #include "common.h"
 
-struct trace_instance *trace_inst;
+struct osnoise_tool *trace_tool;
 volatile int stop_tracing;
 int nr_cpus;
 
@@ -21,12 +21,16 @@ static void stop_trace(int sig)
 		 * Stop requested twice in a row; abort event processing and
 		 * exit immediately
 		 */
-		tracefs_iterate_stop(trace_inst->inst);
+		if (trace_tool)
+			tracefs_iterate_stop(trace_tool->trace.inst);
 		return;
 	}
 	stop_tracing = 1;
-	if (trace_inst)
-		trace_instance_stop(trace_inst);
+	if (trace_tool) {
+		trace_instance_stop(&trace_tool->trace);
+		if (trace_tool->record)
+			trace_instance_stop(&trace_tool->record->trace);
+	}
 }
 
 /*
@@ -273,11 +277,10 @@ int run_tool(struct tool_ops *ops, int argc, char *argv[])
 	tool->params = params;
 
 	/*
-	 * Save trace instance into global variable so that SIGINT can stop
-	 * the timerlat tracer.
+	 * Expose the tool to signal handlers so they can stop the trace.
 	 * Otherwise, rtla could loop indefinitely when overloaded.
 	 */
-	trace_inst = &tool->trace;
+	trace_tool = tool;
 
 	retval = ops->apply_config(tool);
 	if (retval) {
@@ -285,7 +288,7 @@ int run_tool(struct tool_ops *ops, int argc, char *argv[])
 		goto out_free;
 	}
 
-	retval = enable_tracer_by_name(trace_inst->inst, ops->tracer);
+	retval = enable_tracer_by_name(tool->trace.inst, ops->tracer);
 	if (retval) {
 		err_msg("Failed to enable %s tracer\n", ops->tracer);
 		goto out_free;
diff --git a/tools/tracing/rtla/src/common.h b/tools/tracing/rtla/src/common.h
index 51665db4ffce..eba40b6d9504 100644
--- a/tools/tracing/rtla/src/common.h
+++ b/tools/tracing/rtla/src/common.h
@@ -54,7 +54,6 @@ struct osnoise_context {
 	int			opt_workload;
 };
 
-extern struct trace_instance *trace_inst;
 extern volatile int stop_tracing;
 
 struct hist_params {
diff --git a/tools/tracing/rtla/src/timerlat.c b/tools/tracing/rtla/src/timerlat.c
index f8c057518d22..637f68d684f5 100644
--- a/tools/tracing/rtla/src/timerlat.c
+++ b/tools/tracing/rtla/src/timerlat.c
@@ -202,7 +202,7 @@ void timerlat_analyze(struct osnoise_tool *tool, bool stopped)
 		 * If the trace did not stop with --aa-only, at least print
 		 * the max known latency.
 		 */
-		max_lat = tracefs_instance_file_read(trace_inst->inst, "tracing_max_latency", NULL);
+		max_lat = tracefs_instance_file_read(tool->trace.inst, "tracing_max_latency", NULL);
 		if (max_lat) {
 			printf("  Max latency was %s\n", max_lat);
 			free(max_lat);
-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 2/2] serial: qcom-geni: Add tracepoints for Qualcomm GENI serial driver
From: Praveen Talari @ 2026-05-12 17:14 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-serial,
	Mukesh Kumar Savaliya, Aniket Randive, chandana.chiluveru,
	jyothi.seerapu, Praveen Talari
In-Reply-To: <20260512-add-tracepoints-for-qcom-geni-serial-v2-0-a5726421b3af@oss.qualcomm.com>

Add tracing to the Qualcomm GENI serial driver to improve runtime
observability.

Trace hooks are added at key points including termios and clock
configuration, manual control get/set, interrupt handling, and data
TX/RX paths.

Usage examples:

Enable all serial traces:
  echo 1 > /sys/kernel/debug/tracing/events/qcom_geni_serial/enable
  cat /sys/kernel/debug/tracing/trace_pipe

Example trace output:

2517.938432: geni_serial_clk_cfg: a94000.serial: desired_rate=1843200
     clk_rate=7372800 clk_div=4 clk_idx=0
2517.938753: geni_serial_irq: a94000.serial: m_irq=0x88800000
     s_irq=0x08000111 dma_tx=0x00000000 dma_rx=0x00000000
2517.938803: geni_serial_set_termios: a94000.serial: baud=115200 bpc=8
     tx_trans=0x00000002 tx_par=0x00000000 rx_trans=0x00000000
rx_par=0x00000000 stop=0
2517.938807: geni_serial_set_mctrl: a94000.serial: mctrl=0x8006
     uart_manual_rfr=0x00000000
2517.938818: geni_serial_get_mctrl: a94000.serial: mctrl=0x0160
     geni_ios=0x00000001
2517.939165: geni_serial_irq: a94000.serial: m_irq=0x00400000
     s_irq=0x00000000 dma_tx=0x00000000 dma_rx=0x00000000
2517.939592: geni_serial_tx_data: a94000.serial: tx_len=8 data=61 62 63
     64 65 66 67 68
2517.940610: geni_serial_irq: a94000.serial: m_irq=0x00000001
     s_irq=0x00000000 dma_tx=0x00000003 dma_rx=0x00000000
2517.942174: geni_serial_irq: a94000.serial: m_irq=0x08000000
     s_irq=0x08000100 dma_tx=0x00000000 dma_rx=0x00000003
2517.942323: geni_serial_rx_data: a94000.serial: rx_len=8 data=61 62 63
     64 65 66 67 68
2517.942680: geni_serial_set_mctrl: a94000.serial: mctrl=0x8000
     uart_manual_rfr=0x80000002

Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
 drivers/tty/serial/qcom_geni_serial.c | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
index e6b0a55f0cfb..9e2de074d799 100644
--- a/drivers/tty/serial/qcom_geni_serial.c
+++ b/drivers/tty/serial/qcom_geni_serial.c
@@ -7,6 +7,9 @@
 /* Disable MMIO tracing to prevent excessive logging of unwanted MMIO traces */
 #define __DISABLE_TRACE_MMIO__
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/qcom_geni_serial.h>
+
 #include <linux/clk.h>
 #include <linux/console.h>
 #include <linux/io.h>
@@ -225,7 +228,7 @@ static void qcom_geni_serial_config_port(struct uart_port *uport, int cfg_flags)
 static unsigned int qcom_geni_serial_get_mctrl(struct uart_port *uport)
 {
 	unsigned int mctrl = TIOCM_DSR | TIOCM_CAR;
-	u32 geni_ios;
+	u32 geni_ios = 0;
 
 	if (uart_console(uport)) {
 		mctrl |= TIOCM_CTS;
@@ -235,6 +238,8 @@ static unsigned int qcom_geni_serial_get_mctrl(struct uart_port *uport)
 			mctrl |= TIOCM_CTS;
 	}
 
+	trace_geni_serial_get_mctrl(uport->dev, mctrl, geni_ios);
+
 	return mctrl;
 }
 
@@ -253,6 +258,8 @@ static void qcom_geni_serial_set_mctrl(struct uart_port *uport,
 	if (!(mctrl & TIOCM_RTS) && !uport->suspended)
 		uart_manual_rfr = UART_MANUAL_RFR_EN | UART_RFR_NOT_READY;
 	writel(uart_manual_rfr, uport->membase + SE_UART_MANUAL_RFR);
+
+	trace_geni_serial_set_mctrl(uport->dev, mctrl, uart_manual_rfr);
 }
 
 static const char *qcom_geni_serial_get_type(struct uart_port *uport)
@@ -683,6 +690,8 @@ static void qcom_geni_serial_start_tx_dma(struct uart_port *uport)
 	xmit_size = kfifo_out_linear_ptr(&tport->xmit_fifo, &tail,
 			UART_XMIT_SIZE);
 
+	trace_geni_serial_tx_data(uport->dev, tail, xmit_size);
+
 	qcom_geni_set_rs485_mode(uport, SER_RS485_RTS_ON_SEND);
 
 	qcom_geni_serial_setup_tx(uport, xmit_size);
@@ -909,8 +918,10 @@ static void qcom_geni_serial_handle_rx_dma(struct uart_port *uport, bool drop)
 		return;
 	}
 
-	if (!drop)
+	if (!drop) {
+		trace_geni_serial_rx_data(uport->dev, port->rx_buf, rx_in);
 		handle_rx_uart(uport, rx_in);
+	}
 
 	ret = geni_se_rx_dma_prep(&port->se, port->rx_buf,
 				  DMA_RX_BUF_SIZE,
@@ -1069,6 +1080,10 @@ static irqreturn_t qcom_geni_serial_isr(int isr, void *dev)
 	geni_status = readl(uport->membase + SE_GENI_STATUS);
 	dma = readl(uport->membase + SE_GENI_DMA_MODE_EN);
 	m_irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
+
+	trace_geni_serial_irq(uport->dev, m_irq_status, s_irq_status,
+			      dma_tx_status, dma_rx_status);
+
 	writel(m_irq_status, uport->membase + SE_GENI_M_IRQ_CLEAR);
 	writel(s_irq_status, uport->membase + SE_GENI_S_IRQ_CLEAR);
 	writel(dma_tx_status, uport->membase + SE_DMA_TX_IRQ_CLR);
@@ -1281,8 +1296,8 @@ static int geni_serial_set_rate(struct uart_port *uport, unsigned int baud)
 		return -EINVAL;
 	}
 
-	dev_dbg(port->se.dev, "desired_rate = %u, clk_rate = %lu, clk_div = %u\n, clk_idx = %u\n",
-		baud * sampling_rate, clk_rate, clk_div, clk_idx);
+	trace_geni_serial_clk_cfg(uport->dev, baud * sampling_rate, clk_rate,
+				  clk_div, clk_idx);
 
 	uport->uartclk = clk_rate;
 	port->clk_rate = clk_rate;
@@ -1432,6 +1447,10 @@ static void qcom_geni_serial_set_termios(struct uart_port *uport,
 	writel(bits_per_char, uport->membase + SE_UART_TX_WORD_LEN);
 	writel(bits_per_char, uport->membase + SE_UART_RX_WORD_LEN);
 	writel(stop_bit_len, uport->membase + SE_UART_TX_STOP_BIT_LEN);
+
+	trace_geni_serial_set_termios(uport->dev, baud, bits_per_char,
+				      tx_trans_cfg, tx_parity_cfg, rx_trans_cfg,
+				      rx_parity_cfg, stop_bit_len);
 }
 
 #ifdef CONFIG_SERIAL_QCOM_GENI_CONSOLE

-- 
2.34.1


^ permalink raw reply related

* [PATCH v2 1/2] serial: qcom-geni: trace: Add tracepoint support for Qualcomm GENI serial
From: Praveen Talari @ 2026-05-12 17:14 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-serial,
	Mukesh Kumar Savaliya, Aniket Randive, chandana.chiluveru,
	jyothi.seerapu, Praveen Talari
In-Reply-To: <20260512-add-tracepoints-for-qcom-geni-serial-v2-0-a5726421b3af@oss.qualcomm.com>

Add tracepoint support to the Qualcomm GENI serial driver to provide
runtime visibility into driver behavior without requiring invasive debug
patches.

The trace events cover UART termios configuration, clock setup, modem
control state, interrupt status, and TX/RX data, making it easier to
diagnose communication issues in the field.

Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
v1->v2:
- Removed multiple TX/RX trace events, instead used
  DECLARE_EVENT_CLASS and DEFINE_EVENT.
---
 include/trace/events/qcom_geni_serial.h | 172 ++++++++++++++++++++++++++++++++
 1 file changed, 172 insertions(+)

diff --git a/include/trace/events/qcom_geni_serial.h b/include/trace/events/qcom_geni_serial.h
new file mode 100644
index 000000000000..5e23827881d0
--- /dev/null
+++ b/include/trace/events/qcom_geni_serial.h
@@ -0,0 +1,172 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM qcom_geni_serial
+
+#if !defined(_TRACE_QCOM_GENI_SERIAL_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_QCOM_GENI_SERIAL_H
+
+#include <linux/device.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(geni_serial_set_termios,
+	    TP_PROTO(struct device *dev, unsigned int baud,
+		     unsigned int bits_per_char, u32 tx_trans_cfg,
+		     u32 tx_parity_cfg, u32 rx_trans_cfg,
+		     u32 rx_parity_cfg, u32 stop_bit_len),
+	    TP_ARGS(dev, baud, bits_per_char, tx_trans_cfg, tx_parity_cfg,
+		    rx_trans_cfg, rx_parity_cfg, stop_bit_len),
+
+	    TP_STRUCT__entry(__string(name, dev_name(dev))
+			     __field(unsigned int, baud)
+			     __field(unsigned int, bits_per_char)
+			     __field(u32, tx_trans_cfg)
+			     __field(u32, tx_parity_cfg)
+			     __field(u32, rx_trans_cfg)
+			     __field(u32, rx_parity_cfg)
+			     __field(u32, stop_bit_len)
+	    ),
+
+	    TP_fast_assign(__assign_str(name);
+			   __entry->baud = baud;
+			   __entry->bits_per_char = bits_per_char;
+			   __entry->tx_trans_cfg = tx_trans_cfg;
+			   __entry->tx_parity_cfg = tx_parity_cfg;
+			   __entry->rx_trans_cfg = rx_trans_cfg;
+			   __entry->rx_parity_cfg = rx_parity_cfg;
+			   __entry->stop_bit_len = stop_bit_len;
+	    ),
+
+	    TP_printk("%s: baud=%u bpc=%u tx_trans=0x%08x tx_par=0x%08x rx_trans=0x%08x rx_par=0x%08x stop=%u",
+		      __get_str(name), __entry->baud, __entry->bits_per_char,
+		      __entry->tx_trans_cfg, __entry->tx_parity_cfg,
+		      __entry->rx_trans_cfg, __entry->rx_parity_cfg,
+		      __entry->stop_bit_len)
+);
+
+TRACE_EVENT(geni_serial_clk_cfg,
+	    TP_PROTO(struct device *dev, unsigned int desired_rate,
+		     unsigned long clk_rate, unsigned int clk_div,
+		     unsigned int clk_idx),
+	    TP_ARGS(dev, desired_rate, clk_rate, clk_div, clk_idx),
+
+	    TP_STRUCT__entry(__string(name, dev_name(dev))
+			     __field(unsigned int, desired_rate)
+			     __field(unsigned long, clk_rate)
+			     __field(unsigned int, clk_div)
+			     __field(unsigned int, clk_idx)
+	    ),
+
+	    TP_fast_assign(__assign_str(name);
+			   __entry->desired_rate = desired_rate;
+			   __entry->clk_rate = clk_rate;
+			   __entry->clk_div = clk_div;
+			   __entry->clk_idx = clk_idx;
+	    ),
+
+	    TP_printk("%s: desired_rate=%u clk_rate=%lu clk_div=%u clk_idx=%u",
+		      __get_str(name), __entry->desired_rate, __entry->clk_rate,
+		      __entry->clk_div, __entry->clk_idx)
+);
+
+TRACE_EVENT(geni_serial_irq,
+	    TP_PROTO(struct device *dev, u32 m_irq, u32 s_irq,
+		     u32 dma_tx, u32 dma_rx),
+	    TP_ARGS(dev, m_irq, s_irq, dma_tx, dma_rx),
+
+	    TP_STRUCT__entry(__string(name, dev_name(dev))
+			     __field(u32, m_irq)
+			     __field(u32, s_irq)
+			     __field(u32, dma_tx)
+			     __field(u32, dma_rx)
+	    ),
+
+	    TP_fast_assign(__assign_str(name);
+			   __entry->m_irq = m_irq;
+			   __entry->s_irq = s_irq;
+			   __entry->dma_tx = dma_tx;
+			   __entry->dma_rx = dma_rx;
+	    ),
+
+	    TP_printk("%s: m_irq=0x%08x s_irq=0x%08x dma_tx=0x%08x dma_rx=0x%08x",
+		      __get_str(name), __entry->m_irq, __entry->s_irq,
+		      __entry->dma_tx, __entry->dma_rx)
+);
+
+DECLARE_EVENT_CLASS(geni_serial_data,
+
+	TP_PROTO(struct device *dev, const u8 *buf, unsigned int len),
+
+	TP_ARGS(dev, buf, len),
+
+	TP_STRUCT__entry(__string(name, dev_name(dev))
+			 __field(unsigned int, len)
+			 __dynamic_array(u8, data, len)
+	),
+
+	TP_fast_assign(__assign_str(name);
+		       __entry->len = len;
+		       memcpy(__get_dynamic_array(data), buf, len);
+	),
+
+	TP_printk("%s: len=%u data=%s",
+		  __get_str(name), __entry->len,
+		  __print_hex(__get_dynamic_array(data), __entry->len))
+);
+
+DEFINE_EVENT(geni_serial_data, geni_serial_tx_data,
+
+	TP_PROTO(struct device *dev, const u8 *buf, unsigned int len),
+
+	TP_ARGS(dev, buf, len)
+
+);
+
+DEFINE_EVENT(geni_serial_data, geni_serial_rx_data,
+
+	TP_PROTO(struct device *dev, const u8 *buf, unsigned int len),
+
+	TP_ARGS(dev, buf, len)
+
+);
+
+TRACE_EVENT(geni_serial_set_mctrl,
+	    TP_PROTO(struct device *dev, unsigned int mctrl,
+		     u32 uart_manual_rfr),
+	    TP_ARGS(dev, mctrl, uart_manual_rfr),
+
+	    TP_STRUCT__entry(__string(name, dev_name(dev))
+			     __field(unsigned int, mctrl)
+			     __field(u32, uart_manual_rfr)
+	    ),
+
+	    TP_fast_assign(__assign_str(name);
+			   __entry->mctrl = mctrl;
+			   __entry->uart_manual_rfr = uart_manual_rfr;
+	    ),
+
+	    TP_printk("%s: mctrl=0x%04x uart_manual_rfr=0x%08x",
+		      __get_str(name), __entry->mctrl, __entry->uart_manual_rfr)
+);
+
+TRACE_EVENT(geni_serial_get_mctrl,
+	    TP_PROTO(struct device *dev, unsigned int mctrl, u32 geni_ios),
+	    TP_ARGS(dev, mctrl, geni_ios),
+
+	    TP_STRUCT__entry(__string(name, dev_name(dev))
+			     __field(unsigned int, mctrl)
+			     __field(u32, geni_ios)
+	    ),
+
+	    TP_fast_assign(__assign_str(name);
+			   __entry->mctrl = mctrl;
+			   __entry->geni_ios = geni_ios;
+	    ),
+
+	    TP_printk("%s: mctrl=0x%04x geni_ios=0x%08x",
+		      __get_str(name), __entry->mctrl, __entry->geni_ios)
+);
+
+#endif /* _TRACE_QCOM_GENI_SERIAL_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>

-- 
2.34.1


^ permalink raw reply related

* [PATCH v2 0/2] Add tracepoints support for Qualcomm GENI Serial drivers
From: Praveen Talari @ 2026-05-12 17:14 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-serial,
	Mukesh Kumar Savaliya, Aniket Randive, chandana.chiluveru,
	jyothi.seerapu, Praveen Talari

Add tracepoints to the Qualcomm GENI (Generic Interface) serial driver.
These trace events enable runtime debugging and performance analysis of
UART operations.

The trace events cover UART termios configuration, clock setup, manual
control state, interrupt status, and actual transmitted/received data in
hexadecimal format.

Usage examples:

Enable all serial traces:
  echo 1 > /sys/kernel/debug/tracing/events/qcom_geni_serial/enable
  cat /sys/kernel/debug/tracing/trace_pipe

Example trace output:

2517.938432: geni_serial_clk_cfg: a94000.serial: desired_rate=1843200
     clk_rate=7372800 clk_div=4 clk_idx=0
2517.938753: geni_serial_irq: a94000.serial: m_irq=0x88800000
     s_irq=0x08000111 dma_tx=0x00000000 dma_rx=0x00000000
2517.938803: geni_serial_set_termios: a94000.serial: baud=115200 bpc=8
     tx_trans=0x00000002 tx_par=0x00000000 rx_trans=0x00000000
rx_par=0x00000000 stop=0
2517.938807: geni_serial_set_mctrl: a94000.serial: mctrl=0x8006
     uart_manual_rfr=0x00000000
2517.938818: geni_serial_get_mctrl: a94000.serial: mctrl=0x0160
     geni_ios=0x00000001
2517.939165: geni_serial_irq: a94000.serial: m_irq=0x00400000
     s_irq=0x00000000 dma_tx=0x00000000 dma_rx=0x00000000
2517.939592: geni_serial_tx_data: a94000.serial: tx_len=8 data=61 62 63
     64 65 66 67 68
2517.940610: geni_serial_irq: a94000.serial: m_irq=0x00000001
     s_irq=0x00000000 dma_tx=0x00000003 dma_rx=0x00000000
2517.942174: geni_serial_irq: a94000.serial: m_irq=0x08000000
     s_irq=0x08000100 dma_tx=0x00000000 dma_rx=0x00000003
2517.942323: geni_serial_rx_data: a94000.serial: rx_len=8 data=61 62 63
     64 65 66 67 68
2517.942680: geni_serial_set_mctrl: a94000.serial: mctrl=0x8000
     uart_manual_rfr=0x80000002

Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
Changes in v2:
- removed multiple trace events for TX/RX events, instead used
  DECLARE_EVENT_CLASS and DEFINE_EVENT.
- Link to v1: https://lore.kernel.org/r/20260506-add-tracepoints-for-qcom-geni-serial-v1-0-544b22612e08@oss.qualcomm.com

---
Praveen Talari (2):
      serial: qcom-geni: trace: Add tracepoint support for Qualcomm GENI serial
      serial: qcom-geni: Add tracepoints for Qualcomm GENI serial driver

 drivers/tty/serial/qcom_geni_serial.c   |  27 ++++-
 include/trace/events/qcom_geni_serial.h | 172 ++++++++++++++++++++++++++++++++
 2 files changed, 195 insertions(+), 4 deletions(-)
---
base-commit: 1f5ffc672165ff851063a5fd044b727ab2517ae3
change-id: 20260427-add-tracepoints-for-qcom-geni-serial-948777218b7b

Best regards,
-- 
Praveen Talari <praveen.talari@oss.qualcomm.com>


^ permalink raw reply

* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
From: Jiri Olsa @ 2026-05-12 17:06 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Jiri Olsa, Andrii Nakryiko, bpf, linux-trace-kernel, oleg, peterz,
	mingo
In-Reply-To: <20260512141431.a70375744fdae263bda5b722@kernel.org>

On Tue, May 12, 2026 at 02:14:31PM +0900, Masami Hiramatsu wrote:
> On Sun, 10 May 2026 23:25:26 +0200
> Jiri Olsa <olsajiri@gmail.com> wrote:
> 
> > On Fri, May 08, 2026 at 05:30:56PM -0700, Andrii Nakryiko wrote:
> > > The x86 uprobe nop5 optimization currently replaces a 5-byte NOP at the
> > > probe site with a CALL into a uprobe trampoline. CALL pushes a return
> > > address to [rsp-8]. On x86-64 this is inside the 128-byte red zone, where
> > > user code may keep temporary data without adjusting rsp.
> > > 
> > > Use a 5-byte JMP instead. JMP does not write to the user stack, but it
> > > also does not provide a return address. Replace the single trampoline
> > > entry with a page of 16-byte slots. Each optimized probe jumps to its
> > > assigned slot, the slot moves rsp below the red zone, saves the registers
> > > clobbered by syscall, and invokes the uprobe syscall:
> > > 
> > >   Probe site:   jmp slot_N              (5B, replaces nop5)
> > > 
> > >   Slot N:       lea  -128(%rsp), %rsp   (5B)  skip red zone
> > >                 push %rcx               (1B)  save (syscall clobbers)
> > >                 push %r11               (2B)  save (syscall clobbers)
> > >                 push %rax               (1B)  save (syscall uses for nr)
> > >                 mov  $336, %eax         (5B)  uprobe syscall number
> > >                 syscall                 (2B)
> > > 
> > > All slots contain identical code at different offsets, so the trampoline
> > > page is generated once at boot and mapped read-execute into each process.
> > > The syscall handler identifies the slot from regs->ip, which points just
> > > after the syscall instruction, and uses a per-mm slot table to recover the
> > > original probe address.
> > > 
> > > The uprobe syscall does not return to the trampoline slot. The handler
> > > restores the probe-site register state, runs the uprobe consumers, sets
> > > pt_regs to continue at probe_addr + 5 unless a consumer redirected
> > > execution, and returns directly through the IRET path. This preserves
> > > general purpose registers, including rcx and r11, without requiring any
> > > post-syscall cleanup code in the trampoline and avoids call/ret, RSB, and
> > > shadow stack concerns.
> > > 
> > > Protect the per-mm trampoline list with RCU and free trampoline metadata
> > > with kfree_rcu(). This lets the syscall path resolve trampoline slots
> > > without taking mmap_lock. The optimized-instruction detection path also
> > > walks the trampoline list under an RCU read-side lock. Since that path
> > > starts from the JMP target, it translates the slot start to the post-syscall
> > > IP expected by the shared resolver before checking the trampoline mapping.
> > > 
> > > Each trampoline page provides 256 slots. Slots stay permanently assigned
> > > to their first probe address and are reused only when the same address is
> > > probed again. Reassigning detached slots is deliberately avoided because a
> > > thread can remain in a trampoline for an unbounded time due to ptrace,
> > > interrupts, or scheduling delays. If a reachable trampoline page runs out
> > > of slots, probes that cannot allocate a slot fall back to the slower INT3
> > > path.
> > > 
> > > Require the entire trampoline page to be reachable by a rel32 JMP before
> > > reusing it for a probe. This keeps every slot in the page within the range
> > > that can be encoded at the probe site.
> > > 
> > > Change the error code returned when the uprobe syscall is invoked outside
> > > a kernel-generated trampoline from -ENXIO to -EPROTO. This lets libbpf and
> > > similar libraries distinguish fixed kernels from kernels with the
> > > red-zone-clobbering implementation and enable nop5 optimization only on
> > > fixed kernels.
> > > 
> > > Performance (usdt single-thread, M/s):
> > > 
> > >                   usdt-nop  usdt-nop5-base  usdt-nop5-fix  nop5-change  iret%
> > >   Skylake          3.149        6.422          4.865         -24.3%     39.1%
> > >   Milan            2.910        3.443          3.820         +11.0%     24.3%
> > >   Sapphire Rapids  1.896        4.023          3.693          -8.2%     24.9%
> > >   Bergamo          3.393        3.895          3.849          -1.2%     24.5%
> > > 
> > > The fixed nop5 path remains faster than the non-optimized INT3 path on all
> > > measured systems. The regression relative to the old CALL-based trampoline
> > > comes from IRET being more expensive than SYSRET, most noticeably on older
> > > Intel Skylake. Newer Intel CPUs and tested AMD CPUs have lower IRET cost,
> > > and AMD Milan improves because removing mmap_lock from the hot path more
> > > than offsets the IRET cost.
> > > 
> > > Multi-threaded throughput scales nearly linearly with the number of CPUs, like
> > > it used to, thanks to lockless RCU-protected uprobe trampoline lookup.
> > 
> > hi,
> > thanks a lot for the fix
> > 
> > FWIW we discussed also an option to have 10-bytes nop and do:
> >   [rsp+0x80, call trampoline]
> > 
> > we would not need the slots re-use logic, but not sure what other
> > surprises there are with 10-bytes nop
> 
> Does this mean we have to update UDST implementation?

it's the optimized uprobe code, that's used for usdt that emits nop5 instead
of single nop

> 
> > 
> > I tried that change [1], it seems to work, but it has other
> > difficulties, like I think the unoptimized path needs to do:
> >   [rsp+0x80, call trampoline] -> [jmp end of 10-bytes nop]
> > instead of patching back the 10-byte nop, because some thread
> > could be inside the nop area already.
> 
> Yeah, but at that moment, we know where the modified code is.
> Maybe memory dump shows different code, but that is also true
> if uprobe is active. So I think it is OK.

hum, I'm not what you mean.. I attached the kernel change from my changes,
if you want to comment on top of that

the whole change including user space changes is in here:
  https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/log/?h=redzone_fix

jirka


---
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index ebb1baf1eb1d..a6db7b76cb49 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -636,9 +636,20 @@ struct uprobe_trampoline {
 	unsigned long		vaddr;
 };
 
+static const u8 lea_rsp[] = { 0x48, 0x8d, 0x64, 0x24, 0x80 };
+
+#define LEA_INSN_SIZE		5
+#define UPROBE_OPT_INSN_SIZE	(LEA_INSN_SIZE + CALL_INSN_SIZE)
+#define REDZONE_SIZE		0x80
+
+static bool is_lea_insn(const uprobe_opcode_t *insn)
+{
+	return !memcmp(insn, lea_rsp, LEA_INSN_SIZE);
+}
+
 static bool is_reachable_by_call(unsigned long vtramp, unsigned long vaddr)
 {
-	long delta = (long)(vaddr + 5 - vtramp);
+	long delta = (long)(vaddr + UPROBE_OPT_INSN_SIZE - vtramp);
 
 	return delta >= INT_MIN && delta <= INT_MAX;
 }
@@ -651,7 +662,7 @@ static unsigned long find_nearest_trampoline(unsigned long vaddr)
 	};
 	unsigned long low_limit, high_limit;
 	unsigned long low_tramp, high_tramp;
-	unsigned long call_end = vaddr + 5;
+	unsigned long call_end = vaddr + UPROBE_OPT_INSN_SIZE;
 
 	if (check_add_overflow(call_end, INT_MIN, &low_limit))
 		low_limit = PAGE_SIZE;
@@ -826,8 +837,8 @@ SYSCALL_DEFINE0(uprobe)
 	regs->ax  = args.ax;
 	regs->r11 = args.r11;
 	regs->cx  = args.cx;
-	regs->ip  = args.retaddr - 5;
-	regs->sp += sizeof(args);
+	regs->ip  = args.retaddr - UPROBE_OPT_INSN_SIZE;
+	regs->sp += sizeof(args) + REDZONE_SIZE;
 	regs->orig_ax = -1;
 
 	sp = regs->sp;
@@ -844,12 +855,12 @@ SYSCALL_DEFINE0(uprobe)
 	 */
 	if (regs->sp != sp) {
 		/* skip the trampoline call */
-		if (args.retaddr - 5 == regs->ip)
-			regs->ip += 5;
+		if (args.retaddr - UPROBE_OPT_INSN_SIZE == regs->ip)
+			regs->ip += UPROBE_OPT_INSN_SIZE;
 		return regs->ax;
 	}
 
-	regs->sp -= sizeof(args);
+	regs->sp -= sizeof(args) + REDZONE_SIZE;
 
 	/* for the case uprobe_consumer has changed ax/r11/cx */
 	args.ax  = regs->ax;
@@ -857,7 +868,7 @@ SYSCALL_DEFINE0(uprobe)
 	args.cx  = regs->cx;
 
 	/* keep return address unless we are instructed otherwise */
-	if (args.retaddr - 5 != regs->ip)
+	if (args.retaddr - UPROBE_OPT_INSN_SIZE != regs->ip)
 		args.retaddr = regs->ip;
 
 	if (shstk_push(args.retaddr) == -EFAULT)
@@ -891,7 +902,7 @@ asm (
 	"pop %rax\n"
 	"pop %r11\n"
 	"pop %rcx\n"
-	"ret\n"
+	"ret $0x80\n"
 	"int3\n"
 	".balign " __stringify(PAGE_SIZE) "\n"
 	".popsection\n"
@@ -930,9 +941,9 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
 		       int nbytes, void *data)
 {
 	struct write_opcode_ctx *ctx = data;
-	uprobe_opcode_t old_opcode[5];
+	uprobe_opcode_t old_opcode[UPROBE_OPT_INSN_SIZE];
 
-	uprobe_copy_from_page(page, ctx->base, (uprobe_opcode_t *) &old_opcode, 5);
+	uprobe_copy_from_page(page, ctx->base, old_opcode, UPROBE_OPT_INSN_SIZE);
 
 	switch (ctx->expect) {
 	case EXPECT_SWBP:
@@ -940,7 +951,7 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
 			return 1;
 		break;
 	case EXPECT_CALL:
-		if (is_call_insn(&old_opcode[0]))
+		if (is_lea_insn(old_opcode))
 			return 1;
 		break;
 	}
@@ -963,7 +974,7 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
  *   - SMP sync all CPUs
  */
 static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
-		       unsigned long vaddr, char *insn, bool optimize)
+		       unsigned long vaddr, char *insn, int size, bool optimize)
 {
 	uprobe_opcode_t int3 = UPROBE_SWBP_INSN;
 	struct write_opcode_ctx ctx = {
@@ -990,7 +1001,7 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 
 	/* Write all but the first byte of the patched range. */
 	ctx.expect = EXPECT_SWBP;
-	err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1, 4, verify_insn,
+	err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1, size - 1, verify_insn,
 			   true /* is_register */, false /* do_update_ref_ctr */,
 			   &ctx);
 	if (err)
@@ -1017,17 +1028,35 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 static int swbp_optimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 			 unsigned long vaddr, unsigned long tramp)
 {
-	u8 call[5];
+	u8 insn[UPROBE_OPT_INSN_SIZE];
 
-	__text_gen_insn(call, CALL_INSN_OPCODE, (const void *) vaddr,
-			(const void *) tramp, CALL_INSN_SIZE);
-	return int3_update(auprobe, vma, vaddr, call, true /* optimize */);
+	/*
+	 * We have nop10 (with first byte overwritten to int3),
+	 * change it to:
+	 *   lea 0x80(%rsp), %rsp
+	 *   call tramp
+	 *
+	 * The first lea instruction skips the stack redzone so the call
+	 * instruction can safely push return address on stack.
+	 */
+	memcpy(insn, lea_rsp, LEA_INSN_SIZE);
+	__text_gen_insn(insn + LEA_INSN_SIZE, CALL_INSN_OPCODE,
+			(const void *)(vaddr + LEA_INSN_SIZE),
+			(const void *)tramp, CALL_INSN_SIZE);
+	return int3_update(auprobe, vma, vaddr, insn, UPROBE_OPT_INSN_SIZE, true /* optimize */);
 }
 
 static int swbp_unoptimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 			   unsigned long vaddr)
 {
-	return int3_update(auprobe, vma, vaddr, auprobe->insn, false /* optimize */);
+	/*
+	 * Write JMP rel8 to end of the 10-byte slot instead of restoring the
+	 * original nop10, because we could have thread already inside lea
+	 * instruction.
+	 */
+	u8 jmp[UPROBE_OPT_INSN_SIZE] = { 0xeb, UPROBE_OPT_INSN_SIZE - 2 };
+
+	return int3_update(auprobe, vma, vaddr, jmp, 2, false /* optimize */);
 }
 
 static int copy_from_vaddr(struct mm_struct *mm, unsigned long vaddr, void *dst, int len)
@@ -1049,19 +1078,21 @@ static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
 	struct __packed __arch_relative_insn {
 		u8 op;
 		s32 raddr;
-	} *call = (struct __arch_relative_insn *) insn;
+	} *call = (struct __arch_relative_insn *)(insn + LEA_INSN_SIZE);
 
-	if (!is_call_insn(insn))
+	if (!is_lea_insn(insn))
 		return false;
-	return __in_uprobe_trampoline(vaddr + 5 + call->raddr);
+	if (!is_call_insn((uprobe_opcode_t *) call))
+		return false;
+	return __in_uprobe_trampoline(vaddr + UPROBE_OPT_INSN_SIZE + call->raddr);
 }
 
 static int is_optimized(struct mm_struct *mm, unsigned long vaddr)
 {
-	uprobe_opcode_t insn[5];
+	uprobe_opcode_t insn[UPROBE_OPT_INSN_SIZE];
 	int err;
 
-	err = copy_from_vaddr(mm, vaddr, &insn, 5);
+	err = copy_from_vaddr(mm, vaddr, &insn, UPROBE_OPT_INSN_SIZE);
 	if (err)
 		return err;
 	return __is_optimized((uprobe_opcode_t *)&insn, vaddr);
@@ -1131,7 +1162,7 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
 void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
 {
 	struct mm_struct *mm = current->mm;
-	uprobe_opcode_t insn[5];
+	uprobe_opcode_t insn[UPROBE_OPT_INSN_SIZE];
 
 	if (!should_optimize(auprobe))
 		return;
@@ -1142,7 +1173,7 @@ void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
 	 * Check if some other thread already optimized the uprobe for us,
 	 * if it's the case just go away silently.
 	 */
-	if (copy_from_vaddr(mm, vaddr, &insn, 5))
+	if (copy_from_vaddr(mm, vaddr, &insn, UPROBE_OPT_INSN_SIZE))
 		goto unlock;
 	if (!is_swbp_insn((uprobe_opcode_t*) &insn))
 		goto unlock;
@@ -1160,14 +1191,23 @@ void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
 
 static bool can_optimize(struct insn *insn, unsigned long vaddr)
 {
-	if (!insn->x86_64 || insn->length != 5)
+	if (!insn->x86_64)
 		return false;
 
-	if (!insn_is_nop(insn))
+	/* We can't do cross page atomic writes yet. */
+	if (PAGE_SIZE - (vaddr & ~PAGE_MASK) < UPROBE_OPT_INSN_SIZE)
 		return false;
 
-	/* We can't do cross page atomic writes yet. */
-	return PAGE_SIZE - (vaddr & ~PAGE_MASK) >= 5;
+	if (insn->length == UPROBE_OPT_INSN_SIZE && insn_is_nop(insn))
+		return true;
+
+	/* JMP rel8 to end of slot — written by swbp_unoptimize. */
+	if (insn->length == 2 &&
+	    insn->opcode.bytes[0] == 0xEB &&
+	    insn->immediate.value == UPROBE_OPT_INSN_SIZE - 2)
+		return true;
+
+	return false;
 }
 #else /* 32-bit: */
 /*

^ permalink raw reply related

* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
From: Jiri Olsa @ 2026-05-12 16:47 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Andrii Nakryiko, bpf, linux-trace-kernel, oleg, peterz,
	mingo, mhiramat
In-Reply-To: <CAEf4Bza9PjbaVjFxYDmWPXXGV+Z-_Hn2Kz_KB2TOa5s-_UJ1xA@mail.gmail.com>

On Mon, May 11, 2026 at 06:41:06PM +0200, Andrii Nakryiko wrote:
> On Sun, May 10, 2026 at 2:25 PM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Fri, May 08, 2026 at 05:30:56PM -0700, Andrii Nakryiko wrote:
> > > The x86 uprobe nop5 optimization currently replaces a 5-byte NOP at the
> > > probe site with a CALL into a uprobe trampoline. CALL pushes a return
> > > address to [rsp-8]. On x86-64 this is inside the 128-byte red zone, where
> > > user code may keep temporary data without adjusting rsp.
> > >
> > > Use a 5-byte JMP instead. JMP does not write to the user stack, but it
> > > also does not provide a return address. Replace the single trampoline
> > > entry with a page of 16-byte slots. Each optimized probe jumps to its
> > > assigned slot, the slot moves rsp below the red zone, saves the registers
> > > clobbered by syscall, and invokes the uprobe syscall:
> > >
> > >   Probe site:   jmp slot_N              (5B, replaces nop5)
> > >
> > >   Slot N:       lea  -128(%rsp), %rsp   (5B)  skip red zone
> > >                 push %rcx               (1B)  save (syscall clobbers)
> > >                 push %r11               (2B)  save (syscall clobbers)
> > >                 push %rax               (1B)  save (syscall uses for nr)
> > >                 mov  $336, %eax         (5B)  uprobe syscall number
> > >                 syscall                 (2B)
> > >
> > > All slots contain identical code at different offsets, so the trampoline
> > > page is generated once at boot and mapped read-execute into each process.
> > > The syscall handler identifies the slot from regs->ip, which points just
> > > after the syscall instruction, and uses a per-mm slot table to recover the
> > > original probe address.
> > >
> > > The uprobe syscall does not return to the trampoline slot. The handler
> > > restores the probe-site register state, runs the uprobe consumers, sets
> > > pt_regs to continue at probe_addr + 5 unless a consumer redirected
> > > execution, and returns directly through the IRET path. This preserves
> > > general purpose registers, including rcx and r11, without requiring any
> > > post-syscall cleanup code in the trampoline and avoids call/ret, RSB, and
> > > shadow stack concerns.
> > >
> > > Protect the per-mm trampoline list with RCU and free trampoline metadata
> > > with kfree_rcu(). This lets the syscall path resolve trampoline slots
> > > without taking mmap_lock. The optimized-instruction detection path also
> > > walks the trampoline list under an RCU read-side lock. Since that path
> > > starts from the JMP target, it translates the slot start to the post-syscall
> > > IP expected by the shared resolver before checking the trampoline mapping.
> > >
> > > Each trampoline page provides 256 slots. Slots stay permanently assigned
> > > to their first probe address and are reused only when the same address is
> > > probed again. Reassigning detached slots is deliberately avoided because a
> > > thread can remain in a trampoline for an unbounded time due to ptrace,
> > > interrupts, or scheduling delays. If a reachable trampoline page runs out
> > > of slots, probes that cannot allocate a slot fall back to the slower INT3
> > > path.
> > >
> > > Require the entire trampoline page to be reachable by a rel32 JMP before
> > > reusing it for a probe. This keeps every slot in the page within the range
> > > that can be encoded at the probe site.
> > >
> > > Change the error code returned when the uprobe syscall is invoked outside
> > > a kernel-generated trampoline from -ENXIO to -EPROTO. This lets libbpf and
> > > similar libraries distinguish fixed kernels from kernels with the
> > > red-zone-clobbering implementation and enable nop5 optimization only on
> > > fixed kernels.
> > >
> > > Performance (usdt single-thread, M/s):
> > >
> > >                   usdt-nop  usdt-nop5-base  usdt-nop5-fix  nop5-change  iret%
> > >   Skylake          3.149        6.422          4.865         -24.3%     39.1%
> > >   Milan            2.910        3.443          3.820         +11.0%     24.3%
> > >   Sapphire Rapids  1.896        4.023          3.693          -8.2%     24.9%
> > >   Bergamo          3.393        3.895          3.849          -1.2%     24.5%
> > >
> > > The fixed nop5 path remains faster than the non-optimized INT3 path on all
> > > measured systems. The regression relative to the old CALL-based trampoline
> > > comes from IRET being more expensive than SYSRET, most noticeably on older
> > > Intel Skylake. Newer Intel CPUs and tested AMD CPUs have lower IRET cost,
> > > and AMD Milan improves because removing mmap_lock from the hot path more
> > > than offsets the IRET cost.
> > >
> > > Multi-threaded throughput scales nearly linearly with the number of CPUs, like
> > > it used to, thanks to lockless RCU-protected uprobe trampoline lookup.
> >
> > hi,
> > thanks a lot for the fix
> >
> > FWIW we discussed also an option to have 10-bytes nop and do:
> >   [rsp+0x80, call trampoline]
> >
> > we would not need the slots re-use logic, but not sure what other
> > surprises there are with 10-bytes nop
> >
> > I tried that change [1], it seems to work, but it has other
> > difficulties, like I think the unoptimized path needs to do:
> >   [rsp+0x80, call trampoline] -> [jmp end of 10-bytes nop]
> > instead of patching back the 10-byte nop, because some thread
> > could be inside the nop area already.
> >
> 
> Yeah, nop10 and this jump-over-nop10 approach is an alternative. I
> don't have strong feelings apart from the ridiculousness of a 10-byte
> nop :)
> 
> did you get a chance to benchmark your nop10 approach, curious how do
> the number look like

yes, it's the same as with the nop5

  base:
          usermode-count :  152.509 ± 0.044M/s
          syscall-count  :   15.177 ± 0.021M/s
          uprobe-nop     :    3.215 ± 0.002M/s
          uprobe-push    :    3.054 ± 0.003M/s
          uprobe-ret     :    1.100 ± 0.002M/s
          uprobe-nop5    :    7.251 ± 0.034M/s
          uretprobe-nop  :    2.149 ± 0.012M/s
          uretprobe-push :    2.088 ± 0.001M/s
          uretprobe-ret  :    0.960 ± 0.001M/s
          uretprobe-nop5 :    3.402 ± 0.001M/s
          usdt-nop       :    3.185 ± 0.024M/s
          usdt-nop5      :    7.378 ± 0.016M/s

  nop10:
          usermode-count :  152.503 ± 0.024M/s
          syscall-count  :   15.977 ± 0.047M/s
          uprobe-nop     :    3.174 ± 0.011M/s
          uprobe-push    :    3.030 ± 0.006M/s
          uprobe-ret     :    1.124 ± 0.004M/s
          uprobe-nop5    :    7.201 ± 0.012M/s
          uretprobe-nop  :    2.141 ± 0.005M/s
          uretprobe-push :    2.078 ± 0.007M/s
          uretprobe-ret  :    0.947 ± 0.003M/s
          uretprobe-nop5 :    3.384 ± 0.014M/s
          usdt-nop       :    3.247 ± 0.002M/s
          usdt-nop5      :    7.374 ± 0.027M/s

jirka

^ permalink raw reply

* Re: [RFC][PATCH] unwind: Add stacktrace_setup system call
From: Steven Rostedt @ 2026-05-12 16:47 UTC (permalink / raw)
  To: Jens Remus
  Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Linus Torvalds, Andrew Morton, Florian Weimer, Kees Cook,
	Carlos O'Donell, Sam James, Dylan Hatch, Borislav Petkov,
	Dave Hansen, David Hildenbrand, H. Peter Anvin, Liam R. Howlett,
	Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Suren Baghdasaryan,
	Vlastimil Babka, Heiko Carstens, Vasily Gorbik
In-Reply-To: <43158d95-b4c2-44d2-a244-eb546fb2bfaa@linux.ibm.com>

On Fri, 8 May 2026 09:46:30 +0200
Jens Remus <jremus@linux.ibm.com> wrote:

> >   STACKTRACE_REGISTER_SFRAME - This registers the sframe
> >   STACKTRACE_UNREGISTER_SFRAME - This removes the sframe
> > 
> > Signed-off-by: Steven Rostedt <rostedt@goodmis.org>  
> 
> LGTM.  Some comments/questions below.

Note, after talking with people at LSF/MM/BPF, I plan on completely
changing this system call into two distinct ones, and only for sframes.
I'll be sending that later this week.

> 
> > diff --git a/include/uapi/linux/stacktrace.h b/include/uapi/linux/stacktrace.h  
> 
> > @@ -0,0 +1,10 @@
> > +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> > +#ifndef _UAPI_LINUX_STACKTRACE_H
> > +#define _UAPI_LINUX_STACKTRACE_H
> > +
> > +enum stacktrace_setup_types {
> > +	STACKTRACE_REGISTER_SFRAME	= 1,
> > +	STACKTRACE_UNREGISTER_SFRAME	= 2,
> > +};
> > +
> > +#endif /* _UAPI_LINUX_STACKTRACE_H */  
> 
> > diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c  
> 
> Having the syscall live in kernel/unwind/sframe.c means it is only
> available if config option HAVE_UNWIND_USER_SFRAME is selected (which
> triggers sframe.o to be built and linked into the kernel), which makes
> sense as long as it only implements sframe-specific functionality.
> I suppose it could be moved elsewhere if non-sframe use cases would
> arise in the future?

The new system calls will only be for sframes. Other unwinders will need to
implement their own system calls.

> 
> Would Dylan need to guard it when introducing HAVE_UNWIND_KERNEL_SFRAME?
> Provided the syscall fails with -ENOSYS if not implemented (e.g. when
> HAVE_UNWIND_USER_SFRAME is not enabled) the dummy implementations of
> sframe_add_section() and sframe_remove_section() in linux/sframe.h also
> return -ENOSYS, so the user observable behavior would be the same and
> it would not matter.  Do you agree?

I'll reply to that when Dylan's patches get closer to acceptance ;-)

> 
> > @@ -12,8 +12,10 @@
> >  #include <linux/mm.h>
> >  #include <linux/string_helpers.h>
> >  #include <linux/sframe.h>
> > +#include <linux/syscalls.h>
> >  #include <asm/unwind_user_sframe.h>
> >  #include <linux/unwind_user_types.h>
> > +#include <uapi/linux/stacktrace.h>
> >  
> >  #include "sframe.h"
> >  #include "sframe_debug.h"
> > @@ -838,3 +840,38 @@ void sframe_free_mm(struct mm_struct *mm)
> >  
> >  	mtree_destroy(&mm->sframe_mt);
> >  }
> > +
> > +/**
> > + * sys_stacktrace_setup - register an address for user space stacktrace walking.
> > + * @op: Type of operation to perform
> > + * @addr_start: The virtual address of the stacktrace information
> > + * @addr_length: The length of the stacktrace information
> > + * @text_start: The virtual address of the text that @addr_start represents
> > + * @text_length: The length of teh text
> > + *
> > + * This system call is used by dynamic library utilities to inform the kernel
> > + * of meta data that it loaded that can be used by the kernel to know how
> > + * to stack walk the given text locations.
> > + *
> > + * Currently only sframes are supported, but in the future, this may be used
> > + * to tell the kernel about JIT code which will most likely have a different
> > + * format.
> > + *
> > + * The type command may be extended and parameters may be used for other
> > + * purposes.
> > + *
> > + * Return: 0 if successful, otherwise a negative error.
> > + */
> > +SYSCALL_DEFINE5(stacktrace_setup, int, op, unsigned long, addr_start,
> > +		unsigned long, addr_length, unsigned long, text_start,
> > +		unsigned long, text_length)  
> 
> Would it make sense to keep the parameters generic from start, similar
> to how it is done in prctl()?  Or can this be changed later, if the need
> arises?

With discussions at LSF/MM/BPF I'll have the system call parameters be a
pointer to a structure, and a size of that structure. All the API will then
be part of the structure.

Thanks for reviewing,

-- Steve

> 
> SYSCALL_DEFINE5(stacktrace_setup, int, op, unsigned long, arg2,
> 		unsigned long, arg3, unsigned long, arg4, unsigned long, arg5)
> 
> > +{
> > +	switch (op) {
> > +	case STACKTRACE_REGISTER_SFRAME:
> > +		return sframe_add_section(addr_start, addr_start + addr_length,
> > +					  text_start, text_start+text_length);  
> 
> Nit:
> 					  text_start, text_start + text_length);
> 
> > +	case STACKTRACE_UNREGISTER_SFRAME:
> > +		return sframe_remove_section(addr_start);
> > +	}
> > +	return -EINVAL;
> > +}  
> Thanks and regards,
> Jens


^ permalink raw reply

* Re: [PATCHv2] uprobes: Use flexible array for xol_area bitmap
From: Oleg Nesterov @ 2026-05-12 16:17 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Rosen Penev, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, open list:PERFORMANCE EVENTS SUBSYSTEM,
	open list:UPROBES
In-Reply-To: <20260512234857.44e2b0fafa4961900fdb7246@kernel.org>

On 05/12, Masami Hiramatsu wrote:
>
> On Tue, 12 May 2026 13:29:52 +0200
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > >
> > > -	area = kzalloc_obj(*area);
> > > +	area = kzalloc_flex(*area, bitmap, BITS_TO_LONGS(UINSNS_PER_PAGE));
> >
> > The downside is that kmalloc will use kmem_cache with ->object_size = PAGE_SIZE * 2,
> > almost half of the allocated memory won't be used...
>
> Hmm, is the bitmap so big?
>
> #define UINSNS_PER_PAGE			(PAGE_SIZE/UPROBE_XOL_SLOT_BYTES)
>
> And even on arm64,
>
> #define UPROBE_XOL_SLOT_BYTES	AARCH64_INSN_SIZE
>
> So if PAGE_SIZE is 4k, UINSNS_PER_PAGE is 1k, its BITS_TO_LONGS will
> be 1024/64 = 16. So 128 bytes. So the object is allocated from
> object_size = 256 ?

Indeed you are right.

Sorry for the noise and thanks for correcting me! I can't even explain how can
I came to conclusion that object_size can be greater than PAGE_SIZE with this
change ;)

So I think the patch from Rosen is fine.

Thanks,

Oleg.


^ permalink raw reply

* Re: [PATCH mm-unstable v17 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Wei Yang @ 2026-05-12 15:44 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260511185817.686831-12-npache@redhat.com>

On Mon, May 11, 2026 at 12:58:11PM -0600, Nico Pache wrote:
>Enable khugepaged to collapse to mTHP orders. This patch implements the
>main scanning logic using a bitmap to track occupied pages and a stack
>structure that allows us to find optimal collapse sizes.
>
>Previous to this patch, PMD collapse had 3 main phases, a light weight
>scanning phase (mmap_read_lock) that determines a potential PMD
>collapse, an alloc phase (mmap unlocked), then finally heavier collapse
>phase (mmap_write_lock).
>
>To enabled mTHP collapse we make the following changes:
>
>During PMD scan phase, track occupied pages in a bitmap. When mTHP
>orders are enabled, we remove the restriction of max_ptes_none during the
>scan phase to avoid missing potential mTHP collapse candidates. Once we
>have scanned the full PMD range and updated the bitmap to track occupied
>pages, we use the bitmap to find the optimal mTHP size.
>
>Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
>and determine the best eligible order for the collapse. A stack structure
>is used instead of traditional recursion to manage the search. This also
>prevents a traditional recursive approach when the kernel stack struct is
>limited. The algorithm recursively splits the bitmap into smaller chunks to
>find the highest order mTHPs that satisfy the collapse criteria. We start
>by attempting the PMD order, then moved on the consecutively lower orders
>(mTHP collapse). The stack maintains a pair of variables (offset, order),
>indicating the number of PTEs from the start of the PMD, and the order of
>the potential collapse candidate.
>
>The algorithm for consuming the bitmap works as such:
>    1) push (0, HPAGE_PMD_ORDER) onto the stack
>    2) pop the stack
>    3) check if the number of set bits in that (offset,order) pair
>       statisfy the max_ptes_none threshold for that order
>    4) if yes, attempt collapse
>    5) if no (or collapse fails), push two new stack items representing
>       the left and right halves of the current bitmap range, at the
>       next lower order
>    6) repeat at step (2) until stack is empty.
>
>Below is a diagram representing the algorithm and stack items:
>
>                            offset   mid_offset
>                            |        |
>                            |        |
>                            v        v
>          ____________________________________
>         |          PTE Page Table            |
>         --------------------------------------
>			    <-------><------->
>                             order-1  order-1
>
>mTHP collapses reject regions containing swapped out or shared pages.
>This is because adding new entries can lead to new none pages, and these
>may lead to constant promotion into a higher order mTHP. A similar
>issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
>introducing at least 2x the number of pages, and on a future scan will
>satisfy the promotion condition once again. This issue is prevented via
>the collapse_max_ptes_none() function which imposes the max_ptes_none
>restrictions above.
>
>We currently only support mTHP collapse for max_ptes_none values of 0
>and HPAGE_PMD_NR - 1. resulting in the following behavior:
>
>    - max_ptes_none=0: Never introduce new empty pages during collapse
>    - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
>      available mTHP order
>
>Any other max_ptes_none value will emit a warning and skip mTHP collapse
>attempts. There should be no behavior change for PMD collapse.
>
>Once we determine what mTHP sizes fits best in that PMD range a collapse
>is attempted. A minimum collapse order of 2 is used as this is the lowest
>order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
>
>Currently madv_collapse is not supported and will only attempt PMD
>collapse.
>
>We can also remove the check for is_khugepaged inside the PMD scan as
>the collapse_max_ptes_none() function handles this logic now.
>
>Signed-off-by: Nico Pache <npache@redhat.com>

[...]

>+static int mthp_collapse(struct mm_struct *mm, unsigned long address,
>+		int referenced, int unmapped, struct collapse_control *cc,
>+		unsigned long enabled_orders)
>+{
>+	unsigned int nr_occupied_ptes, nr_ptes;
>+	int max_ptes_none, collapsed = 0, stack_size = 0;
>+	unsigned long collapse_address;
>+	struct mthp_range range;
>+	u16 offset;
>+	u8 order;
>+
>+	collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
>+
>+	while (stack_size) {
>+		range = collapse_mthp_stack_pop(cc, &stack_size);
>+		order = range.order;
>+		offset = range.offset;
>+		nr_ptes = 1UL << order;
>+
>+		if (!test_bit(order, &enabled_orders))
>+			goto next_order;
>+
>+		max_ptes_none = collapse_max_ptes_none(cc, NULL, order);

I am thinking whether there is a behavioral change for userfaultfd_armed(vma).

collapse_single_pmd()
    collapse_scan_pmd
        max_ptes_none = collapse_max_ptes_none(cc, vma)
        max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT                --- (1)
        mthp_collapse
            max_ptes_none = collapse_max_ptes_none(cc, NULL)     --- (2)
            collapse_huge_page(mm)
                hugepage_vma_revalidate(&vma)
                __collapse_huge_page_isolate(vma)
                    max_ptes_none = collapse_max_ptes_none(cc, vma)

Before mthp_collapse() introduced, userfaultfd_armed(vma) is skipped if there
is any pte_none_or_zero() in collapse_scan_pmd().

But now, max_ptes_none could be set to KHUGEPAGED_MAX_PTES_LIMIT at (1), so
that we can scan all the pte to get the bitmap. This means
userfaultfd_armed(vma) could continue even with pte_none_or_zero().

Then in mthp_collapse(), collapse_max_ptes_none() at (2) ignores
userfaultfd_armed(vma), which means it will continue to collapse a
userfaultfd_armed(vma) when there is pte_none_or_zero(). 

The good news is we will stop at __collapse_huge_page_isolate(), where we
get collapse_max_ptes_none() with vma. But we already did a lot of work.

Not sure if I missed something.

>+
>+		if (max_ptes_none < 0)
>+			return collapsed;
>+
>+		nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
>+							       nr_ptes);
>+
>+		if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
>+			int ret;
>+
>+			collapse_address = address + offset * PAGE_SIZE;
>+			ret = collapse_huge_page(mm, collapse_address, referenced,
>+						 unmapped, cc, order);
>+			if (ret == SCAN_SUCCEED) {
>+				collapsed += nr_ptes;
>+				continue;
>+			}
>+		}
>+
>+next_order:
>+		if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
>+			const u8 next_order = order - 1;
>+			const u16 mid_offset = offset + (nr_ptes / 2);
>+
>+			collapse_mthp_stack_push(cc, &stack_size, mid_offset,
>+						 next_order);
>+			collapse_mthp_stack_push(cc, &stack_size, offset,
>+						 next_order);
>+		}
>+	}
>+	return collapsed;
>+}
>+
> static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> 		struct vm_area_struct *vma, unsigned long start_addr,
> 		bool *lock_dropped, struct collapse_control *cc)
> {
>-	const int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
>+	int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> 	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
> 	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
>+	enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
> 	pmd_t *pmd;
>-	pte_t *pte, *_pte;
>-	int none_or_zero = 0, shared = 0, referenced = 0;
>+	pte_t *pte, *_pte, pteval;
>+	int i;
>+	int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
> 	enum scan_result result = SCAN_FAIL;
> 	struct page *page = NULL;
> 	struct folio *folio = NULL;
> 	unsigned long addr;
>+	unsigned long enabled_orders;
> 	spinlock_t *ptl;
> 	int node = NUMA_NO_NODE, unmapped = 0;
> 
>@@ -1429,8 +1579,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> 		goto out;
> 	}
> 
>+	bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> 	memset(cc->node_load, 0, sizeof(cc->node_load));
> 	nodes_clear(cc->alloc_nmask);
>+
>+	enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);

Would it be 0 at this point?

>+
>+	/*
>+	 * If PMD is the only enabled order, enforce max_ptes_none, otherwise
>+	 * scan all pages to populate the bitmap for mTHP collapse.
>+	 */
>+	if (enabled_orders != BIT(HPAGE_PMD_ORDER))
>+		max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
>+
> 	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> 	if (!pte) {
> 		cc->progress++;
>@@ -1438,11 +1599,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> 		goto out;
> 	}
> 
>-	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>-	     _pte++, addr += PAGE_SIZE) {
>+	for (i = 0; i < HPAGE_PMD_NR; i++) {
>+		_pte = pte + i;
>+		addr = start_addr + i * PAGE_SIZE;
>+		pteval = ptep_get(_pte);
>+
> 		cc->progress++;
> 
>-		pte_t pteval = ptep_get(_pte);
> 		if (pte_none_or_zero(pteval)) {
> 			if (++none_or_zero > max_ptes_none) {
> 				result = SCAN_EXCEED_NONE_PTE;
>@@ -1522,6 +1685,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> 			}
> 		}
> 
>+		/* Set bit for occupied pages */
>+		__set_bit(i, cc->mthp_bitmap);
> 		/*
> 		 * Record which node the original page is from and save this
> 		 * information to cc->node_load[].
>@@ -1580,10 +1745,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> 	if (result == SCAN_SUCCEED) {
> 		/* collapse_huge_page expects the lock to be dropped before calling */
> 		mmap_read_unlock(mm);
>-		result = collapse_huge_page(mm, start_addr, referenced,
>-					    unmapped, cc, HPAGE_PMD_ORDER);
>+		nr_collapsed = mthp_collapse(mm, start_addr, referenced, unmapped,
>+					      cc, enabled_orders);
> 		/* collapse_huge_page will return with the mmap_lock released */

collapse_huge_page will return with mmap_lock released, but mthp_collapse()
may not?

> 		*lock_dropped = true;
>+		result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
> 	}
> out:
> 	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
>-- 
>2.54.0

-- 
Wei Yang
Help you, Help me

^ permalink raw reply

* Re: [PATCHv2] uprobes: Use flexible array for xol_area bitmap
From: Masami Hiramatsu @ 2026-05-12 14:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Rosen Penev, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Masami Hiramatsu,
	open list:PERFORMANCE EVENTS SUBSYSTEM, open list:UPROBES
In-Reply-To: <agMPMAnsCH2ZRhf5@redhat.com>

On Tue, 12 May 2026 13:29:52 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> On 05/11, Rosen Penev wrote:
> >
> >  struct xol_area {
> >  	wait_queue_head_t		wq;		/* if all slots are busy */
> > -	unsigned long			*bitmap;	/* 0 = free slot */
> >
> >  	struct page			*page;
> >  	/*
> > @@ -117,6 +116,7 @@ struct xol_area {
> >  	 * the vma go away, and we must handle that reasonably gracefully.
> >  	 */
> >  	unsigned long			vaddr;		/* Page(s) of instruction slots */
> > +	unsigned long			bitmap[];	/* 0 = free slot */
> >  };
> >
> >  static void uprobe_warn(struct task_struct *t, const char *msg)
> > @@ -1755,18 +1755,13 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
> >  	struct xol_area *area;
> >  	void *insns;
> >
> > -	area = kzalloc_obj(*area);
> > +	area = kzalloc_flex(*area, bitmap, BITS_TO_LONGS(UINSNS_PER_PAGE));
> 
> The downside is that kmalloc will use kmem_cache with ->object_size = PAGE_SIZE * 2,
> almost half of the allocated memory won't be used...

Hmm, is the bitmap so big? 

#define UINSNS_PER_PAGE			(PAGE_SIZE/UPROBE_XOL_SLOT_BYTES)

And even on arm64, 

#define UPROBE_XOL_SLOT_BYTES	AARCH64_INSN_SIZE

So if PAGE_SIZE is 4k, UINSNS_PER_PAGE is 1k, its BITS_TO_LONGS will
be 1024/64 = 16. So 128 bytes. So the object is allocated from
object_size = 256 ?

Thank you,

> 
> But technically the patch looks correct so I won't argue.
> 
> Oleg.
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH] tracing: Switch trace_recursion_record.c code over to use guard()
From: Yash Suthar @ 2026-05-12 14:41 UTC (permalink / raw)
  To: rostedt
  Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel,
	skhan, me
In-Reply-To: <20260502174741.39636-1-yashsuthar983@gmail.com>

Gentle ping.

Sincerely,
Yash Suthar

On Sat, May 2, 2026 at 11:17 PM Yash Suthar <yashsuthar983@gmail.com> wrote:
>
> Switch mutex_lock()/mutex_unlock() to guard().
> also drop the ret local variable and return directly.
>
> Signed-off-by: Yash Suthar <yashsuthar983@gmail.com>
> ---
>  kernel/trace/trace_recursion_record.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/trace/trace_recursion_record.c b/kernel/trace/trace_recursion_record.c
> index 784fe1fbb866..bac4bc844ccd 100644
> --- a/kernel/trace/trace_recursion_record.c
> +++ b/kernel/trace/trace_recursion_record.c
> @@ -180,9 +180,8 @@ static const struct seq_operations recursed_function_seq_ops = {
>
>  static int recursed_function_open(struct inode *inode, struct file *file)
>  {
> -       int ret = 0;
> +       guard(mutex)(&recursed_function_lock);
>
> -       mutex_lock(&recursed_function_lock);
>         /* If this file was opened for write, then erase contents */
>         if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) {
>                 /* disable updating records */
> @@ -194,10 +193,9 @@ static int recursed_function_open(struct inode *inode, struct file *file)
>                 atomic_set(&nr_records, 0);
>         }
>         if (file->f_mode & FMODE_READ)
> -               ret = seq_open(file, &recursed_function_seq_ops);
> -       mutex_unlock(&recursed_function_lock);
> +               return seq_open(file, &recursed_function_seq_ops);
>
> -       return ret;
> +       return 0;
>  }
>
>  static ssize_t recursed_function_write(struct file *file,
> --
> 2.43.0
>

^ permalink raw reply

* [RFC PATCH v2 18/28] mm/damon: trace probe_hits
From: SeongJae Park @ 2026-05-12 14:36 UTC (permalink / raw)
  Cc: SeongJae Park, Andrew Morton, Masami Hiramatsu, Mathieu Desnoyers,
	Steven Rostedt, damon, linux-kernel, linux-mm, linux-trace-kernel
In-Reply-To: <20260512143645.113201-1-sj@kernel.org>

Introduce a new tracepoint for exposing the per-region per-probe
positive sample count via tracefs.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 include/trace/events/damon.h | 36 ++++++++++++++++++++++++++++++++++++
 mm/damon/core.c              |  7 +++++++
 2 files changed, 43 insertions(+)

diff --git a/include/trace/events/damon.h b/include/trace/events/damon.h
index 7e25f4469b81b..d7b94c7640217 100644
--- a/include/trace/events/damon.h
+++ b/include/trace/events/damon.h
@@ -130,6 +130,42 @@ TRACE_EVENT(damon_monitor_intervals_tune,
 	TP_printk("sample_us=%lu", __entry->sample_us)
 );
 
+TRACE_EVENT(damon_aggregated_v2,
+
+	TP_PROTO(unsigned int target_id, struct damon_region *r,
+		unsigned int nr_regions, unsigned int nr_probes),
+
+	TP_ARGS(target_id, r, nr_regions, nr_probes),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, target_id)
+		__field(unsigned long, start)
+		__field(unsigned long, end)
+		__field(unsigned int, nr_regions)
+		__field(unsigned int, nr_accesses)
+		__field(unsigned int, age)
+		__dynamic_array(unsigned char, probe_hits, nr_probes)
+	),
+
+	TP_fast_assign(
+		__entry->target_id = target_id;
+		__entry->start = r->ar.start;
+		__entry->end = r->ar.end;
+		__entry->nr_regions = nr_regions;
+		__entry->nr_accesses = r->nr_accesses;
+		__entry->age = r->age;
+		memcpy(__get_dynamic_array(probe_hits), r->probe_hits,
+			sizeof(*r->probe_hits) * nr_probes);
+	),
+
+	TP_printk("target_id=%lu nr_regions=%u %lu-%lu: %u %u probe_hits=%s",
+			__entry->target_id, __entry->nr_regions,
+			__entry->start, __entry->end,
+			__entry->nr_accesses, __entry->age,
+			__print_hex(__get_dynamic_array(probe_hits),
+				__get_dynamic_array_len(probe_hits)))
+);
+
 TRACE_EVENT(damon_aggregated,
 
 	TP_PROTO(unsigned int target_id, struct damon_region *r,
diff --git a/mm/damon/core.c b/mm/damon/core.c
index fe6c789f2cecb..14b15c9876516 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -1905,6 +1905,11 @@ static void kdamond_reset_aggregated(struct damon_ctx *c)
 {
 	struct damon_target *t;
 	unsigned int ti = 0;	/* target's index */
+	unsigned int nr_probes = 0;
+	struct damon_probe *probe;
+
+	damon_for_each_probe(probe, c)
+		nr_probes++;
 
 	damon_for_each_target(t, c) {
 		struct damon_region *r;
@@ -1913,6 +1918,8 @@ static void kdamond_reset_aggregated(struct damon_ctx *c)
 			int i;
 
 			trace_damon_aggregated(ti, r, damon_nr_regions(t));
+			trace_damon_aggregated_v2(ti, r, damon_nr_regions(t),
+					nr_probes);
 			damon_warn_fix_nr_accesses_corruption(r);
 			r->last_nr_accesses = r->nr_accesses;
 			r->nr_accesses = 0;
-- 
2.47.3

^ permalink raw reply related

* [RFC PATCH v2 00/28] mm/damon: introduce data attributes monitoring
From: SeongJae Park @ 2026-05-12 14:36 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Masami Hiramatsu,
	Mathieu Desnoyers, Michal Hocko, Mike Rapoport, Shuah Khan,
	Shuah Khan, Steven Rostedt, Suren Baghdasaryan, Vlastimil Babka,
	damon, linux-doc, linux-kernel, linux-kselftest, linux-mm,
	linux-trace-kernel

TL; DR
======

Extend DAMON for monitoring general data attributes other than accesses.
The short term motivation is lightweight page type (e.g., belonging
cgroup) aware monitoring.  In long term, this will help extending DAMON
for multiple access events capture primitives (e.g., page faults and
PMU) and eventually pivotting DAMON to a "Data Attributes Monitoring and
Operations eNgine" in long term.

Background: High Cost of Page Level Properties Monitoring
=========================================================

DAMON is initially introduced as a Data Access MONitor.  It has been
extended for not only access monitoring but also data access-aware
system operations (DAMOS).  But still the monitoring part is only for
data accesses.

Data access patterns is good information, but some users need more
holistic views.  Particularly, users want to show the access pattern
information together with the types of the memory.  For example, users
who work for making huge pages efficiently want to know how much of
DAMON-found hot/cold regions are backed by huge pages.  Users who run
multiple workloads with different cgroups want to know how much of
DAMON-found hot/cold regions belong to specific cgroups.

For the user demand, we developed a DAMOS extension for page level
properties based monitoring [1], which has landed on 6.14.  Using the
feature, users can inform the page level data properties that they are
interested in, in a flexible format that uses DAMOS filters.  Then,
DAMON applies the filters to each folio of the entire DAMON region and
lets users know how many bytes of memory in each DAMON region passed the
given filters.

This gives page level detailed and deterministic information to users.
But, because the operation is done at page level, the overhead is
proportional to the memory size.  It was useful for test or debugging
purposes on a small number of machines.  But it was obviously too heavy
to be enabled always on all machines running the real user workloads.
For real world workloads, it was recommended to use the feature with
user-space controlled sampling approaches.  For example, users could do
the page level monitoring only once per hour, on randomly selected one
percent of machines of their fleet.  If the runtime and the  size of the
fleet is long and big enough, it should provide statistically meaningful
data.

But users are too busy to implement such controls on their own.

Data Attributes Monitoring
==========================

Extend DAMON to monitor not only data accesses, but also general data
attributes.  Do the extension while keeping the main promise of DAMON,
the bounded and best-effort minimum overhead.

Allow users to specify what data attributes in addition to the data
access they want to monitor.  Users can install one 'data probe' per
data attribute of their interest for this purpose.  The 'data probe'
should be able to be applied to any memory, and determine if the given
memory has the appropriate data attribute.  E.g., if memory of physical
address 42 belongs to cgroup A.  Each 'data probe' is configured with
filters that are very similar to the DAMOS filters.

When DAMON checks if each sampling address memory of each region is
accessed since the last check, it applies data probes if registered.
Same to the number of access check-positive samples accounting
(nr_accesses), it accounts the number of each data probe-positive
samples in another per-region counters array, namely 'probe_hits'. When
DAMON resets nr_accesses every aggregation interval, it resets
'probe_hits' together.

Users can read 'probe_hits' just before the values are reset.  In this
way, users can know how many hot/cold memory regions have data
attributes of their interest.  E.g., 30 percent of this system's hot
memory is belonging to cgroup A, and 80 percent of the cgroup
A-belonging hot memory is backed by huge pages.

Patches Sequence
================

First eight patches implement the core feature, interface and the
working support.  Patch 1 introduces data probe data structure, namely
damon_probe.  Patch 2 extends damon_ctx for installing data probes.
Patch 3 introduces another data structure for filters of each data
probe, namely damon_filter.  Patch 4 updates damon_ctx commit function
to handle the probes.  Patch 5 extends damon_region for the per-region
per-probe positive samples counter, namely probe_hits.  Patch 6 extends
damon_operations for applying probes on the underlying DAMON operations
implementation.  Patch 7 updates kdamond_fn() to invoke the probes
applying callback.  Patch 8 finally implements the probes support on
paddr ops.

Ten changes for user interface (patches 9-18) come next.  Patches 9-13
implements sysfs directories and files for setting data probes, namely
probes directory, probe directory, filters directory, filter directory
and filter directory internal files, respectively.  Patch 14 connects
the user inputs that are made via the sysfs files to DAMON core.
Following three patches (patches 15-17) implement sysfs directories and
files for showing the probe_hits to users, namely probes directory,
probe directory and hits files, respectively.  Patch 18 introduces a new
tracepoint for showing the probe_hits via tracefs.

Patch 19 adds a selftest for the sysfs files.

Patches 20 and 21 documents the design and usage of the new feature,
respectively.

Seven additional patches (patches 22-28) for monitoring belonging memory
cgroup follow.  Depending on the feedback, this part might be separated
to another series in future.  Patch 22 defines the DAMON filter type for
the new attribute, namely DAMON_FILTER_TYPE_MEMCG.  Patch 23 add the
support on paddr ops.  Patch 24 updates the sysfs interface for setup of
the target memcg.  Patch 25 move code for easy reuse of the filter
target memcg setup.  Patch 26 connects the user input to the core layer.
Finally, patches 27 and 28 update the design and usage documents for the
memcg attribute monitoring support.

Discussions
===========

This allows the page properties monitoring with overhead that is low
enough to be enabled always on real world workloads.  Because the
sampling time for access check is reused for data attributes check,  the
upper-bounded and best-effort minimum overhead of DAMON is kept.
Because the sampling memory for access check is reused for data
attributes check, additional overhead is minimum.

Still DAMOS-based page level properties monitoring should be useful,
because it provides a deterministic page level information.  When in
doubt of the sampling based information, running DAMOS-based one
together and comparing the results would be useful, for debugging and
tuning.

Plan for Dropping RFC tag
=========================

I'm considering renaming the tracepoint for exposing probe_hits
(damon_aggregated_v2).

Making changes for feedback from myself, humans and Sashiko should be
the major remaining work.

I'm currently hoping to drop the RFC tag by 7.2-rc1.

Future Works: Mid Term
========================

This version of implementation is limiting the maximum number of data
probes to four.  I will try to find a way to remove the limit in future.
I personally think it should be enough for common use cases, though, and
therefore not giving high priority at the moment.

Future Works: Long Term
=======================

There are user requests for extending DAMON with detailed access
information, for example, per-CPUs/threads/read/writes monitoring.  For
that, I was working [2] on extending DAMON to use page fault events as
another access check primitives, and making the infrastructure flexible
for future use of yet another access check primitive.  Actually there is
another ongoing work [3] for extending DAMON with PMU events.  The
motivation of the work is reducing the overhead, though.

In my work [2], I was introducing a new interface for access sampling
primitives control.  Now I think this data probe interface can be used
for that, too.  That is, data access becomes just one type of data
attribute.  Also, pg_idle-confirmed access, page fault-confirmed access,
and PMU event-confirmed access will be different types of data
attributes.

The regions adjustment mechanism is currently working based on the
access information.  That's because DAMON is designed for data access
monitoring.  That is, data access information is the primary interest,
and therefore DAMON adjusts regions in a way that can best-present the
information.

Once data access becomes just one of data attributes, there is no reason
to think data access that special.  There might be some users not
interested in access at all but want to know the location of memory of
specific type.  Data probes interface will allow doing that.  Further,
we could extend the interface to let users set any data attribute as the
'primary' attribute.  Then, DAMON will split and merge regions in a way
that can best-present the 'primary' attributes.

DAMOS will also be extended, to specify targets based on not only the
data access pattern, but all user-registered data attributes.  From this
stage, we may be able to call DAMON as a "Data Attributes Monitoring and
Operations eNgine".

[1] https://lore.kernel.org/20250106193401.109161-1-sj@kernel.org
[2] https://lore.kernel.org/20251208062943.68824-1-sj@kernel.org/
[3] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@gmail.com

Changes from RFC
- rfc: https://lore.kernel.org/all/20260426205222.93895-1-sj@kernel.org/
- Support memcg DAMON filter.
- Use per-probe probe_hits sysfs file.
- Use dynamic_array for probe_hits tracing.
- Fix filter matching field.
- Fix folio leaking in damon_pa_filter_pass().
- Move nr_regions of damon_aggregated_v2 tracepoint after end.
- Rename DAMON_TEST_TYPE_ANON to DAMON_FILTER_TYPE_ANON.

SeongJae Park (28):
  mm/damon/core: introduce struct damon_probe
  mm/damon/core: embed damon_probe objects in damon_ctx
  mm/damon/core: introduce damon_filter
  mm/damon/core: commit probes
  mm/damon/core: introduce damon_region->probe_hits
  mm/damon/core: introduce damon_ops->apply_probes
  mm/damon/core: do data attributes monitoring
  mm/damon/paddr: support data attributes monitoring
  mm/damon/sysfs: implement probes dir
  mm/damon/sysfs: implement probe dir
  mm/damon/sysfs: implement filters directory
  mm/damon/sysfs: implement filter dir
  mm/damon/sysfs: implement filter dir files
  mm/damon/sysfs: setup probes on DAMON core API parameters
  mm/damon/sysfs-schemes: implement tried_regions/<r>/probes/
  mm/damon/sysfs-schemes: implement probe dir
  mm/damon/sysfs-schemes: implement probe/hits file
  mm/damon: trace probe_hits
  selftests/damon/sysfs.sh: test probes dir
  Docs/mm/damon/design: document data attributes monitoring
  Docs/admin-guide/mm/damon/usage: document data attributes monitoring
  mm/damon/core: introduce DAMON_FILTER_TYPE_MEMCG
  mm/damon/paddr: support DAMON_FILTER_TYPE_MEMCG
  mm/damon/sysfs: add filters/<F>/path file
  mm/damon/sysfs-schemes: move memcg_path_to_id() to sysfs-common
  mm/damon/sysfs: setup damon_filter->memcg_id from path
  Docs/mm/damon/design: update for memcg damon filter
  Docs/admin-guide/mm/damon/usage: update for memcg damon filter

 Documentation/admin-guide/mm/damon/usage.rst |  48 +-
 Documentation/mm/damon/design.rst            |  39 ++
 include/linux/damon.h                        |  67 +++
 include/trace/events/damon.h                 |  36 ++
 mm/damon/core.c                              | 195 +++++++
 mm/damon/paddr.c                             |  76 +++
 mm/damon/sysfs-common.c                      |  41 ++
 mm/damon/sysfs-common.h                      |   2 +
 mm/damon/sysfs-schemes.c                     | 222 ++++++--
 mm/damon/sysfs.c                             | 557 +++++++++++++++++++
 tools/testing/selftests/damon/sysfs.sh       |  48 ++
 11 files changed, 1280 insertions(+), 51 deletions(-)

base-commit: 610724cfd93c1c413faf9e5bb63926fe54849887
-- 
2.47.3

^ permalink raw reply

* [PATCH] tracing: Fix unload_page for simple_ring_buffer init rollback
From: Vincent Donnefort @ 2026-05-12 14:16 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel
  Cc: kernel-team, linux-kernel, Vincent Donnefort

The unload_page callback expects the return value of load_page() as its
argument: ret = load_page(va); unload(ret). Fix the rollback code in
simple_ring_buffer_init_mm() where the descriptor's VA is used instead
of the loaded page address.

Fixes: 635923081c79 ("tracing: load/unload page callbacks for simple_ring_buffer")
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

diff --git a/kernel/trace/simple_ring_buffer.c b/kernel/trace/simple_ring_buffer.c
index 02af2297ae5a..38cf9abe0be8 100644
--- a/kernel/trace/simple_ring_buffer.c
+++ b/kernel/trace/simple_ring_buffer.c
@@ -431,7 +431,7 @@ int simple_ring_buffer_init_mm(struct simple_rb_per_cpu *cpu_buffer,
 
 	if (ret) {
 		for (i--; i >= 0; i--)
-			unload_page((void *)desc->page_va[i]);
+			unload_page(bpages[i].page);
 		unload_page(cpu_buffer->meta);
 
 		return ret;

base-commit: 5d6919055dec134de3c40167a490f33c74c12581
-- 
2.54.0.563.g4f69b47b94-goog


^ permalink raw reply related

* [PATCH 9/9] rv: Mandate deallocation for per-obj monitors
From: Gabriele Monaco @ 2026-05-12 14:02 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, Masami Hiramatsu,
	linux-trace-kernel
  Cc: Nam Cao, Wen Yang
In-Reply-To: <20260512140250.262190-1-gmonaco@redhat.com>

The per-object monitors use a hash tables and dynamic allocation of the
monitor storage, functions to clean a monitor that is no longer needed
are provided but nothing ensures the monitor actually uses them.

Remove the inline specifier on the deallocation function to let the
compiler warn in case it isn't referenced. If the monitor really doesn't
need one (for instance because instances will never cease to exist
before disabling the monitor), the da_skip_deallocation() helper macro
can be used to silence the warning.

Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/rv/da_monitor.h                      | 14 +++++++++++++-
 kernel/trace/rv/monitors/deadline/deadline.h |  5 ++++-
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index 402d3b935c08..378d23ab7dfb 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -489,8 +489,11 @@ static inline monitor_target da_get_target_by_id(da_id_type id)
  * locks.
  * This function includes an RCU read-side critical section to synchronise
  * against da_monitor_destroy().
+ * NOTE: inline is omitted on purpose to let the compiler warn if this function
+ * is never referenced. For monitors that don't require a deallocation hook,
+ * da_skip_deallocation() can be used.
  */
-static inline void da_destroy_storage(da_id_type id)
+static void da_destroy_storage(da_id_type id)
 {
 	struct da_monitor_storage *mon_storage;
 
@@ -504,6 +507,15 @@ static inline void da_destroy_storage(da_id_type id)
 	kfree_rcu(mon_storage, rcu);
 }
 
+/*
+ * da_skip_deallocation - explicitly mark a deallocation function as not required
+ *
+ * Only use when you are absolutely sure the monitor doesn't require a
+ * deallocation hook (i.e. it's not possible for an object to finish existing
+ * when the monitor is still running).
+ */
+#define da_skip_deallocation(hook) ((void)hook)
+
 static void da_monitor_reset_all(void)
 {
 	struct da_monitor_storage *mon_storage;
diff --git a/kernel/trace/rv/monitors/deadline/deadline.h b/kernel/trace/rv/monitors/deadline/deadline.h
index 78fca873d61e..c39fd79148c2 100644
--- a/kernel/trace/rv/monitors/deadline/deadline.h
+++ b/kernel/trace/rv/monitors/deadline/deadline.h
@@ -194,7 +194,10 @@ static void __maybe_unused handle_newtask(void *data, struct task_struct *task,
 		da_create_storage(EXPAND_ID_TASK(task), NULL);
 }
 
-static void __maybe_unused handle_exit(void *data, struct task_struct *p, bool group_dead)
+/*
+ * Deallocation hook, use da_skip_deallocation() when not necessary
+ */
+static void handle_exit(void *data, struct task_struct *p, bool group_dead)
 {
 	if (p->policy == SCHED_DEADLINE)
 		da_destroy_storage(get_entity_id(&p->dl, DL_TASK, DL_TASK));
-- 
2.54.0


^ permalink raw reply related

* [PATCH 8/9] rv: Add automatic cleanup handlers for per-task HA monitors
From: Gabriele Monaco @ 2026-05-12 14:02 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, Nam Cao,
	linux-trace-kernel
  Cc: Wen Yang
In-Reply-To: <20260512140250.262190-1-gmonaco@redhat.com>

Hybrid automata monitors may start timers, depending on the model, these
may remain active on an exiting task and cause false positives or even
access freed memory.

Add an enable/disable hook in the HA code, currently only populated by
the per-task handler for registration and deregistration.
This hooks to the sched_process_exit event and ensures the timer is
stopped for every exiting task. The handler is enabled automatically but
may be disabled, for instance if the monitor uses the event for another
purpose (but should still manually ensure timers are stopped).

Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/rv/ha_monitor.h | 44 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
index 11ae85bad492..1bdf866e9c63 100644
--- a/include/rv/ha_monitor.h
+++ b/include/rv/ha_monitor.h
@@ -28,6 +28,7 @@ static inline void ha_monitor_init_env(struct da_monitor *da_mon);
 static inline void ha_monitor_reset_env(struct da_monitor *da_mon);
 static inline void ha_setup_timer(struct ha_monitor *ha_mon);
 static inline bool ha_cancel_timer(struct ha_monitor *ha_mon);
+static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon);
 static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
 					 enum states curr_state,
 					 enum events event,
@@ -38,6 +39,26 @@ static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
 #define da_monitor_reset_hook ha_monitor_reset_env
 #define da_monitor_sync_hook() synchronize_rcu()
 
+#if !defined(HA_SKIP_AUTO_CLEANUP) && RV_MON_TYPE == RV_MON_PER_TASK
+/*
+ * Automatic cleanup handlers for per-task HA monitors, only skip if you know
+ * what you are doing (e.g. you want to implement cleanup manually in another
+ * handler doing more things).
+ */
+static void ha_handle_sched_process_exit(void *data, struct task_struct *p,
+					 bool group_dead);
+
+#define ha_monitor_enable_hook()                                             \
+	rv_attach_trace_probe(__stringify(MONITOR_NAME), sched_process_exit, \
+			      ha_handle_sched_process_exit)
+#define ha_monitor_disable_hook()                                            \
+	rv_detach_trace_probe(__stringify(MONITOR_NAME), sched_process_exit, \
+			      ha_handle_sched_process_exit)
+#else
+#define ha_monitor_enable_hook()
+#define ha_monitor_disable_hook()
+#endif
+
 #include <rv/da_monitor.h>
 #include <linux/seq_buf.h>
 
@@ -124,12 +145,14 @@ static int ha_monitor_init(void)
 
 	ha_mon_initializing = true;
 	ret = da_monitor_init();
+	ha_monitor_enable_hook();
 	ha_mon_initializing = false;
 	return ret;
 }
 
 static void ha_monitor_destroy(void)
 {
+	ha_monitor_disable_hook();
 	da_monitor_destroy();
 }
 
@@ -230,6 +253,18 @@ static inline void ha_trace_error_env(struct ha_monitor *ha_mon,
 {
 	CONCATENATE(trace_error_env_, MONITOR_NAME)(id, curr_state, event, env);
 }
+
+#if !defined(HA_SKIP_AUTO_CLEANUP) && RV_MON_TYPE == RV_MON_PER_TASK
+static void ha_handle_sched_process_exit(void *data, struct task_struct *p,
+					 bool group_dead)
+{
+	struct da_monitor *da_mon = da_get_monitor(p);
+
+	if (likely(!ha_monitor_uninitialized(da_mon)))
+		ha_cancel_timer_sync(to_ha_monitor(da_mon));
+}
+#endif
+
 #endif /* RV_MON_TYPE */
 
 /*
@@ -455,6 +490,10 @@ static inline bool ha_cancel_timer(struct ha_monitor *ha_mon)
 {
 	return timer_delete(&ha_mon->timer);
 }
+static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon)
+{
+	timer_delete_sync(&ha_mon->timer);
+}
 #elif HA_TIMER_TYPE == HA_TIMER_HRTIMER
 /*
  * Helper functions to handle the monitor timer.
@@ -506,6 +545,10 @@ static inline bool ha_cancel_timer(struct ha_monitor *ha_mon)
 {
 	return hrtimer_try_to_cancel(&ha_mon->hrtimer) == 1;
 }
+static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon)
+{
+	hrtimer_cancel(&ha_mon->hrtimer);
+}
 #else /* HA_TIMER_NONE */
 /*
  * Start function is intentionally not defined, monitors using timers must
@@ -516,6 +559,7 @@ static inline bool ha_cancel_timer(struct ha_monitor *ha_mon)
 {
 	return false;
 }
+static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon) { }
 #endif
 
 #endif
-- 
2.54.0


^ permalink raw reply related

* [PATCH 7/9] rv: Do not rely on clean monitor when initialising HA
From: Gabriele Monaco @ 2026-05-12 14:02 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, Masami Hiramatsu,
	Nam Cao, linux-trace-kernel
  Cc: Wen Yang
In-Reply-To: <20260512140250.262190-1-gmonaco@redhat.com>

Hybrid Automata monitors hook into the DA implementation when doing
da_monitor_reset(). This function is called both on initialisation and
teardown, HA monitors try to cancel a timer only when it's initialised
relying on the da_mon->monitoring flag. This flag could however be
corrupted during initialisation. This happens for instance on per-task
monitors that share the same storage with different type of monitors
like LTL or in case of races during a previous teardown.

Stop relying on the monitoring flag during initialisation, assume that
can have any value, so skip timer cancellation in any case when a local
flag is set. New monitors (e.g. new tasks) are always zero-initialised
so they are safe.

Reported-by: Wen Yang <wen.yang@linux.dev>
Closes: https://lore.kernel.org/lkml/d02c656aada7d071f083460a5c9a454363669b61.1778522945.git.wen.yang@linux.dev
Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/rv/ha_monitor.h                       | 31 ++++++++++++++++++-
 kernel/trace/rv/monitors/nomiss/nomiss.c      |  4 +--
 kernel/trace/rv/monitors/opid/opid.c          |  4 +--
 kernel/trace/rv/monitors/stall/stall.c        |  4 +--
 .../rvgen/rvgen/templates/dot2k/main.c        |  4 +--
 5 files changed, 38 insertions(+), 9 deletions(-)

diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
index 47ff1a41febe..11ae85bad492 100644
--- a/include/rv/ha_monitor.h
+++ b/include/rv/ha_monitor.h
@@ -116,6 +116,35 @@ static enum hrtimer_restart ha_monitor_timer_callback(struct hrtimer *hrtimer);
 #define ha_get_ns() 0
 #endif /* HA_CLK_NS */
 
+static bool ha_mon_initializing;
+
+static int ha_monitor_init(void)
+{
+	int ret;
+
+	ha_mon_initializing = true;
+	ret = da_monitor_init();
+	ha_mon_initializing = false;
+	return ret;
+}
+
+static void ha_monitor_destroy(void)
+{
+	da_monitor_destroy();
+}
+
+/*
+ * ha_monitor_uninitialized - are fields like the timer initialized?
+ *
+ * On a clean monitor, we can assume an active monitor (monitoring) is
+ * initialized, however the monitoring field cannot be trusted during
+ * initialization.
+ */
+static inline bool ha_monitor_uninitialized(struct da_monitor *da_mon)
+{
+	return ha_mon_initializing || !da_monitoring(da_mon);
+}
+
 /* Should be supplied by the monitor */
 static u64 ha_get_env(struct ha_monitor *ha_mon, enum envs env, u64 time_ns);
 static bool ha_verify_constraint(struct ha_monitor *ha_mon,
@@ -160,7 +189,7 @@ static inline void ha_monitor_reset_env(struct da_monitor *da_mon)
 	struct ha_monitor *ha_mon = to_ha_monitor(da_mon);
 
 	/* Initialisation resets the monitor before initialising the timer */
-	if (likely(da_monitoring(da_mon)))
+	if (likely(!ha_monitor_uninitialized(da_mon)))
 		ha_cancel_timer(ha_mon);
 }
 
diff --git a/kernel/trace/rv/monitors/nomiss/nomiss.c b/kernel/trace/rv/monitors/nomiss/nomiss.c
index 31f90f3638d8..8ead8783c29f 100644
--- a/kernel/trace/rv/monitors/nomiss/nomiss.c
+++ b/kernel/trace/rv/monitors/nomiss/nomiss.c
@@ -227,7 +227,7 @@ static int enable_nomiss(void)
 {
 	int retval;
 
-	retval = da_monitor_init();
+	retval = ha_monitor_init();
 	if (retval)
 		return retval;
 
@@ -263,7 +263,7 @@ static void disable_nomiss(void)
 	rv_detach_trace_probe("nomiss", sched_switch, handle_sched_switch);
 	rv_detach_trace_probe("nomiss", sched_wakeup, handle_sched_wakeup);
 
-	da_monitor_destroy();
+	ha_monitor_destroy();
 }
 
 static struct rv_monitor rv_this = {
diff --git a/kernel/trace/rv/monitors/opid/opid.c b/kernel/trace/rv/monitors/opid/opid.c
index 4594c7c46601..2922318c6112 100644
--- a/kernel/trace/rv/monitors/opid/opid.c
+++ b/kernel/trace/rv/monitors/opid/opid.c
@@ -73,7 +73,7 @@ static int enable_opid(void)
 {
 	int retval;
 
-	retval = da_monitor_init();
+	retval = ha_monitor_init();
 	if (retval)
 		return retval;
 
@@ -90,7 +90,7 @@ static void disable_opid(void)
 	rv_detach_trace_probe("opid", sched_set_need_resched_tp, handle_sched_need_resched);
 	rv_detach_trace_probe("opid", sched_waking, handle_sched_waking);
 
-	da_monitor_destroy();
+	ha_monitor_destroy();
 }
 
 /*
diff --git a/kernel/trace/rv/monitors/stall/stall.c b/kernel/trace/rv/monitors/stall/stall.c
index 9ccfda6b0e73..3c38fb1a0159 100644
--- a/kernel/trace/rv/monitors/stall/stall.c
+++ b/kernel/trace/rv/monitors/stall/stall.c
@@ -103,7 +103,7 @@ static int enable_stall(void)
 {
 	int retval;
 
-	retval = da_monitor_init();
+	retval = ha_monitor_init();
 	if (retval)
 		return retval;
 
@@ -120,7 +120,7 @@ static void disable_stall(void)
 	rv_detach_trace_probe("stall", sched_switch, handle_sched_switch);
 	rv_detach_trace_probe("stall", sched_wakeup, handle_sched_wakeup);
 
-	da_monitor_destroy();
+	ha_monitor_destroy();
 }
 
 static struct rv_monitor rv_this = {
diff --git a/tools/verification/rvgen/rvgen/templates/dot2k/main.c b/tools/verification/rvgen/rvgen/templates/dot2k/main.c
index bf0999f6657a..889446760e3c 100644
--- a/tools/verification/rvgen/rvgen/templates/dot2k/main.c
+++ b/tools/verification/rvgen/rvgen/templates/dot2k/main.c
@@ -35,7 +35,7 @@ static int enable_%%MODEL_NAME%%(void)
 {
 	int retval;
 
-	retval = da_monitor_init();
+	retval = %%MONITOR_CLASS%%_monitor_init();
 	if (retval)
 		return retval;
 
@@ -50,7 +50,7 @@ static void disable_%%MODEL_NAME%%(void)
 
 %%TRACEPOINT_DETACH%%
 
-	da_monitor_destroy();
+	%%MONITOR_CLASS%%_monitor_destroy();
 }
 
 /*
-- 
2.54.0


^ permalink raw reply related

* [PATCH 6/9] rv: Ensure synchronous cleanup for HA monitors
From: Gabriele Monaco @ 2026-05-12 14:02 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, Nam Cao,
	linux-trace-kernel
  Cc: Wen Yang
In-Reply-To: <20260512140250.262190-1-gmonaco@redhat.com>

HA monitors may start timers, all cleanup functions currently stop the
timers asynchronously to avoid sleeping in the wrong context.
Nothing makes sure running callbacks terminate on cleanup.

Run the entire HA timer callback in an RCU read-side critical section,
this way we can simply synchronize_rcu() with any pending timer and are
sure any cleanup using kfree_rcu() runs after callbacks terminated.
Additionally make sure any unlikely callback running late won't run any
code if the monitor is marked as disabled.

Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
Fixes: 4a24127bd6cb ("rv: Add support for per-object monitors in DA/HA")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/rv/da_monitor.h | 23 +++++++++++++++++++----
 include/rv/ha_monitor.h | 18 ++++++++++++++++--
 2 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index a4a13b62d1a4..402d3b935c08 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -57,6 +57,15 @@ static struct rv_monitor rv_this;
 #define da_monitor_reset_hook(da_mon)
 #endif
 
+/*
+ * Hook to allow the implementation of hybrid automata: define it with a
+ * function that waits for the termination of all monitors background
+ * activities (e.g. all timers). This hook can sleep.
+ */
+#ifndef da_monitor_sync_hook
+#define da_monitor_sync_hook()
+#endif
+
 /*
  * Type for the target id, default to int but can be overridden.
  * A long type can work as hash table key (PER_OBJ) but will be downgraded to
@@ -179,6 +188,7 @@ static inline int da_monitor_init(void)
 static inline void da_monitor_destroy(void)
 {
 	da_monitor_reset_all();
+	da_monitor_sync_hook();
 }
 
 #ifndef da_implicit_guard
@@ -232,6 +242,7 @@ static inline int da_monitor_init(void)
 static inline void da_monitor_destroy(void)
 {
 	da_monitor_reset_all();
+	da_monitor_sync_hook();
 }
 
 #ifndef da_implicit_guard
@@ -319,6 +330,7 @@ static inline void da_monitor_destroy(void)
 	}
 
 	da_monitor_reset_all();
+	da_monitor_sync_hook();
 
 	rv_put_task_monitor_slot(task_mon_slot);
 	task_mon_slot = RV_PER_TASK_MONITOR_INIT;
@@ -497,10 +509,9 @@ static void da_monitor_reset_all(void)
 	struct da_monitor_storage *mon_storage;
 	int bkt;
 
-	rcu_read_lock();
+	guard(rcu)();
 	hash_for_each_rcu(da_monitor_ht, bkt, mon_storage, node)
 		da_monitor_reset(&mon_storage->rv.da_mon);
-	rcu_read_unlock();
 }
 
 static inline int da_monitor_init(void)
@@ -516,13 +527,17 @@ static inline void da_monitor_destroy(void)
 	int bkt;
 
 	tracepoint_synchronize_unregister();
+	scoped_guard(rcu) {
+		hash_for_each_rcu(da_monitor_ht, bkt, mon_storage, node) {
+			da_monitor_reset_hook(&mon_storage->rv.da_mon);
+		}
+	}
+	da_monitor_sync_hook();
 	/*
 	 * This function is called after all probes are disabled and no longer
 	 * pending, we can safely assume no concurrent user.
 	 */
-	synchronize_rcu();
 	hash_for_each_safe(da_monitor_ht, bkt, tmp, mon_storage, node) {
-		da_monitor_reset_hook(&mon_storage->rv.da_mon);
 		hash_del_rcu(&mon_storage->node);
 		kfree(mon_storage);
 	}
diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
index d59507e8cb30..47ff1a41febe 100644
--- a/include/rv/ha_monitor.h
+++ b/include/rv/ha_monitor.h
@@ -36,6 +36,7 @@ static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
 #define da_monitor_event_hook ha_monitor_handle_constraint
 #define da_monitor_init_hook ha_monitor_init_env
 #define da_monitor_reset_hook ha_monitor_reset_env
+#define da_monitor_sync_hook() synchronize_rcu()
 
 #include <rv/da_monitor.h>
 #include <linux/seq_buf.h>
@@ -237,12 +238,25 @@ static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
 	return false;
 }
 
+/*
+ * __ha_monitor_timer_callback - generic callback representation
+ *
+ * This callback runs in an RCU read-side critical section to allow the
+ * destruction sequence to easily synchronize_rcu() with all pending timer
+ * after asynchronously disabling them.
+ */
 static inline void __ha_monitor_timer_callback(struct ha_monitor *ha_mon)
 {
-	enum states curr_state = READ_ONCE(ha_mon->da_mon.curr_state);
 	DECLARE_SEQ_BUF(env_string, ENV_BUFFER_SIZE);
-	u64 time_ns = ha_get_ns();
+	enum states curr_state;
+	u64 time_ns;
+
+	if (unlikely(!da_monitor_handling_event(&ha_mon->da_mon)))
+		return;
 
+	guard(rcu)();
+	curr_state = READ_ONCE(ha_mon->da_mon.curr_state);
+	time_ns = ha_get_ns();
 	ha_get_env_string(&env_string, ha_mon, time_ns);
 	ha_react(curr_state, EVENT_NONE, env_string.buffer);
 	ha_trace_error_env(ha_mon, model_get_state_name(curr_state),
-- 
2.54.0


^ permalink raw reply related

* [PATCH 5/9] rv: Ensure all pending probes terminate on per-obj monitor destroy
From: Gabriele Monaco @ 2026-05-12 14:02 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, Nam Cao,
	linux-trace-kernel
  Cc: Wen Yang
In-Reply-To: <20260512140250.262190-1-gmonaco@redhat.com>

The monitor disable/destroy sequence detaches all probes and resets the
monitor's data, however it doesn't wait for pending probes. This is an
issue with per-object monitors, which free the monitor storage.

Call tracepoint_synchronize_unregister() to make sure to wait for all
pending probes before destroying the monitor storage.

Fixes: 4a24127bd6cb ("rv: Add support for per-object monitors in DA/HA")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/rv/da_monitor.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index a9fd284195ee..a4a13b62d1a4 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -515,9 +515,10 @@ static inline void da_monitor_destroy(void)
 	struct hlist_node *tmp;
 	int bkt;
 
+	tracepoint_synchronize_unregister();
 	/*
-	 * This function is called after all probes are disabled, we need only
-	 * worry about concurrency against old events.
+	 * This function is called after all probes are disabled and no longer
+	 * pending, we can safely assume no concurrent user.
 	 */
 	synchronize_rcu();
 	hash_for_each_safe(da_monitor_ht, bkt, tmp, mon_storage, node) {
-- 
2.54.0


^ permalink raw reply related

* [PATCH 4/9] rv: Prevent task migration while handling per-CPU events
From: Gabriele Monaco @ 2026-05-12 14:02 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, linux-trace-kernel
  Cc: Nam Cao, Wen Yang
In-Reply-To: <20260512140250.262190-1-gmonaco@redhat.com>

Tracepoint handlers are now fully preemptible. When a per-CPU monitor
handles an event, it retrieves the monitor state using a per-CPU
pointer. If the event itself doesn't disable preemption, the task can
migrate to a different CPU and we risk updating the wrong monitor.

Mitigate this by explicitly disabling task migration before acquiring
the monitor pointer. This cannot guarantee the monitor runs on the
correct CPU but reduces the race condition window and prevents warnings.

Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/rv/da_monitor.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index 0b7028df08fb..a9fd284195ee 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -181,6 +181,10 @@ static inline void da_monitor_destroy(void)
 	da_monitor_reset_all();
 }
 
+#ifndef da_implicit_guard
+#define da_implicit_guard()
+#endif
+
 #elif RV_MON_TYPE == RV_MON_PER_CPU
 /*
  * Functions to define, init and get a per-cpu monitor.
@@ -230,6 +234,10 @@ static inline void da_monitor_destroy(void)
 	da_monitor_reset_all();
 }
 
+#ifndef da_implicit_guard
+#define da_implicit_guard() guard(migrate)()
+#endif
+
 #elif RV_MON_TYPE == RV_MON_PER_TASK
 /*
  * Functions to define, init and get a per-task monitor.
@@ -677,6 +685,7 @@ static inline bool __da_handle_start_run_event(struct da_monitor *da_mon,
  */
 static inline void da_handle_event(enum events event)
 {
+	da_implicit_guard();
 	__da_handle_event(da_get_monitor(), event, 0);
 }
 
@@ -692,6 +701,7 @@ static inline void da_handle_event(enum events event)
  */
 static inline bool da_handle_start_event(enum events event)
 {
+	da_implicit_guard();
 	return __da_handle_start_event(da_get_monitor(), event, 0);
 }
 
@@ -703,6 +713,7 @@ static inline bool da_handle_start_event(enum events event)
  */
 static inline bool da_handle_start_run_event(enum events event)
 {
+	da_implicit_guard();
 	return __da_handle_start_run_event(da_get_monitor(), event, 0);
 }
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH 3/9] rv: Reset per-task DA monitors before releasing the slot
From: Gabriele Monaco @ 2026-05-12 14:02 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, Nam Cao,
	linux-trace-kernel
  Cc: Wen Yang
In-Reply-To: <20260512140250.262190-1-gmonaco@redhat.com>

Per-task monitors use task_mon_slot to determine which slot in the array
to use for the monitor. During destruction, this slot is returned but
this is done before resetting the monitor. As a result, the monitor's
reset is in fact resetting a slot that is outside of the array
(RV_PER_TASK_MONITOR_INIT).

Release the slot only after the reset to avoid out-of-bound memory
access.

Fixes: 30984ccf31b7f ("rv: Refactor da_monitor to minimise macros")
Fixes: 792575348ff70 ("rv/include: Add deterministic automata monitor definition via C macros")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/rv/da_monitor.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index 250888812125..0b7028df08fb 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -309,10 +309,11 @@ static inline void da_monitor_destroy(void)
 		WARN_ONCE(1, "Disabling a disabled monitor: " __stringify(MONITOR_NAME));
 		return;
 	}
-	rv_put_task_monitor_slot(task_mon_slot);
-	task_mon_slot = RV_PER_TASK_MONITOR_INIT;
 
 	da_monitor_reset_all();
+
+	rv_put_task_monitor_slot(task_mon_slot);
+	task_mon_slot = RV_PER_TASK_MONITOR_INIT;
 }
 
 #elif RV_MON_TYPE == RV_MON_PER_OBJ
-- 
2.54.0


^ permalink raw reply related

* [PATCH 2/9] rv: Fix read_lock scope in per-task DA cleanup
From: Gabriele Monaco @ 2026-05-12 14:02 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, Nam Cao,
	linux-trace-kernel
  Cc: Wen Yang
In-Reply-To: <20260512140250.262190-1-gmonaco@redhat.com>

The da_monitor_reset_all() function for per-task monitors takes
tasklist_lock while iterating over tasks, then keeps it also while
iterating over idle tasks (one per CPU). The latter is not necessary
since the lock needs to guard only for_each_process_thread().

Use a scoped_guard for more compact syntax and adjust the scope only
where the lock is necessary.

Fixes: 30984ccf31b7f ("rv: Refactor da_monitor to minimise macros")
Fixes: 8259cb14a7068 ("rv: Reset per-task monitors also for idle tasks")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/rv/da_monitor.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index 39765ff6f098..250888812125 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -272,12 +272,12 @@ static void da_monitor_reset_all(void)
 	struct task_struct *g, *p;
 	int cpu;
 
-	read_lock(&tasklist_lock);
-	for_each_process_thread(g, p)
-		da_monitor_reset(da_get_monitor(p));
+	scoped_guard(read_lock, &tasklist_lock) {
+		for_each_process_thread(g, p)
+			da_monitor_reset(da_get_monitor(p));
+	}
 	for_each_present_cpu(cpu)
 		da_monitor_reset(da_get_monitor(idle_task(cpu)));
-	read_unlock(&tasklist_lock);
 }
 
 /*
-- 
2.54.0


^ permalink raw reply related

* [PATCH 1/9] rv: Fix __user specifier usage in extract_params()
From: Gabriele Monaco @ 2026-05-12 14:02 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, Masami Hiramatsu,
	Nam Cao, linux-trace-kernel
  Cc: kernel test robot, Wen Yang
In-Reply-To: <20260512140250.262190-1-gmonaco@redhat.com>

The attributes variables extracted from syscalls in the helper are both
defined with the __user specifier although only the actual pointer to
user data should be marked.

Remove the __user specifier from attr.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604150820.Ny143u6X-lkp@intel.com
Fixes: b133207deb72 ("rv: Add nomiss deadline monitor")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 kernel/trace/rv/monitors/deadline/deadline.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/rv/monitors/deadline/deadline.h b/kernel/trace/rv/monitors/deadline/deadline.h
index 0bbfd2543329..78fca873d61e 100644
--- a/kernel/trace/rv/monitors/deadline/deadline.h
+++ b/kernel/trace/rv/monitors/deadline/deadline.h
@@ -95,7 +95,8 @@ static inline u8 get_server_type(struct task_struct *tsk)
 static inline int extract_params(struct pt_regs *regs, long id, pid_t *pid_out)
 {
 	size_t size = offsetofend(struct sched_attr, sched_flags);
-	struct sched_attr __user *uattr, attr;
+	struct sched_attr __user *uattr;
+	struct sched_attr attr;
 	int new_policy = -1, ret;
 	unsigned long args[6];
 
-- 
2.54.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox