Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* [PATCH 1/9] rv: Fix __user specifier usage in extract_params()
From: Gabriele Monaco @ 2026-05-12 14:02 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, Masami Hiramatsu,
	Nam Cao, linux-trace-kernel
  Cc: kernel test robot, Wen Yang
In-Reply-To: <20260512140250.262190-1-gmonaco@redhat.com>

The attributes variables extracted from syscalls in the helper are both
defined with the __user specifier although only the actual pointer to
user data should be marked.

Remove the __user specifier from attr.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604150820.Ny143u6X-lkp@intel.com
Fixes: b133207deb72 ("rv: Add nomiss deadline monitor")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 kernel/trace/rv/monitors/deadline/deadline.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/rv/monitors/deadline/deadline.h b/kernel/trace/rv/monitors/deadline/deadline.h
index 0bbfd2543329..78fca873d61e 100644
--- a/kernel/trace/rv/monitors/deadline/deadline.h
+++ b/kernel/trace/rv/monitors/deadline/deadline.h
@@ -95,7 +95,8 @@ static inline u8 get_server_type(struct task_struct *tsk)
 static inline int extract_params(struct pt_regs *regs, long id, pid_t *pid_out)
 {
 	size_t size = offsetofend(struct sched_attr, sched_flags);
-	struct sched_attr __user *uattr, attr;
+	struct sched_attr __user *uattr;
+	struct sched_attr attr;
 	int new_policy = -1, ret;
 	unsigned long args[6];
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH 0/9] rv: Fixes on Deterministic and Hybrid Automata
From: Gabriele Monaco @ 2026-05-12 14:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: Gabriele Monaco, Nam Cao, Wen Yang, linux-trace-kernel

Fix issues that were reported by bots or visible only after integration:
 * Make sure timers are always terminated and waited for when disabling
   the monitor or when the target terminates
 * Run per-cpu monitors with migration disabled since preemption is now
   enabled from tracepoints
 * Fix a wrong __user specifier in a helper function
 * Other cleanup and concurrency issues

Cc: Nam Cao <namcao@linutronix.de>
Cc: Wen Yang <wen.yang@linux.dev>
Cc: linux-trace-kernel@vger.kernel.org

Gabriele Monaco (9):
  rv: Fix __user specifier usage in extract_params()
  rv: Fix read_lock scope in per-task DA cleanup
  rv: Reset per-task DA monitors before releasing the slot
  rv: Prevent task migration while handling per-CPU events
  rv: Ensure all pending probes terminate on per-obj monitor destroy
  rv: Ensure synchronous cleanup for HA monitors
  rv: Do not rely on clean monitor when initialising HA
  rv: Add automatic cleanup handlers for per-task HA monitors
  rv: Mandate deallocation for per-obj monitors

 include/rv/da_monitor.h                       | 66 ++++++++++---
 include/rv/ha_monitor.h                       | 93 ++++++++++++++++++-
 kernel/trace/rv/monitors/deadline/deadline.h  |  8 +-
 kernel/trace/rv/monitors/nomiss/nomiss.c      |  4 +-
 kernel/trace/rv/monitors/opid/opid.c          |  4 +-
 kernel/trace/rv/monitors/stall/stall.c        |  4 +-
 .../rvgen/rvgen/templates/dot2k/main.c        |  4 +-
 7 files changed, 157 insertions(+), 26 deletions(-)


base-commit: 6d35786de28116ecf78797a62b84e6bf3c45aa5a
-- 
2.54.0


^ permalink raw reply

* Re: [bug report] bootconfig: init: Allow admin to use bootconfig for kernel command line
From: Breno Leitao @ 2026-05-12 13:54 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Dan Carpenter, kernel-janitors, Linux Trace Kernel, linux-kernel
In-Reply-To: <20260512091638.8b95253ab022d7dabf473465@kernel.org>

On Tue, May 12, 2026 at 09:16:38AM +0900, Masami Hiramatsu wrote:
> Hi Dan,
>
> Thanks for reporting. A similar problem is pointed by Sashiko [1].
>
> [1] https://sashiko.dev/#/patchset/20260508-bootconfig_using_tools-v1-0-1132219aa773%40debian.org
>
> On Fri, 8 May 2026 20:07:25 +0300
> Dan Carpenter <error27@gmail.com> wrote:
>
> > Hello Masami Hiramatsu,
> >
> > Commit 51887d03aca1 ("bootconfig: init: Allow admin to use bootconfig
> > for kernel command line") from Jan 11, 2020 (linux-next), leads to
> > the following Smatch static checker warning:
> >
> > 	init/main.c:368 xbc_snprint_cmdline()

For your information, I am moving this function to lib/bootconfig.

https://lore.kernel.org/all/20260508-bootconfig_using_tools-v1-1-1132219aa773@debian.org/

I understand that no action is required on this report, correct?

Thanks,
--breno

^ permalink raw reply

* [PATCH v3] tracing: Fix nr_subbufs initialization in simple_ring_buffer_init_mm()
From: David Carlier @ 2026-05-12 13:54 UTC (permalink / raw)
  To: catalin.marinas, will
  Cc: rostedt, mhiramat, mathieu.desnoyers, vdonnefort, ryan.roberts,
	maz, linux-arm-kernel, linux-trace-kernel, linux-kernel,
	David Carlier

nr_subbufs in the ring buffer metadata is always initialized to zero
because it is assigned from cpu_buffer->nr_pages before the page
initialization loop has run. While nr_subbufs is not currently read
by the kernel, it should reflect the actual buffer geometry in the
meta page for correctness.

Move the assignment after the page loop so that cpu_buffer->nr_pages
holds the final count.

Fixes: 34e5b958bdad ("tracing: Introduce simple_ring_buffer")
Reviewed-by: Vincent Donnefort <vdonnefort@google.com>
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: David Carlier <devnexen@gmail.com>
---
 kernel/trace/simple_ring_buffer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/simple_ring_buffer.c b/kernel/trace/simple_ring_buffer.c
index 02af2297ae5a..f731f14d0ff7 100644
--- a/kernel/trace/simple_ring_buffer.c
+++ b/kernel/trace/simple_ring_buffer.c
@@ -395,7 +395,6 @@ int simple_ring_buffer_init_mm(struct simple_rb_per_cpu *cpu_buffer,
 
 	memset(cpu_buffer->meta, 0, sizeof(*cpu_buffer->meta));
 	cpu_buffer->meta->meta_page_size = PAGE_SIZE;
-	cpu_buffer->meta->nr_subbufs = cpu_buffer->nr_pages;
 
 	/* The reader page is not part of the ring initially */
 	page = load_page(desc->page_va[0]);
@@ -437,6 +436,7 @@ int simple_ring_buffer_init_mm(struct simple_rb_per_cpu *cpu_buffer,
 		return ret;
 	}
 
+	cpu_buffer->meta->nr_subbufs = cpu_buffer->nr_pages;
 	/* Close the ring */
 	bpage->link.next = &cpu_buffer->tail_page->link;
 	cpu_buffer->tail_page->link.prev = &bpage->link;
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH] rtla: Stop the record trace on interrupt
From: Tomas Glozar @ 2026-05-12 13:51 UTC (permalink / raw)
  To: Crystal Wood
  Cc: Steven Rostedt, linux-trace-kernel, John Kacur, Costa Shulyupin,
	Wander Lairson Costa
In-Reply-To: <CAP4=nvQEnCG2vabuA52p3bmGo7f1_-pZJ2b6rr5N=bKmaBuqxg@mail.gmail.com>

út 12. 5. 2026 v 15:50 odesílatel Tomas Glozar <tglozar@redhat.com> napsal:
>
> This affects only the "new" (since 6.19) --on-end trace option, right?

s/6.19/6.17/


^ permalink raw reply

* Re: [PATCH] rtla: Stop the record trace on interrupt
From: Tomas Glozar @ 2026-05-12 13:50 UTC (permalink / raw)
  To: Crystal Wood
  Cc: Steven Rostedt, linux-trace-kernel, John Kacur, Costa Shulyupin,
	Wander Lairson Costa
In-Reply-To: <20260511223509.1483323-1-crwood@redhat.com>

út 12. 5. 2026 v 0:35 odesílatel Crystal Wood <crwood@redhat.com> napsal:
>
> Before, when rtla gets a signal, it stopped the main trace but not the
> record trace.  save_trace_to_file() could also fail to keep up on a debug
> kernel -- and in any case, it adds post-stoppage noise to the trace file.
>

This affects only the "new" (since 6.19) --on-end trace option, right?
The regular --trace/--on-threshold trace should already disable the
instance in-kernel, as timerlat disables all instances set to the
tracer on stop_tracing threshold.

If so, that should be reflected in the commit message.

> Signed-off-by: Crystal Wood <crwood@redhat.com>
> ---
>  tools/tracing/rtla/src/common.c   | 19 +++++++++++--------
>  tools/tracing/rtla/src/common.h   |  1 -
>  tools/tracing/rtla/src/timerlat.c |  2 +-
>  3 files changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/tools/tracing/rtla/src/common.c b/tools/tracing/rtla/src/common.c
> index 35e3d3aa922e..effad523e8cf 100644
> --- a/tools/tracing/rtla/src/common.c
> +++ b/tools/tracing/rtla/src/common.c
> @@ -10,7 +10,7 @@
>
>  #include "common.h"
>
> -struct trace_instance *trace_inst;
> +struct osnoise_tool *trace_tool;
>  volatile int stop_tracing;
>  int nr_cpus;
>
> @@ -21,12 +21,16 @@ static void stop_trace(int sig)
>                  * Stop requested twice in a row; abort event processing and
>                  * exit immediately
>                  */
> -               tracefs_iterate_stop(trace_inst->inst);
> +               if (trace_tool)
> +                       tracefs_iterate_stop(trace_tool->trace.inst);

Can this condition be ever false? The signal should be attached after
trace_inst is initialized. Of course, it doesn't hurt to have it there
for safety, as long as we keep in mind that it does not protect from
the trace instance being freed (which should be fixed by commit
be8058f31b4e already).

>                 return;
>         }
>         stop_tracing = 1;
> -       if (trace_inst)
> -               trace_instance_stop(trace_inst);
> +       if (trace_tool) {
> +               trace_instance_stop(&trace_tool->trace);
> +               if (trace_tool->record)
> +                       trace_instance_stop(&trace_tool->record->trace);
> +       }
>  }
>

Otherwise this makes sense. The reason for doing trace_instance_stop()
in stop_trace() and not waiting for cleanup is that in tracefs mode,
the loop would get struck in tracefs_iterate_raw_events() if it cannot
process the events faster than they are created (see commit
c73cab9dbed and a4dfce7559d respectively). But it can cause a
discrepancy in the trace output, as you point out.

>  /*
> @@ -273,11 +277,10 @@ int run_tool(struct tool_ops *ops, int argc, char *argv[])
>         tool->params = params;
>
>         /*
> -        * Save trace instance into global variable so that SIGINT can stop
> -        * the timerlat tracer.
> +        * Expose the tool to signal handlers so they can stop the trace.
>          * Otherwise, rtla could loop indefinitely when overloaded.
>          */
> -       trace_inst = &tool->trace;
> +       trace_tool = tool;
>
>         retval = ops->apply_config(tool);
>         if (retval) {
> @@ -285,7 +288,7 @@ int run_tool(struct tool_ops *ops, int argc, char *argv[])
>                 goto out_free;
>         }
>
> -       retval = enable_tracer_by_name(trace_inst->inst, ops->tracer);
> +       retval = enable_tracer_by_name(tool->trace.inst, ops->tracer);
>         if (retval) {
>                 err_msg("Failed to enable %s tracer\n", ops->tracer);
>                 goto out_free;
> diff --git a/tools/tracing/rtla/src/common.h b/tools/tracing/rtla/src/common.h
> index 51665db4ffce..eba40b6d9504 100644
> --- a/tools/tracing/rtla/src/common.h
> +++ b/tools/tracing/rtla/src/common.h
> @@ -54,7 +54,6 @@ struct osnoise_context {
>         int                     opt_workload;
>  };
>
> -extern struct trace_instance *trace_inst;
>  extern volatile int stop_tracing;
>

This is removed as it is not needed now after the consolidation I
assume, since all users are now in common.c? That could also be
mentioned in the commit message.

>  struct hist_params {
> diff --git a/tools/tracing/rtla/src/timerlat.c b/tools/tracing/rtla/src/timerlat.c
> index f8c057518d22..637f68d684f5 100644
> --- a/tools/tracing/rtla/src/timerlat.c
> +++ b/tools/tracing/rtla/src/timerlat.c
> @@ -202,7 +202,7 @@ void timerlat_analyze(struct osnoise_tool *tool, bool stopped)
>                  * If the trace did not stop with --aa-only, at least print
>                  * the max known latency.
>                  */
> -               max_lat = tracefs_instance_file_read(trace_inst->inst, "tracing_max_latency", NULL);
> +               max_lat = tracefs_instance_file_read(tool->trace.inst, "tracing_max_latency", NULL);

Not sure why this used trace_inst instead of tool which is right
there, maybe again refactoring of the code?

>                 if (max_lat) {
>                         printf("  Max latency was %s\n", max_lat);
>                         free(max_lat);
> --
> 2.54.0
>

Otherwise, LGTM. You can also add:

Fixes: c73cab9dbed0 ("rtla/timerlat_hist: Stop timerlat tracer on signal")
Fixes: a4dfce7559d7 ("rtla/timerlat_top: Stop timerlat tracer on signal")
Fixes: 3aadb65db5d6 ("rtla/timerlat: Add action on end feature")

as this fixes two bugs, one with the post-stoppage noise (that is
present since the trace_instance_stop() was added), one with the
save_trace_to_file() not keeping up (which should be since the
--on-end option was added).

Tomas


^ permalink raw reply

* Re: [PATCH v6 2/4] mm/memory-failure: classify get_any_page() failures by reason
From: Breno Leitao @ 2026-05-12 13:33 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
	Shuah Khan, Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett, linux-mm,
	linux-kernel, linux-doc, linux-kselftest, linux-trace-kernel,
	kernel-team, Lance Yang
In-Reply-To: <28b01c14-3d87-4cab-b695-5b9015578785@kernel.org>

On Tue, May 12, 2026 at 10:21:50AM +0200, David Hildenbrand (Arm) wrote:
> 
> >  		}
> >  		goto unlock_mutex;
> >  	} else if (res < 0) {
> > -		if (is_reserved)
> > +		/*
> > +		 * Promote a stable unhandlable kernel page diagnosed by
> > +		 * get_hwpoison_page() to MF_MSG_KERNEL alongside reserved
> > +		 * pages; transient lifecycle races stay as MF_MSG_GET_HWPOISON.
> > +		 */
> > +		if (is_reserved || gp_status == MF_GET_PAGE_UNHANDLABLE)
> >  			res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
> 
> 
> It's all a bit of a mess. get_hwpoison_page() should just indicate that a page
> is unhandable if it is PG_reserved?

Are you saying that we should identify if the page is PG_reserved in
get_hwpoison_page() instead of in memory_failure(), as done in the
previous patch ("mm/memory-failure: report MF_MSG_KERNEL for reserved
pages") ?

> Why can't we just return a special error code from  get_hwpoison_page()? We ahve
> plenty of errno values to chose from.

Something like:


diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 866c4428ac7ef..0a6d83575833e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -878,7 +878,7 @@ static const char *action_name[] = {
 };
 
 static const char * const action_page_types[] = {
-	[MF_MSG_KERNEL]			= "reserved kernel page",
+	[MF_MSG_KERNEL]			= "unrecoverable kernel page",
 	[MF_MSG_KERNEL_HIGH_ORDER]	= "high-order kernel page",
 	[MF_MSG_HUGE]			= "huge page",
 	[MF_MSG_FREE_HUGE]		= "free huge page",
@@ -1394,6 +1394,21 @@ static int get_any_page(struct page *p, unsigned long flags)
 	int ret = 0, pass = 0;
 	bool count_increased = false;

+	if (PageReserved(p)) {
+		ret = -ENOTRECOVERABLE;
+		goto out;
+	}
+
 	if (flags & MF_COUNT_INCREASED)
 		count_increased = true;
 
@@ -1422,7 +1437,7 @@ static int get_any_page(struct page *p, unsigned long flags)
 				shake_page(p);
 				goto try_again;
 			}
-			ret = -EIO;
+			ret = -ENOTRECOVERABLE;
 			goto out;
 		}
 	}
@@ -1441,10 +1456,10 @@ static int get_any_page(struct page *p, unsigned long flags)
 			goto try_again;
 		}
 		put_page(p);
-		ret = -EIO;
+		ret = -ENOTRECOVERABLE;
 	}
 out:
-	if (ret == -EIO)
+	if (ret == -EIO || ret == -ENOTRECOVERABLE)
 		pr_err("%#lx: unhandlable page.\n", page_to_pfn(p));
 
 	return ret;
@@ -2431,6 +2448,9 @@ int memory_failure(unsigned long pfn, int flags)
 			res = action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
 		}
 		goto unlock_mutex;
+	} else if (res == -ENOTRECOVERABLE) {
+		res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
+		goto unlock_mutex;
 	} else if (res < 0) {
 		res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
 		goto unlock_mutex;


If that is what you are suggestion, maybe we can create another
MF_MSG_RESERVED? and another return value for get_any_page() to track
the reserve pages ?

Thanks for the review and suggestions,
--breno

^ permalink raw reply related

* Re: [RFC PATCH v2 06/10] rvgen: support reset() on the __init arrow for global-window HA clocks
From: Gabriele Monaco @ 2026-05-12 13:25 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <aa156a1c7696e054f8db57c48a26fa6ec1e17395.1778522945.git.wen.yang@linux.dev>



On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> rvgen rejects a state invariant when its env is never reset on any
> state-transition edge.  This prevents expressing monitors where a clock
> tracks the full monitoring window — reset once at object creation,
> active in all states.
> 
> Allow reset() annotations on the __init_STATE -> STATE arrow.
> automata.py adds listed envs to the new env_init_started set (and to
> env_stored so the HA framework allocates per-object storage).  dot2k.py
> uses env_init_started for three purposes:
> 
> - Generate a handle_monitor_start() skeleton that resets the env and
>   arms the timer after the caller sets up DA storage and initial state.
> 
> - Guard ha_inv_to_guard calls with !ha_monitor_env_invalid() for these
>   envs: a concurrent DA event between da_handle_start_event() and
>   ha_reset_env() would otherwise store U64_MAX - BUDGET as the guard
>   anchor, silently disabling enforcement.
> 
> - Always generate ha_verify_guards() for monitors with invariants,
>   providing a stable extension point for future per-event guards.
> 
> Models without __init resets (e.g. stall.dot) are unaffected.
> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>

Mmh, I see that's an issue, but technically the init arrow isn't a real state
transition. In your case, you have a start condition that you labelled
"switch_in_tlob" although it isn't a switch in.

Why don't you make it a separate event (e.g. "start_tlob"), it seems to me you
cannot really have that occur multiple times, so doing the following wouldn't
harm and you wouldn't need to change how HA monitors work:

                |
                |
                v
              +-----------------------------+   start;reset(clk)
              |           running           | -------------------+
 switch_in    |      clk < BUDGET_NS()      |                    |
  +---------> |                             | <------------------+
  |           +-----------------------------+
  |             |                  |
  |             | sleep            |
  |             v                  |
  |                                |
  |          sleeping              |
  |      clk < BUDGET_NS()         |
  |                                | preempt
  |             |                  |
  |             | wakeup           |
  |             v                  |
  |                                |
  |                 waiting        |
  +----------  clk < BUDGET_NS()  <+


Then you also wouldn't need to call reset() and start_timer() manually.
Isn't that feasible?

Thanks,
Gabriele

> ---
>  tools/verification/rvgen/rvgen/automata.py |  26 ++++++
>  tools/verification/rvgen/rvgen/dot2k.py    | 100 +++++++++++++++++++--
>  2 files changed, 119 insertions(+), 7 deletions(-)
> 
> diff --git a/tools/verification/rvgen/rvgen/automata.py
> b/tools/verification/rvgen/rvgen/automata.py
> index b9f8149f7118..178a1a4ffd8a 100644
> --- a/tools/verification/rvgen/rvgen/automata.py
> +++ b/tools/verification/rvgen/rvgen/automata.py
> @@ -69,15 +69,41 @@ class Automata:
>          self.states, self.initial_state, self.final_states =
> self.__get_state_variables()
>          self.env_types = {}
>          self.env_stored = set()
> +        self.env_init_started = set()
>          self.constraint_vars = set()
>          self.self_loop_reset_events = set()
>          self.events, self.envs = self.__get_event_variables()
> +        self.__parse_init_resets()
>          self.function, self.constraints = self.__create_matrix()
>          self.events_start, self.events_start_run = self.__store_init_events()
>          self.env_stored = sorted(self.env_stored)
> +        self.env_init_started = sorted(self.env_init_started)
>          self.constraint_vars = sorted(self.constraint_vars)
>          self.self_loop_reset_events = sorted(self.self_loop_reset_events)
>  
> +    def __parse_init_resets(self) -> None:
> +        """Parse reset() annotations on the __init_STATE -> STATE arrow.
> +
> +        Adds each listed env to env_stored (HA framework allocates per-object
> +        storage) and env_init_started (ha2k generates
> handle_monitor_start()).
> +        """
> +        init_prefix = f'"{self.init_marker}'
> +        for line in map(str.lstrip, self.__dot_lines):
> +            if not line.startswith(init_prefix):
> +                continue
> +            split_line = line.split()
> +            if len(split_line) < 3 or split_line[1] != "->":
> +                continue
> +            if "label" not in line:
> +                continue
> +            label = "".join(split_line[split_line.index("label") + 2:-
> 1]).replace('"', '')
> +            for part in label.split(";"):
> +                reset_m = self.constraint_reset.search(part.strip())
> +                if reset_m:
> +                    env = reset_m["env"]
> +                    self.env_stored.add(env)
> +                    self.env_init_started.add(env)
> +
>      def __get_model_name(self) -> str:
>          basename = ntpath.basename(self.__dot_path)
>          if not basename.endswith(".dot") and not basename.endswith(".gv"):
> diff --git a/tools/verification/rvgen/rvgen/dot2k.py
> b/tools/verification/rvgen/rvgen/dot2k.py
> index e6f476b903b0..e8066260c0af 100644
> --- a/tools/verification/rvgen/rvgen/dot2k.py
> +++ b/tools/verification/rvgen/rvgen/dot2k.py
> @@ -366,7 +366,18 @@ f"""static inline void ha_convert_inv_guard(struct
> ha_monitor *ha_mon,
>              conf_g = [e for s, e in conflict_guards if s == state]
>              if not conf_i and not conf_g:
>                  continue
> -            buff.append(f"\t{_else}if (curr_state ==
> {self.states[state]}{self.enum_suffix})")
> +
> +            state_name = f"{self.states[state]}{self.enum_suffix}"
> +            env_full = self.__get_constraint_env(constr)
> +            env_bare = env_full[:-len(self.enum_suffix)]
> +            if env_bare in self.env_init_started:
> +                # env_store is ENV_INVALID_VALUE until
> handle_monitor_start();
> +                # skip ha_inv_to_guard during the init race window.
> +                cont = "\t\t " if _else else "\t    "
> +                buff.append(f"\t{_else}if (curr_state == {state_name} &&")
> +                buff.append(f"{cont}!ha_monitor_env_invalid(ha_mon,
> {env_full}))")
> +            else:
> +                buff.append(f"\t{_else}if (curr_state == {state_name})")
>  
>              buff.append(f"\t\t{self.__start_to_conv(constr)};")
>              _else = "else "
> @@ -376,16 +387,22 @@ f"""static inline void ha_convert_inv_guard(struct
> ha_monitor *ha_mon,
>  
>      def __fill_verify_guards_func(self) -> list[str]:
>          buff = []
> -        if not self.guards:
> +        # Always generate for monitors with invariants: stable extension
> +        # point for future guard conditions.
> +        if not self.guards and not self.invariants:
>              return []
>  
>          buff.append(
>  f"""static inline bool ha_verify_guards(struct ha_monitor *ha_mon,
>  \t\t\t\t    enum {self.enum_states_def} curr_state, enum
> {self.enum_events_def} event,
>  \t\t\t\t    enum {self.enum_states_def} next_state, u64 time_ns)
> -{{
> -\tbool res = true;
> -""")
> +{{""")
> +
> +        if not self.guards:
> +            buff.append("\treturn true;\n}\n")
> +            return buff
> +
> +        buff.append("\tbool res = true;\n")
>  
>          _else = ""
>          for edge, constr in sorted(self.guards.items()):
> @@ -522,7 +539,7 @@ f"""static bool ha_verify_constraint(struct ha_monitor
> *ha_mon,
>              buff.append("\tha_convert_inv_guard(ha_mon, curr_state, event, "
>                          "next_state, time_ns);\n")
>  
> -        if self.guards:
> +        if self.guards or self.invariants:
>              buff.append("\tif (!ha_verify_guards(ha_mon, curr_state, event, "
>                          "next_state, time_ns))\n\t\treturn false;\n")
>  
> @@ -599,8 +616,77 @@ f"""static bool ha_verify_constraint(struct ha_monitor
> *ha_mon,
>                  buff.append("}\n")
>          return buff
>  
> +    def __fill_init_start_helper(self) -> list[str]:
> +        """Generate handle_monitor_start() for envs reset on the __init
> arrow.
> +
> +        env_store is invalid inside da_handle_start_event(); this helper must
> +        be called after DA storage is allocated and initial state is set.
> +        """
> +        if not self.env_init_started:
> +            return []
> +
> +        # Collect the ha_start_timer call for each init-started env from the
> +        # first state invariant that references it.
> +        timer_calls: dict[str, str] = {}
> +        for env in self.env_init_started:
> +            env_full = f"{env}{self.enum_suffix}"
> +            for constr in self.invariants.values():
> +                if env_full in constr:
> +                    timer_calls[env] = constr
> +                    break
> +
> +        buff = []
> +        buff.append(
> +"""/*
> + * handle_monitor_start - reset per-object clock(s) and arm the timer.
> + *
> + * env_store is invalid inside da_handle_start_event(); call this helper
> + * after allocating DA storage and setting the initial DA state.
> + *
> + * XXX: replace the placeholders with the actual logic for your monitor.
> + */""")
> +
> +        if self.monitor_type == "per_obj":
> +            buff.append("static int handle_monitor_start(int id,
> monitor_target t)")
> +            buff.append("{")
> +            buff.append("\tstruct ha_monitor *ha_mon;")
> +            buff.append("\tu64 time_ns = ktime_get_ns();\n")
> +            buff.append("\t/* XXX: allocate DA storage, e.g.
> da_create_or_get(id, t) */")
> +            buff.append("\t/* XXX: set initial DA state, e.g.
> da_handle_start_event(id, t, <event>) */")
> +            buff.append("\tha_mon = /* XXX: retrieve ha_monitor for (id, t)
> */;")
> +        elif self.monitor_type == "per_task":
> +            buff.append("static int handle_monitor_start(struct task_struct
> *p)")
> +            buff.append("{")
> +            buff.append("\tstruct ha_monitor *ha_mon;")
> +            buff.append("\tu64 time_ns = ktime_get_ns();\n")
> +            buff.append("\t/* XXX: allocate DA storage, e.g.
> da_create_or_get(p->pid, p) */")
> +            buff.append("\t/* XXX: set initial DA state, e.g.
> da_handle_start_event(p->pid, p, <event>) */")
> +            buff.append("\tha_mon = /* XXX: retrieve ha_monitor for p */;")
> +        else:
> +            buff.append("static int handle_monitor_start(void)")
> +            buff.append("{")
> +            buff.append("\tstruct ha_monitor *ha_mon;")
> +            buff.append("\tu64 time_ns = ktime_get_ns();\n")
> +            buff.append("\tha_mon = /* XXX: retrieve global ha_monitor */;")
> +
> +        buff.append("\tif (!ha_mon)")
> +        buff.append("\t\treturn -ENOENT;")
> +
> +        for env in self.env_init_started:
> +            buff.append(f"\tha_reset_env(ha_mon, {env}{self.enum_suffix},
> time_ns);")
> +            if env in timer_calls:
> +                buff.append(f"\t{timer_calls[env]};")
> +            else:
> +                buff.append(f"\t/* XXX: arm timer for {env} */")
> +
> +        buff.append("\treturn 0;")
> +        buff.append("}\n")
> +        return buff
> +
>      def _fill_hybrid_definitions(self) -> list[str]:
> -        return self.__fill_hybrid_get_reset_functions() +
> self.__fill_constr_func()
> +        return (self.__fill_hybrid_get_reset_functions() +
> +                self.__fill_init_start_helper() +
> +                self.__fill_constr_func())
>  
>      def _fill_timer_type(self) -> list:
>          if self.invariants:


^ permalink raw reply

* Re: [PATCH v6 3/4] mm/memory-failure: add panic option for unrecoverable pages
From: Breno Leitao @ 2026-05-12 13:05 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
	Shuah Khan, Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett, linux-mm,
	linux-kernel, linux-doc, linux-kselftest, linux-trace-kernel,
	kernel-team
In-Reply-To: <8d4940bc-d8c4-4e7f-a35d-979e6a781966@kernel.org>

On Tue, May 12, 2026 at 10:22:38AM +0200, David Hildenbrand (Arm) wrote:
> 
> > @@ -1281,6 +1292,18 @@ static void update_per_node_mf_stats(unsigned long pfn,
> >  	++mf_stats->total;
> >  }
> >  
> > +static bool panic_on_unrecoverable_mf(enum mf_action_page_type type,
> > +				      enum mf_result result)
> > +{
> > +	if (!sysctl_panic_on_unrecoverable_mf || result != MF_IGNORED)
> > +		return false;
> > +
> > +	if (type == MF_MSG_KERNEL)
> > +		return true;
> > +
> > +	return false;
> 
> return type == MF_MSG_KERNEL;
> 
> might be simpler.

Ack, I will update once we decide about the other pendencies.

^ permalink raw reply

* Re: [PATCH v6 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
From: Breno Leitao @ 2026-05-12 13:04 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
	Shuah Khan, Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett, linux-mm,
	linux-kernel, linux-doc, linux-kselftest, linux-trace-kernel,
	kernel-team, Lance Yang
In-Reply-To: <9504c193-8c01-4d03-8f62-c50fd7fbdbc0@kernel.org>

On Tue, May 12, 2026 at 10:17:00AM +0200, David Hildenbrand (Arm) wrote:
> > @@ -2348,6 +2348,7 @@ int memory_failure(unsigned long pfn, int flags)
> >  	unsigned long page_flags;
> >  	bool retry = true;
> >  	int hugetlb = 0;
> > +	bool is_reserved;
> >  
> >  	if (!sysctl_memory_failure_recovery)
> >  		panic("Memory failure on page %lx", pfn);
> > @@ -2411,6 +2412,18 @@ int memory_failure(unsigned long pfn, int flags)
> >  	 * In fact it's dangerous to directly bump up page count from 0,
> >  	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
> >  	 */
> > +	/*
> > +	 * Pages with PG_reserved set are not currently managed by the
> > +	 * page allocator (memblock-reserved memory, driver reservations,
> > +	 * etc.), so classify them as kernel-owned for reporting.
> > +	 *
> > +	 * Sample the flag before get_hwpoison_page(): in the
> > +	 * MF_COUNT_INCREASED path, get_any_page() can drop the caller's
> > +	 * reference before returning -EIO, after which page->flags may
> > +	 * have been reset by the allocator.
> > +	 */
> > +	is_reserved = PageReserved(p);
> > +
> >  	res = get_hwpoison_page(p, flags);
> >  	if (!res) {
> >  		if (is_free_buddy_page(p)) {
> > @@ -2432,7 +2445,11 @@ int memory_failure(unsigned long pfn, int flags)
> >  		}
> >  		goto unlock_mutex;
> >  	} else if (res < 0) {
> > -		res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
> > +		if (is_reserved)
> > +			res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
> > +		else
> > +			res = action_result(pfn, MF_MSG_GET_HWPOISON,
> > +					    MF_IGNORED);
> >  		goto unlock_mutex;
> >  	}
> >  
> > 
> 
> It's a bit odd that we need this handling when we already have handling for
> reserved pages in error_states[].
> 
> HWPoisonHandlable() would always essentially reject PG_reserved pages. So
> __get_hwpoison_page() ... would always fail? Making
> get_hwpoison_page()->get_any_page() always fail?
> 
> But then, we never call identify_page_state()? And never call me_kernel()?

From what I read, it seems that error_states[0] = { reserved, reserved, MF_MSG_KERNEL, me_kernel }
has been effectively dead code on the hwpoison-from-MCE path for a
while.

My v6 patch relabels the failure-path output to match what me_kernel() would
have reported anyway.

> This all looks very odd.
> 
> Why would you even want to call get_hwpoison_page() in the first place if you
> find PageReserved?

Are you suggesting we should all the page action as soon as we detect the page
is reserved and get out?

Something as:

    if (PageReserved(p)) {
        res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
        goto unlock_mutex;
    }

    res = get_hwpoison_page(p, flags);

Thanks for the review,
--breno

^ permalink raw reply

* Re: [PATCH v6 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
From: Lance Yang @ 2026-05-12 12:48 UTC (permalink / raw)
  To: david
  Cc: leitao, linmiaohe, nao.horiguchi, akpm, corbet, skhan, ljs,
	vbabka, rppt, surenb, mhocko, shuah, rostedt, mhiramat,
	mathieu.desnoyers, liam, linux-mm, linux-kernel, linux-doc,
	linux-kselftest, linux-trace-kernel, kernel-team, lance.yang
In-Reply-To: <9504c193-8c01-4d03-8f62-c50fd7fbdbc0@kernel.org>


On Tue, May 12, 2026 at 10:17:00AM +0200, David Hildenbrand (Arm) wrote:
>On 5/11/26 17:38, Breno Leitao wrote:
>> When get_hwpoison_page() returns a negative value, distinguish
>> reserved pages from other failure cases by reporting MF_MSG_KERNEL
>> instead of MF_MSG_GET_HWPOISON. Reserved pages belong to the kernel
>> and should be classified accordingly for proper handling.
>> 
>> Sample PG_reserved before the get_hwpoison_page() call. In the
>> MF_COUNT_INCREASED path get_any_page() can drop the caller's
>> reference before returning -EIO, after which the underlying page may
>> have been freed and reallocated with page->flags reset; reading
>> PageReserved(p) at that point would observe stale or unrelated state.
>> The pre-call snapshot reflects what the page actually was at the
>> time of the failure event.
>> 
>> Acked-by: Miaohe Lin <linmiaohe@huawei.com>
>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>> Signed-off-by: Breno Leitao <leitao@debian.org>
>> ---
>>  mm/memory-failure.c | 19 ++++++++++++++++++-
>>  1 file changed, 18 insertions(+), 1 deletion(-)
>> 
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 866c4428ac7ef..f112fb27a8ff6 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -2348,6 +2348,7 @@ int memory_failure(unsigned long pfn, int flags)
>>  	unsigned long page_flags;
>>  	bool retry = true;
>>  	int hugetlb = 0;
>> +	bool is_reserved;
>>  
>>  	if (!sysctl_memory_failure_recovery)
>>  		panic("Memory failure on page %lx", pfn);
>> @@ -2411,6 +2412,18 @@ int memory_failure(unsigned long pfn, int flags)
>>  	 * In fact it's dangerous to directly bump up page count from 0,
>>  	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
>>  	 */
>> +	/*
>> +	 * Pages with PG_reserved set are not currently managed by the
>> +	 * page allocator (memblock-reserved memory, driver reservations,
>> +	 * etc.), so classify them as kernel-owned for reporting.
>> +	 *
>> +	 * Sample the flag before get_hwpoison_page(): in the
>> +	 * MF_COUNT_INCREASED path, get_any_page() can drop the caller's
>> +	 * reference before returning -EIO, after which page->flags may
>> +	 * have been reset by the allocator.
>> +	 */
>> +	is_reserved = PageReserved(p);
>> +
>>  	res = get_hwpoison_page(p, flags);
>>  	if (!res) {
>>  		if (is_free_buddy_page(p)) {
>> @@ -2432,7 +2445,11 @@ int memory_failure(unsigned long pfn, int flags)
>>  		}
>>  		goto unlock_mutex;
>>  	} else if (res < 0) {
>> -		res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
>> +		if (is_reserved)
>> +			res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
>> +		else
>> +			res = action_result(pfn, MF_MSG_GET_HWPOISON,
>> +					    MF_IGNORED);
>>  		goto unlock_mutex;
>>  	}
>>  
>> 
>
>It's a bit odd that we need this handling when we already have handling for
>reserved pages in error_states[].
>
>HWPoisonHandlable() would always essentially reject PG_reserved pages. So
>__get_hwpoison_page() ... would always fail? Making
>get_hwpoison_page()->get_any_page() always fail?
>
>But then, we never call identify_page_state()? And never call me_kernel()?

Looks like we never get that far ...

>This all looks very odd.
>
>Why would you even want to call get_hwpoison_page() in the first place if you
>find PageReserved?

Ah, I see :)

For a PG_reserved page, I would not expect PageLRU to be set, nor would
I expect it to be in the buddy allocator.

include/linux/page-flags.h also says:

"
Once (if ever) freed, PG_reserved is cleared and they will be given to
the page allocator.
"

So maybe special-case PageReserved() before get_hwpoison_page()?
Something like:

	if (PageReserved(p)){
		res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
		goto unlock_mutex; 	
	}

	res = get_hwpoison_page(p, flags, &gp_status);

Cheers, Lance

^ permalink raw reply

* Re: [PATCHv2] uprobes: Use flexible array for xol_area bitmap
From: Oleg Nesterov @ 2026-05-12 11:29 UTC (permalink / raw)
  To: Rosen Penev
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Masami Hiramatsu,
	open list:PERFORMANCE EVENTS SUBSYSTEM, open list:UPROBES
In-Reply-To: <20260511225648.27886-1-rosenp@gmail.com>

On 05/11, Rosen Penev wrote:
>
>  struct xol_area {
>  	wait_queue_head_t		wq;		/* if all slots are busy */
> -	unsigned long			*bitmap;	/* 0 = free slot */
>
>  	struct page			*page;
>  	/*
> @@ -117,6 +116,7 @@ struct xol_area {
>  	 * the vma go away, and we must handle that reasonably gracefully.
>  	 */
>  	unsigned long			vaddr;		/* Page(s) of instruction slots */
> +	unsigned long			bitmap[];	/* 0 = free slot */
>  };
>
>  static void uprobe_warn(struct task_struct *t, const char *msg)
> @@ -1755,18 +1755,13 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
>  	struct xol_area *area;
>  	void *insns;
>
> -	area = kzalloc_obj(*area);
> +	area = kzalloc_flex(*area, bitmap, BITS_TO_LONGS(UINSNS_PER_PAGE));

The downside is that kmalloc will use kmem_cache with ->object_size = PAGE_SIZE * 2,
almost half of the allocated memory won't be used...

But technically the patch looks correct so I won't argue.

Oleg.


^ permalink raw reply

* Re: [PATCH 07/13] rv: Simply hybrid automata monitors's clock variables
From: Gabriele Monaco @ 2026-05-12  9:31 UTC (permalink / raw)
  To: Nam Cao
  Cc: Steven Rostedt, Wander Lairson Costa, linux-trace-kernel,
	linux-kernel
In-Reply-To: <87wlxaupmz.fsf@yellow.woof>



On Mon, 2026-05-11 at 13:55 +0200, Nam Cao wrote:
> Gabriele Monaco <gmonaco@redhat.com> writes:
> > Well, this is roughly what we discussed in [1].
> > Now, I didn't submit the throttle monitor yet because it depends on unacked
> > tracepoints.
> 
> Thanks for bringing that up. I had no memory of that discussion.
> 
> > In short, this works with the assumption that the expires value you pass to
> > ha_check_invariant() is the same you used to arm the timer.
> > 
> > That's true for constant values only (the deadline) but not for something
> > like
> > the runtime. I couldn't think of a way to rearrange that model not to
> > require
> > the runtime left field.
> 
> I believe you are referring to this:
> 
>                                      |
>                                      |
>       dl_replenish;reset(clk)        v
>               sched_switch_in   #=========================# sched_switch_in;
>                +--------------- H                         H   reset(clk)
>                |                H                         H <----------------+
>                +--------------> H         running         H                  |
>     dl_throttle;reset(clk)      H clk < runtime_left_ns() H                  |
>    +--------------------------- H                         H sched_switch_out |
>    |       +------------------> H                         H -------------+   |
>    | dl_replenish;reset(clk)    #=========================#              |   |
>    |       |                         |             ^                     |   |
>    v       |                  dl_defer_arm         |                     |   |
> 
> Now that I stared at this again, I think we already deviate from theory
> here. Our documentation mentions that the invariant must be in the form
> 
>         g = v < c | true
> 
> with "c [being] a numerical value".
> 
> The invariant "clk < runtime_left_ns()" means clk must not exceed the
> remaining runtime, which is sampled by calling runtime_left_ns() when
> the state is entered. This is not in the theory. Additionally, I think
> this interpretation is ambiguous; one could also interpret that as "the
> clk value must never exceed the *current* value returned by
> runtime_left_ns()".

Well, that's a fair point. Using functions here is kind of pushing it, but if we
assume the runtime constant for the duration of the invariant (which is what
happens in practice), I don't see that huge difference. Then sure, I'm still
twisting the theory here.

But that's right, it's quite ambiguous. The function is technically syntactic
sugar in RV to allow monitor-specific values, I should probably make it clear it
doesn't make it a dynamic value (at least within the same constraint
validation).

> I digged into the cited academic papers, but couldn't find anything that
> can describe this. The closest I see is the "init" label for states, but
> that is the condition for entering the states.
> 
> > Otherwise.. We could read the remaining time in the timer, but we wouldn't
> > be
> > able to simulate ns precision when using the timer wheel.
> > 
> > Now if we really wanted to go down that path, we are using a union to
> > allocate
> > either timer or hrtimer, the latter is larger, so we /could/ add a u64
> > expire_ns
> > field to the ha_monitor struct without needing additional memory.
> > 
> > If that doesn't sound too wild to you, I may try and sketch something up to
> > see
> > if that's viable. Then this patch could go through as is and I would add the
> > extension together with the submission of throttle.
> 
> That can work, but not ideal, because hrtimer will not be usable.

Why not? If we have HA_TIMER_WHEEL , we'd use timer and expire, if we have
HA_TIMER_HRTIMER we'd only need hrtimer with it's hrtimer_get_expires():

 union {
 struct hrtimer hrtimer;
 struct {
 struct timer_list timer;
 u64 expire; /* Explicitly store the armed budget */
 };

we already can't use timer and hrtimer interchangeably.
What am I missing here?

> Looking at the throttle monitor again, is it possible to rewrite
> runtime_left_ns() to read .dl_runtime instead of .runtime? I don't know
> the deadline schedule very well, but I think .dl_runtime is not changing
> like .runtime?

In theory yes, but since the runtime is consumed only when running, we cannot
just set the timeout once. We either save how much was consumed somewhere or do
some start/pause mechanism.
Neither looks simpler to me.

Thanks,
Gabriele


^ permalink raw reply

* Re: [RFC PATCH v2 02/10] rv/da: fix per-task da_monitor_destroy() ordering and sync
From: Gabriele Monaco @ 2026-05-12  9:09 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <f654a17c671469fd8fc9ea438daf2266d05068d4.camel@redhat.com>

On Tue, 2026-05-12 at 10:27 +0200, Gabriele Monaco wrote:
> On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> > From: Wen Yang <wen.yang@linux.dev>
> > 
> > The following two paths race:
> > 
> >   CPU 0 (disable_stall/__rv_disable_monitor)  CPU 1 (wwnr probe handler)
> 							^ did you mean stall?

Ok I got it now, so essentially you'd reproduce it like:

* start a DA per-task monitor (no timer)
* stop it, a handler is still running after reset, it sets monitoring back to 1
* start an HA per-task monitor

that would use the same slot that is now looking like:

 { monitoring = 1, timer.function = NULL }

because it was not initialised as HA but monitoring was reset in the race.

Thinking about this again, it isn't just an issue with per-task monitors, all
monitors reusing slots would suffer from it.
Besides, relying on monitoring can be fragile when using LTL monitors on the
same task (those don't even have monitoring).

Perhaps the solution isn't that trivial, I'm going to give one more thought on
it, but thanks again for bringing this up!

Gabriele

> >   ------------------------------------------  -----------------------------
> >   disable_stall()
> >     da_monitor_destroy()
> >       da_monitor_reset_all()          <------ [task T: monitoring=0]
> >                                               da_monitor_start(&T->rv[n])
> >                                               /* no timer_setup */
> >                                                monitoring=1  <----
> >   tracepoint_synchronize_unregister()
> >   // CPU 1 probe has already returned; sync returns
> > 
> > Later, enable_stall() acquires the same slot and calls da_monitor_init():
> > 
> >   da_monitor_reset_all()
> >     da_monitor_reset(&T->rv[slot])    // monitoring=1, timer.function==0
> >       ha_monitor_reset_env()
> >         ha_cancel_timer()
> >           timer_delete(&ha_mon->timer)  // ODEBUG: timer never initialised
> > 
> >   ODEBUG: assert_init not available (active state 0)
> >   object type: timer_list
> >   Call trace: timer_delete <- da_monitor_reset_all <- enable_stall
> > 
> > Call tracepoint_synchronize_unregister() inside da_monitor_destroy()
> > before da_monitor_reset_all().  The unregister_trace_xxx() calls in the
> > monitor's disable() have already disconnected the tracepoints; the sync
> > here drains any handler still in flight, so no new monitoring=1 can
> > appear after da_monitor_reset_all() clears the slot.
> > 
> > Also fix the slot release ordering: release the slot only after
> > reset_all() to avoid accessing rv[] with an out-of-bounds index.
> > 
> > Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
> > Signed-off-by: Wen Yang <wen.yang@linux.dev>
> > ---
> 
> Thanks for the fix, I have a similar one waiting for submission.
> 
> These are technically 2 separate fixes though: the ordering with unset
> task_mon_slot (independent on HA) and the synchronisation with pending
> tracepoints. They probably deserve separate patches and visibility, the first
> has always been around and we're technically overwriting who knows what.
> 
> 
> The explanation above is a bit hard to follow though, are you talking about a
> handler for the same (stall) monitor running after the reset, effectively
> undoing it by setting the monitoring flag?
> 
> Then this is indeed an issue with ha_monitor_reset_env() which expects a clean
> environment.
> 
> So that's basically what you'd see now much more often because in fact we
> don't
> reset the right slot (though, again, that's a different issue).
> 
> 
> Calling tracepoint_synchronize_unregister() there too would surely fix, but it
> used to be kinda slow. But it's probably gotten faster since now tracepoints
> use
> SRCU, so we can wait for a dedicated grace period.
> 
> I liked the idea to wait cumulatively in the end, but that's just making
> things
> harder.. Let's do like this:
> 
> Prepare 2 separate patches as fixes, put the task slot one first (would ease
> backporting), mention this issue with the race condition only in the second.
> You can send them independently and I'll add them to the tree as urgent.
> 
> 
> I'm soon going to send my set of fixes that will also include the task slot
> patch (not removing to ease my life with conflicts).
> 
> Thanks,
> Gabriele
> 
> >  include/rv/da_monitor.h | 18 ++++++++++++++++--
> >  1 file changed, 16 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> > index 00ded3d5ab3f..d04bb3229c75 100644
> > --- a/include/rv/da_monitor.h
> > +++ b/include/rv/da_monitor.h
> > @@ -304,6 +304,20 @@ static int da_monitor_init(void)
> >  
> >  /*
> >   * da_monitor_destroy - return the allocated slot
> > + *
> > + * Call tracepoint_synchronize_unregister() before reset_all() to close
> > + * the race where an in-flight non-HA probe handler sets monitoring=1
> > + * (without calling timer_setup()) after da_monitor_reset_all() has
> > + * already cleared the slot but before the caller's own sync completes.
> > + * Without this barrier, an HA_TIMER_WHEEL monitor that later acquires
> > + * the same slot would call timer_delete() on a never-initialised
> > + * timer_list, triggering ODEBUG warnings.
> > + *
> > + * Note: tracepoint_synchronize_unregister() is a system-wide barrier
> > + * that waits for all CPUs to finish any in-flight tracepoint handlers.
> > + * The caller's own __rv_disable_monitor() issues a second sync after
> > + * returning from disable(); that redundant call is harmless on the
> > + * infrequent admin (enable/disable) path.
> >   */
> >  static inline void da_monitor_destroy(void)
> >  {
> > @@ -311,10 +325,10 @@ static inline void da_monitor_destroy(void)
> >  		WARN_ONCE(1, "Disabling a disabled monitor: "
> > __stringify(MONITOR_NAME));
> >  		return;
> >  	}
> > +	tracepoint_synchronize_unregister();
> > +	da_monitor_reset_all();
> >  	rv_put_task_monitor_slot(task_mon_slot);
> >  	task_mon_slot = RV_PER_TASK_MONITOR_INIT;
> > -
> > -	da_monitor_reset_all();
> >  }
> >  
> >  #elif RV_MON_TYPE == RV_MON_PER_OBJ


^ permalink raw reply

* Re: [RFC PATCH v2 02/10] rv/da: fix per-task da_monitor_destroy() ordering and sync
From: Gabriele Monaco @ 2026-05-12  8:27 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <d02c656aada7d071f083460a5c9a454363669b61.1778522945.git.wen.yang@linux.dev>

On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> The following two paths race:
> 
>   CPU 0 (disable_stall/__rv_disable_monitor)  CPU 1 (wwnr probe handler)
							^ did you mean stall?
>   ------------------------------------------  -----------------------------
>   disable_stall()
>     da_monitor_destroy()
>       da_monitor_reset_all()          <------ [task T: monitoring=0]
>                                               da_monitor_start(&T->rv[n])
>                                               /* no timer_setup */
>                                                monitoring=1  <----
>   tracepoint_synchronize_unregister()
>   // CPU 1 probe has already returned; sync returns
> 
> Later, enable_stall() acquires the same slot and calls da_monitor_init():
> 
>   da_monitor_reset_all()
>     da_monitor_reset(&T->rv[slot])    // monitoring=1, timer.function==0
>       ha_monitor_reset_env()
>         ha_cancel_timer()
>           timer_delete(&ha_mon->timer)  // ODEBUG: timer never initialised
> 
>   ODEBUG: assert_init not available (active state 0)
>   object type: timer_list
>   Call trace: timer_delete <- da_monitor_reset_all <- enable_stall
> 
> Call tracepoint_synchronize_unregister() inside da_monitor_destroy()
> before da_monitor_reset_all().  The unregister_trace_xxx() calls in the
> monitor's disable() have already disconnected the tracepoints; the sync
> here drains any handler still in flight, so no new monitoring=1 can
> appear after da_monitor_reset_all() clears the slot.
> 
> Also fix the slot release ordering: release the slot only after
> reset_all() to avoid accessing rv[] with an out-of-bounds index.
> 
> Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---

Thanks for the fix, I have a similar one waiting for submission.

These are technically 2 separate fixes though: the ordering with unset
task_mon_slot (independent on HA) and the synchronisation with pending
tracepoints. They probably deserve separate patches and visibility, the first
has always been around and we're technically overwriting who knows what.


The explanation above is a bit hard to follow though, are you talking about a
handler for the same (stall) monitor running after the reset, effectively
undoing it by setting the monitoring flag?

Then this is indeed an issue with ha_monitor_reset_env() which expects a clean
environment.

So that's basically what you'd see now much more often because in fact we don't
reset the right slot (though, again, that's a different issue).


Calling tracepoint_synchronize_unregister() there too would surely fix, but it
used to be kinda slow. But it's probably gotten faster since now tracepoints use
SRCU, so we can wait for a dedicated grace period.

I liked the idea to wait cumulatively in the end, but that's just making things
harder.. Let's do like this:

Prepare 2 separate patches as fixes, put the task slot one first (would ease
backporting), mention this issue with the race condition only in the second.
You can send them independently and I'll add them to the tree as urgent.


I'm soon going to send my set of fixes that will also include the task slot
patch (not removing to ease my life with conflicts).

Thanks,
Gabriele

>  include/rv/da_monitor.h | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index 00ded3d5ab3f..d04bb3229c75 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -304,6 +304,20 @@ static int da_monitor_init(void)
>  
>  /*
>   * da_monitor_destroy - return the allocated slot
> + *
> + * Call tracepoint_synchronize_unregister() before reset_all() to close
> + * the race where an in-flight non-HA probe handler sets monitoring=1
> + * (without calling timer_setup()) after da_monitor_reset_all() has
> + * already cleared the slot but before the caller's own sync completes.
> + * Without this barrier, an HA_TIMER_WHEEL monitor that later acquires
> + * the same slot would call timer_delete() on a never-initialised
> + * timer_list, triggering ODEBUG warnings.
> + *
> + * Note: tracepoint_synchronize_unregister() is a system-wide barrier
> + * that waits for all CPUs to finish any in-flight tracepoint handlers.
> + * The caller's own __rv_disable_monitor() issues a second sync after
> + * returning from disable(); that redundant call is harmless on the
> + * infrequent admin (enable/disable) path.
>   */
>  static inline void da_monitor_destroy(void)
>  {
> @@ -311,10 +325,10 @@ static inline void da_monitor_destroy(void)
>  		WARN_ONCE(1, "Disabling a disabled monitor: "
> __stringify(MONITOR_NAME));
>  		return;
>  	}
> +	tracepoint_synchronize_unregister();
> +	da_monitor_reset_all();
>  	rv_put_task_monitor_slot(task_mon_slot);
>  	task_mon_slot = RV_PER_TASK_MONITOR_INIT;
> -
> -	da_monitor_reset_all();
>  }
>  
>  #elif RV_MON_TYPE == RV_MON_PER_OBJ


^ permalink raw reply

* Re: [PATCH v6 3/4] mm/memory-failure: add panic option for unrecoverable pages
From: David Hildenbrand (Arm) @ 2026-05-12  8:22 UTC (permalink / raw)
  To: Breno Leitao, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Jonathan Corbet, Shuah Khan, Lorenzo Stoakes, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260511-ecc_panic-v6-3-183012ba7d4b@debian.org>


> @@ -1281,6 +1292,18 @@ static void update_per_node_mf_stats(unsigned long pfn,
>  	++mf_stats->total;
>  }
>  
> +static bool panic_on_unrecoverable_mf(enum mf_action_page_type type,
> +				      enum mf_result result)
> +{
> +	if (!sysctl_panic_on_unrecoverable_mf || result != MF_IGNORED)
> +		return false;
> +
> +	if (type == MF_MSG_KERNEL)
> +		return true;
> +
> +	return false;

return type == MF_MSG_KERNEL;

might be simpler.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v6 2/4] mm/memory-failure: classify get_any_page() failures by reason
From: David Hildenbrand (Arm) @ 2026-05-12  8:21 UTC (permalink / raw)
  To: Breno Leitao, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Jonathan Corbet, Shuah Khan, Lorenzo Stoakes, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <20260511-ecc_panic-v6-2-183012ba7d4b@debian.org>


>  		}
>  		goto unlock_mutex;
>  	} else if (res < 0) {
> -		if (is_reserved)
> +		/*
> +		 * Promote a stable unhandlable kernel page diagnosed by
> +		 * get_hwpoison_page() to MF_MSG_KERNEL alongside reserved
> +		 * pages; transient lifecycle races stay as MF_MSG_GET_HWPOISON.
> +		 */
> +		if (is_reserved || gp_status == MF_GET_PAGE_UNHANDLABLE)
>  			res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);


It's all a bit of a mess. get_hwpoison_page() should just indicate that a page
is unhandable if it is PG_reserved?

Why can't we just return a special error code from  get_hwpoison_page()? We ahve
plenty of errno values to chose from.

-- 
Cheers,

David

^ permalink raw reply

* Re: [bug report] bootconfig: init: Allow admin to use bootconfig for kernel command line
From: Dan Carpenter @ 2026-05-12  8:21 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: kernel-janitors, Linux Trace Kernel, linux-kernel, Breno Leitao
In-Reply-To: <20260512091638.8b95253ab022d7dabf473465@kernel.org>

On Tue, May 12, 2026 at 09:16:38AM +0900, Masami Hiramatsu wrote:
> Hi Dan,
> 
> Thanks for reporting. A similar problem is pointed by Sashiko [1].
> 
> [1] https://sashiko.dev/#/patchset/20260508-bootconfig_using_tools-v1-0-1132219aa773%40debian.org
> 
> On Fri, 8 May 2026 20:07:25 +0300
> Dan Carpenter <error27@gmail.com> wrote:
> 
> > Hello Masami Hiramatsu,
> > 
> > Commit 51887d03aca1 ("bootconfig: init: Allow admin to use bootconfig
> > for kernel command line") from Jan 11, 2020 (linux-next), leads to
> > the following Smatch static checker warning:
> > 
> > 	init/main.c:368 xbc_snprint_cmdline()
> > 	use scnprintf() instead of snprintf()
> > 
> > init/main.c
> >     331 static int __init xbc_snprint_cmdline(char *buf, size_t size,
> >     332                                       struct xbc_node *root)
> >     333 {
> >     334         struct xbc_node *knode, *vnode;
> >     335         char *end = buf + size;
> >     336         const char *val, *q;
> >     337         int ret;
> >     338 
> >     339         xbc_node_for_each_key_value(root, knode, val) {
> >     340                 ret = xbc_node_compose_key_after(root, knode,
> >     341                                         xbc_namebuf, XBC_KEYLEN_MAX);
> >     342                 if (ret < 0)
> >     343                         return ret;
> >     344 
> >     345                 vnode = xbc_node_get_child(knode);
> >     346                 if (!vnode) {
> >     347                         ret = snprintf(buf, rest(buf, end), "%s ", xbc_namebuf);
> >     348                         if (ret < 0)
> >     349                                 return ret;
> >     350                         buf += ret;
> > 
> > In user space snprintf() can return negative, but in the kernel, no.
> > It returns the number of bytes (not counting the NUL terminator) which
> > would have been copied if there were enough space.  So maybe you want
> > to do something like:
> > 
> > 	remain = rest(buf, end);
> > 	ret = snprintf(buf, rest(buf, end), "%s ", xbc_namebuf);
> > 	if (ret >= remain)
> > 		return -ENOSPC;
> 
> Actually, we need to query the length of required buffer size if buf == NULL
> or the buffer size is not enough.
> 
> But as Sashiko pointed, I need to check it with UBSAN. (but I think,
> even if @buf is NULL, the @buf is char *, thus it is safe to add some
> value...)
> 

Sashiko says that pointer math on a NULL is undefined but we do it all
the time in the kernel...  When you are a the 800 pound gorilla, you can
ask compilers to implement features the way you want them to be.  :P

regards,
dan carpenter


^ permalink raw reply

* Re: [PATCH v6 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
From: David Hildenbrand (Arm) @ 2026-05-12  8:17 UTC (permalink / raw)
  To: Breno Leitao, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Jonathan Corbet, Shuah Khan, Lorenzo Stoakes, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Liam R. Howlett
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <20260511-ecc_panic-v6-1-183012ba7d4b@debian.org>

On 5/11/26 17:38, Breno Leitao wrote:
> When get_hwpoison_page() returns a negative value, distinguish
> reserved pages from other failure cases by reporting MF_MSG_KERNEL
> instead of MF_MSG_GET_HWPOISON. Reserved pages belong to the kernel
> and should be classified accordingly for proper handling.
> 
> Sample PG_reserved before the get_hwpoison_page() call. In the
> MF_COUNT_INCREASED path get_any_page() can drop the caller's
> reference before returning -EIO, after which the underlying page may
> have been freed and reallocated with page->flags reset; reading
> PageReserved(p) at that point would observe stale or unrelated state.
> The pre-call snapshot reflects what the page actually was at the
> time of the failure event.
> 
> Acked-by: Miaohe Lin <linmiaohe@huawei.com>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  mm/memory-failure.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 866c4428ac7ef..f112fb27a8ff6 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2348,6 +2348,7 @@ int memory_failure(unsigned long pfn, int flags)
>  	unsigned long page_flags;
>  	bool retry = true;
>  	int hugetlb = 0;
> +	bool is_reserved;
>  
>  	if (!sysctl_memory_failure_recovery)
>  		panic("Memory failure on page %lx", pfn);
> @@ -2411,6 +2412,18 @@ int memory_failure(unsigned long pfn, int flags)
>  	 * In fact it's dangerous to directly bump up page count from 0,
>  	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
>  	 */
> +	/*
> +	 * Pages with PG_reserved set are not currently managed by the
> +	 * page allocator (memblock-reserved memory, driver reservations,
> +	 * etc.), so classify them as kernel-owned for reporting.
> +	 *
> +	 * Sample the flag before get_hwpoison_page(): in the
> +	 * MF_COUNT_INCREASED path, get_any_page() can drop the caller's
> +	 * reference before returning -EIO, after which page->flags may
> +	 * have been reset by the allocator.
> +	 */
> +	is_reserved = PageReserved(p);
> +
>  	res = get_hwpoison_page(p, flags);
>  	if (!res) {
>  		if (is_free_buddy_page(p)) {
> @@ -2432,7 +2445,11 @@ int memory_failure(unsigned long pfn, int flags)
>  		}
>  		goto unlock_mutex;
>  	} else if (res < 0) {
> -		res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
> +		if (is_reserved)
> +			res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
> +		else
> +			res = action_result(pfn, MF_MSG_GET_HWPOISON,
> +					    MF_IGNORED);
>  		goto unlock_mutex;
>  	}
>  
> 

It's a bit odd that we need this handling when we already have handling for
reserved pages in error_states[].

HWPoisonHandlable() would always essentially reject PG_reserved pages. So
__get_hwpoison_page() ... would always fail? Making
get_hwpoison_page()->get_any_page() always fail?

But then, we never call identify_page_state()? And never call me_kernel()?

This all looks very odd.

Why would you even want to call get_hwpoison_page() in the first place if you
find PageReserved?

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH mm-unstable v17 05/14] mm/khugepaged: require collapse_huge_page to enter/exit with the lock dropped
From: David Hildenbrand (Arm) @ 2026-05-12  7:42 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260511185817.686831-6-npache@redhat.com>

On 5/11/26 20:58, Nico Pache wrote:
> Currently the collapse_huge_page function requires the mmap_read_lock to
> enter with it held, and exit with it dropped. This function moves the
> unlock into its parent caller, and changes this semantic to requiring it
> to enter/exit with it always unlocked.
> 
> In future patches, we need this expectation, as for in mTHP collapse, we
> may have already have dropped the lock, and do not want to conditionally
> check for this by passing through the lock_dropped variable.
> 
> No functional change is expected as one of the first things the
> collapse_huge_page function does is drop this lock before allocating the
> hugepage.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 18 ++++++++++--------
>  1 file changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 27465161fa6d..37a5f6791816 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1199,6 +1199,14 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
>  	return SCAN_SUCCEED;
>  }
>  
> +/*
> + * collapse_huge_page expects the mmap_read_lock to be dropped before
> + * entering this function. 

"before entering." Talking about "this function" after naming it sounds odd.

Also, there is only an "mmap_lock".

> The function will also always return with the lock
> + * dropped. 

"collapse_huge_page expects the mmap_lock to be unlocked before entering and
will always return with the lock unlocked."

Or something simple like that?

> The function starts by allocation a folio, which can potentially
> + * take a long time if it involves sync compaction, and we do not need to hold
> + * the mmap_lock during that. We must recheck the vma after taking it again in
> + * write mode.
> + */

"... to avoid holding the mmap_lock while allocating a THP, as that could
trigger direct reclaim/compaction. Note that the VMA must be rechecked after
grabbing the mmap_lock again."

?

Ending up with something like

"collapse_huge_page expects the mmap_lock to be unlocked before entering and
will always return with the lock unlocked, to avoid holding the mmap_lock while
allocating a THP, as that could trigger direct reclaim/compaction.  Note that
the VMA must be rechecked after grabbing the mmap_lock again."

>  static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  		int referenced, int unmapped, struct collapse_control *cc)
>  {
> @@ -1214,14 +1222,6 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  
> -	/*
> -	 * Before allocating the hugepage, release the mmap_lock read lock.
> -	 * The allocation can take potentially a long time if it involves
> -	 * sync compaction, and we do not need to hold the mmap_lock during
> -	 * that. We will recheck the vma after taking it again in write mode.
> -	 */
> -	mmap_read_unlock(mm);
> -
>  	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>  	if (result != SCAN_SUCCEED)
>  		goto out_nolock;
> @@ -1526,6 +1526,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  out_unmap:
>  	pte_unmap_unlock(pte, ptl);
>  	if (result == SCAN_SUCCEED) {
> +		/* collapse_huge_page expects the lock to be dropped before calling */
> +		mmap_read_unlock(mm);
>  		result = collapse_huge_page(mm, start_addr, referenced,
>  					    unmapped, cc);
>  		/* collapse_huge_page will return with the mmap_lock released */


Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH mm-unstable v17 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: Lance Yang @ 2026-05-12  7:42 UTC (permalink / raw)
  To: npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260511185817.686831-5-npache@redhat.com>


On Mon, May 11, 2026 at 12:58:04PM -0600, Nico Pache wrote:
>generalize the order of the __collapse_huge_page_* and collapse_max_*
>functions to support future mTHP collapse.
>
>The current mechanism for determining collapse with the
>khugepaged_max_ptes_none value is not designed with mTHP in mind. This
>raises a key design issue: if we support user defined max_pte_none values
>(even those scaled by order), a collapse of a lower order can introduces
>an feedback loop, or "creep", when max_ptes_none is set to a value greater
>than HPAGE_PMD_NR / 2. [1]
>
>With this configuration, a successful collapse to order N will populate
>enough pages to satisfy the collapse condition on order N+1 on the next
>scan. This leads to unnecessary work and memory churn.
>
>To fix this issue introduce a helper function that will limit mTHP
>collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
>This effectively supports two modes: [2]
>
>- max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE
>  that maps the shared zeropage. Consequently, no memory bloat.
>- max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
>  available mTHP order.
>
>This removes the possiblilty of "creep", while not modifying any uAPI
>expectations. A warning will be emitted if any non-supported
>max_ptes_none value is configured with mTHP enabled.
>
>mTHP collapse will not honor the khugepaged_max_ptes_shared or
>khugepaged_max_ptes_swap parameters, and will fail if it encounters a
>shared or swapped entry.
>
>No functional changes in this patch; however it defines future behavior
>for mTHP collapse.
>
>[1] - https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.com
>[2] - https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com
>
>Co-developed-by: Dev Jain <dev.jain@arm.com>
>Signed-off-by: Dev Jain <dev.jain@arm.com>
>Signed-off-by: Nico Pache <npache@redhat.com>
>---
> include/trace/events/huge_memory.h |   3 +-
> mm/khugepaged.c                    | 117 ++++++++++++++++++++---------
> 2 files changed, 85 insertions(+), 35 deletions(-)
>
>diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
>index bcdc57eea270..443e0bd13fdb 100644
>--- a/include/trace/events/huge_memory.h
>+++ b/include/trace/events/huge_memory.h
>@@ -39,7 +39,8 @@
> 	EM( SCAN_STORE_FAILED,		"store_failed")			\
> 	EM( SCAN_COPY_MC,		"copy_poisoned_page")		\
> 	EM( SCAN_PAGE_FILLED,		"page_filled")			\
>-	EMe(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")
>+	EM(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")	\
>+	EMe(SCAN_INVALID_PTES_NONE,	"invalid_ptes_none")
> 
> #undef EM
> #undef EMe
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index f68853b3caa7..27465161fa6d 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -61,6 +61,7 @@ enum scan_result {
> 	SCAN_COPY_MC,
> 	SCAN_PAGE_FILLED,
> 	SCAN_PAGE_DIRTY_OR_WRITEBACK,
>+	SCAN_INVALID_PTES_NONE,
> };
> 
> #define CREATE_TRACE_POINTS
>@@ -353,37 +354,60 @@ static bool pte_none_or_zero(pte_t pte)
>  * PTEs for the given collapse operation.
>  * @cc: The collapse control struct
>  * @vma: The vma to check for userfaultfd
>+ * @order: The folio order being collapsed to
>  *
>  * Return: Maximum number of none-page or zero-page PTEs allowed for the
>  * collapse operation.
>  */
>-static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
>-		struct vm_area_struct *vma)
>+static int collapse_max_ptes_none(struct collapse_control *cc,
>+		struct vm_area_struct *vma, unsigned int order)
> {
>+	unsigned int max_ptes_none = khugepaged_max_ptes_none;
> 	// If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.

One thing I still want to call out: kernel code usually uses C-style
comments :)

> 	if (vma && userfaultfd_armed(vma))
> 		return 0;
> 	// for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
> 	if (!cc->is_khugepaged)
> 		return HPAGE_PMD_NR;
>-	// For all other cases repect the user defined maximum.
>-	return khugepaged_max_ptes_none;
>+	// for PMD collapse, respect the user defined maximum.
>+	if (is_pmd_order(order))
>+		return max_ptes_none;
>+	/* Zero/non-present collapse disabled. */
>+	if (!max_ptes_none)
>+		return 0;
>+	// for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
>+	// scale the maximum number of PTEs to the order of the collapse.
>+	if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
>+		return (1 << order) - 1;
>+
>+	// We currently only support max_ptes_none values of 0 or KHUGEPAGED_MAX_PTES_LIMIT.
>+	// Emit a warning and return -EINVAL.
>+	pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u\n",
>+		      KHUGEPAGED_MAX_PTES_LIMIT);

Maybe fallback to 0 instead, as David suggested earlier?

max_ptes_none is mostly legacy PMD THP behavior. mTHP is new, and any
intermediate value in (0, KHUGEPAGED_MAX_PTES_LIMIT) would implicitly
disable it :(

Treating those values as 0 feels like the least surprising behavior,
IMHO. It also gives mTHP a cleaner staring point, rather than carry over
all the old PMD knob semantics :)

Otherwise, LGTM!
Reviewed-by: Lance Yang <lance.yang@linux.dev>

>+	return -EINVAL;
> }
> 
> /**
>  * collapse_max_ptes_shared - Calculate maximum allowed PTEs that map shared
>  * anonymous pages for the given collapse operation.
>  * @cc: The collapse control struct
>+ * @order: The folio order being collapsed to
>  *
>  * Return: Maximum number of PTEs that map shared anonymous pages for the
>  * collapse operation
>  */
>-static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
>+static unsigned int collapse_max_ptes_shared(struct collapse_control *cc,
>+		unsigned int order)
> {
> 	// for MADV_COLLAPSE, do not restrict the number of PTEs that map shared
> 	// anonymous pages.
> 	if (!cc->is_khugepaged)
> 		return HPAGE_PMD_NR;
>+	// for mTHP collapse do not allow collapsing anonymous memory pages that
>+	// are shared between processes.
>+	if (!is_pmd_order(order))
>+		return 0;
>+	// for PMD collapse, respect the user defined maximum.
> 	return khugepaged_max_ptes_shared;
> }
> 
>@@ -391,16 +415,22 @@ static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
>  * collapse_max_ptes_swap - Calculate the maximum allowed non-present PTEs or the
>  * maximum allowed non-present pagecache entries for the given collapse operation.
>  * @cc: The collapse control struct
>+ * @order: The folio order being collapsed to
>  *
>  * Return: Maximum number of non-present PTEs or the maximum allowed non-present
>  * pagecache entries for the collapse operation.
>  */
>-static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
>+static unsigned int collapse_max_ptes_swap(struct collapse_control *cc,
>+		unsigned int order)
> {
> 	// for MADV_COLLAPSE, do not restrict the number PTEs entries or
> 	// pagecache entries that are non-present.
> 	if (!cc->is_khugepaged)
> 		return HPAGE_PMD_NR;
>+	// for mTHP collapse do not allow any non-present PTEs or pagecache entries.
>+	if (!is_pmd_order(order))
>+		return 0;
>+	// for PMD collapse, respect the user defined maximum.
> 	return khugepaged_max_ptes_swap;
> }
> 
>@@ -594,18 +624,22 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
> 
> static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> 		unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
>-		struct list_head *compound_pagelist)
>+		unsigned int order, struct list_head *compound_pagelist)
> {
>+	const unsigned long nr_pages = 1UL << order;
> 	struct page *page = NULL;
> 	struct folio *folio = NULL;
> 	unsigned long addr = start_addr;
> 	pte_t *_pte;
> 	int none_or_zero = 0, shared = 0, referenced = 0;
> 	enum scan_result result = SCAN_FAIL;
>-	unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
>-	unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
>+	int max_ptes_none = collapse_max_ptes_none(cc, vma, order);
>+	unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, order);
>+
>+	if (max_ptes_none < 0)
>+		return SCAN_INVALID_PTES_NONE;
> 
>-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>+	for (_pte = pte; _pte < pte + nr_pages;
> 	     _pte++, addr += PAGE_SIZE) {
> 		pte_t pteval = ptep_get(_pte);
> 		if (pte_none_or_zero(pteval)) {
>@@ -738,18 +772,18 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> }
> 
> static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>-						struct vm_area_struct *vma,
>-						unsigned long address,
>-						spinlock_t *ptl,
>-						struct list_head *compound_pagelist)
>+		struct vm_area_struct *vma, unsigned long address,
>+		spinlock_t *ptl, unsigned int order,
>+		struct list_head *compound_pagelist)
> {
>-	unsigned long end = address + HPAGE_PMD_SIZE;
>+	const unsigned long nr_pages = 1UL << order;
>+	unsigned long end = address + (PAGE_SIZE << order);
> 	struct folio *src, *tmp;
> 	pte_t pteval;
> 	pte_t *_pte;
> 	unsigned int nr_ptes;
> 
>-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
>+	for (_pte = pte; _pte < pte + nr_pages; _pte += nr_ptes,
> 	     address += nr_ptes * PAGE_SIZE) {
> 		nr_ptes = 1;
> 		pteval = ptep_get(_pte);
>@@ -802,11 +836,10 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> }
> 
> static void __collapse_huge_page_copy_failed(pte_t *pte,
>-					     pmd_t *pmd,
>-					     pmd_t orig_pmd,
>-					     struct vm_area_struct *vma,
>-					     struct list_head *compound_pagelist)
>+		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
>+		unsigned int order, struct list_head *compound_pagelist)
> {
>+	const unsigned long nr_pages = 1UL << order;
> 	spinlock_t *pmd_ptl;
> 
> 	/*
>@@ -822,7 +855,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> 	 * Release both raw and compound pages isolated
> 	 * in __collapse_huge_page_isolate.
> 	 */
>-	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
>+	release_pte_pages(pte, pte + nr_pages, compound_pagelist);
> }
> 
> /*
>@@ -842,16 +875,17 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>  */
> static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
> 		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
>-		unsigned long address, spinlock_t *ptl,
>+		unsigned long address, spinlock_t *ptl, unsigned int order,
> 		struct list_head *compound_pagelist)
> {
>+	const unsigned long nr_pages = 1UL << order;
> 	unsigned int i;
> 	enum scan_result result = SCAN_SUCCEED;
> 
> 	/*
> 	 * Copying pages' contents is subject to memory poison at any iteration.
> 	 */
>-	for (i = 0; i < HPAGE_PMD_NR; i++) {
>+	for (i = 0; i < nr_pages; i++) {
> 		pte_t pteval = ptep_get(pte + i);
> 		struct page *page = folio_page(folio, i);
> 		unsigned long src_addr = address + i * PAGE_SIZE;
>@@ -870,10 +904,10 @@ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *foli
> 
> 	if (likely(result == SCAN_SUCCEED))
> 		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
>-						    compound_pagelist);
>+						    order, compound_pagelist);
> 	else
> 		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
>-						 compound_pagelist);
>+						 order, compound_pagelist);
> 
> 	return result;
> }
>@@ -1044,12 +1078,12 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm,
>  * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
>  */
> static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>-		struct vm_area_struct *vma, unsigned long start_addr, pmd_t *pmd,
>-		int referenced)
>+		struct vm_area_struct *vma, unsigned long start_addr,
>+		pmd_t *pmd, int referenced, unsigned int order)
> {
> 	int swapped_in = 0;
> 	vm_fault_t ret = 0;
>-	unsigned long addr, end = start_addr + (HPAGE_PMD_NR * PAGE_SIZE);
>+	unsigned long addr, end = start_addr + (PAGE_SIZE << order);
> 	enum scan_result result;
> 	pte_t *pte = NULL;
> 	spinlock_t *ptl;
>@@ -1081,6 +1115,19 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> 		    pte_present(vmf.orig_pte))
> 			continue;
> 
>+		/*
>+		 * TODO: Support swapin without leading to further mTHP
>+		 * collapses. Currently bringing in new pages via swapin may
>+		 * cause a future higher order collapse on a rescan of the same
>+		 * range.
>+		 */
>+		if (!is_pmd_order(order)) {
>+			pte_unmap(pte);
>+			mmap_read_unlock(mm);
>+			result = SCAN_EXCEED_SWAP_PTE;
>+			goto out;
>+		}
>+
> 		vmf.pte = pte;
> 		vmf.ptl = ptl;
> 		ret = do_swap_page(&vmf);
>@@ -1200,7 +1247,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> 		 * that case.  Continuing to collapse causes inconsistency.
> 		 */
> 		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
>-						     referenced);
>+						     referenced, HPAGE_PMD_ORDER);
> 		if (result != SCAN_SUCCEED)
> 			goto out_nolock;
> 	}
>@@ -1248,6 +1295,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> 	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> 	if (pte) {
> 		result = __collapse_huge_page_isolate(vma, address, pte, cc,
>+						      HPAGE_PMD_ORDER,
> 						      &compound_pagelist);
> 		spin_unlock(pte_ptl);
> 	} else {
>@@ -1278,6 +1326,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> 
> 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> 					   vma, address, pte_ptl,
>+					   HPAGE_PMD_ORDER,
> 					   &compound_pagelist);
> 	pte_unmap(pte);
> 	if (unlikely(result != SCAN_SUCCEED))
>@@ -1313,9 +1362,9 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> 		struct vm_area_struct *vma, unsigned long start_addr,
> 		bool *lock_dropped, struct collapse_control *cc)
> {
>-	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
>-	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
>-	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
>+	const int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
>+	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
>+	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> 	pmd_t *pmd;
> 	pte_t *pte, *_pte;
> 	int none_or_zero = 0, shared = 0, referenced = 0;
>@@ -2369,8 +2418,8 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
> 		unsigned long addr, struct file *file, pgoff_t start,
> 		struct collapse_control *cc)
> {
>-	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, NULL);
>-	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
>+	const int max_ptes_none = collapse_max_ptes_none(cc, NULL, HPAGE_PMD_ORDER);
>+	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> 	struct folio *folio = NULL;
> 	struct address_space *mapping = file->f_mapping;
> 	XA_STATE(xas, &mapping->i_pages, start);
>-- 
>2.54.0
>
>

^ permalink raw reply

* Re: [PATCH mm-unstable v17 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions
From: David Hildenbrand (Arm) @ 2026-05-12  7:29 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
  Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe, Usama Arif
In-Reply-To: <20260511185817.686831-4-npache@redhat.com>

On 5/11/26 20:58, Nico Pache wrote:
> The following cleanup reworks all the max_ptes_* handling into helper
> functions. This increases the code readability and will later be used to
> implement the mTHP handling of these variables.
> 
> With these changes we abstract all the madvise_collapse() special casing
> (dont respect the sysctls) away from the functions that utilize them. And
> will be used later in this series to cleanly restrict the mTHP collapse
> behavior.
> 
> No functional change is intended; however, we are now only reading the
> sysfs variables once per scan, whereas before these variables were being
> read on each loop iteration.
> 
> Suggested-by: David Hildenbrand <david@kernel.org>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>

Some nits when re-reading:

> Acked-by: Usama Arif <usama.arif@linux.dev>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 118 +++++++++++++++++++++++++++++++++---------------
>  1 file changed, 82 insertions(+), 36 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index f0e29d5c7b1f..f68853b3caa7 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -348,6 +348,62 @@ static bool pte_none_or_zero(pte_t pte)
>  	return pte_present(pte) && is_zero_pfn(pte_pfn(pte));
>  }
>  
> +/**
> + * collapse_max_ptes_none - Calculate maximum allowed none-page or zero-page

I know, it's painful, but ...

There is no "none-page".

Calculate maximum allowed empty PTEs or PTEs mapping the shared zeropage ... ?

> + * PTEs for the given collapse operation.

We usually indent here (second line of subject), I think. Same applies to the
other doc below.

> + * @cc: The collapse control struct
> + * @vma: The vma to check for userfaultfd
> + *
> + * Return: Maximum number of none-page or zero-page PTEs allowed for the
> + * collapse operation.

Same here.

> + */
> +static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> +		struct vm_area_struct *vma)
> +{
> +	// If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.

Lance commented on the comment style.

Is this comment really required? It's pretty self-documenting already.

> +	if (vma && userfaultfd_armed(vma))
> +		return 0;
> +	// for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
> +	if (!cc->is_khugepaged)
> +		return HPAGE_PMD_NR;
> +	// For all other cases repect the user defined maximum.
> +	return khugepaged_max_ptes_none;
> +}
> +
-- 
Cheers,

David

^ permalink raw reply

* [PATCH v2 2/2] spi: qcom-geni: Add trace events for Qualcomm GENI SPI driver
From: Praveen Talari @ 2026-05-12  6:12 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Mark Brown
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-spi,
	mukesh.savaliya, aniket.randive, chandana.chiluveru,
	jyothi.seerapu, Praveen Talari
In-Reply-To: <20260512-add-tracepoints-for-qcom-geni-spi-v2-0-3b184068ecf9@oss.qualcomm.com>

Add tracepoints to the Qualcomm GENI (Generic Interface) SPI driver.
These trace events enable runtime debugging and performance analysis
of SPI operations.

The trace events capture SPI clock configuration, FIFO parameters,
transfer details, interrupt status.

Usage examples:

Enable all SPI traces:
  echo 1 > /sys/kernel/tracing/events/spi/enable
  echo 1 > /sys/kernel/debug/tracing/events/qcom_geni_spi/enable
  cat /sys/kernel/debug/tracing/trace_pipe

Example trace output:

1003.956560: spi_message_submit: spi16.0 000000001b20b93c
1003.956642: spi_controller_busy: spi16
1003.956643: spi_message_start: spi16.0 000000001b20b93c
1003.956646: geni_spi_fifo_params: 888000.spi: cs=0 mode=0x00000020
     mode_changed=0x00000007 cs_changed=0
1003.956647: spi_set_cs: spi16.0 activate
1003.956648: spi_transfer_start: spi16.0 00000000ea1cf8b6 len=16
     tx=[4c-80-e4-ca-68-4d-95-aa-ee-99-ae-d7-69-e9-5f-39]
rx=[00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00]
1003.956653: geni_spi_clk_cfg: 888000.spi: req_hz=20000000
     sclk_hz=100000000 clk_idx=5 clk_div=5 bpw=8
1003.956691: geni_spi_transfer: 888000.spi: len=16 m_cmd=0x00000003
1003.956708: geni_spi_irq: 888000.spi: m_irq=0x08000081
     dma_tx=0x00000000 dma_rx=0x00000000
1003.956717: spi_transfer_stop: spi16.0 00000000ea1cf8b6 len=16
     tx=[4c-80-e4-ca-68-4d-95-aa-ee-99-ae-d7-69-e9-5f-39]
rx=[4c-80-e4-ca-68-4d-95-aa-ee-99-ae-d7-69-e9-5f-39]
1003.956717: spi_set_cs: spi16.0 deactivate
1003.956718: spi_message_done: spi16.0 000000001b20b93c len=16/16

Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
v1->v2:
- Removed tx/rx data capture since spi core had already support.
- Updated commit text.
---
 drivers/spi/spi-geni-qcom.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
index d5fb0edc8e0c..164c6c0b9544 100644
--- a/drivers/spi/spi-geni-qcom.c
+++ b/drivers/spi/spi-geni-qcom.c
@@ -1,6 +1,9 @@
 // SPDX-License-Identifier: GPL-2.0
 // Copyright (c) 2017-2018, The Linux foundation. All rights reserved.
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/qcom_geni_spi.h>
+
 #include <linux/clk.h>
 #include <linux/dmaengine.h>
 #include <linux/dma-mapping.h>
@@ -332,6 +335,9 @@ static int geni_spi_set_clock_and_bw(struct spi_geni_master *mas,
 	writel(clk_sel, se->base + SE_GENI_CLK_SEL);
 	writel(m_clk_cfg, se->base + GENI_SER_M_CLK_CFG);
 
+	trace_geni_spi_clk_cfg(mas->dev, clk_hz, mas->cur_sclk_hz, idx, div,
+			       mas->cur_bits_per_word);
+
 	/* Set BW quota for CPU as driver supports FIFO mode only. */
 	se->icc_paths[CPU_TO_GENI].avg_bw = Bps_to_icc(mas->cur_speed_hz);
 	ret = geni_icc_set_bw(se);
@@ -366,6 +372,9 @@ static int setup_fifo_params(struct spi_device *spi_slv,
 	if ((mode_changed & SPI_CS_HIGH) || (cs_changed && (spi_slv->mode & SPI_CS_HIGH)))
 		writel((spi_slv->mode & SPI_CS_HIGH) ? BIT(chipselect) : 0, se->base + SE_SPI_DEMUX_OUTPUT_INV);
 
+	trace_geni_spi_fifo_params(mas->dev, chipselect, spi_slv->mode,
+				   mode_changed, cs_changed);
+
 	return 0;
 }
 
@@ -861,6 +870,8 @@ static int setup_se_xfer(struct spi_transfer *xfer,
 	spin_lock_irq(&mas->lock);
 	geni_se_setup_m_cmd(se, m_cmd, m_params);
 
+	trace_geni_spi_transfer(mas->dev, len, m_cmd);
+
 	if (mas->cur_xfer_mode == GENI_SE_DMA) {
 		if (m_cmd & SPI_RX_ONLY)
 			geni_se_rx_init_dma(se, sg_dma_address(xfer->rx_sg.sgl),
@@ -915,6 +926,8 @@ static irqreturn_t geni_spi_isr(int irq, void *data)
 	if (!m_irq && !dma_tx_status && !dma_rx_status)
 		return IRQ_NONE;
 
+	trace_geni_spi_irq(mas->dev, m_irq, dma_tx_status, dma_rx_status);
+
 	if (m_irq & (M_CMD_OVERRUN_EN | M_ILLEGAL_CMD_EN | M_CMD_FAILURE_EN |
 		     M_RX_FIFO_RD_ERR_EN | M_RX_FIFO_WR_ERR_EN |
 		     M_TX_FIFO_RD_ERR_EN | M_TX_FIFO_WR_ERR_EN))

-- 
2.34.1


^ permalink raw reply related

* [PATCH v2 1/2] spi: qcom-geni: trace: Add trace events for Qualcomm GENI SPI
From: Praveen Talari @ 2026-05-12  6:12 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Mark Brown
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-spi,
	mukesh.savaliya, aniket.randive, chandana.chiluveru,
	jyothi.seerapu, Praveen Talari
In-Reply-To: <20260512-add-tracepoints-for-qcom-geni-spi-v2-0-3b184068ecf9@oss.qualcomm.com>

Add tracepoint support to the Qualcomm GENI SPI driver to provide
runtime visibility into driver behavior without requiring invasive debug
patches.

The trace events cover clock and FIFO parameter configuration,
transfer metadata, interrupt status to be making it easier to diagnose
communication issues in the field..

Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
v1->v2:
- Removed TX/RX data tracepoints.
- Updated commit text.
---
 include/trace/events/qcom_geni_spi.h | 103 +++++++++++++++++++++++++++++++++++
 1 file changed, 103 insertions(+)

diff --git a/include/trace/events/qcom_geni_spi.h b/include/trace/events/qcom_geni_spi.h
new file mode 100644
index 000000000000..5f39dab47e4e
--- /dev/null
+++ b/include/trace/events/qcom_geni_spi.h
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM qcom_geni_spi
+
+#if !defined(_TRACE_QCOM_GENI_SPI_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_QCOM_GENI_SPI_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(geni_spi_fifo_params,
+	    TP_PROTO(struct device *dev, u8 cs, u32 mode,
+		     u32 mode_changed, bool cs_changed),
+	    TP_ARGS(dev, cs, mode, mode_changed, cs_changed),
+
+	    TP_STRUCT__entry(__string(name, dev_name(dev))
+			     __field(u8, cs)
+			     __field(u32, mode)
+			     __field(u32, mode_changed)
+			     __field(bool, cs_changed)
+	    ),
+
+	    TP_fast_assign(__assign_str(name);
+			   __entry->cs = cs;
+			   __entry->mode = mode;
+			   __entry->mode_changed = mode_changed;
+			   __entry->cs_changed = cs_changed;
+	    ),
+
+	    TP_printk("%s: cs=%u mode=0x%08x mode_changed=0x%08x cs_changed=%d",
+		      __get_str(name), __entry->cs, __entry->mode,
+		      __entry->mode_changed, __entry->cs_changed)
+);
+
+TRACE_EVENT(geni_spi_clk_cfg,
+	    TP_PROTO(struct device *dev, unsigned long req_hz,
+		     unsigned long sclk_hz, unsigned int clk_idx,
+		     unsigned int clk_div, unsigned int bpw),
+	    TP_ARGS(dev, req_hz, sclk_hz, clk_idx, clk_div, bpw),
+
+	    TP_STRUCT__entry(__string(name, dev_name(dev))
+			     __field(unsigned long, req_hz)
+			     __field(unsigned long, sclk_hz)
+			     __field(unsigned int, clk_idx)
+			     __field(unsigned int, clk_div)
+			     __field(unsigned int, bpw)
+	    ),
+
+	    TP_fast_assign(__assign_str(name);
+			   __entry->req_hz = req_hz;
+			   __entry->sclk_hz = sclk_hz;
+			   __entry->clk_idx = clk_idx;
+			   __entry->clk_div = clk_div;
+			   __entry->bpw = bpw;
+	    ),
+
+	    TP_printk("%s: req_hz=%lu sclk_hz=%lu clk_idx=%u clk_div=%u bpw=%u",
+		      __get_str(name), __entry->req_hz, __entry->sclk_hz,
+		      __entry->clk_idx, __entry->clk_div, __entry->bpw)
+);
+
+TRACE_EVENT(geni_spi_transfer,
+	    TP_PROTO(struct device *dev, unsigned int len, u32 m_cmd),
+	    TP_ARGS(dev, len, m_cmd),
+
+	    TP_STRUCT__entry(__string(name, dev_name(dev))
+			     __field(unsigned int, len)
+			     __field(u32, m_cmd)
+	    ),
+
+	    TP_fast_assign(__assign_str(name);
+			   __entry->len = len;
+			   __entry->m_cmd = m_cmd;
+	    ),
+
+	    TP_printk("%s: len=%u m_cmd=0x%08x",
+		      __get_str(name), __entry->len, __entry->m_cmd)
+);
+
+TRACE_EVENT(geni_spi_irq,
+	    TP_PROTO(struct device *dev, u32 m_irq, u32 dma_tx, u32 dma_rx),
+	    TP_ARGS(dev, m_irq, dma_tx, dma_rx),
+
+	    TP_STRUCT__entry(__string(name, dev_name(dev))
+			     __field(u32, m_irq)
+			     __field(u32, dma_tx)
+			     __field(u32, dma_rx)
+	    ),
+
+	    TP_fast_assign(__assign_str(name);
+			   __entry->m_irq = m_irq;
+			   __entry->dma_tx = dma_tx;
+			   __entry->dma_rx = dma_rx;
+	    ),
+
+	    TP_printk("%s: m_irq=0x%08x dma_tx=0x%08x dma_rx=0x%08x",
+		      __get_str(name), __entry->m_irq, __entry->dma_tx,
+		      __entry->dma_rx)
+);
+
+#endif /* _TRACE_QCOM_GENI_SPI_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>

-- 
2.34.1


^ permalink raw reply related

* [PATCH v2 0/2] Add trace events for Qualcomm GENI SPI drivers
From: Praveen Talari @ 2026-05-12  6:12 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Mark Brown
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-spi,
	mukesh.savaliya, aniket.randive, chandana.chiluveru,
	jyothi.seerapu, Praveen Talari

Add tracepoints to the Qualcomm GENI (Generic Interface) SPI driver.
These trace events enable runtime debugging and performance analysis
of SPI operations.

The trace events capture SPI clock configuration, FIFO parameters,
transfer details, interrupt status.

Usage examples:

Enable all SPI traces:
  echo 1 > /sys/kernel/tracing/events/spi/enable  
  echo 1 > /sys/kernel/debug/tracing/events/qcom_geni_spi/enable
  cat /sys/kernel/debug/tracing/trace_pipe

Example trace output:

1003.956560: spi_message_submit: spi16.0 000000001b20b93c
1003.956642: spi_controller_busy: spi16
1003.956643: spi_message_start: spi16.0 000000001b20b93c
1003.956646: geni_spi_fifo_params: 888000.spi: cs=0 mode=0x00000020
     mode_changed=0x00000007 cs_changed=0
1003.956647: spi_set_cs: spi16.0 activate
1003.956648: spi_transfer_start: spi16.0 00000000ea1cf8b6 len=16
     tx=[4c-80-e4-ca-68-4d-95-aa-ee-99-ae-d7-69-e9-5f-39]
rx=[00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00]
1003.956653: geni_spi_clk_cfg: 888000.spi: req_hz=20000000
     sclk_hz=100000000 clk_idx=5 clk_div=5 bpw=8
1003.956691: geni_spi_transfer: 888000.spi: len=16 m_cmd=0x00000003
1003.956708: geni_spi_irq: 888000.spi: m_irq=0x08000081
     dma_tx=0x00000000 dma_rx=0x00000000
1003.956717: spi_transfer_stop: spi16.0 00000000ea1cf8b6 len=16
     tx=[4c-80-e4-ca-68-4d-95-aa-ee-99-ae-d7-69-e9-5f-39]
rx=[4c-80-e4-ca-68-4d-95-aa-ee-99-ae-d7-69-e9-5f-39]
1003.956717: spi_set_cs: spi16.0 deactivate
1003.956718: spi_message_done: spi16.0 000000001b20b93c len=16/16

Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
Changes in v2:
- Removed tx/rx data capture since spi core had already support.
- Updated commit text in patches and cover letter.
- Link to v1: https://lore.kernel.org/r/20260506-add-tracepoints-for-qcom-geni-spi-v1-0-c957cfe712d1@oss.qualcomm.com

---
Praveen Talari (2):
      spi: qcom-geni: trace: Add trace events for Qualcomm GENI SPI
      spi: qcom-geni: Add trace events for Qualcomm GENI SPI driver

 drivers/spi/spi-geni-qcom.c          |  13 +++++
 include/trace/events/qcom_geni_spi.h | 103 +++++++++++++++++++++++++++++++++++
 2 files changed, 116 insertions(+)
---
base-commit: 1f5ffc672165ff851063a5fd044b727ab2517ae3
change-id: 20260506-add-tracepoints-for-qcom-geni-spi-e31457c2267c

Best regards,
-- 
Praveen Talari <praveen.talari@oss.qualcomm.com>


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox