Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH v2] bootconfig: Apply early options from embedded config
From: Breno Leitao @ 2026-04-15 11:15 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Jonathan Corbet, Shuah Khan, linux-kernel, linux-trace-kernel,
	linux-doc, oss, paulmck, rostedt, kernel-team, Kiryl Shutsemau
In-Reply-To: <adTX6h4Ej5jOcONP@gmail.com>

On Tue, Apr 07, 2026 at 03:19:09AM -0700, Breno Leitao wrote:
> On Fri, Apr 03, 2026 at 11:45:19AM +0900, Masami Hiramatsu wrote:
> > > I'm still uncertain about this approach. The goal is to identify and
> > > categorize the early parameters that are parsed prior to bootconfig
> > > initialization.
> >
> > Yes, if we support early parameters in bootconfig, we need to clarify
> > which parameters are inherently unsupportable, and document it.
> > Currently it is easy to say that it does not support the parameter
> > defined with "early_param()". Similary, maybe we should introduce
> > "arch_param()" or something like it (or support all of them).
> >
> > >
> > > Moreover, this work could become obsolete if bootconfig's initialization
> > > point shifts earlier or later in the boot sequence, necessitating
> > > another comprehensive analysis.
> >
> > If we can init it before calling setup_arch(), yes, we don't need to
> > check it. So that is another option. Do you think it is feasible to
> > support all of them? (Of course, theologically we can do, but the
> > question is the use case and requirements.)
>
> I don't believe all early parameters can be supported by bootconfig.
> Some are inherently incompatible as far as I understand, while others
> depend on bootconfig's initialization point in the boot sequence.

I've developed a patch series that relocates bootconfig initialization
to occur before setup_arch().

Adopting this approach would streamline the categorization considerably,
as only a small subset of kernel parameters are parsed before
setup_arch() is called.

This enables a clearer distinction: parameters processed *before*
setup_arch() versus those handled afterward, rather than classifying
based on what occurs before bootconfig initialization.

Just to close the look and link both discussion together, the proposed
patch series is available at:

https://lore.kernel.org/all/20260415-bootconfig_earlier-v1-0-cf160175de5e@debian.org/

^ permalink raw reply

* [PATCH 3/3] init: move embedded bootconfig parsing before setup_arch()
From: Breno Leitao @ 2026-04-15 10:51 UTC (permalink / raw)
  To: Andrew Morton, Masami Hiramatsu
  Cc: oss, paulmck, linux-trace-kernel, linux-kernel,
	linux-trace-kernel, kernel-team, Breno Leitao
In-Reply-To: <20260415-bootconfig_earlier-v1-0-cf160175de5e@debian.org>

Split setup_boot_config() into setup_boot_config_from_embedded() for
embedded bootconfig and setup_boot_config_from_initrd() for initrd
bootconfig.

The embedded bootconfig data lives in .init.rodata (compiled into the
kernel via bootconfig-data.S), so it requires no memory allocation to
access. Combined with the previous patches that replaced memblock
allocations with static buffers, this allows embedded bootconfig to
be parsed before setup_arch(), making bootconfig parameters available
during architecture-specific setup.

The initrd bootconfig path remains at its original position since it
needs initrd_start/initrd_end set up by setup_arch().

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 init/main.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 55 insertions(+), 12 deletions(-)

diff --git a/init/main.c b/init/main.c
index b9feca55e01f9..20fded9bfbdd0 100644
--- a/init/main.c
+++ b/init/main.c
@@ -416,20 +416,17 @@ static int __init warn_bootconfig(char *str)
 	return 0;
 }
 
-static void __init setup_boot_config(void)
+/*
+ * Parse bootconfig data and extract kernel/init command line parameters.
+ * Shared by both the early embedded path and the late initrd path.
+ */
+static void __init __boot_config_load(const char *data, size_t size)
 {
 	static char tmp_cmdline[COMMAND_LINE_SIZE] __initdata;
-	const char *msg, *data;
+	const char *msg;
 	int pos, ret;
-	size_t size;
 	char *err;
 
-	/* Cut out the bootconfig data even if we have no bootconfig option */
-	data = get_boot_config_from_initrd(&size);
-	/* If there is no bootconfig in initrd, try embedded one. */
-	if (!data)
-		data = xbc_get_embedded_bootconfig(&size);
-
 	strscpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE);
 	err = parse_args("bootconfig", tmp_cmdline, NULL, 0, 0, 0, NULL,
 			 bootconfig_params);
@@ -471,7 +468,50 @@ static void __init setup_boot_config(void)
 		/* Also, "init." keys are init arguments */
 		extra_init_args = xbc_make_cmdline("init", extra_initargs_buf);
 	}
-	return;
+}
+
+/*
+ * Load embedded bootconfig before setup_arch(). This runs before memblock
+ * is available, relying on static buffers in the bootconfig parser.
+ * The embedded data lives in .init.rodata so no allocation is needed
+ * to access it.
+ */
+static void __init setup_boot_config_from_embedded(void)
+{
+	const char *data;
+	size_t size;
+
+	data = xbc_get_embedded_bootconfig(&size);
+	if (!data)
+		return;
+
+	pr_info("Load embedded bootconfig early\n");
+	__boot_config_load(data, size);
+}
+
+/*
+ * Load bootconfig from initrd. This MUST run after setup_arch() when initrd
+ * boundaries are known. If initrd bootconfig exists, it overrides any embedded
+ * bootconfig that was loaded early, preserving the original priority (initrd
+ * > embedded).
+ */
+static void __init setup_boot_config_from_initrd(void)
+{
+	const char *data;
+	size_t size;
+
+	/* Cut out the bootconfig data even if we have no bootconfig option */
+	data = get_boot_config_from_initrd(&size);
+
+	/* No initrd bootconfig — keep embedded if already loaded */
+	if (!data)
+		return;
+
+	/* Initrd overrides embedded — tear down and re-parse */
+	if (xbc_get_info(NULL, NULL) == 0)
+		_xbc_exit(true);
+
+	__boot_config_load(data, size);
 }
 
 static void __init exit_boot_config(void)
@@ -481,7 +521,9 @@ static void __init exit_boot_config(void)
 
 #else	/* !CONFIG_BOOT_CONFIG */
 
-static void __init setup_boot_config(void)
+static void __init setup_boot_config_from_embedded(void) { }
+
+static void __init setup_boot_config_from_initrd(void)
 {
 	/* Remove bootconfig data from initrd */
 	get_boot_config_from_initrd(NULL);
@@ -1036,13 +1078,14 @@ void start_kernel(void)
 	boot_cpu_init();
 	page_address_init();
 	pr_notice("%s", linux_banner);
+	setup_boot_config_from_embedded();
 	setup_arch(&command_line);
 	mm_core_init_early();
 	/* Static keys and static calls are needed by LSMs */
 	jump_label_init();
 	static_call_init();
 	early_security_init();
-	setup_boot_config();
+	setup_boot_config_from_initrd();
 	setup_command_line(command_line);
 	setup_nr_cpu_ids();
 	setup_per_cpu_areas();

-- 
2.52.0


^ permalink raw reply related

* [PATCH 2/3] init: use static buffers for bootconfig extra command line
From: Breno Leitao @ 2026-04-15 10:51 UTC (permalink / raw)
  To: Andrew Morton, Masami Hiramatsu
  Cc: oss, paulmck, linux-trace-kernel, linux-kernel,
	linux-trace-kernel, kernel-team, Breno Leitao
In-Reply-To: <20260415-bootconfig_earlier-v1-0-cf160175de5e@debian.org>

Replace memblock_alloc/memblock_free in xbc_make_cmdline() with
static __initdata buffers. This removes the last memblock dependency
from the bootconfig loading path, completing the prerequisite for
running bootconfig parsing before memblock is available.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 init/main.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/init/main.c b/init/main.c
index 96f93bb06c490..b9feca55e01f9 100644
--- a/init/main.c
+++ b/init/main.c
@@ -369,11 +369,18 @@ static int __init xbc_snprint_cmdline(char *buf, size_t size,
 }
 #undef rest
 
+/*
+ * Static buffers for bootconfig-generated command line parameters.
+ * Two separate buffers are needed because both "kernel" and "init"
+ * parameters are stored simultaneously.
+ */
+static char extra_cmdline_buf[COMMAND_LINE_SIZE] __initdata;
+static char extra_initargs_buf[COMMAND_LINE_SIZE] __initdata;
+
 /* Make an extra command line under given key word */
-static char * __init xbc_make_cmdline(const char *key)
+static char * __init xbc_make_cmdline(const char *key, char *new_cmdline)
 {
 	struct xbc_node *root;
-	char *new_cmdline;
 	int ret, len = 0;
 
 	root = xbc_find_node(key);
@@ -382,19 +389,12 @@ static char * __init xbc_make_cmdline(const char *key)
 
 	/* Count required buffer size */
 	len = xbc_snprint_cmdline(NULL, 0, root);
-	if (len <= 0)
+	if (len <= 0 || len >= COMMAND_LINE_SIZE)
 		return NULL;
 
-	new_cmdline = memblock_alloc(len + 1, SMP_CACHE_BYTES);
-	if (!new_cmdline) {
-		pr_err("Failed to allocate memory for extra kernel cmdline.\n");
-		return NULL;
-	}
-
 	ret = xbc_snprint_cmdline(new_cmdline, len + 1, root);
 	if (ret < 0 || ret > len) {
 		pr_err("Failed to print extra kernel cmdline.\n");
-		memblock_free(new_cmdline, len + 1);
 		return NULL;
 	}
 
@@ -467,9 +467,9 @@ static void __init setup_boot_config(void)
 		xbc_get_info(&ret, NULL);
 		pr_info("Load bootconfig: %ld bytes %d nodes\n", (long)size, ret);
 		/* keys starting with "kernel." are passed via cmdline */
-		extra_command_line = xbc_make_cmdline("kernel");
+		extra_command_line = xbc_make_cmdline("kernel", extra_cmdline_buf);
 		/* Also, "init." keys are init arguments */
-		extra_init_args = xbc_make_cmdline("init");
+		extra_init_args = xbc_make_cmdline("init", extra_initargs_buf);
 	}
 	return;
 }

-- 
2.52.0


^ permalink raw reply related

* [PATCH 1/3] bootconfig: use static buffers instead of memblock allocation
From: Breno Leitao @ 2026-04-15 10:51 UTC (permalink / raw)
  To: Andrew Morton, Masami Hiramatsu
  Cc: oss, paulmck, linux-trace-kernel, linux-kernel,
	linux-trace-kernel, kernel-team, Breno Leitao
In-Reply-To: <20260415-bootconfig_earlier-v1-0-cf160175de5e@debian.org>

Replace dynamic allocations in the bootconfig parser with static
__initdata buffers. This removes the dependency on memblock being
available when bootconfig is parsed, which is a prerequisite for
moving embedded bootconfig parsing earlier in the boot sequence.

The static buffers are:
- xbc_data_buf[XBC_DATA_MAX + 1] (~32KB) for bootconfig text data
- xbc_nodes_buf[XBC_NODE_MAX] (64KB) for the parse tree nodes

Both are __initdata and freed after init in the kernel. The userspace
tools path (tools/bootconfig) also uses the same static buffers,
which simplifies the code by removing all #ifdef __KERNEL__ blocks
from the parser.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 lib/bootconfig.c | 56 ++++++++++----------------------------------------------
 1 file changed, 10 insertions(+), 46 deletions(-)

diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index c470b93d5dbc2..3e529325c90ac 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -15,12 +15,11 @@
 
 #ifdef __KERNEL__
 #include <linux/bug.h>
-#include <linux/ctype.h>
-#include <linux/errno.h>
 #include <linux/cache.h>
 #include <linux/compiler.h>
+#include <linux/ctype.h>
+#include <linux/errno.h>
 #include <linux/sprintf.h>
-#include <linux/memblock.h>
 #include <linux/string.h>
 
 #ifdef CONFIG_BOOT_CONFIG_EMBED
@@ -54,33 +53,12 @@ static const char *xbc_err_msg __initdata;
 static int xbc_err_pos __initdata;
 static int open_brace[XBC_DEPTH_MAX] __initdata;
 static int brace_index __initdata;
-
-#ifdef __KERNEL__
-static inline void * __init xbc_alloc_mem(size_t size)
-{
-	return memblock_alloc(size, SMP_CACHE_BYTES);
-}
-
-static inline void __init xbc_free_mem(void *addr, size_t size, bool early)
-{
-	if (early)
-		memblock_free(addr, size);
-	else if (addr)
-		memblock_free(addr, size);
-}
-
-#else /* !__KERNEL__ */
-
-static inline void *xbc_alloc_mem(size_t size)
-{
-	return calloc(1, size);
-}
-
-static inline void xbc_free_mem(void *addr, size_t size, bool early)
-{
-	free(addr);
-}
-#endif
+/*
+ * Static buffers for bootconfig data and nodes. Avoids dynamic allocation
+ * so bootconfig can be parsed before memblock is available in the kernel.
+ */
+static char xbc_data_buf[XBC_DATA_MAX + 1] __initdata;
+static struct xbc_node xbc_nodes_buf[XBC_NODE_MAX] __initdata;
 
 /**
  * xbc_get_info() - Get the information of loaded boot config
@@ -930,11 +908,9 @@ static int __init xbc_parse_tree(void)
  */
 void __init _xbc_exit(bool early)
 {
-	xbc_free_mem(xbc_data, xbc_data_size, early);
 	xbc_data = NULL;
 	xbc_data_size = 0;
 	xbc_node_num = 0;
-	xbc_free_mem(xbc_nodes, sizeof(struct xbc_node) * XBC_NODE_MAX, early);
 	xbc_nodes = NULL;
 	brace_index = 0;
 }
@@ -973,24 +949,12 @@ int __init xbc_init(const char *data, size_t size, const char **emsg, int *epos)
 		return -ERANGE;
 	}
 
-	xbc_data = xbc_alloc_mem(size + 1);
-	if (!xbc_data) {
-		if (emsg)
-			*emsg = "Failed to allocate bootconfig data";
-		return -ENOMEM;
-	}
+	xbc_data = xbc_data_buf;
+	xbc_nodes = xbc_nodes_buf;
 	memcpy(xbc_data, data, size);
 	xbc_data[size] = '\0';
 	xbc_data_size = size + 1;
 
-	xbc_nodes = xbc_alloc_mem(sizeof(struct xbc_node) * XBC_NODE_MAX);
-	if (!xbc_nodes) {
-		if (emsg)
-			*emsg = "Failed to allocate bootconfig nodes";
-		_xbc_exit(true);
-		return -ENOMEM;
-	}
-
 	ret = xbc_parse_tree();
 	if (!ret)
 		ret = xbc_verify_tree();

-- 
2.52.0


^ permalink raw reply related

* [PATCH 0/3] bootconfig: break dependency from memblock
From: Breno Leitao @ 2026-04-15 10:51 UTC (permalink / raw)
  To: Andrew Morton, Masami Hiramatsu
  Cc: oss, paulmck, linux-trace-kernel, linux-kernel,
	linux-trace-kernel, kernel-team, Breno Leitao

This series moves embedded bootconfig (CONFIG_BOOT_CONFIG_EMBED) parsing
before setup_arch() in start_kernel(). The goal is to prepare for a
follow-up change that will allow bootconfig to provide early_param()
parameters (e.g. memblock=debug, loglevel) that are consumed during
architecture-specific setup.

Currently, bootconfig depends on memblock for memory allocation, which
is not available until setup_arch() initializes it. This series removes
that dependency by replacing memblock allocations with static __initdata
buffers (~96KB total, freed after init), then splits the bootconfig
loading into an early embedded path and a late initrd path.

The static buffers are used by both the kernel and the userspace
tools/bootconfig parser, which simplifies the code by removing the

These patches alone do not change user-visible behavior. Bootconfig
parameters are still stored in extra_command_line, which gets merged
into boot_command_line by setup_command_line() after setup_arch().

Making parse_early_param() see bootconfig values during setup_arch()
is a separate change that will follow once this infrastructure is
accepted.

Keeping the two efforts separate simplifies review, since the
early_param() integration is a non-trivial change on its own.

Some discussion that led to this decision:

1) It is not trivial to parse early param after setup_arch()
https://lore.kernel.org/all/20260325-early_bootconfig-v2-1-6b05a36fbfb5@debian.org/

2) An similar proposal from Petr 3 years ago:
https://lore.kernel.org/all/20231121231342.193646-2-oss@malat.biz/

---
Breno Leitao (3):
      bootconfig: use static buffers instead of memblock allocation
      init: use static buffers for bootconfig extra command line
      init: move embedded bootconfig parsing before setup_arch()

 init/main.c      | 91 +++++++++++++++++++++++++++++++++++++++++---------------
 lib/bootconfig.c | 56 +++++++---------------------------
 2 files changed, 77 insertions(+), 70 deletions(-)
---
base-commit: 1c7cc4904160c6fc6377564140062d68a3dc93a0
change-id: 20260414-bootconfig_earlier-9e185ceb71ae

Best regards,
--  
Breno Leitao <leitao@debian.org>

^ permalink raw reply

* Re: [PATCH] ring-buffer: Preserve true payload lengths in long data events
From: kernel test robot @ 2026-04-15 10:36 UTC (permalink / raw)
  To: Cao Ruichuang, rostedt, mhiramat
  Cc: oe-kbuild-all, mathieu.desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260407091550.67963-1-create0818@163.com>

Hi Cao,

kernel test robot noticed the following build warnings:

[auto build test WARNING on trace/for-next]
[also build test WARNING on linus/master v7.0 next-20260414]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Cao-Ruichuang/ring-buffer-Preserve-true-payload-lengths-in-long-data-events/20260415-040554
base:   https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace for-next
patch link:    https://lore.kernel.org/r/20260407091550.67963-1-create0818%40163.com
patch subject: [PATCH] ring-buffer: Preserve true payload lengths in long data events
config: arc-defconfig (https://download.01.org/0day-ci/archive/20260415/202604151837.uakVEVAS-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260415/202604151837.uakVEVAS-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202604151837.uakVEVAS-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> Warning: kernel/trace/ring_buffer.c:4786 function parameter 'force_long' not described in '__ring_buffer_lock_reserve'
>> Warning: kernel/trace/ring_buffer.c:4786 expecting prototype for ring_buffer_lock_reserve(). Prototype was for __ring_buffer_lock_reserve() instead

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* [PATCH v7 6/6] tracing/fprobe: Fix to unregister ftrace_ops if it is empty on module unloading
From: Masami Hiramatsu (Google) @ 2026-04-15 10:09 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177624771150.2407691.15764846647014540969.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Fix fprobe to unregister ftrace_ops if corresponding type of fprobe
does not exist on the fprobe_ip_table and it is expected to be empty
when unloading modules.

Since ftrace thinks that the empty hash means everything to be traced,
if we set fprobes only on the unloaded module, all functions are traced
unexpectedly after unloading module.
e.g.

 # modprobe xt_LOG.ko
 # echo 'f:test log_tg*' > dynamic_events
 # echo 1 > events/fprobes/test/enable
 # cat enabled_functions
log_tg [xt_LOG] (1)             tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
log_tg_check [xt_LOG] (1)               tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
log_tg_destroy [xt_LOG] (1)             tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
 # rmmod xt_LOG
 # wc -l enabled_functions
34085 enabled_functions

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v7:
  - Fix to split checking whether ftrace_ops is registered from
    the number of registered fprobes, because ftrace_ops can be
    unregistered in module unloading.
 Changes in v6:
  - Newly added.
---
 kernel/trace/fprobe.c |  214 ++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 165 insertions(+), 49 deletions(-)

diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index 8a5601bf2330..3e3df2e13c06 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -79,7 +79,7 @@ static const struct rhashtable_params fprobe_rht_params = {
 };
 
 /* Node insertion and deletion requires the fprobe_mutex */
-static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
+static int __insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
 {
 	int ret;
 
@@ -92,7 +92,7 @@ static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
 	return ret;
 }
 
-static void delete_fprobe_node(struct fprobe_hlist_node *node)
+static void __delete_fprobe_node(struct fprobe_hlist_node *node)
 {
 	lockdep_assert_held(&fprobe_mutex);
 
@@ -250,7 +250,74 @@ static inline int __fprobe_kprobe_handler(unsigned long ip, unsigned long parent
 	return ret;
 }
 
+static int fprobe_fgraph_entry(struct ftrace_graph_ent *trace, struct fgraph_ops *gops,
+			       struct ftrace_regs *fregs);
+static void fprobe_return(struct ftrace_graph_ret *trace,
+			  struct fgraph_ops *gops,
+			  struct ftrace_regs *fregs);
+
+static struct fgraph_ops fprobe_graph_ops = {
+	.entryfunc	= fprobe_fgraph_entry,
+	.retfunc	= fprobe_return,
+};
+/* Number of fgraph fprobes */
+static int fprobe_graph_active;
+/* Number of fgraph fprobe nodes */
+static int nr_fgraph_fprobes;
+/* Is fprobe_graph_ops registered? */
+static bool fprobe_graph_registered;
+
+/* Add @addrs to the ftrace filter and register fgraph if needed. */
+static int fprobe_graph_add_ips(unsigned long *addrs, int num)
+{
+	int ret;
+
+	lockdep_assert_held(&fprobe_mutex);
+
+	ret = ftrace_set_filter_ips(&fprobe_graph_ops.ops, addrs, num, 0, 0);
+	if (ret)
+		return ret;
+
+	if (!fprobe_graph_active) {
+		if (WARN_ON_ONCE(fprobe_graph_registered))
+			return -EINVAL;
+		ret = register_ftrace_graph(&fprobe_graph_ops);
+		if (WARN_ON_ONCE(ret)) {
+			ftrace_free_filter(&fprobe_graph_ops.ops);
+			return ret;
+		}
+		fprobe_graph_registered = true;
+	}
+	fprobe_graph_active++;
+	return 0;
+}
+
+static void __fprobe_graph_unregister(void)
+{
+	if (fprobe_graph_registered) {
+		unregister_ftrace_graph(&fprobe_graph_ops);
+		ftrace_free_filter(&fprobe_graph_ops.ops);
+		fprobe_graph_registered = false;
+	}
+}
+
+/* Remove @addrs from the ftrace filter and unregister fgraph if possible. */
+static void fprobe_graph_remove_ips(unsigned long *addrs, int num)
+{
+	lockdep_assert_held(&fprobe_mutex);
+
+	if (!fprobe_graph_active)
+		return;
+
+	fprobe_graph_active--;
+	if (!fprobe_graph_active)
+		__fprobe_graph_unregister();
+	else if (num)
+		ftrace_set_filter_ips(&fprobe_graph_ops.ops, addrs, num, 1, 0);
+}
+
 #if defined(CONFIG_DYNAMIC_FTRACE_WITH_ARGS) || defined(CONFIG_DYNAMIC_FTRACE_WITH_REGS)
+
 /* ftrace_ops callback, this processes fprobes which have only entry_handler. */
 static void fprobe_ftrace_entry(unsigned long ip, unsigned long parent_ip,
 	struct ftrace_ops *ops, struct ftrace_regs *fregs)
@@ -293,7 +360,12 @@ static struct ftrace_ops fprobe_ftrace_ops = {
 	.func	= fprobe_ftrace_entry,
 	.flags	= FTRACE_OPS_FL_SAVE_ARGS,
 };
+/* Number of fgraph fprobes */
 static int fprobe_ftrace_active;
+/* Number of ftrace fprobe nodes */
+static int nr_ftrace_fprobes;
+/* Is fprobe_ftrace_ops registered? */
+static bool fprobe_ftrace_registered;
 
 static int fprobe_ftrace_add_ips(unsigned long *addrs, int num)
 {
@@ -306,24 +378,40 @@ static int fprobe_ftrace_add_ips(unsigned long *addrs, int num)
 		return ret;
 
 	if (!fprobe_ftrace_active) {
+		if (WARN_ON_ONCE(fprobe_ftrace_registered))
+			return -EINVAL;
+
 		ret = register_ftrace_function(&fprobe_ftrace_ops);
 		if (ret) {
 			ftrace_free_filter(&fprobe_ftrace_ops);
 			return ret;
 		}
+		fprobe_ftrace_registered = true;
 	}
 	fprobe_ftrace_active++;
 	return 0;
 }
 
+static void __fprobe_ftrace_unregister(void)
+{
+	if (fprobe_ftrace_registered) {
+		unregister_ftrace_function(&fprobe_ftrace_ops);
+		ftrace_free_filter(&fprobe_ftrace_ops);
+		fprobe_ftrace_registered = false;
+	}
+}
+
 static void fprobe_ftrace_remove_ips(unsigned long *addrs, int num)
 {
 	lockdep_assert_held(&fprobe_mutex);
 
+	if (!fprobe_ftrace_active)
+		return;
+
 	fprobe_ftrace_active--;
 	if (!fprobe_ftrace_active)
-		unregister_ftrace_function(&fprobe_ftrace_ops);
-	if (num)
+		__fprobe_ftrace_unregister();
+	else if (num)
 		ftrace_set_filter_ips(&fprobe_ftrace_ops, addrs, num, 1, 0);
 }
 
@@ -332,6 +420,40 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
 	return !fp->exit_handler;
 }
 
+/* Node insertion and deletion requires the fprobe_mutex */
+static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
+{
+	int ret;
+
+	lockdep_assert_held(&fprobe_mutex);
+
+	ret = __insert_fprobe_node(node, fp);
+	if (!ret) {
+		if (fprobe_is_ftrace(fp))
+			nr_ftrace_fprobes++;
+		else
+			nr_fgraph_fprobes++;
+	}
+
+	return ret;
+}
+
+static void delete_fprobe_node(struct fprobe_hlist_node *node)
+{
+	struct fprobe *fp;
+
+	lockdep_assert_held(&fprobe_mutex);
+
+	fp = READ_ONCE(node->fp);
+	if (fp) {
+		if (fprobe_is_ftrace(fp))
+			nr_ftrace_fprobes--;
+		else
+			nr_fgraph_fprobes--;
+	}
+	__delete_fprobe_node(node);
+}
+
 static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace)
 {
 	struct rhlist_head *head, *pos;
@@ -361,8 +483,15 @@ static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace)
 #ifdef CONFIG_MODULES
 static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
 {
-	ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
-	ftrace_set_filter_ips(&fprobe_ftrace_ops, ips, cnt, 1, 0);
+	if (!nr_fgraph_fprobes)
+		__fprobe_graph_unregister();
+	else
+		ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
+
+	if (!nr_ftrace_fprobes)
+		__fprobe_ftrace_unregister();
+	else
+		ftrace_set_filter_ips(&fprobe_ftrace_ops, ips, cnt, 1, 0);
 }
 #endif
 #else
@@ -380,6 +509,32 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
 	return false;
 }
 
+/* Node insertion and deletion requires the fprobe_mutex */
+static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
+{
+	int ret;
+
+	lockdep_assert_held(&fprobe_mutex);
+
+	ret = __insert_fprobe_node(node, fp);
+	if (!ret)
+		nr_fgraph_fprobes++;
+
+	return ret;
+}
+
+static void delete_fprobe_node(struct fprobe_hlist_node *node)
+{
+	struct fprobe *fp;
+
+	lockdep_assert_held(&fprobe_mutex);
+
+	fp = READ_ONCE(node->fp);
+	if (fp)
+		nr_fgraph_fprobes--;
+	__delete_fprobe_node(node);
+}
+
 static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace __maybe_unused)
 {
 	struct rhlist_head *head, *pos;
@@ -406,7 +561,10 @@ static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace __maybe_unused)
 #ifdef CONFIG_MODULES
 static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
 {
-	ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
+	if (!nr_fgraph_fprobes)
+		__fprobe_graph_unregister();
+	else
+		ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
 }
 #endif
 #endif /* !CONFIG_DYNAMIC_FTRACE_WITH_ARGS && !CONFIG_DYNAMIC_FTRACE_WITH_REGS */
@@ -536,48 +694,6 @@ static void fprobe_return(struct ftrace_graph_ret *trace,
 }
 NOKPROBE_SYMBOL(fprobe_return);
 
-static struct fgraph_ops fprobe_graph_ops = {
-	.entryfunc	= fprobe_fgraph_entry,
-	.retfunc	= fprobe_return,
-};
-static int fprobe_graph_active;
-
-/* Add @addrs to the ftrace filter and register fgraph if needed. */
-static int fprobe_graph_add_ips(unsigned long *addrs, int num)
-{
-	int ret;
-
-	lockdep_assert_held(&fprobe_mutex);
-
-	ret = ftrace_set_filter_ips(&fprobe_graph_ops.ops, addrs, num, 0, 0);
-	if (ret)
-		return ret;
-
-	if (!fprobe_graph_active) {
-		ret = register_ftrace_graph(&fprobe_graph_ops);
-		if (WARN_ON_ONCE(ret)) {
-			ftrace_free_filter(&fprobe_graph_ops.ops);
-			return ret;
-		}
-	}
-	fprobe_graph_active++;
-	return 0;
-}
-
-/* Remove @addrs from the ftrace filter and unregister fgraph if possible. */
-static void fprobe_graph_remove_ips(unsigned long *addrs, int num)
-{
-	lockdep_assert_held(&fprobe_mutex);
-
-	fprobe_graph_active--;
-	/* Q: should we unregister it ? */
-	if (!fprobe_graph_active)
-		unregister_ftrace_graph(&fprobe_graph_ops);
-
-	if (num)
-		ftrace_set_filter_ips(&fprobe_graph_ops.ops, addrs, num, 1, 0);
-}
-
 #ifdef CONFIG_MODULES
 
 #define FPROBE_IPS_BATCH_INIT 128


^ permalink raw reply related

* [PATCH v7 5/6] tracing/fprobe: Check the same type fprobe on table as the unregistered one
From: Masami Hiramatsu (Google) @ 2026-04-15 10:09 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177624771150.2407691.15764846647014540969.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Commit 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case")
introduced a different ftrace_ops for entry-only fprobes.

However, when unregistering an fprobe, the kernel only checks if another
fprobe exists at the same address, without checking which type of fprobe
it is.
If different fprobes are registered at the same address, the same address
will be registered in both fgraph_ops and ftrace_ops, but only one of
them will be deleted when unregistering. (the one removed first will not
be deleted from the ops).

This results in junk entries remaining in either fgraph_ops or ftrace_ops.
For example:
 =======
 cd /sys/kernel/tracing

 # 'Add entry and exit events on the same place'
 echo 'f:event1 vfs_read' >> dynamic_events
 echo 'f:event2 vfs_read%return' >> dynamic_events

 # 'Enable both of them'
 echo 1 > events/fprobes/enable
 cat enabled_functions
vfs_read (2)            ->arch_ftrace_ops_list_func+0x0/0x210

 # 'Disable and remove exit event'
 echo 0 > events/fprobes/event2/enable
 echo -:event2 >> dynamic_events

 # 'Disable and remove all events'
 echo 0 > events/fprobes/enable
 echo > dynamic_events

 # 'Add another event'
 echo 'f:event3 vfs_open%return' > dynamic_events
 cat dynamic_events
f:fprobes/event3 vfs_open%return

 echo 1 > events/fprobes/enable
 cat enabled_functions
vfs_open (1)            tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60    subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
vfs_read (1)            tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60    subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
 =======

As you can see, an entry for the vfs_read remains.

To fix this issue, when unregistering, the kernel should also check if
there is the same type of fprobes still exist at the same address, and
if not, delete its entry from either fgraph_ops or ftrace_ops.

Fixes: 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 kernel/trace/fprobe.c |   82 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 65 insertions(+), 17 deletions(-)

diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index 0df8e003e88e..8a5601bf2330 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -92,11 +92,8 @@ static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
 	return ret;
 }
 
-/* Return true if there are synonims */
-static bool delete_fprobe_node(struct fprobe_hlist_node *node)
+static void delete_fprobe_node(struct fprobe_hlist_node *node)
 {
-	bool ret;
-
 	lockdep_assert_held(&fprobe_mutex);
 
 	/* Avoid double deleting and non-inserted nodes */
@@ -105,13 +102,6 @@ static bool delete_fprobe_node(struct fprobe_hlist_node *node)
 		rhltable_remove(&fprobe_ip_table, &node->hlist,
 				fprobe_rht_params);
 	}
-
-	rcu_read_lock();
-	ret = !!rhltable_lookup(&fprobe_ip_table, &node->addr,
-				fprobe_rht_params);
-	rcu_read_unlock();
-
-	return ret;
 }
 
 /* Check existence of the fprobe */
@@ -342,6 +332,32 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
 	return !fp->exit_handler;
 }
 
+static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace)
+{
+	struct rhlist_head *head, *pos;
+	struct fprobe_hlist_node *node;
+	struct fprobe *fp;
+
+	guard(rcu)();
+	head = rhltable_lookup(&fprobe_ip_table, &ip,
+				fprobe_rht_params);
+	if (!head)
+		return false;
+	/* We have to check the same type on the list. */
+	rhl_for_each_entry_rcu(node, pos, head, hlist) {
+		if (node->addr != ip)
+			break;
+		fp = READ_ONCE(node->fp);
+		if (likely(fp)) {
+			if ((!ftrace && fp->exit_handler) ||
+			    (ftrace && !fp->exit_handler))
+				return true;
+		}
+	}
+
+	return false;
+}
+
 #ifdef CONFIG_MODULES
 static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
 {
@@ -364,6 +380,29 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
 	return false;
 }
 
+static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace __maybe_unused)
+{
+	struct rhlist_head *head, *pos;
+	struct fprobe_hlist_node *node;
+	struct fprobe *fp;
+
+	guard(rcu)();
+	head = rhltable_lookup(&fprobe_ip_table, &ip,
+				fprobe_rht_params);
+	if (!head)
+		return false;
+	/* We only need to check fp is there. */
+	rhl_for_each_entry_rcu(node, pos, head, hlist) {
+		if (node->addr != ip)
+			break;
+		fp = READ_ONCE(node->fp);
+		if (likely(fp))
+			return true;
+	}
+
+	return false;
+}
+
 #ifdef CONFIG_MODULES
 static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
 {
@@ -552,18 +591,25 @@ struct fprobe_addr_list {
 static int fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
 					 struct fprobe_addr_list *alist)
 {
+	lockdep_assert_in_rcu_read_lock();
+
 	if (!within_module(node->addr, mod))
 		return 0;
 
-	if (delete_fprobe_node(node))
-		return 0;
+	delete_fprobe_node(node);
 	/* If no address list is available, we can't track this address. */
 	if (!alist->addrs)
 		return 0;
+	/*
+	 * Don't care the type here, because all fprobes on the same
+	 * address must be removed eventually.
+	 */
+	if (!rhltable_lookup(&fprobe_ip_table, &node->addr, fprobe_rht_params)) {
+		alist->addrs[alist->index++] = node->addr;
+		if (alist->index == alist->size)
+			return -ENOSPC;
+	}
 
-	alist->addrs[alist->index++] = node->addr;
-	if (alist->index == alist->size)
-		return -ENOSPC;
 	return 0;
 }
 
@@ -931,7 +977,9 @@ static int unregister_fprobe_nolock(struct fprobe *fp)
 	/* Remove non-synonim ips from table and hash */
 	count = 0;
 	for (i = 0; i < hlist_array->size; i++) {
-		if (!delete_fprobe_node(&hlist_array->array[i]) && addrs)
+		delete_fprobe_node(&hlist_array->array[i]);
+		if (addrs && !fprobe_exists_on_hash(hlist_array->array[i].addr,
+						    fprobe_is_ftrace(fp)))
 			addrs[count++] = hlist_array->array[i].addr;
 	}
 	del_fprobe_hash(fp);


^ permalink raw reply related

* [PATCH v7 4/6] tracing/fprobe: Avoid kcalloc() in rcu_read_lock section
From: Masami Hiramatsu (Google) @ 2026-04-15 10:09 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177624771150.2407691.15764846647014540969.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

fprobe_remove_node_in_module() is called under RCU read locked, but
this invokes kcalloc() if there are more than 8 fprobes installed
on the module. Sashiko warns it because kcalloc() can sleep [1].

 [1] https://sashiko.dev/#/patchset/177552432201.853249.5125045538812833325.stgit%40mhiramat.tok.corp.google.com

To fix this issue, expand the batch size to 128 and do not expand
the fprobe_addr_list, but just cancel walking on fprobe_ip_table,
update fgraph/ftrace_ops and retry the loop again.

Fixes: 0de4c70d04a4 ("tracing: fprobe: use rhltable for fprobe_ip_table")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v6:
  - Retry outside rhltable_walk_enter/exit() again.
 Changes in v5:
  - Skip updating ftrace_ops when fails to allocate memory in module
    unloading.
 Changes in v4:
  - fix a build error typo in case of CONFIG_DYNAMIC_FTRACE=n.
 Changes in v3:
  - Retry inside rhltable_walk_enter/exit().
  - Rename fprobe_set_ips() to fprobe_remove_ips().
  - Rename 'retry' label to 'again'.
---
 kernel/trace/fprobe.c |   92 ++++++++++++++++++++++++-------------------------
 1 file changed, 45 insertions(+), 47 deletions(-)

diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index c58e41b2da8e..0df8e003e88e 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -343,11 +343,10 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
 }
 
 #ifdef CONFIG_MODULES
-static void fprobe_set_ips(unsigned long *ips, unsigned int cnt, int remove,
-			   int reset)
+static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
 {
-	ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, remove, reset);
-	ftrace_set_filter_ips(&fprobe_ftrace_ops, ips, cnt, remove, reset);
+	ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
+	ftrace_set_filter_ips(&fprobe_ftrace_ops, ips, cnt, 1, 0);
 }
 #endif
 #else
@@ -366,10 +365,9 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
 }
 
 #ifdef CONFIG_MODULES
-static void fprobe_set_ips(unsigned long *ips, unsigned int cnt, int remove,
-			   int reset)
+static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
 {
-	ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, remove, reset);
+	ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
 }
 #endif
 #endif /* !CONFIG_DYNAMIC_FTRACE_WITH_ARGS && !CONFIG_DYNAMIC_FTRACE_WITH_REGS */
@@ -543,7 +541,7 @@ static void fprobe_graph_remove_ips(unsigned long *addrs, int num)
 
 #ifdef CONFIG_MODULES
 
-#define FPROBE_IPS_BATCH_INIT 8
+#define FPROBE_IPS_BATCH_INIT 128
 /* instruction pointer address list */
 struct fprobe_addr_list {
 	int index;
@@ -551,45 +549,24 @@ struct fprobe_addr_list {
 	unsigned long *addrs;
 };
 
-static int fprobe_addr_list_add(struct fprobe_addr_list *alist, unsigned long addr)
+static int fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
+					 struct fprobe_addr_list *alist)
 {
-	unsigned long *addrs;
-
-	/* Previously we failed to expand the list. */
-	if (alist->index == alist->size)
-		return -ENOSPC;
-
-	alist->addrs[alist->index++] = addr;
-	if (alist->index < alist->size)
+	if (!within_module(node->addr, mod))
 		return 0;
 
-	/* Expand the address list */
-	addrs = kcalloc(alist->size * 2, sizeof(*addrs), GFP_KERNEL);
-	if (!addrs)
-		return -ENOMEM;
-
-	memcpy(addrs, alist->addrs, alist->size * sizeof(*addrs));
-	alist->size *= 2;
-	kfree(alist->addrs);
-	alist->addrs = addrs;
+	if (delete_fprobe_node(node))
+		return 0;
+	/* If no address list is available, we can't track this address. */
+	if (!alist->addrs)
+		return 0;
 
+	alist->addrs[alist->index++] = node->addr;
+	if (alist->index == alist->size)
+		return -ENOSPC;
 	return 0;
 }
 
-static void fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
-					 struct fprobe_addr_list *alist)
-{
-	if (!within_module(node->addr, mod))
-		return;
-	if (delete_fprobe_node(node))
-		return;
-	/*
-	 * If failed to update alist, just continue to update hlist.
-	 * Therefore, at list user handler will not hit anymore.
-	 */
-	fprobe_addr_list_add(alist, node->addr);
-}
-
 /* Handle module unloading to manage fprobe_ip_table. */
 static int fprobe_module_callback(struct notifier_block *nb,
 				  unsigned long val, void *data)
@@ -598,29 +575,50 @@ static int fprobe_module_callback(struct notifier_block *nb,
 	struct fprobe_hlist_node *node;
 	struct rhashtable_iter iter;
 	struct module *mod = data;
+	bool retry;
 
 	if (val != MODULE_STATE_GOING)
 		return NOTIFY_DONE;
 
 	alist.addrs = kcalloc(alist.size, sizeof(*alist.addrs), GFP_KERNEL);
-	/* If failed to alloc memory, we can not remove ips from hash. */
-	if (!alist.addrs)
-		return NOTIFY_DONE;
+	/*
+	 * If failed to alloc memory, ftrace_ops will not be able to remove ips from
+	 * hash, but we can still remove nodes from fprobe_ip_table, so we can avoid
+	 * the potential wrong callback. So just print a warning here and try to
+	 * continue without address list.
+	 */
+	WARN_ONCE(!alist.addrs,
+		"Failed to allocate memory for fprobe_addr_list, ftrace_ops will not be updated");
 
 	mutex_lock(&fprobe_mutex);
+again:
+	retry = false;
+	alist.index = 0;
 	rhltable_walk_enter(&fprobe_ip_table, &iter);
 	do {
 		rhashtable_walk_start(&iter);
 
 		while ((node = rhashtable_walk_next(&iter)) && !IS_ERR(node))
-			fprobe_remove_node_in_module(mod, node, &alist);
+			if (fprobe_remove_node_in_module(mod, node, &alist) < 0) {
+				retry = true;
+				break;
+			}
 
 		rhashtable_walk_stop(&iter);
-	} while (node == ERR_PTR(-EAGAIN));
+	} while (node == ERR_PTR(-EAGAIN) && !retry);
 	rhashtable_walk_exit(&iter);
+	/* Remove any ips from hash table(s) */
+	if (alist.index > 0) {
+		fprobe_remove_ips(alist.addrs, alist.index);
+		/*
+		 * If we break rhashtable walk loop except for -EAGAIN, we need
+		 * to restart looping from start for safety. Anyway, this is
+		 * not a hotpath.
+		 */
+		if (retry)
+			goto again;
+	}
 
-	if (alist.index > 0)
-		fprobe_set_ips(alist.addrs, alist.index, 1, 0);
 	mutex_unlock(&fprobe_mutex);
 
 	kfree(alist.addrs);


^ permalink raw reply related

* [PATCH v7 3/6] tracing/fprobe: Remove fprobe from hash in failure path
From: Masami Hiramatsu (Google) @ 2026-04-15 10:08 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177624771150.2407691.15764846647014540969.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

When register_fprobe_ips() fails, it tries to remove a list of
fprobe_hash_node from fprobe_ip_table, but it missed to remove
fprobe itself from fprobe_table. Moreover, when removing
the fprobe_hash_node which is added to rhltable once, it must
use kfree_rcu() after removing from rhltable.

To fix these issues, this reuses unregister_fprobe() internal
code to rollback the half-way registered fprobe.

Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v7:
  - Remove RCU grace period wait, since fprobe itself is not
    that is not needed.
 Changes in v6:
  - Wait for an RCU grace period before returning error in
    unregister_fprobe_nolock().
 Changes in v5:
  - When rolling back an fprobe that failed to register, the
    fprobe_hash_node are forcibly removed and warn if failure.
 Changes in v4:
  - Remove short-cut case because we always need to upadte ftrace_ops.
  - Use guard(mutex) in register_fprobe_ips() to unlock it correctly.
  - Remove redundant !ret check in register_fprobe_ips().
  - Do not set hlist_array->size in failure case, instead,
    hlist_array->array[i].fp is set only when insertion is succeeded.
  Changes in v3:
  - Newly added.
---
 kernel/trace/fprobe.c |   87 +++++++++++++++++++++++++------------------------
 1 file changed, 44 insertions(+), 43 deletions(-)

diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index d634eb8e8b9e..c58e41b2da8e 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -79,20 +79,27 @@ static const struct rhashtable_params fprobe_rht_params = {
 };
 
 /* Node insertion and deletion requires the fprobe_mutex */
-static int insert_fprobe_node(struct fprobe_hlist_node *node)
+static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
 {
+	int ret;
+
 	lockdep_assert_held(&fprobe_mutex);
 
-	return rhltable_insert(&fprobe_ip_table, &node->hlist, fprobe_rht_params);
+	ret = rhltable_insert(&fprobe_ip_table, &node->hlist, fprobe_rht_params);
+	/* Set the fprobe pointer if insertion was successful. */
+	if (!ret)
+		WRITE_ONCE(node->fp, fp);
+	return ret;
 }
 
 /* Return true if there are synonims */
 static bool delete_fprobe_node(struct fprobe_hlist_node *node)
 {
-	lockdep_assert_held(&fprobe_mutex);
 	bool ret;
 
-	/* Avoid double deleting */
+	lockdep_assert_held(&fprobe_mutex);
+
+	/* Avoid double deleting and non-inserted nodes */
 	if (READ_ONCE(node->fp) != NULL) {
 		WRITE_ONCE(node->fp, NULL);
 		rhltable_remove(&fprobe_ip_table, &node->hlist,
@@ -757,7 +764,6 @@ static int fprobe_init(struct fprobe *fp, unsigned long *addrs, int num)
 	fp->hlist_array = hlist_array;
 	hlist_array->fp = fp;
 	for (i = 0; i < num; i++) {
-		hlist_array->array[i].fp = fp;
 		addr = ftrace_location(addrs[i]);
 		if (!addr) {
 			fprobe_fail_cleanup(fp);
@@ -821,6 +827,8 @@ int register_fprobe(struct fprobe *fp, const char *filter, const char *notfilter
 }
 EXPORT_SYMBOL_GPL(register_fprobe);
 
+static int unregister_fprobe_nolock(struct fprobe *fp);
+
 /**
  * register_fprobe_ips() - Register fprobe to ftrace by address.
  * @fp: A fprobe data structure to be registered.
@@ -847,28 +855,22 @@ int register_fprobe_ips(struct fprobe *fp, unsigned long *addrs, int num)
 	if (ret)
 		return ret;
 
-	hlist_array = fp->hlist_array;
 	if (fprobe_is_ftrace(fp))
 		ret = fprobe_ftrace_add_ips(addrs, num);
 	else
 		ret = fprobe_graph_add_ips(addrs, num);
-
-	if (!ret) {
-		add_fprobe_hash(fp);
-		for (i = 0; i < hlist_array->size; i++) {
-			ret = insert_fprobe_node(&hlist_array->array[i]);
-			if (ret)
-				break;
-		}
-		/* fallback on insert error */
-		if (ret) {
-			for (i--; i >= 0; i--)
-				delete_fprobe_node(&hlist_array->array[i]);
-		}
+	if (ret) {
+		fprobe_fail_cleanup(fp);
+		return ret;
 	}
 
-	if (ret)
-		fprobe_fail_cleanup(fp);
+	hlist_array = fp->hlist_array;
+	add_fprobe_hash(fp);
+	for (i = 0; i < hlist_array->size; i++) {
+		ret = insert_fprobe_node(&hlist_array->array[i], fp);
+		if (ret)
+			unregister_fprobe_nolock(fp);
+	}
 
 	return ret;
 }
@@ -912,27 +914,12 @@ bool fprobe_is_registered(struct fprobe *fp)
 	return true;
 }
 
-/**
- * unregister_fprobe() - Unregister fprobe.
- * @fp: A fprobe data structure to be unregistered.
- *
- * Unregister fprobe (and remove ftrace hooks from the function entries).
- *
- * Return 0 if @fp is unregistered successfully, -errno if not.
- */
-int unregister_fprobe(struct fprobe *fp)
+static int unregister_fprobe_nolock(struct fprobe *fp)
 {
-	struct fprobe_hlist *hlist_array;
+	struct fprobe_hlist *hlist_array = fp->hlist_array;
 	unsigned long *addrs = NULL;
-	int ret = 0, i, count;
+	int i, count;
 
-	mutex_lock(&fprobe_mutex);
-	if (!fp || !fprobe_registered(fp)) {
-		ret = -EINVAL;
-		goto out;
-	}
-
-	hlist_array = fp->hlist_array;
 	addrs = kcalloc(hlist_array->size, sizeof(unsigned long), GFP_KERNEL);
 	/*
 	 * This will remove fprobe_hash_node from the hash table even if
@@ -958,12 +945,26 @@ int unregister_fprobe(struct fprobe *fp)
 
 	kfree_rcu(hlist_array, rcu);
 	fp->hlist_array = NULL;
+	kfree(addrs);
 
-out:
-	mutex_unlock(&fprobe_mutex);
+	return 0;
+}
 
-	kfree(addrs);
-	return ret;
+/**
+ * unregister_fprobe() - Unregister fprobe.
+ * @fp: A fprobe data structure to be unregistered.
+ *
+ * Unregister fprobe (and remove ftrace hooks from the function entries).
+ *
+ * Return 0 if @fp is unregistered successfully, -errno if not.
+ */
+int unregister_fprobe(struct fprobe *fp)
+{
+	guard(mutex)(&fprobe_mutex);
+	if (!fp || !fprobe_registered(fp))
+		return -EINVAL;
+
+	return unregister_fprobe_nolock(fp);
 }
 EXPORT_SYMBOL_GPL(unregister_fprobe);
 


^ permalink raw reply related

* [PATCH v7 2/6] tracing/fprobe: Unregister fprobe even if memory allocation fails
From: Masami Hiramatsu (Google) @ 2026-04-15 10:08 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177624771150.2407691.15764846647014540969.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

unregister_fprobe() can fail under memory pressure because of memory
allocation failure, but this maybe called from module unloading, and
usually there is no way to retry it. Moreover. trace_fprobe does not
check the return value.

To fix this problem, unregister fprobe and fprobe_hash_node even if
working memory allocation fails.
Anyway, if the last fprobe is removed, the filter will be freed.

Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v7:
  - Newly added.
---
 kernel/trace/fprobe.c |   14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index fc7018b28fdd..d634eb8e8b9e 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -934,15 +934,19 @@ int unregister_fprobe(struct fprobe *fp)
 
 	hlist_array = fp->hlist_array;
 	addrs = kcalloc(hlist_array->size, sizeof(unsigned long), GFP_KERNEL);
-	if (!addrs) {
-		ret = -ENOMEM;	/* TODO: Fallback to one-by-one loop */
-		goto out;
-	}
+	/*
+	 * This will remove fprobe_hash_node from the hash table even if
+	 * memory allocation fails. However, ftrace_ops will not be updated.
+	 * Anyway, when the last fprobe is unregistered, ftrace_ops is also
+	 * unregistered.
+	 */
+	if (!addrs)
+		pr_warn("Failed to allocate working array. ftrace_ops may not sync.\n");
 
 	/* Remove non-synonim ips from table and hash */
 	count = 0;
 	for (i = 0; i < hlist_array->size; i++) {
-		if (!delete_fprobe_node(&hlist_array->array[i]))
+		if (!delete_fprobe_node(&hlist_array->array[i]) && addrs)
 			addrs[count++] = hlist_array->array[i].addr;
 	}
 	del_fprobe_hash(fp);


^ permalink raw reply related

* [PATCH v7 1/6] tracing/fprobe: Reject registration of a registered fprobe before init
From: Masami Hiramatsu (Google) @ 2026-04-15 10:08 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177624771150.2407691.15764846647014540969.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Reject registration of a registered fprobe which is on the fprobe
hash table before initializing fprobe.
The add_fprobe_hash() checks this re-register fprobe, but since
fprobe_init() clears hlist_array field, it is too late to check it.
It has to check the re-registration before touncing fprobe.

Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v6:
  - Newly added.
---
 kernel/trace/fprobe.c |   21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index dcadf1d23b8a..fc7018b28fdd 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -4,6 +4,7 @@
  */
 #define pr_fmt(fmt) "fprobe: " fmt
 
+#include <linux/cleanup.h>
 #include <linux/err.h>
 #include <linux/fprobe.h>
 #include <linux/kallsyms.h>
@@ -107,7 +108,7 @@ static bool delete_fprobe_node(struct fprobe_hlist_node *node)
 }
 
 /* Check existence of the fprobe */
-static bool is_fprobe_still_exist(struct fprobe *fp)
+static bool fprobe_registered(struct fprobe *fp)
 {
 	struct hlist_head *head;
 	struct fprobe_hlist *fph;
@@ -120,7 +121,7 @@ static bool is_fprobe_still_exist(struct fprobe *fp)
 	}
 	return false;
 }
-NOKPROBE_SYMBOL(is_fprobe_still_exist);
+NOKPROBE_SYMBOL(fprobe_registered);
 
 static int add_fprobe_hash(struct fprobe *fp)
 {
@@ -132,9 +133,6 @@ static int add_fprobe_hash(struct fprobe *fp)
 	if (WARN_ON_ONCE(!fph))
 		return -EINVAL;
 
-	if (is_fprobe_still_exist(fp))
-		return -EEXIST;
-
 	head = &fprobe_table[hash_ptr(fp, FPROBE_HASH_BITS)];
 	hlist_add_head_rcu(&fp->hlist_array->hlist, head);
 	return 0;
@@ -149,7 +147,7 @@ static int del_fprobe_hash(struct fprobe *fp)
 	if (WARN_ON_ONCE(!fph))
 		return -EINVAL;
 
-	if (!is_fprobe_still_exist(fp))
+	if (!fprobe_registered(fp))
 		return -ENOENT;
 
 	fph->fp = NULL;
@@ -482,7 +480,7 @@ static void fprobe_return(struct ftrace_graph_ret *trace,
 		if (!fp)
 			break;
 		curr += FPROBE_HEADER_SIZE_IN_LONG;
-		if (is_fprobe_still_exist(fp) && !fprobe_disabled(fp)) {
+		if (fprobe_registered(fp) && !fprobe_disabled(fp)) {
 			if (WARN_ON_ONCE(curr + size > size_words))
 				break;
 			fp->exit_handler(fp, trace->func, ret_ip, fregs,
@@ -841,12 +839,14 @@ int register_fprobe_ips(struct fprobe *fp, unsigned long *addrs, int num)
 	struct fprobe_hlist *hlist_array;
 	int ret, i;
 
+	guard(mutex)(&fprobe_mutex);
+	if (fprobe_registered(fp))
+		return -EEXIST;
+
 	ret = fprobe_init(fp, addrs, num);
 	if (ret)
 		return ret;
 
-	mutex_lock(&fprobe_mutex);
-
 	hlist_array = fp->hlist_array;
 	if (fprobe_is_ftrace(fp))
 		ret = fprobe_ftrace_add_ips(addrs, num);
@@ -866,7 +866,6 @@ int register_fprobe_ips(struct fprobe *fp, unsigned long *addrs, int num)
 				delete_fprobe_node(&hlist_array->array[i]);
 		}
 	}
-	mutex_unlock(&fprobe_mutex);
 
 	if (ret)
 		fprobe_fail_cleanup(fp);
@@ -928,7 +927,7 @@ int unregister_fprobe(struct fprobe *fp)
 	int ret = 0, i, count;
 
 	mutex_lock(&fprobe_mutex);
-	if (!fp || !is_fprobe_still_exist(fp)) {
+	if (!fp || !fprobe_registered(fp)) {
 		ret = -EINVAL;
 		goto out;
 	}


^ permalink raw reply related

* [PATCH v7 0/6] tracing/fprobe: Fix fprobe_ip_table related bugs
From: Masami Hiramatsu (Google) @ 2026-04-15 10:08 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel

Hi,

Here is the 7th version of fprobe bugfix series.
The previous version is here.

https://lore.kernel.org/all/177615807787.1165997.921227352050738693.stgit@mhiramat.tok.corp.google.com/

In this version, I added yet another bugfixes (again) for make
unregister_fprobe() more robust for memory pressure [2/6]. And
remove uneeded RCU sync [3/6]. And split checking whether ftrace_ops
is registered from the number of registered fprobes, because
ftrace_ops can be unregistered in module unloading[6/6].

Thank you!

Masami Hiramatsu (Google) (6):
      tracing/fprobe: Reject registration of a registered fprobe before init
      tracing/fprobe: Unregister fprobe even if memory allocation fails
      tracing/fprobe: Remove fprobe from hash in failure path
      tracing/fprobe: Avoid kcalloc() in rcu_read_lock section
      tracing/fprobe: Check the same type fprobe on table as the unregistered one
      tracing/fprobe: Fix to unregister ftrace_ops if it is empty on module unloading


 kernel/trace/fprobe.c |  480 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 323 insertions(+), 157 deletions(-)


base-commit: 4346be6577aaa04586167402ae87bbdbe32484a4
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v6 2/5] tracing/fprobe: Remove fprobe from hash in failure path
From: Masami Hiramatsu @ 2026-04-15 10:06 UTC (permalink / raw)
  To: Menglong Dong
  Cc: Steven Rostedt, Menglong Dong, Mathieu Desnoyers, jiang.biao,
	linux-kernel, linux-trace-kernel
In-Reply-To: <2405872.ElGaqSPkdT@7940hx>

On Wed, 15 Apr 2026 17:47:11 +0800
Menglong Dong <menglong.dong@linux.dev> wrote:

> On 2026/4/14 17:14 Masami Hiramatsu (Google) <mhiramat@kernel.org> write:
> > From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > 
> > When register_fprobe_ips() fails, it tries to remove a list of
> > fprobe_hash_node from fprobe_ip_table, but it missed to remove
> > fprobe itself from fprobe_table. Moreover, when removing
> > the fprobe_hash_node which is added to rhltable once, it must
> > use kfree_rcu() after removing from rhltable.
> > 
> > To fix these issues, this reuses unregister_fprobe() internal
> > code to rollback the half-way registered fprobe.
> > 
> > Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > ---
> [...]
> >  
> > +static int unregister_fprobe_nolock(struct fprobe *fp, bool force);
> > +
> >  /**
> >   * register_fprobe_ips() - Register fprobe to ftrace by address.
> >   * @fp: A fprobe data structure to be registered.
> > @@ -847,29 +855,26 @@ int register_fprobe_ips(struct fprobe *fp, unsigned long *addrs, int num)
> >  	if (ret)
> >  		return ret;
> 
> Hi, Masami. The logic of unregister_fprobe_nolock() looks a little
> messy. How about we make the logic here like this:
> 
> for (i = 0; i < hlist_array->size; i++) {
>     // The node->fp is NULL, so it's safe to add the node before
>     // fprobe_ftrace_add_ips(), right?
>     ret = insert_fprobe_node(&hlist_array->array[i], fp);
>     if (ret)
>         goto fallback_err;
> }
> 
> if (fprobe_is_ftrace(fp))
>     ret = fprobe_ftrace_add_ips(addrs, num);
> else
>     ret = fprobe_graph_add_ips(addrs, num);
> if (ret)
>     goto fallback_err;
> 
> add_fprobe_hash(fp);
> for (i = 0; i < hlist_array->size; i++)
>     WRITE_ONCE(hlist_array->array[i].fp, fp);
> 
> return 0;
> 
> fallback_err:
> for (i--; i >= 0; i--)
>     delete_fprobe_node(&hlist_array->array[i]);
> fprobe_fail_cleanup(fp);
> return ret;
> 
> Then, we don't need to change unregister_fprobe_nolock and
> insert_fprobe_node.

Thanks for the idea, but I don't like repeat it.
It is better to do the same thing(unregister) in the same code.
Above seems a bit optimized for fixing a problem.
(Maybe revisit it later for optimization)

Thank you,


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: David Hildenbrand (Arm) @ 2026-04-15  9:49 UTC (permalink / raw)
  To: Gregory Price
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <ad0iT4UWka3gMUpu@gourry-fedora-PF4VCD3F>

On 4/13/26 19:05, Gregory Price wrote:
> On Mon, Apr 13, 2026 at 03:11:12PM +0200, David Hildenbrand (Arm) wrote:
>>> Normally cloud-hypervisor VMs with virtio-net can't be subject to KSM
>>> because the entire boot region gets marked shared.  
>>
>> What exactly do you mean with "mark shared". Do you mean, that "shared
>> memory" is used in the hypervisor for all boot memory?
>>
> 
> Sorry, meant MAP_SHARED.  But yes, in some setups the hypervisor simply
> makes a memfd with the entire main memory region MAP_SHARED.
> 
> This is because the virtio-net device / network stack does GFP_KERNEL
> allocations and then pins them on the host to allow zero-copy - so all
> of ZONE_NORMAL is a valid target.
> 
> (At least that's my best understanding of the entire setup).

I think with vhost-kernel virtio-net just supports MAP_PRIVATE, KSM and
all of that.

The problem is vhost-user, where the other process needs to access all
of VM's memory. That's not only a problem for virtio-net, but also
virtio-fs and all the other stuff that uses vhost-user.

One idea discussed in the past was to let vhost-user access selected
guest memory through QEMU, so there would be no need to even map all of
guest memory into the other processes.

That in turn would stop requiring MAP_SHARED for most guest RAM, only
focusing it on some key parts. Not sure what happened with that idea.

A related series proposed some  MEM_READ/WRITE backend requests [1]

[1] https://lists.nongnu.org/archive/html/qemu-devel/2024-09/msg02693.html


Something else people were discussing in the past was to physically
limit the area where virtio queues could be placed.

> 
>>
>> You mean, in the VM, memory usable by virtio-net can only be consumed
>> from a dedicated physical memory region, and that region would be a
>> separate node?
>>
> 
> Correct - it does requires teaching the network stack numa awareness.
> 
> I was surprised by how little code this required, though I can't be
> 100% sure of its correctness since networking isn't my normal space.

One problem might be that VMs with NUMA disabled or reconfigured would
just break. So you cannot run arbitrary guests in there. That was also
one of the problems of "physically limit the area where virtio queues
could be placed", if you have to be prepared to run arbitrary OSes in
your VM (Windows says hi).

> 
> Alternatively you could imagine this as a real device bringing its own
> dedicated networking memory for network buffers, and then telling the
> network start "Hey, prefer this node over normal kernel allocations".
> 
> What I'd been hacking on was cobbled together with memfd + SRAT bits to
> bring up a private node statically and then have the device claim it -
> but this is just a proof of concept.  A proper implementation would be
> extending virtio-net to report a dedicated EFI_RESERVED region.
> 
>>>
>>> I see you saw below that one of the extensions is removing the nodes
>>> from the fallback list.  That is part one, but it's insufficient to
>>> prevent complete leakage (someone might iterate over the nodes-possible
>>> list and try migrating memory).
>>
>> Which code would do that?
>>
> 
> There are many callers of for_each_node() throughout the system.
> 
> but one discrete example:
> 
> int alloc_shrinker_info(struct mem_cgroup *memcg)
> {
> ... snip ...
>   for_each_node(nid) {
>     struct shrinker_info *info = kvzalloc_node(sizeof(*info) + array_size,
>                                                GFP_KERNEL, nid);
> ... snip ..
> }
> 
> If you disallow fallbacks in this scenario, this allocation always fails.
> 
> This partially answers your question about slub fallback allocations,
> there are slab allocations like this that depend on fallbacks (more
> below on this explicitly).

But that's a different "fallback" problem, no?

You want allocations that target the "special node" to fallback to
*other* nodes, but not other allocations to fallback to *this special* node.

> 
>>> Basically the only isolation mechanism we have today is ZONE_DEVICE.
>>>
>>> Either via mbind and friends, or even just the driver itself managing it
>>> directly via alloc_pages_node() and exposing some userland interface.
>>
>> Would mbind() work here? I thought mbind() would not suddenly give
>> access to some ZONE_DEVICE memory.
>>
> 
> Sorry these were orthogonal thoughts.
> 
> 1) We don't have such a mechanism. ZONE_DEVICE's preferred mechanism is
>    setting up explicit migrations via migrate_device.c

Makes sense.

> 
> 2) mbind / alloc_pages_node would only work for private nodes.
> 
>    Extending ZONE_DEVICE to enable mbind() would be an extreme lift,
>    as the kernel makes a lot of assumptions about folio->lru.
> 
>    This is why i went the node route in the first place.


Agreed.

> 
>>>
>>> in the NP_OPS_MIGRATION patch, this gets covered.
>>
>> Right, but I am not sure if NP_OPS_MIGRATION is really the right
>> approach for that. Have to think about that.
>>
> 
> So, OPS is a bit misleading, but it's the closest i came to some
> existing pattern.  OPS does not necessarily need to imply callbacks.
> 
> I've been trying to minimize the patch set and I'm starting to think
> the MVP may actually be able to do away with the private_ops structure
> for a basic migration+mempolicy example by simply teaching some services
> (migrate.c, mempolicy.c) how/when to inject __GFP_PRIVATE.
> 
> the mempolicy.c patch already does this, but not migrate.c - i haven't
> figured out the right pattern for that yet.

I assume you will be as LSF/MM? Would be good to discuss some of that in
person.

> 
>>> 1) as you note, removing it from the default bitmaps, which is actually
>>>    hard.  You can't remove it from the possible-node bitmap, so that
>>>    just seemed non-tractable.
>>
>> What about making people use a different set of bitmaps here? Quite some
>> work, but maybe that's the right direction given that we'll now treat
>> some nodes differently.
>>
> 
> It's an option, although it is fragile.  That means having to police all
> future users of possible-nodes and for_each_node and etc.
> 
> I've been err'ing on the side of "not fragile", but i'm open to rework.
> 
>>>
>>> 2) __GFP_THISNODE actually means (among other things) "don't fallback".
>>>    And, in fact, there are some hotplug-time allocations that occur in
>>>    SLAB (pglist_data) that target the private node that *must* fallback
>>>    to successfully allocate for successful kernel operation.
>>
>>
>> Can you point me at the code?
>>
> 
> There is actually a comment in slub.c that addresses this directly:
> 
> static int slab_mem_going_online_callback(int nid)
> {
> ... snip ...
> 	/*
> 	 * XXX: kmem_cache_alloc_node will fallback to other nodes
> 	 *      since memory is not yet available from the node that
> 	 *      is brought up.
> 	 */
> 	n = kmem_cache_alloc(kmem_cache_node, GFP_KERNEL);
> ... snip ...
> }
> 
> Slab basically acknowledges the behavior is required on existing nodes
> and just falls back immediately for the "going online" path.
> 
> Other specific calls in the hotplug path:
> 
>   mm/sparse.c:           kzalloc_node(size, GFP_KERNEL, nid)
>   mm/sparse-vmemmap.c:   alloc_pages_node(nid, GFP_KERNEL|...)
>   mm/slub.c:             kmalloc_node(sizeof(*barn), GFP_KERNEL, nid)
> 
> There are quite a number of callers to kmem_cache_alloc_node() that
> would have to be individually audited.
> 
> And some non-slab interfaces examples as well:
> 	alloc_shrinker_info
> 	alloc_node_nr_active
> 
> I've been looking at this for a while, but I'm starting to think trying
> to touch all this surface area is simply too fragile compared to just
> letting normal memory be a fallback for private nodes and adding:
> 
>       __GFP_PRIVATE   - unlock's private node, but allow fallback
> #define GFP_PRIVATE   (__GFP_PRIVATE | __GFP_THISNODE) - only this node
> 
> __GFP_PRIVATE vs GFP_PRIVATE then is just a matter of use case.
> 
> For mbind() it probably makes sense we'd use GFP_PRIVATE - either it
> succeeds or it OOMs.

Needs a second thought regarding fallback logic I raised above.

What I think would have to be audited is the usage of __GFP_THISNODE by
kernel allocations, where we would not actually want to allocate from
this private node.

Maybe we could just outright refuse *any* non-user (movable) allocations
that target the node, even with __GFP_THISNODE.

Because, why would we want kernel allocations to even end up on a
private node that is supposed to only be consumed by user space? Or
which use cases are there where we would want to place kernel
allocations on there?

Needs a second thought, hoping we can discuss that in person.

> 
>>> The flexibility is kind of the point :]
>>
>> Yeah, but it would be interesting which minimal support we would need to
>> just let some special memory be managed by the kernel, allowing mbind()
>> users to use it, but not have any other fallback allocations end up on it.
>>
>> Something very basic, on which we could build additional functionality.
>>
> 
> I actually have a simplistic CXL driver that does exactly this:
> https://github.com/gourryinverse/linux/blob/072ecf7cbebd9871e76c0b52fd99aa1321405a59/drivers/cxl/type3_drivers/cxl_mempolicy/mempolicy.c#L65
> 

Great.

> We have to support migration because mbind can migrate on bind if the
> VMA already has memory - but all this means is the migrate interfaces
> are live - not that the kernel actually uses them.
> 
> so mbind requires (OPS_MIGRATE | OPS_MEMPOLICY)
> 
> All these flags say is:
>    - move_pages() syscalls can accept these nodes
>    - migrate_pages() function calls can accept these nodes
>    - mempolicy.c nodemasks allow the nodes (should restrict to mbind)
>    - vma's with these nodes now inject __GFP_PRIVATE on fault
> 
> All other services (reclaim, compaction, khugepaged, etc) do not scan
> these nodes and do not know about __GFP_PRIVATE, so they never see
> private node folios and can't allocate from the node.
> 
> In this example, all migrate_to() really does is inject __GFP_THISNODE,
> but I've been thinking about whether we can just do this in migrate.c
> and leave implementing the .ops to a user that requires is.
> 
> But otherwise "it just works".
> 
> One note here though - OOM conditions and allocation failures are not
> intuitive, especially when THP/non-order-0 allocations are involved.
> 
> But that might just mean this minimal setup should only allow order-0
> allocations - which is fiiiiiiiiiiiiiine :P.


Again, I am not sure about compaction and khugepaged. All we want to
guarantee is that our memory does not leave the private node.

That doesn't require any __GFP_PRIVATE magic, just en-lighting these
subsuystems that private nodes must use __GFP_THISNODE and must not leak
to other nodes.

> 
> -----------------
> 
> For basic examples
> 
> I've implemented 4 examples to consider building on:
> 
>   1) CXL mempolicy driver:
>      https://github.com/gourryinverse/linux/blob/072ecf7cbebd9871e76c0b52fd99aa1321405a59/drivers/cxl/type3_drivers/cxl_mempolicy/mempolicy.c#L65
> 
>      As described above
> 
>   2) Virtio-net / CXL.mem Network Card
>      (Not published yet)
> 
>      This doesn't require any ops at all - the plumbing happens entirely
>      inside the kernel.  I onlined the node with an SRAT hack and no ops
>      structure at all associated with the device (just set node affinity
>      to the pcie_dev and plumbed it through the network stack).
> 
>      A proper implementation would have virtio-net register is own
>      reserved memory region and online it during probe.
>   
>   3) Accelerator
>      (Not published yet)
> 
>      I have converted an open source but out of tree GPU driver which
>      uses NUMA nodes to use private nodes.  This required:
>             NP_OPS_MIGRATION
>             NP_OPS_MEMPOLICY
> 
>      The pattern is very similar to the CXL mempolicy driver, except
>      that the driver had alloc_pages_node() calls that needed to have
>      __GFP_PRIVATE added to ensure allocations landed on the device.
> 
> 
>   4) CXL Compressed RAM driver:
>      https://github.com/gourryinverse/linux/blob/55c06eb6bced58132d9001e318f2958e8ac80614/mm/cram.c#L340
>      needs pretty much everything - it's "normal memory" with access
>      rules, so the driver isn't really in the management lifecycle.
> 
>      In this example - the only way to allocate memory on the node is
>      via demotion.  This allows us to close off the device to new
>      allocations if the hardware reports low memory but the OS percieves
>      the device to still have free memory.
> 
>      Which is a cool example:  The driver just sets up the node with
>      certain attributes and then lets the kernel deal with it.
> 
> 
> I have started compacting the _OPS_* flags related to reclaim into a
> single NP_OPS_RECLAIM flag while testing with this.  Really i've come
> around to thinking many mm/ services need to be taken as a package,
> not fully piecemeal.
> 
> The tl;dr: Once you cede some control over to the kernel, you're
> very close to ceding ALL control, but you still get some control
> over how/when allocations on the node can be made.
> 
> 
> It is important to note that even if we don't expose callbacks, we do
> still need a modicum of node filtering in some places that still use
> for_each_node() (vmscan.c, compaction.c, oom_kill.c, etc).
> 
> These are basically all the places ZONE_DEVICE *implicitly* opts itself
> out of by having managed_pages=0.  We have to make those situations
> explicit - but that doesn't mean we need callbacks.
> 
>>>
>>> I would simply state: "That depends on the memory device"
>>
>> Let's keep it very simple: just some memory that you mbind(), and you
>> only want the mbind() user to make use of that memory.
>>
>> What would be the minimal set of hooks to guarantee that.
>>
> 
> If you want the mbind contract to stay intact:
> 
>    NP_OPS_MIGRATION (mbind can generate migrations)
>    NP_OPS_MEMPOLICY (this just tells mempolicy.c to allow the node)

I'm missing why these are even opt-in. What's the problem with allowing
mbind and mempolicy to use these nodes in some of your drivers?

I also have some questions about longterm pinnings, but that's better
discussed in person :)

> 
> The set of callbacks required should be exactly 0 (assuming we teach
> migrate.c to inject __GFP_PRIVATE like we have mempolicy.c).
> 
> If your device requires some special notification on allocation, free
> or migration to/from you need:
> 
>    ops.free_folio(folio)
>    ops.migrate_to(folios, nid, mode, reason, nr_success)
>    ops.migrate_folio(src_folio, dst_folio)
> 
> The free path is the tricky one to get right.  You can imagine:
> 
>    buf = malloc(...);
>    mbind(buf, private_node);
>    memset(buf, 0x42, ...);
>    ioctl(driver, CHECK_OUT_THIS_DATA, buf); 
>    exit(0);
> 
> The task dies and frees the pages back to the buddy - the question is
> whether the 4-5 free_folio paths (put_folio, put_unref_folios, etc) can
> all eat an ops.free_folio() callback to inform the driver the memory has
> been freed.

Right, that's rather invasive.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v6 2/5] tracing/fprobe: Remove fprobe from hash in failure path
From: Menglong Dong @ 2026-04-15  9:47 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Masami Hiramatsu (Google)
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177615809677.1165997.619922394559783590.stgit@mhiramat.tok.corp.google.com>

On 2026/4/14 17:14 Masami Hiramatsu (Google) <mhiramat@kernel.org> write:
> From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> 
> When register_fprobe_ips() fails, it tries to remove a list of
> fprobe_hash_node from fprobe_ip_table, but it missed to remove
> fprobe itself from fprobe_table. Moreover, when removing
> the fprobe_hash_node which is added to rhltable once, it must
> use kfree_rcu() after removing from rhltable.
> 
> To fix these issues, this reuses unregister_fprobe() internal
> code to rollback the half-way registered fprobe.
> 
> Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")
> Cc: stable@vger.kernel.org
> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> ---
[...]
>  
> +static int unregister_fprobe_nolock(struct fprobe *fp, bool force);
> +
>  /**
>   * register_fprobe_ips() - Register fprobe to ftrace by address.
>   * @fp: A fprobe data structure to be registered.
> @@ -847,29 +855,26 @@ int register_fprobe_ips(struct fprobe *fp, unsigned long *addrs, int num)
>  	if (ret)
>  		return ret;

Hi, Masami. The logic of unregister_fprobe_nolock() looks a little
messy. How about we make the logic here like this:

for (i = 0; i < hlist_array->size; i++) {
    // The node->fp is NULL, so it's safe to add the node before
    // fprobe_ftrace_add_ips(), right?
    ret = insert_fprobe_node(&hlist_array->array[i], fp);
    if (ret)
        goto fallback_err;
}

if (fprobe_is_ftrace(fp))
    ret = fprobe_ftrace_add_ips(addrs, num);
else
    ret = fprobe_graph_add_ips(addrs, num);
if (ret)
    goto fallback_err;

add_fprobe_hash(fp);
for (i = 0; i < hlist_array->size; i++)
    WRITE_ONCE(hlist_array->array[i].fp, fp);

return 0;

fallback_err:
for (i--; i >= 0; i--)
    delete_fprobe_node(&hlist_array->array[i]);
fprobe_fail_cleanup(fp);
return ret;

Then, we don't need to change unregister_fprobe_nolock and
insert_fprobe_node.

Thanks!
Menglong Dong

>  
> -	hlist_array = fp->hlist_array;
>  	if (fprobe_is_ftrace(fp))
>  		ret = fprobe_ftrace_add_ips(addrs, num);
>  	else
>  		ret = fprobe_graph_add_ips(addrs, num);
> +	if (ret) {
> +		fprobe_fail_cleanup(fp);
> +		return ret;
> +	}
>  
> -	if (!ret) {
> -		add_fprobe_hash(fp);
> -		for (i = 0; i < hlist_array->size; i++) {
> -			ret = insert_fprobe_node(&hlist_array->array[i]);
> -			if (ret)
> -				break;
> -		}
> -		/* fallback on insert error */
> +	hlist_array = fp->hlist_array;
> +	add_fprobe_hash(fp);
> +	for (i = 0; i < hlist_array->size; i++) {
> +		ret = insert_fprobe_node(&hlist_array->array[i], fp);
>  		if (ret) {
> -			for (i--; i >= 0; i--)
> -				delete_fprobe_node(&hlist_array->array[i]);
> +			if (unregister_fprobe_nolock(fp, true))
> +				pr_warn("Failed to cleanup fprobe after insertion failure.\n");
> +			break;
>  		}
>  	}
>  
> -	if (ret)
> -		fprobe_fail_cleanup(fp);
> -
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(register_fprobe_ips);
> @@ -912,37 +917,29 @@ bool fprobe_is_registered(struct fprobe *fp)
>  	return true;
>  }
>  
> -/**
> - * unregister_fprobe() - Unregister fprobe.
> - * @fp: A fprobe data structure to be unregistered.
> - *
> - * Unregister fprobe (and remove ftrace hooks from the function entries).
> - *
> - * Return 0 if @fp is unregistered successfully, -errno if not.
> - */
> -int unregister_fprobe(struct fprobe *fp)
> +static int unregister_fprobe_nolock(struct fprobe *fp, bool force)
>  {
> -	struct fprobe_hlist *hlist_array;
> +	struct fprobe_hlist *hlist_array = fp->hlist_array;
>  	unsigned long *addrs = NULL;
> -	int ret = 0, i, count;
> +	int i, count;
>  
> -	mutex_lock(&fprobe_mutex);
> -	if (!fp || !fprobe_registered(fp)) {
> -		ret = -EINVAL;
> -		goto out;
> -	}
> -
> -	hlist_array = fp->hlist_array;
>  	addrs = kcalloc(hlist_array->size, sizeof(unsigned long), GFP_KERNEL);
> -	if (!addrs) {
> -		ret = -ENOMEM;	/* TODO: Fallback to one-by-one loop */
> -		goto out;
> -	}
> +	if (!addrs && !force)
> +		return -ENOMEM;
> +	/*
> +	 * If @force is set, this function will remove fprobe_hash_node
> +	 * from the hash table even if memory allocation fails. However,
> +	 * ftrace_ops will not be updated. Anyway, when the last fprobe
> +	 * is unregistered, ftrace_ops is also unregistered.
> +	 */
>  
>  	/* Remove non-synonim ips from table and hash */
>  	count = 0;
>  	for (i = 0; i < hlist_array->size; i++) {
> -		if (!delete_fprobe_node(&hlist_array->array[i]))
> +		if (delete_fprobe_node(&hlist_array->array[i]))
> +			continue;
> +
> +		if (addrs)
>  			addrs[count++] = hlist_array->array[i].addr;
>  	}
>  	del_fprobe_hash(fp);
> @@ -951,15 +948,35 @@ int unregister_fprobe(struct fprobe *fp)
>  		fprobe_ftrace_remove_ips(addrs, count);
>  	else
>  		fprobe_graph_remove_ips(addrs, count);
> +	/*
> +	 * If count == 0, instead of calling ftrace_set_filter_ips(),
> +	 * we must wait for RCU grace period to finish del_fprobe_hash().
> +	 */
> +	if (!count)
> +		synchronize_rcu();
>  
>  	kfree_rcu(hlist_array, rcu);
>  	fp->hlist_array = NULL;
> +	kfree(addrs);
>  
> -out:
> -	mutex_unlock(&fprobe_mutex);
> +	return !addrs ? -ENOMEM : 0;
> +}
>  
> -	kfree(addrs);
> -	return ret;
> +/**
> + * unregister_fprobe() - Unregister fprobe.
> + * @fp: A fprobe data structure to be unregistered.
> + *
> + * Unregister fprobe (and remove ftrace hooks from the function entries).
> + *
> + * Return 0 if @fp is unregistered successfully, -errno if not.
> + */
> +int unregister_fprobe(struct fprobe *fp)
> +{
> +	guard(mutex)(&fprobe_mutex);
> +	if (!fp || !fprobe_registered(fp))
> +		return -EINVAL;
> +
> +	return unregister_fprobe_nolock(fp, false);
>  }
>  EXPORT_SYMBOL_GPL(unregister_fprobe);
>  
> 
> 
> 





^ permalink raw reply

* Re: [PATCH] ring-buffer: Preserve true payload lengths in long data events
From: kernel test robot @ 2026-04-15  9:29 UTC (permalink / raw)
  To: Cao Ruichuang, rostedt, mhiramat
  Cc: llvm, oe-kbuild-all, mathieu.desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260407091550.67963-1-create0818@163.com>

Hi Cao,

kernel test robot noticed the following build warnings:

[auto build test WARNING on trace/for-next]
[also build test WARNING on linus/master v7.0 next-20260414]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Cao-Ruichuang/ring-buffer-Preserve-true-payload-lengths-in-long-data-events/20260415-040554
base:   https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace for-next
patch link:    https://lore.kernel.org/r/20260407091550.67963-1-create0818%40163.com
patch subject: [PATCH] ring-buffer: Preserve true payload lengths in long data events
config: x86_64-kexec (https://download.01.org/0day-ci/archive/20260415/202604151102.OQ1HyH3F-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260415/202604151102.OQ1HyH3F-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202604151102.OQ1HyH3F-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> Warning: kernel/trace/ring_buffer.c:4786 function parameter 'force_long' not described in '__ring_buffer_lock_reserve'
>> Warning: kernel/trace/ring_buffer.c:4786 expecting prototype for ring_buffer_lock_reserve(). Prototype was for __ring_buffer_lock_reserve() instead

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH v3 09/11] dt-bindings: input: Document hid-over-spi DT schema
From: Benjamin Tissoires @ 2026-04-15  8:41 UTC (permalink / raw)
  To: Rob Herring, Dmitry Torokhov
  Cc: Conor Dooley, Jingyuan Liang, Jiri Kosina, Jonathan Corbet,
	Mark Brown, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Krzysztof Kozlowski, Conor Dooley, linux-input, linux-doc,
	linux-kernel, linux-spi, linux-trace-kernel, devicetree, hbarnor,
	tfiga, Dmitry Antipov, Jarrett Schultz
In-Reply-To: <20260413223439.GA3647847-robh@kernel.org>

On Apr 13 2026, Rob Herring wrote:
> On Fri, Apr 10, 2026 at 06:35:00PM +0100, Conor Dooley wrote:
> > On Thu, Apr 09, 2026 at 10:16:46AM -0700, Dmitry Torokhov wrote:
> > > On Thu, Apr 09, 2026 at 05:02:11PM +0100, Conor Dooley wrote:
> > > > On Thu, Apr 02, 2026 at 01:59:46AM +0000, Jingyuan Liang wrote:
> > > > > Documentation describes the required and optional properties for
> > > > > implementing Device Tree for a Microsoft G6 Touch Digitizer that
> > > > > supports HID over SPI Protocol 1.0 specification.
> > > > > 
> > > > > The properties are common to HID over SPI.
> > > > > 
> > > > > Signed-off-by: Dmitry Antipov <dmanti@microsoft.com>
> > > > > Signed-off-by: Jarrett Schultz <jaschultz@microsoft.com>
> > > > > Signed-off-by: Jingyuan Liang <jingyliang@chromium.org>
> > > > > ---
> > > > >  .../devicetree/bindings/input/hid-over-spi.yaml    | 126 +++++++++++++++++++++
> > > > >  1 file changed, 126 insertions(+)
> > > > > 
> > > > > diff --git a/Documentation/devicetree/bindings/input/hid-over-spi.yaml b/Documentation/devicetree/bindings/input/hid-over-spi.yaml
> > > > > new file mode 100644
> > > > > index 000000000000..d1b0a2e26c32
> > > > > --- /dev/null
> > > > > +++ b/Documentation/devicetree/bindings/input/hid-over-spi.yaml
> > > > > @@ -0,0 +1,126 @@
> > > > > +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> > > > > +%YAML 1.2
> > > > > +---
> > > > > +$id: http://devicetree.org/schemas/input/hid-over-spi.yaml#
> > > > > +$schema: http://devicetree.org/meta-schemas/core.yaml#
> > > > > +
> > > > > +title: HID over SPI Devices
> > > > > +
> > > > > +maintainers:
> > > > > +  - Benjamin Tissoires <benjamin.tissoires@redhat.com>
> > > > > +  - Jiri Kosina <jkosina@suse.cz>
> > > > 
> > > > Why them and not you, the developers of the series?
> > > > 
> > > > > +
> > > > > +description: |+
> > > > > +  HID over SPI provides support for various Human Interface Devices over the
> > > > > +  SPI bus. These devices can be for example touchpads, keyboards, touch screens
> > > > > +  or sensors.
> > > > > +
> > > > > +  The specification has been written by Microsoft and is currently available
> > > > > +  here: https://www.microsoft.com/en-us/download/details.aspx?id=103325
> > > > > +
> > > > > +  If this binding is used, the kernel module spi-hid will handle the
> > > > > +  communication with the device and the generic hid core layer will handle the
> > > > > +  protocol.
> > > > 
> > > > This is not relevant to the binding, please remove it.
> > > > 
> > > > > +
> > > > > +allOf:
> > > > > +  - $ref: /schemas/input/touchscreen/touchscreen.yaml#
> > > > > +
> > > > > +properties:
> > > > > +  compatible:
> > > > > +    oneOf:
> > > > > +      - items:
> > > > > +          - enum:
> > > > > +              - microsoft,g6-touch-digitizer
> > > > > +          - const: hid-over-spi
> > > > > +      - description: Just "hid-over-spi" alone is allowed, but not recommended.
> > > > > +        const: hid-over-spi
> > > > 
> > > > Why is it allowed but not recommended? Seems to me like we should
> > > > require device-specific compatibles.
> > > 
> > > Why would we want to change the driver code to add a new compatible each
> > > time a vendor decides to create a chip that is fully hid-spi-protocol
> > > compliant? Or is the plan to still allow "hid-over-spi" fallback but
> > > require device-specific compatible that will be ignored unless there is
> > > device-specific quirk needed?
> 
> The plan is the latter case (the 1st entry up above). The comment is 
> remove the 2nd entry (with 'Just "hid-over-spi" alone is allowed, but 
> not recommended.').
> 
> > This has nothing to do with the driver, just the oddity of having a
> > comment saying that not having a device specific compatible was
> > permitted by not recommended in a binding. Requiring device-specific
> > compatibles is the norm after all and a comment like this makes draws
> > more attention to the fact that this is abnormal. Regardless of what the
> > driver does, device-specific compatibles should be required.
> > 
> > > > > +
> > > > > +  reg:
> > > > > +    maxItems: 1
> > > > > +
> > > > > +  interrupts:
> > > > > +    maxItems: 1
> > > > > +
> > > > > +  reset-gpios:
> > > > > +    maxItems: 1
> > > > > +    description:
> > > > > +      GPIO specifier for the digitizer's reset pin (active low). The line must
> > > > > +      be flagged with GPIO_ACTIVE_LOW.
> > > > > +
> > > > > +  vdd-supply:
> > > > > +    description:
> > > > > +      Regulator for the VDD supply voltage.
> > > > > +
> > > > > +  input-report-header-address:
> > > > > +    $ref: /schemas/types.yaml#/definitions/uint32
> > > > > +    minimum: 0
> > > > > +    maximum: 0xffffff
> > > > > +    description:
> > > > > +      A value to be included in the Read Approval packet, listing an address of
> > > > > +      the input report header to be put on the SPI bus. This address has 24
> > > > > +      bits.
> > > > > +
> > > > > +  input-report-body-address:
> > > > > +    $ref: /schemas/types.yaml#/definitions/uint32
> > > > > +    minimum: 0
> > > > > +    maximum: 0xffffff
> > > > > +    description:
> > > > > +      A value to be included in the Read Approval packet, listing an address of
> > > > > +      the input report body to be put on the SPI bus. This address has 24 bits.
> > > > > +
> > > > > +  output-report-address:
> > > > > +    $ref: /schemas/types.yaml#/definitions/uint32
> > > > > +    minimum: 0
> > > > > +    maximum: 0xffffff
> > > > > +    description:
> > > > > +      A value to be included in the Output Report sent by the host, listing an
> > > > > +      address where the output report on the SPI bus is to be written to. This
> > > > > +      address has 24 bits.
> > > > > +
> > > > > +  read-opcode:
> > > > > +    $ref: /schemas/types.yaml#/definitions/uint8
> > > > > +    description:
> > > > > +      Value to be used in Read Approval packets. 1 byte.
> > > > > +
> > > > > +  write-opcode:
> > > > > +    $ref: /schemas/types.yaml#/definitions/uint8
> > > > > +    description:
> > > > > +      Value to be used in Write Approval packets. 1 byte.
> > > > 
> > > > Why can none of these things be determined from the device's compatible?
> > > > On the surface, they like the kinds of things that could/should be.
> > > 
> > > Why would we want to keep tables of these values in the kernel and again
> > > have to update the driver for each new chip?
> > 
> > That's pretty normal though innit? It's what match data does.
> > If someone wants to have properties that communicate data that
> > can be determined from the compatible, they need to provide
> > justification why it is being done.
> 
> IIRC, it was explained in prior versions the spec itself says these 
> values vary by device. If we expect variation, then I think these 
> properties are fine. But please capture the reasoning for them in this 
> patch or we will just keep asking the same questions over and over. 
> 

Dmitry, FWIW, we roughly had the exact same of argument with Rob with
i2c-hid :)

It took me a while, but I finally understood the rationale and agreed to
it (using the i2c-hid examples):

Most i2c-hid devices are following the spec and rely purely on ACPI and
the DT declaration of i2c-hid -> they are working fine and we don't need
anything else for them. They declare their compatible and the i2c-hid
compatible, and they work great.

But some devices need a reset line. But the i2c-hid spec doesn't mention
a reset line at all. And some other devices need a reset line with a
different timing. etc... Relying purely on the i2c-hid driver means that
the driver needs to now the platform the device is on and the exact
device we have in front of us. i2c-hid provide a VID/PID through the
protocol, but we are still lacking information: in some cases, the
timing of the reset line for the same device differs depending on the
platform they run.

Having a device specific compatible means that we can make use of it
before we load i2c-hid. This is why we have i2c-hid-core and module
specifics on the side. Those extra module can do all the oddities they
want, like having 2 or 3 reset lines, but in the end they are using the
core i2c-hid once they are set up.

Think of it as a way to quirk the device upfront without polluting the
i2c-hid processing.

That allowed us to clean up the i2c-hid code by removing the non
standard regulators, reset lines, quirks that are device specific and
keep it closer to the spec.

Cheers,
Benjamin

^ permalink raw reply

* Re: [RFC PATCH 1/2] kernel/notifier: replace single-linked list with double-linked list for reverse traversal
From: Christoph Hellwig @ 2026-04-15  7:40 UTC (permalink / raw)
  To: chensong_2000
  Cc: rafael, lenb, mturquette, sboyd, viresh.kumar, agk, snitzer,
	mpatocka, bmarzins, song, yukuai, linan122, jason.wessel, danielt,
	dianders, horms, davem, edumazet, kuba, pabeni, paulmck, frederic,
	mcgrof, petr.pavlu, da.gomez, samitolvanen, atomlin, jpoimboe,
	jikos, mbenes, pmladek, joe.lawrence, rostedt, mhiramat,
	mark.rutland, mathieu.desnoyers, linux-modules, linux-kernel,
	linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
	live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <20260415070137.17860-1-chensong_2000@189.cn>

On Wed, Apr 15, 2026 at 03:01:37PM +0800, chensong_2000@189.cn wrote:
> diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
> index 132a9df98471..b776dbd5a382 100644
> --- a/drivers/acpi/sleep.c
> +++ b/drivers/acpi/sleep.c
> @@ -56,7 +56,6 @@ static int tts_notify_reboot(struct notifier_block *this,
>  
>  static struct notifier_block tts_notifier = {
>  	.notifier_call	= tts_notify_reboot,
> -	.next		= NULL,
>  	.priority	= 0,

IFF this becomes important for some reason (and right now I don't see
it), please start by using proper wrappers for notifiers so that the
implementation details don't leak into the users.  That would actually
be useful on it's own even.


^ permalink raw reply

* Re: [RFC PATCH 0/2] Decouple ftrace/livepatch from module loader via notifier priority and reverse traversal
From: Christoph Hellwig @ 2026-04-15  7:38 UTC (permalink / raw)
  To: chensong_2000
  Cc: rafael, lenb, mturquette, sboyd, viresh.kumar, agk, snitzer,
	mpatocka, bmarzins, song, yukuai, linan122, jason.wessel, danielt,
	dianders, horms, davem, edumazet, kuba, pabeni, paulmck, frederic,
	mcgrof, petr.pavlu, da.gomez, samitolvanen, atomlin, jpoimboe,
	jikos, mbenes, pmladek, joe.lawrence, rostedt, mhiramat,
	mark.rutland, mathieu.desnoyers, linux-modules, linux-kernel,
	linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
	live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <20260413080140.180616-1-chensong_2000@189.cn>

On Mon, Apr 13, 2026 at 04:01:40PM +0800, chensong_2000@189.cn wrote:
> From: Song Chen <chensong_2000@189.cn>
> 
> This patchset addresses a long-standing tight coupling between the
> module loader and two of its key consumers: ftrace and livepatch.
> 
> Background:
> 
> The module loader currently hard-codes direct calls to
> ftrace_module_enable(), klp_module_coming(), klp_module_going() and
> ftrace_release_mod() inside prepare_coming_module() and the module
> unload path.

And that is bad why?

>  13 files changed, 290 insertions(+), 74 deletions(-)

This is a lot of new complex code touching a lot of places for no obvious
gain.  What is the reason for this series?  Does it prepare for something
else?


^ permalink raw reply

* [RFC PATCH 1/2] kernel/notifier: replace single-linked list with double-linked list for reverse traversal
From: chensong_2000 @ 2026-04-15  7:01 UTC (permalink / raw)
  To: rafael, lenb, mturquette, sboyd, viresh.kumar, agk, snitzer,
	mpatocka, bmarzins, song, yukuai, linan122, jason.wessel, danielt,
	dianders, horms, davem, edumazet, kuba, pabeni, paulmck, frederic,
	mcgrof, petr.pavlu, da.gomez, samitolvanen, atomlin, jpoimboe,
	jikos, mbenes, pmladek, joe.lawrence, rostedt, mhiramat,
	mark.rutland, mathieu.desnoyers
  Cc: linux-modules, linux-kernel, linux-trace-kernel, linux-acpi,
	linux-clk, linux-pm, live-patching, dm-devel, linux-raid,
	kgdb-bugreport, netdev, Song Chen

From: Song Chen <chensong_2000@189.cn>

The current notifier chain implementation uses a single-linked list
(struct notifier_block *next), which only supports forward traversal
in priority order. This makes it difficult to handle cleanup/teardown
scenarios that require notifiers to be called in reverse priority order.

A concrete example is the ordering dependency between ftrace and
livepatch during module load/unload. see the detail here [1].

This patch replaces the single-linked list in struct notifier_block
with a struct list_head, converting the notifier chain into a
doubly-linked list sorted in descending priority order. Based on
this, a new function notifier_call_chain_reverse() is introduced,
which traverses the chain in reverse (ascending priority order).
The corresponding blocking_notifier_call_chain_reverse() is also
added as the locking wrapper for blocking notifier chains.

The internal notifier_call_chain_robust() is updated to use
notifier_call_chain_reverse() for rollback: on error, it records
the failing notifier (last_nb) and the count of successfully called
notifiers (nr), then rolls back exactly those nr-1 notifiers in
reverse order starting from last_nb's predecessor, without needing
to know the total length of the chain.

With this change, subsystems with symmetric setup/teardown ordering
requirements can register a single notifier_block with one priority
value, and rely on blocking_notifier_call_chain() for forward
traversal and blocking_notifier_call_chain_reverse() for reverse
traversal, without needing hard-coded call sequences or separate
notifier registrations for each direction.

[1]:https://lore.kernel.org/all
	/alpine.LNX.2.00.1602172216491.22700@cbobk.fhfr.pm/

Signed-off-by: Song Chen <chensong_2000@189.cn>
---
 drivers/acpi/sleep.c      |   1 -
 drivers/clk/clk.c         |   2 +-
 drivers/cpufreq/cpufreq.c |   2 +-
 drivers/md/dm-integrity.c |   1 -
 drivers/md/md.c           |   1 -
 include/linux/notifier.h  |  26 ++---
 kernel/debug/debug_core.c |   1 -
 kernel/notifier.c         | 219 ++++++++++++++++++++++++++++++++------
 net/ipv4/nexthop.c        |   2 +-
 9 files changed, 201 insertions(+), 54 deletions(-)

diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
index 132a9df98471..b776dbd5a382 100644
--- a/drivers/acpi/sleep.c
+++ b/drivers/acpi/sleep.c
@@ -56,7 +56,6 @@ static int tts_notify_reboot(struct notifier_block *this,
 
 static struct notifier_block tts_notifier = {
 	.notifier_call	= tts_notify_reboot,
-	.next		= NULL,
 	.priority	= 0,
 };
 
diff --git a/drivers/clk/clk.c b/drivers/clk/clk.c
index 47093cda9df3..b6fe380d0468 100644
--- a/drivers/clk/clk.c
+++ b/drivers/clk/clk.c
@@ -4862,7 +4862,7 @@ int clk_notifier_unregister(struct clk *clk, struct notifier_block *nb)
 			clk->core->notifier_count--;
 
 			/* XXX the notifier code should handle this better */
-			if (!cn->notifier_head.head) {
+			if (list_empty(&cn->notifier_head.head)) {
 				srcu_cleanup_notifier_head(&cn->notifier_head);
 				list_del(&cn->node);
 				kfree(cn);
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 277884d91913..12637e742ffa 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -445,7 +445,7 @@ static void cpufreq_list_transition_notifiers(void)
 
 	mutex_lock(&cpufreq_transition_notifier_list.mutex);
 
-	for (nb = cpufreq_transition_notifier_list.head; nb; nb = nb->next)
+	list_for_each_entry(nb, &cpufreq_transition_notifier_list.head, entry)
 		pr_info("%pS\n", nb->notifier_call);
 
 	mutex_unlock(&cpufreq_transition_notifier_list.mutex);
diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c
index 06e805902151..ccdf75c40b62 100644
--- a/drivers/md/dm-integrity.c
+++ b/drivers/md/dm-integrity.c
@@ -3909,7 +3909,6 @@ static void dm_integrity_resume(struct dm_target *ti)
 	}
 
 	ic->reboot_notifier.notifier_call = dm_integrity_reboot;
-	ic->reboot_notifier.next = NULL;
 	ic->reboot_notifier.priority = INT_MAX - 1;	/* be notified after md and before hardware drivers */
 	WARN_ON(register_reboot_notifier(&ic->reboot_notifier));
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 3ce6f9e9d38e..8249e78636ab 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -10480,7 +10480,6 @@ static int md_notify_reboot(struct notifier_block *this,
 
 static struct notifier_block md_notifier = {
 	.notifier_call	= md_notify_reboot,
-	.next		= NULL,
 	.priority	= INT_MAX, /* before any real devices */
 };
 
diff --git a/include/linux/notifier.h b/include/linux/notifier.h
index 01b6c9d9956f..b2abbdfcaadd 100644
--- a/include/linux/notifier.h
+++ b/include/linux/notifier.h
@@ -53,41 +53,41 @@ typedef	int (*notifier_fn_t)(struct notifier_block *nb,
 
 struct notifier_block {
 	notifier_fn_t notifier_call;
-	struct notifier_block __rcu *next;
+	struct list_head __rcu entry;
 	int priority;
 };
 
 struct atomic_notifier_head {
 	spinlock_t lock;
-	struct notifier_block __rcu *head;
+	struct list_head __rcu head;
 };
 
 struct blocking_notifier_head {
 	struct rw_semaphore rwsem;
-	struct notifier_block __rcu *head;
+	struct list_head __rcu head;
 };
 
 struct raw_notifier_head {
-	struct notifier_block __rcu *head;
+	struct list_head __rcu head;
 };
 
 struct srcu_notifier_head {
 	struct mutex mutex;
 	struct srcu_usage srcuu;
 	struct srcu_struct srcu;
-	struct notifier_block __rcu *head;
+	struct list_head __rcu head;
 };
 
 #define ATOMIC_INIT_NOTIFIER_HEAD(name) do {	\
 		spin_lock_init(&(name)->lock);	\
-		(name)->head = NULL;		\
+		INIT_LIST_HEAD(&(name)->head);		\
 	} while (0)
 #define BLOCKING_INIT_NOTIFIER_HEAD(name) do {	\
 		init_rwsem(&(name)->rwsem);	\
-		(name)->head = NULL;		\
+		INIT_LIST_HEAD(&(name)->head);		\
 	} while (0)
 #define RAW_INIT_NOTIFIER_HEAD(name) do {	\
-		(name)->head = NULL;		\
+		INIT_LIST_HEAD(&(name)->head);		\
 	} while (0)
 
 /* srcu_notifier_heads must be cleaned up dynamically */
@@ -97,17 +97,17 @@ extern void srcu_init_notifier_head(struct srcu_notifier_head *nh);
 
 #define ATOMIC_NOTIFIER_INIT(name) {				\
 		.lock = __SPIN_LOCK_UNLOCKED(name.lock),	\
-		.head = NULL }
+		.head = LIST_HEAD_INIT((name).head) }
 #define BLOCKING_NOTIFIER_INIT(name) {				\
 		.rwsem = __RWSEM_INITIALIZER((name).rwsem),	\
-		.head = NULL }
+		.head = LIST_HEAD_INIT((name).head) }
 #define RAW_NOTIFIER_INIT(name)	{				\
-		.head = NULL }
+		.head = LIST_HEAD_INIT((name).head) }
 
 #define SRCU_NOTIFIER_INIT(name, pcpu)				\
 	{							\
 		.mutex = __MUTEX_INITIALIZER(name.mutex),	\
-		.head = NULL,					\
+		.head = LIST_HEAD_INIT((name).head),					\
 		.srcuu = __SRCU_USAGE_INIT(name.srcuu),		\
 		.srcu = __SRCU_STRUCT_INIT(name.srcu, name.srcuu, pcpu, 0), \
 	}
@@ -170,6 +170,8 @@ extern int atomic_notifier_call_chain(struct atomic_notifier_head *nh,
 		unsigned long val, void *v);
 extern int blocking_notifier_call_chain(struct blocking_notifier_head *nh,
 		unsigned long val, void *v);
+extern int blocking_notifier_call_chain_reverse(struct blocking_notifier_head *nh,
+		unsigned long val, void *v);
 extern int raw_notifier_call_chain(struct raw_notifier_head *nh,
 		unsigned long val, void *v);
 extern int srcu_notifier_call_chain(struct srcu_notifier_head *nh,
diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
index 0b9495187fba..a26a7683d142 100644
--- a/kernel/debug/debug_core.c
+++ b/kernel/debug/debug_core.c
@@ -1054,7 +1054,6 @@ dbg_notify_reboot(struct notifier_block *this, unsigned long code, void *x)
 
 static struct notifier_block dbg_reboot_notifier = {
 	.notifier_call		= dbg_notify_reboot,
-	.next			= NULL,
 	.priority		= INT_MAX,
 };
 
diff --git a/kernel/notifier.c b/kernel/notifier.c
index 2f9fe7c30287..6f4d887771c4 100644
--- a/kernel/notifier.c
+++ b/kernel/notifier.c
@@ -14,39 +14,47 @@
  *	are layered on top of these, with appropriate locking added.
  */
 
-static int notifier_chain_register(struct notifier_block **nl,
+static int notifier_chain_register(struct list_head *nl,
 				   struct notifier_block *n,
 				   bool unique_priority)
 {
-	while ((*nl) != NULL) {
-		if (unlikely((*nl) == n)) {
+	struct notifier_block *cur;
+
+	list_for_each_entry(cur, nl, entry) {
+		if (unlikely(cur == n)) {
 			WARN(1, "notifier callback %ps already registered",
 			     n->notifier_call);
 			return -EEXIST;
 		}
-		if (n->priority > (*nl)->priority)
-			break;
-		if (n->priority == (*nl)->priority && unique_priority)
+
+		if (n->priority == cur->priority && unique_priority)
 			return -EBUSY;
-		nl = &((*nl)->next);
+
+		if (n->priority > cur->priority) {
+			list_add_tail(&n->entry, &cur->entry);
+			goto out;
+		}
 	}
-	n->next = *nl;
-	rcu_assign_pointer(*nl, n);
+
+	list_add_tail(&n->entry, nl);
+out:
 	trace_notifier_register((void *)n->notifier_call);
 	return 0;
 }
 
-static int notifier_chain_unregister(struct notifier_block **nl,
+static int notifier_chain_unregister(struct list_head *nl,
 		struct notifier_block *n)
 {
-	while ((*nl) != NULL) {
-		if ((*nl) == n) {
-			rcu_assign_pointer(*nl, n->next);
+	struct notifier_block *cur;
+
+	list_for_each_entry(cur, nl, entry) {
+		if (cur == n) {
+			list_del(&n->entry);
 			trace_notifier_unregister((void *)n->notifier_call);
 			return 0;
 		}
-		nl = &((*nl)->next);
 	}
+
 	return -ENOENT;
 }
 
@@ -59,25 +67,25 @@ static int notifier_chain_unregister(struct notifier_block **nl,
  *			value of this parameter is -1.
  *	@nr_calls:	Records the number of notifications sent. Don't care
  *			value of this field is NULL.
+ *	@last_nb:  Records the last called notifier block for rolling back
  *	Return:		notifier_call_chain returns the value returned by the
  *			last notifier function called.
  */
-static int notifier_call_chain(struct notifier_block **nl,
+static int notifier_call_chain(struct list_head *nl,
 			       unsigned long val, void *v,
-			       int nr_to_call, int *nr_calls)
+			       int nr_to_call, int *nr_calls,
+				   struct notifier_block **last_nb)
 {
 	int ret = NOTIFY_DONE;
-	struct notifier_block *nb, *next_nb;
-
-	nb = rcu_dereference_raw(*nl);
+	struct notifier_block *nb;
 
-	while (nb && nr_to_call) {
-		next_nb = rcu_dereference_raw(nb->next);
+	if (!nr_to_call)
+		return ret;
 
+	list_for_each_entry(nb, nl, entry) {
 #ifdef CONFIG_DEBUG_NOTIFIERS
 		if (unlikely(!func_ptr_is_kernel_text(nb->notifier_call))) {
 			WARN(1, "Invalid notifier called!");
-			nb = next_nb;
 			continue;
 		}
 #endif
@@ -87,15 +95,118 @@ static int notifier_call_chain(struct notifier_block **nl,
 		if (nr_calls)
 			(*nr_calls)++;
 
+		if (last_nb)
+			*last_nb = nb;
+
 		if (ret & NOTIFY_STOP_MASK)
 			break;
-		nb = next_nb;
-		nr_to_call--;
+
+		if (nr_to_call-- == 0)
+			break;
 	}
 	return ret;
 }
 NOKPROBE_SYMBOL(notifier_call_chain);
 
+/**
+ * notifier_call_chain_reverse - Informs the registered notifiers
+ *			about an event reversely.
+ *	@nl:		Pointer to head of the blocking notifier chain
+ *	@val:		Value passed unmodified to notifier function
+ *	@v:		Pointer passed unmodified to notifier function
+ *	@nr_to_call:	Number of notifier functions to be called. Don't care
+ *			value of this parameter is -1.
+ *	@nr_calls:	Records the number of notifications sent. Don't care
+ *			value of this field is NULL.
+ *	Return:		notifier_call_chain returns the value returned by the
+ *			last notifier function called.
+ */
+static int notifier_call_chain_reverse(struct list_head *nl,
+					struct notifier_block *start,
+					unsigned long val, void *v,
+					int nr_to_call, int *nr_calls)
+{
+	int ret = NOTIFY_DONE;
+	struct notifier_block *nb;
+	bool do_call = (start == NULL);
+
+	if (!nr_to_call)
+		return ret;
+
+	list_for_each_entry_reverse(nb, nl, entry) {
+		if (!do_call) {
+			if (nb == start)
+				do_call = true;
+			continue;
+		}
+#ifdef CONFIG_DEBUG_NOTIFIERS
+		if (unlikely(!func_ptr_is_kernel_text(nb->notifier_call))) {
+			WARN(1, "Invalid notifier called!");
+			continue;
+		}
+#endif
+		trace_notifier_run((void *)nb->notifier_call);
+		ret = nb->notifier_call(nb, val, v);
+
+		if (nr_calls)
+			(*nr_calls)++;
+
+		if (ret & NOTIFY_STOP_MASK)
+			break;
+
+		if (nr_to_call-- == 0)
+			break;
+	}
+	return ret;
+}
+NOKPROBE_SYMBOL(notifier_call_chain_reverse);
+
+/**
+ * notifier_call_chain_rcu - Informs the registered notifiers
+ *			about an event for srcu notifier chain.
+ *	@nl:		Pointer to head of the blocking notifier chain
+ *	@val:		Value passed unmodified to notifier function
+ *	@v:		Pointer passed unmodified to notifier function
+ *	@nr_to_call:	Number of notifier functions to be called. Don't care
+ *			value of this parameter is -1.
+ *	@nr_calls:	Records the number of notifications sent. Don't care
+ *			value of this field is NULL.
+ *	Return:		notifier_call_chain returns the value returned by the
+ *			last notifier function called.
+ */
+static int notifier_call_chain_rcu(struct list_head *nl,
+			       unsigned long val, void *v,
+			       int nr_to_call, int *nr_calls)
+{
+	int ret = NOTIFY_DONE;
+	struct notifier_block *nb;
+
+	if (!nr_to_call)
+		return ret;
+
+	list_for_each_entry_rcu(nb, nl, entry) {
+#ifdef CONFIG_DEBUG_NOTIFIERS
+		if (unlikely(!func_ptr_is_kernel_text(nb->notifier_call))) {
+			WARN(1, "Invalid notifier called!");
+			continue;
+		}
+#endif
+		trace_notifier_run((void *)nb->notifier_call);
+		ret = nb->notifier_call(nb, val, v);
+
+		if (nr_calls)
+			(*nr_calls)++;
+
+		if (ret & NOTIFY_STOP_MASK)
+			break;
+
+		if (nr_to_call-- == 0)
+			break;
+	}
+	return ret;
+}
+NOKPROBE_SYMBOL(notifier_call_chain_rcu);
+
 /**
  * notifier_call_chain_robust - Inform the registered notifiers about an event
  *                              and rollback on error.
@@ -111,15 +222,16 @@ NOKPROBE_SYMBOL(notifier_call_chain);
  *
  * Return:	the return value of the @val_up call.
  */
-static int notifier_call_chain_robust(struct notifier_block **nl,
+static int notifier_call_chain_robust(struct list_head *nl,
 				     unsigned long val_up, unsigned long val_down,
 				     void *v)
 {
 	int ret, nr = 0;
+	struct notifier_block *last_nb = NULL;
 
-	ret = notifier_call_chain(nl, val_up, v, -1, &nr);
+	ret = notifier_call_chain(nl, val_up, v, -1, &nr, &last_nb);
 	if (ret & NOTIFY_STOP_MASK)
-		notifier_call_chain(nl, val_down, v, nr-1, NULL);
+		notifier_call_chain_reverse(nl, last_nb, val_down, v, nr-1, NULL);
 
 	return ret;
 }
@@ -220,7 +332,7 @@ int atomic_notifier_call_chain(struct atomic_notifier_head *nh,
 	int ret;
 
 	rcu_read_lock();
-	ret = notifier_call_chain(&nh->head, val, v, -1, NULL);
+	ret = notifier_call_chain(&nh->head, val, v, -1, NULL, NULL);
 	rcu_read_unlock();
 
 	return ret;
@@ -238,7 +350,7 @@ NOKPROBE_SYMBOL(atomic_notifier_call_chain);
  */
 bool atomic_notifier_call_chain_is_empty(struct atomic_notifier_head *nh)
 {
-	return !rcu_access_pointer(nh->head);
+	return list_empty(&nh->head);
 }
 
 /*
@@ -340,7 +452,7 @@ int blocking_notifier_call_chain_robust(struct blocking_notifier_head *nh,
 	 * racy then it does not matter what the result of the test
 	 * is, we re-check the list after having taken the lock anyway:
 	 */
-	if (rcu_access_pointer(nh->head)) {
+	if (!list_empty(&nh->head)) {
 		down_read(&nh->rwsem);
 		ret = notifier_call_chain_robust(&nh->head, val_up, val_down, v);
 		up_read(&nh->rwsem);
@@ -375,15 +487,52 @@ int blocking_notifier_call_chain(struct blocking_notifier_head *nh,
 	 * racy then it does not matter what the result of the test
 	 * is, we re-check the list after having taken the lock anyway:
 	 */
-	if (rcu_access_pointer(nh->head)) {
+	if (!list_empty(&nh->head)) {
 		down_read(&nh->rwsem);
-		ret = notifier_call_chain(&nh->head, val, v, -1, NULL);
+		ret = notifier_call_chain(&nh->head, val, v, -1, NULL, NULL);
 		up_read(&nh->rwsem);
 	}
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blocking_notifier_call_chain);
 
+/**
+ *	blocking_notifier_call_chain_reverse - Call functions reversely in
+ *				a blocking notifier chain
+ *	@nh: Pointer to head of the blocking notifier chain
+ *	@val: Value passed unmodified to notifier function
+ *	@v: Pointer passed unmodified to notifier function
+ *
+ *	Calls each function in a notifier chain in turn.  The functions
+ *	run in a process context, so they are allowed to block.
+ *
+ *	If the return value of the notifier can be and'ed
+ *	with %NOTIFY_STOP_MASK then blocking_notifier_call_chain()
+ *	will return immediately, with the return value of
+ *	the notifier function which halted execution.
+ *	Otherwise the return value is the return value
+ *	of the last notifier function called.
+ */
+
+int blocking_notifier_call_chain_reverse(struct blocking_notifier_head *nh,
+		unsigned long val, void *v)
+{
+	int ret = NOTIFY_DONE;
+
+	/*
+	 * We check the head outside the lock, but if this access is
+	 * racy then it does not matter what the result of the test
+	 * is, we re-check the list after having taken the lock anyway:
+	 */
+	if (!list_empty(&nh->head)) {
+		down_read(&nh->rwsem);
+		ret = notifier_call_chain_reverse(&nh->head, NULL, val, v, -1, NULL);
+		up_read(&nh->rwsem);
+	}
+	return ret;
+}
+EXPORT_SYMBOL_GPL(blocking_notifier_call_chain_reverse);
+
 /*
  *	Raw notifier chain routines.  There is no protection;
  *	the caller must provide it.  Use at your own risk!
@@ -450,7 +599,7 @@ EXPORT_SYMBOL_GPL(raw_notifier_call_chain_robust);
 int raw_notifier_call_chain(struct raw_notifier_head *nh,
 		unsigned long val, void *v)
 {
-	return notifier_call_chain(&nh->head, val, v, -1, NULL);
+	return notifier_call_chain(&nh->head, val, v, -1, NULL, NULL);
 }
 EXPORT_SYMBOL_GPL(raw_notifier_call_chain);
 
@@ -543,7 +692,7 @@ int srcu_notifier_call_chain(struct srcu_notifier_head *nh,
 	int idx;
 
 	idx = srcu_read_lock(&nh->srcu);
-	ret = notifier_call_chain(&nh->head, val, v, -1, NULL);
+	ret = notifier_call_chain_rcu(&nh->head, val, v, -1, NULL);
 	srcu_read_unlock(&nh->srcu, idx);
 	return ret;
 }
@@ -566,7 +715,7 @@ void srcu_init_notifier_head(struct srcu_notifier_head *nh)
 	mutex_init(&nh->mutex);
 	if (init_srcu_struct(&nh->srcu) < 0)
 		BUG();
-	nh->head = NULL;
+	INIT_LIST_HEAD(&nh->head);
 }
 EXPORT_SYMBOL_GPL(srcu_init_notifier_head);
 
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index c942f1282236..0afcba2967c7 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -90,7 +90,7 @@ static const struct nla_policy rtm_nh_res_bucket_policy_get[] = {
 
 static bool nexthop_notifiers_is_empty(struct net *net)
 {
-	return !net->nexthop.notifier_chain.head;
+	return list_empty(&net->nexthop.notifier_chain.head);
 }
 
 static void
-- 
2.43.0


^ permalink raw reply related

* Re: [RFC 0/2] openrisc: Add support for KProbes
From: Masami Hiramatsu @ 2026-04-15  6:48 UTC (permalink / raw)
  To: Sahil Siddiq
  Cc: jonas, stefan.kristiansson, shorne, naveen, davem, peterz,
	jpoimboe, jbaron, rostedt, ardb, chenmiao.ku, johannes, nsc,
	masahiroy, tytso, linux-openrisc, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260407185650.79816-1-sahilcdq0@gmail.com>

Hi Sahil,

On Wed,  8 Apr 2026 00:26:48 +0530
Sahil Siddiq <sahilcdq0@gmail.com> wrote:

> Hi,
> 
> This series adds basic support for KProbes on OpenRISC. There are
> a few changes that I would still like to add and test before this
> can be considered for merging. I was hoping to get some feedback on
> the changes made so far. The implementation in this series is based
> on KProbes for LoongArch, MIPS and RISC-V. 

Thanks for porting!
Sashiko reviewed this series, can you check the comments?
Most comments (not all) look reasonable to me.

https://sashiko.dev/#/patchset/20260407185650.79816-1-sahilcdq0%40gmail.com

Generally, please make better use of macros rather than magic values
in your code to make it easier to understand.
Also, use GENMASK() and BIT() macro to define bitmasks and bit.

Thanks,

> 
> The current state of the series allows traps to be inserted dynamically
> in the kernel. A KProbe can be inserted via a kernel module whose
> init/exit functions are used to register/unregister a KProbe. A pre-
> handler and post-handler can also be provisioned in the module, which
> are associated with the KProbe and triggered when the probe is hit. See
> the documentation on KProbes for a detailed explanation [1].
> 
> The following are yet to be implemented for OpenRISC:
> 1. kretprobes
> 2. kprobe-based event tracing
> 3. ftrace, and kprobe features that depend on ftrace (particularly,
> dynamic tracing)
> 
> I hope to submit a patch for kretprobes soon (possibly in a revision of
> this series).
> 
> I wrote a couple of kernel modules to test these changes. They can be found
> here [2]. I also ran test_kprobes located at ./lib/tests/ against these
> changes [3]. The results are as shown below:
> 
> /home # insmod test_kprobes.ko 
> KTAP version 1
> 1..1
>     KTAP version 1
>     # Subtest: kprobes_test
>     # module: test_kprobes
>     1..3
>     ok 1 test_kprobe
>     ok 2 test_kprobes
>     ok 3 test_kprobe_missed
> # kprobes_test: pass:3 fail:0 skip:0 total:3
> # Totals: pass:3 fail:0 skip:0 total:3
> ok 1 kprobes_test
> /home # 
> 
> When compiling the kernel, the following options should be enabled:
> 1. CONFIG_HAVE_KPROBES=y
> 2. CONFIG_KPROBES=y
> 
> Also ensure that CONFIG_KPROBE_EVENTS is disabled.
> 
> To compile /lib/tests/test_kprobes.c, add the following to .config:
> 1. CONFIG_KUNIT=y
> 2. CONFIG_DEBUG_KERNEL=y
> 3. CONFIG_KPROBES_SANITY_TEST=m
> 
> The first commit cleans up and reorganizes existing functions, fixes
> a few issues with instruction simulation, and introduces new structures
> and macros that will be used by KProbes and other tracing facilities
> in the future.
> 
> The second commit adds support for KProbes. Currently, I have
> implemented this in such a way that KProbes can't be used to probe
> a few "blacklisted" instructions. Probes can't be inserted in a delay
> slot either (similar to MIPS). I have also added a few asm functions
> to the blacklist that I think should not be probed. For e.g., "memset"
> and "_trap_handler" have been blacklisted because probing them causes
> the kernel to hang. However, I am not sure if other functions in "entry.S"
> need to be added as well to the blacklist.
> 
> Thanks,
> Sahil
> 
> [1] https://www.kernel.org/doc/html/latest/trace/kprobes.html
> [2] https://github.com/valdaarhun/or-dev/tree/main/home
> [3] https://github.com/openrisc/linux/blob/for-next/lib/tests/test_kprobes.c
> 
> Sahil Siddiq (2):
>   openrisc: Add utilities and clean up simulation of instructions
>   openrisc: Add KProbes
> 
>  arch/openrisc/Kconfig                   |   1 +
>  arch/openrisc/configs/or1ksim_defconfig |   2 +
>  arch/openrisc/configs/virt_defconfig    |   2 +
>  arch/openrisc/include/asm/asm.h         |  22 ++
>  arch/openrisc/include/asm/break.h       |  19 ++
>  arch/openrisc/include/asm/insn-def.h    |  61 +++-
>  arch/openrisc/include/asm/kprobes.h     |  76 +++++
>  arch/openrisc/include/asm/spr_defs.h    |   1 +
>  arch/openrisc/kernel/Makefile           |   3 +-
>  arch/openrisc/kernel/entry.S            |  16 +
>  arch/openrisc/kernel/insn.c             |  74 +++++
>  arch/openrisc/kernel/jump_label.c       |   2 +-
>  arch/openrisc/kernel/kprobes.c          | 381 ++++++++++++++++++++++++
>  arch/openrisc/kernel/traps.c            |  67 ++---
>  arch/openrisc/lib/memcpy.c              |   2 +
>  arch/openrisc/lib/memset.S              |   4 +
>  arch/openrisc/mm/fault.c                |   5 +
>  samples/kprobes/kprobe_example.c        |   8 +
>  18 files changed, 701 insertions(+), 45 deletions(-)
>  create mode 100644 arch/openrisc/include/asm/asm.h
>  create mode 100644 arch/openrisc/include/asm/break.h
>  create mode 100644 arch/openrisc/include/asm/kprobes.h
>  create mode 100644 arch/openrisc/kernel/insn.c
>  create mode 100644 arch/openrisc/kernel/kprobes.c
> 
> -- 
> 2.53.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [RFC PATCH 2/2] kernel/module: Decouple klp and ftrace from load_module
From: Song Chen @ 2026-04-15  6:43 UTC (permalink / raw)
  To: Petr Pavlu
  Cc: rafael, lenb, mturquette, sboyd, viresh.kumar, agk, snitzer,
	mpatocka, bmarzins, song, yukuai, linan122, jason.wessel, danielt,
	dianders, horms, davem, edumazet, kuba, pabeni, paulmck, frederic,
	mcgrof, da.gomez, samitolvanen, atomlin, jpoimboe, jikos, mbenes,
	pmladek, joe.lawrence, rostedt, mhiramat, mark.rutland,
	mathieu.desnoyers, linux-modules, linux-kernel,
	linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
	live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <1191caf5-6a61-4622-a15e-854d3701f4fc@suse.com>

Hi,

On 4/14/26 22:33, Petr Pavlu wrote:
> On 4/13/26 10:07 AM, chensong_2000@189.cn wrote:
>> From: Song Chen <chensong_2000@189.cn>
>>
>> ftrace and livepatch currently have their module load/unload callbacks
>> hard-coded in the module loader as direct function calls to
>> ftrace_module_enable(), klp_module_coming(), klp_module_going()
>> and ftrace_release_mod(). This tight coupling was originally introduced
>> to enforce strict call ordering that could not be guaranteed by the
>> module notifier chain, which only supported forward traversal. Their
>> notifiers were moved in and out back and forth. see [1] and [2].
> 
> I'm unclear about what is meant by the notifiers being moved back and
> forth. The links point to patches that converted ftrace+klp from using
> module notifiers to explicit callbacks due to ordering issues, but this
> switch occurred only once. Have there been other attempts to use
> notifiers again?
> 

Yes,only once,i will rephrase.

>>
>> Now that the notifier chain supports reverse traversal via
>> blocking_notifier_call_chain_reverse(), the ordering can be enforced
>> purely through notifier priority. As a result, the module loader is now
>> decoupled from the implementation details of ftrace and livepatch.
>> What's more, adding a new subsystem with symmetric setup/teardown ordering
>> requirements during module load/unload no longer requires modifying
>> kernel/module/main.c; it only needs to register a notifier_block with an
>> appropriate priority.
>>
>> [1]:https://lore.kernel.org/all/
>> 	alpine.LNX.2.00.1602172216491.22700@cbobk.fhfr.pm/
>> [2]:https://lore.kernel.org/all/
>> 	20160301030034.GC12120@packer-debian-8-amd64.digitalocean.com/
> 
> Nit: Avoid wrapping URLs, as it breaks autolinking and makes the links
> harder to copy.
> 
> Better links would be:
> [1] https://lore.kernel.org/all/1455661953-15838-1-git-send-email-jeyu@redhat.com/
> [2] https://lore.kernel.org/all/1458176139-17455-1-git-send-email-jeyu@redhat.com/
> 
> The first link is the final version of what landed as commit
> 7dcd182bec27 ("ftrace/module: remove ftrace module notifier"). The
> second is commit 7e545d6eca20 ("livepatch/module: remove livepatch
> module notifier").
> 

Thank you, i will update.

>>
>> Signed-off-by: Song Chen <chensong_2000@189.cn>
>> ---
>>   include/linux/module.h  |  8 ++++++++
>>   kernel/livepatch/core.c | 29 ++++++++++++++++++++++++++++-
>>   kernel/module/main.c    | 34 +++++++++++++++-------------------
>>   kernel/trace/ftrace.c   | 38 ++++++++++++++++++++++++++++++++++++++
>>   4 files changed, 89 insertions(+), 20 deletions(-)
>>
>> diff --git a/include/linux/module.h b/include/linux/module.h
>> index 14f391b186c6..0bdd56f9defd 100644
>> --- a/include/linux/module.h
>> +++ b/include/linux/module.h
>> @@ -308,6 +308,14 @@ enum module_state {
>>   	MODULE_STATE_COMING,	/* Full formed, running module_init. */
>>   	MODULE_STATE_GOING,	/* Going away. */
>>   	MODULE_STATE_UNFORMED,	/* Still setting it up. */
>> +	MODULE_STATE_FORMED,
> 
> I don't see a reason to add a new module state. Why is it necessary and
> how does it fit with the existing states?
> 
because once notifier fails in state MODULE_STATE_UNFORMED (now only 
ftrace has someting to do in this state), notifier chain will roll back 
by calling blocking_notifier_call_chain_robust, i'm afraid 
MODULE_STATE_GOING is going to jeopardise the notifers which don't 
handle it appropriately, like:

case MODULE_STATE_COMING:
      kmalloc();
case MODULE_STATE_GOING:
      kfree();


>> +};
>> +
>> +enum module_notifier_prio {
>> +	MODULE_NOTIFIER_PRIO_LOW = INT_MIN,	/* Low prioroty, coming last, going first */
>> +	MODULE_NOTIFIER_PRIO_MID = 0,	/* Normal priority. */
>> +	MODULE_NOTIFIER_PRIO_SECOND_HIGH = INT_MAX - 1,	/* Second high priorigy, coming second*/
>> +	MODULE_NOTIFIER_PRIO_HIGH = INT_MAX,	/* High priorigy, coming first, going late. */
> 
> I suggest being explicit about how the notifiers are ordered. For
> example:
> 
> enum module_notifier_prio {
> 	MODULE_NOTIFIER_PRIO_NORMAL,	/* Normal priority, coming last, going first. */
> 	MODULE_NOTIFIER_PRIO_LIVEPATCH,
> 	MODULE_NOTIFIER_PRIO_FTRACE,	/* High priority, coming first, going late. */
> };
> 

accepted.

>>   };
>>   
>>   struct mod_tree_node {
>> diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
>> index 28d15ba58a26..ce78bb23e24b 100644
>> --- a/kernel/livepatch/core.c
>> +++ b/kernel/livepatch/core.c
>> @@ -1375,13 +1375,40 @@ void *klp_find_section_by_name(const struct module *mod, const char *name,
>>   }
>>   EXPORT_SYMBOL_GPL(klp_find_section_by_name);
>>   
>> +static int klp_module_callback(struct notifier_block *nb, unsigned long op,
>> +			void *module)
>> +{
>> +	struct module *mod = module;
>> +	int err = 0;
>> +
>> +	switch (op) {
>> +	case MODULE_STATE_COMING:
>> +		err = klp_module_coming(mod);
>> +		break;
>> +	case MODULE_STATE_LIVE:
>> +		break;
>> +	case MODULE_STATE_GOING:
>> +		klp_module_going(mod);
>> +		break;
>> +	default:
>> +		break;
>> +	}
> 
> klp_module_coming() and klp_module_going() are now used only in
> kernel/livepatch/core.c where they are also defined. This means the
> functions can be static and their declarations removed from
> include/linux/livepatch.h.
> 
> Nit: The MODULE_STATE_LIVE and default cases in the switch can be
> removed.
> 

accepted.

>> +
>> +	return notifier_from_errno(err);
>> +}
>> +
>> +static struct notifier_block klp_module_nb = {
>> +	.notifier_call = klp_module_callback,
>> +	.priority = MODULE_NOTIFIER_PRIO_SECOND_HIGH
>> +};
>> +
>>   static int __init klp_init(void)
>>   {
>>   	klp_root_kobj = kobject_create_and_add("livepatch", kernel_kobj);
>>   	if (!klp_root_kobj)
>>   		return -ENOMEM;
>>   
>> -	return 0;
>> +	return register_module_notifier(&klp_module_nb);
>>   }
>>   
>>   module_init(klp_init);
>> diff --git a/kernel/module/main.c b/kernel/module/main.c
>> index c3ce106c70af..226dd5b80997 100644
>> --- a/kernel/module/main.c
>> +++ b/kernel/module/main.c
>> @@ -833,10 +833,8 @@ SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
>>   	/* Final destruction now no one is using it. */
>>   	if (mod->exit != NULL)
>>   		mod->exit();
>> -	blocking_notifier_call_chain(&module_notify_list,
>> +	blocking_notifier_call_chain_reverse(&module_notify_list,
>>   				     MODULE_STATE_GOING, mod);
>> -	klp_module_going(mod);
>> -	ftrace_release_mod(mod);
>>   
>>   	async_synchronize_full();
>>   
>> @@ -3135,10 +3133,8 @@ static noinline int do_init_module(struct module *mod)
>>   	mod->state = MODULE_STATE_GOING;
>>   	synchronize_rcu();
>>   	module_put(mod);
>> -	blocking_notifier_call_chain(&module_notify_list,
>> +	blocking_notifier_call_chain_reverse(&module_notify_list,
>>   				     MODULE_STATE_GOING, mod);
>> -	klp_module_going(mod);
>> -	ftrace_release_mod(mod);
>>   	free_module(mod);
>>   	wake_up_all(&module_wq);
>>   
> 
> The patch unexpectedly leaves a call to ftrace_free_mem() in
> do_init_module().

Thanks for pointing it out, it was removed when i implemented and 
tested, but when i organized the patch, it was left. I will remove it.

> 
>> @@ -3281,20 +3277,14 @@ static int complete_formation(struct module *mod, struct load_info *info)
>>   	return err;
>>   }
>>   
>> -static int prepare_coming_module(struct module *mod)
>> +static int prepare_module_state_transaction(struct module *mod,
>> +			unsigned long val_up, unsigned long val_down)
>>   {
>>   	int err;
>>   
>> -	ftrace_module_enable(mod);
>> -	err = klp_module_coming(mod);
>> -	if (err)
>> -		return err;
>> -
>>   	err = blocking_notifier_call_chain_robust(&module_notify_list,
>> -			MODULE_STATE_COMING, MODULE_STATE_GOING, mod);
>> +			val_up, val_down, mod);
>>   	err = notifier_to_errno(err);
>> -	if (err)
>> -		klp_module_going(mod);
>>   
>>   	return err;
>>   }
>> @@ -3468,14 +3458,21 @@ static int load_module(struct load_info *info, const char __user *uargs,
>>   	init_build_id(mod, info);
>>   
>>   	/* Ftrace init must be called in the MODULE_STATE_UNFORMED state */
>> -	ftrace_module_init(mod);
>> +	err = prepare_module_state_transaction(mod,
>> +				MODULE_STATE_UNFORMED, MODULE_STATE_FORMED);
> 
> I believe val_down should be MODULE_STATE_GOING to reverse the
> operation. Why is the new state MODULE_STATE_FORMED needed here?
to avoid this:

case MODULE_STATE_COMING:
      kmalloc();
case MODULE_STATE_GOING:
      kfree();



> 
>> +	if (err)
>> +		goto ddebug_cleanup;
>>   
>>   	/* Finally it's fully formed, ready to start executing. */
>>   	err = complete_formation(mod, info);
>> -	if (err)
>> +	if (err) {
>> +		blocking_notifier_call_chain_reverse(&module_notify_list,
>> +				MODULE_STATE_FORMED, mod);
>>   		goto ddebug_cleanup;
>> +	}
>>   
>> -	err = prepare_coming_module(mod);
>> +	err = prepare_module_state_transaction(mod,
>> +				MODULE_STATE_COMING, MODULE_STATE_GOING);
>>   	if (err)
>>   		goto bug_cleanup;
>>   
>> @@ -3522,7 +3519,6 @@ static int load_module(struct load_info *info, const char __user *uargs,
>>   	destroy_params(mod->kp, mod->num_kp);
>>   	blocking_notifier_call_chain(&module_notify_list,
>>   				     MODULE_STATE_GOING, mod);
> 
> My understanding is that all notifier chains for MODULE_STATE_GOING
> should be reversed.
yes, all, from lowest priority notifier to highest.
I will resend patch 1 which was failed due to my proxy setting.

> 
>> -	klp_module_going(mod);
>>    bug_cleanup:
>>   	mod->state = MODULE_STATE_GOING;
>>   	/* module_bug_cleanup needs module_mutex protection */
> 
> The patch removes the klp_module_going() cleanup call in load_module().
> Similarly, the ftrace_release_mod() call under the ddebug_cleanup label
> should be removed and appropriately replaced with a cleanup via
> a notifier.
> 
     err = prepare_module_state_transaction(mod,
                 MODULE_STATE_UNFORMED, MODULE_STATE_FORMED);
     if (err)
         goto ddebug_cleanup;

ftrace will be cleanup in blocking_notifier_call_chain_robust rolling back.

     err = prepare_module_state_transaction(mod,
                 MODULE_STATE_COMING, MODULE_STATE_GOING);

each notifier including ftrace and klp will be cleanup in 
blocking_notifier_call_chain_robust rolling back.

if all notifiers are successful in MODULE_STATE_COMING, they all will be 
clean up in
  coming_cleanup:
     mod->state = MODULE_STATE_GOING;
     destroy_params(mod->kp, mod->num_kp);
     blocking_notifier_call_chain(&module_notify_list,
                      MODULE_STATE_GOING, mod);

if  something wrong underneath.

>> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
>> index 8df69e702706..efedb98d3db4 100644
>> --- a/kernel/trace/ftrace.c
>> +++ b/kernel/trace/ftrace.c
>> @@ -5241,6 +5241,44 @@ static int __init ftrace_mod_cmd_init(void)
>>   }
>>   core_initcall(ftrace_mod_cmd_init);
>>   
>> +static int ftrace_module_callback(struct notifier_block *nb, unsigned long op,
>> +			void *module)
>> +{
>> +	struct module *mod = module;
>> +
>> +	switch (op) {
>> +	case MODULE_STATE_UNFORMED:
>> +		ftrace_module_init(mod);
>> +		break;
>> +	case MODULE_STATE_COMING:
>> +		ftrace_module_enable(mod);
>> +		break;
>> +	case MODULE_STATE_LIVE:
>> +		ftrace_free_mem(mod, mod->mem[MOD_INIT_TEXT].base,
>> +				mod->mem[MOD_INIT_TEXT].base + mod->mem[MOD_INIT_TEXT].size);
>> +		break;
>> +	case MODULE_STATE_GOING:
>> +	case MODULE_STATE_FORMED:
>> +		ftrace_release_mod(mod);
>> +		break;
>> +	default:
>> +		break;
>> +	}
> 
> ftrace_module_init(), ftrace_module_enable(), ftrace_free_mem() and
> ftrace_release_mod() should be newly used only in kernel/trace/ftrace.c
> where they are also defined. The functions can then be made static and
> removed from include/linux/ftrace.h.
> 
> Nit: The default case in the switch can be removed.
> 

accepted.

>> +
>> +	return notifier_from_errno(0);
> 
> Nit: This can be simply "return NOTIFY_OK;".

accepted
> 
>> +}
>> +
>> +static struct notifier_block ftrace_module_nb = {
>> +	.notifier_call = ftrace_module_callback,
>> +	.priority = MODULE_NOTIFIER_PRIO_HIGH
>> +};
>> +
>> +static int __init ftrace_register_module_notifier(void)
>> +{
>> +	return register_module_notifier(&ftrace_module_nb);
>> +}
>> +core_initcall(ftrace_register_module_notifier);
>> +
>>   static void function_trace_probe_call(unsigned long ip, unsigned long parent_ip,
>>   				      struct ftrace_ops *op, struct ftrace_regs *fregs)
>>   {
> 

Best regards

Song


^ permalink raw reply

* Re: [RFC 1/2] openrisc: Add utilities and clean up simulation of instructions
From: Masami Hiramatsu @ 2026-04-15  6:39 UTC (permalink / raw)
  To: Stafford Horne
  Cc: Sahil Siddiq, jonas, stefan.kristiansson, naveen, davem, mhiramat,
	peterz, jpoimboe, jbaron, rostedt, ardb, chenmiao.ku, johannes,
	nsc, masahiroy, tytso, linux-openrisc, linux-kernel,
	linux-trace-kernel
In-Reply-To: <ad51SeeF2Gv06mgQ@antec>

On Tue, 14 Apr 2026 18:11:37 +0100
Stafford Horne <shorne@gmail.com> wrote:

> Hello Sahil,
> 
> Thanks for your patches.  I have a few questions. so this is not really a
> technical review at the moment just some main items.
> 
> On Wed, Apr 08, 2026 at 12:26:49AM +0530, Sahil Siddiq wrote:
> > Introduce new instruction-related utilities and macros for OpenRISC.
> > This is in preparation for patches that add tracing support such as
> > KProbes.
> > 
> > Simulate l.adrp. Fix bugs in simulation of l.jal and l.jalr. Earlier,
> 
> Why emulate l.adrp?  We don't really use this yet in any OpenRISC code.  This
> instruction is meant to be used to help -fpic code, see:
> 
>   https://openrisc.io/proposals/ladrp

Maybe because l.adrp depends on InsnAddr (instruction address), 
if kprobe runs copied instruction on its trampoline buffer, it
generates a wrong result. If it may not be used in the kernel,
arch_prepare_kprobe() can detect it and return an error.

[...]
> > +
> > +#define OPENRISC_INSN_SIZE  (sizeof(union openrisc_instruction))

This should be defined within an architecture header file.

> > +
> > +/* Helpers for working with l.trap */
> > +static inline unsigned long __emit_trap(unsigned int code)
> > +{
> > +	return (code & 0xffff) | OPCODE_TRAP;
> > +}
> > +
> > +static inline bool has_delay_slot(void)
> > +{
> > +	unsigned int cpucfgr = mfspr(SPR_CPUCFGR);
> > +
> > +	return !(cpucfgr & SPR_CPUCFGR_ND);
> > +}
> 
> This is for handling CPU's that do not have delay slots.  We didn't do this
> before, why are you doing it now?  Should we mention this in the git commit
> message?
> 
> Also, since this is a static configuration for the CPU should we use static keys
> for this?
> 
> > +
> > +void simulate_pc(struct pt_regs *regs, unsigned int jmp);
> > +void simulate_branch(struct pt_regs *regs, unsigned int jmp, bool has_delay_slot);

Also, do not use the same name for a function name and a local variable.
Please rename has_delay_slot() to machine_has_delay_slot() or rename
the has_delay_slot parameter to 'delay_slot'.

[...]
> > +#include <linux/ptrace.h>
> > +#include <asm/insn-def.h>
> > +
> > +void simulate_pc(struct pt_regs *regs, unsigned int jmp)
> > +{
> > +	int displacement;
> > +	unsigned int rd, op;
> > +
> > +	displacement = sign_extend32(((jmp) & 0x7ffff) << 13, 31);
> > +	rd = (jmp & 0x3ffffff) >> 21;

Please use GENMASK() macro instead of hex value.
(Moreover, define meaningful named macro instead of the magic number.)

> > +	op = jmp >> 26;
> > +
> > +	switch (op) {
> > +	case l_adrp:

l_adrp seems have 2 modes. Is this OK only supporting 32-bit?

32-bit Implementation:
rD[31:0] ← exts(Immediate[18:0] << 13) + (InstAddr & -8192)
64-bit Implementation:
rD[63:0] ← exts(Immediate[20:0] << 13) + (InstAddr & -8192)

> > +		regs->gpr[rd] = displacement + (regs->pc & (-8192));

Please use nicely named macro instead of -8192.

Thanks,



-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [RFC 1/2] openrisc: Add utilities and clean up simulation of instructions
From: Sahil @ 2026-04-15  6:10 UTC (permalink / raw)
  To: Stafford Horne
  Cc: jonas, stefan.kristiansson, naveen, davem, mhiramat, peterz,
	jpoimboe, jbaron, rostedt, ardb, chenmiao.ku, johannes, nsc,
	masahiroy, tytso, linux-openrisc, linux-kernel,
	linux-trace-kernel
In-Reply-To: <ad51SeeF2Gv06mgQ@antec>

Hi Stafford,

Thank you for your review.

On 4/14/26 10:41 PM, Stafford Horne wrote:
> Hello Sahil,
> 
> Thanks for your patches.  I have a few questions. so this is not really a
> technical review at the moment just some main items.
> 
> On Wed, Apr 08, 2026 at 12:26:49AM +0530, Sahil Siddiq wrote:
>> Introduce new instruction-related utilities and macros for OpenRISC.
>> This is in preparation for patches that add tracing support such as
>> KProbes.
>>
>> Simulate l.adrp. Fix bugs in simulation of l.jal and l.jalr. Earlier,
> 
> Why emulate l.adrp?  We don't really use this yet in any OpenRISC code.  This
> instruction is meant to be used to help -fpic code, see:
> 
>    https://openrisc.io/proposals/ladrp

Right, I couldn't find l.adrp in the kernel's image after compilation. But I
was thinking, in case l.adrp is used somewhere in the future (for e.g. in a
custom kernel module) and a probe is placed at that instruction, we would need
to emulate that instruction. This is because the target address is calculated
using the instruction's address (InsnAddr & -8192). If l.adrp is moved to an
out-of-line slot to single-step for KProbes, the slot's address will be used to
generate the target address instead of the original address.

Since l.adrp is not really used anywhere and it's highly unlikely that it will
be probed, an alternative could be to add l.adrp to the blacklist so that probes
at this instruction are never inserted.

>> PC was being updated and then saved in the link register r9, resulting
>> in a corrupted page table (bad page map in process). Instead, update
>> PC after storing it in r9.
>>
>> Move instruction simulation to its own file to enable reuse. Clean it
>> up and replace hardcoded values with computed expressions.
>>
>> Link: https://raw.githubusercontent.com/openrisc/doc/master/openrisc-arch-1.4-rev0.pdf
>> Signed-off-by: Sahil Siddiq <sahilcdq0@gmail.com>
>> ---
>>   arch/openrisc/include/asm/insn-def.h | 61 +++++++++++++++++++++--
>>   arch/openrisc/include/asm/spr_defs.h |  1 +
>>   arch/openrisc/kernel/Makefile        |  2 +-
>>   arch/openrisc/kernel/insn.c          | 74 ++++++++++++++++++++++++++++
>>   arch/openrisc/kernel/jump_label.c    |  2 +-
>>   arch/openrisc/kernel/traps.c         | 41 +--------------
>>   6 files changed, 136 insertions(+), 45 deletions(-)
>>   create mode 100644 arch/openrisc/kernel/insn.c
>>
>> diff --git a/arch/openrisc/include/asm/insn-def.h b/arch/openrisc/include/asm/insn-def.h
>> index 1e0c028a5b95..c98f9770c52e 100644
>> --- a/arch/openrisc/include/asm/insn-def.h
>> +++ b/arch/openrisc/include/asm/insn-def.h
>> @@ -3,13 +3,66 @@
>>    * Copyright (C) 2025 Chen Miao
>>    */
>>   
>> +#include <asm/spr.h>
>> +#include <asm/spr_defs.h>
>> +
>>   #ifndef __ASM_OPENRISC_INSN_DEF_H
>>   #define __ASM_OPENRISC_INSN_DEF_H
>>   
>> -/* or1k instructions are always 32 bits. */
>> -#define	OPENRISC_INSN_SIZE		4
>> -
>>   /* or1k nop instruction code */
>> -#define OPENRISC_INSN_NOP     0x15000000U
>> +#define INSN_NOP	0x15000000U
> 
> Why remove the OPENRISC_ namespace from thse definitions?
> 
> Other architectures like arm64, riscv have namespaces on these similar
> definitions.

Oh, got it. I hadn't checked those architectures at that point. I was
following the convention in LoongArch and the namespace wasn't there. I'll
revert this change.

>> +
>> +#define INSN_CSYNC	0x23000000U
>> +#define INSN_MSYNC	0x22000000U
>> +#define INSN_PSYNC	0x22800000U
>> +
>> +#define OPCODE_TRAP	0x21000000U
>> +#define OPCODE_SYS	0x20000000U
>> +#define OPCODE_MACRC	0x18010000U

I'll add the namespace here.

>> +struct pt_regs;
>> +
>> +enum six_bit_opcodes {
> 
> I don't see why we need to name this six_bit as such.  OpenRISC
> instructions have either fir most significant 6-bits or 11-bits
> as opcodes but we still use the first 6-bits to drive the logic.

Maybe it should be renamed to something that indicates this is related
to probes (I wasn't able to come up with a better name).

The enums are used in a few places in KProbe's implementation to check
if an instruction requires emulation, has been blacklisted, or can be
single-stepped in an out-of-line slot. An alternative would be to use
macros (similar to OPCODE_SYS) but that won't have type checking. Either
way, I think it's better than using raw values in multiple places.

Shall I simply rename the enum type, or shall I switch to macros? What
are your thoughts?

>> +	l_rfe = 0x09,
>> +	l_lwa = 0x1b,
>> +	l_mfspr = 0x2d,
>> +	l_mtspr = 0x30,
>> +	l_swa = 0x33,
>> +	l_j = 0x00,
>> +	l_jal = 0x01,
>> +	l_adrp = 0x02,
>> +	l_bnf = 0x03,
>> +	l_bf = 0x04,
>> +	l_jr = 0x11,
>> +	l_jalr = 0x12,
>> +};
>> +
>> +struct insn {
>> +	unsigned int opcode: 6;
>> +	unsigned int operands: 26;
>> +};
>> +
>> +union openrisc_instruction {
>> +	unsigned int word;
>> +	struct insn opcodes_6bit;
>> +};
>> +
>> +#define OPENRISC_INSN_SIZE  (sizeof(union openrisc_instruction))
>> +
>> +/* Helpers for working with l.trap */
>> +static inline unsigned long __emit_trap(unsigned int code)
>> +{
>> +	return (code & 0xffff) | OPCODE_TRAP;
>> +}
>> +
>> +static inline bool has_delay_slot(void)
>> +{
>> +	unsigned int cpucfgr = mfspr(SPR_CPUCFGR);
>> +
>> +	return !(cpucfgr & SPR_CPUCFGR_ND);
>> +}
> 
> This is for handling CPU's that do not have delay slots.  We didn't do this
> before, why are you doing it now?  Should we mention this in the git commit
> message?

I am not entirely sure why CPUs with no delay slot were not taken into account
initially. The "in_delay_slot" function in traps.c [1] always assumes (for the
case where SPR_SR_DSX is not implemented) that the CPU has a delay slot if PC is
a branch instruction.

Given that the specification allows CPUs not to have a delay slot, I thought I
would include that as well. With no SPR_SR_DSX, if it can be guaranteed that PC
will be a branch instruction only when delay slots are enabled, then there should
be no change in behaviour in "simulate_branch" even with the addition of the
"has_delay_slot" check.

However, if PC can be a branch instruction even when there are no delay slots, then
"simulate_branch" now adds 4 to regs->pc instead of adding 8 unconditionally. Please
correct me if I am wrong.

"has_delay_slot" is particularly important for KProbes, since a probe at a branch
instruction with a delay slot means the branch will have to be emulated followed
by single-stepping of the delay slot instruction.

But if there are no delay slots, then only branch emulation is required.

> Also, since this is a static configuration for the CPU should we use static keys
> for this?

Yes, I think static keys can be used here. If the more likely case is that a delay
slot exists, we can make that the more likely branch.

>> +
>> +void simulate_pc(struct pt_regs *regs, unsigned int jmp);
>> +void simulate_branch(struct pt_regs *regs, unsigned int jmp, bool has_delay_slot);
>>   
>>   #endif /* __ASM_OPENRISC_INSN_DEF_H */
>> diff --git a/arch/openrisc/include/asm/spr_defs.h b/arch/openrisc/include/asm/spr_defs.h
>> index f0b6b492e9f4..5d13a9b96263 100644
>> --- a/arch/openrisc/include/asm/spr_defs.h
>> +++ b/arch/openrisc/include/asm/spr_defs.h
>> @@ -179,6 +179,7 @@
>>   #define SPR_CPUCFGR_OF32S  0x00000080  /* ORFPX32 supported */
>>   #define SPR_CPUCFGR_OF64S  0x00000100  /* ORFPX64 supported */
>>   #define SPR_CPUCFGR_OV64S  0x00000200  /* ORVDX64 supported */
>> +#define SPR_CPUCFGR_ND     0x00000400  /* No delay slot */
>>   #define SPR_CPUCFGR_RES	   0xfffffc00  /* Reserved */
>>   
>>   /*
>> diff --git a/arch/openrisc/kernel/Makefile b/arch/openrisc/kernel/Makefile
>> index 19e0eb94f2eb..150779fbf010 100644
>> --- a/arch/openrisc/kernel/Makefile
>> +++ b/arch/openrisc/kernel/Makefile
>> @@ -5,7 +5,7 @@
>>   
>>   always-$(KBUILD_BUILTIN)	:= vmlinux.lds
>>   
>> -obj-y	:= head.o setup.o or32_ksyms.o process.o dma.o \
>> +obj-y	:= head.o insn.o setup.o or32_ksyms.o process.o dma.o \
>>   	   traps.o time.o irq.o entry.o ptrace.o signal.o \
>>   	   sys_call_table.o unwinder.o cacheinfo.o
>>   
>> diff --git a/arch/openrisc/kernel/insn.c b/arch/openrisc/kernel/insn.c
>> new file mode 100644
>> index 000000000000..2c97eceee6d7
>> --- /dev/null
>> +++ b/arch/openrisc/kernel/insn.c
>> @@ -0,0 +1,74 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +/*
>> + * OpenRISC instruction utils
>> + *
>> + * Linux architectural port borrowing liberally from similar works of
>> + * others.  All original copyrights apply as per the original source
>> + * declaration.
>> + *
>> + * OpenRISC implementation:
>> + * Copyright (C) 2026 Sahil Siddiq <sahilcdq0@gmail.com>
>> + */
>> +
>> +#include <linux/ptrace.h>
>> +#include <asm/insn-def.h>
>> +
>> +void simulate_pc(struct pt_regs *regs, unsigned int jmp)
>> +{
>> +	int displacement;
>> +	unsigned int rd, op;
>> +
>> +	displacement = sign_extend32(((jmp) & 0x7ffff) << 13, 31);
>> +	rd = (jmp & 0x3ffffff) >> 21;
>> +	op = jmp >> 26;
>> +
>> +	switch (op) {
>> +	case l_adrp:
>> +		regs->gpr[rd] = displacement + (regs->pc & (-8192));
>> +		return;
>> +	default:
>> +		break;
>> +	}
>> +}
> 
> This function is not used, what will it be used for? Could you mention it in the
> commit message?  Also could you document it?

"simulate_pc" is used by KProbes for single-stepping l.adrp. At the moment, it's
not being used anywhere else. Maybe I should move this function to the second commit
which has the KProbe changes.

Also, this function can be removed altogether if we choose to blacklist l.adrp.

>> +
>> +void simulate_branch(struct pt_regs *regs, unsigned int jmp_insn, bool has_delay_slot)
> 
> Since you are using this for more than just instruction simulation now, maybe
> its good to document what it does. For example when we use it below we surround
> it with the in_delay_slot check.

I am a little confused here. The core of the function was already wrapped in the
"in_delay_slot" check [2]. I moved it to a new file so instruction emulation could be
used in other places as well(e.g., in KProbes). Functionality wise, the function is
still the same with the added feature that it also supports CPUs with no delay slot.

>> +{
>> +	int displacement;
>> +	unsigned int rb, op, jmp;
>> +
>> +	displacement = sign_extend32(((jmp_insn) & 0x3ffffff) << 2, 27);
>> +	rb = (jmp_insn & 0x0000ffff) >> 11;
>> +	op = jmp_insn >> 26;
> 
> Naming this variable jmp is a bit misleading.  Maybe something like
> link_displacement, link_offset or fallthrough?

Sure, I'll rename this.

>> +	jmp = has_delay_slot ? 2 * OPENRISC_INSN_SIZE : OPENRISC_INSN_SIZE;
>> +
>> +	switch (op) {
>> +	case l_j: /* l.j */
>> +		regs->pc += displacement;
>> +		return;
>> +	case l_jal: /* l.jal */
>> +		regs->gpr[9] = regs->pc + jmp;
>> +		regs->pc += displacement;
>> +		return;
>> +	case l_bnf: /* l.bnf */
>> +		if (regs->sr & SPR_SR_F)
>> +			regs->pc += jmp;
>> +		else
>> +			regs->pc += displacement;
>> +		return;
>> +	case l_bf: /* l.bf */
>> +		if (regs->sr & SPR_SR_F)
>> +			regs->pc += displacement;
>> +		else
>> +			regs->pc += jmp;
>> +		return;
>> +	case l_jr: /* l.jr */
>> +		regs->pc = regs->gpr[rb];
>> +		return;
>> +	case l_jalr: /* l.jalr */
>> +		regs->gpr[9] = regs->pc + jmp;
>> +		regs->pc = regs->gpr[rb];
>> +		return;
>> +	default:
>> +		break;
>> +	}
>> +}
>> diff --git a/arch/openrisc/kernel/jump_label.c b/arch/openrisc/kernel/jump_label.c
>> index ab7137c23b46..fe082eb847a4 100644
>> --- a/arch/openrisc/kernel/jump_label.c
>> +++ b/arch/openrisc/kernel/jump_label.c
>> @@ -34,7 +34,7 @@ bool arch_jump_label_transform_queue(struct jump_entry *entry,
>>   
>>   		insn = offset;
>>   	} else {
>> -		insn = OPENRISC_INSN_NOP;
>> +		insn = INSN_NOP;
>>   	}
>>   
>>   	if (early_boot_irqs_disabled)
>> diff --git a/arch/openrisc/kernel/traps.c b/arch/openrisc/kernel/traps.c
>> index c195be9cc9fc..ee87a3af34fc 100644
>> --- a/arch/openrisc/kernel/traps.c
>> +++ b/arch/openrisc/kernel/traps.c
>> @@ -32,6 +32,7 @@
>>   
>>   #include <asm/bug.h>
>>   #include <asm/fpu.h>
>> +#include <asm/insn-def.h>
>>   #include <asm/io.h>
>>   #include <asm/processor.h>
>>   #include <asm/unwinder.h>
>> @@ -269,47 +270,9 @@ static inline int in_delay_slot(struct pt_regs *regs)
>>   
>>   static inline void adjust_pc(struct pt_regs *regs, unsigned long address)
>>   {
>> -	int displacement;
>> -	unsigned int rb, op, jmp;
>> -
>>   	if (unlikely(in_delay_slot(regs))) {
>>   		/* In delay slot, instruction at pc is a branch, simulate it */
>> -		jmp = *((unsigned int *)regs->pc);
>> -
>> -		displacement = sign_extend32(((jmp) & 0x3ffffff) << 2, 27);
>> -		rb = (jmp & 0x0000ffff) >> 11;
>> -		op = jmp >> 26;
>> -
>> -		switch (op) {
>> -		case 0x00: /* l.j */
>> -			regs->pc += displacement;
>> -			return;
>> -		case 0x01: /* l.jal */
>> -			regs->pc += displacement;
>> -			regs->gpr[9] = regs->pc + 8;
>> -			return;
>> -		case 0x03: /* l.bnf */
>> -			if (regs->sr & SPR_SR_F)
>> -				regs->pc += 8;
>> -			else
>> -				regs->pc += displacement;
>> -			return;
>> -		case 0x04: /* l.bf */
>> -			if (regs->sr & SPR_SR_F)
>> -				regs->pc += displacement;
>> -			else
>> -				regs->pc += 8;
>> -			return;
>> -		case 0x11: /* l.jr */
>> -			regs->pc = regs->gpr[rb];
>> -			return;
>> -		case 0x12: /* l.jalr */
>> -			regs->pc = regs->gpr[rb];
>> -			regs->gpr[9] = regs->pc + 8;
>> -			return;
>> -		default:
>> -			break;
>> -		}
>> +		simulate_branch(regs, *((unsigned int *)regs->pc), has_delay_slot());
>>   	} else {
>>   		regs->pc += 4;
>>   	}
>> -- 
>> 2.53.0
>>

Thanks,
Sahil

[1] https://github.com/openrisc/linux/blob/for-next/arch/openrisc/kernel/traps.c#L248
[2] https://github.com/openrisc/linux/blob/for-next/arch/openrisc/kernel/traps.c#L275-L312


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox