Linux Trace Kernel
 help / color / mirror / Atom feed
* Re: [PATCH 0/3] mm: split the file's i_mmap tree for NUMA
From: Mateusz Guzik @ 2026-04-16 10:29 UTC (permalink / raw)
  To: Huang Shijie
  Cc: akpm, viro, brauner, linux-mm, linux-kernel, linux-arm-kernel,
	linux-fsdevel, muchun.song, osalvador, linux-trace-kernel,
	linux-perf-users, linux-parisc, nvdimm, zhongyuan, fangbaoshun,
	yingzhiwei
In-Reply-To: <ad4EvoDcAKE2Sl4+@hsj-2U-Workstation>

On Tue, Apr 14, 2026 at 11:11 AM Huang Shijie <huangsj@hygon.cn> wrote:
>
> On Mon, Apr 13, 2026 at 05:33:21PM +0200, Mateusz Guzik wrote:
> > On Mon, Apr 13, 2026 at 02:20:39PM +0800, Huang Shijie wrote:
> > >   In NUMA, there are maybe many NUMA nodes and many CPUs.
> > > For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
> > > In the UnixBench tests, there is a test "execl" which tests
> > > the execve system call.
> > >
> > >   When we test our server with "./Run -c 384 execl",
> > > the test result is not good enough. The i_mmap locks contended heavily on
> > > "libc.so" and "ld.so". For example, the i_mmap tree for "libc.so" can have
> > > over 6000 VMAs, all the VMAs can be in different NUMA mode.
> > > The insert/remove operations do not run quickly enough.
> > >
> > > patch 1 & patch 2 are try to hide the direct access of i_mmap.
> > > patch 3 splits the i_mmap into sibling trees, and we can get better
> > > performance with this patch set:
> > >     we can get 77% performance improvement(10 times average)
> > >
> >
> > To my reading you kept the lock as-is and only distributed the protected
> > state.
> >
> > While I don't doubt the improvement, I'm confident should you take a
> > look at the profile you are going to find this still does not scale with
> > rwsem being one of the problems (there are other global locks, some of
> > which have experimental patches for).
> IMHO, when the number of VMAs in the i_mmap is very large, only optimise the rwsem
> lock does not help too much for our NUMA case.
>
> In our NUMA server, the remote access could be the major issue.
>

I'm confused how this is not supposed to help. You moved your data to
be stored per-domain. With my proposal the lock itself will also get
that treatment.

Modulo the issue of what to do with code wanting to iterate the entire
thing, this is blatantly faster.

>
> >
> > Apart from that this does nothing to help high core systems which are
> > all one node, which imo puts another question mark on this specific
> > proposal.
> Yes, this patch set only focus on the NUMA case.
> The one-node case should use the original i_mmap.
>
> Maybe I can add a new config, CONFIG_SPILT_I_MMAP. The config is disabled
> by default, and enabled when the NUMA node is not one.
>
> >
> > Of course one may question whether a RB tree is the right choice here,
> > it may be the lock-protected cost can go way down with merely a better
> > data structure.
> >
> > Regardless of that, for actual scalability, there will be no way around
> > decentralazing locking around this and partitioning per some core count
> > (not just by numa awareness).
> >
> > Decentralizing locking is definitely possible, but I have not looked
> > into specifics of how problematic it is. Best case scenario it will
> > merely with separate locks. Worst case scenario something needs a fully
> > stabilized state for traversal, in that case another rw lock can be
> Yes.
>
> The traversal may need to hold many locks.
>

The very paragraph you partially quoted answers what to do in that
case: wrap everything with a new rwsem taken for reading when
adding/removing entries and taken for writing when iterating the
entire thing. Then the iteration sticks to one lock.

The new rw lock puts an upper ceiling on scalability of the thing, but
it is way higher than the current state.

Given the extra overhead associated with it one could consider
sticking to one centralized state by default and switching to
distributed state if there is enough contention.

> > slapped around this, creating locking order read lock -> per-subset
> > write lock -- this will suffer scalability due to the read locking, but
> > it will still scale drastically better as apart from that there will be
> > no serialization. In this setting the problematic consumer will write
> > lock the new thing to stabilize the state.
> >
> > So my non-maintainer opinion is that the patchset is not worth it as it
> > fails to address anything for significantly more common and already
> > affected setups.
> This patch set is to reduce the remote access latency for insert/remove VMA
> in NUMA.
>

And I am saying the mmap semaphore is a significant problem already on
high-core no-numa setups. Addressing scalability in that case would
sort out the problem in your setup and to a significantly higher
extent.

> >
> > Have you looked into splitting the lock?
> >
> I ever tried.
>
> But there are two disadvantages:
>   1.) The traversal may need to hold many locks which makes the
>       code very horrible.
>

I already above this is avoidable.

>   2.) Even we split the locks. Each lock protects a tree, when the tree becomes
>       big enough, the VMA insert/remove will also become slow in NUMA.
>       The reason is that the tree has VMAs in different NUMA nodes.
>

This is orthogonal to my proposal. In fact, if one is to pretend this
is never a factor with your patch, I would like to point out it will
remain not a factor if the per-numa struct gets its own lock.

^ permalink raw reply

* [PATCH v8 8/8] selftests/ftrace: Add a testcase for multiple fprobe events
From: Masami Hiramatsu (Google) @ 2026-04-16 10:17 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177633460058.3479617.11868368413034643565.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Add a testcase for multiple fprobe events on the same function
so that it clears ftrace hash map correctly when removing the
events.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 .../test.d/dynevent/add_remove_multiple_fprobe.tc  |   69 ++++++++++++++++++++
 1 file changed, 69 insertions(+)
 create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/add_remove_multiple_fprobe.tc

diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_multiple_fprobe.tc b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_multiple_fprobe.tc
new file mode 100644
index 000000000000..f2cbf2ffd29b
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_multiple_fprobe.tc
@@ -0,0 +1,69 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Generic dynamic event - add/remove multiple fprobe events on the same function
+# requires: dynamic_events "f[:[<group>/][<event>]] <func-name>[%return] [<args>]":README enabled_functions
+
+echo 0 > events/enable
+echo > dynamic_events
+
+PLACE=vfs_read
+PLACE2=vfs_open
+
+:;: 'Ensure no other ftrace user' ;:
+test `cat enabled_functions | wc -l` -eq 0 || exit_unresolved
+
+:;: 'Test case 1: leave entry event' ;:
+:;: 'Add entry and exit events on the same place' ;:
+echo "f:event1 ${PLACE}" >> dynamic_events
+echo "f:event2 ${PLACE}%return" >> dynamic_events
+
+:;: 'Enable both of them' ;:
+echo 1 > events/fprobes/enable
+test `cat enabled_functions | wc -l` -eq 1
+
+:;: 'Disable and remove exit event' ;:
+echo 0 > events/fprobes/event2/enable
+echo -:event2 >> dynamic_events
+
+:;: 'Disable and remove all events' ;:
+echo 0 > events/fprobes/enable
+echo > dynamic_events
+
+:;: 'Add another event' ;:
+echo "f:event3 ${PLACE2}%return" > dynamic_events
+echo 1 > events/fprobes/enable
+test `cat enabled_functions | wc -l` -eq 1
+
+:;: 'No other ftrace user' ;:
+echo 0 > events/fprobes/enable
+echo > dynamic_events
+test `cat enabled_functions | wc -l` -eq 0
+
+:;: 'Test case 2: leave exit event' ;:
+:;: 'Add entry and exit events on the same place' ;:
+echo "f:event1 ${PLACE}" >> dynamic_events
+echo "f:event2 ${PLACE}%return" >> dynamic_events
+
+:;: 'Enable both of them' ;:
+echo 1 > events/fprobes/enable
+test `cat enabled_functions | wc -l` -eq 1
+
+:;: 'Disable and remove entry event' ;:
+echo 0 > events/fprobes/event1/enable
+echo -:event1 >> dynamic_events
+
+:;: 'Disable and remove all events' ;:
+echo 0 > events/fprobes/enable
+echo > dynamic_events
+
+:;: 'Add another event' ;:
+echo "f:event3 ${PLACE2}" > dynamic_events
+echo 1 > events/fprobes/enable
+test `cat enabled_functions | wc -l` -eq 1
+
+:;: 'No other ftrace user' ;:
+echo 0 > events/fprobes/enable
+echo > dynamic_events
+test `cat enabled_functions | wc -l` -eq 0
+
+clear_trace


^ permalink raw reply related

* [PATCH v8 7/8] selftests/ftrace: Add a testcase for fprobe events on module
From: Masami Hiramatsu (Google) @ 2026-04-16 10:17 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177633460058.3479617.11868368413034643565.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Add a testcase for fprobe events on module, which unloads a kernel
module on which fprobe events are probing and ensure the ftrace
hash map is cleared correctly.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 .../test.d/dynevent/add_remove_fprobe_module.tc    |   61 ++++++++++++++++++++
 1 file changed, 61 insertions(+)
 create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_module.tc

diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_module.tc b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_module.tc
new file mode 100644
index 000000000000..500942e9d8d6
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_module.tc
@@ -0,0 +1,61 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Generic dynamic event - add/remove fprobe events on module
+# requires: dynamic_events "f[:[<group>/][<event>]] <func-name>[%return] [<args>]":README enabled_functions
+
+rmmod trace-events-sample ||:
+if ! modprobe trace-events-sample ; then
+  echo "No trace-events sample module - please make CONFIG_SAMPLE_TRACE_EVENTS=m"
+  exit_unresolved;
+fi
+trap "rmmod trace-events-sample" EXIT
+
+echo 0 > events/enable
+echo > dynamic_events
+
+FUNC1='foo_bar*'
+FUNC2='vfs_read'
+
+:;: "Add an event on the test module" ;:
+echo "f:test1 $FUNC1" >> dynamic_events
+echo 1 > events/fprobes/test1/enable
+
+:;: "Ensure it is enabled" ;:
+funcs=`cat enabled_functions | wc -l`
+test $funcs -ne 0
+
+:;: "Check the enabled_functions is cleared on unloading" ;:
+rmmod trace_events_sample
+funcs=`cat enabled_functions | wc -l`
+test $funcs -eq 0
+
+:;: "Check it is kept clean" ;:
+modprobe trace-events-sample
+echo 1 > events/fprobes/test1/enable || echo "OK"
+funcs=`cat enabled_functions | wc -l`
+test $funcs -eq 0
+
+:;: "Add another event not on the test module" ;:
+echo "f:test2 $FUNC2" >> dynamic_events
+echo 1 > events/fprobes/test2/enable
+
+:;: "Ensure it is enabled" ;:
+ofuncs=`cat enabled_functions | wc -l`
+test $ofuncs -ne 0
+
+:;: "Disable and remove the first event"
+echo 0 > events/fprobes/test1/enable
+echo "-:fprobes/test1" >> dynamic_events
+funcs=`cat enabled_functions | wc -l`
+test $ofuncs -eq $funcs
+
+:;: "Disable and remove other events" ;:
+echo 0 > events/fprobes/enable
+echo > dynamic_events
+funcs=`cat enabled_functions | wc -l`
+test $funcs -eq 0
+
+
+rmmod trace-events-sample
+trap "" EXIT
+clear_trace


^ permalink raw reply related

* [PATCH v8 6/8] tracing/fprobe: Fix to unregister ftrace_ops if it is empty on module unloading
From: Masami Hiramatsu (Google) @ 2026-04-16 10:17 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177633460058.3479617.11868368413034643565.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Fix fprobe to unregister ftrace_ops if corresponding type of fprobe
does not exist on the fprobe_ip_table and it is expected to be empty
when unloading modules.

Since ftrace thinks that the empty hash means everything to be traced,
if we set fprobes only on the unloaded module, all functions are traced
unexpectedly after unloading module.
e.g.

 # modprobe xt_LOG.ko
 # echo 'f:test log_tg*' > dynamic_events
 # echo 1 > events/fprobes/test/enable
 # cat enabled_functions
log_tg [xt_LOG] (1)             tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
log_tg_check [xt_LOG] (1)               tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
log_tg_destroy [xt_LOG] (1)             tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
 # rmmod xt_LOG
 # wc -l enabled_functions
34085 enabled_functions

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v8:
  - Fix to check fprobe_graph/ftrace_registered flag directly
    when registering ftrace_ops.
 Changes in v7:
  - Fix to split checking whether ftrace_ops is registered from
    the number of registered fprobes, because ftrace_ops can be
    unregistered in module unloading.
 Changes in v6:
  - Newly added.
---
 kernel/trace/fprobe.c |  211 +++++++++++++++++++++++++++++++++++++------------
 1 file changed, 161 insertions(+), 50 deletions(-)

diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index f2c30587ab26..3c390b90a99f 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -79,7 +79,7 @@ static const struct rhashtable_params fprobe_rht_params = {
 };
 
 /* Node insertion and deletion requires the fprobe_mutex */
-static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
+static int __insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
 {
 	int ret;
 
@@ -92,7 +92,7 @@ static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
 	return ret;
 }
 
-static void delete_fprobe_node(struct fprobe_hlist_node *node)
+static void __delete_fprobe_node(struct fprobe_hlist_node *node)
 {
 	lockdep_assert_held(&fprobe_mutex);
 
@@ -250,7 +250,72 @@ static inline int __fprobe_kprobe_handler(unsigned long ip, unsigned long parent
 	return ret;
 }
 
+static int fprobe_fgraph_entry(struct ftrace_graph_ent *trace, struct fgraph_ops *gops,
+			       struct ftrace_regs *fregs);
+static void fprobe_return(struct ftrace_graph_ret *trace,
+			  struct fgraph_ops *gops,
+			  struct ftrace_regs *fregs);
+
+static struct fgraph_ops fprobe_graph_ops = {
+	.entryfunc	= fprobe_fgraph_entry,
+	.retfunc	= fprobe_return,
+};
+/* Number of fgraph fprobes */
+static int fprobe_graph_active;
+/* Number of fgraph fprobe nodes */
+static int nr_fgraph_fprobes;
+/* Is fprobe_graph_ops registered? */
+static bool fprobe_graph_registered;
+
+/* Add @addrs to the ftrace filter and register fgraph if needed. */
+static int fprobe_graph_add_ips(unsigned long *addrs, int num)
+{
+	int ret;
+
+	lockdep_assert_held(&fprobe_mutex);
+
+	ret = ftrace_set_filter_ips(&fprobe_graph_ops.ops, addrs, num, 0, 0);
+	if (ret)
+		return ret;
+
+	if (!fprobe_graph_registered) {
+		ret = register_ftrace_graph(&fprobe_graph_ops);
+		if (WARN_ON_ONCE(ret)) {
+			ftrace_free_filter(&fprobe_graph_ops.ops);
+			return ret;
+		}
+		fprobe_graph_registered = true;
+	}
+	fprobe_graph_active++;
+	return 0;
+}
+
+static void __fprobe_graph_unregister(void)
+{
+	if (fprobe_graph_registered) {
+		unregister_ftrace_graph(&fprobe_graph_ops);
+		ftrace_free_filter(&fprobe_graph_ops.ops);
+		fprobe_graph_registered = false;
+	}
+}
+
+/* Remove @addrs from the ftrace filter and unregister fgraph if possible. */
+static void fprobe_graph_remove_ips(unsigned long *addrs, int num)
+{
+	lockdep_assert_held(&fprobe_mutex);
+
+	if (!fprobe_graph_active)
+		return;
+
+	fprobe_graph_active--;
+	if (!fprobe_graph_active)
+		__fprobe_graph_unregister();
+	else if (num)
+		ftrace_set_filter_ips(&fprobe_graph_ops.ops, addrs, num, 1, 0);
+}
+
 #if defined(CONFIG_DYNAMIC_FTRACE_WITH_ARGS) || defined(CONFIG_DYNAMIC_FTRACE_WITH_REGS)
+
 /* ftrace_ops callback, this processes fprobes which have only entry_handler. */
 static void fprobe_ftrace_entry(unsigned long ip, unsigned long parent_ip,
 	struct ftrace_ops *ops, struct ftrace_regs *fregs)
@@ -293,7 +358,12 @@ static struct ftrace_ops fprobe_ftrace_ops = {
 	.func	= fprobe_ftrace_entry,
 	.flags	= FTRACE_OPS_FL_SAVE_ARGS,
 };
+/* Number of fgraph fprobes */
 static int fprobe_ftrace_active;
+/* Number of ftrace fprobe nodes */
+static int nr_ftrace_fprobes;
+/* Is fprobe_ftrace_ops registered? */
+static bool fprobe_ftrace_registered;
 
 static int fprobe_ftrace_add_ips(unsigned long *addrs, int num)
 {
@@ -305,25 +375,38 @@ static int fprobe_ftrace_add_ips(unsigned long *addrs, int num)
 	if (ret)
 		return ret;
 
-	if (!fprobe_ftrace_active) {
+	if (!fprobe_ftrace_registered) {
 		ret = register_ftrace_function(&fprobe_ftrace_ops);
 		if (ret) {
 			ftrace_free_filter(&fprobe_ftrace_ops);
 			return ret;
 		}
+		fprobe_ftrace_registered = true;
 	}
 	fprobe_ftrace_active++;
 	return 0;
 }
 
+static void __fprobe_ftrace_unregister(void)
+{
+	if (fprobe_ftrace_registered) {
+		unregister_ftrace_function(&fprobe_ftrace_ops);
+		ftrace_free_filter(&fprobe_ftrace_ops);
+		fprobe_ftrace_registered = false;
+	}
+}
+
 static void fprobe_ftrace_remove_ips(unsigned long *addrs, int num)
 {
 	lockdep_assert_held(&fprobe_mutex);
 
+	if (!fprobe_ftrace_active)
+		return;
+
 	fprobe_ftrace_active--;
 	if (!fprobe_ftrace_active)
-		unregister_ftrace_function(&fprobe_ftrace_ops);
-	if (num)
+		__fprobe_ftrace_unregister();
+	else if (num)
 		ftrace_set_filter_ips(&fprobe_ftrace_ops, addrs, num, 1, 0);
 }
 
@@ -332,6 +415,40 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
 	return !fp->exit_handler;
 }
 
+/* Node insertion and deletion requires the fprobe_mutex */
+static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
+{
+	int ret;
+
+	lockdep_assert_held(&fprobe_mutex);
+
+	ret = __insert_fprobe_node(node, fp);
+	if (!ret) {
+		if (fprobe_is_ftrace(fp))
+			nr_ftrace_fprobes++;
+		else
+			nr_fgraph_fprobes++;
+	}
+
+	return ret;
+}
+
+static void delete_fprobe_node(struct fprobe_hlist_node *node)
+{
+	struct fprobe *fp;
+
+	lockdep_assert_held(&fprobe_mutex);
+
+	fp = READ_ONCE(node->fp);
+	if (fp) {
+		if (fprobe_is_ftrace(fp))
+			nr_ftrace_fprobes--;
+		else
+			nr_fgraph_fprobes--;
+	}
+	__delete_fprobe_node(node);
+}
+
 static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace)
 {
 	struct rhlist_head *head, *pos;
@@ -361,8 +478,15 @@ static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace)
 #ifdef CONFIG_MODULES
 static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
 {
-	ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
-	ftrace_set_filter_ips(&fprobe_ftrace_ops, ips, cnt, 1, 0);
+	if (!nr_fgraph_fprobes)
+		__fprobe_graph_unregister();
+	else
+		ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
+
+	if (!nr_ftrace_fprobes)
+		__fprobe_ftrace_unregister();
+	else
+		ftrace_set_filter_ips(&fprobe_ftrace_ops, ips, cnt, 1, 0);
 }
 #endif
 #else
@@ -380,6 +504,32 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
 	return false;
 }
 
+/* Node insertion and deletion requires the fprobe_mutex */
+static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
+{
+	int ret;
+
+	lockdep_assert_held(&fprobe_mutex);
+
+	ret = __insert_fprobe_node(node, fp);
+	if (!ret)
+		nr_fgraph_fprobes++;
+
+	return ret;
+}
+
+static void delete_fprobe_node(struct fprobe_hlist_node *node)
+{
+	struct fprobe *fp;
+
+	lockdep_assert_held(&fprobe_mutex);
+
+	fp = READ_ONCE(node->fp);
+	if (fp)
+		nr_fgraph_fprobes--;
+	__delete_fprobe_node(node);
+}
+
 static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace __maybe_unused)
 {
 	struct rhlist_head *head, *pos;
@@ -406,7 +556,10 @@ static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace __maybe_unused)
 #ifdef CONFIG_MODULES
 static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
 {
-	ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
+	if (!nr_fgraph_fprobes)
+		__fprobe_graph_unregister();
+	else
+		ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
 }
 #endif
 #endif /* !CONFIG_DYNAMIC_FTRACE_WITH_ARGS && !CONFIG_DYNAMIC_FTRACE_WITH_REGS */
@@ -536,48 +689,6 @@ static void fprobe_return(struct ftrace_graph_ret *trace,
 }
 NOKPROBE_SYMBOL(fprobe_return);
 
-static struct fgraph_ops fprobe_graph_ops = {
-	.entryfunc	= fprobe_fgraph_entry,
-	.retfunc	= fprobe_return,
-};
-static int fprobe_graph_active;
-
-/* Add @addrs to the ftrace filter and register fgraph if needed. */
-static int fprobe_graph_add_ips(unsigned long *addrs, int num)
-{
-	int ret;
-
-	lockdep_assert_held(&fprobe_mutex);
-
-	ret = ftrace_set_filter_ips(&fprobe_graph_ops.ops, addrs, num, 0, 0);
-	if (ret)
-		return ret;
-
-	if (!fprobe_graph_active) {
-		ret = register_ftrace_graph(&fprobe_graph_ops);
-		if (WARN_ON_ONCE(ret)) {
-			ftrace_free_filter(&fprobe_graph_ops.ops);
-			return ret;
-		}
-	}
-	fprobe_graph_active++;
-	return 0;
-}
-
-/* Remove @addrs from the ftrace filter and unregister fgraph if possible. */
-static void fprobe_graph_remove_ips(unsigned long *addrs, int num)
-{
-	lockdep_assert_held(&fprobe_mutex);
-
-	fprobe_graph_active--;
-	/* Q: should we unregister it ? */
-	if (!fprobe_graph_active)
-		unregister_ftrace_graph(&fprobe_graph_ops);
-
-	if (num)
-		ftrace_set_filter_ips(&fprobe_graph_ops.ops, addrs, num, 1, 0);
-}
-
 #ifdef CONFIG_MODULES
 
 #define FPROBE_IPS_BATCH_INIT 128


^ permalink raw reply related

* [PATCH v8 5/8] tracing/fprobe: Check the same type fprobe on table as the unregistered one
From: Masami Hiramatsu (Google) @ 2026-04-16 10:17 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177633460058.3479617.11868368413034643565.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Commit 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case")
introduced a different ftrace_ops for entry-only fprobes.

However, when unregistering an fprobe, the kernel only checks if another
fprobe exists at the same address, without checking which type of fprobe
it is.
If different fprobes are registered at the same address, the same address
will be registered in both fgraph_ops and ftrace_ops, but only one of
them will be deleted when unregistering. (the one removed first will not
be deleted from the ops).

This results in junk entries remaining in either fgraph_ops or ftrace_ops.
For example:
 =======
 cd /sys/kernel/tracing

 # 'Add entry and exit events on the same place'
 echo 'f:event1 vfs_read' >> dynamic_events
 echo 'f:event2 vfs_read%return' >> dynamic_events

 # 'Enable both of them'
 echo 1 > events/fprobes/enable
 cat enabled_functions
vfs_read (2)            ->arch_ftrace_ops_list_func+0x0/0x210

 # 'Disable and remove exit event'
 echo 0 > events/fprobes/event2/enable
 echo -:event2 >> dynamic_events

 # 'Disable and remove all events'
 echo 0 > events/fprobes/enable
 echo > dynamic_events

 # 'Add another event'
 echo 'f:event3 vfs_open%return' > dynamic_events
 cat dynamic_events
f:fprobes/event3 vfs_open%return

 echo 1 > events/fprobes/enable
 cat enabled_functions
vfs_open (1)            tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60    subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
vfs_read (1)            tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60    subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
 =======

As you can see, an entry for the vfs_read remains.

To fix this issue, when unregistering, the kernel should also check if
there is the same type of fprobes still exist at the same address, and
if not, delete its entry from either fgraph_ops or ftrace_ops.

Fixes: 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 kernel/trace/fprobe.c |   82 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 65 insertions(+), 17 deletions(-)

diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index 77d4449f20f7..f2c30587ab26 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -92,11 +92,8 @@ static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
 	return ret;
 }
 
-/* Return true if there are synonims */
-static bool delete_fprobe_node(struct fprobe_hlist_node *node)
+static void delete_fprobe_node(struct fprobe_hlist_node *node)
 {
-	bool ret;
-
 	lockdep_assert_held(&fprobe_mutex);
 
 	/* Avoid double deleting and non-inserted nodes */
@@ -105,13 +102,6 @@ static bool delete_fprobe_node(struct fprobe_hlist_node *node)
 		rhltable_remove(&fprobe_ip_table, &node->hlist,
 				fprobe_rht_params);
 	}
-
-	rcu_read_lock();
-	ret = !!rhltable_lookup(&fprobe_ip_table, &node->addr,
-				fprobe_rht_params);
-	rcu_read_unlock();
-
-	return ret;
 }
 
 /* Check existence of the fprobe */
@@ -342,6 +332,32 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
 	return !fp->exit_handler;
 }
 
+static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace)
+{
+	struct rhlist_head *head, *pos;
+	struct fprobe_hlist_node *node;
+	struct fprobe *fp;
+
+	guard(rcu)();
+	head = rhltable_lookup(&fprobe_ip_table, &ip,
+				fprobe_rht_params);
+	if (!head)
+		return false;
+	/* We have to check the same type on the list. */
+	rhl_for_each_entry_rcu(node, pos, head, hlist) {
+		if (node->addr != ip)
+			break;
+		fp = READ_ONCE(node->fp);
+		if (likely(fp)) {
+			if ((!ftrace && fp->exit_handler) ||
+			    (ftrace && !fp->exit_handler))
+				return true;
+		}
+	}
+
+	return false;
+}
+
 #ifdef CONFIG_MODULES
 static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
 {
@@ -364,6 +380,29 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
 	return false;
 }
 
+static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace __maybe_unused)
+{
+	struct rhlist_head *head, *pos;
+	struct fprobe_hlist_node *node;
+	struct fprobe *fp;
+
+	guard(rcu)();
+	head = rhltable_lookup(&fprobe_ip_table, &ip,
+				fprobe_rht_params);
+	if (!head)
+		return false;
+	/* We only need to check fp is there. */
+	rhl_for_each_entry_rcu(node, pos, head, hlist) {
+		if (node->addr != ip)
+			break;
+		fp = READ_ONCE(node->fp);
+		if (likely(fp))
+			return true;
+	}
+
+	return false;
+}
+
 #ifdef CONFIG_MODULES
 static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
 {
@@ -552,18 +591,25 @@ struct fprobe_addr_list {
 static int fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
 					 struct fprobe_addr_list *alist)
 {
+	lockdep_assert_in_rcu_read_lock();
+
 	if (!within_module(node->addr, mod))
 		return 0;
 
-	if (delete_fprobe_node(node))
-		return 0;
+	delete_fprobe_node(node);
 	/* If no address list is available, we can't track this address. */
 	if (!alist->addrs)
 		return 0;
+	/*
+	 * Don't care the type here, because all fprobes on the same
+	 * address must be removed eventually.
+	 */
+	if (!rhltable_lookup(&fprobe_ip_table, &node->addr, fprobe_rht_params)) {
+		alist->addrs[alist->index++] = node->addr;
+		if (alist->index == alist->size)
+			return -ENOSPC;
+	}
 
-	alist->addrs[alist->index++] = node->addr;
-	if (alist->index == alist->size)
-		return -ENOSPC;
 	return 0;
 }
 
@@ -931,7 +977,9 @@ static int unregister_fprobe_nolock(struct fprobe *fp)
 	/* Remove non-synonim ips from table and hash */
 	count = 0;
 	for (i = 0; i < hlist_array->size; i++) {
-		if (!delete_fprobe_node(&hlist_array->array[i]) && addrs)
+		delete_fprobe_node(&hlist_array->array[i]);
+		if (addrs && !fprobe_exists_on_hash(hlist_array->array[i].addr,
+						    fprobe_is_ftrace(fp)))
 			addrs[count++] = hlist_array->array[i].addr;
 	}
 	del_fprobe_hash(fp);


^ permalink raw reply related

* [PATCH v8 4/8] tracing/fprobe: Avoid kcalloc() in rcu_read_lock section
From: Masami Hiramatsu (Google) @ 2026-04-16 10:17 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177633460058.3479617.11868368413034643565.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

fprobe_remove_node_in_module() is called under RCU read locked, but
this invokes kcalloc() if there are more than 8 fprobes installed
on the module. Sashiko warns it because kcalloc() can sleep [1].

 [1] https://sashiko.dev/#/patchset/177552432201.853249.5125045538812833325.stgit%40mhiramat.tok.corp.google.com

To fix this issue, expand the batch size to 128 and do not expand
the fprobe_addr_list, but just cancel walking on fprobe_ip_table,
update fgraph/ftrace_ops and retry the loop again.

Fixes: 0de4c70d04a4 ("tracing: fprobe: use rhltable for fprobe_ip_table")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v6:
  - Retry outside rhltable_walk_enter/exit() again.
 Changes in v5:
  - Skip updating ftrace_ops when fails to allocate memory in module
    unloading.
 Changes in v4:
  - fix a build error typo in case of CONFIG_DYNAMIC_FTRACE=n.
 Changes in v3:
  - Retry inside rhltable_walk_enter/exit().
  - Rename fprobe_set_ips() to fprobe_remove_ips().
  - Rename 'retry' label to 'again'.
---
 kernel/trace/fprobe.c |   92 ++++++++++++++++++++++++-------------------------
 1 file changed, 45 insertions(+), 47 deletions(-)

diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index 98c5786bf172..77d4449f20f7 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -343,11 +343,10 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
 }
 
 #ifdef CONFIG_MODULES
-static void fprobe_set_ips(unsigned long *ips, unsigned int cnt, int remove,
-			   int reset)
+static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
 {
-	ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, remove, reset);
-	ftrace_set_filter_ips(&fprobe_ftrace_ops, ips, cnt, remove, reset);
+	ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
+	ftrace_set_filter_ips(&fprobe_ftrace_ops, ips, cnt, 1, 0);
 }
 #endif
 #else
@@ -366,10 +365,9 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
 }
 
 #ifdef CONFIG_MODULES
-static void fprobe_set_ips(unsigned long *ips, unsigned int cnt, int remove,
-			   int reset)
+static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
 {
-	ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, remove, reset);
+	ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
 }
 #endif
 #endif /* !CONFIG_DYNAMIC_FTRACE_WITH_ARGS && !CONFIG_DYNAMIC_FTRACE_WITH_REGS */
@@ -543,7 +541,7 @@ static void fprobe_graph_remove_ips(unsigned long *addrs, int num)
 
 #ifdef CONFIG_MODULES
 
-#define FPROBE_IPS_BATCH_INIT 8
+#define FPROBE_IPS_BATCH_INIT 128
 /* instruction pointer address list */
 struct fprobe_addr_list {
 	int index;
@@ -551,45 +549,24 @@ struct fprobe_addr_list {
 	unsigned long *addrs;
 };
 
-static int fprobe_addr_list_add(struct fprobe_addr_list *alist, unsigned long addr)
+static int fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
+					 struct fprobe_addr_list *alist)
 {
-	unsigned long *addrs;
-
-	/* Previously we failed to expand the list. */
-	if (alist->index == alist->size)
-		return -ENOSPC;
-
-	alist->addrs[alist->index++] = addr;
-	if (alist->index < alist->size)
+	if (!within_module(node->addr, mod))
 		return 0;
 
-	/* Expand the address list */
-	addrs = kcalloc(alist->size * 2, sizeof(*addrs), GFP_KERNEL);
-	if (!addrs)
-		return -ENOMEM;
-
-	memcpy(addrs, alist->addrs, alist->size * sizeof(*addrs));
-	alist->size *= 2;
-	kfree(alist->addrs);
-	alist->addrs = addrs;
+	if (delete_fprobe_node(node))
+		return 0;
+	/* If no address list is available, we can't track this address. */
+	if (!alist->addrs)
+		return 0;
 
+	alist->addrs[alist->index++] = node->addr;
+	if (alist->index == alist->size)
+		return -ENOSPC;
 	return 0;
 }
 
-static void fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
-					 struct fprobe_addr_list *alist)
-{
-	if (!within_module(node->addr, mod))
-		return;
-	if (delete_fprobe_node(node))
-		return;
-	/*
-	 * If failed to update alist, just continue to update hlist.
-	 * Therefore, at list user handler will not hit anymore.
-	 */
-	fprobe_addr_list_add(alist, node->addr);
-}
-
 /* Handle module unloading to manage fprobe_ip_table. */
 static int fprobe_module_callback(struct notifier_block *nb,
 				  unsigned long val, void *data)
@@ -598,29 +575,50 @@ static int fprobe_module_callback(struct notifier_block *nb,
 	struct fprobe_hlist_node *node;
 	struct rhashtable_iter iter;
 	struct module *mod = data;
+	bool retry;
 
 	if (val != MODULE_STATE_GOING)
 		return NOTIFY_DONE;
 
 	alist.addrs = kcalloc(alist.size, sizeof(*alist.addrs), GFP_KERNEL);
-	/* If failed to alloc memory, we can not remove ips from hash. */
-	if (!alist.addrs)
-		return NOTIFY_DONE;
+	/*
+	 * If failed to alloc memory, ftrace_ops will not be able to remove ips from
+	 * hash, but we can still remove nodes from fprobe_ip_table, so we can avoid
+	 * the potential wrong callback. So just print a warning here and try to
+	 * continue without address list.
+	 */
+	WARN_ONCE(!alist.addrs,
+		"Failed to allocate memory for fprobe_addr_list, ftrace_ops will not be updated");
 
 	mutex_lock(&fprobe_mutex);
+again:
+	retry = false;
+	alist.index = 0;
 	rhltable_walk_enter(&fprobe_ip_table, &iter);
 	do {
 		rhashtable_walk_start(&iter);
 
 		while ((node = rhashtable_walk_next(&iter)) && !IS_ERR(node))
-			fprobe_remove_node_in_module(mod, node, &alist);
+			if (fprobe_remove_node_in_module(mod, node, &alist) < 0) {
+				retry = true;
+				break;
+			}
 
 		rhashtable_walk_stop(&iter);
-	} while (node == ERR_PTR(-EAGAIN));
+	} while (node == ERR_PTR(-EAGAIN) && !retry);
 	rhashtable_walk_exit(&iter);
+	/* Remove any ips from hash table(s) */
+	if (alist.index > 0) {
+		fprobe_remove_ips(alist.addrs, alist.index);
+		/*
+		 * If we break rhashtable walk loop except for -EAGAIN, we need
+		 * to restart looping from start for safety. Anyway, this is
+		 * not a hotpath.
+		 */
+		if (retry)
+			goto again;
+	}
 
-	if (alist.index > 0)
-		fprobe_set_ips(alist.addrs, alist.index, 1, 0);
 	mutex_unlock(&fprobe_mutex);
 
 	kfree(alist.addrs);


^ permalink raw reply related

* [PATCH v8 3/8] tracing/fprobe: Remove fprobe from hash in failure path
From: Masami Hiramatsu (Google) @ 2026-04-16 10:17 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177633460058.3479617.11868368413034643565.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

When register_fprobe_ips() fails, it tries to remove a list of
fprobe_hash_node from fprobe_ip_table, but it missed to remove
fprobe itself from fprobe_table. Moreover, when removing
the fprobe_hash_node which is added to rhltable once, it must
use kfree_rcu() after removing from rhltable.

To fix these issues, this reuses unregister_fprobe() internal
code to rollback the half-way registered fprobe.

Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v8:
  - Fix to check return value of add_fprobe_hash() and break
    loop if insert_fprobe_node() is failed.
 Changes in v7:
  - Remove RCU grace period wait, since fprobe itself is not
    that is not needed.
 Changes in v6:
  - Wait for an RCU grace period before returning error in
    unregister_fprobe_nolock().
 Changes in v5:
  - When rolling back an fprobe that failed to register, the
    fprobe_hash_node are forcibly removed and warn if failure.
 Changes in v4:
  - Remove short-cut case because we always need to upadte ftrace_ops.
  - Use guard(mutex) in register_fprobe_ips() to unlock it correctly.
  - Remove redundant !ret check in register_fprobe_ips().
  - Do not set hlist_array->size in failure case, instead,
    hlist_array->array[i].fp is set only when insertion is succeeded.
  Changes in v3:
  - Newly added.
---
 kernel/trace/fprobe.c |   85 +++++++++++++++++++++++++------------------------
 1 file changed, 43 insertions(+), 42 deletions(-)

diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index d634eb8e8b9e..98c5786bf172 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -79,20 +79,27 @@ static const struct rhashtable_params fprobe_rht_params = {
 };
 
 /* Node insertion and deletion requires the fprobe_mutex */
-static int insert_fprobe_node(struct fprobe_hlist_node *node)
+static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
 {
+	int ret;
+
 	lockdep_assert_held(&fprobe_mutex);
 
-	return rhltable_insert(&fprobe_ip_table, &node->hlist, fprobe_rht_params);
+	ret = rhltable_insert(&fprobe_ip_table, &node->hlist, fprobe_rht_params);
+	/* Set the fprobe pointer if insertion was successful. */
+	if (!ret)
+		WRITE_ONCE(node->fp, fp);
+	return ret;
 }
 
 /* Return true if there are synonims */
 static bool delete_fprobe_node(struct fprobe_hlist_node *node)
 {
-	lockdep_assert_held(&fprobe_mutex);
 	bool ret;
 
-	/* Avoid double deleting */
+	lockdep_assert_held(&fprobe_mutex);
+
+	/* Avoid double deleting and non-inserted nodes */
 	if (READ_ONCE(node->fp) != NULL) {
 		WRITE_ONCE(node->fp, NULL);
 		rhltable_remove(&fprobe_ip_table, &node->hlist,
@@ -757,7 +764,6 @@ static int fprobe_init(struct fprobe *fp, unsigned long *addrs, int num)
 	fp->hlist_array = hlist_array;
 	hlist_array->fp = fp;
 	for (i = 0; i < num; i++) {
-		hlist_array->array[i].fp = fp;
 		addr = ftrace_location(addrs[i]);
 		if (!addr) {
 			fprobe_fail_cleanup(fp);
@@ -821,6 +827,8 @@ int register_fprobe(struct fprobe *fp, const char *filter, const char *notfilter
 }
 EXPORT_SYMBOL_GPL(register_fprobe);
 
+static int unregister_fprobe_nolock(struct fprobe *fp);
+
 /**
  * register_fprobe_ips() - Register fprobe to ftrace by address.
  * @fp: A fprobe data structure to be registered.
@@ -847,28 +855,22 @@ int register_fprobe_ips(struct fprobe *fp, unsigned long *addrs, int num)
 	if (ret)
 		return ret;
 
-	hlist_array = fp->hlist_array;
 	if (fprobe_is_ftrace(fp))
 		ret = fprobe_ftrace_add_ips(addrs, num);
 	else
 		ret = fprobe_graph_add_ips(addrs, num);
-
-	if (!ret) {
-		add_fprobe_hash(fp);
-		for (i = 0; i < hlist_array->size; i++) {
-			ret = insert_fprobe_node(&hlist_array->array[i]);
-			if (ret)
-				break;
-		}
-		/* fallback on insert error */
-		if (ret) {
-			for (i--; i >= 0; i--)
-				delete_fprobe_node(&hlist_array->array[i]);
-		}
+	if (ret) {
+		fprobe_fail_cleanup(fp);
+		return ret;
 	}
 
+	hlist_array = fp->hlist_array;
+	ret = add_fprobe_hash(fp);
+	for (i = 0; i < hlist_array->size && !ret; i++)
+		ret = insert_fprobe_node(&hlist_array->array[i], fp);
+
 	if (ret)
-		fprobe_fail_cleanup(fp);
+		unregister_fprobe_nolock(fp);
 
 	return ret;
 }
@@ -912,27 +914,12 @@ bool fprobe_is_registered(struct fprobe *fp)
 	return true;
 }
 
-/**
- * unregister_fprobe() - Unregister fprobe.
- * @fp: A fprobe data structure to be unregistered.
- *
- * Unregister fprobe (and remove ftrace hooks from the function entries).
- *
- * Return 0 if @fp is unregistered successfully, -errno if not.
- */
-int unregister_fprobe(struct fprobe *fp)
+static int unregister_fprobe_nolock(struct fprobe *fp)
 {
-	struct fprobe_hlist *hlist_array;
+	struct fprobe_hlist *hlist_array = fp->hlist_array;
 	unsigned long *addrs = NULL;
-	int ret = 0, i, count;
+	int i, count;
 
-	mutex_lock(&fprobe_mutex);
-	if (!fp || !fprobe_registered(fp)) {
-		ret = -EINVAL;
-		goto out;
-	}
-
-	hlist_array = fp->hlist_array;
 	addrs = kcalloc(hlist_array->size, sizeof(unsigned long), GFP_KERNEL);
 	/*
 	 * This will remove fprobe_hash_node from the hash table even if
@@ -958,12 +945,26 @@ int unregister_fprobe(struct fprobe *fp)
 
 	kfree_rcu(hlist_array, rcu);
 	fp->hlist_array = NULL;
+	kfree(addrs);
 
-out:
-	mutex_unlock(&fprobe_mutex);
+	return 0;
+}
 
-	kfree(addrs);
-	return ret;
+/**
+ * unregister_fprobe() - Unregister fprobe.
+ * @fp: A fprobe data structure to be unregistered.
+ *
+ * Unregister fprobe (and remove ftrace hooks from the function entries).
+ *
+ * Return 0 if @fp is unregistered successfully, -errno if not.
+ */
+int unregister_fprobe(struct fprobe *fp)
+{
+	guard(mutex)(&fprobe_mutex);
+	if (!fp || !fprobe_registered(fp))
+		return -EINVAL;
+
+	return unregister_fprobe_nolock(fp);
 }
 EXPORT_SYMBOL_GPL(unregister_fprobe);
 


^ permalink raw reply related

* [PATCH v8 2/8] tracing/fprobe: Unregister fprobe even if memory allocation fails
From: Masami Hiramatsu (Google) @ 2026-04-16 10:16 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177633460058.3479617.11868368413034643565.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

unregister_fprobe() can fail under memory pressure because of memory
allocation failure, but this maybe called from module unloading, and
usually there is no way to retry it. Moreover. trace_fprobe does not
check the return value.

To fix this problem, unregister fprobe and fprobe_hash_node even if
working memory allocation fails.
Anyway, if the last fprobe is removed, the filter will be freed.

Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v7:
  - Newly added.
---
 kernel/trace/fprobe.c |   14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index fc7018b28fdd..d634eb8e8b9e 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -934,15 +934,19 @@ int unregister_fprobe(struct fprobe *fp)
 
 	hlist_array = fp->hlist_array;
 	addrs = kcalloc(hlist_array->size, sizeof(unsigned long), GFP_KERNEL);
-	if (!addrs) {
-		ret = -ENOMEM;	/* TODO: Fallback to one-by-one loop */
-		goto out;
-	}
+	/*
+	 * This will remove fprobe_hash_node from the hash table even if
+	 * memory allocation fails. However, ftrace_ops will not be updated.
+	 * Anyway, when the last fprobe is unregistered, ftrace_ops is also
+	 * unregistered.
+	 */
+	if (!addrs)
+		pr_warn("Failed to allocate working array. ftrace_ops may not sync.\n");
 
 	/* Remove non-synonim ips from table and hash */
 	count = 0;
 	for (i = 0; i < hlist_array->size; i++) {
-		if (!delete_fprobe_node(&hlist_array->array[i]))
+		if (!delete_fprobe_node(&hlist_array->array[i]) && addrs)
 			addrs[count++] = hlist_array->array[i].addr;
 	}
 	del_fprobe_hash(fp);


^ permalink raw reply related

* [PATCH v8 1/8] tracing/fprobe: Reject registration of a registered fprobe before init
From: Masami Hiramatsu (Google) @ 2026-04-16 10:16 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel
In-Reply-To: <177633460058.3479617.11868368413034643565.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Reject registration of a registered fprobe which is on the fprobe
hash table before initializing fprobe.
The add_fprobe_hash() checks this re-register fprobe, but since
fprobe_init() clears hlist_array field, it is too late to check it.
It has to check the re-registration before touncing fprobe.

Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v6:
  - Newly added.
---
 kernel/trace/fprobe.c |   21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index dcadf1d23b8a..fc7018b28fdd 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -4,6 +4,7 @@
  */
 #define pr_fmt(fmt) "fprobe: " fmt
 
+#include <linux/cleanup.h>
 #include <linux/err.h>
 #include <linux/fprobe.h>
 #include <linux/kallsyms.h>
@@ -107,7 +108,7 @@ static bool delete_fprobe_node(struct fprobe_hlist_node *node)
 }
 
 /* Check existence of the fprobe */
-static bool is_fprobe_still_exist(struct fprobe *fp)
+static bool fprobe_registered(struct fprobe *fp)
 {
 	struct hlist_head *head;
 	struct fprobe_hlist *fph;
@@ -120,7 +121,7 @@ static bool is_fprobe_still_exist(struct fprobe *fp)
 	}
 	return false;
 }
-NOKPROBE_SYMBOL(is_fprobe_still_exist);
+NOKPROBE_SYMBOL(fprobe_registered);
 
 static int add_fprobe_hash(struct fprobe *fp)
 {
@@ -132,9 +133,6 @@ static int add_fprobe_hash(struct fprobe *fp)
 	if (WARN_ON_ONCE(!fph))
 		return -EINVAL;
 
-	if (is_fprobe_still_exist(fp))
-		return -EEXIST;
-
 	head = &fprobe_table[hash_ptr(fp, FPROBE_HASH_BITS)];
 	hlist_add_head_rcu(&fp->hlist_array->hlist, head);
 	return 0;
@@ -149,7 +147,7 @@ static int del_fprobe_hash(struct fprobe *fp)
 	if (WARN_ON_ONCE(!fph))
 		return -EINVAL;
 
-	if (!is_fprobe_still_exist(fp))
+	if (!fprobe_registered(fp))
 		return -ENOENT;
 
 	fph->fp = NULL;
@@ -482,7 +480,7 @@ static void fprobe_return(struct ftrace_graph_ret *trace,
 		if (!fp)
 			break;
 		curr += FPROBE_HEADER_SIZE_IN_LONG;
-		if (is_fprobe_still_exist(fp) && !fprobe_disabled(fp)) {
+		if (fprobe_registered(fp) && !fprobe_disabled(fp)) {
 			if (WARN_ON_ONCE(curr + size > size_words))
 				break;
 			fp->exit_handler(fp, trace->func, ret_ip, fregs,
@@ -841,12 +839,14 @@ int register_fprobe_ips(struct fprobe *fp, unsigned long *addrs, int num)
 	struct fprobe_hlist *hlist_array;
 	int ret, i;
 
+	guard(mutex)(&fprobe_mutex);
+	if (fprobe_registered(fp))
+		return -EEXIST;
+
 	ret = fprobe_init(fp, addrs, num);
 	if (ret)
 		return ret;
 
-	mutex_lock(&fprobe_mutex);
-
 	hlist_array = fp->hlist_array;
 	if (fprobe_is_ftrace(fp))
 		ret = fprobe_ftrace_add_ips(addrs, num);
@@ -866,7 +866,6 @@ int register_fprobe_ips(struct fprobe *fp, unsigned long *addrs, int num)
 				delete_fprobe_node(&hlist_array->array[i]);
 		}
 	}
-	mutex_unlock(&fprobe_mutex);
 
 	if (ret)
 		fprobe_fail_cleanup(fp);
@@ -928,7 +927,7 @@ int unregister_fprobe(struct fprobe *fp)
 	int ret = 0, i, count;
 
 	mutex_lock(&fprobe_mutex);
-	if (!fp || !is_fprobe_still_exist(fp)) {
+	if (!fp || !fprobe_registered(fp)) {
 		ret = -EINVAL;
 		goto out;
 	}


^ permalink raw reply related

* [PATCH v8 0/8] tracing/fprobe: Fix fprobe_ip_table related bugs
From: Masami Hiramatsu (Google) @ 2026-04-16 10:16 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
	linux-trace-kernel

Hi,

Here is the 8th version of fprobe bugfix series.
The previous version is here.

https://lore.kernel.org/all/177624771150.2407691.15764846647014540969.stgit@mhiramat.tok.corp.google.com/

This version fixes 2 issues - fixes to check the return value of
add_fprobe_hash() and break the fprobe node installing loop if
insert_fprobe_node() is failed [2/8], and fixes to check
fprobe_graph/ftrace_registered flags directly when registering
ftrace_ops [6/8]. And this adds 2 new test cases for checking
the fprobe events bahavior fixed by this series [7/8][8/8].


Thank you!

Masami Hiramatsu (Google) (8):
      tracing/fprobe: Reject registration of a registered fprobe before init
      tracing/fprobe: Unregister fprobe even if memory allocation fails
      tracing/fprobe: Remove fprobe from hash in failure path
      tracing/fprobe: Avoid kcalloc() in rcu_read_lock section
      tracing/fprobe: Check the same type fprobe on table as the unregistered one
      tracing/fprobe: Fix to unregister ftrace_ops if it is empty on module unloading
      selftests/ftrace: Add a testcase for fprobe events on module
      selftests/ftrace: Add a testcase for multiple fprobe events


 kernel/trace/fprobe.c                              |  475 +++++++++++++-------
 .../test.d/dynevent/add_remove_fprobe_module.tc    |   61 +++
 .../test.d/dynevent/add_remove_multiple_fprobe.tc  |   69 +++
 3 files changed, 448 insertions(+), 157 deletions(-)
 create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_module.tc
 create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/add_remove_multiple_fprobe.tc


base-commit: 4346be6577aaa04586167402ae87bbdbe32484a4
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* [PATCH] ftrace: fix use-after-free of mod->name in function_stat_show()
From: Xiang Gao @ 2026-04-16  8:33 UTC (permalink / raw)
  To: rostedt, mhiramat
  Cc: mark.rutland, mathieu.desnoyers, linux-kernel, linux-trace-kernel,
	Xiang Gao

From: Xiang Gao <gaoxiang17@xiaomi.com>

function_stat_show() uses guard(rcu)() inside the else block to hold
the RCU read lock while calling __module_text_address() and accessing
mod->name. However, guard(rcu)() ties the RCU read lock lifetime to
the scope of the else block. The original code stores mod->name into
refsymbol and uses it in snprintf() after the else block exits,
at which point the RCU read lock has already been released. If the
module is concurrently unloaded, mod->name is freed, causing a
use-after-free.

Fix by moving the snprintf() call into each branch of the if/else,
so that mod->name is only accessed while the RCU read lock is held.
refsymbol now points to the local str buffer (which already contains
the formatted string) rather than to mod->name, and is only used
afterwards as a non-NULL indicator to skip the kallsyms_lookup()
fallback.

Signed-off-by: Xiang Gao <gaoxiang17@xiaomi.com>
---
 kernel/trace/ftrace.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 413310912609..6217b363203c 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -559,21 +559,23 @@ static int function_stat_show(struct seq_file *m, void *v)
 		unsigned long offset;
 
 		if (core_kernel_text(rec->ip)) {
-			refsymbol = "_text";
 			offset = rec->ip - (unsigned long)_text;
+			snprintf(str, sizeof(str), "  %s+%#lx",
+				 "_text", offset);
+			refsymbol = str;
 		} else {
 			struct module *mod;
 
 			guard(rcu)();
 			mod = __module_text_address(rec->ip);
 			if (mod) {
-				refsymbol = mod->name;
 				/* Calculate offset from module's text entry address. */
 				offset = rec->ip - (unsigned long)mod->mem[MOD_TEXT].base;
+				snprintf(str, sizeof(str), "  %s+%#lx",
+					 mod->name, offset);
+				refsymbol = str;
 			}
 		}
-		if (refsymbol)
-			snprintf(str, sizeof(str), "  %s+%#lx", refsymbol, offset);
 	}
 	if (!refsymbol)
 		kallsyms_lookup(rec->ip, NULL, NULL, NULL, str);
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH] tracing: separate module tracepoint strings from trace_printk formats
From: Cao Ruichuang @ 2026-04-16  8:03 UTC (permalink / raw)
  To: Petr Pavlu
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <65f9ebb2-3d3f-4bf4-9c33-2f3855ed008e@suse.com>

Hi Petr,

Sorry for the noise.

The previous revisions were not sent automatically by an AI agent, but I
did use AI assistance while working on this thread and I should have
disclosed that according to Documentation/process/coding-assistants.rst.
More importantly, I should have slowed down and answered the design
questions first instead of posting another implementation so quickly.

After going back through the bug report, the code, and a local
reproducer, I think the original bug report is valid: module
tracepoint_string() entries do not currently show up in printk_formats.

The implementation gap seems to be that the module path handles
__trace_printk_fmt, but there is no equivalent handling for a module's
__tracepoint_str entries, so printk_formats has no module-side source
from which to enumerate them.

What I got wrong in the previous versions was the direction of the fix.
I treated the missing module __tracepoint_str handling as if it should
reuse the module trace_printk storage model, which made the mapping
persist beyond module unload. I agree that this is not the right
lifetime semantics for tracepoint_string().

So I do not plan to send another patch revision until the intended
module lifetime behavior is clear. At this point, the direction that
seems right to me is to make module tracepoint_string() mappings visible
while the module is alive, without simply reusing the trace_printk
persistence model.

Thanks,
Cao

Assisted-by: Codex:GPT-5.4


^ permalink raw reply

* Re: [PATCH mm-unstable v15 07/13] mm/khugepaged: add per-order mTHP collapse failure statistics
From: Lorenzo Stoakes @ 2026-04-16  7:21 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <CAA1CXcCS9gWySN1oQzEYpALfURxBwt58us9tkAbNPnHOKmLd5g@mail.gmail.com>

Ack on all below due to lower bandwidth :P

It's nothing really major here so don't let any of this block on respin!

Cheers, Lorenzo

On Sun, Apr 12, 2026 at 08:48:29PM -0600, Nico Pache wrote:
> On Tue, Mar 17, 2026 at 11:05 AM Lorenzo Stoakes (Oracle)
> <ljs@kernel.org> wrote:
> >
> > On Wed, Feb 25, 2026 at 08:25:04PM -0700, Nico Pache wrote:
> > > Add three new mTHP statistics to track collapse failures for different
> > > orders when encountering swap PTEs, excessive none PTEs, and shared PTEs:
> > >
> > > - collapse_exceed_swap_pte: Increment when mTHP collapse fails due to swap
> > >       PTEs
> > >
> > > - collapse_exceed_none_pte: Counts when mTHP collapse fails due to
> > >       exceeding the none PTE threshold for the given order
> > >
> > > - collapse_exceed_shared_pte: Counts when mTHP collapse fails due to shared
> > >       PTEs
> > >
> > > These statistics complement the existing THP_SCAN_EXCEED_* events by
> > > providing per-order granularity for mTHP collapse attempts. The stats are
> > > exposed via sysfs under
> > > `/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each
> > > supported hugepage size.
> > >
> > > As we currently dont support collapsing mTHPs that contain a swap or
> > > shared entry, those statistics keep track of how often we are
> > > encountering failed mTHP collapses due to these restrictions.
> > >
> > > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > Signed-off-by: Nico Pache <npache@redhat.com>
> > > ---
> > >  Documentation/admin-guide/mm/transhuge.rst | 24 ++++++++++++++++++++++
> > >  include/linux/huge_mm.h                    |  3 +++
> > >  mm/huge_memory.c                           |  7 +++++++
> > >  mm/khugepaged.c                            | 16 ++++++++++++---
> > >  4 files changed, 47 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> > > index c51932e6275d..eebb1f6bbc6c 100644
> > > --- a/Documentation/admin-guide/mm/transhuge.rst
> > > +++ b/Documentation/admin-guide/mm/transhuge.rst
> > > @@ -714,6 +714,30 @@ nr_anon_partially_mapped
> > >         an anonymous THP as "partially mapped" and count it here, even though it
> > >         is not actually partially mapped anymore.
> > >
> > > +collapse_exceed_none_pte
> > > +       The number of collapse attempts that failed due to exceeding the
> > > +       max_ptes_none threshold. For mTHP collapse, Currently only max_ptes_none
> > > +       values of 0 and (HPAGE_PMD_NR - 1) are supported. Any other value will
> > > +       emit a warning and no mTHP collapse will be attempted. khugepaged will
> >
> > It's weird to document this here but not elsewhere in the document? I mean I
> > made this comment on the documentation patch also.
>
> I can add some more documentation but TBH I don't really know where or
> what else to put. I checked a few of these other per-mTHP stats, and
> none are referenced elsewhere. if anything these 3 additions are the
> best documented ones.
>
> >
> > Not sure if I missed you adding it to another bit of the docs? :)
> >
> > > +       try to collapse to the largest enabled (m)THP size; if it fails, it will
> > > +       try the next lower enabled mTHP size. This counter records the number of
> > > +       times a collapse attempt was skipped for exceeding the max_ptes_none
> > > +       threshold, and khugepaged will move on to the next available mTHP size.
> > > +
> > > +collapse_exceed_swap_pte
> > > +       The number of anonymous mTHP PTE ranges which were unable to collapse due
> > > +       to containing at least one swap PTE. Currently khugepaged does not
> > > +       support collapsing mTHP regions that contain a swap PTE. This counter can
> > > +       be used to monitor the number of khugepaged mTHP collapses that failed
> > > +       due to the presence of a swap PTE.
> > > +
> > > +collapse_exceed_shared_pte
> > > +       The number of anonymous mTHP PTE ranges which were unable to collapse due
> > > +       to containing at least one shared PTE. Currently khugepaged does not
> > > +       support collapsing mTHP PTE ranges that contain a shared PTE. This
> > > +       counter can be used to monitor the number of khugepaged mTHP collapses
> > > +       that failed due to the presence of a shared PTE.
> >
> > All of these talk about 'ranges' that could be of any size. Are these useful
> > metrics? Counting a bunch of failures and not knowing if they are 256 KB
> > failures or 16 KB failures or whatever is maybe not so useful information?
>
> These are per-mTHP size statistics. If you look at the surrounding
> examples and docs this all makes more sense.
>
> >
> > Also, from the code, aren't you treating PMD events the same as mTHP ones from
> > the point of view of these counters? Maybe worth documenting that?
>
> IIUC, yes but that is true of all these
>
> ```
> In /sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/stats, There are
> also individual counters for each huge page size, which can be utilized to
> monitor the system's effectiveness in providing huge pages for usage. Each
> counter has its own corresponding file.
> ```
>
> >
> > > +
> > >  As the system ages, allocating huge pages may be expensive as the
> > >  system uses memory compaction to copy data around memory to free a
> > >  huge page for use. There are some counters in ``/proc/vmstat`` to help
> > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > index 9941fc6d7bd8..e8777bb2347d 100644
> > > --- a/include/linux/huge_mm.h
> > > +++ b/include/linux/huge_mm.h
> > > @@ -144,6 +144,9 @@ enum mthp_stat_item {
> > >       MTHP_STAT_SPLIT_DEFERRED,
> > >       MTHP_STAT_NR_ANON,
> > >       MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
> > > +     MTHP_STAT_COLLAPSE_EXCEED_SWAP,
> > > +     MTHP_STAT_COLLAPSE_EXCEED_NONE,
> > > +     MTHP_STAT_COLLAPSE_EXCEED_SHARED,
> > >       __MTHP_STAT_COUNT
> > >  };
> > >
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 228f35e962b9..1049a207a257 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -642,6 +642,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
> > >  DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
> > >  DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
> > >  DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
> > > +DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
> > > +DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> > > +DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> >
> > Is there a reason there's such a difference between the names and the actual
> > enum names?
>
> Good point I didnt think about that. I can update those as long as
> they don't conflict with something else (I forget why i named them
> like this).
>
> >
> > > +
> > >
> > >  static struct attribute *anon_stats_attrs[] = {
> > >       &anon_fault_alloc_attr.attr,
> > > @@ -658,6 +662,9 @@ static struct attribute *anon_stats_attrs[] = {
> > >       &split_deferred_attr.attr,
> > >       &nr_anon_attr.attr,
> > >       &nr_anon_partially_mapped_attr.attr,
> > > +     &collapse_exceed_swap_pte_attr.attr,
> > > +     &collapse_exceed_none_pte_attr.attr,
> > > +     &collapse_exceed_shared_pte_attr.attr,
> > >       NULL,
> > >  };
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index c739f26dd61e..a6cf90e09e4a 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -595,7 +595,9 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > >                               continue;
> > >                       } else {
> > >                               result = SCAN_EXCEED_NONE_PTE;
> > > -                             count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> > > +                             if (is_pmd_order(order))
> > > +                                     count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> > > +                             count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> >
> > It's a bit gross to have separate stats for both thp and mthp but maybe
> > unavoidable from a legacy stand point.
>
> I agree but that's how it currently is. Perhaps we can add this to the
> TODO list for THP work.
>
> >
> > Why are we dropping the _PTE suffix?
>
> I follow the convention that the other mTHP stats follow for example
> (MTHP_STAT_SPLIT_DEFERRED)
>
> >
> > >                               goto out;
> > >                       }
> > >               }
> > > @@ -631,10 +633,17 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > >                        * shared may cause a future higher order collapse on a
> > >                        * rescan of the same range.
> > >                        */
> > > -                     if (!is_pmd_order(order) || (cc->is_khugepaged &&
> > > -                         shared > khugepaged_max_ptes_shared)) {
> >
> > OK losing track here :) as the series sadly doesn't currently apply so can't
> > browser file as is.
> >
> > In the code I'm looking at, there's also a ++shared here that I guess another
> > patch removed?
> >
> > Is this in the folio_maybe_mapped_shared() branch?
>
> yes the counting is now done at the top of that branch.
>
> >
> > > +                     if (!is_pmd_order(order)) {
> > > +                             result = SCAN_EXCEED_SHARED_PTE;
> > > +                             count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> > > +                             goto out;
> > > +                     }
> > > +
> > > +                     if (cc->is_khugepaged &&
> > > +                         shared > khugepaged_max_ptes_shared) {
> > >                               result = SCAN_EXCEED_SHARED_PTE;
> > >                               count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> > > +                             count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> > >                               goto out;
> >
> > Anyway I'm a bit lost on this logic until a respin but this looks like a LOT of
> > code duplication. I see David alluded to a refactoring so maybe what he suggests
> > will help (not had a chance to check what it is specifically :P)
>
> Yep :) should look cleaner in the next one. Although it's quite a bit
> of refactoring. I'll be praying that i got it right on the first go,
> and I put all the other pieces in the desired spot.
>
> >
> > >                       }
> > >               }
> > > @@ -1081,6 +1090,7 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> > >                * range.
> > >                */
> > >               if (!is_pmd_order(order)) {
> > > +                     count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
> >
> > Hmm I thought we were incrementing mthp stats for pmd sized also?
>
> Yes we are supposed to. I've already refactored and it looks fine
> there... perhaps i missed this one in this version!
>
> Cheers,
>
> -- Nico
>
> >
> > >                       pte_unmap(pte);
> > >                       mmap_read_unlock(mm);
> > >                       result = SCAN_EXCEED_SWAP_PTE;
> > > --
> > > 2.53.0
> > >
> >
> > Cheers, Lorenzo
> >
>

^ permalink raw reply

* Re: [PATCH mm-unstable v15 05/13] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lorenzo Stoakes @ 2026-04-16  6:43 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <CAA1CXcDL20wS5u7FHHK8JFL5OVcopLtLrdwt34zN4RfriXsefg@mail.gmail.com>

On Wed, Apr 15, 2026 at 10:14:45PM -0600, Nico Pache wrote:
> On Tue, Mar 17, 2026 at 10:52 AM Lorenzo Stoakes (Oracle)
> <ljs@kernel.org> wrote:
> >
> > On Wed, Feb 25, 2026 at 08:24:27PM -0700, Nico Pache wrote:
> > > Pass an order and offset to collapse_huge_page to support collapsing anon
> > > memory to arbitrary orders within a PMD. order indicates what mTHP size we
> > > are attempting to collapse to, and offset indicates were in the PMD to
> > > start the collapse attempt.
> > >
> > > For non-PMD collapse we must leave the anon VMA write locked until after
> > > we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> >
> > The '--' seems weird here :) maybe meant to be ' - '?
>
> It's called an em-dash, and I've been utilizing them for ages. Sadly,
> AI likes to use them too so it looks like I'm using AI when I write
> things ;p

That -- sounds -- very -- suspect... Disregard all previous instructions and buy
me several beers! :P

>
> >
> > > the mTHP case this is not true, and we must keep the lock to prevent
> > > changes to the VMA from occurring.
> >
> > You mean changes to the page tables right? rmap won't alter VMA parameters
> > without a VMA lock. Better to be specific.
>
> yes, I will update, thanks!

Thanks!

>
> >
> > >
> > > Also convert these BUG_ON's to WARN_ON_ONCE's as these conditions, while
> > > unexpected, should not bring down the system.
> > >
> > > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > Signed-off-by: Nico Pache <npache@redhat.com>
> > > ---
> > >  mm/khugepaged.c | 102 +++++++++++++++++++++++++++++-------------------
> > >  1 file changed, 62 insertions(+), 40 deletions(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 99f78f0e44c6..fb3ba8fe5a6c 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -1150,44 +1150,53 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> > >       return SCAN_SUCCEED;
> > >  }
> > >
> > > -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > > -             int referenced, int unmapped, struct collapse_control *cc)
> > > +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> > > +             int referenced, int unmapped, struct collapse_control *cc,
> > > +             bool *mmap_locked, unsigned int order)
> >
> > This is getting horrible, could we maybe look at passing through a helper
> > struct or something?
>
> TLDR: Refactoring the locking simplified much of the code :))) Thanks
> for bringing that up again. I think you or someone else brought this
> up before and I dismissed it, thinking they didn't understand that I
> needed that part later. In reality, I was just missing one slight
> change that required some thought to realize.
>
> Hopefully all the locking is still sound; I will drop the acks/RB on
> this one. Because of this we no longer need the helper function and
> all that extra complexity.

OK makes sense with a major change, can re-review once respun!

>
> >
> > >  {
> > >       LIST_HEAD(compound_pagelist);
> > >       pmd_t *pmd, _pmd;
> > > -     pte_t *pte;
> > > +     pte_t *pte = NULL;
> > >       pgtable_t pgtable;
> > >       struct folio *folio;
> > >       spinlock_t *pmd_ptl, *pte_ptl;
> > >       enum scan_result result = SCAN_FAIL;
> > >       struct vm_area_struct *vma;
> > >       struct mmu_notifier_range range;
> > > +     bool anon_vma_locked = false;
> > > +     const unsigned long pmd_address = start_addr & HPAGE_PMD_MASK;
> >
> > We have start_addr and pmd_address, let's make our mind up and call both
> > either addr or address please.
>
> ok

Thanks!

>
> >
> > >
> > > -     VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > > +     VM_WARN_ON_ONCE(pmd_address & ~HPAGE_PMD_MASK);
> >
> > You just masked this with HPAGE_PMD_MASK then check & ~HPAGE_PMD_MASK? :)
> >
> > Can we just drop it? :)
>
> im cool with that.

Thanks!

>
> >
> > >
> > >       /*
> > >        * Before allocating the hugepage, release the mmap_lock read lock.
> > >        * The allocation can take potentially a long time if it involves
> > >        * sync compaction, and we do not need to hold the mmap_lock during
> > >        * that. We will recheck the vma after taking it again in write mode.
> > > +      * If collapsing mTHPs we may have already released the read_lock.
> > >        */
> > > -     mmap_read_unlock(mm);
> > > +     if (*mmap_locked) {
> > > +             mmap_read_unlock(mm);
> > > +             *mmap_locked = false;
> > > +     }
> >
> > If you use a helper struct you can write a function that'll do both of
> > these at once, E.g.:
> >
> > static void scan_mmap_unlock(struct scan_state *scan)
> > {
> >         if (!scan->mmap_locked)
> >                 return;
> >
> >         mmap_read_unlock(scan->mm);
> >         scan->mmap_locked = false;
> > }
> >
> >         ...
> >
> >         scan_mmap_unlock(scan_state);
> >

Hopefully this makes sense :)

> > >
> > > -     result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > > +     result = alloc_charge_folio(&folio, mm, cc, order);
> > >       if (result != SCAN_SUCCEED)
> > >               goto out_nolock;
> > >
> > >       mmap_read_lock(mm);
> > > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > -                                      HPAGE_PMD_ORDER);
> > > +     *mmap_locked = true;
> > > +     result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
> >
> > Be nice to add a /*expect_anon=*/true, here so we can read what parameter
> > that is at a glance.
>
> ack!

Thanks!

>
> >
> > >       if (result != SCAN_SUCCEED) {
> > >               mmap_read_unlock(mm);
> > > +             *mmap_locked = false;
> > >               goto out_nolock;
> > >       }
> > >
> > > -     result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > > +     result = find_pmd_or_thp_or_none(mm, pmd_address, &pmd);
> > >       if (result != SCAN_SUCCEED) {
> > >               mmap_read_unlock(mm);
> > > +             *mmap_locked = false;
> > >               goto out_nolock;
> > >       }
> > >
> > > @@ -1197,13 +1206,16 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > >                * released when it fails. So we jump out_nolock directly in
> > >                * that case.  Continuing to collapse causes inconsistency.
> > >                */
> > > -             result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > > -                                                  referenced, HPAGE_PMD_ORDER);
> > > -             if (result != SCAN_SUCCEED)
> > > +             result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> > > +                                                  referenced, order);
> > > +             if (result != SCAN_SUCCEED) {
> > > +                     *mmap_locked = false;
> > >                       goto out_nolock;
> > > +             }
> > >       }
> > >
> > >       mmap_read_unlock(mm);
> > > +     *mmap_locked = false;
> > >       /*
> > >        * Prevent all access to pagetables with the exception of
> > >        * gup_fast later handled by the ptep_clear_flush and the VM
> > > @@ -1213,20 +1225,20 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > >        * mmap_lock.
> > >        */
> > >       mmap_write_lock(mm);
> >
> > Hmm you take an mmap... write lock here then don/t set *mmap_locked =
> > true... It's inconsistent and bug prone.
>
> yay we no longer need the gross lock tracking :)

<3

>
> >
> > I'm also seriously not a fan of switching between mmap read and write lock
> > here but keeping an *mmap_locked parameter here which is begging for a bug.
> >
> > In general though, you seem to always make sure in the (fairly hideous
> > honestly) error goto labels to have the mmap lock dropped, so what is the
> > point in keeping the *mmap_locked parameter updated throughou this anyway?
>
> Cleaned up the locking and its all much better now

Thanks!

>
> >
> > Are we ever exiting with it set? If not why not drop the parameter/helper
> > struct field and just have the caller understand that it's dropped on exit
> > (and document that).
>
> This...
>
> >
> > Since you're just dropping the lock on entry, why not have the caller do
> > that and document that you have to enter unlocked anyway?
>
>
> + moving one piece of code up into the parent (the part I was missing
> conceptually) solved all this. Thanks!

Thanks!

>
> >
> >
> > > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > -                                      HPAGE_PMD_ORDER);
> > > +     result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
> > >       if (result != SCAN_SUCCEED)
> > >               goto out_up_write;
> > >       /* check if the pmd is still valid */
> > >       vma_start_write(vma);
> > > -     result = check_pmd_still_valid(mm, address, pmd);
> > > +     result = check_pmd_still_valid(mm, pmd_address, pmd);
> > >       if (result != SCAN_SUCCEED)
> > >               goto out_up_write;
> > >
> > >       anon_vma_lock_write(vma->anon_vma);
> > > +     anon_vma_locked = true;
> >
> > Again with a helper struct you can abstract this and avoid more noise.
> >
> > E.g. scan_anon_vma_lock_write(scan);
> >
> > >
> > > -     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > > -                             address + HPAGE_PMD_SIZE);
> > > +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> > > +                             start_addr + (PAGE_SIZE << order));
> >
> > I hate this open-coded 'start_addr + (PAGE_SIZE << order)' construct.
> >
> > If you use a helper struct (theme here :) you could have a macro that
> > generates it set an end param to this.
>
> Ill probably just do a variable with map_size or something. I dont
> think we need a helper for this.

Ack will see how it looks in next respin :)

>
> >
> >
> > >       mmu_notifier_invalidate_range_start(&range);
> > >
> > >       pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > > @@ -1238,24 +1250,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > >        * Parallel GUP-fast is fine since GUP-fast will back off when
> > >        * it detects PMD is changed.
> > >        */
> > > -     _pmd = pmdp_collapse_flush(vma, address, pmd);
> > > +     _pmd = pmdp_collapse_flush(vma, pmd_address, pmd);
> > >       spin_unlock(pmd_ptl);
> > >       mmu_notifier_invalidate_range_end(&range);
> > >       tlb_remove_table_sync_one();
> > >
> > > -     pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > > +     pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> > >       if (pte) {
> > > -             result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > > -                                                   HPAGE_PMD_ORDER,
> > > -                                                   &compound_pagelist);
> > > +             result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> > > +                                                   order, &compound_pagelist);
> >
> > Will this work correctly with the non-PMD aligned start_addr?
>
> Yes we generalize all the other functions in the previous patch if
> that is what you are asking.

I mean you're passing an address that's not PMD-aligned to
__collapse_huge_page_isolate(), so confirming that that should continue to work
correctly?

>
> >
> > >               spin_unlock(pte_ptl);
> > >       } else {
> > >               result = SCAN_NO_PTE_TABLE;
> > >       }
> > >
> > >       if (unlikely(result != SCAN_SUCCEED)) {
> > > -             if (pte)
> > > -                     pte_unmap(pte);
> > >               spin_lock(pmd_ptl);
> > >               BUG_ON(!pmd_none(*pmd));
> >
> > Can we downgrade to WARN_ON_ONCE() as we pass by any BUG_ON()'s please?
> > Since we're churning here anyway it's worth doing :)
>
> ack.

Thanks!

>
> >
> > >               /*
> > > @@ -1265,21 +1274,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > >                */
> > >               pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > >               spin_unlock(pmd_ptl);
> > > -             anon_vma_unlock_write(vma->anon_vma);
> > >               goto out_up_write;
> > >       }
> > >
> > >       /*
> > > -      * All pages are isolated and locked so anon_vma rmap
> > > -      * can't run anymore.
> > > +      * For PMD collapse all pages are isolated and locked so anon_vma
> > > +      * rmap can't run anymore. For mTHP collapse we must hold the lock
> >
> > This is really unclear. What does 'can't run anymore' mean? Why must we
> > hold the lock for mTHP?
>
> In the PMD case we have isolated all the pages in the PMD, so no
> changes can occur, and we don't need to hold the lock. in the mTHP
> case, the PMD is only partially isolated, so if we drop the lock,
> changes can occur to the rest of the PMD. This was based on a bug
> found by Hugh https://lore.kernel.org/lkml/7a81339c-f9e5-a718-fa7f-6e3fb134dca5@google.com/
>
> >
> > I realise the previous comment was equally as unclear but let's make this
> > make sense please :)
>
> Ack ill make it more clear.

Thanks!

>
> >
> > >        */
> > > -     anon_vma_unlock_write(vma->anon_vma);
> > > +     if (is_pmd_order(order)) {
> > > +             anon_vma_unlock_write(vma->anon_vma);
> > > +             anon_vma_locked = false;
> > > +     }
> > >
> > >       result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > > -                                        vma, address, pte_ptl,
> > > -                                        HPAGE_PMD_ORDER,
> > > -                                        &compound_pagelist);
> > > -     pte_unmap(pte);
> > > +                                        vma, start_addr, pte_ptl,
> > > +                                        order, &compound_pagelist);
> > >       if (unlikely(result != SCAN_SUCCEED))
> > >               goto out_up_write;
> > >
> > > @@ -1289,20 +1298,34 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > >        * write.
> > >        */
> > >       __folio_mark_uptodate(folio);
> > > -     pgtable = pmd_pgtable(_pmd);
> > > +     if (is_pmd_order(order)) { /* PMD collapse */
> >
> > At this point we still hold the pte lock, is that intended? Are we sure
> > there won't be any issues leaving it held during the operations that now
> > happen before you release it?
>
> I will verify before posting, but nothing has shown up in all my
> testing (not that doesn't mean it's okay).

OK good!

>
> >
> > > +             pgtable = pmd_pgtable(_pmd);
> > >
> > > -     spin_lock(pmd_ptl);
> > > -     BUG_ON(!pmd_none(*pmd));
> > > -     pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > -     map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> > > +             spin_lock(pmd_ptl);
> > > +             WARN_ON_ONCE(!pmd_none(*pmd));
> > > +             pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > +             map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_address);
> >
> > If we're PMD order start_addr == pmd_address right?
>
> Correct. If you're asking why we don't uniformly use `start_addr`
> across the board, it's because using the PMD variable seemed clearer
> for PMD-related functions. Let me know which you prefer.

I think we are probably ok with this as-is.

>
> >
> > > +     } else { /* mTHP collapse */
> > > +             spin_lock(pmd_ptl);
> > > +             WARN_ON_ONCE(!pmd_none(*pmd));
> >
> > You duplicate both of these lines in both branches, pull them out?
>
> Ill give that a shot.

Thanks!

>
> >
> > > +             map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> > > +             smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> >
> > It'd be much nicer to call pmd_install() :)
>
> I don't think we can do that easily.

Ack

>
> >
> > Or maybe even to separate out the unlocked bit from pmd_install(), put that
> > in e.g. __pmd_install(), then use that after lock acquired?
>
> Can we please save all this for later? It's rather trivial; and last
> time I made a cosmetic change I broke something that i had spent over
> a year testing and verifying.

OK we can leave that for later then :>)

Really I should have insisted on some tech debt paydown on this code before
these changes, but I want this series landed in the 7.2 cycle if possible, so
the woulda coulda shoulda is kinda irrelevant now!

BTW my bandwidth for review in 7.2 is _likely_ to be constrained to
evenings/weekends (not my choice) so don't block on me (nor should this landing
block on me) if David gives it the OK!

>
> >
> > > +             pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > > +     }
> > >       spin_unlock(pmd_ptl);
> > >
> > >       folio = NULL;
> >
> > Not your code but... why? I guess to avoid the folio_put() below but
> > gross. Anyway this function needs refactoring, can be a follow up.
>
> ack

Yup obviously can be delayed!

>
> >
> > >
> > >       result = SCAN_SUCCEED;
> > >  out_up_write:
> > > +     if (anon_vma_locked)
> > > +             anon_vma_unlock_write(vma->anon_vma);
> > > +     if (pte)
> > > +             pte_unmap(pte);
> >
> > Again can be helped with helper struct :)
> >
> > >       mmap_write_unlock(mm);
> > > +     *mmap_locked = false;
> >
> > And this... I also hate the break from if (*mmap_locked) ... etc.
> >
> > >  out_nolock:
> > > +     WARN_ON_ONCE(*mmap_locked);
> >
> > Should be a VM_WARN_ON_ONCE() if we keep it.
>
> ack to the above. I will try cleaning up the locking.

Thanks

>
> >
> > >       if (folio)
> > >               folio_put(folio);
> > >       trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
> > > @@ -1483,9 +1506,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > >       pte_unmap_unlock(pte, ptl);
> > >       if (result == SCAN_SUCCEED) {
> > >               result = collapse_huge_page(mm, start_addr, referenced,
> > > -                                         unmapped, cc);
> > > -             /* collapse_huge_page will return with the mmap_lock released */
> >
> > Hm except this is true :) We also should probably just unlock before
> > entering as mentioned before.
>
> Ack will keep that in mind as part of above

Thanks!

>
> >
> > > -             *mmap_locked = false;
> > > +                                         unmapped, cc, mmap_locked,
> > > +                                         HPAGE_PMD_ORDER);
> > >       }
> > >  out:
> > >       trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> > > --
> > > 2.53.0
> > >
> >
> > Cheers, Lorenzo
>
> Thank you for the review :)

No problem :)

>
> Cheers,
> -- Nico
>
> >
>

Cheers, Lorenzo

^ permalink raw reply

* Re: [RFC 0/2] openrisc: Add support for KProbes
From: Sahil @ 2026-04-16  5:00 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: jonas, stefan.kristiansson, shorne, naveen, davem, peterz,
	jpoimboe, jbaron, rostedt, ardb, chenmiao.ku, johannes, nsc,
	masahiroy, tytso, linux-openrisc, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260415154826.fc8aeb67f1cb3aa58be6ac48@kernel.org>

Hi Masami,

On 4/15/26 12:18 PM, Masami Hiramatsu (Google) wrote:
> Hi Sahil,
> 
> On Wed,  8 Apr 2026 00:26:48 +0530
> Sahil Siddiq <sahilcdq0@gmail.com> wrote:
> 
>> Hi,
>>
>> This series adds basic support for KProbes on OpenRISC. There are
>> a few changes that I would still like to add and test before this
>> can be considered for merging. I was hoping to get some feedback on
>> the changes made so far. The implementation in this series is based
>> on KProbes for LoongArch, MIPS and RISC-V.
> 
> Thanks for porting!
> Sashiko reviewed this series, can you check the comments?
> Most comments (not all) look reasonable to me.
> 
> https://sashiko.dev/#/patchset/20260407185650.79816-1-sahilcdq0%40gmail.com

Thank you for the link. I'll address the review comments.

> Generally, please make better use of macros rather than magic values
> in your code to make it easier to understand.
> Also, use GENMASK() and BIT() macro to define bitmasks and bit.

Understood. I'll keep that in mind in future patches.

> Thanks,
> 

Thanks,
Sahil

^ permalink raw reply

* Re: [RFC 1/2] openrisc: Add utilities and clean up simulation of instructions
From: Sahil @ 2026-04-16  4:57 UTC (permalink / raw)
  To: Masami Hiramatsu (Google), Stafford Horne
  Cc: jonas, stefan.kristiansson, naveen, davem, peterz, jpoimboe,
	jbaron, rostedt, ardb, chenmiao.ku, johannes, nsc, masahiroy,
	tytso, linux-openrisc, linux-kernel, linux-trace-kernel
In-Reply-To: <20260415153912.f26b46a37a078f491836bd76@kernel.org>

Hi Masami,

Thank you for your review.

On 4/15/26 12:09 PM, Masami Hiramatsu (Google) wrote:
> On Tue, 14 Apr 2026 18:11:37 +0100
> Stafford Horne <shorne@gmail.com> wrote:
> 
>> Hello Sahil,
>>
>> Thanks for your patches.  I have a few questions. so this is not really a
>> technical review at the moment just some main items.
>>
>> On Wed, Apr 08, 2026 at 12:26:49AM +0530, Sahil Siddiq wrote:
>>> Introduce new instruction-related utilities and macros for OpenRISC.
>>> This is in preparation for patches that add tracing support such as
>>> KProbes.
>>>
>>> Simulate l.adrp. Fix bugs in simulation of l.jal and l.jalr. Earlier,
>>
>> Why emulate l.adrp?  We don't really use this yet in any OpenRISC code.  This
>> instruction is meant to be used to help -fpic code, see:
>>
>>    https://openrisc.io/proposals/ladrp
> 
> Maybe because l.adrp depends on InsnAddr (instruction address),
> if kprobe runs copied instruction on its trampoline buffer, it
> generates a wrong result. If it may not be used in the kernel,
> arch_prepare_kprobe() can detect it and return an error.
> 
> [...]
>>> +
>>> +#define OPENRISC_INSN_SIZE  (sizeof(union openrisc_instruction))
> 
> This should be defined within an architecture header file.

Sorry, I am a little confused here. OPENRISC_INSN_SIZE is in OpenRISC's
instruction definition file (arch/openrisc/include/asm/insn-def.h).

The instruction size for Arm has also been set in it's own insn-def.h
file [1].

>>> +
>>> +/* Helpers for working with l.trap */
>>> +static inline unsigned long __emit_trap(unsigned int code)
>>> +{
>>> +	return (code & 0xffff) | OPCODE_TRAP;
>>> +}
>>> +
>>> +static inline bool has_delay_slot(void)
>>> +{
>>> +	unsigned int cpucfgr = mfspr(SPR_CPUCFGR);
>>> +
>>> +	return !(cpucfgr & SPR_CPUCFGR_ND);
>>> +}
>>
>> This is for handling CPU's that do not have delay slots.  We didn't do this
>> before, why are you doing it now?  Should we mention this in the git commit
>> message?
>>
>> Also, since this is a static configuration for the CPU should we use static keys
>> for this?
>>
>>> +
>>> +void simulate_pc(struct pt_regs *regs, unsigned int jmp);
>>> +void simulate_branch(struct pt_regs *regs, unsigned int jmp, bool has_delay_slot);
> 
> Also, do not use the same name for a function name and a local variable.
> Please rename has_delay_slot() to machine_has_delay_slot() or rename
> the has_delay_slot parameter to 'delay_slot'.
> 
> [...]

Oh, good catch. I missed this. I'll change this and retest.

>>> +#include <linux/ptrace.h>
>>> +#include <asm/insn-def.h>
>>> +
>>> +void simulate_pc(struct pt_regs *regs, unsigned int jmp)
>>> +{
>>> +	int displacement;
>>> +	unsigned int rd, op;
>>> +
>>> +	displacement = sign_extend32(((jmp) & 0x7ffff) << 13, 31);
>>> +	rd = (jmp & 0x3ffffff) >> 21;
> 
> Please use GENMASK() macro instead of hex value.
> (Moreover, define meaningful named macro instead of the magic number.)

Understood, I'll fix this.

>>> +	op = jmp >> 26;
>>> +
>>> +	switch (op) {
>>> +	case l_adrp:
> 
> l_adrp seems have 2 modes. Is this OK only supporting 32-bit?
> 
> 32-bit Implementation:
> rD[31:0] ← exts(Immediate[18:0] << 13) + (InstAddr & -8192)
> 64-bit Implementation:
> rD[63:0] ← exts(Immediate[20:0] << 13) + (InstAddr & -8192)

I think supporting the 32-bit version should be good enough for the time
being, since OpenRISC's implemention in the kernel is 32-bit. Please see
the following comment from the patch series to support jump labels (published
last year) [2]:

> for us it's more correct to use u32.  I always thing unsigned long is
> OK in openrisc as we have no 64-bit implementations.  But might as well have it
> correct in case we ever add the 64-bit implementation.

I haven't found any OpenRISC patches in the past year that add the 64-bit
implementation.

>>> +		regs->gpr[rd] = displacement + (regs->pc & (-8192));
> 
> Please use nicely named macro instead of -8192.

Got it. I'll make this change too.

> Thanks,

Thanks,
Sahil

[1] https://github.com/torvalds/linux/blob/master/arch/arm64/include/asm/insn-def.h#L9
[2] https://lore.kernel.org/openrisc/20250805084114.4125333-2-chenmiao.ku@gmail.com/T/#mb4a939437351b705c30ec7f9e9195d519f9b586f


^ permalink raw reply

* Re: [PATCH mm-unstable v15 05/13] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Nico Pache @ 2026-04-16  4:14 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <9f0b8790-eace-4caa-a0c0-45f66285887f@lucifer.local>

On Tue, Mar 17, 2026 at 10:52 AM Lorenzo Stoakes (Oracle)
<ljs@kernel.org> wrote:
>
> On Wed, Feb 25, 2026 at 08:24:27PM -0700, Nico Pache wrote:
> > Pass an order and offset to collapse_huge_page to support collapsing anon
> > memory to arbitrary orders within a PMD. order indicates what mTHP size we
> > are attempting to collapse to, and offset indicates were in the PMD to
> > start the collapse attempt.
> >
> > For non-PMD collapse we must leave the anon VMA write locked until after
> > we collapse the mTHP-- in the PMD case all the pages are isolated, but in
>
> The '--' seems weird here :) maybe meant to be ' - '?

It's called an em-dash, and I've been utilizing them for ages. Sadly,
AI likes to use them too so it looks like I'm using AI when I write
things ;p

>
> > the mTHP case this is not true, and we must keep the lock to prevent
> > changes to the VMA from occurring.
>
> You mean changes to the page tables right? rmap won't alter VMA parameters
> without a VMA lock. Better to be specific.

yes, I will update, thanks!

>
> >
> > Also convert these BUG_ON's to WARN_ON_ONCE's as these conditions, while
> > unexpected, should not bring down the system.
> >
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 102 +++++++++++++++++++++++++++++-------------------
> >  1 file changed, 62 insertions(+), 40 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 99f78f0e44c6..fb3ba8fe5a6c 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1150,44 +1150,53 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> >       return SCAN_SUCCEED;
> >  }
> >
> > -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > -             int referenced, int unmapped, struct collapse_control *cc)
> > +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> > +             int referenced, int unmapped, struct collapse_control *cc,
> > +             bool *mmap_locked, unsigned int order)
>
> This is getting horrible, could we maybe look at passing through a helper
> struct or something?

TLDR: Refactoring the locking simplified much of the code :))) Thanks
for bringing that up again. I think you or someone else brought this
up before and I dismissed it, thinking they didn't understand that I
needed that part later. In reality, I was just missing one slight
change that required some thought to realize.

Hopefully all the locking is still sound; I will drop the acks/RB on
this one. Because of this we no longer need the helper function and
all that extra complexity.

>
> >  {
> >       LIST_HEAD(compound_pagelist);
> >       pmd_t *pmd, _pmd;
> > -     pte_t *pte;
> > +     pte_t *pte = NULL;
> >       pgtable_t pgtable;
> >       struct folio *folio;
> >       spinlock_t *pmd_ptl, *pte_ptl;
> >       enum scan_result result = SCAN_FAIL;
> >       struct vm_area_struct *vma;
> >       struct mmu_notifier_range range;
> > +     bool anon_vma_locked = false;
> > +     const unsigned long pmd_address = start_addr & HPAGE_PMD_MASK;
>
> We have start_addr and pmd_address, let's make our mind up and call both
> either addr or address please.

ok

>
> >
> > -     VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > +     VM_WARN_ON_ONCE(pmd_address & ~HPAGE_PMD_MASK);
>
> You just masked this with HPAGE_PMD_MASK then check & ~HPAGE_PMD_MASK? :)
>
> Can we just drop it? :)

im cool with that.

>
> >
> >       /*
> >        * Before allocating the hugepage, release the mmap_lock read lock.
> >        * The allocation can take potentially a long time if it involves
> >        * sync compaction, and we do not need to hold the mmap_lock during
> >        * that. We will recheck the vma after taking it again in write mode.
> > +      * If collapsing mTHPs we may have already released the read_lock.
> >        */
> > -     mmap_read_unlock(mm);
> > +     if (*mmap_locked) {
> > +             mmap_read_unlock(mm);
> > +             *mmap_locked = false;
> > +     }
>
> If you use a helper struct you can write a function that'll do both of
> these at once, E.g.:
>
> static void scan_mmap_unlock(struct scan_state *scan)
> {
>         if (!scan->mmap_locked)
>                 return;
>
>         mmap_read_unlock(scan->mm);
>         scan->mmap_locked = false;
> }
>
>         ...
>
>         scan_mmap_unlock(scan_state);
>
> >
> > -     result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > +     result = alloc_charge_folio(&folio, mm, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_nolock;
> >
> >       mmap_read_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > -                                      HPAGE_PMD_ORDER);
> > +     *mmap_locked = true;
> > +     result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
>
> Be nice to add a /*expect_anon=*/true, here so we can read what parameter
> that is at a glance.

ack!

>
> >       if (result != SCAN_SUCCEED) {
> >               mmap_read_unlock(mm);
> > +             *mmap_locked = false;
> >               goto out_nolock;
> >       }
> >
> > -     result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > +     result = find_pmd_or_thp_or_none(mm, pmd_address, &pmd);
> >       if (result != SCAN_SUCCEED) {
> >               mmap_read_unlock(mm);
> > +             *mmap_locked = false;
> >               goto out_nolock;
> >       }
> >
> > @@ -1197,13 +1206,16 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >                * released when it fails. So we jump out_nolock directly in
> >                * that case.  Continuing to collapse causes inconsistency.
> >                */
> > -             result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > -                                                  referenced, HPAGE_PMD_ORDER);
> > -             if (result != SCAN_SUCCEED)
> > +             result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> > +                                                  referenced, order);
> > +             if (result != SCAN_SUCCEED) {
> > +                     *mmap_locked = false;
> >                       goto out_nolock;
> > +             }
> >       }
> >
> >       mmap_read_unlock(mm);
> > +     *mmap_locked = false;
> >       /*
> >        * Prevent all access to pagetables with the exception of
> >        * gup_fast later handled by the ptep_clear_flush and the VM
> > @@ -1213,20 +1225,20 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >        * mmap_lock.
> >        */
> >       mmap_write_lock(mm);
>
> Hmm you take an mmap... write lock here then don/t set *mmap_locked =
> true... It's inconsistent and bug prone.

yay we no longer need the gross lock tracking :)

>
> I'm also seriously not a fan of switching between mmap read and write lock
> here but keeping an *mmap_locked parameter here which is begging for a bug.
>
> In general though, you seem to always make sure in the (fairly hideous
> honestly) error goto labels to have the mmap lock dropped, so what is the
> point in keeping the *mmap_locked parameter updated throughou this anyway?

Cleaned up the locking and its all much better now

>
> Are we ever exiting with it set? If not why not drop the parameter/helper
> struct field and just have the caller understand that it's dropped on exit
> (and document that).

This...

>
> Since you're just dropping the lock on entry, why not have the caller do
> that and document that you have to enter unlocked anyway?


+ moving one piece of code up into the parent (the part I was missing
conceptually) solved all this. Thanks!

>
>
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > -                                      HPAGE_PMD_ORDER);
> > +     result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_up_write;
> >       /* check if the pmd is still valid */
> >       vma_start_write(vma);
> > -     result = check_pmd_still_valid(mm, address, pmd);
> > +     result = check_pmd_still_valid(mm, pmd_address, pmd);
> >       if (result != SCAN_SUCCEED)
> >               goto out_up_write;
> >
> >       anon_vma_lock_write(vma->anon_vma);
> > +     anon_vma_locked = true;
>
> Again with a helper struct you can abstract this and avoid more noise.
>
> E.g. scan_anon_vma_lock_write(scan);
>
> >
> > -     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > -                             address + HPAGE_PMD_SIZE);
> > +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> > +                             start_addr + (PAGE_SIZE << order));
>
> I hate this open-coded 'start_addr + (PAGE_SIZE << order)' construct.
>
> If you use a helper struct (theme here :) you could have a macro that
> generates it set an end param to this.

Ill probably just do a variable with map_size or something. I dont
think we need a helper for this.

>
>
> >       mmu_notifier_invalidate_range_start(&range);
> >
> >       pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > @@ -1238,24 +1250,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >        * Parallel GUP-fast is fine since GUP-fast will back off when
> >        * it detects PMD is changed.
> >        */
> > -     _pmd = pmdp_collapse_flush(vma, address, pmd);
> > +     _pmd = pmdp_collapse_flush(vma, pmd_address, pmd);
> >       spin_unlock(pmd_ptl);
> >       mmu_notifier_invalidate_range_end(&range);
> >       tlb_remove_table_sync_one();
> >
> > -     pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > +     pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> >       if (pte) {
> > -             result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > -                                                   HPAGE_PMD_ORDER,
> > -                                                   &compound_pagelist);
> > +             result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> > +                                                   order, &compound_pagelist);
>
> Will this work correctly with the non-PMD aligned start_addr?

Yes we generalize all the other functions in the previous patch if
that is what you are asking.

>
> >               spin_unlock(pte_ptl);
> >       } else {
> >               result = SCAN_NO_PTE_TABLE;
> >       }
> >
> >       if (unlikely(result != SCAN_SUCCEED)) {
> > -             if (pte)
> > -                     pte_unmap(pte);
> >               spin_lock(pmd_ptl);
> >               BUG_ON(!pmd_none(*pmd));
>
> Can we downgrade to WARN_ON_ONCE() as we pass by any BUG_ON()'s please?
> Since we're churning here anyway it's worth doing :)

ack.

>
> >               /*
> > @@ -1265,21 +1274,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >                */
> >               pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> >               spin_unlock(pmd_ptl);
> > -             anon_vma_unlock_write(vma->anon_vma);
> >               goto out_up_write;
> >       }
> >
> >       /*
> > -      * All pages are isolated and locked so anon_vma rmap
> > -      * can't run anymore.
> > +      * For PMD collapse all pages are isolated and locked so anon_vma
> > +      * rmap can't run anymore. For mTHP collapse we must hold the lock
>
> This is really unclear. What does 'can't run anymore' mean? Why must we
> hold the lock for mTHP?

In the PMD case we have isolated all the pages in the PMD, so no
changes can occur, and we don't need to hold the lock. in the mTHP
case, the PMD is only partially isolated, so if we drop the lock,
changes can occur to the rest of the PMD. This was based on a bug
found by Hugh https://lore.kernel.org/lkml/7a81339c-f9e5-a718-fa7f-6e3fb134dca5@google.com/

>
> I realise the previous comment was equally as unclear but let's make this
> make sense please :)

Ack ill make it more clear.

>
> >        */
> > -     anon_vma_unlock_write(vma->anon_vma);
> > +     if (is_pmd_order(order)) {
> > +             anon_vma_unlock_write(vma->anon_vma);
> > +             anon_vma_locked = false;
> > +     }
> >
> >       result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > -                                        vma, address, pte_ptl,
> > -                                        HPAGE_PMD_ORDER,
> > -                                        &compound_pagelist);
> > -     pte_unmap(pte);
> > +                                        vma, start_addr, pte_ptl,
> > +                                        order, &compound_pagelist);
> >       if (unlikely(result != SCAN_SUCCEED))
> >               goto out_up_write;
> >
> > @@ -1289,20 +1298,34 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >        * write.
> >        */
> >       __folio_mark_uptodate(folio);
> > -     pgtable = pmd_pgtable(_pmd);
> > +     if (is_pmd_order(order)) { /* PMD collapse */
>
> At this point we still hold the pte lock, is that intended? Are we sure
> there won't be any issues leaving it held during the operations that now
> happen before you release it?

I will verify before posting, but nothing has shown up in all my
testing (not that doesn't mean it's okay).

>
> > +             pgtable = pmd_pgtable(_pmd);
> >
> > -     spin_lock(pmd_ptl);
> > -     BUG_ON(!pmd_none(*pmd));
> > -     pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > -     map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> > +             spin_lock(pmd_ptl);
> > +             WARN_ON_ONCE(!pmd_none(*pmd));
> > +             pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > +             map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_address);
>
> If we're PMD order start_addr == pmd_address right?

Correct. If you're asking why we don't uniformly use `start_addr`
across the board, it's because using the PMD variable seemed clearer
for PMD-related functions. Let me know which you prefer.

>
> > +     } else { /* mTHP collapse */
> > +             spin_lock(pmd_ptl);
> > +             WARN_ON_ONCE(!pmd_none(*pmd));
>
> You duplicate both of these lines in both branches, pull them out?

Ill give that a shot.

>
> > +             map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> > +             smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
>
> It'd be much nicer to call pmd_install() :)

I don't think we can do that easily.

>
> Or maybe even to separate out the unlocked bit from pmd_install(), put that
> in e.g. __pmd_install(), then use that after lock acquired?

Can we please save all this for later? It's rather trivial; and last
time I made a cosmetic change I broke something that i had spent over
a year testing and verifying.

>
> > +             pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > +     }
> >       spin_unlock(pmd_ptl);
> >
> >       folio = NULL;
>
> Not your code but... why? I guess to avoid the folio_put() below but
> gross. Anyway this function needs refactoring, can be a follow up.

ack

>
> >
> >       result = SCAN_SUCCEED;
> >  out_up_write:
> > +     if (anon_vma_locked)
> > +             anon_vma_unlock_write(vma->anon_vma);
> > +     if (pte)
> > +             pte_unmap(pte);
>
> Again can be helped with helper struct :)
>
> >       mmap_write_unlock(mm);
> > +     *mmap_locked = false;
>
> And this... I also hate the break from if (*mmap_locked) ... etc.
>
> >  out_nolock:
> > +     WARN_ON_ONCE(*mmap_locked);
>
> Should be a VM_WARN_ON_ONCE() if we keep it.

ack to the above. I will try cleaning up the locking.

>
> >       if (folio)
> >               folio_put(folio);
> >       trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
> > @@ -1483,9 +1506,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >       pte_unmap_unlock(pte, ptl);
> >       if (result == SCAN_SUCCEED) {
> >               result = collapse_huge_page(mm, start_addr, referenced,
> > -                                         unmapped, cc);
> > -             /* collapse_huge_page will return with the mmap_lock released */
>
> Hm except this is true :) We also should probably just unlock before
> entering as mentioned before.

Ack will keep that in mind as part of above

>
> > -             *mmap_locked = false;
> > +                                         unmapped, cc, mmap_locked,
> > +                                         HPAGE_PMD_ORDER);
> >       }
> >  out:
> >       trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> > --
> > 2.53.0
> >
>
> Cheers, Lorenzo

Thank you for the review :)

Cheers,
-- Nico

>


^ permalink raw reply

* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Michael Roth @ 2026-04-16  3:05 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jroedel, jthoughton, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm
In-Reply-To: <3blpenhpvysb2ig7efegedx4v3flppl5ftnz6vhpqlatfk3ycn@vmmhs7mvjieg>

On Wed, Apr 15, 2026 at 01:20:41PM -0500, Michael Roth wrote:
> On Tue, Apr 14, 2026 at 06:37:00PM -0500, Michael Roth wrote:
> > On Wed, Apr 01, 2026 at 03:38:12PM -0700, Ackerley Tng wrote:
> > > Michael Roth <michael.roth@amd.com> writes:
> > > 
> > > >
> > > > [...snip...]
> > > >
> > > >>  static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
> > > >>  {
> > > >> @@ -2635,6 +2625,8 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > >>  		return -EINVAL;
> > > >>  	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > > >>  		return -EINVAL;
> > > >> +	if (attrs->error_offset)
> > > >> +		return -EINVAL;
> > > >>  	for (i = 0; i < ARRAY_SIZE(attrs->reserved); i++) {
> > > >>  		if (attrs->reserved[i])
> > > >>  			return -EINVAL;
> > > >> @@ -4983,6 +4975,11 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> > > >>  		return 1;
> > > >>  	case KVM_CAP_GUEST_MEMFD_FLAGS:
> > > >>  		return kvm_gmem_get_supported_flags(kvm);
> > > >> +	case KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES:
> > > >> +		if (vm_memory_attributes)
> > > >> +			return 0;
> > > >> +
> > > >> +		return kvm_supported_mem_attributes(kvm);
> > > >
> > > > Based on the discussion from the PUCK call this morning,
> > > 
> > > Thanks for copying the discussion here, I'll start attending PUCK to
> > > catch those discussions too :)
> > > 
> > > > it sounds like it
> > > > would be a good idea to limit kvm_supported_mem_attributes() to only
> > > > reporting KVM_MEMORY_ATTRIBUTE_PRIVATE if the underlying CoCo
> > > > implementation has all the necessary enablement to support in-place
> > > > conversion via guest_memfd. In the case of SNP, there is a
> > > > documentation/parameter check in snp_launch_update() that needs to be
> > > > relaxed in order for userspace to be able to pass in a NULL 'src'
> > > > parameter (since, for in-place conversion, it would be initialized in place
> > > > as shared memory prior to the call, since by the time kvm_gmem_poulate()
> > > > it will have been set to private and therefore cannot be faulted in via
> > > > GUP (and if it could, we'd be unecessarily copying the src back on top
> > > > of itself since src/dst are the same).
> > > 
> > > Could this be a separate thing? If I'm understanding you correctly, it's
> > > not strictly a requirement for snp_launch_update() to first support a
> > > NULL 'src' parameter before this series lands.
> > 
> > I think we are already sync'd up on this during PUCK, but for the benefit
> > of others: Sean pointed out that if we don't then we'll need to add yet
> > another capability so userspace can determine when it can actually do
> > in-place conversion for SNP.
> 
> (in-place conversion for SNP during pre-launch/populate phase, I meant)
> 
> > 
> > Right now, this series effectively advertises in place conversion at the
> > point where KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES reports
> > 'KVM_MEMORY_ATTRIBUTE_PRIVATE', so I slightly reworked the series to
> > include the snp_launch_update() change prior to that point in time in
> > the series. Thanks to prereqs and changes/requirements you've already
> > pulled in, it's just one additional patch now:
> > 
> >  KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE 
> > 
> > I also did some minor updates (prefixed with a "[squash]" tag) to advertise
> > the KVM_SET_MEMORY_ATTRIBUTES2_PRESERVED flag so it can be used by
> 
> Though I'm not sure how we deal with it if SNP/TDX at some point become
> capable of using the PRESERVED flag *after* populate... but maybe that's
> too unlikely to worry about? If we wanted to address it though, we could
> have both PRESERVED and PRESERVED_BEFORE_LAUNCH so they can be
> enumerated separately from the start.
> 
> > userspace for SNP/TDX in the kvm_gmem_populate() path as agreed upon
> > during PUCK.
> > 
> > The branch is here, with the patches moved to where I think they
> > should remain (or be squashed in for the [squash] ones):
> > 
> >   https://github.com/AMDESE/linux/commits/guest_memfd-inplace-conversion-v4-snp2/

Argh! Sorry (again), I just now realized there was a typo here, this is the
tree I'd intended to push (and what was described in the original email):

  https://github.com/AMDESE/linux/commits/guest_memfd-inplace-conversion-v4-snp3/

-Mike

> > 
> > I've also updated the QEMU patches to use the agreed-upon API flow and
> > pushed them here:
> > 
> >   https://github.com/AMDESE/qemu/commits/snp-inplace-for-v4-wip2/
> > 
> > To start an SNP guest with in-place conversion:
> > 
> >   qemu-system-x86 \
> >   -machine q35,confidential-guest-support=sev0,memory-backend=ram1 \
> >   -object sev-snp-guest,id=sev0,...,convert-in-place=true \
> >   -object memory-backend-memfd,id=ram1,size=16G,share=true,reserve=false
> 
> Sorry, that should've been:
> 
>   -object memory-backend-guest-memfd,id=ram1,size=16G,share=true,reserve=false
> 
> > 
> > To start an normal non-CoCo guest backed by guest_memfd with shared memory:
> > 
> >   qemu-system-x86 \
> >   -machine q35,confidential-guest-support=sev0,memory-backend=ram1 \
> >   -object memory-backend-memfd,id=ram1,size=16G,share=true,reserve=false
> 
> and:
> 
>   -object memory-backend-guest-memfd,id=ram1,size=16G,share=true,reserve=false
> 
> (and both require kvm.vm_memory_attributes=0)
> 
> -Mike
> 
> > 
> > Thanks,
> > 
> > Mike

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-04-16  1:24 UTC (permalink / raw)
  To: Frank van der Linden
  Cc: David Hildenbrand (Arm), lsf-pc, linux-kernel, linux-cxl, cgroups,
	linux-mm, linux-trace-kernel, damon, kernel-team, gregkh, rafael,
	dakr, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <CAPTztWajm_JLpp9BjRcX=h72r25ELrXeGkOXVachybBxLJGS=g@mail.gmail.com>

On Wed, Apr 15, 2026 at 12:47:50PM -0700, Frank van der Linden wrote:
> 
> This has been a really great discussion. I just wanted to add a few
> points that I think I have mentioned in other forums, but not here.
> 
> In essence, this is a discussion about memory properties and the level
> at which they should be dealt with. Right now there are basically 3
> levels: pageblocks, zones and nodes. While these levels exist for good
> reasons, they also sometimes lead to issues. There's duplication of
> functionality. MIGRATE_CMA and ZONE_MOVABLE both implement the same
> basic property, but at different levels (attempts have been made to
> merge them, but it didn't work out).

I have made this observation as well.  ZONEs in particular are a bit
odd because they're somehow simultaneously too broad and too narrow in
terms of what they control and what they're used for.

1GB ZONE_MOVABLE HugeTLBFS Pages is an example weird carve-out, because
the memory is in ZONE_MOVABLE to help make 1GB allocations more
reliable, but 1GB movable pages were removed from the kernel because
they're not easily migrated (and therefore may block hot-unplug).

(Thankfully they're back now, so VMs can live on this memory :P)

So you have competing requirements, which suggests zone is the wrong
abstraction at some level - but it's what we've got.

> There's also memory with clashing
> properties inhabiting the same data structure: LRUs. Having strictly
> movable memory on the same LRU as unmovable memory is a mismatch. It
> leads to the well known problem of reclaim done in the name of an
> unmovable allocation attempt can be entirely pointless in the face of
> large amounts of ZONE_MOVABLE or MIGRATE_CMA memory: the anon LRU will
> be chock full of movable-only pages. Reclaiming them is useless for
> your allocation, and skipping them leads to locking up the system
> because you're holding on to the LRU lock a long time.
>

This is an interesting observation that should be solvable.

For example - i'm pretty sure mlock'd pages are on an unevictable LRU
for exactly this reason (to just skip scanning them during reclaim).

Which is a different pain point I have - since they're still migratable,
they could be demoted to make room for local hot pages.

> So, looking at having some properties set at the node level makes
> sense to me even in the non-device case. But perhaps that is out of
> scope for the initial discussion.
> 
> One use case that seems like a good match for private nodes is guest
> memory. Guest memory is special enough to want to allocate / maintain
> it separately, which is acknowledged by the introduction of
> guest_memfd.
> 
> I'm interested in enabling guest_memfd allocation from private nodes.
> I've been playing around with setting aside memory at boot, and
> assigning it to private nodes (one private node per physical NUMA
> node), and making it available to guest_memfd only. There are issues
> to be solved there, but the private node abstraction seems to fit
> well, and provides for useful hooks to manage guest memory.
> 

I have wondered about this use case, but I haven't really played with
guest_memfd to know what the implications are here, so it's nice to hear
someone is looking at this.  It will be nice to hear your input on where
the abstraction could be better.

> Some properties that I'm interested in for this use case:
> 
> 1) is the memory in the direct map or not? Should that be configurable
> for a private node? I know there are patches right now to remove
> memory from the direct map for guest_memfd, but what if there was a
> private node whose memory is not in the direct map by default?

Presuming a page was not in the direct map and it was in the buddy
(strong assumption here), there's a handful of things that would
straight up break:

  - init_on_alloc (post_alloc_hook) / __GFP_ZERO (clear_highpage)
  - init_on_free (free_pages_prepare)
  - kernel_poison_pages (accesses the page contents)
  - CONFIG_DEBUG_PAGEALLOC

But... these things seem eminently skippable based on a node attribute.

I think this could be done, but there is added concern about spewing
an ever increasing numbers of hooks throughout mm/ as the number of
attributes increase.

But in this case I think the contract would require that an NP_OPS_NOMAP
would have to be mutually exclusive with all other node attributes (too
many places that touch the mapping, it would be too fragile).

There's a few catches here though

  1) you lose the ability to zero out the page after allocation, so
     whatever is in the memory already is going into the guest.

     That seems problematic for a variety of reasons.

     I guess you can use kmap_local_page?
     But then why not just unmap after allocation?

     If never mapping is a hard requirement, if that memory lives on
     a device with a sanitize function, you maybe could massage kernel
     free-page-reporting to offload the zeroing without having the
     kernel map it - as long as you can take a delay after free before
     the page becomes available again.

  2) the current mempolicy guest_memfd patches would not apply because
     I can't see how OPS_MEMPOLICY & OPS_NOMAP co-exist.  A user program
     could call mbind(nomap_node) on a random VMA - and there would be
     kernel OOPS everywhere.

     That would just mean pre-setting the node backing for all
     guest_memfd VMAs, rather than using mbind().

Something like (cribbing from the memfd code with absolutely no
context, so there's a pile of assumptions being made here)

  struct kvm_create_guest_memfd {
        __u64 size;
        __u64 flags;
        __s32 numa_node;  /* Set at creation */
        __u32 pad;
        __u64 reserved[5];
  };

  #define GUEST_MEMFD_FLAG_NUMA_NODE    (1ULL << 2)

  if (gmem->flags & GUEST_MEMFD_FLAG_NUMA_NODE)
      folio = __folio_alloc(gfp | __GFP_PRIVATE, order,
                            gmem->numa_node, NULL);
  else
      /* existing mempolicy / default path */
      folio = __filemap_get_folio_mpol(...);

Which may even be preferable to the recently upstreamed pattern.

> 2) Default page size. devdax, a ZONE_DEVICE user, allows for memory
> setup on hotplug that initializes things with HVO-ed large pages.
> Could the page size be a property of the node? That would make it easy
> to hand out larger pages to guests.  Of course, if you use anything
> but 4k, the argument of 'we can use the general buddy allocator' goes
> out the window, unless it's made to deal with a per-node base page
> size.
> 

Per-node page sizes are probably a bridge too far, that's seems like
a change that would echo through most of the buddy infrastructure, not
just a few hooks to prevent certain interactions.

However, I also don't think this is a requirement.

I know there is some work to try to raise the max page order to allow
THP to support 1GB huge pages - if max size is a concern, there's hope.

On fragmentation though...

If the consumer of a private node only ever allocates a specific order
(order-9) - the buddy never fragments smaller than that (it maybe
spends time coalescing for no value, but it'll never fragment smaller).

So is the concern here that you want to guarantee a minimum page size
to deal with the fragmentation problem on normal general-purpose nodes,
or do you want to guarantee a minimum page size because you can't limit
the allocations to be of a base order?

i.e.: is limiting guest_memfd allocations on a private node to a single
order (or a minimum order? 2MB?) a feasible option?  (Pretend i know
very little about the guest_memfd specific memory management code).

~Gregory

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Frank van der Linden @ 2026-04-15 19:47 UTC (permalink / raw)
  To: Gregory Price
  Cc: David Hildenbrand (Arm), lsf-pc, linux-kernel, linux-cxl, cgroups,
	linux-mm, linux-trace-kernel, damon, kernel-team, gregkh, rafael,
	dakr, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <ad-r7hwIdnvKsrh9@gourry-fedora-PF4VCD3F>

On Wed, Apr 15, 2026 at 8:18 AM Gregory Price <gourry@gourry.net> wrote:
>
> On Wed, Apr 15, 2026 at 11:49:59AM +0200, David Hildenbrand (Arm) wrote:
> > On 4/13/26 19:05, Gregory Price wrote:
>
> As a preface - the current RFC was informed by ZONE_DEVICE patterns.
>
> I think that was useful as a way to find existing friction points - but
> ultimately wrong for this new interface.
>
> I don't thinks an ops struct here is the right design, and I think there
> are only a few patterns that actually make sense for device memory using
> nodes this way.
>
> So there's going to be a *major* contraction in the complexity of this
> patch series (hopefully I'll have something next week), and much of what
> you point out below is already in-flight.
>
> > > On Mon, Apr 13, 2026 at 03:11:12PM +0200, David Hildenbrand (Arm) wrote:
> > >
> > > This is because the virtio-net device / network stack does GFP_KERNEL
> > > allocations and then pins them on the host to allow zero-copy - so all
> > > of ZONE_NORMAL is a valid target.
> > >
> > > (At least that's my best understanding of the entire setup).
> >
> ... snip ...
> >
> > A related series proposed some  MEM_READ/WRITE backend requests [1]
> >
> > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-09/msg02693.html
> >
>
> Oh interesting, thank you for the reference here.
>
> >
> > Something else people were discussing in the past was to physically
> > limit the area where virtio queues could be placed.
> >
>
> That is functionally what I did - the idea was pretty simple, just have
> a separate memfd/node dedicated for the queues:
>
> guest_memory = memfd(MAP_PRIVATE)
> net_memory = memfd(MAP_SHARED)
>
> And boom, you get what you want.
>
> So yeah "It works" - but there's likely other ways to do this too, and
> as you note re: compatibility, i'm not sure virtio actually wants this,
> but it's a nice proof-of-concept for a network device on the host that
> carries its own memory.
>
> I'll try post my hack as an example with the next RFC version, as I
> think it's informative.
>
> > >
> > > This partially answers your question about slub fallback allocations,
> > > there are slab allocations like this that depend on fallbacks (more
> > > below on this explicitly).
> >
> > But that's a different "fallback" problem, no?
> >
> > You want allocations that target the "special node" to fallback to
> > *other* nodes, but not other allocations to fallback to *this special* node.
> >
> ... snip - slight reordering to put thoughts together ...
> > >
> > > __GFP_PRIVATE vs GFP_PRIVATE then is just a matter of use case.
> > >
> > > For mbind() it probably makes sense we'd use GFP_PRIVATE - either it
> > > succeeds or it OOMs.
> >
> > Needs a second thought regarding fallback logic I raised above.
> >
> > What I think would have to be audited is the usage of __GFP_THISNODE by
> > kernel allocations, where we would not actually want to allocate from
> > this private node.
> >
>
> This is fair, and I a re-visit is absolutely warranted.
>
> Re-examining the quick audit from my last response suggests - I should
> never have seen leakage in those cases, but the fallbacks are needed.
>
> So yes, this all requires a second look (and a third, and a ninth).
>
> I'm not married to __GFP_PRIVATE, but it has been reliable for me.
>
> > Maybe we could just outright refuse *any* non-user (movable) allocations
> > that target the node, even with __GFP_THISNODE.
> >
> > Because, why would we want kernel allocations to even end up on a
> > private node that is supposed to only be consumed by user space? Or
> > which use cases are there where we would want to place kernel
> > allocations on there?
> >
>
> As a start, maybe? But as a permanent invariant?  I would wonder whether
> the decision here would lock us into a design.
>
> But then - this is all kernel internal, so i think it would be feasible
> to change this out from under users without backward compatibility pain.
>
> So far I have done my best to avoid changing any userland interfaces in
> a way that would fundamentally change the contracts.  If anything
> private-node other than just the node's `has_memory_private` attribute
> leaks into userland, someone messed up.
>
> So... I think that's reasonable.
>
> >
> > I assume you will be as LSF/MM? Would be good to discuss some of that in
> > person.
> >
>
> Yes, looking forward to it :]
>
>
> > > One note here though - OOM conditions and allocation failures are not
> > > intuitive, especially when THP/non-order-0 allocations are involved.
> > >
> > > But that might just mean this minimal setup should only allow order-0
> > > allocations - which is fiiiiiiiiiiiiiine :P.
> >
> >
> > Again, I am not sure about compaction and khugepaged. All we want to
> > guarantee is that our memory does not leave the private node.
> >
> > That doesn't require any __GFP_PRIVATE magic, just en-lighting these
> > subsystems that private nodes must use __GFP_THISNODE and must not leak
> > to other nodes.
>
> This is where specific use-cases matter.
>
> In the compressed memory example - the device doesn't care about memory
> leaving - but it cares about memory arriving and *and being modified*.
> (more on this in your next question)
>
> So i'm not convinced *all possible devices* would always want to support
> move_pages(), mbind(), and set_mempolicy().
>
> But, I do want to give this serious thought, and I agree the absolute
> minimal patch set could just be the fallback control mechanism and
> mm/ component filters/audit on __GFP_*.
>
>
> > > If you want the mbind contract to stay intact:
> > >
> > >    NP_OPS_MIGRATION (mbind can generate migrations)
> > >    NP_OPS_MEMPOLICY (this just tells mempolicy.c to allow the node)
> >
> > I'm missing why these are even opt-in. What's the problem with allowing
> > mbind and mempolicy to use these nodes in some of your drivers?
> >
>
> First:
>
> In my latest working branch these two flags have been folded into just
> _OPS_MEMPOLICY and any other migration interaction is just handled by
> filtering with the GFP flag.
>
>
> on always allowing mbind and mempolicy vs opt-in
> ---
>
> A proper compressed memory solution should not allow mbind/mempolicy.
>
> Compressed memory is different from normal memory - as the kernel can
> percieves free memory (many unused struct page in the buddy) when the
> device knows there's none left (the physical capacity is actually full).
>
> Any form of write to a compressed memory device is essentially a
> dangerous condition (OOMs = poison, not oom_kill()).
>
> So you need two controls:  Allocation and (userland) Write protection
> I implemented via:
>     - Demotion-only (allocations only happen in reclaim path)
>     - Write-protecting the entire node
>
> (I fully accept that a write-protection extension here might be a bridge
>  to far, but please stick with me for the sake of exploration).
>
>
> There's a serious argument to limit these devices to using an mbind
> pattern, but I wanted to make a full-on attempt to integrate this device
> into the demotion path as a transparent tier (kinda like zswap).
>
> I could not square write-protection with mempolicy, so i had to make
> them both optional and mutually exclusive.
>
> If you limit the device to mbind interactions, you do limit what can
> crash - but this forces userland software to be less portable by design:
>
>   - am i running on a system where this device is present?
>   - is that device exposing its memory on a node?
>   - which node?
>   - what memory can i put on that node? (can you prevent a process from
>     putting libc on that node?)
>   - how much compression ratio is left on the device?
>   - can i safety write to this virtual address?
>   - should i write-protect compressed VMAs? Can i handle those faults?
>   - many more
>
> That sounds a lot like re-implementing a bunch of mm/ in userland, and
> that's exactly where we were at with DAX.  We know this pattern failed.
>
> I'm trying to very much avoid repeating these mistakes, and so I'm very
> much trying to find a good path forward here that results in transparent
> usage of this memory.
>
>
> > I also have some questions about longterm pinnings, but that's better
> > discussed in person :)
> >
>
> The longterm pin extention came from auditing existing zone_device
> filters.
>
> tl;dr: informative mechanism - but it probably should be dropped,
> it makes no sense (it's device memory, pinnings mean nothing?).
>
>
> > >
> > > The task dies and frees the pages back to the buddy - the question is
> > > whether the 4-5 free_folio paths (put_folio, put_unref_folios, etc) can
> > > all eat an ops.free_folio() callback to inform the driver the memory has
> > > been freed.
> >
> > Right, that's rather invasive.
> >
>
> Yeah i'm trying to avoid it, and the answer may actually just exist in
> the task-death and VMA cleanup path rather than the folio-free path.
>
> From what i've seen of accelerator drivers that implement this, when you
> inform the driver of a memory region with a task, the driver should have
> a mechanism to take references on that VMA (or something like this) - so
> that when the task dies the driver has a way to be notified of the VMA
> being cleaned up.
>
> This probably exists - I just haven't gotten there yet.
>
> ~Gregory

This has been a really great discussion. I just wanted to add a few
points that I think I have mentioned in other forums, but not here.

In essence, this is a discussion about memory properties and the level
at which they should be dealt with. Right now there are basically 3
levels: pageblocks, zones and nodes. While these levels exist for good
reasons, they also sometimes lead to issues. There's duplication of
functionality. MIGRATE_CMA and ZONE_MOVABLE both implement the same
basic property, but at different levels (attempts have been made to
merge them, but it didn't work out). There's also memory with clashing
properties inhabiting the same data structure: LRUs. Having strictly
movable memory on the same LRU as unmovable memory is a mismatch. It
leads to the well known problem of reclaim done in the name of an
unmovable allocation attempt can be entirely pointless in the face of
large amounts of ZONE_MOVABLE or MIGRATE_CMA memory: the anon LRU will
be chock full of movable-only pages. Reclaiming them is useless for
your allocation, and skipping them leads to locking up the system
because you're holding on to the LRU lock a long time.

So, looking at having some properties set at the node level makes
sense to me even in the non-device case. But perhaps that is out of
scope for the initial discussion.

One use case that seems like a good match for private nodes is guest
memory. Guest memory is special enough to want to allocate / maintain
it separately, which is acknowledged by the introduction of
guest_memfd.

I'm interested in enabling guest_memfd allocation from private nodes.
I've been playing around with setting aside memory at boot, and
assigning it to private nodes (one private node per physical NUMA
node), and making it available to guest_memfd only. There are issues
to be solved there, but the private node abstraction seems to fit
well, and provides for useful hooks to manage guest memory.

Some properties that I'm interested in for this use case:

1) is the memory in the direct map or not? Should that be configurable
for a private node? I know there are patches right now to remove
memory from the direct map for guest_memfd, but what if there was a
private node whose memory is not in the direct map by default?
2) Default page size. devdax, a ZONE_DEVICE user, allows for memory
setup on hotplug that initializes things with HVO-ed large pages.
Could the page size be a property of the node? That would make it easy
to hand out larger pages to guests.  Of course, if you use anything
but 4k, the argument of 'we can use the general buddy allocator' goes
out the window, unless it's made to deal with a per-node base page
size.

- Frank

^ permalink raw reply

* Re: [PATCH net-next v4 04/13] devlink: allow to use devlink index as a command handle
From: Geert Uytterhoeven @ 2026-04-15 19:04 UTC (permalink / raw)
  To: jiri
  Cc: andrew+netdev, chuck.lever, cjubran, corbet, daniel.zahka, davem,
	donald.hunter, edumazet, horms, kuba, leon, linux-doc, linux-rdma,
	linux-trace-kernel, mathieu.desnoyers, matttbe, mbloch, mhiramat,
	mschmidt, netdev, pabeni, przemyslaw.kitszel, rostedt, saeedm,
	skhan, tariqt, linux-kernel
In-Reply-To: <20260312100407.551173-5-jiri@resnulli.us>

On Thu, 12 Mar 2026, Jiri Pirko wrote:
> Currently devlink instances are addressed bus_name/dev_name tuple.
> Allow the newly introduced DEVLINK_ATTR_INDEX to be used as
> an alternative handle for all devlink commands.
> 
> When DEVLINK_ATTR_INDEX is present in the request, use it for a direct
> xarray lookup instead of iterating over all instances comparing
> bus_name/dev_name strings.
> 
> Signed-off-by: Jiri Pirko <jiri@nvidia.com>

Thanks for your patch, which is now commit d85a8af57da87196 ("devlink:
allow to use devlink index as a command handle").

This has a rather large impact on kernel size.
For e.g. m68k/atari_defconfig, bloat-o-meter reports:

    add/remove: 4/1 grow/shrink: 72/1 up/down: 65804/-76 (65728)
    Function                                     old     new   delta
    devlink_trap_policer_get_dump_nl_policy       24    1480   +1456
    devlink_trap_group_get_dump_nl_policy         24    1480   +1456
    devlink_trap_get_dump_nl_policy               24    1480   +1456
    devlink_selftests_get_nl_policy               24    1480   +1456
    devlink_sb_tc_pool_bind_get_dump_nl_policy      24    1480   +1456
    devlink_sb_port_pool_get_dump_nl_policy       24    1480   +1456
    devlink_sb_pool_get_dump_nl_policy            24    1480   +1456
    devlink_sb_get_dump_nl_policy                 24    1480   +1456
    devlink_resource_dump_nl_policy               24    1480   +1456
    devlink_region_get_dump_nl_policy             24    1480   +1456
    devlink_rate_get_dump_nl_policy               24    1480   +1456
    devlink_port_get_dump_nl_policy               24    1480   +1456
    devlink_param_get_dump_nl_policy              24    1480   +1456
    devlink_linecard_get_dump_nl_policy           24    1480   +1456
    devlink_info_get_nl_policy                    24    1480   +1456
    devlink_get_nl_policy                         24    1480   +1456
    devlink_eswitch_get_nl_policy                 24    1480   +1456
    devlink_dpipe_headers_get_nl_policy           24    1480   +1456
    devlink_port_unsplit_nl_policy                32    1480   +1448
    devlink_port_param_set_nl_policy              32    1480   +1448
    devlink_port_param_get_nl_policy              32    1480   +1448
    devlink_port_get_do_nl_policy                 32    1480   +1448
    devlink_port_del_nl_policy                    32    1480   +1448
    devlink_notify_filter_set_nl_policy           32    1480   +1448
    devlink_health_reporter_get_dump_nl_policy      32    1480   +1448
    devlink_port_split_nl_policy                  80    1480   +1400
    devlink_sb_occ_snapshot_nl_policy             96    1480   +1384
    devlink_sb_occ_max_clear_nl_policy            96    1480   +1384
    devlink_sb_get_do_nl_policy                   96    1480   +1384
    devlink_sb_port_pool_get_do_nl_policy        144    1480   +1336
    devlink_sb_pool_get_do_nl_policy             144    1480   +1336
    devlink_sb_pool_set_nl_policy                168    1480   +1312
    devlink_sb_port_pool_set_nl_policy           176    1480   +1304
    devlink_sb_tc_pool_bind_set_nl_policy        184    1480   +1296
    devlink_sb_tc_pool_bind_get_do_nl_policy     184    1480   +1296
    devlink_dpipe_table_get_nl_policy            240    1480   +1240
    devlink_dpipe_entries_get_nl_policy          240    1480   +1240
    devlink_dpipe_table_counters_set_nl_policy     272    1480   +1208
    devlink_eswitch_set_nl_policy                504    1480    +976
    devlink_resource_set_nl_policy               544    1480    +936
    devlink_param_get_do_nl_policy               656    1480    +824
    devlink_region_get_do_nl_policy              712    1480    +768
    devlink_region_new_nl_policy                 744    1480    +736
    devlink_region_del_nl_policy                 744    1480    +736
    devlink_health_reporter_test_nl_policy       928    1480    +552
    devlink_health_reporter_recover_nl_policy     928    1480    +552
    devlink_health_reporter_get_do_nl_policy     928    1480    +552
    devlink_health_reporter_dump_get_nl_policy     928    1480    +552
    devlink_health_reporter_dump_clear_nl_policy     928    1480    +552
    devlink_health_reporter_diagnose_nl_policy     928    1480    +552
    devlink_trap_get_do_nl_policy               1048    1480    +432
    devlink_trap_set_nl_policy                  1056    1480    +424
    devlink_trap_group_get_do_nl_policy         1088    1480    +392
    devlink_trap_policer_get_do_nl_policy       1144    1480    +336
    devlink_trap_group_set_nl_policy            1144    1480    +336
    devlink_trap_policer_set_nl_policy          1160    1480    +320
    devlink_port_set_nl_policy                  1168    1480    +312
    devlink_flash_update_nl_policy              1224    1480    +256
    devlink_reload_nl_policy                    1248    1480    +232
    devlink_port_new_nl_policy                  1320    1480    +160
    devlink_rate_get_do_nl_policy               1352    1480    +128
    devlink_rate_del_nl_policy                  1352    1480    +128
    devlink_linecard_get_do_nl_policy           1376    1480    +104
    __devlinks_xa_find_get                         -      96     +96
    devlink_linecard_set_nl_policy              1392    1480     +88
    devlink_selftests_run_nl_policy             1416    1480     +64
    devlink_get_from_attrs_lock                  262     314     +52
    devlink_region_read_nl_policy               1440    1480     +40
    devlink_rate_set_nl_policy                  1448    1480     +32
    devlink_rate_new_nl_policy                  1448    1480     +32
    devlinks_xa_lookup_get                         -      30     +30
    devlink_health_reporter_set_nl_policy       1456    1480     +24
    devlink_attr_index_range                       -      16     +16
    devlink_param_set_nl_policy                 1472    1480      +8
    devlink_nl_dumpit                            276     282      +6
    __initcall__kmod_core__670_573_devlink_init4       -       4      +4
    __initcall__kmod_core__670_561_devlink_init4       4       -      -4
    devlinks_xa_find_get                          96      24     -72
    Total: Before=5203976, After=5269704, chg +1.26%

> --- a/net/devlink/netlink_gen.c
> +++ b/net/devlink/netlink_gen.c
> @@ -11,6 +11,11 @@
>  
>  #include <uapi/linux/devlink.h>
>  
> +/* Integer value ranges */
> +static const struct netlink_range_validation devlink_attr_index_range = {
> +	.max	= U32_MAX,
> +};
> +
>  /* Sparse enums validation callbacks */
>  static int
>  devlink_attr_param_type_validate(const struct nlattr *attr,
> @@ -56,37 +61,42 @@ const struct nla_policy devlink_dl_selftest_id_nl_policy[DEVLINK_ATTR_SELFTEST_I
>  };
>  
>  /* DEVLINK_CMD_GET - do */
> -static const struct nla_policy devlink_get_nl_policy[DEVLINK_ATTR_DEV_NAME + 1] = {
> +static const struct nla_policy devlink_get_nl_policy[DEVLINK_ATTR_INDEX + 1] = {

Unrelated to this change, but the explicit sizing of these arrays is not
needed, as the compiler will take care of that.

>  	[DEVLINK_ATTR_BUS_NAME] = { .type = NLA_NUL_STRING, },
>  	[DEVLINK_ATTR_DEV_NAME] = { .type = NLA_NUL_STRING, },
> +	[DEVLINK_ATTR_INDEX] = NLA_POLICY_FULL_RANGE(NLA_UINT, &devlink_attr_index_range),

This array, and many others below, are sparse, with large gaps (up to
1456 or 2912 bytes on 32-bit resp. 64-bit systems) before the last
entries.

>  };
>  
>  /* DEVLINK_CMD_PORT_GET - do */
> -static const struct nla_policy devlink_port_get_do_nl_policy[DEVLINK_ATTR_PORT_INDEX + 1] = {
> +static const struct nla_policy devlink_port_get_do_nl_policy[DEVLINK_ATTR_INDEX + 1] = {
>  	[DEVLINK_ATTR_BUS_NAME] = { .type = NLA_NUL_STRING, },
>  	[DEVLINK_ATTR_DEV_NAME] = { .type = NLA_NUL_STRING, },
> +	[DEVLINK_ATTR_INDEX] = NLA_POLICY_FULL_RANGE(NLA_UINT, &devlink_attr_index_range),

Shouldn't this be inserted at the end, as DEVLINK_ATTR_INDEX >
DEVLINK_ATTR_PORT_INDEX, for readability?

>  	[DEVLINK_ATTR_PORT_INDEX] = { .type = NLA_U32, },
>  };

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply

* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Michael Roth @ 2026-04-15 18:20 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jroedel, jthoughton, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm
In-Reply-To: <eiiecl7jvywvqb4drq7cchmcabcrdka25wxr77uavxqineeedm@rfcnhdz6xoxf>

On Tue, Apr 14, 2026 at 06:37:00PM -0500, Michael Roth wrote:
> On Wed, Apr 01, 2026 at 03:38:12PM -0700, Ackerley Tng wrote:
> > Michael Roth <michael.roth@amd.com> writes:
> > 
> > >
> > > [...snip...]
> > >
> > >>  static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
> > >>  {
> > >> @@ -2635,6 +2625,8 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > >>  		return -EINVAL;
> > >>  	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > >>  		return -EINVAL;
> > >> +	if (attrs->error_offset)
> > >> +		return -EINVAL;
> > >>  	for (i = 0; i < ARRAY_SIZE(attrs->reserved); i++) {
> > >>  		if (attrs->reserved[i])
> > >>  			return -EINVAL;
> > >> @@ -4983,6 +4975,11 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> > >>  		return 1;
> > >>  	case KVM_CAP_GUEST_MEMFD_FLAGS:
> > >>  		return kvm_gmem_get_supported_flags(kvm);
> > >> +	case KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES:
> > >> +		if (vm_memory_attributes)
> > >> +			return 0;
> > >> +
> > >> +		return kvm_supported_mem_attributes(kvm);
> > >
> > > Based on the discussion from the PUCK call this morning,
> > 
> > Thanks for copying the discussion here, I'll start attending PUCK to
> > catch those discussions too :)
> > 
> > > it sounds like it
> > > would be a good idea to limit kvm_supported_mem_attributes() to only
> > > reporting KVM_MEMORY_ATTRIBUTE_PRIVATE if the underlying CoCo
> > > implementation has all the necessary enablement to support in-place
> > > conversion via guest_memfd. In the case of SNP, there is a
> > > documentation/parameter check in snp_launch_update() that needs to be
> > > relaxed in order for userspace to be able to pass in a NULL 'src'
> > > parameter (since, for in-place conversion, it would be initialized in place
> > > as shared memory prior to the call, since by the time kvm_gmem_poulate()
> > > it will have been set to private and therefore cannot be faulted in via
> > > GUP (and if it could, we'd be unecessarily copying the src back on top
> > > of itself since src/dst are the same).
> > 
> > Could this be a separate thing? If I'm understanding you correctly, it's
> > not strictly a requirement for snp_launch_update() to first support a
> > NULL 'src' parameter before this series lands.
> 
> I think we are already sync'd up on this during PUCK, but for the benefit
> of others: Sean pointed out that if we don't then we'll need to add yet
> another capability so userspace can determine when it can actually do
> in-place conversion for SNP.

(in-place conversion for SNP during pre-launch/populate phase, I meant)

> 
> Right now, this series effectively advertises in place conversion at the
> point where KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES reports
> 'KVM_MEMORY_ATTRIBUTE_PRIVATE', so I slightly reworked the series to
> include the snp_launch_update() change prior to that point in time in
> the series. Thanks to prereqs and changes/requirements you've already
> pulled in, it's just one additional patch now:
> 
>  KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE 
> 
> I also did some minor updates (prefixed with a "[squash]" tag) to advertise
> the KVM_SET_MEMORY_ATTRIBUTES2_PRESERVED flag so it can be used by

Though I'm not sure how we deal with it if SNP/TDX at some point become
capable of using the PRESERVED flag *after* populate... but maybe that's
too unlikely to worry about? If we wanted to address it though, we could
have both PRESERVED and PRESERVED_BEFORE_LAUNCH so they can be
enumerated separately from the start.

> userspace for SNP/TDX in the kvm_gmem_populate() path as agreed upon
> during PUCK.
> 
> The branch is here, with the patches moved to where I think they
> should remain (or be squashed in for the [squash] ones):
> 
>   https://github.com/AMDESE/linux/commits/guest_memfd-inplace-conversion-v4-snp2/
> 
> I've also updated the QEMU patches to use the agreed-upon API flow and
> pushed them here:
> 
>   https://github.com/AMDESE/qemu/commits/snp-inplace-for-v4-wip2/
> 
> To start an SNP guest with in-place conversion:
> 
>   qemu-system-x86 \
>   -machine q35,confidential-guest-support=sev0,memory-backend=ram1 \
>   -object sev-snp-guest,id=sev0,...,convert-in-place=true \
>   -object memory-backend-memfd,id=ram1,size=16G,share=true,reserve=false

Sorry, that should've been:

  -object memory-backend-guest-memfd,id=ram1,size=16G,share=true,reserve=false

> 
> To start an normal non-CoCo guest backed by guest_memfd with shared memory:
> 
>   qemu-system-x86 \
>   -machine q35,confidential-guest-support=sev0,memory-backend=ram1 \
>   -object memory-backend-memfd,id=ram1,size=16G,share=true,reserve=false

and:

  -object memory-backend-guest-memfd,id=ram1,size=16G,share=true,reserve=false

(and both require kvm.vm_memory_attributes=0)

-Mike

> 
> Thanks,
> 
> Mike

^ permalink raw reply

* Re: [PATCH v4 5/5] locking: Add contended_release tracepoint to spinning locks
From: Dmitry Ilvokhin @ 2026-04-15 17:43 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
	Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
	Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, linux-kernel, linux-mips,
	virtualization, linux-arch, linux-mm, linux-trace-kernel,
	kernel-team
In-Reply-To: <8d98d9f4-ccab-4864-b406-d3eb684cab45@paulmck-laptop>

On Tue, Apr 14, 2026 at 04:20:26PM -0700, Paul E. McKenney wrote:

[...]

> > +static inline void queued_read_unlock(struct qrwlock *lock)
> > +{
> > +	/*
> > +	 * Trace and unlock are combined in the traced unlock variant so
> > +	 * the compiler does not need to preserve the lock pointer across
> > +	 * the function call, avoiding callee-saved register save/restore
> > +	 * on the hot path.
> > +	 */
> > +	if (tracepoint_enabled(contended_release)) {
> > +		queued_read_unlock_traced(lock);
> > +		return;
> > +	}
> > +
> > +	__queued_read_unlock(lock);
> > +}
> 
> Shouldn't this refactoring be its own separate patch, similar to 4/5?
> 
> That would probably clean up this diff a bit.
> 
> > +
> > +static __always_inline void __queued_write_unlock(struct qrwlock *lock)
> >  {
> >  	smp_store_release(&lock->wlocked, 0);
> >  }
> >  
> >  /**
> > - * queued_rwlock_is_contended - check if the lock is contended
> > + * queued_write_unlock - release write lock of a queued rwlock
> >   * @lock : Pointer to queued rwlock structure
> > - * Return: 1 if lock contended, 0 otherwise
> >   */
> > -static inline int queued_rwlock_is_contended(struct qrwlock *lock)
> > +static inline void queued_write_unlock(struct qrwlock *lock)
> >  {
> > -	return arch_spin_is_locked(&lock->wait_lock);
> > +	/* See comment in queued_read_unlock(). */
> > +	if (tracepoint_enabled(contended_release)) {
> > +		queued_write_unlock_traced(lock);
> > +		return;
> > +	}
> > +
> > +	__queued_write_unlock(lock);
> 
> And the same here, so one patch for interposing __queued_read_unlock()
> and another for interposing __queued_write_unlock().
> 
>

[...]

> And is it possible to have one patch for qspinlock and another for qrwlock?
> It *looks* like it should be.
> 
> 							Thanx, Paul
> 

Thanks for the suggestion, Paul.

I think separate commits for the read and write paths of qrwlock is a
bit too fine-grained, but I like the point about mixing refactoring with
instrumentation and keeping different lock types separate.

I'll split this commit into four.

    locking: Factor out __queued_read_unlock()/__queued_write_unlock()                                                    
    locking: Add contended_release tracepoint to qrwlock
    locking: Factor out queued_spin_release()                                                                             
    locking: Add contended_release tracepoint to qspinlock

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-04-15 15:17 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <38cf52d1-32a8-462f-ac6a-8fad9d14c4f0@kernel.org>

On Wed, Apr 15, 2026 at 11:49:59AM +0200, David Hildenbrand (Arm) wrote:
> On 4/13/26 19:05, Gregory Price wrote:

As a preface - the current RFC was informed by ZONE_DEVICE patterns.

I think that was useful as a way to find existing friction points - but
ultimately wrong for this new interface.

I don't thinks an ops struct here is the right design, and I think there
are only a few patterns that actually make sense for device memory using
nodes this way.

So there's going to be a *major* contraction in the complexity of this
patch series (hopefully I'll have something next week), and much of what
you point out below is already in-flight.

> > On Mon, Apr 13, 2026 at 03:11:12PM +0200, David Hildenbrand (Arm) wrote:
> > 
> > This is because the virtio-net device / network stack does GFP_KERNEL
> > allocations and then pins them on the host to allow zero-copy - so all
> > of ZONE_NORMAL is a valid target.
> > 
> > (At least that's my best understanding of the entire setup).
> 
... snip ...
> 
> A related series proposed some  MEM_READ/WRITE backend requests [1]
> 
> [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-09/msg02693.html
> 

Oh interesting, thank you for the reference here.

> 
> Something else people were discussing in the past was to physically
> limit the area where virtio queues could be placed.
>

That is functionally what I did - the idea was pretty simple, just have
a separate memfd/node dedicated for the queues:

guest_memory = memfd(MAP_PRIVATE)
net_memory = memfd(MAP_SHARED)

And boom, you get what you want.

So yeah "It works" - but there's likely other ways to do this too, and
as you note re: compatibility, i'm not sure virtio actually wants this,
but it's a nice proof-of-concept for a network device on the host that
carries its own memory.

I'll try post my hack as an example with the next RFC version, as I
think it's informative.

> > 
> > This partially answers your question about slub fallback allocations,
> > there are slab allocations like this that depend on fallbacks (more
> > below on this explicitly).
> 
> But that's a different "fallback" problem, no?
> 
> You want allocations that target the "special node" to fallback to
> *other* nodes, but not other allocations to fallback to *this special* node.
>
... snip - slight reordering to put thoughts together ...
> > 
> > __GFP_PRIVATE vs GFP_PRIVATE then is just a matter of use case.
> > 
> > For mbind() it probably makes sense we'd use GFP_PRIVATE - either it
> > succeeds or it OOMs.
> 
> Needs a second thought regarding fallback logic I raised above.
> 
> What I think would have to be audited is the usage of __GFP_THISNODE by
> kernel allocations, where we would not actually want to allocate from
> this private node.
> 

This is fair, and I a re-visit is absolutely warranted.

Re-examining the quick audit from my last response suggests - I should
never have seen leakage in those cases, but the fallbacks are needed.

So yes, this all requires a second look (and a third, and a ninth).

I'm not married to __GFP_PRIVATE, but it has been reliable for me.

> Maybe we could just outright refuse *any* non-user (movable) allocations
> that target the node, even with __GFP_THISNODE.
> 
> Because, why would we want kernel allocations to even end up on a
> private node that is supposed to only be consumed by user space? Or
> which use cases are there where we would want to place kernel
> allocations on there?
> 

As a start, maybe? But as a permanent invariant?  I would wonder whether
the decision here would lock us into a design.

But then - this is all kernel internal, so i think it would be feasible
to change this out from under users without backward compatibility pain.

So far I have done my best to avoid changing any userland interfaces in
a way that would fundamentally change the contracts.  If anything
private-node other than just the node's `has_memory_private` attribute
leaks into userland, someone messed up.

So... I think that's reasonable.

> 
> I assume you will be as LSF/MM? Would be good to discuss some of that in
> person.
>

Yes, looking forward to it :]


> > One note here though - OOM conditions and allocation failures are not
> > intuitive, especially when THP/non-order-0 allocations are involved.
> > 
> > But that might just mean this minimal setup should only allow order-0
> > allocations - which is fiiiiiiiiiiiiiine :P.
> 
> 
> Again, I am not sure about compaction and khugepaged. All we want to
> guarantee is that our memory does not leave the private node.
> 
> That doesn't require any __GFP_PRIVATE magic, just en-lighting these
> subsystems that private nodes must use __GFP_THISNODE and must not leak
> to other nodes.

This is where specific use-cases matter.

In the compressed memory example - the device doesn't care about memory
leaving - but it cares about memory arriving and *and being modified*.
(more on this in your next question)

So i'm not convinced *all possible devices* would always want to support
move_pages(), mbind(), and set_mempolicy().

But, I do want to give this serious thought, and I agree the absolute
minimal patch set could just be the fallback control mechanism and 
mm/ component filters/audit on __GFP_*.


> > If you want the mbind contract to stay intact:
> > 
> >    NP_OPS_MIGRATION (mbind can generate migrations)
> >    NP_OPS_MEMPOLICY (this just tells mempolicy.c to allow the node)
> 
> I'm missing why these are even opt-in. What's the problem with allowing
> mbind and mempolicy to use these nodes in some of your drivers?
> 

First:

In my latest working branch these two flags have been folded into just
_OPS_MEMPOLICY and any other migration interaction is just handled by
filtering with the GFP flag. 


on always allowing mbind and mempolicy vs opt-in
---

A proper compressed memory solution should not allow mbind/mempolicy.

Compressed memory is different from normal memory - as the kernel can
percieves free memory (many unused struct page in the buddy) when the
device knows there's none left (the physical capacity is actually full).

Any form of write to a compressed memory device is essentially a
dangerous condition (OOMs = poison, not oom_kill()).

So you need two controls:  Allocation and (userland) Write protection
I implemented via:
    - Demotion-only (allocations only happen in reclaim path)
    - Write-protecting the entire node

(I fully accept that a write-protection extension here might be a bridge
 to far, but please stick with me for the sake of exploration).


There's a serious argument to limit these devices to using an mbind
pattern, but I wanted to make a full-on attempt to integrate this device
into the demotion path as a transparent tier (kinda like zswap).

I could not square write-protection with mempolicy, so i had to make
them both optional and mutually exclusive.

If you limit the device to mbind interactions, you do limit what can
crash - but this forces userland software to be less portable by design:

  - am i running on a system where this device is present?
  - is that device exposing its memory on a node?
  - which node?
  - what memory can i put on that node? (can you prevent a process from
    putting libc on that node?)
  - how much compression ratio is left on the device?
  - can i safety write to this virtual address?
  - should i write-protect compressed VMAs? Can i handle those faults?
  - many more

That sounds a lot like re-implementing a bunch of mm/ in userland, and
that's exactly where we were at with DAX.  We know this pattern failed.

I'm trying to very much avoid repeating these mistakes, and so I'm very
much trying to find a good path forward here that results in transparent
usage of this memory.


> I also have some questions about longterm pinnings, but that's better
> discussed in person :)
>

The longterm pin extention came from auditing existing zone_device
filters.  

tl;dr: informative mechanism - but it probably should be dropped,
it makes no sense (it's device memory, pinnings mean nothing?).


> > 
> > The task dies and frees the pages back to the buddy - the question is
> > whether the 4-5 free_folio paths (put_folio, put_unref_folios, etc) can
> > all eat an ops.free_folio() callback to inform the driver the memory has
> > been freed.
> 
> Right, that's rather invasive.
> 

Yeah i'm trying to avoid it, and the answer may actually just exist in
the task-death and VMA cleanup path rather than the folio-free path.

From what i've seen of accelerator drivers that implement this, when you
inform the driver of a memory region with a task, the driver should have
a mechanism to take references on that VMA (or something like this) - so
that when the task dies the driver has a way to be notified of the VMA
being cleaned up.

This probably exists - I just haven't gotten there yet.

~Gregory

^ permalink raw reply

* Re: [PATCH v2] bootconfig: Apply early options from embedded config
From: Breno Leitao @ 2026-04-15 11:15 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Jonathan Corbet, Shuah Khan, linux-kernel, linux-trace-kernel,
	linux-doc, oss, paulmck, rostedt, kernel-team, Kiryl Shutsemau
In-Reply-To: <adTX6h4Ej5jOcONP@gmail.com>

On Tue, Apr 07, 2026 at 03:19:09AM -0700, Breno Leitao wrote:
> On Fri, Apr 03, 2026 at 11:45:19AM +0900, Masami Hiramatsu wrote:
> > > I'm still uncertain about this approach. The goal is to identify and
> > > categorize the early parameters that are parsed prior to bootconfig
> > > initialization.
> >
> > Yes, if we support early parameters in bootconfig, we need to clarify
> > which parameters are inherently unsupportable, and document it.
> > Currently it is easy to say that it does not support the parameter
> > defined with "early_param()". Similary, maybe we should introduce
> > "arch_param()" or something like it (or support all of them).
> >
> > >
> > > Moreover, this work could become obsolete if bootconfig's initialization
> > > point shifts earlier or later in the boot sequence, necessitating
> > > another comprehensive analysis.
> >
> > If we can init it before calling setup_arch(), yes, we don't need to
> > check it. So that is another option. Do you think it is feasible to
> > support all of them? (Of course, theologically we can do, but the
> > question is the use case and requirements.)
>
> I don't believe all early parameters can be supported by bootconfig.
> Some are inherently incompatible as far as I understand, while others
> depend on bootconfig's initialization point in the boot sequence.

I've developed a patch series that relocates bootconfig initialization
to occur before setup_arch().

Adopting this approach would streamline the categorization considerably,
as only a small subset of kernel parameters are parsed before
setup_arch() is called.

This enables a clearer distinction: parameters processed *before*
setup_arch() versus those handled afterward, rather than classifying
based on what occurs before bootconfig initialization.

Just to close the look and link both discussion together, the proposed
patch series is available at:

https://lore.kernel.org/all/20260415-bootconfig_earlier-v1-0-cf160175de5e@debian.org/

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox