Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH] samples/ftrace: reject zero ftrace-ops call count
From: Steven Rostedt @ 2026-06-11  0:03 UTC (permalink / raw)
  To: Samuel Moelius
  Cc: Masami Hiramatsu, Mark Rutland, open list:FUNCTION HOOKS (FTRACE),
	open list:FUNCTION HOOKS (FTRACE)
In-Reply-To: <CAE+C+DZXcQfyQt-UV2PRt0vVFJqCciW6QkyEd9FaJ8++B3M4Ow@mail.gmail.com>

On Tue, 9 Jun 2026 07:26:27 -0400
Samuel Moelius <sam.moelius@trailofbits.com> wrote:

> Is it okay to keep the same subject line or should I change it?

Yeah, and also note that the tracing subsystem uses capital letters:

  samples/ftrace: Reject zero ftrace-ops call count

But you can change it to:

  samples/ftrace: Prevent division by zero when nr_function_calls is zero

-- Steve

^ permalink raw reply

* Re: [PATCH 1/2] arm64: ftrace: prepare ftrace_modify_call() for use without CALL_OPS
From: Xu Kuohai @ 2026-06-11  4:06 UTC (permalink / raw)
  To: Jose Fernandez (Anthropic), Steven Rostedt, Masami Hiramatsu,
	Mark Rutland, Catalin Marinas, Will Deacon, Nathan Chancellor,
	Nick Desaulniers, Bill Wendling, Justin Stitt
  Cc: linux-kernel, linux-trace-kernel, linux-arm-kernel, llvm, bpf,
	Florent Revest, Puranjay Mohan
In-Reply-To: <20260609-arm64-ftrace-direct-calls-v1-1-4a46f266697f@linux.dev>

On 6/9/2026 1:19 PM, Jose Fernandez (Anthropic) wrote:
> ftrace_modify_call() is guarded by CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
> and calls ftrace_rec_set_ops(rec, arm64_rec_get_ops(rec)) directly,
> which only exists when CALL_OPS is enabled.
> 
> Generic ftrace also needs ftrace_modify_call() when
> CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS is enabled, to retarget a
> callsite between two non-FTRACE_ADDR destinations, as happens when a
> direct trampoline is modified. The next patch allows DIRECT_CALLS without
> CALL_OPS, so widen the guard to cover both configurations and switch
> the body to the ftrace_rec_update_ops() wrapper, which already has a
> stub for the !CALL_OPS case. ftrace_make_call() already uses the same
> wrapper today.
> 
> No functional change: with CALL_OPS enabled, ftrace_rec_update_ops()
> expands to the exact call this replaces.
> 
> Assisted-by: Claude:unspecified
> Signed-off-by: Jose Fernandez (Anthropic) <jose.fernandez@linux.dev>
> ---
>   arch/arm64/kernel/ftrace.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c
> index 5a1554a441628..e1a3c0b3a0514 100644
> --- a/arch/arm64/kernel/ftrace.c
> +++ b/arch/arm64/kernel/ftrace.c
> @@ -409,7 +409,8 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
>   	return ftrace_modify_code(pc, old, new, true);
>   }
>   
> -#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
> +#if defined(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS) || \
> +	defined(CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS)
>   int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
>   		       unsigned long addr)
>   {
> @@ -417,7 +418,7 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
>   	u32 old, new;
>   	int ret;
>   
> -	ret = ftrace_rec_set_ops(rec, arm64_rec_get_ops(rec));
> +	ret = ftrace_rec_update_ops(rec);
>   	if (ret)
>   		return ret;
>   
>
Acked-by: Xu Kuohai <xukuohai@huawei.com>


^ permalink raw reply

* Re: [PATCH 2/2] arm64: ftrace: allow DIRECT_CALLS without CALL_OPS
From: Xu Kuohai @ 2026-06-11  4:06 UTC (permalink / raw)
  To: Jose Fernandez (Anthropic), Steven Rostedt, Masami Hiramatsu,
	Mark Rutland, Catalin Marinas, Will Deacon, Nathan Chancellor,
	Nick Desaulniers, Bill Wendling, Justin Stitt
  Cc: linux-kernel, linux-trace-kernel, linux-arm-kernel, llvm, bpf,
	Florent Revest, Puranjay Mohan
In-Reply-To: <20260609-arm64-ftrace-direct-calls-v1-2-4a46f266697f@linux.dev>

On 6/9/2026 1:19 PM, Jose Fernandez (Anthropic) wrote:
> arm64 gained ftrace direct calls in commit 2aa6ac03516d ("arm64:
> ftrace: Add direct call support") on top of
> DYNAMIC_FTRACE_WITH_CALL_OPS, using the per-callsite ops pointer as a
> fast path to reach the direct trampoline. Since commit baaf553d3bc3
> ("arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS"), CALL_OPS is
> mutually exclusive with CFI: the pre-function NOPs would change the
> offset of the pre-function kCFI type hash, and the compiler support
> needed to keep that offset consistent does not exist yet.
> 
> The result is that a CONFIG_CFI=y kernel loses CALL_OPS, and with it
> DIRECT_CALLS, and with it every BPF trampoline attachment to kernel
> functions: register_fentry() returns -ENOTSUPP, so fentry/fexit,
> fmod_ret and BPF LSM programs cannot attach at all. This is a real
> problem for hardened arm64 deployments that rely on BPF LSM for
> security monitoring while keeping kCFI enabled.
> 
> CALL_OPS is an optimization for direct calls, not a dependency. When
> the direct trampoline is within BL range, the callsite branches
> straight to it and ftrace_caller is not involved. When it is out of
> range, ftrace_find_callable_addr() already falls back to
> ftrace_caller, and the DIRECT_CALLS machinery there
> (FREGS_DIRECT_TRAMP, ftrace_caller_direct_late) is gated on
> DIRECT_CALLS alone, not CALL_OPS: the ops dispatch invokes
> call_direct_funcs(), which stores the trampoline address in
> ftrace_regs, and ftrace_caller tail-calls it. s390 and loongarch use
> this same mechanism for HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS without
> having CALL_OPS at all, and DYNAMIC_FTRACE_WITH_ARGS without CALL_OPS
> is already a supported arm64 configuration (GCC builds with
> CC_OPTIMIZE_FOR_SIZE do not satisfy the CALL_OPS select condition).
> 
> Drop the CALL_OPS requirement from the
> HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS select. Configurations that
> keep CALL_OPS (!CFI clang builds, and GCC builds without
> CC_OPTIMIZE_FOR_SIZE) are unchanged. CALL_OPS-less configurations
> take the ftrace_caller ops-dispatch path for out-of-range direct
> calls, trading the per-callsite fast path for working BPF
> trampolines; in-range attachments still branch directly with no
> overhead. GCC -Os builds also gain DIRECT_CALLS as a side effect.
> That is intended: s390 and loongarch already ship DIRECT_CALLS
> without any per-callsite fast path.
> 
> Assisted-by: Claude:unspecified
> Signed-off-by: Jose Fernandez (Anthropic) <jose.fernandez@linux.dev>
> ---
>   arch/arm64/Kconfig | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index fe60738e5943b..2cd7d536671c9 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -188,7 +188,7 @@ config ARM64
>   		if (GCC_SUPPORTS_DYNAMIC_FTRACE_WITH_ARGS || \
>   		    CLANG_SUPPORTS_DYNAMIC_FTRACE_WITH_ARGS)
>   	select HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS \
> -		if DYNAMIC_FTRACE_WITH_ARGS && DYNAMIC_FTRACE_WITH_CALL_OPS
> +		if DYNAMIC_FTRACE_WITH_ARGS
>   	select HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS \
>   		if (DYNAMIC_FTRACE_WITH_ARGS && !CFI && \
>   		    (CC_IS_CLANG || !CC_OPTIMIZE_FOR_SIZE))
> 
Acked-by: Xu Kuohai <xukuohai@huawei.com>


^ permalink raw reply

* [PATCH] tracing: Remove unused ret assignment in tracing_set_tracer()
From: Wayen.Yan @ 2026-06-11  3:52 UTC (permalink / raw)
  To: rostedt; +Cc: mhiramat, mathieu.desnoyers, linux-trace-kernel, linux-kernel

In tracing_set_tracer(), the assignment 'ret = 0' following the
__tracing_resize_ring_buffer() error check is a dead store. After
this point, all subsequent code paths either return with a constant
value (-EINVAL, 0, -EBUSY) or reassign ret before reading it
(tracing_arm_snapshot_locked, tracer_init).

Remove the unnecessary assignment.

No functional change.

Signed-off-by: Wayen.Yan <win847@gmail.com>
---
 kernel/trace/trace.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a..58f81b36b7 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -5018,7 +5018,6 @@ int tracing_set_tracer(struct trace_array *tr, const char *buf)
 						RING_BUFFER_ALL_CPUS);
 		if (ret < 0)
 			return ret;
-		ret = 0;
 	}

 	list_for_each_entry(t, &tr->tracers, list) {
-- 
2.51.0

^ permalink raw reply related

* [PATCH v5 1/3] tracing: Use __free() for expr_str() buffer
From: Pengpeng Hou @ 2026-06-11  5:59 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers
  Cc: linux-trace-kernel, linux-kernel, Pengpeng Hou

expr_str() allocates a temporary expression buffer and manually frees it
on some error paths.

Convert the buffer to __free(kfree) and return it with return_ptr() on
success. This keeps ownership handling separate from the later ERR_PTR()
conversion and string-bound change.

Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
 kernel/trace/trace_events_hist.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 82ce492ab268..f778f060e922 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -1759,7 +1759,7 @@ static void expr_field_str(struct hist_field *field, char *expr)
 
 static char *expr_str(struct hist_field *field, unsigned int level)
 {
-	char *expr;
+	char *expr __free(kfree) = NULL;
 
 	if (level > 1)
 		return NULL;
@@ -1770,7 +1770,7 @@ static char *expr_str(struct hist_field *field, unsigned int level)
 
 	if (!field->operands[0]) {
 		expr_field_str(field, expr);
-		return expr;
+		return_ptr(expr);
 	}
 
 	if (field->operator == FIELD_OP_UNARY_MINUS) {
@@ -1778,16 +1778,15 @@ static char *expr_str(struct hist_field *field, unsigned int level)
 
 		strcat(expr, "-(");
 		subexpr = expr_str(field->operands[0], ++level);
-		if (!subexpr) {
-			kfree(expr);
+		if (!subexpr)
 			return NULL;
-		}
+
 		strcat(expr, subexpr);
 		strcat(expr, ")");
 
 		kfree(subexpr);
 
-		return expr;
+		return_ptr(expr);
 	}
 
 	expr_field_str(field->operands[0], expr);
@@ -1806,13 +1805,12 @@ static char *expr_str(struct hist_field *field, unsigned int level)
 		strcat(expr, "*");
 		break;
 	default:
-		kfree(expr);
 		return NULL;
 	}
 
 	expr_field_str(field->operands[1], expr);
 
-	return expr;
+	return_ptr(expr);
 }
 
 /*
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related

* [PATCH v5 2/3] tracing: Return ERR_PTR() from expr_str()
From: Pengpeng Hou @ 2026-06-11  5:59 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers
  Cc: linux-trace-kernel, linux-kernel, Pengpeng Hou

expr_str() currently reports all failure cases as NULL, so callers cannot
distinguish invalid recursion depth from allocation failure or later
string construction errors.

Return ERR_PTR()-encoded errors from expr_str() and make parse_unary()
and parse_expr() propagate them. Clear expr->name before destroying the
hist field so the error pointer is not freed as a string.

Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
 kernel/trace/trace_events_hist.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index f778f060e922..082842e64ccd 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -1762,11 +1762,11 @@ static char *expr_str(struct hist_field *field, unsigned int level)
 	char *expr __free(kfree) = NULL;
 
 	if (level > 1)
-		return NULL;
+		return ERR_PTR(-EINVAL);
 
 	expr = kzalloc(MAX_FILTER_STR_VAL, GFP_KERNEL);
 	if (!expr)
-		return NULL;
+		return ERR_PTR(-ENOMEM);
 
 	if (!field->operands[0]) {
 		expr_field_str(field, expr);
@@ -1778,8 +1778,8 @@ static char *expr_str(struct hist_field *field, unsigned int level)
 
 		strcat(expr, "-(");
 		subexpr = expr_str(field->operands[0], ++level);
-		if (!subexpr)
-			return NULL;
+		if (IS_ERR(subexpr))
+			return subexpr;
 
 		strcat(expr, subexpr);
 		strcat(expr, ")");
@@ -1805,7 +1805,7 @@ static char *expr_str(struct hist_field *field, unsigned int level)
 		strcat(expr, "*");
 		break;
 	default:
-		return NULL;
+		return ERR_PTR(-EINVAL);
 	}
 
 	expr_field_str(field->operands[1], expr);
@@ -2624,6 +2624,11 @@ static struct hist_field *parse_unary(struct hist_trigger_data *hist_data,
 	expr->is_signed = operand1->is_signed;
 	expr->operator = FIELD_OP_UNARY_MINUS;
 	expr->name = expr_str(expr, 0);
+	if (IS_ERR(expr->name)) {
+		ret = PTR_ERR(expr->name);
+		expr->name = NULL;
+		goto free;
+	}
 	expr->type = kstrdup_const(operand1->type, GFP_KERNEL);
 	if (!expr->type) {
 		ret = -ENOMEM;
@@ -2836,6 +2841,11 @@ static struct hist_field *parse_expr(struct hist_trigger_data *hist_data,
 		destroy_hist_field(operand1, 0);
 
 		expr->name = expr_str(expr, 0);
+		if (IS_ERR(expr->name)) {
+			ret = PTR_ERR(expr->name);
+			expr->name = NULL;
+			goto free_expr;
+		}
 	} else {
 		/* The operand sizes should be the same, so just pick one */
 		expr->size = operand1->size;
@@ -2849,6 +2859,11 @@ static struct hist_field *parse_expr(struct hist_trigger_data *hist_data,
 		}
 
 		expr->name = expr_str(expr, 0);
+		if (IS_ERR(expr->name)) {
+			ret = PTR_ERR(expr->name);
+			expr->name = NULL;
+			goto free_expr;
+		}
 	}
 
 	return expr;
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related

* [PATCH v5 0/3] tracing: bound histogram expression strings
From: Pengpeng Hou @ 2026-06-11  5:59 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers
  Cc: linux-trace-kernel, linux-kernel, Pengpeng Hou

This v5 starts a new thread and splits the previous expr_str() update
into the preparatory lifetime cleanup, the ERR_PTR() conversion, and the
bounded seq_buf conversion.

The series is based on trace/for-next at 8970865b788e
("Merge trace/for-next").

Patch 1 converts the expr_str() output buffer to __free(kfree).
Patch 2 converts expr_str() failures from NULL to ERR_PTR().
Patch 3 replaces the raw strcat() expression construction with seq_buf
and returns -E2BIG when the rendered expression would exceed
MAX_FILTER_STR_VAL.

Changes since v4:
https://lore.kernel.org/all/20260521022817.38453-1-pengpeng@iscas.ac.cn/
- start a new thread for the new patch revision
- split the __free(kfree) conversion and ERR_PTR() conversion into
  separate patches as requested
- simplify the unary-minus seq_buf conversion with a single
  seq_buf_printf(&s, "-(%s)", subexpr) and keep subexpr manually freed
- keep the seq_buf include unchanged because trace/for-next already has
  <linux/seq_buf.h> in this file

Pengpeng Hou (3):
  tracing: Use __free() for expr_str() buffer
  tracing: Return ERR_PTR() from expr_str()
  tracing: Bound histogram expression strings with seq_buf

 kernel/trace/trace_events_hist.c | 92 ++++++++++++++++++++------------
 1 file changed, 57 insertions(+), 35 deletions(-)

-- 
2.50.1 (Apple Git-155)

^ permalink raw reply

* [PATCH v5 3/3] tracing: Bound histogram expression strings with seq_buf
From: Pengpeng Hou @ 2026-06-11  5:59 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers
  Cc: linux-trace-kernel, linux-kernel, Pengpeng Hou

expr_str() allocates a fixed MAX_FILTER_STR_VAL buffer and then builds
expression names with a series of raw strcat() appends. Nested operands,
constants, field flags, and generated field names can push the rendered
string past that fixed limit before the name is attached to the hist
field.

Build expression strings with seq_buf and return -E2BIG when the
rendered name would exceed MAX_FILTER_STR_VAL. This keeps the existing
tracing-side limit while replacing the raw append logic with bounded
construction.

Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
 kernel/trace/trace_events_hist.c | 57 ++++++++++++++++++--------------
 1 file changed, 33 insertions(+), 24 deletions(-)

diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 082842e64ccd..810c63c6b3df 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -94,7 +94,6 @@ typedef u64 (*hist_field_fn_t) (struct hist_field *field,
 #define HIST_FIELD_OPERANDS_MAX	2
 #define HIST_FIELDS_MAX		(TRACING_MAP_FIELDS_MAX + TRACING_MAP_VARS_MAX)
 #define HIST_ACTIONS_MAX	8
-#define HIST_CONST_DIGITS_MAX	21
 #define HIST_DIV_SHIFT		20  /* For optimizing division by constants */
 
 enum field_op_id {
@@ -1733,33 +1732,36 @@ static const char *get_hist_field_flags(struct hist_field *hist_field)
 	return flags_str;
 }
 
-static void expr_field_str(struct hist_field *field, char *expr)
+static bool expr_field_str(struct hist_field *field, struct seq_buf *s)
 {
+	const char *field_name;
+
 	if (field->flags & HIST_FIELD_FL_VAR_REF) {
 		if (!field->system)
-			strcat(expr, "$");
-	} else if (field->flags & HIST_FIELD_FL_CONST) {
-		char str[HIST_CONST_DIGITS_MAX];
+			seq_buf_putc(s, '$');
+	} else if (field->flags & HIST_FIELD_FL_CONST)
+		seq_buf_printf(s, "%llu", field->constant);
 
-		snprintf(str, HIST_CONST_DIGITS_MAX, "%llu", field->constant);
-		strcat(expr, str);
-	}
+	field_name = hist_field_name(field, 0);
+	if (!field_name)
+		return false;
 
-	strcat(expr, hist_field_name(field, 0));
+	seq_buf_puts(s, field_name);
 
 	if (field->flags && !(field->flags & HIST_FIELD_FL_VAR_REF)) {
 		const char *flags_str = get_hist_field_flags(field);
 
-		if (flags_str) {
-			strcat(expr, ".");
-			strcat(expr, flags_str);
-		}
+		if (flags_str)
+			seq_buf_printf(s, ".%s", flags_str);
 	}
+
+	return !seq_buf_has_overflowed(s);
 }
 
 static char *expr_str(struct hist_field *field, unsigned int level)
 {
 	char *expr __free(kfree) = NULL;
+	struct seq_buf s;
 
 	if (level > 1)
 		return ERR_PTR(-EINVAL);
@@ -1768,47 +1770,54 @@ static char *expr_str(struct hist_field *field, unsigned int level)
 	if (!expr)
 		return ERR_PTR(-ENOMEM);
 
+	seq_buf_init(&s, expr, MAX_FILTER_STR_VAL);
+
 	if (!field->operands[0]) {
-		expr_field_str(field, expr);
+		if (!expr_field_str(field, &s))
+			return ERR_PTR(-E2BIG);
+
 		return_ptr(expr);
 	}
 
 	if (field->operator == FIELD_OP_UNARY_MINUS) {
 		char *subexpr;
 
-		strcat(expr, "-(");
 		subexpr = expr_str(field->operands[0], ++level);
 		if (IS_ERR(subexpr))
 			return subexpr;
 
-		strcat(expr, subexpr);
-		strcat(expr, ")");
-
+		seq_buf_printf(&s, "-(%s)", subexpr);
 		kfree(subexpr);
 
+		if (seq_buf_has_overflowed(&s))
+			return ERR_PTR(-E2BIG);
+
 		return_ptr(expr);
 	}
 
-	expr_field_str(field->operands[0], expr);
+	if (!expr_field_str(field->operands[0], &s))
+		return ERR_PTR(-E2BIG);
 
 	switch (field->operator) {
 	case FIELD_OP_MINUS:
-		strcat(expr, "-");
+		seq_buf_putc(&s, '-');
 		break;
 	case FIELD_OP_PLUS:
-		strcat(expr, "+");
+		seq_buf_putc(&s, '+');
 		break;
 	case FIELD_OP_DIV:
-		strcat(expr, "/");
+		seq_buf_putc(&s, '/');
 		break;
 	case FIELD_OP_MULT:
-		strcat(expr, "*");
+		seq_buf_putc(&s, '*');
 		break;
 	default:
 		return ERR_PTR(-EINVAL);
 	}
 
-	expr_field_str(field->operands[1], expr);
+	if (seq_buf_has_overflowed(&s) ||
+	    !expr_field_str(field->operands[1], &s))
+		return ERR_PTR(-E2BIG);
 
 	return_ptr(expr);
 }
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related

* [PATCH v2 3/4] mm/fs: split the file's i_mmap tree
From: Huang Shijie @ 2026-06-11  6:18 UTC (permalink / raw)
  To: akpm, viro, brauner, jack, muchun.song, osalvador, david
  Cc: surenb, mjguzik, liam, ljs, vbabka, shakeel.butt, rppt, mhocko,
	corbet, skhan, linux, dinguyen, schuster.simon, James.Bottomley,
	deller, djbw, willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, pfalcato,
	riel, harry, will, brian.ruley, rmk+kernel, dave.anglin, linux-mm,
	linux-doc, linux-kernel, linux-arm-kernel, linux-parisc,
	linux-fsdevel, nvdimm, linux-perf-users, linux-trace-kernel,
	zhongyuan, fangbaoshun, yingzhiwei, Huang Shijie
In-Reply-To: <20260611061915.2354307-1-huangsj@hygon.cn>

In the UnixBench tests, there is a test "execl" which tests
the execve system call.
  For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
When we test our server with "./Run -c 384 execl",
the test result is not good enough. The i_mmap locks contended heavily on
"libc.so" and "ld.so". The i_mmap tree for "libc.so" can be
over 6000 VMAs, all the VMAs can be in different NUMA mode. The insert/remove
operations do not run quickly enough.

 In order to reduce the competition of the i_mmap lock, this patch does
following:
   1.) Split the single i_mmap tree into several sibling trees:
       Each tree has a lock. The CONFIG_SPLIT_I_MMAP is used to
       turn on/off this feature.
   2.) Introduce a new field "tree_idx" for vm_area_struct to save the
       sibling tree index for this VMA.
   3.) Introduce a new field "vma_count" for address_space.
       The new mapping_mapped() will use it.
   4.) Rewrite the vma_interval_tree_foreach()
   5.) Rewrite the lock functions.	

 After this patch, the VMA insert/remove operations will work faster,
and we can get over 400% performance improvement with the above test.

Signed-off-by: Huang Shijie <huangsj@hygon.cn>
---
 fs/Kconfig               |   8 ++
 fs/hugetlbfs/inode.c     |  20 ++++-
 fs/inode.c               |  75 ++++++++++++++++-
 include/linux/fs.h       | 174 ++++++++++++++++++++++++++++++++++++++-
 include/linux/mm.h       |  80 ++++++++++++++++++
 include/linux/mm_types.h |   3 +
 mm/internal.h            |   3 +-
 mm/mmap.c                |  11 ++-
 mm/nommu.c               |  23 ++++--
 mm/pagewalk.c            |   2 +-
 mm/vma.c                 |  72 +++++++++++-----
 mm/vma_init.c            |   3 +
 12 files changed, 436 insertions(+), 38 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 43cb06de297f..e24804f70432 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -9,6 +9,14 @@ menu "File systems"
 config DCACHE_WORD_ACCESS
        bool
 
+config SPLIT_I_MMAP
+	bool "Split the file's i_mmap to several trees"
+	default n
+	help
+	  Split the file's i_mmap to several trees, each tree has a separate
+	  lock. This will reduce the lock contention of file's i_mmap tree,
+	  but it will cost more memory for per inode.
+
 config VALIDATE_FS_PARSER
 	bool "Validate filesystem parameter description"
 	help
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index da5b41ea5bdd..68d8308418dd 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -891,6 +891,23 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
  */
 static struct lock_class_key hugetlbfs_i_mmap_rwsem_key;
 
+#ifdef CONFIG_SPLIT_I_MMAP
+static void hugetlbfs_lockdep_set_class(struct address_space *mapping)
+{
+	int i;
+
+	for (i = 0; i < split_tree_num; i++) {
+		lockdep_set_class(&mapping->i_mmap[i].rwsem,
+				&hugetlbfs_i_mmap_rwsem_key);
+	}
+}
+#else
+static void hugetlbfs_lockdep_set_class(struct address_space *mapping)
+{
+	lockdep_set_class(&mapping->i_mmap_rwsem, &hugetlbfs_i_mmap_rwsem_key);
+}
+#endif
+
 static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 					struct mnt_idmap *idmap,
 					struct inode *dir,
@@ -915,8 +932,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 
 		inode->i_ino = get_next_ino();
 		inode_init_owner(idmap, inode, dir, mode);
-		lockdep_set_class(&inode->i_mapping->i_mmap_rwsem,
-				&hugetlbfs_i_mmap_rwsem_key);
+		hugetlbfs_lockdep_set_class(inode->i_mapping);
 		inode->i_mapping->a_ops = &hugetlbfs_aops;
 		simple_inode_init_ts(inode);
 		info->resv_map = resv_map;
diff --git a/fs/inode.c b/fs/inode.c
index 62c579a0cf7d..cb67ae83f5b3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -214,6 +214,70 @@ static int no_open(struct inode *inode, struct file *file)
 	return -ENXIO;
 }
 
+#ifdef CONFIG_SPLIT_I_MMAP
+int split_tree_num;
+static int split_tree_align __maybe_unused = 32;
+
+static void __init init_split_tree_num(void)
+{
+#ifdef CONFIG_NUMA
+	split_tree_num = nr_node_ids;
+#else
+	split_tree_num = ALIGN(nr_cpu_ids, split_tree_align);
+#endif
+}
+
+static void free_mapping_i_mmap(struct address_space *mapping)
+{
+	int i;
+
+	if (!mapping->i_mmap)
+		return;
+
+	for (i = 0; i < split_tree_num; i++)
+		kfree(mapping->i_mmap[i]);
+
+	kfree(mapping->i_mmap);
+	mapping->i_mmap = NULL;
+}
+
+static int init_mapping_i_mmap(struct address_space *mapping, gfp_t gfp)
+{
+	struct i_mmap_tree *tree;
+	int i;
+
+	/* The extra one is used as terminator in vma_interval_tree_foreach() */
+	mapping->i_mmap = kzalloc(sizeof(tree) * (split_tree_num + 1), gfp);
+	if (!mapping->i_mmap)
+		return -ENOMEM;
+
+	for (i = 0; i < split_tree_num; i++) {
+		tree = kzalloc_node(sizeof(*tree), gfp, i);
+		if (!tree)
+			goto nomem;
+
+		tree->root = RB_ROOT_CACHED;
+		init_rwsem(&tree->rwsem);
+
+		mapping->i_mmap[i] = tree;
+	}
+	return 0;
+nomem:
+	free_mapping_i_mmap(mapping);
+	return -ENOMEM;
+}
+#else
+static int init_mapping_i_mmap(struct address_space *mapping, gfp_t gfp)
+{
+	mapping->i_mmap = RB_ROOT_CACHED;
+	init_rwsem(&mapping->i_mmap_rwsem);
+	return 0;
+}
+
+static void free_mapping_i_mmap(struct address_space *mapping) { }
+static void __init init_split_tree_num(void) {}
+#endif
+
 /**
  * inode_init_always_gfp - perform inode structure initialisation
  * @sb: superblock inode belongs to
@@ -302,9 +366,14 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp
 #endif
 	inode->i_flctx = NULL;
 
-	if (unlikely(security_inode_alloc(inode, gfp)))
+	if (init_mapping_i_mmap(mapping, gfp))
 		return -ENOMEM;
 
+	if (unlikely(security_inode_alloc(inode, gfp))) {
+		free_mapping_i_mmap(mapping);
+		return -ENOMEM;
+	}
+
 	this_cpu_inc(nr_inodes);
 
 	return 0;
@@ -380,6 +449,7 @@ void __destroy_inode(struct inode *inode)
 	if (inode->i_default_acl && !is_uncached_acl(inode->i_default_acl))
 		posix_acl_release(inode->i_default_acl);
 #endif
+	free_mapping_i_mmap(&inode->i_data);
 	this_cpu_dec(nr_inodes);
 }
 EXPORT_SYMBOL(__destroy_inode);
@@ -480,9 +550,7 @@ EXPORT_SYMBOL(inc_nlink);
 static void __address_space_init_once(struct address_space *mapping)
 {
 	xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
-	init_rwsem(&mapping->i_mmap_rwsem);
 	spin_lock_init(&mapping->i_private_lock);
-	mapping->i_mmap = RB_ROOT_CACHED;
 }
 
 void address_space_init_once(struct address_space *mapping)
@@ -2619,6 +2687,7 @@ void __init inode_init(void)
 					&i_hash_mask,
 					0,
 					0);
+	init_split_tree_num();
 }
 
 void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index cd46615b8f53..f4b3645b61df 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -450,6 +450,25 @@ struct mapping_metadata_bhs {
 	struct list_head list;	/* The list of bhs (b_assoc_buffers) */
 };
 
+#ifdef CONFIG_SPLIT_I_MMAP
+/*
+ * struct i_mmap_tree - A single sibling tree of the file's split i_mmap.
+ * @root: The red/black interval tree root.
+ * @rwsem: Protects insert/remove operations on this sibling tree.
+ * @vma_count: Number of VMAs in this sibling tree.
+ *
+ * When CONFIG_SPLIT_I_MMAP is enabled, the file's single i_mmap tree is
+ * split into split_tree_num sibling trees, each with its own lock. This
+ * reduces lock contention by allowing concurrent VMA insert/remove
+ * operations on different sibling trees.
+ */
+struct i_mmap_tree {
+	struct rb_root_cached	root;
+	struct rw_semaphore	rwsem;
+	atomic_t		vma_count;
+};
+#endif
+
 /**
  * struct address_space - Contents of a cacheable, mappable object.
  * @host: Owner, either the inode or the block_device.
@@ -461,8 +480,13 @@ struct mapping_metadata_bhs {
  * @gfp_mask: Memory allocation flags to use for allocating pages.
  * @i_mmap_writable: Number of VM_SHARED, VM_MAYWRITE mappings.
  * @nr_thps: Number of THPs in the pagecache (non-shmem only).
- * @i_mmap: Tree of private and shared mappings.
- * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
+ * @i_mmap: Tree of private and shared mappings. When CONFIG_SPLIT_I_MMAP
+ *   is enabled, this is an array of split_tree_num struct i_mmap_tree
+ *   pointers (plus a NULL terminator).
+ * @vma_count: Total number of VMAs across all sibling trees (only when
+ *   CONFIG_SPLIT_I_MMAP is enabled). Used by mapping_mapped().
+ * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable (only when
+ *   CONFIG_SPLIT_I_MMAP is disabled; otherwise per-tree rwsem is used).
  * @nrpages: Number of page entries, protected by the i_pages lock.
  * @writeback_index: Writeback starts here.
  * @a_ops: Methods.
@@ -480,14 +504,19 @@ struct address_space {
 	/* number of thp, only for non-shmem files */
 	atomic_t		nr_thps;
 #endif
+#ifdef CONFIG_SPLIT_I_MMAP
+	struct i_mmap_tree	**i_mmap;
+	atomic_t		vma_count;
+#else
 	struct rb_root_cached	i_mmap;
+	struct rw_semaphore	i_mmap_rwsem;
+#endif
 	unsigned long		nrpages;
 	pgoff_t			writeback_index;
 	const struct address_space_operations *a_ops;
 	unsigned long		flags;
 	errseq_t		wb_err;
 	spinlock_t		i_private_lock;
-	struct rw_semaphore	i_mmap_rwsem;
 } __attribute__((aligned(sizeof(long)))) __randomize_layout;
 	/*
 	 * On most architectures that alignment is already the case; but
@@ -508,6 +537,133 @@ static inline bool mapping_tagged(const struct address_space *mapping, xa_mark_t
 	return xa_marked(&mapping->i_pages, tag);
 }
 
+#ifdef CONFIG_SPLIT_I_MMAP
+static inline int mapping_mapped(const struct address_space *mapping)
+{
+	return	atomic_read(&mapping->vma_count);
+}
+
+static inline void inc_mapping_vma(struct address_space *mapping,
+				struct vm_area_struct *vma)
+{
+	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
+
+	atomic_inc(&tree->vma_count);
+	atomic_inc(&mapping->vma_count);
+}
+
+static inline void dec_mapping_vma(struct address_space *mapping,
+				struct vm_area_struct *vma)
+{
+	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
+
+	atomic_dec(&tree->vma_count);
+	atomic_dec(&mapping->vma_count);
+}
+
+static inline struct rb_root_cached *get_i_mmap_root(struct address_space *mapping)
+{
+	return (struct rb_root_cached *)mapping->i_mmap;
+}
+
+static inline void i_mmap_tree_lock_write(struct address_space *mapping,
+					struct vm_area_struct *vma)
+{
+	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
+
+	down_write(&tree->rwsem);
+}
+
+static inline void i_mmap_tree_unlock_write(struct address_space *mapping,
+					struct vm_area_struct *vma)
+{
+	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
+
+	up_write(&tree->rwsem);
+}
+
+#define i_mmap_lock_write_prepare(mapping)
+#define i_mmap_unlock_write_complete(mapping)
+
+extern int split_tree_num;
+static inline void i_mmap_lock_write(struct address_space *mapping)
+{
+	int i;
+
+	for (i = 0; i < split_tree_num; i++)
+		down_write(&mapping->i_mmap[i]->rwsem);
+}
+
+static inline int i_mmap_trylock_write(struct address_space *mapping)
+{
+	int i;
+
+	for (i = 0; i < split_tree_num; i++) {
+		if (!down_write_trylock(&mapping->i_mmap[i]->rwsem)) {
+			while (i--)
+				up_write(&mapping->i_mmap[i]->rwsem);
+			return 0;
+		}
+	}
+	return 1;
+}
+
+static inline void i_mmap_unlock_write(struct address_space *mapping)
+{
+	int i;
+
+	for (i = 0; i < split_tree_num; i++)
+		up_write(&mapping->i_mmap[i]->rwsem);
+}
+
+static inline int i_mmap_trylock_read(struct address_space *mapping)
+{
+	int i;
+
+	for (i = 0; i < split_tree_num; i++) {
+		if (!down_read_trylock(&mapping->i_mmap[i]->rwsem)) {
+			while (i--)
+				up_read(&mapping->i_mmap[i]->rwsem);
+			return 0;
+		}
+	}
+	return 1;
+}
+
+static inline void i_mmap_lock_read(struct address_space *mapping)
+{
+	int i;
+
+	for (i = 0; i < split_tree_num; i++)
+		down_read(&mapping->i_mmap[i]->rwsem);
+}
+
+static inline void i_mmap_unlock_read(struct address_space *mapping)
+{
+	int i;
+
+	for (i = 0; i < split_tree_num; i++)
+		up_read(&mapping->i_mmap[i]->rwsem);
+}
+
+static inline void i_mmap_assert_locked(struct address_space *mapping)
+{
+	int i;
+
+	for (i = 0; i < split_tree_num; i++)
+		lockdep_assert_held(&mapping->i_mmap[i]->rwsem);
+}
+
+static inline void i_mmap_assert_write_locked(struct address_space *mapping)
+{
+	int i;
+
+	for (i = 0; i < split_tree_num; i++)
+		lockdep_assert_held_write(&mapping->i_mmap[i]->rwsem);
+}
+
+#else
+
 static inline void i_mmap_lock_write(struct address_space *mapping)
 {
 	down_write(&mapping->i_mmap_rwsem);
@@ -561,6 +717,18 @@ static inline struct rb_root_cached *get_i_mmap_root(struct address_space *mappi
 	return &mapping->i_mmap;
 }
 
+static inline void inc_mapping_vma(struct address_space *mapping,
+				struct vm_area_struct *vma) { }
+static inline void dec_mapping_vma(struct address_space *mapping,
+				struct vm_area_struct *vma) { }
+
+#define i_mmap_lock_write_prepare(mapping)	i_mmap_lock_write(mapping)
+#define i_mmap_unlock_write_complete(mapping)	i_mmap_unlock_write(mapping)
+#define i_mmap_tree_lock_write(mapping, vma)
+#define i_mmap_tree_unlock_write(mapping, vma)
+
+#endif
+
 /*
  * Might pages of this file have been modified in userspace?
  * Note that i_mmap_writable counts all VM_SHARED, VM_MAYWRITE vmas: do_mmap
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0a45c6a8b9f2..9aa8119fa9bf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4041,11 +4041,91 @@ struct vm_area_struct *vma_interval_tree_iter_first(struct rb_root_cached *root,
 struct vm_area_struct *vma_interval_tree_iter_next(struct vm_area_struct *node,
 				unsigned long start, unsigned long last);
 
+#ifdef CONFIG_SPLIT_I_MMAP
+extern int split_tree_num;
+
+static inline int smallest_tree_idx(struct file *file)
+{
+	struct address_space *mapping = file->f_mapping;
+	int tmp = INT_MAX, count;
+	int i, j = 0;
+
+	/*
+	 * Since a not 100% accurate value is still okay,
+	 * we do not need any lock here.
+	 */
+	for (i = 0; i < split_tree_num; i++) {
+		count = atomic_read(&mapping->i_mmap[i]->vma_count);
+		if (count < tmp) {
+			j = i;
+			tmp = count;
+			if (!tmp)
+				break;
+		}
+	}
+	return j;
+}
+
+static inline void vma_set_tree_idx(struct vm_area_struct *vma)
+{
+#ifdef CONFIG_NUMA
+	vma->tree_idx = numa_node_id();
+#else
+	vma->tree_idx = smallest_tree_idx(vma->vm_file);
+#endif
+}
+
+static inline struct rb_root_cached *get_rb_root(struct vm_area_struct *vma,
+					struct address_space *mapping)
+{
+	return &mapping->i_mmap[vma->tree_idx]->root;
+}
+
+/* Find the first valid VMA in the sibling trees */
+static inline struct vm_area_struct *first_vma(struct i_mmap_tree ***__r,
+				unsigned long start, unsigned long last)
+{
+	struct vm_area_struct *vma = NULL;
+	struct i_mmap_tree **tree = *__r;
+	struct rb_root_cached *root;
+
+	while (*tree) {
+		root = &(*tree)->root;
+		tree++;
+		vma = vma_interval_tree_iter_first(root, start, last);
+		if (vma)
+			break;
+	}
+
+	/* Save for the next loop */
+	*__r = tree;
+	return vma;
+}
+
+/*
+ * Please use get_i_mmap_root() to get the @root.
+ * @_tmp is referenced to avoid unused variable warning.
+ */
+#define vma_interval_tree_foreach(vma, root, start, last)		\
+	for (struct i_mmap_tree **_r = (struct i_mmap_tree **)(root),	\
+		**_tmp = (vma = first_vma(&_r, start, last)) ? _r : NULL;\
+	     ((_tmp && vma) || (vma = first_vma(&_r, start, last)));	\
+		vma = vma_interval_tree_iter_next(vma, start, last))
+#else
 /* Please use get_i_mmap_root() to get the @root */
 #define vma_interval_tree_foreach(vma, root, start, last)		\
 	for (vma = vma_interval_tree_iter_first(root, start, last);	\
 	     vma; vma = vma_interval_tree_iter_next(vma, start, last))
 
+static inline void vma_set_tree_idx(struct vm_area_struct *vma) { }
+
+static inline struct rb_root_cached *get_rb_root(struct vm_area_struct *vma,
+					struct address_space *mapping)
+{
+	return &mapping->i_mmap;
+}
+#endif
+
 void anon_vma_interval_tree_insert(struct anon_vma_chain *node,
 				   struct rb_root_cached *root);
 void anon_vma_interval_tree_remove(struct anon_vma_chain *node,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a308e2c23b82..8d6aab3346ce 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1072,6 +1072,9 @@ struct vm_area_struct {
 #ifdef __HAVE_PFNMAP_TRACKING
 	struct pfnmap_track_ctx *pfnmap_track_ctx;
 #endif
+#ifdef CONFIG_SPLIT_I_MMAP
+	int tree_idx;			/* The sibling tree index for the VMA */
+#endif
 } __randomize_layout;
 
 /* Clears all bits in the VMA flags bitmap, non-atomically. */
diff --git a/mm/internal.h b/mm/internal.h
index 5a2ddcf68e0b..2d35cacffd19 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1888,7 +1888,8 @@ static inline void maybe_rmap_unlock_action(struct vm_area_struct *vma,
 
 	VM_WARN_ON_ONCE(vma_is_anonymous(vma));
 	file = vma->vm_file;
-	i_mmap_unlock_write(file->f_mapping);
+	i_mmap_tree_unlock_write(file->f_mapping, vma);
+	i_mmap_unlock_write_complete(file->f_mapping);
 	action->hide_from_rmap_until_complete = false;
 }
 
diff --git a/mm/mmap.c b/mm/mmap.c
index d714fdb357e5..70036ec9dcaa 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1825,15 +1825,20 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 			struct address_space *mapping = file->f_mapping;
 
 			get_file(file);
-			i_mmap_lock_write(mapping);
+			i_mmap_lock_write_prepare(mapping);
+			i_mmap_tree_lock_write(mapping, mpnt);
+
 			if (vma_is_shared_maywrite(tmp))
 				mapping_allow_writable(mapping);
 			flush_dcache_mmap_lock(mapping);
 			/* insert tmp into the share list, just after mpnt */
 			vma_interval_tree_insert_after(tmp, mpnt,
-					get_i_mmap_root(mapping));
+					get_rb_root(mpnt, mapping));
+			inc_mapping_vma(mapping, tmp);
 			flush_dcache_mmap_unlock(mapping);
-			i_mmap_unlock_write(mapping);
+
+			i_mmap_tree_unlock_write(mapping, mpnt);
+			i_mmap_unlock_write_complete(mapping);
 		}
 
 		if (!(tmp->vm_flags & VM_WIPEONFORK))
diff --git a/mm/nommu.c b/mm/nommu.c
index 0f18ffc658e9..1f2c60a220f6 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -567,11 +567,16 @@ static void setup_vma_to_mm(struct vm_area_struct *vma, struct mm_struct *mm)
 	if (vma->vm_file) {
 		struct address_space *mapping = vma->vm_file->f_mapping;
 
-		i_mmap_lock_write(mapping);
+		i_mmap_lock_write_prepare(mapping);
+		i_mmap_tree_lock_write(mapping, vma);
+
 		flush_dcache_mmap_lock(mapping);
-		vma_interval_tree_insert(vma, get_i_mmap_root(mapping));
+		vma_interval_tree_insert(vma, get_rb_root(vma, mapping));
+		inc_mapping_vma(mapping, vma);
 		flush_dcache_mmap_unlock(mapping);
-		i_mmap_unlock_write(mapping);
+
+		i_mmap_tree_unlock_write(mapping, vma);
+		i_mmap_unlock_write_complete(mapping);
 	}
 }
 
@@ -583,11 +588,16 @@ static void cleanup_vma_from_mm(struct vm_area_struct *vma)
 		struct address_space *mapping;
 		mapping = vma->vm_file->f_mapping;
 
-		i_mmap_lock_write(mapping);
+		i_mmap_lock_write_prepare(mapping);
+		i_mmap_tree_lock_write(mapping, vma);
+
 		flush_dcache_mmap_lock(mapping);
-		vma_interval_tree_remove(vma, get_i_mmap_root(mapping));
+		vma_interval_tree_remove(vma, get_rb_root(vma, mapping));
+		dec_mapping_vma(mapping, vma);
 		flush_dcache_mmap_unlock(mapping);
-		i_mmap_unlock_write(mapping);
+
+		i_mmap_tree_unlock_write(mapping, vma);
+		i_mmap_unlock_write_complete(mapping);
 	}
 }
 
@@ -1063,6 +1073,7 @@ unsigned long do_mmap(struct file *file,
 	if (file) {
 		region->vm_file = get_file(file);
 		vma->vm_file = get_file(file);
+		vma_set_tree_idx(vma);
 	}
 
 	down_write(&nommu_region_sem);
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 8df1b5077951..d5745519d95a 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -809,7 +809,7 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
 	if (!check_ops_safe(ops))
 		return -EINVAL;
 
-	lockdep_assert_held(&mapping->i_mmap_rwsem);
+	i_mmap_assert_locked(mapping);
 	vma_interval_tree_foreach(vma, get_i_mmap_root(mapping), first_index,
 				  first_index + nr - 1) {
 		/* Clip to the vma */
diff --git a/mm/vma.c b/mm/vma.c
index 6159650c1b42..2055758064a9 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -234,22 +234,23 @@ static void __vma_link_file(struct vm_area_struct *vma,
 		mapping_allow_writable(mapping);
 
 	flush_dcache_mmap_lock(mapping);
-	vma_interval_tree_insert(vma, get_i_mmap_root(mapping));
+	vma_interval_tree_insert(vma, get_rb_root(vma, mapping));
+	inc_mapping_vma(mapping, vma);
 	flush_dcache_mmap_unlock(mapping);
 }
 
-/*
- * Requires inode->i_mapping->i_mmap_rwsem
- */
 static void __remove_shared_vm_struct(struct vm_area_struct *vma,
 				      struct address_space *mapping)
 {
+	i_mmap_tree_lock_write(mapping, vma);
 	if (vma_is_shared_maywrite(vma))
 		mapping_unmap_writable(mapping);
 
 	flush_dcache_mmap_lock(mapping);
-	vma_interval_tree_remove(vma, get_i_mmap_root(mapping));
+	vma_interval_tree_remove(vma, get_rb_root(vma, mapping));
+	dec_mapping_vma(mapping, vma);
 	flush_dcache_mmap_unlock(mapping);
+	i_mmap_tree_unlock_write(mapping, vma);
 }
 
 /*
@@ -297,8 +298,9 @@ static void vma_prepare(struct vma_prepare *vp)
 			uprobe_munmap(vp->adj_next, vp->adj_next->vm_start,
 				      vp->adj_next->vm_end);
 
-		i_mmap_lock_write(vp->mapping);
+		i_mmap_lock_write_prepare(vp->mapping);
 		if (vp->insert && vp->insert->vm_file) {
+			i_mmap_tree_lock_write(vp->mapping, vp->insert);
 			/*
 			 * Put into interval tree now, so instantiated pages
 			 * are visible to arm/parisc __flush_dcache_page
@@ -307,6 +309,7 @@ static void vma_prepare(struct vma_prepare *vp)
 			 */
 			__vma_link_file(vp->insert,
 					vp->insert->vm_file->f_mapping);
+			i_mmap_tree_unlock_write(vp->mapping, vp->insert);
 		}
 	}
 
@@ -318,12 +321,17 @@ static void vma_prepare(struct vma_prepare *vp)
 	}
 
 	if (vp->file) {
+		i_mmap_tree_lock_write(vp->mapping, vp->vma);
 		flush_dcache_mmap_lock(vp->mapping);
 		vma_interval_tree_remove(vp->vma,
-					get_i_mmap_root(vp->mapping));
-		if (vp->adj_next)
+					get_rb_root(vp->vma, vp->mapping));
+		dec_mapping_vma(vp->mapping, vp->vma);
+		if (vp->adj_next) {
+			i_mmap_tree_lock_write(vp->mapping, vp->adj_next);
 			vma_interval_tree_remove(vp->adj_next,
-					get_i_mmap_root(vp->mapping));
+					get_rb_root(vp->adj_next, vp->mapping));
+			dec_mapping_vma(vp->mapping, vp->adj_next);
+		}
 	}
 
 }
@@ -340,12 +348,17 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
 			 struct mm_struct *mm)
 {
 	if (vp->file) {
-		if (vp->adj_next)
+		if (vp->adj_next) {
 			vma_interval_tree_insert(vp->adj_next,
-					get_i_mmap_root(vp->mapping));
+					get_rb_root(vp->adj_next, vp->mapping));
+			inc_mapping_vma(vp->mapping, vp->adj_next);
+			i_mmap_tree_unlock_write(vp->mapping, vp->adj_next);
+		}
 		vma_interval_tree_insert(vp->vma,
-					get_i_mmap_root(vp->mapping));
+					get_rb_root(vp->vma, vp->mapping));
+		inc_mapping_vma(vp->mapping, vp->vma);
 		flush_dcache_mmap_unlock(vp->mapping);
+		i_mmap_tree_unlock_write(vp->mapping, vp->vma);
 	}
 
 	if (vp->remove && vp->file) {
@@ -370,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
 	}
 
 	if (vp->file) {
-		i_mmap_unlock_write(vp->mapping);
+		i_mmap_unlock_write_complete(vp->mapping);
 
 		if (!vp->skip_vma_uprobe) {
 			uprobe_mmap(vp->vma);
@@ -1799,12 +1812,12 @@ static void unlink_file_vma_batch_process(struct unlink_vma_file_batch *vb)
 	int i;
 
 	mapping = vb->vmas[0]->vm_file->f_mapping;
-	i_mmap_lock_write(mapping);
+	i_mmap_lock_write_prepare(mapping);
 	for (i = 0; i < vb->count; i++) {
 		VM_WARN_ON_ONCE(vb->vmas[i]->vm_file->f_mapping != mapping);
 		__remove_shared_vm_struct(vb->vmas[i], mapping);
 	}
-	i_mmap_unlock_write(mapping);
+	i_mmap_unlock_write_complete(mapping);
 
 	unlink_file_vma_batch_init(vb);
 }
@@ -1836,10 +1849,13 @@ static void vma_link_file(struct vm_area_struct *vma, bool hold_rmap_lock)
 
 	if (file) {
 		mapping = file->f_mapping;
-		i_mmap_lock_write(mapping);
+		i_mmap_lock_write_prepare(mapping);
+		i_mmap_tree_lock_write(mapping, vma);
 		__vma_link_file(vma, mapping);
-		if (!hold_rmap_lock)
-			i_mmap_unlock_write(mapping);
+		if (!hold_rmap_lock) {
+			i_mmap_tree_unlock_write(mapping, vma);
+			i_mmap_unlock_write_complete(mapping);
+		}
 	}
 }
 
@@ -2164,6 +2180,23 @@ static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma)
 	}
 }
 
+#ifdef CONFIG_SPLIT_I_MMAP
+static inline void i_mmap_nest_lock(struct address_space *mapping,
+				struct rw_semaphore *lock)
+{
+	int i;
+
+	for (i = 0; i < split_tree_num; i++)
+		down_write_nest_lock(&mapping->i_mmap[i]->rwsem, lock);
+}
+#else
+static inline void i_mmap_nest_lock(struct address_space *mapping,
+				struct rw_semaphore *lock)
+{
+	down_write_nest_lock(&mapping->i_mmap_rwsem, lock);
+}
+#endif
+
 static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
 {
 	if (!test_bit(AS_MM_ALL_LOCKS, &mapping->flags)) {
@@ -2178,7 +2211,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
 		 */
 		if (test_and_set_bit(AS_MM_ALL_LOCKS, &mapping->flags))
 			BUG();
-		down_write_nest_lock(&mapping->i_mmap_rwsem, &mm->mmap_lock);
+		i_mmap_nest_lock(mapping, &mm->mmap_lock);
 	}
 }
 
@@ -2489,6 +2522,7 @@ static int __mmap_new_file_vma(struct mmap_state *map,
 	int error;
 
 	vma->vm_file = map->file;
+	vma_set_tree_idx(vma);
 	if (!map->file_doesnt_need_get)
 		get_file(map->file);
 
diff --git a/mm/vma_init.c b/mm/vma_init.c
index 3c0b65950510..c115e33d4812 100644
--- a/mm/vma_init.c
+++ b/mm/vma_init.c
@@ -72,6 +72,9 @@ static void vm_area_init_from(const struct vm_area_struct *src,
 #ifdef CONFIG_NUMA
 	dest->vm_policy = src->vm_policy;
 #endif
+#ifdef CONFIG_SPLIT_I_MMAP
+	dest->tree_idx = src->tree_idx;
+#endif
 #ifdef __HAVE_PFNMAP_TRACKING
 	dest->pfnmap_track_ctx = NULL;
 #endif
-- 
2.53.0



^ permalink raw reply related

* [PATCH v2 4/4] docs/mm: update document for split i_mmap tree
From: Huang Shijie @ 2026-06-11  6:19 UTC (permalink / raw)
  To: akpm, viro, brauner, jack, muchun.song, osalvador, david
  Cc: surenb, mjguzik, liam, ljs, vbabka, shakeel.butt, rppt, mhocko,
	corbet, skhan, linux, dinguyen, schuster.simon, James.Bottomley,
	deller, djbw, willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, pfalcato,
	riel, harry, will, brian.ruley, rmk+kernel, dave.anglin, linux-mm,
	linux-doc, linux-kernel, linux-arm-kernel, linux-parisc,
	linux-fsdevel, nvdimm, linux-perf-users, linux-trace-kernel,
	zhongyuan, fangbaoshun, yingzhiwei, Huang Shijie
In-Reply-To: <20260611061915.2354307-1-huangsj@hygon.cn>

Document the i_mmap locking changes introduced by the following patches:
- Use mapping_mapped() to simplify the code
- Use get_i_mmap_root() to access the file's i_mmap
- Split the file's i_mmap tree (CONFIG_SPLIT_I_MMAP)

Add documentation for:
- CONFIG_SPLIT_I_MMAP split i_mmap tree architecture with per-tree locks
- New per-tree lock helpers: i_mmap_tree_lock_write/unlock_write
- New vm_area_struct.tree_idx field for sibling tree selection
- Updated i_mmap_lock_read/write semantics acquiring all per-tree locks
- Updated lock ordering notes for split tree configuration
- Updated page table freeing section for split tree scenario

Signed-off-by: Huang Shijie <huangsj@hygon.cn>
---
 Documentation/mm/process_addrs.rst | 63 +++++++++++++++++++++++-------
 1 file changed, 49 insertions(+), 14 deletions(-)

diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
index 851680ead45f..4aed3100b249 100644
--- a/Documentation/mm/process_addrs.rst
+++ b/Documentation/mm/process_addrs.rst
@@ -60,6 +60,15 @@ Terminology
   :c:func:`!i_mmap_[try]lock_write` for file-backed memory. We refer to these
   locks as the reverse mapping locks, or 'rmap locks' for brevity.
 
+  When :c:macro:`!CONFIG_SPLIT_I_MMAP` is enabled, the file-backed i_mmap tree
+  is split into multiple sibling trees (one per NUMA node or a number based on
+  CPU count), each with its own :c:type:`!struct i_mmap_tree` containing a
+  red/black interval tree and a :c:type:`!struct rw_semaphore`. In this
+  configuration, :c:func:`!i_mmap_lock_read` and :c:func:`!i_mmap_lock_write`
+  acquire all per-tree locks, while VMA insert/remove operations use the
+  per-tree granularity :c:func:`!i_mmap_tree_lock_write` to lock only the
+  relevant sibling tree, significantly reducing lock contention.
+
 We discuss page table locks separately in the dedicated section below.
 
 The first thing **any** of these locks achieve is to **stabilise** the VMA
@@ -230,12 +239,16 @@ These are the core fields which describe the MM the VMA belongs to and its attri
                                                            Updated under mmap read lock by
                                                            :c:func:`!task_numa_work`.
    :c:member:`!vm_userfaultfd_ctx`   CONFIG_USERFAULTFD    Userfaultfd context wrapper object of    mmap write,
-                                                           type :c:type:`!vm_userfaultfd_ctx`,      VMA write.
-                                                           either of zero size if userfaultfd is
-                                                           disabled, or containing a pointer
-                                                           to an underlying
-                                                           :c:type:`!userfaultfd_ctx` object which
-                                                           describes userfaultfd metadata.
+                                                            type :c:type:`!vm_userfaultfd_ctx`,      VMA write.
+                                                            either of zero size if userfaultfd is
+                                                            disabled, or containing a pointer
+                                                            to an underlying
+                                                            :c:type:`!userfaultfd_ctx` object which
+                                                            describes userfaultfd metadata.
+   :c:member:`!tree_idx`             CONFIG_SPLIT_I_MMAP   The index of the sibling i_mmap tree     Written once on
+                                                            that this VMA belongs to, set at         initial map.
+                                                            VMA creation time based on the NUMA
+                                                            node or the smallest sibling tree.
    ================================= ===================== ======================================== ===============
 
 These fields are present or not depending on whether the relevant kernel
@@ -247,12 +260,18 @@ configuration option is set.
    Field                               Description                               Write lock
    =================================== ========================================= ============================
    :c:member:`!shared.rb`              A red/black tree node used, if the        mmap write, VMA write,
-                                       mapping is file-backed, to place the VMA  i_mmap write.
-                                       in the
-                                       :c:member:`!struct address_space->i_mmap`
-                                       red/black interval tree.
+                                        mapping is file-backed, to place the VMA  i_mmap write (or per-tree
+                                        in the                                    i_mmap write when
+                                        :c:member:`!struct address_space->i_mmap` :c:macro:`!CONFIG_SPLIT_I_MMAP`
+                                        red/black interval tree (or one of the    is set).
+                                        sibling trees when
+                                        :c:macro:`!CONFIG_SPLIT_I_MMAP`
+                                        is enabled).
    :c:member:`!shared.rb_subtree_last` Metadata used for management of the       mmap write, VMA write,
-                                       interval tree if the VMA is file-backed.  i_mmap write.
+                                        interval tree if the VMA is file-backed.  i_mmap write (or per-tree
+                                                                                  i_mmap write when
+                                                                                  :c:macro:`!CONFIG_SPLIT_I_MMAP`
+                                                                                  is set).
    :c:member:`!anon_vma_chain`         List of pointers to both forked/CoW’d     mmap read, anon_vma write.
                                        :c:type:`!anon_vma` objects and
                                        :c:member:`!vma->anon_vma` if it is
@@ -490,6 +509,16 @@ There is also a file-system specific lock ordering comment located at the top of
 Please check the current state of these comments which may have changed since
 the time of writing of this document.
 
+.. note:: When :c:macro:`!CONFIG_SPLIT_I_MMAP` is enabled, the single
+   ``mapping->i_mmap_rwsem`` is replaced by an array of per-tree locks
+   ``mapping->i_mmap[i]->rwsem``. The lock ordering positions of
+   ``mapping->i_mmap_rwsem`` above apply to each per-tree lock
+   equivalently. VMA insert/remove operations acquire only the relevant
+   per-tree lock via :c:func:`!i_mmap_tree_lock_write`, while operations
+   that require all trees to be locked (such as
+   :c:func:`!unmap_mapping_range`) acquire all per-tree locks via
+   :c:func:`!i_mmap_lock_write` or :c:func:`!i_mmap_lock_read`.
+
 ------------------------------
 Locking Implementation Details
 ------------------------------
@@ -704,11 +733,15 @@ traversed or referenced by concurrent tasks.
 
 It is insufficient to simply hold an mmap write lock and VMA lock (which will
 prevent racing faults, and rmap operations), as a file-backed mapping can be
-truncated under the :c:struct:`!struct address_space->i_mmap_rwsem` alone.
+truncated under the :c:struct:`!struct address_space->i_mmap_rwsem` alone
+(or, when :c:macro:`!CONFIG_SPLIT_I_MMAP` is enabled, under all per-tree
+``mapping->i_mmap[i]->rwsem`` locks acquired via
+:c:func:`!i_mmap_lock_write`).
 
 As a result, no VMA which can be accessed via the reverse mapping (either
 through the :c:struct:`!struct anon_vma->rb_root` or the :c:member:`!struct
-address_space->i_mmap` interval trees) can have its page tables torn down.
+address_space->i_mmap` interval trees, or the sibling trees when
+:c:macro:`!CONFIG_SPLIT_I_MMAP` is enabled) can have its page tables torn down.
 
 The operation is typically performed via :c:func:`!free_pgtables`, which assumes
 either the mmap write lock has been taken (as specified by its
@@ -729,7 +762,9 @@ cleared without page table locks (in the :c:func:`!pgd_clear`, :c:func:`!p4d_cle
 .. note:: It is possible for leaf page tables to be torn down independent of
           the page tables above it as is done by
           :c:func:`!retract_page_tables`, which is performed under the i_mmap
-          read lock, PMD, and PTE page table locks, without this level of care.
+          read lock (or all per-tree ``mapping->i_mmap[i]->rwsem`` locks in
+          read mode when :c:macro:`!CONFIG_SPLIT_I_MMAP` is enabled), PMD, and
+          PTE page table locks, without this level of care.
 
 Page table moving
 ^^^^^^^^^^^^^^^^^
-- 
2.53.0



^ permalink raw reply related

* [PATCH v2 1/4] mm: use mapping_mapped to simplify the code
From: Huang Shijie @ 2026-06-11  6:18 UTC (permalink / raw)
  To: akpm, viro, brauner, jack, muchun.song, osalvador, david
  Cc: surenb, mjguzik, liam, ljs, vbabka, shakeel.butt, rppt, mhocko,
	corbet, skhan, linux, dinguyen, schuster.simon, James.Bottomley,
	deller, djbw, willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, pfalcato,
	riel, harry, will, brian.ruley, rmk+kernel, dave.anglin, linux-mm,
	linux-doc, linux-kernel, linux-arm-kernel, linux-parisc,
	linux-fsdevel, nvdimm, linux-perf-users, linux-trace-kernel,
	zhongyuan, fangbaoshun, yingzhiwei, Huang Shijie
In-Reply-To: <20260611061915.2354307-1-huangsj@hygon.cn>

Use mapping_mapped() to simplify the code, make
the code tidy and clean.

Signed-off-by: Huang Shijie <huangsj@hygon.cn>
---
 fs/hugetlbfs/inode.c | 4 ++--
 mm/memory.c          | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 78d61bf2bd9b..216e1a0dd0b2 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -614,7 +614,7 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 
 	i_size_write(inode, offset);
 	i_mmap_lock_write(mapping);
-	if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
+	if (mapping_mapped(mapping))
 		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
 				      ZAP_FLAG_DROP_MARKER);
 	i_mmap_unlock_write(mapping);
@@ -675,7 +675,7 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 
 	/* Unmap users of full pages in the hole. */
 	if (hole_end > hole_start) {
-		if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
+		if (mapping_mapped(mapping))
 			hugetlb_vmdelete_list(&mapping->i_mmap,
 					      hole_start >> PAGE_SHIFT,
 					      hole_end >> PAGE_SHIFT, 0);
diff --git a/mm/memory.c b/mm/memory.c
index 86a973119bd4..5335077765e2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4386,7 +4386,7 @@ void unmap_mapping_folio(struct folio *folio)
 	details.zap_flags = ZAP_FLAG_DROP_MARKER;
 
 	i_mmap_lock_read(mapping);
-	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
+	if (unlikely(mapping_mapped(mapping)))
 		unmap_mapping_range_tree(&mapping->i_mmap, first_index,
 					 last_index, &details);
 	i_mmap_unlock_read(mapping);
@@ -4416,7 +4416,7 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
 		last_index = ULONG_MAX;
 
 	i_mmap_lock_read(mapping);
-	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
+	if (unlikely(mapping_mapped(mapping)))
 		unmap_mapping_range_tree(&mapping->i_mmap, first_index,
 					 last_index, &details);
 	i_mmap_unlock_read(mapping);
-- 
2.53.0



^ permalink raw reply related

* [PATCH v2 2/4] mm: use get_i_mmap_root to access the file's i_mmap
From: Huang Shijie @ 2026-06-11  6:18 UTC (permalink / raw)
  To: akpm, viro, brauner, jack, muchun.song, osalvador, david
  Cc: surenb, mjguzik, liam, ljs, vbabka, shakeel.butt, rppt, mhocko,
	corbet, skhan, linux, dinguyen, schuster.simon, James.Bottomley,
	deller, djbw, willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, pfalcato,
	riel, harry, will, brian.ruley, rmk+kernel, dave.anglin, linux-mm,
	linux-doc, linux-kernel, linux-arm-kernel, linux-parisc,
	linux-fsdevel, nvdimm, linux-perf-users, linux-trace-kernel,
	zhongyuan, fangbaoshun, yingzhiwei, Huang Shijie
In-Reply-To: <20260611061915.2354307-1-huangsj@hygon.cn>

Do not access the file's i_mmap directly, use get_i_mmap_root()
to access it. This patch makes preparations for later patch.

Signed-off-by: Huang Shijie <huangsj@hygon.cn>
---
 arch/arm/mm/fault-armv.c   |  3 ++-
 arch/arm/mm/flush.c        |  3 ++-
 arch/nios2/mm/cacheflush.c |  3 ++-
 arch/parisc/kernel/cache.c |  4 +++-
 fs/dax.c                   |  3 ++-
 fs/hugetlbfs/inode.c       |  6 +++---
 include/linux/fs.h         |  5 +++++
 include/linux/mm.h         |  1 +
 kernel/events/uprobes.c    |  3 ++-
 mm/hugetlb.c               |  7 +++++--
 mm/khugepaged.c            |  6 ++++--
 mm/memory-failure.c        |  8 +++++---
 mm/memory.c                |  4 ++--
 mm/mmap.c                  |  2 +-
 mm/nommu.c                 |  9 +++++----
 mm/pagewalk.c              |  2 +-
 mm/rmap.c                  |  2 +-
 mm/vma.c                   | 14 ++++++++------
 18 files changed, 54 insertions(+), 31 deletions(-)

diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
index 91e488767783..1b5fe151e805 100644
--- a/arch/arm/mm/fault-armv.c
+++ b/arch/arm/mm/fault-armv.c
@@ -126,6 +126,7 @@ make_coherent(struct address_space *mapping, struct vm_area_struct *vma,
 {
 	const unsigned long pmd_start_addr = ALIGN_DOWN(addr, PMD_SIZE);
 	const unsigned long pmd_end_addr = pmd_start_addr + PMD_SIZE;
+	struct rb_root_cached *root = get_i_mmap_root(mapping);
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *mpnt;
 	unsigned long offset;
@@ -140,7 +141,7 @@ make_coherent(struct address_space *mapping, struct vm_area_struct *vma,
 	 * cache coherency.
 	 */
 	flush_dcache_mmap_lock(mapping);
-	vma_interval_tree_foreach(mpnt, &mapping->i_mmap, pgoff, pgoff) {
+	vma_interval_tree_foreach(mpnt, root, pgoff, pgoff) {
 		/*
 		 * If we are using split PTE locks, then we need to take the pte
 		 * lock. Otherwise we are using shared mm->page_table_lock which
diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
index 4d7ef5cc36b6..01588df81bfc 100644
--- a/arch/arm/mm/flush.c
+++ b/arch/arm/mm/flush.c
@@ -238,6 +238,7 @@ void __flush_dcache_folio(struct address_space *mapping, struct folio *folio)
 static void __flush_dcache_aliases(struct address_space *mapping, struct folio *folio)
 {
 	struct mm_struct *mm = current->active_mm;
+	struct rb_root_cached *root = get_i_mmap_root(mapping);
 	struct vm_area_struct *vma;
 	pgoff_t pgoff, pgoff_end;
 
@@ -251,7 +252,7 @@ static void __flush_dcache_aliases(struct address_space *mapping, struct folio *
 	pgoff_end = pgoff + folio_nr_pages(folio) - 1;
 
 	flush_dcache_mmap_lock(mapping);
-	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff_end) {
+	vma_interval_tree_foreach(vma, root, pgoff, pgoff_end) {
 		unsigned long start, offset, pfn;
 		unsigned int nr;
 
diff --git a/arch/nios2/mm/cacheflush.c b/arch/nios2/mm/cacheflush.c
index 8321182eb927..ab6e064fabe2 100644
--- a/arch/nios2/mm/cacheflush.c
+++ b/arch/nios2/mm/cacheflush.c
@@ -78,11 +78,12 @@ static void flush_aliases(struct address_space *mapping, struct folio *folio)
 	unsigned long flags;
 	pgoff_t pgoff;
 	unsigned long nr = folio_nr_pages(folio);
+	struct rb_root_cached *root = get_i_mmap_root(mapping);
 
 	pgoff = folio->index;
 
 	flush_dcache_mmap_lock_irqsave(mapping, flags);
-	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff + nr - 1) {
+	vma_interval_tree_foreach(vma, root, pgoff, pgoff + nr - 1) {
 		unsigned long start;
 
 		if (vma->vm_mm != mm)
diff --git a/arch/parisc/kernel/cache.c b/arch/parisc/kernel/cache.c
index 0170b69a21d3..f99dffd6cc22 100644
--- a/arch/parisc/kernel/cache.c
+++ b/arch/parisc/kernel/cache.c
@@ -473,6 +473,7 @@ static inline unsigned long get_upa(struct mm_struct *mm, unsigned long addr)
 void flush_dcache_folio(struct folio *folio)
 {
 	struct address_space *mapping = folio_flush_mapping(folio);
+	struct rb_root_cached *root;
 	struct vm_area_struct *vma;
 	unsigned long addr, old_addr = 0;
 	void *kaddr;
@@ -494,6 +495,7 @@ void flush_dcache_folio(struct folio *folio)
 		return;
 
 	pgoff = folio->index;
+	root = get_i_mmap_root(mapping);
 
 	/*
 	 * We have carefully arranged in arch_get_unmapped_area() that
@@ -503,7 +505,7 @@ void flush_dcache_folio(struct folio *folio)
 	 * on machines that support equivalent aliasing
 	 */
 	flush_dcache_mmap_lock_irqsave(mapping, flags);
-	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff + nr - 1) {
+	vma_interval_tree_foreach(vma, root, pgoff, pgoff + nr - 1) {
 		unsigned long offset = pgoff - vma->vm_pgoff;
 		unsigned long pfn = folio_pfn(folio);
 
diff --git a/fs/dax.c b/fs/dax.c
index 6d175cd47a99..d402edc3c1b8 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1138,6 +1138,7 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
 		struct address_space *mapping, void *entry)
 {
 	unsigned long pfn, index, count, end;
+	struct rb_root_cached *root = get_i_mmap_root(mapping);
 	long ret = 0;
 	struct vm_area_struct *vma;
 
@@ -1201,7 +1202,7 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
 
 	/* Walk all mappings of a given index of a file and writeprotect them */
 	i_mmap_lock_read(mapping);
-	vma_interval_tree_foreach(vma, &mapping->i_mmap, index, end) {
+	vma_interval_tree_foreach(vma, root, index, end) {
 		pfn_mkclean_range(pfn, count, index, vma);
 		cond_resched();
 	}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 216e1a0dd0b2..da5b41ea5bdd 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -380,7 +380,7 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
 					struct address_space *mapping,
 					struct folio *folio, pgoff_t index)
 {
-	struct rb_root_cached *root = &mapping->i_mmap;
+	struct rb_root_cached *root = get_i_mmap_root(mapping);
 	struct hugetlb_vma_lock *vma_lock;
 	unsigned long pfn = folio_pfn(folio);
 	struct vm_area_struct *vma;
@@ -615,7 +615,7 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	i_size_write(inode, offset);
 	i_mmap_lock_write(mapping);
 	if (mapping_mapped(mapping))
-		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
+		hugetlb_vmdelete_list(get_i_mmap_root(mapping), pgoff, 0,
 				      ZAP_FLAG_DROP_MARKER);
 	i_mmap_unlock_write(mapping);
 	remove_inode_hugepages(inode, offset, LLONG_MAX);
@@ -676,7 +676,7 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	/* Unmap users of full pages in the hole. */
 	if (hole_end > hole_start) {
 		if (mapping_mapped(mapping))
-			hugetlb_vmdelete_list(&mapping->i_mmap,
+			hugetlb_vmdelete_list(get_i_mmap_root(mapping),
 					      hole_start >> PAGE_SHIFT,
 					      hole_end >> PAGE_SHIFT, 0);
 	}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 11559c513dfb..cd46615b8f53 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -556,6 +556,11 @@ static inline int mapping_mapped(const struct address_space *mapping)
 	return	!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root);
 }
 
+static inline struct rb_root_cached *get_i_mmap_root(struct address_space *mapping)
+{
+	return &mapping->i_mmap;
+}
+
 /*
  * Might pages of this file have been modified in userspace?
  * Note that i_mmap_writable counts all VM_SHARED, VM_MAYWRITE vmas: do_mmap
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 06bbe9eba636..0a45c6a8b9f2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4041,6 +4041,7 @@ struct vm_area_struct *vma_interval_tree_iter_first(struct rb_root_cached *root,
 struct vm_area_struct *vma_interval_tree_iter_next(struct vm_area_struct *node,
 				unsigned long start, unsigned long last);
 
+/* Please use get_i_mmap_root() to get the @root */
 #define vma_interval_tree_foreach(vma, root, start, last)		\
 	for (vma = vma_interval_tree_iter_first(root, start, last);	\
 	     vma; vma = vma_interval_tree_iter_next(vma, start, last))
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 4084e926e284..d8561a42aec8 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1201,6 +1201,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
 static struct map_info *
 build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
 {
+	struct rb_root_cached *root = get_i_mmap_root(mapping);
 	unsigned long pgoff = offset >> PAGE_SHIFT;
 	struct vm_area_struct *vma;
 	struct map_info *curr = NULL;
@@ -1210,7 +1211,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
 
  again:
 	i_mmap_lock_read(mapping);
-	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+	vma_interval_tree_foreach(vma, root, pgoff, pgoff) {
 		if (!valid_vma(vma, is_register))
 			continue;
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4b80b167cc9c..8bc49d57a116 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5360,6 +5360,7 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct hstate *h = hstate_vma(vma);
 	struct vm_area_struct *iter_vma;
 	struct address_space *mapping;
+	struct rb_root_cached *root;
 	pgoff_t pgoff;
 
 	/*
@@ -5370,6 +5371,7 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 	pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) +
 			vma->vm_pgoff;
 	mapping = vma->vm_file->f_mapping;
+	root = get_i_mmap_root(mapping);
 
 	/*
 	 * Take the mapping lock for the duration of the table walk. As
@@ -5377,7 +5379,7 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * __unmap_hugepage_range() is called as the lock is already held
 	 */
 	i_mmap_lock_write(mapping);
-	vma_interval_tree_foreach(iter_vma, &mapping->i_mmap, pgoff, pgoff) {
+	vma_interval_tree_foreach(iter_vma, root, pgoff, pgoff) {
 		/* Do not unmap the current VMA */
 		if (iter_vma == vma)
 			continue;
@@ -6850,6 +6852,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, pud_t *pud)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct rb_root_cached *root = get_i_mmap_root(mapping);
 	pgoff_t idx = ((addr - vma->vm_start) >> PAGE_SHIFT) +
 			vma->vm_pgoff;
 	struct vm_area_struct *svma;
@@ -6858,7 +6861,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_t *pte;
 
 	i_mmap_lock_read(mapping);
-	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
+	vma_interval_tree_foreach(svma, root, idx, idx) {
 		if (svma == vma)
 			continue;
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b8452dbdb043..0f577e4a2ccd 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1773,10 +1773,11 @@ static bool file_backed_vma_is_retractable(struct vm_area_struct *vma)
 
 static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 {
+	struct rb_root_cached *root = get_i_mmap_root(mapping);
 	struct vm_area_struct *vma;
 
 	i_mmap_lock_read(mapping);
-	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+	vma_interval_tree_foreach(vma, root, pgoff, pgoff) {
 		struct mmu_notifier_range range;
 		struct mm_struct *mm;
 		unsigned long addr;
@@ -2194,7 +2195,8 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 		 * not be able to observe any missing pages due to the
 		 * previously inserted retry entries.
 		 */
-		vma_interval_tree_foreach(vma, &mapping->i_mmap, start, end) {
+		vma_interval_tree_foreach(vma, get_i_mmap_root(mapping),
+					start, end) {
 			if (userfaultfd_missing(vma)) {
 				result = SCAN_EXCEED_NONE_PTE;
 				goto immap_locked;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ee42d4361309..85196d9bb26c 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -598,7 +598,7 @@ static void collect_procs_file(const struct folio *folio,
 
 		if (!t)
 			continue;
-		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff,
+		vma_interval_tree_foreach(vma, get_i_mmap_root(mapping), pgoff,
 				      pgoff) {
 			/*
 			 * Send early kill signal to tasks where a vma covers
@@ -650,7 +650,8 @@ static void collect_procs_fsdax(const struct page *page,
 			t = task_early_kill(tsk, true);
 		if (!t)
 			continue;
-		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		vma_interval_tree_foreach(vma, get_i_mmap_root(mapping), pgoff,
+					pgoff) {
 			if (vma->vm_mm == t->mm)
 				add_to_kill_fsdax(t, page, vma, to_kill, pgoff);
 		}
@@ -2251,7 +2252,8 @@ static void collect_procs_pfn(struct pfn_address_space *pfn_space,
 		t = task_early_kill(tsk, true);
 		if (!t)
 			continue;
-		vma_interval_tree_foreach(vma, &mapping->i_mmap, 0, ULONG_MAX) {
+		vma_interval_tree_foreach(vma, get_i_mmap_root(mapping),
+					0, ULONG_MAX) {
 			pgoff_t pgoff;
 
 			if (vma->vm_mm == t->mm &&
diff --git a/mm/memory.c b/mm/memory.c
index 5335077765e2..9ea5d6c8ef4d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4387,7 +4387,7 @@ void unmap_mapping_folio(struct folio *folio)
 
 	i_mmap_lock_read(mapping);
 	if (unlikely(mapping_mapped(mapping)))
-		unmap_mapping_range_tree(&mapping->i_mmap, first_index,
+		unmap_mapping_range_tree(get_i_mmap_root(mapping), first_index,
 					 last_index, &details);
 	i_mmap_unlock_read(mapping);
 }
@@ -4417,7 +4417,7 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
 
 	i_mmap_lock_read(mapping);
 	if (unlikely(mapping_mapped(mapping)))
-		unmap_mapping_range_tree(&mapping->i_mmap, first_index,
+		unmap_mapping_range_tree(get_i_mmap_root(mapping), first_index,
 					 last_index, &details);
 	i_mmap_unlock_read(mapping);
 }
diff --git a/mm/mmap.c b/mm/mmap.c
index 5754d1c36462..d714fdb357e5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1831,7 +1831,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 			flush_dcache_mmap_lock(mapping);
 			/* insert tmp into the share list, just after mpnt */
 			vma_interval_tree_insert_after(tmp, mpnt,
-					&mapping->i_mmap);
+					get_i_mmap_root(mapping));
 			flush_dcache_mmap_unlock(mapping);
 			i_mmap_unlock_write(mapping);
 		}
diff --git a/mm/nommu.c b/mm/nommu.c
index ed3934bc2de4..0f18ffc658e9 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -569,7 +569,7 @@ static void setup_vma_to_mm(struct vm_area_struct *vma, struct mm_struct *mm)
 
 		i_mmap_lock_write(mapping);
 		flush_dcache_mmap_lock(mapping);
-		vma_interval_tree_insert(vma, &mapping->i_mmap);
+		vma_interval_tree_insert(vma, get_i_mmap_root(mapping));
 		flush_dcache_mmap_unlock(mapping);
 		i_mmap_unlock_write(mapping);
 	}
@@ -585,7 +585,7 @@ static void cleanup_vma_from_mm(struct vm_area_struct *vma)
 
 		i_mmap_lock_write(mapping);
 		flush_dcache_mmap_lock(mapping);
-		vma_interval_tree_remove(vma, &mapping->i_mmap);
+		vma_interval_tree_remove(vma, get_i_mmap_root(mapping));
 		flush_dcache_mmap_unlock(mapping);
 		i_mmap_unlock_write(mapping);
 	}
@@ -1804,6 +1804,7 @@ EXPORT_SYMBOL_GPL(copy_remote_vm_str);
 int nommu_shrink_inode_mappings(struct inode *inode, size_t size,
 				size_t newsize)
 {
+	struct rb_root_cached *root = get_i_mmap_root(&inode->i_mapping);
 	struct vm_area_struct *vma;
 	struct vm_region *region;
 	pgoff_t low, high;
@@ -1816,7 +1817,7 @@ int nommu_shrink_inode_mappings(struct inode *inode, size_t size,
 	i_mmap_lock_read(inode->i_mapping);
 
 	/* search for VMAs that fall within the dead zone */
-	vma_interval_tree_foreach(vma, &inode->i_mapping->i_mmap, low, high) {
+	vma_interval_tree_foreach(vma, root, low, high) {
 		/* found one - only interested if it's shared out of the page
 		 * cache */
 		if (vma->vm_flags & VM_SHARED) {
@@ -1832,7 +1833,7 @@ int nommu_shrink_inode_mappings(struct inode *inode, size_t size,
 	 * we don't check for any regions that start beyond the EOF as there
 	 * shouldn't be any
 	 */
-	vma_interval_tree_foreach(vma, &inode->i_mapping->i_mmap, 0, ULONG_MAX) {
+	vma_interval_tree_foreach(vma, root, 0, ULONG_MAX) {
 		if (!(vma->vm_flags & VM_SHARED))
 			continue;
 
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 3ae2586ff45b..8df1b5077951 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -810,7 +810,7 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
 		return -EINVAL;
 
 	lockdep_assert_held(&mapping->i_mmap_rwsem);
-	vma_interval_tree_foreach(vma, &mapping->i_mmap, first_index,
+	vma_interval_tree_foreach(vma, get_i_mmap_root(mapping), first_index,
 				  first_index + nr - 1) {
 		/* Clip to the vma */
 		vba = vma->vm_pgoff;
diff --git a/mm/rmap.c b/mm/rmap.c
index 99e1b3dc390b..6cfcdb96071f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -3051,7 +3051,7 @@ static void __rmap_walk_file(struct folio *folio, struct address_space *mapping,
 		i_mmap_lock_read(mapping);
 	}
 lookup:
-	vma_interval_tree_foreach(vma, &mapping->i_mmap,
+	vma_interval_tree_foreach(vma, get_i_mmap_root(mapping),
 			pgoff_start, pgoff_end) {
 		unsigned long address = vma_address(vma, pgoff_start, nr_pages);
 
diff --git a/mm/vma.c b/mm/vma.c
index d90791b00a7b..6159650c1b42 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -234,7 +234,7 @@ static void __vma_link_file(struct vm_area_struct *vma,
 		mapping_allow_writable(mapping);
 
 	flush_dcache_mmap_lock(mapping);
-	vma_interval_tree_insert(vma, &mapping->i_mmap);
+	vma_interval_tree_insert(vma, get_i_mmap_root(mapping));
 	flush_dcache_mmap_unlock(mapping);
 }
 
@@ -248,7 +248,7 @@ static void __remove_shared_vm_struct(struct vm_area_struct *vma,
 		mapping_unmap_writable(mapping);
 
 	flush_dcache_mmap_lock(mapping);
-	vma_interval_tree_remove(vma, &mapping->i_mmap);
+	vma_interval_tree_remove(vma, get_i_mmap_root(mapping));
 	flush_dcache_mmap_unlock(mapping);
 }
 
@@ -319,10 +319,11 @@ static void vma_prepare(struct vma_prepare *vp)
 
 	if (vp->file) {
 		flush_dcache_mmap_lock(vp->mapping);
-		vma_interval_tree_remove(vp->vma, &vp->mapping->i_mmap);
+		vma_interval_tree_remove(vp->vma,
+					get_i_mmap_root(vp->mapping));
 		if (vp->adj_next)
 			vma_interval_tree_remove(vp->adj_next,
-						 &vp->mapping->i_mmap);
+					get_i_mmap_root(vp->mapping));
 	}
 
 }
@@ -341,8 +342,9 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
 	if (vp->file) {
 		if (vp->adj_next)
 			vma_interval_tree_insert(vp->adj_next,
-						 &vp->mapping->i_mmap);
-		vma_interval_tree_insert(vp->vma, &vp->mapping->i_mmap);
+					get_i_mmap_root(vp->mapping));
+		vma_interval_tree_insert(vp->vma,
+					get_i_mmap_root(vp->mapping));
 		flush_dcache_mmap_unlock(vp->mapping);
 	}
 
-- 
2.53.0



^ permalink raw reply related

* [PATCH v2 0/4] mm: split the file's i_mmap tree for NUMA
From: Huang Shijie @ 2026-06-11  6:18 UTC (permalink / raw)
  To: akpm, viro, brauner, jack, muchun.song, osalvador, david
  Cc: surenb, mjguzik, liam, ljs, vbabka, shakeel.butt, rppt, mhocko,
	corbet, skhan, linux, dinguyen, schuster.simon, James.Bottomley,
	deller, djbw, willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, pfalcato,
	riel, harry, will, brian.ruley, rmk+kernel, dave.anglin, linux-mm,
	linux-doc, linux-kernel, linux-arm-kernel, linux-parisc,
	linux-fsdevel, nvdimm, linux-perf-users, linux-trace-kernel,
	zhongyuan, fangbaoshun, yingzhiwei, Huang Shijie

  In NUMA, there are maybe many NUMA nodes and many CPUs.
For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
In the UnixBench tests, there is a test "execl" which tests
the execve system call.

  When we test our server with "./Run -c 384 execl",
the test result is not good enough. The i_mmap locks contended heavily on
"libc.so" and "ld.so". For example, the i_mmap tree for "libc.so" can have 
over 6000 VMAs, all the VMAs can be in different NUMA mode.
The insert/remove operations do not run quickly enough.

patch 1 & patch 2 are try to hide the direct access of i_mmap.
patch 3 splits the i_mmap into sibling trees, each tree has separate lock,
and we can get better performance with this patch set in our NUMA server:
    we can get over 400% performance improvement.

I did not test the non-NUMA case, since I do not have such server.    
    
v1 --> v2:
	Not only split the immap tree, but also split the lock.
	v1 : https://lkml.org/lkml/2026/4/13/199

Huang Shijie (4):
  mm: use mapping_mapped to simplify the code
  mm: use get_i_mmap_root to access the file's i_mmap
  mm/fs: split the file's i_mmap tree
  docs/mm: update document for split i_mmap tree

 Documentation/mm/process_addrs.rst |  63 +++++++---
 arch/arm/mm/fault-armv.c           |   3 +-
 arch/arm/mm/flush.c                |   3 +-
 arch/nios2/mm/cacheflush.c         |   3 +-
 arch/parisc/kernel/cache.c         |   4 +-
 fs/Kconfig                         |   8 ++
 fs/dax.c                           |   3 +-
 fs/hugetlbfs/inode.c               |  30 +++--
 fs/inode.c                         |  75 +++++++++++-
 include/linux/fs.h                 | 179 ++++++++++++++++++++++++++++-
 include/linux/mm.h                 |  81 +++++++++++++
 include/linux/mm_types.h           |   3 +
 kernel/events/uprobes.c            |   3 +-
 mm/hugetlb.c                       |   7 +-
 mm/internal.h                      |   3 +-
 mm/khugepaged.c                    |   6 +-
 mm/memory-failure.c                |   8 +-
 mm/memory.c                        |   8 +-
 mm/mmap.c                          |  11 +-
 mm/nommu.c                         |  28 +++--
 mm/pagewalk.c                      |   4 +-
 mm/rmap.c                          |   2 +-
 mm/vma.c                           |  74 +++++++++---
 mm/vma_init.c                      |   3 +
 24 files changed, 534 insertions(+), 78 deletions(-)

-- 
2.53.0



^ permalink raw reply

* Re: [RESEND][PATCH v2] unwind: Add sframe_(un)register() system calls
From: Fangrui Song @ 2026-06-11  7:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux Trace Kernel, bpf, Masami Hiramatsu,
	Mathieu Desnoyers, Jens Remus, Josh Poimboeuf, Peter Zijlstra,
	Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
	Kees Cook, Carlos O'Donell, Sam James, Dylan Hatch,
	Borislav Petkov, Dave Hansen, David Hildenbrand, H. Peter Anvin,
	Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Heiko Carstens,
	Vasily Gorbik, Thomas Weißschuh
In-Reply-To: <20260528151626.4573592d@gandalf.local.home>



On 2026-05-28, Steven Rostedt wrote:
>From: Steven Rostedt <rostedt@goodmis.org>
>
>Add system calls to register and unregister sframes that can be used by
>dynamic linkers to tell the kernel where the sframe section is in memory
>for libraries it loads.
>
>Both system calls take a pointer to a new structure:
>
>  struct sframe_setup {
>	__u64			sframe_start;
>	__u64			sframe_size;
>	__u64			text_start;
>	__u64			text_size;
>  };
>
>and a size of the passed in structure. If the system call needs to be
>extended, then the structure could be changed and the size of that
>structure will tell the kernel that it is the new version. If the kernel
>does not recognize the structure size, it will return -EINVAL.
>
>  sframe_start - The virtual address of the sframe section
>  sframe_size  - The length of the sframe section
>  text_start   - the text section the sframe represents
>  test_size    - the length of the section
>
>If other stack tracing functionality is added, it will require a new
>system call.
>
>The unregister only needs the sframe_start and requires all the rest of
>the fields to be 0. In the future, if more can be done, then user space
>can update the other values and check the return code to see if the kernel
>supports it.
>
>Also added a DEFINE_GUARD() for mmap_write_lock. There was one for
>mmap_read_lock but not for mmap_write_lock.
>
>Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
>---
>
>[ Resend with Indu's current email address. ]
>
>Changes since v1: https://patch.msgid.link/20260521183532.7a145c8a@gandalf.local.home
>
>- Use mmap_write_lock() instead of mmap_read_lock() for mutual
>  exclusiveness. (Jens Remus)
>
>- Guard mtree_insert_range() with mmap_write_lock. (Jens Remus)
>
>- Added a guard for mmap_write_lock() similar to the one for mmap_read_lock.
>
>- Have syscall prototype use structure pointer instead of void (Thomas Weißschuh)
>
>- Use __u64 instead of unsigned long for struct members (Thomas Weißschuh)
>
>- Use size_t instead of int for structure size in syscall argument.
> (Thomas Weißschuh)
>
> arch/alpha/kernel/syscalls/syscall.tbl      |  2 +
> arch/arm/tools/syscall.tbl                  |  2 +
> arch/arm64/tools/syscall_32.tbl             |  2 +
> arch/m68k/kernel/syscalls/syscall.tbl       |  2 +
> arch/microblaze/kernel/syscalls/syscall.tbl |  2 +
> arch/mips/kernel/syscalls/syscall_n32.tbl   |  2 +
> arch/mips/kernel/syscalls/syscall_n64.tbl   |  2 +
> arch/mips/kernel/syscalls/syscall_o32.tbl   |  2 +
> arch/parisc/kernel/syscalls/syscall.tbl     |  2 +
> arch/powerpc/kernel/syscalls/syscall.tbl    |  2 +
> arch/s390/kernel/syscalls/syscall.tbl       |  3 +
> arch/sh/kernel/syscalls/syscall.tbl         |  2 +
> arch/sparc/kernel/syscalls/syscall.tbl      |  2 +
> arch/x86/entry/syscalls/syscall_32.tbl      |  2 +
> arch/x86/entry/syscalls/syscall_64.tbl      |  2 +
> arch/xtensa/kernel/syscalls/syscall.tbl     |  2 +
> include/linux/mmap_lock.h                   |  3 +
> include/linux/syscalls.h                    |  3 +
> include/uapi/asm-generic/unistd.h           |  7 ++-
> include/uapi/linux/sframe.h                 | 12 ++++
> kernel/sys_ni.c                             |  3 +
> kernel/unwind/sframe.c                      | 69 +++++++++++++++++++--
> scripts/syscall.tbl                         |  2 +
> 23 files changed, 126 insertions(+), 6 deletions(-)
> create mode 100644 include/uapi/linux/sframe.h
>
>diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
>index f31b7afffc34..f0639b831f2a 100644
>--- a/arch/alpha/kernel/syscalls/syscall.tbl
>+++ b/arch/alpha/kernel/syscalls/syscall.tbl
>@@ -511,3 +511,5 @@
> 579	common	file_setattr			sys_file_setattr
> 580	common	listns				sys_listns
> 581	common	rseq_slice_yield		sys_rseq_slice_yield
>+582	common	sframe_register			sys_sframe_register
>+583	common	sframe_unregister		sys_sframe_unregister
>diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
>index 94351e22bfcf..887b242ffb25 100644
>--- a/arch/arm/tools/syscall.tbl
>+++ b/arch/arm/tools/syscall.tbl
>@@ -486,3 +486,5 @@
> 469	common	file_setattr			sys_file_setattr
> 470	common	listns				sys_listns
> 471	common	rseq_slice_yield		sys_rseq_slice_yield
>+472	common	sframe_register			sys_sframe_register
>+473	common	sframe_unregister		sys_sframe_unregister
>diff --git a/arch/arm64/tools/syscall_32.tbl b/arch/arm64/tools/syscall_32.tbl
>index 62d93d88e0fe..c820f1ff718c 100644
>--- a/arch/arm64/tools/syscall_32.tbl
>+++ b/arch/arm64/tools/syscall_32.tbl
>@@ -483,3 +483,5 @@
> 469	common	file_setattr			sys_file_setattr
> 470	common	listns				sys_listns
> 471	common	rseq_slice_yield		sys_rseq_slice_yield
>+472	common	sframe_register			sys_sframe_register
>+473	common	sframe_unregister		sys_sframe_unregister
>diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
>index 248934257101..4c7f17f0364b 100644
>--- a/arch/m68k/kernel/syscalls/syscall.tbl
>+++ b/arch/m68k/kernel/syscalls/syscall.tbl
>@@ -471,3 +471,5 @@
> 469	common	file_setattr			sys_file_setattr
> 470	common	listns				sys_listns
> 471	common	rseq_slice_yield		sys_rseq_slice_yield
>+472	common	sframe_register			sys_sframe_register
>+473	common	sframe_unregister		sys_sframe_unregister
>diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
>index 223d26303627..e8dc2cc149f4 100644
>--- a/arch/microblaze/kernel/syscalls/syscall.tbl
>+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
>@@ -477,3 +477,5 @@
> 469	common	file_setattr			sys_file_setattr
> 470	common	listns				sys_listns
> 471	common	rseq_slice_yield		sys_rseq_slice_yield
>+472	common	sframe_register			sys_sframe_register
>+473	common	sframe_unregister		sys_sframe_unregister
>diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
>index 7430714e2b8f..d0bae05d16af 100644
>--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
>+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
>@@ -410,3 +410,5 @@
> 469	n32	file_setattr			sys_file_setattr
> 470	n32	listns				sys_listns
> 471	n32	rseq_slice_yield		sys_rseq_slice_yield
>+472	n32	sframe_register			sys_sframe_register
>+473	n32	sframe_unregister		sys_sframe_unregister
>diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
>index 630aab9e5425..2e200de6a58c 100644
>--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
>+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
>@@ -386,3 +386,5 @@
> 469	n64	file_setattr			sys_file_setattr
> 470	n64	listns				sys_listns
> 471	n64	rseq_slice_yield		sys_rseq_slice_yield
>+472	n64	sframe_register			sys_sframe_register
>+473	n64	sframe_unregister		sys_sframe_unregister
>diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
>index 128653112284..0e3b82011ae2 100644
>--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
>+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
>@@ -459,3 +459,5 @@
> 469	o32	file_setattr			sys_file_setattr
> 470	o32	listns				sys_listns
> 471	o32	rseq_slice_yield		sys_rseq_slice_yield
>+472	o32	sframe_register			sys_sframe_register
>+473	o32	sframe_unregister		sys_sframe_unregister
>diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
>index c6331dad9461..e0758ef8667d 100644
>--- a/arch/parisc/kernel/syscalls/syscall.tbl
>+++ b/arch/parisc/kernel/syscalls/syscall.tbl
>@@ -470,3 +470,5 @@
> 469	common	file_setattr			sys_file_setattr
> 470	common	listns				sys_listns
> 471	common	rseq_slice_yield		sys_rseq_slice_yield
>+472	common	sframe_register			sys_sframe_register
>+473	common	sframe_unregister		sys_sframe_unregister
>diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
>index 4fcc7c58a105..eda40c4f4f2f 100644
>--- a/arch/powerpc/kernel/syscalls/syscall.tbl
>+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
>@@ -562,3 +562,5 @@
> 469	common	file_setattr			sys_file_setattr
> 470	common	listns				sys_listns
> 471	nospu	rseq_slice_yield		sys_rseq_slice_yield
>+472	nospu	sframe_register			sys_sframe_register
>+473	nospu	sframe_unregister		sys_sframe_unregister
>diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
>index 09a7ef04d979..52519e2acdc8 100644
>--- a/arch/s390/kernel/syscalls/syscall.tbl
>+++ b/arch/s390/kernel/syscalls/syscall.tbl
>@@ -398,3 +398,6 @@
> 469	common	file_setattr			sys_file_setattr
> 470	common	listns				sys_listns
> 471	common	rseq_slice_yield		sys_rseq_slice_yield
>+472	common	stacktrace_setup		sys_stacktrace_setup
>+472	common	sframe_register			sys_sframe_register
>+473	common	sframe_unregister		sys_sframe_unregister
>diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
>index 70b315cbe710..62ac7b1b4dd4 100644
>--- a/arch/sh/kernel/syscalls/syscall.tbl
>+++ b/arch/sh/kernel/syscalls/syscall.tbl
>@@ -475,3 +475,5 @@
> 469	common	file_setattr			sys_file_setattr
> 470	common	listns				sys_listns
> 471	common	rseq_slice_yield		sys_rseq_slice_yield
>+472	common	sframe_register			sys_sframe_register
>+473	common	sframe_unregister		sys_sframe_unregister
>diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
>index 7e71bf7fcd14..f92273ae608a 100644
>--- a/arch/sparc/kernel/syscalls/syscall.tbl
>+++ b/arch/sparc/kernel/syscalls/syscall.tbl
>@@ -517,3 +517,5 @@
> 469	common	file_setattr			sys_file_setattr
> 470	common	listns				sys_listns
> 471	common	rseq_slice_yield		sys_rseq_slice_yield
>+472	common	sframe_register			sys_sframe_register
>+473	common	sframe_unregister		sys_sframe_unregister
>diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
>index f832ebd2d79b..409a50df3b21 100644
>--- a/arch/x86/entry/syscalls/syscall_32.tbl
>+++ b/arch/x86/entry/syscalls/syscall_32.tbl
>@@ -477,3 +477,5 @@
> 469	i386	file_setattr		sys_file_setattr
> 470	i386	listns			sys_listns
> 471	i386	rseq_slice_yield	sys_rseq_slice_yield
>+472	i386	sframe_register		sys_sframe_register
>+473	i386	sframe_unregister	sys_sframe_unregister
>diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
>index 524155d655da..9b7c5a449751 100644
>--- a/arch/x86/entry/syscalls/syscall_64.tbl
>+++ b/arch/x86/entry/syscalls/syscall_64.tbl
>@@ -396,6 +396,8 @@
> 469	common	file_setattr		sys_file_setattr
> 470	common	listns			sys_listns
> 471	common	rseq_slice_yield	sys_rseq_slice_yield
>+472	common	sframe_register		sys_sframe_register
>+473	common	sframe_unregister	sys_sframe_unregister
>
> #
> # Due to a historical design error, certain syscalls are numbered differently
>diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
>index a9bca4e484de..037b8040f69d 100644
>--- a/arch/xtensa/kernel/syscalls/syscall.tbl
>+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
>@@ -442,3 +442,5 @@
> 469	common	file_setattr			sys_file_setattr
> 470	common	listns				sys_listns
> 471	common	rseq_slice_yield		sys_rseq_slice_yield
>+472	common	sframe_register'		sys_sframe_register
>+473	common	sframe_unregister		sys_sframe_unregister
>diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
>index 04b8f61ece5d..6650c89a13ab 100644
>--- a/include/linux/mmap_lock.h
>+++ b/include/linux/mmap_lock.h
>@@ -579,6 +579,9 @@ static inline void mmap_write_unlock(struct mm_struct *mm)
> 	up_write(&mm->mmap_lock);
> }
>
>+DEFINE_GUARD(mmap_write_lock, struct mm_struct *,
>+	     mmap_write_lock(_T), mmap_write_unlock(_T))
>+
> static inline void mmap_write_downgrade(struct mm_struct *mm)
> {
> 	__mmap_lock_trace_acquire_returned(mm, false, true);
>diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>index f5639d5ac331..ad3c8d6b6471 100644
>--- a/include/linux/syscalls.h
>+++ b/include/linux/syscalls.h
>@@ -79,6 +79,7 @@ struct mnt_id_req;
> struct ns_id_req;
> struct xattr_args;
> struct file_attr;
>+struct sframe_setup;
>
> #include <linux/types.h>
> #include <linux/aio_abi.h>
>@@ -999,6 +1000,8 @@ asmlinkage long sys_lsm_get_self_attr(unsigned int attr, struct lsm_ctx __user *
> asmlinkage long sys_lsm_set_self_attr(unsigned int attr, struct lsm_ctx __user *ctx,
> 				      u32 size, u32 flags);
> asmlinkage long sys_lsm_list_modules(u64 __user *ids, u32 __user *size, u32 flags);
>+asmlinkage long sys_sframe_register(struct sframe_setup *data,  size_t size);
>+asmlinkage long sys_sframe_unregister(struct sframe_setup *data, size_t size);
>
> /*
>  * Architecture-specific system calls
>diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
>index a627acc8fb5f..17042d7e5e87 100644
>--- a/include/uapi/asm-generic/unistd.h
>+++ b/include/uapi/asm-generic/unistd.h
>@@ -863,8 +863,13 @@ __SYSCALL(__NR_listns, sys_listns)
> #define __NR_rseq_slice_yield 471
> __SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
>
>+#define __NR_sframe_register 472
>+__SYSCALL(__NR_sframe_register, sys_sframe_register)
>+#define __NR_sframe_unregister 473
>+__SYSCALL(__NR_sframe_unregister, sys_sframe_unregister)
>+
> #undef __NR_syscalls
>-#define __NR_syscalls 472
>+#define __NR_syscalls 474
>
> /*
>  * 32 bit systems traditionally used different
>diff --git a/include/uapi/linux/sframe.h b/include/uapi/linux/sframe.h
>new file mode 100644
>index 000000000000..d3c9f88b024b
>--- /dev/null
>+++ b/include/uapi/linux/sframe.h
>@@ -0,0 +1,12 @@
>+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
>+#ifndef _UAPI_LINUX_SFRAME_H
>+#define _UAPI_LINUX_SFRAME_H
>+
>+struct sframe_setup {
>+	__u64			sframe_start;
>+	__u64			sframe_size;
>+	__u64			text_start;
>+	__u64			text_size;
>+};
>+
>+#endif /* _UAPI_LINUX_SFRAME_H */
>diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
>index add3032da16f..eca5293f5d40 100644
>--- a/kernel/sys_ni.c
>+++ b/kernel/sys_ni.c
>@@ -394,3 +394,6 @@ COND_SYSCALL(rseq_slice_yield);
>
> COND_SYSCALL(uretprobe);
> COND_SYSCALL(uprobe);
>+
>+COND_SYSCALL(sframe_register);
>+COND_SYSCALL(sframe_unregister);
>diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
>index db88d993dff1..84bd762a1080 100644
>--- a/kernel/unwind/sframe.c
>+++ b/kernel/unwind/sframe.c
>@@ -12,8 +12,10 @@
> #include <linux/mm.h>
> #include <linux/string_helpers.h>
> #include <linux/sframe.h>
>+#include <linux/syscalls.h>
> #include <asm/unwind_user_sframe.h>
> #include <linux/unwind_user_types.h>
>+#include <uapi/linux/sframe.h>
>
> #include "sframe.h"
> #include "sframe_debug.h"
>@@ -817,8 +819,10 @@ int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
> 	if (ret)
> 		goto err_free;
>
>-	ret = mtree_insert_range(sframe_mt, sec->text_start, sec->text_end - 1,
>-				 sec, GFP_KERNEL_ACCOUNT);
>+	scoped_guard(mmap_write_lock, mm) {
>+		ret = mtree_insert_range(sframe_mt, sec->text_start, sec->text_end - 1,
>+					 sec, GFP_KERNEL_ACCOUNT);
>+	}
> 	if (ret) {
> 		dbg_sec("mtree_insert_range failed: text=%lx-%lx\n",
> 			sec->text_start, sec->text_end);
>@@ -842,9 +846,11 @@ static void sframe_free_srcu(struct rcu_head *rcu)
> static int __sframe_remove_section(struct mm_struct *mm,
> 				   struct sframe_section *sec)
> {
>-	if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
>-		dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
>-		return -EINVAL;
>+	scoped_guard(mmap_write_lock, mm) {
>+		if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
>+			dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
>+			return -EINVAL;
>+		}
> 	}
>
> 	call_srcu(&sframe_srcu, &sec->rcu, sframe_free_srcu);
>@@ -936,3 +942,56 @@ void sframe_free_mm(struct mm_struct *mm)
>
> 	mtree_destroy(&mm->sframe_mt);
> }
>+
>+/**
>+ * sys_sframe_register - register an address for user space stacktrace walking.
>+ * @data: Structure of sframe data used to register the sframe section
>+ * @size: The size of the given structure.
>+ *
>+ * This system call is used by dynamic library utilities to inform the kernel
>+ * of meta data that it loaded that can be used by the kernel to know how
>+ * to stack walk the given text locations.
>+ *
>+ * Return: 0 if successful, otherwise a negative error.
>+ */
>+SYSCALL_DEFINE2(sframe_register, struct sframe_setup __user *, data, size_t, size)
>+{
>+	struct sframe_setup sframe;
>+
>+	if (sizeof(sframe) != size)
>+		return -EINVAL;
>+
>+	if (copy_from_user(&sframe, data, size))
>+		return -EFAULT;
>+
>+	return sframe_add_section(sframe.sframe_start,
>+				  sframe.sframe_start + sframe.sframe_size,
>+				  sframe.text_start,
>+				  sframe.text_start + sframe.text_size);
>+}
>+
>+/**
>+ * sys_sframe_unregister - unregister an sframe address
>+ * @data: Structure of sframe data used to register the sframe section
>+ * @size: The size of the given structure.
>+ *
>+ * The data->sframe_start is the only value that is used. The rest must
>+ * be zero.
>+ *
>+ * Return: 0 if successful, otherwise a negative error.
>+ */
>+SYSCALL_DEFINE2(sframe_unregister, struct sframe_setup __user *, data, size_t, size)
>+{
>+	struct sframe_setup sframe;
>+
>+	if (sizeof(sframe) != size)
>+		return -EINVAL;
>+
>+	if (copy_from_user(&sframe, data, size))
>+		return -EFAULT;
>+
>+	if (sframe.sframe_size || sframe.text_start || sframe.text_size)
>+		return -EINVAL;
>+
>+	return sframe_remove_section(sframe.sframe_start);
>+}
>diff --git a/scripts/syscall.tbl b/scripts/syscall.tbl
>index 7a42b32b6577..46ec22b50042 100644
>--- a/scripts/syscall.tbl
>+++ b/scripts/syscall.tbl
>@@ -412,3 +412,5 @@
> 469	common	file_setattr			sys_file_setattr
> 470	common	listns				sys_listns
> 471	common	rseq_slice_yield		sys_rseq_slice_yield
>+472	common	sframe_register			sys_sframe_register
>+473	common	sframe_unregister		sys_sframe_unregister
>-- 
>2.53.0
>


Hi Steven,

This is not an objection to deferred userspace unwinding itself -- my
concern is narrower: these syscalls permanently encode the kernel's
commitment to the SFrame format family at exactly the moment the
format's size trajectory is heading the wrong way, and while arguably
superior formats exist.

I raised related size concerns about SFrame's viability for userspace
stack walking earlier:
https://lore.kernel.org/all/3xd4fqvwflefvsjjoagytoi3y3sf7lxqjremhe2zo5tounihe4@3ftafgryadsr/
("Concerns about SFrame viability for userspace stack walking")

SFrame v3 is even larger than v2.

For comparison: Microsoft is currently upstreaming its Windows x64
Unwind V3 implementation to LLVM, which will make a side-by-side reading
of the two formats straightforward. Unwind V3 provides correct
exception-handling unwind -- full prologue replay, SEH handlers,
funclets -- and supports Intel APX. SFrame v3 provides stack tracing
only, no EH, yet comes out larger than .eh_frame. A format revision that
adds capability without adding bulk is demonstrably achievable; SFrame
v3 went the other way.

I understand IBM is doubling down on SFrame for their s390x and ppc64,
but I'm not convinced the size overhead of v3 will make it appealing on
x86-64. I have learned that the person driving their SFrame work at
Google had left and the SFrame at data center effort was being
reevaluated per a toolchain manager.

^ permalink raw reply

* Re: [PATCH v6 5/7] locking: Add contended_release tracepoint to qspinlock
From: Dmitry Ilvokhin @ 2026-06-11  7:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
	Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
	Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, linux-kernel, linux-mips,
	virtualization, linux-arch, linux-mm, linux-trace-kernel,
	kernel-team, Paul E. McKenney
In-Reply-To: <20260603120811.GW3493090@noisy.programming.kicks-ass.net>

On Wed, Jun 03, 2026 at 02:08:11PM +0200, Peter Zijlstra wrote:
> Also, I think someone should go do some performance runs with
> ARCH_INLINE_SPIN_* set for x86 just like for s390.

As promised, I set ARCH_INLINE_SPIN_UNLOCK{,_BH,_IRQ,_IRQRESTORE} for
x86 and measured the effect on a few real workloads.

Short version: inlining of _raw_spin_unlock() adds measurable kernel
i-cache pressure on every workload I tried, and on a
kernel-i-cache-bound one (nginx connection churn) it costs ~1.27%
throughput. I did not find a workload where it helps.

HOW BENCHMARKS WERE CHOSEN

The cost of inlining unlock is text footprint increase. Every unlock
site grows, and the extra bytes compete for the shared L1i. The bill is
paid by unrelated code, in both kernel and userspace.

Locktorture and similar microbenchmarks can't see this, because they
usually hammer a tiny loop that stays L1i-resident, so they measure
fast-path cycles, where inlining (fewer instructions per unlock) looks
neutral-to-good.

To make the cost visible, the workload has to have real instruction
cache pressure. To achieve that, it has to touch a lot of code.

A good way to screen benchmarks: look for high tma_frontend_bound
fraction from 'perf stat -M TopdownL1' and simultaneously require it to
spend non-trivial time in the kernel (be syscall-heavy).

SETUP

Hardware: 2x Intel Xeon Gold 6138 (Skylake-SP), 20 cores/socket, 40C/80T
with kernel built from locking/core branch. Baseline _raw_spin_unlock()
is out-of-line via UNINLINE_SPIN_UNLOCK=y. Experiment adds the four
selects above (exact patch is at the end of this message). Cache
geometry (lscpu -C):

NAME ONE-SIZE ALL-SIZE WAYS TYPE        LEVEL  SETS PHY-LINE COHERENCY-SIZE
L1d       32K     1.3M    8 Data            1    64        1             64
L1i       32K     1.3M    8 Instruction     1    64        1             64
L2         1M      40M   16 Unified         2  1024        1             64
L3      27.5M      55M   11 Unified         3 40960        1             64

Per run I collected cycles, instructions and L1i-misses. To stay within
the available PMU counters, each run used only 3 events: cycles,
instructions and one L1i filter (:u or :k). The NMI watchdog was off and
every run reported 100% counter enablement (no multiplexing). Userspace
and kernel misses therefore come from separate runs. Each benchmark was
run 20x per side: 10 with the :u counter, 10 with :k.  Cycles,
instructions and throughput are pooled across all 20, each L1i split
comes from its 10.

KERNEL IMAGE SIZE

To give a sense of the code-footprint increase, scripts/bloat-o-meter on
vmlinux, GCC 11, x86_64, defconfig + CONFIG_PARAVIRT_SPINLOCKS=y:

    Total: Before=23838694, After=23977159, chg +0.58%

ROCKSDB (DELETESEQ)

    db_bench -benchmarks=deleteseq

Metric                       Baseline      Experiment     Delta   Sig
----------------------------------------------------------------------
Instructions (total)    9,574,476,543   9,573,602,441    -0.01%   flat
L1i-miss :k (kernel)      198,588,165     216,672,536    +9.11%   **
L1i-miss :u (userspace)   593,276,235     616,433,813    +3.90%   **
Throughput ops/s            431,398         432,897      +0.35%   ns
Cycles (total)          4,681,002,302   4,665,106,876    -0.34%   ns
IPC                          2.045           2.052       +0.33%   ns
Time elapsed (s)            2.4012          2.3865       -0.62%   ns
----------------------------------------------------------------------
L1i-miss: higher = worse. Throughput: higher = better.
** = beyond per-run noise (+-0.1..0.36%), ns = within noise.

At constant instructions, inlining raises L1i misses +9.11% (kernel) and
+3.90% (userspace), both well beyond noise. Throughput, cycles, IPC and
wall-time all stay within run-to-run noise. So the i-cache cost is real,
but at IPC ~2 db_bench isn't fetch-bound at the app level, so it doesn't
surface.

No benefit from _raw_spin_unlock() inlining.

KERNEL BUILD

Building locking/core (defconfig), GCC 11.

    make -j80

Metric              Baseline      Experiment     Delta   Sig
-------------------------------------------------------------
L1i-miss :k          36.72G        37.51G       +2.16%   **
L1i-miss :u         246.99G       246.06G       -0.38%   **
Sys (s)             478.250       482.420       +0.87%   **
Time elapsed (s)    105.221       105.373       +0.14%   ns
User (s)           4022.046      4024.012       +0.05%   flat
Cycles            8,894.10G     8,902.12G       +0.09%   flat
Instructions      8,424.28G     8,426.48G       +0.03%   flat
IPC                   0.947         0.947       -0.06%   flat
-------------------------------------------------------------
L1i-miss/Sys: higher = worse.
** = beyond per-run noise, ns = within noise.

Kernel i-cache misses (+2.16%) and sys time (+0.87%) both rise and are
significant. Wall-time and userspace L1i are flat. Kernel build is
GCC/userspace-bound (User 4022s vs Sys 478s), so the added kernel fetch
cost is real but appears to sit off the critical path.

No benefit from _raw_spin_unlock() inlining.

NGINX

I ran nginx with taskset -c 2.

    perf stat -C 2 ... -- ab -n 100000 -c 80 http://127.0.0.1:8080/

Config for nginx was the following.

  worker_processes 1;
  error_log /tmp/ngx/error.log;
  pid       /tmp/ngx/nginx.pid;
  events { worker_connections 16384; }
  http {
      access_log off;
      server { listen 8080 reuseport; location / { return 200 "ok\n"; } }
  }

I used nginx version 1.20.1 (prebuilt, from CentOS repo).

Metric              Baseline      Experiment     Delta   Sig
------------------------------------------------------------
req/s (ab)           25,113        24,795       -1.27%   **
L1i MPKI :k          70.06         72.10        +2.92%   **
L1i MPKI :u          20.16         20.66        +2.50%   **
instructions          5.86G         5.83G       -0.50%   **
L1i-miss :k           0.41G         0.42G       +2.44%   **
L1i-miss :u           0.12G         0.12G       +1.95%   **
cycles                4.82G         4.81G       -0.28%   ns
IPC                   1.215         1.213       -0.22%   ns
perf time (s)         4.077         4.129       +1.26%   **
failed reqs              0             0          -      valid
------------------------------------------------------------
req/s: higher=better. MPKI: higher=worse.
** = beyond per-run noise, ns = within noise.

nginx connection-churn is the one workload that is genuinely
kernel-fetch-bound: MPKI:k ~70 and IPC ~1.2 (vs db_bench's 2.05). Here
the cost surfaces: req/s −1.27%. Misses rise in both domains (+2.9%
MPKI:k, +2.5% MPKI:u). Unlike kernel build, userspace is hit too,
because nginx runs user and kernel hot on the same core and the kernel
bloat pollutes the shared L1i.

And the kicker: instructions fell 0.5% (inlining removed the call/ret)
yet throughput dropped.

Caveat: ab is single-threaded, so it seems the worker core is
under-saturated: cycles is flat (−0.28%, ns) while wall-time rose
(+1.26%).

Measurable throughput regression from _raw_spin_unlock() inlining.

CONCLUSION

Inlining _raw_spin_unlock() raises kernel L1i misses on every workload.
It's an unconditional cost. Whether it costs the application throughput
depends on how kernel-fetch-bound the workload is.

The cost is real everywhere. It only surfaces as throughput regression
where the kernel is on the fetch critical path. And inlining did not
help in any workload I measured. The one micro-effect inlining produced
(-0.5% instructions on nginx) was erased by the added i-cache pressure.

From 99502328caed3c195e20cf194a1e8aa1563f3896 Mon Sep 17 00:00:00 2001
From: Dmitry Ilvokhin <d@ilvokhin.com>
Date: Thu, 4 Jun 2026 07:43:00 -0700
Subject: [PATCH] x86/locking: Inline the spin_unlock()

Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
---
 arch/x86/Kconfig | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fdaef60b46d6..c9a0638225fd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -113,6 +113,10 @@ config X86
 	select ARCH_HAS_ZONE_DMA_SET if EXPERT
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select ARCH_HAVE_EXTRA_ELF_NOTES
+	select ARCH_INLINE_SPIN_UNLOCK
+	select ARCH_INLINE_SPIN_UNLOCK_BH
+	select ARCH_INLINE_SPIN_UNLOCK_IRQ
+	select ARCH_INLINE_SPIN_UNLOCK_IRQRESTORE
 	select ARCH_MEMORY_ORDER_TSO
 	select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
 	select ARCH_MIGHT_HAVE_ACPI_PDC		if ACPI
-- 
2.53.0-Meta

^ permalink raw reply related

* Re: [PATCH v3 08/13] verification/rvgen: Simplify the generation for clock variables
From: Gabriele Monaco @ 2026-06-11  8:39 UTC (permalink / raw)
  To: Nam Cao
  Cc: Steven Rostedt, Wander Lairson Costa, linux-trace-kernel,
	linux-kernel
In-Reply-To: <eb57a6b4659b325a4a0eded67baa3c5fe9f0ff35.1780908661.git.namcao@linutronix.de>

On Mon, 2026-06-08 at 10:57 +0200, Nam Cao wrote:
> Hybrid automata monitors's clock variables have been changed to have
> only a single representation. Now there is no need to generate code
> to convert between the two representations.
> 
> Delete __fill_convert_inv_guard_func() and its associates. Update
> __start_to_invariant_check() to how invariants now work.
> 
> Signed-off-by: Nam Cao <namcao@linutronix.de>

Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>

> ---
>  tools/verification/rvgen/rvgen/dot2k.py | 96 +----------------------
> --
>  1 file changed, 3 insertions(+), 93 deletions(-)
> 
> diff --git a/tools/verification/rvgen/rvgen/dot2k.py
> b/tools/verification/rvgen/rvgen/dot2k.py
> index e91717fde30d..6d346a718a39 100644
> --- a/tools/verification/rvgen/rvgen/dot2k.py
> +++ b/tools/verification/rvgen/rvgen/dot2k.py
> @@ -246,7 +246,9 @@ class ha2k(dot2k):
>          if inv.unit == "j":
>              clock_type = "jiffy"
>  
> -        return f"return ha_check_invariant_{clock_type}(ha_mon,
> {inv.env}_{self.name}, time_ns)"
> +        value = self.__adjust_value(inv.val, inv.unit)
> +
> +        return f"return ha_check_invariant_{clock_type}(ha_mon,
> {inv.env}_{self.name}, time_ns, {value})"
>  
>      def __start_to_conv(self, constr: str) -> str:
>          """
> @@ -383,40 +385,6 @@ f"""static inline bool
> ha_verify_invariants(struct ha_monitor *ha_mon,
>          buff.append("\treturn true;\n}\n")
>          return buff
>  
> -    def __fill_convert_inv_guard_func(self) -> list[str]:
> -        buff = []
> -        if not self.invariants:
> -            return []
> -
> -        conflict_guards, conflict_invs = self.__find_inv_conflicts()
> -        if not conflict_guards and not conflict_invs:
> -            return []
> -
> -        buff.append(
> -f"""static inline void ha_convert_inv_guard(struct ha_monitor
> *ha_mon,
> -\t\t\t\t\tenum {self.enum_states_def} curr_state, enum
> {self.enum_events_def} event,
> -\t\t\t\t\tenum {self.enum_states_def} next_state, u64 time_ns)
> -{{""")
> -        buff.append("\tif (curr_state == next_state)\n\t\treturn;")
> -
> -        _else = ""
> -        for state, constr in sorted(self.invariants.items()):
> -            # a state with invariant can reach us without reset
> -            # multiple conflicts must have the same invariant,
> otherwise we cannot
> -            # know how to reset the value
> -            conf_i = [start for start, end in conflict_invs if end
> == state]
> -            # we can reach a guard without reset
> -            conf_g = [e for s, e in conflict_guards if s == state]
> -            if not conf_i and not conf_g:
> -                continue
> -            buff.append(f"\t{_else}if (curr_state ==
> {self.states[state]}{self.enum_suffix})")
> -
> -            buff.append(f"\t\t{self.__start_to_conv(constr)};")
> -            _else = "else "
> -
> -        buff.append("}\n")
> -        return buff
> -
>      def __fill_verify_guards_func(self) -> list[str]:
>          buff = []
>  
> @@ -456,54 +424,6 @@ f"""static inline bool ha_verify_guards(struct
> ha_monitor *ha_mon,
>          buff.append("\treturn res;\n}\n")
>          return buff
>  
> -    def __find_inv_conflicts(self) -> tuple[set[tuple[int,
> _EventConstraintKey]],
> -                                            set[tuple[int,
> _StateConstraintKey]]]:
> -        """
> -        Run a breadth first search from all states with an
> invariant.
> -        Find any conflicting constraints reachable from there, this
> can be
> -        another state with an invariant or an edge with a non-reset
> guard.
> -        Stop when we find a reset.
> -
> -        Return the set of conflicting guards and invariants as
> tuples of
> -        conflicting state and constraint key.
> -        """
> -        conflict_guards: set[tuple[int, _EventConstraintKey]] =
> set()
> -        conflict_invs: set[tuple[int, _StateConstraintKey]] = set()
> -        for start_idx in self.invariants:
> -            queue = deque([(start_idx, 0)])  # (state_idx, distance)
> -            env =
> self.__get_constraint_env(self.invariants[start_idx])
> -
> -            while queue:
> -                curr_idx, distance = queue.popleft()
> -
> -                # Check state condition
> -                if curr_idx != start_idx and curr_idx in
> self.invariants:
> -                    conflict_invs.add((start_idx,
> _StateConstraintKey(curr_idx)))
> -                    continue
> -
> -                # Check if we should stop
> -                if distance > len(self.states):
> -                    break
> -                if curr_idx != start_idx and distance > 1:
> -                    continue
> -
> -                for event_idx, next_state_name in
> enumerate(self.function[curr_idx]):
> -                    if next_state_name == self.invalid_state_str:
> -                        continue
> -                    curr_guard = self.guards.get((curr_idx,
> event_idx), "")
> -                    if "reset" in curr_guard and env in curr_guard:
> -                        continue
> -
> -                    if env in curr_guard:
> -                        conflict_guards.add((start_idx,
> -                                            
> _EventConstraintKey(curr_idx, event_idx)))
> -                        continue
> -
> -                    next_idx = self.states.index(next_state_name)
> -                    queue.append((next_idx, distance + 1))
> -
> -        return conflict_guards, conflict_invs
> -
>      def __fill_setup_invariants_func(self) -> list[str]:
>          if not self.has_invariant:
>              return []
> @@ -554,16 +474,9 @@ f"""static inline void
> ha_setup_invariants(struct ha_monitor *ha_mon,
>   * the next state has a constraint, cancel it in any other case and
> to check
>   * that it didn't expire before the callback run. Transitions to the
> same state
>   * without a reset never affect timers.
> - * Due to the different representations between invariants and
> guards, there is
> - * a function to convert it in case invariants or guards are
> reachable from
> - * another invariant without reset. Those are not present if not
> required in
> - * the model. This is all automatic but is worth checking because it
> may show
> - * errors in the model (e.g. missing resets).
>   */""")
>  
>          buff += self.__fill_verify_invariants_func()
> -        inv_conflicts = self.__fill_convert_inv_guard_func()
> -        buff += inv_conflicts
>          buff += self.__fill_verify_guards_func()
>          buff += self.__fill_setup_invariants_func()
>  
> @@ -576,9 +489,6 @@ f"""static bool ha_verify_constraint(struct
> ha_monitor *ha_mon,
>          if self.invariants:
>              buff.append("\tif (!ha_verify_invariants(ha_mon,
> curr_state, "
>                          "event, next_state, time_ns))\n\t\treturn
> false;\n")
> -        if inv_conflicts:
> -            buff.append("\tha_convert_inv_guard(ha_mon, curr_state,
> event, "
> -                        "next_state, time_ns);\n")
>  
>          if self.guards:
>              buff.append("\tif (!ha_verify_guards(ha_mon, curr_state,
> event, "


^ permalink raw reply

* Re: [RFC PATCH 1/2] tracing/osnoise: Sample IPI counts
From: Tomas Glozar @ 2026-06-11  8:59 UTC (permalink / raw)
  To: Crystal Wood
  Cc: Valentin Schneider, linux-kernel, linux-trace-kernel,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Costa Shulyupin, Ivan Pravdin
In-Reply-To: <f40903f8198f8e2f42f1efdf950f995f24cf6e93.camel@redhat.com>

[just replying to comments, will do a full review later]

st 10. 6. 2026 v 21:51 odesílatel Crystal Wood <crwood@redhat.com> napsal:
>
> On Wed, 2026-06-10 at 15:04 +0200, Valentin Schneider wrote:
> > Osnoise already implictly accounts IPIs via its IRQ tracking,
>
> Does it?  It seems that IPIs bypass the kernel/irq subsystem on some
> arches (including x86, but not ARM).
>
> It would be nice to solve this properly by adding generic ipi
> entry/exit tracing (similar to what ARM already has).
>

Isn't that precisely what the ipi tracepoints used by this
implementation (ipi:ipi_send_cpu) are for?

> > however it
> > can be interesting to distiguish between the two: undesired IPIs usually
> > imply a software configuration issue (e.g. wrong/incomplete CPU isolation)
> > whereas undesired (non-IPI) IRQs usually imply a hardware configuration
> > issue.
> >
> > Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> > ---
> > Note that this is modifying the osnoise:osnoise_entry Ftrace entry; I know
> > trace events are sort of supposed to be stable, but I'm not sure about
> > ftrace entries.
>
> I think old rtla will be OK with this since it looks up fields by name
> rather than assuming a fixed layout.
>

Yeah, the fields are either looked up with tep_get_field_val() [2], or
with name-based BPF CO-RE relocations against the tracepoint structure
[3]. So this shouldn't be an issue, as long as the old counts stay the
same.

[2] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/tools/tracing/rtla/src/timerlat_hist.c#n191
[3] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/tools/tracing/rtla/src/timerlat.bpf.c#n12

> > Alternatively I can have this be purely supported in userspace osnoise by
> > hooking into the IPI events and counting IPIs separately from the osnoise
> > events.
>
> One benefit I could see of doing this in kernel osnoise would be if you
> could atomically correlate the count with the particular noise
> interval, but this patch doesn't do that.
>

The count is already reported by cycle on the kernel side in the
patchset, right? It's only missing in the current RTLA (userspace)
part, as there is no statistic using the information. But it can still
be collected through custom histogram triggers.

> > ...
> >
> > +static void trace_ipi_send_cpumask_callback(void *data, const struct cpumask *cpumask,
> > +                                         unsigned long callsite, void *callback)
> > +{
> > +     struct osnoise_variables *osn_var;
> > +     int cpu;
> > +
> > +     for_each_cpu_and(cpu, cpumask, &osnoise_cpumask) {
> > +             osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
> > +             ipi_emission(osn_var, cpu);
> > +     }
> > +}
>
> Isn't this racy to do from a different CPU?  Both in terms of the
> counter, and the timing of the increment relative to when the IPI is
> actually received.  Not necessarily a huge deal if you only care about
> zero versus bignum, but still.  At least worth a comment, if we go with
> this approach.
>

I also think it's a bit confusing, especially as the other accesses to
osn_var are cpu-local, but here, "cpu" is the *target* CPU, not the
current CPU. Not sure how expensive it would be to do atomic_add for
that, at least it's something to consider.

Tomas


^ permalink raw reply

* Re: [RFC PATCH 1/2] tracing/osnoise: Sample IPI counts
From: Valentin Schneider @ 2026-06-11 10:21 UTC (permalink / raw)
  To: Crystal Wood, linux-kernel, linux-trace-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Tomas Glozar,
	Costa Shulyupin, Ivan Pravdin
In-Reply-To: <f40903f8198f8e2f42f1efdf950f995f24cf6e93.camel@redhat.com>

On 10/06/26 14:51, Crystal Wood wrote:
> On Wed, 2026-06-10 at 15:04 +0200, Valentin Schneider wrote:
>> Osnoise already implictly accounts IPIs via its IRQ tracking,
>
> Does it?  It seems that IPIs bypass the kernel/irq subsystem on some
> arches (including x86, but not ARM).
>

Right...

> It would be nice to solve this properly by adding generic ipi
> entry/exit tracing (similar to what ARM already has).
>

I think for x86 the CSD tracepoints catch a few of these strays - I think
the smp_call ones for instance.

>> however it
>> can be interesting to distiguish between the two: undesired IPIs usually
>> imply a software configuration issue (e.g. wrong/incomplete CPU isolation)
>> whereas undesired (non-IPI) IRQs usually imply a hardware configuration
>> issue.
>> 
>> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
>> ---
>> Note that this is modifying the osnoise:osnoise_entry Ftrace entry; I know
>> trace events are sort of supposed to be stable, but I'm not sure about
>> ftrace entries.
>
> I think old rtla will be OK with this since it looks up fields by name
> rather than assuming a fixed layout.
>
>> Alternatively I can have this be purely supported in userspace osnoise by
>> hooking into the IPI events and counting IPIs separately from the osnoise
>> events.
>
> One benefit I could see of doing this in kernel osnoise would be if you
> could atomically correlate the count with the particular noise
> interval, but this patch doesn't do that.
>
>> +static void ipi_emission(struct osnoise_variables *osn_var, unsigned int dst_cpu)
>> +{
>> +	if (!osn_var->sampling)
>> +		return;
>> +
>> +	osn_var->ipi.count++;
>> +}
>> +
>> +static void trace_ipi_send_cpu_callback(void *data, unsigned int cpu,
>> +					unsigned long callsite, void *callback)
>> +{
>> +	struct osnoise_variables *osn_var;
>> +
>> +	osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
>> +	ipi_emission(osn_var, cpu);
>> +}
>> +
>> +static void trace_ipi_send_cpumask_callback(void *data, const struct cpumask *cpumask,
>> +					    unsigned long callsite, void *callback)
>> +{
>> +	struct osnoise_variables *osn_var;
>> +	int cpu;
>> +
>> +	for_each_cpu_and(cpu, cpumask, &osnoise_cpumask) {
>> +		osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
>> +		ipi_emission(osn_var, cpu);
>> +	}
>> +}
>
> Isn't this racy to do from a different CPU?  Both in terms of the
> counter, and the timing of the increment relative to when the IPI is
> actually received.  Not necessarily a huge deal if you only care about
> zero versus bignum, but still.  At least worth a comment, if we go with
> this approach.
>

Yes on both points :-) Let me see what Tomas has to say on that...

> -Crystal


^ permalink raw reply

* Re: [RFC PATCH 1/2] tracing/osnoise: Sample IPI counts
From: Valentin Schneider @ 2026-06-11 10:30 UTC (permalink / raw)
  To: Tomas Glozar, Crystal Wood
  Cc: linux-kernel, linux-trace-kernel, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Costa Shulyupin,
	Ivan Pravdin
In-Reply-To: <CAP4=nvT6GEdwZAm-2h+Z-mj9DOmJ-swyApujFun6m1Qaxxr7hQ@mail.gmail.com>

On 11/06/26 10:59, Tomas Glozar wrote:
> [just replying to comments, will do a full review later]
>
> st 10. 6. 2026 v 21:51 odesílatel Crystal Wood <crwood@redhat.com> napsal:
>>
>> On Wed, 2026-06-10 at 15:04 +0200, Valentin Schneider wrote:
>> > Osnoise already implictly accounts IPIs via its IRQ tracking,
>>
>> Does it?  It seems that IPIs bypass the kernel/irq subsystem on some
>> arches (including x86, but not ARM).
>>
>> It would be nice to solve this properly by adding generic ipi
>> entry/exit tracing (similar to what ARM already has).
>>
>
> Isn't that precisely what the ipi tracepoints used by this
> implementation (ipi:ipi_send_cpu) are for?
>

Well, these catch the emission of the IPI, which is great for investigation
- slap a stacktrace trigger and you (most of the time) get the source of
your interference.

However Crystal's point is that on x86 (and I assume other archs) receiving
& handling these IPIs is "special" and doesn't go through the generic irq
subsystem and thus has to be tracked separately, which is why osnoise has
this fairly lengthy osnoise_arch_register() thing.

>> > however it
>> > can be interesting to distiguish between the two: undesired IPIs usually
>> > imply a software configuration issue (e.g. wrong/incomplete CPU isolation)
>> > whereas undesired (non-IPI) IRQs usually imply a hardware configuration
>> > issue.
>> >
>> > Signed-off-by: Valentin Schneider <vschneid@redhat.com>
>> > ---
>> > Note that this is modifying the osnoise:osnoise_entry Ftrace entry; I know
>> > trace events are sort of supposed to be stable, but I'm not sure about
>> > ftrace entries.
>>
>> I think old rtla will be OK with this since it looks up fields by name
>> rather than assuming a fixed layout.
>>
>
> Yeah, the fields are either looked up with tep_get_field_val() [2], or
> with name-based BPF CO-RE relocations against the tracepoint structure
> [3]. So this shouldn't be an issue, as long as the old counts stay the
> same.
>
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/tools/tracing/rtla/src/timerlat_hist.c#n191
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/tools/tracing/rtla/src/timerlat.bpf.c#n12
>
>> > Alternatively I can have this be purely supported in userspace osnoise by
>> > hooking into the IPI events and counting IPIs separately from the osnoise
>> > events.
>>
>> One benefit I could see of doing this in kernel osnoise would be if you
>> could atomically correlate the count with the particular noise
>> interval, but this patch doesn't do that.
>>
>
> The count is already reported by cycle on the kernel side in the
> patchset, right? It's only missing in the current RTLA (userspace)
> part, as there is no statistic using the information. But it can still
> be collected through custom histogram triggers.
>
>> > ...
>> >
>> > +static void trace_ipi_send_cpumask_callback(void *data, const struct cpumask *cpumask,
>> > +                                         unsigned long callsite, void *callback)
>> > +{
>> > +     struct osnoise_variables *osn_var;
>> > +     int cpu;
>> > +
>> > +     for_each_cpu_and(cpu, cpumask, &osnoise_cpumask) {
>> > +             osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
>> > +             ipi_emission(osn_var, cpu);
>> > +     }
>> > +}
>>
>> Isn't this racy to do from a different CPU?  Both in terms of the
>> counter, and the timing of the increment relative to when the IPI is
>> actually received.  Not necessarily a huge deal if you only care about
>> zero versus bignum, but still.  At least worth a comment, if we go with
>> this approach.
>>
>
> I also think it's a bit confusing, especially as the other accesses to
> osn_var are cpu-local, but here, "cpu" is the *target* CPU, not the
> current CPU. Not sure how expensive it would be to do atomic_add for
> that, at least it's something to consider.
>

I suppose that could be an argument for doing that stat aggregation in
userspace osnoise - event handlers are run after the fact via
tracefs_iterate_raw_events(), it's all inherently slower since it's just
increments of one (one per handled event) but it's also all done in
userspace on a control thread and doesn't bog down the kernelspace.

> Tomas


^ permalink raw reply

* Re: [PATCH v2 3/4] mm/fs: split the file's i_mmap tree
From: Pedro Falcato @ 2026-06-11 11:11 UTC (permalink / raw)
  To: Huang Shijie
  Cc: akpm, viro, brauner, jack, muchun.song, osalvador, david, surenb,
	mjguzik, liam, ljs, vbabka, shakeel.butt, rppt, mhocko, corbet,
	skhan, linux, dinguyen, schuster.simon, James.Bottomley, deller,
	djbw, willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, riel, harry,
	will, brian.ruley, rmk+kernel, dave.anglin, linux-mm, linux-doc,
	linux-kernel, linux-arm-kernel, linux-parisc, linux-fsdevel,
	nvdimm, linux-perf-users, linux-trace-kernel, zhongyuan,
	fangbaoshun, yingzhiwei
In-Reply-To: <20260611061915.2354307-4-huangsj@hygon.cn>

Hi,

On Thu, Jun 11, 2026 at 02:18:59PM +0800, Huang Shijie wrote:
> In the UnixBench tests, there is a test "execl" which tests
> the execve system call.
>   For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
> When we test our server with "./Run -c 384 execl",
> the test result is not good enough. The i_mmap locks contended heavily on
> "libc.so" and "ld.so". The i_mmap tree for "libc.so" can be
> over 6000 VMAs, all the VMAs can be in different NUMA mode. The insert/remove
> operations do not run quickly enough.

I _really_ would have appreciated some coordination here, because I said I was
going to take a look at it. I have something that I think is much simpler
in practice. These patches are also way too complex to be dropped just before
the merge window.

Some comments:

> 
>  In order to reduce the competition of the i_mmap lock, this patch does
> following:
>    1.) Split the single i_mmap tree into several sibling trees:
>        Each tree has a lock. The CONFIG_SPLIT_I_MMAP is used to
>        turn on/off this feature.

There is no need for a config option. This needs to Just Work.

>    2.) Introduce a new field "tree_idx" for vm_area_struct to save the
>        sibling tree index for this VMA.

This is possibly contentious, but there are holes in vm_area_struct.
So I think this is fine.

>    3.) Introduce a new field "vma_count" for address_space.
>        The new mapping_mapped() will use it.
>    4.) Rewrite the vma_interval_tree_foreach()
>    5.) Rewrite the lock functions.	
> 
>  After this patch, the VMA insert/remove operations will work faster,
> and we can get over 400% performance improvement with the above test.
> 
> Signed-off-by: Huang Shijie <huangsj@hygon.cn>
> ---
>  fs/Kconfig               |   8 ++
>  fs/hugetlbfs/inode.c     |  20 ++++-
>  fs/inode.c               |  75 ++++++++++++++++-
>  include/linux/fs.h       | 174 ++++++++++++++++++++++++++++++++++++++-
>  include/linux/mm.h       |  80 ++++++++++++++++++
>  include/linux/mm_types.h |   3 +
>  mm/internal.h            |   3 +-
>  mm/mmap.c                |  11 ++-
>  mm/nommu.c               |  23 ++++--
>  mm/pagewalk.c            |   2 +-
>  mm/vma.c                 |  72 +++++++++++-----
>  mm/vma_init.c            |   3 +
>  12 files changed, 436 insertions(+), 38 deletions(-)
> 
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 43cb06de297f..e24804f70432 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -9,6 +9,14 @@ menu "File systems"
>  config DCACHE_WORD_ACCESS
>         bool
>  
> +config SPLIT_I_MMAP
> +	bool "Split the file's i_mmap to several trees"
> +	default n
> +	help
> +	  Split the file's i_mmap to several trees, each tree has a separate
> +	  lock. This will reduce the lock contention of file's i_mmap tree,
> +	  but it will cost more memory for per inode.
> +
>  config VALIDATE_FS_PARSER
>  	bool "Validate filesystem parameter description"
>  	help
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index da5b41ea5bdd..68d8308418dd 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -891,6 +891,23 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
>   */
>  static struct lock_class_key hugetlbfs_i_mmap_rwsem_key;
>  
> +#ifdef CONFIG_SPLIT_I_MMAP
> +static void hugetlbfs_lockdep_set_class(struct address_space *mapping)
> +{
> +	int i;
> +
> +	for (i = 0; i < split_tree_num; i++) {
> +		lockdep_set_class(&mapping->i_mmap[i].rwsem,
> +				&hugetlbfs_i_mmap_rwsem_key);
> +	}
> +}
> +#else
> +static void hugetlbfs_lockdep_set_class(struct address_space *mapping)
> +{
> +	lockdep_set_class(&mapping->i_mmap_rwsem, &hugetlbfs_i_mmap_rwsem_key);
> +}
> +#endif
> +
>  static struct inode *hugetlbfs_get_inode(struct super_block *sb,
>  					struct mnt_idmap *idmap,
>  					struct inode *dir,
> @@ -915,8 +932,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
>  
>  		inode->i_ino = get_next_ino();
>  		inode_init_owner(idmap, inode, dir, mode);
> -		lockdep_set_class(&inode->i_mapping->i_mmap_rwsem,
> -				&hugetlbfs_i_mmap_rwsem_key);
> +		hugetlbfs_lockdep_set_class(inode->i_mapping);
>  		inode->i_mapping->a_ops = &hugetlbfs_aops;
>  		simple_inode_init_ts(inode);
>  		info->resv_map = resv_map;
> diff --git a/fs/inode.c b/fs/inode.c
> index 62c579a0cf7d..cb67ae83f5b3 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -214,6 +214,70 @@ static int no_open(struct inode *inode, struct file *file)
>  	return -ENXIO;
>  }
>  
> +#ifdef CONFIG_SPLIT_I_MMAP
> +int split_tree_num;
> +static int split_tree_align __maybe_unused = 32;
> +
> +static void __init init_split_tree_num(void)
> +{
> +#ifdef CONFIG_NUMA
> +	split_tree_num = nr_node_ids;
> +#else
> +	split_tree_num = ALIGN(nr_cpu_ids, split_tree_align);
> +#endif
> +}

Again, too configurable. I think you're too stuck up on the NUMA case -
which does not matter for many people - and may actively harm NUMA users. If
I have a 128 core 2 NUMA node system, what should I shard by?

> +
> +static void free_mapping_i_mmap(struct address_space *mapping)
> +{
> +	int i;
> +
> +	if (!mapping->i_mmap)
> +		return;
> +
> +	for (i = 0; i < split_tree_num; i++)
> +		kfree(mapping->i_mmap[i]);
> +
> +	kfree(mapping->i_mmap);
> +	mapping->i_mmap = NULL;
> +}
> +
> +static int init_mapping_i_mmap(struct address_space *mapping, gfp_t gfp)
> +{
> +	struct i_mmap_tree *tree;
> +	int i;
> +
> +	/* The extra one is used as terminator in vma_interval_tree_foreach() */
> +	mapping->i_mmap = kzalloc(sizeof(tree) * (split_tree_num + 1), gfp);
> +	if (!mapping->i_mmap)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < split_tree_num; i++) {
> +		tree = kzalloc_node(sizeof(*tree), gfp, i);
> +		if (!tree)
> +			goto nomem;
> +
> +		tree->root = RB_ROOT_CACHED;
> +		init_rwsem(&tree->rwsem);

This (as-is) should blow up with lockdep + the locking loops down there.

> +
> +		mapping->i_mmap[i] = tree;
> +	}
> +	return 0;
> +nomem:
> +	free_mapping_i_mmap(mapping);
> +	return -ENOMEM;
> +}

Honestly, it's likely that a simple static array in struct address_space
suffices. I would not go through the trouble of getting everything very
tight and NUMA correct.

> +#else
> +static int init_mapping_i_mmap(struct address_space *mapping, gfp_t gfp)
> +{
> +	mapping->i_mmap = RB_ROOT_CACHED;
> +	init_rwsem(&mapping->i_mmap_rwsem);
> +	return 0;
> +}
> +
> +static void free_mapping_i_mmap(struct address_space *mapping) { }
> +static void __init init_split_tree_num(void) {}
> +#endif
> +
>  /**
>   * inode_init_always_gfp - perform inode structure initialisation
>   * @sb: superblock inode belongs to
> @@ -302,9 +366,14 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp
>  #endif
>  	inode->i_flctx = NULL;
>  
> -	if (unlikely(security_inode_alloc(inode, gfp)))
> +	if (init_mapping_i_mmap(mapping, gfp))
>  		return -ENOMEM;
>  
> +	if (unlikely(security_inode_alloc(inode, gfp))) {
> +		free_mapping_i_mmap(mapping);
> +		return -ENOMEM;
> +	}
> +
>  	this_cpu_inc(nr_inodes);
>  
>  	return 0;
> @@ -380,6 +449,7 @@ void __destroy_inode(struct inode *inode)
>  	if (inode->i_default_acl && !is_uncached_acl(inode->i_default_acl))
>  		posix_acl_release(inode->i_default_acl);
>  #endif
> +	free_mapping_i_mmap(&inode->i_data);
>  	this_cpu_dec(nr_inodes);
>  }
>  EXPORT_SYMBOL(__destroy_inode);
> @@ -480,9 +550,7 @@ EXPORT_SYMBOL(inc_nlink);
>  static void __address_space_init_once(struct address_space *mapping)
>  {
>  	xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
> -	init_rwsem(&mapping->i_mmap_rwsem);
>  	spin_lock_init(&mapping->i_private_lock);
> -	mapping->i_mmap = RB_ROOT_CACHED;
>  }
>  
>  void address_space_init_once(struct address_space *mapping)
> @@ -2619,6 +2687,7 @@ void __init inode_init(void)
>  					&i_hash_mask,
>  					0,
>  					0);
> +	init_split_tree_num();
>  }
>  
>  void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index cd46615b8f53..f4b3645b61df 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -450,6 +450,25 @@ struct mapping_metadata_bhs {
>  	struct list_head list;	/* The list of bhs (b_assoc_buffers) */
>  };
>  
> +#ifdef CONFIG_SPLIT_I_MMAP
> +/*
> + * struct i_mmap_tree - A single sibling tree of the file's split i_mmap.
> + * @root: The red/black interval tree root.
> + * @rwsem: Protects insert/remove operations on this sibling tree.
> + * @vma_count: Number of VMAs in this sibling tree.
> + *
> + * When CONFIG_SPLIT_I_MMAP is enabled, the file's single i_mmap tree is
> + * split into split_tree_num sibling trees, each with its own lock. This
> + * reduces lock contention by allowing concurrent VMA insert/remove
> + * operations on different sibling trees.
> + */
> +struct i_mmap_tree {
> +	struct rb_root_cached	root;
> +	struct rw_semaphore	rwsem;
> +	atomic_t		vma_count;

I don't see what you need this vma_count for? I get the one in address_space,
but this one does not seem useful.

> +};
> +#endif
> +
>  /**
>   * struct address_space - Contents of a cacheable, mappable object.
>   * @host: Owner, either the inode or the block_device.
> @@ -461,8 +480,13 @@ struct mapping_metadata_bhs {
>   * @gfp_mask: Memory allocation flags to use for allocating pages.
>   * @i_mmap_writable: Number of VM_SHARED, VM_MAYWRITE mappings.
>   * @nr_thps: Number of THPs in the pagecache (non-shmem only).
> - * @i_mmap: Tree of private and shared mappings.
> - * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
> + * @i_mmap: Tree of private and shared mappings. When CONFIG_SPLIT_I_MMAP
> + *   is enabled, this is an array of split_tree_num struct i_mmap_tree
> + *   pointers (plus a NULL terminator).

NULL terminator wastes more memory, so I would really strongly avoid it as
well.

> + * @vma_count: Total number of VMAs across all sibling trees (only when
> + *   CONFIG_SPLIT_I_MMAP is enabled). Used by mapping_mapped().
> + * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable (only when
> + *   CONFIG_SPLIT_I_MMAP is disabled; otherwise per-tree rwsem is used).

So, there are very good reasons why you still need an i_mmap_rwsem protecting
state, even with split mmap trees. Which I'll go into later.

>   * @nrpages: Number of page entries, protected by the i_pages lock.
>   * @writeback_index: Writeback starts here.
>   * @a_ops: Methods.
> @@ -480,14 +504,19 @@ struct address_space {
>  	/* number of thp, only for non-shmem files */
>  	atomic_t		nr_thps;
>  #endif
> +#ifdef CONFIG_SPLIT_I_MMAP
> +	struct i_mmap_tree	**i_mmap;
> +	atomic_t		vma_count;
> +#else
>  	struct rb_root_cached	i_mmap;
> +	struct rw_semaphore	i_mmap_rwsem;
> +#endif
>  	unsigned long		nrpages;
>  	pgoff_t			writeback_index;
>  	const struct address_space_operations *a_ops;
>  	unsigned long		flags;
>  	errseq_t		wb_err;
>  	spinlock_t		i_private_lock;
> -	struct rw_semaphore	i_mmap_rwsem;

See d3b1a9a778e1 ("fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.")

>  } __attribute__((aligned(sizeof(long)))) __randomize_layout;
>  	/*
>  	 * On most architectures that alignment is already the case; but
> @@ -508,6 +537,133 @@ static inline bool mapping_tagged(const struct address_space *mapping, xa_mark_t
>  	return xa_marked(&mapping->i_pages, tag);
>  }
>  
> +#ifdef CONFIG_SPLIT_I_MMAP
> +static inline int mapping_mapped(const struct address_space *mapping)
> +{
> +	return	atomic_read(&mapping->vma_count);

Now that I think of it, I don't think we need atomic_t, only unsigned long +
READ_ONCE() suffices. Increments can race just fine, we don't expect any 
consistency there - if you want consistency you probably hold the i_mmap lock.

> +}
> +
> +static inline void inc_mapping_vma(struct address_space *mapping,
> +				struct vm_area_struct *vma)
> +{
> +	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
> +
> +	atomic_inc(&tree->vma_count);
> +	atomic_inc(&mapping->vma_count);
> +}
> +
> +static inline void dec_mapping_vma(struct address_space *mapping,
> +				struct vm_area_struct *vma)
> +{
> +	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
> +
> +	atomic_dec(&tree->vma_count);
> +	atomic_dec(&mapping->vma_count);
> +}

This probably shouldn't be in linux/fs.h.

> +
> +static inline struct rb_root_cached *get_i_mmap_root(struct address_space *mapping)
> +{
> +	return (struct rb_root_cached *)mapping->i_mmap;
> +}
> +
> +static inline void i_mmap_tree_lock_write(struct address_space *mapping,
> +					struct vm_area_struct *vma)
> +{
> +	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
> +
> +	down_write(&tree->rwsem);
> +}
> +
> +static inline void i_mmap_tree_unlock_write(struct address_space *mapping,
> +					struct vm_area_struct *vma)
> +{
> +	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
> +
> +	up_write(&tree->rwsem);
> +}
> +
> +#define i_mmap_lock_write_prepare(mapping)
> +#define i_mmap_unlock_write_complete(mapping)

It's unclear to me why you added write_prepare() and write_complete().

> +
> +extern int split_tree_num;
> +static inline void i_mmap_lock_write(struct address_space *mapping)
> +{
> +	int i;
> +
> +	for (i = 0; i < split_tree_num; i++)
> +		down_write(&mapping->i_mmap[i]->rwsem);

Oof, this is an incredibly large hammer. This is basically why I think keeping
i_mmap_rwsem (in a different form) is required. You do not want to take $nr_cpus
locks (read _or_ write). For my design, I keep i_mmap_rwsem, but I invert its
meaning - taking it in write = I'm reading from the tree; taking it in read =
I'm writing to the tree. This provides some lighter-weight exclusion between
rmap walks and rmap tree manipulation.

_Technically_, you shouldn't need to always take a lock when manipulating the
tree. A pattern like mnt_hold_writers()/mnt_get_write_access() can probably
work well. But it may be too complex ATM.


Also, note that you pretty much do not want i_mmap_lock_write() users after
the conversion is done.

> +}
> +
> +static inline int i_mmap_trylock_write(struct address_space *mapping)
> +{
> +	int i;
> +
> +	for (i = 0; i < split_tree_num; i++) {
> +		if (!down_write_trylock(&mapping->i_mmap[i]->rwsem)) {
> +			while (i--)
> +				up_write(&mapping->i_mmap[i]->rwsem);
> +			return 0;
> +		}
> +	}
> +	return 1;
> +}
> +
> +static inline void i_mmap_unlock_write(struct address_space *mapping)
> +{
> +	int i;
> +
> +	for (i = 0; i < split_tree_num; i++)
> +		up_write(&mapping->i_mmap[i]->rwsem);
> +}
> +
> +static inline int i_mmap_trylock_read(struct address_space *mapping)
> +{
> +	int i;
> +
> +	for (i = 0; i < split_tree_num; i++) {
> +		if (!down_read_trylock(&mapping->i_mmap[i]->rwsem)) {
> +			while (i--)
> +				up_read(&mapping->i_mmap[i]->rwsem);
> +			return 0;
> +		}
> +	}
> +	return 1;
> +}
> +
> +static inline void i_mmap_lock_read(struct address_space *mapping)
> +{
> +	int i;
> +
> +	for (i = 0; i < split_tree_num; i++)
> +		down_read(&mapping->i_mmap[i]->rwsem);
> +}
> +
> +static inline void i_mmap_unlock_read(struct address_space *mapping)
> +{
> +	int i;
> +
> +	for (i = 0; i < split_tree_num; i++)
> +		up_read(&mapping->i_mmap[i]->rwsem);
> +}
> +
> +static inline void i_mmap_assert_locked(struct address_space *mapping)
> +{
> +	int i;
> +
> +	for (i = 0; i < split_tree_num; i++)
> +		lockdep_assert_held(&mapping->i_mmap[i]->rwsem);
> +}
> +
> +static inline void i_mmap_assert_write_locked(struct address_space *mapping)
> +{
> +	int i;
> +
> +	for (i = 0; i < split_tree_num; i++)
> +		lockdep_assert_held_write(&mapping->i_mmap[i]->rwsem);
> +}
> +
> +#else
> +
>  static inline void i_mmap_lock_write(struct address_space *mapping)
>  {
>  	down_write(&mapping->i_mmap_rwsem);
> @@ -561,6 +717,18 @@ static inline struct rb_root_cached *get_i_mmap_root(struct address_space *mappi
>  	return &mapping->i_mmap;
>  }
>  
> +static inline void inc_mapping_vma(struct address_space *mapping,
> +				struct vm_area_struct *vma) { }
> +static inline void dec_mapping_vma(struct address_space *mapping,
> +				struct vm_area_struct *vma) { }
> +
> +#define i_mmap_lock_write_prepare(mapping)	i_mmap_lock_write(mapping)
> +#define i_mmap_unlock_write_complete(mapping)	i_mmap_unlock_write(mapping)
> +#define i_mmap_tree_lock_write(mapping, vma)
> +#define i_mmap_tree_unlock_write(mapping, vma)
> +
> +#endif
> +
>  /*
>   * Might pages of this file have been modified in userspace?
>   * Note that i_mmap_writable counts all VM_SHARED, VM_MAYWRITE vmas: do_mmap
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0a45c6a8b9f2..9aa8119fa9bf 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4041,11 +4041,91 @@ struct vm_area_struct *vma_interval_tree_iter_first(struct rb_root_cached *root,
>  struct vm_area_struct *vma_interval_tree_iter_next(struct vm_area_struct *node,
>  				unsigned long start, unsigned long last);
>  
> +#ifdef CONFIG_SPLIT_I_MMAP
> +extern int split_tree_num;
> +
> +static inline int smallest_tree_idx(struct file *file)
> +{
> +	struct address_space *mapping = file->f_mapping;
> +	int tmp = INT_MAX, count;
> +	int i, j = 0;
> +
> +	/*
> +	 * Since a not 100% accurate value is still okay,
> +	 * we do not need any lock here.
> +	 */
> +	for (i = 0; i < split_tree_num; i++) {
> +		count = atomic_read(&mapping->i_mmap[i]->vma_count);
> +		if (count < tmp) {
> +			j = i;
> +			tmp = count;
> +			if (!tmp)
> +				break;
> +		}
> +	}

Ohh, I see why you want the per-subtree vma_count now. But is this a net-win?
I think doing something like vma-pointer-hashing or just smp_processor_id()
would work a-ok.

> +	return j;
> +}
> +
> +static inline void vma_set_tree_idx(struct vm_area_struct *vma)
> +{
> +#ifdef CONFIG_NUMA
> +	vma->tree_idx = numa_node_id();
> +#else
> +	vma->tree_idx = smallest_tree_idx(vma->vm_file);
> +#endif
> +}
> +
> +static inline struct rb_root_cached *get_rb_root(struct vm_area_struct *vma,
> +					struct address_space *mapping)
> +{
> +	return &mapping->i_mmap[vma->tree_idx]->root;
> +}
> +
> +/* Find the first valid VMA in the sibling trees */
> +static inline struct vm_area_struct *first_vma(struct i_mmap_tree ***__r,
> +				unsigned long start, unsigned long last)
> +{
> +	struct vm_area_struct *vma = NULL;
> +	struct i_mmap_tree **tree = *__r;
> +	struct rb_root_cached *root;
> +
> +	while (*tree) {
> +		root = &(*tree)->root;
> +		tree++;
> +		vma = vma_interval_tree_iter_first(root, start, last);
> +		if (vma)
> +			break;
> +	}
> +
> +	/* Save for the next loop */
> +	*__r = tree;
> +	return vma;
> +}
> +
> +/*
> + * Please use get_i_mmap_root() to get the @root.
> + * @_tmp is referenced to avoid unused variable warning.
> + */
> +#define vma_interval_tree_foreach(vma, root, start, last)		\
> +	for (struct i_mmap_tree **_r = (struct i_mmap_tree **)(root),	\
> +		**_tmp = (vma = first_vma(&_r, start, last)) ? _r : NULL;\
> +	     ((_tmp && vma) || (vma = first_vma(&_r, start, last)));	\
> +		vma = vma_interval_tree_iter_next(vma, start, last))
> +#else
>  /* Please use get_i_mmap_root() to get the @root */
>  #define vma_interval_tree_foreach(vma, root, start, last)		\
>  	for (vma = vma_interval_tree_iter_first(root, start, last);	\
>  	     vma; vma = vma_interval_tree_iter_next(vma, start, last))
>  
> +static inline void vma_set_tree_idx(struct vm_area_struct *vma) { }
> +
> +static inline struct rb_root_cached *get_rb_root(struct vm_area_struct *vma,
> +					struct address_space *mapping)
> +{
> +	return &mapping->i_mmap;
> +}
> +#endif
> +
>  void anon_vma_interval_tree_insert(struct anon_vma_chain *node,
>  				   struct rb_root_cached *root);
>  void anon_vma_interval_tree_remove(struct anon_vma_chain *node,
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index a308e2c23b82..8d6aab3346ce 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1072,6 +1072,9 @@ struct vm_area_struct {
>  #ifdef __HAVE_PFNMAP_TRACKING
>  	struct pfnmap_track_ctx *pfnmap_track_ctx;
>  #endif
> +#ifdef CONFIG_SPLIT_I_MMAP
> +	int tree_idx;			/* The sibling tree index for the VMA */
> +#endif

FTR the struct hole isn't here, but right after vm_lock_seq or vm_refcnt in
most configs.

>  } __randomize_layout;
>  
>  /* Clears all bits in the VMA flags bitmap, non-atomically. */
> diff --git a/mm/internal.h b/mm/internal.h
> index 5a2ddcf68e0b..2d35cacffd19 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1888,7 +1888,8 @@ static inline void maybe_rmap_unlock_action(struct vm_area_struct *vma,
>  
>  	VM_WARN_ON_ONCE(vma_is_anonymous(vma));
>  	file = vma->vm_file;
> -	i_mmap_unlock_write(file->f_mapping);
> +	i_mmap_tree_unlock_write(file->f_mapping, vma);
> +	i_mmap_unlock_write_complete(file->f_mapping);
>  	action->hide_from_rmap_until_complete = false;
>  }
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d714fdb357e5..70036ec9dcaa 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1825,15 +1825,20 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
>  			struct address_space *mapping = file->f_mapping;
>  
>  			get_file(file);
> -			i_mmap_lock_write(mapping);
> +			i_mmap_lock_write_prepare(mapping);
> +			i_mmap_tree_lock_write(mapping, mpnt);
> +
>  			if (vma_is_shared_maywrite(tmp))
>  				mapping_allow_writable(mapping);
>  			flush_dcache_mmap_lock(mapping);
>  			/* insert tmp into the share list, just after mpnt */
>  			vma_interval_tree_insert_after(tmp, mpnt,
> -					get_i_mmap_root(mapping));
> +					get_rb_root(mpnt, mapping));
> +			inc_mapping_vma(mapping, tmp);

Honestly, would prefer to hide all of these details from mmap.

>  			flush_dcache_mmap_unlock(mapping);
> -			i_mmap_unlock_write(mapping);
> +
> +			i_mmap_tree_unlock_write(mapping, mpnt);
> +			i_mmap_unlock_write_complete(mapping);
>  		}
>  
>  		if (!(tmp->vm_flags & VM_WIPEONFORK))
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 0f18ffc658e9..1f2c60a220f6 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -567,11 +567,16 @@ static void setup_vma_to_mm(struct vm_area_struct *vma, struct mm_struct *mm)
>  	if (vma->vm_file) {
>  		struct address_space *mapping = vma->vm_file->f_mapping;
>  
> -		i_mmap_lock_write(mapping);
> +		i_mmap_lock_write_prepare(mapping);
> +		i_mmap_tree_lock_write(mapping, vma);
> +
>  		flush_dcache_mmap_lock(mapping);
> -		vma_interval_tree_insert(vma, get_i_mmap_root(mapping));
> +		vma_interval_tree_insert(vma, get_rb_root(vma, mapping));
> +		inc_mapping_vma(mapping, vma);
>  		flush_dcache_mmap_unlock(mapping);
> -		i_mmap_unlock_write(mapping);
> +
> +		i_mmap_tree_unlock_write(mapping, vma);
> +		i_mmap_unlock_write_complete(mapping);
>  	}
>  }
>  
> @@ -583,11 +588,16 @@ static void cleanup_vma_from_mm(struct vm_area_struct *vma)
>  		struct address_space *mapping;
>  		mapping = vma->vm_file->f_mapping;
>  
> -		i_mmap_lock_write(mapping);
> +		i_mmap_lock_write_prepare(mapping);
> +		i_mmap_tree_lock_write(mapping, vma);
> +
>  		flush_dcache_mmap_lock(mapping);
> -		vma_interval_tree_remove(vma, get_i_mmap_root(mapping));
> +		vma_interval_tree_remove(vma, get_rb_root(vma, mapping));
> +		dec_mapping_vma(mapping, vma);
>  		flush_dcache_mmap_unlock(mapping);
> -		i_mmap_unlock_write(mapping);
> +
> +		i_mmap_tree_unlock_write(mapping, vma);
> +		i_mmap_unlock_write_complete(mapping);
>  	}
>  }
>  
> @@ -1063,6 +1073,7 @@ unsigned long do_mmap(struct file *file,
>  	if (file) {
>  		region->vm_file = get_file(file);
>  		vma->vm_file = get_file(file);
> +		vma_set_tree_idx(vma);

This is unrelated, shouldn't be done here.

>  	}
>  
>  	down_write(&nommu_region_sem);
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 8df1b5077951..d5745519d95a 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -809,7 +809,7 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
>  	if (!check_ops_safe(ops))
>  		return -EINVAL;
>  
> -	lockdep_assert_held(&mapping->i_mmap_rwsem);
> +	i_mmap_assert_locked(mapping);

This kind of conversion should be done in a separate step.

>  	vma_interval_tree_foreach(vma, get_i_mmap_root(mapping), first_index,
>  				  first_index + nr - 1) {
>  		/* Clip to the vma */
> diff --git a/mm/vma.c b/mm/vma.c
> index 6159650c1b42..2055758064a9 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -234,22 +234,23 @@ static void __vma_link_file(struct vm_area_struct *vma,
>  		mapping_allow_writable(mapping);
>  
>  	flush_dcache_mmap_lock(mapping);
> -	vma_interval_tree_insert(vma, get_i_mmap_root(mapping));
> +	vma_interval_tree_insert(vma, get_rb_root(vma, mapping));
> +	inc_mapping_vma(mapping, vma);

inc_mapping_vma() should probably be done implicitly by insertion?

>  	flush_dcache_mmap_unlock(mapping);
>  }
>  
> -/*
> - * Requires inode->i_mapping->i_mmap_rwsem
> - */
>  static void __remove_shared_vm_struct(struct vm_area_struct *vma,
>  				      struct address_space *mapping)
>  {
> +	i_mmap_tree_lock_write(mapping, vma);
>  	if (vma_is_shared_maywrite(vma))
>  		mapping_unmap_writable(mapping);
>  
>  	flush_dcache_mmap_lock(mapping);
> -	vma_interval_tree_remove(vma, get_i_mmap_root(mapping));
> +	vma_interval_tree_remove(vma, get_rb_root(vma, mapping));
> +	dec_mapping_vma(mapping, vma);
>  	flush_dcache_mmap_unlock(mapping);
> +	i_mmap_tree_unlock_write(mapping, vma);
>  }
>  
>  /*
> @@ -297,8 +298,9 @@ static void vma_prepare(struct vma_prepare *vp)
>  			uprobe_munmap(vp->adj_next, vp->adj_next->vm_start,
>  				      vp->adj_next->vm_end);
>  
> -		i_mmap_lock_write(vp->mapping);
> +		i_mmap_lock_write_prepare(vp->mapping);
>  		if (vp->insert && vp->insert->vm_file) {
> +			i_mmap_tree_lock_write(vp->mapping, vp->insert);
>  			/*
>  			 * Put into interval tree now, so instantiated pages
>  			 * are visible to arm/parisc __flush_dcache_page
> @@ -307,6 +309,7 @@ static void vma_prepare(struct vma_prepare *vp)
>  			 */
>  			__vma_link_file(vp->insert,
>  					vp->insert->vm_file->f_mapping);
> +			i_mmap_tree_unlock_write(vp->mapping, vp->insert);
>  		}
>  	}
>  
> @@ -318,12 +321,17 @@ static void vma_prepare(struct vma_prepare *vp)
>  	}
>  
>  	if (vp->file) {
> +		i_mmap_tree_lock_write(vp->mapping, vp->vma);
>  		flush_dcache_mmap_lock(vp->mapping);
>  		vma_interval_tree_remove(vp->vma,
> -					get_i_mmap_root(vp->mapping));
> -		if (vp->adj_next)
> +					get_rb_root(vp->vma, vp->mapping));
> +		dec_mapping_vma(vp->mapping, vp->vma);
> +		if (vp->adj_next) {
> +			i_mmap_tree_lock_write(vp->mapping, vp->adj_next);
>  			vma_interval_tree_remove(vp->adj_next,
> -					get_i_mmap_root(vp->mapping));
> +					get_rb_root(vp->adj_next, vp->mapping));
> +			dec_mapping_vma(vp->mapping, vp->adj_next);
> +		}
>  	}
>  
>  }
> @@ -340,12 +348,17 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
>  			 struct mm_struct *mm)
>  {
>  	if (vp->file) {
> -		if (vp->adj_next)
> +		if (vp->adj_next) {
>  			vma_interval_tree_insert(vp->adj_next,
> -					get_i_mmap_root(vp->mapping));
> +					get_rb_root(vp->adj_next, vp->mapping));
> +			inc_mapping_vma(vp->mapping, vp->adj_next);
> +			i_mmap_tree_unlock_write(vp->mapping, vp->adj_next);
> +		}
>  		vma_interval_tree_insert(vp->vma,
> -					get_i_mmap_root(vp->mapping));
> +					get_rb_root(vp->vma, vp->mapping));
> +		inc_mapping_vma(vp->mapping, vp->vma);
>  		flush_dcache_mmap_unlock(vp->mapping);
> +		i_mmap_tree_unlock_write(vp->mapping, vp->vma);
>  	}
>  
>  	if (vp->remove && vp->file) {
> @@ -370,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
>  	}
>  
>  	if (vp->file) {
> -		i_mmap_unlock_write(vp->mapping);
> +		i_mmap_unlock_write_complete(vp->mapping);
>  
>  		if (!vp->skip_vma_uprobe) {
>  			uprobe_mmap(vp->vma);
> @@ -1799,12 +1812,12 @@ static void unlink_file_vma_batch_process(struct unlink_vma_file_batch *vb)
>  	int i;
>  
>  	mapping = vb->vmas[0]->vm_file->f_mapping;
> -	i_mmap_lock_write(mapping);
> +	i_mmap_lock_write_prepare(mapping);
>  	for (i = 0; i < vb->count; i++) {
>  		VM_WARN_ON_ONCE(vb->vmas[i]->vm_file->f_mapping != mapping);
>  		__remove_shared_vm_struct(vb->vmas[i], mapping);
>  	}
> -	i_mmap_unlock_write(mapping);
> +	i_mmap_unlock_write_complete(mapping);
>  
>  	unlink_file_vma_batch_init(vb);
>  }
> @@ -1836,10 +1849,13 @@ static void vma_link_file(struct vm_area_struct *vma, bool hold_rmap_lock)
>  
>  	if (file) {
>  		mapping = file->f_mapping;
> -		i_mmap_lock_write(mapping);
> +		i_mmap_lock_write_prepare(mapping);
> +		i_mmap_tree_lock_write(mapping, vma);
>  		__vma_link_file(vma, mapping);
> -		if (!hold_rmap_lock)
> -			i_mmap_unlock_write(mapping);
> +		if (!hold_rmap_lock) {
> +			i_mmap_tree_unlock_write(mapping, vma);
> +			i_mmap_unlock_write_complete(mapping);
> +		}
>  	}
>  }
>  
> @@ -2164,6 +2180,23 @@ static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma)
>  	}
>  }

I can but hope that all of the above is quite simplified before we get to the
"making file rmap more complicated" bit.

>  
> +#ifdef CONFIG_SPLIT_I_MMAP
> +static inline void i_mmap_nest_lock(struct address_space *mapping,
> +				struct rw_semaphore *lock)
> +{
> +	int i;
> +
> +	for (i = 0; i < split_tree_num; i++)
> +		down_write_nest_lock(&mapping->i_mmap[i]->rwsem, lock);
> +}
> +#else
> +static inline void i_mmap_nest_lock(struct address_space *mapping,
> +				struct rw_semaphore *lock)
> +{
> +	down_write_nest_lock(&mapping->i_mmap_rwsem, lock);
> +}
> +#endif
> +
>  static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
>  {
>  	if (!test_bit(AS_MM_ALL_LOCKS, &mapping->flags)) {
> @@ -2178,7 +2211,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
>  		 */
>  		if (test_and_set_bit(AS_MM_ALL_LOCKS, &mapping->flags))
>  			BUG();
> -		down_write_nest_lock(&mapping->i_mmap_rwsem, &mm->mmap_lock);
> +		i_mmap_nest_lock(mapping, &mm->mmap_lock);
>  	}
>  }
>  
> @@ -2489,6 +2522,7 @@ static int __mmap_new_file_vma(struct mmap_state *map,
>  	int error;
>  
>  	vma->vm_file = map->file;
> +	vma_set_tree_idx(vma);
>  	if (!map->file_doesnt_need_get)
>  		get_file(map->file);
>  
> diff --git a/mm/vma_init.c b/mm/vma_init.c
> index 3c0b65950510..c115e33d4812 100644
> --- a/mm/vma_init.c
> +++ b/mm/vma_init.c
> @@ -72,6 +72,9 @@ static void vm_area_init_from(const struct vm_area_struct *src,
>  #ifdef CONFIG_NUMA
>  	dest->vm_policy = src->vm_policy;
>  #endif
> +#ifdef CONFIG_SPLIT_I_MMAP
> +	dest->tree_idx = src->tree_idx;
> +#endif
>  #ifdef __HAVE_PFNMAP_TRACKING
>  	dest->pfnmap_track_ctx = NULL;
>  #endif

-- 
Pedro

^ permalink raw reply

* Re: [PATCH v2 1/4] mm: use mapping_mapped to simplify the code
From: Pedro Falcato @ 2026-06-11 11:13 UTC (permalink / raw)
  To: Huang Shijie
  Cc: akpm, viro, brauner, jack, muchun.song, osalvador, david, surenb,
	mjguzik, liam, ljs, vbabka, shakeel.butt, rppt, mhocko, corbet,
	skhan, linux, dinguyen, schuster.simon, James.Bottomley, deller,
	djbw, willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, riel, harry,
	will, brian.ruley, rmk+kernel, dave.anglin, linux-mm, linux-doc,
	linux-kernel, linux-arm-kernel, linux-parisc, linux-fsdevel,
	nvdimm, linux-perf-users, linux-trace-kernel, zhongyuan,
	fangbaoshun, yingzhiwei
In-Reply-To: <20260611061915.2354307-2-huangsj@hygon.cn>

On Thu, Jun 11, 2026 at 02:18:57PM +0800, Huang Shijie wrote:
> Use mapping_mapped() to simplify the code, make
> the code tidy and clean.
> 
> Signed-off-by: Huang Shijie <huangsj@hygon.cn>

Reviewed-by: Pedro Falcato <pfalcato@suse.de>

LGTM, thanks! Super uncontroversial so perhaps
could be picked up separately.

-- 
Pedro

^ permalink raw reply

* Re: [RESEND][PATCH v2] unwind: Add sframe_(un)register() system calls
From: Steven Rostedt @ 2026-06-11 11:22 UTC (permalink / raw)
  To: Fangrui Song
  Cc: LKML, Linux Trace Kernel, bpf, Masami Hiramatsu,
	Mathieu Desnoyers, Jens Remus, Josh Poimboeuf, Peter Zijlstra,
	Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
	Kees Cook, Carlos O'Donell, Sam James, Dylan Hatch,
	Borislav Petkov, Dave Hansen, David Hildenbrand, H. Peter Anvin,
	Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Heiko Carstens,
	Vasily Gorbik, Thomas Weißschuh
In-Reply-To: <aipWtXVRqNmZY4gr@archer>

On Thu, 11 Jun 2026 00:00:25 -0700
Fangrui Song <i@maskray.me> wrote:

> Hi Steven,
> 
> This is not an objection to deferred userspace unwinding itself -- my
> concern is narrower: these syscalls permanently encode the kernel's
> commitment to the SFrame format family at exactly the moment the
> format's size trajectory is heading the wrong way, and while arguably
> superior formats exist.
> 
> I raised related size concerns about SFrame's viability for userspace
> stack walking earlier:
> https://lore.kernel.org/all/3xd4fqvwflefvsjjoagytoi3y3sf7lxqjremhe2zo5tounihe4@3ftafgryadsr/
> ("Concerns about SFrame viability for userspace stack walking")
> 
> SFrame v3 is even larger than v2.
> 
> For comparison: Microsoft is currently upstreaming its Windows x64
> Unwind V3 implementation to LLVM, which will make a side-by-side reading
> of the two formats straightforward. Unwind V3 provides correct
> exception-handling unwind -- full prologue replay, SEH handlers,
> funclets -- and supports Intel APX. SFrame v3 provides stack tracing
> only, no EH, yet comes out larger than .eh_frame. A format revision that
> adds capability without adding bulk is demonstrably achievable; SFrame
> v3 went the other way.

My main concern is simplicity in implementation on the kernel side. One
thing we would like to avoid is any interpreter that becomes basically
executing user space code to perform the stack tracing (i.e. DWARF). I
haven't looked at the Windows x64 but will do so.

> 
> I understand IBM is doubling down on SFrame for their s390x and ppc64,

That's because this is currently the only way s390 can perform stack
walking in user space.

> but I'm not convinced the size overhead of v3 will make it appealing on
> x86-64. I have learned that the person driving their SFrame work at
> Google had left and the SFrame at data center effort was being
> reevaluated per a toolchain manager.

I believe the person who left Google that was driving the SFrame work was
me ;-)

Thanks,

-- Steve

^ permalink raw reply

* Re: [RFC PATCH 1/2] tracing/osnoise: Sample IPI counts
From: Tomas Glozar @ 2026-06-11 11:55 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Crystal Wood, linux-kernel, linux-trace-kernel, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Costa Shulyupin,
	Ivan Pravdin
In-Reply-To: <xhsmh33yt2wtc.mognet@vschneid-thinkpadt14sgen2i.remote.csb>

čt 11. 6. 2026 v 12:31 odesílatel Valentin Schneider
<vschneid@redhat.com> napsal:
> >
> > Isn't that precisely what the ipi tracepoints used by this
> > implementation (ipi:ipi_send_cpu) are for?
> >
>
> Well, these catch the emission of the IPI, which is great for investigation
> - slap a stacktrace trigger and you (most of the time) get the source of
> your interference.
>
> However Crystal's point is that on x86 (and I assume other archs) receiving
> & handling these IPIs is "special" and doesn't go through the generic irq
> subsystem and thus has to be tracked separately, which is why osnoise has
> this fairly lengthy osnoise_arch_register() thing.
>

Ah, right. This is not IPI specific, though, IIUC - Intel also has
other IRQs that have to be traced using Intel-specific trace points,
like irq_vectors:local_timer, which is also handled in
osnoise_arch_register(). On ARM from what I recall, most (all?) IRQs
are traced with irq:* tracepoints.

So there are two parts to this:

- Detecting interference from IPIs firing as osnoise:irq_noise (to be
analyzed by timerlat auto analysis, and also will appear by default in
trace output if enabled, regardless of the tool, as all osnoise:*
tracepoints are enabled there). This is done locally using the already
existing path (no race hazard), but requires arch-specific detection.

- Counting IPIs when they are being sent. This is the new feature, and
the count is being recorded in osnoise_sample.

I guess that means that if there were a generic IPI interface, it
would be easier to use that for IPI counting, as the event would be
CPU-local? As you say, for tracing of the IPI source, the sending
tracepoints are better, and that you can already dump the stack trace
of with --event/--trigger. timerlat auto-analysis could be extended to
connect the specific IPI to the IRQ noise and display its stack trace
automatically, instead of manually analyzing the trace output.

> >> Isn't this racy to do from a different CPU?  Both in terms of the
> >> counter, and the timing of the increment relative to when the IPI is
> >> actually received.  Not necessarily a huge deal if you only care about
> >> zero versus bignum, but still.  At least worth a comment, if we go with
> >> this approach.
> >>
> >
> > I also think it's a bit confusing, especially as the other accesses to
> > osn_var are cpu-local, but here, "cpu" is the *target* CPU, not the
> > current CPU. Not sure how expensive it would be to do atomic_add for
> > that, at least it's something to consider.
> >
>
> I suppose that could be an argument for doing that stat aggregation in
> userspace osnoise - event handlers are run after the fact via
> tracefs_iterate_raw_events(), it's all inherently slower since it's just
> increments of one (one per handled event) but it's also all done in
> userspace on a control thread and doesn't bog down the kernelspace.
>

You can also do per-cpu counters in-kernel and sum them in the end,
but that would take cpus^2 space (indexed by [current_cpu,
target_cpu]). The question is whether there could be enough samples to
overload sample collection (like it happens for timerlat, which
collects data in-kernel using BPF instead).

In-kernel counting can be tested with " --event ipi:ipi_send_cpu
--trigger hist:key=cpu" - IIRC, tracefs histograms use atomic
operations (via tracing_map) to protect the entries from races in
multi thread access. Of course, that is inferior to what the patchset
implements, as it doesn't record which osnoise cycle the IPI was sent
in, nor can record cpumask IPIs.


Tomas


^ permalink raw reply

* Re: [PATCH 1/3] tracing/user_events: Simplify data output in user_seq_show()
From: Steven Rostedt @ 2026-06-11 12:59 UTC (permalink / raw)
  To: Markus Elfring
  Cc: linux-trace-kernel, Masami Hiramatsu, Mathieu Desnoyers, LKML,
	kernel-janitors
In-Reply-To: <6762145e-3e51-43b8-8bca-a9dd200b54e2@web.de>

On Tue, 9 Jun 2026 18:44:04 +0200
Markus Elfring <Markus.Elfring@web.de> wrote:

> >> @@ -2800,8 +2800,7 @@ static int user_seq_show(struct seq_file *m, void *p)
> >>  
> >>  	mutex_unlock(&group->reg_mutex);
> >>  
> >> -	seq_puts(m, "\n");
> >> -	seq_printf(m, "Active: %d\n", active);
> >> +	seq_printf(m, "\nActive: %d\n", active);
> >>  	seq_printf(m, "Busy: %d\n", busy);  
> > 
> > This isn't a critical section and I find the original way easier to read.  
> 
> Would you prefer to use a seq_putc() call instead at such a source code place?
> https://elixir.bootlin.com/linux/v7.1-rc7/source/kernel/trace/trace_events_user.c#L2803

Sure, why not.

-- Steve

^ permalink raw reply

* Re: [PATCH v6 5/7] locking: Add contended_release tracepoint to qspinlock
From: Peter Zijlstra @ 2026-06-11 13:44 UTC (permalink / raw)
  To: Dmitry Ilvokhin
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
	Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
	Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, linux-kernel, linux-mips,
	virtualization, linux-arch, linux-mm, linux-trace-kernel,
	kernel-team, Paul E. McKenney
In-Reply-To: <aiphFXe_TPNPxZ_n@shell.ilvokhin.com>

On Thu, Jun 11, 2026 at 07:17:41AM +0000, Dmitry Ilvokhin wrote:
> On Wed, Jun 03, 2026 at 02:08:11PM +0200, Peter Zijlstra wrote:
> > Also, I think someone should go do some performance runs with
> > ARCH_INLINE_SPIN_* set for x86 just like for s390.
> 
> As promised, I set ARCH_INLINE_SPIN_UNLOCK{,_BH,_IRQ,_IRQRESTORE} for
> x86 and measured the effect on a few real workloads.
> 
> Short version: inlining of _raw_spin_unlock() adds measurable kernel
> i-cache pressure on every workload I tried, and on a
> kernel-i-cache-bound one (nginx connection churn) it costs ~1.27%
> throughput. I did not find a workload where it helps.

Thanks for checking!

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox