Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* [PATCH v11 03/11] tools/bootconfig: Ignore comment lines in dynamic_events/kprobe_events file
From: Masami Hiramatsu (Google) @ 2026-06-26 14:14 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178248325671.841606.17344906774310339507.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Since dynamic_events/kprobe_events files show the fetcharg debug
information as comment lines, its reader needs to ignore it.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 tools/bootconfig/scripts/ftrace2bconf.sh |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/bootconfig/scripts/ftrace2bconf.sh b/tools/bootconfig/scripts/ftrace2bconf.sh
index 1603801cf126..8eed445c295e 100755
--- a/tools/bootconfig/scripts/ftrace2bconf.sh
+++ b/tools/bootconfig/scripts/ftrace2bconf.sh
@@ -57,6 +57,8 @@ EOF
 kprobe_event_options() {
 	cat $TRACEFS/kprobe_events | while read p args; do
 		case $p in
+		\#*)
+		continue;;
 		r*)
 		cat 1>&2 << EOF
 # WARN: A return probe found but it is not supported by bootconfig. Skip it.


^ permalink raw reply related

* [PATCH v11 02/11] tracing/probes: Support dumping fetcharg program for debugging dynamic events
From: Masami Hiramatsu (Google) @ 2026-06-26 14:14 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178248325671.841606.17344906774310339507.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

For debugging probe events, it is helpful to verify the compiled
fetch instructions for each probe argument. This introduces a new
kernel config CONFIG_PROBE_EVENTS_DUMP_FETCHARG to decode the
instruction sequence of each argument and display it under a
commented line starting with '#' immediately following the dynamic
event definition (such as in dynamic_events, kprobe_events,
uprobe_events, etc.).

For example:
 /sys/kernel/tracing # cat dynamic_events
 p:kprobes/p_vfs_read_0 vfs_read arg1=+0(file):ustring arg2=%ax:x16
 #  arg1: ARG(0) -> ST_USTRING(offset=0,size=4) -> END
 #  arg2: REG(80) -> ST_RAW(size=2) -> END

Assisted-by: Antigravity:gemini-3.5-flash
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v8:
  - State this feature is only for debugging probe events.
  - Fix dependency list after description in Kconfig.
 Changes in v7:
   - Show trace event field name for FETCH_OP_TP_ARG.
   - Show immediate string value for FETCH_OP_IMMSTR.
   - Fix style issues warned by checkpatch.pl.
 Changes in v6:
   - Newly added.
---
 kernel/trace/Kconfig        |   12 +++++
 kernel/trace/trace_eprobe.c |    2 +
 kernel/trace/trace_fprobe.c |    2 +
 kernel/trace/trace_kprobe.c |    2 +
 kernel/trace/trace_probe.c  |   96 +++++++++++++++++++++++++++++++++++++++++++
 kernel/trace/trace_probe.h  |   79 +++++++++++++++++++++--------------
 kernel/trace/trace_uprobe.c |    3 +
 7 files changed, 164 insertions(+), 32 deletions(-)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 084f34dc6c9f..0ab5916575a9 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -779,6 +779,18 @@ config PROBE_EVENTS_BTF_ARGS
 	  kernel function entry or a tracepoint.
 	  This is available only if BTF (BPF Type Format) support is enabled.
 
+config PROBE_EVENTS_DUMP_FETCHARG
+	bool "Dump of dynamic probe event fetch-arguments"
+	depends on PROBE_EVENTS
+	default n
+	help
+	  This shows the dump of fetch-arguments of dynamic probe events
+	  alongside their event definitions in the dynamic_events file
+	  as comment lines. This is useful to debug the probe events.
+	  Since this exposes the raw values in the dynamic_events file,
+	  it might be a security risk. Only enable it if you need to debug
+	  probe events themselves.
+
 config KPROBE_EVENTS
 	depends on KPROBES
 	depends on HAVE_REGS_AND_STACK_ACCESS_API
diff --git a/kernel/trace/trace_eprobe.c b/kernel/trace/trace_eprobe.c
index 50518b071414..462c31145733 100644
--- a/kernel/trace/trace_eprobe.c
+++ b/kernel/trace/trace_eprobe.c
@@ -87,6 +87,8 @@ static int eprobe_dyn_event_show(struct seq_file *m, struct dyn_event *ev)
 		seq_printf(m, " %s=%s", ep->tp.args[i].name, ep->tp.args[i].comm);
 	seq_putc(m, '\n');
 
+	trace_probe_dump_args(m, &ep->tp);
+
 	return 0;
 }
 
diff --git a/kernel/trace/trace_fprobe.c b/kernel/trace/trace_fprobe.c
index 4d1abbf66229..536781cd4c47 100644
--- a/kernel/trace/trace_fprobe.c
+++ b/kernel/trace/trace_fprobe.c
@@ -1449,6 +1449,8 @@ static int trace_fprobe_show(struct seq_file *m, struct dyn_event *ev)
 		seq_printf(m, " %s=%s", tf->tp.args[i].name, tf->tp.args[i].comm);
 	seq_putc(m, '\n');
 
+	trace_probe_dump_args(m, &tf->tp);
+
 	return 0;
 }
 
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index a8420e6abb56..cfa807d8e760 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1320,6 +1320,8 @@ static int trace_kprobe_show(struct seq_file *m, struct dyn_event *ev)
 		seq_printf(m, " %s=%s", tk->tp.args[i].name, tk->tp.args[i].comm);
 	seq_putc(m, '\n');
 
+	trace_probe_dump_args(m, &tk->tp);
+
 	return 0;
 }
 
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 2ce7d62471cb..0908019aea12 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -2403,3 +2403,99 @@ int trace_probe_print_args(struct trace_seq *s, struct probe_arg *args, int nr_a
 	}
 	return 0;
 }
+
+#ifdef CONFIG_PROBE_EVENTS_DUMP_FETCHARG
+
+struct fetch_op_decode {
+	const char *name;
+	void (*decode)(struct seq_file *m, struct fetch_insn *insn);
+};
+
+static const struct fetch_op_decode fetch_op_decode[];
+
+static void fetcharg_decode_none(struct seq_file *m, struct fetch_insn *insn)
+{
+	seq_puts(m, fetch_op_decode[insn->op].name);
+}
+
+static void fetcharg_decode_param(struct seq_file *m, struct fetch_insn *insn)
+{
+	seq_printf(m, "%s(%u)", fetch_op_decode[insn->op].name, insn->param);
+}
+
+static void fetcharg_decode_imm(struct seq_file *m, struct fetch_insn *insn)
+{
+	seq_printf(m, "%s(0x%lx)", fetch_op_decode[insn->op].name, insn->immediate);
+}
+
+static void fetcharg_decode_string(struct seq_file *m, struct fetch_insn *insn)
+{
+	seq_printf(m, "%s(%s)", fetch_op_decode[insn->op].name, (char *)insn->data);
+}
+
+static void fetcharg_decode_symbol(struct seq_file *m, struct fetch_insn *insn)
+{
+	seq_printf(m, "%s(%s)", fetch_op_decode[insn->op].name, (char *)insn->data);
+}
+
+static void fetcharg_decode_offset(struct seq_file *m, struct fetch_insn *insn)
+{
+	seq_printf(m, "%s(offset=%d)", fetch_op_decode[insn->op].name, insn->offset);
+}
+
+static void fetcharg_decode_store(struct seq_file *m, struct fetch_insn *insn)
+{
+	if (insn->op == FETCH_OP_ST_RAW)
+		seq_printf(m, "%s(size=%u)", fetch_op_decode[insn->op].name, insn->size);
+	else
+		seq_printf(m, "%s(offset=%d,size=%u)", fetch_op_decode[insn->op].name,
+			  insn->offset, insn->size);
+}
+
+static void fetcharg_decode_bf(struct seq_file *m, struct fetch_insn *insn)
+{
+	seq_printf(m, "%s(basesize=%u,lshift=%u,rshift=%u)",
+		   fetch_op_decode[insn->op].name, insn->basesize, insn->lshift, insn->rshift);
+}
+
+static void fetcharg_decode_tp_arg(struct seq_file *m, struct fetch_insn *insn)
+{
+	struct ftrace_event_field *field = insn->data;
+
+	seq_printf(m, "%s(%s)", fetch_op_decode[insn->op].name, field->name);
+}
+
+#define FETCH_OP(opname, decode_fn) \
+	[FETCH_OP_##opname] = { .name = #opname, .decode = fetcharg_decode_##decode_fn }
+
+static const struct fetch_op_decode fetch_op_decode[] = FETCH_OP_LIST;
+#undef FETCH_OP
+
+static void trace_probe_dump_arg(struct seq_file *m, struct probe_arg *parg)
+{
+	int i;
+
+	seq_printf(m, "#  %s: ", parg->name);
+	for (i = 0; i < FETCH_INSN_MAX; i++) {
+		struct fetch_insn *insn = parg->code + i;
+
+		if (insn->op >= ARRAY_SIZE(fetch_op_decode) || !fetch_op_decode[insn->op].decode)
+			seq_printf(m, "unknown(%d)", insn->op);
+		else
+			fetch_op_decode[insn->op].decode(m, insn);
+
+		if (insn->op == FETCH_OP_END)
+			break;
+		seq_puts(m, " -> ");
+	}
+	seq_putc(m, '\n');
+}
+
+void trace_probe_dump_args(struct seq_file *m, struct trace_probe *tp)
+{
+	int i;
+
+	for (i = 0; i < tp->nr_args; i++)
+		trace_probe_dump_arg(m, &tp->args[i]);
+}
+#endif /* CONFIG_PROBE_EVENTS_DUMP_FETCHARG */
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 2e0d8384ee5c..e36cfe39e9a8 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -83,38 +83,46 @@ static nokprobe_inline u32 update_data_loc(u32 loc, int consumed)
 /* Printing function type */
 typedef int (*print_type_func_t)(struct trace_seq *, void *, void *);
 
-enum fetch_op {
-	FETCH_OP_NOP = 0,
-	// Stage 1 (load) ops
-	FETCH_OP_REG,		/* Register : .param = offset */
-	FETCH_OP_STACK,		/* Stack : .param = index */
-	FETCH_OP_STACKP,	/* Stack pointer */
-	FETCH_OP_RETVAL,	/* Return value */
-	FETCH_OP_IMM,		/* Immediate : .immediate */
-	FETCH_OP_COMM,		/* Current comm */
-	FETCH_OP_ARG,		/* Function argument : .param */
-	FETCH_OP_FOFFS,		/* File offset: .immediate */
-	FETCH_OP_IMMSTR,	/* Allocated string: .data */
-	FETCH_OP_EDATA,		/* Entry data: .offset */
-	// Stage 2 (dereference) op
-	FETCH_OP_DEREF,		/* Dereference: .offset */
-	FETCH_OP_UDEREF,	/* User-space Dereference: .offset */
-	// Stage 3 (store) ops
-	FETCH_OP_ST_RAW,	/* Raw: .size */
-	FETCH_OP_ST_MEM,	/* Mem: .offset, .size */
-	FETCH_OP_ST_UMEM,	/* Mem: .offset, .size */
-	FETCH_OP_ST_STRING,	/* String: .offset, .size */
-	FETCH_OP_ST_USTRING,	/* User String: .offset, .size */
-	FETCH_OP_ST_SYMSTR,	/* Kernel Symbol String: .offset, .size */
-	FETCH_OP_ST_EDATA,	/* Store Entry Data: .offset */
-	// Stage 4 (modify) op
-	FETCH_OP_MOD_BF,	/* Bitfield: .basesize, .lshift, .rshift */
-	// Stage 5 (loop) op
-	FETCH_OP_LP_ARRAY,	/* Array: .param = loop count */
-	FETCH_OP_TP_ARG,	/* Trace Point argument */
-	FETCH_OP_END,
-	FETCH_NOP_SYMBOL,	/* Unresolved Symbol holder */
-};
+#define FETCH_OP_LIST	{						\
+	/* Stage 1 (load) ops */					\
+	FETCH_OP(NOP, none),		/* NOP */			\
+	FETCH_OP(REG, param),		/* Register: .param = offset */	\
+	FETCH_OP(STACK, param),		/* Stack: .param = index */	\
+	FETCH_OP(STACKP, none),		/* Stack pointer */		\
+	FETCH_OP(RETVAL, none),		/* Return value */		\
+	FETCH_OP(IMM, imm),		/* Immediate: .immediate */	\
+	FETCH_OP(COMM, none),		/* Current comm */		\
+	FETCH_OP(ARG, param),		/* Argument: .param = index */	\
+	FETCH_OP(FOFFS, imm),		/* File offset: .immediate */	\
+	FETCH_OP(IMMSTR, string),	/* Allocated string: .data */	\
+	FETCH_OP(EDATA, offset),	/* Entry data: .offset */	\
+	FETCH_OP(TP_ARG, tp_arg),	/* Tracepoint argument: .data */\
+	/* Stage 2 (dereference) ops */					\
+	FETCH_OP(DEREF, offset),	/* Dereference: .offset */	\
+	FETCH_OP(UDEREF, offset),	/* User-space dereference: .offset */\
+	/* Stage 3 (store) ops */					\
+	FETCH_OP(ST_RAW, store),	/* Raw value: .size */		\
+	FETCH_OP(ST_MEM, store),	/* Memory: .offset, .size */	\
+	FETCH_OP(ST_UMEM, store),	/* User memory: .offset, .size */\
+	FETCH_OP(ST_STRING, store),	/* String: .offset, .size */	\
+	FETCH_OP(ST_USTRING, store),	/* User string: .offset, .size */\
+	FETCH_OP(ST_SYMSTR, store),	/* Symbol name: .offset, .size */\
+	FETCH_OP(ST_EDATA, offset),	/* Entry data: .offset */	\
+	/* Stage 4 (modify) op */					\
+	FETCH_OP(MOD_BF, bf),		/* Bitfield: .basesize, .lshift, .rshift*/\
+	/* Stage 5 (loop) op */						\
+	FETCH_OP(LP_ARRAY, param),	/* Loop array: .param = count */\
+	/* End */							\
+	FETCH_OP(END, none),						\
+	/* Unresolved Symbol holder */					\
+	FETCH_OP(NOP_SYMBOL, symbol),	/* Non loaded symbol: .data = symbol name */\
+}
+
+#define FETCH_OP(opname, decode_fn) FETCH_OP_##opname
+enum fetch_op FETCH_OP_LIST;
+#undef FETCH_OP
+
+#define FETCH_NOP_SYMBOL FETCH_OP_NOP_SYMBOL
 
 struct fetch_insn {
 	enum fetch_op op;
@@ -370,6 +378,13 @@ bool trace_probe_match_command_args(struct trace_probe *tp,
 int trace_probe_create(const char *raw_command, int (*createfn)(int, const char **));
 int trace_probe_print_args(struct trace_seq *s, struct probe_arg *args, int nr_args,
 		 u8 *data, void *field);
+#ifdef CONFIG_PROBE_EVENTS_DUMP_FETCHARG
+void trace_probe_dump_args(struct seq_file *m, struct trace_probe *tp);
+#else
+static inline void trace_probe_dump_args(struct seq_file *m, struct trace_probe *tp)
+{
+}
+#endif
 
 #ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
 int traceprobe_get_entry_data_size(struct trace_probe *tp);
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index c274346853d1..b2e264a4b96c 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -765,6 +765,9 @@ static int trace_uprobe_show(struct seq_file *m, struct dyn_event *ev)
 		seq_printf(m, " %s=%s", tu->tp.args[i].name, tu->tp.args[i].comm);
 
 	seq_putc(m, '\n');
+
+	trace_probe_dump_args(m, &tu->tp);
+
 	return 0;
 }
 


^ permalink raw reply related

* [PATCH v11 01/11] tracing/probes: Allow eprobe to use variable without $ prefix
From: Masami Hiramatsu (Google) @ 2026-06-26 14:14 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178248325671.841606.17344906774310339507.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

The commit 69efd863a785 ("tracing/eprobes: Allow use of BTF names
to dereference pointers") allows eprobe to use event field without
"$" prefix when it is used with typecast, it is natual to allow it
without typecast.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v8:
  - Newly added.
---
 kernel/trace/trace_probe.c                         |   12 +++++++++++-
 kernel/trace/trace_probe.h                         |    1 +
 .../test.d/dynevent/eprobes_syntax_errors.tc       |    3 +--
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 0da7c0b53ba7..2ce7d62471cb 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -1341,7 +1341,17 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
 		ret = handle_typecast(arg, pcode, end, ctx);
 		break;
 	default:
-		if (isalpha(arg[0]) || arg[0] == '_') {	/* BTF variable */
+		if (isalpha(arg[0]) || arg[0] == '_') {
+			/* BTF variable or event field*/
+			if (ctx->flags & TPARG_FL_TEVENT) {
+				ret = parse_trace_event(arg, *pcode, ctx);
+				if (ret < 0) {
+					trace_probe_log_err(ctx->offset,
+							    NO_EVENT_FIELD);
+					return -EINVAL;
+				}
+				break;
+			}
 			if (!tparg_is_function_entry(ctx->flags) &&
 			    !tparg_is_function_return(ctx->flags)) {
 				trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 40b53b5b58a9..2e0d8384ee5c 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -559,6 +559,7 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
 	C(NO_PTR_STRCT,		"This is not a pointer to union/structure."),	\
 	C(NOSUP_DAT_ARG,	"Non pointer structure/union argument is not supported."),\
 	C(BAD_HYPHEN,		"Failed to parse single hyphen. Forgot '>'?"),	\
+	C(NO_EVENT_FIELD,	"This event field is not found."),	\
 	C(NO_BTF_FIELD,		"This field is not found."),	\
 	C(BAD_BTF_TID,		"Failed to get BTF type info."),\
 	C(BAD_TYPE4STR,		"This type does not fit for string."),\
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/eprobes_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/dynevent/eprobes_syntax_errors.tc
index 2a680c086047..0e65e787e426 100644
--- a/tools/testing/selftests/ftrace/test.d/dynevent/eprobes_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/eprobes_syntax_errors.tc
@@ -10,7 +10,7 @@ check_error() { # command-with-error-pos-by-^
 check_error 'e ^a.'			# NO_EVENT_INFO
 check_error 'e ^.b'			# NO_EVENT_INFO
 check_error 'e ^a.b'			# BAD_ATTACH_EVENT
-check_error 'e syscalls/sys_enter_openat ^foo'	# BAD_ATTACH_ARG
+check_error 'e syscalls/sys_enter_openat ^foo'	# NO_EVENT_FIELD
 check_error 'e:^/bar syscalls/sys_enter_openat'	# NO_GROUP_NAME
 check_error 'e:^12345678901234567890123456789012345678901234567890123456789012345/bar syscalls/sys_enter_openat'	# GROUP_TOO_LONG
 
@@ -19,7 +19,6 @@ check_error 'e:^ syscalls/sys_enter_openat'		# NO_EVENT_NAME
 check_error 'e:foo/^12345678901234567890123456789012345678901234567890123456789012345 syscalls/sys_enter_openat'	# EVENT_TOO_LONG
 check_error 'e:foo/^bar.1 syscalls/sys_enter_openat'	# BAD_EVENT_NAME
 
-check_error 'e:foo/bar syscalls/sys_enter_openat arg=^dfd'	# BAD_FETCH_ARG
 check_error 'e:foo/bar syscalls/sys_enter_openat arg=^$foo'	# BAD_ATTACH_ARG
 
 if grep -q '<attached-group>\.<attached-event>.*\[if <filter>\]' README; then


^ permalink raw reply related

* [PATCH v11 00/11] tracing/probes: Add more typecast features
From: Masami Hiramatsu (Google) @ 2026-06-26 14:14 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest

Hi,

Here is the 11th version of series to introduce more typecast features
to probe events. The previous version is here:

 https://lore.kernel.org/all/178243982430.790911.17439694390021542101.stgit@devnote2/

In this version, I fixed minor issues and add 2 patches to fix
in-tree tools to ignore comment lines in dynamic_events[3/11][4/11].

This series extends BTF typecast feature and add more options:

1. Expanding BTF typecast to kprobe and fprobe.
   (currently only function entry/exit)

2. Introduce container_of like typecast. This adds a "assigned
   member" option to the typecast.

   (STRUCT,MEMBER)VAR->ANOTHER_MEMBER

   This casts VAR to STRUCT type but the VAR is as the address
   of STRUCT.MEMBER. In C, it is:

   container_of(VAR, STRUCT, MEMBER)->ANOTHER_MEMBER

3. Support nested typecast, e.g.

   (STRUCT)((STRUCT2)VAR->MEMBER2)->MEMBER

   the nest level must be smaller than 3.

4. Add $current variable to point "current" task_struct.
   This is useful with typecast, e.g.

   (task_struct)$current->pid

5. per-cpu dereference support.

   Intrdouce this_cpu_read(VAR) and this_cpu_ptr(VAR) to
   access per-cpu data on the current CPU (accessing other CPU
   data is not stable, because it can be changed.)

   You can access the member of per-cpu data structure using
   typecast like:

   (STRUCT)this_cpu_ptr(VAR)->MEMBER

6. Support event fields without $ prefix on eprobes.

   Now eprobe events can access its event fields.

And added fetcharg dump feature (for debug) and updated test scripts
to test part of them.

Thanks,

---
base-commit: c69b5f959286395e94c237ce6d7d4970bad7f6e3

Masami Hiramatsu (Google) (11):
      tracing/probes: Allow eprobe to use variable without $ prefix
      tracing/probes: Support dumping fetcharg program for debugging dynamic events
      tools/bootconfig: Ignore comment lines in dynamic_events/kprobe_events file
      perf/probe: Ignore comment lines in dynamic_events/kprobe_events file
      tracing/probes: Support typecast for various probe events
      tracing/probes: Support nested typecast
      tracing/probes: Type casting always involves nested calls
      tracing/probes: Support field specifier option for typecast
      tracing/probes: Add $current variable support
      tracing/probes: Add this_cpu_read() and this_cpu_ptr() dereference method to fetcharg
      tracing/probes: Add a new testcase for BTF typecasts


 Documentation/trace/eprobetrace.rst                |    7 
 Documentation/trace/fprobetrace.rst                |   10 
 Documentation/trace/kprobetrace.rst                |   11 
 kernel/trace/Kconfig                               |   12 
 kernel/trace/trace.c                               |    8 
 kernel/trace/trace_eprobe.c                        |    2 
 kernel/trace/trace_fprobe.c                        |    2 
 kernel/trace/trace_kprobe.c                        |    2 
 kernel/trace/trace_probe.c                         |  585 ++++++++++++++++----
 kernel/trace/trace_probe.h                         |  100 ++-
 kernel/trace/trace_probe_tmpl.h                    |   25 +
 kernel/trace/trace_uprobe.c                        |    3 
 samples/trace_events/trace-events-sample.c         |   40 +
 samples/trace_events/trace-events-sample.h         |   34 +
 tools/bootconfig/scripts/ftrace2bconf.sh           |    2 
 tools/perf/util/probe-file.c                       |    2 
 .../ftrace/test.d/dynevent/btf_probe_event.tc      |   51 ++
 .../test.d/dynevent/btf_typecast_accepted.tc       |  107 ++++
 .../test.d/dynevent/eprobes_syntax_errors.tc       |   12 
 .../ftrace/test.d/dynevent/fprobe_syntax_errors.tc |   12 
 .../ftrace/test.d/kprobe/kprobe_syntax_errors.tc   |   12 
 .../ftrace/test.d/kprobe/uprobe_syntax_errors.tc   |    5 
 22 files changed, 890 insertions(+), 154 deletions(-)
 create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
 create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/btf_typecast_accepted.tc

--
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v5 13/24] virt/steal_monitor: Add documentation
From: Shrikanth Hegde @ 2026-06-26 14:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot, yury.norov,
	kprateek.nayak, iii, corbet, tglx, gregkh, pbonzini, seanjc,
	vschneid, huschle, rostedt, dietmar.eggemann, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael, rdunlap,
	kernellwp, linux-doc
In-Reply-To: <20260626092806.GL1181229@noisy.programming.kicks-ass.net>



On 6/26/26 2:58 PM, Peter Zijlstra wrote:
> On Thu, Jun 25, 2026 at 06:16:37PM +0530, Shrikanth Hegde wrote:
> 
>> +Core idea:
>> +==========
>> +steal time is an indication available today in Guest which shows contention
>> +for underlying physical CPU. Use it as a hint in the guest to fold the
>> +workload to a reduced set of vCPUs. When there is contention, steal time
>> +will show up in all the guests. When each guest honors the hint and folds
>> +the workload to a smaller set of vCPUs(Preferred CPUs), it reduces the
>> +contention and thereby reduces vCPU preemption.
>> +This is achieved without any cross-guest communication.
>> +
>> +Steal monitor driver effectively does:
>> +
>> +1. Periodically computes steal time across the system.
>> +
>> +2. If steal time is greater than high threshold, reduce the number of
>> +   preferred CPUs by 1 core. Ensure at least one core is left always.
>> +   This avoids running into extreme cases.
>> +
>> +3. If steal time is lower or equal to low threshold, increase the
>> +   number of preferred CPUs by 1 core. If preferred is same as active,
>> +   nothing to be done.
>> +
>> +4. Ensure preferred CPUs is always subset of active CPUs.
>> +   On feature disable it is same as active CPUs.
> 
> 
> So this is very much a co-operative scheme. Perhaps add a few words to
> describe the effect of a non cooperative guest. IIRC the result is not
> worse than the status quo. That is, if one (or more) guests refuse to
> co-operate it will not make things worse, it will just not result in
> improvements, right?

Yes, for the benefits all the guests should enable the feature. If not,
one guest may use more. but if we see overall combined performance,
it should be better than status quo.

I will add a paragraph about it.


^ permalink raw reply

* Re: [PATCH] docs: arm64: Document that text_offset is always 0
From: Rasmus Villemoes @ 2026-06-26 13:52 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, Ard Biesheuvel, Will Deacon, Jonathan Corbet,
	linux-doc, linux-kernel
In-Reply-To: <ajVekauNroapwbtm@J2N7QTR9R3.cambridge.arm.com>

On Fri, Jun 19 2026, "Mark Rutland" <mark.rutland@arm.com> wrote:

> On Thu, Jun 04, 2026 at 04:08:39PM +0200, Rasmus Villemoes wrote:
>> When trying to figure out where to place and call an arm64 Image in
>> memory, reading booting.rst should provide the answer. However, it
>> requires quite some digging to figure out that text_offset is set via
>> ".quad 0" in head.S and is thus actually always 0 since v5.10.
>
> What is the actual problem?
>
> The documentation in booting.rst is accurate; I don't see why it's
> necessary to read the source code to look at text_offset. Immediately
> above the text in your diff, the documentation has:
>
> | 4. Call the kernel image
> | ------------------------
> |
> | Requirement: MANDATORY
> |
> | The decompressed kernel image contains a 64-byte header as follows::
> |
> |   u32 code0;                    /* Executable code */
> |   u32 code1;                    /* Executable code */
> |   u64 text_offset;              /* Image load offset, little endian */
> |   u64 image_size;               /* Effective Image size, little endian */
> |   u64 flags;                    /* kernel flags, little endian */
> |   u64 res2      = 0;            /* reserved */
> |   u64 res3      = 0;            /* reserved */
> |   u64 res4      = 0;            /* reserved */
> |   u32 magic     = 0x644d5241;   /* Magic number, little endian, "ARM\x64" */
> |   u32 res5;                     /* reserved (used for PE COFF offset) */
>
> Can you explain the problem you're facing? e.g.
>
> * Is the documentation unclear, in a way that could be better?
>
> * Is there some aspect of the boot protocol that is hard for a
>   bootloader to follow?
>
> * Is there some problem with *testing* that bootloaders respect the
>   text_offset requirements?
>
> * Something else?

Yes, the structure of the header is documented. But nowhere is it
explained how the text_offset field gets its value.

So imagine I've just built an arm64 kernel. Now I want to put that into
a FIT image, where I tell the bootloader where to place it and what
address to jump to, via the load= and entry= properties. Now, the
documentation

  The Image must be placed text_offset bytes from a 2MB aligned base
  address anywhere in usable system RAM and called there.

is clear enough that those two have to be the same value. What is not at
all clear is how I'm suppose to determine what that text_offset value is
that I'm suppose to add to some 2MB aligned address I choose.

Prior to 120dc60d0, one could at least 'git grep TEXT_OFFSET --
arch/arm64/' and see 'TEXT_OFFSET := 0x0'.

>> I've included a Fixes tag since I spent way too much time tracking
>> down where that text_offset might be defined. The mentioned commit did
>> get rid of all references to TEXT_OFFSET-the-macro, but not
>> text_offset-the-concept.
>
> Keeping text_offset as a concept was deliberate. That allows us to keep
> the documentation accruate for older kernel versions, and allows for the
> possiblity that a non-zero offset is introduced in future (though I
> admit that might be a tough sell).

Fair enough. But would you at least consider adding just this part:

>> +- As of v5.10, text_offset is always 0.
>> +

One can, using the documented header, read it post-factum from the
kernel binary itself, and perhaps that's what's intended. But to answer
your first question, yes, I did find the documenation unclear and
expected to find some explicit mention of how one is supposed to know
the value of text_offset.

Rasmus

^ permalink raw reply

* Re: [PATCH] KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE
From: Sean Christopherson @ 2026-06-26 13:40 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Gerd Hoffmann, Paolo Bonzini, Jonathan Corbet, Shuah Khan,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, kvm, linux-doc, linux-kernel,
	linux-kselftest
In-Reply-To: <8edfdca645f691cb856e80ade830d78925fdc19d.camel@infradead.org>

On Fri, Jun 26, 2026, David Woodhouse wrote:
> On Thu, 2026-06-25 at 16:09 -0700, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> > > index 91fd3673c09a..c16b4560c9e7 100644
> > > --- a/arch/x86/kvm/xen.c
> > > +++ b/arch/x86/kvm/xen.c
> > > @@ -907,6 +907,13 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
> > >  {
> > >  	int idx, r = -ENOENT;
> > >  
> > > +	/*
> > > +	 * kvm_xen_write_hypercall_page() manages its own locking.
> > > +	 * Handle it before taking xen_lock to avoid a deadlock.
> > 
> > Do we actually want the side effects that necessitate taking xen.xen_lock?  From
> > a uAPI perspective, it's odd to effectively bundle KVM_XEN_ATTR_TYPE_LONG_MODE
> > into KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE.
> 
> That's *guest* ABI, and it's derived from Xen behaviour. Xen will
> 'latch' its idea of whether a guest VM is 32-bit or 64-bit, for the
> purpose of shared data structures (shared_info page, vcpu_info,
> runstate).
> 
> Xen latches this from the current mode of the running vCPU in *two*
> places:
>  • When the hypercall MSR is invoked
>  • When the guest sets the event channel GSI (HVM_PARAM_CALLBACK_IRQ).
> 
> Thus far, the former has been handled in the kernel (in the code you're
> looking at), while the latter is why we have the ioctl to explicitly
> latch the guest's long_mode from userspace too, as userspace handles
> the HVMOP_set_param calls.

Right, and I'm pointing out that from a KVM uAPI perspective, bundling the first
one in a "write hypercall page" call is rather odd, especially since there's
already uAPI to handle the latching.

> > The other question is, why does kvm_xen_write_hypercall_page() drop xen_lock
> > when writing guest memory?  That seems odd and unnecessary.
> 
> Huh? It takes the lock to do the thing that needs the lock, then drops
> it. That is not "odd and unnecessary" at all.
>
> You've been spending too long with these scope-guarded locks.

No, I'm asking why KVM doesn't serialize the writes to guest memory.  Usually
when KVM writes to guest memory, KVM is emulating something that is very much
vCPU-specific, and so if there are races it's the guest's problem to deal with.

The Xen MSR here is clearly VM-scoped though, which is why it feels odd to take
a per-VM lock, and then deliberately drop the lock before completing the operation,
In practice it shouldn't matter, since it sounds like the same repeating 16 byte
pattern will be written every time, but it was a bit head-scratching when reading
the code.

> > > +	if (data->type == KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE)
> > > +		return kvm_xen_write_hypercall_page(vcpu, data->u.gpa) ? -EIO : 0;
> > 
> > -EIO is rather weird, wouldn't -EINVAL be more appropriate?  Ah, and both are
> > wrong if copying the blob fails.
> 
> -EINVAL is more for "you asked me to do something that doesn't make sense".
> -EIO is for "something went wrong when I tried".

Sure, but KVM returns EINVAL for pretty much every ioctl (or ioctl-like thing)
if userspace provides bad input, e.g. for the @data param.
 
> Arguably, the thing that's most likely to go wrong is the
> kvm_vcpu_write_guest() where it writes instructions[] to the guest, and
> maybe that ought to be -EFAULT?

Heh, ya, I just say that too when looking at the code again.

> But I'm not sure that's quite the right semantic to return from the ioctl?

We can/should return whatever kvm_vcpu_write_guest() returns, i.e. literally
return its result directly.  Which of course is only ever going to be -EFAULT,
but in the extremely unlikely case that ever changes, we won't have to worry
about creating misleading behavior in the Xen code.

> > >  	mutex_lock(&vcpu->kvm->arch.xen.xen_lock);
> > >  	idx = srcu_read_lock(&vcpu->kvm->srcu);
> > 
> > Speaking of writing memory, kvm_xen_write_hypercall_page() expects the caller
> > to be in a read-side SRCU critical section (I didn't actually run this with
> > PROVE_LOCKING=y, but I don't think I'm missing anything?)
> 
> Yes, good catch. Thanks.
> 
> > So, if this uAPI is unavoidable seems like we want something like the below.
> > Either that or guard all of kvm_xen_write_hypercall_page() with a lock, and put
> > the entire thing in a helper so that KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE
> > can be handled in a case-statement and doesn't need to grab SRCU on its own.
> 
> Makes sense (with the test, of course). Want me to put them together
> and resend?

Yes please.

^ permalink raw reply

* Re: [PATCH v5 04/24] cpumask: Introduce cpu_preferred_mask
From: Shrikanth Hegde @ 2026-06-26 13:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot, yury.norov,
	kprateek.nayak, iii, corbet, tglx, gregkh, pbonzini, seanjc,
	vschneid, huschle, rostedt, dietmar.eggemann, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael, rdunlap,
	kernellwp, linux-doc
In-Reply-To: <20260626093414.GM1181229@noisy.programming.kicks-ass.net>



On 6/26/26 3:04 PM, Peter Zijlstra wrote:
> On Thu, Jun 25, 2026 at 06:16:28PM +0530, Shrikanth Hegde wrote:
>> This patch does
>> - Declare and Define cpu_preferred_mask.
>> - Get/Set helpers for it.
> 
> There is a blub in submitting-patches.rst about how 'this patch' is
> basically a red-flag for a changelog.
> 
> The changelog is per-definition pertaining to 'this patch', therefore
> stating this is a tautology. Further, it is often fairly clear what the
> patch does, but less clear as to why.
> 
> So the suggestion is to phrase this like:
> 
> Provide cpu_preferred_mask infrastructure (definitions, declarations and
> helper methods) to facilitate ....
> 
> 

Ok. My bad, i will update the changelog. thanks for catching it.

^ permalink raw reply

* Re: [PATCH v5 09/24] sched/fair: Pull the load on preferred CPU
From: Shrikanth Hegde @ 2026-06-26 13:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot, yury.norov,
	kprateek.nayak, iii, corbet, tglx, gregkh, pbonzini, seanjc,
	vschneid, huschle, rostedt, dietmar.eggemann, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael, rdunlap,
	kernellwp, linux-doc
In-Reply-To: <20260626100043.GP1181229@noisy.programming.kicks-ass.net>



On 6/26/26 3:30 PM, Peter Zijlstra wrote:
> On Thu, Jun 25, 2026 at 06:16:33PM +0530, Shrikanth Hegde wrote:
> 
>> @@ -14375,6 +14379,10 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
>>   	if (!cpu_active(this_cpu))
>>   		return 0;
>>   
>> +	/* Do not pull to a !preferred CPU just to push it out next */
>> +	if (!cpu_preferred(this_cpu))
>> +		return 0;
>> +
>>   	/*
>>   	 * This is OK, because current is on_cpu, which avoids it being picked
>>   	 * for load-balance and preemption/IRQs are still disabled avoiding
> 
> Why not just replace the cpu_active() check above?

Ok. that should be fine. i will add a comment there.

^ permalink raw reply

* Re: [PATCH v5 04/24] cpumask: Introduce cpu_preferred_mask
From: Shrikanth Hegde @ 2026-06-26 13:27 UTC (permalink / raw)
  To: Yury Norov
  Cc: Peter Zijlstra, linux-kernel, mingo, juri.lelli, vincent.guittot,
	kprateek.nayak, iii, corbet, tglx, gregkh, pbonzini, seanjc,
	vschneid, huschle, rostedt, dietmar.eggemann, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael, rdunlap,
	kernellwp, linux-doc
In-Reply-To: <aj58DG3BLf3YWPyg@yury>



On 6/26/26 6:48 PM, Yury Norov wrote:
> On Fri, Jun 26, 2026 at 06:39:48PM +0530, Shrikanth Hegde wrote:
>> Hi Peter, Yury.
>>
>> On 6/26/26 3:11 PM, Peter Zijlstra wrote:
>>> On Fri, Jun 26, 2026 at 11:39:01AM +0200, Peter Zijlstra wrote:
>>>> On Thu, Jun 25, 2026 at 06:16:28PM +0530, Shrikanth Hegde wrote:
>>>>
>>>>> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
>>>>> index 80211900f373..5a643d608ea6 100644
>>>>> --- a/include/linux/cpumask.h
>>>>> +++ b/include/linux/cpumask.h
>>>>> @@ -120,12 +120,20 @@ extern struct cpumask __cpu_enabled_mask;
>>>>>    extern struct cpumask __cpu_present_mask;
>>>>>    extern struct cpumask __cpu_active_mask;
>>>>>    extern struct cpumask __cpu_dying_mask;
>>>>> +
>>>>> +#ifdef CONFIG_PREFERRED_CPU
>>>>> +extern struct cpumask __cpu_preferred_mask;
>>>>> +#else
>>>>> +#define __cpu_preferred_mask __cpu_active_mask
>>>>> +#endif
>>>>
>>>> This is cure, but does it not result in set_cpu_preferred() changing
>>> s/cure/cute/
>>>> active mask, and it that not somewhat unexpected behaviour?
>>> s/it/is/
>>>
>>
>> Yes. I thought about this, but i didn't see anything bad happening apart from
>> setting it twice. But I do agree, it is an eyesore when CONFIG_PREFERRED_CPU=n.
>>
>>> Typing hard, clearly. Also hitting 30C before noon :-(
>>>
>>
>> Take care. Even we should have had monsoon by now.
>> But its bright sunshine :(
>>
>>>
>>
>> For this reason, i had it as a function instead of macro in v4.
>> Do you think we can still fallback to it?
>>
>> only caveat is it won't be a macro. But since it is still compile
>> time optimized due to IS_ENABLED, it should be relatively ok right?
>>
>> +void set_cpu_preferred(unsigned int cpu, bool preferred)
>> +{
>> +	if (!IS_ENABLED(CONFIG_PREFERRED_CPU))
>> +		return;
>> +
>> +	assign_cpu((cpu), &__cpu_preferred_mask, (preferred));
>> +}
> 
>   #ifdef CONFIG_PREFERRED_CPU
>   #define set_cpu_preferred(cpu, preferred) assign_cpu((cpu), &__cpu_preferred_mask, (preferred))
>   #else
>   #define set_cpu_preferred(cpu, preferred) {}
>   #endif
> 

Ah! thanks.

^ permalink raw reply

* Re: [PATCH v5 06/24] sched/core: allow only preferred CPUs in is_cpu_allowed
From: Shrikanth Hegde @ 2026-06-26 13:25 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	kprateek.nayak, iii, corbet, tglx, gregkh, pbonzini, seanjc,
	vschneid, huschle, rostedt, dietmar.eggemann, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael, rdunlap,
	kernellwp, linux-doc
In-Reply-To: <aj55TIWls4HKYj5b@yury>

Hi Yury. Thanks for going through the patches.

On 6/26/26 6:36 PM, Yury Norov wrote:
> On Thu, Jun 25, 2026 at 06:16:30PM +0530, Shrikanth Hegde wrote:
>> When possible, choose a preferred CPUs to pick.
>>
>> Push task mechanism uses stopper thread which going to call
>> select_fallback_rq and use this mechanism to pick only a preferred CPU.
>>
>> When task is affined only to non-preferred CPUs it should continue to
>> run there. Detect that by checking if cpus_ptr and cpu_preferred_mask
>> intersect or not.
>>
>> Since is_cpu_allowed can be called directly or repeatedly in
>> select_fallback_rq, encode the info in task_struct->has_preferred_cpu_state
>> if the path is via select_fallback_rq or not.
>> This helps to avoid N**2 complexity for the rare cases.
>>
>> Additional overhead of O(N) comes to is_cpu_allowed only when cpu is not
>> preferred. So in normal scenarios overhead is only a bit check.
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>> v4->v5:
>> - Do simple encoding of -1,0,1 instead (K Prateek Nayak)
>> - Make it s8 (K Prateek Nayak)
>> - Update changelog to address sashiko concerns of overhead.
>>
>>   include/linux/sched.h |  1 +
>>   kernel/sched/core.c   | 35 +++++++++++++++++++++++++++++++++--
>>   kernel/sched/sched.h  | 25 +++++++++++++++++++++++++
>>   3 files changed, 59 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index fc6ecb3869dd..27dbf676113e 100644

>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1657,6 +1657,7 @@ struct task_struct {
>>   #ifdef CONFIG_UNWIND_USER
>>   	struct unwind_task_info		unwind_info;
>>   #endif
>> +	s8				has_preferred_cpu_state;
> 
> Why not protected with the config?

Ok. I will add, i thought it would mean too many ifdefs due to usage in
the below function.

> 
> It looks like you didn't ever ran pahole on it. Maybe it's worth to
> try now?

I did, This is what i saw in powerpc. It did fit in the available cacheline.

	struct bpf_net_context *   bpf_net_context;      /*  4736     8 */
	struct llist_head          kretprobe_instances;  /*  4744     8 */
	struct llist_head          rethooks;             /*  4752     8 */
	union rv_task_monitor      rv[2];                /*  4760    16 */
	s8                         has_preferred_cpu_state; /*  4776     1 */

	/* XXX 7 bytes hole, try to pack */

	struct thread_struct       thread;               /*  4784  2864 */


> 
>>   	/* CPU-specific state of this task: */
>>   	struct thread_struct		thread;
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 9e16946c9d62..281715a6e88f 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2500,6 +2500,8 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
>>    */
>>   static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>>   {
>> +	bool task_check_preferred_cpu;
>> +
>>   	/* When not in the task's cpumask, no point in looking further. */
>>   	if (!task_allowed_on_cpu(p, cpu))
>>   		return false;
>> @@ -2508,9 +2510,23 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>>   	if (is_migration_disabled(p))
>>   		return cpu_online(cpu);
>>   
>> +	/*
>> +	 * This is essential to maintain user affinities when preferred
>> +	 * CPUs change. A task pinned on non-preferred CPU should continue
>> +	 * to run there, since this is non-user triggered.
>> +	 *
>> +	 * If CPU is non-preferred and task can run on other CPUs which are
>> +	 * currently preferred, then choose those other CPUs instead.
>> +	 * Overhead is minimal when CPU is preferred.
>> +	 */
>> +	task_check_preferred_cpu = !cpu_preferred(cpu) && task_has_preferred_cpus(p);
>> +
>>   	/* Non kernel threads are not allowed during either online or offline. */
>> -	if (!(p->flags & PF_KTHREAD))
>> +	if (!(p->flags & PF_KTHREAD)) {
>> +		if (task_check_preferred_cpu)
>> +			return false;
>>   		return cpu_active(cpu);
>> +	}
>>   
>>   	/* KTHREAD_IS_PER_CPU is always allowed. */
>>   	if (kthread_is_per_cpu(p))
>> @@ -2520,6 +2536,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>>   	if (cpu_dying(cpu))
>>   		return false;
>>   
>> +	/* Try on preferred CPU first if possible*/
>> +	if (task_check_preferred_cpu)
>> +		return false;
>> +
>>   	/* But are allowed during online. */
>>   	return cpu_online(cpu);
>>   }
>> @@ -3549,6 +3569,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>>   	enum { cpuset, possible, fail } state = cpuset;
>>   	int dest_cpu;
>>   
>> +	/*
>> +	 * Cache the value whether task's affinity spans preferred CPUs.
>> +	 * This helps to avoid repeating the same for each CPU
>> +	 * later in the loop. Encode call to is_cpu_allowed coming
>> +	 * via select_fallback_rq.
>> +	 */
>> +	p->has_preferred_cpu_state = task_has_preferred_cpus(p) ? 1 : -1;
>> +
>>   	/*
>>   	 * If the node that the CPU is on has been offlined, cpu_to_node()
>>   	 * will return -1. There is no CPU on the node, and we should
>> @@ -3560,7 +3588,7 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>>   		/* Look for allowed, online CPU in same node. */
>>   		for_each_cpu(dest_cpu, nodemask) {
>>   			if (is_cpu_allowed(p, dest_cpu))
>> -				return dest_cpu;
>> +				goto clear_and_return;
>>   		}
>>   	}
>>   
>> @@ -3604,6 +3632,8 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>>   		}
>>   	}
>>   
>> +clear_and_return:
>> +	p->has_preferred_cpu_state = 0;
> 
> Sadly, you've ignored my comments from the previous round. Let me repeat
> it once again:
> 
> This ->has_preferred_cpu_state is always zero out of the scope of the
> function. It means, it's a local variable, and should not belong to
> the task_struct.

Ok. Making it as another variable is better. I will make change accordingly.

> 
>>   	return dest_cpu;
>>   }
>>   
>> @@ -4612,6 +4642,7 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p)
>>   	init_numa_balancing(clone_flags, p);
>>   	p->wake_entry.u_flags = CSD_TYPE_TTWU;
>>   	p->migration_pending = NULL;
>> +	p->has_preferred_cpu_state = 0;
>>   	init_sched_mm(p);
>>   }
>>   
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index c7c2dea65edd..5d009c2529b2 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -4213,4 +4213,29 @@ DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)
>>   
>>   #include "ext.h"
>>   
>> +/*
>> + * has_preferred_cpu_state could have the value cached from
>> + * select_fallback_rq. It is set/cleared while holding pi_lock
>> + * and irq disabled.
>> + *
>> + *  1: Cached and preferred CPUs exists in task's affinity.
>> + *  0: Not cached and need to evaluate.
>> + * -1: Cached and preferred CPU doesn't exits task's affinity
> 
> So, you've got 3 options to declare the status: self-explaining enum,
> self-explaining #defines, and this random numbers explained in
> comment. The latter option is the worst to me.

ok. I will define the enums.

> 
> And you didn't provide any benchmark advocating this caching
> optimization.
> 
> Sorry, but NAK.
> 

If we move to local variable then this won;t be necessary,
just enum's would be enough (I think). Let me go stare at it.

^ permalink raw reply

* Re: [PATCH v5 04/24] cpumask: Introduce cpu_preferred_mask
From: Shrikanth Hegde @ 2026-06-26 13:18 UTC (permalink / raw)
  To: Yury Norov, Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot, kprateek.nayak,
	iii, corbet, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <aj5zRBrQJG-cxs0_@yury>



On 6/26/26 6:10 PM, Yury Norov wrote:
> On Fri, Jun 26, 2026 at 11:39:01AM +0200, Peter Zijlstra wrote:
>> On Thu, Jun 25, 2026 at 06:16:28PM +0530, Shrikanth Hegde wrote:
>>
>>> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
>>> index 80211900f373..5a643d608ea6 100644
>>> --- a/include/linux/cpumask.h
>>> +++ b/include/linux/cpumask.h
>>> @@ -120,12 +120,20 @@ extern struct cpumask __cpu_enabled_mask;
>>>   extern struct cpumask __cpu_present_mask;
>>>   extern struct cpumask __cpu_active_mask;
>>>   extern struct cpumask __cpu_dying_mask;
>>> +
>>> +#ifdef CONFIG_PREFERRED_CPU
>>> +extern struct cpumask __cpu_preferred_mask;
>>> +#else
>>> +#define __cpu_preferred_mask __cpu_active_mask
>>> +#endif
>>
>> This is cure, but does it not result in set_cpu_preferred() changing
>> active mask, and it that not somewhat unexpected behaviour?
> 
> I agree, and I think I already commented on it on previous round.
> set_cpu_preferred() should be protected the same way as the
> corresponding mask, and should be a NOP when CONFIG_PREFERRED_CPU
> is disabled.
> 
>>>   #define cpu_possible_mask ((const struct cpumask *)&__cpu_possible_mask)
>>>   #define cpu_online_mask   ((const struct cpumask *)&__cpu_online_mask)
>>>   #define cpu_enabled_mask   ((const struct cpumask *)&__cpu_enabled_mask)
>>>   #define cpu_present_mask  ((const struct cpumask *)&__cpu_present_mask)
>>>   #define cpu_active_mask   ((const struct cpumask *)&__cpu_active_mask)
>>>   #define cpu_dying_mask    ((const struct cpumask *)&__cpu_dying_mask)
>>> +#define cpu_preferred_mask ((const struct cpumask *)&__cpu_preferred_mask)
>>>   
>>>   extern atomic_t __num_online_cpus;
>>>   extern unsigned int __num_possible_cpus;
>>
>>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>>> index bc4f7a9ba64e..d623a9c5554a 100644
>>> --- a/kernel/cpu.c
>>> +++ b/kernel/cpu.c
>>> @@ -3107,6 +3107,11 @@ EXPORT_SYMBOL(__cpu_dying_mask);
>>>   atomic_t __num_online_cpus __read_mostly;
>>>   EXPORT_SYMBOL(__num_online_cpus);
>>>   
>>> +#ifdef CONFIG_PREFERRED_CPU
>>> +struct cpumask __cpu_preferred_mask __read_mostly;
>>> +EXPORT_SYMBOL(__cpu_preferred_mask);
>>> +#endif
>>
>> Precedent is definitely towards !GPL exports for this, but could we get
>> away with making this one GPL?
>>
>>
>>> @@ -3164,6 +3169,7 @@ void __init boot_cpu_init(void)
>>>   	/* Mark the boot cpu "present", "online" etc for SMP and UP case */
>>>   	set_cpu_online(cpu, true);
>>>   	set_cpu_active(cpu, true);
>>> +	set_cpu_preferred(cpu, true);
>>
>> This sets active twice, which is harmless, but wasteful...
> 
> I think, the good criteria for correctness of this series would be the
> identical binaries before the series, and when CONFIG_PREFERRED_CPU is
> off. At least, as a mental model. This double-set chunk breaks that
> model.
> 

Sorry, i didn't get how comparison could be,
You mean bloat-o-meter or kernel/cpu.o size or vmlinux size file?

That would mean everything should be under ifdef CONFIG_PREFERRED_CPU.
No? That was the case in few versions earlier, and it was not looking 
good since due to many ifdefs.

If we fix set_cpu_preferred to be a NOP when CONFIG_PREFERRED_CPU=n and 
driver depends on it, i think we should be good.

What do you think?


^ permalink raw reply

* Re: [PATCH v5 04/24] cpumask: Introduce cpu_preferred_mask
From: Yury Norov @ 2026-06-26 13:18 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Peter Zijlstra, yury.norov, linux-kernel, mingo, juri.lelli,
	vincent.guittot, kprateek.nayak, iii, corbet, tglx, gregkh,
	pbonzini, seanjc, vschneid, huschle, rostedt, dietmar.eggemann,
	maddy, srikar, hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael, rdunlap,
	kernellwp, linux-doc
In-Reply-To: <79e85557-719c-4fc8-98ad-7bdcc6add753@linux.ibm.com>

On Fri, Jun 26, 2026 at 06:39:48PM +0530, Shrikanth Hegde wrote:
> Hi Peter, Yury.
> 
> On 6/26/26 3:11 PM, Peter Zijlstra wrote:
> > On Fri, Jun 26, 2026 at 11:39:01AM +0200, Peter Zijlstra wrote:
> > > On Thu, Jun 25, 2026 at 06:16:28PM +0530, Shrikanth Hegde wrote:
> > > 
> > > > diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> > > > index 80211900f373..5a643d608ea6 100644
> > > > --- a/include/linux/cpumask.h
> > > > +++ b/include/linux/cpumask.h
> > > > @@ -120,12 +120,20 @@ extern struct cpumask __cpu_enabled_mask;
> > > >   extern struct cpumask __cpu_present_mask;
> > > >   extern struct cpumask __cpu_active_mask;
> > > >   extern struct cpumask __cpu_dying_mask;
> > > > +
> > > > +#ifdef CONFIG_PREFERRED_CPU
> > > > +extern struct cpumask __cpu_preferred_mask;
> > > > +#else
> > > > +#define __cpu_preferred_mask __cpu_active_mask
> > > > +#endif
> > > 
> > > This is cure, but does it not result in set_cpu_preferred() changing
> > s/cure/cute/
> > > active mask, and it that not somewhat unexpected behaviour?
> > s/it/is/
> > 
> 
> Yes. I thought about this, but i didn't see anything bad happening apart from
> setting it twice. But I do agree, it is an eyesore when CONFIG_PREFERRED_CPU=n.
> 
> > Typing hard, clearly. Also hitting 30C before noon :-(
> > 
> 
> Take care. Even we should have had monsoon by now.
> But its bright sunshine :(
> 
> > 
> 
> For this reason, i had it as a function instead of macro in v4.
> Do you think we can still fallback to it?
> 
> only caveat is it won't be a macro. But since it is still compile
> time optimized due to IS_ENABLED, it should be relatively ok right?
> 
> +void set_cpu_preferred(unsigned int cpu, bool preferred)
> +{
> +	if (!IS_ENABLED(CONFIG_PREFERRED_CPU))
> +		return;
> +
> +	assign_cpu((cpu), &__cpu_preferred_mask, (preferred));
> +}

 #ifdef CONFIG_PREFERRED_CPU
 #define set_cpu_preferred(cpu, preferred) assign_cpu((cpu), &__cpu_preferred_mask, (preferred))
 #else
 #define set_cpu_preferred(cpu, preferred) {}
 #endif


^ permalink raw reply

* Re: [PATCH v5 07/24] sched/fair: Select preferred CPU at wakeup when possible
From: Shrikanth Hegde @ 2026-06-26 13:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot, yury.norov,
	kprateek.nayak, iii, corbet, tglx, gregkh, pbonzini, seanjc,
	vschneid, huschle, rostedt, dietmar.eggemann, maddy, srikar,
	hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael, rdunlap,
	kernellwp, linux-doc
In-Reply-To: <20260626095948.GO1181229@noisy.programming.kicks-ass.net>

Hi Peter, Thank you very much for going through the patches.

On 6/26/26 3:29 PM, Peter Zijlstra wrote:
> On Thu, Jun 25, 2026 at 06:16:31PM +0530, Shrikanth Hegde wrote:
>> Update available_idle_cpu to consider preferred CPUs. This takes care of
>> lot of decisions at wakeup to use only preferred CPUs. There is no need to
>> put those explicit checks everywhere.
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>>   kernel/sched/sched.h | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 5d009c2529b2..148fe6145f1a 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1434,6 +1434,9 @@ static inline bool available_idle_cpu(int cpu)
>>   	if (!idle_rq(cpu_rq(cpu)))
>>   		return 0;
>>   
>> +	if (!cpu_preferred(cpu))
>> +		return 0;
>> +
>>   	if (vcpu_is_preempted(cpu))
>>   		return 0;
>>   
> 
> This one might hurt, it is a whole extra cacheline in otherwise already
> sensitive (wakeup) paths.
> 

Yes, this could be costly. If wakeup returns a non-preferred CPU,
is_cpu_allowed would catch it.
So, i think we can avoid repeated computation of it in available_idle_cpu.

Let me see if removing it still achieves the functionality of moving out
fast enough and numbers are close enough to with it.

^ permalink raw reply

* Re: [PATCH v5 04/24] cpumask: Introduce cpu_preferred_mask
From: Shrikanth Hegde @ 2026-06-26 13:09 UTC (permalink / raw)
  To: Peter Zijlstra, yury.norov
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot, kprateek.nayak,
	iii, corbet, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
	rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
	vineeth, frederic, arighi, pauld, christian.loehle, tj,
	tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260626094153.GD2568396@noisy.programming.kicks-ass.net>

Hi Peter, Yury.

On 6/26/26 3:11 PM, Peter Zijlstra wrote:
> On Fri, Jun 26, 2026 at 11:39:01AM +0200, Peter Zijlstra wrote:
>> On Thu, Jun 25, 2026 at 06:16:28PM +0530, Shrikanth Hegde wrote:
>>
>>> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
>>> index 80211900f373..5a643d608ea6 100644
>>> --- a/include/linux/cpumask.h
>>> +++ b/include/linux/cpumask.h
>>> @@ -120,12 +120,20 @@ extern struct cpumask __cpu_enabled_mask;
>>>   extern struct cpumask __cpu_present_mask;
>>>   extern struct cpumask __cpu_active_mask;
>>>   extern struct cpumask __cpu_dying_mask;
>>> +
>>> +#ifdef CONFIG_PREFERRED_CPU
>>> +extern struct cpumask __cpu_preferred_mask;
>>> +#else
>>> +#define __cpu_preferred_mask __cpu_active_mask
>>> +#endif
>>
>> This is cure, but does it not result in set_cpu_preferred() changing
> s/cure/cute/
>> active mask, and it that not somewhat unexpected behaviour?
> s/it/is/
> 

Yes. I thought about this, but i didn't see anything bad happening apart from
setting it twice. But I do agree, it is an eyesore when CONFIG_PREFERRED_CPU=n.

> Typing hard, clearly. Also hitting 30C before noon :-(
> 

Take care. Even we should have had monsoon by now.
But its bright sunshine :(

> 

For this reason, i had it as a function instead of macro in v4.
Do you think we can still fallback to it?

only caveat is it won't be a macro. But since it is still compile
time optimized due to IS_ENABLED, it should be relatively ok right?

+void set_cpu_preferred(unsigned int cpu, bool preferred)
+{
+	if (!IS_ENABLED(CONFIG_PREFERRED_CPU))
+		return;
+
+	assign_cpu((cpu), &__cpu_preferred_mask, (preferred));
+}

^ permalink raw reply

* Re: [PATCH v5 06/24] sched/core: allow only preferred CPUs in is_cpu_allowed
From: Yury Norov @ 2026-06-26 13:06 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	yury.norov, kprateek.nayak, iii, corbet, tglx, gregkh, pbonzini,
	seanjc, vschneid, huschle, rostedt, dietmar.eggemann, maddy,
	srikar, hdanton, chleroy, vineeth, frederic, arighi, pauld,
	christian.loehle, tj, tommaso.cucinotta, maz, rafael, rdunlap,
	kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-7-sshegde@linux.ibm.com>

On Thu, Jun 25, 2026 at 06:16:30PM +0530, Shrikanth Hegde wrote:
> When possible, choose a preferred CPUs to pick.
> 
> Push task mechanism uses stopper thread which going to call
> select_fallback_rq and use this mechanism to pick only a preferred CPU.
> 
> When task is affined only to non-preferred CPUs it should continue to
> run there. Detect that by checking if cpus_ptr and cpu_preferred_mask
> intersect or not.
> 
> Since is_cpu_allowed can be called directly or repeatedly in
> select_fallback_rq, encode the info in task_struct->has_preferred_cpu_state
> if the path is via select_fallback_rq or not.
> This helps to avoid N**2 complexity for the rare cases.
> 
> Additional overhead of O(N) comes to is_cpu_allowed only when cpu is not
> preferred. So in normal scenarios overhead is only a bit check.
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
> v4->v5:
> - Do simple encoding of -1,0,1 instead (K Prateek Nayak)
> - Make it s8 (K Prateek Nayak)
> - Update changelog to address sashiko concerns of overhead.
> 
>  include/linux/sched.h |  1 +
>  kernel/sched/core.c   | 35 +++++++++++++++++++++++++++++++++--
>  kernel/sched/sched.h  | 25 +++++++++++++++++++++++++
>  3 files changed, 59 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index fc6ecb3869dd..27dbf676113e 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1657,6 +1657,7 @@ struct task_struct {
>  #ifdef CONFIG_UNWIND_USER
>  	struct unwind_task_info		unwind_info;
>  #endif
> +	s8				has_preferred_cpu_state;

Why not protected with the config?

It looks like you didn't ever ran pahole on it. Maybe it's worth to
try now?

>  	/* CPU-specific state of this task: */
>  	struct thread_struct		thread;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9e16946c9d62..281715a6e88f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2500,6 +2500,8 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
>   */
>  static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>  {
> +	bool task_check_preferred_cpu;
> +
>  	/* When not in the task's cpumask, no point in looking further. */
>  	if (!task_allowed_on_cpu(p, cpu))
>  		return false;
> @@ -2508,9 +2510,23 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>  	if (is_migration_disabled(p))
>  		return cpu_online(cpu);
>  
> +	/*
> +	 * This is essential to maintain user affinities when preferred
> +	 * CPUs change. A task pinned on non-preferred CPU should continue
> +	 * to run there, since this is non-user triggered.
> +	 *
> +	 * If CPU is non-preferred and task can run on other CPUs which are
> +	 * currently preferred, then choose those other CPUs instead.
> +	 * Overhead is minimal when CPU is preferred.
> +	 */
> +	task_check_preferred_cpu = !cpu_preferred(cpu) && task_has_preferred_cpus(p);
> +
>  	/* Non kernel threads are not allowed during either online or offline. */
> -	if (!(p->flags & PF_KTHREAD))
> +	if (!(p->flags & PF_KTHREAD)) {
> +		if (task_check_preferred_cpu)
> +			return false;
>  		return cpu_active(cpu);
> +	}
>  
>  	/* KTHREAD_IS_PER_CPU is always allowed. */
>  	if (kthread_is_per_cpu(p))
> @@ -2520,6 +2536,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>  	if (cpu_dying(cpu))
>  		return false;
>  
> +	/* Try on preferred CPU first if possible*/
> +	if (task_check_preferred_cpu)
> +		return false;
> +
>  	/* But are allowed during online. */
>  	return cpu_online(cpu);
>  }
> @@ -3549,6 +3569,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>  	enum { cpuset, possible, fail } state = cpuset;
>  	int dest_cpu;
>  
> +	/*
> +	 * Cache the value whether task's affinity spans preferred CPUs.
> +	 * This helps to avoid repeating the same for each CPU
> +	 * later in the loop. Encode call to is_cpu_allowed coming
> +	 * via select_fallback_rq.
> +	 */
> +	p->has_preferred_cpu_state = task_has_preferred_cpus(p) ? 1 : -1;
> +
>  	/*
>  	 * If the node that the CPU is on has been offlined, cpu_to_node()
>  	 * will return -1. There is no CPU on the node, and we should
> @@ -3560,7 +3588,7 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>  		/* Look for allowed, online CPU in same node. */
>  		for_each_cpu(dest_cpu, nodemask) {
>  			if (is_cpu_allowed(p, dest_cpu))
> -				return dest_cpu;
> +				goto clear_and_return;
>  		}
>  	}
>  
> @@ -3604,6 +3632,8 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>  		}
>  	}
>  
> +clear_and_return:
> +	p->has_preferred_cpu_state = 0;

Sadly, you've ignored my comments from the previous round. Let me repeat
it once again:

This ->has_preferred_cpu_state is always zero out of the scope of the
function. It means, it's a local variable, and should not belong to
the task_struct.

>  	return dest_cpu;
>  }
>  
> @@ -4612,6 +4642,7 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p)
>  	init_numa_balancing(clone_flags, p);
>  	p->wake_entry.u_flags = CSD_TYPE_TTWU;
>  	p->migration_pending = NULL;
> +	p->has_preferred_cpu_state = 0;
>  	init_sched_mm(p);
>  }
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index c7c2dea65edd..5d009c2529b2 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -4213,4 +4213,29 @@ DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)
>  
>  #include "ext.h"
>  
> +/*
> + * has_preferred_cpu_state could have the value cached from
> + * select_fallback_rq. It is set/cleared while holding pi_lock
> + * and irq disabled.
> + *
> + *  1: Cached and preferred CPUs exists in task's affinity.
> + *  0: Not cached and need to evaluate.
> + * -1: Cached and preferred CPU doesn't exits task's affinity

So, you've got 3 options to declare the status: self-explaining enum,
self-explaining #defines, and this random numbers explained in
comment. The latter option is the worst to me.

And you didn't provide any benchmark advocating this caching
optimization.

Sorry, but NAK.

> + *
> + * Only affects FAIR task.
> + */
> +static inline bool task_has_preferred_cpus(struct task_struct *p)
> +{
> +	int cached;
> +
> +	/* Only FAIR tasks honor preferred CPU state */
> +	if (unlikely(p->sched_class != &fair_sched_class))
> +		return false;
> +
> +	cached = READ_ONCE(p->has_preferred_cpu_state);
> +	if (cached)
> +		return cached > 0;
> +	else
> +		return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
> +}
>  #endif /* _KERNEL_SCHED_SCHED_H */
> -- 
> 2.47.3

^ permalink raw reply

* Re: [PATCH net-next] Documentation: networking: Add a test plan for ethtool pause validation
From: Maxime Chevallier @ 2026-06-26 12:51 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Jakub Kicinski, davem, Eric Dumazet, Paolo Abeni, Simon Horman,
	Russell King, Heiner Kallweit, Jonathan Corbet, Shuah Khan,
	Oleksij Rempel, Vladimir Oltean, Florian Fainelli,
	thomas.petazzoni, netdev, linux-kernel, linux-doc
In-Reply-To: <5b7dbdbc-93fd-4664-abad-0f47855fab55@lunn.ch>

Hi,

On 6/26/26 14:39, Andrew Lunn wrote:
> On Fri, Jun 26, 2026 at 10:33:50AM +0200, Maxime Chevallier wrote:
>>
>>> Sphinx follows pythons object orientate structure. So you could have a
>>> class test_ethtool_pause_advertising, with class documentation. And
>>> then methods within the class which are individual tests.  The
>>> commented out section would then be method documentation.
>>
>> Good point, so maybe something along these lines :
>>
>>  - A class for the test
>>  - methods for indivitual tests
>>  - For readability, I've written what the internal test helper would look
>>    like (_adv_test), and how a test would look like without the helper in
>>    adv_rx_on_tx_on().
>>
>> I'm already diving into coding, but it helps me a bit in the definition of the
>> "description" format :)
>>
>> this is what the class would look like :
> 
> I like this :-)

Great :)

> 
>>
>>
>>     @ksft_ethtool_needs_supported_allof([Pause])
>>     def adv_rx_on_tx_on(cfg, peer) -> None:
> 
> Using decorators is a nice idea. Since it is not a C concept, please
> give the decorator a good comment explaining what it does. We should
> not assume driver developers know python.

No problem, I'll add that

> 
>>         """Advertising test with rx on tx on
>>
>>         - run 'ethtool -A ethX rx on tx on autoneg on'
>>         - FAIL if the return isn't 0
>>         - FAIL if ETHTOOL_A_LINKMODES_OURS's advertised values does not contain
>>           "Pause" or contains "Asym_Pause"
>>         - FAIL if peer's lp_advertising doesn't contain "Pause" or contains
>>           "Asym_Pause"
>>         - Succeed otherwise
>>         """
>>         ret = cfg.run('ethtool -A ethX rx on tx on autoneg on')
>>         ksft_eq(ret, 0)
>>
>>         linkmodes = cfg.get_advertising()
>>         ksft_in('Pause', linkmodes, "rx on tx on must advertise Pause")
>>         ksft_not_in('Asym_Pause', linkmodes, "rx on tx on must not advertise Asym_Pause")
>>
>>         remote_linkmodes = peer.get_lp_advertising()
>>         ksft_in('Pause', linkmodes, "PHY does not advertise Pause")
>>         ksft_not_in('Asym_Pause', linkmodes, "PHY incorrectly advertises Asym_Pause")
> 
> There should be a sleep in here somewhere, to allow the autoneg to
> complete.

Indeed, I think in the end this will be wrapped by some ksft_ethtool_* helper we'll add,
that will also deal with the case where autoneg doesn't succeed and the link stays down.

That's both for error detections, but I also expect there might be cases we'll want to test
that autoneg does not actually succeed.

Good to see we're closing in on a definition, I'll spin V2 based on that format :)

Maxime


^ permalink raw reply

* [PATCH v7 9/9] init/main.c: use bootconfig_cmdline_requested() for the runtime opt-in
From: Breno Leitao @ 2026-06-26 12:50 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier, Nick Desaulniers, Bill Wendling, Justin Stitt,
	Jonathan Corbet, Shuah Khan
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, llvm, linux-doc, Breno Leitao, kernel-team
In-Reply-To: <20260626-bootconfig_using_tools-v7-0-24ab72139c29@debian.org>

setup_boot_config() open-coded the same "is bootconfig requested on the
kernel command line?" check that setup_arch() performs via the shared
bootconfig_cmdline_requested() helper. Switch it to the helper so the
early (setup_arch) and late (setup_boot_config) paths use one parser and
cannot disagree on what counts as opt-in.

The helper also reports the offset of the init arguments following a "--"
separator, which is exactly what initargs_offs needs, so the local
parse_args() call, its bootconfig_params() callback and the tmp_cmdline
copy are removed.

No functional change intended.

Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 init/main.c | 27 ++++++---------------------
 1 file changed, 6 insertions(+), 21 deletions(-)

diff --git a/init/main.c b/init/main.c
index 260bd5242f94e..39a518a472422 100644
--- a/init/main.c
+++ b/init/main.c
@@ -356,28 +356,17 @@ static char * __init xbc_make_cmdline(const char *key)
 	return new_cmdline;
 }
 
-static int __init bootconfig_params(char *param, char *val,
-				    const char *unused, void *arg)
-{
-	if (strcmp(param, "bootconfig") == 0) {
-		bootconfig_found = true;
-	}
-	return 0;
-}
-
 static int __init warn_bootconfig(char *str)
 {
-	/* The 'bootconfig' has been handled by bootconfig_params(). */
+	/* The 'bootconfig' option is handled by setup_boot_config(). */
 	return 0;
 }
 
 static void __init setup_boot_config(void)
 {
-	static char tmp_cmdline[COMMAND_LINE_SIZE] __initdata;
 	const char *msg, *data;
-	int pos, ret;
+	int pos, ret, offs;
 	size_t size;
-	char *err;
 	bool from_embedded = false;
 
 	/* Cut out the bootconfig data even if we have no bootconfig option */
@@ -388,16 +377,12 @@ static void __init setup_boot_config(void)
 		from_embedded = true;
 	}
 
-	strscpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE);
-	err = parse_args("bootconfig", tmp_cmdline, NULL, 0, 0, 0, NULL,
-			 bootconfig_params);
-
-	if (IS_ERR(err) || !(bootconfig_found || IS_ENABLED(CONFIG_BOOT_CONFIG_FORCE)))
+	bootconfig_found = bootconfig_cmdline_requested(boot_command_line, &offs);
+	if (!(bootconfig_found || IS_ENABLED(CONFIG_BOOT_CONFIG_FORCE)))
 		return;
 
-	/* parse_args() stops at the next param of '--' and returns an address */
-	if (err)
-		initargs_offs = err - tmp_cmdline;
+	/* Offset of the init arguments after a "--", located by the helper. */
+	initargs_offs = offs;
 
 	if (!data) {
 		/* If user intended to use bootconfig, show an error level message */

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v7 8/9] bootconfig: skip runtime kernel.* render once prepended early
From: Breno Leitao @ 2026-06-26 12:50 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier, Nick Desaulniers, Bill Wendling, Justin Stitt,
	Jonathan Corbet, Shuah Khan
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, llvm, linux-doc, Breno Leitao, kernel-team
In-Reply-To: <20260626-bootconfig_using_tools-v7-0-24ab72139c29@debian.org>

setup_boot_config() folds the embedded bootconfig "kernel" subtree into
the command line via xbc_make_cmdline("kernel"). A subsequent patch lets
an architecture prepend the build-time-rendered embedded "kernel" keys
to boot_command_line early in setup_arch(); rendering them again here
would then duplicate every key in saved_command_line and make
accumulating handlers (console=, earlycon=, ...) re-register the same
value.

Track whether the bootconfig data came from the embedded source
(from_embedded) and skip the runtime render only when the early prepend
actually happened, as reported by xbc_embedded_cmdline_applied(). On
architectures that do not select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG
that helper is a stub returning false, so this path is unchanged and the
embedded "kernel" keys still reach the cmdline via the runtime parser
exactly as before.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 init/main.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/init/main.c b/init/main.c
index e363232b428b4..260bd5242f94e 100644
--- a/init/main.c
+++ b/init/main.c
@@ -378,12 +378,15 @@ static void __init setup_boot_config(void)
 	int pos, ret;
 	size_t size;
 	char *err;
+	bool from_embedded = false;
 
 	/* Cut out the bootconfig data even if we have no bootconfig option */
 	data = get_boot_config_from_initrd(&size);
 	/* If there is no bootconfig in initrd, try embedded one. */
-	if (!data)
+	if (!data) {
 		data = xbc_get_embedded_bootconfig(&size);
+		from_embedded = true;
+	}
 
 	strscpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE);
 	err = parse_args("bootconfig", tmp_cmdline, NULL, 0, 0, 0, NULL,
@@ -421,8 +424,24 @@ static void __init setup_boot_config(void)
 	} else {
 		xbc_get_info(&ret, NULL);
 		pr_info("Load bootconfig: %ld bytes %d nodes\n", (long)size, ret);
-		/* keys starting with "kernel." are passed via cmdline */
-		extra_command_line = xbc_make_cmdline("kernel");
+		/*
+		 * keys starting with "kernel." are passed via cmdline. When
+		 * this bootconfig came from the embedded source and
+		 * setup_arch() already prepended the rendered "kernel" subtree
+		 * to boot_command_line, rendering again here would duplicate
+		 * the keys in saved_command_line and make accumulating handlers
+		 * (console=, earlycon=, ...) re-register the same value. Skip
+		 * only when the prepend really happened.
+		 *
+		 * On arches that do not select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG,
+		 * CONFIG_CMDLINE_FROM_BOOTCONFIG is unselectable and
+		 * xbc_embedded_cmdline_applied() collapses to a stub returning
+		 * false, so this path still runs and the embedded "kernel"
+		 * keys reach the cmdline via the runtime parser exactly as
+		 * before this series.
+		 */
+		if (!from_embedded || !xbc_embedded_cmdline_applied())
+			extra_command_line = xbc_make_cmdline("kernel");
 		/* Also, "init." keys are init arguments */
 		extra_init_args = xbc_make_cmdline("init");
 	}

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v7 7/9] x86/setup: prepend embedded bootconfig cmdline before parse_early_param
From: Breno Leitao @ 2026-06-26 12:50 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier, Nick Desaulniers, Bill Wendling, Justin Stitt,
	Jonathan Corbet, Shuah Khan
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, llvm, linux-doc, Breno Leitao, kernel-team
In-Reply-To: <20260626-bootconfig_using_tools-v7-0-24ab72139c29@debian.org>

Call xbc_prepend_embedded_cmdline() in setup_arch() right after the
CONFIG_CMDLINE merge and before strscpy(command_line, ...) so the
build-time-rendered embedded bootconfig "kernel" subtree is part of
boot_command_line by the time parse_early_param() runs. early_param()
handlers (mem=, earlycon=, loglevel=, ...) now see values supplied via
CONFIG_BOOT_CONFIG_EMBED_FILE without parsing bootconfig at runtime.

Gate the prepend on the same opt-in the runtime parser uses: prepend
when "bootconfig" is present on the command line, or when
CONFIG_BOOT_CONFIG_FORCE is set. Detect it with parse_args(), exactly
as setup_boot_config() does, so both agree on what counts as opt-in:
any "bootconfig" key regardless of value (bare, =0, =1, ...), and only
before the "--" that separates init arguments. Sharing the parser keeps
the early and late paths from diverging -- e.g. "bootconfig=0" or a
"-- bootconfig" meant for init must not apply the embedded keys early
while the runtime parser skips them.

The prepend necessarily runs before setup_boot_config() detects an
initrd bootconfig, so an initrd cannot override the embedded "kernel"
keys for early_param(). This is intentional: the embedded cmdline acts
like a build-time CONFIG_CMDLINE. An initrd bootconfig's "kernel" keys
never reached early_param() anyway (they apply late via
extra_command_line), so nothing is lost -- the initrd keys still apply
late, with last-wins keeping the embedded values in effect.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 arch/x86/Kconfig        |  1 +
 arch/x86/kernel/setup.c | 14 +++++++++++++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0de23e6471973..8ab11199c16d5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -127,6 +127,7 @@ config X86
 	select ARCH_SUPPORTS_NUMA_BALANCING	if X86_64
 	select ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP	if NR_CPUS <= 4096
 	select ARCH_SUPPORTS_CFI		if X86_64
+	select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG
 	select ARCH_USES_CFI_TRAPS		if X86_64 && CFI
 	select ARCH_SUPPORTS_LTO_CLANG
 	select ARCH_SUPPORTS_LTO_CLANG_THIN
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 46882ce79c3a4..88b055a46591e 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -6,6 +6,7 @@
  * parts of early kernel initialization.
  */
 #include <linux/acpi.h>
+#include <linux/bootconfig.h>
 #include <linux/console.h>
 #include <linux/cpu.h>
 #include <linux/crash_dump.h>
@@ -880,7 +881,6 @@ static void __init x86_report_nx(void)
  *
  * Note: On x86_64, fixmaps are ready for use even before this is called.
  */
-
 void __init setup_arch(char **cmdline_p)
 {
 #ifdef CONFIG_X86_32
@@ -924,6 +924,18 @@ void __init setup_arch(char **cmdline_p)
 	builtin_cmdline_added = true;
 #endif

+#ifdef CONFIG_CMDLINE_FROM_BOOTCONFIG
+	/*
+	 * Prepend the build-time-rendered embedded "kernel" keys here so
+	 * parse_early_param() below sees them, using the same opt-in as the
+	 * runtime parser, plus the build-time CONFIG_BOOT_CONFIG_FORCE.
+	 */
+	if (bootconfig_cmdline_requested(boot_command_line, NULL) ||
+	    IS_ENABLED(CONFIG_BOOT_CONFIG_FORCE))
+		xbc_prepend_embedded_cmdline(boot_command_line,
+					     COMMAND_LINE_SIZE);
+#endif
+
 	strscpy(command_line, boot_command_line, COMMAND_LINE_SIZE);
 	*cmdline_p = command_line;

-- 
2.53.0-Meta

^ permalink raw reply related

* [PATCH v7 6/9] Documentation: bootconfig: document build-time cmdline rendering
From: Breno Leitao @ 2026-06-26 12:50 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier, Nick Desaulniers, Bill Wendling, Justin Stitt,
	Jonathan Corbet, Shuah Khan
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, llvm, linux-doc, Breno Leitao, kernel-team
In-Reply-To: <20260626-bootconfig_using_tools-v7-0-24ab72139c29@debian.org>

Add a section describing CONFIG_CMDLINE_FROM_BOOTCONFIG: what it
does (renders the embedded "kernel" subtree to a flat cmdline at
build time so early_param() handlers see the values), what it
requires (BOOT_CONFIG_EMBED, a non-empty BOOT_CONFIG_EMBED_FILE,
CONFIG_CMDLINE to be empty, and ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG --
currently x86 only), the bootconfig opt-in semantics, the initrd-vs-embedded
precedence, and the soft-error overflow behavior.

This addresses feedback from the Sashiko AI review and Masami Hiramatsu to
document the CONFIG_CMDLINE requirement, which is enforced at the Kconfig
level but was not mentioned in the documentation, potentially confusing users
who might satisfy all other requirements but still find the option hidden in
menuconfig if CONFIG_CMDLINE is non-empty.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 Documentation/admin-guide/bootconfig.rst | 81 ++++++++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)

diff --git a/Documentation/admin-guide/bootconfig.rst b/Documentation/admin-guide/bootconfig.rst
index f712758472d5c..3d6412458c8b6 100644
--- a/Documentation/admin-guide/bootconfig.rst
+++ b/Documentation/admin-guide/bootconfig.rst
@@ -234,6 +234,87 @@ Kconfig option selected.
 Note that even if you set this option, you can override the embedded
 bootconfig by another bootconfig which attached to the initrd.
 
+Rendering Embedded kernel.* Keys at Build Time
+----------------------------------------------
+
+By default, the embedded bootconfig (``CONFIG_BOOT_CONFIG_EMBED=y``) is
+parsed at runtime, after ``parse_early_param()`` has already run. Early
+parameter handlers (``mem=``, ``earlycon=``, ``loglevel=``, ...) therefore
+cannot see values supplied via the embedded ``kernel`` subtree.
+
+``CONFIG_CMDLINE_FROM_BOOTCONFIG`` resolves this by rendering the
+``kernel`` subtree of ``CONFIG_BOOT_CONFIG_EMBED_FILE`` into a flat cmdline
+string at kernel build time (via ``tools/bootconfig -C``) and prepending
+it to ``boot_command_line`` during early architecture setup, so the keys
+are visible to ``parse_early_param()``.
+
+The option requires ``CONFIG_BOOT_CONFIG_EMBED=y``, a non-empty
+``CONFIG_BOOT_CONFIG_EMBED_FILE``, ``CONFIG_CMDLINE`` to be empty, and
+an architecture that selects ``CONFIG_ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG``.
+Currently only x86 selects it; on other architectures the embedded
+bootconfig still works, but only through the late runtime parser.
+
+The same ``bootconfig`` opt-in applies as elsewhere: the rendered keys
+are prepended only when ``bootconfig`` (in any form) appears on the
+kernel command line, or when ``CONFIG_BOOT_CONFIG_FORCE`` is set, which
+defaults to ``y`` when ``CONFIG_BOOT_CONFIG_EMBED`` is set.
+
+For example, given::
+
+ kernel {
+   loglevel = 7
+   mem = 4G
+ }
+
+the kernel boots as if ``loglevel=7 mem=4G`` had been prepended to the
+bootloader command line, with the values visible to early-parsed
+handlers. Comma-separated values are still expanded into multiple
+cmdline entries per the bootconfig array convention -- the embedded
+``kernel.earlycon = "uart8250,io,0x3f8"`` must be quoted to land as a
+single ``earlycon=`` entry, exactly as for the runtime parser.
+
+If the rendered string would not fit in ``COMMAND_LINE_SIZE`` together
+with the existing command line, the prepend is skipped and an error is
+logged, so an oversized embedded bootconfig cannot brick a boot.
+
+Interaction with other command line and bootconfig sources
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+With ``CONFIG_CMDLINE_FROM_BOOTCONFIG=y`` the rendered ``kernel``
+subtree behaves like a build-time command line (similar to
+``CONFIG_CMDLINE``), not like a bootconfig source. It is prepended to
+``boot_command_line`` in ``setup_arch()``, before ``parse_early_param()``
+and long before the runtime parser looks at an initrd. Options can reach
+the kernel from up to four places:
+
+- Bootloader command line: the arguments the boot loader passes. The
+  embedded cmdline is prepended in front of them, so for last-one-wins
+  parameters a bootloader option still overrides the embedded value.
+  Visible in /proc/cmdline.
+- Embedded cmdline (this option): the rendered ``kernel`` subtree,
+  prepended early so it is seen by ``parse_early_param()``. Visible in
+  /proc/cmdline.
+- Initrd bootconfig: parsed late in ``setup_boot_config()``; its
+  ``kernel`` keys are placed ahead of ``boot_command_line``, i.e. before
+  the embedded cmdline, so last-wins favors the embedded values. As a
+  bootconfig source, an initrd bootconfig still replaces the embedded
+  bootconfig. Visible in /proc/cmdline and /proc/bootconfig.
+- Embedded bootconfig (runtime): parsed late, only when no initrd
+  bootconfig is present. Visible in /proc/cmdline and /proc/bootconfig.
+
+So with this option the embedded ``kernel.*`` values take precedence
+over an initrd bootconfig's ``kernel.*`` values: for early parameters
+the initrd is not parsed yet, and for ordinary parameters the embedded
+keys land later in the command line. If you need an initrd bootconfig to
+override the embedded ``kernel.*`` keys, leave this option off and rely
+on the runtime parser.
+
+The rendered string is part of the command line, so it appears in
+/proc/cmdline. It is deliberately not shown in /proc/bootconfig: that
+file keeps reporting the parsed bootconfig tree -- the initrd bootconfig
+if present, otherwise the embedded bootconfig -- independent of whether
+build-time cmdline rendering is enabled.
+
 Kernel parameters via Boot Config
 =================================
 

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v7 5/9] bootconfig: add xbc_prepend_embedded_cmdline() helper
From: Breno Leitao @ 2026-06-26 12:50 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier, Nick Desaulniers, Bill Wendling, Justin Stitt,
	Jonathan Corbet, Shuah Khan
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, llvm, linux-doc, Breno Leitao, kernel-team
In-Reply-To: <20260626-bootconfig_using_tools-v7-0-24ab72139c29@debian.org>

Add a helper that prepends the build-time-rendered embedded bootconfig
"kernel" subtree (embedded_kernel_cmdline[] from embedded-cmdline.S) to
a cmdline buffer with a separating space. Architectures call this from
setup_arch() before parse_early_param() so early_param() handlers
(mem=, earlycon=, loglevel=, ...) see values supplied via the embedded
bootconfig.

The in-place prepend (shift the existing string right, then drop the
embedded string in front) is factored into a small str_prepend() helper.

On overflow the helper logs an error and leaves the cmdline untouched
rather than panicking. Booting without the embedded values is better
than refusing to boot, and the error tells the user why their embedded
keys are missing.

The helper records whether it actually prepended, exposed via
xbc_embedded_cmdline_applied(). setup_boot_config() uses this to decide
whether the runtime "kernel" render would duplicate keys already folded
into boot_command_line.

Also add bootconfig_cmdline_requested(), a small parse_args() wrapper
that reports whether "bootconfig" was passed on the command line and,
via an optional out-parameter, where the "--" init arguments begin.
setup_arch() and setup_boot_config() share it so the early and late
paths agree on the opt-in. It sits under CONFIG_BOOT_CONFIG rather than
CONFIG_CMDLINE_FROM_BOOTCONFIG because the runtime parser needs it on
every bootconfig build.

When CONFIG_CMDLINE_FROM_BOOTCONFIG=n, the public declaration in
<linux/bootconfig.h> resolves to a no-op stub so callers compile
unchanged.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 include/linux/bootconfig.h |  14 +++++
 lib/bootconfig.c           | 128 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 141 insertions(+), 1 deletion(-)

diff --git a/include/linux/bootconfig.h b/include/linux/bootconfig.h
index 1c7f3b74ffcf3..deda507500da2 100644
--- a/include/linux/bootconfig.h
+++ b/include/linux/bootconfig.h
@@ -308,4 +308,18 @@ static inline const char *xbc_get_embedded_bootconfig(size_t *size)
 }
 #endif
 
+/* Bootconfig opt-in detection, shared by setup_arch() and setup_boot_config() */
+#ifdef CONFIG_BOOT_CONFIG
+bool __init bootconfig_cmdline_requested(const char *boot_cmdline, int *end_offset);
+#endif
+
+/* Build-time-rendered bootconfig cmdline prepended in setup_arch() */
+#ifdef CONFIG_CMDLINE_FROM_BOOTCONFIG
+void __init xbc_prepend_embedded_cmdline(char *dst, size_t size);
+bool __init xbc_embedded_cmdline_applied(void);
+#else
+static inline void xbc_prepend_embedded_cmdline(char *dst, size_t size) { }
+static inline bool xbc_embedded_cmdline_applied(void) { return false; }
+#endif
+
 #endif
diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index 926094d97397e..89c88e359179f 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -19,9 +19,13 @@
 #include <linux/errno.h>
 #include <linux/cache.h>
 #include <linux/compiler.h>
+#include <linux/init.h>
+#include <linux/moduleparam.h>
+#include <linux/printk.h>
 #include <linux/sprintf.h>
 #include <linux/memblock.h>
 #include <linux/string.h>
+#include <asm/setup.h>		/* COMMAND_LINE_SIZE */
 
 #ifdef CONFIG_BOOT_CONFIG_EMBED
 /* embedded_bootconfig_data is defined in bootconfig-data.S */
@@ -34,7 +38,129 @@ const char * __init xbc_get_embedded_bootconfig(size_t *size)
 	return (*size) ? embedded_bootconfig_data : NULL;
 }
 #endif
-#endif
+
+#ifdef CONFIG_CMDLINE_FROM_BOOTCONFIG
+/* embedded_kernel_cmdline is defined in embedded-cmdline.S */
+extern __visible const char embedded_kernel_cmdline[];
+extern __visible const char embedded_kernel_cmdline_end[];
+
+/* Set once the embedded cmdline has actually been prepended. */
+static bool xbc_cmdline_applied __initdata;
+
+/*
+ * str_prepend() - Prepend @src in front of the string in @dst, in place
+ * @dst: NUL-terminated destination buffer, currently @dst_len bytes long
+ * @dst_len: length of the current @dst string (excluding its NUL)
+ * @src: bytes to prepend (not NUL-terminated)
+ * @src_len: number of bytes from @src to prepend
+ *
+ * The caller must guarantee @dst has room for src_len + dst_len + 1 bytes.
+ * Moving dst_len + 1 bytes carries @dst's NUL terminator too, so an empty
+ * @dst needs no special case.
+ */
+static void __init str_prepend(char *dst, size_t dst_len,
+			       const char *src, size_t src_len)
+{
+	memmove(dst + src_len, dst, dst_len + 1);
+	memcpy(dst, src, src_len);
+}
+
+/**
+ * xbc_prepend_embedded_cmdline() - Prepend embedded bootconfig cmdline
+ * @dst: cmdline buffer to prepend into (must already contain a NUL byte)
+ * @size: total capacity of @dst in bytes
+ *
+ * Prepend the build-time-rendered "kernel" subtree of the embedded
+ * bootconfig to @dst. The rendered string already ends with a single
+ * space (the xbc_snprint_cmdline() invariant), which serves as the
+ * separator between the embedded keys and any existing content of @dst.
+ * On overflow, log an error and leave @dst untouched rather than
+ * silently truncating: booting without the embedded values is better
+ * than refusing to boot, and the error message tells the user why
+ * their embedded keys are missing.
+ *
+ * Intended to be called from setup_arch() before parse_early_param() so
+ * that early_param() handlers see the embedded values.
+ */
+void __init xbc_prepend_embedded_cmdline(char *dst, size_t size)
+{
+	size_t embed_len = embedded_kernel_cmdline_end - embedded_kernel_cmdline;
+	size_t dst_len;
+
+	if (!size || embed_len <= 1)	/* trailing NUL only */
+		return;
+	embed_len--;			/* exclude trailing NUL byte */
+
+	dst_len = strnlen(dst, size);
+	if (embed_len + dst_len + 1 > size) {
+		pr_err("embedded bootconfig cmdline (%zu bytes) does not fit in COMMAND_LINE_SIZE with %zu bytes already used; ignoring embedded values\n",
+		       embed_len, dst_len);
+		return;
+	}
+
+	str_prepend(dst, dst_len, embedded_kernel_cmdline, embed_len);
+	xbc_cmdline_applied = true;
+}
+
+/**
+ * xbc_embedded_cmdline_applied() - Did the embedded cmdline get prepended?
+ *
+ * Return true if xbc_prepend_embedded_cmdline() actually prepended the
+ * embedded "kernel" subtree. setup_boot_config() uses this to avoid
+ * rendering the same keys a second time.
+ */
+bool __init xbc_embedded_cmdline_applied(void)
+{
+	return xbc_cmdline_applied;
+}
+#endif	/* CONFIG_CMDLINE_FROM_BOOTCONFIG */
+
+/* parse_args() callback: flag when the "bootconfig" parameter is present. */
+static int __init bootconfig_optin(char *param, char *val,
+				   const char *unused, void *arg)
+{
+	if (!strcmp(param, "bootconfig"))
+		*(bool *)arg = true;
+	return 0;
+}
+
+/**
+ * bootconfig_cmdline_requested() - Was "bootconfig" passed on the cmdline?
+ * @boot_cmdline: kernel command line to inspect (not modified)
+ * @end_offset: if non-NULL, set to the offset of the init arguments that
+ *		follow a "--" separator, or 0 when there is none
+ *
+ * Parse a private copy of @boot_cmdline (parse_args() is destructive) and
+ * report whether "bootconfig" is present before the "--" separator.
+ * setup_arch() uses this to gate prepending the build-time embedded cmdline;
+ * setup_boot_config() uses it for the runtime opt-in and to locate the init
+ * arguments via @end_offset. Sharing one parser keeps the early and late
+ * paths agreeing on what counts as opt-in. CONFIG_BOOT_CONFIG_FORCE is not
+ * folded in here; callers apply it where they need it.
+ */
+bool __init bootconfig_cmdline_requested(const char *boot_cmdline, int *end_offset)
+{
+	static char tmp_cmdline[COMMAND_LINE_SIZE] __initdata;
+	bool found = false;
+	char *err;
+
+	if (end_offset)
+		*end_offset = 0;
+
+	strscpy(tmp_cmdline, boot_cmdline, COMMAND_LINE_SIZE);
+	err = parse_args("bootconfig", tmp_cmdline, NULL, 0, 0, 0,
+			 &found, bootconfig_optin);
+	if (IS_ERR(err))
+		return false;
+
+	/* parse_args() stops at "--" and returns the address of the rest. */
+	if (end_offset && err)
+		*end_offset = err - tmp_cmdline;
+
+	return found;
+}
+
+#endif	/* __KERNEL__ */
 
 /*
  * Extra Boot Config (XBC) is given as tree-structured ascii text of

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v7 4/9] bootconfig: clean build-time tools/bootconfig from make clean
From: Breno Leitao @ 2026-06-26 12:50 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier, Nick Desaulniers, Bill Wendling, Justin Stitt,
	Jonathan Corbet, Shuah Khan
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, llvm, linux-doc, Breno Leitao, kernel-team, Nicolas Schier
In-Reply-To: <20260626-bootconfig_using_tools-v7-0-24ab72139c29@debian.org>

The previous patch builds tools/bootconfig during 'make prepare' to
render the embedded bootconfig cmdline, but nothing removes it on
'make clean', leaving the compiled tool and its objects behind.

Wire a bootconfig_clean hook into the top-level clean target so the
compiled tool and its objects are removed by make clean, matching the
prepare-wired tools/objtool and tools/bpf/resolve_btfids.

The hook runs tools/bootconfig's Makefile via $(MAKE), which the kernel
build invokes with -rR (MAKEFLAGS += -rR). -rR drops the built-in $(RM)
variable, so the existing "$(RM) -f ..." clean recipe would expand to a
bare "-f ..." and fail. Spell the recipe with a literal "rm -f" so it
keeps working both standalone and when invoked from Kbuild.

Reviewed-by: Nicolas Schier <n.schier@fritz.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 Makefile                  | 11 ++++++++++-
 tools/bootconfig/Makefile |  2 +-
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/Makefile b/Makefile
index 5255aa35a2e51..20a2bcacde3b8 100644
--- a/Makefile
+++ b/Makefile
@@ -1587,6 +1587,15 @@ ifneq ($(wildcard $(objtool_O)),)
 	$(Q)$(MAKE) -sC $(abs_srctree)/tools/objtool O=$(objtool_O) srctree=$(abs_srctree) $(patsubst objtool_%,%,$@)
 endif
 
+PHONY += bootconfig_clean
+
+bootconfig_O = $(abspath $(objtree))/tools/bootconfig
+
+bootconfig_clean:
+ifneq ($(wildcard $(bootconfig_O)),)
+	$(Q)$(MAKE) -sC $(srctree)/tools/bootconfig O=$(bootconfig_O) clean
+endif
+
 tools/: FORCE
 	$(Q)mkdir -p $(objtree)/tools
 	$(Q)$(MAKE) O=$(abspath $(objtree)) subdir=tools -C $(srctree)/tools/
@@ -1757,7 +1766,7 @@ vmlinuxclean:
 	$(Q)$(CONFIG_SHELL) $(srctree)/scripts/link-vmlinux.sh clean
 	$(Q)$(if $(ARCH_POSTLINK), $(MAKE) -f $(ARCH_POSTLINK) clean)
 
-clean: archclean vmlinuxclean resolve_btfids_clean objtool_clean
+clean: archclean vmlinuxclean resolve_btfids_clean objtool_clean bootconfig_clean
 
 # mrproper - Delete all generated files, including .config
 #
diff --git a/tools/bootconfig/Makefile b/tools/bootconfig/Makefile
index 4e82fd9553cde..3cb8066d5141b 100644
--- a/tools/bootconfig/Makefile
+++ b/tools/bootconfig/Makefile
@@ -27,4 +27,4 @@ install: $(ALL_PROGRAMS)
 	install $(OUTPUT)bootconfig $(DESTDIR)$(bindir)
 
 clean:
-	$(RM) -f $(OUTPUT)*.o $(ALL_PROGRAMS)
+	rm -f $(OUTPUT)*.o $(ALL_PROGRAMS)

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v7 3/9] bootconfig: render embedded bootconfig as a kernel cmdline at build time
From: Breno Leitao @ 2026-06-26 12:50 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier, Nick Desaulniers, Bill Wendling, Justin Stitt,
	Jonathan Corbet, Shuah Khan
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, llvm, linux-doc, Breno Leitao, kernel-team, Nicolas Schier
In-Reply-To: <20260626-bootconfig_using_tools-v7-0-24ab72139c29@debian.org>

Add the build-time pipeline that renders the "kernel" subtree of
CONFIG_BOOT_CONFIG_EMBED_FILE into a flat cmdline string and stashes
it in .init.rodata as embedded_kernel_cmdline[]. A follow-up patch
adds the runtime helper that prepends this string to boot_command_line
during early architecture setup so parse_early_param() sees the values.

The build wires up:
  tools/bootconfig -C kernel - userspace tool already shared with
                               lib/bootconfig.c, used here in -C mode
                               to render a bootconfig file to a cmdline
  lib/embedded-cmdline.S     - .incbin's the rendered text plus a NUL
                               (listed under the EXTRA BOOT CONFIG
                               MAINTAINERS entry)
  lib/Makefile rule          - runs tools/bootconfig at build time
  Makefile prepare dep       - ensures tools/bootconfig is built first,
                               same pattern as tools/objtool and
                               tools/bpf/resolve_btfids

Drop the test target from tools/bootconfig/Makefile's default 'all'
recipe so that hooking the binary into the kernel build does not run
test-bootconfig.sh on every prepare. The tests stay available as
'make -C tools/bootconfig test', matching the convention of
tools/objtool and tools/bpf/resolve_btfids whose 'all' targets only
build the binary.

Require BOOT_CONFIG_EMBED_FILE to be non-empty before the new option
can be enabled, otherwise tools/bootconfig -C runs against an empty
file and prints a parse error on every kernel build.

The feature gates on CONFIG_ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG, a
silent symbol arches select once they've wired the prepend call into
setup_arch(). No arch selects it in this patch, so the user-visible
CONFIG_CMDLINE_FROM_BOOTCONFIG is not yet enableable; when an arch
later opts in, the runtime behavior is added by the follow-up patches.

tools/bootconfig also installs on target systems, so its own Makefile
keeps $(CC) and stays cross-buildable as a standalone tool. The kernel
build, which runs the tool on the build host during prepare, instead
forces CC=$(HOSTCC) from a dedicated tools/bootconfig rule and clears
CROSS_COMPILE= in the sub-make. Without that clear, an LLVM=1 cross
build would inherit CROSS_COMPILE and tools/scripts/Makefile.include
would inject --target=/--sysroot= flags into the host clang invocation,
producing a target binary that fails to exec ("Exec format error").

embedded-cmdline.S places the rendered string in its own .init.rodata
subsection (.init.rodata.embed_cmdline) with the "a" (allocatable,
read-only) flag and %progbits. lib/bootconfig-data.S already places
the embedded bootconfig blob in .init.rodata with the "aw" flag
(xbc_init() rewrites separators in place, so that data must be
writable). Using a distinct subsection name avoids the ld.lld section-
type mismatch that would otherwise arise from mixing "a" and "aw"
under the same name; the linker's "*(.init.rodata .init.rodata.*)"
glob still folds both into the init image and frees them after boot.

A follow-up patch wires the build-time tools/bootconfig into the
top-level clean target.

Reviewed-by: Nicolas Schier <n.schier@fritz.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 MAINTAINERS               |  1 +
 Makefile                  | 16 ++++++++++++++++
 init/Kconfig              | 36 ++++++++++++++++++++++++++++++++++++
 lib/Makefile              | 16 ++++++++++++++++
 lib/embedded-cmdline.S    | 16 ++++++++++++++++
 tools/bootconfig/Makefile |  2 +-
 6 files changed, 86 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 57656ec0e9d5d..953231df1911d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9844,6 +9844,7 @@ F:	fs/proc/bootconfig.c
 F:	include/linux/bootconfig.h
 F:	lib/bootconfig-data.S
 F:	lib/bootconfig.c
+F:	lib/embedded-cmdline.S
 F:	tools/bootconfig/*
 F:	tools/bootconfig/scripts/*
 
diff --git a/Makefile b/Makefile
index bf196c6df5b92..5255aa35a2e51 100644
--- a/Makefile
+++ b/Makefile
@@ -1545,6 +1545,22 @@ prepare: tools/bpf/resolve_btfids
 endif
 endif
 
+# tools/bootconfig renders the embedded bootconfig into a cmdline at build time.
+ifdef CONFIG_CMDLINE_FROM_BOOTCONFIG
+prepare: tools/bootconfig
+endif
+
+# tools/bootconfig is run on the build host during prepare, so force a host
+# binary here; its own Makefile keeps $(CC) for standalone and cross builds.
+# CROSS_COMPILE= is cleared so tools/scripts/Makefile.include does not inject
+# the target's --target=/--sysroot= flags into the host clang invocation under
+# LLVM=1 cross builds (which would produce a target binary that fails to exec).
+tools/bootconfig: export CC := $(HOSTCC)
+tools/bootconfig: FORCE
+	$(Q)mkdir -p $(objtree)/tools
+	$(Q)$(MAKE) O=$(abspath $(objtree)) subdir=tools -C $(srctree)/tools/ \
+		bootconfig CROSS_COMPILE=
+
 # The tools build system is not a part of Kbuild and tends to introduce
 # its own unique issues. If you need to integrate a new tool into Kbuild,
 # please consider locating that tool outside the tools/ tree and using the
diff --git a/init/Kconfig b/init/Kconfig
index 5230d4879b1c8..598690ec313a2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1566,6 +1566,42 @@ config BOOT_CONFIG_EMBED_FILE
 	  This bootconfig will be used if there is no initrd or no other
 	  bootconfig in the initrd.
 
+config ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG
+	bool
+	help
+	  Silent symbol; no C code reads it directly. Architectures
+	  select it once their setup_arch() calls
+	  xbc_prepend_embedded_cmdline() before parse_early_param().
+	  Its only role is to gate the user-visible
+	  CMDLINE_FROM_BOOTCONFIG option per-arch, the same
+	  ARCH_SUPPORTS_* idiom used by ARCH_SUPPORTS_CFI, etc.
+
+config CMDLINE_FROM_BOOTCONFIG
+	bool "Render embedded bootconfig as kernel cmdline at build time"
+	depends on BOOT_CONFIG_EMBED_FILE != ""
+	depends on ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG
+	depends on CMDLINE = ""
+	default n
+	help
+	  Render the "kernel" subtree of the embedded bootconfig file into a
+	  flat cmdline string at kernel build time and prepend it to
+	  boot_command_line during early architecture setup. This makes
+	  early_param() handlers (e.g. mem=, earlycon=, loglevel=) see the
+	  values supplied via the embedded bootconfig.
+
+	  The runtime bootconfig parser is unaffected, so tree-structured
+	  consumers such as ftrace boot-time tracing keep working.
+
+	  Note: when an initrd also carries a bootconfig, its "kernel"
+	  subtree is still parsed at runtime, but the embedded "kernel"
+	  keys remain in boot_command_line for parse_early_param() and
+	  end up later than the initrd keys in saved_command_line, so
+	  parse_args() last-wins favors the embedded values. If you need
+	  initrd to override embedded kernel.* keys, leave this option
+	  off.
+
+	  If unsure, say N.
+
 config CMDLINE_LOG_WRAP_IDEAL_LEN
 	int "Length to try to wrap the cmdline when logged at boot"
 	default 1021
diff --git a/lib/Makefile b/lib/Makefile
index 7f75cc6edf94a..4ccdce2fd5e5b 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -273,6 +273,22 @@ filechk_defbconf = cat $(or $(real-prereqs), /dev/null)
 $(obj)/default.bconf: $(CONFIG_BOOT_CONFIG_EMBED_FILE) FORCE
 	$(call filechk,defbconf)
 
+obj-$(CONFIG_CMDLINE_FROM_BOOTCONFIG) += embedded-cmdline.o
+$(obj)/embedded-cmdline.o: $(obj)/embedded_cmdline.bin
+
+# Render the bootconfig "kernel" subtree to a flat cmdline string using
+# the userspace tools/bootconfig parser (-C mode). The runtime prepend
+# helper enforces COMMAND_LINE_SIZE at boot, so no build-time size
+# check is performed here (COMMAND_LINE_SIZE is an arch header
+# constant, not a Kconfig value).
+quiet_cmd_render_cmdline = BCONF2C $@
+      cmd_render_cmdline = \
+	$(objtree)/tools/bootconfig/bootconfig -C $< > $@
+
+targets += embedded_cmdline.bin
+$(obj)/embedded_cmdline.bin: $(obj)/default.bconf $(objtree)/tools/bootconfig/bootconfig FORCE
+	$(call if_changed,render_cmdline)
+
 obj-$(CONFIG_RBTREE_TEST) += rbtree_test.o
 obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o
 
diff --git a/lib/embedded-cmdline.S b/lib/embedded-cmdline.S
new file mode 100644
index 0000000000000..bda81b4a42bea
--- /dev/null
+++ b/lib/embedded-cmdline.S
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Embed the build-time-rendered bootconfig "kernel" subtree as a flat
+ * cmdline string. setup_arch() prepends this to boot_command_line on
+ * architectures that select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates
+ * Copyright (c) 2026 Breno Leitao <leitao@debian.org>
+ */
+	.section .init.rodata.embed_cmdline, "a", %progbits
+	.global embedded_kernel_cmdline
+embedded_kernel_cmdline:
+	.incbin "lib/embedded_cmdline.bin"
+	.byte 0
+	.global embedded_kernel_cmdline_end
+embedded_kernel_cmdline_end:
diff --git a/tools/bootconfig/Makefile b/tools/bootconfig/Makefile
index 90eb47c9d8de6..4e82fd9553cde 100644
--- a/tools/bootconfig/Makefile
+++ b/tools/bootconfig/Makefile
@@ -15,7 +15,7 @@ override CFLAGS += -Wall -g -I$(CURDIR)/include
 ALL_TARGETS := bootconfig
 ALL_PROGRAMS := $(patsubst %,$(OUTPUT)%,$(ALL_TARGETS))
 
-all: $(ALL_PROGRAMS) test
+all: $(ALL_PROGRAMS)
 
 $(OUTPUT)bootconfig: main.c include/linux/bootconfig.h $(LIBSRC)
 	$(CC) $(filter %.c,$^) $(CFLAGS) $(LDFLAGS) -o $@

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v7 2/9] bootconfig: render descendant keys when xbc_snprint_cmdline() root has a value
From: Breno Leitao @ 2026-06-26 12:50 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier, Nick Desaulniers, Bill Wendling, Justin Stitt,
	Jonathan Corbet, Shuah Khan
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, llvm, linux-doc, Breno Leitao, kernel-team
In-Reply-To: <20260626-bootconfig_using_tools-v7-0-24ab72139c29@debian.org>

xbc_node_for_each_key_value() walks to the first leaf under @root, and
when @root is itself a leaf it yields @root. That happens not only for
an empty "kernel {}" subtree, but also when @root carries both a value
and subkeys, e.g.

	kernel = x
	kernel.foo = bar

Here @root ("kernel") is a leaf because its first child is the value
node "x", so the iterator returns @root first. Feeding @root back into
xbc_node_compose_key_after(root, root) returns -EINVAL, which the only
in-kernel caller papers over with a "len <= 0" check -- but the
follow-up tools/bootconfig -C user propagates the error and turns such
a bootconfig into a build failure. Worse, short-circuiting the whole
call on a leaf @root would silently drop the valid "kernel.foo = bar"
descendant that this patch should render.

Skip @root inside the loop instead of bailing out: the value-only entry
is dropped (it is rendered through the "kernel" cmdline path, not here),
while real descendant keys are still emitted. An entirely empty subtree
now renders nothing and returns 0 rather than -EINVAL, matching the
"nothing to render is not an error" semantics expected by the new
build-time caller.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 lib/bootconfig.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index 2ed9ee3dc81c7..926094d97397e 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -440,6 +440,17 @@ int __init xbc_snprint_cmdline(char *buf, size_t size, struct xbc_node *root)
 	 * itself is well defined and returns the would-be length.
 	 */
 	xbc_node_for_each_key_value(root, knode, val) {
+		/*
+		 * An empty or value-only @root (e.g. "kernel {}" or
+		 * "kernel = x", possibly alongside "kernel.foo = bar")
+		 * yields @root itself here. Skip it: composing a key for it
+		 * would fail with -EINVAL, yet any real descendant keys must
+		 * still be rendered. An entirely empty subtree then renders
+		 * nothing and returns 0 rather than an error.
+		 */
+		if (knode == root)
+			continue;
+
 		ret = xbc_node_compose_key_after(root, knode,
 					xbc_namebuf, XBC_KEYLEN_MAX);
 		if (ret < 0)

-- 
2.53.0-Meta

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox