Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH 1/2] x86/uprobes: Keep shadow stack in sync for emulated CALLs
From: David Windsor @ 2026-06-23 14:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mhiramat, oleg, tglx, mingo, bp, dave.hansen, x86, shuah,
	linux-trace-kernel, linux-kselftest, linux-kernel
In-Reply-To: <20260623084327.GU48970@noisy.programming.kicks-ass.net>

On Tue, Jun 23, 2026 at 4:43 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Jun 22, 2026 at 02:31:08PM -0400, David Windsor wrote:
> > Uprobe CALL emulation updates the normal user stack, but not the CET user
> > shadow stack. The subsequent RET then sees a stale shadow stack entry and
> > raises #CP.
> >
> > Update the relative CALL emulation and XOL CALL fixup paths to keep the
> > shadow stack in sync.
> >
> > Fixes: 488af8ea7131 ("x86/shstk: Wire in shadow stack interface")
>
> I can confirm this patch fixes the included test case, so yay for that.
>
> However, should this not be:
>
> Fixes: 1713b63a07a2 ("x86/shstk: Make return uprobe work with shadow stack")
>
> ?
>

Hmm, this commit appears to only be concerned with the uretprobe case?

^ permalink raw reply

* Re: [PATCH v7 00/10] tracing/probes: Add more typecast features
From: Masami Hiramatsu @ 2026-06-23 13:54 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Steven Rostedt, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178217904992.643090.15726197350652241270.stgit@devnote2>

On Tue, 23 Jun 2026 10:44:10 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:

> Hi,
> 
> Here is the 7th version of series to introduce more typecast features
> to probe events. The previous version is here:
> 
>  https://lore.kernel.org/all/178201238795.570818.15573963115625446598.stgit@devnote2/
> 
> In this version, I added 2 new fix and cleanup patches and update
> according to Sashiko's review. [1/10] is a long-lived issue about
> @+FOFFS, which was wrongly adding offset twice. [2/10] is a clean
> up patch for renaming fetch_op name (good to dump it). 
> This is applicable against probes/core branch on linux-trace tree.

I'll take the first 2 patches to probes/core, since those
are obvious fix and cleanup.

Thanks,

> 
> Steve introduced BTF typecast feature for eprobe[1].
> This series extends it and add more options:
> 
> 1. Expanding BTF typecast to kprobe and fprobe.
>    (currently only function entry/exit)
> 
> 2. Introduce container_of like typecast. This adds a "assigned
>    member" option to the typecast.
> 
>    (STRUCT,MEMBER)VAR->ANOTHER_MEMBER
> 
>    This casts VAR to STRUCT type but the VAR is as the address
>    of STRUCT.MEMBER. In C, it is:
> 
>    container_of(VAR, STRUCT, MEMBER)->ANOTHER_MEMBER
> 
> 3. Support nested typecast, e.g.
> 
>    (STRUCT)((STRUCT2)VAR->MEMBER2)->MEMBER
> 
>    the nest level must be smaller than 3.
> 
> 4. Add $current variable to point "current" task_struct.
>    This is useful with typecast, e.g.
> 
>    (task_struct)$current->pid
> 
> 5. per-cpu dereference support.
> 
>    Intrdouce this_cpu_read(VAR) and this_cpu_ptr(VAR) to
>    access per-cpu data on the current CPU (accessing other CPU
>    data is not stable, because it can be changed.)
> 
>    You can access the member of per-cpu data structure using
>    typecast like:
> 
>    (STRUCT)this_cpu_ptr(VAR)->MEMBER
> 
> And added fetcharg dump feature (for debug) and updated test scripts
> to test part of them.
> 
> Thanks,
> 
> ---
> base-commit: 3ec75d0067f30eb5e0730f033766d6ab2feca7ae
> 
> Masami Hiramatsu (Google) (10):
>       tracing/probes: Fix double addition of offset for @+FOFFSET
>       tracing/probes: Rename FETCH_OP_DATA to FETCH_OP_IMMSTR
>       tracing/probes: Support dumping fetcharg program for debugging dynamic events
>       tracing/probes: Support typecast for various probe events
>       tracing/probes: Support nested typecast
>       tracing/probes: Type casting always involves nested calls
>       tracing/probes: Support field specifier option for typecast
>       tracing/probes: Add $current variable support
>       tracing/probes: Add this_cpu_read() and this_cpu_ptr() dereference method to fetcharg
>       tracing/probes: Add a new testcase for BTF typecasts
> 
> 
>  Documentation/trace/eprobetrace.rst                |    9 
>  Documentation/trace/fprobetrace.rst                |   10 
>  Documentation/trace/kprobetrace.rst                |   11 
>  kernel/trace/Kconfig                               |   11 
>  kernel/trace/trace.c                               |    8 
>  kernel/trace/trace_eprobe.c                        |    2 
>  kernel/trace/trace_fprobe.c                        |    2 
>  kernel/trace/trace_kprobe.c                        |    2 
>  kernel/trace/trace_probe.c                         |  582 ++++++++++++++++----
>  kernel/trace/trace_probe.h                         |   98 ++-
>  kernel/trace/trace_probe_tmpl.h                    |   27 +
>  kernel/trace/trace_uprobe.c                        |    3 
>  samples/trace_events/trace-events-sample.c         |   40 +
>  samples/trace_events/trace-events-sample.h         |   34 +
>  .../ftrace/test.d/dynevent/btf_probe_event.tc      |   51 ++
>  .../ftrace/test.d/dynevent/fprobe_syntax_errors.tc |   11 
>  .../ftrace/test.d/kprobe/kprobe_syntax_errors.tc   |   11 
>  .../ftrace/test.d/kprobe/uprobe_syntax_errors.tc   |    5 
>  18 files changed, 756 insertions(+), 161 deletions(-)
>  create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
> 
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* [PATCH] tracing/probes: ignore id update from btf_type_skip_modifiers
From: Martin Kaiser @ 2026-06-23 13:29 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: linux-trace-kernel, linux-kernel, Martin Kaiser

We can pass NULL as id pointer to btf_type_skip_modifiers if we do not
need the id of the returned btf_type.

Signed-off-by: Martin Kaiser <martin@kaiser.cx>
---
 kernel/trace/trace_probe.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 9b3219e755cb..78bca283763f 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -360,9 +360,8 @@ static bool btf_type_is_char_ptr(struct btf *btf, const struct btf_type *type)
 {
 	const struct btf_type *real_type;
 	u32 intdata;
-	s32 tid;
 
-	real_type = btf_type_skip_modifiers(btf, type->type, &tid);
+	real_type = btf_type_skip_modifiers(btf, type->type, NULL);
 	if (!real_type)
 		return false;
 
@@ -379,14 +378,13 @@ static bool btf_type_is_char_array(struct btf *btf, const struct btf_type *type)
 	const struct btf_type *real_type;
 	const struct btf_array *array;
 	u32 intdata;
-	s32 tid;
 
 	if (BTF_INFO_KIND(type->info) != BTF_KIND_ARRAY)
 		return false;
 
 	array = (const struct btf_array *)(type + 1);
 
-	real_type = btf_type_skip_modifiers(btf, array->type, &tid);
+	real_type = btf_type_skip_modifiers(btf, array->type, NULL);
 
 	intdata = btf_type_int(real_type);
 	return !(BTF_INT_ENCODING(intdata) & BTF_INT_SIGNED)
@@ -589,7 +587,6 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 	struct btf *btf = ctx_btf(ctx);
 	char *next;
 	int is_ptr;
-	s32 tid;
 
 	do {
 		if (!is_struct) {
@@ -600,7 +597,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 			}
 
 			/* Convert a struct pointer type to a struct type */
-			type = btf_type_skip_modifiers(btf, type->type, &tid);
+			type = btf_type_skip_modifiers(btf, type->type, NULL);
 			if (!type) {
 				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
 				return -EINVAL;
@@ -640,7 +637,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 				ctx->last_bitsize = 0;
 			}
 
-			type = btf_type_skip_modifiers(btf, field->type, &tid);
+			type = btf_type_skip_modifiers(btf, field->type, NULL);
 			if (!type) {
 				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
 				return -EINVAL;
@@ -759,7 +756,7 @@ static int parse_btf_arg(char *varname,
 	return -ENOENT;
 
 found:
-	type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
+	type = btf_type_skip_modifiers(ctx->btf, tid, NULL);
 found_type:
 	if (!type) {
 		trace_probe_log_err(ctx->offset, BAD_BTF_TID);
-- 
2.43.7


^ permalink raw reply related

* Re: [PATCH 1/2] x86/uprobes: Keep shadow stack in sync for emulated CALLs
From: Oleg Nesterov @ 2026-06-23 13:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Windsor, mhiramat, tglx, mingo, bp, dave.hansen, x86, shuah,
	linux-trace-kernel, linux-kselftest, linux-kernel
In-Reply-To: <20260623125725.GW48970@noisy.programming.kicks-ass.net>

On 06/23, Peter Zijlstra wrote:
>
> On Tue, Jun 23, 2026 at 02:52:32PM +0200, Oleg Nesterov wrote:
> > On 06/22, David Windsor wrote:
> > >
> > > --- a/arch/x86/kernel/uprobes.c
> > > +++ b/arch/x86/kernel/uprobes.c
> > > @@ -1246,8 +1246,12 @@ static int default_post_xol_op(struct arch_uprobe *auprobe, struct pt_regs *regs
> > >  		long correction = utask->vaddr - utask->xol_vaddr;
> > >  		regs->ip += correction;
> > >  	} else if (auprobe->defparam.fixups & UPROBE_FIX_CALL) {
> > > +		unsigned long retaddr = utask->vaddr + auprobe->defparam.ilen;
> > > +
> > >  		regs->sp += sizeof_long(regs); /* Pop incorrect return address */
> > > -		if (emulate_push_stack(regs, utask->vaddr + auprobe->defparam.ilen))
> > > +		if (emulate_push_stack(regs, retaddr))
> > > +			return -ERESTART;
> > > +		if (shstk_update_last_frame(retaddr))
> > >  			return -ERESTART;
> >
> > Well, if shstk_update_last_frame() fails after emulate_push_stack(), we should
> > probably return another error, so that the caller handle_singlestep() will kill
> > this task?
>
> Makes sense, the other user has a force_sig(SIGSEGV) on failure.

Offtopic question... both shstk_update_last_frame() and shstk_push() are only
used by arch/x86/kernel/uprobes.c. But they are not symmetric in that
shstk_update_last_frame() returns 0 if !features_enabled(ARCH_SHSTK_SHSTK),
while shstk_push() returns -ENOTSUPP in this case.

That is why the users can't just do "if (shstk_push(xxx)) ...". This is really
minor, but perhaps it makes sense to change shstk_push() to return 0 in this
case too? I don't think -ENOTSUPP is actually useful...

Oleg.


^ permalink raw reply

* Re: [PATCH 1/1] tools/tracing/rtla: fix missing unistd include
From: Tomas Glozar @ 2026-06-23 13:12 UTC (permalink / raw)
  To: Andreas Ziegler; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <20260614092855.129278-1-br025@umbiko.net>

Hi Andreas,

Please note that rtla uses "tools/rtla:" or "rtla:" prefix for
patches, not "tools/tracing/rtla".

ne 14. 6. 2026 v 11:35 odesílatel Andreas Ziegler <br025@umbiko.net> napsal:
>
> Compiling RTLA 7.1-rc6 with GCC 16 and uClibc as standard library fails
> with these errors:
>
> ...
>
> Restore the missing unistd.h include.
>

Thanks for the fix, I missed that.

Indeed, according to POSIX, alarm() has to include unistd.h. I'll try
to add the uclibc build to my tests.

> Fixes: <115b06a00875> (tools/rtla: Consolidate nr_cpus usage across all tools)
>

The conventional syntax for Fixes is:

Fixes: 115b06a00875 ("tools/rtla: Consolidate nr_cpus usage across all tools")

i.e. no angle brackets, and double quotes around the commit name. See:
https://docs.kernel.org/process/submitting-patches.html#describe-changes

> Signed-off-by: Andreas Ziegler <br025@umbiko.net>
> ---
>  tools/tracing/rtla/src/common.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/tools/tracing/rtla/src/common.c b/tools/tracing/rtla/src/common.c
> index 35e3d3aa922e..5c5398d20f40 100644
> --- a/tools/tracing/rtla/src/common.c
> +++ b/tools/tracing/rtla/src/common.c
> @@ -5,6 +5,7 @@
>  #include <signal.h>
>  #include <stdlib.h>
>  #include <string.h>
> +#include <unistd.h>
>  #include <getopt.h>

The getopt.h include was removed in master, causing a conflict when
applying the patch. Could you please rebase?

>  #include <sys/sysinfo.h>
>
> --
> 2.53.0
>

Thanks,

Tomas

^ permalink raw reply

* Re: [PATCH 1/2] x86/uprobes: Keep shadow stack in sync for emulated CALLs
From: Peter Zijlstra @ 2026-06-23 12:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: David Windsor, mhiramat, tglx, mingo, bp, dave.hansen, x86, shuah,
	linux-trace-kernel, linux-kselftest, linux-kernel
In-Reply-To: <ajqBkKE8TpQL1mIG@redhat.com>

On Tue, Jun 23, 2026 at 02:52:32PM +0200, Oleg Nesterov wrote:
> On 06/22, David Windsor wrote:
> >
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> > @@ -1246,8 +1246,12 @@ static int default_post_xol_op(struct arch_uprobe *auprobe, struct pt_regs *regs
> >  		long correction = utask->vaddr - utask->xol_vaddr;
> >  		regs->ip += correction;
> >  	} else if (auprobe->defparam.fixups & UPROBE_FIX_CALL) {
> > +		unsigned long retaddr = utask->vaddr + auprobe->defparam.ilen;
> > +
> >  		regs->sp += sizeof_long(regs); /* Pop incorrect return address */
> > -		if (emulate_push_stack(regs, utask->vaddr + auprobe->defparam.ilen))
> > +		if (emulate_push_stack(regs, retaddr))
> > +			return -ERESTART;
> > +		if (shstk_update_last_frame(retaddr))
> >  			return -ERESTART;
> 
> Well, if shstk_update_last_frame() fails after emulate_push_stack(), we should
> probably return another error, so that the caller handle_singlestep() will kill
> this task?

Makes sense, the other user has a force_sig(SIGSEGV) on failure.

^ permalink raw reply

* Re: [PATCH 1/2] x86/uprobes: Keep shadow stack in sync for emulated CALLs
From: Oleg Nesterov @ 2026-06-23 12:52 UTC (permalink / raw)
  To: David Windsor
  Cc: mhiramat, peterz, tglx, mingo, bp, dave.hansen, x86, shuah,
	linux-trace-kernel, linux-kselftest, linux-kernel
In-Reply-To: <20260622183109.1137245-1-dwindsor@gmail.com>

On 06/22, David Windsor wrote:
>
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -1246,8 +1246,12 @@ static int default_post_xol_op(struct arch_uprobe *auprobe, struct pt_regs *regs
>  		long correction = utask->vaddr - utask->xol_vaddr;
>  		regs->ip += correction;
>  	} else if (auprobe->defparam.fixups & UPROBE_FIX_CALL) {
> +		unsigned long retaddr = utask->vaddr + auprobe->defparam.ilen;
> +
>  		regs->sp += sizeof_long(regs); /* Pop incorrect return address */
> -		if (emulate_push_stack(regs, utask->vaddr + auprobe->defparam.ilen))
> +		if (emulate_push_stack(regs, retaddr))
> +			return -ERESTART;
> +		if (shstk_update_last_frame(retaddr))
>  			return -ERESTART;

Well, if shstk_update_last_frame() fails after emulate_push_stack(), we should
probably return another error, so that the caller handle_singlestep() will kill
this task?

Oleg.


^ permalink raw reply

* Re: [PATCH v7 06/10] tracing/probes: Type casting always involves nested calls
From: Masami Hiramatsu @ 2026-06-23 11:50 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Steven Rostedt, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178217910709.643090.6379198143797177881.stgit@devnote2>


Sashiko commented:

> > +	}
> > +	*close = '\0';
> >  
> > -		ctx->offset += 1;	/* for the '(' */
> > -		/* We need to parse the nested one */
> > -		ret = parse_probe_arg(tmp + 1, find_fetch_type(NULL, ctx->flags),
> > -				pcode, end, ctx);
> > -		if (ret < 0)
> > -			return ret;
> > -		ctx->nested_level--;
> > -		clear_struct_btf(ctx);
> > +	/* We need to parse the nested one */
> > +	ret = parse_probe_arg(tmp, find_fetch_type(NULL, ctx->flags),
> > +			      pcode, end, ctx);
> 
> Does routing the nested argument through parse_probe_arg() instead of 
> parse_btf_arg() break the documented eprobe syntax (STRUCT)FIELD->MEMBER?
> Since this patch removes the TPARG_FL_TEVENT handling block from
> parse_btf_arg(), and handle_typecast() now recursively invokes 
> parse_probe_arg() on the extracted FIELD name, a bare eprobe FIELD name 
> that does not start with a '$' prefix will hit the default case in 
> parse_probe_arg().
> This causes it to be unconditionally rejected with -EINVAL (NOSUP_BTFARG) 
> because eprobe flags (TPARG_FL_TEVENT) do not satisfy 
> tparg_is_function_entry() or tparg_is_function_return(). This acts as an 
> ABI breakage for existing user-space scripts relying on this eprobe syntax.

Wait... Did the type-casting support patch accidentally allow access to
event fields without the "$"?

Hmm, if so, it should be documented, and need to support correctly
with/without typecast.

Thank you,

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v2 1/2] signal: avoid shared siginfo namespace rewrites
From: Oleg Nesterov @ 2026-06-23 11:37 UTC (permalink / raw)
  To: Bradley Morgan, Eric W. Biederman
  Cc: Christian Brauner, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Andrew Morton, Peter Zijlstra, Marco Elver,
	Aleksandr Nogikh, Thomas Gleixner, Adrian Huang, Kexin Sun,
	linux-kernel, linux-trace-kernel, stable
In-Reply-To: <86a8857d58d43ee26a8b365b837fd24830343494.1782159692.git.include@grrlz.net>

Add Eric.

OK, I agree, it seems we need a simple fix.

Acked-by: Oleg Nesterov <oleg@redhat.com>

-------------------------------------------------------------------------
But let me add some "offtopic" notes... Why do we actually need this fix?

kill_something_info(). But at first glance sys_kill/kill_something_info
can simply use SEND_SIG_NOINFO? If yes, this makes sense anyway, I will
re-check...

do_pidfd_send_signal(PIDFD_SIGNAL_PROCESS_GROUP) allows to call
kill_pgrp_info() if si_code < 0... Not that I think this would be better,
but we could move this "rewrite" logic into __kill_pgrp_info()...

Anything else needs this change? Most probably yes, but after the quick
grep I don't see other group senders with !is_si_special(info).

Eric, what do you think?

Oleg.

On 06/22, Bradley Morgan wrote:
>
> send_signal_locked() rewrites sender ids for the target namespace.
> Group sends reuse the same siginfo, so one recipient can affect the
> next.
>
> Copy the siginfo before changing it.
>
> Fixes: 7a0cf094944e ("signal: Correct namespace fixups of si_pid and si_uid")
> Cc: stable@vger.kernel.org
> Signed-off-by: Bradley Morgan <include@grrlz.net>
> ---
> Changes since v1:
> - No code changes in this patch.
> - Add patch 2 for Oleg's const suggestion.
> - Link to v1:
>   https://lore.kernel.org/all/0873AC4A-3CB2-4F7B-BFE6-75D855AD22DC@grrlz.net/T/#m89955d13f10807c316d34cc76680d690a2d95b31
>
>  kernel/signal.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index b9fc7be1a169..d72d9be3a992 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1181,6 +1181,7 @@ static inline bool has_si_pid_and_uid(struct kernel_siginfo *info)
>  int send_signal_locked(int sig, struct kernel_siginfo *info,
>  		       struct task_struct *t, enum pid_type type)
>  {
> +	struct kernel_siginfo rewritten;
>  	/* Should SIGKILL or SIGSTOP be received by a pid namespace init? */
>  	bool force = false;
>
> @@ -1194,6 +1195,9 @@ int send_signal_locked(int sig, struct kernel_siginfo *info,
>  		/* SIGKILL and SIGSTOP is special or has ids */
>  		struct user_namespace *t_user_ns;
>
> +		rewritten = *info;
> +		info = &rewritten;
> +
>  		rcu_read_lock();
>  		t_user_ns = task_cred_xxx(t, user_ns);
>  		if (current_user_ns() != t_user_ns) {
> --
> 2.53.0
>


^ permalink raw reply

* Re: [PATCH] tracing/probes: make file offset error message probe-agnostic
From: Masami Hiramatsu @ 2026-06-23 10:53 UTC (permalink / raw)
  To: Yudistira Putra
  Cc: Steven Rostedt, Mathieu Desnoyers, linux-trace-kernel,
	linux-kernel
In-Reply-To: <20260622160032.99834-1-pyudistira519@gmail.com>

On Mon, 22 Jun 2026 12:00:32 -0400
Yudistira Putra <pyudistira519@gmail.com> wrote:

> The shared probe argument parser rejects file offsets for kernel probes.
> This path is used outside the kprobe event parser too, but the diagnostic
> currently says "with kprobe" even when emitted from another probe path.
> 
> Make the diagnostic probe-agnostic.
> 

Looks good to me. Let me pick it.

Thanks!

> Signed-off-by: Yudistira Putra <pyudistira519@gmail.com>
> ---
>  kernel/trace/trace_probe.c | 2 +-
>  kernel/trace/trace_probe.h | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index fd1caa1f9723..fec0ad51cf61 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -1228,7 +1228,7 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
>  			code->op = FETCH_OP_IMM;
>  			code->immediate = param;
>  		} else if (arg[1] == '+') {
> -			/* kprobes don't support file offsets */
> +			/* Kernel probes do not support file offsets */
>  			if (ctx->flags & TPARG_FL_KERNEL) {
>  				trace_probe_log_err(ctx->offset, FILE_ON_KPROBE);
>  				return -EINVAL;
> diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
> index 15758cc11fc6..6162f066c2b8 100644
> --- a/kernel/trace/trace_probe.h
> +++ b/kernel/trace/trace_probe.h
> @@ -516,7 +516,7 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
>  	C(BAD_MEM_ADDR,		"Invalid memory address"),		\
>  	C(BAD_IMM,		"Invalid immediate value"),		\
>  	C(IMMSTR_NO_CLOSE,	"String is not closed with '\"'"),	\
> -	C(FILE_ON_KPROBE,	"File offset is not available with kprobe"), \
> +	C(FILE_ON_KPROBE,	"File offset is not available for kernel probes"), \
>  	C(BAD_FILE_OFFS,	"Invalid file offset value"),		\
>  	C(SYM_ON_UPROBE,	"Symbol is not available with uprobe"),	\
>  	C(TOO_MANY_OPS,		"Dereference is too much nested"), 	\
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH] tracing/probes: fix typo in invalid variable error message
From: Masami Hiramatsu @ 2026-06-23 10:52 UTC (permalink / raw)
  To: Yudistira Putra
  Cc: Steven Rostedt, Mathieu Desnoyers, linux-trace-kernel,
	linux-kernel
In-Reply-To: <20260622152304.88345-1-pyudistira519@gmail.com>

On Mon, 22 Jun 2026 11:23:04 -0400
Yudistira Putra <pyudistira519@gmail.com> wrote:

> Fix a typo in the BAD_VAR diagnostic emitted for invalid $-variables
> in probe event arguments.
> 

Thanks, but the same fix are already picked.

https://lore.kernel.org/all/20260507081041.885781-4-martin@kaiser.cx/


> Signed-off-by: Yudistira Putra <pyudistira519@gmail.com>
> ---
>  kernel/trace/trace_probe.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
> index 15758cc11fc6..0f09f7aaf93f 100644
> --- a/kernel/trace/trace_probe.h
> +++ b/kernel/trace/trace_probe.h
> @@ -511,7 +511,7 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
>  	C(NO_RETVAL,		"This function returns 'void' type"),	\
>  	C(BAD_STACK_NUM,	"Invalid stack number"),		\
>  	C(BAD_ARG_NUM,		"Invalid argument number"),		\
> -	C(BAD_VAR,		"Invalid $-valiable specified"),	\
> +	C(BAD_VAR,		"Invalid $-variable specified"),	\
>  	C(BAD_REG_NAME,		"Invalid register name"),		\
>  	C(BAD_MEM_ADDR,		"Invalid memory address"),		\
>  	C(BAD_IMM,		"Invalid immediate value"),		\
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v2 2/2] signal: make send_signal_locked() take const siginfo
From: Oleg Nesterov @ 2026-06-23 10:39 UTC (permalink / raw)
  To: Bradley Morgan
  Cc: Christian Brauner, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Andrew Morton, Peter Zijlstra, Marco Elver,
	Aleksandr Nogikh, Thomas Gleixner, Adrian Huang, Kexin Sun,
	linux-kernel, linux-trace-kernel
In-Reply-To: <f754c4e5c82b45bcbb770aa8bb1f4ab1d87a0b0e.1782159692.git.include@grrlz.net>

On 06/22, Bradley Morgan wrote:
>
> send_signal_locked() should not change the caller's siginfo. Make that
> part of the type and keep the local rewrite on its copy.
>
> Suggested-by: Oleg Nesterov <oleg@redhat.com>

Ah, sorry... I only suggested to change the signature of send_signal_locked()
and thus has_si_pid_and_uid(). Perhaps a broader change makes sense too, but
this conflicts with another (under discussion) series:

	PATCH v2 3/3] signal: fix evasion of SA_IMMUTABLE signals
	https://lore.kernel.org/all/ajVD6ZmiSQLxjj57@redhat.com/

Now let me take another look at 1/2 ...

Oleg.

> Signed-off-by: Bradley Morgan <include@grrlz.net>
> ---
> Changes since v1:
> - New patch from Oleg's suggestion.
> - Link to Oleg's suggestion:
>   https://lore.kernel.org/all/0873AC4A-3CB2-4F7B-BFE6-75D855AD22DC@grrlz.net/T/#m5f8a2d54928efff41de539969b68149e1ec5fca4
> 
>  include/linux/signal.h        |  2 +-
>  include/trace/events/signal.h |  4 ++--
>  kernel/signal.c               | 20 +++++++++++---------
>  3 files changed, 14 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/signal.h b/include/linux/signal.h
> index f19816832f05..a1ba8c5973c6 100644
> --- a/include/linux/signal.h
> +++ b/include/linux/signal.h
> @@ -283,7 +283,7 @@ extern int do_send_sig_info(int sig, struct kernel_siginfo *info,
>  				struct task_struct *p, enum pid_type type);
>  extern int group_send_sig_info(int sig, struct kernel_siginfo *info,
>  			       struct task_struct *p, enum pid_type type);
> -extern int send_signal_locked(int sig, struct kernel_siginfo *info,
> +extern int send_signal_locked(int sig, const struct kernel_siginfo *info,
>  			      struct task_struct *p, enum pid_type type);
>  extern int sigprocmask(int, sigset_t *, sigset_t *);
>  extern void set_current_blocked(sigset_t *);
> diff --git a/include/trace/events/signal.h b/include/trace/events/signal.h
> index 1db7e4b07c01..05a46135ee34 100644
> --- a/include/trace/events/signal.h
> +++ b/include/trace/events/signal.h
> @@ -49,8 +49,8 @@ enum {
>   */
>  TRACE_EVENT(signal_generate,
>  
> -	TP_PROTO(int sig, struct kernel_siginfo *info, struct task_struct *task,
> -			int group, int result),
> +	TP_PROTO(int sig, const struct kernel_siginfo *info,
> +		 struct task_struct *task, int group, int result),
>  
>  	TP_ARGS(sig, info, task, group, result),
>  
> diff --git a/kernel/signal.c b/kernel/signal.c
> index d72d9be3a992..26e8b8e1d03c 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1037,7 +1037,7 @@ static inline bool legacy_queue(struct sigpending *signals, int sig)
>  	return (sig < SIGRTMIN) && sigismember(&signals->signal, sig);
>  }
>  
> -static int __send_signal_locked(int sig, struct kernel_siginfo *info,
> +static int __send_signal_locked(int sig, const struct kernel_siginfo *info,
>  				struct task_struct *t, enum pid_type type, bool force)
>  {
>  	struct sigpending *pending;
> @@ -1154,7 +1154,7 @@ static int __send_signal_locked(int sig, struct kernel_siginfo *info,
>  	return ret;
>  }
>  
> -static inline bool has_si_pid_and_uid(struct kernel_siginfo *info)
> +static inline bool has_si_pid_and_uid(const struct kernel_siginfo *info)
>  {
>  	bool ret = false;
>  	switch (siginfo_layout(info->si_signo, info->si_code)) {
> @@ -1178,10 +1178,11 @@ static inline bool has_si_pid_and_uid(struct kernel_siginfo *info)
>  	return ret;
>  }
>  
> -int send_signal_locked(int sig, struct kernel_siginfo *info,
> +int send_signal_locked(int sig, const struct kernel_siginfo *info,
>  		       struct task_struct *t, enum pid_type type)
>  {
>  	struct kernel_siginfo rewritten;
> +	const struct kernel_siginfo *send_info = info;
>  	/* Should SIGKILL or SIGSTOP be received by a pid namespace init? */
>  	bool force = false;
>  
> @@ -1196,26 +1197,27 @@ int send_signal_locked(int sig, struct kernel_siginfo *info,
>  		struct user_namespace *t_user_ns;
>  
>  		rewritten = *info;
> -		info = &rewritten;
> +		send_info = &rewritten;
>  
>  		rcu_read_lock();
>  		t_user_ns = task_cred_xxx(t, user_ns);
>  		if (current_user_ns() != t_user_ns) {
> -			kuid_t uid = make_kuid(current_user_ns(), info->si_uid);
> -			info->si_uid = from_kuid_munged(t_user_ns, uid);
> +			kuid_t uid = make_kuid(current_user_ns(), rewritten.si_uid);
> +
> +			rewritten.si_uid = from_kuid_munged(t_user_ns, uid);
>  		}
>  		rcu_read_unlock();
>  
>  		/* A kernel generated signal? */
> -		force = (info->si_code == SI_KERNEL);
> +		force = (rewritten.si_code == SI_KERNEL);
>  
>  		/* From an ancestor pid namespace? */
>  		if (!task_pid_nr_ns(current, task_active_pid_ns(t))) {
> -			info->si_pid = 0;
> +			rewritten.si_pid = 0;
>  			force = true;
>  		}
>  	}
> -	return __send_signal_locked(sig, info, t, type, force);
> +	return __send_signal_locked(sig, send_info, t, type, force);
>  }
>  
>  static void print_fatal_signal(int signr)
> -- 
> 2.53.0
> 


^ permalink raw reply

* Re: [PATCH 2/3] rv/reactors: add KUnit tests for reactor_printk
From: Thomas Weißschuh @ 2026-06-23  9:54 UTC (permalink / raw)
  To: wen.yang; +Cc: Gabriele Monaco, Nam Cao, linux-trace-kernel, linux-kernel
In-Reply-To: <690593305ad075539a804e0bf94493335354e6b9.1781541556.git.wen.yang@linux.dev>

On Tue, Jun 16, 2026 at 12:44:49AM +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> Add KUnit tests for the printk reactor covering:
> - Reactor registration and unregistration lifecycle
> - React callback invocation via rv_react()
> - Double registration rejection
> - Multiple register/unregister cycles
> 
> The mock callback calls vprintk_deferred() — the same path as the real
> reactor — then busy-waits to simulate I/O back-pressure, exercising the
> LD_WAIT_FREE constraint of rv_react() under load.
> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
>  kernel/trace/rv/Kconfig                |  10 ++
>  kernel/trace/rv/Makefile               |   1 +
>  kernel/trace/rv/reactor_printk_kunit.c | 123 +++++++++++++++++++++++++
>  3 files changed, 134 insertions(+)
>  create mode 100644 kernel/trace/rv/reactor_printk_kunit.c
> 
> diff --git a/kernel/trace/rv/Kconfig b/kernel/trace/rv/Kconfig
> index 3884b14df375..ff47895c897f 100644
> --- a/kernel/trace/rv/Kconfig
> +++ b/kernel/trace/rv/Kconfig
> @@ -104,6 +104,16 @@ config RV_REACT_PRINTK
>  	  Enables the printk reactor. The printk reactor emits a printk()
>  	  message if an exception is found.
>  
> +config RV_REACT_PRINTK_KUNIT
> +	bool "KUnit tests for reactor_printk" if !KUNIT_ALL_TESTS

It would be nice if this was a tristate symbol.
Otherwise the test is completely unusable with CONFIG_KUNIT=m.
Maybe use EXPORT_SYMBOL_FOR_MODULES() for the few needed symbols.

> +	depends on RV_REACT_PRINTK && KUNIT

The dependency on RV_REACT_PRINTK is not actually necessary.

Nit: I would split this into two 'depends on'.

> +	default KUNIT_ALL_TESTS
> +	help
> +	  This builds KUnit tests for the printk reactor. These are only
> +	  for development and testing, not for regular kernel use cases.
> +
> +	  If unsure, say N.
> +
>  config RV_REACT_PANIC
>  	bool "Panic reactor"
>  	depends on RV_REACTORS
> diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
> index 94498da35b37..ef0a2dcb927c 100644
> --- a/kernel/trace/rv/Makefile
> +++ b/kernel/trace/rv/Makefile
> @@ -23,4 +23,5 @@ obj-$(CONFIG_RV_MON_NOMISS) += monitors/nomiss/nomiss.o
>  # Add new monitors here
>  obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
>  obj-$(CONFIG_RV_REACT_PRINTK) += reactor_printk.o
> +obj-$(CONFIG_RV_REACT_PRINTK_KUNIT) += reactor_printk_kunit.o
>  obj-$(CONFIG_RV_REACT_PANIC) += reactor_panic.o
> diff --git a/kernel/trace/rv/reactor_printk_kunit.c b/kernel/trace/rv/reactor_printk_kunit.c
> new file mode 100644
> index 000000000000..933aa5602226
> --- /dev/null
> +++ b/kernel/trace/rv/reactor_printk_kunit.c
> @@ -0,0 +1,123 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KUnit tests for reactor_printk
> + *
> + */
> +
> +#include <kunit/test.h>
> +#include <linux/rv.h>
> +#include <linux/printk.h>
> +#include <linux/sched/clock.h>
> +#include <linux/processor.h>
> +
> +/*
> + * Simulated execution time for mock_printk_react (sched_clock units,
> + * nanoseconds).  Models the time a real printk reactor callback may consume
> + * under I/O pressure, exercising the LD_WAIT_FREE constraint of rv_react().
> + */
> +#define MOCK_REACT_DURATION_NS	5000000ULL
> +
> +/*
> + * Mock react callback mirroring rv_printk_reaction().
> + *
> + * Calls vprintk_deferred() — the same path as the real reactor — then holds
> + * the CPU for MOCK_REACT_DURATION_NS via a sched_clock() timed busy-loop,
> + * simulating a callback that is slow due to I/O back-pressure.
> + * sched_clock() is notrace and lock-free; no sleep or lock acquisition is
> + * performed, satisfying the LD_WAIT_FREE constraint of rv_react().
> + */
> +__printf(1, 0) static void mock_printk_react(const char *msg, va_list args)
> +{
> +	u64 start = sched_clock();
> +
> +	vprintk_deferred(msg, args);

This can get out of sync with the real reactor.

> +
> +	while (sched_clock() - start < MOCK_REACT_DURATION_NS)
> +		cpu_relax();
> +}
> +
> +static struct rv_reactor mock_printk_reactor = {
> +	.name		= "test_printk",
> +	.description	= "test printk reactor",
> +	.react		= mock_printk_react,
> +};
> +
> +/* Test 1: register and unregister reactor */

Not a fan of the test numbers in the comment.
They will become stale fast.

> +static void test_printk_register_unregister(struct kunit *test)
> +{
> +	int ret;
> +
> +	ret = rv_register_reactor(&mock_printk_reactor);
> +	KUNIT_EXPECT_EQ(test, ret, 0);

> +	KUNIT_EXPECT_STREQ(test, mock_printk_reactor.name, "test_printk");

This doesn't really test anything.

> +
> +	rv_unregister_reactor(&mock_printk_reactor);
> +}
> +
> +/* Test 2: react callback is invoked via rv_react() */
> +static void test_printk_react_called(struct kunit *test)
> +{
> +	struct rv_reactor reactor = {
> +		.name	= "printk_cb_test",
> +		.react	= mock_printk_react,
> +	};
> +	struct rv_monitor monitor = {
> +		.name	= "test_monitor",
> +		.reactor = &reactor,
> +		.react	= mock_printk_react,
> +	};
> +
> +	rv_react(&monitor, "printk violation message");

The invocation is not actually tested.

> +}
> +
> +/* Test 3: double registration should fail */
> +static void test_printk_double_register(struct kunit *test)
> +{
> +	int ret;
> +
> +	ret = rv_register_reactor(&mock_printk_reactor);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	ret = rv_register_reactor(&mock_printk_reactor);
> +	KUNIT_EXPECT_NE(test, ret, 0);

This could test the specific return value.

> +
> +	rv_unregister_reactor(&mock_printk_reactor);
> +}
> +
> +/* Test 4: register/unregister cycle */
> +static void test_printk_register_cycle(struct kunit *test)
> +{
> +	int ret, i;
> +
> +	for (i = 0; i < 5; i++) {
> +		ret = rv_register_reactor(&mock_printk_reactor);
> +		KUNIT_EXPECT_EQ(test, ret, 0);
> +
> +		rv_unregister_reactor(&mock_printk_reactor);
> +	}
> +}
> +
> +/* Test 5: react callback is not NULL (printk reactors must provide react) */
> +static void test_printk_react_not_null(struct kunit *test)
> +{
> +	KUNIT_EXPECT_NOT_NULL(test, mock_printk_reactor.react);

This doesn't really test anything.

> +}
> +
> +static struct kunit_case reactor_printk_kunit_cases[] = {
> +	KUNIT_CASE(test_printk_register_unregister),
> +	KUNIT_CASE(test_printk_react_called),
> +	KUNIT_CASE(test_printk_double_register),
> +	KUNIT_CASE(test_printk_register_cycle),
> +	KUNIT_CASE(test_printk_react_not_null),

Most tests are not related to the printk reactor at all.
Maybe add a generic "rv_reactor" suite.

> +	{}
> +};
> +
> +static struct kunit_suite reactor_printk_kunit_suite = {
> +	.name		= "rv_reactor_printk",
> +	.test_cases	= reactor_printk_kunit_cases,
> +};
> +
> +kunit_test_suite(reactor_printk_kunit_suite);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_DESCRIPTION("KUnit tests for reactor_printk");
> -- 
> 2.25.1
> 

^ permalink raw reply

* Re: [PATCH v8 18/46] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: Binbin Wu @ 2026-06-23  9:48 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
	jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-18-9d2959357853@google.com>

On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> @@ -606,12 +608,20 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
>  	next = start;
>  	while (safe && filemap_get_folios(mapping, &next, last, &fbatch)) {
>  
> -		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> +		for (i = 0; i < folio_batch_count(&fbatch);) {
>  			struct folio *folio = fbatch.folios[i];
>  
> -			if (folio_ref_count(folio) !=
> -			    folio_nr_pages(folio) + filemap_get_folios_refcount) {
> -				safe = false;
> +			safe = (folio_ref_count(folio) ==
> +				folio_nr_pages(folio) +
> +				filemap_get_folios_refcount);
> +
> +			if (safe) {
> +				++i;
> +			} else if (folio_may_be_lru_cached(folio) &&
> +				   !lru_drained) {
> +				lru_add_drain_all();

It seems unprivileged userspace is able to trigger lru_add_drain_all() repeatedly
by invoking KVM_SET_MEMORY_ATTRIBUTES2 in a loop, which could lead to DoS risk?

> +				lru_drained = true;
> +			} else {
>  				*err_index = max(start, folio->index);
>  				break;
>  			}
> 


^ permalink raw reply

* Re: [PATCH 1/3] rv/reactors: fix lockdep "Invalid wait context" in rv_react()
From: Thomas Weißschuh @ 2026-06-23  9:38 UTC (permalink / raw)
  To: wen.yang; +Cc: Gabriele Monaco, Nam Cao, linux-trace-kernel, linux-kernel
In-Reply-To: <bc01343ae74acf6bdf142434aeaa4e6b40aa72a9.1781541556.git.wen.yang@linux.dev>

On Tue, Jun 16, 2026 at 12:44:48AM +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> The DEFINE_WAIT_OVERRIDE_MAP() macro creates a lockdep map with
> wait_type_inner = LD_WAIT_CONFIG, which inherits the outer context's
> wait type.  When rv_react() is called from a LD_WAIT_FREE context
> (e.g., a KUnit test with busy-wait), and the reactor callback triggers
> a timer interrupt during the busy-loop, the interrupt exit path attempts
> to schedule (preempt_schedule_irq -> __schedule -> rq->__lock), which is
> LD_WAIT_SPIN.  Lockdep then reports:
> 
>     [ BUG: Invalid wait context ]
>     context-{5:5}
>     1 lock held by kunit_try_catch/209:
>      #0: rv_react_map-wait-type-override at rv_react+0x9d/0xf0

It would be nice to have the full trace here from your real-world example.

> The wait_type_override map allowed the outer LD_WAIT_FREE to propagate
> inward, but scheduling from an interrupt is LD_WAIT_SPIN, violating the
> constraint.
> 
> Fix by explicitly setting wait_type_inner = LD_WAIT_SPIN, which is the
> tightest constraint rv_react() callbacks must satisfy: they may not
> sleep (LD_WAIT_SLEEP) or use mutexes, but can use spinlocks and be
> interrupted. This matches the documented LD_WAIT_FREE constraint.

So this is not a pure fix but a change in behavior. This should be
reflected in the subject.

> Fixes: 69d8895cb9a9 ("rv: Add explicit lockdep context for reactors")
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
> ---
>  kernel/trace/rv/rv_reactors.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/rv/rv_reactors.c b/kernel/trace/rv/rv_reactors.c
> index 460af07f7aba..423f843bbd68 100644
> --- a/kernel/trace/rv/rv_reactors.c
> +++ b/kernel/trace/rv/rv_reactors.c
> @@ -465,7 +465,13 @@ int init_rv_reactors(struct dentry *root_dir)
>  
>  void rv_react(struct rv_monitor *monitor, const char *msg, ...)
>  {
> -	static DEFINE_WAIT_OVERRIDE_MAP(rv_react_map, LD_WAIT_FREE);
> +#ifdef CONFIG_LOCKDEP
> +	static struct lockdep_map rv_react_map = {
> +		.name = "rv_react",
> +		.wait_type_outer = LD_WAIT_FREE,
> +		.wait_type_inner = LD_WAIT_SPIN,
> +	};
> +#endif

This now allows reactors to take (raw) spinlocks. The original idea was
to not allow that as a reactor can be called from LD_WAIT_FREE context.
So I am not sure this is the right fix. Not that I have a better one
available right now.

>  	va_list args;
>  
>  	if (!rv_reacting_on() || !monitor->react)
> -- 
> 2.25.1
> 

^ permalink raw reply

* Re: [PATCH v8 21/46] KVM: guest_memfd: Zero page while getting pfn
From: Yan Zhao @ 2026-06-23  8:56 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
	Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com>

On Thu, Jun 18, 2026 at 05:31:58PM -0700, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> Move the folio initialization logic from kvm_gmem_get_pfn() into
> __kvm_gmem_get_pfn() to also zero pages if the page is to be used in
> kvm_gmem_populate().
> 
> With in-place conversion, the existing data in a guest_memfd page can be
> populated into guest memory through platform-specific ioctls.
> 
> Without first zeroing the page obtained using __kvm_gmem_get_pfn(), it
> might contain uninitialized host memory, which would leak to the guest if
> the populate completes.
> 
> guest_memfd pages are zeroed at most once in the page's entire lifetime
> with guest_memfd, and that is tracked using the uptodate flag.
> 
> Zeroing the page in __kvm_gmem_get_pfn() is chosen over zeroing in
> kvm_gmem_get_folio() since other flows, such as a future write() syscall,
> can get a page, write to the page and then set page uptodate without
> zeroing.
> 
> This aligns with the concept of zeroing before first use - the other place
> where zeroing happens is in kvm_gmem_fault_user_mapping().
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  virt/kvm/guest_memfd.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 90bc1a26512b6..86c9f5b0863cb 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -1137,6 +1137,11 @@ static struct folio *__kvm_gmem_get_pfn(struct file *file,
>  		return ERR_PTR(-EHWPOISON);
>  	}
>  
> +	if (!folio_test_uptodate(folio)) {
> +		clear_highpage(folio_page(folio, 0));
> +		folio_mark_uptodate(folio);
> +	}
Note:
In the __kvm_gmem_populate() path, this folio_mark_uptodate() call makes the
later one after post_populate() pointless.

__kvm_gmem_populate
    |1.__kvm_gmem_get_pfn
    |     |->folio = kvm_gmem_get_folio()
    |     |  if (!folio_test_uptodate(folio))
    |     |     folio_mark_uptodate(folio);
    |2. ret = post_populate()
    |3. if (!ret)
    |       folio_mark_uptodate(folio);

>  	*pfn = folio_file_pfn(folio, index);
>  	if (max_order)
>  		*max_order = 0;
> @@ -1166,11 +1171,6 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>  		goto out;
>  	}
>  
> -	if (!folio_test_uptodate(folio)) {
> -		clear_highpage(folio_page(folio, 0));
> -		folio_mark_uptodate(folio);
> -	}
> -
>  	if (kvm_gmem_is_private_mem(inode, index))
>  		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>  
>


^ permalink raw reply

* Re: [PATCH v8 23/46] KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
From: Yan Zhao @ 2026-06-23  8:41 UTC (permalink / raw)
  To: Sean Christopherson, ackerleytng, aik, andrew.jones, binbin.wu,
	brauner, chao.p.peng, david, jmattson, jthoughton, michael.roth,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <ajoWngKaZ+wfIyR+@yzhao56-desk.sh.intel.com>

On Tue, Jun 23, 2026 at 01:16:14PM +0800, Yan Zhao wrote:
> On Mon, Jun 22, 2026 at 06:22:45PM -0700, Sean Christopherson wrote:
> > On Mon, Jun 22, 2026, Yan Zhao wrote:
> > > On Thu, Jun 18, 2026 at 05:32:00PM -0700, Ackerley Tng via B4 Relay wrote:
> > > > From: Ackerley Tng <ackerleytng@google.com>
> > > > 
> > > > Update tdx_gmem_post_populate() to handle cases where a source page is
> > > > not explicitly provided. Instead of returning -EOPNOTSUPP when src_page
> > > > is NULL, default to using the page associated with the destination PFN.
> > > > 
> > > > This change allows for in-place memory conversion where the data is
> > > > already present in the target PFN, ensuring the TDX module has a valid
> > > > source page reference for the TDH.MEM.PAGE.ADD operation.
> > > > 
> > > > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > > ---
> > > >  Documentation/virt/kvm/x86/intel-tdx.rst |  4 ++++
> > > >  arch/x86/kvm/vmx/tdx.c                   | 11 ++++++++---
> > > >  2 files changed, 12 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/Documentation/virt/kvm/x86/intel-tdx.rst b/Documentation/virt/kvm/x86/intel-tdx.rst
> > > > index 6a222e9d09541..74357fe87f9ec 100644
> > > > --- a/Documentation/virt/kvm/x86/intel-tdx.rst
> > > > +++ b/Documentation/virt/kvm/x86/intel-tdx.rst
> > > > @@ -158,6 +158,10 @@ KVM_TDX_INIT_MEM_REGION
> > > >  Initialize @nr_pages TDX guest private memory starting from @gpa with userspace
> > > >  provided data from @source_addr. @source_addr must be PAGE_SIZE-aligned.
> > > >  
> > > > +If guest_memfd in-place conversion is enabled, pass NULL for @source_addr to
> > > > +initialize the memory region using memory contents already populated in
> > > > +guest_memfd memory.
> > > > +
> > > >  Note, before calling this sub command, memory attribute of the range
> > > >  [gpa, gpa + nr_pages] needs to be private.  Userspace can use
> > > >  KVM_SET_MEMORY_ATTRIBUTES to set the attribute.
> > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > > index ffe9d0db58c59..56d10333c61a7 100644
> > > > --- a/arch/x86/kvm/vmx/tdx.c
> > > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > > @@ -3198,8 +3198,12 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > > >  	if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
> > > >  		return -EIO;
> > > >  
> > > > -	if (!src_page)
> > > > -		return -EOPNOTSUPP;
> > > > +	if (!src_page) {
> > > > +		if (!gmem_in_place_conversion)
> > > When userspace turns on gmem_in_place_conversion while creating guest_memfd
> > > without the MMAP flag, the absence of src_page should still be treated as an
> > > error.
> > 
> > Why MMAP?
> Hmm, I was showing a scenario that in-place conversion couldn't occur.
> I didn't mean that with the MMAP flag, mmap() and user write must occur.
> 
> > Shouldn't this be a general "if (!src_page && !up-to-date)"?  Just
> > because userspace _can_ mmap() the memory doesn't mean userspace _has_ mmap()'d
> > and written memory.  And when write() lands, MMAP wouldn't be necessary to
> > initialize the memory.
> Do you mean using up-to-date flag as below?
> 
> if (!src_page) {
> 	src_page = pfn_to_page(pfn);
> 	if (!folio_test_uptodate(page_folio(src_page)))
> 		return -EOPNOTSUPP;
> }

Another concern with this fix is that:
commit "KVM: guest_memfd: Zero page while getting pfn" [1] always marks the
folio uptodate before reaching post_populate().

[1] https://lore.kernel.org/all/20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com/

> One concern is that TDX now does not much care about the up-to-date flag since
> TDX doesn't rely on the flag to clear pages on conversions.
> I'm not sure if the flag can be reliably checked in this case. e.g.,
> now the whole folio is marked up-to-date even if only part of it is faulted by
> user access.
> Ensuring that the up-to-date flag works correctly with huge page support seems
> to have more effort than introducing a dedicated flag for TDX.
> 
> > > Additionally, to properly enable in-place copying for the TDX initial memory
> > > region, userspace must not only specify source_addr to NULL, but also follow
> > > a specific sequence (where steps 1/2/3/7 are required only for in-place copy):
> > > 1. create guest_memfd with MMAP flag
> > > 2. mmap the guest_memfd.
> > > 3. convert the initial memory range to shared.
> > > 4. copy initial content to the source page.
> > > 5. convert the initial memory range to private
> > > 6. invoke ioctl KVM_TDX_INIT_MEM_REGION.
> > > 7. do not unmap the source backend.
> > > 
> > > So, would it be reasonable to introduce a dedicated flag that allows userspace
> > > to explicitly opt into the in-place copy functionality? e.g.,
> > 
> > Why?  It's userspace's responsibility to get the above right.  If userspace fails
> > to provide a src_page when it doesn't want in-place copy, that's a userspace bug.
> I mean if userspace specifies a NULL source_addr by mistake, it's better for
> kernel to detect this mistake, similar to how it validates whether source_addr
> is PAGE_ALIGNED.
> Since userspace already needs to perform additional steps to enable in-place
> copy, specifying a dedicated flag to indicate that the NULL source_addr is
> intentional seems like a reasonable burden.

^ permalink raw reply

* Re: [PATCH v8 17/46] KVM: guest_memfd: Advertise KVM_SET_MEMORY_ATTRIBUTES2 ioctl
From: Binbin Wu @ 2026-06-23  9:14 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
	jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-17-9d2959357853@google.com>

On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> Introduce KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES to advertise the
> availability of the KVM_SET_MEMORY_ATTRIBUTES2 ioctl.
> 
> KVM_SET_MEMORY_ATTRIBUTES2 is a guest_memfd-scoped version of the existing
> KVM_SET_MEMORY_ATTRIBUTES VM ioctl. It allows userspace to manage memory
> attributes, such as KVM_MEMORY_ATTRIBUTE_PRIVATE, directly on a guest_memfd
> file descriptor.
> 
> This new version uses struct kvm_memory_attributes2, which adds an
> error_offset field to the output. This allows KVM to return the specific
> offset that triggered an error, which is especially useful for handling
> EAGAIN results caused by transient page reference counts during attribute
> conversions.
> 
> Update the KVM API documentation to define the new ioctl and its behavior,
> and add the necessary UAPI definitions and capability checks.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Suggested-by: Michael Roth <michael.roth@amd.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>

Two nits below.


>  
> +4.145 KVM_SET_MEMORY_ATTRIBUTES2
> +---------------------------------
> +
> +:Capability: KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES
> +:Architectures: all
> +:Type: guest_memfd ioctl
> +:Parameters: struct kvm_memory_attributes2 (in/out)
> +:Returns: 0 on success, <0 on error
> +
> +Errors:
> +
> +  ========== ===============================================================
> +  EINVAL     The specified `offset` or `size` were invalid (e.g. not
                                                   ^
                                                 was
 > +             page aligned, causes an overflow, or size is zero).
> +  EFAULT     The parameter address was invalid.
> +  EAGAIN     Some page within requested range had unexpected refcounts. The
> +             offset of the page will be returned in `error_offset`.
> +  ENOMEM     Ran out of memory trying to track private/shared state
> +  ========== ===============================================================

[...]

> +
> +Set attributes for a range of offsets within a guest_memfd to
> +KVM_MEMORY_ATTRIBUTE_PRIVATE to limit the specified guest_memfd backed
> +memory range for guest_use. Even if KVM_CAP_GUEST_MEMFD_MMAP is
                         ^
                    guest use

> +supported, after a successful call to set
> +KVM_MEMORY_ATTRIBUTE_PRIVATE, the requested range will not be mappable
> +into host userspace and will only be mappable by the guest.
> +


^ permalink raw reply

* Re: [PATCH v8 15/46] KVM: guest_memfd: Call arch invalidate hooks on conversion
From: Fuad Tabba @ 2026-06-23  8:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	willy, wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <ajneQVLriUshjFIO@google.com>

Hi Sean,

On Tue, 23 Jun 2026 at 02:15, Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Jun 19, 2026, Fuad Tabba wrote:
> > On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
> > <devnull+ackerleytng.google.com@kernel.org> wrote:
> > >
> > > From: Ackerley Tng <ackerleytng@google.com>
> > >
> > > When memory in guest_memfd is converted from private to shared, the
> > > platform-specific state associated with the guest-private pages must be
> > > invalidated or cleaned up.
> > >
> > > Iterate over the folios in the affected range and call the
> > > kvm_arch_gmem_invalidate() hook for each PFN range. This allows
> > > architectures to perform necessary teardown, such as updating hardware
> > > metadata or encryption states, before the pages are transitioned to the
> > > shared state.
> > >
> > > Invoke this helper after indicating to KVM's mmu code that an invalidation
> > > is in progress to stop in-flight page faults from succeeding.
> > >
> > > Reviewed-by: Fuad Tabba <tabba@google.com>
> > > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> >
> > Coming back to this after working through the arm64/pKVM side. My
> > Reviewed-by here is from the previous round and the patch hasn't
> > changed, but I missed an implication for arm64.
> >
> > kvm_arch_gmem_invalidate() is now called from two paths with the same
> > (start, end) signature: folio teardown (kvm_gmem_free_folio) and
> > private->shared conversion (here). For SNP/TDX that's fine, conversion is
> > destructive anyway. For pKVM the two need opposite content semantics:
> > conversion must preserve the page in place (same physical page, the point
> > of in-place conversion without encryption), while teardown must scrub it
> > before returning it to the host.
> >
> > The hook gets only a pfn range with no indication of which caller it's
> > serving, so arm64 can't give the two paths the behaviour they need. It
> > would help to signal intent on the conversion path: a reason/flag, a
> > separate hook, or not routing non-destructive conversion through the
> > teardown hook.
> >
> > arm64 isn't here yet, so this isn't urgent, but the hook is gaining a
> > second caller now, and it's cheaper to leave room for the distinction
> > than to change a generic contract other arches depend on later.
>
> Crud.  It may not be urgent for arm64, but it's urgent for other reasons that
> I "can't" describe in detail at the moment, and even if that weren't the case, I
> think we should clean things up now.  More below.

No problem on the parts you can't get into. Agreed it's worth cleaning up
now, and worth doing in this round rather than landing the overloaded
hook: reworking a generic contract once SNP/TDX (and eventually arm64)
depend on it is the expensive path.

>
> > >  virt/kvm/guest_memfd.c | 41 +++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 41 insertions(+)
> > >
> > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > > index 433f79047b9d1..3c94442bc8131 100644
> > > --- a/virt/kvm/guest_memfd.c
> > > +++ b/virt/kvm/guest_memfd.c
> > > @@ -607,6 +607,42 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> > >         return safe;
> > >  }
> > >
> > > +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> > > +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
>
> Not your fault, but kvm_arch_gmem_invalidate() is badly misnamed.  It's not
> "invalidating" anything, it's much more of a "free" callback, as SNP uses it to
> put physical pages back into a shared state when a maybe-private folio is freed.
>
> As Fuad points out, (ab)using that hook for the private=>shared conversion case
> "works", but not broadly.  And it makes the bad name worse, because it's called
> from code that _is_ doing true invalidations.  For pKVM, it may not even need to
> do anything invalidation-like.

Agreed on the name and the overload, and for pKVM the split is more than
cosmetic. The free/teardown path is where pKVM has to scrub a page before
it goes back to the host; conversion has to leave the page in place with
its contents intact (no encryption, same physical page in both states).
Keeping scrub on the free callback and off the conversion path is what
preserves that, so this helps us, it isn't just tidying SNP.

>
> To avoid a conflict with patches that are going to have priority over this series,
> to set the stage for arm64 support, and to avoid avoid bleeding vendor details
> into guest_memfd, as if they are core guest_memfd behavior (only SNP needs the
> "invalidation" on this specific transition), I think we should add an arch hook
> to do conversions straightaway.
>
> Unless there's a clever option I'm missing, it'll mean adding yet another
> HAVE_KVM_ARCH_GMEM_XXX flag?  Hmm, especially because IIUC, arm64/pKVM doesn't
> need a callback for this case, only the free_folio case.
>
> > > +{
> > > +       struct folio_batch fbatch;
> > > +       pgoff_t next = start;
> > > +       int i;
> > > +
> > > +       folio_batch_init(&fbatch);
> > > +       while (filemap_get_folios(inode->i_mapping, &next, end - 1, &fbatch)) {
> > > +               for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> > > +                       struct folio *folio = fbatch.folios[i];
> > > +                       pgoff_t start_index, end_index;
> > > +                       kvm_pfn_t start_pfn, end_pfn;
> > > +
> > > +                       start_index = max(start, folio->index);
> > > +                       end_index = min(end, folio_next_index(folio));
> > > +                       /*
> > > +                        * end_index is either in folio or points to
> > > +                        * the first page of the next folio. Hence,
> > > +                        * all pages in range [start_index, end_index)
> > > +                        * are contiguous.
> > > +                        */
> > > +                       start_pfn = folio_file_pfn(folio, start_index);
> > > +                       end_pfn = start_pfn + end_index - start_index;
> > > +
> > > +                       kvm_arch_gmem_invalidate(start_pfn, end_pfn);
> > > +               }
> > > +
> > > +               folio_batch_release(&fbatch);
> > > +               cond_resched();
> > > +       }
> > > +}
> > > +#else
> > > +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
> > > +#endif
> > > +
> > >  static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> > >                                      size_t nr_pages, uint64_t attrs,
> > >                                      pgoff_t *err_index)
> > > @@ -647,7 +683,12 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> > >          */
> > >
> > >         kvm_gmem_invalidate_start(inode, start, end);
> > > +
> > > +       if (!to_private)
> > > +               kvm_gmem_invalidate(inode, start, end);
>
> E.g. instead make this something like this?
>
>         kvm_gmem_set_pfn_attributes(...)
>
> Hrm, though that wastes folio lookups in the to_private case.  So maybe just this,
> assuming pKVM doesn't need to take additional action on conversions?

You're right, and we expect it to hold for both directions, not only
private->shared. pKVM conversions are driven by the guest's
share/unshare hypercall: EL2 makes the stage-2 ownership change (grant
or remove host access) on the hypercall and exits, and the host
records it via KVM_SET_MEMORY_ATTRIBUTES2 afterwards. So by the time
guest_memfd updates attributes the EL2 side is already done in either
direction, and the ioctl is host-side bookkeeping. The only arch
callback we expect to need is the free/teardown one, nothing on
convert, and we wouldn't want a make_private hook either.

>
>         if (!to_private)
>                 kvm_gmem_make_shared(...)
>
> Actually, if we do that, then we don't need a separate arch hook, just a separate
> config.  It'll still bleed SNP details into guest_memfd, but it'll at least be
> done in a way that's more explicitly arch specific (and it's no different than
> what we already do for PREPARE...).

Doing it config-only (no separate convert hook) works for us, and nothing
about it constrains arm64. If connecting pKVM conversion to gmem later
turns up something we need, we'd add it config-gated in parallel, not by
overloading the renamed callback.

Cheers,
/fuad

>
> E.g. this?  There will still be a looming rename conflict, but that's easy enough
> to handle.
>
> diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
> index 9ce5be7843f2..8aead0abd788 100644
> --- virt/kvm/guest_memfd.c
> +++ virt/kvm/guest_memfd.c
> @@ -648,8 +648,8 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
>         return safe;
>  }
>
> -#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> -static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
> +#ifdef CONFIG_KVM_ARCH_GMEM_FREE_ON_SHARED_CONVERSION
> +static void kvm_gmem_make_shared(struct inode *inode, pgoff_t start, pgoff_t end)
>  {
>         struct folio_batch fbatch;
>         pgoff_t next = start;
> @@ -681,7 +681,7 @@ static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
>         }
>  }
>  #else
> -static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
> +static void kvm_gmem_make_shared(struct inode *inode, pgoff_t start, pgoff_t end) { }
>  #endif
>
>  static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> @@ -729,7 +729,7 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>         kvm_gmem_invalidate_start(inode, start, end);
>
>         if (!to_private)
> -               kvm_gmem_invalidate(inode, start, end);
> +               kvm_gmem_make_shared(inode, start, end);
>
>         mas_store_prealloc(&mas, xa_mk_value(attrs));

^ permalink raw reply

* Re: [PATCH v2] f2fs: don't drop the top folio order in the f2fs_iostat tracepoint
From: Chao Yu @ 2026-06-23  8:50 UTC (permalink / raw)
  To: Zhan Xusheng, Jaegeuk Kim
  Cc: chao, Daniel Lee, Steven Rostedt, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, stable, Zhan Xusheng
In-Reply-To: <20260623072641.3547410-1-zhanxusheng@xiaomi.com>

On 6/23/26 15:26, Zhan Xusheng wrote:
> The f2fs_iostat tracepoint stores the per-order read folio counts in a
> fixed-size array and prints a fixed number of buckets, both hardcoded to
> 11. The sysfs iostat accounting array is instead sized by NR_PAGE_ORDERS
> (= MAX_PAGE_ORDER + 1), which is not always 11:
> 
> 	arm64 16K pages -> MAX_PAGE_ORDER 11 -> NR_PAGE_ORDERS 12
> 	arm64 64K pages -> MAX_PAGE_ORDER 13 -> NR_PAGE_ORDERS 14
> 
> f2fs enables large folios for immutable, non-compressed files, and the
> read folio order is bounded by MAX_PAGECACHE_ORDER, i.e.
> min(MAX_XAS_ORDER, PREFERRED_MAX_PAGECACHE_ORDER). With THP enabled this
> reaches order 11 on 16K/64K base-page kernels (MAX_XAS_ORDER caps it at
> 11). So an order-11 read folio is possible there and is accounted into
> index 11 of the array.
> 
> On those configurations the sysfs file reports the order-11 count
> correctly, but the tracepoint silently drops it: the memcpy is capped at
> min(NR_PAGE_ORDERS, 11), so index 11 is never copied and the trace
> disagrees with sysfs. There is no memory-safety issue, only the order-11
> bucket missing from the trace; 4K-page kernels (NR_PAGE_ORDERS == 11,
> max order <= 9) are unaffected.
> 
> Size the array and the printed buckets by a ceiling that covers the
> largest possible NR_PAGE_ORDERS (14) with headroom, and add a
> BUILD_BUG_ON() so any future growth of NR_PAGE_ORDERS fails the build
> loudly instead of silently truncating again. The human-readable
> "order=count" output is preserved.
> 
> Fixes: cb8ff3ead9a3 ("f2fs: add page-order information for large folio reads in iostat")
> Cc: stable@vger.kernel.org
> Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>

Reviewed-by: Chao Yu <chao@kernel.org>

Thanks,

^ permalink raw reply

* Re: [PATCH 1/2] x86/uprobes: Keep shadow stack in sync for emulated CALLs
From: Peter Zijlstra @ 2026-06-23  8:43 UTC (permalink / raw)
  To: David Windsor
  Cc: mhiramat, oleg, tglx, mingo, bp, dave.hansen, x86, shuah,
	linux-trace-kernel, linux-kselftest, linux-kernel
In-Reply-To: <20260622183109.1137245-1-dwindsor@gmail.com>

On Mon, Jun 22, 2026 at 02:31:08PM -0400, David Windsor wrote:
> Uprobe CALL emulation updates the normal user stack, but not the CET user
> shadow stack. The subsequent RET then sees a stale shadow stack entry and
> raises #CP.
> 
> Update the relative CALL emulation and XOL CALL fixup paths to keep the
> shadow stack in sync.
> 
> Fixes: 488af8ea7131 ("x86/shstk: Wire in shadow stack interface")

I can confirm this patch fixes the included test case, so yay for that.

However, should this not be:

Fixes: 1713b63a07a2 ("x86/shstk: Make return uprobe work with shadow stack")

?

> Signed-off-by: David Windsor <dwindsor@gmail.com>
> ---
>  arch/x86/kernel/uprobes.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index ebb1baf1eb1d..ae32013a7097 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -1246,8 +1246,12 @@ static int default_post_xol_op(struct arch_uprobe *auprobe, struct pt_regs *regs
>  		long correction = utask->vaddr - utask->xol_vaddr;
>  		regs->ip += correction;
>  	} else if (auprobe->defparam.fixups & UPROBE_FIX_CALL) {
> +		unsigned long retaddr = utask->vaddr + auprobe->defparam.ilen;
> +
>  		regs->sp += sizeof_long(regs); /* Pop incorrect return address */
> -		if (emulate_push_stack(regs, utask->vaddr + auprobe->defparam.ilen))
> +		if (emulate_push_stack(regs, retaddr))
> +			return -ERESTART;
> +		if (shstk_update_last_frame(retaddr))
>  			return -ERESTART;
>  	}
>  	/* popf; tell the caller to not touch TF */
> @@ -1338,6 +1342,10 @@ static bool branch_emulate_op(struct arch_uprobe *auprobe, struct pt_regs *regs)
>  		 */
>  		if (emulate_push_stack(regs, new_ip))
>  			return false;
> +		if (shstk_push(new_ip) == -EFAULT) {
> +			regs->sp += sizeof_long(regs);
> +			return false;
> +		}
>  	} else if (!check_jmp_cond(auprobe, regs)) {
>  		offs = 0;
>  	}
> -- 
> 2.43.0

^ permalink raw reply

* Re: [PATCH v2 2/2] tracing: Remove trace_printk.h from kernel.h
From: Steven Rostedt @ 2026-06-23  8:29 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Linus Torvalds,
	Sebastian Andrzej Siewior, John Ogness, Thomas Gleixner,
	Peter Zijlstra, Julia Lawall
In-Reply-To: <ajlcOU1o5Omy4q57@yury>

On Mon, 22 Jun 2026 12:01:05 -0400
Yury Norov <yury.norov@gmail.com> wrote:

> On Mon, Jun 22, 2026 at 09:07:41AM -0400, Steven Rostedt wrote:
> > From: Steven Rostedt <rostedt@goodmis.org>
> > 
> > There have been complaints about trace_printk.h causing more build time
> > for being in kernel.h. Move it out of kernel.h and place it in the headers
> > and C files that use it.
> > 
> > Link: https://lore.kernel.org/all/CAHk-=wikCBeVFjVXiY4o-oepdbjAoir5+TcAgtL12c4u1TpZLQ@mail.gmail.com/  
> 
> Link is nice, but can you explain in the commit message what those
> complaints exactly are? There's enough opinions shared to make a nice
> summary. I even think it's important enough to become a Documentation
> rule.

What rule is that?



> > @@ -35,6 +35,7 @@
> >  #define I915_GFP_ALLOW_FAIL (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN)
> >  
> >  #if IS_ENABLED(CONFIG_DRM_I915_TRACE_GTT)
> > +#include <linux/trace_printk.h>  
> 
> So, before it was included unconditionally, now it's included. It
> looks technically correct, but conceptually - I'm not sure.
> 
> I'm not a developer of this driver, but ... here we need trace_printk.h
> if TRACE_GTT is enabled, in the next header TRACE_GEM needs it. To me
> it sounds like the whole driver simply needs trace_printk.h.

I just added it when trace_printk() is being used. Why else should it
be included when trace_printk() is not used. There's precedent to add
includes within #if blocks that contain code that requires the include
where nothing else needs it.

> 
> >  #define GTT_TRACE(...) trace_printk(__VA_ARGS__)
> >  #else
> >  #define GTT_TRACE(...)
> > diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h
> > index 1da8fb61c09e..f490052e8964 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.h
> > +++ b/drivers/gpu/drm/i915/i915_gem.h
> > @@ -117,6 +117,7 @@ int i915_gem_open(struct drm_i915_private *i915, struct drm_file *file);
> >  
> >  #if IS_ENABLED(CONFIG_DRM_I915_TRACE_GEM)
> >  #include <linux/trace_controls.h>
> > +#include <linux/trace_printk.h>
> >  #define GEM_TRACE(...) trace_printk(__VA_ARGS__)
> >  #define GEM_TRACE_ERR(...) do {						\
> >  	pr_err(__VA_ARGS__);						\
> > diff --git a/drivers/hwtracing/stm/dummy_stm.c b/drivers/hwtracing/stm/dummy_stm.c
> > index 38528ffdc0b3..784f9af7ccba 100644
> > --- a/drivers/hwtracing/stm/dummy_stm.c
> > +++ b/drivers/hwtracing/stm/dummy_stm.c
> > @@ -14,6 +14,10 @@
> >  #include <linux/stm.h>
> >  #include <uapi/linux/stm.h>
> >  
> > +#ifdef DEBUG
> > +#include <linux/trace_printk.h>
> > +#endif
> > +  
> 
> Same here. The cost of adding the header in a particular C file is
> unmeasurable. But playing "#undef DEBUG #ifdef DEBUG" games looks
> weird.

This one I'll agree with you. I didn't like the if conditional, and
looking at it now, I think it's not needed.

But the first instance above, if it is possible to add the include
within the #if conditional where trace_printk() is used then why not add
the include then?

-- Steve

^ permalink raw reply

* Re: [PATCH v2 1/2] tracing: Move non-trace_printk prototypes into trace_controls.h
From: Steven Rostedt @ 2026-06-23  8:22 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Linus Torvalds,
	Sebastian Andrzej Siewior, John Ogness, Thomas Gleixner,
	Peter Zijlstra, Julia Lawall
In-Reply-To: <ajk7fN5v31kCfGVp@yury>

On Mon, 22 Jun 2026 09:41:16 -0400
Yury Norov <yury.norov@gmail.com> wrote:

> > +void trace_dump_stack(int skip);  
> 
> The function description says:
> 
>   record a stack back trace in the trace buffer
> 
> So, to me it sounds like it should go to the trace_printk.h.

The main reason I don't want these in trace_printk.h is because they
are not the same as trace_printk(). These are usually called when
things go wrong, and are usually called along with tracing_off(), to
stop the trace to make sure you don't lose the trace of the bug that
triggered the dump.

These can also be in production with no problem, as they are triggered
when things go wrong. trace_printk() should *not* be used in production.

The uses of these are fundamentally different than the use of
trace_printk(). They are not just for development environments.

I'll update the change log to note this.

-- Steve

^ permalink raw reply

* Re: [PATCH v8 13/46] KVM: guest_memfd: Add base support for KVM_SET_MEMORY_ATTRIBUTES2
From: Fuad Tabba @ 2026-06-23  8:20 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	willy, wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <ajnRxuJ19OzZ8zJC@google.com>

On Tue, 23 Jun 2026 at 01:22, Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Jun 19, 2026, Fuad Tabba wrote:
> > On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
> > <devnull+ackerleytng.google.com@kernel.org> wrote:
> > >
> > > From: Ackerley Tng <ackerleytng@google.com>
> > >
> > > Introduce base support for KVM_SET_MEMORY_ATTRIBUTES2 in guest_memfd, which
> > > just updates attributes tracked by guest_memfd.
> > >
> > > Validate input fields in general. Guard usage of KVM_SET_MEMORY_ATTRIBUTES2
> > > by making sure requested attributes are supported for this instance of kvm.
> > >
> > > A new KVM_SET_MEMORY_ATTRIBUTES2 is defined to support writes (unlike
> > > KVM_SET_MEMORY_ATTRIBUTES) in addition to reads so it can provide error
> > > details to userspace. This will be used in a later patch.
> > >
> > > The two ioctls use their corresponding structs with no overlap, but
> > > backward compatibility is baked in for future support of
> > > KVM_SET_MEMORY_ATTRIBUTES2 and struct kvm_memory_attributes2 in the VM
> > > ioctl.
> > >
> > > The process of setting memory attributes is set up such that the later half
> > > will not fail due to allocation. Any necessary checks are performed before
> > > the point of no return.
> > >
> > > Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> > > Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> > > Co-developed-by: Sean Christoperson <seanjc@google.com>
> > > Signed-off-by: Sean Christoperson <seanjc@google.com>
> > > Reviewed-by: Fuad Tabba <tabba@google.com>
> > > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> >
> > Note sure if it's user error on my part, if I'm applying this to the
> > wrong base, but I found a build break here on patch 13:
> > kvm_gmem_invalidate_start() doesn't exist in the base tree. The
> > function is kvm_gmem_invalidate_begin() here. The rename
> > (190cc5370a8b6) landed via a different merge path and isn't an
> > ancestor of the stated base.
> >
> > Patches 19 and 20 have the same mismatch. Fix for all three is
> > s/kvm_gmem_invalidate_start/kvm_gmem_invalidate_begin/.
>
> Ya, Ackerley used a slightly older kvm/next to send the patches.  I at least was
> testing against kvm-x86/next, which does have the rename.
>
> Other than noting that this should be applied against the current kvm/next, I
> don't think there's anything else to be done?

Agree. Sorry, didn't mean to be nit-picky, but this really threw me off :)

Cheers,
/fuad

^ permalink raw reply

* Re: [PATCH v2 1/2] tracing: Move non-trace_printk prototypes into trace_controls.h
From: Steven Rostedt @ 2026-06-23  8:09 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Linus Torvalds,
	Sebastian Andrzej Siewior, John Ogness, Thomas Gleixner,
	Peter Zijlstra, Julia Lawall
In-Reply-To: <ajlciSfVixfYG_ln@yury>

On Mon, 22 Jun 2026 12:02:17 -0400
Yury Norov <yury.norov@gmail.com> wrote:


> > > Suggested-by: Yury Norov <yury.norov@gmail.com>  
> > 
> > Thanks, I'll add you tag.  
> 
> Thanks, but can you also comment on trace_dump/ftrace_dump?

You mean just add to the change log about trace_dump and ftrace_dump
being used here too?

It's there mainly because some bug code in headers uses it.

-- Steve

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox