Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* [PATCH bpf] bpf: Reject sleepable kprobe_multi programs at attach time
From: Varun R Mallya @ 2026-04-01 13:49 UTC (permalink / raw)
  To: varunrmallya
  Cc: andrii, ast, bpf, daniel, eddyz87, jolsa, kpsingh, linux-kernel,
	linux-trace-kernel, martin.lau, mathieu.desnoyers, mattbobrowski,
	memxor, mhiramat, rostedt, song, yonghong.song
In-Reply-To: <ac0hLyMuPHiZfnVc@computer>

kprobe.multi programs run in atomic/RCU context and cannot sleep.
However, bpf_kprobe_multi_link_attach() did not validate whether the
program being attached had the sleepable flag set, allowing sleepable
helpers such as bpf_copy_from_user() to be invoked from a non-sleepable
context.

This causes a "sleeping function called from invalid context" splat:

  BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169
  in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo
  preempt_count: 1, expected: 0
  RCU nest depth: 2, expected: 0

Fix this by rejecting sleepable programs early in
bpf_kprobe_multi_link_attach(), before any further processing.

Fixes: 0dcac272540613d41c05e89679e4ddb978b612f1 ("bpf: Add multi kprobe link")
Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
---
 kernel/trace/bpf_trace.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 0b040a417442..af7079aa0f36 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2752,6 +2752,10 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
 	if (!is_kprobe_multi(prog))
 		return -EINVAL;
 
+	/* kprobe_multi is not allowed to be sleepable. */
+	if (prog->sleepable)
+		return -EINVAL;
+
 	/* Writing to context is not allowed for kprobes. */
 	if (prog->aux->kprobe_write_ctx)
 		return -EINVAL;
-- 
2.53.0


^ permalink raw reply related

* [PATCH] bootconfig: Skip printing early params to cmdline from bootconfig
From: Masami Hiramatsu (Google) @ 2026-04-01 14:02 UTC (permalink / raw)
  To: Masami Hiramatsu, Steven Rostedt
  Cc: Breno Leitao, linux-kernel, linux-trace-kernel

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

If user configures `kernel.key` in bootconfig, the 'key' is shown
in kernel cmdline (/proc/cmdline) and kernel boot parameter
handler associated with 'key' is invoked. However, since the
bootconfig does not support the parameter defined with early_param,
those keys are shown in '/proc/cmdline' but not handled by kernel.

This could easily mislead users who expected to be able to specify
early parameters via the boot configuration, leading them to wonder
why it doesn't work.

Let's skip printing out early params to cmdline buffer, and warn
if there is such parameters in bootconfig.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 init/main.c |   30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/init/main.c b/init/main.c
index 1cb395dd94e4..095c497ea2df 100644
--- a/init/main.c
+++ b/init/main.c
@@ -324,10 +324,21 @@ static void * __init get_boot_config_from_initrd(size_t *_size)
 
 static char xbc_namebuf[XBC_KEYLEN_MAX] __initdata;
 
+static bool __init is_early_param(const char *param)
+{
+	const struct obs_kernel_param *p;
+
+	for (p = __setup_start; p < __setup_end; p++) {
+		if (p->early && parameq(param, p->str))
+			return true;
+	}
+	return false;
+}
+
 #define rest(dst, end) ((end) > (dst) ? (end) - (dst) : 0)
 
 static int __init xbc_snprint_cmdline(char *buf, size_t size,
-				      struct xbc_node *root)
+				      struct xbc_node *root, bool is_kernel)
 {
 	struct xbc_node *knode, *vnode;
 	char *end = buf + size;
@@ -340,6 +351,13 @@ static int __init xbc_snprint_cmdline(char *buf, size_t size,
 		if (ret < 0)
 			return ret;
 
+		/* We will skip early params because it is not applied. */
+		if (is_kernel && is_early_param(xbc_namebuf)) {
+			pr_warn_once("early_param(e.g. %s.%s) is not passed to cmdline from bootconfig\n",
+				     xbc_node_get_data(root), xbc_namebuf);
+			continue;
+		}
+
 		vnode = xbc_node_get_child(knode);
 		if (!vnode) {
 			ret = snprintf(buf, rest(buf, end), "%s ", xbc_namebuf);
@@ -368,7 +386,7 @@ static int __init xbc_snprint_cmdline(char *buf, size_t size,
 #undef rest
 
 /* Make an extra command line under given key word */
-static char * __init xbc_make_cmdline(const char *key)
+static char * __init xbc_make_cmdline(const char *key, bool is_kernel)
 {
 	struct xbc_node *root;
 	char *new_cmdline;
@@ -379,7 +397,7 @@ static char * __init xbc_make_cmdline(const char *key)
 		return NULL;
 
 	/* Count required buffer size */
-	len = xbc_snprint_cmdline(NULL, 0, root);
+	len = xbc_snprint_cmdline(NULL, 0, root, is_kernel);
 	if (len <= 0)
 		return NULL;
 
@@ -389,7 +407,7 @@ static char * __init xbc_make_cmdline(const char *key)
 		return NULL;
 	}
 
-	ret = xbc_snprint_cmdline(new_cmdline, len + 1, root);
+	ret = xbc_snprint_cmdline(new_cmdline, len + 1, root, is_kernel);
 	if (ret < 0 || ret > len) {
 		pr_err("Failed to print extra kernel cmdline.\n");
 		memblock_free(new_cmdline, len + 1);
@@ -465,9 +483,9 @@ static void __init setup_boot_config(void)
 		xbc_get_info(&ret, NULL);
 		pr_info("Load bootconfig: %ld bytes %d nodes\n", (long)size, ret);
 		/* keys starting with "kernel." are passed via cmdline */
-		extra_command_line = xbc_make_cmdline("kernel");
+		extra_command_line = xbc_make_cmdline("kernel", true);
 		/* Also, "init." keys are init arguments */
-		extra_init_args = xbc_make_cmdline("init");
+		extra_init_args = xbc_make_cmdline("init", false);
 	}
 	return;
 }


^ permalink raw reply related

* Re: [PATCH RFC v4 07/44] KVM: guest_memfd: Only prepare folios for private pages
From: Ackerley Tng @ 2026-04-01 14:05 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jroedel, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Jason Gunthorpe,
	Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm
In-Reply-To: <20260326-gmem-inplace-conversion-v4-7-e202fe950ffd@google.com>

Ackerley Tng <ackerleytng@google.com> writes:

> All-shared guest_memfd used to be only supported for non-CoCo VMs where
> preparation doesn't apply. INIT_SHARED is about to be supported for
> non-CoCo VMs in a later patch in this series.
>
> In addition, KVM_SET_MEMORY_ATTRIBUTES2 is about to be supported in
> guest_memfd in a later patch in this series.
>
> This means that the kvm fault handler may now call kvm_gmem_get_pfn() on a
> shared folio for a CoCo VM where preparation applies.
>
> Add a check to make sure that preparation is only performed for private
> folios.
>
> Preparation will be undone on freeing (see kvm_gmem_free_folio()) and on
> conversion to shared.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  virt/kvm/guest_memfd.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index b6ffa8734175d..d414ebfcb4c19 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -900,6 +900,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>  		     int *max_order)
>  {
>  	pgoff_t index = kvm_gmem_get_index(slot, gfn);
> +	struct inode *inode;
>  	struct folio *folio;
>  	int r = 0;
>
> @@ -907,7 +908,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>  	if (!file)
>  		return -EFAULT;
>
> -	filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> +	inode = file_inode(file);
> +	filemap_invalidate_lock_shared(inode->i_mapping);
>
>  	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
>  	if (IS_ERR(folio)) {
> @@ -920,7 +922,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>  		folio_mark_uptodate(folio);
>  	}
>
> -	r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> +	if (kvm_gmem_is_private_mem(inode, index))
> +		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);

Michael, I might have misunderstood you at the last guest_memfd call:
sev_gmem_prepare() doesn't prepare a page for being a shared page,
right? Does this work? That prepare is only called to "make private"?

>
>  	folio_unlock(folio);
>
> @@ -930,7 +933,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>  		folio_put(folio);
>
>  out:
> -	filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
> +	filemap_invalidate_unlock_shared(inode->i_mapping);
>  	return r;
>  }
>  EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
>
> --
> 2.53.0.1018.g2bb0e51243-goog

^ permalink raw reply

* Re: [PATCH bpf] bpf: Reject sleepable kprobe_multi programs at attach time
From: Kumar Kartikeya Dwivedi @ 2026-04-01 14:08 UTC (permalink / raw)
  To: Varun R Mallya, jolsa
  Cc: andrii, ast, bpf, daniel, eddyz87, kpsingh, linux-kernel,
	linux-trace-kernel, martin.lau, mathieu.desnoyers, mattbobrowski,
	mhiramat, rostedt, song, yonghong.song
In-Reply-To: <20260401134921.362148-1-varunrmallya@gmail.com>

On Wed, 1 Apr 2026 at 15:49, Varun R Mallya <varunrmallya@gmail.com> wrote:
>
> kprobe.multi programs run in atomic/RCU context and cannot sleep.
> However, bpf_kprobe_multi_link_attach() did not validate whether the
> program being attached had the sleepable flag set, allowing sleepable
> helpers such as bpf_copy_from_user() to be invoked from a non-sleepable
> context.
>
> This causes a "sleeping function called from invalid context" splat:
>
>   BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169
>   in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo
>   preempt_count: 1, expected: 0
>   RCU nest depth: 2, expected: 0
>
> Fix this by rejecting sleepable programs early in
> bpf_kprobe_multi_link_attach(), before any further processing.
>
> Fixes: 0dcac272540613d41c05e89679e4ddb978b612f1 ("bpf: Add multi kprobe link")

Please use the right format.
Fixes: 0dcac2725406 ("bpf: Add multi kprobe link")

You can add the follow to your .gitconfig to generate the fixes tag.
[alias]
    fixes = log -1 --format='Fixes: %h (\"%s\")'

> Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
> ---
>  kernel/trace/bpf_trace.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 0b040a417442..af7079aa0f36 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -2752,6 +2752,10 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
>         if (!is_kprobe_multi(prog))
>                 return -EINVAL;
>
> +       /* kprobe_multi is not allowed to be sleepable. */
> +       if (prog->sleepable)
> +               return -EINVAL;
> +

Looks ok to me, Jiri, could you also take a look?

>         /* Writing to context is not allowed for kprobes. */
>         if (prog->aux->kprobe_write_ctx)
>                 return -EINVAL;
> --
> 2.53.0
>

^ permalink raw reply

* Re: [PATCH 0/2] Fix trace remotes read with an offline CPU
From: Steven Rostedt @ 2026-04-01 14:31 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: mhiramat, mathieu.desnoyers, linux-trace-kernel, maz, kernel-team,
	linux-kernel
In-Reply-To: <20260401025003.3258729-1-vdonnefort@google.com>

On Wed,  1 Apr 2026 03:50:01 +0100
Vincent Donnefort <vdonnefort@google.com> wrote:

> This small series is fixing non-consuming read of a trace remote when the
> trace_buffer is created after a CPU is offline.
> 
> It also extends hotplug testing coverage to include this test case.
> 
> I have based this series on top of kvmarm/next which contains the hypervisor
> tracing patches.
> 
> Vincent Donnefort (2):
>   tracing: Non-consuming read for trace remotes with an offline CPU
>   tracing: selftests: Extend hotplug testing for trace remotes

This looks fine to me. Since they only affect the new remote code and will
not conflict with anything I'm doing, they can go on top of the KVM patches
in the arm64 tree.

Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>

for the series.

Thanks,

-- Steve

^ permalink raw reply

* Re: [PATCH v9 2/3] tracing: Remove the backup instance automatically after read
From: Steven Rostedt @ 2026-04-01 14:40 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260401121957.2665d454390aff97593bb996@kernel.org>

On Wed, 1 Apr 2026 12:19:57 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> > 
> >   CPU 0							CPU 1
> >   -----							-----
> >   open(trace_pipe);
> >   read(..);
> >   close(trace_pipe);
> >      kick the work queue to delete it....
> > 						rmdir();
> > 							[instance deleted]  
> 
> I thought this requires trace_types_lock, and after kicked the queue,
> can rmdir() gets the tr? (__trace_array_get() return error if
> tr->free_on_close is set)

rmdir() doesn't use __trace_array_get(), it uses trace_array_find() which
we shouldn't need to modify.

static int instance_rmdir(const char *name)
{
	struct trace_array *tr;

	guard(mutex)(&event_mutex);
	guard(mutex)(&trace_types_lock);

	tr = trace_array_find(name);
	if (!tr)
		return -ENODEV;

	return __remove_instance(tr);
}

> 
> > 
> >   __remove_instance();
> > 
> >    [ now the tr is freed, and the remove will crash!]
> > 
> > 
> > What would prevent this is this is to use trace_array_destroy() that checks
> > this and also adds the proper locking:
> > 
> > static void trace_array_autoremove(struct work_struct *work)
> > {
> > 	struct trace_array *tr = container_of(work, struct trace_array, autoremove_work);
> > 
> > 	trace_array_destroy(tr);
> > }  
> 
> OK, let's use it.

Yes, by using trace_array_destroy(), it will fix this.

Thanks,

-- Steve

^ permalink raw reply

* Re: [PATCH v9 3/3] tracing/Documentation: Add a section about backup instance
From: Steven Rostedt @ 2026-04-01 14:41 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260401121541.e2c8c549ae172ccf2d5e3035@kernel.org>

On Wed, 1 Apr 2026 12:15:41 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> > > +It is also possible to record trace data at system boot time by specifying
> > > +events with the persistent ring buffer, but in this case the data before the
> > > +reboot will be lost before it can be read. This problem can be solved by a
> > > +backup instance. From the kernel command line::
> > > +
> > > +  reserve_mem=12M:4096:trace trace_instance=boot_map@trace,sched,irq trace_instance=backup=boot_map  
> > 
> > The above didn't actually work well without my patch to enable events on
> > the persistent ring buffer with the backup. But keep it, as it will work
> > with my patch ;-)  
> 
> Ah, thanks!
> Let me rebase on it.

Well, my patch hasn't been added yet ;-)  But don't change anything. When
this goes upstream both patches should have been applied.

-- Steve

^ permalink raw reply

* Re: [PATCH bpf] bpf: Reject sleepable kprobe_multi programs at attach time
From: Kumar Kartikeya Dwivedi @ 2026-04-01 14:42 UTC (permalink / raw)
  To: Varun R Mallya, jolsa
  Cc: andrii, ast, bpf, daniel, eddyz87, kpsingh, linux-kernel,
	linux-trace-kernel, martin.lau, mathieu.desnoyers, mattbobrowski,
	mhiramat, rostedt, song, yonghong.song
In-Reply-To: <CAP01T75D7p2smSyoFvyZykgC5bJdobL_cqtk6-w+LmP4NP9SpQ@mail.gmail.com>

On Wed, 1 Apr 2026 at 16:08, Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> On Wed, 1 Apr 2026 at 15:49, Varun R Mallya <varunrmallya@gmail.com> wrote:
> >
> > kprobe.multi programs run in atomic/RCU context and cannot sleep.
> > However, bpf_kprobe_multi_link_attach() did not validate whether the
> > program being attached had the sleepable flag set, allowing sleepable
> > helpers such as bpf_copy_from_user() to be invoked from a non-sleepable
> > context.
> >
> > This causes a "sleeping function called from invalid context" splat:
> >
> >   BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169
> >   in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo
> >   preempt_count: 1, expected: 0
> >   RCU nest depth: 2, expected: 0
> >
> > Fix this by rejecting sleepable programs early in
> > bpf_kprobe_multi_link_attach(), before any further processing.
> >
> > Fixes: 0dcac272540613d41c05e89679e4ddb978b612f1 ("bpf: Add multi kprobe link")
>
> Please use the right format.
> Fixes: 0dcac2725406 ("bpf: Add multi kprobe link")
>
> You can add the follow to your .gitconfig to generate the fixes tag.
> [alias]
>     fixes = log -1 --format='Fixes: %h (\"%s\")'
>
> > Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
> > ---
> >  kernel/trace/bpf_trace.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index 0b040a417442..af7079aa0f36 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -2752,6 +2752,10 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
> >         if (!is_kprobe_multi(prog))
> >                 return -EINVAL;
> >
> > +       /* kprobe_multi is not allowed to be sleepable. */
> > +       if (prog->sleepable)
> > +               return -EINVAL;
> > +
>
> Looks ok to me, Jiri, could you also take a look?
>
> >         /* Writing to context is not allowed for kprobes. */
> >         if (prog->aux->kprobe_write_ctx)
> >                 return -EINVAL;
> > --
> > 2.53.0
> >

Also, please add a selftest.

pw-bot: cr

^ permalink raw reply

* Re: [PATCH v2] bootconfig: Apply early options from embedded config
From: Breno Leitao @ 2026-04-01 15:01 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Jonathan Corbet, Shuah Khan, linux-kernel, linux-trace-kernel,
	linux-doc, oss, paulmck, rostedt, kernel-team
In-Reply-To: <20260401224853.d8ed517a344c4be51d371a9c@kernel.org>

On Wed, Apr 01, 2026 at 10:48:53PM +0900, Masami Hiramatsu wrote:

> > The challenge extends beyond that. There are numerous early_parameter()
> > definitions scattered throughout the kernel that may or may not be
> > utilized by setup_arch().
> >
> > For example, consider `early_param("mitigations", ..)` in
> > ./kernel/cpu.c. This modifies the cpu_mitigations global variable, which
> > is referenced in various locations across different architectures.
> >
> > It's worth noting that we have over 300 early_parameter() instances in
> > the kernel.
> >
> > Given this, analyzing all these early parameters and examining each one
> > individually represents a substantial amount of work.
>
> Yes, that may require a substantial amount of work. But to improve
> the kernel framework around the parameter handling, eventually we
> need to examine each early parameter.

I'm still uncertain about this approach. The goal is to identify and
categorize the early parameters that are parsed prior to bootconfig
initialization.

Moreover, this work could become obsolete if bootconfig's initialization
point shifts earlier or later in the boot sequence, necessitating
another comprehensive analysis.

Conversely, if we successfully move bootconfig initialization earlier
by breaking the dependency of memblock (assuming this is feasible), the
vast majority of early parameters would execute after bootconfig is
configured, eliminating the need for this extensive categorization work.

Please, feel free to tell what approach might be better for the project.

> > Are there alternative approaches? At this point, I'm leaning toward
> > breaking bootconfig's dependency on memblock, allowing us to invoke it
> > before setup_arch(). Is this the only practical solution available?!
>
> Basically, the memblock dependency comes from allocating copy of data.
> Only for the embedded bootconfig, we can just pass copy memory block
> to the xbc_init(). Something like;
>
> xbc_init() {
>  xbc_data = memblock_alloc();
>  memcpy(xbc_data, data);
>  __xbc_init(xbc_data);
> }
>
> embedded_xbc_init() {
>   __xbc_init(embedded_bootconfig_data);
> }
>
> Afterwards, we can pass mixture of embedded bootcofnigt and initrd
> bootconfig data to parser again.
>
> (But in this case, we must be careful not to override the early
>  parameters that we have already applied.)

Do you have any additional recommendations if I proceed with this
approach?

Thank you for your detailed responses and insights.
--breno

^ permalink raw reply

* Re: [PATCH 0/2] Fix trace remotes read with an offline CPU
From: Marc Zyngier @ 2026-04-01 15:08 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Vincent Donnefort, mhiramat, mathieu.desnoyers,
	linux-trace-kernel, kernel-team, linux-kernel
In-Reply-To: <20260401103159.747ae168@gandalf.local.home>

On Wed, 01 Apr 2026 15:31:59 +0100,
Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Wed,  1 Apr 2026 03:50:01 +0100
> Vincent Donnefort <vdonnefort@google.com> wrote:
> 
> > This small series is fixing non-consuming read of a trace remote when the
> > trace_buffer is created after a CPU is offline.
> > 
> > It also extends hotplug testing coverage to include this test case.
> > 
> > I have based this series on top of kvmarm/next which contains the hypervisor
> > tracing patches.
> > 
> > Vincent Donnefort (2):
> >   tracing: Non-consuming read for trace remotes with an offline CPU
> >   tracing: selftests: Extend hotplug testing for trace remotes
> 
> This looks fine to me. Since they only affect the new remote code and will
> not conflict with anything I'm doing, they can go on top of the KVM patches
> in the arm64 tree.
> 
> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> 
> for the series.

Do you want to look at v2 [1], which was sent two hours later?

	M.

[1] https://lore.kernel.org/r/20260401045100.3394299-1-vdonnefort@google.com

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply

* Re: [PATCH 0/2] Fix trace remotes read with an offline CPU
From: Steven Rostedt @ 2026-04-01 15:16 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Vincent Donnefort, mhiramat, mathieu.desnoyers,
	linux-trace-kernel, kernel-team, linux-kernel
In-Reply-To: <861pgy4t1i.wl-maz@kernel.org>

On Wed, 01 Apr 2026 16:08:41 +0100
Marc Zyngier <maz@kernel.org> wrote:

> Do you want to look at v2 [1], which was sent two hours later?

Yeah I saw that right after I sent my ack.

Yep, my ack still stands, please add it to them.

Thanks,

-- Steve

^ permalink raw reply

* Re: [PATCH RFC v4 07/44] KVM: guest_memfd: Only prepare folios for private pages
From: Michael Roth @ 2026-04-01 15:16 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jroedel, jthoughton, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm
In-Reply-To: <CAEvNRgF+FjJ1EWSR_rzD1=N040ZitiRrM2O3N0Kj5yN5rT3h+Q@mail.gmail.com>

On Wed, Apr 01, 2026 at 07:05:16AM -0700, Ackerley Tng wrote:
> Ackerley Tng <ackerleytng@google.com> writes:
> 
> > All-shared guest_memfd used to be only supported for non-CoCo VMs where
> > preparation doesn't apply. INIT_SHARED is about to be supported for
> > non-CoCo VMs in a later patch in this series.
> >
> > In addition, KVM_SET_MEMORY_ATTRIBUTES2 is about to be supported in
> > guest_memfd in a later patch in this series.
> >
> > This means that the kvm fault handler may now call kvm_gmem_get_pfn() on a
> > shared folio for a CoCo VM where preparation applies.
> >
> > Add a check to make sure that preparation is only performed for private
> > folios.
> >
> > Preparation will be undone on freeing (see kvm_gmem_free_folio()) and on
> > conversion to shared.
> >
> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > ---
> >  virt/kvm/guest_memfd.c | 9 ++++++---
> >  1 file changed, 6 insertions(+), 3 deletions(-)
> >
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index b6ffa8734175d..d414ebfcb4c19 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -900,6 +900,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> >  		     int *max_order)
> >  {
> >  	pgoff_t index = kvm_gmem_get_index(slot, gfn);
> > +	struct inode *inode;
> >  	struct folio *folio;
> >  	int r = 0;
> >
> > @@ -907,7 +908,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> >  	if (!file)
> >  		return -EFAULT;
> >
> > -	filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> > +	inode = file_inode(file);
> > +	filemap_invalidate_lock_shared(inode->i_mapping);
> >
> >  	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
> >  	if (IS_ERR(folio)) {
> > @@ -920,7 +922,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> >  		folio_mark_uptodate(folio);
> >  	}
> >
> > -	r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> > +	if (kvm_gmem_is_private_mem(inode, index))
> > +		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> 
> Michael, I might have misunderstood you at the last guest_memfd call:
> sev_gmem_prepare() doesn't prepare a page for being a shared page,
> right? Does this work? That prepare is only called to "make private"?

Hmm, I guess your guest_memfd-inplace-conversion-v4 branch is out of sync with
these patches?

I have the below local patch based on top of that for SNP-specific enablement,
which is basically identically, so suffice to say: yes, this should work
for SNP :) If any architecture pops up that needs to do some prep in
advance of mapping shared pages, then we could potentially plumb the
shared/private flag through to the arch-specific prep hook, as was also
suggested on the call, but it doesn't seem like that's needed by any
users for now.

-Mike

  Author: Michael Roth <michael.roth@amd.com>
  Date:   Mon Oct 27 07:58:32 2025 -0500
  
      KVM: guest_memfd: Don't prepare shared folios
  
      In the current guest_memfd logic, "preparation" is only used currently
      to describe the additional work of putting a guest_memfd page into an
      architecturally-defined "private" state, such as updating RMP table
      entries for SEV-SNP guests. As such, there's no input to the
      corresponding kvm_arch_gmem_prepare() hooks as to whether a page is
      being prepared/accessed as shared or as private, so "preparation" will
      end up being erroneously done on pages that were supposed to remain in a
      shared state. Rather than plumb through the additional information
      needed to distinguish between shared vs. private preparation, just
      continue to only do preparation on private pages, as was the case prior
      to support for GUEST_MEMFD_FLAG_MMAP being introduced.
  
      Signed-off-by: Michael Roth <michael.roth@amd.com>
  
  diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
  index 3acc6d983449..4869e59e4fc5 100644
  --- a/virt/kvm/guest_memfd.c
  +++ b/virt/kvm/guest_memfd.c
  @@ -1249,7 +1249,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
                  folio_mark_uptodate(folio);
          }
  
  -       r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
  +       if (!kvm_gmem_is_shared_mem(file_inode(file), index))
  +               r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
  
          folio_unlock(folio);

> 
> >
> >  	folio_unlock(folio);
> >
> > @@ -930,7 +933,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> >  		folio_put(folio);
> >
> >  out:
> > -	filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
> > +	filemap_invalidate_unlock_shared(inode->i_mapping);
> >  	return r;
> >  }
> >  EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
> >
> > --
> > 2.53.0.1018.g2bb0e51243-goog

^ permalink raw reply

* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Michael Roth @ 2026-04-01 15:35 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jroedel, jthoughton, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm
In-Reply-To: <20260326-gmem-inplace-conversion-v4-10-e202fe950ffd@google.com>

On Thu, Mar 26, 2026 at 03:24:19PM -0700, Ackerley Tng wrote:
> For shared to private conversions, if refcounts on any of the folios
> within the range are elevated, fail the conversion with -EAGAIN.
> 
> At the point of shared to private conversion, all folios in range are
> also unmapped. The filemap_invalidate_lock() is held, so no faulting
> can occur. Hence, from that point on, only transient refcounts can be
> taken on the folios associated with that guest_memfd.
> 
> Hence, it is safe to do the conversion from shared to private.
> 
> After conversion is complete, refcounts may become elevated, but that
> is fine since users of transient refcounts don't actually access
> memory.
> 
> For private to shared conversions, there are no refcount checks, since
> the guest is the only user of private pages, and guest_memfd will be the
> only holder of refcounts on private pages.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  Documentation/virt/kvm/api.rst |  48 +++++++-
>  include/linux/kvm_host.h       |  10 ++
>  include/uapi/linux/kvm.h       |   9 +-
>  virt/kvm/Kconfig               |   1 +
>  virt/kvm/guest_memfd.c         | 245 ++++++++++++++++++++++++++++++++++++++---
>  virt/kvm/kvm_main.c            |  17 ++-
>  6 files changed, 300 insertions(+), 30 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 0b61e2579e1d8..15148c80cfdb6 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -117,7 +117,7 @@ description:
>        x86 includes both i386 and x86_64.
>  
>    Type:
> -      system, vm, or vcpu.
> +      system, vm, vcpu or guest_memfd.
>  
>    Parameters:
>        what parameters are accepted by the ioctl.
> @@ -6557,11 +6557,22 @@ KVM_S390_KEYOP_SSKE
>  ---------------------------------
>  
>  :Capability: KVM_CAP_MEMORY_ATTRIBUTES2
> -:Architectures: x86
> -:Type: vm ioctl
> +:Architectures: all
> +:Type: vm, guest_memfd ioctl
>  :Parameters: struct kvm_memory_attributes2 (in/out)
>  :Returns: 0 on success, <0 on error
>  
> +Errors:
> +
> +  ========== ===============================================================
> +  EINVAL     The specified `offset` or `size` were invalid (e.g. not
> +             page aligned, causes an overflow, or size is zero).
> +  EFAULT     The parameter address was invalid.
> +  EAGAIN     Some page within requested range had unexpected refcounts. The
> +             offset of the page will be returned in `error_offset`.
> +  ENOMEM     Ran out of memory trying to track private/shared state
> +  ========== ===============================================================
> +
>  KVM_SET_MEMORY_ATTRIBUTES2 is an extension to
>  KVM_SET_MEMORY_ATTRIBUTES that supports returning (writing) values to
>  userspace.  The original (pre-extension) fields are shared with
> @@ -6572,15 +6583,42 @@ Attribute values are shared with KVM_SET_MEMORY_ATTRIBUTES.
>  ::
>  
>    struct kvm_memory_attributes2 {
> -	__u64 address;
> +	/* in */
> +	union {
> +		__u64 address;
> +		__u64 offset;
> +	};
>  	__u64 size;
>  	__u64 attributes;
>  	__u64 flags;
> -	__u64 reserved[12];
> +	/* out */
> +	__u64 error_offset;
> +	__u64 reserved[11];
>    };
>  
>    #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
>  
> +Set attributes for a range of offsets within a guest_memfd to
> +KVM_MEMORY_ATTRIBUTE_PRIVATE to limit the specified guest_memfd backed
> +memory range for guest_use. Even if KVM_CAP_GUEST_MEMFD_MMAP is
> +supported, after a successful call to set
> +KVM_MEMORY_ATTRIBUTE_PRIVATE, the requested range will not be mappable
> +into host userspace and will only be mappable by the guest.
> +
> +To allow the range to be mappable into host userspace again, call
> +KVM_SET_MEMORY_ATTRIBUTES2 on the guest_memfd again with
> +KVM_MEMORY_ATTRIBUTE_PRIVATE unset.
> +
> +If this ioctl returns -EAGAIN, the offset of the page with unexpected
> +refcounts will be returned in `error_offset`. This can occur if there
> +are transient refcounts on the pages, taken by other parts of the
> +kernel.
> +
> +Userspace is expected to figure out how to remove all known refcounts
> +on the shared pages, such as refcounts taken by get_user_pages(), and
> +try the ioctl again. A possible source of these long term refcounts is
> +if the guest_memfd memory was pinned in IOMMU page tables.
> +
>  See also: :ref: `KVM_SET_MEMORY_ATTRIBUTES`.
>  
>  .. _kvm_run:
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 19f026f8de390..1ea14c66fc82e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2514,6 +2514,16 @@ static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
>  }
>  
>  #ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> +static inline u64 kvm_supported_mem_attributes(struct kvm *kvm)
> +{
> +#ifdef kvm_arch_has_private_mem
> +	if (!kvm || kvm_arch_has_private_mem(kvm))
> +		return KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +#endif
> +
> +	return 0;
> +}
> +
>  typedef unsigned long (kvm_get_memory_attributes_t)(struct kvm *kvm, gfn_t gfn);
>  DECLARE_STATIC_CALL(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
>  
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 16567d4a769e5..29baaa60de35a 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -990,6 +990,7 @@ struct kvm_enable_cap {
>  #define KVM_CAP_S390_USER_OPEREXEC 246
>  #define KVM_CAP_S390_KEYOP 247
>  #define KVM_CAP_MEMORY_ATTRIBUTES2 248
> +#define KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES 249
>  
>  struct kvm_irq_routing_irqchip {
>  	__u32 irqchip;
> @@ -1642,11 +1643,15 @@ struct kvm_memory_attributes {
>  #define KVM_SET_MEMORY_ATTRIBUTES2              _IOWR(KVMIO,  0xd2, struct kvm_memory_attributes2)
>  
>  struct kvm_memory_attributes2 {
> -	__u64 address;
> +	union {
> +		__u64 address;
> +		__u64 offset;
> +	};
>  	__u64 size;
>  	__u64 attributes;
>  	__u64 flags;
> -	__u64 reserved[12];
> +	__u64 error_offset;
> +	__u64 reserved[11];
>  };
>  
>  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 3fea89c45cfb4..e371e079e2c50 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -109,6 +109,7 @@ config KVM_VM_MEMORY_ATTRIBUTES
>  
>  config KVM_GUEST_MEMFD
>         select XARRAY_MULTI
> +       select KVM_MEMORY_ATTRIBUTES
>         bool
>  
>  config HAVE_KVM_ARCH_GMEM_PREPARE
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index d414ebfcb4c19..0cff9a85a4c53 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -183,10 +183,12 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>  
>  static enum kvm_gfn_range_filter kvm_gmem_get_invalidate_filter(struct inode *inode)
>  {
> -	if (GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED)
> -		return KVM_FILTER_SHARED;
> -
> -	return KVM_FILTER_PRIVATE;
> +	/*
> +	 * TODO: Limit invalidations based on the to-be-invalidated range, i.e.
> +	 *       invalidate shared/private if and only if there can possibly be
> +	 *       such mappings.
> +	 */
> +	return KVM_FILTER_SHARED | KVM_FILTER_PRIVATE;
>  }
>  
>  static void __kvm_gmem_invalidate_begin(struct gmem_file *f, pgoff_t start,
> @@ -552,11 +554,235 @@ unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
>  }
>  EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_memory_attributes);
>  
> +static bool kvm_gmem_range_has_attributes(struct maple_tree *mt,
> +					  pgoff_t index, size_t nr_pages,
> +					  u64 attributes)
> +{
> +	pgoff_t end = index + nr_pages - 1;
> +	void *entry;
> +
> +	lockdep_assert(mt_lock_is_held(mt));
> +
> +	mt_for_each(mt, entry, index, end) {
> +		if (xa_to_value(entry) != attributes)
> +			return false;
> +	}
> +
> +	return true;
> +}
> +
> +static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> +					    size_t nr_pages, pgoff_t *err_index)
> +{
> +	struct address_space *mapping = inode->i_mapping;
> +	const int filemap_get_folios_refcount = 1;
> +	pgoff_t last = start + nr_pages - 1;
> +	struct folio_batch fbatch;
> +	bool safe = true;
> +	int i;
> +
> +	folio_batch_init(&fbatch);
> +	while (safe && filemap_get_folios(mapping, &start, last, &fbatch)) {
> +
> +		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> +			struct folio *folio = fbatch.folios[i];
> +
> +			if (folio_ref_count(folio) !=
> +			    folio_nr_pages(folio) + filemap_get_folios_refcount) {
> +				safe = false;
> +				*err_index = folio->index;
> +				break;
> +			}
> +		}
> +
> +		folio_batch_release(&fbatch);
> +		cond_resched();
> +	}
> +
> +	return safe;
> +}
> +
> +/*
> + * Preallocate memory for attributes to be stored on a maple tree, pointed to
> + * by mas.  Adjacent ranges with attributes identical to the new attributes
> + * will be merged.  Also sets mas's bounds up for storing attributes.
> + *
> + * This maintains the invariant that ranges with the same attributes will
> + * always be merged.
> + */
> +static int kvm_gmem_mas_preallocate(struct ma_state *mas, u64 attributes,
> +				    pgoff_t start, size_t nr_pages)
> +{
> +	pgoff_t end = start + nr_pages;
> +	pgoff_t last = end - 1;
> +	void *entry;
> +
> +	/* Try extending range. entry is NULL on overflow/wrap-around. */
> +	mas_set_range(mas, end, end);
> +	entry = mas_find(mas, end);
> +	if (entry && xa_to_value(entry) == attributes)
> +		last = mas->last;
> +
> +	if (start > 0) {
> +		mas_set_range(mas, start - 1, start - 1);
> +		entry = mas_find(mas, start - 1);
> +		if (entry && xa_to_value(entry) == attributes)
> +			start = mas->index;
> +	}
> +
> +	mas_set_range(mas, start, last);
> +	return mas_preallocate(mas, xa_mk_value(attributes), GFP_KERNEL);
> +}
> +
> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
> +{
> +	struct folio_batch fbatch;
> +	pgoff_t next = start;
> +	int i;
> +
> +	folio_batch_init(&fbatch);
> +	while (filemap_get_folios(inode->i_mapping, &next, end - 1, &fbatch)) {
> +		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> +			struct folio *folio = fbatch.folios[i];
> +			unsigned long pfn = folio_pfn(folio);
> +
> +			kvm_arch_gmem_invalidate(pfn, pfn + folio_nr_pages(folio));
> +		}
> +
> +		folio_batch_release(&fbatch);
> +		cond_resched();
> +	}
> +}
> +#else
> +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
> +#endif
> +
> +static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> +				     size_t nr_pages, uint64_t attrs,
> +				     pgoff_t *err_index)
> +{
> +	bool to_private = attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +	struct address_space *mapping = inode->i_mapping;
> +	struct gmem_inode *gi = GMEM_I(inode);
> +	pgoff_t end = start + nr_pages;
> +	struct maple_tree *mt;
> +	struct ma_state mas;
> +	int r;
> +
> +	mt = &gi->attributes;
> +
> +	filemap_invalidate_lock(mapping);
> +
> +	mas_init(&mas, mt, start);
> +
> +	if (kvm_gmem_range_has_attributes(mt, start, nr_pages, attrs)) {
> +		r = 0;
> +		goto out;
> +	}
> +
> +	r = kvm_gmem_mas_preallocate(&mas, attrs, start, nr_pages);
> +	if (r) {
> +		*err_index = start;
> +		goto out;
> +	}
> +
> +	if (to_private) {
> +		unmap_mapping_pages(mapping, start, nr_pages, false);
> +
> +		if (!kvm_gmem_is_safe_for_conversion(inode, start, nr_pages,
> +						     err_index)) {
> +			mas_destroy(&mas);
> +			r = -EAGAIN;
> +			goto out;
> +		}
> +	}
> +
> +	/*
> +	 * From this point on guest_memfd has performed necessary
> +	 * checks and can proceed to do guest-breaking changes.
> +	 */
> +
> +	kvm_gmem_invalidate_begin(inode, start, end);
> +
> +	if (!to_private)
> +		kvm_gmem_invalidate(inode, start, end);
> +
> +	mas_store_prealloc(&mas, xa_mk_value(attrs));
> +
> +	kvm_gmem_invalidate_end(inode, start, end);
> +out:
> +	filemap_invalidate_unlock(mapping);
> +	return r;
> +}
> +
> +static long kvm_gmem_set_attributes(struct file *file, void __user *argp)
> +{
> +	struct gmem_file *f = file->private_data;
> +	struct inode *inode = file_inode(file);
> +	struct kvm_memory_attributes2 attrs;
> +	pgoff_t err_index;
> +	size_t nr_pages;
> +	pgoff_t index;
> +	int i, r;
> +
> +	if (copy_from_user(&attrs, argp, sizeof(attrs)))
> +		return -EFAULT;
> +
> +	if (attrs.flags)
> +		return -EINVAL;
> +	if (attrs.error_offset)
> +		return -EINVAL;
> +	for (i = 0; i < ARRAY_SIZE(attrs.reserved); i++) {
> +		if (attrs.reserved[i])
> +			return -EINVAL;
> +	}
> +	if (attrs.attributes & ~kvm_supported_mem_attributes(f->kvm))
> +		return -EINVAL;
> +	if (attrs.size == 0 || attrs.offset + attrs.size < attrs.offset)
> +		return -EINVAL;
> +	if (!PAGE_ALIGNED(attrs.offset) || !PAGE_ALIGNED(attrs.size))
> +		return -EINVAL;
> +
> +	if (attrs.offset >= inode->i_size ||
> +	    attrs.offset + attrs.size > inode->i_size)
> +		return -EINVAL;
> +
> +	nr_pages = attrs.size >> PAGE_SHIFT;
> +	index = attrs.offset >> PAGE_SHIFT;
> +	r = __kvm_gmem_set_attributes(inode, index, nr_pages, attrs.attributes,
> +				      &err_index);
> +	if (r) {
> +		attrs.error_offset = ((uint64_t)err_index) << PAGE_SHIFT;
> +
> +		if (copy_to_user(argp, &attrs, sizeof(attrs)))
> +			return -EFAULT;
> +	}
> +
> +	return r;
> +}
> +
> +static long kvm_gmem_ioctl(struct file *file, unsigned int ioctl,
> +			   unsigned long arg)
> +{
> +	switch (ioctl) {
> +	case KVM_SET_MEMORY_ATTRIBUTES2:
> +		if (vm_memory_attributes)
> +			return -ENOTTY;
> +
> +		return kvm_gmem_set_attributes(file, (void __user *)arg);
> +	default:
> +		return -ENOTTY;
> +	}
> +}
> +
> +
>  static struct file_operations kvm_gmem_fops = {
>  	.mmap		= kvm_gmem_mmap,
>  	.open		= generic_file_open,
>  	.release	= kvm_gmem_release,
>  	.fallocate	= kvm_gmem_fallocate,
> +	.unlocked_ioctl	= kvm_gmem_ioctl,
>  };
>  
>  static int kvm_gmem_migrate_folio(struct address_space *mapping,
> @@ -942,20 +1168,13 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
>  static bool kvm_gmem_range_is_private(struct gmem_inode *gi, pgoff_t index,
>  				      size_t nr_pages, struct kvm *kvm, gfn_t gfn)
>  {
> -	pgoff_t end = index + nr_pages - 1;
> -	void *entry;
> -
>  	if (vm_memory_attributes)
>  		return kvm_range_has_vm_memory_attributes(kvm, gfn, gfn + nr_pages,
>  						       KVM_MEMORY_ATTRIBUTE_PRIVATE,
>  						       KVM_MEMORY_ATTRIBUTE_PRIVATE);
>  
> -	mt_for_each(&gi->attributes, entry, index, end) {
> -		if (xa_to_value(entry) != KVM_MEMORY_ATTRIBUTE_PRIVATE)
> -			return false;
> -	}
> -
> -	return true;
> +	return kvm_gmem_range_has_attributes(&gi->attributes, index, nr_pages,
> +					     KVM_MEMORY_ATTRIBUTE_PRIVATE);
>  }
>  
>  static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 3c261904322f0..85c14197587d4 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2435,16 +2435,6 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
>  #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>  
>  #ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> -static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> -{
> -#ifdef kvm_arch_has_private_mem
> -	if (!kvm || kvm_arch_has_private_mem(kvm))
> -		return KVM_MEMORY_ATTRIBUTE_PRIVATE;
> -#endif
> -
> -	return 0;
> -}
> -
>  #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>  static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
>  {
> @@ -2635,6 +2625,8 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>  		return -EINVAL;
>  	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
>  		return -EINVAL;
> +	if (attrs->error_offset)
> +		return -EINVAL;
>  	for (i = 0; i < ARRAY_SIZE(attrs->reserved); i++) {
>  		if (attrs->reserved[i])
>  			return -EINVAL;
> @@ -4983,6 +4975,11 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>  		return 1;
>  	case KVM_CAP_GUEST_MEMFD_FLAGS:
>  		return kvm_gmem_get_supported_flags(kvm);
> +	case KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES:
> +		if (vm_memory_attributes)
> +			return 0;
> +
> +		return kvm_supported_mem_attributes(kvm);

Based on the discussion from the PUCK call this morning, it sounds like it
would be a good idea to limit kvm_supported_mem_attributes() to only
reporting KVM_MEMORY_ATTRIBUTE_PRIVATE if the underlying CoCo
implementation has all the necessary enablement to support in-place
conversion via guest_memfd. In the case of SNP, there is a
documentation/parameter check in snp_launch_update() that needs to be
relaxed in order for userspace to be able to pass in a NULL 'src'
parameter (since, for in-place conversion, it would be initialized in place
as shared memory prior to the call, since by the time kvm_gmem_poulate()
it will have been set to private and therefore cannot be faulted in via
GUP (and if it could, we'd be unecessarily copying the src back on top
of itself since src/dst are the same).

So maybe there should be an arch hook to check a whitelist of VM types
that support KVM_MEMORY_ATTRIBUTE_PRIVATE when vm_memory_attributes=0,
and if we decide to enable it for SNP as part of this series you could
include the 1-2 patches needed there, or I could enable the SNP support
separately as a small series and I guess that would then become a prereq
for the SNP self-tests?

Not sure if additional enablement is needed for TDX or not before
KVM_MEMORY_ATTRIBUTE_PRIVATE would be advertised, but similar
considerations there.

-Mike

>  #endif
>  	default:
>  		break;
> 
> -- 
> 2.53.0.1018.g2bb0e51243-goog
> 

^ permalink raw reply

* Re: [PATCH] bootconfig: Skip printing early params to cmdline from bootconfig
From: Breno Leitao @ 2026-04-01 15:51 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel
In-Reply-To: <177505217508.1807250.22866077077504564.stgit@mhiramat.tok.corp.google.com>

On Wed, Apr 01, 2026 at 11:02:55PM +0900, Masami Hiramatsu (Google) wrote:
> From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> 
> If user configures `kernel.key` in bootconfig, the 'key' is shown
> in kernel cmdline (/proc/cmdline) and kernel boot parameter
> handler associated with 'key' is invoked. However, since the
> bootconfig does not support the parameter defined with early_param,
> those keys are shown in '/proc/cmdline' but not handled by kernel.
> 
> This could easily mislead users who expected to be able to specify
> early parameters via the boot configuration, leading them to wonder
> why it doesn't work.
> 
> Let's skip printing out early params to cmdline buffer, and warn
> if there is such parameters in bootconfig.
> 
> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Reviewed-by: Breno Leitao <leitao@debian.org>

> +static bool __init is_early_param(const char *param)
> +{
> +	const struct obs_kernel_param *p;
> +
> +	for (p = __setup_start; p < __setup_end; p++) {
> +		if (p->early && parameq(param, p->str))
> +			return true;
> +	}

nit: I don't think you need the parenthesis ({) for the ifs in here.

^ permalink raw reply

* [PATCH v2] tracing/probe: reject empty immediate strings
From: Pengpeng Hou @ 2026-04-01 16:03 UTC (permalink / raw)
  To: rostedt
  Cc: mhiramat, mathieu.desnoyers, linux-trace-kernel, linux-kernel,
	pengpeng
In-Reply-To: <20260330062920.40766-1-pengpeng@iscas.ac.cn>

parse_probe_arg() accepts quoted immediate strings and passes the body
after the opening quote to __parse_imm_string(). That helper currently
computes strlen(str) and immediately dereferences str[len - 1], which
underflows when the body is empty.

Reject empty immediate strings before checking for the closing quote.

Fixes: a42e3c4de964 ("tracing/probe: Add immediate string parameter support")
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
Changes since v1:
- resend as a standalone patch instead of part of an accidental
  cross-subsystem 1/4 series

 kernel/trace/trace_probe.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index e0a5dc86c07e..e1c73065dae5 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -1068,7 +1068,7 @@ static int __parse_imm_string(char *str, char **pbuf, int offs)
 {
 	size_t len = strlen(str);
 
-	if (str[len - 1] != '"') {
+	if (!len || str[len - 1] != '"') {
 		trace_probe_log_err(offs + len, IMMSTR_NO_CLOSE);
 		return -EINVAL;
 	}
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related

* [PATCH bpf v2] bpf: Reject sleepable kprobe_multi programs at attach time
From: Varun R Mallya @ 2026-04-01 16:25 UTC (permalink / raw)
  To: varunrmallya
  Cc: andrii, ast, bpf, daniel, eddyz87, jolsa, kpsingh, linux-kernel,
	linux-trace-kernel, martin.lau, mathieu.desnoyers, mattbobrowski,
	memxor, mhiramat, rostedt, song, yonghong.song
In-Reply-To: <ac0hLyMuPHiZfnVc@computer>

kprobe.multi programs run in atomic/RCU context and cannot sleep.
However, bpf_kprobe_multi_link_attach() did not validate whether the
program being attached had the sleepable flag set, allowing sleepable
helpers such as bpf_copy_from_user() to be invoked from a non-sleepable
context.

This causes a "sleeping function called from invalid context" splat:

  BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169
  in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo
  preempt_count: 1, expected: 0
  RCU nest depth: 2, expected: 0

Fix this by rejecting sleepable programs early in
bpf_kprobe_multi_link_attach(), before any further processing.

Also add a selftest that tries to load a sleepable kprobe_multi program
and it needs to be rejected for the test to pass.

Fixes: 0dcac2725406 ("bpf: Add multi kprobe link")
Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
---
 kernel/trace/bpf_trace.c                      |  4 ++
 .../bpf/prog_tests/kprobe_multi_test.c        | 41 +++++++++++++++++++
 .../bpf/progs/kprobe_multi_sleepable.c        | 13 ++++++
 3 files changed, 58 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 0b040a417442..af7079aa0f36 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2752,6 +2752,10 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
 	if (!is_kprobe_multi(prog))
 		return -EINVAL;
 
+	/* kprobe_multi is not allowed to be sleepable. */
+	if (prog->sleepable)
+		return -EINVAL;
+
 	/* Writing to context is not allowed for kprobes. */
 	if (prog->aux->kprobe_write_ctx)
 		return -EINVAL;
diff --git a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
index 78c974d4ea33..f02fec2b6fda 100644
--- a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
+++ b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
@@ -10,6 +10,7 @@
 #include "kprobe_multi_session_cookie.skel.h"
 #include "kprobe_multi_verifier.skel.h"
 #include "kprobe_write_ctx.skel.h"
+#include "kprobe_multi_sleepable.skel.h"
 #include "bpf/libbpf_internal.h"
 #include "bpf/hashmap.h"
 
@@ -633,6 +634,44 @@ static void test_attach_write_ctx(void)
 }
 #endif
 
+static void test_attach_multi_sleepable(void)
+{
+	struct kprobe_multi_sleepable *skel;
+	int err;
+
+	skel = kprobe_multi_sleepable__open();
+	if (!ASSERT_OK_PTR(skel, "kprobe_multi_sleepable__open"))
+		return;
+
+	err = bpf_program__set_flags(skel->progs.handle_kprobe_multi_sleepable,
+				     BPF_F_SLEEPABLE);
+	if (!ASSERT_OK(err, "bpf_program__set_flags"))
+		goto cleanup;
+
+	/* Load should succeed even with BPF_F_SLEEPABLE for KPROBE types */
+	err = kprobe_multi_sleepable__load(skel);
+	if (!ASSERT_OK(err, "kprobe_multi_sleepable__load"))
+		goto cleanup;
+
+	/* Attachment must fail for kprobe.multi + BPF_F_SLEEPABLE.
+	 * Also chosen a stable symbol to send into opts
+	 */
+	LIBBPF_OPTS(bpf_kprobe_multi_opts, opts);
+	const char *sym = "vfs_read";
+
+	opts.syms = &sym;
+	opts.cnt = 1;
+
+	skel->links.handle_kprobe_multi_sleepable =
+		bpf_program__attach_kprobe_multi_opts(skel->progs.handle_kprobe_multi_sleepable,
+						      NULL, &opts);
+	ASSERT_ERR_PTR(skel->links.handle_kprobe_multi_sleepable,
+		       "bpf_program__attach_kprobe_multi_opts");
+
+cleanup:
+	kprobe_multi_sleepable__destroy(skel);
+}
+
 void serial_test_kprobe_multi_bench_attach(void)
 {
 	if (test__start_subtest("kernel"))
@@ -676,5 +715,7 @@ void test_kprobe_multi_test(void)
 		test_unique_match();
 	if (test__start_subtest("attach_write_ctx"))
 		test_attach_write_ctx();
+	if (test__start_subtest("attach_multi_sleepable"))
+		test_attach_multi_sleepable();
 	RUN_TESTS(kprobe_multi_verifier);
 }
diff --git a/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c b/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c
new file mode 100644
index 000000000000..56973ad8779d
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c
@@ -0,0 +1,13 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+SEC("kprobe.multi")
+int handle_kprobe_multi_sleepable(struct pt_regs *ctx)
+{
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH bpf v2] bpf: Reject sleepable kprobe_multi programs at attach time
From: Kumar Kartikeya Dwivedi @ 2026-04-01 16:31 UTC (permalink / raw)
  To: Varun R Mallya
  Cc: andrii, ast, bpf, daniel, eddyz87, jolsa, kpsingh, linux-kernel,
	linux-trace-kernel, martin.lau, mathieu.desnoyers, mattbobrowski,
	mhiramat, rostedt, song, yonghong.song
In-Reply-To: <20260401162555.426867-1-varunrmallya@gmail.com>

On Wed, 1 Apr 2026 at 18:26, Varun R Mallya <varunrmallya@gmail.com> wrote:
>
> kprobe.multi programs run in atomic/RCU context and cannot sleep.
> However, bpf_kprobe_multi_link_attach() did not validate whether the
> program being attached had the sleepable flag set, allowing sleepable
> helpers such as bpf_copy_from_user() to be invoked from a non-sleepable
> context.
>
> This causes a "sleeping function called from invalid context" splat:
>
>   BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169
>   in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo
>   preempt_count: 1, expected: 0
>   RCU nest depth: 2, expected: 0
>
> Fix this by rejecting sleepable programs early in
> bpf_kprobe_multi_link_attach(), before any further processing.
>
> Also add a selftest that tries to load a sleepable kprobe_multi program
> and it needs to be rejected for the test to pass.
>
> Fixes: 0dcac2725406 ("bpf: Add multi kprobe link")
> Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
> ---

Looks ok, but please split kernel fix and tests into separate patches,
and no need to reply to this thread.

pw-bot: cr


>  kernel/trace/bpf_trace.c                      |  4 ++
>  .../bpf/prog_tests/kprobe_multi_test.c        | 41 +++++++++++++++++++
>  .../bpf/progs/kprobe_multi_sleepable.c        | 13 ++++++
>  3 files changed, 58 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c
>
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 0b040a417442..af7079aa0f36 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -2752,6 +2752,10 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
>         if (!is_kprobe_multi(prog))
>                 return -EINVAL;
>
> +       /* kprobe_multi is not allowed to be sleepable. */
> +       if (prog->sleepable)
> +               return -EINVAL;
> +
>         /* Writing to context is not allowed for kprobes. */
>         if (prog->aux->kprobe_write_ctx)
>                 return -EINVAL;
> diff --git a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
> index 78c974d4ea33..f02fec2b6fda 100644
> --- a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
> +++ b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
> @@ -10,6 +10,7 @@
>  #include "kprobe_multi_session_cookie.skel.h"
>  #include "kprobe_multi_verifier.skel.h"
>  #include "kprobe_write_ctx.skel.h"
> +#include "kprobe_multi_sleepable.skel.h"
>  #include "bpf/libbpf_internal.h"
>  #include "bpf/hashmap.h"
>
> @@ -633,6 +634,44 @@ static void test_attach_write_ctx(void)
>  }
>  #endif
>
> +static void test_attach_multi_sleepable(void)
> +{
> +       struct kprobe_multi_sleepable *skel;
> +       int err;
> +
> +       skel = kprobe_multi_sleepable__open();
> +       if (!ASSERT_OK_PTR(skel, "kprobe_multi_sleepable__open"))
> +               return;
> +
> +       err = bpf_program__set_flags(skel->progs.handle_kprobe_multi_sleepable,
> +                                    BPF_F_SLEEPABLE);
> +       if (!ASSERT_OK(err, "bpf_program__set_flags"))
> +               goto cleanup;
> +
> +       /* Load should succeed even with BPF_F_SLEEPABLE for KPROBE types */
> +       err = kprobe_multi_sleepable__load(skel);
> +       if (!ASSERT_OK(err, "kprobe_multi_sleepable__load"))
> +               goto cleanup;
> +
> +       /* Attachment must fail for kprobe.multi + BPF_F_SLEEPABLE.
> +        * Also chosen a stable symbol to send into opts
> +        */
> +       LIBBPF_OPTS(bpf_kprobe_multi_opts, opts);
> +       const char *sym = "vfs_read";
> +
> +       opts.syms = &sym;
> +       opts.cnt = 1;
> +
> +       skel->links.handle_kprobe_multi_sleepable =
> +               bpf_program__attach_kprobe_multi_opts(skel->progs.handle_kprobe_multi_sleepable,
> +                                                     NULL, &opts);
> +       ASSERT_ERR_PTR(skel->links.handle_kprobe_multi_sleepable,
> +                      "bpf_program__attach_kprobe_multi_opts");
> +
> +cleanup:
> +       kprobe_multi_sleepable__destroy(skel);
> +}
> +
>  void serial_test_kprobe_multi_bench_attach(void)
>  {
>         if (test__start_subtest("kernel"))
> @@ -676,5 +715,7 @@ void test_kprobe_multi_test(void)
>                 test_unique_match();
>         if (test__start_subtest("attach_write_ctx"))
>                 test_attach_write_ctx();
> +       if (test__start_subtest("attach_multi_sleepable"))
> +               test_attach_multi_sleepable();
>         RUN_TESTS(kprobe_multi_verifier);
>  }
> diff --git a/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c b/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c
> new file mode 100644
> index 000000000000..56973ad8779d
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c
> @@ -0,0 +1,13 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +
> +SEC("kprobe.multi")
> +int handle_kprobe_multi_sleepable(struct pt_regs *ctx)
> +{
> +       return 0;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> --
> 2.53.0
>

^ permalink raw reply

* Re: [PATCH v2] tracing/probe: reject empty immediate strings
From: Steven Rostedt @ 2026-04-01 17:00 UTC (permalink / raw)
  To: Pengpeng Hou
  Cc: mhiramat, mathieu.desnoyers, linux-trace-kernel, linux-kernel
In-Reply-To: <20260401160315.88518-1-pengpeng@iscas.ac.cn>

On Thu,  2 Apr 2026 00:03:15 +0800
Pengpeng Hou <pengpeng@iscas.ac.cn> wrote:

> parse_probe_arg() accepts quoted immediate strings and passes the body
> after the opening quote to __parse_imm_string(). That helper currently
> computes strlen(str) and immediately dereferences str[len - 1], which
> underflows when the body is empty.
> 
> Reject empty immediate strings before checking for the closing quote.
> 
> Fixes: a42e3c4de964 ("tracing/probe: Add immediate string parameter support")
> Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>

Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>

-- Steve

^ permalink raw reply

* Re: [PATCH v3 0/3] tracing: Read user data from futex system call trace event
From: Steven Rostedt @ 2026-04-01 17:15 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, Peter Zijlstra
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Thomas Gleixner, Brian Geffon, John Stultz, Ian Rogers,
	Suleiman Souhlal
In-Reply-To: <20260331181349.062575155@kernel.org>


Peter,

Any thoughts on this version?

-- Steve


On Tue, 31 Mar 2026 14:13:49 -0400
Steven Rostedt <rostedt@kernel.org> wrote:

> We are looking at the performance of futexes and require a bit more
> information when tracing them.
> 
> The two patches here extend the system call reading of user space to
> create specific handling of the futex system call. It now reads the
> user space relevant data (the addr, utime and addr2), as well as
> parses the flags. This adds a little smarts to the trace event as
> it only shows the parameters that are relevant, as well as parses
> utime as either a timespec or as val2 depending on the futex_op.
> 
> Here's an example of the new output:
> 
>  sys_futex(uaddr: 0x56196292e830 (0), FUTEX_WAKE|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e834 (0x4a7) tid: 1191, FUTEX_UNLOCK_PI|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e834 (0) tid: 0, FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e830 (0), FUTEX_WAIT|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e838 (0), FUTEX_WAIT_REQUEUE_PI|FUTEX_PRIVATE_FLAG, timespec: 0x7ffc1b91a9f0 (163.048528790), uaddr2: 0x56196292e834 (4aa), val3: 0)
>  sys_futex(uaddr: 0x56196292e834 (0x4aa) tid: 1194, FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e838 (0), FUTEX_WAIT_REQUEUE_PI|FUTEX_PRIVATE_FLAG, timespec: 0x7ffc1b91a9f0 (163.048528790), uaddr2: 0x56196292e834 (800004aa), val3: 0)
>  sys_futex(uaddr: 0x7f7ed6b29990 (0x4ab), FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME)
>  sys_futex(uaddr: 0x56196292e834 (0x800004aa) tid: 1194 (WAITERS), FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e838 (0), FUTEX_WAIT_REQUEUE_PI|FUTEX_PRIVATE_FLAG, timespec: 0x7ffc1b91a9f0 (163.048528790), uaddr2: 0x56196292e834 (800004aa), val3: 0)
>  sys_futex(uaddr: 0x56196292e834 (0x800004aa) tid: 1194 (WAITERS), FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG)
> 
> Changes since v2: https://lore.kernel.org/all/20260310200954.285663884@kernel.org/
> 
> - Removed unused "buf" variable (kernel test robot)
> 
> - Iterate __futex_cmds[] make the print statement.
>   Note this required exposing __futex_cmds[] to trace_syscall.c
>   (Masami Hiramatsu)
> 
> - Added back val statement (with the move to futex/syscall.c the
>   third parameter was dropped).
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
> futex/core
> 
> Head SHA1: 79b0609ad15b24d0bbfb1790e17902a6c210ae69
> 
> 
> Steven Rostedt (3):
>       tracing: Have futex syscall trace event show specific user data
>       tracing: Update futex syscall trace event to show more commands
>       tracing: Show TID and flags for PI futex system call trace event
> 
> ----
>  include/linux/futex.h         |  39 ++++++-
>  kernel/futex/syscalls.c       | 137 +++++++++++++++++++++---
>  kernel/trace/trace_syscalls.c | 237 +++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 396 insertions(+), 17 deletions(-)


^ permalink raw reply

* Re: [PATCH v3 1/3] tracing: Have futex syscall trace event show specific user data
From: Ian Rogers @ 2026-04-01 18:07 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Brian Geffon, John Stultz, Suleiman Souhlal
In-Reply-To: <20260331181432.995263521@kernel.org>

On Tue, Mar 31, 2026 at 11:13 AM Steven Rostedt <rostedt@kernel.org> wrote:
>
> From: Steven Rostedt <rostedt@goodmis.org>
>
> Add specific reporting of the futex system call. This allows for debugging
> the futex code a bit easier. Instead of just showing the values passed
> into the futex system call, read the value of the user space memory
> pointed to by the addr parameter.
>
> Also make the op parameter more readable by parsing the values to show
> what the command is:
>
>  futex_requeue_p-3251    [002] .....  2101.068479: sys_futex(uaddr: 0x55e79a4da834 (0x80000cb1), FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG, val: 0)
>  futex_requeue_p-3248    [001] .....  2101.068970: sys_futex(uaddr: 0x7f859072f990 (0xcb2), FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, val: 3250)

nit: Would this read better if `val` were also hex?
futex_requeue_p-3248    [001] .....  2101.068970: sys_futex(uaddr:
0x7f859072f990 (0xcb2), FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, val:
0xcb2)
The val matching the value in uaddr is the condition necessary for the
futex to sleep.

>  futex_requeue_p-3252    [005] .....  2101.069108: sys_futex(uaddr: 0x55e79a4da838 (0), FUTEX_WAIT_REQUEUE_PI|FUTEX_PRIVATE_FLAG, val: 0, timespec: 0x7ffe61076aa0, uaddr2: 0x55e79a4da834, uaddr2: 94453214586932, val3: 0)
>  futex_requeue_p-3252    [005] .....  2101.069410: sys_futex(uaddr: 0x55e79a4da834 (0x80000cb1), FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG, val: 0)
>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
> Changes since v2: https://patch.msgid.link/20260310201036.542627924@kernel.org
>
> - Removed unused "buf" variable (kernel test robot)
>
> - Iterate __futex_cmds[] make the print statement.
>   Note this required exposing __futex_cmds[] to trace_syscall.c
>
> - Added back val statement (with the move to futex/syscall.c the
>   third parameter was dropped).
>
>  include/linux/futex.h         |   8 +-
>  kernel/futex/syscalls.c       |  97 +++++++++++++++++++++++
>  kernel/trace/trace_syscalls.c | 144 +++++++++++++++++++++++++++++++++-
>  3 files changed, 245 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/futex.h b/include/linux/futex.h
> index 9e9750f04980..7eaf01ff68cf 100644
> --- a/include/linux/futex.h
> +++ b/include/linux/futex.h
> @@ -82,6 +82,11 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
>               u32 __user *uaddr2, u32 val2, u32 val3);
>  int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4);
>
> +#ifdef CONFIG_FTRACE_SYSCALLS
> +void futex_print_syscall(struct seq_buf *s, int nr_args, unsigned long *args, u32 *ptr);
> +extern const char * __futex_cmds[];
> +#endif
> +
>  #ifdef CONFIG_FUTEX_PRIVATE_HASH
>  int futex_hash_allocate_default(void);
>  void futex_hash_free(struct mm_struct *mm);
> @@ -114,7 +119,8 @@ static inline int futex_hash_allocate_default(void)
>  }
>  static inline int futex_hash_free(struct mm_struct *mm) { return 0; }
>  static inline int futex_mm_init(struct mm_struct *mm) { return 0; }
> -
> +static inline void futex_print_syscall(struct seq_buf *s, int nr_args,
> +                                      unsigned long *args, u32 *ptr) { }
>  #endif
>
>  #endif
> diff --git a/kernel/futex/syscalls.c b/kernel/futex/syscalls.c
> index 743c7a728237..3ba8ca017c9c 100644
> --- a/kernel/futex/syscalls.c
> +++ b/kernel/futex/syscalls.c
> @@ -171,6 +171,18 @@ static __always_inline bool futex_cmd_has_timeout(u32 cmd)
>         return false;
>  }
>
> +static __always_inline bool futex_cmd_has_addr2(u32 cmd)
> +{
> +       switch (cmd) {
> +       case FUTEX_REQUEUE:
> +       case FUTEX_CMP_REQUEUE:
> +       case FUTEX_WAKE_OP:
> +       case FUTEX_WAIT_REQUEUE_PI:
> +               return true;
> +       }
> +       return false;
> +}
> +
>  static __always_inline int
>  futex_init_timeout(u32 cmd, u32 op, struct timespec64 *ts, ktime_t *t)
>  {
> @@ -207,6 +219,91 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
>         return do_futex(uaddr, op, val, tp, uaddr2, (unsigned long)utime, val3);
>  }
>
> +#ifdef CONFIG_FTRACE_SYSCALLS
> +
> +/* End with NULL for iterators */
> +const char * __futex_cmds[] =
> +{
> +       "FUTEX_WAIT", "FUTEX_WAKE", "FUTEX_FD", "FUTEX_REQUEUE",
> +       "FUTEX_CMP_REQUEUE", "FUTEX_WAKE_OP", "FUTEX_LOCK_PI",
> +       "FUTEX_UNLOCK_PI", "FUTEX_TRYLOCK_PI", "FUTEX_WAIT_BITSET",
> +       "FUTEX_WAKE_BITSET", "FUTEX_WAIT_REQUEUE_PI", "FUTEX_CMP_REQUEUE_PI",
> +       "FUTEX_LOCK_PI2", NULL
> +};
> +
> +void futex_print_syscall(struct seq_buf *s, int nr_args, unsigned long *args, u32 *ptr)
> +{
> +       unsigned int op, cmd;
> +       bool done = false;
> +
> +       for (int i = 0; !done && i < nr_args; i++) {
> +
> +               if (seq_buf_has_overflowed(s))
> +                       break;
> +
> +               switch (i) {
> +               case 0:
> +                       seq_buf_printf(s, "uaddr: 0x%lx", args[i]);
> +                       if (ptr) {
> +                               u32 val = *ptr;
> +                               if (val < 10)
> +                                       seq_buf_printf(s, " (%u)", val);
> +                               else
> +                                       seq_buf_printf(s, " (0x%x)", val);
> +                       }
> +                       continue;
> +               case 1:
> +                       op = args[i];
> +                       cmd = op & FUTEX_CMD_MASK;
> +                       if (cmd <= FUTEX_LOCK_PI2)
> +                               seq_buf_printf(s, ", %s", __futex_cmds[cmd]);
> +                       else
> +                               seq_buf_puts(s, ", UNKNOWN");
> +
> +                       if (op & FUTEX_PRIVATE_FLAG)
> +                               seq_buf_puts(s, "|FUTEX_PRIVATE_FLAG");
> +                       if (op & FUTEX_CLOCK_REALTIME)
> +                               seq_buf_puts(s, "|FUTEX_CLOCK_REALTIME");
> +                       continue;
> +               case 2:
> +                       if (args[i] < 10)
> +                               seq_buf_printf(s, ", val: %ld", args[i]);
> +                       else
> +                               seq_buf_printf(s, ", val: 0x%lx", args[i]);

I'm confused why this hex formatting didn't appear in the sample output.

Thanks,
Ian

> +                       continue;
> +               case 3:
> +                       if (!futex_cmd_has_timeout(cmd)) {
> +
> +                               if (!futex_cmd_has_addr2(cmd)) {
> +                                       done = true;
> +                                       continue;
> +                               }
> +
> +                               seq_buf_printf(s, ", val2: 0x%x", (u32)(long)args[i]);
> +                               continue;
> +                       }
> +
> +                       if (!args[i])
> +                               continue;
> +
> +                       seq_buf_printf(s, ", timespec: 0x%lx", args[i]);
> +                       continue;
> +               case 4:
> +                       if (!futex_cmd_has_addr2(cmd)) {
> +                               done = true;
> +                               continue;
> +                       }
> +                       seq_buf_printf(s, ", uaddr2: 0x%lx", args[i]);
> +                       continue;
> +               case 5:
> +                       seq_buf_printf(s, ", val3: %lu", args[i]);
> +                       done = true;
> +                       continue;
> +               }
> +       }
> +}
> +#endif
> +
>  /**
>   * futex_parse_waitv - Parse a waitv array from userspace
>   * @futexv:    Kernel side list of waiters to be filled
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index 37317b81fcda..f8213d772f89 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -6,11 +6,13 @@
>  #include <linux/slab.h>
>  #include <linux/kernel.h>
>  #include <linux/module.h>      /* for MODULE_NAME_LEN via KSYM_SYMBOL_LEN */
> +#include <linux/futex.h>
>  #include <linux/ftrace.h>
>  #include <linux/perf_event.h>
>  #include <linux/xarray.h>
>  #include <asm/syscall.h>
>
> +
>  #include "trace_output.h"
>  #include "trace.h"
>
> @@ -237,6 +239,27 @@ sys_enter_openat_print(struct syscall_trace_enter *trace, struct syscall_metadat
>         return trace_handle_return(s);
>  }
>
> +static enum print_line_t
> +sys_enter_futex_print(struct syscall_trace_enter *trace, struct syscall_metadata *entry,
> +                     struct trace_seq *s, struct trace_event *event, int ent_size)
> +{
> +       void *end = (void *)trace + ent_size;
> +       void *ptr;
> +
> +       /* Set ptr to the user space copied area */
> +       ptr = (void *)trace->args + sizeof(unsigned long) * entry->nb_args;
> +       if (ptr + 4 > end)
> +               ptr = NULL;
> +
> +       trace_seq_printf(s, "%s(", entry->name);
> +
> +       futex_print_syscall(&s->seq, entry->nb_args, trace->args, ptr);
> +
> +       trace_seq_puts(s, ")\n");
> +
> +       return trace_handle_return(s);
> +}
> +
>  static enum print_line_t
>  print_syscall_enter(struct trace_iterator *iter, int flags,
>                     struct trace_event *event)
> @@ -267,6 +290,10 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
>                 if (!tr || !(tr->trace_flags & TRACE_ITER(VERBOSE)))
>                         return sys_enter_openat_print(trace, entry, s, event);
>                 break;
> +       case __NR_futex:
> +               if (!tr || !(tr->trace_flags & TRACE_ITER(VERBOSE)))
> +                       return sys_enter_futex_print(trace, entry, s, event, iter->ent_size);
> +               break;
>         default:
>                 break;
>         }
> @@ -437,6 +464,48 @@ sys_enter_openat_print_fmt(struct syscall_metadata *entry, char *buf, int len)
>         return pos;
>  }
>
> +static int __init
> +sys_enter_futex_print_fmt(struct syscall_metadata *entry, char *buf, int len)
> +{
> +       int pos = 0;
> +
> +       pos += snprintf(buf + pos, LEN_OR_ZERO,
> +                       "\"uaddr: 0x%%lx (0x%%lx) cmd=%%s%%s%%s");
> +       pos += snprintf(buf + pos, LEN_OR_ZERO,
> +                       "  val: 0x%%x timeout/val2: 0x%%llx");
> +       pos += snprintf(buf + pos, LEN_OR_ZERO,
> +                       " uaddr2: 0x%%lx val3: 0x%%x\", ");
> +
> +       pos += snprintf(buf + pos, LEN_OR_ZERO,
> +                       " REC->uaddr,");
> +       pos += snprintf(buf + pos, LEN_OR_ZERO,
> +                       " REC->__value,");
> +       pos += snprintf(buf + pos, LEN_OR_ZERO,
> +                       "  __print_symbolic(REC->op & 0x%x, ", FUTEX_CMD_MASK);
> +
> +       for (int i = 0; __futex_cmds[i]; i++) {
> +               pos += snprintf(buf + pos, LEN_OR_ZERO,
> +                               "%s{%d, \"%s\"} ",
> +                               i ? "," : "", i, __futex_cmds[i]);
> +       }
> +
> +       pos += snprintf(buf + pos, LEN_OR_ZERO, "), ");
> +
> +       pos += snprintf(buf + pos, LEN_OR_ZERO,
> +                       " (REC->op & %d) ? \"|FUTEX_PRIVATE_FLAG\" : \"\",",
> +                       FUTEX_PRIVATE_FLAG);
> +       pos += snprintf(buf + pos, LEN_OR_ZERO,
> +                       " (REC->op & %d) ? \"|FUTEX_CLOCK_REALTIME\" : \"\",",
> +                       FUTEX_CLOCK_REALTIME);
> +
> +       pos += snprintf(buf + pos, LEN_OR_ZERO,
> +                       " REC->val, REC->utime,");
> +
> +       pos += snprintf(buf + pos, LEN_OR_ZERO,
> +                       " REC->uaddr, REC->val3");
> +       return pos;
> +}
> +
>  static int __init
>  __set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
>  {
> @@ -447,6 +516,8 @@ __set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
>         switch (entry->syscall_nr) {
>         case __NR_openat:
>                 return sys_enter_openat_print_fmt(entry, buf, len);
> +       case __NR_futex:
> +               return sys_enter_futex_print_fmt(entry, buf, len);
>         default:
>                 break;
>         }
> @@ -523,6 +594,21 @@ static void __init free_syscall_print_fmt(struct trace_event_call *call)
>                 kfree(call->print_fmt);
>  }
>
> +static int __init futex_fields(struct trace_event_call *call, int offset)
> +{
> +       char *arg;
> +       int ret;
> +
> +       arg = kstrdup("__value", GFP_KERNEL);
> +       if (WARN_ON_ONCE(!arg))
> +               return -ENOMEM;
> +       ret = trace_define_field(call, "u32", arg, offset, sizeof(int), 0,
> +                                FILTER_OTHER);
> +       if (ret)
> +               kfree(arg);
> +       return ret;
> +}
> +
>  static int __init syscall_enter_define_fields(struct trace_event_call *call)
>  {
>         struct syscall_trace_enter trace;
> @@ -544,6 +630,9 @@ static int __init syscall_enter_define_fields(struct trace_event_call *call)
>                 offset += sizeof(unsigned long);
>         }
>
> +       if (!ret && meta->syscall_nr == __NR_futex)
> +               return futex_fields(call, offset);
> +
>         if (ret || !meta->user_mask)
>                 return ret;
>
> @@ -689,6 +778,45 @@ static int syscall_copy_user_array(char *buf, const char __user *ptr,
>         return 0;
>  }
>
> +static int
> +syscall_get_futex(unsigned long *args, char **buffer, int *size, int buf_size)
> +{
> +       struct syscall_user_buffer *sbuf;
> +       const char __user *ptr;
> +
> +       /* buf_size of zero means user doesn't want user space read */
> +       if (!buf_size)
> +               return -1;
> +
> +       /* If the syscall_buffer is NULL, tracing is being shutdown */
> +       sbuf = READ_ONCE(syscall_buffer);
> +       if (!sbuf)
> +               return -1;
> +
> +       ptr = (char __user *)args[0];
> +
> +       *buffer = trace_user_fault_read(&sbuf->buf, ptr, 4, NULL, NULL);
> +       if (!*buffer)
> +               return -1;
> +
> +       /* Add room for the value */
> +       *size += 4;
> +
> +       return 0;
> +}
> +
> +static void syscall_put_futex(struct syscall_metadata *sys_data,
> +                             struct syscall_trace_enter *entry,
> +                             char *buffer)
> +{
> +       u32 *ptr;
> +
> +       /* Place the futex key into the storage */
> +       ptr = (void *)entry->args + sizeof(unsigned long) * sys_data->nb_args;
> +
> +       *ptr = *(u32 *)buffer;
> +}
> +
>  static char *sys_fault_user(unsigned int buf_size,
>                             struct syscall_metadata *sys_data,
>                             struct syscall_user_buffer *sbuf,
> @@ -905,6 +1033,9 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
>                 if (syscall_get_data(sys_data, args, &user_ptr,
>                                      &size, user_sizes, &uargs, tr->syscall_buf_sz) < 0)
>                         return;
> +       } else if (syscall_nr == __NR_futex) {
> +               if (syscall_get_futex(args, &user_ptr, &size, tr->syscall_buf_sz) < 0)
> +                       return;
>         }
>
>         size += sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
> @@ -921,6 +1052,9 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
>         if (mayfault)
>                 syscall_put_data(sys_data, entry, user_ptr, size, user_sizes, uargs);
>
> +       else if (syscall_nr == __NR_futex)
> +               syscall_put_futex(sys_data, entry, user_ptr);
> +
>         trace_event_buffer_commit(&fbuffer);
>  }
>
> @@ -971,14 +1105,18 @@ static int reg_event_syscall_enter(struct trace_event_file *file,
>  {
>         struct syscall_metadata *sys_data = call->data;
>         struct trace_array *tr = file->tr;
> +       bool enable_faults;
>         int ret = 0;
>         int num;
>
>         num = sys_data->syscall_nr;
>         if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
>                 return -ENOSYS;
> +
> +       enable_faults = sys_data->user_mask || num == __NR_futex;
> +
>         guard(mutex)(&syscall_trace_lock);
> -       if (sys_data->user_mask) {
> +       if (enable_faults) {
>                 ret = syscall_fault_buffer_enable();
>                 if (ret < 0)
>                         return ret;
> @@ -986,7 +1124,7 @@ static int reg_event_syscall_enter(struct trace_event_file *file,
>         if (!tr->sys_refcount_enter) {
>                 ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
>                 if (ret < 0) {
> -                       if (sys_data->user_mask)
> +                       if (enable_faults)
>                                 syscall_fault_buffer_disable();
>                         return ret;
>                 }
> @@ -1011,7 +1149,7 @@ static void unreg_event_syscall_enter(struct trace_event_file *file,
>         WRITE_ONCE(tr->enter_syscall_files[num], NULL);
>         if (!tr->sys_refcount_enter)
>                 unregister_trace_sys_enter(ftrace_syscall_enter, tr);
> -       if (sys_data->user_mask)
> +       if (sys_data->user_mask || num == __NR_futex)
>                 syscall_fault_buffer_disable();
>  }
>
> --
> 2.51.0
>
>

^ permalink raw reply

* Re: [PATCH v3 1/3] tracing: Have futex syscall trace event show specific user data
From: Steven Rostedt @ 2026-04-01 18:17 UTC (permalink / raw)
  To: Ian Rogers
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Brian Geffon, John Stultz, Suleiman Souhlal
In-Reply-To: <CAP-5=fWxhLrMTwteTedGVeUg8uz=xzF+_3Q4qFnxB=TaKR9D8w@mail.gmail.com>

On Wed, 1 Apr 2026 11:07:07 -0700
Ian Rogers <irogers@google.com> wrote:

> nit: Would this read better if `val` were also hex?
> futex_requeue_p-3248    [001] .....  2101.068970: sys_futex(uaddr:
> 0x7f859072f990 (0xcb2), FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, val:
> 0xcb2)
> The val matching the value in uaddr is the condition necessary for the
> futex to sleep.

Good point. I can update that in v4.

-- Steve

^ permalink raw reply

* [PATCH bpf v3 1/2] bpf: Reject sleepable kprobe_multi programs at attach time
From: Varun R Mallya @ 2026-04-01 19:11 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, memxor, yonghong.song, jolsa, rostedt, mhiramat,
	linux-kernel, linux-trace-kernel, varunrmallya

kprobe.multi programs run in atomic/RCU context and cannot sleep.
However, bpf_kprobe_multi_link_attach() did not validate whether the
program being attached had the sleepable flag set, allowing sleepable
helpers such as bpf_copy_from_user() to be invoked from a non-sleepable
context.

This causes a "sleeping function called from invalid context" splat:

  BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169
  in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo
  preempt_count: 1, expected: 0
  RCU nest depth: 2, expected: 0

Fix this by rejecting sleepable programs early in
bpf_kprobe_multi_link_attach(), before any further processing.

Fixes: 0dcac2725406 ("bpf: Add multi kprobe link")
Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
---
 kernel/trace/bpf_trace.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 0b040a417442..af7079aa0f36 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2752,6 +2752,10 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
 	if (!is_kprobe_multi(prog))
 		return -EINVAL;
 
+	/* kprobe_multi is not allowed to be sleepable. */
+	if (prog->sleepable)
+		return -EINVAL;
+
 	/* Writing to context is not allowed for kprobes. */
 	if (prog->aux->kprobe_write_ctx)
 		return -EINVAL;
-- 
2.53.0


^ permalink raw reply related

* [PATCH bpf v3 2/2] selftests/bpf: Add test to ensure kprobe_multi is not sleepable
From: Varun R Mallya @ 2026-04-01 19:11 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, memxor, yonghong.song, jolsa, rostedt, mhiramat,
	linux-kernel, linux-trace-kernel, varunrmallya
In-Reply-To: <20260401191126.440683-1-varunrmallya@gmail.com>

Add a selftest to ensure that kprobe_multi programs cannot be attached
using the BPF_F_SLEEPABLE flag. This test succeeds when the kernel
rejects attachment of kprobe_multi when the BPF_F_SLEEPABLE flag is set.

Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
---
 .../bpf/prog_tests/kprobe_multi_test.c        | 41 +++++++++++++++++++
 .../bpf/progs/kprobe_multi_sleepable.c        | 13 ++++++
 2 files changed, 54 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c

diff --git a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
index 78c974d4ea33..f02fec2b6fda 100644
--- a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
+++ b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
@@ -10,6 +10,7 @@
 #include "kprobe_multi_session_cookie.skel.h"
 #include "kprobe_multi_verifier.skel.h"
 #include "kprobe_write_ctx.skel.h"
+#include "kprobe_multi_sleepable.skel.h"
 #include "bpf/libbpf_internal.h"
 #include "bpf/hashmap.h"
 
@@ -633,6 +634,44 @@ static void test_attach_write_ctx(void)
 }
 #endif
 
+static void test_attach_multi_sleepable(void)
+{
+	struct kprobe_multi_sleepable *skel;
+	int err;
+
+	skel = kprobe_multi_sleepable__open();
+	if (!ASSERT_OK_PTR(skel, "kprobe_multi_sleepable__open"))
+		return;
+
+	err = bpf_program__set_flags(skel->progs.handle_kprobe_multi_sleepable,
+				     BPF_F_SLEEPABLE);
+	if (!ASSERT_OK(err, "bpf_program__set_flags"))
+		goto cleanup;
+
+	/* Load should succeed even with BPF_F_SLEEPABLE for KPROBE types */
+	err = kprobe_multi_sleepable__load(skel);
+	if (!ASSERT_OK(err, "kprobe_multi_sleepable__load"))
+		goto cleanup;
+
+	/* Attachment must fail for kprobe.multi + BPF_F_SLEEPABLE.
+	 * Also chosen a stable symbol to send into opts
+	 */
+	LIBBPF_OPTS(bpf_kprobe_multi_opts, opts);
+	const char *sym = "vfs_read";
+
+	opts.syms = &sym;
+	opts.cnt = 1;
+
+	skel->links.handle_kprobe_multi_sleepable =
+		bpf_program__attach_kprobe_multi_opts(skel->progs.handle_kprobe_multi_sleepable,
+						      NULL, &opts);
+	ASSERT_ERR_PTR(skel->links.handle_kprobe_multi_sleepable,
+		       "bpf_program__attach_kprobe_multi_opts");
+
+cleanup:
+	kprobe_multi_sleepable__destroy(skel);
+}
+
 void serial_test_kprobe_multi_bench_attach(void)
 {
 	if (test__start_subtest("kernel"))
@@ -676,5 +715,7 @@ void test_kprobe_multi_test(void)
 		test_unique_match();
 	if (test__start_subtest("attach_write_ctx"))
 		test_attach_write_ctx();
+	if (test__start_subtest("attach_multi_sleepable"))
+		test_attach_multi_sleepable();
 	RUN_TESTS(kprobe_multi_verifier);
 }
diff --git a/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c b/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c
new file mode 100644
index 000000000000..56973ad8779d
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c
@@ -0,0 +1,13 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+SEC("kprobe.multi")
+int handle_kprobe_multi_sleepable(struct pt_regs *ctx)
+{
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH v3 0/3] tracing: Read user data from futex system call trace event
From: Thomas Gleixner @ 2026-04-01 19:31 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Peter Zijlstra, Brian Geffon, John Stultz, Ian Rogers,
	Suleiman Souhlal
In-Reply-To: <20260331181349.062575155@kernel.org>

On Tue, Mar 31 2026 at 14:13, Steven Rostedt wrote:
> We are looking at the performance of futexes and require a bit more
> information when tracing them.
>
> The two patches here extend the system call reading of user space to

s/two/three/ :)

> create specific handling of the futex system call. It now reads the
> user space relevant data (the addr, utime and addr2), as well as
> parses the flags. This adds a little smarts to the trace event as
> it only shows the parameters that are relevant, as well as parses
> utime as either a timespec or as val2 depending on the futex_op.
>
> Here's an example of the new output:
>
>  sys_futex(uaddr: 0x56196292e830 (0), FUTEX_WAKE|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e834 (0x4a7) tid: 1191, FUTEX_UNLOCK_PI|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e834 (0) tid: 0, FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e830 (0), FUTEX_WAIT|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e838 (0), FUTEX_WAIT_REQUEUE_PI|FUTEX_PRIVATE_FLAG, timespec: 0x7ffc1b91a9f0 (163.048528790), uaddr2: 0x56196292e834 (4aa), val3: 0)
>  sys_futex(uaddr: 0x56196292e834 (0x4aa) tid: 1194, FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e838 (0), FUTEX_WAIT_REQUEUE_PI|FUTEX_PRIVATE_FLAG, timespec: 0x7ffc1b91a9f0 (163.048528790), uaddr2: 0x56196292e834 (800004aa), val3: 0)
>  sys_futex(uaddr: 0x7f7ed6b29990 (0x4ab), FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME)
>  sys_futex(uaddr: 0x56196292e834 (0x800004aa) tid: 1194 (WAITERS), FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e838 (0), FUTEX_WAIT_REQUEUE_PI|FUTEX_PRIVATE_FLAG, timespec: 0x7ffc1b91a9f0 (163.048528790), uaddr2: 0x56196292e834 (800004aa), val3: 0)
>  sys_futex(uaddr: 0x56196292e834 (0x800004aa) tid: 1194 (WAITERS), FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG)

I understand what you are trying to achieve, but do we really need all
the complexity of decoding and pretty printing in the kernel?

Isn't it sufficient to store and expose the raw data and use post
processing to make it readable?

I've been doing complex futex analysis for two decades with a small set
of python scripts which translate raw text or binary trace data into
human readable information.

I agree that it's useful to have the actual timeout value and other data
which is missing today, but that still does not require all this
customized printing.

The initial idea of having at least some information about the data
entry (type, meaning etc.) in $event/format and use that for kernel text
output and for user space tools to analyze a binary trace has been
definitely the right way to go.

But that now deviates because $event/format cannot carry that
information you translate to in the kernel. It will still describe raw
event data, no?

So why not keeping the well known and working solution of identifying
the data in the format, print it raw and leave the post processing to
user space tools in case there is a need.

You actually make it harder to do development. Look at the patch series
related to robust futexes:

  https://lore.kernel.org/lkml/20260330114212.927686587@kernel.org/

So your decoding:

>  sys_futex(uaddr: 0x56196292e830 (0), FUTEX_WAKE|FUTEX_PRIVATE_FLAG)

fails to decode the new flag and the usage of uaddr2 unless I go and add
it in the first place _before_ working on the code. Right now it is just
printing op as a hex value and it just works when a new bit is added.

Stick 100 lines of python into tools/tracing and be done with it. I'm
happy to contribute to that.

Aside of that:

  Putting the decoder (futex_print_syscall) into the futex code itself
  is admittedly a smart move to offload the work of keeping that up to
  date to the people who are actually working on futexes.

TBH, I'm not interested to deal with that at all. If you want this
ftrace magic pretty printing, then stick it into kernel/trace or if
there is a real technical reason (hint there is none) into
kernel/futex/trace.c and take ownership of it. But please do not burden
others with your fancy toy of the day.

Thanks,

        tglx

^ permalink raw reply

* Re: [PATCH v3 0/3] tracing: Read user data from futex system call trace event
From: Peter Zijlstra @ 2026-04-01 20:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel,
	Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Brian Geffon, John Stultz, Ian Rogers, Suleiman Souhlal
In-Reply-To: <87zf3m7a0o.ffs@tglx>

On Wed, Apr 01, 2026 at 09:31:19PM +0200, Thomas Gleixner wrote:

> Stick 100 lines of python into tools/tracing and be done with it. I'm
> happy to contribute to that.
> 
> Aside of that:
> 
>   Putting the decoder (futex_print_syscall) into the futex code itself
>   is admittedly a smart move to offload the work of keeping that up to
>   date to the people who are actually working on futexes.
> 
> TBH, I'm not interested to deal with that at all. If you want this
> ftrace magic pretty printing, then stick it into kernel/trace or if
> there is a real technical reason (hint there is none) into
> kernel/futex/trace.c and take ownership of it. But please do not burden
> others with your fancy toy of the day.

Him putting it near futex is my fault. I really hate having this
description nonsense separate -- *IF* is has to exist at all.

That said, I was thinking of Sasha's syscall api endeavour, I feel that
if we do any of this, it should all come from the same one place.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox