Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH v15 5/5] ring-buffer: Show commit numbers in buffer_meta file
From: Steven Rostedt @ 2026-04-01 22:49 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Catalin Marinas, Will Deacon, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177494619878.71933.15471023049227398684.stgit@mhiramat.tok.corp.google.com>

On Tue, 31 Mar 2026 17:36:38 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:

> From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> 
> In addition to the index number, show the commit numbers of
> each data page in per_cpu buffer_meta file.

                 in the per_cpu buffer_meta file.

-- Steve

> This is useful for understanding the current status of the
> persistent ring buffer. (Note that this file is shown
> only for persistent ring buffer and its backup instance)
> 

^ permalink raw reply

* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Ackerley Tng @ 2026-04-01 22:46 UTC (permalink / raw)
  To: Michael Roth
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jroedel, jthoughton, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm
In-Reply-To: <cqkyz4zxosmev6lnbasa32ed5jjfljrx3gr6plyjfhrtbysuwr@rg5noy6y6ayu>

Michael Roth <michael.roth@amd.com> writes:

> On Thu, Mar 26, 2026 at 03:24:19PM -0700, Ackerley Tng wrote:
>> For shared to private conversions, if refcounts on any of the folios
>> within the range are elevated, fail the conversion with -EAGAIN.
>>
>> At the point of shared to private conversion, all folios in range are
>> also unmapped. The filemap_invalidate_lock() is held, so no faulting
>> can occur. Hence, from that point on, only transient refcounts can be
>> taken on the folios associated with that guest_memfd.
>>
>> Hence, it is safe to do the conversion from shared to private.
>>
>> After conversion is complete, refcounts may become elevated, but that
>> is fine since users of transient refcounts don't actually access
>> memory.
>>
>> For private to shared conversions, there are no refcount checks, since
>> the guest is the only user of private pages, and guest_memfd will be the
>> only holder of refcounts on private pages.
>
> I think KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES deserves some mention in
> the commit log.
>

Will update this in the next revision. Thanks!

>>
>>
>> [...snip...]
>>
>>
>> +Set attributes for a range of offsets within a guest_memfd to
>> +KVM_MEMORY_ATTRIBUTE_PRIVATE to limit the specified guest_memfd backed
>> +memory range for guest_use. Even if KVM_CAP_GUEST_MEMFD_MMAP is
>> +supported, after a successful call to set
>> +KVM_MEMORY_ATTRIBUTE_PRIVATE, the requested range will not be mappable
>> +into host userspace and will only be mappable by the guest.
>> +
>> +To allow the range to be mappable into host userspace again, call
>> +KVM_SET_MEMORY_ATTRIBUTES2 on the guest_memfd again with
>> +KVM_MEMORY_ATTRIBUTE_PRIVATE unset.
>> +
>> +If this ioctl returns -EAGAIN, the offset of the page with unexpected
>> +refcounts will be returned in `error_offset`. This can occur if there
>> +are transient refcounts on the pages, taken by other parts of the
>> +kernel.
>
> That's only true for the guest_memfd ioctl, for KVM ioctl these new
> fields and r/w behavior are basically ignored. So you might need to be
> clearer on which fields/behavior are specific to guest_memfd like in
> the preceeding paragraphs..
>

Yes, will update in the next revision, thanks!

> ..or maybe it's better to do the opposite and just have a blanket 'for
> now, all newly-described behavior pertains only to usage via a
> guest_memfd ioctl, and for KVM ioctls only the fields/behaviors common
> with KVM_SET_MEMORY_ATTRIBUTES are applicable.', since it doesn't seem
> like vm_memory_attributes=1 is long for this world and that's the only
> case where KVM memory attribute ioctls seem relevant.
>
> But then it makes me wonder, if we adopt the semantics I mentioned
> earlier and have KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES advertise both
> the gmem ioctl support as well as the struct kvm_memory_attributes2
> support, if we should even advertise KVM_CAP_MEMORY_ATTRIBUTES2 at all
> as part of this series.
>

Read your other email as well, thanks for reviewing!

It makes sense, hope this captures what you suggested. In v5,

If vm_memory_attributes == 1:
    (KVM_CAP_MEMORY_ATTRIBUTES2 will be removed (will return 0))

If vm_memory_attributes == 0 aka attributes are tracked by guest_memfd:
    KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES2 will return valid attributes
    (KVM_CAP_MEMORY_ATTRIBUTES2 will be removed (will return 0))

So yup, KVM_CAP_MEMORY_ATTRIBUTES2 will not even be #defined at all.

>> +
>> +Userspace is expected to figure out how to remove all known refcounts
>> +on the shared pages, such as refcounts taken by get_user_pages(), and
>> +try the ioctl again. A possible source of these long term refcounts is
>> +if the guest_memfd memory was pinned in IOMMU page tables.
>
> One might read this to mean error_offset is used purely for the EAGAIN
> case, so it might be worth touching on the other cases as well.
>

Will update this in the next revision.

> -Mike
>
>>
>> [...snip...]
>>

^ permalink raw reply

* Re: [PATCH bpf v3 1/2] bpf: Reject sleepable kprobe_multi programs at attach time
From: Kumar Kartikeya Dwivedi @ 2026-04-01 22:45 UTC (permalink / raw)
  To: Varun R Mallya, Jiri Olsa
  Cc: bpf, ast, daniel, yonghong.song, rostedt, mhiramat, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260401191126.440683-1-varunrmallya@gmail.com>

On Wed, 1 Apr 2026 at 21:11, Varun R Mallya <varunrmallya@gmail.com> wrote:
>
> kprobe.multi programs run in atomic/RCU context and cannot sleep.
> However, bpf_kprobe_multi_link_attach() did not validate whether the
> program being attached had the sleepable flag set, allowing sleepable
> helpers such as bpf_copy_from_user() to be invoked from a non-sleepable
> context.
>
> This causes a "sleeping function called from invalid context" splat:
>
>   BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169
>   in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo
>   preempt_count: 1, expected: 0
>   RCU nest depth: 2, expected: 0
>
> Fix this by rejecting sleepable programs early in
> bpf_kprobe_multi_link_attach(), before any further processing.
>
> Fixes: 0dcac2725406 ("bpf: Add multi kprobe link")
> Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
> ---

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

Jiri, would be great if you can ack both patches too.

^ permalink raw reply

* Re: [PATCH v15 4/5] ring-buffer: Add persistent ring buffer invalid-page inject test
From: Steven Rostedt @ 2026-04-01 22:40 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Catalin Marinas, Will Deacon, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177494619065.71933.9842685686800241005.stgit@mhiramat.tok.corp.google.com>


I replied with mostly grammar fixes and some rewrites of text.

On Tue, 31 Mar 2026 17:36:30 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:

> From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> 
> Add a self-destractive test for the persistent ring buffer.

  "self-destractive"? Do you mean "self-destructive"?

Probably better to call it a "self-corrupting test".

> 
> This will inject erroneous value to some sub-buffer pages (where

            inject an erroneous

> the index is even or multiples of 5) in the persistent ring buffer
> when kernel gets panic, and check whether the number of detected

  when the kernel panics, and checks whether

> invalid pages and the total entry_bytes are the same as recorded

                                                  same as the recorded

> values after reboot.
> 
> This can ensure the kernel correctly recover partially corrupted

  This ensures that the kernel can correctly recover a partially corrupted

> persistent ring buffer when boot.

                         after a reboot or panic.

> 
> The test only runs on the persistent ring buffer whose name is
> "ptracingtest". And user has to fill it up with events before

                  The user has to fill it with events before a kernel panic.

> kernel panics.
> 
> To run the test, enable CONFIG_RING_BUFFER_PERSISTENT_INJECT
> and you have to setup the kernel cmdline;

  and add the following kernel cmdline:

Note, it's more proper to use a colon (:) than a semi-colon (;) when
referencing an example on the next line.

> 
>  reserve_mem=20M:2M:trace trace_instance=ptracingtest^traceoff@trace
>  panic=1
> 
> And run following commands after the 1st boot;

  Run the following commands after the 1st boot:

> 
>  cd /sys/kernel/tracing/instances/ptracingtest
>  echo 1 > tracing_on
>  echo 1 > events/enable
>  sleep 3
>  echo c > /proc/sysrq-trigger
> 
> After panic message, the kernel will reboot and run the verification
> on the persistent ring buffer, e.g.
> 
>  Ring buffer meta [2] invalid buffer page detected
>  Ring buffer meta [2] is from previous boot! (318 pages discarded)
>  Ring buffer testing [2] invalid pages: PASSED (318/318)
>  Ring buffer testing [2] entry_bytes: PASSED (1300476/1300476)
> 
> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> ---
>  Changes in v15:
>   - Use pr_warn() for test result.
>   - Inject errors on the page index is multiples of 5 so that
>     this can reproduce contiguous empty pages.
>  Changes in v14:
>   - Rename config to CONFIG_RING_BUFFER_PERSISTENT_INJECT.
>   - Clear meta->nr_invalid/entry_bytes after testing.
>   - Add test commands in config comment.
>  Changes in v10:
>   - Add entry_bytes test.
>   - Do not compile test code if CONFIG_RING_BUFFER_PERSISTENT_SELFTEST=n.
>  Changes in v9:
>   - Test also reader pages.
> ---
>  include/linux/ring_buffer.h |    1 +
>  kernel/trace/Kconfig        |   31 ++++++++++++++++++
>  kernel/trace/ring_buffer.c  |   74 +++++++++++++++++++++++++++++++++++++++++++
>  kernel/trace/trace.c        |    4 ++
>  4 files changed, 110 insertions(+)
> 
> diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
> index 994f52b34344..0670742b2d60 100644
> --- a/include/linux/ring_buffer.h
> +++ b/include/linux/ring_buffer.h
> @@ -238,6 +238,7 @@ int ring_buffer_subbuf_size_get(struct trace_buffer *buffer);
>  
>  enum ring_buffer_flags {
>  	RB_FL_OVERWRITE		= 1 << 0,
> +	RB_FL_TESTING		= 1 << 1,
>  };
>  
>  #ifdef CONFIG_RING_BUFFER
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index e130da35808f..07305ed6d745 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -1202,6 +1202,37 @@ config RING_BUFFER_VALIDATE_TIME_DELTAS
>  	  Only say Y if you understand what this does, and you
>  	  still want it enabled. Otherwise say N
>  
> +config RING_BUFFER_PERSISTENT_INJECT
> +	bool "Enable persistent ring buffer error injection test"
> +	depends on RING_BUFFER
> +	help
> +	  Run a selftest on the persistent ring buffer which names
> +	  "ptracingtest" (and its backup) when panic_on_reboot by

  Does this do anything with the backup?

> +	  invalidating ring buffer pages.

 	  This option will have the kernel check if the persistent ring
 	  buffer is named "ptracingtest", and if so, it will corrupt some
 	  of its pages on a kernel panic. This is used to test if the
 	  persistent ring buffer can recover from some of its sub-buffers
 	  being corrupted.

[space]

> +	  To use this, boot kernel with "ptracingtest" persistent

                     , boot a kernel with a "ptracingtest" persistent

> +	  ring buffer, e.g.
> +
> +	   reserve_mem=20M:2M:trace trace_instance=ptracingtest@trace panic=1
> +
> +	  And after the 1st boot, run test command, like;

                               , run the following commands:

> +
> +	   cd /sys/kernel/tracing/instances/ptracingtest
> +	   echo 1 > events/enable
> +	   echo 1 > tracing_on
> +	   sleep 3
> +	   echo c > /proc/sysrq-trigger
> +
> +	  After panic message, the kernel reboots and show test results
> +	  on the boot log.

          After the panic message, the kernel will reboot and will show the
          test results in the console output.

> +
> +	  Note that user has to enable events on the persistent ring
> +	  buffer manually to fill up ring buffers before rebooting.

	  Note that events for the ring buffer needs to be enabled prior to
	  crashing the kernel so that the ring buffer has content that the
	  test will corrupt.

> +	  Since this invalidates the data on test target ring buffer,
> +	  "ptracingtest" persistent ring buffer must not be used for
> +	  actual tracing, but only for testing.

	  As the test will corrupt events in the "ptracingtest" persistent
	  ring buffer, it should not be used for any other purpose other
	  than his test.


> +
> +	  If unsure, say N
> +
>  config MMIOTRACE_TEST
>  	tristate "Test module for mmiotrace"
>  	depends on MMIOTRACE && m
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 5ff632ca3858..fb098b0b4505 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -64,6 +64,10 @@ struct ring_buffer_cpu_meta {
>  	unsigned long	commit_buffer;
>  	__u32		subbuf_size;
>  	__u32		nr_subbufs;
> +#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
> +	__u32		nr_invalid;
> +	__u32		entry_bytes;
> +#endif
>  	int		buffers[];
>  };
>  
> @@ -2079,6 +2083,21 @@ static void rb_meta_validate_events(struct
> 	  ring_buffer_per_cpu *cpu_buffer) if (discarded)
>  		pr_cont(" (%d pages discarded)", discarded);
>  	pr_cont("\n");
> +
> +#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
> +	if (meta->nr_invalid)
> +		pr_warn("Ring buffer testing [%d] invalid pages: %s (%d/%d)\n",
> +			cpu_buffer->cpu,
> +			(discarded == meta->nr_invalid) ? "PASSED" : "FAILED",
> +			discarded, meta->nr_invalid);
> +	if (meta->entry_bytes)
> +		pr_warn("Ring buffer testing [%d] entry_bytes: %s (%ld/%ld)\n",
> +			cpu_buffer->cpu,
> +			(entry_bytes == meta->entry_bytes) ? "PASSED" : "FAILED",
> +			(long)entry_bytes, (long)meta->entry_bytes);
> +	meta->nr_invalid = 0;
> +	meta->entry_bytes = 0;
> +#endif
>  	return;
>  
>   invalid:
> @@ -2559,12 +2578,67 @@ static void rb_free_cpu_buffer(struct
> 	  ring_buffer_per_cpu *cpu_buffer) kfree(cpu_buffer);
>  }
>  
> +#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
> +static void rb_test_inject_invalid_pages(struct trace_buffer *buffer)
> +{
> +	struct ring_buffer_per_cpu *cpu_buffer;
> +	struct ring_buffer_cpu_meta *meta;
> +	struct buffer_data_page *dpage;
> +	u32 entry_bytes = 0;
> +	unsigned long ptr;
> +	int subbuf_size;
> +	int invalid = 0;
> +	int cpu;
> +	int i;
> +
> +	if (!(buffer->flags & RB_FL_TESTING))
> +		return;
> +
> +	guard(preempt)();
> +	cpu = smp_processor_id();
> +
> +	cpu_buffer = buffer->buffers[cpu];
> +	meta = cpu_buffer->ring_meta;
> +	ptr = (unsigned long)rb_subbufs_from_meta(meta);
> +	subbuf_size = meta->subbuf_size;
> +
> +	for (i = 0; i < meta->nr_subbufs; i++) {
> +		int idx = meta->buffers[i];
> +
> +		dpage = (void *)(ptr + idx * subbuf_size);
> +		/* Skip unused pages */
> +		if (!local_read(&dpage->commit))
> +			continue;
> +
> +		/*
> +		 * Invalidate even pages or multiples of 5. This will lead 3

							    This will cause 3

-- Steve

> +		 * contiguous invalidated(empty) pages.
> +		 */
> +		if (!(i & 0x1) || !(i % 5)) {
> +			local_add(subbuf_size + 1, &dpage->commit);
> +			invalid++;
> +		} else {
> +			/* Count total commit bytes. */
> +			entry_bytes += local_read(&dpage->commit);
> +		}
> +	}
> +
> +	pr_info("Inject invalidated %d pages on CPU%d, total size: %ld\n",
> +		invalid, cpu, (long)entry_bytes);
> +	meta->nr_invalid = invalid;
> +	meta->entry_bytes = entry_bytes;
> +}

^ permalink raw reply

* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Ackerley Tng @ 2026-04-01 22:38 UTC (permalink / raw)
  To: Michael Roth
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jroedel, jthoughton, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm
In-Reply-To: <2r4mmfiuisw26qymahnbh2oxqkkrywqev477kc4rlkcyx7tels@c7ple7kdgpo3>

Michael Roth <michael.roth@amd.com> writes:

>
> [...snip...]
>
>>  static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
>>  {
>> @@ -2635,6 +2625,8 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>>  		return -EINVAL;
>>  	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
>>  		return -EINVAL;
>> +	if (attrs->error_offset)
>> +		return -EINVAL;
>>  	for (i = 0; i < ARRAY_SIZE(attrs->reserved); i++) {
>>  		if (attrs->reserved[i])
>>  			return -EINVAL;
>> @@ -4983,6 +4975,11 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>>  		return 1;
>>  	case KVM_CAP_GUEST_MEMFD_FLAGS:
>>  		return kvm_gmem_get_supported_flags(kvm);
>> +	case KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES:
>> +		if (vm_memory_attributes)
>> +			return 0;
>> +
>> +		return kvm_supported_mem_attributes(kvm);
>
> Based on the discussion from the PUCK call this morning,

Thanks for copying the discussion here, I'll start attending PUCK to
catch those discussions too :)

> it sounds like it
> would be a good idea to limit kvm_supported_mem_attributes() to only
> reporting KVM_MEMORY_ATTRIBUTE_PRIVATE if the underlying CoCo
> implementation has all the necessary enablement to support in-place
> conversion via guest_memfd. In the case of SNP, there is a
> documentation/parameter check in snp_launch_update() that needs to be
> relaxed in order for userspace to be able to pass in a NULL 'src'
> parameter (since, for in-place conversion, it would be initialized in place
> as shared memory prior to the call, since by the time kvm_gmem_poulate()
> it will have been set to private and therefore cannot be faulted in via
> GUP (and if it could, we'd be unecessarily copying the src back on top
> of itself since src/dst are the same).

Could this be a separate thing? If I'm understanding you correctly, it's
not strictly a requirement for snp_launch_update() to first support a
NULL 'src' parameter before this series lands.

Without this series, the startup procedure is to have memory set up in
non-guest_memfd shared memory, and then snp_launch_update()-ed into
guest_memfd private memory.

With this series, it is a little troublesome, but the startup procedure
can still set up memory in guest_memfd shared memory, then copy
everything out to some temporary memory, then set guest_memfd memory to
private, then snp_launch_update() the temporary memory into guest_memfd
private memory.

We would be unnecessarily copying the src (now in some temporary memory)
back onto itself. Can that be a separate patch series?

Btw, if snp_launch_update() is going to accept a NULL src parameter and
launch-update the src in-place:

+ Will userspace have to set that memory to private before calling launch
  update?
    + If yes, then would we need some other mode of conversion that is
      not ZERO and not quite PRESERVE (since PRESERVE is defined as that
      the guest will see what the host wrote post-encryption, but it
      sounds like launch update is doing the encryption)
+ Or should launch update be called when that memory is shared? Will
  launch update then also set that memory to private in guest_memfd?

>
> So maybe there should be an arch hook to check a whitelist of VM types
> that support KVM_MEMORY_ATTRIBUTE_PRIVATE when vm_memory_attributes=0,
> and if we decide to enable it for SNP as part of this series you could
> include the 1-2 patches needed there, or I could enable the SNP support
> separately as a small series and I guess that would then become a prereq
> for the SNP self-tests?
>
> Not sure if additional enablement is needed for TDX or not before
> KVM_MEMORY_ATTRIBUTE_PRIVATE would be advertised, but similar
> considerations there.
>
> -Mike
>
>>  #endif
>>  	default:
>>  		break;
>>
>> --
>> 2.53.0.1018.g2bb0e51243-goog
>>

^ permalink raw reply

* Re: [PATCH RFC v4 07/44] KVM: guest_memfd: Only prepare folios for private pages
From: Ackerley Tng @ 2026-04-01 21:43 UTC (permalink / raw)
  To: Michael Roth
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jroedel, jthoughton, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm
In-Reply-To: <s4dbqrv2c6yzt4nsflfarnggtl25xlz6mzg74tfeg3eskceno6@6l5hpfmcbju3>

Michael Roth <michael.roth@amd.com> writes:

> On Wed, Apr 01, 2026 at 07:05:16AM -0700, Ackerley Tng wrote:
>> Ackerley Tng <ackerleytng@google.com> writes:
>>
>> > All-shared guest_memfd used to be only supported for non-CoCo VMs where
>> > preparation doesn't apply. INIT_SHARED is about to be supported for
>> > non-CoCo VMs in a later patch in this series.
>> >
>> > In addition, KVM_SET_MEMORY_ATTRIBUTES2 is about to be supported in
>> > guest_memfd in a later patch in this series.
>> >
>> > This means that the kvm fault handler may now call kvm_gmem_get_pfn() on a
>> > shared folio for a CoCo VM where preparation applies.
>> >
>> > Add a check to make sure that preparation is only performed for private
>> > folios.
>> >
>> > Preparation will be undone on freeing (see kvm_gmem_free_folio()) and on
>> > conversion to shared.
>> >
>> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> > ---
>> >  virt/kvm/guest_memfd.c | 9 ++++++---
>> >  1 file changed, 6 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> > index b6ffa8734175d..d414ebfcb4c19 100644
>> > --- a/virt/kvm/guest_memfd.c
>> > +++ b/virt/kvm/guest_memfd.c
>> > @@ -900,6 +900,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>> >  		     int *max_order)
>> >  {
>> >  	pgoff_t index = kvm_gmem_get_index(slot, gfn);
>> > +	struct inode *inode;
>> >  	struct folio *folio;
>> >  	int r = 0;
>> >
>> > @@ -907,7 +908,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>> >  	if (!file)
>> >  		return -EFAULT;
>> >
>> > -	filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
>> > +	inode = file_inode(file);
>> > +	filemap_invalidate_lock_shared(inode->i_mapping);
>> >
>> >  	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
>> >  	if (IS_ERR(folio)) {
>> > @@ -920,7 +922,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>> >  		folio_mark_uptodate(folio);
>> >  	}
>> >
>> > -	r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>> > +	if (kvm_gmem_is_private_mem(inode, index))
>> > +		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>>
>> Michael, I might have misunderstood you at the last guest_memfd call:
>> sev_gmem_prepare() doesn't prepare a page for being a shared page,
>> right? Does this work? That prepare is only called to "make private"?
>
> Hmm, I guess your guest_memfd-inplace-conversion-v4 branch is out of sync with
> these patches?
>

My bad, it was. I just force-pushed to github to synchronize them with
this patch series.

> I have the below local patch based on top of that for SNP-specific enablement,
> which is basically identically, so suffice to say: yes, this should work
> for SNP :) If any architecture pops up that needs to do some prep in
> advance of mapping shared pages, then we could potentially plumb the
> shared/private flag through to the arch-specific prep hook, as was also
> suggested on the call, but it doesn't seem like that's needed by any
> users for now.
>

Thanks for checking :)

> -Mike
>
>   Author: Michael Roth <michael.roth@amd.com>
>   Date:   Mon Oct 27 07:58:32 2025 -0500
>
>       KVM: guest_memfd: Don't prepare shared folios
>
>       In the current guest_memfd logic, "preparation" is only used currently
>       to describe the additional work of putting a guest_memfd page into an
>       architecturally-defined "private" state, such as updating RMP table
>       entries for SEV-SNP guests. As such, there's no input to the
>       corresponding kvm_arch_gmem_prepare() hooks as to whether a page is
>       being prepared/accessed as shared or as private, so "preparation" will
>       end up being erroneously done on pages that were supposed to remain in a
>       shared state. Rather than plumb through the additional information
>       needed to distinguish between shared vs. private preparation, just
>       continue to only do preparation on private pages, as was the case prior
>       to support for GUEST_MEMFD_FLAG_MMAP being introduced.
>
>       Signed-off-by: Michael Roth <michael.roth@amd.com>
>
>   diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>   index 3acc6d983449..4869e59e4fc5 100644
>   --- a/virt/kvm/guest_memfd.c
>   +++ b/virt/kvm/guest_memfd.c
>   @@ -1249,7 +1249,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>                   folio_mark_uptodate(folio);
>           }
>
>   -       r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>   +       if (!kvm_gmem_is_shared_mem(file_inode(file), index))
>   +               r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>
>           folio_unlock(folio);
>
>>
>> >
>> >  	folio_unlock(folio);
>> >
>> > @@ -930,7 +933,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>> >  		folio_put(folio);
>> >
>> >  out:
>> > -	filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
>> > +	filemap_invalidate_unlock_shared(inode->i_mapping);
>> >  	return r;
>> >  }
>> >  EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
>> >
>> > --
>> > 2.53.0.1018.g2bb0e51243-goog

^ permalink raw reply

* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Sean Christopherson @ 2026-04-01 21:12 UTC (permalink / raw)
  To: Michael Roth
  Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jroedel, jthoughton, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm
In-Reply-To: <2r4mmfiuisw26qymahnbh2oxqkkrywqev477kc4rlkcyx7tels@c7ple7kdgpo3>

On Wed, Apr 01, 2026, Michael Roth wrote:
> On Thu, Mar 26, 2026 at 03:24:19PM -0700, Ackerley Tng wrote:
> >  #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> >  static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
> >  {
> > @@ -2635,6 +2625,8 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> >  		return -EINVAL;
> >  	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> >  		return -EINVAL;
> > +	if (attrs->error_offset)
> > +		return -EINVAL;
> >  	for (i = 0; i < ARRAY_SIZE(attrs->reserved); i++) {
> >  		if (attrs->reserved[i])
> >  			return -EINVAL;
> > @@ -4983,6 +4975,11 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> >  		return 1;
> >  	case KVM_CAP_GUEST_MEMFD_FLAGS:
> >  		return kvm_gmem_get_supported_flags(kvm);
> > +	case KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES:
> > +		if (vm_memory_attributes)
> > +			return 0;
> > +
> > +		return kvm_supported_mem_attributes(kvm);
> 
> Based on the discussion from the PUCK call this morning, it sounds like it
> would be a good idea to limit kvm_supported_mem_attributes() to only
> reporting KVM_MEMORY_ATTRIBUTE_PRIVATE if the underlying CoCo
> implementation has all the necessary enablement to support in-place
> conversion via guest_memfd. In the case of SNP, there is a
> documentation/parameter check in snp_launch_update() that needs to be
> relaxed in order for userspace to be able to pass in a NULL 'src'
> parameter (since, for in-place conversion, it would be initialized in place
> as shared memory prior to the call, since by the time kvm_gmem_poulate()
> it will have been set to private and therefore cannot be faulted in via
> GUP (and if it could, we'd be unecessarily copying the src back on top
> of itself since src/dst are the same).
> 
> So maybe there should be an arch hook to check a whitelist of VM types
> that support KVM_MEMORY_ATTRIBUTE_PRIVATE when vm_memory_attributes=0,
> and if we decide to enable it for SNP as part of this series you could
> include the 1-2 patches needed there, or I could enable the SNP support
> separately as a small series and I guess that would then become a prereq
> for the SNP self-tests?

If it's trivial-ish, my preference would be to include SNP as part of this series,
_before_ KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES is exposed to userspace.  I think
that would avoid the need for pivoting on the VM type?  I.e. don't advertise
support until all VM types play nice.

> Not sure if additional enablement is needed for TDX or not before
> KVM_MEMORY_ATTRIBUTE_PRIVATE would be advertised, but similar
> considerations there.

^ permalink raw reply

* Re: [PATCH RFC v4 25/44] KVM: selftests: Test basic single-page conversion flow
From: Sean Christopherson @ 2026-04-01 21:08 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jroedel, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm
In-Reply-To: <CAEvNRgE6Tn81Yddgbjqs-gs491NzpppjbDHKzpmdPCxgSPeUPQ@mail.gmail.com>

On Tue, Mar 31, 2026, Ackerley Tng wrote:
> Ackerley Tng <ackerleytng@google.com> writes:

Please trim your replies (even more, since you did trim a little).

> > +static void run_guest_do_rmw(struct kvm_vcpu *vcpu, loff_t pgoff,
> > +			     char expected_val, char write_val)
> > +{
> > +	struct ucall uc;
> > +	int r;
> > +
> > +	guest_data.mem = (void *)GUEST_MEMFD_SHARING_TEST_GVA + pgoff * page_size;
> > +	guest_data.expected_val = expected_val;
> > +	guest_data.write_val = write_val;
> > +	sync_global_to_guest(vcpu->vm, guest_data);
> > +
> > +	do {
> > +		r = __vcpu_run(vcpu);
> > +	} while (r == -1 && errno == EINTR);
> > +
> > +	TEST_ASSERT_EQ(r, 0);
> 
> TEST_ASSERT_EQ() ends up calling exit() on failures, which skips
> FIXTURE_TEARDOWN().
> 
> Other than the explicit assertions not working with the
> kselftest_harness, kvm selftest library functions like vm_mem_add() also
> call TEST_ASSERT, which doesn't play nice with kselftest_harness.
> 
> Any suggestions for this? Should we use the kselftest framework with
> these tests?
> 
> (I ran into this issue while trying to test something else, where I
> needed FIXTURE_TEARDOWN() to clean up system state.)
> 
> Or is it "okay" in this case since FIXTURE_TEARDOWN() only cleans up
> stuff that would happen if the program exits anyway?

Can you see if any of the ideas in https://lore.kernel.org/all/ZjUwqEXPA5QVItyX@google.com
would help?  Converting more tests to TAP+FIXTURE is still on my wish list, I've
just never been able to carve out cycles to see it through.

^ permalink raw reply

* Re: [PATCH RFC v4 08/44] KVM: Introduce KVM_SET_MEMORY_ATTRIBUTES2
From: Sean Christopherson @ 2026-04-01 21:04 UTC (permalink / raw)
  To: Michael Roth
  Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jroedel, jthoughton, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm
In-Reply-To: <cxmtst7txfodp6lo4ipue3trohx2ge7nkagkfzfixfdnlf5qlo@e3jw3tmbe272>

On Tue, Mar 31, 2026, Michael Roth wrote:
> On Thu, Mar 26, 2026 at 03:24:17PM -0700, Ackerley Tng wrote:
> > Introduce a "version 2" of KVM_SET_MEMORY_ATTRIBUTES to support returning
> > information back to userspace.
> 
> Hi Ackerley,
> 
> Not trying to bikeshed below, but I'm working on getting related QEMU
> patches cleaned up to post soon and was working through some of the new
> uAPI bits, and plumbing some of these capabilities in seems a little
> awkward in a couple places so wondering if we should revisit how some of
> this API is defined...
> 
> > 
> > This new ioctl and structure will, in a later patch, be shared as a
> > guest_memfd ioctl, where the padding in the new kvm_memory_attributes2
> > structure will be for writing the response from the guest_memfd ioctl to
> > userspace.
> > 
> > A new ioctl is necessary for these reasons:
> > 
> > 1. KVM_SET_MEMORY_ATTRIBUTES is currently a write-only ioctl and does not
> >    allow userspace to read fields. There's nothing in code (yet?) that
> >    validates this, but using _IOWR for consistency would be prudent.
> > 
> > 2. KVM_SET_MEMORY_ATTRIBUTES, when used as a guest_memfd ioctl, will need
> >    an additional field to provide userspace with more error details.
> > 
> > Alternatively, a completely new ioctl could be defined, unrelated to
> > KVM_SET_MEMORY_ATTRIBUTES, but using the same ioctl number and struct for
> > the vm and guest_memfd ioctls streamlines the interface for userspace. In
> > addition, any memory attributes, implemented on the vm or guest_memfd
> > ioctl, can be easily shared with the other.
> > 
> > Add KVM_CAP_MEMORY_ATTRIBUTES2 to indicate that struct
> > kvm_memory_attributes2 exists and can be used either with
> > KVM_SET_MEMORY_ATTRIBUTES2 via the vm or guest_memfd ioctl.
> 
> The guest_memfd support for the KVM_SET_MEMORY_ATTRIBUTES2 ioctl isn't
> added until patch #10, and to scan for it you sort of need to infer it
> via KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES reporting non-zero (i.e.
> KVM_MEMORY_ATTRIBUTE_PRIVATE), so it's confusing to state that
> KVM_CAP_MEMORY_ATTRIBUTES2 means you can use the struct via a guest_memfd
> ioctl.
> 
> I think the above is trying to simply say that the corresponding struct
> exists, and remain agnostic about how it can be used. But if that were
> the case, there would be no way to know when KVM_SET_MEMORY_ATTRIBUTES2 is
> available in the first place, so in the case of KVM ioctls at least,
> KVM_CAP_MEMORY_ATTRIBUTES2 is advertising both the struct and the ioctl,
> whereas for guest_memfd it's only advertising the struct and not saying
> anything about whether a similar gmem ioctl is available to use it.

+1 to everything Mike said.

> Instead, maybe they should both have the same semantics:
> 
>   KVM_CAP_MEMORY_ATTRIBUTES2: *SET_ATTRIBUTES* ioctl exists for KVM that utilizes
>     struct kvm_memory_attributes2
> 
>   KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES: *SET_ATTRIBUTES* ioctl exists for
>     guest_memfd that utilizes struct kvm_memory_attributes2
> 
> In which case you would leave out any mention of guest_memfd here as far as
> the documentation does, and then in patch #10 you could modify it to be
> something like:
> 
>    4.145 KVM_SET_MEMORY_ATTRIBUTES2
>    ---------------------------------
> 
>   -:Capability: KVM_CAP_MEMORY_ATTRIBUTES2
>   +:Capability: KVM_CAP_MEMORY_ATTRIBUTES2, KVM_GUEST_MEMFD_CAP_MEMORY_ATTRIBUTES
>   -:Architectures: x86
>   +:Architectures: all
>   -:Type: vm ioctl
>   +:Type: vm, guest_memfd ioctl
>    :Parameters: struct kvm_memory_attributes2 (in/out)
>    :Returns: 0 on success, <0 on error

As discussed at PUCK, I think we should omit KVM_CAP_MEMORY_ATTRIBUTES2 and
vm-scoped support entirely until it's needed (which may be never).

> and *then* add in your mentions of how the usage/fields differ for
> guest_memfd/KVM_GUEST_MEMFD_CAP_MEMORY_ATTRIBUTES case vs. KVM ioctls.
> 
> This avoids needing to issue 2 checks for the guest_memfd variant vs. 1
> for KVM, but more importantly avoids subtle differences in how these
> similarly-named capabilities are used/documented that might cause
> unecessary confusion.

^ permalink raw reply

* Re: [PATCH v3 0/3] tracing: Read user data from futex system call trace event
From: Peter Zijlstra @ 2026-04-01 20:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, linux-kernel, linux-trace-kernel,
	Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Brian Geffon, John Stultz, Ian Rogers, Suleiman Souhlal
In-Reply-To: <20260401161919.147355bf@gandalf.local.home>

On Wed, Apr 01, 2026 at 04:19:19PM -0400, Steven Rostedt wrote:

> Well, it would be updated for trace-cmd not tools/tracing.

So I tried using that a while ago:

  https://lkml.kernel.org/r/20260116181524.GF831285@noisy.programming.kicks-ass.net

and I found it really rough.

^ permalink raw reply

* Re: [PATCH v3 0/3] tracing: Read user data from futex system call trace event
From: Steven Rostedt @ 2026-04-01 20:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Peter Zijlstra, Brian Geffon,
	John Stultz, Ian Rogers, Suleiman Souhlal
In-Reply-To: <87zf3m7a0o.ffs@tglx>

On Wed, 01 Apr 2026 21:31:19 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Tue, Mar 31 2026 at 14:13, Steven Rostedt wrote:
> > We are looking at the performance of futexes and require a bit more
> > information when tracing them.
> >
> > The two patches here extend the system call reading of user space to  
> 
> s/two/three/ :)

Ah v1 had only two patches and this was cut and pasted from there.

> 
> I understand what you are trying to achieve, but do we really need all
> the complexity of decoding and pretty printing in the kernel?

You could say the same for most tracepoints. ;-)

> 
> Isn't it sufficient to store and expose the raw data and use post
> processing to make it readable?

Yes this is possible, and will also work too, as libtraceevent will be
updated to parse the raw data.

> 
> I've been doing complex futex analysis for two decades with a small set
> of python scripts which translate raw text or binary trace data into
> human readable information.
> 
> I agree that it's useful to have the actual timeout value and other data
> which is missing today, but that still does not require all this
> customized printing.
> 
> The initial idea of having at least some information about the data
> entry (type, meaning etc.) in $event/format and use that for kernel text
> output and for user space tools to analyze a binary trace has been
> definitely the right way to go.
> 
> But that now deviates because $event/format cannot carry that
> information you translate to in the kernel. It will still describe raw
> event data, no?

It still shows a bit:

name: sys_enter_futex
ID: 592
format:
	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
	field:int common_pid;	offset:4;	size:4;	signed:1;

	field:int __syscall_nr;	offset:8;	size:4;	signed:1;
	field:u32 * uaddr;	offset:16;	size:8;	signed:0;
	field:int op;	offset:24;	size:8;	signed:0;
	field:u32 val;	offset:32;	size:8;	signed:0;
	field:const struct __kernel_timespec * utime;	offset:40;	size:8;	signed:0;
	field:u32 * uaddr2;	offset:48;	size:8;	signed:0;
	field:u32 val3;	offset:56;	size:8;	signed:0;
	field:u32 __value;	offset:64;	size:4;	signed:0;
	field:u32 __value2;	offset:68;	size:4;	signed:0;
	field:unsigned long __ts1;	offset:72;	size:8;	signed:0;
	field:unsigned long __ts2;	offset:80;	size:8;	signed:0;

print fmt: "uaddr: 0x%lx (0x%lx) cmd=%s%s%s  val: 0x%x timeout/val2: 0x%llx (%lu.%lu) uaddr2: 0x%lx (0x%lx) val3: 0x%x",  REC->uaddr, REC->__value,  __print_symbolic(REC->op & 0xfffffe7f, {0, "FUTEX_WAIT"} ,{1, "FUTEX_WAKE"} ,{2, "FUTEX_FD"} ,{3, "FUTEX_REQUEUE"} ,{4, "FUTEX_CMP_REQUEUE"} ,{5, "FUTEX_WAKE_OP"} ,{6, "FUTEX_LOCK_PI"} ,{7, "FUTEX_UNLOCK_PI"} ,{8, "FUTEX_TRYLOCK_PI"} ,{9, "FUTEX_WAIT_BITSET"} ,{10, "FUTEX_WAKE_BITSET"} ,{11, "FUTEX_WAIT_REQUEUE_PI"} ,{12, "FUTEX_CMP_REQUEUE_PI"} ,{13, "FUTEX_LOCK_PI2"} ),  (REC->op & 128) ? "|FUTEX_PRIVATE_FLAG" : "", (REC->op & 256) ? "|FUTEX_CLOCK_REALTIME" : "", REC->val, REC->utime, REC->__ts1, REC->__ts2, REC->uaddr, REC->__value2, REC->val3

> 
> So why not keeping the well known and working solution of identifying
> the data in the format, print it raw and leave the post processing to
> user space tools in case there is a need.
> 
> You actually make it harder to do development. Look at the patch series
> related to robust futexes:
> 
>   https://lore.kernel.org/lkml/20260330114212.927686587@kernel.org/
> 
> So your decoding:
> 
> >  sys_futex(uaddr: 0x56196292e830 (0), FUTEX_WAKE|FUTEX_PRIVATE_FLAG)  
> 
> fails to decode the new flag and the usage of uaddr2 unless I go and add
> it in the first place _before_ working on the code. Right now it is just
> printing op as a hex value and it just works when a new bit is added.
> 
> Stick 100 lines of python into tools/tracing and be done with it. I'm
> happy to contribute to that.

Well, it would be updated for trace-cmd not tools/tracing.

> 
> Aside of that:
> 
>   Putting the decoder (futex_print_syscall) into the futex code itself
>   is admittedly a smart move to offload the work of keeping that up to
>   date to the people who are actually working on futexes.
> 
> TBH, I'm not interested to deal with that at all. If you want this
> ftrace magic pretty printing, then stick it into kernel/trace or if
> there is a real technical reason (hint there is none) into
> kernel/futex/trace.c and take ownership of it. But please do not burden
> others with your fancy toy of the day.

v1 kept it all within the tracing subsystem, but Peter suggested that it be
closer to the syscall:

   https://lore.kernel.org/all/20260304090748.GO606826@noisy.programming.kicks-ass.net/

I'm happy to put it back and maintain it separately.

Or I can just keep the simple bits (the reading of user space), and not do
all the more fancy formatting. Basically dropping patch 2 and 3.

I've been using trace-cmd start / show for testing. But I could also move
the logic to libtraceevent, which would require using trace-cmd record
instead.

How much are you against the full series? Are you OK with it if it stays
within the tracing subsystem? Or would you prefer just keeping with patch 1
and dropping the other patches and doing that work in libtraceevent?

-- Steve

^ permalink raw reply

* Re: [PATCH v3 0/3] tracing: Read user data from futex system call trace event
From: Peter Zijlstra @ 2026-04-01 20:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel,
	Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Brian Geffon, John Stultz, Ian Rogers, Suleiman Souhlal
In-Reply-To: <87zf3m7a0o.ffs@tglx>

On Wed, Apr 01, 2026 at 09:31:19PM +0200, Thomas Gleixner wrote:

> Stick 100 lines of python into tools/tracing and be done with it. I'm
> happy to contribute to that.
> 
> Aside of that:
> 
>   Putting the decoder (futex_print_syscall) into the futex code itself
>   is admittedly a smart move to offload the work of keeping that up to
>   date to the people who are actually working on futexes.
> 
> TBH, I'm not interested to deal with that at all. If you want this
> ftrace magic pretty printing, then stick it into kernel/trace or if
> there is a real technical reason (hint there is none) into
> kernel/futex/trace.c and take ownership of it. But please do not burden
> others with your fancy toy of the day.

Him putting it near futex is my fault. I really hate having this
description nonsense separate -- *IF* is has to exist at all.

That said, I was thinking of Sasha's syscall api endeavour, I feel that
if we do any of this, it should all come from the same one place.

^ permalink raw reply

* Re: [PATCH v3 0/3] tracing: Read user data from futex system call trace event
From: Thomas Gleixner @ 2026-04-01 19:31 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Peter Zijlstra, Brian Geffon, John Stultz, Ian Rogers,
	Suleiman Souhlal
In-Reply-To: <20260331181349.062575155@kernel.org>

On Tue, Mar 31 2026 at 14:13, Steven Rostedt wrote:
> We are looking at the performance of futexes and require a bit more
> information when tracing them.
>
> The two patches here extend the system call reading of user space to

s/two/three/ :)

> create specific handling of the futex system call. It now reads the
> user space relevant data (the addr, utime and addr2), as well as
> parses the flags. This adds a little smarts to the trace event as
> it only shows the parameters that are relevant, as well as parses
> utime as either a timespec or as val2 depending on the futex_op.
>
> Here's an example of the new output:
>
>  sys_futex(uaddr: 0x56196292e830 (0), FUTEX_WAKE|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e834 (0x4a7) tid: 1191, FUTEX_UNLOCK_PI|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e834 (0) tid: 0, FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e830 (0), FUTEX_WAIT|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e838 (0), FUTEX_WAIT_REQUEUE_PI|FUTEX_PRIVATE_FLAG, timespec: 0x7ffc1b91a9f0 (163.048528790), uaddr2: 0x56196292e834 (4aa), val3: 0)
>  sys_futex(uaddr: 0x56196292e834 (0x4aa) tid: 1194, FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e838 (0), FUTEX_WAIT_REQUEUE_PI|FUTEX_PRIVATE_FLAG, timespec: 0x7ffc1b91a9f0 (163.048528790), uaddr2: 0x56196292e834 (800004aa), val3: 0)
>  sys_futex(uaddr: 0x7f7ed6b29990 (0x4ab), FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME)
>  sys_futex(uaddr: 0x56196292e834 (0x800004aa) tid: 1194 (WAITERS), FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e838 (0), FUTEX_WAIT_REQUEUE_PI|FUTEX_PRIVATE_FLAG, timespec: 0x7ffc1b91a9f0 (163.048528790), uaddr2: 0x56196292e834 (800004aa), val3: 0)
>  sys_futex(uaddr: 0x56196292e834 (0x800004aa) tid: 1194 (WAITERS), FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG)

I understand what you are trying to achieve, but do we really need all
the complexity of decoding and pretty printing in the kernel?

Isn't it sufficient to store and expose the raw data and use post
processing to make it readable?

I've been doing complex futex analysis for two decades with a small set
of python scripts which translate raw text or binary trace data into
human readable information.

I agree that it's useful to have the actual timeout value and other data
which is missing today, but that still does not require all this
customized printing.

The initial idea of having at least some information about the data
entry (type, meaning etc.) in $event/format and use that for kernel text
output and for user space tools to analyze a binary trace has been
definitely the right way to go.

But that now deviates because $event/format cannot carry that
information you translate to in the kernel. It will still describe raw
event data, no?

So why not keeping the well known and working solution of identifying
the data in the format, print it raw and leave the post processing to
user space tools in case there is a need.

You actually make it harder to do development. Look at the patch series
related to robust futexes:

  https://lore.kernel.org/lkml/20260330114212.927686587@kernel.org/

So your decoding:

>  sys_futex(uaddr: 0x56196292e830 (0), FUTEX_WAKE|FUTEX_PRIVATE_FLAG)

fails to decode the new flag and the usage of uaddr2 unless I go and add
it in the first place _before_ working on the code. Right now it is just
printing op as a hex value and it just works when a new bit is added.

Stick 100 lines of python into tools/tracing and be done with it. I'm
happy to contribute to that.

Aside of that:

  Putting the decoder (futex_print_syscall) into the futex code itself
  is admittedly a smart move to offload the work of keeping that up to
  date to the people who are actually working on futexes.

TBH, I'm not interested to deal with that at all. If you want this
ftrace magic pretty printing, then stick it into kernel/trace or if
there is a real technical reason (hint there is none) into
kernel/futex/trace.c and take ownership of it. But please do not burden
others with your fancy toy of the day.

Thanks,

        tglx

^ permalink raw reply

* [PATCH bpf v3 2/2] selftests/bpf: Add test to ensure kprobe_multi is not sleepable
From: Varun R Mallya @ 2026-04-01 19:11 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, memxor, yonghong.song, jolsa, rostedt, mhiramat,
	linux-kernel, linux-trace-kernel, varunrmallya
In-Reply-To: <20260401191126.440683-1-varunrmallya@gmail.com>

Add a selftest to ensure that kprobe_multi programs cannot be attached
using the BPF_F_SLEEPABLE flag. This test succeeds when the kernel
rejects attachment of kprobe_multi when the BPF_F_SLEEPABLE flag is set.

Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
---
 .../bpf/prog_tests/kprobe_multi_test.c        | 41 +++++++++++++++++++
 .../bpf/progs/kprobe_multi_sleepable.c        | 13 ++++++
 2 files changed, 54 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c

diff --git a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
index 78c974d4ea33..f02fec2b6fda 100644
--- a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
+++ b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
@@ -10,6 +10,7 @@
 #include "kprobe_multi_session_cookie.skel.h"
 #include "kprobe_multi_verifier.skel.h"
 #include "kprobe_write_ctx.skel.h"
+#include "kprobe_multi_sleepable.skel.h"
 #include "bpf/libbpf_internal.h"
 #include "bpf/hashmap.h"
 
@@ -633,6 +634,44 @@ static void test_attach_write_ctx(void)
 }
 #endif
 
+static void test_attach_multi_sleepable(void)
+{
+	struct kprobe_multi_sleepable *skel;
+	int err;
+
+	skel = kprobe_multi_sleepable__open();
+	if (!ASSERT_OK_PTR(skel, "kprobe_multi_sleepable__open"))
+		return;
+
+	err = bpf_program__set_flags(skel->progs.handle_kprobe_multi_sleepable,
+				     BPF_F_SLEEPABLE);
+	if (!ASSERT_OK(err, "bpf_program__set_flags"))
+		goto cleanup;
+
+	/* Load should succeed even with BPF_F_SLEEPABLE for KPROBE types */
+	err = kprobe_multi_sleepable__load(skel);
+	if (!ASSERT_OK(err, "kprobe_multi_sleepable__load"))
+		goto cleanup;
+
+	/* Attachment must fail for kprobe.multi + BPF_F_SLEEPABLE.
+	 * Also chosen a stable symbol to send into opts
+	 */
+	LIBBPF_OPTS(bpf_kprobe_multi_opts, opts);
+	const char *sym = "vfs_read";
+
+	opts.syms = &sym;
+	opts.cnt = 1;
+
+	skel->links.handle_kprobe_multi_sleepable =
+		bpf_program__attach_kprobe_multi_opts(skel->progs.handle_kprobe_multi_sleepable,
+						      NULL, &opts);
+	ASSERT_ERR_PTR(skel->links.handle_kprobe_multi_sleepable,
+		       "bpf_program__attach_kprobe_multi_opts");
+
+cleanup:
+	kprobe_multi_sleepable__destroy(skel);
+}
+
 void serial_test_kprobe_multi_bench_attach(void)
 {
 	if (test__start_subtest("kernel"))
@@ -676,5 +715,7 @@ void test_kprobe_multi_test(void)
 		test_unique_match();
 	if (test__start_subtest("attach_write_ctx"))
 		test_attach_write_ctx();
+	if (test__start_subtest("attach_multi_sleepable"))
+		test_attach_multi_sleepable();
 	RUN_TESTS(kprobe_multi_verifier);
 }
diff --git a/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c b/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c
new file mode 100644
index 000000000000..56973ad8779d
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c
@@ -0,0 +1,13 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+SEC("kprobe.multi")
+int handle_kprobe_multi_sleepable(struct pt_regs *ctx)
+{
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.53.0


^ permalink raw reply related

* [PATCH bpf v3 1/2] bpf: Reject sleepable kprobe_multi programs at attach time
From: Varun R Mallya @ 2026-04-01 19:11 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, memxor, yonghong.song, jolsa, rostedt, mhiramat,
	linux-kernel, linux-trace-kernel, varunrmallya

kprobe.multi programs run in atomic/RCU context and cannot sleep.
However, bpf_kprobe_multi_link_attach() did not validate whether the
program being attached had the sleepable flag set, allowing sleepable
helpers such as bpf_copy_from_user() to be invoked from a non-sleepable
context.

This causes a "sleeping function called from invalid context" splat:

  BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169
  in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo
  preempt_count: 1, expected: 0
  RCU nest depth: 2, expected: 0

Fix this by rejecting sleepable programs early in
bpf_kprobe_multi_link_attach(), before any further processing.

Fixes: 0dcac2725406 ("bpf: Add multi kprobe link")
Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
---
 kernel/trace/bpf_trace.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 0b040a417442..af7079aa0f36 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2752,6 +2752,10 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
 	if (!is_kprobe_multi(prog))
 		return -EINVAL;
 
+	/* kprobe_multi is not allowed to be sleepable. */
+	if (prog->sleepable)
+		return -EINVAL;
+
 	/* Writing to context is not allowed for kprobes. */
 	if (prog->aux->kprobe_write_ctx)
 		return -EINVAL;
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH v3 1/3] tracing: Have futex syscall trace event show specific user data
From: Steven Rostedt @ 2026-04-01 18:17 UTC (permalink / raw)
  To: Ian Rogers
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Brian Geffon, John Stultz, Suleiman Souhlal
In-Reply-To: <CAP-5=fWxhLrMTwteTedGVeUg8uz=xzF+_3Q4qFnxB=TaKR9D8w@mail.gmail.com>

On Wed, 1 Apr 2026 11:07:07 -0700
Ian Rogers <irogers@google.com> wrote:

> nit: Would this read better if `val` were also hex?
> futex_requeue_p-3248    [001] .....  2101.068970: sys_futex(uaddr:
> 0x7f859072f990 (0xcb2), FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, val:
> 0xcb2)
> The val matching the value in uaddr is the condition necessary for the
> futex to sleep.

Good point. I can update that in v4.

-- Steve

^ permalink raw reply

* Re: [PATCH v3 1/3] tracing: Have futex syscall trace event show specific user data
From: Ian Rogers @ 2026-04-01 18:07 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Brian Geffon, John Stultz, Suleiman Souhlal
In-Reply-To: <20260331181432.995263521@kernel.org>

On Tue, Mar 31, 2026 at 11:13 AM Steven Rostedt <rostedt@kernel.org> wrote:
>
> From: Steven Rostedt <rostedt@goodmis.org>
>
> Add specific reporting of the futex system call. This allows for debugging
> the futex code a bit easier. Instead of just showing the values passed
> into the futex system call, read the value of the user space memory
> pointed to by the addr parameter.
>
> Also make the op parameter more readable by parsing the values to show
> what the command is:
>
>  futex_requeue_p-3251    [002] .....  2101.068479: sys_futex(uaddr: 0x55e79a4da834 (0x80000cb1), FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG, val: 0)
>  futex_requeue_p-3248    [001] .....  2101.068970: sys_futex(uaddr: 0x7f859072f990 (0xcb2), FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, val: 3250)

nit: Would this read better if `val` were also hex?
futex_requeue_p-3248    [001] .....  2101.068970: sys_futex(uaddr:
0x7f859072f990 (0xcb2), FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, val:
0xcb2)
The val matching the value in uaddr is the condition necessary for the
futex to sleep.

>  futex_requeue_p-3252    [005] .....  2101.069108: sys_futex(uaddr: 0x55e79a4da838 (0), FUTEX_WAIT_REQUEUE_PI|FUTEX_PRIVATE_FLAG, val: 0, timespec: 0x7ffe61076aa0, uaddr2: 0x55e79a4da834, uaddr2: 94453214586932, val3: 0)
>  futex_requeue_p-3252    [005] .....  2101.069410: sys_futex(uaddr: 0x55e79a4da834 (0x80000cb1), FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG, val: 0)
>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
> Changes since v2: https://patch.msgid.link/20260310201036.542627924@kernel.org
>
> - Removed unused "buf" variable (kernel test robot)
>
> - Iterate __futex_cmds[] make the print statement.
>   Note this required exposing __futex_cmds[] to trace_syscall.c
>
> - Added back val statement (with the move to futex/syscall.c the
>   third parameter was dropped).
>
>  include/linux/futex.h         |   8 +-
>  kernel/futex/syscalls.c       |  97 +++++++++++++++++++++++
>  kernel/trace/trace_syscalls.c | 144 +++++++++++++++++++++++++++++++++-
>  3 files changed, 245 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/futex.h b/include/linux/futex.h
> index 9e9750f04980..7eaf01ff68cf 100644
> --- a/include/linux/futex.h
> +++ b/include/linux/futex.h
> @@ -82,6 +82,11 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
>               u32 __user *uaddr2, u32 val2, u32 val3);
>  int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4);
>
> +#ifdef CONFIG_FTRACE_SYSCALLS
> +void futex_print_syscall(struct seq_buf *s, int nr_args, unsigned long *args, u32 *ptr);
> +extern const char * __futex_cmds[];
> +#endif
> +
>  #ifdef CONFIG_FUTEX_PRIVATE_HASH
>  int futex_hash_allocate_default(void);
>  void futex_hash_free(struct mm_struct *mm);
> @@ -114,7 +119,8 @@ static inline int futex_hash_allocate_default(void)
>  }
>  static inline int futex_hash_free(struct mm_struct *mm) { return 0; }
>  static inline int futex_mm_init(struct mm_struct *mm) { return 0; }
> -
> +static inline void futex_print_syscall(struct seq_buf *s, int nr_args,
> +                                      unsigned long *args, u32 *ptr) { }
>  #endif
>
>  #endif
> diff --git a/kernel/futex/syscalls.c b/kernel/futex/syscalls.c
> index 743c7a728237..3ba8ca017c9c 100644
> --- a/kernel/futex/syscalls.c
> +++ b/kernel/futex/syscalls.c
> @@ -171,6 +171,18 @@ static __always_inline bool futex_cmd_has_timeout(u32 cmd)
>         return false;
>  }
>
> +static __always_inline bool futex_cmd_has_addr2(u32 cmd)
> +{
> +       switch (cmd) {
> +       case FUTEX_REQUEUE:
> +       case FUTEX_CMP_REQUEUE:
> +       case FUTEX_WAKE_OP:
> +       case FUTEX_WAIT_REQUEUE_PI:
> +               return true;
> +       }
> +       return false;
> +}
> +
>  static __always_inline int
>  futex_init_timeout(u32 cmd, u32 op, struct timespec64 *ts, ktime_t *t)
>  {
> @@ -207,6 +219,91 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
>         return do_futex(uaddr, op, val, tp, uaddr2, (unsigned long)utime, val3);
>  }
>
> +#ifdef CONFIG_FTRACE_SYSCALLS
> +
> +/* End with NULL for iterators */
> +const char * __futex_cmds[] =
> +{
> +       "FUTEX_WAIT", "FUTEX_WAKE", "FUTEX_FD", "FUTEX_REQUEUE",
> +       "FUTEX_CMP_REQUEUE", "FUTEX_WAKE_OP", "FUTEX_LOCK_PI",
> +       "FUTEX_UNLOCK_PI", "FUTEX_TRYLOCK_PI", "FUTEX_WAIT_BITSET",
> +       "FUTEX_WAKE_BITSET", "FUTEX_WAIT_REQUEUE_PI", "FUTEX_CMP_REQUEUE_PI",
> +       "FUTEX_LOCK_PI2", NULL
> +};
> +
> +void futex_print_syscall(struct seq_buf *s, int nr_args, unsigned long *args, u32 *ptr)
> +{
> +       unsigned int op, cmd;
> +       bool done = false;
> +
> +       for (int i = 0; !done && i < nr_args; i++) {
> +
> +               if (seq_buf_has_overflowed(s))
> +                       break;
> +
> +               switch (i) {
> +               case 0:
> +                       seq_buf_printf(s, "uaddr: 0x%lx", args[i]);
> +                       if (ptr) {
> +                               u32 val = *ptr;
> +                               if (val < 10)
> +                                       seq_buf_printf(s, " (%u)", val);
> +                               else
> +                                       seq_buf_printf(s, " (0x%x)", val);
> +                       }
> +                       continue;
> +               case 1:
> +                       op = args[i];
> +                       cmd = op & FUTEX_CMD_MASK;
> +                       if (cmd <= FUTEX_LOCK_PI2)
> +                               seq_buf_printf(s, ", %s", __futex_cmds[cmd]);
> +                       else
> +                               seq_buf_puts(s, ", UNKNOWN");
> +
> +                       if (op & FUTEX_PRIVATE_FLAG)
> +                               seq_buf_puts(s, "|FUTEX_PRIVATE_FLAG");
> +                       if (op & FUTEX_CLOCK_REALTIME)
> +                               seq_buf_puts(s, "|FUTEX_CLOCK_REALTIME");
> +                       continue;
> +               case 2:
> +                       if (args[i] < 10)
> +                               seq_buf_printf(s, ", val: %ld", args[i]);
> +                       else
> +                               seq_buf_printf(s, ", val: 0x%lx", args[i]);

I'm confused why this hex formatting didn't appear in the sample output.

Thanks,
Ian

> +                       continue;
> +               case 3:
> +                       if (!futex_cmd_has_timeout(cmd)) {
> +
> +                               if (!futex_cmd_has_addr2(cmd)) {
> +                                       done = true;
> +                                       continue;
> +                               }
> +
> +                               seq_buf_printf(s, ", val2: 0x%x", (u32)(long)args[i]);
> +                               continue;
> +                       }
> +
> +                       if (!args[i])
> +                               continue;
> +
> +                       seq_buf_printf(s, ", timespec: 0x%lx", args[i]);
> +                       continue;
> +               case 4:
> +                       if (!futex_cmd_has_addr2(cmd)) {
> +                               done = true;
> +                               continue;
> +                       }
> +                       seq_buf_printf(s, ", uaddr2: 0x%lx", args[i]);
> +                       continue;
> +               case 5:
> +                       seq_buf_printf(s, ", val3: %lu", args[i]);
> +                       done = true;
> +                       continue;
> +               }
> +       }
> +}
> +#endif
> +
>  /**
>   * futex_parse_waitv - Parse a waitv array from userspace
>   * @futexv:    Kernel side list of waiters to be filled
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index 37317b81fcda..f8213d772f89 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -6,11 +6,13 @@
>  #include <linux/slab.h>
>  #include <linux/kernel.h>
>  #include <linux/module.h>      /* for MODULE_NAME_LEN via KSYM_SYMBOL_LEN */
> +#include <linux/futex.h>
>  #include <linux/ftrace.h>
>  #include <linux/perf_event.h>
>  #include <linux/xarray.h>
>  #include <asm/syscall.h>
>
> +
>  #include "trace_output.h"
>  #include "trace.h"
>
> @@ -237,6 +239,27 @@ sys_enter_openat_print(struct syscall_trace_enter *trace, struct syscall_metadat
>         return trace_handle_return(s);
>  }
>
> +static enum print_line_t
> +sys_enter_futex_print(struct syscall_trace_enter *trace, struct syscall_metadata *entry,
> +                     struct trace_seq *s, struct trace_event *event, int ent_size)
> +{
> +       void *end = (void *)trace + ent_size;
> +       void *ptr;
> +
> +       /* Set ptr to the user space copied area */
> +       ptr = (void *)trace->args + sizeof(unsigned long) * entry->nb_args;
> +       if (ptr + 4 > end)
> +               ptr = NULL;
> +
> +       trace_seq_printf(s, "%s(", entry->name);
> +
> +       futex_print_syscall(&s->seq, entry->nb_args, trace->args, ptr);
> +
> +       trace_seq_puts(s, ")\n");
> +
> +       return trace_handle_return(s);
> +}
> +
>  static enum print_line_t
>  print_syscall_enter(struct trace_iterator *iter, int flags,
>                     struct trace_event *event)
> @@ -267,6 +290,10 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
>                 if (!tr || !(tr->trace_flags & TRACE_ITER(VERBOSE)))
>                         return sys_enter_openat_print(trace, entry, s, event);
>                 break;
> +       case __NR_futex:
> +               if (!tr || !(tr->trace_flags & TRACE_ITER(VERBOSE)))
> +                       return sys_enter_futex_print(trace, entry, s, event, iter->ent_size);
> +               break;
>         default:
>                 break;
>         }
> @@ -437,6 +464,48 @@ sys_enter_openat_print_fmt(struct syscall_metadata *entry, char *buf, int len)
>         return pos;
>  }
>
> +static int __init
> +sys_enter_futex_print_fmt(struct syscall_metadata *entry, char *buf, int len)
> +{
> +       int pos = 0;
> +
> +       pos += snprintf(buf + pos, LEN_OR_ZERO,
> +                       "\"uaddr: 0x%%lx (0x%%lx) cmd=%%s%%s%%s");
> +       pos += snprintf(buf + pos, LEN_OR_ZERO,
> +                       "  val: 0x%%x timeout/val2: 0x%%llx");
> +       pos += snprintf(buf + pos, LEN_OR_ZERO,
> +                       " uaddr2: 0x%%lx val3: 0x%%x\", ");
> +
> +       pos += snprintf(buf + pos, LEN_OR_ZERO,
> +                       " REC->uaddr,");
> +       pos += snprintf(buf + pos, LEN_OR_ZERO,
> +                       " REC->__value,");
> +       pos += snprintf(buf + pos, LEN_OR_ZERO,
> +                       "  __print_symbolic(REC->op & 0x%x, ", FUTEX_CMD_MASK);
> +
> +       for (int i = 0; __futex_cmds[i]; i++) {
> +               pos += snprintf(buf + pos, LEN_OR_ZERO,
> +                               "%s{%d, \"%s\"} ",
> +                               i ? "," : "", i, __futex_cmds[i]);
> +       }
> +
> +       pos += snprintf(buf + pos, LEN_OR_ZERO, "), ");
> +
> +       pos += snprintf(buf + pos, LEN_OR_ZERO,
> +                       " (REC->op & %d) ? \"|FUTEX_PRIVATE_FLAG\" : \"\",",
> +                       FUTEX_PRIVATE_FLAG);
> +       pos += snprintf(buf + pos, LEN_OR_ZERO,
> +                       " (REC->op & %d) ? \"|FUTEX_CLOCK_REALTIME\" : \"\",",
> +                       FUTEX_CLOCK_REALTIME);
> +
> +       pos += snprintf(buf + pos, LEN_OR_ZERO,
> +                       " REC->val, REC->utime,");
> +
> +       pos += snprintf(buf + pos, LEN_OR_ZERO,
> +                       " REC->uaddr, REC->val3");
> +       return pos;
> +}
> +
>  static int __init
>  __set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
>  {
> @@ -447,6 +516,8 @@ __set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
>         switch (entry->syscall_nr) {
>         case __NR_openat:
>                 return sys_enter_openat_print_fmt(entry, buf, len);
> +       case __NR_futex:
> +               return sys_enter_futex_print_fmt(entry, buf, len);
>         default:
>                 break;
>         }
> @@ -523,6 +594,21 @@ static void __init free_syscall_print_fmt(struct trace_event_call *call)
>                 kfree(call->print_fmt);
>  }
>
> +static int __init futex_fields(struct trace_event_call *call, int offset)
> +{
> +       char *arg;
> +       int ret;
> +
> +       arg = kstrdup("__value", GFP_KERNEL);
> +       if (WARN_ON_ONCE(!arg))
> +               return -ENOMEM;
> +       ret = trace_define_field(call, "u32", arg, offset, sizeof(int), 0,
> +                                FILTER_OTHER);
> +       if (ret)
> +               kfree(arg);
> +       return ret;
> +}
> +
>  static int __init syscall_enter_define_fields(struct trace_event_call *call)
>  {
>         struct syscall_trace_enter trace;
> @@ -544,6 +630,9 @@ static int __init syscall_enter_define_fields(struct trace_event_call *call)
>                 offset += sizeof(unsigned long);
>         }
>
> +       if (!ret && meta->syscall_nr == __NR_futex)
> +               return futex_fields(call, offset);
> +
>         if (ret || !meta->user_mask)
>                 return ret;
>
> @@ -689,6 +778,45 @@ static int syscall_copy_user_array(char *buf, const char __user *ptr,
>         return 0;
>  }
>
> +static int
> +syscall_get_futex(unsigned long *args, char **buffer, int *size, int buf_size)
> +{
> +       struct syscall_user_buffer *sbuf;
> +       const char __user *ptr;
> +
> +       /* buf_size of zero means user doesn't want user space read */
> +       if (!buf_size)
> +               return -1;
> +
> +       /* If the syscall_buffer is NULL, tracing is being shutdown */
> +       sbuf = READ_ONCE(syscall_buffer);
> +       if (!sbuf)
> +               return -1;
> +
> +       ptr = (char __user *)args[0];
> +
> +       *buffer = trace_user_fault_read(&sbuf->buf, ptr, 4, NULL, NULL);
> +       if (!*buffer)
> +               return -1;
> +
> +       /* Add room for the value */
> +       *size += 4;
> +
> +       return 0;
> +}
> +
> +static void syscall_put_futex(struct syscall_metadata *sys_data,
> +                             struct syscall_trace_enter *entry,
> +                             char *buffer)
> +{
> +       u32 *ptr;
> +
> +       /* Place the futex key into the storage */
> +       ptr = (void *)entry->args + sizeof(unsigned long) * sys_data->nb_args;
> +
> +       *ptr = *(u32 *)buffer;
> +}
> +
>  static char *sys_fault_user(unsigned int buf_size,
>                             struct syscall_metadata *sys_data,
>                             struct syscall_user_buffer *sbuf,
> @@ -905,6 +1033,9 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
>                 if (syscall_get_data(sys_data, args, &user_ptr,
>                                      &size, user_sizes, &uargs, tr->syscall_buf_sz) < 0)
>                         return;
> +       } else if (syscall_nr == __NR_futex) {
> +               if (syscall_get_futex(args, &user_ptr, &size, tr->syscall_buf_sz) < 0)
> +                       return;
>         }
>
>         size += sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
> @@ -921,6 +1052,9 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
>         if (mayfault)
>                 syscall_put_data(sys_data, entry, user_ptr, size, user_sizes, uargs);
>
> +       else if (syscall_nr == __NR_futex)
> +               syscall_put_futex(sys_data, entry, user_ptr);
> +
>         trace_event_buffer_commit(&fbuffer);
>  }
>
> @@ -971,14 +1105,18 @@ static int reg_event_syscall_enter(struct trace_event_file *file,
>  {
>         struct syscall_metadata *sys_data = call->data;
>         struct trace_array *tr = file->tr;
> +       bool enable_faults;
>         int ret = 0;
>         int num;
>
>         num = sys_data->syscall_nr;
>         if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
>                 return -ENOSYS;
> +
> +       enable_faults = sys_data->user_mask || num == __NR_futex;
> +
>         guard(mutex)(&syscall_trace_lock);
> -       if (sys_data->user_mask) {
> +       if (enable_faults) {
>                 ret = syscall_fault_buffer_enable();
>                 if (ret < 0)
>                         return ret;
> @@ -986,7 +1124,7 @@ static int reg_event_syscall_enter(struct trace_event_file *file,
>         if (!tr->sys_refcount_enter) {
>                 ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
>                 if (ret < 0) {
> -                       if (sys_data->user_mask)
> +                       if (enable_faults)
>                                 syscall_fault_buffer_disable();
>                         return ret;
>                 }
> @@ -1011,7 +1149,7 @@ static void unreg_event_syscall_enter(struct trace_event_file *file,
>         WRITE_ONCE(tr->enter_syscall_files[num], NULL);
>         if (!tr->sys_refcount_enter)
>                 unregister_trace_sys_enter(ftrace_syscall_enter, tr);
> -       if (sys_data->user_mask)
> +       if (sys_data->user_mask || num == __NR_futex)
>                 syscall_fault_buffer_disable();
>  }
>
> --
> 2.51.0
>
>

^ permalink raw reply

* Re: [PATCH v3 0/3] tracing: Read user data from futex system call trace event
From: Steven Rostedt @ 2026-04-01 17:15 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, Peter Zijlstra
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Thomas Gleixner, Brian Geffon, John Stultz, Ian Rogers,
	Suleiman Souhlal
In-Reply-To: <20260331181349.062575155@kernel.org>


Peter,

Any thoughts on this version?

-- Steve


On Tue, 31 Mar 2026 14:13:49 -0400
Steven Rostedt <rostedt@kernel.org> wrote:

> We are looking at the performance of futexes and require a bit more
> information when tracing them.
> 
> The two patches here extend the system call reading of user space to
> create specific handling of the futex system call. It now reads the
> user space relevant data (the addr, utime and addr2), as well as
> parses the flags. This adds a little smarts to the trace event as
> it only shows the parameters that are relevant, as well as parses
> utime as either a timespec or as val2 depending on the futex_op.
> 
> Here's an example of the new output:
> 
>  sys_futex(uaddr: 0x56196292e830 (0), FUTEX_WAKE|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e834 (0x4a7) tid: 1191, FUTEX_UNLOCK_PI|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e834 (0) tid: 0, FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e830 (0), FUTEX_WAIT|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e838 (0), FUTEX_WAIT_REQUEUE_PI|FUTEX_PRIVATE_FLAG, timespec: 0x7ffc1b91a9f0 (163.048528790), uaddr2: 0x56196292e834 (4aa), val3: 0)
>  sys_futex(uaddr: 0x56196292e834 (0x4aa) tid: 1194, FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e838 (0), FUTEX_WAIT_REQUEUE_PI|FUTEX_PRIVATE_FLAG, timespec: 0x7ffc1b91a9f0 (163.048528790), uaddr2: 0x56196292e834 (800004aa), val3: 0)
>  sys_futex(uaddr: 0x7f7ed6b29990 (0x4ab), FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME)
>  sys_futex(uaddr: 0x56196292e834 (0x800004aa) tid: 1194 (WAITERS), FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG)
>  sys_futex(uaddr: 0x56196292e838 (0), FUTEX_WAIT_REQUEUE_PI|FUTEX_PRIVATE_FLAG, timespec: 0x7ffc1b91a9f0 (163.048528790), uaddr2: 0x56196292e834 (800004aa), val3: 0)
>  sys_futex(uaddr: 0x56196292e834 (0x800004aa) tid: 1194 (WAITERS), FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG)
> 
> Changes since v2: https://lore.kernel.org/all/20260310200954.285663884@kernel.org/
> 
> - Removed unused "buf" variable (kernel test robot)
> 
> - Iterate __futex_cmds[] make the print statement.
>   Note this required exposing __futex_cmds[] to trace_syscall.c
>   (Masami Hiramatsu)
> 
> - Added back val statement (with the move to futex/syscall.c the
>   third parameter was dropped).
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
> futex/core
> 
> Head SHA1: 79b0609ad15b24d0bbfb1790e17902a6c210ae69
> 
> 
> Steven Rostedt (3):
>       tracing: Have futex syscall trace event show specific user data
>       tracing: Update futex syscall trace event to show more commands
>       tracing: Show TID and flags for PI futex system call trace event
> 
> ----
>  include/linux/futex.h         |  39 ++++++-
>  kernel/futex/syscalls.c       | 137 +++++++++++++++++++++---
>  kernel/trace/trace_syscalls.c | 237 +++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 396 insertions(+), 17 deletions(-)


^ permalink raw reply

* Re: [PATCH v2] tracing/probe: reject empty immediate strings
From: Steven Rostedt @ 2026-04-01 17:00 UTC (permalink / raw)
  To: Pengpeng Hou
  Cc: mhiramat, mathieu.desnoyers, linux-trace-kernel, linux-kernel
In-Reply-To: <20260401160315.88518-1-pengpeng@iscas.ac.cn>

On Thu,  2 Apr 2026 00:03:15 +0800
Pengpeng Hou <pengpeng@iscas.ac.cn> wrote:

> parse_probe_arg() accepts quoted immediate strings and passes the body
> after the opening quote to __parse_imm_string(). That helper currently
> computes strlen(str) and immediately dereferences str[len - 1], which
> underflows when the body is empty.
> 
> Reject empty immediate strings before checking for the closing quote.
> 
> Fixes: a42e3c4de964 ("tracing/probe: Add immediate string parameter support")
> Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>

Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>

-- Steve

^ permalink raw reply

* Re: [PATCH bpf v2] bpf: Reject sleepable kprobe_multi programs at attach time
From: Kumar Kartikeya Dwivedi @ 2026-04-01 16:31 UTC (permalink / raw)
  To: Varun R Mallya
  Cc: andrii, ast, bpf, daniel, eddyz87, jolsa, kpsingh, linux-kernel,
	linux-trace-kernel, martin.lau, mathieu.desnoyers, mattbobrowski,
	mhiramat, rostedt, song, yonghong.song
In-Reply-To: <20260401162555.426867-1-varunrmallya@gmail.com>

On Wed, 1 Apr 2026 at 18:26, Varun R Mallya <varunrmallya@gmail.com> wrote:
>
> kprobe.multi programs run in atomic/RCU context and cannot sleep.
> However, bpf_kprobe_multi_link_attach() did not validate whether the
> program being attached had the sleepable flag set, allowing sleepable
> helpers such as bpf_copy_from_user() to be invoked from a non-sleepable
> context.
>
> This causes a "sleeping function called from invalid context" splat:
>
>   BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169
>   in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo
>   preempt_count: 1, expected: 0
>   RCU nest depth: 2, expected: 0
>
> Fix this by rejecting sleepable programs early in
> bpf_kprobe_multi_link_attach(), before any further processing.
>
> Also add a selftest that tries to load a sleepable kprobe_multi program
> and it needs to be rejected for the test to pass.
>
> Fixes: 0dcac2725406 ("bpf: Add multi kprobe link")
> Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
> ---

Looks ok, but please split kernel fix and tests into separate patches,
and no need to reply to this thread.

pw-bot: cr


>  kernel/trace/bpf_trace.c                      |  4 ++
>  .../bpf/prog_tests/kprobe_multi_test.c        | 41 +++++++++++++++++++
>  .../bpf/progs/kprobe_multi_sleepable.c        | 13 ++++++
>  3 files changed, 58 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c
>
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 0b040a417442..af7079aa0f36 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -2752,6 +2752,10 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
>         if (!is_kprobe_multi(prog))
>                 return -EINVAL;
>
> +       /* kprobe_multi is not allowed to be sleepable. */
> +       if (prog->sleepable)
> +               return -EINVAL;
> +
>         /* Writing to context is not allowed for kprobes. */
>         if (prog->aux->kprobe_write_ctx)
>                 return -EINVAL;
> diff --git a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
> index 78c974d4ea33..f02fec2b6fda 100644
> --- a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
> +++ b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
> @@ -10,6 +10,7 @@
>  #include "kprobe_multi_session_cookie.skel.h"
>  #include "kprobe_multi_verifier.skel.h"
>  #include "kprobe_write_ctx.skel.h"
> +#include "kprobe_multi_sleepable.skel.h"
>  #include "bpf/libbpf_internal.h"
>  #include "bpf/hashmap.h"
>
> @@ -633,6 +634,44 @@ static void test_attach_write_ctx(void)
>  }
>  #endif
>
> +static void test_attach_multi_sleepable(void)
> +{
> +       struct kprobe_multi_sleepable *skel;
> +       int err;
> +
> +       skel = kprobe_multi_sleepable__open();
> +       if (!ASSERT_OK_PTR(skel, "kprobe_multi_sleepable__open"))
> +               return;
> +
> +       err = bpf_program__set_flags(skel->progs.handle_kprobe_multi_sleepable,
> +                                    BPF_F_SLEEPABLE);
> +       if (!ASSERT_OK(err, "bpf_program__set_flags"))
> +               goto cleanup;
> +
> +       /* Load should succeed even with BPF_F_SLEEPABLE for KPROBE types */
> +       err = kprobe_multi_sleepable__load(skel);
> +       if (!ASSERT_OK(err, "kprobe_multi_sleepable__load"))
> +               goto cleanup;
> +
> +       /* Attachment must fail for kprobe.multi + BPF_F_SLEEPABLE.
> +        * Also chosen a stable symbol to send into opts
> +        */
> +       LIBBPF_OPTS(bpf_kprobe_multi_opts, opts);
> +       const char *sym = "vfs_read";
> +
> +       opts.syms = &sym;
> +       opts.cnt = 1;
> +
> +       skel->links.handle_kprobe_multi_sleepable =
> +               bpf_program__attach_kprobe_multi_opts(skel->progs.handle_kprobe_multi_sleepable,
> +                                                     NULL, &opts);
> +       ASSERT_ERR_PTR(skel->links.handle_kprobe_multi_sleepable,
> +                      "bpf_program__attach_kprobe_multi_opts");
> +
> +cleanup:
> +       kprobe_multi_sleepable__destroy(skel);
> +}
> +
>  void serial_test_kprobe_multi_bench_attach(void)
>  {
>         if (test__start_subtest("kernel"))
> @@ -676,5 +715,7 @@ void test_kprobe_multi_test(void)
>                 test_unique_match();
>         if (test__start_subtest("attach_write_ctx"))
>                 test_attach_write_ctx();
> +       if (test__start_subtest("attach_multi_sleepable"))
> +               test_attach_multi_sleepable();
>         RUN_TESTS(kprobe_multi_verifier);
>  }
> diff --git a/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c b/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c
> new file mode 100644
> index 000000000000..56973ad8779d
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c
> @@ -0,0 +1,13 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +
> +SEC("kprobe.multi")
> +int handle_kprobe_multi_sleepable(struct pt_regs *ctx)
> +{
> +       return 0;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> --
> 2.53.0
>

^ permalink raw reply

* [PATCH bpf v2] bpf: Reject sleepable kprobe_multi programs at attach time
From: Varun R Mallya @ 2026-04-01 16:25 UTC (permalink / raw)
  To: varunrmallya
  Cc: andrii, ast, bpf, daniel, eddyz87, jolsa, kpsingh, linux-kernel,
	linux-trace-kernel, martin.lau, mathieu.desnoyers, mattbobrowski,
	memxor, mhiramat, rostedt, song, yonghong.song
In-Reply-To: <ac0hLyMuPHiZfnVc@computer>

kprobe.multi programs run in atomic/RCU context and cannot sleep.
However, bpf_kprobe_multi_link_attach() did not validate whether the
program being attached had the sleepable flag set, allowing sleepable
helpers such as bpf_copy_from_user() to be invoked from a non-sleepable
context.

This causes a "sleeping function called from invalid context" splat:

  BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169
  in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo
  preempt_count: 1, expected: 0
  RCU nest depth: 2, expected: 0

Fix this by rejecting sleepable programs early in
bpf_kprobe_multi_link_attach(), before any further processing.

Also add a selftest that tries to load a sleepable kprobe_multi program
and it needs to be rejected for the test to pass.

Fixes: 0dcac2725406 ("bpf: Add multi kprobe link")
Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
---
 kernel/trace/bpf_trace.c                      |  4 ++
 .../bpf/prog_tests/kprobe_multi_test.c        | 41 +++++++++++++++++++
 .../bpf/progs/kprobe_multi_sleepable.c        | 13 ++++++
 3 files changed, 58 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 0b040a417442..af7079aa0f36 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2752,6 +2752,10 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
 	if (!is_kprobe_multi(prog))
 		return -EINVAL;
 
+	/* kprobe_multi is not allowed to be sleepable. */
+	if (prog->sleepable)
+		return -EINVAL;
+
 	/* Writing to context is not allowed for kprobes. */
 	if (prog->aux->kprobe_write_ctx)
 		return -EINVAL;
diff --git a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
index 78c974d4ea33..f02fec2b6fda 100644
--- a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
+++ b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
@@ -10,6 +10,7 @@
 #include "kprobe_multi_session_cookie.skel.h"
 #include "kprobe_multi_verifier.skel.h"
 #include "kprobe_write_ctx.skel.h"
+#include "kprobe_multi_sleepable.skel.h"
 #include "bpf/libbpf_internal.h"
 #include "bpf/hashmap.h"
 
@@ -633,6 +634,44 @@ static void test_attach_write_ctx(void)
 }
 #endif
 
+static void test_attach_multi_sleepable(void)
+{
+	struct kprobe_multi_sleepable *skel;
+	int err;
+
+	skel = kprobe_multi_sleepable__open();
+	if (!ASSERT_OK_PTR(skel, "kprobe_multi_sleepable__open"))
+		return;
+
+	err = bpf_program__set_flags(skel->progs.handle_kprobe_multi_sleepable,
+				     BPF_F_SLEEPABLE);
+	if (!ASSERT_OK(err, "bpf_program__set_flags"))
+		goto cleanup;
+
+	/* Load should succeed even with BPF_F_SLEEPABLE for KPROBE types */
+	err = kprobe_multi_sleepable__load(skel);
+	if (!ASSERT_OK(err, "kprobe_multi_sleepable__load"))
+		goto cleanup;
+
+	/* Attachment must fail for kprobe.multi + BPF_F_SLEEPABLE.
+	 * Also chosen a stable symbol to send into opts
+	 */
+	LIBBPF_OPTS(bpf_kprobe_multi_opts, opts);
+	const char *sym = "vfs_read";
+
+	opts.syms = &sym;
+	opts.cnt = 1;
+
+	skel->links.handle_kprobe_multi_sleepable =
+		bpf_program__attach_kprobe_multi_opts(skel->progs.handle_kprobe_multi_sleepable,
+						      NULL, &opts);
+	ASSERT_ERR_PTR(skel->links.handle_kprobe_multi_sleepable,
+		       "bpf_program__attach_kprobe_multi_opts");
+
+cleanup:
+	kprobe_multi_sleepable__destroy(skel);
+}
+
 void serial_test_kprobe_multi_bench_attach(void)
 {
 	if (test__start_subtest("kernel"))
@@ -676,5 +715,7 @@ void test_kprobe_multi_test(void)
 		test_unique_match();
 	if (test__start_subtest("attach_write_ctx"))
 		test_attach_write_ctx();
+	if (test__start_subtest("attach_multi_sleepable"))
+		test_attach_multi_sleepable();
 	RUN_TESTS(kprobe_multi_verifier);
 }
diff --git a/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c b/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c
new file mode 100644
index 000000000000..56973ad8779d
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c
@@ -0,0 +1,13 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+SEC("kprobe.multi")
+int handle_kprobe_multi_sleepable(struct pt_regs *ctx)
+{
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.53.0


^ permalink raw reply related

* [PATCH v2] tracing/probe: reject empty immediate strings
From: Pengpeng Hou @ 2026-04-01 16:03 UTC (permalink / raw)
  To: rostedt
  Cc: mhiramat, mathieu.desnoyers, linux-trace-kernel, linux-kernel,
	pengpeng
In-Reply-To: <20260330062920.40766-1-pengpeng@iscas.ac.cn>

parse_probe_arg() accepts quoted immediate strings and passes the body
after the opening quote to __parse_imm_string(). That helper currently
computes strlen(str) and immediately dereferences str[len - 1], which
underflows when the body is empty.

Reject empty immediate strings before checking for the closing quote.

Fixes: a42e3c4de964 ("tracing/probe: Add immediate string parameter support")
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
Changes since v1:
- resend as a standalone patch instead of part of an accidental
  cross-subsystem 1/4 series

 kernel/trace/trace_probe.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index e0a5dc86c07e..e1c73065dae5 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -1068,7 +1068,7 @@ static int __parse_imm_string(char *str, char **pbuf, int offs)
 {
 	size_t len = strlen(str);
 
-	if (str[len - 1] != '"') {
+	if (!len || str[len - 1] != '"') {
 		trace_probe_log_err(offs + len, IMMSTR_NO_CLOSE);
 		return -EINVAL;
 	}
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related

* Re: [PATCH] bootconfig: Skip printing early params to cmdline from bootconfig
From: Breno Leitao @ 2026-04-01 15:51 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel
In-Reply-To: <177505217508.1807250.22866077077504564.stgit@mhiramat.tok.corp.google.com>

On Wed, Apr 01, 2026 at 11:02:55PM +0900, Masami Hiramatsu (Google) wrote:
> From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> 
> If user configures `kernel.key` in bootconfig, the 'key' is shown
> in kernel cmdline (/proc/cmdline) and kernel boot parameter
> handler associated with 'key' is invoked. However, since the
> bootconfig does not support the parameter defined with early_param,
> those keys are shown in '/proc/cmdline' but not handled by kernel.
> 
> This could easily mislead users who expected to be able to specify
> early parameters via the boot configuration, leading them to wonder
> why it doesn't work.
> 
> Let's skip printing out early params to cmdline buffer, and warn
> if there is such parameters in bootconfig.
> 
> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Reviewed-by: Breno Leitao <leitao@debian.org>

> +static bool __init is_early_param(const char *param)
> +{
> +	const struct obs_kernel_param *p;
> +
> +	for (p = __setup_start; p < __setup_end; p++) {
> +		if (p->early && parameq(param, p->str))
> +			return true;
> +	}

nit: I don't think you need the parenthesis ({) for the ifs in here.

^ permalink raw reply

* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Michael Roth @ 2026-04-01 15:35 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jroedel, jthoughton, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm
In-Reply-To: <20260326-gmem-inplace-conversion-v4-10-e202fe950ffd@google.com>

On Thu, Mar 26, 2026 at 03:24:19PM -0700, Ackerley Tng wrote:
> For shared to private conversions, if refcounts on any of the folios
> within the range are elevated, fail the conversion with -EAGAIN.
> 
> At the point of shared to private conversion, all folios in range are
> also unmapped. The filemap_invalidate_lock() is held, so no faulting
> can occur. Hence, from that point on, only transient refcounts can be
> taken on the folios associated with that guest_memfd.
> 
> Hence, it is safe to do the conversion from shared to private.
> 
> After conversion is complete, refcounts may become elevated, but that
> is fine since users of transient refcounts don't actually access
> memory.
> 
> For private to shared conversions, there are no refcount checks, since
> the guest is the only user of private pages, and guest_memfd will be the
> only holder of refcounts on private pages.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  Documentation/virt/kvm/api.rst |  48 +++++++-
>  include/linux/kvm_host.h       |  10 ++
>  include/uapi/linux/kvm.h       |   9 +-
>  virt/kvm/Kconfig               |   1 +
>  virt/kvm/guest_memfd.c         | 245 ++++++++++++++++++++++++++++++++++++++---
>  virt/kvm/kvm_main.c            |  17 ++-
>  6 files changed, 300 insertions(+), 30 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 0b61e2579e1d8..15148c80cfdb6 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -117,7 +117,7 @@ description:
>        x86 includes both i386 and x86_64.
>  
>    Type:
> -      system, vm, or vcpu.
> +      system, vm, vcpu or guest_memfd.
>  
>    Parameters:
>        what parameters are accepted by the ioctl.
> @@ -6557,11 +6557,22 @@ KVM_S390_KEYOP_SSKE
>  ---------------------------------
>  
>  :Capability: KVM_CAP_MEMORY_ATTRIBUTES2
> -:Architectures: x86
> -:Type: vm ioctl
> +:Architectures: all
> +:Type: vm, guest_memfd ioctl
>  :Parameters: struct kvm_memory_attributes2 (in/out)
>  :Returns: 0 on success, <0 on error
>  
> +Errors:
> +
> +  ========== ===============================================================
> +  EINVAL     The specified `offset` or `size` were invalid (e.g. not
> +             page aligned, causes an overflow, or size is zero).
> +  EFAULT     The parameter address was invalid.
> +  EAGAIN     Some page within requested range had unexpected refcounts. The
> +             offset of the page will be returned in `error_offset`.
> +  ENOMEM     Ran out of memory trying to track private/shared state
> +  ========== ===============================================================
> +
>  KVM_SET_MEMORY_ATTRIBUTES2 is an extension to
>  KVM_SET_MEMORY_ATTRIBUTES that supports returning (writing) values to
>  userspace.  The original (pre-extension) fields are shared with
> @@ -6572,15 +6583,42 @@ Attribute values are shared with KVM_SET_MEMORY_ATTRIBUTES.
>  ::
>  
>    struct kvm_memory_attributes2 {
> -	__u64 address;
> +	/* in */
> +	union {
> +		__u64 address;
> +		__u64 offset;
> +	};
>  	__u64 size;
>  	__u64 attributes;
>  	__u64 flags;
> -	__u64 reserved[12];
> +	/* out */
> +	__u64 error_offset;
> +	__u64 reserved[11];
>    };
>  
>    #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
>  
> +Set attributes for a range of offsets within a guest_memfd to
> +KVM_MEMORY_ATTRIBUTE_PRIVATE to limit the specified guest_memfd backed
> +memory range for guest_use. Even if KVM_CAP_GUEST_MEMFD_MMAP is
> +supported, after a successful call to set
> +KVM_MEMORY_ATTRIBUTE_PRIVATE, the requested range will not be mappable
> +into host userspace and will only be mappable by the guest.
> +
> +To allow the range to be mappable into host userspace again, call
> +KVM_SET_MEMORY_ATTRIBUTES2 on the guest_memfd again with
> +KVM_MEMORY_ATTRIBUTE_PRIVATE unset.
> +
> +If this ioctl returns -EAGAIN, the offset of the page with unexpected
> +refcounts will be returned in `error_offset`. This can occur if there
> +are transient refcounts on the pages, taken by other parts of the
> +kernel.
> +
> +Userspace is expected to figure out how to remove all known refcounts
> +on the shared pages, such as refcounts taken by get_user_pages(), and
> +try the ioctl again. A possible source of these long term refcounts is
> +if the guest_memfd memory was pinned in IOMMU page tables.
> +
>  See also: :ref: `KVM_SET_MEMORY_ATTRIBUTES`.
>  
>  .. _kvm_run:
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 19f026f8de390..1ea14c66fc82e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2514,6 +2514,16 @@ static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
>  }
>  
>  #ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> +static inline u64 kvm_supported_mem_attributes(struct kvm *kvm)
> +{
> +#ifdef kvm_arch_has_private_mem
> +	if (!kvm || kvm_arch_has_private_mem(kvm))
> +		return KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +#endif
> +
> +	return 0;
> +}
> +
>  typedef unsigned long (kvm_get_memory_attributes_t)(struct kvm *kvm, gfn_t gfn);
>  DECLARE_STATIC_CALL(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
>  
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 16567d4a769e5..29baaa60de35a 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -990,6 +990,7 @@ struct kvm_enable_cap {
>  #define KVM_CAP_S390_USER_OPEREXEC 246
>  #define KVM_CAP_S390_KEYOP 247
>  #define KVM_CAP_MEMORY_ATTRIBUTES2 248
> +#define KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES 249
>  
>  struct kvm_irq_routing_irqchip {
>  	__u32 irqchip;
> @@ -1642,11 +1643,15 @@ struct kvm_memory_attributes {
>  #define KVM_SET_MEMORY_ATTRIBUTES2              _IOWR(KVMIO,  0xd2, struct kvm_memory_attributes2)
>  
>  struct kvm_memory_attributes2 {
> -	__u64 address;
> +	union {
> +		__u64 address;
> +		__u64 offset;
> +	};
>  	__u64 size;
>  	__u64 attributes;
>  	__u64 flags;
> -	__u64 reserved[12];
> +	__u64 error_offset;
> +	__u64 reserved[11];
>  };
>  
>  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 3fea89c45cfb4..e371e079e2c50 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -109,6 +109,7 @@ config KVM_VM_MEMORY_ATTRIBUTES
>  
>  config KVM_GUEST_MEMFD
>         select XARRAY_MULTI
> +       select KVM_MEMORY_ATTRIBUTES
>         bool
>  
>  config HAVE_KVM_ARCH_GMEM_PREPARE
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index d414ebfcb4c19..0cff9a85a4c53 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -183,10 +183,12 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>  
>  static enum kvm_gfn_range_filter kvm_gmem_get_invalidate_filter(struct inode *inode)
>  {
> -	if (GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED)
> -		return KVM_FILTER_SHARED;
> -
> -	return KVM_FILTER_PRIVATE;
> +	/*
> +	 * TODO: Limit invalidations based on the to-be-invalidated range, i.e.
> +	 *       invalidate shared/private if and only if there can possibly be
> +	 *       such mappings.
> +	 */
> +	return KVM_FILTER_SHARED | KVM_FILTER_PRIVATE;
>  }
>  
>  static void __kvm_gmem_invalidate_begin(struct gmem_file *f, pgoff_t start,
> @@ -552,11 +554,235 @@ unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
>  }
>  EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_memory_attributes);
>  
> +static bool kvm_gmem_range_has_attributes(struct maple_tree *mt,
> +					  pgoff_t index, size_t nr_pages,
> +					  u64 attributes)
> +{
> +	pgoff_t end = index + nr_pages - 1;
> +	void *entry;
> +
> +	lockdep_assert(mt_lock_is_held(mt));
> +
> +	mt_for_each(mt, entry, index, end) {
> +		if (xa_to_value(entry) != attributes)
> +			return false;
> +	}
> +
> +	return true;
> +}
> +
> +static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> +					    size_t nr_pages, pgoff_t *err_index)
> +{
> +	struct address_space *mapping = inode->i_mapping;
> +	const int filemap_get_folios_refcount = 1;
> +	pgoff_t last = start + nr_pages - 1;
> +	struct folio_batch fbatch;
> +	bool safe = true;
> +	int i;
> +
> +	folio_batch_init(&fbatch);
> +	while (safe && filemap_get_folios(mapping, &start, last, &fbatch)) {
> +
> +		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> +			struct folio *folio = fbatch.folios[i];
> +
> +			if (folio_ref_count(folio) !=
> +			    folio_nr_pages(folio) + filemap_get_folios_refcount) {
> +				safe = false;
> +				*err_index = folio->index;
> +				break;
> +			}
> +		}
> +
> +		folio_batch_release(&fbatch);
> +		cond_resched();
> +	}
> +
> +	return safe;
> +}
> +
> +/*
> + * Preallocate memory for attributes to be stored on a maple tree, pointed to
> + * by mas.  Adjacent ranges with attributes identical to the new attributes
> + * will be merged.  Also sets mas's bounds up for storing attributes.
> + *
> + * This maintains the invariant that ranges with the same attributes will
> + * always be merged.
> + */
> +static int kvm_gmem_mas_preallocate(struct ma_state *mas, u64 attributes,
> +				    pgoff_t start, size_t nr_pages)
> +{
> +	pgoff_t end = start + nr_pages;
> +	pgoff_t last = end - 1;
> +	void *entry;
> +
> +	/* Try extending range. entry is NULL on overflow/wrap-around. */
> +	mas_set_range(mas, end, end);
> +	entry = mas_find(mas, end);
> +	if (entry && xa_to_value(entry) == attributes)
> +		last = mas->last;
> +
> +	if (start > 0) {
> +		mas_set_range(mas, start - 1, start - 1);
> +		entry = mas_find(mas, start - 1);
> +		if (entry && xa_to_value(entry) == attributes)
> +			start = mas->index;
> +	}
> +
> +	mas_set_range(mas, start, last);
> +	return mas_preallocate(mas, xa_mk_value(attributes), GFP_KERNEL);
> +}
> +
> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
> +{
> +	struct folio_batch fbatch;
> +	pgoff_t next = start;
> +	int i;
> +
> +	folio_batch_init(&fbatch);
> +	while (filemap_get_folios(inode->i_mapping, &next, end - 1, &fbatch)) {
> +		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> +			struct folio *folio = fbatch.folios[i];
> +			unsigned long pfn = folio_pfn(folio);
> +
> +			kvm_arch_gmem_invalidate(pfn, pfn + folio_nr_pages(folio));
> +		}
> +
> +		folio_batch_release(&fbatch);
> +		cond_resched();
> +	}
> +}
> +#else
> +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
> +#endif
> +
> +static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> +				     size_t nr_pages, uint64_t attrs,
> +				     pgoff_t *err_index)
> +{
> +	bool to_private = attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +	struct address_space *mapping = inode->i_mapping;
> +	struct gmem_inode *gi = GMEM_I(inode);
> +	pgoff_t end = start + nr_pages;
> +	struct maple_tree *mt;
> +	struct ma_state mas;
> +	int r;
> +
> +	mt = &gi->attributes;
> +
> +	filemap_invalidate_lock(mapping);
> +
> +	mas_init(&mas, mt, start);
> +
> +	if (kvm_gmem_range_has_attributes(mt, start, nr_pages, attrs)) {
> +		r = 0;
> +		goto out;
> +	}
> +
> +	r = kvm_gmem_mas_preallocate(&mas, attrs, start, nr_pages);
> +	if (r) {
> +		*err_index = start;
> +		goto out;
> +	}
> +
> +	if (to_private) {
> +		unmap_mapping_pages(mapping, start, nr_pages, false);
> +
> +		if (!kvm_gmem_is_safe_for_conversion(inode, start, nr_pages,
> +						     err_index)) {
> +			mas_destroy(&mas);
> +			r = -EAGAIN;
> +			goto out;
> +		}
> +	}
> +
> +	/*
> +	 * From this point on guest_memfd has performed necessary
> +	 * checks and can proceed to do guest-breaking changes.
> +	 */
> +
> +	kvm_gmem_invalidate_begin(inode, start, end);
> +
> +	if (!to_private)
> +		kvm_gmem_invalidate(inode, start, end);
> +
> +	mas_store_prealloc(&mas, xa_mk_value(attrs));
> +
> +	kvm_gmem_invalidate_end(inode, start, end);
> +out:
> +	filemap_invalidate_unlock(mapping);
> +	return r;
> +}
> +
> +static long kvm_gmem_set_attributes(struct file *file, void __user *argp)
> +{
> +	struct gmem_file *f = file->private_data;
> +	struct inode *inode = file_inode(file);
> +	struct kvm_memory_attributes2 attrs;
> +	pgoff_t err_index;
> +	size_t nr_pages;
> +	pgoff_t index;
> +	int i, r;
> +
> +	if (copy_from_user(&attrs, argp, sizeof(attrs)))
> +		return -EFAULT;
> +
> +	if (attrs.flags)
> +		return -EINVAL;
> +	if (attrs.error_offset)
> +		return -EINVAL;
> +	for (i = 0; i < ARRAY_SIZE(attrs.reserved); i++) {
> +		if (attrs.reserved[i])
> +			return -EINVAL;
> +	}
> +	if (attrs.attributes & ~kvm_supported_mem_attributes(f->kvm))
> +		return -EINVAL;
> +	if (attrs.size == 0 || attrs.offset + attrs.size < attrs.offset)
> +		return -EINVAL;
> +	if (!PAGE_ALIGNED(attrs.offset) || !PAGE_ALIGNED(attrs.size))
> +		return -EINVAL;
> +
> +	if (attrs.offset >= inode->i_size ||
> +	    attrs.offset + attrs.size > inode->i_size)
> +		return -EINVAL;
> +
> +	nr_pages = attrs.size >> PAGE_SHIFT;
> +	index = attrs.offset >> PAGE_SHIFT;
> +	r = __kvm_gmem_set_attributes(inode, index, nr_pages, attrs.attributes,
> +				      &err_index);
> +	if (r) {
> +		attrs.error_offset = ((uint64_t)err_index) << PAGE_SHIFT;
> +
> +		if (copy_to_user(argp, &attrs, sizeof(attrs)))
> +			return -EFAULT;
> +	}
> +
> +	return r;
> +}
> +
> +static long kvm_gmem_ioctl(struct file *file, unsigned int ioctl,
> +			   unsigned long arg)
> +{
> +	switch (ioctl) {
> +	case KVM_SET_MEMORY_ATTRIBUTES2:
> +		if (vm_memory_attributes)
> +			return -ENOTTY;
> +
> +		return kvm_gmem_set_attributes(file, (void __user *)arg);
> +	default:
> +		return -ENOTTY;
> +	}
> +}
> +
> +
>  static struct file_operations kvm_gmem_fops = {
>  	.mmap		= kvm_gmem_mmap,
>  	.open		= generic_file_open,
>  	.release	= kvm_gmem_release,
>  	.fallocate	= kvm_gmem_fallocate,
> +	.unlocked_ioctl	= kvm_gmem_ioctl,
>  };
>  
>  static int kvm_gmem_migrate_folio(struct address_space *mapping,
> @@ -942,20 +1168,13 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
>  static bool kvm_gmem_range_is_private(struct gmem_inode *gi, pgoff_t index,
>  				      size_t nr_pages, struct kvm *kvm, gfn_t gfn)
>  {
> -	pgoff_t end = index + nr_pages - 1;
> -	void *entry;
> -
>  	if (vm_memory_attributes)
>  		return kvm_range_has_vm_memory_attributes(kvm, gfn, gfn + nr_pages,
>  						       KVM_MEMORY_ATTRIBUTE_PRIVATE,
>  						       KVM_MEMORY_ATTRIBUTE_PRIVATE);
>  
> -	mt_for_each(&gi->attributes, entry, index, end) {
> -		if (xa_to_value(entry) != KVM_MEMORY_ATTRIBUTE_PRIVATE)
> -			return false;
> -	}
> -
> -	return true;
> +	return kvm_gmem_range_has_attributes(&gi->attributes, index, nr_pages,
> +					     KVM_MEMORY_ATTRIBUTE_PRIVATE);
>  }
>  
>  static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 3c261904322f0..85c14197587d4 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2435,16 +2435,6 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
>  #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>  
>  #ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> -static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> -{
> -#ifdef kvm_arch_has_private_mem
> -	if (!kvm || kvm_arch_has_private_mem(kvm))
> -		return KVM_MEMORY_ATTRIBUTE_PRIVATE;
> -#endif
> -
> -	return 0;
> -}
> -
>  #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>  static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
>  {
> @@ -2635,6 +2625,8 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>  		return -EINVAL;
>  	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
>  		return -EINVAL;
> +	if (attrs->error_offset)
> +		return -EINVAL;
>  	for (i = 0; i < ARRAY_SIZE(attrs->reserved); i++) {
>  		if (attrs->reserved[i])
>  			return -EINVAL;
> @@ -4983,6 +4975,11 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>  		return 1;
>  	case KVM_CAP_GUEST_MEMFD_FLAGS:
>  		return kvm_gmem_get_supported_flags(kvm);
> +	case KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES:
> +		if (vm_memory_attributes)
> +			return 0;
> +
> +		return kvm_supported_mem_attributes(kvm);

Based on the discussion from the PUCK call this morning, it sounds like it
would be a good idea to limit kvm_supported_mem_attributes() to only
reporting KVM_MEMORY_ATTRIBUTE_PRIVATE if the underlying CoCo
implementation has all the necessary enablement to support in-place
conversion via guest_memfd. In the case of SNP, there is a
documentation/parameter check in snp_launch_update() that needs to be
relaxed in order for userspace to be able to pass in a NULL 'src'
parameter (since, for in-place conversion, it would be initialized in place
as shared memory prior to the call, since by the time kvm_gmem_poulate()
it will have been set to private and therefore cannot be faulted in via
GUP (and if it could, we'd be unecessarily copying the src back on top
of itself since src/dst are the same).

So maybe there should be an arch hook to check a whitelist of VM types
that support KVM_MEMORY_ATTRIBUTE_PRIVATE when vm_memory_attributes=0,
and if we decide to enable it for SNP as part of this series you could
include the 1-2 patches needed there, or I could enable the SNP support
separately as a small series and I guess that would then become a prereq
for the SNP self-tests?

Not sure if additional enablement is needed for TDX or not before
KVM_MEMORY_ATTRIBUTE_PRIVATE would be advertised, but similar
considerations there.

-Mike

>  #endif
>  	default:
>  		break;
> 
> -- 
> 2.53.0.1018.g2bb0e51243-goog
> 

^ permalink raw reply

* Re: [PATCH RFC v4 07/44] KVM: guest_memfd: Only prepare folios for private pages
From: Michael Roth @ 2026-04-01 15:16 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jroedel, jthoughton, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm
In-Reply-To: <CAEvNRgF+FjJ1EWSR_rzD1=N040ZitiRrM2O3N0Kj5yN5rT3h+Q@mail.gmail.com>

On Wed, Apr 01, 2026 at 07:05:16AM -0700, Ackerley Tng wrote:
> Ackerley Tng <ackerleytng@google.com> writes:
> 
> > All-shared guest_memfd used to be only supported for non-CoCo VMs where
> > preparation doesn't apply. INIT_SHARED is about to be supported for
> > non-CoCo VMs in a later patch in this series.
> >
> > In addition, KVM_SET_MEMORY_ATTRIBUTES2 is about to be supported in
> > guest_memfd in a later patch in this series.
> >
> > This means that the kvm fault handler may now call kvm_gmem_get_pfn() on a
> > shared folio for a CoCo VM where preparation applies.
> >
> > Add a check to make sure that preparation is only performed for private
> > folios.
> >
> > Preparation will be undone on freeing (see kvm_gmem_free_folio()) and on
> > conversion to shared.
> >
> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > ---
> >  virt/kvm/guest_memfd.c | 9 ++++++---
> >  1 file changed, 6 insertions(+), 3 deletions(-)
> >
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index b6ffa8734175d..d414ebfcb4c19 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -900,6 +900,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> >  		     int *max_order)
> >  {
> >  	pgoff_t index = kvm_gmem_get_index(slot, gfn);
> > +	struct inode *inode;
> >  	struct folio *folio;
> >  	int r = 0;
> >
> > @@ -907,7 +908,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> >  	if (!file)
> >  		return -EFAULT;
> >
> > -	filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> > +	inode = file_inode(file);
> > +	filemap_invalidate_lock_shared(inode->i_mapping);
> >
> >  	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
> >  	if (IS_ERR(folio)) {
> > @@ -920,7 +922,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> >  		folio_mark_uptodate(folio);
> >  	}
> >
> > -	r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> > +	if (kvm_gmem_is_private_mem(inode, index))
> > +		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> 
> Michael, I might have misunderstood you at the last guest_memfd call:
> sev_gmem_prepare() doesn't prepare a page for being a shared page,
> right? Does this work? That prepare is only called to "make private"?

Hmm, I guess your guest_memfd-inplace-conversion-v4 branch is out of sync with
these patches?

I have the below local patch based on top of that for SNP-specific enablement,
which is basically identically, so suffice to say: yes, this should work
for SNP :) If any architecture pops up that needs to do some prep in
advance of mapping shared pages, then we could potentially plumb the
shared/private flag through to the arch-specific prep hook, as was also
suggested on the call, but it doesn't seem like that's needed by any
users for now.

-Mike

  Author: Michael Roth <michael.roth@amd.com>
  Date:   Mon Oct 27 07:58:32 2025 -0500
  
      KVM: guest_memfd: Don't prepare shared folios
  
      In the current guest_memfd logic, "preparation" is only used currently
      to describe the additional work of putting a guest_memfd page into an
      architecturally-defined "private" state, such as updating RMP table
      entries for SEV-SNP guests. As such, there's no input to the
      corresponding kvm_arch_gmem_prepare() hooks as to whether a page is
      being prepared/accessed as shared or as private, so "preparation" will
      end up being erroneously done on pages that were supposed to remain in a
      shared state. Rather than plumb through the additional information
      needed to distinguish between shared vs. private preparation, just
      continue to only do preparation on private pages, as was the case prior
      to support for GUEST_MEMFD_FLAG_MMAP being introduced.
  
      Signed-off-by: Michael Roth <michael.roth@amd.com>
  
  diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
  index 3acc6d983449..4869e59e4fc5 100644
  --- a/virt/kvm/guest_memfd.c
  +++ b/virt/kvm/guest_memfd.c
  @@ -1249,7 +1249,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
                  folio_mark_uptodate(folio);
          }
  
  -       r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
  +       if (!kvm_gmem_is_shared_mem(file_inode(file), index))
  +               r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
  
          folio_unlock(folio);

> 
> >
> >  	folio_unlock(folio);
> >
> > @@ -930,7 +933,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> >  		folio_put(folio);
> >
> >  out:
> > -	filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
> > +	filemap_invalidate_unlock_shared(inode->i_mapping);
> >  	return r;
> >  }
> >  EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
> >
> > --
> > 2.53.0.1018.g2bb0e51243-goog

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox