Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* [PATCHv5 01/13] uprobes/x86: Use proper mm_struct in __in_uprobe_trampoline
From: Jiri Olsa @ 2026-07-01 11:13 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko
  Cc: bpf, linux-trace-kernel
In-Reply-To: <20260701111337.53943-1-jolsa@kernel.org>

In the unregister path we use __in_uprobe_trampoline check with
current->mm for the VMA lookup, which is wrong, because we are
in the tracer context, not the traced process.

Add mm_struct pointer argument to __in_uprobe_trampoline and
changing related callers to pass proper mm_struct pointer.

Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 arch/x86/kernel/uprobes.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index ebb1baf1eb1d..2be6707e3320 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -761,9 +761,9 @@ void arch_uprobe_clear_state(struct mm_struct *mm)
 		destroy_uprobe_trampoline(tramp);
 }
 
-static bool __in_uprobe_trampoline(unsigned long ip)
+static bool __in_uprobe_trampoline(struct mm_struct *mm, unsigned long ip)
 {
-	struct vm_area_struct *vma = vma_lookup(current->mm, ip);
+	struct vm_area_struct *vma = vma_lookup(mm, ip);
 
 	return vma && vma_is_special_mapping(vma, &tramp_mapping);
 }
@@ -776,14 +776,14 @@ static bool in_uprobe_trampoline(unsigned long ip)
 
 	rcu_read_lock();
 	if (mmap_lock_speculate_try_begin(mm, &seq)) {
-		found = __in_uprobe_trampoline(ip);
+		found = __in_uprobe_trampoline(mm, ip);
 		retry = mmap_lock_speculate_retry(mm, seq);
 	}
 	rcu_read_unlock();
 
 	if (retry) {
 		mmap_read_lock(mm);
-		found = __in_uprobe_trampoline(ip);
+		found = __in_uprobe_trampoline(mm, ip);
 		mmap_read_unlock(mm);
 	}
 	return found;
@@ -1044,7 +1044,7 @@ static int copy_from_vaddr(struct mm_struct *mm, unsigned long vaddr, void *dst,
 	return 0;
 }
 
-static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
+static bool __is_optimized(struct mm_struct *mm, uprobe_opcode_t *insn, unsigned long vaddr)
 {
 	struct __packed __arch_relative_insn {
 		u8 op;
@@ -1053,7 +1053,7 @@ static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
 
 	if (!is_call_insn(insn))
 		return false;
-	return __in_uprobe_trampoline(vaddr + 5 + call->raddr);
+	return __in_uprobe_trampoline(mm, vaddr + 5 + call->raddr);
 }
 
 static int is_optimized(struct mm_struct *mm, unsigned long vaddr)
@@ -1064,7 +1064,7 @@ static int is_optimized(struct mm_struct *mm, unsigned long vaddr)
 	err = copy_from_vaddr(mm, vaddr, &insn, 5);
 	if (err)
 		return err;
-	return __is_optimized((uprobe_opcode_t *)&insn, vaddr);
+	return __is_optimized(mm, (uprobe_opcode_t *)&insn, vaddr);
 }
 
 static bool should_optimize(struct arch_uprobe *auprobe)
-- 
2.54.0


^ permalink raw reply related

* [PATCHv5 00/13] uprobes/x86: Fix red zone issue for optimized uprobes
From: Jiri Olsa @ 2026-07-01 11:13 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko
  Cc: bpf, linux-trace-kernel

hi,
Andrii reported an issue with optimized uprobes [1] that can clobber
redzone area with call instruction storing return address on stack
where user code may keep temporary data without adjusting rsp.

Fixing this by moving the optimized uprobes on top of 10-bytes nop
instruction, so we can squeeze another instruction to escape the
redzone area before doing the call.

Note we need upstream update first for patch 3 (github.com/libbpf/usdt),
if we decide to take this change.

thanks,
jirka


v1: https://lore.kernel.org/bpf/20260514135342.22130-1-jolsa@kernel.org/
v2: https://lore.kernel.org/bpf/20260518105957.123445-1-jolsa@kernel.org/
v3: https://lore.kernel.org/bpf/20260521124411.31133-1-jolsa@kernel.org/
v4: https://lore.kernel.org/bpf/20260526205840.173790-1-jolsa@kernel.org/

v5 changes:
- several selftests changes and reviewed-by tags [Jakub]
- add more comments in int3_update_unoptimize [Andrii]
- several other minor changes and acks [Oleg]
- move insn_decode out of uprobe_init_insn to simplify the code
- align uprobe_red_zone_test to 64 to make sure nop10 is not on page boundary

v4 changes:
- do not use 2nd int3 (ont +5 offset) because the call instruction
  is allways the same for the given nop10 address [Andrii/Peter]
- unmap unused trampoline vma after unsuccesfull optimization [sashiko]
- small change to patch#2 moved user_64bit_mode earlier in the path
  and pass/use mm_struct pointer directly from arch_uprobe_optimize
  instead of gettting current->mm
  Andrii, keeping your ack, please shout otherwise

v3 changes:
- use nop10 update suggested by Peter in [2]
- remove struct uprobe_trampoline object, use vma objects directly instead
- selftests fixes [sashiko]
- ack from Andrii

v2 changes:
- several selftest fixes [sashiko]
- consolidate is_lea_insn and is_call_insn insto single check [Jakub Sitnicki]
- use proper mm_struct object in __in_uprobe_trampoline check [sashiko]
- allow to copy uprobe trampolines vma objects on fork [sashiko]
- change uprobe syscall detection error from -ENXIO to -EPROTO [Andrii]
- added fork/clone tests
- I kept the selftest changes and nop5->nop10 changes in separate
  commits for easier review, we can squash them later if we want to keep
  bisect working properly


[1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
[2] https://lore.kernel.org/bpf/20260518104306.GU3102624@noisy.programming.kicks-ass.net/#t
---
Andrii Nakryiko (1):
      selftests/bpf: Add tests for uprobe nop10 red zone clobbering

Jiri Olsa (12):
      uprobes/x86: Use proper mm_struct in __in_uprobe_trampoline
      uprobes/x86: Remove struct uprobe_trampoline object
      uprobes/x86: Do not leak trampoline vma mapping on optimization failure
      uprobes/x86: Allow to copy uprobe trampolines on fork
      uprobes/x86: Move optimized uprobe from nop5 to nop10
      libbpf: Change has_nop_combo to work on top of nop10
      libbpf: Detect uprobe syscall with new error
      selftests/bpf: Emit nop,nop10 instructions combo for x86_64 arch
      selftests/bpf: Change uprobe syscall tests to use nop10
      selftests/bpf: Change uprobe/usdt trigger bench code to use nop10
      selftests/bpf: Add reattach tests for uprobe syscall
      selftests/bpf: Add tests for forked/cloned optimized uprobes

 arch/x86/kernel/uprobes.c                               | 416 +++++++++++++++++++++++++++++++++++++++++++-----------------------------
 include/linux/uprobes.h                                 |   5 -
 kernel/events/uprobes.c                                 |  10 --
 kernel/fork.c                                           |   1 -
 tools/lib/bpf/features.c                                |   4 +-
 tools/lib/bpf/usdt.c                                    |  16 +--
 tools/testing/selftests/bpf/bench.c                     |  20 ++--
 tools/testing/selftests/bpf/benchs/bench_trigger.c      |  38 +++----
 tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh |   2 +-
 tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c | 326 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
 tools/testing/selftests/bpf/prog_tests/usdt.c           |  74 +++++++++++--
 tools/testing/selftests/bpf/progs/test_usdt.c           |  25 +++++
 tools/testing/selftests/bpf/usdt.h                      |   2 +-
 tools/testing/selftests/bpf/usdt_2.c                    |  15 ++-
 14 files changed, 698 insertions(+), 256 deletions(-)

^ permalink raw reply

* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: Kaitao Cheng @ 2026-07-01 11:07 UTC (permalink / raw)
  To: Jani Nikula, David Laight, Christian König,
	David Hildenbrand (Arm), Alexei Starovoitov
  Cc: Andrew Morton, Jens Axboe, Tejun Heo, Alexander Viro,
	Christian Brauner, Daniel Borkmann, Andrii Nakryiko,
	Johannes Weiner, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Juri Lelli, Vincent Guittot, Paul Moore, Andy Shevchenko,
	Paul E. McKenney, Shakeel Butt, David Howells, Simona Vetter,
	Randy Dunlap, Luca Ceresoli, Philipp Stanner, linux-block,
	linux-kernel, cgroups, linux-ntfs-dev, linux-fsdevel, io-uring,
	audit, bpf, netdev, dri-devel, linux-perf-users,
	linux-trace-kernel, kexec, live-patching, linux-modules,
	linux-crypto, linux-pm, rcu, sched-ext, linux-mm, virtualization,
	damon, llvm, Kaitao Cheng, Muchun Song
In-Reply-To: <734f66ca51485ee3ec9788c0eaaead681e00664b@intel.com>

在 2026/6/25 19:00, Jani Nikula 写道:
> On Thu, 25 Jun 2026, Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
>> 在 2026/6/24 22:23, David Laight 写道:
>>> On Wed, 24 Jun 2026 15:23:47 +0200
>>> Christian König <christian.koenig@amd.com> wrote:
>>>> On 6/24/26 15:14, Kaitao Cheng wrote:
>>>>> 在 2026/6/22 16:42, David Laight 写道:  
>>>>>> On Mon, 22 Jun 2026 12:05:31 +0800
>>>>>> Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
>>>>>>  
>>>>>>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>>>>
>>>>>>> The list_for_each*_safe() helpers are used when the loop body may
>>>>>>> remove the current entry.  Their API exposes the temporary cursor at
>>>>>>> every call site, even though most users only need it for the iterator
>>>>>>> implementation and never reference it in the loop body.
>>>>>>>
>>>>>>> Add *_mutable() variants for list and hlist iteration.  The new helpers
>>>>>>> support both forms: callers may keep passing an explicit temporary cursor
>>>>>>> when they need to inspect or reset it, or omit it and let the helper use
>>>>>>> a unique internal cursor.  
>>>>>>
>>>>>> I'm not really sure 'mutable' means anything either.
>>>>>> It is possible to make it valid for the loop body (or even other threads)
>>>>>> to delete arbitrary list items - but that needs significant extra overheads.
>>>>>>
>>>>>> It might be worth doing something that doesn't need the extra variable,
>>>>>> but there is little point doing all the churn just to rename things.
>>>>>>  
>>>>>>>
>>>>>>> This makes call sites that only mutate the list through the current entry
>>>>>>> less noisy, while keeping the existing *_safe() helpers available for
>>>>>>> compatibility.
>>>>>>>
>>>>>>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>>>> ---
>>>>>>>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
>>>>>>>  1 file changed, 231 insertions(+), 38 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/list.h b/include/linux/list.h
>>>>>>> index 09d979976b3b..1081def7cea9 100644
>>>>>>> --- a/include/linux/list.h
>>>>>>> +++ b/include/linux/list.h
>>>>>>> @@ -7,6 +7,7 @@
>>>>>>>  #include <linux/stddef.h>
>>>>>>>  #include <linux/poison.h>
>>>>>>>  #include <linux/const.h>
>>>>>>> +#include <linux/args.h>
>>>>>>>  
>>>>>>>  #include <asm/barrier.h>
>>>>>>>  
>>>>>>> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
>>>>>>>  #define list_for_each_prev(pos, head) \
>>>>>>>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
>>>>>>>  
>>>>>>> -/**
>>>>>>> - * list_for_each_safe - iterate over a list safe against removal of list entry
>>>>>>> - * @pos:	the &struct list_head to use as a loop cursor.
>>>>>>> - * @n:		another &struct list_head to use as temporary storage
>>>>>>> - * @head:	the head for your list.
>>>>>>> +/*
>>>>>>> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
>>>>>>>   */
>>>>>>>  #define list_for_each_safe(pos, n, head) \
>>>>>>>  	for (pos = (head)->next, n = pos->next; \
>>>>>>>  	     !list_is_head(pos, (head)); \
>>>>>>>  	     pos = n, n = pos->next)
>>>>>>>  
>>>>>>> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
>>>>>>> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\  
>>>>>>
>>>>>> Use auto
>>>>>>  
>>>>>>> +	     !list_is_head(pos, (head));				\
>>>>>>> +	     pos = tmp, tmp = pos->next)
>>>>>>> +
>>>>>>> +#define __list_for_each_mutable1(pos, head)				\
>>>>>>> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
>>>>>>> +
>>>>>>> +#define __list_for_each_mutable2(pos, next, head)			\
>>>>>>> +	list_for_each_safe(pos, next, head)
>>>>>>> +
>>>>>>>  /**
>>>>>>> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
>>>>>>> + * list_for_each_mutable - iterate over a list safe against entry removal
>>>>>>>   * @pos:	the &struct list_head to use as a loop cursor.
>>>>>>> - * @n:		another &struct list_head to use as temporary storage
>>>>>>> - * @head:	the head for your list.
>>>>>>> + * @...:	either (head) or (next, head)
>>>>>>> + *
>>>>>>> + * next:	another &struct list_head to use as optional temporary storage.
>>>>>>> + *		The temporary cursor is internal unless explicitly supplied by
>>>>>>> + *		the caller.
>>>>>>> + * head:	the head for your list.
>>>>>>> + */
>>>>>>> +#define list_for_each_mutable(pos, ...)					\
>>>>>>> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
>>>>>>> +		(pos, __VA_ARGS__)  
>>>>>>
>>>>>> The variable argument count logic really just slows down compilation.
>>>>>> Maybe there aren't enough copies of this code to make that significant.
>>>>>> But just because you can do it doesn't mean it is a gooD idea.
>>>>>> I'm also not sure it really adds anything to the readability.
>>>>>>
>>>>>> And, it you are going to make the middle argument optional there is
>>>>>> no need to change the macro name.  
>>>>>
>>>>> Christian König and Jani Nikula also disagree with the variadic-argument
>>>>> implementation approach. If we abandon that method, it means we will
>>>>> inevitably need to add some new macros. If mutable is not a good name,
>>>>> suggestions for better alternatives would be welcome; coming up with a
>>>>> suitable name is indeed rather tricky.  
>>>>
>>>> I don't think you need to add a new macro for the specific use case that people want to modify the next element of the iteration.
>>>>
>>>> If I remember your numbers correctly that is a really corner case and keeping using the existing *_safe() macros for that sounds perfectly fine to me.
>>>
>>> IIRC currently you have a choice of either:
>>> 	define               Item that can't be deleted
>>> 	list_for_each()	     The current item.
>>> 	list_for_each_safe() The next item.
>>> There is also likely to be code that updates the variables to allow
>>> for other scenarios.
>>>
>>> Note that if increase a reference count and release a lock then list_for_each()
>>> is likely safer than list_for_each_safe() :-)
>>>
>>> list.h has 9 variants of the 'safe' loop.
>>> The bloat of another 9 is getting excessive.
>>>
>>> It has to be said that this is one of my least favourite type of list...
>>
>> Hi Christian König, David Laight, Jani Nikula, David Hildenbrand,
>> Andy Shevchenko, Alexei Starovoitov
>>
>> For ease of discussion, I need to summarize the currently possible
>> approaches and briefly describe their respective pros and cons,
>> using the list_for_each_entry* interfaces as examples.
>>
>> 1. Add list_for_each_entry_mutable, while keeping list_for_each_entry
>> and list_for_each_entry_safe unchanged. list_for_each_entry_mutable
>> would be used specifically for safe deletion scenarios that do not
>> need to expose the temporary cursor externally. The code can refer to
>> the v1 version.
>>
>> Pros: Does not depend on immediate per-subsystem adaptation and can be
>>       merged directly.
>> Cons: Requires adding a whole set of mutable interfaces, which makes the
>>       code somewhat redundant.
> 
> Seems fine, and the original _safe naming is ambiguous anyway.
> 
>> 2. Directly optimize away the temporary cursor in list_for_each_entry_safe
>> and define it inside the loop instead, changing the interface from four
>> arguments to three.
>>
>> Pros: Does not add redundant interfaces.
>> Cons: (1) Users need to manually update special cases that use the
>>       traversal variable of list_for_each_entry_safe, the new
>>       list_for_each_entry_safe would no longer apply there and would
>>       need to be open-coded.
>>       (2) Because the macro arguments changes, all list_for_each_entry_safe
>>       callers would need to be modified and merged together, making it
>>       difficult to merge such a large amount of code at once.
> 
> This won't fly because there are literally thousands of
> list_for_each_entry_safe() users.
> 
>> 3. Use a variadic macro approach to optimize list_for_each_entry_safe,
>> so that it supports both three and four arguments.
>>
>> Pros: (1) Does not add redundant interfaces.
>>       (2) Does not depend on immediate per-subsystem adaptation and can
>>       be merged directly.
>> Cons: (1) Increases compile time.
>>       (2) Makes the interface harder for users to use.
> 
> Basically I'm against any variadic macro tricks where the optional
> argument is not the last argument. That's just way too surprising, and
> goes against common practice in just about all other languages.
> 
>> 4. Optimize list_for_each_entry by defining the temporary cursor internally,
>> making it compatible with the functionality of list_for_each_entry_safe.
>> The code can refer to the v2 version.
>>
>> Pros: (1) Does not add redundant interfaces.
>>       (2) The number of externally visible arguments of list_for_each_entry
>>       remains unchanged, still three.
>> Cons: (1) list_for_each_entry and list_for_each_entry_safe would be merged
>>       into one, and list_for_each_entry_safe would gradually be deprecated.
>>       (2) Users need to manually update special cases that use the traversal
>>       variable of list_for_each_entry, the new list_for_each_entry would no
>>       longer apply there and would need to be open-coded. There are 15 such
>>       cases in total.
> 
> This sounds good to me, though I take it there's some code size increase
> and/or performance penalty?
> 
> Maybe the 15 cases are questionable anyway?
> 
>> 5. Use a variadic macro approach to optimize list_for_each_entry, so that
>> it supports both three and four arguments.
>>
>> Pros: (1) Does not add redundant interfaces.
>>       (2) Does not depend on immediate per-subsystem adaptation and can be
>>       merged directly.
>> Cons: (1) Increases compile time.
>>       (2) list_for_each_entry and list_for_each_entry_safe would be merged
>>       into one, and list_for_each_entry_safe would gradually be deprecated.
> 
> Please don't do the macro tricks.
> 
>> 6. Make no changes, keep the current logic unchanged, and close the current
>> email discussion.
> 
> I like hiding the temporary stuff when possible.
> 
> BR,
> Jani.

Hi all,
If there are no objections, I will make the changes using the first approach.


Hi David Laight,
You previously expressed a different opinion. Do you have any further comments
on the current proposed approach?

-- 
Thanks
Kaitao Cheng


^ permalink raw reply

* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default\
From: Xiaoyao Li @ 2026-07-01 11:07 UTC (permalink / raw)
  To: Sean Christopherson, Yan Zhao
  Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <aj7NwCRwWEfLK-gQ@google.com>

On 6/27/2026 3:06 AM, Sean Christopherson wrote:
> On Fri, Jun 26, 2026, Yan Zhao wrote:
>> My first impression of gmem_in_place_conversion=true was that it enforces gmem
>> in-place conversion. However, it actually only enforces per-gmem private/shared
>> attribute.
>> My worry was that people might think it's a kernel bug if userspace can still
>> have shared memory from other sources after they configured
>> gmem_in_place_conversion=true.
> Ah, I see where you're coming from.  FWIW, truly enforcing in-place conversion
> is flat out impossible.  E.g. userspace can simply replace the memslot, at which
> point the memory effectively reverts to shared.

would something like below enforce the in-place conversion?

Userspace can create a memslot without gmem fd, but that memslot can 
only serve as shared memory and cannot be converted. So it doesn't 
violate the in-place conversion.

--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2122,6 +2122,8 @@ static int kvm_set_memory_region(struct kvm *kvm,
         new->flags = mem->flags;
         new->userspace_addr = mem->userspace_addr;
         if (mem->flags & KVM_MEM_GUEST_MEMFD) {
+               if (gmem_in_place_conversion)
+                       new->flags |= KVM_MEMSLOT_GMEM_ONLY;
                 r = kvm_gmem_bind(kvm, new, mem->guest_memfd, 
mem->guest_memfd_offset);
                 if (r)
                         goto out;

^ permalink raw reply

* Re: [PATCH 2/2] tracing: Keep pid and comm[] in the same structure
From: Steven Rostedt @ 2026-07-01 10:38 UTC (permalink / raw)
  To: David Laight
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Michal Koutný
In-Reply-To: <20260701110407.31f7b6ca@pumpkin>

On Wed, 1 Jul 2026 11:04:07 +0100
David Laight <david.laight.linux@gmail.com> wrote:

> 
> I thought it was just used to do a pid->string lookup when you run 'cat trace'.
> But then I found the code that lets userspace read the table....
> I guess the latter is used by the userspace code that reads the raw trace buffer.

Yes, trace-cmd uses it.

> (I found some instructions that did it that way, the output was unparseable
> when tracing things that are happening on multiple cpu.)
> The userspace code could probably be given comm[] for all the running
> processes and those that exited while tracing_on() set.
> (I didn't see anything that would clear the table when the trace buffer
> was cleared.)

Well, that would break trace-cmd. As reading the raw buffers clears the
trace, and trace-cmd reads the saved_cmdlines file *after* it reads the
trace, as during the trace it gets populated.

-- Steve

^ permalink raw reply

* [PATCH 1/1] tracing: Prevent out-of-bounds read in glob matching
From: Ren Wei @ 2026-07-01 10:28 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: rostedt, mhiramat, mathieu.desnoyers, akpm, namhyung, yuantan098,
	yifanwucs, tomapufckgml, zcliangcn, bird, hhhuang, n05ec
In-Reply-To: <cover.1782836943.git.hhhuang@smu.edu.sg>

From: Huihui Huang <hhhuang@smu.edu.sg>

String event fields are not necessarily NUL-terminated, so the filter
predicate functions (filter_pred_string(), filter_pred_strloc() and
filter_pred_strrelloc()) pass the field length to the regex match
callbacks, and the length-aware matchers honour it.

regex_match_glob() was the exception: it ignored the length and called
glob_match(), which scans the string until it hits a NUL byte. Some
string fields are not NUL-terminated. One example is the dynamic char
array of the xfs_* namespace tracepoints, which is copied without a
trailing NUL. For such a field, glob matching reads past the end of
the event field, causing a KASAN slab-out-of-bounds read in
glob_match(), reached via regex_match_glob() and filter_match_preds()
from the xfs_lookup tracepoint.

Add a length-bounded glob_match_len() and use it from regex_match_glob()
so glob matching always stops at the field boundary. The matching loop
is factored into a shared helper so glob_match() keeps its behaviour.

Fixes: 60f1d5e3bac4 ("ftrace: Support full glob matching")
Cc: stable@vger.kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Zhengchuan Liang <zcliangcn@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Assisted-by: Codex:GPT-5.4
Signed-off-by: Huihui Huang <hhhuang@smu.edu.sg>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
---
 include/linux/glob.h               |  1 +
 kernel/trace/trace_events_filter.c |  6 ++----
 lib/glob.c                         | 31 ++++++++++++++++++++++++++++--
 3 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/include/linux/glob.h b/include/linux/glob.h
index 861327b33e..91595e7509 100644
--- a/include/linux/glob.h
+++ b/include/linux/glob.h
@@ -6,5 +6,6 @@
 #include <linux/compiler.h>	/* For __pure */
 
 bool __pure glob_match(char const *pat, char const *str);
+bool __pure glob_match_len(char const *pat, char const *str, size_t len);
 
 #endif	/* _LINUX_GLOB_H */
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 609325f579..6385cd662d 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -1056,11 +1056,9 @@ static int regex_match_end(char *str, struct regex *r, int len)
 	return 0;
 }
 
-static int regex_match_glob(char *str, struct regex *r, int len __maybe_unused)
+static int regex_match_glob(char *str, struct regex *r, int len)
 {
-	if (glob_match(r->pattern, str))
-		return 1;
-	return 0;
+	return glob_match_len(r->pattern, str, len) ? 1 : 0;
 }
 
 /**
diff --git a/lib/glob.c b/lib/glob.c
index 7aca76c25b..c80d9dd736 100644
--- a/lib/glob.c
+++ b/lib/glob.c
@@ -11,6 +11,9 @@
 MODULE_DESCRIPTION("glob(7) matching");
 MODULE_LICENSE("Dual MIT/GPL");
 
+static bool __pure glob_match_str(char const *pat, char const *str,
+				  char const *str_end);
+
 /**
  * glob_match - Shell-style pattern matching, like !fnmatch(pat, str, 0)
  * @pat: Shell-style pattern to match, e.g. "*.[ch]".
@@ -40,6 +43,29 @@ MODULE_LICENSE("Dual MIT/GPL");
  * An opening bracket without a matching close is matched literally.
  */
 bool __pure glob_match(char const *pat, char const *str)
+{
+	return glob_match_str(pat, str, NULL);
+}
+EXPORT_SYMBOL(glob_match);
+
+/**
+ * glob_match_len - glob match against a length-bounded string
+ * @pat: Shell-style pattern to match.
+ * @str: String to match.  Need not be NUL-terminated.
+ * @len: Number of bytes of @str that may be read.
+ *
+ * Like glob_match(), but @str is only read up to @len bytes, so it can be
+ * used on buffers that are not NUL-terminated (e.g. trace event fields).
+ * A NUL byte within @len still terminates the string.
+ */
+bool __pure glob_match_len(char const *pat, char const *str, size_t len)
+{
+	return glob_match_str(pat, str, str + len);
+}
+EXPORT_SYMBOL(glob_match_len);
+
+static bool __pure glob_match_str(char const *pat, char const *str,
+				  char const *str_end)
 {
 	/*
 	 * Backtrack to previous * on mismatch and retry starting one
@@ -55,9 +81,11 @@ bool __pure glob_match(char const *pat, char const *str)
 	 * on mismatch, or true after matching the trailing nul bytes.
 	 */
 	for (;;) {
-		unsigned char c = *str++;
+		unsigned char c = (str_end && str >= str_end) ? '\0' : *str;
 		unsigned char d = *pat++;
 
+		str++;
+
 		switch (d) {
 		case '?':	/* Wildcard: anything but nul */
 			if (c == '\0')
@@ -125,4 +153,3 @@ bool __pure glob_match(char const *pat, char const *str)
 		}
 	}
 }
-EXPORT_SYMBOL(glob_match);
-- 
2.50.1


^ permalink raw reply related

* Re: [PATCH v5 3/9] mm: use enum migrate_reason instead of int for migration reason parameters
From: Lorenzo Stoakes @ 2026-07-01 10:23 UTC (permalink / raw)
  To: Ye Liu
  Cc: Muchun Song, Oscar Salvador, Andrew Morton, David Hildenbrand,
	Steven Rostedt, Masami Hiramatsu, Vlastimil Babka, Zi Yan,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Liam R. Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Mathieu Desnoyers, Brendan Jackman, Johannes Weiner, linux-mm,
	linux-kernel, linux-trace-kernel
In-Reply-To: <20260701061101.344679-4-ye.liu@linux.dev>

On Wed, Jul 01, 2026 at 02:10:46PM +0800, Ye Liu wrote:
> Replace all 'int reason' function parameters that carry migrate_reason
> values with the proper 'enum migrate_reason' type.  This makes the
> intent explicit and leverages compiler type checking.  The affected
> subsystems are:
>
>   - page_owner: __folio_set_owner_migrate_reason(),
>                 folio_set_owner_migrate_reason()
>   - migrate: migrate_pages(), migrate_pages_sync(),
>              migrate_pages_batch(), migrate_folios_move(),
>              migrate_hugetlbs(), unmap_and_move_huge_page()
>   - hugetlb: move_hugetlb_state(), htlb_allow_alloc_fallback()
>   - trace: mm_migrate_pages and mm_migrate_pages_start events
>
> The 'short last_migrate_reason' struct field and internal helper
> parameter in page_owner are intentionally left as 'short' since they
> store per-page metadata where size matters.

Based on my own personal experience, it's ok to be short ;)

>
> No functional change.
>
> Signed-off-by: Ye Liu <ye.liu@linux.dev>
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

1 nit below but otherwise LGTM so:

Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>

> ---
>  include/linux/hugetlb.h        |  9 +++++----
>  include/linux/migrate.h        |  6 ++++--
>  include/linux/page_owner.h     |  7 ++++---
>  include/trace/events/migrate.h |  8 ++++----
>  mm/hugetlb.c                   |  3 ++-
>  mm/migrate.c                   | 12 ++++++------
>  mm/page_owner.c                |  2 +-
>  7 files changed, 26 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 2abaf99321e9..fa828232dfcc 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -154,7 +154,8 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
>  bool folio_isolate_hugetlb(struct folio *folio, struct list_head *list);
>  int get_hwpoison_hugetlb_folio(struct folio *folio, bool *hugetlb, bool unpoison);
>  void folio_putback_hugetlb(struct folio *folio);
> -void move_hugetlb_state(struct folio *old_folio, struct folio *new_folio, int reason);
> +void move_hugetlb_state(struct folio *old_folio, struct folio *new_folio,
> +			enum migrate_reason reason);
>  void hugetlb_fix_reserve_counts(struct inode *inode);
>  extern struct mutex *hugetlb_fault_mutex_table;
>  u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);
> @@ -424,7 +425,7 @@ static inline void folio_putback_hugetlb(struct folio *folio)
>  }
>
>  static inline void move_hugetlb_state(struct folio *old_folio,
> -					struct folio *new_folio, int reason)
> +					struct folio *new_folio, enum migrate_reason reason)
>  {
>  }
>
> @@ -956,7 +957,7 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
>  	return modified_mask;
>  }
>
> -static inline bool htlb_allow_alloc_fallback(int reason)
> +static inline bool htlb_allow_alloc_fallback(enum migrate_reason reason)
>  {
>  	bool allowed_fallback = false;
>
> @@ -1238,7 +1239,7 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
>  	return 0;
>  }
>
> -static inline bool htlb_allow_alloc_fallback(int reason)
> +static inline bool htlb_allow_alloc_fallback(enum migrate_reason reason)
>  {
>  	return false;
>  }
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index d5af2b7f577b..1f83924615d6 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -57,7 +57,8 @@ void putback_movable_pages(struct list_head *l);
>  int migrate_folio(struct address_space *mapping, struct folio *dst,
>  		struct folio *src, enum migrate_mode mode);
>  int migrate_pages(struct list_head *l, new_folio_t new, free_folio_t free,
> -		  unsigned long private, enum migrate_mode mode, int reason,
> +		  unsigned long private, enum migrate_mode mode,
> +		  enum migrate_reason reason,
>  		  unsigned int *ret_succeeded);
>  struct folio *alloc_migration_target(struct folio *src, unsigned long private);
>  bool isolate_movable_ops_page(struct page *page, isolate_mode_t mode);
> @@ -77,7 +78,8 @@ int set_movable_ops(const struct movable_operations *ops, enum pagetype type);
>  static inline void putback_movable_pages(struct list_head *l) {}
>  static inline int migrate_pages(struct list_head *l, new_folio_t new,
>  		free_folio_t free, unsigned long private,
> -		enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
> +		enum migrate_mode mode, enum migrate_reason reason,
> +		unsigned int *ret_succeeded)
>  	{ return -ENOSYS; }
>  static inline struct folio *alloc_migration_target(struct folio *src,
>  		unsigned long private)
> diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
> index 3328357f6dba..9fe51dfccf26 100644
> --- a/include/linux/page_owner.h
> +++ b/include/linux/page_owner.h
> @@ -3,6 +3,7 @@
>  #define __LINUX_PAGE_OWNER_H
>
>  #include <linux/jump_label.h>
> +#include <linux/migrate_mode.h>
>
>  #ifdef CONFIG_PAGE_OWNER
>  extern struct static_key_false page_owner_inited;
> @@ -14,7 +15,7 @@ extern void __set_page_owner(struct page *page,
>  extern void __split_page_owner(struct page *page, int old_order,
>  			int new_order);
>  extern void __folio_copy_owner(struct folio *newfolio, struct folio *old);
> -extern void __folio_set_owner_migrate_reason(struct folio *folio, int reason);
> +extern void __folio_set_owner_migrate_reason(struct folio *folio, enum migrate_reason reason);

NIT: We drop externs when we change them as a rule, the extern is unnecessary.

>  extern void __dump_page_owner(const struct page *page);
>  extern void pagetypeinfo_showmixedcount_print(struct seq_file *m,
>  					pg_data_t *pgdat, struct zone *zone);
> @@ -43,7 +44,7 @@ static inline void folio_copy_owner(struct folio *newfolio, struct folio *old)
>  	if (static_branch_unlikely(&page_owner_inited))
>  		__folio_copy_owner(newfolio, old);
>  }
> -static inline void folio_set_owner_migrate_reason(struct folio *folio, int reason)
> +static inline void folio_set_owner_migrate_reason(struct folio *folio, enum migrate_reason reason)
>  {
>  	if (static_branch_unlikely(&page_owner_inited))
>  		__folio_set_owner_migrate_reason(folio, reason);
> @@ -68,7 +69,7 @@ static inline void split_page_owner(struct page *page, int old_order,
>  static inline void folio_copy_owner(struct folio *newfolio, struct folio *folio)
>  {
>  }
> -static inline void folio_set_owner_migrate_reason(struct folio *folio, int reason)
> +static inline void folio_set_owner_migrate_reason(struct folio *folio, enum migrate_reason reason)
>  {
>  }
>  static inline void dump_page_owner(const struct page *page)
> diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
> index 11bc0aa14c7e..15ee2ef201b5 100644
> --- a/include/trace/events/migrate.h
> +++ b/include/trace/events/migrate.h
> @@ -52,7 +52,7 @@ TRACE_EVENT(mm_migrate_pages,
>  	TP_PROTO(unsigned long succeeded, unsigned long failed,
>  		 unsigned long thp_succeeded, unsigned long thp_failed,
>  		 unsigned long thp_split, unsigned long large_folio_split,
> -		 enum migrate_mode mode, int reason),
> +		 enum migrate_mode mode, enum migrate_reason reason),
>
>  	TP_ARGS(succeeded, failed, thp_succeeded, thp_failed,
>  		thp_split, large_folio_split, mode, reason),
> @@ -65,7 +65,7 @@ TRACE_EVENT(mm_migrate_pages,
>  		__field(	unsigned long,		thp_split)
>  		__field(	unsigned long,		large_folio_split)
>  		__field(	enum migrate_mode,	mode)
> -		__field(	int,			reason)
> +		__field(	enum migrate_reason,	reason)
>  	),
>
>  	TP_fast_assign(
> @@ -92,13 +92,13 @@ TRACE_EVENT(mm_migrate_pages,
>
>  TRACE_EVENT(mm_migrate_pages_start,
>
> -	TP_PROTO(enum migrate_mode mode, int reason),
> +	TP_PROTO(enum migrate_mode mode, enum migrate_reason reason),
>
>  	TP_ARGS(mode, reason),
>
>  	TP_STRUCT__entry(
>  		__field(enum migrate_mode, mode)
> -		__field(int, reason)
> +		__field(enum migrate_reason, reason)
>  	),
>
>  	TP_fast_assign(
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 571212b80835..17732d1fdc5e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -7182,7 +7182,8 @@ void folio_putback_hugetlb(struct folio *folio)
>  	folio_put(folio);
>  }
>
> -void move_hugetlb_state(struct folio *old_folio, struct folio *new_folio, int reason)
> +void move_hugetlb_state(struct folio *old_folio, struct folio *new_folio,
> +			enum migrate_reason reason)
>  {
>  	struct hstate *h = folio_hstate(old_folio);
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index d9b23909d716..49e10feeb094 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1469,7 +1469,7 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
>  static int unmap_and_move_huge_page(new_folio_t get_new_folio,
>  		free_folio_t put_new_folio, unsigned long private,
>  		struct folio *src, int force, enum migrate_mode mode,
> -		int reason, struct list_head *ret)
> +		enum migrate_reason reason, struct list_head *ret)
>  {
>  	struct folio *dst;
>  	int rc = -EAGAIN;
> @@ -1626,7 +1626,7 @@ struct migrate_pages_stats {
>   */
>  static int migrate_hugetlbs(struct list_head *from, new_folio_t get_new_folio,
>  			    free_folio_t put_new_folio, unsigned long private,
> -			    enum migrate_mode mode, int reason,
> +			    enum migrate_mode mode, enum migrate_reason reason,
>  			    struct migrate_pages_stats *stats,
>  			    struct list_head *ret_folios)
>  {
> @@ -1716,7 +1716,7 @@ static int migrate_hugetlbs(struct list_head *from, new_folio_t get_new_folio,
>  static void migrate_folios_move(struct list_head *src_folios,
>  		struct list_head *dst_folios,
>  		free_folio_t put_new_folio, unsigned long private,
> -		enum migrate_mode mode, int reason,
> +		enum migrate_mode mode, enum migrate_reason reason,
>  		struct list_head *ret_folios,
>  		struct migrate_pages_stats *stats,
>  		int *retry, int *thp_retry, int *nr_failed,
> @@ -1799,7 +1799,7 @@ static void migrate_folios_undo(struct list_head *src_folios,
>   */
>  static int migrate_pages_batch(struct list_head *from,
>  		new_folio_t get_new_folio, free_folio_t put_new_folio,
> -		unsigned long private, enum migrate_mode mode, int reason,
> +		unsigned long private, enum migrate_mode mode, enum migrate_reason reason,
>  		struct list_head *ret_folios, struct list_head *split_folios,
>  		struct migrate_pages_stats *stats, int nr_pass)
>  {
> @@ -2011,7 +2011,7 @@ static int migrate_pages_batch(struct list_head *from,
>
>  static int migrate_pages_sync(struct list_head *from, new_folio_t get_new_folio,
>  		free_folio_t put_new_folio, unsigned long private,
> -		enum migrate_mode mode, int reason,
> +		enum migrate_mode mode, enum migrate_reason reason,
>  		struct list_head *ret_folios, struct list_head *split_folios,
>  		struct migrate_pages_stats *stats)
>  {
> @@ -2088,7 +2088,7 @@ static int migrate_pages_sync(struct list_head *from, new_folio_t get_new_folio,
>   */
>  int migrate_pages(struct list_head *from, new_folio_t get_new_folio,
>  		free_folio_t put_new_folio, unsigned long private,
> -		enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
> +		enum migrate_mode mode, enum migrate_reason reason, unsigned int *ret_succeeded)
>  {
>  	int rc, rc_gather;
>  	int nr_pages;
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index c2f43ab860eb..4e352941a6e2 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -345,7 +345,7 @@ noinline void __set_page_owner(struct page *page, unsigned short order,
>  	inc_stack_record_count(handle, gfp_mask, 1 << order);
>  }
>
> -void __folio_set_owner_migrate_reason(struct folio *folio, int reason)
> +void __folio_set_owner_migrate_reason(struct folio *folio, enum migrate_reason reason)
>  {
>  	struct page_ext *page_ext = page_ext_get(&folio->page);
>  	struct page_owner *page_owner;
> --
> 2.43.0
>

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH 12/30] mm/vma: clean up anon_vma_compatible()
From: Lorenzo Stoakes @ 2026-07-01 10:20 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Russell King, Dinh Nguyen, Simon Schuster,
	James E . J . Bottomley, Helge Deller, Jarkko Sakkinen,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Ian Abbott, H Hartley Sweeten, Lucas Stach, David Airlie,
	Simona Vetter, Patrik Jakobsson, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Rob Clark, Dmitry Baryshkov, Tomi Valkeinen,
	Thierry Reding, Mikko Perttunen, Jonathan Hunter,
	Christian Koenig, Huang Rui, Ankit Agrawal, Alex Williamson,
	Alexander Viro, Christian Brauner, Dan Williams, Muchun Song,
	Oscar Salvador, David Hildenbrand, Suren Baghdasaryan,
	Liam R . Howlett, Matthew Wilcox, Marek Szyprowski,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Masami Hiramatsu, Oleg Nesterov, Steven Rostedt, SeongJae Park,
	Miaohe Lin, Hugh Dickins, Mike Rapoport, Kees Cook, Paolo Bonzini,
	linux-kernel, linux-arm-kernel, linux-parisc, linux-sgx, etnaviv,
	dri-devel, linux-arm-msm, freedreno, linux-tegra, kvm,
	linux-fsdevel, nvdimm, linux-mm, iommu, linux-perf-users,
	linux-trace-kernel, kasan-dev, damon, Rik van Riel, Harry Yoo,
	Jann Horn
In-Reply-To: <akPwbHbGiF1FxL6m@pedro-suse.lan>

On Tue, Jun 30, 2026 at 05:36:18PM +0100, Pedro Falcato wrote:
> On Mon, Jun 29, 2026 at 01:23:23PM +0100, Lorenzo Stoakes wrote:
> > Break up the existing very large conditional, add comments and use
> > vma_[start/end]_pgoff() to make clearer what we're doing here.
> >
> > No functional change intended.
> >
> > Signed-off-by: Lorenzo Stoakes <ljs@kernel.org>
> > ---
> >  mm/vma.c | 21 ++++++++++++++++-----
> >  1 file changed, 16 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/vma.c b/mm/vma.c
> > index b60375c6c5c3..6296acecf3b7 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -1967,14 +1967,25 @@ static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *
> >  {
> >  	vma_flags_t diff = vma_flags_diff_pair(&a->flags, &b->flags);
> >
> > +	/* Ignore flags that mprotect() can change. */
> >  	vma_flags_clear_mask(&diff, VMA_ACCESS_FLAGS);
> > +	/* Ignore flags that do not impact merging. */
> >  	vma_flags_clear_mask(&diff, VMA_IGNORE_MERGE_FLAGS);
> >
> > -	return a->vm_end == b->vm_start &&
> > -		mpol_equal(vma_policy(a), vma_policy(b)) &&
> > -		a->vm_file == b->vm_file &&
> > -		vma_flags_empty(&diff) &&
> > -		b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);
> > +	/* Must be adjacent. */
> > +	if (a->vm_end != b->vm_start)
> > +		return false;
> > +	/* Must have matching policy. */
> > +	if (!mpol_equal(vma_policy(a), vma_policy(b)))
> > +		return false;
> > +	/* Must both be anon or map the same file (MAP_PRIVATE case). */
> > +	if (a->vm_file != b->vm_file)
> > +		return false;
> > +	/* Flags must be equivalent modulo mprotect(). */
> > +	if (!vma_flags_empty(&diff))
> > +		return false;
> > +	/* Page offset must align. */
> > +	return vma_end_pgoff(a) == vma_start_pgoff(b);
>
> Very nice.
>
> Reviewed-by: Pedro Falcato <pfalcato@suse.de>

Thanks :)

>
> --
> Pedro

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH 09/30] mm/rmap: parameterise anon_vma_interval_tree_*() by anon_vma
From: Lorenzo Stoakes @ 2026-07-01 10:18 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Russell King, Dinh Nguyen, Simon Schuster,
	James E . J . Bottomley, Helge Deller, Jarkko Sakkinen,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Ian Abbott, H Hartley Sweeten, Lucas Stach, David Airlie,
	Simona Vetter, Patrik Jakobsson, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Rob Clark, Dmitry Baryshkov, Tomi Valkeinen,
	Thierry Reding, Mikko Perttunen, Jonathan Hunter,
	Christian Koenig, Huang Rui, Ankit Agrawal, Alex Williamson,
	Alexander Viro, Christian Brauner, Dan Williams, Muchun Song,
	Oscar Salvador, David Hildenbrand, Suren Baghdasaryan,
	Liam R . Howlett, Matthew Wilcox, Marek Szyprowski,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Masami Hiramatsu, Oleg Nesterov, Steven Rostedt, SeongJae Park,
	Miaohe Lin, Hugh Dickins, Mike Rapoport, Kees Cook, Paolo Bonzini,
	linux-kernel, linux-arm-kernel, linux-parisc, linux-sgx, etnaviv,
	dri-devel, linux-arm-msm, freedreno, linux-tegra, kvm,
	linux-fsdevel, nvdimm, linux-mm, iommu, linux-perf-users,
	linux-trace-kernel, kasan-dev, damon, Rik van Riel, Harry Yoo,
	Jann Horn
In-Reply-To: <akPvI4r1rnxxSbzW@pedro-suse.lan>

On Tue, Jun 30, 2026 at 05:32:03PM +0100, Pedro Falcato wrote:
> On Mon, Jun 29, 2026 at 01:23:20PM +0100, Lorenzo Stoakes wrote:
> > Similar to what we did with mapping_interval_tree*(), let's declare
> > anon_vma_interval_tree*() in terms of anon_vma rather than rb_root_cached.
> >
> > In each case the rb tree referenced is &anon_vma->rb_root, so just pass
> > anon_vma and the functions can figure this out themselves.
> >
> > Additionally, rename 'node' to 'avc', 'index' to 'pgoff_start', and 'last'
> > to 'pgoff_last' to make clear what is being passed.
> >
> > Finally express page offsets in terms of pgoff_t to be consistent.
> >
> > No functional change intended.
> >
> > Signed-off-by: Lorenzo Stoakes <ljs@kernel.org>
>
> Yay!
>
> Reviewed-by: Pedro Falcato <pfalcato@suse.de>

Thanks :)

>
> I have a vaguely similar comment as for the file part (names could be
> simpler, doesn't really matter whether it's an interval tree or possibly
> even a tree), but I care less strongly about anon rmap :)

And yeah ack, let me agonise over that a bit :P

>
> --
> Pedro

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH 08/30] mm/rmap: rename vma_interval_tree_*() to mapping_interval_tree_*()
From: Lorenzo Stoakes @ 2026-07-01 10:14 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Russell King, Dinh Nguyen, Simon Schuster,
	James E . J . Bottomley, Helge Deller, Jarkko Sakkinen,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Ian Abbott, H Hartley Sweeten, Lucas Stach, David Airlie,
	Simona Vetter, Patrik Jakobsson, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Rob Clark, Dmitry Baryshkov, Tomi Valkeinen,
	Thierry Reding, Mikko Perttunen, Jonathan Hunter,
	Christian Koenig, Huang Rui, Ankit Agrawal, Alex Williamson,
	Alexander Viro, Christian Brauner, Dan Williams, Muchun Song,
	Oscar Salvador, David Hildenbrand, Suren Baghdasaryan,
	Liam R . Howlett, Matthew Wilcox, Marek Szyprowski,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Masami Hiramatsu, Oleg Nesterov, Steven Rostedt, SeongJae Park,
	Miaohe Lin, Hugh Dickins, Mike Rapoport, Kees Cook, Paolo Bonzini,
	linux-kernel, linux-arm-kernel, linux-parisc, linux-sgx, etnaviv,
	dri-devel, linux-arm-msm, freedreno, linux-tegra, kvm,
	linux-fsdevel, nvdimm, linux-mm, iommu, linux-perf-users,
	linux-trace-kernel, kasan-dev, damon, Rik van Riel, Harry Yoo,
	Jann Horn
In-Reply-To: <akPtjuele4I7iqTQ@pedro-suse.lan>

On Tue, Jun 30, 2026 at 05:28:20PM +0100, Pedro Falcato wrote:
> On Mon, Jun 29, 2026 at 01:23:19PM +0100, Lorenzo Stoakes wrote:
> > The family of vma_interval_tree_() functions manipulate the
> > address_space (which, of course, is generally referred to as 'mapping')
> > reverse mapping, but are named the 'VMA' interval tree.
> >
> > VMAs may be mapped by an anon_vma, an address_space, or both. Therefore
> > calling the mapping interval tree a 'VMA' interval tree is rather
> > confusing.
> >
> > This is also inconsistent with the anon_vma_interval_tree_*() functions
> > which explicitly reference the rmap object to which they pertain.
> >
> > Rename the vma_interval_tree_*() functions to mapping_interval_tree_*() to
> > correct this.
> >
> > No functional change intended.
> >
> > Signed-off-by: Lorenzo Stoakes <ljs@kernel.org>
>
> I'll have to nitpick this and say that I prefer [1] file_rmap_tree_, or
> mapping_rmap_tree. Or possibly even better - mapping_ (so
> mapping_for_each_vma, mapping_insert_vma, etc).

Haha of course.

I don't like mapping_ because that is such an overloaded term and it's confusing
really.

I'm iffy on mapping_rmap_tree as it feels like that implies 'this is the whole
of how the rmap lookup works', and it also makes it less obvious what _kind_ of
tree it is, though I suppose you could look up the types, but the crap macro
generation makes that more of a pain.

Also, for file 'rmap' you are potentially not actually doing an rmap at all in
the classical sense of folio -> rmap object -> related VMAs, but rather are
going from file/inode -> <page cache abstraction> -> related VMAs...

But then again a quick spot check suggests it's usually from a folio... I guess
only the rmap really needs to find the VMAs.

Also if I renamed it to mapping_rmap_tree I'd have to respin and churn all the
anon_vma code but I guess that's not impossible... :)

I guess I'm a bit stubborn about doing much with the anon_vma stuff until I get
rid of it.

But I couldn't live with the stupidity of calling it vma_interval_tree_*()
anymore here...

OK I'm a bit torn, mapping_rmap_tree*() and anon_rmap_tree*() are the workable
contenders from your suggestions.

Let me think about that on respin... :)

>
> A bit of bikeshedding never hurts ;)
>
> [1] locally I was naming things file_rmap, but I never actually got to
> churn these names away
> --
> Pedro

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH 2/2] tracing: Keep pid and comm[] in the same structure
From: David Laight @ 2026-07-01 10:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Michal Koutný
In-Reply-To: <20260630150348.149e318c@gandalf.local.home>

On Tue, 30 Jun 2026 15:03:48 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 30 Jun 2026 11:01:56 +0100
> David Laight <david.laight.linux@gmail.com> wrote:
> 
> > > It's been a long time since I worked on this, but IIRC, it was to keep
> > > the pressure down on the TLB when tracing. It updates at every
> > > sched_switch that has a trace event occurring so, I likely used normal
> > > pages which are part of the huge pages the kernel sets up and doesn't
> > > affect the TLB as much. vmalloc does have impact on the TLB pressure,
> > > and tracing should always try to avoid that.    
> > 
> > Isn't this a cache so that the pid numbers can be converted to strings
> > when the trace is read out after the actual process has exited?
> > That does mean that cache doesn't need to be updated on every trace
> > request - it might be enough to just save on process exit and lookup the
> > pid itself for running processes (the whole thing relies on pids not
> > being reused).  
> 
> Yes it's a cache but it only gets filled when needed. That is, after a
> trace event occurred. Tracing is very commonly used with filtering, where
> events can be seldom triggered. What is in the saved_cmdlines file should
> only be tasks that were running when a trace occurred.
> 
> Now what we could do is add a flag to the task struct and only set that
> when tracing happens. Only tasks that exit would be saved in this array.
> The other tasks could be queried via iterating the tasks and reporting any
> task with this bit set.

I thought it was just used to do a pid->string lookup when you run 'cat trace'.
But then I found the code that lets userspace read the table....
I guess the latter is used by the userspace code that reads the raw trace buffer.
(I found some instructions that did it that way, the output was unparseable
when tracing things that are happening on multiple cpu.)
The userspace code could probably be given comm[] for all the running
processes and those that exited while tracing_on() set.
(I didn't see anything that would clear the table when the trace buffer
was cleared.)

I'll try to remember to look at this when I get home.
(I've a pi-5 with me, not doing kernel builds on it...)

	David


^ permalink raw reply

* Re: [PATCH 06/30] mm/rmap: parameterise vma_interval_tree_*() by address_space
From: Lorenzo Stoakes @ 2026-07-01  9:56 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Russell King, Dinh Nguyen, Simon Schuster,
	James E . J . Bottomley, Helge Deller, Jarkko Sakkinen,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Ian Abbott, H Hartley Sweeten, Lucas Stach, David Airlie,
	Simona Vetter, Patrik Jakobsson, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Rob Clark, Dmitry Baryshkov, Tomi Valkeinen,
	Thierry Reding, Mikko Perttunen, Jonathan Hunter,
	Christian Koenig, Huang Rui, Ankit Agrawal, Alex Williamson,
	Alexander Viro, Christian Brauner, Dan Williams, Muchun Song,
	Oscar Salvador, David Hildenbrand, Suren Baghdasaryan,
	Liam R . Howlett, Matthew Wilcox, Marek Szyprowski,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Masami Hiramatsu, Oleg Nesterov, Steven Rostedt, SeongJae Park,
	Miaohe Lin, Hugh Dickins, Mike Rapoport, Kees Cook, Paolo Bonzini,
	linux-kernel, linux-arm-kernel, linux-parisc, linux-sgx, etnaviv,
	dri-devel, linux-arm-msm, freedreno, linux-tegra, kvm,
	linux-fsdevel, nvdimm, linux-mm, iommu, linux-perf-users,
	linux-trace-kernel, kasan-dev, damon, Rik van Riel, Harry Yoo,
	Jann Horn
In-Reply-To: <akPsGrmaOd3JMlC2@pedro-suse.lan>

On Tue, Jun 30, 2026 at 05:19:16PM +0100, Pedro Falcato wrote:
> On Mon, Jun 29, 2026 at 01:23:17PM +0100, Lorenzo Stoakes wrote:
> > The file-backed mapping interval tree functions vma_interval_tree_*()
> > accept a raw rb_root_cached pointer to determine the tree in which they are
> > operating.
> >
> > However, in each case, this is always associated with an address_space data
> > type.
> >
> > So simply pass a pointer to that instead to simplify the code, and more
> > clearly differentiate between these operations and those concerning
> > anonymous mappings.
> >
> > While we're here, make the generated interval tree functions static as they
> > do not need to be used externally (any previously existing external users
> > have now been removed).
> >
> > We also rename VMA parameters from 'node' to 'vma' as calling this a node
> > is simply confusing, update the input index types to pgoff_t since they
> > reference page offsets and rename the parameters to pgoff_start and
> > pgoff_last.
> >
> > No functional change intended.
> >
> > Signed-off-by: Lorenzo Stoakes <ljs@kernel.org>
>
> 1) This is fantastic
> 2) I need to rebase my local work :)
>
> Reviewed-by: Pedro Falcato <pfalcato@suse.de>
>
> --
> Pedro

1) thanks :)
2) sorry :P

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH 05/30] mm/rmap: update mm/interval_tree.c comments
From: Lorenzo Stoakes @ 2026-07-01  9:55 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Russell King, Dinh Nguyen, Simon Schuster,
	James E . J . Bottomley, Helge Deller, Jarkko Sakkinen,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Ian Abbott, H Hartley Sweeten, Lucas Stach, David Airlie,
	Simona Vetter, Patrik Jakobsson, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Rob Clark, Dmitry Baryshkov, Tomi Valkeinen,
	Thierry Reding, Mikko Perttunen, Jonathan Hunter,
	Christian Koenig, Huang Rui, Ankit Agrawal, Alex Williamson,
	Alexander Viro, Christian Brauner, Dan Williams, Muchun Song,
	Oscar Salvador, David Hildenbrand, Suren Baghdasaryan,
	Liam R . Howlett, Matthew Wilcox, Marek Szyprowski,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Masami Hiramatsu, Oleg Nesterov, Steven Rostedt, SeongJae Park,
	Miaohe Lin, Hugh Dickins, Mike Rapoport, Kees Cook, Paolo Bonzini,
	linux-kernel, linux-arm-kernel, linux-parisc, linux-sgx, etnaviv,
	dri-devel, linux-arm-msm, freedreno, linux-tegra, kvm,
	linux-fsdevel, nvdimm, linux-mm, iommu, linux-perf-users,
	linux-trace-kernel, kasan-dev, damon, Rik van Riel, Harry Yoo,
	Jann Horn
In-Reply-To: <akPrTU96BwQJoygw@pedro-suse.lan>

On Tue, Jun 30, 2026 at 05:16:51PM +0100, Pedro Falcato wrote:
> On Mon, Jun 29, 2026 at 01:23:16PM +0100, Lorenzo Stoakes wrote:
> > Update the file comment to clarify that both file-backed and anonymous
> > interval trees are provided, referencing the relevant data types for
> > clarity.
> >
> > Also add comments to indicate which parts of the file apply to each.
> >
> > While we're here, convert the VM_BUG_ON_VMA() to VM_WARN_ON_ONCE_VMA().
> >
> > Signed-off-by: Lorenzo Stoakes <ljs@kernel.org>
>
> Reviewed-by: Pedro Falcato <pfalcato@suse.de>
>
> This is fine for now, but I'm wondering if it doesn't make sense to, in the
> long term, have:
>
> mm/rmap.c - common rmap mechanisms
> mm/anon_rmap.c - anon rmap gunk
> mm/file_rmap.c - file rmap gunk
>
> or even something like mm/rmap/{core,anon,file,ksm??}.c
>
> While working on my file rmap patches I noticed there's so much stuff just
> splurged all over rmap.c - interval_tree.c - fs.h - fs/inode.c.
> It's a little silly.

Well, Wei had something like this idea a way back, but I'd rather avoid it.

Firstly, with scalable cow coming, I'd rather us not make anon_vma a special
citizen in any way, and I'm going to be heavily modifying all this anyway.

On the interval tree side, I'm simply going to get rid of the anon side of it
altogether with scalable CoW also so that can live as it is now for the time
being.

In general we try to have generalised rmap walk logic for file vs. anon with
rmap_walk -> rmap_walk_[anon, file]() for one obviously.

But obviously there's separate rmap walker logic for file vs. anon.

I used to harbour a belief that we could make anon like file-backed but now I'm
kind of thinking that's not possible :)

So I guess the real answer is - yeah, but let's revisit it after scalable CoW
(and anyway I'm likely to do sensible architectural breakouts as part of that
work anyway).

>
> --
> Pedro

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH 01/30] mm: move vma_start_pgoff() into mm.h and clean up
From: Lorenzo Stoakes @ 2026-07-01  9:42 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Russell King, Dinh Nguyen, Simon Schuster,
	James E . J . Bottomley, Helge Deller, Jarkko Sakkinen,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Ian Abbott, H Hartley Sweeten, Lucas Stach, David Airlie,
	Simona Vetter, Patrik Jakobsson, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Rob Clark, Dmitry Baryshkov, Tomi Valkeinen,
	Thierry Reding, Mikko Perttunen, Jonathan Hunter,
	Christian Koenig, Huang Rui, Ankit Agrawal, Alex Williamson,
	Alexander Viro, Christian Brauner, Dan Williams, Muchun Song,
	Oscar Salvador, David Hildenbrand, Suren Baghdasaryan,
	Liam R . Howlett, Matthew Wilcox, Marek Szyprowski,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Namhyung Kim,
	Masami Hiramatsu, Oleg Nesterov, Steven Rostedt, SeongJae Park,
	Miaohe Lin, Hugh Dickins, Mike Rapoport, Kees Cook, Paolo Bonzini,
	linux-kernel, linux-arm-kernel, linux-parisc, linux-sgx, etnaviv,
	dri-devel, linux-arm-msm, freedreno, linux-tegra, kvm,
	linux-fsdevel, nvdimm, linux-mm, iommu, linux-perf-users,
	linux-trace-kernel, kasan-dev, damon, Rik van Riel, Harry Yoo,
	Jann Horn
In-Reply-To: <akPqIfmQLOs4gI7h@pedro-suse.lan>

On Tue, Jun 30, 2026 at 05:10:55PM +0100, Pedro Falcato wrote:
> On Mon, Jun 29, 2026 at 01:23:12PM +0100, Lorenzo Stoakes wrote:
> > vma_last_pgoff() already lives there, so it's a bit odd to keep
> > vma_start_pgoff() in mm/interval_tree.c. Move them together.
>
> Hmm, a part of me wonders if this is the part where we should start cleaning
> up mm.h into vma.h or something. Probably not, it would be extra churn right
> now.

Yeah the issue is there's some confusion about vma.h - mm.h should be for
stuff that is used outside of mm, and these helpers are definitely like
that.

vma.h is purely for internal mm vma stuff, and most people should be accessing
that via internal.h (I address that in patch 27).

I do wonder if that could be done more nicely but punt that to another time.

But also probably worth doing a pass over some of the defines, I have a
bunch of series chur^W changing stuff lately so can do a follow up on that
maybe.

>
> >
> > These each return unsigned long, which pgoff_t is typedef'd to. Make this
> > consistent and have these functions return pgoff_t instead.
> >
> > Additionally, express vma_last_pgoff() in terms of vma_start_pgoff(), since
> > we wrap the vma->vm_pgoff access, we may as well use it here.
> >
> > Also while we're here, const-ify the VMA and cleanup a bit.
> >
> > No functional change intended.
> >
> > Signed-off-by: Lorenzo Stoakes <ljs@kernel.org>
>
> Reviewed-by: Pedro Falcato <pfalcato@suse.de>

Thanks!

>
> > ---
> >  include/linux/mm.h | 9 +++++++--
> >  mm/interval_tree.c | 5 -----
> >  2 files changed, 7 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 485df9c2dbdd..059144435729 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -4278,9 +4278,14 @@ static inline unsigned long vma_pages(const struct vm_area_struct *vma)
> >  	return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> >  }
> >
> > -static inline unsigned long vma_last_pgoff(struct vm_area_struct *vma)
> > +static inline pgoff_t vma_start_pgoff(const struct vm_area_struct *vma)
> >  {
> > -	return vma->vm_pgoff + vma_pages(vma) - 1;
> > +	return vma->vm_pgoff;
> > +}
> > +
> > +static inline pgoff_t vma_last_pgoff(const struct vm_area_struct *vma)
> > +{
> > +	return vma_start_pgoff(vma) + vma_pages(vma) - 1;
> >  }
> >
> >  static inline unsigned long vma_desc_size(const struct vm_area_desc *desc)
> > diff --git a/mm/interval_tree.c b/mm/interval_tree.c
> > index 32bcfbfcf15f..344d1f5946c7 100644
> > --- a/mm/interval_tree.c
> > +++ b/mm/interval_tree.c
> > @@ -10,11 +10,6 @@
> >  #include <linux/rmap.h>
> >  #include <linux/interval_tree_generic.h>
> >
> > -static inline unsigned long vma_start_pgoff(struct vm_area_struct *v)
> > -{
> > -	return v->vm_pgoff;
> > -}
> > -
> >  INTERVAL_TREE_DEFINE(struct vm_area_struct, shared.rb,
> >  		     unsigned long, shared.rb_subtree_last,
> >  		     vma_start_pgoff, vma_last_pgoff, /* empty */, vma_interval_tree)
> > --
> > 2.54.0
> >
>
> --
> Pedro

Cheers, Lorenzo

^ permalink raw reply

* Re: [syzbot] [trace?] general protection fault in mtree_load
From: syzbot @ 2026-07-01  9:42 UTC (permalink / raw)
  To: bp, dave.hansen, hpa, linux-kernel, linux-trace-kernel, mhiramat,
	mingo, oleg, olsajiri, peterz, syzkaller-bugs, tglx, x86
In-Reply-To: <akTZNgy1UPQdNcO_@redhat.com>

Hello,

syzbot has tested the proposed patch and the reproducer did not trigger any issue:

Reported-by: syzbot+61ce80689253f42e6d80@syzkaller.appspotmail.com
Tested-by: syzbot+61ce80689253f42e6d80@syzkaller.appspotmail.com

Tested on:

commit:         665159e2 Merge tag 'probes-fixes-v7.2-rc1' of git://gi..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=14f7b11c580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=f9bf5d2bfae96234
dashboard link: https://syzkaller.appspot.com/bug?extid=61ce80689253f42e6d80
compiler:       gcc (Debian 14.2.0-19) 14.2.0, GNU ld (GNU Binutils for Debian) 2.44
patch:          https://syzkaller.appspot.com/x/patch.diff?x=1084b9fa580000

Note: testing is done by a robot and is best-effort only.

^ permalink raw reply

* Re: [PATCHv4 02/13] uprobes/x86: Remove struct uprobe_trampoline object
From: Jiri Olsa @ 2026-07-01  9:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jiri Olsa, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <aj0xonQHcEQgIi77@redhat.com>

On Thu, Jun 25, 2026 at 03:48:18PM +0200, Oleg Nesterov wrote:
> On 06/25, Jiri Olsa wrote:
> >
> > On Wed, Jun 24, 2026 at 04:36:23PM +0200, Oleg Nesterov wrote:
> > >
> > > Perhaps we can later optimize this code a bit? I mean something like
> > >
> > > 	start_reachable = ...;
> > > 	end_reachable = ...;
> > >
> > > 	VMA_ITERATOR(vmi, mm, start_reachable);
> > >
> > > 	for_each_vma(vmi, vma) {
> > > 		if (!vma_is_special_mapping(...))
> > > 			continue;
> > > 		if (vma->vm_start > end_reachable)
> > > 			break;
> > > 		return vma;
> > > 	}
> >
> > looks good, will try to use that
> 
> See my next email, we can use for_each_vma_range().
> 
> But let me repeat, we can add this mimor optimization later, I don't want
> to delay this series.
> 
> > > >  static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct *mm,
> > > >  				  unsigned long vaddr)
> > > >  {
> > > > -	struct uprobe_trampoline *tramp;
> > > > -	struct vm_area_struct *vma;
> > > > -	bool new = false;
> > > > -	int err = 0;
> > > > +	struct pt_regs *regs = task_pt_regs(current);
> > > > +	struct vm_area_struct *vma, *tramp;
> > > >
> > > > +	if (!user_64bit_mode(regs))
> > > > +		return -EINVAL;
> > > >  	vma = find_vma(mm, vaddr);
> > > >  	if (!vma)
> > > >  		return -EINVAL;
> > >
> > > I guess find_vma() can't fail, the caller arch_uprobe_optimize() has called
> > > copy_from_vaddr() under mmap_write_lock()... Nevermind.
> >
> > hum, how's that.. I'll check, but where's the magic? :)
> 
> arch_uprobe_optimize() -> copy_from_vaddr() reads this mm at the same vaddr,
> this means that vma at this vaddr must exist. Unless I am totally confused ;)
> But even if I am right please ignore. I just tried to understand if find_vma()
> can fail or not here.

ok, will leave these 2 changes for later

jirka

^ permalink raw reply

* Re: [PATCH v8 06/46] KVM: Enumerate support for PRIVATE memory iff kvm_arch_has_private_mem is defined
From: Xiaoyao Li @ 2026-07-01  9:19 UTC (permalink / raw)
  To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-6-9d2959357853@google.com>

On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> Explicitly guard reporting support for KVM_MEMORY_ATTRIBUTE_PRIVATE based
> on kvm_arch_has_private_mem being #defined in anticipation of decoupling
> kvm_supported_mem_attributes() from CONFIG_KVM_VM_MEMORY_ATTRIBUTES.

Well, after this series, kvm_supported_mem_attributes() is renamed to 
kvm_supported_vm_mem_attributes(), and it's still under 
CONFIG_KVM_VM_MEMORY_ATTRIBUTES.

> guest_memfd support for memory attributes will be unconditional to avoid
> yet more macros (all architectures that support guest_memfd are expected to
> use per-gmem attributes at some point), at which point enumerating support
> KVM_MEMORY_ATTRIBUTE_PRIVATE based solely on memory attributes being
> supported _somewhere_ would result in KVM over-reporting support on arm64.

I don't understand it. This patch only changes the behavior of 
kvm_supported_mem_attributes(), the usage of which is guarded by 
CONFIG_KVM_VM_MEMORY_ATTRIBUTES. This is config is only visible to x86 
due to patch 03. How does it affect arm64?

> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>   virt/kvm/kvm_main.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1ccc4895a4c26..7b989b659cf82 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2421,8 +2421,10 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
>   #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
>   static u64 kvm_supported_mem_attributes(struct kvm *kvm)
>   {
> +#ifdef kvm_arch_has_private_mem
>   	if (!kvm || kvm_arch_has_private_mem(kvm))
>   		return KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +#endif
>   
>   	return 0;
>   }
> 


^ permalink raw reply

* Re: [syzbot] [trace?] general protection fault in mtree_load
From: Oleg Nesterov @ 2026-07-01  9:09 UTC (permalink / raw)
  To: syzbot
  Cc: bp, dave.hansen, hpa, linux-kernel, linux-trace-kernel, mhiramat,
	mingo, olsajiri, peterz, syzkaller-bugs, tglx, x86
In-Reply-To: <6a446e28.854d4ab9.360e1d.0017.GAE@google.com>

On 06/30, syzbot wrote:
>
> syzbot has found a reproducer for the following issue on:
>
> HEAD commit:    dc59e4fea9d8 Linux 7.2-rc1
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=15c7d61c580000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=f9bf5d2bfae96234
> dashboard link: https://syzkaller.appspot.com/bug?extid=61ce80689253f42e6d80
> compiler:       gcc (Debian 14.2.0-19) 14.2.0, GNU ld (GNU Binutils for Debian) 2.44
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=12bbb11c580000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=130bf4ea580000
>
> Downloadable assets:
> disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/d900f083ada3/non_bootable_disk-dc59e4fe.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/1bc8aed8d2e8/vmlinux-dc59e4fe.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/0b1fdfc4aa09/bzImage-dc59e4fe.xz
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+61ce80689253f42e6d80@syzkaller.appspotmail.com
>
> Oops: general protection fault, probably for non-canonical address 0xdffffc0000000011: 0000 [#1] SMP KASAN NOPTI

#syz test

From: Jiri Olsa <jolsa@kernel.org>

In the unregister path we use __in_uprobe_trampoline check with
current->mm for the VMA lookup, which is wrong, because we are
in the tracer context, not the traced process.

Add mm_struct pointer argument to __in_uprobe_trampoline and
changing related callers to pass proper mm_struct pointer.

Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 arch/x86/kernel/uprobes.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index ebb1baf1eb1d..2be6707e3320 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -761,9 +761,9 @@ void arch_uprobe_clear_state(struct mm_struct *mm)
 		destroy_uprobe_trampoline(tramp);
 }
 
-static bool __in_uprobe_trampoline(unsigned long ip)
+static bool __in_uprobe_trampoline(struct mm_struct *mm, unsigned long ip)
 {
-	struct vm_area_struct *vma = vma_lookup(current->mm, ip);
+	struct vm_area_struct *vma = vma_lookup(mm, ip);
 
 	return vma && vma_is_special_mapping(vma, &tramp_mapping);
 }
@@ -776,14 +776,14 @@ static bool in_uprobe_trampoline(unsigned long ip)
 
 	rcu_read_lock();
 	if (mmap_lock_speculate_try_begin(mm, &seq)) {
-		found = __in_uprobe_trampoline(ip);
+		found = __in_uprobe_trampoline(mm, ip);
 		retry = mmap_lock_speculate_retry(mm, seq);
 	}
 	rcu_read_unlock();
 
 	if (retry) {
 		mmap_read_lock(mm);
-		found = __in_uprobe_trampoline(ip);
+		found = __in_uprobe_trampoline(mm, ip);
 		mmap_read_unlock(mm);
 	}
 	return found;
@@ -1044,7 +1044,7 @@ static int copy_from_vaddr(struct mm_struct *mm, unsigned long vaddr, void *dst,
 	return 0;
 }
 
-static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
+static bool __is_optimized(struct mm_struct *mm, uprobe_opcode_t *insn, unsigned long vaddr)
 {
 	struct __packed __arch_relative_insn {
 		u8 op;
@@ -1053,7 +1053,7 @@ static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
 
 	if (!is_call_insn(insn))
 		return false;
-	return __in_uprobe_trampoline(vaddr + 5 + call->raddr);
+	return __in_uprobe_trampoline(mm, vaddr + 5 + call->raddr);
 }
 
 static int is_optimized(struct mm_struct *mm, unsigned long vaddr)
@@ -1064,7 +1064,7 @@ static int is_optimized(struct mm_struct *mm, unsigned long vaddr)
 	err = copy_from_vaddr(mm, vaddr, &insn, 5);
 	if (err)
 		return err;
-	return __is_optimized((uprobe_opcode_t *)&insn, vaddr);
+	return __is_optimized(mm, (uprobe_opcode_t *)&insn, vaddr);
 }
 
 static bool should_optimize(struct arch_uprobe *auprobe)
-- 
2.54.0



^ permalink raw reply related

* Re: [PATCH v8 17/46] KVM: guest_memfd: Advertise KVM_SET_MEMORY_ATTRIBUTES2 ioctl
From: Xiaoyao Li @ 2026-07-01  9:03 UTC (permalink / raw)
  To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-17-9d2959357853@google.com>

On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> @@ -4969,6 +4973,11 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>   		return 1;
>   	case KVM_CAP_GUEST_MEMFD_FLAGS:
>   		return kvm_gmem_get_supported_flags(kvm);
> +	case KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES:
> +		if (!gmem_in_place_conversion || !kvm_supports_private_mem(kvm))
> +			return 0;
> +
> +		return KVM_MEMORY_ATTRIBUTE_PRIVATE;
>   #endif
>   	default:
>   		break;

this looks inconsistent with the

	case KVM_SET_MEMORY_ATTRIBUTES2:
		if (!gmem_in_place_conversion)
			return -ENOTTY;

Well, the check of

	if (!kvm_arch_has_private_mem(f->kvm))
		return -EINVAL;

is buried in the following kvm_gmem_set_attributes(). How about moving 
of kvm_arch_has_private_mem() check to put it along with 
gmem_in_place_conversion check in kvm_gmem_ioctl() in Patch 13?

^ permalink raw reply

* Re: [PATCH v8 12/46] KVM: guest_memfd: Only prepare folios for private pages
From: Xiaoyao Li @ 2026-07-01  8:05 UTC (permalink / raw)
  To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-12-9d2959357853@google.com>

On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> All-shared guest_memfd used to be only supported for non-CoCo VMs where
> preparation doesn't apply. INIT_SHARED is about to be supported for CoCo
> VMs in a later patch in this series.
> 
> In addition, KVM_SET_MEMORY_ATTRIBUTES2 is about to be supported in
> guest_memfd in a later patch in this series.
> 
> This means that the kvm fault handler may now call kvm_gmem_get_pfn() on a
> shared folio for a CoCo VM where preparation applies.

I don't get it clear why preparation isn't needed or should not be 
performed for shared folio.

Before this patch, userspace VMM can create the gmem with FLAG_MMAP and 
FLAG_INIT_SHARED. And KVM always faultin the pfn from gmem. In this 
case, the gmem is always shared, and kvm_gmem_prepare_folio() is always 
called when faultin from gmem. So it's OK to call preparation for shared 
folio, right? Then what is the actual reason we cannot do it now?

^ permalink raw reply

* Re: [RFC PATCH v2 4/4] rtla/osnoise: Leverage IPI event filters when tracing a subset of CPUs
From: Valentin Schneider @ 2026-07-01  7:25 UTC (permalink / raw)
  To: Tomas Glozar
  Cc: linux-kernel, linux-trace-kernel, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Costa Shulyupin,
	Crystal Wood, John Kacur, Ivan Pravdin, Jonathan Corbet
In-Reply-To: <CAP4=nvT5H73ptaBaZ2fWQ+SjJ-c3eEGZbrSefMV6fxqL-va2XA@mail.gmail.com>

On 01/07/26 08:45, Tomas Glozar wrote:
> út 30. 6. 2026 v 16:00 odesílatel Valentin Schneider
> <vschneid@redhat.com> napsal:
>>
>> On 30/06/26 12:14, Tomas Glozar wrote:
>> > st 17. 6. 2026 v 15:18 odesílatel Valentin Schneider
>> > <vschneid@redhat.com> napsal:
>> >> @@ -406,6 +408,33 @@ struct osnoise_tool *osnoise_init_top(struct common_params *params)
>> >>                 goto out_err;
>> >>         }
>> >>
>> >> +       /*
>> >> +        * If tracing on a subset of possible CPUs, leverage the kernel filtering
>> >> +        * infrastructure to only generate events on traced CPUs.
>> >> +        */
>> >> +       if (params->cpus) {
>> >> +               char filter[MAX_PATH];
>> >> +
>> >> +               snprintf(filter, ARRAY_SIZE(filter), "cpu & CPUS{%s}\n", params->cpus);
>> >> +               retval = tracefs_event_file_write(tool->trace.inst,
>> >> +                                                 "ipi", "ipi_send_cpu", "filter",
>> >> +                                                 filter);
>> >> +               if (retval) {
>> >
>> > retval is the number of bytes written here, so this should be "retval
>> > < 0" like in trace_event_enable_filter() in trace.c. Same below.
>> >
>>
>> According to the docstring:
>>
>>  * Return 0 on success, and -1 on error.
>>
>> but regardless yes that should be a '< 0' check to match existing code.
>>
>
> I double-checked that and you are correct that the docstring says so,
> but it's an error in the docstring. According to the manpage, it
> returns the number of bytes written (i.e. positive on success, not
> zero) [1]:
>
> "RETURN VALUE
> ...
> tracefs_event_file_write() and tracefs_event_file_append() returns
> *the number of bytes written to the system/event file* or negative on
> error."
>
> The code agrees as well: in tracefs_event_file_write() there's the
> wrong docstring (likely copied from another function) [2]:
>
> /*
>  * tracefs_event_file_write - write to an event file
>  * ...
>  * Return 0 on success, and -1 on error.
>  */
> int tracefs_event_file_write(struct tracefs_instance *instance,
>     const char *system, const char *event,
>     const char *file, const char *str)
> {
>         ....
>         ret = tracefs_instance_file_write(instance, path, str);
>         free(path);
>         return ret;
> }
>
> but the source of the return value is tracefs_instance_file_write(),
> where the docstring is correct [3]:
>
> /**
>  * tracefs_instance_file_write - Write in trace file of specific instance.
>  * ...
>  * Returns the number of written bytes, or -1 in case of an error
>  */
> int tracefs_instance_file_write(struct tracefs_instance *instance,
> const char *file, const char *str)
> {
>         return instance_file_write(instance, file, str, O_WRONLY | O_TRUNC);
> }
>
> instance_file_write() gets the return value from write_file() [4]
> which returns the return value of write() [5].
>
> [1] https://man7.org/linux/man-pages/man3/tracefs_event_get_file.3.html
> [2] https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/tree/src/tracefs-events.c#n686
> [3] https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/tree/src/tracefs-instance.c#n532
> [4] https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/tree/src/tracefs-instance.c#n514
> [5] https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/tree/src/tracefs-instance.c#n496
>
> So the "< 0" is required, the CPU filter doesn't work without it:
>
> [tglozar@fedora rtla]$ uname -a
> Linux fedora 7.0.12-101.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jun 11
> 01:32:26 UTC 2026 x86_64 GNU/Linux
> [tglozar@fedora rtla]$ sudo ./rtla osnoise top -q -c 0 --ipi
> Could not set ipi_send_cpu CPU filter, return value: 14
> Could not init osnoise tool
> [tglozar@fedora rtla]$ sudo ./rtla osnoise top -q -d 5s --ipi
>                                          Operating System Noise
> ...
> [tglozar@fedora rtla]$
>
> (output with additional debug print)
>

Darnit, you're right! I'm surprised I didn't catch this while
testing. Thanks!

>> >> +                       err_msg("Could not set ipi_send_cpu CPU filter\n");
>> >> +                       goto out_err;
>> >
>> > It would be useful to have --ipi work even on older kernels that don't
>> > yet have your cpumask trace event filter patchset [1], for example, by
>> > printing a debug message that filtering is disabled and setting a flag
>> > instead of erroring out here. Then the code in
>> > osnoise_ipi_cpu_handler() can preserve the CPU_ISSET check if the flag
>> > is set.
>> >
>> > As --ipi is optional, we can choose to only support it on newer
>> > kernels, but it would be nice to have it working without the filter,
>> > too.
>> >
>> > [1] https://lore.kernel.org/linux-trace-kernel/20230707172155.70873-1-vschneid@redhat.com/T/#u
>> >
>>
>> Makes sense, will do.
>>
>
> Thanks!
>
>> [truncated]
>> >
>> > I was thinking that it might make sense to enable the filters also for
>> > the trace output instance. On the other hand, it would make it
>> > difficult to enable the event without the filter then, as specifying
>> > "-e ipi" or similar only re-enables the event but does not remove the
>> > filter. Maybe the better idea is to implement an option to filter any
>> > event enabled through -e/--event only to the measurement CPU, as a
>> > separate feature.
>> >
>>
>> I had actually forgotten about applying the filters for the output
>> instance... I'll look into it.
>>
>
> Thanks. I gave it some more thought and realized enabling the event
> without the filter should not be complicated at all. We can just
> remove existing filters in trace_events_enable(), as
> trace_events_enable() is called after osnoise_init_trace_tool() in
> run_tool(). So that will make an explicit "-e ipi" drop the filter
> from "--ipi" on the trace instance and show all IPI events. So you can
> disregard my note about filtering -e options, it's not relevant here.
>

That makes sense, that's the most intuitive option.

> Tomas


^ permalink raw reply

* Re: [PATCH v8 08/46] KVM: Provide generic interface for checking memory private/shared status
From: Xiaoyao Li @ 2026-07-01  7:22 UTC (permalink / raw)
  To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-8-9d2959357853@google.com>

On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Introduce a generic kvm_mem_is_private() interface using a static call to
> determine if a GFN is private. This allows the implementation for checking
> a GFN's private/shared status to be set at runtime.
> 
> In preparation for choosing implementations between a guest_memfd lookup
> and the existing VM attribute lookup, rename the existing
> VM-attribute-based check to kvm_vm_mem_is_private to emphasize that it
> looks up VM attributes.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>

One nit below.

> ---
>   include/linux/kvm_host.h | 12 +++++++++++-
>   virt/kvm/kvm_main.c      | 15 +++++++++++++++
>   2 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index eb26d4ea8945a..3915da2a61778 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2546,7 +2546,7 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
>   bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
>   					 struct kvm_gfn_range *range);
>   
> -static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +static inline bool kvm_vm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>   {
>   	return kvm_get_vm_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
>   }
> @@ -2557,6 +2557,16 @@ static inline bool kvm_mem_range_is_private(struct kvm *kvm, gfn_t start,
>   						  KVM_MEMORY_ATTRIBUTE_PRIVATE,
>   						  KVM_MEMORY_ATTRIBUTE_PRIVATE);
>   }
> +#endif  /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */

Since it stops the original #define scope of 
CONFIG_KVM_VM_MEMORY_ATTRIBUTES here, ...

> +#ifdef kvm_arch_has_private_mem
> +typedef bool (kvm_mem_is_private_t)(struct kvm *kvm, gfn_t gfn);
> +DECLARE_STATIC_CALL(__kvm_mem_is_private, kvm_mem_is_private_t);
> +
> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
> +	return static_call(__kvm_mem_is_private)(kvm, gfn);
> +}
>   #else
>   static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>   {

... we need to update the comment of #endif to match with the new scope 
kvm_arch_has_private_mem as below. Or just delete the comment.

--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2572,7 +2572,7 @@ static inline bool kvm_mem_is_private(struct kvm 
*kvm, gfn_t gfn)
  {
         return false;
  }
-#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
+#endif /* kvm_arch_has_private_mem */

  #ifdef CONFIG_KVM_GUEST_MEMFD
  int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,

^ permalink raw reply

* Re: [PATCH v8 07/46] KVM: Rename memory attribute APIs to prepare for in-place gmem conversion
From: Xiaoyao Li @ 2026-07-01  7:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <akP9Qv_IPVEh7GAB@google.com>

On 7/1/2026 1:30 AM, Sean Christopherson wrote:
> On Tue, Jun 30, 2026, Xiaoyao Li wrote:
>> On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
>>> -bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
>>> -				     unsigned long mask, unsigned long attrs);
>>> +bool kvm_range_has_vm_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
>>> +					unsigned long mask, unsigned long attrs);
>>>    bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
>>>    					struct kvm_gfn_range *range);
>>>    bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
>>
>> We have
>>
>>   - kvm_pre_set_memory_attributes()
>>   - kvm_arch_pre_set_memory_attributes()
>>   - kvm_arch_post_set_memory_attributes()
> 
> Yeah, that's probably for the best.
> 
>> left, do they need to be renamed as well?
>>
>> then the interesting one is kvm_vm_set_mem_attributes(), which contains "vm"
>> already while it means "vm ioctl". Do we need to rename it to
>> kvm_vm_set_vm_mem_attributes()?
> 
> I say "no" on this last one, the fact that the function is scoped to a VM ioctl
> is enough to communicate that it applies to per-VM attributes.
> 
> Actually, since it's a local helper, we could go with kvm_set_vm_mem_attributes()
> to be consistent with the other functions.  That just leaves
> kvm_vm_ioctl_set_mem_attributes(), which I think it appropriately scoped.

If we finally choose to rename kvm_vm_set_mem_attributes() to 
kvm_set_vm_mem_attributes(), I think the trace 
trace_kvm_vm_set_mem_attributes() needs to be renamed to keep it consistent?

^ permalink raw reply

* Re: [RFC PATCH v2 4/4] rtla/osnoise: Leverage IPI event filters when tracing a subset of CPUs
From: Tomas Glozar @ 2026-07-01  6:45 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Costa Shulyupin,
	Crystal Wood, John Kacur, Ivan Pravdin, Jonathan Corbet
In-Reply-To: <xhsmhmrwcw2ll.mognet@vschneid-thinkpadt14sgen2i.remote.csb>

út 30. 6. 2026 v 16:00 odesílatel Valentin Schneider
<vschneid@redhat.com> napsal:
>
> On 30/06/26 12:14, Tomas Glozar wrote:
> > st 17. 6. 2026 v 15:18 odesílatel Valentin Schneider
> > <vschneid@redhat.com> napsal:
> >> @@ -406,6 +408,33 @@ struct osnoise_tool *osnoise_init_top(struct common_params *params)
> >>                 goto out_err;
> >>         }
> >>
> >> +       /*
> >> +        * If tracing on a subset of possible CPUs, leverage the kernel filtering
> >> +        * infrastructure to only generate events on traced CPUs.
> >> +        */
> >> +       if (params->cpus) {
> >> +               char filter[MAX_PATH];
> >> +
> >> +               snprintf(filter, ARRAY_SIZE(filter), "cpu & CPUS{%s}\n", params->cpus);
> >> +               retval = tracefs_event_file_write(tool->trace.inst,
> >> +                                                 "ipi", "ipi_send_cpu", "filter",
> >> +                                                 filter);
> >> +               if (retval) {
> >
> > retval is the number of bytes written here, so this should be "retval
> > < 0" like in trace_event_enable_filter() in trace.c. Same below.
> >
>
> According to the docstring:
>
>  * Return 0 on success, and -1 on error.
>
> but regardless yes that should be a '< 0' check to match existing code.
>

I double-checked that and you are correct that the docstring says so,
but it's an error in the docstring. According to the manpage, it
returns the number of bytes written (i.e. positive on success, not
zero) [1]:

"RETURN VALUE
...
tracefs_event_file_write() and tracefs_event_file_append() returns
*the number of bytes written to the system/event file* or negative on
error."

The code agrees as well: in tracefs_event_file_write() there's the
wrong docstring (likely copied from another function) [2]:

/*
 * tracefs_event_file_write - write to an event file
 * ...
 * Return 0 on success, and -1 on error.
 */
int tracefs_event_file_write(struct tracefs_instance *instance,
    const char *system, const char *event,
    const char *file, const char *str)
{
        ....
        ret = tracefs_instance_file_write(instance, path, str);
        free(path);
        return ret;
}

but the source of the return value is tracefs_instance_file_write(),
where the docstring is correct [3]:

/**
 * tracefs_instance_file_write - Write in trace file of specific instance.
 * ...
 * Returns the number of written bytes, or -1 in case of an error
 */
int tracefs_instance_file_write(struct tracefs_instance *instance,
const char *file, const char *str)
{
        return instance_file_write(instance, file, str, O_WRONLY | O_TRUNC);
}

instance_file_write() gets the return value from write_file() [4]
which returns the return value of write() [5].

[1] https://man7.org/linux/man-pages/man3/tracefs_event_get_file.3.html
[2] https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/tree/src/tracefs-events.c#n686
[3] https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/tree/src/tracefs-instance.c#n532
[4] https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/tree/src/tracefs-instance.c#n514
[5] https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/tree/src/tracefs-instance.c#n496

So the "< 0" is required, the CPU filter doesn't work without it:

[tglozar@fedora rtla]$ uname -a
Linux fedora 7.0.12-101.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jun 11
01:32:26 UTC 2026 x86_64 GNU/Linux
[tglozar@fedora rtla]$ sudo ./rtla osnoise top -q -c 0 --ipi
Could not set ipi_send_cpu CPU filter, return value: 14
Could not init osnoise tool
[tglozar@fedora rtla]$ sudo ./rtla osnoise top -q -d 5s --ipi
                                         Operating System Noise
...
[tglozar@fedora rtla]$

(output with additional debug print)

> >> +                       err_msg("Could not set ipi_send_cpu CPU filter\n");
> >> +                       goto out_err;
> >
> > It would be useful to have --ipi work even on older kernels that don't
> > yet have your cpumask trace event filter patchset [1], for example, by
> > printing a debug message that filtering is disabled and setting a flag
> > instead of erroring out here. Then the code in
> > osnoise_ipi_cpu_handler() can preserve the CPU_ISSET check if the flag
> > is set.
> >
> > As --ipi is optional, we can choose to only support it on newer
> > kernels, but it would be nice to have it working without the filter,
> > too.
> >
> > [1] https://lore.kernel.org/linux-trace-kernel/20230707172155.70873-1-vschneid@redhat.com/T/#u
> >
>
> Makes sense, will do.
>

Thanks!

> [truncated]
> >
> > I was thinking that it might make sense to enable the filters also for
> > the trace output instance. On the other hand, it would make it
> > difficult to enable the event without the filter then, as specifying
> > "-e ipi" or similar only re-enables the event but does not remove the
> > filter. Maybe the better idea is to implement an option to filter any
> > event enabled through -e/--event only to the measurement CPU, as a
> > separate feature.
> >
>
> I had actually forgotten about applying the filters for the output
> instance... I'll look into it.
>

Thanks. I gave it some more thought and realized enabling the event
without the filter should not be complicated at all. We can just
remove existing filters in trace_events_enable(), as
trace_events_enable() is called after osnoise_init_trace_tool() in
run_tool(). So that will make an explicit "-e ipi" drop the filter
from "--ipi" on the trace instance and show all IPI events. So you can
disregard my note about filtering -e options, it's not relevant here.

Tomas


^ permalink raw reply

* Re: [PATCH v8 23/46] KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
From: Yan Zhao @ 2026-07-01  6:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ackerley Tng, aik@amd.com, andrew.jones@linux.dev,
	binbin.wu@linux.intel.com, brauner@kernel.org,
	chao.p.peng@linux.intel.com, david@kernel.org,
	jmattson@google.com, jthoughton@google.com, michael.roth@amd.com,
	oupton@kernel.org, pankaj.gupta@amd.com, qperret@google.com,
	Edgecombe, Rick P, rientjes@google.com, shivankg@amd.com,
	steven.price@arm.com, tabba@google.com, willy@infradead.org,
	wyihan@google.com, forkloop@google.com, pratyush@kernel.org,
	suzuki.poulose@arm.com, aneesh.kumar@kernel.org,
	liam@infradead.org, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86@kernel.org, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Annapurve, Vishal,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
	Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
	Jason Gunthorpe, Vlastimil Babka, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org,
	linux-mm@kvack.org, linux-coco@lists.linux.dev
In-Reply-To: <akPEKslqAhygyjhg@google.com>

On Tue, Jun 30, 2026 at 09:27:06PM +0800, Sean Christopherson wrote:
> On Tue, Jun 30, 2026, Yan Zhao wrote:
> > On Tue, Jun 30, 2026 at 08:35:49AM +0800, Sean Christopherson wrote:
> > > Gah, I thought I had sent this out this morning, long before Ackerley's response.
> > > But I got distracted by a meeting and forgot to get back to this... *sigh*
> > > 
> > > Sending what I already wrote, even though there's a lot of overlap with Ackerley's
> > > mail.
> > > 
> > > On Mon, Jun 29, 2026, Yan Zhao wrote:
> > > > On Fri, Jun 26, 2026 at 08:28:32AM -0700, Ackerley Tng wrote:
> > > > > Yan Zhao <yan.y.zhao@intel.com> writes:
> > > > > > But if a user configures 0 uaddr as valid, writes to it, and then passes 0 as
> > > > > > source_addr(not from gmem), I'm not sure if it's good for the kernel to silently
> > > > > > treat 0 uaddr as an identifier for in-place copy from the private PFN in gmem.
> > > > > >
> > > > > 
> > > > > I'd say the original uAPI perhaps just didn't document 0 as an
> > > > > unsupported uaddr. Given that commit 2a62345b3052 already merged, uAPI
> > > > > was perhaps accidentally changed and no customer complained, I think we
> > > > > can move forward with 0 as an invalid src_address? I wouldn't think
> > > > > anyone relies on 0 intentionally being a valid address.
> > > > > 
> > > > > I could document that, if it helps?
> > > > What about just documenting that 0 is an unsupported uaddr which will be
> > > > re-purposed as an indicator to use the target pfn as the source, regardless of
> > > > whether gmem_in_place_conversion is true? i.e.,
> > > > 
> > > > if (!src_page) 
> > > > 	src_page = pfn_to_page(pfn);
> > > 
> > > Because KVM can't generally use the target page as the source without in-place
> > > conversion, it's not supported today, and out-of-place conversion is being
> > > deprecated.
> > By "out-of-place conversion", do you mean using per-VM memory attribute
> > conversion?
> 
> Yep, I couldn't come up with a better description.
> 
> > > > I don't get why the two scenarios should be treated differently:
> > > > 1. gmem_in_place_conversion==true, shared memory is not from gmem 
> > > > 2. gmem_in_place_conversion==false, shared memory is not from gmem
> > > > 
> > > > In both case, a 0 uaddr could be mapped to a valid page not from gmem.
> > > 
> > > That's immaterial.  KVM's ABI (that we're solidifying) is that an address of '0'
> > > for the source means NULL.  The fact that userspace could have a valid mapping
> > > at virtual address '0' is irrelevant.
> > So, I'm wondering if we can document that 0 uaddr could always mean using target
> > PFN.
> 
> I would document it as saying "no source page", and then state that a source page
> is required if in-place conversion isn't enabled/supported/allowed.
Ok.

> > i.e., for both scenarios 1 and 2, al long as 0 uaddr is specified, we always
> > use target PFN as source for in-place add.
> > 
> > > Again, just because something is technically possible doesn't mean it needs to
> > > be supported by every piece of KVM's uAPI.
> > > 
> > > > So why not update the uAPI to handle both cases consistently? :)
> > > 
> > > Because retroactively adding support for out-of-place conversion is pointless
> > > (requires a userspace update for a feature that's being deprecated), KVM can't
> > > generally support using the source for out-of-place conversion (it's effectively
> > > an obscure zero-page optimization), and IMO rejecting the out-of-place conversion
> > > scenario is valuable for KVM developers, e.g. to help newcomers understand what
> > > exactly is and isn't possible.
> > Ok. You mean per-VM memory attribute is deprecating, and source page from !gmem
> > backend is also deprecating, so we don't want to change uAPI for scenarios under
> > gmem_in_place_conversion==false. Right?
> 
> Right.
> 
> > 
> > > Side topic, isn't TDX broken if target page has already been added to the TD?
> > > IIUC, kvm_tdp_mmu_map_private_pfn() will be a glorified nop due to the page
> > > already having a valid S-EPT mapping, and so KVM will incorrectly allow a double
> > Not sure if my understand out-of-place conversion correctly.
> > Given target PFNs and GFNs are not duplicated, what would cause double add? :)
> 
> I was working through what would happen if userspace did KVM_TDX_INIT_MEM_REGION
> on the same target page multiple times.
Oh. To have KVM_TDX_INIT_MEM_REGION on the same target page multiple times, the
user needs to invoke KVM_TDX_INIT_MEM_REGION on the same GPA multiple times.
In that case, yes, kvm_tdp_mmu_map_private_pfn() will return -EIO.

> > 
> > > add.  Ahhh, no, because KVM will return RET_PF_SPURIOUS and
> > > kvm_tdp_mmu_map_private_pfn() will then return -EIO.
> > My asking was if we could document uaddr always means using target PFN, since
> > TDX's in-place add does not rely on gmem in-place conversion.
> 
> Yeah, I was on a tangent, ignore everything from "Side topic" on.
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox