* [PATCH v7 2/3] locking/percpu-rwsem: Extract __percpu_up_read()
From: Dmitry Ilvokhin @ 2026-06-04 7:15 UTC (permalink / raw)
To: Peter Zijlstra, Dennis Zhou, Tejun Heo, Christoph Lameter,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long
Cc: linux-mm, linux-kernel, linux-trace-kernel, kernel-team,
Dmitry Ilvokhin, Usama Arif
In-Reply-To: <cover.1780506267.git.d@ilvokhin.com>
Move the percpu_up_read() slowpath out of the inline function into a new
__percpu_up_read() to avoid binary size increase from adding a
tracepoint to an inlined function.
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Acked-by: Usama Arif <usama.arif@linux.dev>
---
include/linux/percpu-rwsem.h | 15 +++------------
kernel/locking/percpu-rwsem.c | 18 ++++++++++++++++++
2 files changed, 21 insertions(+), 12 deletions(-)
diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index c8cb010d655e..39d5bf8e6562 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -107,6 +107,8 @@ static inline bool percpu_down_read_trylock(struct percpu_rw_semaphore *sem)
return ret;
}
+extern void __percpu_up_read(struct percpu_rw_semaphore *sem);
+
static inline void percpu_up_read(struct percpu_rw_semaphore *sem)
{
rwsem_release(&sem->dep_map, _RET_IP_);
@@ -118,18 +120,7 @@ static inline void percpu_up_read(struct percpu_rw_semaphore *sem)
if (likely(rcu_sync_is_idle(&sem->rss))) {
this_cpu_dec(*sem->read_count);
} else {
- /*
- * slowpath; reader will only ever wake a single blocked
- * writer.
- */
- smp_mb(); /* B matches C */
- /*
- * In other words, if they see our decrement (presumably to
- * aggregate zero, as that is the only time it matters) they
- * will also see our critical section.
- */
- this_cpu_dec(*sem->read_count);
- rcuwait_wake_up(&sem->writer);
+ __percpu_up_read(sem);
}
preempt_enable();
}
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index ef234469baac..f3ee7a0d6047 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -288,3 +288,21 @@ void percpu_up_write(struct percpu_rw_semaphore *sem)
rcu_sync_exit(&sem->rss);
}
EXPORT_SYMBOL_GPL(percpu_up_write);
+
+void __percpu_up_read(struct percpu_rw_semaphore *sem)
+{
+ lockdep_assert_preemption_disabled();
+ /*
+ * slowpath; reader will only ever wake a single blocked
+ * writer.
+ */
+ smp_mb(); /* B matches C */
+ /*
+ * In other words, if they see our decrement (presumably to
+ * aggregate zero, as that is the only time it matters) they
+ * will also see our critical section.
+ */
+ this_cpu_dec(*sem->read_count);
+ rcuwait_wake_up(&sem->writer);
+}
+EXPORT_SYMBOL_GPL(__percpu_up_read);
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH v7 0/3] locking: contended_release tracepoint instrumentation
From: Dmitry Ilvokhin @ 2026-06-04 7:15 UTC (permalink / raw)
To: Peter Zijlstra, Dennis Zhou, Tejun Heo, Christoph Lameter,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long
Cc: linux-mm, linux-kernel, linux-trace-kernel, kernel-team,
Dmitry Ilvokhin
The existing contention_begin/contention_end tracepoints fire on the
waiter side. The lock holder's identity and stack can be captured at
contention_begin time (e.g. perf lock contention --lock-owner), but only
for locks whose owner it reads: mutex and rwsem. Other lock types have
none it can read, so they get no holder-side attribution today.
This series adds a contended_release tracepoint that fires on the
holder side when a lock with waiters is released. This provides:
- Hold time estimation: when the holder's own acquisition was
contended, its contention_end (acquisition) and contended_release
can be correlated to measure how long the lock was held under
contention.
- The holder's stack at release time, which may differ from what perf lock
contention --lock-owner captures if the holder does significant work between
the waiter's arrival and the unlock.
Note: for reader/writer locks, the tracepoint fires for every reader
releasing while a writer is waiting, not only for the last reader.
v6 -> v7:
- Dropped spinlocks instrumentation patches, I'll rework them based on Peter's
static call idea. See [2].
v5 -> v6:
- Use trace_call__contended_release() instead of trace_contended_release(),
where appropriate to avoid a redundant static branch check when the caller
already guards with trace_contended_release_enabled().
- Added acked-by from Paul.
- Rebase on top of the fresh locking/core.
v4 -> v5:
- Split the combined spinning locks patch into separate qspinlock and
qrwlock patches (Paul E. McKenney).
- Factor out __queued_read_unlock()/__queued_write_unlock() as a
separate preparatory commit, mirroring the queued_spin_release()
split (Paul E. McKenney).
- Updated binary size numbers for qspinlock-only change.
- Added Acked-by and Reviewed-by tags where appropriate.
v3 -> v4:
- Fix spurious events in __percpu_up_read(): guard with
rcuwait_active(&sem->writer) to avoid tracing during the RCU grace
period after a writer releases (Sashiko).
- Fix possible use-after-free in semaphore up(): move
trace_contended_release() inside the sem->lock critical section
(Sashiko).
- Fix build failure with CONFIG_PARAVIRT_SPINLOCKS=y: introduce
queued_spin_release() as the arch-overridable unlock primitive,
so queued_spin_unlock() can be a generic tracing wrapper. Convert
x86 (paravirt) and MIPS overrides (Sashiko).
- Add EXPORT_TRACEPOINT_SYMBOL_GPL(contended_release) for module
support (Sashiko).
- Split spinning locks patch: factor out queued_spin_release() as a
separate preparatory commit (Sashiko).
- Make read unlock tracepoint behavior consistent across all
reader/writer lock types: fire for every reader releasing while
a writer is waiting (rwsem, rwbase_rt were previously last-reader
only).
v2 -> v3:
- Added new patch: extend contended_release tracepoint to queued spinlocks
and queued rwlocks (marked as RFC, requesting feedback). This is prompted by
Matthew Wilcox's suggestion to try to come up with generic instrumentation,
instead of instrumenting each "special" lock manually. See [1] for the
discussion.
- Reworked tracepoint placement to fire before the lock is released and
before the waiter is woken where possible, for consistency with
spinning locks where there is no explicit wake (inspired by Usama Arif's
suggestion).
- Remove unnecessary linux/sched.h include from trace/events/lock.h.
RFC -> v2:
- Add trace_contended_release_enabled() guard before waiter checks that
exist only for the tracepoint (Steven Rostedt).
- Rename __percpu_up_read_slowpath() to __percpu_up_read() (Peter
Zijlstra).
- Add extern for __percpu_up_read() (Peter Zijlstra).
- Squashed tracepoint introduction and usage commits (Masami Hiramatsu).
v6: https://lore.kernel.org/all/cover.1777999826.git.d@ilvokhin.com/
v5: https://lore.kernel.org/all/cover.1776350944.git.d@ilvokhin.com/
v4: https://lore.kernel.org/all/cover.1774536681.git.d@ilvokhin.com/
v3: https://lore.kernel.org/all/cover.1773858853.git.d@ilvokhin.com/
v2: https://lore.kernel.org/all/cover.1773164180.git.d@ilvokhin.com/
RFC: https://lore.kernel.org/all/cover.1772642407.git.d@ilvokhin.com/
[1]: https://lore.kernel.org/all/aa7G1nD7Rd9F4eBH@casper.infradead.org/
[2]: https://lore.kernel.org/all/20260603120811.GW3493090@noisy.programming.kicks-ass.net/
Dmitry Ilvokhin (3):
tracing/lock: Remove unnecessary linux/sched.h include
locking/percpu-rwsem: Extract __percpu_up_read()
locking: Add contended_release tracepoint to sleepable locks
include/linux/percpu-rwsem.h | 15 +++------------
include/trace/events/lock.h | 18 +++++++++++++++++-
kernel/locking/mutex.c | 4 ++++
kernel/locking/percpu-rwsem.c | 29 +++++++++++++++++++++++++++++
kernel/locking/rtmutex.c | 1 +
kernel/locking/rwbase_rt.c | 6 ++++++
kernel/locking/rwsem.c | 10 ++++++++--
kernel/locking/semaphore.c | 4 ++++
8 files changed, 72 insertions(+), 15 deletions(-)
--
2.53.0-Meta
^ permalink raw reply
* [PATCH v7 1/3] tracing/lock: Remove unnecessary linux/sched.h include
From: Dmitry Ilvokhin @ 2026-06-04 7:15 UTC (permalink / raw)
To: Peter Zijlstra, Dennis Zhou, Tejun Heo, Christoph Lameter,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long
Cc: linux-mm, linux-kernel, linux-trace-kernel, kernel-team,
Dmitry Ilvokhin, Usama Arif
In-Reply-To: <cover.1780506267.git.d@ilvokhin.com>
None of the trace events in lock.h reference anything from
linux/sched.h. Remove the unnecessary include.
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Acked-by: Usama Arif <usama.arif@linux.dev>
---
include/trace/events/lock.h | 1 -
1 file changed, 1 deletion(-)
diff --git a/include/trace/events/lock.h b/include/trace/events/lock.h
index 8e89baa3775f..da978f2afb45 100644
--- a/include/trace/events/lock.h
+++ b/include/trace/events/lock.h
@@ -5,7 +5,6 @@
#if !defined(_TRACE_LOCK_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_LOCK_H
-#include <linux/sched.h>
#include <linux/tracepoint.h>
/* flags for lock:contention_begin */
--
2.53.0-Meta
^ permalink raw reply related
* Re: [PATCHv4 00/13] uprobes/x86: Fix red zone issue for optimized uprobes
From: Jiri Olsa @ 2026-06-04 6:59 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
Andrii Nakryiko
Cc: bpf, linux-trace-kernel
In-Reply-To: <20260526205840.173790-1-jolsa@kernel.org>
On Tue, May 26, 2026 at 10:58:27PM +0200, Jiri Olsa wrote:
> hi,
> Andrii reported an issue with optimized uprobes [1] that can clobber
> redzone area with call instruction storing return address on stack
> where user code may keep temporary data without adjusting rsp.
>
> Fixing this by moving the optimized uprobes on top of 10-bytes nop
> instruction, so we can squeeze another instruction to escape the
> redzone area before doing the call.
>
> Note we need upstream update first for patch 3 (github.com/libbpf/usdt),
> if we decide to take this change.
>
> thanks,
> jirka
>
>
> v1: https://lore.kernel.org/bpf/20260514135342.22130-1-jolsa@kernel.org/
> v2: https://lore.kernel.org/bpf/20260518105957.123445-1-jolsa@kernel.org/
> v3: https://lore.kernel.org/bpf/20260521124411.31133-1-jolsa@kernel.org/
>
> v4 changes:
> - do not use 2nd int3 (ont +5 offset) because the call instruction
> is allways the same for the given nop10 address [Andrii/Peter]
> - unmap unused trampoline vma after unsuccesfull optimization [sashiko]
> - small change to patch#2 moved user_64bit_mode earlier in the path
> and pass/use mm_struct pointer directly from arch_uprobe_optimize
> instead of gettting current->mm
> Andrii, keeping your ack, please shout otherwise
hi,
I think bots did not find anything substantial, I have just small
selftests changes queued for v5
any other feedback/review would be great
thanks,
jirka
>
> v3 changes:
> - use nop10 update suggested by Peter in [2]
> - remove struct uprobe_trampoline object, use vma objects directly instead
> - selftests fixes [sashiko]
> - ack from Andrii
>
> v2 changes:
> - several selftest fixes [sashiko]
> - consolidate is_lea_insn and is_call_insn insto single check [Jakub Sitnicki]
> - use proper mm_struct object in __in_uprobe_trampoline check [sashiko]
> - allow to copy uprobe trampolines vma objects on fork [sashiko]
> - change uprobe syscall detection error from -ENXIO to -EPROTO [Andrii]
> - added fork/clone tests
> - I kept the selftest changes and nop5->nop10 changes in separate
> commits for easier review, we can squash them later if we want to keep
> bisect working properly
>
>
> [1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> [2] https://lore.kernel.org/bpf/20260518104306.GU3102624@noisy.programming.kicks-ass.net/#t
> ---
> Andrii Nakryiko (1):
> selftests/bpf: Add tests for uprobe nop10 red zone clobbering
>
> Jiri Olsa (12):
> uprobes/x86: Use proper mm_struct in __in_uprobe_trampoline
> uprobes/x86: Remove struct uprobe_trampoline object
> uprobes/x86: Allow to copy uprobe trampolines on fork
> uprobes/x86: Unmap trampoline vma object in case it's unused
> uprobes/x86: Move optimized uprobe from nop5 to nop10
> libbpf: Change has_nop_combo to work on top of nop10
> libbpf: Detect uprobe syscall with new error
> selftests/bpf: Emit nop,nop10 instructions combo for x86_64 arch
> selftests/bpf: Change uprobe syscall tests to use nop10
> selftests/bpf: Change uprobe/usdt trigger bench code to use nop10
> selftests/bpf: Add reattach tests for uprobe syscall
> selftests/bpf: Add tests for forked/cloned optimized uprobes
>
> arch/x86/kernel/uprobes.c | 379 +++++++++++++++++++++++++++++++++++++++++++-----------------------------
> include/linux/uprobes.h | 5 -
> kernel/events/uprobes.c | 10 --
> kernel/fork.c | 1 -
> tools/lib/bpf/features.c | 4 +-
> tools/lib/bpf/usdt.c | 16 +--
> tools/testing/selftests/bpf/bench.c | 20 ++--
> tools/testing/selftests/bpf/benchs/bench_trigger.c | 38 ++++----
> tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh | 2 +-
> tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c | 307 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
> tools/testing/selftests/bpf/prog_tests/usdt.c | 74 ++++++++++++--
> tools/testing/selftests/bpf/progs/test_usdt.c | 25 +++++
> tools/testing/selftests/bpf/usdt.h | 2 +-
> tools/testing/selftests/bpf/usdt_2.c | 15 ++-
> 14 files changed, 653 insertions(+), 245 deletions(-)
^ permalink raw reply
* Re: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: David Hildenbrand (Arm) @ 2026-06-04 6:42 UTC (permalink / raw)
To: Xie Yuanbin, qiuxu.zhuo, bp, akpm, rostedt, linmiaohe
Cc: linux-edac, linux-kernel, linux-mm, linux-trace-kernel,
mchehab+huawei, tony.luck, torvalds, yi1.lai
In-Reply-To: <20260604014629.3144-1-xieyuanbin1@huawei.com>
On 6/4/26 03:46, Xie Yuanbin wrote:
> On Wed, 3 Jun 2026 21:13:30 +0200, David Hildenbrand (Arm) wrote:
>> Would the following be sufficient to avoid a full revert and the dependency on CONFIG_RAS?
>>
>> diff --git a/include/trace/events/memory-failure.h b/include/trace/events/memory-failure.h
>> index aa57cc8f896b..c46b17602578 100644
>> --- a/include/trace/events/memory-failure.h
>> +++ b/include/trace/events/memory-failure.h
>> @@ -1,6 +1,7 @@
>> /* SPDX-License-Identifier: GPL-2.0 */
>> #undef TRACE_SYSTEM
>> -#define TRACE_SYSTEM memory_failure
>> +/* Some user space relies on ras/memory_failure_event */
>> +#define TRACE_SYSTEM ras
>> #define TRACE_INCLUDE_FILE memory-failure
>>
>> #if !defined(_TRACE_MEMORY_FAILURE_H) || defined(TRACE_HEADER_MULTI_READ)
>
> Yes, it should be. In fact, when I sent the V2 patch, I had already
> considered this issue, and that's exactly what I did:
> Link: https://lore.kernel.org/20251104072306.100738-3-xieyuanbin1@huawei.com
>
> However, David Hildenbrand advised me at that time to completely
> remove the dependence on RAS:
> Link: https://lore.kernel.org/01b44e0f-ea2e-406f-9f65-b698b5504f42@kernel.org
Yeah, if only I had known that we would break user space by changing trace
events ... now we know :)
Do you have capacity to send a fix?
--
Cheers,
David
^ permalink raw reply
* [RFC PATCH 3/3] mm/compaction: respect compact_unevictable_allowed in alloc_contig path
From: Wandun Chen @ 2026-06-04 2:38 UTC (permalink / raw)
To: linux-mm, linux-kernel, linux-trace-kernel, linux-rt-devel
Cc: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy, rostedt,
mhiramat, mathieu.desnoyers, david, ljs, liam, rppt, bigeasy,
clrkwllms, Alexander.Krabler
In-Reply-To: <20260604023812.3700316-1-chenwandun1@gmail.com>
From: Wandun Chen <chenwandun@lixiang.com>
vm.compact_unevictable_allowed=0 is used to prevent compacting
unevictable pages. However, isolate_migratepages_range() passes
ISOLATE_UNEVICTABLE regardless of this sysctl, so the setting
has no effect in the alloc_contig path.
Fix it by:
- Keep ISOLATE_UNEVICTABLE for CMA allocation, discussed in [1].
- Honour sysctl_compact_unevictable_allowed for non-CMA allocation.
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Wandun Chen <chenwandun@lixiang.com>
Link: https://lore.kernel.org/all/25ba0d77-eb61-4efc-b2fc-73878cbd85c1@suse.cz/ [1]
---
include/linux/compaction.h | 6 ++++++
mm/compaction.c | 9 +++++++--
mm/internal.h | 1 +
mm/page_alloc.c | 2 ++
4 files changed, 16 insertions(+), 2 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index f29ef0653546..04e60f65b976 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -106,6 +106,7 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
extern void __meminit kcompactd_run(int nid);
extern void __meminit kcompactd_stop(int nid);
extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int highest_zoneidx);
+extern bool compaction_allow_unevictable(void);
#else
static inline void reset_isolation_suitable(pg_data_t *pgdat)
@@ -131,6 +132,11 @@ static inline void wakeup_kcompactd(pg_data_t *pgdat,
{
}
+static inline bool compaction_allow_unevictable(void)
+{
+ return true;
+}
+
#endif /* CONFIG_COMPACTION */
struct node;
diff --git a/mm/compaction.c b/mm/compaction.c
index 007d5e00a8ae..a10acb273454 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1341,6 +1341,7 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
unsigned long end_pfn)
{
unsigned long pfn, block_start_pfn, block_end_pfn;
+ isolate_mode_t mode = cc->allow_unevictable ? ISOLATE_UNEVICTABLE : 0;
int ret = 0;
/* Scan block by block. First and last block may be incomplete */
@@ -1360,8 +1361,7 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
block_end_pfn, cc->zone))
continue;
- ret = isolate_migratepages_block(cc, pfn, block_end_pfn,
- ISOLATE_UNEVICTABLE);
+ ret = isolate_migratepages_block(cc, pfn, block_end_pfn, mode);
if (ret)
break;
@@ -1902,6 +1902,11 @@ typedef enum {
* compactable pages.
*/
static int sysctl_compact_unevictable_allowed __read_mostly = CONFIG_COMPACT_UNEVICTABLE_DEFAULT;
+
+bool compaction_allow_unevictable(void)
+{
+ return sysctl_compact_unevictable_allowed;
+}
/*
* Tunable for proactive compaction. It determines how
* aggressively the kernel should compact memory in the
diff --git a/mm/internal.h b/mm/internal.h
index 181e79f1d6a2..163f9d6b37f3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1052,6 +1052,7 @@ struct compact_control {
* ensure forward progress.
*/
bool alloc_contig; /* alloc_contig_range allocation */
+ bool allow_unevictable; /* Allow isolation of unevictable folios */
};
/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 81a9d4d1e6c0..1cf9d4a3b14c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7118,6 +7118,8 @@ int alloc_contig_frozen_range_noprof(unsigned long start, unsigned long end,
.ignore_skip_hint = true,
.no_set_skip_hint = true,
.alloc_contig = true,
+ .allow_unevictable = !!(alloc_flags & ACR_FLAGS_CMA) ||
+ compaction_allow_unevictable(),
};
INIT_LIST_HEAD(&cc.migratepages);
enum pb_isolate_mode mode = (alloc_flags & ACR_FLAGS_CMA) ?
--
2.43.0
^ permalink raw reply related
* [RFC PATCH 2/3] mm/compaction: add per-folio isolation tracepoint
From: Wandun Chen @ 2026-06-04 2:38 UTC (permalink / raw)
To: linux-mm, linux-kernel, linux-trace-kernel, linux-rt-devel
Cc: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy, rostedt,
mhiramat, mathieu.desnoyers, david, ljs, liam, rppt, bigeasy,
clrkwllms, Alexander.Krabler
In-Reply-To: <20260604023812.3700316-1-chenwandun1@gmail.com>
From: Wandun Chen <chenwandun@lixiang.com>
Add a tracepoint that fires once per folio successfully isolated by
isolate_migratepages_block(), recording the pfn, isolation mode and
the folio flags. Knowing these makes it easier to debug unexpected
isolation, such as mlocked or unevictable folios showing up on
PREEMPT_RT kernels [1].
Inspired-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Wandun Chen <chenwandun@lixiang.com>
Link: https://lore.kernel.org/all/20250820142919.HTybedrl@linutronix.de/ [1]
---
include/trace/events/compaction.h | 26 ++++++++++++++++++++++++++
mm/compaction.c | 2 ++
2 files changed, 28 insertions(+)
diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index d05759d18538..8b8b3ec0f324 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -76,6 +76,32 @@ DEFINE_EVENT(mm_compaction_isolate_template, mm_compaction_fast_isolate_freepage
);
#ifdef CONFIG_COMPACTION
+TRACE_EVENT(mm_compaction_isolate_folio,
+
+ TP_PROTO(unsigned long pfn,
+ isolate_mode_t mode,
+ unsigned long flags),
+
+ TP_ARGS(pfn, mode, flags),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, pfn)
+ __field(isolate_mode_t, mode)
+ __field(unsigned long, flags)
+ ),
+
+ TP_fast_assign(
+ __entry->pfn = pfn;
+ __entry->mode = mode;
+ __entry->flags = flags;
+ ),
+
+ TP_printk("pfn=0x%lx mode=0x%x flags=%s",
+ __entry->pfn,
+ __entry->mode,
+ show_page_flags(__entry->flags & PAGEFLAGS_MASK))
+);
+
TRACE_EVENT(mm_compaction_migratepages,
TP_PROTO(unsigned int nr_migratepages,
diff --git a/mm/compaction.c b/mm/compaction.c
index 7e07b792bcb5..007d5e00a8ae 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1223,6 +1223,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
isolate_success:
list_add(&folio->lru, &cc->migratepages);
isolate_success_no_list:
+ trace_mm_compaction_isolate_folio(folio_pfn(folio), mode,
+ folio->flags.f);
cc->nr_migratepages += folio_nr_pages(folio);
nr_isolated += folio_nr_pages(folio);
nr_scanned += folio_nr_pages(folio) - 1;
--
2.43.0
^ permalink raw reply related
* [RFC PATCH 1/3] mm/compaction: skip isolate mlocked folios when compact_unevictable_allowed=0
From: Wandun Chen @ 2026-06-04 2:38 UTC (permalink / raw)
To: linux-mm, linux-kernel, linux-trace-kernel, linux-rt-devel
Cc: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy, rostedt,
mhiramat, mathieu.desnoyers, david, ljs, liam, rppt, bigeasy,
clrkwllms, Alexander.Krabler
In-Reply-To: <20260604023812.3700316-1-chenwandun1@gmail.com>
From: Wandun Chen <chenwandun@lixiang.com>
compact_unevictable_allowed is default 0 under PREEMPT_RT,
isolate_migratepages_block() skips folios with PG_unevictable set.
However, mlock_folio() sets PG_mlocked immediately but defers
PG_unevictable to mlock_folio_batch(), result in a folio with
PG_mlocked=1 but PG_unevictable=0. Compaction will isolate such a
folio.
Fix by checking folio_test_mlocked() together with the existing
folio_test_unevictable() check.
A similar issue has been reported by Alexander Krabler on a 6.12-rt
aarch64 system. Vlastimil suggested to check the mlocked flag [1].
Reported-by: Alexander Krabler <Alexander.Krabler@kuka.com>
Closes: https://lore.kernel.org/all/DU0PR01MB10385345F7153F334100981888259A@DU0PR01MB10385.eurprd01.prod.exchangelabs.com/
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Wandun Chen <chenwandun@lixiang.com>
Link: https://lore.kernel.org/all/33275585-f2db-4779-89f0-3ae24b455a67@suse.cz/ [1]
---
mm/compaction.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index b776f35ad020..7e07b792bcb5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1116,7 +1116,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
is_unevictable = folio_test_unevictable(folio);
/* Compaction might skip unevictable pages but CMA takes them */
- if (!(mode & ISOLATE_UNEVICTABLE) && is_unevictable)
+ if (!(mode & ISOLATE_UNEVICTABLE) &&
+ (is_unevictable || folio_test_mlocked(folio)))
goto isolate_fail_put;
/*
--
2.43.0
^ permalink raw reply related
* [RFC PATCH 0/3] mm/compaction: honour compact_unevictable_allowed in mlock race and alloc_contig path
From: Wandun Chen @ 2026-06-04 2:38 UTC (permalink / raw)
To: linux-mm, linux-kernel, linux-trace-kernel, linux-rt-devel
Cc: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy, rostedt,
mhiramat, mathieu.desnoyers, david, ljs, liam, rppt, bigeasy,
clrkwllms, Alexander.Krabler
From: Wandun Chen <chenwandun@lixiang.com>
vm.compact_unevictable_allowed=0 is meant to keep compaction from
touching unevictable folios. In practice there are still two paths
where it does not take effect. This series fixes them and adds a
tracepoint to make such issues easier to diagnose in the future.
Wandun Chen (3):
mm/compaction: skip isolate mlocked folios when
compact_unevictable_allowed=0
mm/compaction: add per-folio isolation tracepoint
mm/compaction: respect compact_unevictable_allowed in alloc_contig
path
include/linux/compaction.h | 6 ++++++
include/trace/events/compaction.h | 26 ++++++++++++++++++++++++++
mm/compaction.c | 14 +++++++++++---
mm/internal.h | 1 +
mm/page_alloc.c | 2 ++
5 files changed, 46 insertions(+), 3 deletions(-)
--
2.43.0
^ permalink raw reply
* Re: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: Xie Yuanbin @ 2026-06-04 1:46 UTC (permalink / raw)
To: david, qiuxu.zhuo, bp, akpm, rostedt, linmiaohe
Cc: linux-edac, linux-kernel, linux-mm, linux-trace-kernel,
mchehab+huawei, tony.luck, torvalds, xieyuanbin1, yi1.lai
In-Reply-To: <b637ede2-73da-49f0-a7eb-70ec79e79624@kernel.org>
On Wed, 3 Jun 2026 21:13:30 +0200, David Hildenbrand (Arm) wrote:
> Would the following be sufficient to avoid a full revert and the dependency on CONFIG_RAS?
>
> diff --git a/include/trace/events/memory-failure.h b/include/trace/events/memory-failure.h
> index aa57cc8f896b..c46b17602578 100644
> --- a/include/trace/events/memory-failure.h
> +++ b/include/trace/events/memory-failure.h
> @@ -1,6 +1,7 @@
> /* SPDX-License-Identifier: GPL-2.0 */
> #undef TRACE_SYSTEM
> -#define TRACE_SYSTEM memory_failure
> +/* Some user space relies on ras/memory_failure_event */
> +#define TRACE_SYSTEM ras
> #define TRACE_INCLUDE_FILE memory-failure
>
> #if !defined(_TRACE_MEMORY_FAILURE_H) || defined(TRACE_HEADER_MULTI_READ)
Yes, it should be. In fact, when I sent the V2 patch, I had already
considered this issue, and that's exactly what I did:
Link: https://lore.kernel.org/20251104072306.100738-3-xieyuanbin1@huawei.com
However, David Hildenbrand advised me at that time to completely
remove the dependence on RAS:
Link: https://lore.kernel.org/01b44e0f-ea2e-406f-9f65-b698b5504f42@kernel.org
^ permalink raw reply
* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Balbir Singh @ 2026-06-04 1:43 UTC (permalink / raw)
To: Gregory Price
Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman
In-Reply-To: <ah_RcTU8SpQG7hab@gourry-fedora-PF4VCD3F>
On Wed, Jun 03, 2026 at 08:02:09AM +0100, Gregory Price wrote:
> On Wed, Jun 03, 2026 at 03:00:01PM +1000, Balbir Singh wrote:
> > On Tue, Jun 02, 2026 at 09:57:48AM +0100, Gregory Price wrote:
> > > On Tue, Jun 02, 2026 at 12:16:50PM +1000, Balbir Singh wrote:
> > > >
> > > > I was think we wouldn't need explicit flags and that allocations would
> > > > happen from user space using __GFP_THISNODE to the node or via a nodemask
> > > > based on nodes of interest. Is there a reason to add this flag, a system
> > > > might have more than one source of N_MEMORY_PRIVATE?
> > > >
> > >
> > > There's a few things to unpack here. I discussed this many times on
> > > list and at LSF, but to reiterate.
> > >
> > > 1) __GFP_THISNODE is insufficient to enforce isolation and otherwise
> > > not particularly useful. Additionally, from userland, it's not
> > > something you can actually set.
> >
> > I was thinking mbind()/mempolicy() is how we get to it. It already
> > accepts a nodemask.
> >
>
> First let me say: I want to enable mbind access to these nodes.
>
> But let me caveat: I think that needs more time to develop, and
> in the meantime, we can enable the /dev/xxx pattern somewhat trivially.
>
> First let me address a few things about mbind/mempolicy and how it
> interacts with page_alloc.c, I gave this overview at LSF but I don't
> remember if I posted it in any of my follow ups.
>
>
> 1) Fallback lists are filtered by nodemask, the nodemask does not replace
> the fallback list.
>
> Here is how the page allocator fallback lists and nodemasks interact:
>
> Fallbacks A: A B
> Fallbacks B: B A
> Fallbacks C: C A B (Private)
> Fallbacks D: D B A (Private)
>
Do we want regular memory (N_MEMORY) in the fallback list of device private nodes?
The assumption is that we have ATS translation enabled? Assumiung A and
B are N_MEMORY here or am I misreading your illustraion?
> Lets say you pass:
>
> alloc_pages_node(C, ..., nodemask(A,C,D))
>
> So we get
>
> Fallback(C,A,B) & nodemask(A,C,D) -> iterate(C,A)
>
> If we wanted to change this behavior, realistically we'd be looking for
> a way to add specific nodes to certain fallback lists - rather than
> modify the nodemask interaction in some way.
Yes, that is what we did with CDM, control the fallback for
N_MEMORY_PRIVATE, but there is a design decision to be made here.
>
> I think this is out of scope for the first iteration - so supporting
> anything other than mbind() from the start is just pointless.
>
> The only feasible mempolicy you can apply is single-node bind, so
> realistically you can only support mbind.
>
>
> 2) full mempolicy support doesn't really make sense
>
> task mempolicy PROBABLY should never really touch private nodes,
> while VMA policy certainly can. Assuming we're able to support
> multi-private-node masks, none of the non-bind mempolicies even
> make sense for most private nodes (interleave? weighted interleave?)
>
Yes, mostly, but is that baked into the design? If so, why?
> I haven't worked through all the implications of a task policy having
> a private node attached, but the longer I think about it, the less it
> makes sense to just support this outright.
>
>
> 3) Introducing mbind support is not just a simple nodemask on a VMA,
> It also implies migration, cgroup/cpuset, and UAPI interactions.
>
> a) migration:
>
> mbind/mempolicy can and will engage migration when it is called
> with certain flags. Migration has subtle LRU interactions, but
> the patch set I have at least allows this to work.
>
> b) cgroup/cpuset:
>
> cpuset.mems rebinding will cause private nodes to be quietly
> rebound to non-private nodes within a nodemask.
>
> c) between A and B - we really want MPOL_F_STATIC to be required
> for mbind to be applied to private node so that it is never
> forcefully remapped.
>
> That's a UAPI semantic change specific for private nodes we
> should really take time to consider.
>
>
> 4) File VMA interactions don't entirely make sense with mbind
>
> In theory you might want:
>
> fd = open("somefile", ...);
> mem = mmap(fd, ...);
> mbind(mem, ..., private_node);
> for page in mem:
> mem[page_off] /* fault file into private memory */
>
> In reality: This does not work the way you want.
Why not? Just curious about what you found?
>
> I went digging and we need a few mild extensions to allow
> migration on mbind to work for pagecache pages, and the fault
> path does not necessarily respect the vma mempolicy always.
>
> You also start getting into the question of "what happens when
> the node is out of memory and you don't have reclaim support?".
Yes, we should discuss reclaim support, I think we should allow for
reclaim. It allows you to overcommit private memory the way we can
with regular memory.
> The OOM implications jump out at you pretty aggressively.
>
> Moreover other tasks can force the page cache pages to be moved
> as well. So the programming model here just kind of sucks.
>
> Works great for anon memory though :]
>
> For all these reasons, I think the be mbind/mempolicy support with
> private nodes needs to be brought in with follow up work - not
> introduced as part of the baseline set.
>
I am not opposed to the follow up work, but I feel mbind() should
be the fundamental work and user space API.
> > >
> > > for node in possible_nodes:
> > > alloc_pages_node(private_node, __GFP_THISNODE)
> > >
> > > In fact it's the opposite semantic of what we want.
> > > THISNODE says: "Do not fallback back to OTHER nodes".
> > >
> >
> > That's why we need to control the fallback nodes carefully for
> > N_MEMORY_PRIVATE
> >
>
> My point is that __GFP_THISNODE is not actually useful.
>
> If we go by nodemask, submitting a single-node nodemask is the
> equivalent of an empty fallback list.
>
> If we gate access to a private node by __GFP_THISNODE... this is the
> same as just providing a single-node nodelist (putting aside the OOM
> implications for a moment).
>
> And it doesn't even buy you any new filtering ability against existing
> nodemask iterators that may already utilize __GFP_THISNODE. i.e.
>
> for node in online_nodes:
> alloc_pages_node(node, __GFP_THISNODE, ...)
> /* Alloc per-node resources */
>
> This pattern is undesirable, but completely valid.
>
> So overloading/requiring __GFP_THISNODE is just not useful.
>
> I will follow up soon with a new version that limits the private node
> interface to just nodemask and fallback list controls.
>
> I need to test a few more things related to removing normal nodes from
> private node fallbacks before I feel comfortable shipping without
> __GFP_PRIVATE.
>
> > > The semantic we want is "Do not allow allocations from private
> > > nodes UNLESS we specifically request" (__GFP_PRIVATE).
> > >
> > > __GFP_THISNODE does not actually buy you anything here, AND it's
> > > worse, in the scenario where a private node makes its way into the
> > > preferred slot (via possible_nodes or some other nodemask), the
> > > allocator cannot fall back to a node it can access.
> > >
> > > __GFP_THISNODE cannot be overloaded to do anything useful here.
> >
> > Let me clarify, I meant to say, let's use a nodemask for allocation
> > and __GFP_THISNODE gets us to the node we desire, if that is the only
> > node. My earlier comment might not have been clear.
> >
>
> My point was that __GFP_THISNODE is pointless and reduces to providing a
> single node nodemask anyway.
>
> The contention over __GFP_PRIVATE is a bit ideological - do we want:
>
> 1) A hard guarantee that allocations to a private node are controlled
> (__GFP_PRIVATE implies the caller knows what it's doing)
>
> or
>
> 2) A soft guarantee (fallback list isolation only), and needing to
> deal with undesired behavior that's "not technically a bug"
> associated with existing users of global nodemasks (possible,
> online, etc).
>
> I am arguing for #1 - the community has argued for #2 and "fixing
> existing nodemask users". I think we can ship #2 and pivot to #1 if we
> find fixing existing users is infeasible or too much of a maintenance
> burden.
Again happy to discuss this, I'd like to make sure we agree on the
design. I am wondering if there is any experimental data to choose
between 1 and 2.
>
> >
> > Why not use mbind() API's? Do we want to gate allocation/privileges
> > via a /dev?
> >
>
> We want to eventually enable it, but we really need to treat these
> extensions as a separate step from the base so that the UAPI
> implications are given proper scrutiny.
>
> In the short term, /dev/xxx and driver-local/service-local control
> of a node is still very useful.
>
> For example, for my compressed memory work, I have found that if
> implemented as a swap backend - the kernel can manage the node without
> any UAPI implications at all :].
>
> A driver managing memory on a private node could do the same.
>
> ~Gregory
Thanks for the detailed answers, happy to iterate and experiment on
the design with you, my opinions come from way back when we tried
to do CDM (in it's first iteration)
Balbir
^ permalink raw reply
* Re: [PATCH bpf-next v2 2/3] tracing: Expose tracepoint BTF ids via tracefs
From: Andrii Nakryiko @ 2026-06-03 23:51 UTC (permalink / raw)
To: Steven Rostedt
Cc: Mykyta Yatsenko, bpf, Mykyta Yatsenko, linux-trace-kernel,
Andrii Nakryiko, Alexei Starovoitov
In-Reply-To: <20260603185057.4a85d5df@fedora>
On Wed, Jun 3, 2026 at 3:51 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 3 Jun 2026 15:41:50 -0700
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> > That's just to say that it would be great to have this in some form of
> > shape, so please help getting this into acceptable form, thanks!
>
> As I said, I'm not against it. I was hoping to find something to help
> alleviate the memory usage.
>
> I'll likely take these as is, but I'm currently on vacation and it will
> have to wait until next week.
>
Yeah, no worries, I was just going through my backlog and trying to
think if we can avoid extending the tracepoint class struct. Besides
the somewhat dirty kallsyms symbol lookup logic, I don't see any other
way.
Enjoy your vacation!
> -- Steve
>
^ permalink raw reply
* Re: [PATCH v3] tracing: fix CFI violation in probestub test
From: Masami Hiramatsu @ 2026-06-03 23:47 UTC (permalink / raw)
To: Eva Kurchatova
Cc: rostedt, linux-trace-kernel, linux-kernel, mathieu.desnoyers,
peterz, jpoimboe, samitolvanen
In-Reply-To: <20260603153147.573589-1-eva.kurchatova@virtuozzo.com>
On Wed, 3 Jun 2026 18:31:42 +0300
Eva Kurchatova <eva.kurchatova@virtuozzo.com> wrote:
> When multiple callbacks are registered on the same tracepoint,
> callbacks will be indirectly called via traceiter helper.
>
> Pointers to __probestub_* callbacks reside in __tracepoints section,
> which is excluded from ENDBR checks in objtool, causing objtool to
> assume those functions are never indirectly called.
>
> Registering multiple callbacks using sched_wakeup test will result
> in #CP exception due to missing ENDBR in __probestub_sched_wakeup
> on a CFI-enabled machine.
>
> Fix this by adding CFI_NOSEAL annotation to probestub declaration.
Thanks for update, this looks good to me.
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Peter, will you pick this fix because it fixes objtool change?
Thank you,
>
> Fixes: d5173f753750 ("objtool: Exclude __tracepoints data from ENDBR checks")
> Signed-off-by: Eva Kurchatova <eva.kurchatova@virtuozzo.com>
> ---
> include/linux/tracepoint.h | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index 763eea4d80d8..2d2b9f8cdda4 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -20,6 +20,7 @@
> #include <linux/rcupdate_trace.h>
> #include <linux/tracepoint-defs.h>
> #include <linux/static_call.h>
> +#include <linux/cfi.h>
>
> struct module;
> struct tracepoint;
> @@ -389,6 +390,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> void __probestub_##_name(void *__data, proto) \
> { \
> } \
> + /* \
> + * Annotate the probestub 'CFI_NOSEAL' to stop objtool from \
> + * requesting the kernel remove the ENDBR, because the only \
> + * references to the function are in the __tracepoint section, \
> + * that objtool doesn't scan. \
> + */ \
> + CFI_NOSEAL(__probestub_##_name); \
> DEFINE_STATIC_CALL(tp_func_##_name, __traceiter_##_name); \
> DEFINE_RUST_DO_TRACE(_name, TP_PROTO(proto), TP_ARGS(args))
>
> --
> 2.54.0
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [GIT PULL] rv fixes for v7.1
From: Steven Rostedt @ 2026-06-03 23:16 UTC (permalink / raw)
To: Gabriele Monaco; +Cc: linux-kernel, linux-trace-kernel, unknownbbqrx, Wen Yang
In-Reply-To: <20260603125056.75559-1-gmonaco@redhat.com>
On Wed, 3 Jun 2026 14:50:56 +0200
Gabriele Monaco <gmonaco@redhat.com> wrote:
> unknownbbqrx (1):
> tools/rv: Ensure monitor name and desc are NUL-terminated
Hi Gabriele,
What is this? All commits need to be authored by and signed off by from
a real person with their official name.
https://docs.kernel.org/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin
-- Steve
^ permalink raw reply
* Re: [PATCH bpf-next v2 2/3] tracing: Expose tracepoint BTF ids via tracefs
From: Steven Rostedt @ 2026-06-03 22:50 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Mykyta Yatsenko, bpf, Mykyta Yatsenko, linux-trace-kernel,
Andrii Nakryiko, Alexei Starovoitov
In-Reply-To: <CAEf4BzY5MTGJyys369qDS3b63-tKq3okCpFMqUdfzy0dXk7vLg@mail.gmail.com>
On Wed, 3 Jun 2026 15:41:50 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> That's just to say that it would be great to have this in some form of
> shape, so please help getting this into acceptable form, thanks!
As I said, I'm not against it. I was hoping to find something to help
alleviate the memory usage.
I'll likely take these as is, but I'm currently on vacation and it will
have to wait until next week.
-- Steve
^ permalink raw reply
* Re: [PATCH bpf-next v2 2/3] tracing: Expose tracepoint BTF ids via tracefs
From: Andrii Nakryiko @ 2026-06-03 22:41 UTC (permalink / raw)
To: Steven Rostedt
Cc: Mykyta Yatsenko, bpf, Mykyta Yatsenko, linux-trace-kernel,
Andrii Nakryiko, Alexei Starovoitov
In-Reply-To: <20260526215741.1e5b3e42@fedora>
On Tue, May 26, 2026 at 6:57 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 26 May 2026 11:07:56 +0100
> Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote:
>
> > Hi Steven,
> >
> > Gentle ping on this patch from the series.
> >
> > Since this part touches tracing, I’d appreciate your thoughts on the
> > tracing changes whenever you have a chance.
> >
>
> Hi,
>
> I've been looking at this and was wondering if there are ways to not
> extend the trace_event_class structure. It's added for most trace
> events (actually each DECLARE_EVENT_CLASS). Although when things like
> BTF is enabled, this is a very small amount of extra memory.
>
> I haven't been ignoring this. I've just been thinking about other
> approaches, but haven't come up with anything. Of course, I haven't
> been spending that much time on it, as I've been focused on other
> things.
Just in theory, what alternative would there be besides having one
extra pointer in trace_event_class struct? Some sort of lookup by name
or something? E.g., if we know "call" part at runtime for any given
tracepoint for
.btf_ids = __bpf_trace_btf_ids_##call,
we can probably lookup symbol from kallsyms and fetch BTF IDs that way
without extending the struct?
FWIW, this type info for tracepoints (classic and raw both) are very
useful, because right now one needs to do a bunch of work (subject to
break due to kernel type/name changes, etc) to find this, and for
various generic tracing tooling this type information is actually a
necessity to be useful.
That's just to say that it would be great to have this in some form of
shape, so please help getting this into acceptable form, thanks!
>
> -- Steve
^ permalink raw reply
* Re: [PATCH v7 00/42] guest_memfd: In-place conversion support
From: Ackerley Tng @ 2026-06-03 21:27 UTC (permalink / raw)
To: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-0-2f0fae496530@google.com>
Ackerley Tng via B4 Relay <devnull+ackerleytng.google.com@kernel.org>
writes:
> This is v7 of guest_memfd in-place conversion support.
>
Here's the outstanding items after going over everyone's comments
including Sashiko's:
+ KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
+ Need to move page clearing into __kvm_gmem_get_pfn to resolve
leak where populate can put initialized kernel memory into TDX
guest
+ See suggested fix at [1]
+ KVM: guest_memfd: Only prepare folios for private pages,
+ s/non-CoCo/CoCo in commit message "INIT_SHARED is about to be
supported for non-CoCo VMs in a later patch in this series
+ Use Suggested-by: Michael Roth <michael.roth@amd.com>
+ KVM: selftests: Test that shared/private status is consistent across
processes
+ Improve test reliability using pthread_mutex
+ I have a fixup patch offline.
I would like feedback on these:
+ KVM: selftests: Test conversion with elevated page refcount
+ Askar pointed out that soon vmsplice may not pin pages. Should I
pin pages through CONFIG_GUP_TEST like in [2]? I prefer not to
take a dependency on CONFIG_GUP_TEST.
+ KVM: selftests: Add script to exercise private_mem_conversions_test
+ Would like to know what people think of a wrapper script before
I address Sashiko's comments.
[1] https://lore.kernel.org/all/CAEvNRgEVC=fFuKVgZYvWyZD7t_zvUZihFG8hrACjvtkD5cwugw@mail.gmail.com/
[2] https://lore.kernel.org/all/baa8838f623102931e755cf34c86314b305af49c.1747264138.git.ackerleytng@google.com/
>
> [...snip...]
>
^ permalink raw reply
* Re: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: Andrew Morton @ 2026-06-03 19:54 UTC (permalink / raw)
To: Steven Rostedt
Cc: David Hildenbrand (Arm), Borislav Petkov, Zhuo, Qiuxu,
mchehab+huawei@kernel.org, Luck, Tony, linmiaohe@huawei.com,
xieyuanbin1@huawei.com, Lai, Yi1, linux-kernel@vger.kernel.org,
linux-edac@vger.kernel.org, linux-mm@kvack.org,
linux-trace-kernel@vger.kernel.org, Linus Torvalds
In-Reply-To: <20260603130006.7d2c4a62@gandalf.local.home>
On Wed, 3 Jun 2026 13:00:06 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:
> On Wed, 3 Jun 2026 18:26:24 +0200
> "David Hildenbrand (Arm)" <david@kernel.org> wrote:
>
> > Yeah, I was fearing that when I read in [2]:
> >
> > "It has become clear in the past that this promise extends to
> > tracepoints, most notably in 2011 when a tracepoint change broke
> > powertop and had to be reverted."
>
> Technically the issue is with trace events and not tracepoints. The
> difference is that a trace event is created via the TRACE_EVENT() macro
> which defines what is to be collected from the tracepoint and exposes that
> information to tracefs which applications can easily see.
>
> A tracepoint is simply the hook in the code that you can attach to. Trace
> events create a callback from that hook to extract the data from the
> tracepoint to fill in the fields.
The problem here appears to be that "ras:memory_failure_event" became
"memory_failure:memory_failure_event".
Perhaps we can add infrastructure to permit aliasing "ras" onto
"memory_failure". So if we make these namespace alterations we can
easily preserve back-compatibility?
^ permalink raw reply
* Re: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: Steven Rostedt @ 2026-06-03 19:31 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Borislav Petkov, Zhuo, Qiuxu, mchehab+huawei@kernel.org,
Luck, Tony, akpm@linux-foundation.org, linmiaohe@huawei.com,
xieyuanbin1@huawei.com, Lai, Yi1, linux-kernel@vger.kernel.org,
linux-edac@vger.kernel.org, linux-mm@kvack.org,
linux-trace-kernel@vger.kernel.org, Linus Torvalds
In-Reply-To: <b637ede2-73da-49f0-a7eb-70ec79e79624@kernel.org>
On Wed, 3 Jun 2026 21:13:30 +0200
"David Hildenbrand (Arm)" <david@kernel.org> wrote:
> Thanks, that makes sense!
>
> So, would it be fair to say that, in general, what's exposed through
>
> /sys/kernel/tracing/events/
>
> is stable ABI?
It's only stable if something depends on it. It changes all the time.
It's only when someone complains about it that it becomes "stable"!
-- Steve
^ permalink raw reply
* Re: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: Steven Rostedt @ 2026-06-03 19:30 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Borislav Petkov, Zhuo, Qiuxu, mchehab+huawei@kernel.org,
Luck, Tony, akpm@linux-foundation.org, linmiaohe@huawei.com,
xieyuanbin1@huawei.com, Lai, Yi1, linux-kernel@vger.kernel.org,
linux-edac@vger.kernel.org, linux-mm@kvack.org,
linux-trace-kernel@vger.kernel.org, Linus Torvalds
In-Reply-To: <b637ede2-73da-49f0-a7eb-70ec79e79624@kernel.org>
On Wed, 3 Jun 2026 21:13:30 +0200
"David Hildenbrand (Arm)" <david@kernel.org> wrote:
> Would the following be sufficient to avoid a full revert and the dependency on CONFIG_RAS?
>
> diff --git a/include/trace/events/memory-failure.h b/include/trace/events/memory-failure.h
> index aa57cc8f896b..c46b17602578 100644
> --- a/include/trace/events/memory-failure.h
> +++ b/include/trace/events/memory-failure.h
> @@ -1,6 +1,7 @@
> /* SPDX-License-Identifier: GPL-2.0 */
> #undef TRACE_SYSTEM
> -#define TRACE_SYSTEM memory_failure
> +/* Some user space relies on ras/memory_failure_event */
> +#define TRACE_SYSTEM ras
If that puts back the original path then yeah, all would be good.
-- Steve
> #define TRACE_INCLUDE_FILE memory-failure
>
> #if !defined(_TRACE_MEMORY_FAILURE_H) || defined(TRACE_HEADER_MULTI_READ)
^ permalink raw reply
* Re: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: David Hildenbrand (Arm) @ 2026-06-03 19:13 UTC (permalink / raw)
To: Steven Rostedt
Cc: Borislav Petkov, Zhuo, Qiuxu, mchehab+huawei@kernel.org,
Luck, Tony, akpm@linux-foundation.org, linmiaohe@huawei.com,
xieyuanbin1@huawei.com, Lai, Yi1, linux-kernel@vger.kernel.org,
linux-edac@vger.kernel.org, linux-mm@kvack.org,
linux-trace-kernel@vger.kernel.org, Linus Torvalds
In-Reply-To: <20260603130006.7d2c4a62@gandalf.local.home>
On 6/3/26 19:00, Steven Rostedt wrote:
> On Wed, 3 Jun 2026 18:26:24 +0200
> "David Hildenbrand (Arm)" <david@kernel.org> wrote:
>
>> Yeah, I was fearing that when I read in [2]:
>>
>> "It has become clear in the past that this promise extends to
>> tracepoints, most notably in 2011 when a tracepoint change broke
>> powertop and had to be reverted."
>
> Technically the issue is with trace events and not tracepoints. The
> difference is that a trace event is created via the TRACE_EVENT() macro
> which defines what is to be collected from the tracepoint and exposes that
> information to tracefs which applications can easily see.
>
> A tracepoint is simply the hook in the code that you can attach to. Trace
> events create a callback from that hook to extract the data from the
> tracepoint to fill in the fields.
>
>>
>> Which means that I now also fully understand
>>
>> "Some kernel maintainers prohibit or severely restrict the addition of
>> tracepoints to their subsystems out of fear that a similar thing could
>> happen to them. "
>>
>> Whatever the result of this discussion will be, I'll try to document it.
>
> You can still create a tracepoint without creating a trace event by using
> the DECLARE_TRACE() macro. The scheduler subsystem uses that quite
> extensively. That creates a tracepoint without exposing it to tracefs. The
> runtime verifier uses these hooks to monitor the scheduler.
>
> But you can still connect to these tracepoints from tracefs via a tprobe. A
> tprobe hooks to tracepoints that you need the source code to find (just
> like a fprobe hooks to any function). Thus applications *can't* rely on
> them because there's nothing there to tell you it exists or not.
Thanks, that makes sense!
So, would it be fair to say that, in general, what's exposed through
/sys/kernel/tracing/events/
is stable ABI?
Would the following be sufficient to avoid a full revert and the dependency on CONFIG_RAS?
diff --git a/include/trace/events/memory-failure.h b/include/trace/events/memory-failure.h
index aa57cc8f896b..c46b17602578 100644
--- a/include/trace/events/memory-failure.h
+++ b/include/trace/events/memory-failure.h
@@ -1,6 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
#undef TRACE_SYSTEM
-#define TRACE_SYSTEM memory_failure
+/* Some user space relies on ras/memory_failure_event */
+#define TRACE_SYSTEM ras
#define TRACE_INCLUDE_FILE memory-failure
#if !defined(_TRACE_MEMORY_FAILURE_H) || defined(TRACE_HEADER_MULTI_READ)
--
Cheers,
David
^ permalink raw reply related
* Re: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: Steven Rostedt @ 2026-06-03 17:00 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Borislav Petkov, Zhuo, Qiuxu, mchehab+huawei@kernel.org,
Luck, Tony, akpm@linux-foundation.org, linmiaohe@huawei.com,
xieyuanbin1@huawei.com, Lai, Yi1, linux-kernel@vger.kernel.org,
linux-edac@vger.kernel.org, linux-mm@kvack.org,
linux-trace-kernel@vger.kernel.org, Linus Torvalds
In-Reply-To: <0c16bf3d-7c6d-4e28-b200-03b7d0ef714a@kernel.org>
On Wed, 3 Jun 2026 18:26:24 +0200
"David Hildenbrand (Arm)" <david@kernel.org> wrote:
> Yeah, I was fearing that when I read in [2]:
>
> "It has become clear in the past that this promise extends to
> tracepoints, most notably in 2011 when a tracepoint change broke
> powertop and had to be reverted."
Technically the issue is with trace events and not tracepoints. The
difference is that a trace event is created via the TRACE_EVENT() macro
which defines what is to be collected from the tracepoint and exposes that
information to tracefs which applications can easily see.
A tracepoint is simply the hook in the code that you can attach to. Trace
events create a callback from that hook to extract the data from the
tracepoint to fill in the fields.
>
> Which means that I now also fully understand
>
> "Some kernel maintainers prohibit or severely restrict the addition of
> tracepoints to their subsystems out of fear that a similar thing could
> happen to them. "
>
> Whatever the result of this discussion will be, I'll try to document it.
You can still create a tracepoint without creating a trace event by using
the DECLARE_TRACE() macro. The scheduler subsystem uses that quite
extensively. That creates a tracepoint without exposing it to tracefs. The
runtime verifier uses these hooks to monitor the scheduler.
But you can still connect to these tracepoints from tracefs via a tprobe. A
tprobe hooks to tracepoints that you need the source code to find (just
like a fprobe hooks to any function). Thus applications *can't* rely on
them because there's nothing there to tell you it exists or not.
For example, for the given tracepoint:
# cd /sys/kernel/tracing
# echo 't:rfail memory_failure_event pfn=pfn type=type result=result' > dynamic_events
# cat events/tracepoints/rfail/format
name: rfail
ID: 1894
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:unsigned long __probe_ip; offset:8; size:8; signed:0;
field:u64 pfn; offset:16; size:8; signed:0;
field:s32 type; offset:24; size:4; signed:1;
field:s32 result; offset:28; size:4; signed:1;
print fmt: "(%lx) pfn=%Lu type=%d result=%d", REC->__probe_ip, REC->pfn, REC->type, REC->result
It requires that BTF exists and the above doesn't annotate the result as
nicely. But you can get data directly from tracepoints this way.
-- Steve
^ permalink raw reply
* Re: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: David Hildenbrand (Arm) @ 2026-06-03 16:26 UTC (permalink / raw)
To: Borislav Petkov, Steven Rostedt
Cc: Zhuo, Qiuxu, mchehab+huawei@kernel.org, Luck, Tony,
akpm@linux-foundation.org, linmiaohe@huawei.com,
xieyuanbin1@huawei.com, Lai, Yi1, linux-kernel@vger.kernel.org,
linux-edac@vger.kernel.org, linux-mm@kvack.org,
linux-trace-kernel@vger.kernel.org, Linus Torvalds
In-Reply-To: <20260603161947.GBaiBUI7C8WWPwD84S@fat_crate.local>
On 6/3/26 18:19, Borislav Petkov wrote:
> On Wed, Jun 03, 2026 at 12:17:07PM -0400, Steven Rostedt wrote:
>> On Wed, 3 Jun 2026 15:44:54 +0200
>> "David Hildenbrand (Arm)" <david@kernel.org> wrote:
>>
>>> Likely the latter. BPF [1] documents:
>>>
>>> Q: Are tracepoints part of the stable ABI?
>>> A: NO. Tracepoints are tied to internal implementation details hence they are
>>> subject to change and can break with newer kernels. BPF programs need to change
>>> accordingly when this happens.
>>>
>>> The Kernel ABI document explicitly doesn't list them AFAIKS.
>>>
>>> There were previous discussions on the stability of tracepints [2], I don't know
>>> what changed in the meantime. CCing Steve
>>>
>>> [1] https://www.kernel.org/doc/html/latest/bpf/bpf_design_QA.html
>>> [2] https://lwn.net/Articles/747256/
>>> [3] https://www.kernel.org/doc/html/latest/admin-guide/abi.html
>>
>> Tracepoints are not stable or BPF programs only. But other applications
>> they are[1].
>>
>> Adding Linus as he's the Supreme Judge on the matter.
>
> I *think* tools or libtraceevent can't really anticipate the TP namespace
> change so we might have to revert, I'm afraid...
Yeah, I was fearing that when I read in [2]:
"It has become clear in the past that this promise extends to
tracepoints, most notably in 2011 when a tracepoint change broke
powertop and had to be reverted."
Which means that I now also fully understand
"Some kernel maintainers prohibit or severely restrict the addition of
tracepoints to their subsystems out of fear that a similar thing could
happen to them. "
Whatever the result of this discussion will be, I'll try to document it.
--
Cheers,
David
^ permalink raw reply
* Re: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: Borislav Petkov @ 2026-06-03 16:19 UTC (permalink / raw)
To: Steven Rostedt
Cc: David Hildenbrand (Arm), Zhuo, Qiuxu, mchehab+huawei@kernel.org,
Luck, Tony, akpm@linux-foundation.org, linmiaohe@huawei.com,
xieyuanbin1@huawei.com, Lai, Yi1, linux-kernel@vger.kernel.org,
linux-edac@vger.kernel.org, linux-mm@kvack.org,
linux-trace-kernel@vger.kernel.org, Linus Torvalds
In-Reply-To: <20260603121707.7eccb9fb@gandalf.local.home>
On Wed, Jun 03, 2026 at 12:17:07PM -0400, Steven Rostedt wrote:
> On Wed, 3 Jun 2026 15:44:54 +0200
> "David Hildenbrand (Arm)" <david@kernel.org> wrote:
>
> > Likely the latter. BPF [1] documents:
> >
> > Q: Are tracepoints part of the stable ABI?
> > A: NO. Tracepoints are tied to internal implementation details hence they are
> > subject to change and can break with newer kernels. BPF programs need to change
> > accordingly when this happens.
> >
> > The Kernel ABI document explicitly doesn't list them AFAIKS.
> >
> > There were previous discussions on the stability of tracepints [2], I don't know
> > what changed in the meantime. CCing Steve
> >
> > [1] https://www.kernel.org/doc/html/latest/bpf/bpf_design_QA.html
> > [2] https://lwn.net/Articles/747256/
> > [3] https://www.kernel.org/doc/html/latest/admin-guide/abi.html
>
> Tracepoints are not stable or BPF programs only. But other applications
> they are[1].
>
> Adding Linus as he's the Supreme Judge on the matter.
I *think* tools or libtraceevent can't really anticipate the TP namespace
change so we might have to revert, I'm afraid...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply
* Re: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: Steven Rostedt @ 2026-06-03 16:17 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Zhuo, Qiuxu, mchehab+huawei@kernel.org, Luck, Tony, bp@alien8.de,
akpm@linux-foundation.org, linmiaohe@huawei.com,
xieyuanbin1@huawei.com, Lai, Yi1, linux-kernel@vger.kernel.org,
linux-edac@vger.kernel.org, linux-mm@kvack.org,
linux-trace-kernel@vger.kernel.org, Linus Torvalds
In-Reply-To: <b930678d-a1ae-458c-8705-7ca9680d4cb6@kernel.org>
On Wed, 3 Jun 2026 15:44:54 +0200
"David Hildenbrand (Arm)" <david@kernel.org> wrote:
> Likely the latter. BPF [1] documents:
>
> Q: Are tracepoints part of the stable ABI?
> A: NO. Tracepoints are tied to internal implementation details hence they are
> subject to change and can break with newer kernels. BPF programs need to change
> accordingly when this happens.
>
> The Kernel ABI document explicitly doesn't list them AFAIKS.
>
> There were previous discussions on the stability of tracepints [2], I don't know
> what changed in the meantime. CCing Steve
>
> [1] https://www.kernel.org/doc/html/latest/bpf/bpf_design_QA.html
> [2] https://lwn.net/Articles/747256/
> [3] https://www.kernel.org/doc/html/latest/admin-guide/abi.html
Tracepoints are not stable or BPF programs only. But other applications
they are[1].
Adding Linus as he's the Supreme Judge on the matter.
-- Steve
[1] https://lwn.net/Articles/442113/
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox