* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-04 8:36 UTC (permalink / raw)
To: Balbir Singh
Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman
In-Reply-To: <aiDVMgu0viTIml8H@parvat>
On Thu, Jun 04, 2026 at 11:43:14AM +1000, Balbir Singh wrote:
> On Wed, Jun 03, 2026 at 08:02:09AM +0100, Gregory Price wrote:
> >
> > Here is how the page allocator fallback lists and nodemasks interact:
> >
> > Fallbacks A: A B
> > Fallbacks B: B A
> > Fallbacks C: C A B (Private)
> > Fallbacks D: D B A (Private)
> >
>
> Do we want regular memory (N_MEMORY) in the fallback list of device private nodes?
> The assumption is that we have ATS translation enabled? Assumiung A and
> B are N_MEMORY here or am I misreading your illustraion?
>
If we don't have __GFP_PRIVATE, then probably not. This is a holdover
from the current __GFP_PRIVATE branch so that if the preferred_nid=
value is a private node (which is a hint, but not a hard control),
there's a way for that allocation to land *somewhere*.
__GFP_PRIVATE would say "Only allow access to private nodes if this
flag is provided - otherwise treat that as unreachable and fall back".
(__GFP_PRIVATE | __GFP_THISNODE) then does exactly what you expect (only
allocate from specifically this private node and don't fall back).
This has the added benefit of not causing OOM on allocation failure.
Some would consider such a request a bug (i.e. that caller has a bad
mask), but I find the premise of that statement to be flawwed if only
because we do not have good controls over what ends up in a nodemask due
to the existence of things like possible_nodes.
> > If we wanted to change this behavior, realistically we'd be looking for
> > a way to add specific nodes to certain fallback lists - rather than
> > modify the nodemask interaction in some way.
>
> Yes, that is what we did with CDM, control the fallback for
> N_MEMORY_PRIVATE, but there is a design decision to be made here.
>
Agreed, but also one which can be deferred and played with since it's
all kernel-internal. None of this should have UAPI implications, and we
need need to accept that we're going to get it wrong on the first try.
> > 2) full mempolicy support doesn't really make sense
> >
> > task mempolicy PROBABLY should never really touch private nodes,
> > while VMA policy certainly can. Assuming we're able to support
> > multi-private-node masks, none of the non-bind mempolicies even
> > make sense for most private nodes (interleave? weighted interleave?)
> >
>
> Yes, mostly, but is that baked into the design? If so, why?
>
"Baked in" in this case would mean:
set_mempolicy(..., private_node) -> -EINVAL
mbind(..., private_node) -> Success
With appropriate documentation.
This can be changed later if a reasonable design was agreed upon.
> > 4) File VMA interactions don't entirely make sense with mbind
> >
> > In theory you might want:
> >
> > fd = open("somefile", ...);
> > mem = mmap(fd, ...);
> > mbind(mem, ..., private_node);
> > for page in mem:
> > mem[page_off] /* fault file into private memory */
> >
> > In reality: This does not work the way you want.
>
> Why not? Just curious about what you found?
>
Because pagecache pages are associated with potentially many VMAs.
The fault can be a soft fault or a hard fault. On soft fault - the page
was already present, and will simply fault into VMA without being
migrated.
You can imagine the following
Process A:
fd = open("somefile", ...);
mem = mmap(fd, ...);
mbind(mem, ..., private_node_A);
for page in mem:
mem[page_off] /* fault file into private memory */
Process B:
fd = open("somefile", ...);
mem = mmap(fd, ...);
mbind(mem, ..., private_node_B);
for page in mem:
mem[page_off] /* fault file into private memory */
If process A runs first, and assuming VMA mempolicy is respected for
file backed allocation (note: it's not, see below) - then the second
process will think the memory now lives on node B when it's already
living on node A (pages are not migrated on fault).
filemap page cache means file-backed pages are global resources.
Re file-backed VMAs - see filemap_alloc_folio_noprof in mm/filemap.c
struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
{
int n;
struct folio *folio;
if (cpuset_do_page_mem_spread()) {
unsigned int cpuset_mems_cookie;
do {
cpuset_mems_cookie = read_mems_allowed_begin();
n = cpuset_mem_spread_node();
folio = __folio_alloc_node_noprof(gfp, order, n);
} while (!folio && read_mems_allowed_retry(cpuset_mems_cookie));
return folio;
}
return folio_alloc_noprof(gfp, order);
}
We'd have to hang a mempolicy off of the file and use fctl or something
like this if we want a file to have a node preference.
> >
> > I went digging and we need a few mild extensions to allow
> > migration on mbind to work for pagecache pages, and the fault
> > path does not necessarily respect the vma mempolicy always.
> >
> > You also start getting into the question of "what happens when
> > the node is out of memory and you don't have reclaim support?".
>
> Yes, we should discuss reclaim support, I think we should allow for
> reclaim. It allows you to overcommit private memory the way we can
> with regular memory.
>
Reclaim support is feasible, but again - crawl, walk, run.
If we get the base private node infrastructure in place, we can break
things like mempolicy and reclaim support into different work streams
to enable support for these features.
Different private node users will be interested in different
combinations of mm/ service support.
For example: compressed memory as a swap backend DOES NOT want explicit
reclaim support - it will need to manage its own shrinker. This comes
from requirements associated with that specific use case (which I do not
want to get into here).
That is why this series introduced the concept of NP_OPS_* - so that the
owner (driver) of a private node (such as a CXL-enabled accelerator
driver) can tell mm/ what services it should enable for that node.
> >
> > For all these reasons, I think the be mbind/mempolicy support with
> > private nodes needs to be brought in with follow up work - not
> > introduced as part of the baseline set.
> >
>
> I am not opposed to the follow up work, but I feel mbind() should
> be the fundamental work and user space API.
>
This is informed by a single use case / device.
There are users / devices that don't want any UAPI for their memory,
but simply wish to re-utilize some subsection of mm/ (page_alloc,
reclaim, etc).
> >
> > I am arguing for #1 - the community has argued for #2 and "fixing
> > existing nodemask users". I think we can ship #2 and pivot to #1 if we
> > find fixing existing users is infeasible or too much of a maintenance
> > burden.
>
> Again happy to discuss this, I'd like to make sure we agree on the
> design. I am wondering if there is any experimental data to choose
> between 1 and 2.
>
I am trying to test whether, lacking __GFP_PRIVATE, any normal runtime
operations access private nodes removed from fallback lists are reached
via something like the possible / online nodemask.
I remember, maybe a year ago, there were per-node allocations happening
during hotplug and that's why I originally proposed __GFP_PRIVATE, but
I'm trying to re-collect that data now.
~Gregory
^ permalink raw reply
* Re: [PATCHv7 bpf-next 13/29] bpf: Add support for tracing multi link
From: Jiri Olsa @ 2026-06-04 8:00 UTC (permalink / raw)
To: bot+bpf-ci
Cc: ast, daniel, andrii, bpf, linux-trace-kernel, kafai, eddyz87,
songliubraving, yhs, menglong8.dong, rostedt, martin.lau,
yonghong.song, clm, ihor.solodrai
In-Reply-To: <6e738b0697b9ee06e9cbaa325a3330f0447f4cd12ae54d6b2c82365d58c4c525@mail.kernel.org>
On Wed, Jun 03, 2026 at 11:52:13AM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index d853f97bd154..9e3cb547651e 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -42,6 +42,7 @@
> >
> > #define MAX_UPROBE_MULTI_CNT (1U << 20)
> > #define MAX_KPROBE_MULTI_CNT (1U << 20)
> > +#define MAX_TRACING_MULTI_CNT (1U << 20)
>
> [ ... ]
>
> > @@ -3641,3 +3642,132 @@ __bpf_kfunc int bpf_copy_from_user_task_str_dynptr(const struct bpf_dynptr *dptr
>
> [ ... ]
>
> > +static int check_dup_ids(u32 *ids, u32 cnt)
> > +{
> > + int err = 0;
> > +
> > + /*
> > + * Sort ids array (together with cookies array if defined)
> > + * and check it for duplicates. The ids and cookies arrays
> > + * are left sorted.
> > + */
> > + sort_r_nonatomic(ids, cnt, sizeof(ids[0]), ids_cmp_r, ids_swap_r, NULL);
>
> The sort here can run over up to MAX_TRACING_MULTI_CNT (1 << 20) elements
> without ever yielding the CPU. Can this hold the CPU long enough to trip
> RCU stalls or the soft lockup watchdog when cnt is large?
>
> This was raised on an earlier posting of the series, with the suggestion
> to add a cond_resched() somewhere in the path. The reply noted that in
> practice the count is expected to be closer to 100k rather than the full
> 1M limit, and that it was unclear where the cond_resched() should go, so
> the code was left as is. Since MAX_TRACING_MULTI_CNT is still 1 << 20,
> would it be worth revisiting whether a reschedule point is needed here?
we now use sort_r_nonatomic which calls cond_resched,
I don't think loop below needs the cond_resched call
jirka
>
> > +
> > + for (int i = 1; i < cnt; i++) {
> > + if (ids[i] == ids[i - 1]) {
> > + err = -EINVAL;
> > + break;
> > + }
> > + }
> > + return err;
> > +}
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26881310426
^ permalink raw reply
* [PATCH v7 3/3] locking: Add contended_release tracepoint to sleepable locks
From: Dmitry Ilvokhin @ 2026-06-04 7:15 UTC (permalink / raw)
To: Peter Zijlstra, Dennis Zhou, Tejun Heo, Christoph Lameter,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long
Cc: linux-mm, linux-kernel, linux-trace-kernel, kernel-team,
Dmitry Ilvokhin, Paul E. McKenney
In-Reply-To: <cover.1780506267.git.d@ilvokhin.com>
Add the contended_release trace event. This tracepoint fires on the
holder side when a contended lock is released, complementing the
existing contention_begin/contention_end tracepoints which fire on the
waiter side.
This enables correlating lock hold time under contention with waiter
events by lock address.
Add trace_contended_release()/trace_call__contended_release() calls to
the slowpath unlock paths of sleepable locks: mutex, rtmutex, semaphore,
rwsem, percpu-rwsem, and RT-specific rwbase locks.
Where possible, trace_contended_release() fires before the lock is
released and before the waiter is woken. For some lock types, the
tracepoint fires after the release but before the wake. Making the
placement consistent across all lock types is not worth the added
complexity.
For reader/writer locks, the tracepoint fires for every reader releasing
while a writer is waiting, not only for the last reader.
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
---
include/trace/events/lock.h | 17 +++++++++++++++++
kernel/locking/mutex.c | 4 ++++
kernel/locking/percpu-rwsem.c | 11 +++++++++++
kernel/locking/rtmutex.c | 1 +
kernel/locking/rwbase_rt.c | 6 ++++++
kernel/locking/rwsem.c | 10 ++++++++--
kernel/locking/semaphore.c | 4 ++++
7 files changed, 51 insertions(+), 2 deletions(-)
diff --git a/include/trace/events/lock.h b/include/trace/events/lock.h
index da978f2afb45..1ded869cd619 100644
--- a/include/trace/events/lock.h
+++ b/include/trace/events/lock.h
@@ -137,6 +137,23 @@ TRACE_EVENT(contention_end,
TP_printk("%p (ret=%d)", __entry->lock_addr, __entry->ret)
);
+TRACE_EVENT(contended_release,
+
+ TP_PROTO(void *lock),
+
+ TP_ARGS(lock),
+
+ TP_STRUCT__entry(
+ __field(void *, lock_addr)
+ ),
+
+ TP_fast_assign(
+ __entry->lock_addr = lock;
+ ),
+
+ TP_printk("%p", __entry->lock_addr)
+);
+
#endif /* _TRACE_LOCK_H */
/* This part must be outside protection */
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 09534628dc01..43b7f7e281a0 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -1023,6 +1023,9 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
wake_q_add(&wake_q, next);
}
+ if (trace_contended_release_enabled() && waiter)
+ trace_call__contended_release(lock);
+
if (owner & MUTEX_FLAG_HANDOFF)
__mutex_handoff(lock, next);
@@ -1220,6 +1223,7 @@ EXPORT_SYMBOL(ww_mutex_lock_interruptible);
EXPORT_TRACEPOINT_SYMBOL_GPL(contention_begin);
EXPORT_TRACEPOINT_SYMBOL_GPL(contention_end);
+EXPORT_TRACEPOINT_SYMBOL_GPL(contended_release);
/**
* atomic_dec_and_mutex_lock - return holding mutex if we dec to 0
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index f3ee7a0d6047..f7e152c40d6d 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -263,6 +263,9 @@ void percpu_up_write(struct percpu_rw_semaphore *sem)
{
rwsem_release(&sem->dep_map, _RET_IP_);
+ if (trace_contended_release_enabled() && wq_has_sleeper(&sem->waiters))
+ trace_call__contended_release(sem);
+
/*
* Signal the writer is done, no fast path yet.
*
@@ -292,6 +295,14 @@ EXPORT_SYMBOL_GPL(percpu_up_write);
void __percpu_up_read(struct percpu_rw_semaphore *sem)
{
lockdep_assert_preemption_disabled();
+ /*
+ * After percpu_up_write() completes, rcu_sync_is_idle() can still
+ * return false during the grace period, forcing readers into this
+ * slowpath. Only trace when a writer is actually waiting for
+ * readers to drain.
+ */
+ if (trace_contended_release_enabled() && rcuwait_active(&sem->writer))
+ trace_call__contended_release(sem);
/*
* slowpath; reader will only ever wake a single blocked
* writer.
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 9147d6a31b78..28beae7d21fe 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1470,6 +1470,7 @@ static void __sched rt_mutex_slowunlock(struct rt_mutex_base *lock)
raw_spin_lock_irqsave(&lock->wait_lock, flags);
}
+ trace_contended_release(lock);
/*
* The wakeup next waiter path does not suffer from the above
* race. See the comments there.
diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index 82e078c0665a..2835c9ef9b3f 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -174,6 +174,8 @@ static void __sched __rwbase_read_unlock(struct rwbase_rt *rwb,
static __always_inline void rwbase_read_unlock(struct rwbase_rt *rwb,
unsigned int state)
{
+ if (trace_contended_release_enabled() && rt_mutex_owner(&rwb->rtmutex))
+ trace_call__contended_release(rwb);
/*
* rwb->readers can only hit 0 when a writer is waiting for the
* active readers to leave the critical section.
@@ -205,6 +207,8 @@ static inline void rwbase_write_unlock(struct rwbase_rt *rwb)
unsigned long flags;
raw_spin_lock_irqsave(&rtm->wait_lock, flags);
+ if (trace_contended_release_enabled() && rt_mutex_has_waiters(rtm))
+ trace_call__contended_release(rwb);
__rwbase_write_unlock(rwb, WRITER_BIAS, flags);
}
@@ -214,6 +218,8 @@ static inline void rwbase_write_downgrade(struct rwbase_rt *rwb)
unsigned long flags;
raw_spin_lock_irqsave(&rtm->wait_lock, flags);
+ if (trace_contended_release_enabled() && rt_mutex_has_waiters(rtm))
+ trace_call__contended_release(rwb);
/* Release it and account current as reader */
__rwbase_write_unlock(rwb, WRITER_BIAS - 1, flags);
}
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index bf647097369c..b9c180ac1eee 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1387,6 +1387,8 @@ static inline void __up_read(struct rw_semaphore *sem)
rwsem_clear_reader_owned(sem);
tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
+ if (trace_contended_release_enabled() && (tmp & RWSEM_FLAG_WAITERS))
+ trace_call__contended_release(sem);
if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
RWSEM_FLAG_WAITERS)) {
clear_nonspinnable(sem);
@@ -1413,8 +1415,10 @@ static inline void __up_write(struct rw_semaphore *sem)
preempt_disable();
rwsem_clear_owner(sem);
tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
- if (unlikely(tmp & RWSEM_FLAG_WAITERS))
+ if (unlikely(tmp & RWSEM_FLAG_WAITERS)) {
+ trace_contended_release(sem);
rwsem_wake(sem);
+ }
preempt_enable();
}
@@ -1437,8 +1441,10 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
tmp = atomic_long_fetch_add_release(
-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
rwsem_set_reader_owned(sem);
- if (tmp & RWSEM_FLAG_WAITERS)
+ if (tmp & RWSEM_FLAG_WAITERS) {
+ trace_contended_release(sem);
rwsem_downgrade_wake(sem);
+ }
preempt_enable();
}
diff --git a/kernel/locking/semaphore.c b/kernel/locking/semaphore.c
index 74d41433ba13..233730c25933 100644
--- a/kernel/locking/semaphore.c
+++ b/kernel/locking/semaphore.c
@@ -230,6 +230,10 @@ void __sched up(struct semaphore *sem)
sem->count++;
else
__up(sem, &wake_q);
+
+ if (trace_contended_release_enabled() && !wake_q_empty(&wake_q))
+ trace_call__contended_release(sem);
+
raw_spin_unlock_irqrestore(&sem->lock, flags);
if (!wake_q_empty(&wake_q))
wake_up_q(&wake_q);
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH v7 2/3] locking/percpu-rwsem: Extract __percpu_up_read()
From: Dmitry Ilvokhin @ 2026-06-04 7:15 UTC (permalink / raw)
To: Peter Zijlstra, Dennis Zhou, Tejun Heo, Christoph Lameter,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long
Cc: linux-mm, linux-kernel, linux-trace-kernel, kernel-team,
Dmitry Ilvokhin, Usama Arif
In-Reply-To: <cover.1780506267.git.d@ilvokhin.com>
Move the percpu_up_read() slowpath out of the inline function into a new
__percpu_up_read() to avoid binary size increase from adding a
tracepoint to an inlined function.
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Acked-by: Usama Arif <usama.arif@linux.dev>
---
include/linux/percpu-rwsem.h | 15 +++------------
kernel/locking/percpu-rwsem.c | 18 ++++++++++++++++++
2 files changed, 21 insertions(+), 12 deletions(-)
diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index c8cb010d655e..39d5bf8e6562 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -107,6 +107,8 @@ static inline bool percpu_down_read_trylock(struct percpu_rw_semaphore *sem)
return ret;
}
+extern void __percpu_up_read(struct percpu_rw_semaphore *sem);
+
static inline void percpu_up_read(struct percpu_rw_semaphore *sem)
{
rwsem_release(&sem->dep_map, _RET_IP_);
@@ -118,18 +120,7 @@ static inline void percpu_up_read(struct percpu_rw_semaphore *sem)
if (likely(rcu_sync_is_idle(&sem->rss))) {
this_cpu_dec(*sem->read_count);
} else {
- /*
- * slowpath; reader will only ever wake a single blocked
- * writer.
- */
- smp_mb(); /* B matches C */
- /*
- * In other words, if they see our decrement (presumably to
- * aggregate zero, as that is the only time it matters) they
- * will also see our critical section.
- */
- this_cpu_dec(*sem->read_count);
- rcuwait_wake_up(&sem->writer);
+ __percpu_up_read(sem);
}
preempt_enable();
}
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index ef234469baac..f3ee7a0d6047 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -288,3 +288,21 @@ void percpu_up_write(struct percpu_rw_semaphore *sem)
rcu_sync_exit(&sem->rss);
}
EXPORT_SYMBOL_GPL(percpu_up_write);
+
+void __percpu_up_read(struct percpu_rw_semaphore *sem)
+{
+ lockdep_assert_preemption_disabled();
+ /*
+ * slowpath; reader will only ever wake a single blocked
+ * writer.
+ */
+ smp_mb(); /* B matches C */
+ /*
+ * In other words, if they see our decrement (presumably to
+ * aggregate zero, as that is the only time it matters) they
+ * will also see our critical section.
+ */
+ this_cpu_dec(*sem->read_count);
+ rcuwait_wake_up(&sem->writer);
+}
+EXPORT_SYMBOL_GPL(__percpu_up_read);
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH v7 0/3] locking: contended_release tracepoint instrumentation
From: Dmitry Ilvokhin @ 2026-06-04 7:15 UTC (permalink / raw)
To: Peter Zijlstra, Dennis Zhou, Tejun Heo, Christoph Lameter,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long
Cc: linux-mm, linux-kernel, linux-trace-kernel, kernel-team,
Dmitry Ilvokhin
The existing contention_begin/contention_end tracepoints fire on the
waiter side. The lock holder's identity and stack can be captured at
contention_begin time (e.g. perf lock contention --lock-owner), but only
for locks whose owner it reads: mutex and rwsem. Other lock types have
none it can read, so they get no holder-side attribution today.
This series adds a contended_release tracepoint that fires on the
holder side when a lock with waiters is released. This provides:
- Hold time estimation: when the holder's own acquisition was
contended, its contention_end (acquisition) and contended_release
can be correlated to measure how long the lock was held under
contention.
- The holder's stack at release time, which may differ from what perf lock
contention --lock-owner captures if the holder does significant work between
the waiter's arrival and the unlock.
Note: for reader/writer locks, the tracepoint fires for every reader
releasing while a writer is waiting, not only for the last reader.
v6 -> v7:
- Dropped spinlocks instrumentation patches, I'll rework them based on Peter's
static call idea. See [2].
v5 -> v6:
- Use trace_call__contended_release() instead of trace_contended_release(),
where appropriate to avoid a redundant static branch check when the caller
already guards with trace_contended_release_enabled().
- Added acked-by from Paul.
- Rebase on top of the fresh locking/core.
v4 -> v5:
- Split the combined spinning locks patch into separate qspinlock and
qrwlock patches (Paul E. McKenney).
- Factor out __queued_read_unlock()/__queued_write_unlock() as a
separate preparatory commit, mirroring the queued_spin_release()
split (Paul E. McKenney).
- Updated binary size numbers for qspinlock-only change.
- Added Acked-by and Reviewed-by tags where appropriate.
v3 -> v4:
- Fix spurious events in __percpu_up_read(): guard with
rcuwait_active(&sem->writer) to avoid tracing during the RCU grace
period after a writer releases (Sashiko).
- Fix possible use-after-free in semaphore up(): move
trace_contended_release() inside the sem->lock critical section
(Sashiko).
- Fix build failure with CONFIG_PARAVIRT_SPINLOCKS=y: introduce
queued_spin_release() as the arch-overridable unlock primitive,
so queued_spin_unlock() can be a generic tracing wrapper. Convert
x86 (paravirt) and MIPS overrides (Sashiko).
- Add EXPORT_TRACEPOINT_SYMBOL_GPL(contended_release) for module
support (Sashiko).
- Split spinning locks patch: factor out queued_spin_release() as a
separate preparatory commit (Sashiko).
- Make read unlock tracepoint behavior consistent across all
reader/writer lock types: fire for every reader releasing while
a writer is waiting (rwsem, rwbase_rt were previously last-reader
only).
v2 -> v3:
- Added new patch: extend contended_release tracepoint to queued spinlocks
and queued rwlocks (marked as RFC, requesting feedback). This is prompted by
Matthew Wilcox's suggestion to try to come up with generic instrumentation,
instead of instrumenting each "special" lock manually. See [1] for the
discussion.
- Reworked tracepoint placement to fire before the lock is released and
before the waiter is woken where possible, for consistency with
spinning locks where there is no explicit wake (inspired by Usama Arif's
suggestion).
- Remove unnecessary linux/sched.h include from trace/events/lock.h.
RFC -> v2:
- Add trace_contended_release_enabled() guard before waiter checks that
exist only for the tracepoint (Steven Rostedt).
- Rename __percpu_up_read_slowpath() to __percpu_up_read() (Peter
Zijlstra).
- Add extern for __percpu_up_read() (Peter Zijlstra).
- Squashed tracepoint introduction and usage commits (Masami Hiramatsu).
v6: https://lore.kernel.org/all/cover.1777999826.git.d@ilvokhin.com/
v5: https://lore.kernel.org/all/cover.1776350944.git.d@ilvokhin.com/
v4: https://lore.kernel.org/all/cover.1774536681.git.d@ilvokhin.com/
v3: https://lore.kernel.org/all/cover.1773858853.git.d@ilvokhin.com/
v2: https://lore.kernel.org/all/cover.1773164180.git.d@ilvokhin.com/
RFC: https://lore.kernel.org/all/cover.1772642407.git.d@ilvokhin.com/
[1]: https://lore.kernel.org/all/aa7G1nD7Rd9F4eBH@casper.infradead.org/
[2]: https://lore.kernel.org/all/20260603120811.GW3493090@noisy.programming.kicks-ass.net/
Dmitry Ilvokhin (3):
tracing/lock: Remove unnecessary linux/sched.h include
locking/percpu-rwsem: Extract __percpu_up_read()
locking: Add contended_release tracepoint to sleepable locks
include/linux/percpu-rwsem.h | 15 +++------------
include/trace/events/lock.h | 18 +++++++++++++++++-
kernel/locking/mutex.c | 4 ++++
kernel/locking/percpu-rwsem.c | 29 +++++++++++++++++++++++++++++
kernel/locking/rtmutex.c | 1 +
kernel/locking/rwbase_rt.c | 6 ++++++
kernel/locking/rwsem.c | 10 ++++++++--
kernel/locking/semaphore.c | 4 ++++
8 files changed, 72 insertions(+), 15 deletions(-)
--
2.53.0-Meta
^ permalink raw reply
* [PATCH v7 1/3] tracing/lock: Remove unnecessary linux/sched.h include
From: Dmitry Ilvokhin @ 2026-06-04 7:15 UTC (permalink / raw)
To: Peter Zijlstra, Dennis Zhou, Tejun Heo, Christoph Lameter,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long
Cc: linux-mm, linux-kernel, linux-trace-kernel, kernel-team,
Dmitry Ilvokhin, Usama Arif
In-Reply-To: <cover.1780506267.git.d@ilvokhin.com>
None of the trace events in lock.h reference anything from
linux/sched.h. Remove the unnecessary include.
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Acked-by: Usama Arif <usama.arif@linux.dev>
---
include/trace/events/lock.h | 1 -
1 file changed, 1 deletion(-)
diff --git a/include/trace/events/lock.h b/include/trace/events/lock.h
index 8e89baa3775f..da978f2afb45 100644
--- a/include/trace/events/lock.h
+++ b/include/trace/events/lock.h
@@ -5,7 +5,6 @@
#if !defined(_TRACE_LOCK_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_LOCK_H
-#include <linux/sched.h>
#include <linux/tracepoint.h>
/* flags for lock:contention_begin */
--
2.53.0-Meta
^ permalink raw reply related
* Re: [PATCHv4 00/13] uprobes/x86: Fix red zone issue for optimized uprobes
From: Jiri Olsa @ 2026-06-04 6:59 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
Andrii Nakryiko
Cc: bpf, linux-trace-kernel
In-Reply-To: <20260526205840.173790-1-jolsa@kernel.org>
On Tue, May 26, 2026 at 10:58:27PM +0200, Jiri Olsa wrote:
> hi,
> Andrii reported an issue with optimized uprobes [1] that can clobber
> redzone area with call instruction storing return address on stack
> where user code may keep temporary data without adjusting rsp.
>
> Fixing this by moving the optimized uprobes on top of 10-bytes nop
> instruction, so we can squeeze another instruction to escape the
> redzone area before doing the call.
>
> Note we need upstream update first for patch 3 (github.com/libbpf/usdt),
> if we decide to take this change.
>
> thanks,
> jirka
>
>
> v1: https://lore.kernel.org/bpf/20260514135342.22130-1-jolsa@kernel.org/
> v2: https://lore.kernel.org/bpf/20260518105957.123445-1-jolsa@kernel.org/
> v3: https://lore.kernel.org/bpf/20260521124411.31133-1-jolsa@kernel.org/
>
> v4 changes:
> - do not use 2nd int3 (ont +5 offset) because the call instruction
> is allways the same for the given nop10 address [Andrii/Peter]
> - unmap unused trampoline vma after unsuccesfull optimization [sashiko]
> - small change to patch#2 moved user_64bit_mode earlier in the path
> and pass/use mm_struct pointer directly from arch_uprobe_optimize
> instead of gettting current->mm
> Andrii, keeping your ack, please shout otherwise
hi,
I think bots did not find anything substantial, I have just small
selftests changes queued for v5
any other feedback/review would be great
thanks,
jirka
>
> v3 changes:
> - use nop10 update suggested by Peter in [2]
> - remove struct uprobe_trampoline object, use vma objects directly instead
> - selftests fixes [sashiko]
> - ack from Andrii
>
> v2 changes:
> - several selftest fixes [sashiko]
> - consolidate is_lea_insn and is_call_insn insto single check [Jakub Sitnicki]
> - use proper mm_struct object in __in_uprobe_trampoline check [sashiko]
> - allow to copy uprobe trampolines vma objects on fork [sashiko]
> - change uprobe syscall detection error from -ENXIO to -EPROTO [Andrii]
> - added fork/clone tests
> - I kept the selftest changes and nop5->nop10 changes in separate
> commits for easier review, we can squash them later if we want to keep
> bisect working properly
>
>
> [1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> [2] https://lore.kernel.org/bpf/20260518104306.GU3102624@noisy.programming.kicks-ass.net/#t
> ---
> Andrii Nakryiko (1):
> selftests/bpf: Add tests for uprobe nop10 red zone clobbering
>
> Jiri Olsa (12):
> uprobes/x86: Use proper mm_struct in __in_uprobe_trampoline
> uprobes/x86: Remove struct uprobe_trampoline object
> uprobes/x86: Allow to copy uprobe trampolines on fork
> uprobes/x86: Unmap trampoline vma object in case it's unused
> uprobes/x86: Move optimized uprobe from nop5 to nop10
> libbpf: Change has_nop_combo to work on top of nop10
> libbpf: Detect uprobe syscall with new error
> selftests/bpf: Emit nop,nop10 instructions combo for x86_64 arch
> selftests/bpf: Change uprobe syscall tests to use nop10
> selftests/bpf: Change uprobe/usdt trigger bench code to use nop10
> selftests/bpf: Add reattach tests for uprobe syscall
> selftests/bpf: Add tests for forked/cloned optimized uprobes
>
> arch/x86/kernel/uprobes.c | 379 +++++++++++++++++++++++++++++++++++++++++++-----------------------------
> include/linux/uprobes.h | 5 -
> kernel/events/uprobes.c | 10 --
> kernel/fork.c | 1 -
> tools/lib/bpf/features.c | 4 +-
> tools/lib/bpf/usdt.c | 16 +--
> tools/testing/selftests/bpf/bench.c | 20 ++--
> tools/testing/selftests/bpf/benchs/bench_trigger.c | 38 ++++----
> tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh | 2 +-
> tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c | 307 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
> tools/testing/selftests/bpf/prog_tests/usdt.c | 74 ++++++++++++--
> tools/testing/selftests/bpf/progs/test_usdt.c | 25 +++++
> tools/testing/selftests/bpf/usdt.h | 2 +-
> tools/testing/selftests/bpf/usdt_2.c | 15 ++-
> 14 files changed, 653 insertions(+), 245 deletions(-)
^ permalink raw reply
* Re: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: David Hildenbrand (Arm) @ 2026-06-04 6:42 UTC (permalink / raw)
To: Xie Yuanbin, qiuxu.zhuo, bp, akpm, rostedt, linmiaohe
Cc: linux-edac, linux-kernel, linux-mm, linux-trace-kernel,
mchehab+huawei, tony.luck, torvalds, yi1.lai
In-Reply-To: <20260604014629.3144-1-xieyuanbin1@huawei.com>
On 6/4/26 03:46, Xie Yuanbin wrote:
> On Wed, 3 Jun 2026 21:13:30 +0200, David Hildenbrand (Arm) wrote:
>> Would the following be sufficient to avoid a full revert and the dependency on CONFIG_RAS?
>>
>> diff --git a/include/trace/events/memory-failure.h b/include/trace/events/memory-failure.h
>> index aa57cc8f896b..c46b17602578 100644
>> --- a/include/trace/events/memory-failure.h
>> +++ b/include/trace/events/memory-failure.h
>> @@ -1,6 +1,7 @@
>> /* SPDX-License-Identifier: GPL-2.0 */
>> #undef TRACE_SYSTEM
>> -#define TRACE_SYSTEM memory_failure
>> +/* Some user space relies on ras/memory_failure_event */
>> +#define TRACE_SYSTEM ras
>> #define TRACE_INCLUDE_FILE memory-failure
>>
>> #if !defined(_TRACE_MEMORY_FAILURE_H) || defined(TRACE_HEADER_MULTI_READ)
>
> Yes, it should be. In fact, when I sent the V2 patch, I had already
> considered this issue, and that's exactly what I did:
> Link: https://lore.kernel.org/20251104072306.100738-3-xieyuanbin1@huawei.com
>
> However, David Hildenbrand advised me at that time to completely
> remove the dependence on RAS:
> Link: https://lore.kernel.org/01b44e0f-ea2e-406f-9f65-b698b5504f42@kernel.org
Yeah, if only I had known that we would break user space by changing trace
events ... now we know :)
Do you have capacity to send a fix?
--
Cheers,
David
^ permalink raw reply
* [RFC PATCH 3/3] mm/compaction: respect compact_unevictable_allowed in alloc_contig path
From: Wandun Chen @ 2026-06-04 2:38 UTC (permalink / raw)
To: linux-mm, linux-kernel, linux-trace-kernel, linux-rt-devel
Cc: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy, rostedt,
mhiramat, mathieu.desnoyers, david, ljs, liam, rppt, bigeasy,
clrkwllms, Alexander.Krabler
In-Reply-To: <20260604023812.3700316-1-chenwandun1@gmail.com>
From: Wandun Chen <chenwandun@lixiang.com>
vm.compact_unevictable_allowed=0 is used to prevent compacting
unevictable pages. However, isolate_migratepages_range() passes
ISOLATE_UNEVICTABLE regardless of this sysctl, so the setting
has no effect in the alloc_contig path.
Fix it by:
- Keep ISOLATE_UNEVICTABLE for CMA allocation, discussed in [1].
- Honour sysctl_compact_unevictable_allowed for non-CMA allocation.
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Wandun Chen <chenwandun@lixiang.com>
Link: https://lore.kernel.org/all/25ba0d77-eb61-4efc-b2fc-73878cbd85c1@suse.cz/ [1]
---
include/linux/compaction.h | 6 ++++++
mm/compaction.c | 9 +++++++--
mm/internal.h | 1 +
mm/page_alloc.c | 2 ++
4 files changed, 16 insertions(+), 2 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index f29ef0653546..04e60f65b976 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -106,6 +106,7 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
extern void __meminit kcompactd_run(int nid);
extern void __meminit kcompactd_stop(int nid);
extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int highest_zoneidx);
+extern bool compaction_allow_unevictable(void);
#else
static inline void reset_isolation_suitable(pg_data_t *pgdat)
@@ -131,6 +132,11 @@ static inline void wakeup_kcompactd(pg_data_t *pgdat,
{
}
+static inline bool compaction_allow_unevictable(void)
+{
+ return true;
+}
+
#endif /* CONFIG_COMPACTION */
struct node;
diff --git a/mm/compaction.c b/mm/compaction.c
index 007d5e00a8ae..a10acb273454 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1341,6 +1341,7 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
unsigned long end_pfn)
{
unsigned long pfn, block_start_pfn, block_end_pfn;
+ isolate_mode_t mode = cc->allow_unevictable ? ISOLATE_UNEVICTABLE : 0;
int ret = 0;
/* Scan block by block. First and last block may be incomplete */
@@ -1360,8 +1361,7 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
block_end_pfn, cc->zone))
continue;
- ret = isolate_migratepages_block(cc, pfn, block_end_pfn,
- ISOLATE_UNEVICTABLE);
+ ret = isolate_migratepages_block(cc, pfn, block_end_pfn, mode);
if (ret)
break;
@@ -1902,6 +1902,11 @@ typedef enum {
* compactable pages.
*/
static int sysctl_compact_unevictable_allowed __read_mostly = CONFIG_COMPACT_UNEVICTABLE_DEFAULT;
+
+bool compaction_allow_unevictable(void)
+{
+ return sysctl_compact_unevictable_allowed;
+}
/*
* Tunable for proactive compaction. It determines how
* aggressively the kernel should compact memory in the
diff --git a/mm/internal.h b/mm/internal.h
index 181e79f1d6a2..163f9d6b37f3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1052,6 +1052,7 @@ struct compact_control {
* ensure forward progress.
*/
bool alloc_contig; /* alloc_contig_range allocation */
+ bool allow_unevictable; /* Allow isolation of unevictable folios */
};
/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 81a9d4d1e6c0..1cf9d4a3b14c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7118,6 +7118,8 @@ int alloc_contig_frozen_range_noprof(unsigned long start, unsigned long end,
.ignore_skip_hint = true,
.no_set_skip_hint = true,
.alloc_contig = true,
+ .allow_unevictable = !!(alloc_flags & ACR_FLAGS_CMA) ||
+ compaction_allow_unevictable(),
};
INIT_LIST_HEAD(&cc.migratepages);
enum pb_isolate_mode mode = (alloc_flags & ACR_FLAGS_CMA) ?
--
2.43.0
^ permalink raw reply related
* [RFC PATCH 2/3] mm/compaction: add per-folio isolation tracepoint
From: Wandun Chen @ 2026-06-04 2:38 UTC (permalink / raw)
To: linux-mm, linux-kernel, linux-trace-kernel, linux-rt-devel
Cc: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy, rostedt,
mhiramat, mathieu.desnoyers, david, ljs, liam, rppt, bigeasy,
clrkwllms, Alexander.Krabler
In-Reply-To: <20260604023812.3700316-1-chenwandun1@gmail.com>
From: Wandun Chen <chenwandun@lixiang.com>
Add a tracepoint that fires once per folio successfully isolated by
isolate_migratepages_block(), recording the pfn, isolation mode and
the folio flags. Knowing these makes it easier to debug unexpected
isolation, such as mlocked or unevictable folios showing up on
PREEMPT_RT kernels [1].
Inspired-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Wandun Chen <chenwandun@lixiang.com>
Link: https://lore.kernel.org/all/20250820142919.HTybedrl@linutronix.de/ [1]
---
include/trace/events/compaction.h | 26 ++++++++++++++++++++++++++
mm/compaction.c | 2 ++
2 files changed, 28 insertions(+)
diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index d05759d18538..8b8b3ec0f324 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -76,6 +76,32 @@ DEFINE_EVENT(mm_compaction_isolate_template, mm_compaction_fast_isolate_freepage
);
#ifdef CONFIG_COMPACTION
+TRACE_EVENT(mm_compaction_isolate_folio,
+
+ TP_PROTO(unsigned long pfn,
+ isolate_mode_t mode,
+ unsigned long flags),
+
+ TP_ARGS(pfn, mode, flags),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, pfn)
+ __field(isolate_mode_t, mode)
+ __field(unsigned long, flags)
+ ),
+
+ TP_fast_assign(
+ __entry->pfn = pfn;
+ __entry->mode = mode;
+ __entry->flags = flags;
+ ),
+
+ TP_printk("pfn=0x%lx mode=0x%x flags=%s",
+ __entry->pfn,
+ __entry->mode,
+ show_page_flags(__entry->flags & PAGEFLAGS_MASK))
+);
+
TRACE_EVENT(mm_compaction_migratepages,
TP_PROTO(unsigned int nr_migratepages,
diff --git a/mm/compaction.c b/mm/compaction.c
index 7e07b792bcb5..007d5e00a8ae 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1223,6 +1223,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
isolate_success:
list_add(&folio->lru, &cc->migratepages);
isolate_success_no_list:
+ trace_mm_compaction_isolate_folio(folio_pfn(folio), mode,
+ folio->flags.f);
cc->nr_migratepages += folio_nr_pages(folio);
nr_isolated += folio_nr_pages(folio);
nr_scanned += folio_nr_pages(folio) - 1;
--
2.43.0
^ permalink raw reply related
* [RFC PATCH 1/3] mm/compaction: skip isolate mlocked folios when compact_unevictable_allowed=0
From: Wandun Chen @ 2026-06-04 2:38 UTC (permalink / raw)
To: linux-mm, linux-kernel, linux-trace-kernel, linux-rt-devel
Cc: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy, rostedt,
mhiramat, mathieu.desnoyers, david, ljs, liam, rppt, bigeasy,
clrkwllms, Alexander.Krabler
In-Reply-To: <20260604023812.3700316-1-chenwandun1@gmail.com>
From: Wandun Chen <chenwandun@lixiang.com>
compact_unevictable_allowed is default 0 under PREEMPT_RT,
isolate_migratepages_block() skips folios with PG_unevictable set.
However, mlock_folio() sets PG_mlocked immediately but defers
PG_unevictable to mlock_folio_batch(), result in a folio with
PG_mlocked=1 but PG_unevictable=0. Compaction will isolate such a
folio.
Fix by checking folio_test_mlocked() together with the existing
folio_test_unevictable() check.
A similar issue has been reported by Alexander Krabler on a 6.12-rt
aarch64 system. Vlastimil suggested to check the mlocked flag [1].
Reported-by: Alexander Krabler <Alexander.Krabler@kuka.com>
Closes: https://lore.kernel.org/all/DU0PR01MB10385345F7153F334100981888259A@DU0PR01MB10385.eurprd01.prod.exchangelabs.com/
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Wandun Chen <chenwandun@lixiang.com>
Link: https://lore.kernel.org/all/33275585-f2db-4779-89f0-3ae24b455a67@suse.cz/ [1]
---
mm/compaction.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index b776f35ad020..7e07b792bcb5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1116,7 +1116,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
is_unevictable = folio_test_unevictable(folio);
/* Compaction might skip unevictable pages but CMA takes them */
- if (!(mode & ISOLATE_UNEVICTABLE) && is_unevictable)
+ if (!(mode & ISOLATE_UNEVICTABLE) &&
+ (is_unevictable || folio_test_mlocked(folio)))
goto isolate_fail_put;
/*
--
2.43.0
^ permalink raw reply related
* [RFC PATCH 0/3] mm/compaction: honour compact_unevictable_allowed in mlock race and alloc_contig path
From: Wandun Chen @ 2026-06-04 2:38 UTC (permalink / raw)
To: linux-mm, linux-kernel, linux-trace-kernel, linux-rt-devel
Cc: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy, rostedt,
mhiramat, mathieu.desnoyers, david, ljs, liam, rppt, bigeasy,
clrkwllms, Alexander.Krabler
From: Wandun Chen <chenwandun@lixiang.com>
vm.compact_unevictable_allowed=0 is meant to keep compaction from
touching unevictable folios. In practice there are still two paths
where it does not take effect. This series fixes them and adds a
tracepoint to make such issues easier to diagnose in the future.
Wandun Chen (3):
mm/compaction: skip isolate mlocked folios when
compact_unevictable_allowed=0
mm/compaction: add per-folio isolation tracepoint
mm/compaction: respect compact_unevictable_allowed in alloc_contig
path
include/linux/compaction.h | 6 ++++++
include/trace/events/compaction.h | 26 ++++++++++++++++++++++++++
mm/compaction.c | 14 +++++++++++---
mm/internal.h | 1 +
mm/page_alloc.c | 2 ++
5 files changed, 46 insertions(+), 3 deletions(-)
--
2.43.0
^ permalink raw reply
* Re: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: Xie Yuanbin @ 2026-06-04 1:46 UTC (permalink / raw)
To: david, qiuxu.zhuo, bp, akpm, rostedt, linmiaohe
Cc: linux-edac, linux-kernel, linux-mm, linux-trace-kernel,
mchehab+huawei, tony.luck, torvalds, xieyuanbin1, yi1.lai
In-Reply-To: <b637ede2-73da-49f0-a7eb-70ec79e79624@kernel.org>
On Wed, 3 Jun 2026 21:13:30 +0200, David Hildenbrand (Arm) wrote:
> Would the following be sufficient to avoid a full revert and the dependency on CONFIG_RAS?
>
> diff --git a/include/trace/events/memory-failure.h b/include/trace/events/memory-failure.h
> index aa57cc8f896b..c46b17602578 100644
> --- a/include/trace/events/memory-failure.h
> +++ b/include/trace/events/memory-failure.h
> @@ -1,6 +1,7 @@
> /* SPDX-License-Identifier: GPL-2.0 */
> #undef TRACE_SYSTEM
> -#define TRACE_SYSTEM memory_failure
> +/* Some user space relies on ras/memory_failure_event */
> +#define TRACE_SYSTEM ras
> #define TRACE_INCLUDE_FILE memory-failure
>
> #if !defined(_TRACE_MEMORY_FAILURE_H) || defined(TRACE_HEADER_MULTI_READ)
Yes, it should be. In fact, when I sent the V2 patch, I had already
considered this issue, and that's exactly what I did:
Link: https://lore.kernel.org/20251104072306.100738-3-xieyuanbin1@huawei.com
However, David Hildenbrand advised me at that time to completely
remove the dependence on RAS:
Link: https://lore.kernel.org/01b44e0f-ea2e-406f-9f65-b698b5504f42@kernel.org
^ permalink raw reply
* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Balbir Singh @ 2026-06-04 1:43 UTC (permalink / raw)
To: Gregory Price
Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman
In-Reply-To: <ah_RcTU8SpQG7hab@gourry-fedora-PF4VCD3F>
On Wed, Jun 03, 2026 at 08:02:09AM +0100, Gregory Price wrote:
> On Wed, Jun 03, 2026 at 03:00:01PM +1000, Balbir Singh wrote:
> > On Tue, Jun 02, 2026 at 09:57:48AM +0100, Gregory Price wrote:
> > > On Tue, Jun 02, 2026 at 12:16:50PM +1000, Balbir Singh wrote:
> > > >
> > > > I was think we wouldn't need explicit flags and that allocations would
> > > > happen from user space using __GFP_THISNODE to the node or via a nodemask
> > > > based on nodes of interest. Is there a reason to add this flag, a system
> > > > might have more than one source of N_MEMORY_PRIVATE?
> > > >
> > >
> > > There's a few things to unpack here. I discussed this many times on
> > > list and at LSF, but to reiterate.
> > >
> > > 1) __GFP_THISNODE is insufficient to enforce isolation and otherwise
> > > not particularly useful. Additionally, from userland, it's not
> > > something you can actually set.
> >
> > I was thinking mbind()/mempolicy() is how we get to it. It already
> > accepts a nodemask.
> >
>
> First let me say: I want to enable mbind access to these nodes.
>
> But let me caveat: I think that needs more time to develop, and
> in the meantime, we can enable the /dev/xxx pattern somewhat trivially.
>
> First let me address a few things about mbind/mempolicy and how it
> interacts with page_alloc.c, I gave this overview at LSF but I don't
> remember if I posted it in any of my follow ups.
>
>
> 1) Fallback lists are filtered by nodemask, the nodemask does not replace
> the fallback list.
>
> Here is how the page allocator fallback lists and nodemasks interact:
>
> Fallbacks A: A B
> Fallbacks B: B A
> Fallbacks C: C A B (Private)
> Fallbacks D: D B A (Private)
>
Do we want regular memory (N_MEMORY) in the fallback list of device private nodes?
The assumption is that we have ATS translation enabled? Assumiung A and
B are N_MEMORY here or am I misreading your illustraion?
> Lets say you pass:
>
> alloc_pages_node(C, ..., nodemask(A,C,D))
>
> So we get
>
> Fallback(C,A,B) & nodemask(A,C,D) -> iterate(C,A)
>
> If we wanted to change this behavior, realistically we'd be looking for
> a way to add specific nodes to certain fallback lists - rather than
> modify the nodemask interaction in some way.
Yes, that is what we did with CDM, control the fallback for
N_MEMORY_PRIVATE, but there is a design decision to be made here.
>
> I think this is out of scope for the first iteration - so supporting
> anything other than mbind() from the start is just pointless.
>
> The only feasible mempolicy you can apply is single-node bind, so
> realistically you can only support mbind.
>
>
> 2) full mempolicy support doesn't really make sense
>
> task mempolicy PROBABLY should never really touch private nodes,
> while VMA policy certainly can. Assuming we're able to support
> multi-private-node masks, none of the non-bind mempolicies even
> make sense for most private nodes (interleave? weighted interleave?)
>
Yes, mostly, but is that baked into the design? If so, why?
> I haven't worked through all the implications of a task policy having
> a private node attached, but the longer I think about it, the less it
> makes sense to just support this outright.
>
>
> 3) Introducing mbind support is not just a simple nodemask on a VMA,
> It also implies migration, cgroup/cpuset, and UAPI interactions.
>
> a) migration:
>
> mbind/mempolicy can and will engage migration when it is called
> with certain flags. Migration has subtle LRU interactions, but
> the patch set I have at least allows this to work.
>
> b) cgroup/cpuset:
>
> cpuset.mems rebinding will cause private nodes to be quietly
> rebound to non-private nodes within a nodemask.
>
> c) between A and B - we really want MPOL_F_STATIC to be required
> for mbind to be applied to private node so that it is never
> forcefully remapped.
>
> That's a UAPI semantic change specific for private nodes we
> should really take time to consider.
>
>
> 4) File VMA interactions don't entirely make sense with mbind
>
> In theory you might want:
>
> fd = open("somefile", ...);
> mem = mmap(fd, ...);
> mbind(mem, ..., private_node);
> for page in mem:
> mem[page_off] /* fault file into private memory */
>
> In reality: This does not work the way you want.
Why not? Just curious about what you found?
>
> I went digging and we need a few mild extensions to allow
> migration on mbind to work for pagecache pages, and the fault
> path does not necessarily respect the vma mempolicy always.
>
> You also start getting into the question of "what happens when
> the node is out of memory and you don't have reclaim support?".
Yes, we should discuss reclaim support, I think we should allow for
reclaim. It allows you to overcommit private memory the way we can
with regular memory.
> The OOM implications jump out at you pretty aggressively.
>
> Moreover other tasks can force the page cache pages to be moved
> as well. So the programming model here just kind of sucks.
>
> Works great for anon memory though :]
>
> For all these reasons, I think the be mbind/mempolicy support with
> private nodes needs to be brought in with follow up work - not
> introduced as part of the baseline set.
>
I am not opposed to the follow up work, but I feel mbind() should
be the fundamental work and user space API.
> > >
> > > for node in possible_nodes:
> > > alloc_pages_node(private_node, __GFP_THISNODE)
> > >
> > > In fact it's the opposite semantic of what we want.
> > > THISNODE says: "Do not fallback back to OTHER nodes".
> > >
> >
> > That's why we need to control the fallback nodes carefully for
> > N_MEMORY_PRIVATE
> >
>
> My point is that __GFP_THISNODE is not actually useful.
>
> If we go by nodemask, submitting a single-node nodemask is the
> equivalent of an empty fallback list.
>
> If we gate access to a private node by __GFP_THISNODE... this is the
> same as just providing a single-node nodelist (putting aside the OOM
> implications for a moment).
>
> And it doesn't even buy you any new filtering ability against existing
> nodemask iterators that may already utilize __GFP_THISNODE. i.e.
>
> for node in online_nodes:
> alloc_pages_node(node, __GFP_THISNODE, ...)
> /* Alloc per-node resources */
>
> This pattern is undesirable, but completely valid.
>
> So overloading/requiring __GFP_THISNODE is just not useful.
>
> I will follow up soon with a new version that limits the private node
> interface to just nodemask and fallback list controls.
>
> I need to test a few more things related to removing normal nodes from
> private node fallbacks before I feel comfortable shipping without
> __GFP_PRIVATE.
>
> > > The semantic we want is "Do not allow allocations from private
> > > nodes UNLESS we specifically request" (__GFP_PRIVATE).
> > >
> > > __GFP_THISNODE does not actually buy you anything here, AND it's
> > > worse, in the scenario where a private node makes its way into the
> > > preferred slot (via possible_nodes or some other nodemask), the
> > > allocator cannot fall back to a node it can access.
> > >
> > > __GFP_THISNODE cannot be overloaded to do anything useful here.
> >
> > Let me clarify, I meant to say, let's use a nodemask for allocation
> > and __GFP_THISNODE gets us to the node we desire, if that is the only
> > node. My earlier comment might not have been clear.
> >
>
> My point was that __GFP_THISNODE is pointless and reduces to providing a
> single node nodemask anyway.
>
> The contention over __GFP_PRIVATE is a bit ideological - do we want:
>
> 1) A hard guarantee that allocations to a private node are controlled
> (__GFP_PRIVATE implies the caller knows what it's doing)
>
> or
>
> 2) A soft guarantee (fallback list isolation only), and needing to
> deal with undesired behavior that's "not technically a bug"
> associated with existing users of global nodemasks (possible,
> online, etc).
>
> I am arguing for #1 - the community has argued for #2 and "fixing
> existing nodemask users". I think we can ship #2 and pivot to #1 if we
> find fixing existing users is infeasible or too much of a maintenance
> burden.
Again happy to discuss this, I'd like to make sure we agree on the
design. I am wondering if there is any experimental data to choose
between 1 and 2.
>
> >
> > Why not use mbind() API's? Do we want to gate allocation/privileges
> > via a /dev?
> >
>
> We want to eventually enable it, but we really need to treat these
> extensions as a separate step from the base so that the UAPI
> implications are given proper scrutiny.
>
> In the short term, /dev/xxx and driver-local/service-local control
> of a node is still very useful.
>
> For example, for my compressed memory work, I have found that if
> implemented as a swap backend - the kernel can manage the node without
> any UAPI implications at all :].
>
> A driver managing memory on a private node could do the same.
>
> ~Gregory
Thanks for the detailed answers, happy to iterate and experiment on
the design with you, my opinions come from way back when we tried
to do CDM (in it's first iteration)
Balbir
^ permalink raw reply
* Re: [PATCH bpf-next v2 2/3] tracing: Expose tracepoint BTF ids via tracefs
From: Andrii Nakryiko @ 2026-06-03 23:51 UTC (permalink / raw)
To: Steven Rostedt
Cc: Mykyta Yatsenko, bpf, Mykyta Yatsenko, linux-trace-kernel,
Andrii Nakryiko, Alexei Starovoitov
In-Reply-To: <20260603185057.4a85d5df@fedora>
On Wed, Jun 3, 2026 at 3:51 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 3 Jun 2026 15:41:50 -0700
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> > That's just to say that it would be great to have this in some form of
> > shape, so please help getting this into acceptable form, thanks!
>
> As I said, I'm not against it. I was hoping to find something to help
> alleviate the memory usage.
>
> I'll likely take these as is, but I'm currently on vacation and it will
> have to wait until next week.
>
Yeah, no worries, I was just going through my backlog and trying to
think if we can avoid extending the tracepoint class struct. Besides
the somewhat dirty kallsyms symbol lookup logic, I don't see any other
way.
Enjoy your vacation!
> -- Steve
>
^ permalink raw reply
* Re: [PATCH v3] tracing: fix CFI violation in probestub test
From: Masami Hiramatsu @ 2026-06-03 23:47 UTC (permalink / raw)
To: Eva Kurchatova
Cc: rostedt, linux-trace-kernel, linux-kernel, mathieu.desnoyers,
peterz, jpoimboe, samitolvanen
In-Reply-To: <20260603153147.573589-1-eva.kurchatova@virtuozzo.com>
On Wed, 3 Jun 2026 18:31:42 +0300
Eva Kurchatova <eva.kurchatova@virtuozzo.com> wrote:
> When multiple callbacks are registered on the same tracepoint,
> callbacks will be indirectly called via traceiter helper.
>
> Pointers to __probestub_* callbacks reside in __tracepoints section,
> which is excluded from ENDBR checks in objtool, causing objtool to
> assume those functions are never indirectly called.
>
> Registering multiple callbacks using sched_wakeup test will result
> in #CP exception due to missing ENDBR in __probestub_sched_wakeup
> on a CFI-enabled machine.
>
> Fix this by adding CFI_NOSEAL annotation to probestub declaration.
Thanks for update, this looks good to me.
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Peter, will you pick this fix because it fixes objtool change?
Thank you,
>
> Fixes: d5173f753750 ("objtool: Exclude __tracepoints data from ENDBR checks")
> Signed-off-by: Eva Kurchatova <eva.kurchatova@virtuozzo.com>
> ---
> include/linux/tracepoint.h | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index 763eea4d80d8..2d2b9f8cdda4 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -20,6 +20,7 @@
> #include <linux/rcupdate_trace.h>
> #include <linux/tracepoint-defs.h>
> #include <linux/static_call.h>
> +#include <linux/cfi.h>
>
> struct module;
> struct tracepoint;
> @@ -389,6 +390,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> void __probestub_##_name(void *__data, proto) \
> { \
> } \
> + /* \
> + * Annotate the probestub 'CFI_NOSEAL' to stop objtool from \
> + * requesting the kernel remove the ENDBR, because the only \
> + * references to the function are in the __tracepoint section, \
> + * that objtool doesn't scan. \
> + */ \
> + CFI_NOSEAL(__probestub_##_name); \
> DEFINE_STATIC_CALL(tp_func_##_name, __traceiter_##_name); \
> DEFINE_RUST_DO_TRACE(_name, TP_PROTO(proto), TP_ARGS(args))
>
> --
> 2.54.0
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [GIT PULL] rv fixes for v7.1
From: Steven Rostedt @ 2026-06-03 23:16 UTC (permalink / raw)
To: Gabriele Monaco; +Cc: linux-kernel, linux-trace-kernel, unknownbbqrx, Wen Yang
In-Reply-To: <20260603125056.75559-1-gmonaco@redhat.com>
On Wed, 3 Jun 2026 14:50:56 +0200
Gabriele Monaco <gmonaco@redhat.com> wrote:
> unknownbbqrx (1):
> tools/rv: Ensure monitor name and desc are NUL-terminated
Hi Gabriele,
What is this? All commits need to be authored by and signed off by from
a real person with their official name.
https://docs.kernel.org/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin
-- Steve
^ permalink raw reply
* Re: [PATCH bpf-next v2 2/3] tracing: Expose tracepoint BTF ids via tracefs
From: Steven Rostedt @ 2026-06-03 22:50 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Mykyta Yatsenko, bpf, Mykyta Yatsenko, linux-trace-kernel,
Andrii Nakryiko, Alexei Starovoitov
In-Reply-To: <CAEf4BzY5MTGJyys369qDS3b63-tKq3okCpFMqUdfzy0dXk7vLg@mail.gmail.com>
On Wed, 3 Jun 2026 15:41:50 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> That's just to say that it would be great to have this in some form of
> shape, so please help getting this into acceptable form, thanks!
As I said, I'm not against it. I was hoping to find something to help
alleviate the memory usage.
I'll likely take these as is, but I'm currently on vacation and it will
have to wait until next week.
-- Steve
^ permalink raw reply
* Re: [PATCH bpf-next v2 2/3] tracing: Expose tracepoint BTF ids via tracefs
From: Andrii Nakryiko @ 2026-06-03 22:41 UTC (permalink / raw)
To: Steven Rostedt
Cc: Mykyta Yatsenko, bpf, Mykyta Yatsenko, linux-trace-kernel,
Andrii Nakryiko, Alexei Starovoitov
In-Reply-To: <20260526215741.1e5b3e42@fedora>
On Tue, May 26, 2026 at 6:57 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 26 May 2026 11:07:56 +0100
> Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote:
>
> > Hi Steven,
> >
> > Gentle ping on this patch from the series.
> >
> > Since this part touches tracing, I’d appreciate your thoughts on the
> > tracing changes whenever you have a chance.
> >
>
> Hi,
>
> I've been looking at this and was wondering if there are ways to not
> extend the trace_event_class structure. It's added for most trace
> events (actually each DECLARE_EVENT_CLASS). Although when things like
> BTF is enabled, this is a very small amount of extra memory.
>
> I haven't been ignoring this. I've just been thinking about other
> approaches, but haven't come up with anything. Of course, I haven't
> been spending that much time on it, as I've been focused on other
> things.
Just in theory, what alternative would there be besides having one
extra pointer in trace_event_class struct? Some sort of lookup by name
or something? E.g., if we know "call" part at runtime for any given
tracepoint for
.btf_ids = __bpf_trace_btf_ids_##call,
we can probably lookup symbol from kallsyms and fetch BTF IDs that way
without extending the struct?
FWIW, this type info for tracepoints (classic and raw both) are very
useful, because right now one needs to do a bunch of work (subject to
break due to kernel type/name changes, etc) to find this, and for
various generic tracing tooling this type information is actually a
necessity to be useful.
That's just to say that it would be great to have this in some form of
shape, so please help getting this into acceptable form, thanks!
>
> -- Steve
^ permalink raw reply
* Re: [PATCH v7 00/42] guest_memfd: In-place conversion support
From: Ackerley Tng @ 2026-06-03 21:27 UTC (permalink / raw)
To: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-0-2f0fae496530@google.com>
Ackerley Tng via B4 Relay <devnull+ackerleytng.google.com@kernel.org>
writes:
> This is v7 of guest_memfd in-place conversion support.
>
Here's the outstanding items after going over everyone's comments
including Sashiko's:
+ KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
+ Need to move page clearing into __kvm_gmem_get_pfn to resolve
leak where populate can put initialized kernel memory into TDX
guest
+ See suggested fix at [1]
+ KVM: guest_memfd: Only prepare folios for private pages,
+ s/non-CoCo/CoCo in commit message "INIT_SHARED is about to be
supported for non-CoCo VMs in a later patch in this series
+ Use Suggested-by: Michael Roth <michael.roth@amd.com>
+ KVM: selftests: Test that shared/private status is consistent across
processes
+ Improve test reliability using pthread_mutex
+ I have a fixup patch offline.
I would like feedback on these:
+ KVM: selftests: Test conversion with elevated page refcount
+ Askar pointed out that soon vmsplice may not pin pages. Should I
pin pages through CONFIG_GUP_TEST like in [2]? I prefer not to
take a dependency on CONFIG_GUP_TEST.
+ KVM: selftests: Add script to exercise private_mem_conversions_test
+ Would like to know what people think of a wrapper script before
I address Sashiko's comments.
[1] https://lore.kernel.org/all/CAEvNRgEVC=fFuKVgZYvWyZD7t_zvUZihFG8hrACjvtkD5cwugw@mail.gmail.com/
[2] https://lore.kernel.org/all/baa8838f623102931e755cf34c86314b305af49c.1747264138.git.ackerleytng@google.com/
>
> [...snip...]
>
^ permalink raw reply
* Re: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: Andrew Morton @ 2026-06-03 19:54 UTC (permalink / raw)
To: Steven Rostedt
Cc: David Hildenbrand (Arm), Borislav Petkov, Zhuo, Qiuxu,
mchehab+huawei@kernel.org, Luck, Tony, linmiaohe@huawei.com,
xieyuanbin1@huawei.com, Lai, Yi1, linux-kernel@vger.kernel.org,
linux-edac@vger.kernel.org, linux-mm@kvack.org,
linux-trace-kernel@vger.kernel.org, Linus Torvalds
In-Reply-To: <20260603130006.7d2c4a62@gandalf.local.home>
On Wed, 3 Jun 2026 13:00:06 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:
> On Wed, 3 Jun 2026 18:26:24 +0200
> "David Hildenbrand (Arm)" <david@kernel.org> wrote:
>
> > Yeah, I was fearing that when I read in [2]:
> >
> > "It has become clear in the past that this promise extends to
> > tracepoints, most notably in 2011 when a tracepoint change broke
> > powertop and had to be reverted."
>
> Technically the issue is with trace events and not tracepoints. The
> difference is that a trace event is created via the TRACE_EVENT() macro
> which defines what is to be collected from the tracepoint and exposes that
> information to tracefs which applications can easily see.
>
> A tracepoint is simply the hook in the code that you can attach to. Trace
> events create a callback from that hook to extract the data from the
> tracepoint to fill in the fields.
The problem here appears to be that "ras:memory_failure_event" became
"memory_failure:memory_failure_event".
Perhaps we can add infrastructure to permit aliasing "ras" onto
"memory_failure". So if we make these namespace alterations we can
easily preserve back-compatibility?
^ permalink raw reply
* Re: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: Steven Rostedt @ 2026-06-03 19:31 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Borislav Petkov, Zhuo, Qiuxu, mchehab+huawei@kernel.org,
Luck, Tony, akpm@linux-foundation.org, linmiaohe@huawei.com,
xieyuanbin1@huawei.com, Lai, Yi1, linux-kernel@vger.kernel.org,
linux-edac@vger.kernel.org, linux-mm@kvack.org,
linux-trace-kernel@vger.kernel.org, Linus Torvalds
In-Reply-To: <b637ede2-73da-49f0-a7eb-70ec79e79624@kernel.org>
On Wed, 3 Jun 2026 21:13:30 +0200
"David Hildenbrand (Arm)" <david@kernel.org> wrote:
> Thanks, that makes sense!
>
> So, would it be fair to say that, in general, what's exposed through
>
> /sys/kernel/tracing/events/
>
> is stable ABI?
It's only stable if something depends on it. It changes all the time.
It's only when someone complains about it that it becomes "stable"!
-- Steve
^ permalink raw reply
* Re: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: Steven Rostedt @ 2026-06-03 19:30 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Borislav Petkov, Zhuo, Qiuxu, mchehab+huawei@kernel.org,
Luck, Tony, akpm@linux-foundation.org, linmiaohe@huawei.com,
xieyuanbin1@huawei.com, Lai, Yi1, linux-kernel@vger.kernel.org,
linux-edac@vger.kernel.org, linux-mm@kvack.org,
linux-trace-kernel@vger.kernel.org, Linus Torvalds
In-Reply-To: <b637ede2-73da-49f0-a7eb-70ec79e79624@kernel.org>
On Wed, 3 Jun 2026 21:13:30 +0200
"David Hildenbrand (Arm)" <david@kernel.org> wrote:
> Would the following be sufficient to avoid a full revert and the dependency on CONFIG_RAS?
>
> diff --git a/include/trace/events/memory-failure.h b/include/trace/events/memory-failure.h
> index aa57cc8f896b..c46b17602578 100644
> --- a/include/trace/events/memory-failure.h
> +++ b/include/trace/events/memory-failure.h
> @@ -1,6 +1,7 @@
> /* SPDX-License-Identifier: GPL-2.0 */
> #undef TRACE_SYSTEM
> -#define TRACE_SYSTEM memory_failure
> +/* Some user space relies on ras/memory_failure_event */
> +#define TRACE_SYSTEM ras
If that puts back the original path then yeah, all would be good.
-- Steve
> #define TRACE_INCLUDE_FILE memory-failure
>
> #if !defined(_TRACE_MEMORY_FAILURE_H) || defined(TRACE_HEADER_MULTI_READ)
^ permalink raw reply
* Re: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: David Hildenbrand (Arm) @ 2026-06-03 19:13 UTC (permalink / raw)
To: Steven Rostedt
Cc: Borislav Petkov, Zhuo, Qiuxu, mchehab+huawei@kernel.org,
Luck, Tony, akpm@linux-foundation.org, linmiaohe@huawei.com,
xieyuanbin1@huawei.com, Lai, Yi1, linux-kernel@vger.kernel.org,
linux-edac@vger.kernel.org, linux-mm@kvack.org,
linux-trace-kernel@vger.kernel.org, Linus Torvalds
In-Reply-To: <20260603130006.7d2c4a62@gandalf.local.home>
On 6/3/26 19:00, Steven Rostedt wrote:
> On Wed, 3 Jun 2026 18:26:24 +0200
> "David Hildenbrand (Arm)" <david@kernel.org> wrote:
>
>> Yeah, I was fearing that when I read in [2]:
>>
>> "It has become clear in the past that this promise extends to
>> tracepoints, most notably in 2011 when a tracepoint change broke
>> powertop and had to be reverted."
>
> Technically the issue is with trace events and not tracepoints. The
> difference is that a trace event is created via the TRACE_EVENT() macro
> which defines what is to be collected from the tracepoint and exposes that
> information to tracefs which applications can easily see.
>
> A tracepoint is simply the hook in the code that you can attach to. Trace
> events create a callback from that hook to extract the data from the
> tracepoint to fill in the fields.
>
>>
>> Which means that I now also fully understand
>>
>> "Some kernel maintainers prohibit or severely restrict the addition of
>> tracepoints to their subsystems out of fear that a similar thing could
>> happen to them. "
>>
>> Whatever the result of this discussion will be, I'll try to document it.
>
> You can still create a tracepoint without creating a trace event by using
> the DECLARE_TRACE() macro. The scheduler subsystem uses that quite
> extensively. That creates a tracepoint without exposing it to tracefs. The
> runtime verifier uses these hooks to monitor the scheduler.
>
> But you can still connect to these tracepoints from tracefs via a tprobe. A
> tprobe hooks to tracepoints that you need the source code to find (just
> like a fprobe hooks to any function). Thus applications *can't* rely on
> them because there's nothing there to tell you it exists or not.
Thanks, that makes sense!
So, would it be fair to say that, in general, what's exposed through
/sys/kernel/tracing/events/
is stable ABI?
Would the following be sufficient to avoid a full revert and the dependency on CONFIG_RAS?
diff --git a/include/trace/events/memory-failure.h b/include/trace/events/memory-failure.h
index aa57cc8f896b..c46b17602578 100644
--- a/include/trace/events/memory-failure.h
+++ b/include/trace/events/memory-failure.h
@@ -1,6 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
#undef TRACE_SYSTEM
-#define TRACE_SYSTEM memory_failure
+/* Some user space relies on ras/memory_failure_event */
+#define TRACE_SYSTEM ras
#define TRACE_INCLUDE_FILE memory-failure
#if !defined(_TRACE_MEMORY_FAILURE_H) || defined(TRACE_HEADER_MULTI_READ)
--
Cheers,
David
^ permalink raw reply related
* Re: mm/memory-failure tracepoint change breaks userspace rasdaemon
From: Steven Rostedt @ 2026-06-03 17:00 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Borislav Petkov, Zhuo, Qiuxu, mchehab+huawei@kernel.org,
Luck, Tony, akpm@linux-foundation.org, linmiaohe@huawei.com,
xieyuanbin1@huawei.com, Lai, Yi1, linux-kernel@vger.kernel.org,
linux-edac@vger.kernel.org, linux-mm@kvack.org,
linux-trace-kernel@vger.kernel.org, Linus Torvalds
In-Reply-To: <0c16bf3d-7c6d-4e28-b200-03b7d0ef714a@kernel.org>
On Wed, 3 Jun 2026 18:26:24 +0200
"David Hildenbrand (Arm)" <david@kernel.org> wrote:
> Yeah, I was fearing that when I read in [2]:
>
> "It has become clear in the past that this promise extends to
> tracepoints, most notably in 2011 when a tracepoint change broke
> powertop and had to be reverted."
Technically the issue is with trace events and not tracepoints. The
difference is that a trace event is created via the TRACE_EVENT() macro
which defines what is to be collected from the tracepoint and exposes that
information to tracefs which applications can easily see.
A tracepoint is simply the hook in the code that you can attach to. Trace
events create a callback from that hook to extract the data from the
tracepoint to fill in the fields.
>
> Which means that I now also fully understand
>
> "Some kernel maintainers prohibit or severely restrict the addition of
> tracepoints to their subsystems out of fear that a similar thing could
> happen to them. "
>
> Whatever the result of this discussion will be, I'll try to document it.
You can still create a tracepoint without creating a trace event by using
the DECLARE_TRACE() macro. The scheduler subsystem uses that quite
extensively. That creates a tracepoint without exposing it to tracefs. The
runtime verifier uses these hooks to monitor the scheduler.
But you can still connect to these tracepoints from tracefs via a tprobe. A
tprobe hooks to tracepoints that you need the source code to find (just
like a fprobe hooks to any function). Thus applications *can't* rely on
them because there's nothing there to tell you it exists or not.
For example, for the given tracepoint:
# cd /sys/kernel/tracing
# echo 't:rfail memory_failure_event pfn=pfn type=type result=result' > dynamic_events
# cat events/tracepoints/rfail/format
name: rfail
ID: 1894
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:unsigned long __probe_ip; offset:8; size:8; signed:0;
field:u64 pfn; offset:16; size:8; signed:0;
field:s32 type; offset:24; size:4; signed:1;
field:s32 result; offset:28; size:4; signed:1;
print fmt: "(%lx) pfn=%Lu type=%d result=%d", REC->__probe_ip, REC->pfn, REC->type, REC->result
It requires that BTF exists and the above doesn't annotate the result as
nicely. But you can get data directly from tracepoints this way.
-- Steve
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox