* Re: [PATCH] Re: Re: [RFC PATCH v2 04/10] rv/da: add pre-allocated storage pool for per-object monitors
From: Wen Yang @ 2026-05-21 17:40 UTC (permalink / raw)
To: Gabriele Monaco; +Cc: linux-kernel, linux-trace-kernel, rostedt
In-Reply-To: <5183dc18d63b617ab0f19290e8a790fa6898f372.camel@redhat.com>
On 5/19/26 19:14, Gabriele Monaco wrote:
> Hi Wen,
>
> On Mon, 2026-05-18 at 01:13 +0800, Wen Yang wrote:
>>
>> Yes. The ftracetest check_requires logic calls `command -v <binary>` to
>> satisfy `requires: <name>:program` directives. Without the script's
>> directory in PATH those checks evaluate to exit_unsupported and test cases
>> are skipped rather than run. The make path avoids this only because make
>> sets OUTDIR and the runner appends it to PATH internally.
>>
>
> So you're overriding PATH so the selftest's binaries can be found from
> the test, right?
>
> Wouldn't it be simpler to just put the absolute paths in the tests and
> don't touch PATH.
>
> If the selftests are run via makefile, it ensures the required binaries
> are built and available, so there's no need to go through the `requires:
> <name>:program` infrastructure (that's more about what's installed on
> the system.
>
> Or if you don't want anything hardcoded, you could pass the $OUTDIR from
> the environment and use that in scripts, whatever looks cleaner.
>
> Does it make sense to you?
>
>>
>> -- Patch 04: pre-allocated storage pool
>>
>> > Since you're using spinlocks, isn't that going to sleep on PREEMPT_RT?
>>
>> User-mode uprobe handlers run with preempt_count == 0, fully preemptible on
>> both PREEMPT_RT and non-PREEMPT_RT. The strongest evidence is in tlob
>> itself: tlob_start_task() takes a mutex and calls kmem_cache_zalloc(...,
>> GFP_KERNEL) from the uprobe entry handler; both are illegal in atomic
>> context and would trigger lockdep splats immediately.
>>
>> On PREEMPT_RT, spinlock_t becoming a sleeping lock in the uprobe handler
>> iscfine: both call sites (da_create_or_get_pool() from the handler and
>> da_pool_return_cb() from the rcuc kthread) are in sleepable task context.
>>
>
> Yeah exactly, the uprobe is fine with anything (even the automatic
> `kmalloc_nolock`), but sure preallocation at least guarantees the slots
> are there.
>
>> > We can have a macro DA_MON_ALLOCATION_STRATEGY = {DA_ALLOC_AUTO,
>> > DA_ALLOC_POOL, DA_ALLOC_MANUAL} where DA_MON_POOL also requires
>> > DA_MON_POOL_SIZE to be defined (force that with an #error).
>> >
>> > Anyway, this way you probably wouldn't need to define a different init
>> > function and let everything handled more transparently.
>> >
>> > Also you don't need to call da_create_or_get() explicitly,
>> > da_handle_start_event() should do it for you.
>>
>> Agreed on all counts. We plan to implement this in v3 as follows.
>> The three strategies would be a compile-time selection in da_monitor.h:
>>
>> DA_ALLOC_AUTO (default) - lock-free kmalloc_nolock on the hot path;
>> unbounded capacity.
>>
>> DA_ALLOC_POOL - pre-allocated fixed-size pool;
>> DA_MON_POOL_SIZE
>> required, enforced with #error if missing.
>>
>> DA_ALLOC_MANUAL - caller pre-inserts storage via
>> da_create_empty_storage() before the first
>> da_handle_start_event(); the framework only
>> links the target field.
>>
>> da_monitor_init_prealloc() would be removed; da_monitor_init() would
>> select pool or kmalloc initialisation internally based on the strategy.
>>
>> da_handle_start_event() and da_handle_start_run_event() would both call
>> da_prepare_storage() at compile time:
>>
>> DA_ALLOC_AUTO -> da_create_storage() (kmalloc_nolock)
>> DA_ALLOC_POOL -> da_create_or_get_pool()
>> DA_ALLOC_MANUAL -> da_fill_empty_storage() (link target into pre-
>> allocated slot; no allocation on the hot path)
>>
>> No explicit da_create_or_get() call would be needed in any monitor.
>>
>> da_create_or_get_kmalloc() would be removed: as you noted, a caller that
>> uses kmalloc_nolock does so because locking is forbidden; a GFP_KERNEL
>> fallback is equally forbidden if the lockless attempt fails, so the
>> function has no viable use case.
>>
>> tlob would define:
>> #define DA_MON_ALLOCATION_STRATEGY DA_ALLOC_POOL
>> #define DA_MON_POOL_SIZE TLOB_MAX_MONITORED
>>
>> nomiss would define:
>> #define DA_MON_ALLOCATION_STRATEGY DA_ALLOC_MANUAL
>>
>> and call da_create_empty_storage() from handle_sys_enter() (the
>> sched_setscheduler syscall path), which runs in safe task context;
>> da_fill_empty_storage() would then link the sched_dl_entity target on
>> the first da_handle_start_run_event() call in handle_sched_switch().
>
> Yeah good point, there's no need to make it a special path even if we
> have the target ready, da_handle_start_run_event() can do it just fine.
>
>>
>>
>> -- Patch 05: generic uprobe infrastructure
>>
>> Carried unchanged into v3 (as part of the 08-b split described below).
>>
>>
>> -- Patch 06: rvgen __init arrow reset
>>
>> Thanks, carried unchanged into v3.
>>
>
> Well, if you don't need reset() on the __init arrow we can drop this,
> right? Also it doesn't seem fully wired with the rest and requires a
> separate event to do handle_monitor_start(), which can be only just
> another handler for tlob, nothing general.
>
>>
>> > Why don't you make it a separate event (e.g. "start_tlob") [...] then
>> > you also wouldn't need to call reset() and start_timer() manually.
>>
>> Good suggestion. We plan to use a dedicated start_tlob event instead,
>> with a self-loop in tlob.dot:
>>
>> "running" -> "running" [ label = "start;reset(clk_elapsed)" ]
>>
>> da_handle_start_run_event(task->pid, ws, start_tlob) would put the
>> monitor into running and deliver start_tlob, which resets clk_elapsed
>> and arms the budget hrtimer via the generated ha_setup_invariants() —
>> no manual reset() or start_timer() calls needed.
>>
>> One guard would be added in tlob's ha_setup_invariants() to make the
>> self-loop work correctly:
>>
>> if (next_state == curr_state && event != start_tlob)
>> return;
>>
>> Without this, the start_tlob self-loop would be treated the same as any
>> repeated switch_in (already running) and ha_setup_invariants() would
>> return early, leaving the timer unarmed. Does this look right to you?
>>
>
> If you just add a separate event rvgen should take care of everything,
> you should be able to take ha_verify_constraint() and friends as-is from
> the generated code.
>
> But yeah, that's what it would end up doing.
>
>>
>> -- Patch 08: tlob monitor
>>
>> --- Patch structure ---
>>
>> > Could you have everything that isn't strictly tlob-related in another
>> > patch.
>>
>> Agreed. With the ioctl interface deferred (see below), v3 would keep
>> patch 08 as the tlob monitor only:
>>
>> 05-b: rv: extend uprobe API with three-phase detach helpers
>> (rv_uprobe.c, rv_uprobe.h, rv_uprobe_detach refactoring)
>> — extension of patch 05, independent of tlob
>>
>> 08: rv/tlob: add the tlob monitor itself
>> (tlob.c, tlob.h, tlob_trace.h, Kconfig/Makefile, Documentation,
>> rv_trace.h include; ha_monitor.h EVENT_NONE_LBL override
>> bundled here as it is only needed by tlob)
>>
>> The chardev infrastructure (rv_chardev.c, rv.h additions) and the UAPI
>> header (include/uapi/linux/rv.h) would move to a follow-up series
>> together with the ioctl self-instrumentation feature.
>>
>> --- ioctl interface design ---
>>
>> > I'm not particularly fond of ioctls, they aren't that flexible and in
>> > this way I don't really see an added value.
>> > [...] cannot the same thing be achieved using uprobes alone, e.g. by
>> > registering a function address or the current instruction pointer?
>> > [...] wouldn't a sysfs/tracefs file achieve a similar purpose without
>> > much of the boilerplate code?
>>
>> Fair point. We plan to ship v3 with the tracefs/uprobe interface only
>> and defer the ioctl (/dev/rv chardev) to a follow-up series once there
>> is a concrete in-tree user that requires it.
>>
>> The unique value of the ioctl is that TLOB_IOCTL_TRACE_STOP returns a
>> synchronous per-call result (-EOVERFLOW or 0) to the calling thread,
>> which neither uprobes nor tracefs writes can provide. We want to keep
>> that option open for later, but agree it should not block the initial
>> tlob submission.
>>
>> Does this approach work for you?
>>
>> What is your preference?
>
> Yeah looks good to me.
> Ioctls are cumbersome to set up also for the user, perhaps another sysfs
> file in the monitor directory would keep the control entirely in tlob.c
> and give you roughly the same value with easier setup.
>
> Heck we might even think of an RV reactor that does that: e.g. creates a
> file where reads sleep until the first reaction (-EOVERFLOW) and returns
> 0 in other scenarios. I'm gonna have a thought on that, but anyway I
> don't see why a sysfs file cannot do this.
>
> Let's defer it for now.
>
>>
>> --- Handler simplification ---
>>
>> > Perhaps keep the handler simpler by moving this reporting to a helper
>> > function and use guard(rcu)() there.
>>
>> Done. The accumulation logic is extracted into three inline helpers, each
>> using scoped_guard(rcu) and returning bool (true if the task is monitored):
>>
>> tlob_acc_running(prev) - accumulate running_ns on sched-out
>> tlob_acc_waiting(next) - accumulate waiting_ns on sched-in
>> tlob_acc_sleeping(task) - accumulate sleeping_ns on wakeup
>>
>> handle_sched_switch() and handle_sched_wakeup() become one-liners:
>>
>> static void handle_sched_switch(...)
>> {
>> bool prev_preempted = (prev_state == 0);
>>
>> if (tlob_acc_running(prev))
>> da_handle_event(prev->pid, NULL,
>> prev_preempted ? preempt_tlob : sleep_tlob);
>> if (tlob_acc_waiting(next))
>> da_handle_event(next->pid, NULL, switch_in_tlob);
>> }
>
> Yeah sounds good.
>
>>
>> > You probably don't need these. da_handle_event should skip tasks without
>> > a monitor.
>>
>> Agreed; the do_prev/do_next flags are gone. The helpers return false
>> for unmonitored tasks, and da_handle_event() skips them too — both paths
>> are no-ops for tasks with no pool entry.
>>
>> --- scoped_guard(rcu) ---
>>
>> > That should be a scoped_guard(rcu), definitely use guards if you have
>> > return paths, the compiler is going to clean up (unlock) for you.
>>
>> Applied to all RCU-protected sections in tlob_start_task() and
>> tlob_stop_task(). tlob_start_task() now uses guard(mutex) for the
>> serialised duplicate-check (replacing the explicit mutex_lock/unlock),
>> and tlob_stop_task() uses scoped_guard(rcu) for the atomic CAS section:
>>
>> scoped_guard(rcu) {
>> ws = da_get_target_by_id(task->pid);
>> if (!ws)
>> return -ESRCH;
>> ...
>> if (atomic_cmpxchg_release(&ws->stopping, 0, 1) != 0)
>> return -EAGAIN;
>> }
>
> Perfect.
>
>>
>> --- tlob_stop_all removal ---
>>
>> > All this function does should be done by da_monitor_destroy. We could
>> > add a way to pass some additional deallocation for all the other cleanup
>> > you're doing on each storage. Something like a da_extra_cleanup() you
>> > can define as whatever you need and gets called in all per-obj
>> > destruction paths.
>>
>> Agreed. tlob_stop_all() (~50 lines) has been removed entirely.
>>
>> A da_extra_cleanup() hook macro is introduced in da_monitor.h: the default
>> is a no-op; a monitor may override it before including the header. tlob
>> defines:
>>
>> static inline void tlob_extra_cleanup(struct da_monitor *da_mon)
>> {
>> struct ha_monitor *ha_mon = to_ha_monitor(da_mon);
>> struct tlob_task_state *ws = da_get_target(ha_mon);
>>
>> if (!ws)
>> return;
>> if (atomic_cmpxchg_release(&ws->stopping, 0, 1) != 0)
>> return;
>
>> ha_cancel_timer_sync(ha_mon);
> After my patch making timer callbacks RCU read-side critical section,
> you won't need that, just let the usual reset asynchronously stop the
> timer and put everything that needs it stopped in your RCU callback.
>
> Of course make sure the timer was stopped before this extra cleanup, so
> put the macro accordingly.
>
> I don't think da_extra_cleanup in general should be expected to sleep
> and call_rcu should do the heavy lifting (it may run from any tracepoint).
>
> Anyway we can see it later after that's merged.
>
>> atomic_dec(&tlob_num_monitored);
>> put_task_struct(ws->task);
>> call_rcu(&ws->rcu, tlob_free_rcu);
>> }
>> #define da_extra_cleanup tlob_extra_cleanup
>>
>> da_monitor_destroy() iterates remaining entries via da_extra_cleanup +
>> hash_del_rcu + call_rcu, then waits for all callbacks via rcu_barrier().
>> tlob's disable path is now simply:
>>
>> static void __tlob_destroy_monitor(void)
>> {
>> da_monitor_destroy();
>> }
>
> Looks good, let's see the full picture.
>
>> --- EVENT_NONE_LBL ---
>>
>> > Why don't you just override EVENT_NONE_LBL (and if you prefer call it
>> > MONITOR_TIMER_EVENT_NAME) without the need for another function?
>>
>> Done. model_get_timer_event_name() has been removed from automata.h.
>> In ha_monitor.h, EVENT_NONE_LBL is now overridable:
>>
>> #ifndef EVENT_NONE_LBL
>> #define EVENT_NONE_LBL "none"
>> #endif
>>
>> tlob.c defines it before including the model header:
>>
>> #define EVENT_NONE_LBL "budget_exceeded"
>>
>> The two call sites in ha_monitor.h that previously called
>> model_get_timer_event_name() now use EVENT_NONE_LBL directly.
>>
>> --- KUnit config / tristate ---
>>
>> > Do you need to add this here? Since you have a patch adding KUnit tests
>> > to tlob, cannot you put everything kunit-related there?
>> > I couldn't build it as module.
>>
>> Agreed on moving the Kconfig entry to patch 09.
>>
>> The module build issue is fixed by exporting the symbols needed by the
>> test via EXPORT_SYMBOL_IF_KUNIT (EXPORTED_FOR_KUNIT_TESTING namespace);
>> tlob_kunit.c imports the namespace with MODULE_IMPORT_NS. We plan to
>> keep tristate rather than changing to bool. Does that work for you?
>
> Yeah it's good as long as it works as module too.
>
> I might have a look at making my patch module-ready, for now it just
> can't work but I wonder if we can do something nicer to allow it
> (like in your case a bunch of exports, a separate file and a standalone
> testcase, perhaps all wrapped in some helper).
>
>>
>> --- detail_env_tlob tracepoint ---
>>
>> > Since you are not documenting the detail_env_tlob tracepoint, is it
>> > something really required? I would at the very least document its usage.
>>
>> Fair point. detail_env_tlob emits (running_ns, waiting_ns, sleeping_ns)
>> so the user can see which phase consumed the budget: high sleeping_ns
>> indicates I/O latency, high waiting_ns indicates scheduler pressure, high
>> running_ns indicates a compute overrun. Without this breakdown the user
>> only knows the total elapsed time exceeded the threshold, not why.
>>
>
> Alright, then this can go into the docs.
>
>>
>> --- Documentation ---
>>
>> > This is standard tracepoints usage, there's nothing about tlob we should
>> > document here.
>> > Same here, standard RV [for enable/desc tracefs files].
>> > And this is duplicating what mentioned above about uprobes, isn't it?
>>
>> Agreed. The following have been removed:
>>
>> - "Violation events" section: generic trace-cmd examples and cat-trace
>> instructions (standard tracepoints usage).
>> - tracefs files: "enable (rw)" and "desc (ro)" entries (standard RV).
>> - tracefs files: "monitor (rw)" description condensed to one line with
>> a cross-reference to the uprobes section above.
>>
>> In their place, a new "Violation tracepoints" subsection documents both
>> tlob-specific tracepoints with fields and a worked example:
>>
>> error_env_tlob: id, state, event ("budget_exceeded"), env ("clk_elapsed")
>>
>> detail_env_tlob: id, threshold_us, running_ns, waiting_ns, sleeping_ns
>> Use sleeping_ns to diagnose I/O latency, waiting_ns for scheduler
>> pressure, running_ns for compute overruns.
>>
>> Example:
>> trace-cmd record -e error_env_tlob -e detail_env_tlob &
>> # ... run workload ...
>> trace-cmd report
>
> Yeah sounds good, also pointing out to enable the monitor.
> We might think of a general way to do this kind of thing in
> tools/rv, although detail_env_tlob is non-standard.
>
>> > Is kernel code going to use this API? RV monitors are meant to be
>> > enabled by userspace. What's the use-case here?
>>
>> Agreed. The uprobe interface is driven from userspace; tlob_start_task()
>> and tlob_stop_task() are the internal implementation functions it calls,
>> not a public API for external kernel modules. The hypothetical
>> kernel-module use case would be removed from the documentation; the
>> kernel-doc block is retained for code maintainers.
>>
>> > That's probably a bit too detailed for this page. If you really want
>> > this information somewhere couldn't it stay in the code?
>>
>> Agreed; moved to comments in handle_sched_switch() and
>> handle_sched_wakeup(). The "Limitations" subsection is retained.
>>
>> -- Patch 09: KUnit tests
>>
>> > What caught my eyes are tests enrolling tracepoints handlers. If you
>> > go there you're no longer doing unit testing, what's the advantage of
>> > testing the entire monitor here over doing that in selftests?
>>
>> Agreed. The three suites that register tracepoint handlers or create
>> kthreads (tlob_sched_integration, tlob_trace_output, tlob_violation_react)
>> have been removed from KUnit and will be added to selftests in v3.
>>
>> Two pure unit test suites remain in KUnit:
>>
>> tlob_task_api:
>> Tests tlob_start_task / tlob_stop_task return values (-ENODEV,
>> -EALREADY, -ESRCH, -EOVERFLOW, -ENOSPC, -ERANGE) via direct calls
>> (these functions are the internal implementation used by both the
>> uprobe and, in future, the ioctl interface).
>> No tracepoints, no scheduling.
>>
>> tlob_uprobe_format:
>> Tests the uprobe line parser (tlob_parse_uprobe_line,
>> tlob_parse_remove_line) against valid and invalid input strings.
>> Pure string parsing; no scheduling, no tracepoints.
>>
>> This also resolves the tristate-vs-bool issue: with only pure unit tests
>> there is no dependency on sched_setscheduler_nocheck, so bool is correct.
>>
>
> Yeah looks good.
>
>>
>> -- Patch 10: selftests
>>
>> --- PREEMPT_RT RCU stall ---
>>
>> > I run it on a VM and have it hanging at step 9 [...] rcu_preempt stall.
>> > Did you see that? Am I doing something wrong?
>>
>> Thanks for reporting. The patch changed ha_monitor.h from
>> HRTIMER_MODE_REL_HARD (the existing upstream value) to REL_SOFT; the
>> stall appeared on PREEMPT_RT after that change. We have not fully
>> confirmed whether REL_SOFT is the root cause — REL_SOFT defers the
>> callback to the ktimers kthread, which could starve rcu_preempt under
>> certain PREEMPT_RT configurations, but other factors may be involved.
>>
>> We plan to revert to HRTIMER_MODE_REL_HARD at both sites in ha_monitor.h
>> as the conservative choice:
>>
>> ha_setup_timer(): HRTIMER_MODE_REL_SOFT -> HRTIMER_MODE_REL_HARD
>> ha_start_timer_ns(): HRTIMER_MODE_REL_SOFT -> HRTIMER_MODE_REL_HARD
>>
>> Do you have more insight into the stall, or does REL_HARD resolve it on
>> your setup?
>
> Right, good point, any specific reason why you wanted REL_SOFT?
>
> I indeed always test under PREEMPT_RT but I still see the same splat
> also after reverting REL_HARD..
> Could you reproduce it on your setup?
>
> My config is nothing special: what vng gives you adding
> PREEMPT_RT/RCU_PREEMPT and lockdep (PROVE_LOCKING/PROVE_RCU).
>
Hi Gabriele,
No specific reason for REL_SOFT — not intentional, reverting to
REL_HARD.
Reproduced the stall on the same config (PREEMPT_RT +
PROVE_LOCKING/PROVE_RCU).
Root cause is a cleanup ordering bug in uprobe_detail_waiting.tc,
unrelated to REL_SOFT/REL_HARD:
# original cleanup — wrong order
echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR" # (A)
kill "$hog_pid" # (B)
(A)
triggers synchronize_srcu() in the kernel. But tlob_target is stuck
mid-uprobe_notify_resume holding an SRCU read lock, preempted by the
FIFO-99 hog -> so the reader never finishes and (B) is never reached.
rcuc/0 (a kthread on PREEMPT_RT) is also starved by the hog -> RCU stall.
Fix: kill the hog first:
kill "$hog_pid"; wait "$hog_pid"
echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR"
On the PREEMPT_RT it is more reliably triggered there because rcuc/0
runs as a preemptible kthread rather than in softirq, making it easier
for the hog to monopolise the CPU long enough to hit the stall.
Thank you for the thorough review and valuable suggestions. We are
working through all of them and running the full test suite.
We expect to post v3 within the next two days.
--
Best wishes,
Wen
>>
>> --- Selftest structure ---
>>
>> > This should be tested together with the other monitors (enable/disable),
>> > we could at most expand those with the check_requires.
>> > Let's focus on tlob-only features in this patch.
>>
>> Agreed. In v3 we plan to drop tracefs.tc (covered by the generic
>> rv_monitor_enable_disable.tc) and keep only the six uprobe-specific
>> test cases under test.d/tlob/
>>
>> ioctl.tc is deferred with the ioctl interface to the follow-up series.
>> The KUnit integration tests (sched_switch accounting, budget-expiry
>> tracepoint) would be moved to selftests as additional test cases.
>>
>
> Thanks,
> Gabriele
>
^ permalink raw reply
* Re: [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer
From: Steven Rostedt @ 2026-05-21 15:23 UTC (permalink / raw)
To: Li Pengfei
Cc: linux-trace-kernel, mhiramat, linux-kernel, cmllamas, zhangbo56,
lipengfei28
In-Reply-To: <20260514034916.2162517-1-lipengfei28@xiaomi.com>
On Thu, 14 May 2026 11:49:13 +0800
Li Pengfei <ljdlns1987@gmail.com> wrote:
> From: Pengfei Li <lipengfei28@xiaomi.com>
>
> Hi Steven, all,
>
Hi Pengfei,
Can you address the Sashiko reviews:
https://sashiko.dev/?list=org.kernel.vger.linux-trace-kernel#/patchset/20260514034916.2162517-1-lipengfei28%40xiaomi.com
It has a way to copy the comments. Just reply to this series with a past of
Sashiko's review and reply to them to explain why the comments may not be
an issue, or submit a new version with fixes if they are issues.
Thanks,
-- Steve
^ permalink raw reply
* Re: [RFC PATCH 2/2] tracing: Record and show boot ID in last_boot_info
From: Steven Rostedt @ 2026-05-21 15:16 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Theodore Ts'o, Jason A . Donenfeld, Mathieu Desnoyers,
linux-kernel, linux-trace-kernel
In-Reply-To: <177937543666.2596845.9748178606108139386.stgit@mhiramat.tok.corp.google.com>
On Thu, 21 May 2026 23:57:16 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> @@ -4804,6 +4806,7 @@ struct trace_mod_entry {
> struct trace_scratch {
> unsigned int clock_id;
> unsigned long text_addr;
> + u8 boot_id[UUID_SIZE];
> unsigned long nr_entries;
> struct trace_mod_entry entries[];
> };
I just don't like wasting scratch space if boot_id isn't defined. But I
can't figure out a way to optionally have it there without wasting space
anyway.
If the get_boot_id() is accepted by the random folks, then I'm fine with
this change.
-- Steve
^ permalink raw reply
* Re: [PATCH v7 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Breno Leitao @ 2026-05-21 15:09 UTC (permalink / raw)
To: Lance Yang
Cc: linmiaohe, akpm, david, ljs, vbabka, rppt, surenb, mhocko, shuah,
nao.horiguchi, rostedt, mhiramat, mathieu.desnoyers, corbet,
skhan, liam, linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team
In-Reply-To: <f76b79d3-080a-4931-873e-99d4b3e1020f@linux.dev>
On Sat, May 16, 2026 at 12:06:14PM +0800, Lance Yang wrote:
>
>
> On 2026/5/15 21:13, Breno Leitao wrote:
> [...]
> > >
> > > Wonder if it would be simpler to just do a positive check near the top
> > > of get_any_page() instead. Something like:
> > >
> > > static bool hwpoison_unrecoverable_kernel_page(struct page *page,
> > > unsigned long flags)
> >
> > Ack. We probably want to call it something like HWPoisonKernelOwned() to
> > follow the same naming sematics of these helpers, such as HWPoisonHandlable()
> >
> > By the way, I will re-include the self test back to this patch series,
> > In case they are not useful, we do not merge it.
> >
>
> Sounds good :)
>
> Can you also test the relevant page types if possible, especially
> the ones the new helper is supposed to classify?
Ack. I will expand the test to cover different page types as well!
Thanks,
--breno
^ permalink raw reply
* Re: [RFC PATCH 0/2] random: tracing: Expose last boot ID on persistent instance
From: Mathieu Desnoyers @ 2026-05-21 14:59 UTC (permalink / raw)
To: Masami Hiramatsu (Google), Theodore Ts'o, Jason A . Donenfeld,
Steven Rostedt
Cc: linux-kernel, linux-trace-kernel
In-Reply-To: <177937541909.2596845.17729857441826694760.stgit@mhiramat.tok.corp.google.com>
On 2026-05-21 10:56, Masami Hiramatsu (Google) wrote:
> Hi,
>
> Here is an RFC series to expose the boot ID (random UUID for each
> boot) in the last_boot_info of the persistent ring buffer instance.
>
> The persistent ring buffer can hold trace data beyond reboot/crashes.
> This means the recorded data does not always come from the last boot.
> Currently we just assume that the data comes from the last boot.
>
> On the other hand, the kernel provides a random generated UUID for
> each boot time, called "boot ID". If you record the logs with the
> boot ID, it is easy to do cross-referencing it with other logs.
>
> Similarly, recording the Boot ID for persistent ring buffer
> instances would make it easier to determine which boot the read
> data came from.
>
> For example:
>
> # cat /proc/sys/kernel/random/boot_id
> df152e7a-c0a7-4d32-9f0b-7f5c39fb7b68
>
> (enable tracing on persistent instance and reboot)
>
> # cat /sys/kernel/tracing/instances/ptracingtest/last_boot_info
> # boot_id: df152e7a-c0a7-4d32-9f0b-7f5c39fb7b68
> ffffffff81000000 [kernel]
>
FWIW, we've used boot id to uniquely identify traces belonging
to a given kernel execution and allow validation that traces
can indeed be correlated across CPU and across kernel vs userspace
for years in LTTng.
Good to see this approach proposed for Ftrace as well.
Thanks,
Mathieu
>
> Thank you,
>
> ---
>
> Masami Hiramatsu (Google) (2):
> random: Expose boot ID to other subsystems
> tracing: Record and show boot ID in last_boot_info
>
>
> drivers/char/random.c | 27 +++++++++++++++++++++------
> include/linux/random.h | 9 +++++++++
> kernel/trace/trace.c | 14 ++++++++++++--
> 3 files changed, 42 insertions(+), 8 deletions(-)
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply
* Re: [PATCH mm-unstable v17 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: Lorenzo Stoakes @ 2026-05-21 14:59 UTC (permalink / raw)
To: Nico Pache
Cc: David Hildenbrand (Arm), Wei Yang, Lance Yang, linux-doc,
linux-kernel, linux-mm, linux-trace-kernel, aarcange, akpm,
anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
liam, mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
pfalcato, rakie.kim, raquini, rdunlap, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <CAA1CXcCNT51jeXh6Kwg1QN9e+AJB-1hg21kmeY6fTTKr2GACug@mail.gmail.com>
On Tue, May 19, 2026 at 01:05:13PM -0600, Nico Pache wrote:
> On Mon, May 18, 2026 at 1:33 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Mon, May 18, 2026 at 03:16:11PM +0200, David Hildenbrand (Arm) wrote:
> > > > For me, I would vote for fallback to 0.
> > >
> > > At this point I'll prefer to not return errors from collapse_max_ptes_none().
> > > It's just rather awkward to return an error deep down in collapse code for a
> > > configuration problem.
> > >
> > > For mthp collapse, we only support max_ptes_none==0 and
> > > max_ptes_none=="HPAGE_PMD_NR - 1" (default).
> > >
> > > If another value is specified while collapsing mTHP, print a warning and treat
> > > it as 0 (save value, no creep, no memory waste).
> > >
> > > In a sense, this is similar to how we handle max_ptes_shared + max_ptes_swap:
> > > for mTHP: we always treat them as being 0 for mTHP collapse (and don't issue a
> > > warning, because we would issue a warning with the default settings).
> > >
> > > @Lorenzo, fine with you?
> >
> > Yes 100%, this sounds sensible both in terms of the error and the default. Let's
> > keep our lives simple(-ish) please :)
>
> Ok thank you im glad we finally came to consensus on this! phew!
>
It happens sometimes ;)
Cheers, Lorenzo
^ permalink raw reply
* [RFC PATCH 2/2] tracing: Record and show boot ID in last_boot_info
From: Masami Hiramatsu (Google) @ 2026-05-21 14:57 UTC (permalink / raw)
To: Theodore Ts'o, Jason A . Donenfeld, Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel
In-Reply-To: <177937541909.2596845.17729857441826694760.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Record the boot ID (random UUID for each boot) when tracing on the
persistent ring buffer and show it in the last_boot_info file.
User can use this boot ID when cross-referencing it with other logs.
For example:
# cat /proc/sys/kernel/random/boot_id
df152e7a-c0a7-4d32-9f0b-7f5c39fb7b68
(enable tracing on persistent instance and reboot)
# cat /sys/kernel/tracing/instances/ptracingtest/last_boot_info
# boot_id: df152e7a-c0a7-4d32-9f0b-7f5c39fb7b68
ffffffff81000000 [kernel]
Thus, if user saves the other logs with this boot_id,
user can easily find the appropriate logs.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
kernel/trace/trace.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..c56694bb5e0c 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -34,10 +34,12 @@
#include <linux/percpu.h>
#include <linux/splice.h>
#include <linux/kdebug.h>
+#include <linux/random.h>
#include <linux/string.h>
#include <linux/mount.h>
#include <linux/rwsem.h>
#include <linux/slab.h>
+#include <linux/uuid.h>
#include <linux/ctype.h>
#include <linux/init.h>
#include <linux/panic_notifier.h>
@@ -4804,6 +4806,7 @@ struct trace_mod_entry {
struct trace_scratch {
unsigned int clock_id;
unsigned long text_addr;
+ u8 boot_id[UUID_SIZE];
unsigned long nr_entries;
struct trace_mod_entry entries[];
};
@@ -4921,6 +4924,7 @@ static void update_last_data(struct trace_array *tr)
/* Reset the module list and reload them */
if (tr->scratch) {
struct trace_scratch *tscratch = tr->scratch;
+ const u8 *boot_id = get_boot_id();
tscratch->clock_id = tr->clock_id;
memset(tscratch->entries, 0,
@@ -4929,6 +4933,11 @@ static void update_last_data(struct trace_array *tr)
guard(mutex)(&scratch_mutex);
module_for_each_mod(save_mod, tr);
+ /* Also, update boot ID if exists */
+ if (boot_id)
+ memcpy(tscratch->boot_id, boot_id, sizeof(tscratch->boot_id));
+ else
+ memset(tscratch->boot_id, 0, sizeof(tscratch->boot_id));
}
/*
@@ -5822,9 +5831,10 @@ static void show_last_boot_header(struct seq_file *m, struct trace_array *tr)
* Otherwise it shows the KASLR address from the previous boot which
* should not be the same as the current boot.
*/
- if (tscratch && (tr->flags & TRACE_ARRAY_FL_LAST_BOOT))
+ if (tscratch && (tr->flags & TRACE_ARRAY_FL_LAST_BOOT)) {
+ seq_printf(m, "# boot_id: %pUb\n", tscratch->boot_id);
seq_printf(m, "%lx\t[kernel]\n", tscratch->text_addr);
- else
+ } else
seq_puts(m, "# Current\n");
}
^ permalink raw reply related
* [RFC PATCH 1/2] random: Expose boot ID to other subsystems
From: Masami Hiramatsu (Google) @ 2026-05-21 14:57 UTC (permalink / raw)
To: Theodore Ts'o, Jason A . Donenfeld, Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel
In-Reply-To: <177937541909.2596845.17729857441826694760.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Add get_boot_id() to expose current boot ID to other kernel subsystems.
Note that since this is only meaningful if user can access it via sysctl,
it returns NULL if CONFIG_SYSCTL=n.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
drivers/char/random.c | 27 +++++++++++++++++++++------
include/linux/random.h | 9 +++++++++
2 files changed, 30 insertions(+), 6 deletions(-)
diff --git a/drivers/char/random.c b/drivers/char/random.c
index b4da1fb976c1..96a5a165627a 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1615,6 +1615,25 @@ static int sysctl_random_write_wakeup_bits = POOL_READY_BITS;
static int sysctl_poolsize = POOL_BITS;
static u8 sysctl_bootid[UUID_SIZE];
+/**
+ * get_boot_id - return the boot ID UUID
+ *
+ * This function returns a pointer to the boot ID UUID, which is generated on
+ * demand the first time this function is called. The boot ID is a UUID that
+ * is unique to each boot of the system.
+ */
+const u8 *get_boot_id(void)
+{
+ static DEFINE_SPINLOCK(bootid_spinlock);
+
+ spin_lock(&bootid_spinlock);
+ if (!sysctl_bootid[8])
+ generate_random_uuid(sysctl_bootid);
+ spin_unlock(&bootid_spinlock);
+
+ return sysctl_bootid;
+}
+
/*
* This function is used to return both the bootid UUID, and random
* UUID. The difference is in whether table->data is NULL; if it is,
@@ -1638,12 +1657,8 @@ static int proc_do_uuid(const struct ctl_table *table, int write, void *buf,
uuid = tmp_uuid;
generate_random_uuid(uuid);
} else {
- static DEFINE_SPINLOCK(bootid_spinlock);
-
- spin_lock(&bootid_spinlock);
- if (!uuid[8])
- generate_random_uuid(uuid);
- spin_unlock(&bootid_spinlock);
+ /* Ensure that the boot ID is initialized. */
+ get_boot_id();
}
snprintf(uuid_string, sizeof(uuid_string), "%pU", uuid);
diff --git a/include/linux/random.h b/include/linux/random.h
index 8a8064dc3970..aaf630f14931 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -130,6 +130,15 @@ static inline int get_random_bytes_wait(void *buf, size_t nbytes)
return ret;
}
+#ifdef CONFIG_SYSCTL
+const u8 *get_boot_id(void);
+#else
+static inline const u8 *get_boot_id(void)
+{
+ return NULL;
+}
+#endif
+
#ifdef CONFIG_SMP
int random_prepare_cpu(unsigned int cpu);
int random_online_cpu(unsigned int cpu);
^ permalink raw reply related
* [RFC PATCH 0/2] random: tracing: Expose last boot ID on persistent instance
From: Masami Hiramatsu (Google) @ 2026-05-21 14:56 UTC (permalink / raw)
To: Theodore Ts'o, Jason A . Donenfeld, Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel
Hi,
Here is an RFC series to expose the boot ID (random UUID for each
boot) in the last_boot_info of the persistent ring buffer instance.
The persistent ring buffer can hold trace data beyond reboot/crashes.
This means the recorded data does not always come from the last boot.
Currently we just assume that the data comes from the last boot.
On the other hand, the kernel provides a random generated UUID for
each boot time, called "boot ID". If you record the logs with the
boot ID, it is easy to do cross-referencing it with other logs.
Similarly, recording the Boot ID for persistent ring buffer
instances would make it easier to determine which boot the read
data came from.
For example:
# cat /proc/sys/kernel/random/boot_id
df152e7a-c0a7-4d32-9f0b-7f5c39fb7b68
(enable tracing on persistent instance and reboot)
# cat /sys/kernel/tracing/instances/ptracingtest/last_boot_info
# boot_id: df152e7a-c0a7-4d32-9f0b-7f5c39fb7b68
ffffffff81000000 [kernel]
Thank you,
---
Masami Hiramatsu (Google) (2):
random: Expose boot ID to other subsystems
tracing: Record and show boot ID in last_boot_info
drivers/char/random.c | 27 +++++++++++++++++++++------
include/linux/random.h | 9 +++++++++
kernel/trace/trace.c | 14 ++++++++++++--
3 files changed, 42 insertions(+), 8 deletions(-)
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH 1/3] rtla: Fix output files in source tree
From: Steven Rostedt @ 2026-05-21 14:48 UTC (permalink / raw)
To: Ben Hutchings
Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Tomas Glozar, linux-perf-users, linux-trace-kernel
In-Reply-To: <ag8YLYKZTKOq0t4x@decadent.org.uk>
On Thu, 21 May 2026 16:35:25 +0200
Ben Hutchings <benh@debian.org> wrote:
> Some output files (src/timerlat.bpf.o, src/timerlat.skel.h,
> example/timerlat_bpf_action.o, tests/bpf/bpf_action_map.o) are
> currently generated in the source tree, preventing a fully out-of-tree
> build. To fix this:
>
> - Add $(OUTPUT) to their filenames in the relevant Makefile rules, and
> create subdirectories as needed
> - Add $(OUTPUT)src to the include path
> - Add ${OUTPUT} to the BPF object filename in tests/timerlat.t
>
> Fixes: e34293ddcebd ("rtla/timerlat: Add BPF skeleton to collect samples")
> Fixes: 0304a3b7ec9a ("rtla/timerlat: Add example for BPF action program")
> Fixes: 5525aebd4e0c ("rtla/tests: Test BPF action program")
> Signed-off-by: Ben Hutchings <benh@debian.org>
Hi Ben,
Can you send this as a separate patch. The rtla code is handled via a
different tree than the perf code. So these patches will not be going
together.
Thanks,
-- Steve
^ permalink raw reply
* Re: [PATCH v4 2/2] tracing: Bound histogram expression strings with seq_buf
From: Steven Rostedt @ 2026-05-21 14:44 UTC (permalink / raw)
To: Pengpeng Hou
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-trace-kernel,
linux-kernel
In-Reply-To: <20260521022817.38453-2-pengpeng@iscas.ac.cn>
On Thu, 21 May 2026 10:28:17 +0800
Pengpeng Hou <pengpeng@iscas.ac.cn> wrote:
> @@ -1778,47 +1783,56 @@ static char *expr_str(struct hist_field *field, unsigned int level)
> if (!expr)
> return ERR_PTR(-ENOMEM);
>
> + seq_buf_init(&s, expr, MAX_FILTER_STR_VAL);
> +
> if (!field->operands[0]) {
> - expr_field_str(field, expr);
> + if (!expr_field_str(field, &s))
> + return ERR_PTR(-E2BIG);
> +
> return_ptr(expr);
> }
>
> if (field->operator == FIELD_OP_UNARY_MINUS) {
> - char *subexpr;
> + char *subexpr __free(kfree) = NULL;
>
> - strcat(expr, "-(");
> + seq_buf_puts(&s, "-(");
> subexpr = expr_str(field->operands[0], ++level);
> if (IS_ERR(subexpr))
> return subexpr;
>
> - strcat(expr, subexpr);
> - strcat(expr, ")");
> + seq_buf_puts(&s, subexpr);
> + seq_buf_putc(&s, ')');
> + seq_buf_str(&s);
>
> - kfree(subexpr);
> + if (seq_buf_has_overflowed(&s))
> + return ERR_PTR(-E2BIG);
>
> return_ptr(expr);
> }
Wouldn't the above if statement be a lot nicer as:
if (field->operator == FIELD_OP_UNARY_MINUS) {
char *subexpr;
subexpr = expr_str(field->operands[0], ++level);
if (IS_ERR(subexpr))
return subexpr;
seq_buf_printf(&s, "-(%s)", subexpr);
seq_buf_str(&s);
kfree(subexpr);
if (seq_buf_has_overflowed(&s))
return ERR_PTR(-E2BIG);
return_ptr(expr);
}
In fact, the above is so simple, you don't even need to use the __free()
guard on subexpr.
BTW, because currently seq_buf_printf() does add a '\0' to the string, I
may update the API for seq_buf to state that you don't need to terminate
after calling that function. So you can leave out the sub_buf_str() after
calling seq_buf_printf().
-- Steve
^ permalink raw reply
* Re: [PATCH v6 16/43] KVM: guest_memfd: Use actual size for invalidation in kvm_gmem_release()
From: Ackerley Tng @ 2026-05-21 14:40 UTC (permalink / raw)
To: Sean Christopherson, Fuad Tabba
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <ag8BmtzxTlcuA_zy@google.com>
Sean Christopherson <seanjc@google.com> writes:
>
> [...snip...]
>
> --- virt/kvm/guest_memfd.c
> +++ virt/kvm/guest_memfd.c
> @@ -640,9 +640,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> }
>
> int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
> - unsigned int fd, loff_t offset)
> + unsigned int fd, u64 offset)
> {
> - loff_t size = slot->npages << PAGE_SHIFT;
> + u64 size = slot->npages << PAGE_SHIFT;
> unsigned long start, end;
> struct gmem_file *f;
> struct inode *inode;
>
My mental model was:
+ offsets => loff_t
+ indices => pgoff_t
+ sizes => size_t
But looks like loff_t is more suitable for places where return values
(possibly negative) matter.
Good to go with u64!
> [...snip...]
>
^ permalink raw reply
* Re: [PATCH v6 11/43] KVM: guest_memfd: Ensure pages are not in use before conversion
From: Ackerley Tng @ 2026-05-21 14:36 UTC (permalink / raw)
To: Fuad Tabba
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CA+EHjTyaBpTYsJRRyP09YggoHbi6s-ZgDoWoFgDRxO5k_BkoBw@mail.gmail.com>
Fuad Tabba <tabba@google.com> writes:
>
> [...snip...]
>
>> +static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
>> + size_t nr_pages, pgoff_t *err_index)
>> +{
>> + struct address_space *mapping = inode->i_mapping;
>> + const int filemap_get_folios_refcount = 1;
>> + pgoff_t last = start + nr_pages - 1;
>> + struct folio_batch fbatch;
>> + bool safe = true;
>> + int i;
>> +
>> + folio_batch_init(&fbatch);
>> + while (safe && filemap_get_folios(mapping, &start, last, &fbatch)) {
>> +
>> + for (i = 0; i < folio_batch_count(&fbatch); ++i) {
>> + struct folio *folio = fbatch.folios[i];
>> +
>> + if (folio_ref_count(folio) !=
>> + folio_nr_pages(folio) + filemap_get_folios_refcount) {
>> + safe = false;
>> + *err_index = folio->index;
>> + break;
>
> https://sashiko.dev/#/patchset/20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4%40google.com?part=11
>
Sashiko's first issue on lru is addressed in a separate patch later. :)
> Sashiko raised a few issues here, but I think this one might be
> genuine. Can you look into it please?
>
> If that's right, when huge page support lands, if start falls in the
> middle of a large folio, returning folio->index as the err_index will
> return an offset strictly less than the requested start. A naive
> userspace retry loop resuming from error_offset would step backwards
> and corrupt attributes on memory it didn't intend to convert.
> err_index should be clamped to max(start, folio->index).
>
For these ones, I was thinking to defer all the huge-page related issues
to be fixed when huge pages land, since there are probably quite a few
places to update.
On second thought, this isn't a huge change, I'll fix this in the next
revision.
> Cheers,
> /fuad
>
>> + }
>> + }
>> +
>>
>> [...snip...]
>>
^ permalink raw reply
* [PATCH 3/3] perf tools: Put Python bytecode in output directory
From: Ben Hutchings @ 2026-05-21 14:35 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Steven Rostedt, Tomas Glozar
Cc: linux-perf-users, linux-trace-kernel
In-Reply-To: <ag8X7gcDw6jpJsLq@decadent.org.uk>
[-- Attachment #1: Type: text/plain, Size: 1025 bytes --]
The PMU events are processed into C sources by Python scripts, which
normally results in writing bytecode for each module into the source
tree. This prevents a fully out-of-tree build.
To fix this, set $PYTHONPYCACHEPREFIX to relocate the bytecode cache
directory in an out-of-tree build.
Signed-off-by: Ben Hutchings <benh@debian.org>
---
tools/perf/Makefile.perf | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 899a4249a42f..c35b65f9fdda 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -400,6 +400,11 @@ PYTHON_EXTBUILD_LIB := $(PYTHON_EXTBUILD)lib/
PYTHON_EXTBUILD_TMP := $(PYTHON_EXTBUILD)tmp/
export PYTHON_EXTBUILD_LIB PYTHON_EXTBUILD_TMP
+# Put Python bytecode in output directory
+ifdef OUTPUT
+export PYTHONPYCACHEPREFIX := $(OUTPUT)/__pycache__
+endif
+
python-clean := $(call QUIET_CLEAN, python) $(RM) -r $(PYTHON_EXTBUILD) $(OUTPUT)python/perf*.so
# Use the detected configuration
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply related
* [PATCH 2/3] perf tools: Put Python egg info in output directory
From: Ben Hutchings @ 2026-05-21 14:35 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Steven Rostedt, Tomas Glozar
Cc: linux-perf-users, linux-trace-kernel
In-Reply-To: <ag8X7gcDw6jpJsLq@decadent.org.uk>
[-- Attachment #1: Type: text/plain, Size: 990 bytes --]
Installing the Python extension currently creates an egg-info
directory in the source tree, preventing a fully out-of-tree build.
Add the necessary runes to the setup.py comamnd line to relocate the
egg-info directory in an out-of-tree build.
Signed-off-by: Ben Hutchings <benh@debian.org>
---
tools/perf/Makefile.perf | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index cee19c923c06..899a4249a42f 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -1152,7 +1152,9 @@ install-bin: install-tools install-tests
install: install-bin try-install-man
install-python_ext:
- $(PYTHON_WORD) util/setup.py --quiet install --root='/$(DESTDIR_SQ)'
+ $(PYTHON_WORD) util/setup.py --quiet \
+ $(if $(OUTPUT),egg_info --egg-base $(OUTPUT)) \
+ install --root='/$(DESTDIR_SQ)'
# 'make install-doc' should call 'make -C Documentation install'
$(INSTALL_DOC_TARGETS):
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply related
* [PATCH 1/3] rtla: Fix output files in source tree
From: Ben Hutchings @ 2026-05-21 14:35 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Steven Rostedt, Tomas Glozar
Cc: linux-perf-users, linux-trace-kernel
In-Reply-To: <ag8X7gcDw6jpJsLq@decadent.org.uk>
[-- Attachment #1: Type: text/plain, Size: 4742 bytes --]
Some output files (src/timerlat.bpf.o, src/timerlat.skel.h,
example/timerlat_bpf_action.o, tests/bpf/bpf_action_map.o) are
currently generated in the source tree, preventing a fully out-of-tree
build. To fix this:
- Add $(OUTPUT) to their filenames in the relevant Makefile rules, and
create subdirectories as needed
- Add $(OUTPUT)src to the include path
- Add ${OUTPUT} to the BPF object filename in tests/timerlat.t
Fixes: e34293ddcebd ("rtla/timerlat: Add BPF skeleton to collect samples")
Fixes: 0304a3b7ec9a ("rtla/timerlat: Add example for BPF action program")
Fixes: 5525aebd4e0c ("rtla/tests: Test BPF action program")
Signed-off-by: Ben Hutchings <benh@debian.org>
---
tools/tracing/rtla/Makefile | 31 ++++++++++++++++++-----------
tools/tracing/rtla/tests/timerlat.t | 4 ++--
2 files changed, 21 insertions(+), 14 deletions(-)
diff --git a/tools/tracing/rtla/Makefile b/tools/tracing/rtla/Makefile
index 45690ee14544..f54da7be735d 100644
--- a/tools/tracing/rtla/Makefile
+++ b/tools/tracing/rtla/Makefile
@@ -66,30 +66,37 @@ ifeq ($(config),1)
include Makefile.config
endif
+INCLUDES = -I$(OUTPUT)src
+
CFLAGS += $(INCLUDES) $(LIB_INCLUDES)
export CFLAGS OUTPUT srctree
ifeq ($(BUILD_BPF_SKEL),1)
-src/timerlat.bpf.o: src/timerlat.bpf.c
+$(OUTPUT)src/timerlat.bpf.o: src/timerlat.bpf.c
+ mkdir -p $(@D)
$(QUIET_CLANG)$(CLANG) -g -O2 -target bpf -c $(filter %.c,$^) -o $@
-src/timerlat.skel.h: src/timerlat.bpf.o
+$(OUTPUT)src/timerlat.skel.h: $(OUTPUT)src/timerlat.bpf.o
+ mkdir -p $(@D)
$(QUIET_GENSKEL)$(SYSTEM_BPFTOOL) gen skeleton $< > $@
-example/timerlat_bpf_action.o: example/timerlat_bpf_action.c
+$(OUTPUT)example/timerlat_bpf_action.o: example/timerlat_bpf_action.c
+ mkdir -p $(@D)
$(QUIET_CLANG)$(CLANG) -g -O2 -target bpf -c $(filter %.c,$^) -o $@
-tests/bpf/bpf_action_map.o: tests/bpf/bpf_action_map.c
+$(OUTPUT)tests/bpf/bpf_action_map.o: tests/bpf/bpf_action_map.c
+ mkdir -p $(@D)
$(QUIET_CLANG)$(CLANG) -g -O2 -target bpf -c $(filter %.c,$^) -o $@
else
-src/timerlat.skel.h:
- $(Q)echo '/* BPF skeleton is disabled */' > src/timerlat.skel.h
+$(OUTPUT)src/timerlat.skel.h:
+ mkdir -p $(@D)
+ $(Q)echo '/* BPF skeleton is disabled */' > $@
-example/timerlat_bpf_action.o: example/timerlat_bpf_action.c
+$(OUTPUT)example/timerlat_bpf_action.o: example/timerlat_bpf_action.c
$(Q)echo "BPF skeleton support is disabled, skipping example/timerlat_bpf_action.o"
-tests/bpf/bpf_action_map.o: tests/bpf/bpf_action_map.c
+$(OUTPUT)tests/bpf/bpf_action_map.o: tests/bpf/bpf_action_map.c
$(Q)echo "BPF skeleton support is disabled, skipping tests/bpf/bpf_action_map.o"
endif
@@ -103,7 +110,7 @@ static: $(RTLA_IN)
rtla.%: fixdep FORCE
make -f $(srctree)/tools/build/Makefile.build dir=. $@
-$(RTLA_IN): fixdep FORCE src/timerlat.skel.h
+$(RTLA_IN): fixdep FORCE $(OUTPUT)src/timerlat.skel.h
make $(build)=rtla
clean: doc_clean fixdep-clean
@@ -111,10 +118,10 @@ clean: doc_clean fixdep-clean
$(Q)find . -name '*.o' -delete -o -name '\.*.cmd' -delete -o -name '\.*.d' -delete
$(Q)rm -f rtla rtla-static fixdep FEATURE-DUMP rtla-*
$(Q)rm -rf feature
- $(Q)rm -f src/timerlat.bpf.o src/timerlat.skel.h example/timerlat_bpf_action.o
+ $(Q)rm -f $(OUTPUT)src/timerlat.bpf.o $(OUTPUT)src/timerlat.skel.h $(OUTPUT)example/timerlat_bpf_action.o
$(Q)rm -f $(UNIT_TESTS)
-check: $(RTLA) tests/bpf/bpf_action_map.o
+check: $(RTLA) $(OUTPUT)tests/bpf/bpf_action_map.o
RTLA=$(RTLA) BPFTOOL=$(SYSTEM_BPFTOOL) prove -o -f -v tests/
-examples: example/timerlat_bpf_action.o
+examples: $(OUTPUT)example/timerlat_bpf_action.o
.PHONY: FORCE clean check
diff --git a/tools/tracing/rtla/tests/timerlat.t b/tools/tracing/rtla/tests/timerlat.t
index fd4935fd7b49..e0f3fc4df655 100644
--- a/tools/tracing/rtla/tests/timerlat.t
+++ b/tools/tracing/rtla/tests/timerlat.t
@@ -74,12 +74,12 @@ then
# Test BPF action program properly in BPF mode
[ -z "$BPFTOOL" ] && BPFTOOL=bpftool
check "hist with BPF action program (BPF mode)" \
- "timerlat hist -T 2 --bpf-action tests/bpf/bpf_action_map.o --on-threshold shell,command='$BPFTOOL map dump name rtla_test_map'" \
+ "timerlat hist -T 2 --bpf-action ${OUTPUT}tests/bpf/bpf_action_map.o --on-threshold shell,command='$BPFTOOL map dump name rtla_test_map'" \
2 '"value": 42'
else
# Test BPF action program failure in non-BPF mode
check "hist with BPF action program (non-BPF mode)" \
- "timerlat hist -T 2 --bpf-action tests/bpf/bpf_action_map.o" \
+ "timerlat hist -T 2 --bpf-action ${OUTPUT}tests/bpf/bpf_action_map.o" \
1 "BPF actions are not supported in tracefs-only mode"
fi
done
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply related
* [PATCH 0/3] Fix out-of-tree build of some tools
From: Ben Hutchings @ 2026-05-21 14:34 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Steven Rostedt, Tomas Glozar
Cc: linux-perf-users, linux-trace-kernel
[-- Attachment #1: Type: text/plain, Size: 593 bytes --]
perf and rtla currently don't fully support out-of-tree builds, as
they may still create files in their source directory. This series
fixes all the instances of this problem that I have found.
Ben.
Ben Hutchings (3):
rtla: Fix output files in source tree
perf tools: Put Python egg info in output directory
perf tools: Put Python bytecode in output directory
tools/perf/Makefile.perf | 9 ++++++++-
tools/tracing/rtla/Makefile | 31 ++++++++++++++++++-----------
tools/tracing/rtla/tests/timerlat.t | 4 ++--
3 files changed, 29 insertions(+), 15 deletions(-)
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH mm-unstable v17 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Wei Yang @ 2026-05-21 14:32 UTC (permalink / raw)
To: Vernon Yang
Cc: Wei Yang, Nico Pache, linux-doc, linux-kernel, linux-mm,
linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
dave.hansen, david, dev.jain, gourry, hannes, hughd, jack,
jackmanb, jannh, jglisse, joshua.hahnjy, kas, lance.yang, liam,
ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
pfalcato, rakie.kim, raquini, rdunlap, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <91015820-f39a-4b06-89de-b49e5ca465fd@gmail.com>
On Thu, May 21, 2026 at 01:11:18PM +0800, Vernon Yang wrote:
>On Thu, May 21, 2026 at 02:46:54AM +0000, Wei Yang wrote:
>> On Thu, May 21, 2026 at 10:36:15AM +0800, Vernon Yang wrote:
>> >On Mon, May 11, 2026 at 12:58:11PM -0600, Nico Pache wrote:
>> >> Enable khugepaged to collapse to mTHP orders. This patch implements the
>> >> main scanning logic using a bitmap to track occupied pages and a stack
>> >> structure that allows us to find optimal collapse sizes.
>> >>
>> >> Previous to this patch, PMD collapse had 3 main phases, a light weight
>> >> scanning phase (mmap_read_lock) that determines a potential PMD
>> >> collapse, an alloc phase (mmap unlocked), then finally heavier collapse
>> >> phase (mmap_write_lock).
>> >>
>> >> To enabled mTHP collapse we make the following changes:
>> >>
>> >> During PMD scan phase, track occupied pages in a bitmap. When mTHP
>> >> orders are enabled, we remove the restriction of max_ptes_none during the
>> >> scan phase to avoid missing potential mTHP collapse candidates. Once we
>> >> have scanned the full PMD range and updated the bitmap to track occupied
>> >> pages, we use the bitmap to find the optimal mTHP size.
>> >>
>> >> Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
>> >> and determine the best eligible order for the collapse. A stack structure
>> >> is used instead of traditional recursion to manage the search. This also
>> >> prevents a traditional recursive approach when the kernel stack struct is
>> >> limited. The algorithm recursively splits the bitmap into smaller chunks to
>> >> find the highest order mTHPs that satisfy the collapse criteria. We start
>> >> by attempting the PMD order, then moved on the consecutively lower orders
>> >> (mTHP collapse). The stack maintains a pair of variables (offset, order),
>> >> indicating the number of PTEs from the start of the PMD, and the order of
>> >> the potential collapse candidate.
>> >>
>> >> The algorithm for consuming the bitmap works as such:
>> >> 1) push (0, HPAGE_PMD_ORDER) onto the stack
>> >> 2) pop the stack
>> >> 3) check if the number of set bits in that (offset,order) pair
>> >> statisfy the max_ptes_none threshold for that order
>> >> 4) if yes, attempt collapse
>> >> 5) if no (or collapse fails), push two new stack items representing
>> >> the left and right halves of the current bitmap range, at the
>> >> next lower order
>> >> 6) repeat at step (2) until stack is empty.
>> >>
>> >> Below is a diagram representing the algorithm and stack items:
>> >>
>> >> offset mid_offset
>> >> | |
>> >> | |
>> >> v v
>> >> ____________________________________
>> >> | PTE Page Table |
>> >> --------------------------------------
>> >> <-------><------->
>> >> order-1 order-1
>> >>
>> >> mTHP collapses reject regions containing swapped out or shared pages.
>> >> This is because adding new entries can lead to new none pages, and these
>> >> may lead to constant promotion into a higher order mTHP. A similar
>> >> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
>> >> introducing at least 2x the number of pages, and on a future scan will
>> >> satisfy the promotion condition once again. This issue is prevented via
>> >> the collapse_max_ptes_none() function which imposes the max_ptes_none
>> >> restrictions above.
>> >>
>> >> We currently only support mTHP collapse for max_ptes_none values of 0
>> >> and HPAGE_PMD_NR - 1. resulting in the following behavior:
>> >>
>> >> - max_ptes_none=0: Never introduce new empty pages during collapse
>> >> - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
>> >> available mTHP order
>> >>
>> >> Any other max_ptes_none value will emit a warning and skip mTHP collapse
>> >> attempts. There should be no behavior change for PMD collapse.
>> >>
>> >> Once we determine what mTHP sizes fits best in that PMD range a collapse
>> >> is attempted. A minimum collapse order of 2 is used as this is the lowest
>> >> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
>> >>
>> >> Currently madv_collapse is not supported and will only attempt PMD
>> >> collapse.
>> >>
>> >> We can also remove the check for is_khugepaged inside the PMD scan as
>> >> the collapse_max_ptes_none() function handles this logic now.
>> >>
>> >> Signed-off-by: Nico Pache <npache@redhat.com>
>> >> ---
>> >> mm/khugepaged.c | 182 +++++++++++++++++++++++++++++++++++++++++++++---
>> >> 1 file changed, 174 insertions(+), 8 deletions(-)
>> >>
>> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> >> index 3492b135d667..39bf7ea8a6e8 100644
>> >> --- a/mm/khugepaged.c
>> >> +++ b/mm/khugepaged.c
>> >> @@ -100,6 +100,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>> >>
>> >> static struct kmem_cache *mm_slot_cache __ro_after_init;
>> >>
>> >> +#define KHUGEPAGED_MIN_MTHP_ORDER 2
>> >> +/*
>> >> + * mthp_collapse() does an iterative DFS over a binary tree, from
>> >> + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
>> >> + * size needed for a DFS on a binary tree is height + 1, where
>> >> + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
>> >> + *
>> >> + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
>> >> + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
>> >> + */
>> >> +#define MTHP_STACK_SIZE (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
>> >> +
>> >> +/*
>> >> + * Defines a range of PTE entries in a PTE page table which are being
>> >> + * considered for mTHP collapse.
>> >> + *
>> >> + * @offset: the offset of the first PTE entry in a PMD range.
>> >> + * @order: the order of the PTE entries being considered for collapse.
>> >> + */
>> >> +struct mthp_range {
>> >> + u16 offset;
>> >> + u8 order;
>> >> +};
>> >> +
>> >> struct collapse_control {
>> >> bool is_khugepaged;
>> >>
>> >> @@ -111,6 +135,12 @@ struct collapse_control {
>> >>
>> >> /* nodemask for allocation fallback */
>> >> nodemask_t alloc_nmask;
>> >> +
>> >> + /* Each bit represents a single occupied (!none/zero) page. */
>> >> + DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
>> >> + /* A mask of the current range being considered for mTHP collapse. */
>> >> + DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
>> >> + struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
>> >> };
>> >>
>> >> /**
>> >> @@ -1404,20 +1434,140 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>> >> return result;
>> >> }
>> >>
>> >> +static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
>> >> + u16 offset, u8 order)
>> >> +{
>> >> + const int size = *stack_size;
>> >> + struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
>> >> +
>> >> + VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
>> >> + stack->order = order;
>> >> + stack->offset = offset;
>> >> + (*stack_size)++;
>> >> +}
>> >> +
>> >> +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
>> >> + int *stack_size)
>> >> +{
>> >> + const int size = *stack_size;
>> >> +
>> >> + VM_WARN_ON_ONCE(size <= 0);
>> >> + (*stack_size)--;
>> >> + return cc->mthp_bitmap_stack[size - 1];
>> >> +}
>> >> +
>> >> +static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
>> >> + u16 offset, unsigned int nr_ptes)
>> >> +{
>> >> + bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
>> >> + bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
>> >> + return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
>> >> +}
>> >> +
>> >> +/*
>> >> + * mthp_collapse() consumes the bitmap that is generated during
>> >> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
>> >> + *
>> >> + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
>> >> + * A stack structure cc->mthp_bitmap_stack is used to check different regions
>> >> + * of the bitmap for collapse eligibility. The stack maintains a pair of
>> >> + * variables (offset, order), indicating the number of PTEs from the start of
>> >> + * the PMD, and the order of the potential collapse candidate respectively. We
>> >> + * start at the PMD order and check if it is eligible for collapse; if not, we
>> >> + * add two entries to the stack at a lower order to represent the left and right
>> >> + * halves of the PTE page table we are examining.
>> >> + *
>> >> + * offset mid_offset
>> >> + * | |
>> >> + * | |
>> >> + * v v
>> >> + * --------------------------------------
>> >> + * | cc->mthp_bitmap |
>> >> + * --------------------------------------
>> >> + * <-------><------->
>> >> + * order-1 order-1
>> >> + *
>> >> + * For each of these, we determine how many PTE entries are occupied in the
>> >> + * range of PTE entries we propose to collapse, then we compare this to a
>> >> + * threshold number of PTE entries which would need to be occupied for a
>> >> + * collapse to be permitted at that order (accounting for max_ptes_none).
>> >> + *
>> >> + * If a collapse is permitted, we attempt to collapse the PTE range into a
>> >> + * mTHP.
>> >> + */
>> >> +static int mthp_collapse(struct mm_struct *mm, unsigned long address,
>> >> + int referenced, int unmapped, struct collapse_control *cc,
>> >> + unsigned long enabled_orders)
>> >> +{
>> >> + unsigned int nr_occupied_ptes, nr_ptes;
>> >> + int max_ptes_none, collapsed = 0, stack_size = 0;
>> >> + unsigned long collapse_address;
>> >> + struct mthp_range range;
>> >> + u16 offset;
>> >> + u8 order;
>> >> +
>> >> + collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
>> >> +
>> >> + while (stack_size) {
>> >> + range = collapse_mthp_stack_pop(cc, &stack_size);
>> >> + order = range.order;
>> >> + offset = range.offset;
>> >> + nr_ptes = 1UL << order;
>> >> +
>> >> + if (!test_bit(order, &enabled_orders))
>> >> + goto next_order;
>> >> +
>> >> + max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
>> >> +
>> >> + if (max_ptes_none < 0)
>> >> + return collapsed;
>> >> +
>> >> + nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
>> >> + nr_ptes);
>> >> +
>> >> + if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
>> >> + int ret;
>> >> +
>> >> + collapse_address = address + offset * PAGE_SIZE;
>> >> + ret = collapse_huge_page(mm, collapse_address, referenced,
>> >> + unmapped, cc, order);
>> >> + if (ret == SCAN_SUCCEED) {
>> >> + collapsed += nr_ptes;
>> >> + continue;
>> >> + }
>> >> + }
>> >> +
>> >> +next_order:
>> >> + if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
>> >
>> >Hi Nico, thank you very much for your contributions to this series.
>> >
>> >I found a minor issue, for MADV_COLLAPSE, if collapse_huge_page() fails
>> >for some reason (e.g. allocate folio), it goes to next_order and
>> >continues splitting to the next small order. However, enabled_orders
>> >only supports HPAGE_PMD_ORDER, so it keeps runing the split operations
>> >without any effective work until KHUGEPAGED_MIN_MTHP_ORDER is reached
>> >before exiting. For khugepaged, e.g. setting only 2MB to always, also
>> >same phenomenon.
>>
>> Yes, but it does no actual work since it is checked after pop up.
>>
>> >
>> >This does not affect the overall functionality of mthp collapse, just
>> >redundant.
>> >
>> >The redundant operations can be easily skipped with the following
>> >modification. If I miss some thing, please let me know. Thanks!
>> >
>> >diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> >index 1a25af3d6d0f..fa407cce525c 100644
>> >--- a/mm/khugepaged.c
>> >+++ b/mm/khugepaged.c
>> >@@ -1574,7 +1574,7 @@ static int mthp_collapse(struct mm_struct *mm, unsigned long address,
>> > }
>> >
>> > next_order:
>> >- if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
>> >+ if ((BIT(order) - 1) & enabled_orders) {
>> > const u8 next_order = order - 1;
>> > const u16 mid_offset = offset + (nr_ptes / 2);
>> >
>>
>> This would stop the iteration if there are other lower enabled order, right?
> ^^^^ ^^^^^^^^^^^^^^^^^^^
>
>NO :)
Got it. You are right.
The logic here is all lower bits are not set, skip the rest.
>
>For more details, please refer to the following information.
>
>| Scenario | Old Behavior (order > 2) | New Behavior ((BIT(order)-1) & enabled_orders) |
>|-------------------------------------|--------------------------|------------------------------------------------|
>| MADV_COLLAPSE | Splits 9,8,7,...,3 | No split |
>| khugepaged, only 2MB enabled | Splits 9,8,7,...,3 | No split |
>| khugepaged, only 2MB + 64KB enabled | Splits 9,8,7,...,3 | Splits 9,8,7,...,5 |
>| khugepaged, only 32KB enabled | Splits 9,8,7,...,3 | Splits 9,8,7,...,4 |
>| khugepaged, only 16KB enabled | Splits 9,8,7,...,3 | Splits 9,8,7,...,3 |
>| khugepaged, all mTHP enabled | Splits 9,8,7,...,3 | Splits 9,8,7,...,3 |
>
>--
>Cheers,
>Vernon
--
Wei Yang
Help you, Help me
^ permalink raw reply
* Re: [PATCH v4 1/2] tracing: Return ERR_PTR() from expr_str()
From: Steven Rostedt @ 2026-05-21 14:30 UTC (permalink / raw)
To: Pengpeng Hou
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-trace-kernel,
linux-kernel
In-Reply-To: <20260521022817.38453-1-pengpeng@iscas.ac.cn>
On Thu, 21 May 2026 10:28:16 +0800
Pengpeng Hou <pengpeng@iscas.ac.cn> wrote:
> diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
> index 0dbbf6cca9bc..0b33bb8ef6f7 100644
> --- a/kernel/trace/trace_events_hist.c
> +++ b/kernel/trace/trace_events_hist.c
> @@ -1769,18 +1769,18 @@ static void expr_field_str(struct hist_field *field, char *expr)
>
> static char *expr_str(struct hist_field *field, unsigned int level)
> {
> - char *expr;
> + char *expr __free(kfree) = NULL;
Can you split this into two patches.
1. Change expr to use __free(kfree)
2. Update to use ERR_PTR()
as they are two distinct changes.
Thanks,
-- Steve
>
> if (level > 1)
> - return NULL;
> + return ERR_PTR(-EINVAL);
>
> expr = kzalloc(MAX_FILTER_STR_VAL, GFP_KERNEL);
> if (!expr)
> - return NULL;
> + return ERR_PTR(-ENOMEM);
>
> if (!field->operands[0]) {
> expr_field_str(field, expr);
> - return expr;
> + return_ptr(expr);
> }
^ permalink raw reply
* Re: [PATCH v6 05/43] KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes
From: Ackerley Tng @ 2026-05-21 14:29 UTC (permalink / raw)
To: Sean Christopherson, Fuad Tabba
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <ag8JIlHjohAOC3-g@google.com>
Sean Christopherson <seanjc@google.com> writes:
> On Thu, May 21, 2026, Fuad Tabba wrote:
>> On Wed, 20 May 2026 at 22:44, Ackerley Tng <ackerleytng@google.com> wrote:
>> >
>> > Fuad Tabba <tabba@google.com> writes:
>> >
>> > >
>> > > [...snip...]
>> > >
>> > >> +unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
>> > >> +{
>> > >> + struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>> > >> + struct inode *inode;
>> > >> +
>> > >> + /*
>> > >> + * If this gfn has no associated memslot, there's no chance of the gfn
>> > >> + * being backed by private memory, since guest_memfd must be used for
>> > >> + * private memory, and guest_memfd must be associated with some memslot.
>> > >> + */
>> > >> + if (!slot)
>> > >> + return 0;
>> > >> +
>> > >> + CLASS(gmem_get_file, file)(slot);
>> > >> + if (!file)
>> > >> + return 0;
>> > >> +
>> > >> + inode = file_inode(file);
>> > >> +
>> > >> + /*
>> > >> + * Rely on the maple tree's internal RCU lock to ensure a
>> > >> + * stable result. This result can become stale as soon as the
>> > >> + * lock is dropped, so the caller _must_ still protect
>> > >> + * consumption of private vs. shared by checking
>> > >> + * mmu_invalidate_retry_gfn() under mmu_lock to serialize
>> > >> + * against ongoing attribute updates.
>> > >> + */
>> > >> + return kvm_gmem_get_attributes(inode, kvm_gmem_get_index(slot, gfn));
>> > >> +}
>> > >
>> > > Doesn't this imply that all consumers of kvm_mem_is_private() should
>> > > validate the result using mmu_lock and the invalidation sequence?
>> >
>> > Let me know how I can improve the comment.
>>
>> Given Sean's context, the comment is good I think. I would quibble
>> with the the "_must_ still protect" phrasing being a bit too strict.
>>
>> Maybe just soften it slightly to acknowledge the exception? Something like:
>>
>> * lock is dropped, so callers that require a strict result _must_ protect
>> * consumption of private vs. shared by checking mmu_invalidate_retry_gfn()
>> * under mmu_lock to serialize against ongoing attribute updates. Callers
>> * doing lockless reads must be able to tolerate a stale result.
>>
>> That aligns the comment with how KVM is actually using it today. That
>> said, this is nitpicking. Feel free to use or ignore.
>
> Hmm, I wonder if we can figure out a way to consolidate some documentation,
> because this is _exactly_ the same pattern that x86's host_pfn_mapping_level()
> deals with (see its big comment below).
>
This would be great, are you thinking an actual comment or something in
Documentation/?
Perhaps we could iterate on this a little with me providing the newbie
perspective. Do you want me to take a stab at writing something up?
> There's also the stale comment in kvm_invalidate_memslot(), which, stating the
> obvious, speaks to the memslot+SRCU side of things.
>
> Maybe it makes sense to to find a central location for one giant comment about
> how how MMU notifier events and memslot+SRCU protections work? And then refer
> to that in paths where some asset needs to be tied into MMU notifiers and/or
> memslots+SRCU?
>
> [*] https://lore.kernel.org/all/agcbWe8s9lmPuJwG@google.com
>
> [...snip...]
>
^ permalink raw reply
* [PATCH v16 17/20] unwind_user/sframe: Separate reading of FRE from reading of FRE data words
From: Jens Remus @ 2026-05-21 14:25 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James
Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
Jose E. Marchesi, Beau Belgrave, Florian Weimer,
Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
Ilya Leoshkevich
In-Reply-To: <20260521142546.3908498-1-jremus@linux.ibm.com>
__find_fre() performs linear search for a matching SFrame FRE for a
given IP. For that purpose it uses __read_fre(), which reads the whole
FRE. That is the variable-size FRE structure as well as the trailing
variable-length array of variable-size data words. For the search logic
to skip over the FRE it would be sufficient to read the variable-size
FRE structure only, which includes the count and size of data words.
Add fields to struct sframe_fre_internal to store the FRE data word's
address, count, and size. Change __read_fre() to read the variable-
size FRE structure only and populate those new fields. Change
__read_fre_datawords() to use those new fields. Change __find_fre()
to use __read_fre_datawords() to read the FRE data words only after a
matching FRE has been found. Introduce safe_read_fre_datawords() and
use it in sframe_validate_section() to validate that the FRE data words.
Reviewed-by: Indu Bhagat <ibhagatgnu@gmail.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
---
Notes (jremus):
Changes in v15:
- sframe_validate_section(): Fix format specifier for number of FREs
in debug message. (Sashiko AI)
Changes in v14:
- Adjust to rename of SFRAME_FDE_TYPE_* and
__read_default_fre_datawords().
- Update function name in debug message.
kernel/unwind/sframe.c | 99 +++++++++++++++++++++++++++---------------
1 file changed, 63 insertions(+), 36 deletions(-)
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index b623dca072da..7f439600b0f0 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -39,6 +39,9 @@ struct sframe_fre_internal {
u32 fp_ctl;
s32 fp_off;
u8 info;
+ unsigned long dw_addr;
+ unsigned char dw_count;
+ unsigned char dw_size;
};
DEFINE_STATIC_SRCU(sframe_srcu);
@@ -207,11 +210,11 @@ static __always_inline int __find_fde(struct sframe_section *sec,
static __always_inline int
__read_default_fre_datawords(struct sframe_section *sec,
struct sframe_fde_internal *fde,
- unsigned long cur,
- unsigned char dataword_count,
- unsigned char dataword_size,
struct sframe_fre_internal *fre)
{
+ unsigned char dataword_count = fre->dw_count;
+ unsigned char dataword_size = fre->dw_size;
+ unsigned long cur = fre->dw_addr;
s32 cfa_off, ra_off, fp_off;
unsigned int cfa_regnum;
@@ -253,11 +256,11 @@ __read_default_fre_datawords(struct sframe_section *sec,
static __always_inline int
__read_flex_fde_fre_datawords(struct sframe_section *sec,
struct sframe_fde_internal *fde,
- unsigned long cur,
- unsigned char dataword_count,
- unsigned char dataword_size,
struct sframe_fre_internal *fre)
{
+ unsigned char dataword_count = fre->dw_count;
+ unsigned char dataword_size = fre->dw_size;
+ unsigned long cur = fre->dw_addr;
u32 cfa_ctl, ra_ctl, fp_ctl;
s32 cfa_off, ra_off, fp_off;
@@ -325,24 +328,34 @@ __read_flex_fde_fre_datawords(struct sframe_section *sec,
static __always_inline int
__read_fre_datawords(struct sframe_section *sec,
struct sframe_fde_internal *fde,
- unsigned long cur,
- unsigned char dataword_count,
- unsigned char dataword_size,
struct sframe_fre_internal *fre)
{
unsigned char fde_type = SFRAME_V3_FDE_TYPE(fde->info2);
+ unsigned char dataword_count = fre->dw_count;
+
+ if (!dataword_count) {
+ /*
+ * A FRE without datawords indicates an outermost
+ * frame. Zero-initialize CFA, RA, and FP location
+ * info, except for the CFA control word, so that
+ * neither sframe_init_cfa_rule_data() nor
+ * sframe_init_rule_data() fail.
+ */
+ fre->cfa_ctl = (SFRAME_REG_SP << 3) | 1; /* regnum=SP, deref_p=0, reg_p=1 */
+ fre->cfa_off = 0;
+ fre->ra_ctl = 0;
+ fre->ra_off = 0;
+ fre->fp_ctl = 0;
+ fre->fp_off = 0;
+
+ return 0;
+ }
switch (fde_type) {
case SFRAME_FDE_TYPE_DEFAULT:
- return __read_default_fre_datawords(sec, fde, cur,
- dataword_count,
- dataword_size,
- fre);
+ return __read_default_fre_datawords(sec, fde, fre);
case SFRAME_FDE_TYPE_FLEX:
- return __read_flex_fde_fre_datawords(sec, fde, cur,
- dataword_count,
- dataword_size,
- fre);
+ return __read_flex_fde_fre_datawords(sec, fde, fre);
default:
return -EINVAL;
}
@@ -385,26 +398,11 @@ static __always_inline int __read_fre(struct sframe_section *sec,
fre->size = addr_size + 1 + (dataword_count * dataword_size);
fre->ip_off = ip_off;
fre->info = info;
+ fre->dw_addr = cur;
+ fre->dw_count = dataword_count;
+ fre->dw_size = dataword_size;
- if (!dataword_count) {
- /*
- * A FRE without datawords indicates an outermost
- * frame. Zero-initialize CFA, RA, and FP location
- * info, except for the CFA control word, so that
- * neither sframe_init_cfa_rule_data() nor
- * sframe_init_rule_data() fail.
- */
- fre->cfa_ctl = (SFRAME_REG_SP << 3) | 1; /* regnum=SP, deref_p=0, reg_p=1 */
- fre->cfa_off = 0;
- fre->ra_ctl = 0;
- fre->ra_off = 0;
- fre->fp_ctl = 0;
- fre->fp_off = 0;
-
- return 0;
- }
-
- return __read_fre_datawords(sec, fde, cur, dataword_count, dataword_size, fre);
+ return 0;
Efault:
return -EFAULT;
@@ -527,6 +525,10 @@ static __always_inline int __find_fre(struct sframe_section *sec,
return -EINVAL;
fre = prev_fre;
+ ret = __read_fre_datawords(sec, fde, fre);
+ if (ret)
+ return ret;
+
ret = sframe_init_cfa_rule_data(&frame->cfa, fre->cfa_ctl, fre->cfa_off);
if (ret)
return ret;
@@ -610,6 +612,20 @@ static int safe_read_fre(struct sframe_section *sec,
return ret;
}
+static int safe_read_fre_datawords(struct sframe_section *sec,
+ struct sframe_fde_internal *fde,
+ struct sframe_fre_internal *fre)
+{
+ int ret;
+
+ if (!user_read_access_begin((void __user *)sec->sframe_start,
+ sec->sframe_end - sec->sframe_start))
+ return -EFAULT;
+ ret = __read_fre_datawords(sec, fde, fre);
+ user_read_access_end();
+ return ret;
+}
+
static int sframe_validate_section(struct sframe_section *sec)
{
struct sframe_fde_internal fde;
@@ -648,6 +664,17 @@ static int sframe_validate_section(struct sframe_section *sec)
fde.rep_size);
return ret;
}
+ ret = safe_read_fre_datawords(sec, &fde, &fre);
+ if (ret) {
+ dbg_sec("FDE %u: safe_read_fre_datawords(%u) failed\n", i, j);
+ dbg_sec("FDE: func_addr:%#lx func_size:%#x fda_off:%#x fres_off:%#x fres_num:%u info:%u info2:%u rep_size:%u\n",
+ fde.func_addr, fde.func_size,
+ fde.fda_off,
+ fde.fres_off, fde.fres_num,
+ fde.info, fde.info2,
+ fde.rep_size);
+ return ret;
+ }
if (j && fre.ip_off <= prev_ip_off) {
dbg_sec("FDE %u: FRE %u not sorted\n", i, j);
--
2.51.0
^ permalink raw reply related
* [PATCH v16 13/20] unwind_user: Enable archs that pass RA in a register
From: Jens Remus @ 2026-05-21 14:25 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James
Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
Jose E. Marchesi, Beau Belgrave, Florian Weimer,
Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
Ilya Leoshkevich
In-Reply-To: <20260521142546.3908498-1-jremus@linux.ibm.com>
Not all architectures/ABIs pass the return address (RA) on the stack on
function entry, like x86-64 does due to its CALL instruction pushing
the RA onto the stack. Architectures/ABIs, such as s390, also do not
require the RA to be saved on the stack in the function prologue. In
particular, the RA may never be saved to the stack at all, such as in
leaf functions. Unwinding must therefore not assume the presence of a
RA saved on stack for the topmost frame.
Treat a RA offset from CFA of zero as indication that the RA is not
saved (on the stack). For the topmost frame treat it as indication that
the RA is in the link/RA register, such as on arm64 and s390, and obtain
it from there. For non-topmost frames treat it as error, as the RA must
be saved.
Additionally allow the SP to be unchanged in the topmost frame, for
architectures where SP at function entry == SP at call site, such as
arm64 and s390.
Note that treating a RA offset from CFA of zero as indication that
the RA is not saved on the stack additionally allows for architectures,
such as s390, where the frame pointer (FP) may be saved without the RA
being saved as well. Provided that such architectures represent this
in SFrame by encoding the "missing" RA offset using a padding RA offset
with a value of zero.
Reviewed-by: Indu Bhagat <ibhagatgnu@gmail.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
---
Notes (jremus):
Changes in v15:
- Define pr_fmt().
- unwind_user_get_ra_reg(): Use pr_debug_once() instead of
WARN_ON_ONCE() to prevent user-triggered warning/panic. (Sashiko AI)
- Reworded commit message. (Indu)
include/linux/unwind_user.h | 10 ++++++++++
kernel/unwind/sframe.c | 6 ++----
kernel/unwind/user.c | 20 ++++++++++++++++----
3 files changed, 28 insertions(+), 8 deletions(-)
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
index 64618618febd..7bf58f23aa64 100644
--- a/include/linux/unwind_user.h
+++ b/include/linux/unwind_user.h
@@ -23,6 +23,16 @@ static inline bool unwind_user_at_function_start(struct pt_regs *regs)
#define unwind_user_at_function_start unwind_user_at_function_start
#endif
+#ifndef unwind_user_get_ra_reg
+static inline int unwind_user_get_ra_reg(unsigned long *val)
+{
+ pr_debug_once("%s (%d): unwind_user_get_ra_reg() not implemented\n",
+ current->comm, current->pid);
+ return -EINVAL;
+}
+#define unwind_user_get_ra_reg unwind_user_get_ra_reg
+#endif
+
int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries);
#endif /* _LINUX_UNWIND_USER_H */
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index e6d66ae8e7ac..d573c2529926 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -257,10 +257,8 @@ static __always_inline int __read_fre(struct sframe_section *sec,
dataword_count--;
ra_off = sec->ra_off;
- if (!ra_off) {
- if (!dataword_count--)
- return -EINVAL;
-
+ if (!ra_off && dataword_count) {
+ dataword_count--;
UNSAFE_GET_USER_INC(ra_off, cur, dataword_size, Efault);
}
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index fdb1001e3750..afa7c6f6d9b4 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -2,6 +2,9 @@
/*
* Generic interfaces for unwinding user space
*/
+
+#define pr_fmt(fmt) "unwind_user: " fmt
+
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/sched/task_stack.h>
@@ -48,8 +51,12 @@ static int unwind_user_next_common(struct unwind_user_state *state,
}
cfa += frame->cfa_off;
- /* Make sure that stack is not going in wrong direction */
- if (cfa <= state->sp)
+ /*
+ * Make sure that stack is not going in wrong direction. Allow SP
+ * to be unchanged for the topmost frame, by subtracting topmost,
+ * which is either 0 or 1.
+ */
+ if (cfa <= state->sp - state->topmost)
return -EINVAL;
/* Make sure that the address is word aligned */
@@ -57,8 +64,13 @@ static int unwind_user_next_common(struct unwind_user_state *state,
return -EINVAL;
/* Get the Return Address (RA) */
- if (get_user_word(&ra, cfa, frame->ra_off, state->ws))
- return -EINVAL;
+ if (frame->ra_off) {
+ if (get_user_word(&ra, cfa, frame->ra_off, state->ws))
+ return -EINVAL;
+ } else {
+ if (!state->topmost || unwind_user_get_ra_reg(&ra))
+ return -EINVAL;
+ }
/* Get the Frame Pointer (FP) */
if (frame->fp_off && get_user_word(&fp, cfa, frame->fp_off, state->ws))
--
2.51.0
^ permalink raw reply related
* [PATCH v16 08/20] unwind_user: Stop when reaching an outermost frame
From: Jens Remus @ 2026-05-21 14:25 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James
Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
Jose E. Marchesi, Beau Belgrave, Florian Weimer,
Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
Ilya Leoshkevich
In-Reply-To: <20260521142546.3908498-1-jremus@linux.ibm.com>
Add an indication for an outermost frame to the unwind user frame
structure and stop unwinding when reaching an outermost frame.
This will be used by unwind user sframe, as SFrame may represent an
undefined return address as indication for an outermost frame.
Reviewed-by: Indu Bhagat <ibhagatgnu@gmail.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
---
arch/x86/include/asm/unwind_user.h | 6 ++++--
include/linux/unwind_user_types.h | 1 +
kernel/unwind/user.c | 6 ++++++
3 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
index 6e469044e4de..2dfb5ef11e36 100644
--- a/arch/x86/include/asm/unwind_user.h
+++ b/arch/x86/include/asm/unwind_user.h
@@ -23,13 +23,15 @@ static inline int unwind_user_word_size(struct pt_regs *regs)
.cfa_off = 2*(ws), \
.ra_off = -1*(ws), \
.fp_off = -2*(ws), \
- .use_fp = true,
+ .use_fp = true, \
+ .outermost = false,
#define ARCH_INIT_USER_FP_ENTRY_FRAME(ws) \
.cfa_off = 1*(ws), \
.ra_off = -1*(ws), \
.fp_off = 0, \
- .use_fp = false,
+ .use_fp = false, \
+ .outermost = false,
static inline bool unwind_user_at_function_start(struct pt_regs *regs)
{
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
index 43e4b160883f..616cc5ee4586 100644
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -32,6 +32,7 @@ struct unwind_user_frame {
s32 ra_off;
s32 fp_off;
bool use_fp;
+ bool outermost;
};
struct unwind_user_state {
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 1fb272419733..fdb1001e3750 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -32,6 +32,12 @@ static int unwind_user_next_common(struct unwind_user_state *state,
{
unsigned long cfa, fp, ra;
+ /* Stop unwinding when reaching an outermost frame. */
+ if (frame->outermost) {
+ state->done = true;
+ return 0;
+ }
+
/* Get the Canonical Frame Address (CFA) */
if (frame->use_fp) {
if (state->fp < state->sp)
--
2.51.0
^ permalink raw reply related
* [PATCH v16 20/20] unwind_user/sframe: Add prctl() interface for registering .sframe sections
From: Jens Remus @ 2026-05-21 14:25 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James
Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
Jose E. Marchesi, Beau Belgrave, Florian Weimer,
Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
Ilya Leoshkevich, Steven Rostedt (Google)
In-Reply-To: <20260521142546.3908498-1-jremus@linux.ibm.com>
From: Josh Poimboeuf <jpoimboe@kernel.org>
The kernel doesn't have direct visibility to the ELF contents of shared
libraries. Add some prctl() interfaces which allow glibc to tell the
kernel where to find .sframe sections.
[
This adds an interface for prctl() for testing loading of sframes for
libraries. But this interface should really be a system call. This patch
is for testing purposes only and should not be applied to mainline.
]
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Indu Bhagat <ibhagatgnu@gmail.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
---
Notes (jremus):
Changes in v15:
- Fix rebase error (missing break). (Sashiko AI)
Changes in v14:
- Bump PR_ADD_SFRAME and PR_REMOVE_SFRAME.
include/uapi/linux/prctl.h | 4 ++++
kernel/sys.c | 9 +++++++++
2 files changed, 13 insertions(+)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index b6ec6f693719..bd0bf828b033 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -416,4 +416,8 @@ struct prctl_mm_map {
# define PR_CFI_DISABLE _BITUL(1)
# define PR_CFI_LOCK _BITUL(2)
+/* SFRAME management */
+#define PR_ADD_SFRAME 82
+#define PR_REMOVE_SFRAME 83
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 62e842055cc9..b0a9b1e3ccd7 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -65,6 +65,7 @@
#include <linux/rcupdate.h>
#include <linux/uidgid.h>
#include <linux/cred.h>
+#include <linux/sframe.h>
#include <linux/nospec.h>
@@ -2907,6 +2908,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
if (arg3 & PR_CFI_LOCK && !(arg3 & PR_CFI_DISABLE))
error = arch_prctl_lock_branch_landing_pad_state(me);
break;
+ case PR_ADD_SFRAME:
+ error = sframe_add_section(arg2, arg3, arg4, arg5);
+ break;
+ case PR_REMOVE_SFRAME:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = sframe_remove_section(arg2);
+ break;
default:
trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
error = -EINVAL;
--
2.51.0
^ permalink raw reply related
* [PATCH v16 14/20] unwind_user: Flexible FP/RA recovery rules
From: Jens Remus @ 2026-05-21 14:25 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James
Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
Jose E. Marchesi, Beau Belgrave, Florian Weimer,
Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
Ilya Leoshkevich
In-Reply-To: <20260521142546.3908498-1-jremus@linux.ibm.com>
To enable support for SFrame V3 flexible FDEs with a subsequent patch,
add support for the following flexible frame pointer (FP) and return
address (RA) recovery rules:
FP/RA = *(CFA + offset)
FP/RA = register + offset
FP/RA = *(register + offset)
Note that FP/RA recovery rules that use arbitrary register contents are
only valid when in the topmost frame, as their contents are otherwise
unknown.
This also enables unwinding of user space for architectures, such as
s390, that may save the frame pointer (FP) and/or return address (RA) in
other registers, for instance when in a leaf function.
Reviewed-by: Indu Bhagat <ibhagatgnu@gmail.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
---
Notes (jremus):
Changes in v15:
- Define dbg_once().
- unwind_user_get_reg(): Use pr_debug_once() instead of WARN_ON_ONCE()
to prevent user-triggered warning/panic. (Sashiko AI)
- unwind_user_next_common(): Handle UNWIND_USER_RULE_CFA_OFFSET for RA
and FP to use dbg_once() instead of WARN_ON_ONCE() to prevent user-
triggered warning/panic. (Sashiko AI)
Changes in v14:
- Improve comment on why UNWIND_USER_RULE_CFA_OFFSET is not implemented.
(Mark Rutland)
arch/x86/include/asm/unwind_user.h | 21 +++++++--
include/linux/unwind_user.h | 10 +++++
include/linux/unwind_user_types.h | 23 +++++++++-
kernel/unwind/sframe.c | 16 ++++++-
kernel/unwind/user.c | 70 +++++++++++++++++++++++++++---
5 files changed, 125 insertions(+), 15 deletions(-)
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
index 2dfb5ef11e36..9c3417be4283 100644
--- a/arch/x86/include/asm/unwind_user.h
+++ b/arch/x86/include/asm/unwind_user.h
@@ -21,15 +21,26 @@ static inline int unwind_user_word_size(struct pt_regs *regs)
#define ARCH_INIT_USER_FP_FRAME(ws) \
.cfa_off = 2*(ws), \
- .ra_off = -1*(ws), \
- .fp_off = -2*(ws), \
+ .ra = { \
+ .rule = UNWIND_USER_RULE_CFA_OFFSET_DEREF,\
+ .offset = -1*(ws), \
+ }, \
+ .fp = { \
+ .rule = UNWIND_USER_RULE_CFA_OFFSET_DEREF,\
+ .offset = -2*(ws), \
+ }, \
.use_fp = true, \
.outermost = false,
#define ARCH_INIT_USER_FP_ENTRY_FRAME(ws) \
.cfa_off = 1*(ws), \
- .ra_off = -1*(ws), \
- .fp_off = 0, \
+ .ra = { \
+ .rule = UNWIND_USER_RULE_CFA_OFFSET_DEREF,\
+ .offset = -1*(ws), \
+ }, \
+ .fp = { \
+ .rule = UNWIND_USER_RULE_RETAIN,\
+ }, \
.use_fp = false, \
.outermost = false,
@@ -41,4 +52,6 @@ static inline bool unwind_user_at_function_start(struct pt_regs *regs)
#endif /* CONFIG_HAVE_UNWIND_USER_FP */
+#include <asm-generic/unwind_user.h>
+
#endif /* _ASM_X86_UNWIND_USER_H */
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
index 7bf58f23aa64..6aca38f89ddd 100644
--- a/include/linux/unwind_user.h
+++ b/include/linux/unwind_user.h
@@ -33,6 +33,16 @@ static inline int unwind_user_get_ra_reg(unsigned long *val)
#define unwind_user_get_ra_reg unwind_user_get_ra_reg
#endif
+#ifndef unwind_user_get_reg
+static inline int unwind_user_get_reg(unsigned long *val, unsigned int regnum)
+{
+ pr_debug_once("%s (%d): unwind_user_get_reg(%u) not implemented\n",
+ current->comm, current->pid, regnum);
+ return -EINVAL;
+}
+#define unwind_user_get_reg unwind_user_get_reg
+#endif
+
int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries);
#endif /* _LINUX_UNWIND_USER_H */
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
index 616cc5ee4586..0d02714a1b5d 100644
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -27,10 +27,29 @@ struct unwind_stacktrace {
unsigned long *entries;
};
+#define UNWIND_USER_RULE_DEREF BIT(31)
+
+enum unwind_user_rule {
+ UNWIND_USER_RULE_RETAIN, /* entity = entity */
+ UNWIND_USER_RULE_CFA_OFFSET, /* entity = CFA + offset */
+ UNWIND_USER_RULE_REG_OFFSET, /* entity = register + offset */
+ /* DEREF variants */
+ UNWIND_USER_RULE_CFA_OFFSET_DEREF = /* entity = *(CFA + offset) */
+ UNWIND_USER_RULE_CFA_OFFSET | UNWIND_USER_RULE_DEREF,
+ UNWIND_USER_RULE_REG_OFFSET_DEREF = /* entity = *(register + offset) */
+ UNWIND_USER_RULE_REG_OFFSET | UNWIND_USER_RULE_DEREF,
+};
+
+struct unwind_user_rule_data {
+ enum unwind_user_rule rule;
+ s32 offset;
+ unsigned int regnum;
+};
+
struct unwind_user_frame {
s32 cfa_off;
- s32 ra_off;
- s32 fp_off;
+ struct unwind_user_rule_data ra;
+ struct unwind_user_rule_data fp;
bool use_fp;
bool outermost;
};
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index d573c2529926..29a874a67f32 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -285,6 +285,18 @@ static __always_inline int __read_fre(struct sframe_section *sec,
return -EFAULT;
}
+static __always_inline void
+sframe_init_rule_data(struct unwind_user_rule_data *rule_data,
+ s32 offset)
+{
+ if (offset) {
+ rule_data->rule = UNWIND_USER_RULE_CFA_OFFSET_DEREF;
+ rule_data->offset = offset;
+ } else {
+ rule_data->rule = UNWIND_USER_RULE_RETAIN;
+ }
+}
+
static __always_inline int __find_fre(struct sframe_section *sec,
struct sframe_fde_internal *fde,
unsigned long ip,
@@ -335,8 +347,8 @@ static __always_inline int __find_fre(struct sframe_section *sec,
fre = prev_fre;
frame->cfa_off = fre->cfa_off;
- frame->ra_off = fre->ra_off;
- frame->fp_off = fre->fp_off;
+ sframe_init_rule_data(&frame->ra, fre->ra_off);
+ sframe_init_rule_data(&frame->fp, fre->fp_off);
frame->use_fp = SFRAME_V3_FRE_CFA_BASE_REG_ID(fre->info) == SFRAME_BASE_REG_FP;
frame->outermost = SFRAME_V3_FRE_RA_UNDEFINED_P(fre->info);
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index afa7c6f6d9b4..c6a2abac78e0 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -12,6 +12,17 @@
#include <linux/uaccess.h>
#include <linux/sframe.h>
+#ifdef CONFIG_DYNAMIC_DEBUG
+
+#define dbg_once(fmt, ...) \
+ pr_debug_once("%s (%d): " fmt, current->comm, current->pid, ##__VA_ARGS__)
+
+#else /* !CONFIG_DYNAMIC_DEBUG */
+
+#define dbg_once(args...) no_printk(args)
+
+#endif /* !CONFIG_DYNAMIC_DEBUG */
+
#define for_each_user_frame(state) \
for (unwind_user_start(state); !(state)->done; unwind_user_next(state))
@@ -64,22 +75,67 @@ static int unwind_user_next_common(struct unwind_user_state *state,
return -EINVAL;
/* Get the Return Address (RA) */
- if (frame->ra_off) {
- if (get_user_word(&ra, cfa, frame->ra_off, state->ws))
- return -EINVAL;
- } else {
+ switch (frame->ra.rule) {
+ case UNWIND_USER_RULE_RETAIN:
if (!state->topmost || unwind_user_get_ra_reg(&ra))
return -EINVAL;
+ break;
+ case UNWIND_USER_RULE_CFA_OFFSET:
+ /*
+ * RA = CFA + offset does not make sense.
+ * A return address cannot legitimately be a stack address.
+ */
+ dbg_once("UNWIND_USER_RULE_CFA_OFFSET invalid for RA\n");
+ return -EINVAL;
+ case UNWIND_USER_RULE_CFA_OFFSET_DEREF:
+ ra = cfa + frame->ra.offset;
+ break;
+ case UNWIND_USER_RULE_REG_OFFSET:
+ case UNWIND_USER_RULE_REG_OFFSET_DEREF:
+ if (!state->topmost || unwind_user_get_reg(&ra, frame->ra.regnum))
+ return -EINVAL;
+ ra += frame->ra.offset;
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ return -EINVAL;
}
+ if (frame->ra.rule & UNWIND_USER_RULE_DEREF &&
+ get_user_word(&ra, ra, 0, state->ws))
+ return -EINVAL;
/* Get the Frame Pointer (FP) */
- if (frame->fp_off && get_user_word(&fp, cfa, frame->fp_off, state->ws))
+ switch (frame->fp.rule) {
+ case UNWIND_USER_RULE_RETAIN:
+ fp = state->fp;
+ break;
+ case UNWIND_USER_RULE_CFA_OFFSET:
+ /*
+ * FP = CFA + offset is currently not used for FP
+ * (e.g. SFrame cannot represent this rule).
+ */
+ dbg_once("UNWIND_USER_RULE_CFA_OFFSET unsupported for FP\n");
+ return -EINVAL;
+ case UNWIND_USER_RULE_CFA_OFFSET_DEREF:
+ fp = cfa + frame->fp.offset;
+ break;
+ case UNWIND_USER_RULE_REG_OFFSET:
+ case UNWIND_USER_RULE_REG_OFFSET_DEREF:
+ if (!state->topmost || unwind_user_get_reg(&fp, frame->fp.regnum))
+ return -EINVAL;
+ fp += frame->fp.offset;
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ return -EINVAL;
+ }
+ if (frame->fp.rule & UNWIND_USER_RULE_DEREF &&
+ get_user_word(&fp, fp, 0, state->ws))
return -EINVAL;
state->ip = ra;
state->sp = cfa;
- if (frame->fp_off)
- state->fp = fp;
+ state->fp = fp;
state->topmost = false;
return 0;
}
--
2.51.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox