* Re: [RFC PATCH v2 02/10] rv/da: fix per-task da_monitor_destroy() ordering and sync
From: Gabriele Monaco @ 2026-05-12 8:27 UTC (permalink / raw)
To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <d02c656aada7d071f083460a5c9a454363669b61.1778522945.git.wen.yang@linux.dev>
On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
>
> The following two paths race:
>
> CPU 0 (disable_stall/__rv_disable_monitor) CPU 1 (wwnr probe handler)
^ did you mean stall?
> ------------------------------------------ -----------------------------
> disable_stall()
> da_monitor_destroy()
> da_monitor_reset_all() <------ [task T: monitoring=0]
> da_monitor_start(&T->rv[n])
> /* no timer_setup */
> monitoring=1 <----
> tracepoint_synchronize_unregister()
> // CPU 1 probe has already returned; sync returns
>
> Later, enable_stall() acquires the same slot and calls da_monitor_init():
>
> da_monitor_reset_all()
> da_monitor_reset(&T->rv[slot]) // monitoring=1, timer.function==0
> ha_monitor_reset_env()
> ha_cancel_timer()
> timer_delete(&ha_mon->timer) // ODEBUG: timer never initialised
>
> ODEBUG: assert_init not available (active state 0)
> object type: timer_list
> Call trace: timer_delete <- da_monitor_reset_all <- enable_stall
>
> Call tracepoint_synchronize_unregister() inside da_monitor_destroy()
> before da_monitor_reset_all(). The unregister_trace_xxx() calls in the
> monitor's disable() have already disconnected the tracepoints; the sync
> here drains any handler still in flight, so no new monitoring=1 can
> appear after da_monitor_reset_all() clears the slot.
>
> Also fix the slot release ordering: release the slot only after
> reset_all() to avoid accessing rv[] with an out-of-bounds index.
>
> Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
Thanks for the fix, I have a similar one waiting for submission.
These are technically 2 separate fixes though: the ordering with unset
task_mon_slot (independent on HA) and the synchronisation with pending
tracepoints. They probably deserve separate patches and visibility, the first
has always been around and we're technically overwriting who knows what.
The explanation above is a bit hard to follow though, are you talking about a
handler for the same (stall) monitor running after the reset, effectively
undoing it by setting the monitoring flag?
Then this is indeed an issue with ha_monitor_reset_env() which expects a clean
environment.
So that's basically what you'd see now much more often because in fact we don't
reset the right slot (though, again, that's a different issue).
Calling tracepoint_synchronize_unregister() there too would surely fix, but it
used to be kinda slow. But it's probably gotten faster since now tracepoints use
SRCU, so we can wait for a dedicated grace period.
I liked the idea to wait cumulatively in the end, but that's just making things
harder.. Let's do like this:
Prepare 2 separate patches as fixes, put the task slot one first (would ease
backporting), mention this issue with the race condition only in the second.
You can send them independently and I'll add them to the tree as urgent.
I'm soon going to send my set of fixes that will also include the task slot
patch (not removing to ease my life with conflicts).
Thanks,
Gabriele
> include/rv/da_monitor.h | 18 ++++++++++++++++--
> 1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index 00ded3d5ab3f..d04bb3229c75 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -304,6 +304,20 @@ static int da_monitor_init(void)
>
> /*
> * da_monitor_destroy - return the allocated slot
> + *
> + * Call tracepoint_synchronize_unregister() before reset_all() to close
> + * the race where an in-flight non-HA probe handler sets monitoring=1
> + * (without calling timer_setup()) after da_monitor_reset_all() has
> + * already cleared the slot but before the caller's own sync completes.
> + * Without this barrier, an HA_TIMER_WHEEL monitor that later acquires
> + * the same slot would call timer_delete() on a never-initialised
> + * timer_list, triggering ODEBUG warnings.
> + *
> + * Note: tracepoint_synchronize_unregister() is a system-wide barrier
> + * that waits for all CPUs to finish any in-flight tracepoint handlers.
> + * The caller's own __rv_disable_monitor() issues a second sync after
> + * returning from disable(); that redundant call is harmless on the
> + * infrequent admin (enable/disable) path.
> */
> static inline void da_monitor_destroy(void)
> {
> @@ -311,10 +325,10 @@ static inline void da_monitor_destroy(void)
> WARN_ONCE(1, "Disabling a disabled monitor: "
> __stringify(MONITOR_NAME));
> return;
> }
> + tracepoint_synchronize_unregister();
> + da_monitor_reset_all();
> rv_put_task_monitor_slot(task_mon_slot);
> task_mon_slot = RV_PER_TASK_MONITOR_INIT;
> -
> - da_monitor_reset_all();
> }
>
> #elif RV_MON_TYPE == RV_MON_PER_OBJ
^ permalink raw reply
* Re: [PATCH v6 3/4] mm/memory-failure: add panic option for unrecoverable pages
From: David Hildenbrand (Arm) @ 2026-05-12 8:22 UTC (permalink / raw)
To: Breno Leitao, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
Jonathan Corbet, Shuah Khan, Lorenzo Stoakes, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Liam R. Howlett
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team
In-Reply-To: <20260511-ecc_panic-v6-3-183012ba7d4b@debian.org>
> @@ -1281,6 +1292,18 @@ static void update_per_node_mf_stats(unsigned long pfn,
> ++mf_stats->total;
> }
>
> +static bool panic_on_unrecoverable_mf(enum mf_action_page_type type,
> + enum mf_result result)
> +{
> + if (!sysctl_panic_on_unrecoverable_mf || result != MF_IGNORED)
> + return false;
> +
> + if (type == MF_MSG_KERNEL)
> + return true;
> +
> + return false;
return type == MF_MSG_KERNEL;
might be simpler.
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH v6 2/4] mm/memory-failure: classify get_any_page() failures by reason
From: David Hildenbrand (Arm) @ 2026-05-12 8:21 UTC (permalink / raw)
To: Breno Leitao, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
Jonathan Corbet, Shuah Khan, Lorenzo Stoakes, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Liam R. Howlett
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <20260511-ecc_panic-v6-2-183012ba7d4b@debian.org>
> }
> goto unlock_mutex;
> } else if (res < 0) {
> - if (is_reserved)
> + /*
> + * Promote a stable unhandlable kernel page diagnosed by
> + * get_hwpoison_page() to MF_MSG_KERNEL alongside reserved
> + * pages; transient lifecycle races stay as MF_MSG_GET_HWPOISON.
> + */
> + if (is_reserved || gp_status == MF_GET_PAGE_UNHANDLABLE)
> res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
It's all a bit of a mess. get_hwpoison_page() should just indicate that a page
is unhandable if it is PG_reserved?
Why can't we just return a special error code from get_hwpoison_page()? We ahve
plenty of errno values to chose from.
--
Cheers,
David
^ permalink raw reply
* Re: [bug report] bootconfig: init: Allow admin to use bootconfig for kernel command line
From: Dan Carpenter @ 2026-05-12 8:21 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: kernel-janitors, Linux Trace Kernel, linux-kernel, Breno Leitao
In-Reply-To: <20260512091638.8b95253ab022d7dabf473465@kernel.org>
On Tue, May 12, 2026 at 09:16:38AM +0900, Masami Hiramatsu wrote:
> Hi Dan,
>
> Thanks for reporting. A similar problem is pointed by Sashiko [1].
>
> [1] https://sashiko.dev/#/patchset/20260508-bootconfig_using_tools-v1-0-1132219aa773%40debian.org
>
> On Fri, 8 May 2026 20:07:25 +0300
> Dan Carpenter <error27@gmail.com> wrote:
>
> > Hello Masami Hiramatsu,
> >
> > Commit 51887d03aca1 ("bootconfig: init: Allow admin to use bootconfig
> > for kernel command line") from Jan 11, 2020 (linux-next), leads to
> > the following Smatch static checker warning:
> >
> > init/main.c:368 xbc_snprint_cmdline()
> > use scnprintf() instead of snprintf()
> >
> > init/main.c
> > 331 static int __init xbc_snprint_cmdline(char *buf, size_t size,
> > 332 struct xbc_node *root)
> > 333 {
> > 334 struct xbc_node *knode, *vnode;
> > 335 char *end = buf + size;
> > 336 const char *val, *q;
> > 337 int ret;
> > 338
> > 339 xbc_node_for_each_key_value(root, knode, val) {
> > 340 ret = xbc_node_compose_key_after(root, knode,
> > 341 xbc_namebuf, XBC_KEYLEN_MAX);
> > 342 if (ret < 0)
> > 343 return ret;
> > 344
> > 345 vnode = xbc_node_get_child(knode);
> > 346 if (!vnode) {
> > 347 ret = snprintf(buf, rest(buf, end), "%s ", xbc_namebuf);
> > 348 if (ret < 0)
> > 349 return ret;
> > 350 buf += ret;
> >
> > In user space snprintf() can return negative, but in the kernel, no.
> > It returns the number of bytes (not counting the NUL terminator) which
> > would have been copied if there were enough space. So maybe you want
> > to do something like:
> >
> > remain = rest(buf, end);
> > ret = snprintf(buf, rest(buf, end), "%s ", xbc_namebuf);
> > if (ret >= remain)
> > return -ENOSPC;
>
> Actually, we need to query the length of required buffer size if buf == NULL
> or the buffer size is not enough.
>
> But as Sashiko pointed, I need to check it with UBSAN. (but I think,
> even if @buf is NULL, the @buf is char *, thus it is safe to add some
> value...)
>
Sashiko says that pointer math on a NULL is undefined but we do it all
the time in the kernel... When you are a the 800 pound gorilla, you can
ask compilers to implement features the way you want them to be. :P
regards,
dan carpenter
^ permalink raw reply
* Re: [PATCH v6 1/4] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
From: David Hildenbrand (Arm) @ 2026-05-12 8:17 UTC (permalink / raw)
To: Breno Leitao, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
Jonathan Corbet, Shuah Khan, Lorenzo Stoakes, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Liam R. Howlett
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <20260511-ecc_panic-v6-1-183012ba7d4b@debian.org>
On 5/11/26 17:38, Breno Leitao wrote:
> When get_hwpoison_page() returns a negative value, distinguish
> reserved pages from other failure cases by reporting MF_MSG_KERNEL
> instead of MF_MSG_GET_HWPOISON. Reserved pages belong to the kernel
> and should be classified accordingly for proper handling.
>
> Sample PG_reserved before the get_hwpoison_page() call. In the
> MF_COUNT_INCREASED path get_any_page() can drop the caller's
> reference before returning -EIO, after which the underlying page may
> have been freed and reallocated with page->flags reset; reading
> PageReserved(p) at that point would observe stale or unrelated state.
> The pre-call snapshot reflects what the page actually was at the
> time of the failure event.
>
> Acked-by: Miaohe Lin <linmiaohe@huawei.com>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> mm/memory-failure.c | 19 ++++++++++++++++++-
> 1 file changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 866c4428ac7ef..f112fb27a8ff6 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2348,6 +2348,7 @@ int memory_failure(unsigned long pfn, int flags)
> unsigned long page_flags;
> bool retry = true;
> int hugetlb = 0;
> + bool is_reserved;
>
> if (!sysctl_memory_failure_recovery)
> panic("Memory failure on page %lx", pfn);
> @@ -2411,6 +2412,18 @@ int memory_failure(unsigned long pfn, int flags)
> * In fact it's dangerous to directly bump up page count from 0,
> * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
> */
> + /*
> + * Pages with PG_reserved set are not currently managed by the
> + * page allocator (memblock-reserved memory, driver reservations,
> + * etc.), so classify them as kernel-owned for reporting.
> + *
> + * Sample the flag before get_hwpoison_page(): in the
> + * MF_COUNT_INCREASED path, get_any_page() can drop the caller's
> + * reference before returning -EIO, after which page->flags may
> + * have been reset by the allocator.
> + */
> + is_reserved = PageReserved(p);
> +
> res = get_hwpoison_page(p, flags);
> if (!res) {
> if (is_free_buddy_page(p)) {
> @@ -2432,7 +2445,11 @@ int memory_failure(unsigned long pfn, int flags)
> }
> goto unlock_mutex;
> } else if (res < 0) {
> - res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
> + if (is_reserved)
> + res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
> + else
> + res = action_result(pfn, MF_MSG_GET_HWPOISON,
> + MF_IGNORED);
> goto unlock_mutex;
> }
>
>
It's a bit odd that we need this handling when we already have handling for
reserved pages in error_states[].
HWPoisonHandlable() would always essentially reject PG_reserved pages. So
__get_hwpoison_page() ... would always fail? Making
get_hwpoison_page()->get_any_page() always fail?
But then, we never call identify_page_state()? And never call me_kernel()?
This all looks very odd.
Why would you even want to call get_hwpoison_page() in the first place if you
find PageReserved?
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH mm-unstable v17 05/14] mm/khugepaged: require collapse_huge_page to enter/exit with the lock dropped
From: David Hildenbrand (Arm) @ 2026-05-12 7:42 UTC (permalink / raw)
To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260511185817.686831-6-npache@redhat.com>
On 5/11/26 20:58, Nico Pache wrote:
> Currently the collapse_huge_page function requires the mmap_read_lock to
> enter with it held, and exit with it dropped. This function moves the
> unlock into its parent caller, and changes this semantic to requiring it
> to enter/exit with it always unlocked.
>
> In future patches, we need this expectation, as for in mTHP collapse, we
> may have already have dropped the lock, and do not want to conditionally
> check for this by passing through the lock_dropped variable.
>
> No functional change is expected as one of the first things the
> collapse_huge_page function does is drop this lock before allocating the
> hugepage.
>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
> mm/khugepaged.c | 18 ++++++++++--------
> 1 file changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 27465161fa6d..37a5f6791816 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1199,6 +1199,14 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> return SCAN_SUCCEED;
> }
>
> +/*
> + * collapse_huge_page expects the mmap_read_lock to be dropped before
> + * entering this function.
"before entering." Talking about "this function" after naming it sounds odd.
Also, there is only an "mmap_lock".
> The function will also always return with the lock
> + * dropped.
"collapse_huge_page expects the mmap_lock to be unlocked before entering and
will always return with the lock unlocked."
Or something simple like that?
> The function starts by allocation a folio, which can potentially
> + * take a long time if it involves sync compaction, and we do not need to hold
> + * the mmap_lock during that. We must recheck the vma after taking it again in
> + * write mode.
> + */
"... to avoid holding the mmap_lock while allocating a THP, as that could
trigger direct reclaim/compaction. Note that the VMA must be rechecked after
grabbing the mmap_lock again."
?
Ending up with something like
"collapse_huge_page expects the mmap_lock to be unlocked before entering and
will always return with the lock unlocked, to avoid holding the mmap_lock while
allocating a THP, as that could trigger direct reclaim/compaction. Note that
the VMA must be rechecked after grabbing the mmap_lock again."
> static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> int referenced, int unmapped, struct collapse_control *cc)
> {
> @@ -1214,14 +1222,6 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>
> VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>
> - /*
> - * Before allocating the hugepage, release the mmap_lock read lock.
> - * The allocation can take potentially a long time if it involves
> - * sync compaction, and we do not need to hold the mmap_lock during
> - * that. We will recheck the vma after taking it again in write mode.
> - */
> - mmap_read_unlock(mm);
> -
> result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> if (result != SCAN_SUCCEED)
> goto out_nolock;
> @@ -1526,6 +1526,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> out_unmap:
> pte_unmap_unlock(pte, ptl);
> if (result == SCAN_SUCCEED) {
> + /* collapse_huge_page expects the lock to be dropped before calling */
> + mmap_read_unlock(mm);
> result = collapse_huge_page(mm, start_addr, referenced,
> unmapped, cc);
> /* collapse_huge_page will return with the mmap_lock released */
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH mm-unstable v17 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: Lance Yang @ 2026-05-12 7:42 UTC (permalink / raw)
To: npache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <20260511185817.686831-5-npache@redhat.com>
On Mon, May 11, 2026 at 12:58:04PM -0600, Nico Pache wrote:
>generalize the order of the __collapse_huge_page_* and collapse_max_*
>functions to support future mTHP collapse.
>
>The current mechanism for determining collapse with the
>khugepaged_max_ptes_none value is not designed with mTHP in mind. This
>raises a key design issue: if we support user defined max_pte_none values
>(even those scaled by order), a collapse of a lower order can introduces
>an feedback loop, or "creep", when max_ptes_none is set to a value greater
>than HPAGE_PMD_NR / 2. [1]
>
>With this configuration, a successful collapse to order N will populate
>enough pages to satisfy the collapse condition on order N+1 on the next
>scan. This leads to unnecessary work and memory churn.
>
>To fix this issue introduce a helper function that will limit mTHP
>collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
>This effectively supports two modes: [2]
>
>- max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE
> that maps the shared zeropage. Consequently, no memory bloat.
>- max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
> available mTHP order.
>
>This removes the possiblilty of "creep", while not modifying any uAPI
>expectations. A warning will be emitted if any non-supported
>max_ptes_none value is configured with mTHP enabled.
>
>mTHP collapse will not honor the khugepaged_max_ptes_shared or
>khugepaged_max_ptes_swap parameters, and will fail if it encounters a
>shared or swapped entry.
>
>No functional changes in this patch; however it defines future behavior
>for mTHP collapse.
>
>[1] - https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.com
>[2] - https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com
>
>Co-developed-by: Dev Jain <dev.jain@arm.com>
>Signed-off-by: Dev Jain <dev.jain@arm.com>
>Signed-off-by: Nico Pache <npache@redhat.com>
>---
> include/trace/events/huge_memory.h | 3 +-
> mm/khugepaged.c | 117 ++++++++++++++++++++---------
> 2 files changed, 85 insertions(+), 35 deletions(-)
>
>diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
>index bcdc57eea270..443e0bd13fdb 100644
>--- a/include/trace/events/huge_memory.h
>+++ b/include/trace/events/huge_memory.h
>@@ -39,7 +39,8 @@
> EM( SCAN_STORE_FAILED, "store_failed") \
> EM( SCAN_COPY_MC, "copy_poisoned_page") \
> EM( SCAN_PAGE_FILLED, "page_filled") \
>- EMe(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")
>+ EM(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback") \
>+ EMe(SCAN_INVALID_PTES_NONE, "invalid_ptes_none")
>
> #undef EM
> #undef EMe
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index f68853b3caa7..27465161fa6d 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -61,6 +61,7 @@ enum scan_result {
> SCAN_COPY_MC,
> SCAN_PAGE_FILLED,
> SCAN_PAGE_DIRTY_OR_WRITEBACK,
>+ SCAN_INVALID_PTES_NONE,
> };
>
> #define CREATE_TRACE_POINTS
>@@ -353,37 +354,60 @@ static bool pte_none_or_zero(pte_t pte)
> * PTEs for the given collapse operation.
> * @cc: The collapse control struct
> * @vma: The vma to check for userfaultfd
>+ * @order: The folio order being collapsed to
> *
> * Return: Maximum number of none-page or zero-page PTEs allowed for the
> * collapse operation.
> */
>-static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
>- struct vm_area_struct *vma)
>+static int collapse_max_ptes_none(struct collapse_control *cc,
>+ struct vm_area_struct *vma, unsigned int order)
> {
>+ unsigned int max_ptes_none = khugepaged_max_ptes_none;
> // If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
One thing I still want to call out: kernel code usually uses C-style
comments :)
> if (vma && userfaultfd_armed(vma))
> return 0;
> // for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
> if (!cc->is_khugepaged)
> return HPAGE_PMD_NR;
>- // For all other cases repect the user defined maximum.
>- return khugepaged_max_ptes_none;
>+ // for PMD collapse, respect the user defined maximum.
>+ if (is_pmd_order(order))
>+ return max_ptes_none;
>+ /* Zero/non-present collapse disabled. */
>+ if (!max_ptes_none)
>+ return 0;
>+ // for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
>+ // scale the maximum number of PTEs to the order of the collapse.
>+ if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
>+ return (1 << order) - 1;
>+
>+ // We currently only support max_ptes_none values of 0 or KHUGEPAGED_MAX_PTES_LIMIT.
>+ // Emit a warning and return -EINVAL.
>+ pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u\n",
>+ KHUGEPAGED_MAX_PTES_LIMIT);
Maybe fallback to 0 instead, as David suggested earlier?
max_ptes_none is mostly legacy PMD THP behavior. mTHP is new, and any
intermediate value in (0, KHUGEPAGED_MAX_PTES_LIMIT) would implicitly
disable it :(
Treating those values as 0 feels like the least surprising behavior,
IMHO. It also gives mTHP a cleaner staring point, rather than carry over
all the old PMD knob semantics :)
Otherwise, LGTM!
Reviewed-by: Lance Yang <lance.yang@linux.dev>
>+ return -EINVAL;
> }
>
> /**
> * collapse_max_ptes_shared - Calculate maximum allowed PTEs that map shared
> * anonymous pages for the given collapse operation.
> * @cc: The collapse control struct
>+ * @order: The folio order being collapsed to
> *
> * Return: Maximum number of PTEs that map shared anonymous pages for the
> * collapse operation
> */
>-static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
>+static unsigned int collapse_max_ptes_shared(struct collapse_control *cc,
>+ unsigned int order)
> {
> // for MADV_COLLAPSE, do not restrict the number of PTEs that map shared
> // anonymous pages.
> if (!cc->is_khugepaged)
> return HPAGE_PMD_NR;
>+ // for mTHP collapse do not allow collapsing anonymous memory pages that
>+ // are shared between processes.
>+ if (!is_pmd_order(order))
>+ return 0;
>+ // for PMD collapse, respect the user defined maximum.
> return khugepaged_max_ptes_shared;
> }
>
>@@ -391,16 +415,22 @@ static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
> * collapse_max_ptes_swap - Calculate the maximum allowed non-present PTEs or the
> * maximum allowed non-present pagecache entries for the given collapse operation.
> * @cc: The collapse control struct
>+ * @order: The folio order being collapsed to
> *
> * Return: Maximum number of non-present PTEs or the maximum allowed non-present
> * pagecache entries for the collapse operation.
> */
>-static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
>+static unsigned int collapse_max_ptes_swap(struct collapse_control *cc,
>+ unsigned int order)
> {
> // for MADV_COLLAPSE, do not restrict the number PTEs entries or
> // pagecache entries that are non-present.
> if (!cc->is_khugepaged)
> return HPAGE_PMD_NR;
>+ // for mTHP collapse do not allow any non-present PTEs or pagecache entries.
>+ if (!is_pmd_order(order))
>+ return 0;
>+ // for PMD collapse, respect the user defined maximum.
> return khugepaged_max_ptes_swap;
> }
>
>@@ -594,18 +624,22 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
>
> static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
>- struct list_head *compound_pagelist)
>+ unsigned int order, struct list_head *compound_pagelist)
> {
>+ const unsigned long nr_pages = 1UL << order;
> struct page *page = NULL;
> struct folio *folio = NULL;
> unsigned long addr = start_addr;
> pte_t *_pte;
> int none_or_zero = 0, shared = 0, referenced = 0;
> enum scan_result result = SCAN_FAIL;
>- unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
>- unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
>+ int max_ptes_none = collapse_max_ptes_none(cc, vma, order);
>+ unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, order);
>+
>+ if (max_ptes_none < 0)
>+ return SCAN_INVALID_PTES_NONE;
>
>- for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>+ for (_pte = pte; _pte < pte + nr_pages;
> _pte++, addr += PAGE_SIZE) {
> pte_t pteval = ptep_get(_pte);
> if (pte_none_or_zero(pteval)) {
>@@ -738,18 +772,18 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> }
>
> static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>- struct vm_area_struct *vma,
>- unsigned long address,
>- spinlock_t *ptl,
>- struct list_head *compound_pagelist)
>+ struct vm_area_struct *vma, unsigned long address,
>+ spinlock_t *ptl, unsigned int order,
>+ struct list_head *compound_pagelist)
> {
>- unsigned long end = address + HPAGE_PMD_SIZE;
>+ const unsigned long nr_pages = 1UL << order;
>+ unsigned long end = address + (PAGE_SIZE << order);
> struct folio *src, *tmp;
> pte_t pteval;
> pte_t *_pte;
> unsigned int nr_ptes;
>
>- for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
>+ for (_pte = pte; _pte < pte + nr_pages; _pte += nr_ptes,
> address += nr_ptes * PAGE_SIZE) {
> nr_ptes = 1;
> pteval = ptep_get(_pte);
>@@ -802,11 +836,10 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> }
>
> static void __collapse_huge_page_copy_failed(pte_t *pte,
>- pmd_t *pmd,
>- pmd_t orig_pmd,
>- struct vm_area_struct *vma,
>- struct list_head *compound_pagelist)
>+ pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
>+ unsigned int order, struct list_head *compound_pagelist)
> {
>+ const unsigned long nr_pages = 1UL << order;
> spinlock_t *pmd_ptl;
>
> /*
>@@ -822,7 +855,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> * Release both raw and compound pages isolated
> * in __collapse_huge_page_isolate.
> */
>- release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
>+ release_pte_pages(pte, pte + nr_pages, compound_pagelist);
> }
>
> /*
>@@ -842,16 +875,17 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> */
> static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
> pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
>- unsigned long address, spinlock_t *ptl,
>+ unsigned long address, spinlock_t *ptl, unsigned int order,
> struct list_head *compound_pagelist)
> {
>+ const unsigned long nr_pages = 1UL << order;
> unsigned int i;
> enum scan_result result = SCAN_SUCCEED;
>
> /*
> * Copying pages' contents is subject to memory poison at any iteration.
> */
>- for (i = 0; i < HPAGE_PMD_NR; i++) {
>+ for (i = 0; i < nr_pages; i++) {
> pte_t pteval = ptep_get(pte + i);
> struct page *page = folio_page(folio, i);
> unsigned long src_addr = address + i * PAGE_SIZE;
>@@ -870,10 +904,10 @@ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *foli
>
> if (likely(result == SCAN_SUCCEED))
> __collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
>- compound_pagelist);
>+ order, compound_pagelist);
> else
> __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
>- compound_pagelist);
>+ order, compound_pagelist);
>
> return result;
> }
>@@ -1044,12 +1078,12 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm,
> * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
> */
> static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
>- struct vm_area_struct *vma, unsigned long start_addr, pmd_t *pmd,
>- int referenced)
>+ struct vm_area_struct *vma, unsigned long start_addr,
>+ pmd_t *pmd, int referenced, unsigned int order)
> {
> int swapped_in = 0;
> vm_fault_t ret = 0;
>- unsigned long addr, end = start_addr + (HPAGE_PMD_NR * PAGE_SIZE);
>+ unsigned long addr, end = start_addr + (PAGE_SIZE << order);
> enum scan_result result;
> pte_t *pte = NULL;
> spinlock_t *ptl;
>@@ -1081,6 +1115,19 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> pte_present(vmf.orig_pte))
> continue;
>
>+ /*
>+ * TODO: Support swapin without leading to further mTHP
>+ * collapses. Currently bringing in new pages via swapin may
>+ * cause a future higher order collapse on a rescan of the same
>+ * range.
>+ */
>+ if (!is_pmd_order(order)) {
>+ pte_unmap(pte);
>+ mmap_read_unlock(mm);
>+ result = SCAN_EXCEED_SWAP_PTE;
>+ goto out;
>+ }
>+
> vmf.pte = pte;
> vmf.ptl = ptl;
> ret = do_swap_page(&vmf);
>@@ -1200,7 +1247,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> * that case. Continuing to collapse causes inconsistency.
> */
> result = __collapse_huge_page_swapin(mm, vma, address, pmd,
>- referenced);
>+ referenced, HPAGE_PMD_ORDER);
> if (result != SCAN_SUCCEED)
> goto out_nolock;
> }
>@@ -1248,6 +1295,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> if (pte) {
> result = __collapse_huge_page_isolate(vma, address, pte, cc,
>+ HPAGE_PMD_ORDER,
> &compound_pagelist);
> spin_unlock(pte_ptl);
> } else {
>@@ -1278,6 +1326,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>
> result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> vma, address, pte_ptl,
>+ HPAGE_PMD_ORDER,
> &compound_pagelist);
> pte_unmap(pte);
> if (unlikely(result != SCAN_SUCCEED))
>@@ -1313,9 +1362,9 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> struct vm_area_struct *vma, unsigned long start_addr,
> bool *lock_dropped, struct collapse_control *cc)
> {
>- const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
>- const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
>- const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
>+ const int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
>+ const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
>+ const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> pmd_t *pmd;
> pte_t *pte, *_pte;
> int none_or_zero = 0, shared = 0, referenced = 0;
>@@ -2369,8 +2418,8 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
> unsigned long addr, struct file *file, pgoff_t start,
> struct collapse_control *cc)
> {
>- const unsigned int max_ptes_none = collapse_max_ptes_none(cc, NULL);
>- const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
>+ const int max_ptes_none = collapse_max_ptes_none(cc, NULL, HPAGE_PMD_ORDER);
>+ const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> struct folio *folio = NULL;
> struct address_space *mapping = file->f_mapping;
> XA_STATE(xas, &mapping->i_pages, start);
>--
>2.54.0
>
>
^ permalink raw reply
* Re: [PATCH mm-unstable v17 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions
From: David Hildenbrand (Arm) @ 2026-05-12 7:29 UTC (permalink / raw)
To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel
Cc: aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe, Usama Arif
In-Reply-To: <20260511185817.686831-4-npache@redhat.com>
On 5/11/26 20:58, Nico Pache wrote:
> The following cleanup reworks all the max_ptes_* handling into helper
> functions. This increases the code readability and will later be used to
> implement the mTHP handling of these variables.
>
> With these changes we abstract all the madvise_collapse() special casing
> (dont respect the sysctls) away from the functions that utilize them. And
> will be used later in this series to cleanly restrict the mTHP collapse
> behavior.
>
> No functional change is intended; however, we are now only reading the
> sysfs variables once per scan, whereas before these variables were being
> read on each loop iteration.
>
> Suggested-by: David Hildenbrand <david@kernel.org>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Some nits when re-reading:
> Acked-by: Usama Arif <usama.arif@linux.dev>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
> mm/khugepaged.c | 118 +++++++++++++++++++++++++++++++++---------------
> 1 file changed, 82 insertions(+), 36 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index f0e29d5c7b1f..f68853b3caa7 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -348,6 +348,62 @@ static bool pte_none_or_zero(pte_t pte)
> return pte_present(pte) && is_zero_pfn(pte_pfn(pte));
> }
>
> +/**
> + * collapse_max_ptes_none - Calculate maximum allowed none-page or zero-page
I know, it's painful, but ...
There is no "none-page".
Calculate maximum allowed empty PTEs or PTEs mapping the shared zeropage ... ?
> + * PTEs for the given collapse operation.
We usually indent here (second line of subject), I think. Same applies to the
other doc below.
> + * @cc: The collapse control struct
> + * @vma: The vma to check for userfaultfd
> + *
> + * Return: Maximum number of none-page or zero-page PTEs allowed for the
> + * collapse operation.
Same here.
> + */
> +static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> + struct vm_area_struct *vma)
> +{
> + // If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
Lance commented on the comment style.
Is this comment really required? It's pretty self-documenting already.
> + if (vma && userfaultfd_armed(vma))
> + return 0;
> + // for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
> + if (!cc->is_khugepaged)
> + return HPAGE_PMD_NR;
> + // For all other cases repect the user defined maximum.
> + return khugepaged_max_ptes_none;
> +}
> +
--
Cheers,
David
^ permalink raw reply
* [PATCH v2 2/2] spi: qcom-geni: Add trace events for Qualcomm GENI SPI driver
From: Praveen Talari @ 2026-05-12 6:12 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Mark Brown
Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-spi,
mukesh.savaliya, aniket.randive, chandana.chiluveru,
jyothi.seerapu, Praveen Talari
In-Reply-To: <20260512-add-tracepoints-for-qcom-geni-spi-v2-0-3b184068ecf9@oss.qualcomm.com>
Add tracepoints to the Qualcomm GENI (Generic Interface) SPI driver.
These trace events enable runtime debugging and performance analysis
of SPI operations.
The trace events capture SPI clock configuration, FIFO parameters,
transfer details, interrupt status.
Usage examples:
Enable all SPI traces:
echo 1 > /sys/kernel/tracing/events/spi/enable
echo 1 > /sys/kernel/debug/tracing/events/qcom_geni_spi/enable
cat /sys/kernel/debug/tracing/trace_pipe
Example trace output:
1003.956560: spi_message_submit: spi16.0 000000001b20b93c
1003.956642: spi_controller_busy: spi16
1003.956643: spi_message_start: spi16.0 000000001b20b93c
1003.956646: geni_spi_fifo_params: 888000.spi: cs=0 mode=0x00000020
mode_changed=0x00000007 cs_changed=0
1003.956647: spi_set_cs: spi16.0 activate
1003.956648: spi_transfer_start: spi16.0 00000000ea1cf8b6 len=16
tx=[4c-80-e4-ca-68-4d-95-aa-ee-99-ae-d7-69-e9-5f-39]
rx=[00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00]
1003.956653: geni_spi_clk_cfg: 888000.spi: req_hz=20000000
sclk_hz=100000000 clk_idx=5 clk_div=5 bpw=8
1003.956691: geni_spi_transfer: 888000.spi: len=16 m_cmd=0x00000003
1003.956708: geni_spi_irq: 888000.spi: m_irq=0x08000081
dma_tx=0x00000000 dma_rx=0x00000000
1003.956717: spi_transfer_stop: spi16.0 00000000ea1cf8b6 len=16
tx=[4c-80-e4-ca-68-4d-95-aa-ee-99-ae-d7-69-e9-5f-39]
rx=[4c-80-e4-ca-68-4d-95-aa-ee-99-ae-d7-69-e9-5f-39]
1003.956717: spi_set_cs: spi16.0 deactivate
1003.956718: spi_message_done: spi16.0 000000001b20b93c len=16/16
Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
v1->v2:
- Removed tx/rx data capture since spi core had already support.
- Updated commit text.
---
drivers/spi/spi-geni-qcom.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
index d5fb0edc8e0c..164c6c0b9544 100644
--- a/drivers/spi/spi-geni-qcom.c
+++ b/drivers/spi/spi-geni-qcom.c
@@ -1,6 +1,9 @@
// SPDX-License-Identifier: GPL-2.0
// Copyright (c) 2017-2018, The Linux foundation. All rights reserved.
+#define CREATE_TRACE_POINTS
+#include <trace/events/qcom_geni_spi.h>
+
#include <linux/clk.h>
#include <linux/dmaengine.h>
#include <linux/dma-mapping.h>
@@ -332,6 +335,9 @@ static int geni_spi_set_clock_and_bw(struct spi_geni_master *mas,
writel(clk_sel, se->base + SE_GENI_CLK_SEL);
writel(m_clk_cfg, se->base + GENI_SER_M_CLK_CFG);
+ trace_geni_spi_clk_cfg(mas->dev, clk_hz, mas->cur_sclk_hz, idx, div,
+ mas->cur_bits_per_word);
+
/* Set BW quota for CPU as driver supports FIFO mode only. */
se->icc_paths[CPU_TO_GENI].avg_bw = Bps_to_icc(mas->cur_speed_hz);
ret = geni_icc_set_bw(se);
@@ -366,6 +372,9 @@ static int setup_fifo_params(struct spi_device *spi_slv,
if ((mode_changed & SPI_CS_HIGH) || (cs_changed && (spi_slv->mode & SPI_CS_HIGH)))
writel((spi_slv->mode & SPI_CS_HIGH) ? BIT(chipselect) : 0, se->base + SE_SPI_DEMUX_OUTPUT_INV);
+ trace_geni_spi_fifo_params(mas->dev, chipselect, spi_slv->mode,
+ mode_changed, cs_changed);
+
return 0;
}
@@ -861,6 +870,8 @@ static int setup_se_xfer(struct spi_transfer *xfer,
spin_lock_irq(&mas->lock);
geni_se_setup_m_cmd(se, m_cmd, m_params);
+ trace_geni_spi_transfer(mas->dev, len, m_cmd);
+
if (mas->cur_xfer_mode == GENI_SE_DMA) {
if (m_cmd & SPI_RX_ONLY)
geni_se_rx_init_dma(se, sg_dma_address(xfer->rx_sg.sgl),
@@ -915,6 +926,8 @@ static irqreturn_t geni_spi_isr(int irq, void *data)
if (!m_irq && !dma_tx_status && !dma_rx_status)
return IRQ_NONE;
+ trace_geni_spi_irq(mas->dev, m_irq, dma_tx_status, dma_rx_status);
+
if (m_irq & (M_CMD_OVERRUN_EN | M_ILLEGAL_CMD_EN | M_CMD_FAILURE_EN |
M_RX_FIFO_RD_ERR_EN | M_RX_FIFO_WR_ERR_EN |
M_TX_FIFO_RD_ERR_EN | M_TX_FIFO_WR_ERR_EN))
--
2.34.1
^ permalink raw reply related
* [PATCH v2 1/2] spi: qcom-geni: trace: Add trace events for Qualcomm GENI SPI
From: Praveen Talari @ 2026-05-12 6:12 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Mark Brown
Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-spi,
mukesh.savaliya, aniket.randive, chandana.chiluveru,
jyothi.seerapu, Praveen Talari
In-Reply-To: <20260512-add-tracepoints-for-qcom-geni-spi-v2-0-3b184068ecf9@oss.qualcomm.com>
Add tracepoint support to the Qualcomm GENI SPI driver to provide
runtime visibility into driver behavior without requiring invasive debug
patches.
The trace events cover clock and FIFO parameter configuration,
transfer metadata, interrupt status to be making it easier to diagnose
communication issues in the field..
Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
v1->v2:
- Removed TX/RX data tracepoints.
- Updated commit text.
---
include/trace/events/qcom_geni_spi.h | 103 +++++++++++++++++++++++++++++++++++
1 file changed, 103 insertions(+)
diff --git a/include/trace/events/qcom_geni_spi.h b/include/trace/events/qcom_geni_spi.h
new file mode 100644
index 000000000000..5f39dab47e4e
--- /dev/null
+++ b/include/trace/events/qcom_geni_spi.h
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM qcom_geni_spi
+
+#if !defined(_TRACE_QCOM_GENI_SPI_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_QCOM_GENI_SPI_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(geni_spi_fifo_params,
+ TP_PROTO(struct device *dev, u8 cs, u32 mode,
+ u32 mode_changed, bool cs_changed),
+ TP_ARGS(dev, cs, mode, mode_changed, cs_changed),
+
+ TP_STRUCT__entry(__string(name, dev_name(dev))
+ __field(u8, cs)
+ __field(u32, mode)
+ __field(u32, mode_changed)
+ __field(bool, cs_changed)
+ ),
+
+ TP_fast_assign(__assign_str(name);
+ __entry->cs = cs;
+ __entry->mode = mode;
+ __entry->mode_changed = mode_changed;
+ __entry->cs_changed = cs_changed;
+ ),
+
+ TP_printk("%s: cs=%u mode=0x%08x mode_changed=0x%08x cs_changed=%d",
+ __get_str(name), __entry->cs, __entry->mode,
+ __entry->mode_changed, __entry->cs_changed)
+);
+
+TRACE_EVENT(geni_spi_clk_cfg,
+ TP_PROTO(struct device *dev, unsigned long req_hz,
+ unsigned long sclk_hz, unsigned int clk_idx,
+ unsigned int clk_div, unsigned int bpw),
+ TP_ARGS(dev, req_hz, sclk_hz, clk_idx, clk_div, bpw),
+
+ TP_STRUCT__entry(__string(name, dev_name(dev))
+ __field(unsigned long, req_hz)
+ __field(unsigned long, sclk_hz)
+ __field(unsigned int, clk_idx)
+ __field(unsigned int, clk_div)
+ __field(unsigned int, bpw)
+ ),
+
+ TP_fast_assign(__assign_str(name);
+ __entry->req_hz = req_hz;
+ __entry->sclk_hz = sclk_hz;
+ __entry->clk_idx = clk_idx;
+ __entry->clk_div = clk_div;
+ __entry->bpw = bpw;
+ ),
+
+ TP_printk("%s: req_hz=%lu sclk_hz=%lu clk_idx=%u clk_div=%u bpw=%u",
+ __get_str(name), __entry->req_hz, __entry->sclk_hz,
+ __entry->clk_idx, __entry->clk_div, __entry->bpw)
+);
+
+TRACE_EVENT(geni_spi_transfer,
+ TP_PROTO(struct device *dev, unsigned int len, u32 m_cmd),
+ TP_ARGS(dev, len, m_cmd),
+
+ TP_STRUCT__entry(__string(name, dev_name(dev))
+ __field(unsigned int, len)
+ __field(u32, m_cmd)
+ ),
+
+ TP_fast_assign(__assign_str(name);
+ __entry->len = len;
+ __entry->m_cmd = m_cmd;
+ ),
+
+ TP_printk("%s: len=%u m_cmd=0x%08x",
+ __get_str(name), __entry->len, __entry->m_cmd)
+);
+
+TRACE_EVENT(geni_spi_irq,
+ TP_PROTO(struct device *dev, u32 m_irq, u32 dma_tx, u32 dma_rx),
+ TP_ARGS(dev, m_irq, dma_tx, dma_rx),
+
+ TP_STRUCT__entry(__string(name, dev_name(dev))
+ __field(u32, m_irq)
+ __field(u32, dma_tx)
+ __field(u32, dma_rx)
+ ),
+
+ TP_fast_assign(__assign_str(name);
+ __entry->m_irq = m_irq;
+ __entry->dma_tx = dma_tx;
+ __entry->dma_rx = dma_rx;
+ ),
+
+ TP_printk("%s: m_irq=0x%08x dma_tx=0x%08x dma_rx=0x%08x",
+ __get_str(name), __entry->m_irq, __entry->dma_tx,
+ __entry->dma_rx)
+);
+
+#endif /* _TRACE_QCOM_GENI_SPI_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--
2.34.1
^ permalink raw reply related
* [PATCH v2 0/2] Add trace events for Qualcomm GENI SPI drivers
From: Praveen Talari @ 2026-05-12 6:12 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Mark Brown
Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-spi,
mukesh.savaliya, aniket.randive, chandana.chiluveru,
jyothi.seerapu, Praveen Talari
Add tracepoints to the Qualcomm GENI (Generic Interface) SPI driver.
These trace events enable runtime debugging and performance analysis
of SPI operations.
The trace events capture SPI clock configuration, FIFO parameters,
transfer details, interrupt status.
Usage examples:
Enable all SPI traces:
echo 1 > /sys/kernel/tracing/events/spi/enable
echo 1 > /sys/kernel/debug/tracing/events/qcom_geni_spi/enable
cat /sys/kernel/debug/tracing/trace_pipe
Example trace output:
1003.956560: spi_message_submit: spi16.0 000000001b20b93c
1003.956642: spi_controller_busy: spi16
1003.956643: spi_message_start: spi16.0 000000001b20b93c
1003.956646: geni_spi_fifo_params: 888000.spi: cs=0 mode=0x00000020
mode_changed=0x00000007 cs_changed=0
1003.956647: spi_set_cs: spi16.0 activate
1003.956648: spi_transfer_start: spi16.0 00000000ea1cf8b6 len=16
tx=[4c-80-e4-ca-68-4d-95-aa-ee-99-ae-d7-69-e9-5f-39]
rx=[00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00]
1003.956653: geni_spi_clk_cfg: 888000.spi: req_hz=20000000
sclk_hz=100000000 clk_idx=5 clk_div=5 bpw=8
1003.956691: geni_spi_transfer: 888000.spi: len=16 m_cmd=0x00000003
1003.956708: geni_spi_irq: 888000.spi: m_irq=0x08000081
dma_tx=0x00000000 dma_rx=0x00000000
1003.956717: spi_transfer_stop: spi16.0 00000000ea1cf8b6 len=16
tx=[4c-80-e4-ca-68-4d-95-aa-ee-99-ae-d7-69-e9-5f-39]
rx=[4c-80-e4-ca-68-4d-95-aa-ee-99-ae-d7-69-e9-5f-39]
1003.956717: spi_set_cs: spi16.0 deactivate
1003.956718: spi_message_done: spi16.0 000000001b20b93c len=16/16
Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
---
Changes in v2:
- Removed tx/rx data capture since spi core had already support.
- Updated commit text in patches and cover letter.
- Link to v1: https://lore.kernel.org/r/20260506-add-tracepoints-for-qcom-geni-spi-v1-0-c957cfe712d1@oss.qualcomm.com
---
Praveen Talari (2):
spi: qcom-geni: trace: Add trace events for Qualcomm GENI SPI
spi: qcom-geni: Add trace events for Qualcomm GENI SPI driver
drivers/spi/spi-geni-qcom.c | 13 +++++
include/trace/events/qcom_geni_spi.h | 103 +++++++++++++++++++++++++++++++++++
2 files changed, 116 insertions(+)
---
base-commit: 1f5ffc672165ff851063a5fd044b727ab2517ae3
change-id: 20260506-add-tracepoints-for-qcom-geni-spi-e31457c2267c
Best regards,
--
Praveen Talari <praveen.talari@oss.qualcomm.com>
^ permalink raw reply
* Re: [PATCH v1 1/2] spi: qcom-geni: trace: Add trace events for Qualcomm GENI SPI
From: Praveen Talari @ 2026-05-12 6:10 UTC (permalink / raw)
To: Trilok Soni, Mark Brown
Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, linux-arm-msm, linux-spi,
MukeshKumarSavaliyamukesh.savaliya, AniketRandiveaniket.randive,
chandana.chiluveru, jyothi.seerapu
In-Reply-To: <e37f2d5d-daa9-4302-8d34-9ce198e60a4a@oss.qualcomm.com>
Hi Trilok,
On 09-05-2026 04:44, Trilok Soni wrote:
> On 5/8/2026 7:01 AM, Mark Brown wrote:
>> On Thu, May 07, 2026 at 11:03:39PM +0530, Praveen Talari wrote:
>>> On 07-05-2026 13:43, Mark Brown wrote:
>>>> By generic I mean this should not be driver specific at all.
>>> I hope these changes are fine. Please let me know if you have any concerns
>>> or feedback.
>> The data tracepoints look plausible but I would expect them to be
>> generated by the core, they'll be there for everything so I'd expect
>> them to work for everything.
> I agree here. Praveen - this is similar to suggestion I had for the i2c
> internally.
Sure i will review for I2C as well.
Thanks,
Praveen Talari
>
>
> ---Trilok Soni
>
^ permalink raw reply
* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
From: Masami Hiramatsu @ 2026-05-12 5:14 UTC (permalink / raw)
To: Jiri Olsa
Cc: Andrii Nakryiko, bpf, linux-trace-kernel, oleg, peterz, mingo,
mhiramat
In-Reply-To: <agD3xq2MUyskd7X-@krava>
On Sun, 10 May 2026 23:25:26 +0200
Jiri Olsa <olsajiri@gmail.com> wrote:
> On Fri, May 08, 2026 at 05:30:56PM -0700, Andrii Nakryiko wrote:
> > The x86 uprobe nop5 optimization currently replaces a 5-byte NOP at the
> > probe site with a CALL into a uprobe trampoline. CALL pushes a return
> > address to [rsp-8]. On x86-64 this is inside the 128-byte red zone, where
> > user code may keep temporary data without adjusting rsp.
> >
> > Use a 5-byte JMP instead. JMP does not write to the user stack, but it
> > also does not provide a return address. Replace the single trampoline
> > entry with a page of 16-byte slots. Each optimized probe jumps to its
> > assigned slot, the slot moves rsp below the red zone, saves the registers
> > clobbered by syscall, and invokes the uprobe syscall:
> >
> > Probe site: jmp slot_N (5B, replaces nop5)
> >
> > Slot N: lea -128(%rsp), %rsp (5B) skip red zone
> > push %rcx (1B) save (syscall clobbers)
> > push %r11 (2B) save (syscall clobbers)
> > push %rax (1B) save (syscall uses for nr)
> > mov $336, %eax (5B) uprobe syscall number
> > syscall (2B)
> >
> > All slots contain identical code at different offsets, so the trampoline
> > page is generated once at boot and mapped read-execute into each process.
> > The syscall handler identifies the slot from regs->ip, which points just
> > after the syscall instruction, and uses a per-mm slot table to recover the
> > original probe address.
> >
> > The uprobe syscall does not return to the trampoline slot. The handler
> > restores the probe-site register state, runs the uprobe consumers, sets
> > pt_regs to continue at probe_addr + 5 unless a consumer redirected
> > execution, and returns directly through the IRET path. This preserves
> > general purpose registers, including rcx and r11, without requiring any
> > post-syscall cleanup code in the trampoline and avoids call/ret, RSB, and
> > shadow stack concerns.
> >
> > Protect the per-mm trampoline list with RCU and free trampoline metadata
> > with kfree_rcu(). This lets the syscall path resolve trampoline slots
> > without taking mmap_lock. The optimized-instruction detection path also
> > walks the trampoline list under an RCU read-side lock. Since that path
> > starts from the JMP target, it translates the slot start to the post-syscall
> > IP expected by the shared resolver before checking the trampoline mapping.
> >
> > Each trampoline page provides 256 slots. Slots stay permanently assigned
> > to their first probe address and are reused only when the same address is
> > probed again. Reassigning detached slots is deliberately avoided because a
> > thread can remain in a trampoline for an unbounded time due to ptrace,
> > interrupts, or scheduling delays. If a reachable trampoline page runs out
> > of slots, probes that cannot allocate a slot fall back to the slower INT3
> > path.
> >
> > Require the entire trampoline page to be reachable by a rel32 JMP before
> > reusing it for a probe. This keeps every slot in the page within the range
> > that can be encoded at the probe site.
> >
> > Change the error code returned when the uprobe syscall is invoked outside
> > a kernel-generated trampoline from -ENXIO to -EPROTO. This lets libbpf and
> > similar libraries distinguish fixed kernels from kernels with the
> > red-zone-clobbering implementation and enable nop5 optimization only on
> > fixed kernels.
> >
> > Performance (usdt single-thread, M/s):
> >
> > usdt-nop usdt-nop5-base usdt-nop5-fix nop5-change iret%
> > Skylake 3.149 6.422 4.865 -24.3% 39.1%
> > Milan 2.910 3.443 3.820 +11.0% 24.3%
> > Sapphire Rapids 1.896 4.023 3.693 -8.2% 24.9%
> > Bergamo 3.393 3.895 3.849 -1.2% 24.5%
> >
> > The fixed nop5 path remains faster than the non-optimized INT3 path on all
> > measured systems. The regression relative to the old CALL-based trampoline
> > comes from IRET being more expensive than SYSRET, most noticeably on older
> > Intel Skylake. Newer Intel CPUs and tested AMD CPUs have lower IRET cost,
> > and AMD Milan improves because removing mmap_lock from the hot path more
> > than offsets the IRET cost.
> >
> > Multi-threaded throughput scales nearly linearly with the number of CPUs, like
> > it used to, thanks to lockless RCU-protected uprobe trampoline lookup.
>
> hi,
> thanks a lot for the fix
>
> FWIW we discussed also an option to have 10-bytes nop and do:
> [rsp+0x80, call trampoline]
>
> we would not need the slots re-use logic, but not sure what other
> surprises there are with 10-bytes nop
Does this mean we have to update UDST implementation?
>
> I tried that change [1], it seems to work, but it has other
> difficulties, like I think the unoptimized path needs to do:
> [rsp+0x80, call trampoline] -> [jmp end of 10-bytes nop]
> instead of patching back the 10-byte nop, because some thread
> could be inside the nop area already.
Yeah, but at that moment, we know where the modified code is.
Maybe memory dump shows different code, but that is also true
if uprobe is active. So I think it is OK.
Thanks,
>
> jirka
>
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=redzone_fix&id=74b09240289dba8368c2783b771e678b2cc31574
>
> >
> > Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> > arch/x86/include/asm/uprobes.h | 18 ++
> > arch/x86/kernel/uprobes.c | 262 ++++++++++--------
> > tools/lib/bpf/features.c | 8 +-
> > .../selftests/bpf/prog_tests/uprobe_syscall.c | 5 +-
> > tools/testing/selftests/bpf/prog_tests/usdt.c | 2 +-
> > 5 files changed, 181 insertions(+), 114 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> > index 362210c79998..a7cf5c92d95a 100644
> > --- a/arch/x86/include/asm/uprobes.h
> > +++ b/arch/x86/include/asm/uprobes.h
> > @@ -25,6 +25,24 @@ enum {
> > ARCH_UPROBE_FLAG_OPTIMIZE_FAIL = 1,
> > };
> >
> > +/*
> > + * Trampoline page layout: identical 16-byte slots, each containing:
> > + * lea -128(%rsp), %rsp (5B) skip red zone
> > + * push %rcx (1B) save (syscall clobbers)
> > + * push %r11 (2B) save (syscall clobbers)
> > + * push %rax (1B) save (syscall uses for nr)
> > + * mov $336, %eax (5B) uprobe syscall number
> > + * syscall (2B)
> > + * = 16B, no padding needed
> > + *
> > + * The handler identifies which probe fired from regs->ip (each
> > + * slot is at a unique offset), looks up the probe address from a
> > + * per-process table, and returns directly to probe_addr+5 via iret
> > + * with all registers restored.
> > + */
> > +#define UPROBE_TRAMP_SLOT_SIZE 16
> > +#define UPROBE_TRAMP_MAX_SLOTS (PAGE_SIZE / UPROBE_TRAMP_SLOT_SIZE)
> > +
> > struct uprobe_xol_ops;
> >
> > struct arch_uprobe {
> > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > index ebb1baf1eb1d..7e1f14200bbb 100644
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> > @@ -633,16 +633,25 @@ static struct vm_special_mapping tramp_mapping = {
> >
> > struct uprobe_trampoline {
> > struct hlist_node node;
> > + struct rcu_head rcu;
> > unsigned long vaddr;
> > + unsigned long probe_addrs[UPROBE_TRAMP_MAX_SLOTS];
> > };
> >
> > -static bool is_reachable_by_call(unsigned long vtramp, unsigned long vaddr)
> > +
> > +static bool is_reachable_by_jmp(unsigned long dst, unsigned long src)
> > {
> > - long delta = (long)(vaddr + 5 - vtramp);
> > + long delta = (long)(dst - (src + JMP32_INSN_SIZE));
> >
> > return delta >= INT_MIN && delta <= INT_MAX;
> > }
> >
> > +static bool is_reachable_by_trampoline(unsigned long vtramp, unsigned long vaddr)
> > +{
> > + return is_reachable_by_jmp(vtramp, vaddr) &&
> > + is_reachable_by_jmp(vtramp + PAGE_SIZE - 1, vaddr);
> > +}
> > +
> > static unsigned long find_nearest_trampoline(unsigned long vaddr)
> > {
> > struct vm_unmapped_area_info info = {
> > @@ -711,6 +720,21 @@ static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
> > return tramp;
> > }
> >
> > +static int tramp_alloc_slot(struct uprobe_trampoline *tramp, unsigned long probe_addr)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < UPROBE_TRAMP_MAX_SLOTS; i++) {
> > + if (tramp->probe_addrs[i] == probe_addr)
> > + return i;
> > + if (tramp->probe_addrs[i] == 0) {
> > + tramp->probe_addrs[i] = probe_addr;
> > + return i;
> > + }
> > + }
> > + return -ENOSPC;
> > +}
> > +
> > static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool *new)
> > {
> > struct uprobes_state *state = ¤t->mm->uprobes_state;
> > @@ -720,7 +744,7 @@ static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool
> > return NULL;
> >
> > hlist_for_each_entry(tramp, &state->head_tramps, node) {
> > - if (is_reachable_by_call(tramp->vaddr, vaddr)) {
> > + if (is_reachable_by_trampoline(tramp->vaddr, vaddr)) {
> > *new = false;
> > return tramp;
> > }
> > @@ -731,7 +755,7 @@ static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool
> > return NULL;
> >
> > *new = true;
> > - hlist_add_head(&tramp->node, &state->head_tramps);
> > + hlist_add_head_rcu(&tramp->node, &state->head_tramps);
> > return tramp;
> > }
> >
> > @@ -742,8 +766,8 @@ static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
> > * because there's no easy way to make sure none of the threads
> > * is still inside the trampoline.
> > */
> > - hlist_del(&tramp->node);
> > - kfree(tramp);
> > + hlist_del_rcu(&tramp->node);
> > + kfree_rcu(tramp, rcu);
> > }
> >
> > void arch_uprobe_init_state(struct mm_struct *mm)
> > @@ -761,147 +785,153 @@ void arch_uprobe_clear_state(struct mm_struct *mm)
> > destroy_uprobe_trampoline(tramp);
> > }
> >
> > -static bool __in_uprobe_trampoline(unsigned long ip)
> > +/*
> > + * Find the trampoline containing @ip. If @probe_addr is non-NULL, also
> > + * resolve the slot index from @ip and return the probe address.
> > + *
> > + * @ip is expected to point right after the syscall instruction, i.e.,
> > + * at the end of the slot (slot_start + UPROBE_TRAMP_SLOT_SIZE).
> > + */
> > +static bool resolve_uprobe_addr(unsigned long ip, unsigned long *probe_addr)
> > {
> > - struct vm_area_struct *vma = vma_lookup(current->mm, ip);
> > + struct uprobes_state *state = ¤t->mm->uprobes_state;
> > + struct uprobe_trampoline *tramp;
> >
> > - return vma && vma_is_special_mapping(vma, &tramp_mapping);
> > -}
> > + hlist_for_each_entry_rcu(tramp, &state->head_tramps, node) {
> > + /*
> > + * ip points to after syscall, so it's on 16 byte boundary,
> > + * which means that valid ip can point right after the page
> > + * and should never be at zero offset within the page
> > + */
> > + if (ip <= tramp->vaddr || ip > tramp->vaddr + PAGE_SIZE)
> > + continue;
> >
> > -static bool in_uprobe_trampoline(unsigned long ip)
> > -{
> > - struct mm_struct *mm = current->mm;
> > - bool found, retry = true;
> > - unsigned int seq;
> > + if (probe_addr) {
> > + /* we already validated ip is within expected range */
> > + unsigned int slot = (ip - tramp->vaddr - 1) / UPROBE_TRAMP_SLOT_SIZE;
> > + unsigned long addr = tramp->probe_addrs[slot];
> >
> > - rcu_read_lock();
> > - if (mmap_lock_speculate_try_begin(mm, &seq)) {
> > - found = __in_uprobe_trampoline(ip);
> > - retry = mmap_lock_speculate_retry(mm, seq);
> > - }
> > - rcu_read_unlock();
> > + *probe_addr = addr;
> > + if (addr == 0)
> > + return false;
> > + }
> >
> > - if (retry) {
> > - mmap_read_lock(mm);
> > - found = __in_uprobe_trampoline(ip);
> > - mmap_read_unlock(mm);
> > + return true;
> > }
> > - return found;
> > + return false;
> > +}
> > +
> > +static bool in_uprobe_trampoline(unsigned long ip, unsigned long *probe_addr)
> > +{
> > + guard(rcu)();
> > + return resolve_uprobe_addr(ip, probe_addr);
> > }
> >
> > /*
> > - * See uprobe syscall trampoline; the call to the trampoline will push
> > - * the return address on the stack, the trampoline itself then pushes
> > - * cx, r11 and ax.
> > + * The trampoline slot pushes cx, r11, ax (the registers syscall clobbers)
> > + * before doing the uprobe syscall. No return address is pushed — the
> > + * probe site uses jmp, not call.
> > */
> > struct uprobe_syscall_args {
> > unsigned long ax;
> > unsigned long r11;
> > unsigned long cx;
> > - unsigned long retaddr;
> > };
> >
> > +#define UPROBE_TRAMP_REDZONE 128
> > +
> > SYSCALL_DEFINE0(uprobe)
> > {
> > struct pt_regs *regs = task_pt_regs(current);
> > struct uprobe_syscall_args args;
> > - unsigned long ip, sp, sret;
> > + unsigned long probe_addr;
> > int err;
> >
> > /* Allow execution only from uprobe trampolines. */
> > - if (!in_uprobe_trampoline(regs->ip))
> > - return -ENXIO;
> > + if (!in_uprobe_trampoline(regs->ip, &probe_addr))
> > + return -EPROTO;
> >
> > err = copy_from_user(&args, (void __user *)regs->sp, sizeof(args));
> > if (err)
> > goto sigill;
> >
> > - ip = regs->ip;
> > -
> > /*
> > - * expose the "right" values of ax/r11/cx/ip/sp to uprobe_consumer/s, plus:
> > - * - adjust ip to the probe address, call saved next instruction address
> > - * - adjust sp to the probe's stack frame (check trampoline code)
> > + * Restore the register state as it was at the probe site:
> > + * - ax/r11/cx from the trampoline-saved copies on user stack
> > + * - adjust ip to the probe address based on matching slot
> > + * - adjust sp to skip red zone and pushed args
> > */
> > regs->ax = args.ax;
> > regs->r11 = args.r11;
> > regs->cx = args.cx;
> > - regs->ip = args.retaddr - 5;
> > - regs->sp += sizeof(args);
> > + regs->ip = probe_addr;
> > + regs->sp += sizeof(args) + UPROBE_TRAMP_REDZONE;
> > regs->orig_ax = -1;
> >
> > - sp = regs->sp;
> > -
> > - err = shstk_pop((u64 *)&sret);
> > - if (err == -EFAULT || (!err && sret != args.retaddr))
> > - goto sigill;
> > -
> > - handle_syscall_uprobe(regs, regs->ip);
> > + handle_syscall_uprobe(regs, probe_addr);
> >
> > /*
> > - * Some of the uprobe consumers has changed sp, we can do nothing,
> > - * just return via iret.
> > + * Skip the jmp instruction at the probe site (5 bytes) unless
> > + * a consumer redirected execution elsewhere.
> > */
> > - if (regs->sp != sp) {
> > - /* skip the trampoline call */
> > - if (args.retaddr - 5 == regs->ip)
> > - regs->ip += 5;
> > - return regs->ax;
> > - }
> > + if (regs->ip == probe_addr)
> > + regs->ip = probe_addr + 5;
> >
> > - regs->sp -= sizeof(args);
> > -
> > - /* for the case uprobe_consumer has changed ax/r11/cx */
> > - args.ax = regs->ax;
> > - args.r11 = regs->r11;
> > - args.cx = regs->cx;
> > -
> > - /* keep return address unless we are instructed otherwise */
> > - if (args.retaddr - 5 != regs->ip)
> > - args.retaddr = regs->ip;
> > -
> > - if (shstk_push(args.retaddr) == -EFAULT)
> > - goto sigill;
> > -
> > - regs->ip = ip;
> > -
> > - err = copy_to_user((void __user *)regs->sp, &args, sizeof(args));
> > - if (err)
> > - goto sigill;
> > -
> > - /* ensure sysret, see do_syscall_64() */
> > - regs->r11 = regs->flags;
> > - regs->cx = regs->ip;
> > - return 0;
> > + /*
> > + * Return via iret by returning regs->ax. This preserves all
> > + * GP registers (including cx and r11) without needing any
> > + * user-space cleanup code. The iret path is used because we
> > + * don't set up cx/r11 for sysret.
> > + */
> > + return regs->ax;
> >
> > sigill:
> > force_sig(SIGILL);
> > return -1;
> > }
> >
> > +/*
> > + * All uprobe trampoline slots are identical: skip the red zone,
> > + * save the three registers that syscall clobbers, then invoke
> > + * the uprobe syscall. The handler returns directly to the probe
> > + * caller via iret. Execution never returns to the trampoline.
> > + */
> > asm (
> > ".pushsection .rodata\n"
> > - ".balign " __stringify(PAGE_SIZE) "\n"
> > - "uprobe_trampoline_entry:\n"
> > + ".balign " __stringify(UPROBE_TRAMP_SLOT_SIZE) "\n"
> > + "uprobe_trampoline_slot:\n"
> > + "lea -128(%rsp), %rsp\n"
> > "push %rcx\n"
> > "push %r11\n"
> > "push %rax\n"
> > - "mov $" __stringify(__NR_uprobe) ", %rax\n"
> > + "mov $" __stringify(__NR_uprobe) ", %eax\n"
> > "syscall\n"
> > - "pop %rax\n"
> > - "pop %r11\n"
> > - "pop %rcx\n"
> > - "ret\n"
> > - "int3\n"
> > - ".balign " __stringify(PAGE_SIZE) "\n"
> > + "uprobe_trampoline_slot_end:\n"
> > ".popsection\n"
> > );
> >
> > -extern u8 uprobe_trampoline_entry[];
> > +extern u8 uprobe_trampoline_slot[];
> > +extern u8 uprobe_trampoline_slot_end[];
> >
> > static int __init arch_uprobes_init(void)
> > {
> > - tramp_mapping_pages[0] = virt_to_page(uprobe_trampoline_entry);
> > + unsigned int slot_size = uprobe_trampoline_slot_end - uprobe_trampoline_slot;
> > + struct page *page;
> > + u8 *page_addr;
> > + int i;
> > +
> > + BUILD_BUG_ON(UPROBE_TRAMP_SLOT_SIZE != 16);
> > + WARN_ON_ONCE(slot_size != UPROBE_TRAMP_SLOT_SIZE);
> > +
> > + page = alloc_page(GFP_KERNEL);
> > + if (!page)
> > + return -ENOMEM;
> > +
> > + page_addr = page_address(page);
> > + for (i = 0; i < UPROBE_TRAMP_MAX_SLOTS; i++)
> > + memcpy(page_addr + i * UPROBE_TRAMP_SLOT_SIZE, uprobe_trampoline_slot, slot_size);
> > +
> > + tramp_mapping_pages[0] = page;
> > return 0;
> > }
> >
> > @@ -909,7 +939,7 @@ late_initcall(arch_uprobes_init);
> >
> > enum {
> > EXPECT_SWBP,
> > - EXPECT_CALL,
> > + EXPECT_JMP,
> > };
> >
> > struct write_opcode_ctx {
> > @@ -917,14 +947,14 @@ struct write_opcode_ctx {
> > int expect;
> > };
> >
> > -static int is_call_insn(uprobe_opcode_t *insn)
> > +static int is_jmp_insn(uprobe_opcode_t *insn)
> > {
> > - return *insn == CALL_INSN_OPCODE;
> > + return *insn == JMP32_INSN_OPCODE;
> > }
> >
> > /*
> > * Verification callback used by int3_update uprobe_write calls to make sure
> > - * the underlying instruction is as expected - either int3 or call.
> > + * the underlying instruction is as expected - either int3 or jmp.
> > */
> > static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode,
> > int nbytes, void *data)
> > @@ -939,8 +969,8 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
> > if (is_swbp_insn(&old_opcode[0]))
> > return 1;
> > break;
> > - case EXPECT_CALL:
> > - if (is_call_insn(&old_opcode[0]))
> > + case EXPECT_JMP:
> > + if (is_jmp_insn(&old_opcode[0]))
> > return 1;
> > break;
> > }
> > @@ -978,7 +1008,7 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> > * so we can skip this step for optimize == true.
> > */
> > if (!optimize) {
> > - ctx.expect = EXPECT_CALL;
> > + ctx.expect = EXPECT_JMP;
> > err = uprobe_write(auprobe, vma, vaddr, &int3, 1, verify_insn,
> > true /* is_register */, false /* do_update_ref_ctr */,
> > &ctx);
> > @@ -1015,13 +1045,13 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> > }
> >
> > static int swbp_optimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> > - unsigned long vaddr, unsigned long tramp)
> > + unsigned long vaddr, unsigned long slot_vaddr)
> > {
> > - u8 call[5];
> > + u8 jmp[5];
> >
> > - __text_gen_insn(call, CALL_INSN_OPCODE, (const void *) vaddr,
> > - (const void *) tramp, CALL_INSN_SIZE);
> > - return int3_update(auprobe, vma, vaddr, call, true /* optimize */);
> > + __text_gen_insn(jmp, JMP32_INSN_OPCODE, (const void *) vaddr,
> > + (const void *) slot_vaddr, JMP32_INSN_SIZE);
> > + return int3_update(auprobe, vma, vaddr, jmp, true /* optimize */);
> > }
> >
> > static int swbp_unoptimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> > @@ -1049,11 +1079,17 @@ static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
> > struct __packed __arch_relative_insn {
> > u8 op;
> > s32 raddr;
> > - } *call = (struct __arch_relative_insn *) insn;
> > + } *jmp = (struct __arch_relative_insn *) insn;
> >
> > - if (!is_call_insn(insn))
> > + if (!is_jmp_insn(&jmp->op))
> > return false;
> > - return __in_uprobe_trampoline(vaddr + 5 + call->raddr);
> > +
> > + guard(rcu)();
> > + /*
> > + * resolve_uprobe_addr() expects IP pointing after syscall instruction
> > + * (after the slot, basically), so adjust jump target address accordingly
> > + */
> > + return resolve_uprobe_addr(vaddr + 5 + jmp->raddr + UPROBE_TRAMP_SLOT_SIZE, NULL);
> > }
> >
> > static int is_optimized(struct mm_struct *mm, unsigned long vaddr)
> > @@ -1113,8 +1149,9 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
> > {
> > struct uprobe_trampoline *tramp;
> > struct vm_area_struct *vma;
> > + unsigned long slot_vaddr;
> > bool new = false;
> > - int err = 0;
> > + int slot, err;
> >
> > vma = find_vma(mm, vaddr);
> > if (!vma)
> > @@ -1122,8 +1159,17 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
> > tramp = get_uprobe_trampoline(vaddr, &new);
> > if (!tramp)
> > return -EINVAL;
> > - err = swbp_optimize(auprobe, vma, vaddr, tramp->vaddr);
> > - if (WARN_ON_ONCE(err) && new)
> > +
> > + slot = tramp_alloc_slot(tramp, vaddr);
> > + if (slot < 0) {
> > + if (new)
> > + destroy_uprobe_trampoline(tramp);
> > + return slot;
> > + }
> > +
> > + slot_vaddr = tramp->vaddr + slot * UPROBE_TRAMP_SLOT_SIZE;
> > + err = swbp_optimize(auprobe, vma, vaddr, slot_vaddr);
> > + if (err && new)
> > destroy_uprobe_trampoline(tramp);
> > return err;
> > }
> > diff --git a/tools/lib/bpf/features.c b/tools/lib/bpf/features.c
> > index 4f19a0d79b0c..1b6c113357b2 100644
> > --- a/tools/lib/bpf/features.c
> > +++ b/tools/lib/bpf/features.c
> > @@ -577,10 +577,12 @@ static int probe_ldimm64_full_range_off(int token_fd)
> > static int probe_uprobe_syscall(int token_fd)
> > {
> > /*
> > - * If kernel supports uprobe() syscall, it will return -ENXIO when called
> > - * from the outside of a kernel-generated uprobe trampoline.
> > + * If kernel supports uprobe() syscall, it will return -EPROTO when
> > + * called from outside a kernel-generated uprobe trampoline.
> > + * Older kernels with the red-zone-clobbering bug return -ENXIO;
> > + * we only enable the nop5 optimization on fixed kernels.
> > */
> > - return syscall(__NR_uprobe) < 0 && errno == ENXIO;
> > + return syscall(__NR_uprobe) < 0 && errno == EPROTO;
> > }
> > #else
> > static int probe_uprobe_syscall(int token_fd)
> > diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > index 955a37751b52..0d5eb4cd1ddf 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > @@ -422,7 +422,8 @@ static void *check_attach(struct uprobe_syscall_executed *skel, trigger_t trigge
> > /* .. and check the trampoline is as expected. */
> > call = (struct __arch_relative_insn *) addr;
> > tramp = (void *) (call + 1) + call->raddr;
> > - ASSERT_EQ(call->op, 0xe8, "call");
> > + tramp = (void *)((unsigned long)tramp & ~(getpagesize() - 1UL));
> > + ASSERT_EQ(call->op, 0xe9, "jmp");
> > ASSERT_OK(find_uprobes_trampoline(tramp), "uprobes_trampoline");
> >
> > return tramp;
> > @@ -762,7 +763,7 @@ static void test_uprobe_error(void)
> > long err = syscall(__NR_uprobe);
> >
> > ASSERT_EQ(err, -1, "error");
> > - ASSERT_EQ(errno, ENXIO, "errno");
> > + ASSERT_EQ(errno, EPROTO, "errno");
> > }
> >
> > static void __test_uprobe_syscall(void)
> > diff --git a/tools/testing/selftests/bpf/prog_tests/usdt.c b/tools/testing/selftests/bpf/prog_tests/usdt.c
> > index 69759b27794d..9d3744d4e936 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/usdt.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/usdt.c
> > @@ -329,7 +329,7 @@ static void subtest_optimized_attach(void)
> > ASSERT_EQ(*addr_2, 0x90, "nop");
> >
> > /* call is on addr_2 + 1 address */
> > - ASSERT_EQ(*(addr_2 + 1), 0xe8, "call");
> > + ASSERT_EQ(*(addr_2 + 1), 0xe9, "jmp");
> > ASSERT_EQ(skel->bss->executed, 4, "executed");
> >
> > cleanup:
> > --
> > 2.53.0-Meta
> >
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH mm-unstable v17 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions
From: Lance Yang @ 2026-05-12 4:44 UTC (permalink / raw)
To: npache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, usama.arif
In-Reply-To: <20260511185817.686831-4-npache@redhat.com>
On Mon, May 11, 2026 at 12:58:03PM -0600, Nico Pache wrote:
>The following cleanup reworks all the max_ptes_* handling into helper
>functions. This increases the code readability and will later be used to
>implement the mTHP handling of these variables.
>
>With these changes we abstract all the madvise_collapse() special casing
>(dont respect the sysctls) away from the functions that utilize them. And
Nit: s/dont/do not/
>will be used later in this series to cleanly restrict the mTHP collapse
>behavior.
>
>No functional change is intended; however, we are now only reading the
>sysfs variables once per scan, whereas before these variables were being
>read on each loop iteration.
>
>Suggested-by: David Hildenbrand <david@kernel.org>
>Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>Acked-by: Usama Arif <usama.arif@linux.dev>
>Signed-off-by: Nico Pache <npache@redhat.com>
>---
> mm/khugepaged.c | 118 +++++++++++++++++++++++++++++++++---------------
> 1 file changed, 82 insertions(+), 36 deletions(-)
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index f0e29d5c7b1f..f68853b3caa7 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -348,6 +348,62 @@ static bool pte_none_or_zero(pte_t pte)
> return pte_present(pte) && is_zero_pfn(pte_pfn(pte));
> }
>
>+/**
>+ * collapse_max_ptes_none - Calculate maximum allowed none-page or zero-page
>+ * PTEs for the given collapse operation.
>+ * @cc: The collapse control struct
>+ * @vma: The vma to check for userfaultfd
>+ *
>+ * Return: Maximum number of none-page or zero-page PTEs allowed for the
>+ * collapse operation.
>+ */
>+static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
>+ struct vm_area_struct *vma)
>+{
>+ // If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
>+ if (vma && userfaultfd_armed(vma))
>+ return 0;
>+ // for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
>+ if (!cc->is_khugepaged)
>+ return HPAGE_PMD_NR;
>+ // For all other cases repect the user defined maximum.
>+ return khugepaged_max_ptes_none;
Nit: kernel code usually uses C-style comments. This could be:
/* For all other cases, respect the user-defined maximum. */
Also, s/repect/respect/.
>+}
>+
>+/**
>+ * collapse_max_ptes_shared - Calculate maximum allowed PTEs that map shared
>+ * anonymous pages for the given collapse operation.
>+ * @cc: The collapse control struct
>+ *
>+ * Return: Maximum number of PTEs that map shared anonymous pages for the
>+ * collapse operation
>+ */
>+static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
>+{
>+ // for MADV_COLLAPSE, do not restrict the number of PTEs that map shared
>+ // anonymous pages.
Ditto.
>+ if (!cc->is_khugepaged)
>+ return HPAGE_PMD_NR;
>+ return khugepaged_max_ptes_shared;
>+}
>+
>+/**
>+ * collapse_max_ptes_swap - Calculate the maximum allowed non-present PTEs or the
>+ * maximum allowed non-present pagecache entries for the given collapse operation.
>+ * @cc: The collapse control struct
>+ *
>+ * Return: Maximum number of non-present PTEs or the maximum allowed non-present
>+ * pagecache entries for the collapse operation.
>+ */
>+static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
>+{
>+ // for MADV_COLLAPSE, do not restrict the number PTEs entries or
>+ // pagecache entries that are non-present.
Same here.
>+ if (!cc->is_khugepaged)
>+ return HPAGE_PMD_NR;
>+ return khugepaged_max_ptes_swap;
>+}
>+
> int hugepage_madvise(struct vm_area_struct *vma,
> vm_flags_t *vm_flags, int advice)
> {
>@@ -546,21 +602,19 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> pte_t *_pte;
> int none_or_zero = 0, shared = 0, referenced = 0;
> enum scan_result result = SCAN_FAIL;
>+ unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
>+ unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
Nit: could these be const, as David suggested earlier?
Nothing else jumped out at me. LGTM!
Reviewed-by: Lance Yang <lance.yang@linux.dev>
^ permalink raw reply
* Re: [PATCH v1 1/2] spi: qcom-geni: trace: Add trace events for Qualcomm GENI SPI
From: Praveen Talari @ 2026-05-12 3:43 UTC (permalink / raw)
To: Mark Brown
Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, linux-arm-msm, linux-spi,
MukeshKumarSavaliyamukesh.savaliya, AniketRandiveaniket.randive,
chandana.chiluveru, jyothi.seerapu
In-Reply-To: <agB8AgF3qVqDw60Z@sirena.co.uk>
Hi Mark,
On 10-05-2026 18:07, Mark Brown wrote:
> On Sat, May 09, 2026 at 07:37:26AM +0530, Praveen Talari wrote:
>
>> Could you also please review the changes made in spi.c ?
>> I would appreciate any feedback or suggestions you may have.
> Please just sumbmit normal patches instead of sending partial patches in
> reply to another thread unless something is really unclear.
>
>> @@ -1658,6 +1658,11 @@ static int spi_transfer_one_message(struct
>> spi_controller *ctlr,
>>
>> trace_spi_transfer_stop(msg, xfer);
>>
>> + if (spi_valid_txbuf(msg, xfer))
>> + trace_spi_tx_data(msg->spi, xfer->tx_buf,
>> xfer->len);
>> + if (spi_valid_rxbuf(msg, xfer))
>> + trace_spi_rx_data(msg->spi, xfer->rx_buf,
>> xfer->len);
> It feels like it'd be more helpful to log the transmit data before we do
> the send.
I can see that TX/RX data tracepoints are already present in the core
layer, so I will drop the TX/RX data tracepoints from this series. I
will update the patch set accordingly.
trace log from spi.c:
spi_transfer_start: spi16.0 00000000631b0da2 len=16
tx=[d8-07-1f-c4-b3-b3-07-d6-a8-7b-33-6f-7b-bb-ae-9b]
rx=[d8-07-1f-c4-b3-b3-07-d6-a8-7b-33-6f-7b-bb-ae-9b
Thanks,
Praveen Talari
^ permalink raw reply
* Re: [PATCH v19 0/7] ring-buffer: Making persistent ring buffers robust
From: Masami Hiramatsu @ 2026-05-12 0:54 UTC (permalink / raw)
To: Steven Rostedt
Cc: Catalin Marinas, Will Deacon, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <20260511122943.41e204bc@gandalf.local.home>
On Mon, 11 May 2026 12:29:43 -0400
Steven Rostedt <rostedt@kernel.org> wrote:
> On Thu, 7 May 2026 13:14:16 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
>
> > > I'll test this some more, and make a proper patch.
> >
> > Ah, indeed. Thanks for fixing!
> >
> > BTW, shouldn't we unify common logic of those functions?
>
> Hmm, there's not much common between the two. One is a consuming read and
> the other is a non-consuming read that needs to test for a bunch of race
> conditions.
>
> If you see something that can be shared, I'm all for it.
Maybe we can introduce a common inline function to calculate
max_loop, or at least replacing "3" with a common macro.
Thank you,
>
> -- Steve
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCHv2] uprobes: Use flexible array for xol_area bitmap
From: Masami Hiramatsu @ 2026-05-12 0:48 UTC (permalink / raw)
To: Rosen Penev
Cc: linux-kernel, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
James Clark, Masami Hiramatsu, Oleg Nesterov,
open list:PERFORMANCE EVENTS SUBSYSTEM, open list:UPROBES
In-Reply-To: <20260511225648.27886-1-rosenp@gmail.com>
On Mon, 11 May 2026 15:56:48 -0700
Rosen Penev <rosenp@gmail.com> wrote:
> The XOL slot bitmap has the same lifetime as struct xol_area, but it
> is currently allocated separately. That adds another allocation
> failure path and a matching cleanup branch without buying any extra
> flexibility.
>
> Store the bitmap as a flexible array member and allocate it together
> with the xol_area using kzalloc_flex(). The bitmap remains
> zero-initialized, while the allocation and error handling become
> simpler.
>
> Assisted-by: Codex:GPT-5.5
> Signed-off-by: Rosen Penev <rosenp@gmail.com>
OK, this looks good to me.
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Thanks,
> ---
> v2: add missing kfree
> kernel/events/uprobes.c | 14 +++-----------
> 1 file changed, 3 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 4084e926e284..eba71700667e 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -108,7 +108,6 @@ static LIST_HEAD(delayed_uprobe_list);
> */
> struct xol_area {
> wait_queue_head_t wq; /* if all slots are busy */
> - unsigned long *bitmap; /* 0 = free slot */
>
> struct page *page;
> /*
> @@ -117,6 +116,7 @@ struct xol_area {
> * the vma go away, and we must handle that reasonably gracefully.
> */
> unsigned long vaddr; /* Page(s) of instruction slots */
> + unsigned long bitmap[]; /* 0 = free slot */
> };
>
> static void uprobe_warn(struct task_struct *t, const char *msg)
> @@ -1755,18 +1755,13 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
> struct xol_area *area;
> void *insns;
>
> - area = kzalloc_obj(*area);
> + area = kzalloc_flex(*area, bitmap, BITS_TO_LONGS(UINSNS_PER_PAGE));
> if (unlikely(!area))
> goto out;
>
> - area->bitmap = kcalloc(BITS_TO_LONGS(UINSNS_PER_PAGE), sizeof(long),
> - GFP_KERNEL);
> - if (!area->bitmap)
> - goto free_area;
> -
> area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
> if (!area->page)
> - goto free_bitmap;
> + goto free_area;
>
> area->vaddr = vaddr;
> init_waitqueue_head(&area->wq);
> @@ -1779,8 +1774,6 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
> return area;
>
> __free_page(area->page);
> - free_bitmap:
> - kfree(area->bitmap);
> free_area:
> kfree(area);
> out:
> @@ -1831,7 +1824,6 @@ void uprobe_clear_state(struct mm_struct *mm)
> return;
>
> put_page(area->page);
> - kfree(area->bitmap);
> kfree(area);
> }
>
> --
> 2.54.0
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH 0/2] tools/bootconfig: render kernel.* subtree as a cmdline string
From: Masami Hiramatsu @ 2026-05-12 0:43 UTC (permalink / raw)
To: Breno Leitao
Cc: Andrew Morton, Masami Hiramatsu, linux-kernel, linux-trace-kernel,
paulmck, oss, kernel-team
In-Reply-To: <agIFZ0vSfedX7foN@gmail.com>
On Mon, 11 May 2026 09:38:25 -0700
Breno Leitao <leitao@debian.org> wrote:
> On Fri, May 08, 2026 at 02:56:41PM -0700, Andrew Morton wrote:
> > On Fri, 08 May 2026 06:55:02 -0700 Breno Leitao <leitao@debian.org> wrote:
> >
> > > Add a bootconfig -> kernel cmdline rendering capability shared between
> > > the kernel parser library and the userspace tools/bootconfig binary.
> > >
> > > The new userspace mode "tools/bootconfig -C <file>" walks a bootconfig
> > > file's "kernel" subtree and prints it as a flat, space-separated
> > > cmdline string suitable for direct use as (or appending to) a kernel
> > > command line.
> > >
> > > This series prepares tools/bootconfig and lib/bootconfig.c for an
> > > upcoming feature that lets the kernel build render an embedded
> > > bootconfig file's "kernel" subtree to a flat cmdline string and embed
> > > it in the kernel image.
> > >
> > > The follow-up series (sent separately) wires this into setup_arch() so
> > > early_param() handlers see values supplied via CONFIG_BOOT_CONFIG_EMBED_FILE,
> > > following Masami suggestion in [1]
> > >
> > > These two patches are pure groundwork. They add no kernel feature,
> > > change no runtime behavior, and are useful on their own (the new
> > > "tools/bootconfig -C" mode lets anyone render a .bootconfig file to
> > > a cmdline string from the shell).
> > >
> > > Landing them independently lets the follow-up series focus on the
> > > kernel-side plumbing without dragging the refactor and tool addition
> > > through the same review cycle.
> >
> > I'll assume that Masami will process this, although
> > `scripts/get_maintainer.pl lib/bootconfig.c' doesn't mention a git
> > tree.
> >
> > https://sashiko.dev/#/patchset/20260508-bootconfig_using_tools-v1-0-1132219aa773@debian.org
> > says a bunch of picky things which seem pretty ignorable to me. Your
> > call ;)
>
> Well, these are some warnings about not checking that the output was
> properly set. From my view, these are not new to this patch, it is just the
> pattern we see, which is fire-and-forget writes to stdout, which seems
> quite reasonable.
I confirmed there was no problem with UBSAN.
"NULL + 1" is undefined, yes, but "(char *)NULL + 1" is sane.
>
> If we decide to fix this, it would make more sense to do it file-wide
> instead of just in this patch code.
So no need to fix it. Let me pick this series.
Thanks!
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [bug report] bootconfig: init: Allow admin to use bootconfig for kernel command line
From: Masami Hiramatsu @ 2026-05-12 0:16 UTC (permalink / raw)
To: Dan Carpenter
Cc: kernel-janitors, Linux Trace Kernel, linux-kernel, Breno Leitao
In-Reply-To: <af4YTUrDM-ciyoa-@stanley.mountain>
Hi Dan,
Thanks for reporting. A similar problem is pointed by Sashiko [1].
[1] https://sashiko.dev/#/patchset/20260508-bootconfig_using_tools-v1-0-1132219aa773%40debian.org
On Fri, 8 May 2026 20:07:25 +0300
Dan Carpenter <error27@gmail.com> wrote:
> Hello Masami Hiramatsu,
>
> Commit 51887d03aca1 ("bootconfig: init: Allow admin to use bootconfig
> for kernel command line") from Jan 11, 2020 (linux-next), leads to
> the following Smatch static checker warning:
>
> init/main.c:368 xbc_snprint_cmdline()
> use scnprintf() instead of snprintf()
>
> init/main.c
> 331 static int __init xbc_snprint_cmdline(char *buf, size_t size,
> 332 struct xbc_node *root)
> 333 {
> 334 struct xbc_node *knode, *vnode;
> 335 char *end = buf + size;
> 336 const char *val, *q;
> 337 int ret;
> 338
> 339 xbc_node_for_each_key_value(root, knode, val) {
> 340 ret = xbc_node_compose_key_after(root, knode,
> 341 xbc_namebuf, XBC_KEYLEN_MAX);
> 342 if (ret < 0)
> 343 return ret;
> 344
> 345 vnode = xbc_node_get_child(knode);
> 346 if (!vnode) {
> 347 ret = snprintf(buf, rest(buf, end), "%s ", xbc_namebuf);
> 348 if (ret < 0)
> 349 return ret;
> 350 buf += ret;
>
> In user space snprintf() can return negative, but in the kernel, no.
> It returns the number of bytes (not counting the NUL terminator) which
> would have been copied if there were enough space. So maybe you want
> to do something like:
>
> remain = rest(buf, end);
> ret = snprintf(buf, rest(buf, end), "%s ", xbc_namebuf);
> if (ret >= remain)
> return -ENOSPC;
Actually, we need to query the length of required buffer size if buf == NULL
or the buffer size is not enough.
But as Sashiko pointed, I need to check it with UBSAN. (but I think,
even if @buf is NULL, the @buf is char *, thus it is safe to add some
value...)
>
> Or maybe you might want to use scnprintf() which returns the number of
> bytes actually copied. Otherwise bug ends up pointing to beyond the end
> of the buffer.
No, I need to calculate the required length of buffer.
Thank you,
>
> 351 continue;
> 352 }
> 353 xbc_array_for_each_value(vnode, val) {
> 354 /*
> 355 * For prettier and more readable /proc/cmdline, only
> 356 * quote the value when necessary, i.e. when it contains
> 357 * whitespace.
> 358 */
> 359 q = strpbrk(val, " \t\r\n") ? "\"" : "";
> 360 ret = snprintf(buf, rest(buf, end), "%s=%s%s%s ",
> ^^^^^^^^^^^^^^^
> Same.
>
> 361 xbc_namebuf, q, val, q);
> 362 if (ret < 0)
> 363 return ret;
> 364 buf += ret;
> 365 }
> 366 }
> 367
> --> 368 return buf - (end - size);
> 369 }
>
> This email is a free service from the Smatch-CI project [smatch.sf.net].
>
> regards,
> dan carpenter
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH 1/2] bootconfig: move xbc_snprint_cmdline() to lib/bootconfig.c
From: Masami Hiramatsu @ 2026-05-12 0:00 UTC (permalink / raw)
To: Breno Leitao
Cc: Andrew Morton, linux-kernel, linux-trace-kernel, paulmck, oss,
kernel-team
In-Reply-To: <20260508-bootconfig_using_tools-v1-1-1132219aa773@debian.org>
On Fri, 08 May 2026 06:55:03 -0700
Breno Leitao <leitao@debian.org> wrote:
> Move xbc_snprint_cmdline() from init/main.c to lib/bootconfig.c so the
> function (and its xbc_namebuf scratch buffer) becomes part of the shared
> parser library. tools/bootconfig already compiles lib/bootconfig.c
> directly, which lets a follow-up patch reuse the same renderer in the
> userspace tool to convert a bootconfig file into a flat cmdline string
> at build time.
>
> No functional change.
Yeah, this should be under lib/bootconfig.c
Thanks,
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> include/linux/bootconfig.h | 3 +++
> init/main.c | 45 -------------------------------------
> lib/bootconfig.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 59 insertions(+), 45 deletions(-)
>
> diff --git a/include/linux/bootconfig.h b/include/linux/bootconfig.h
> index 692a5acc2ffc4..1c7f3b74ffcf3 100644
> --- a/include/linux/bootconfig.h
> +++ b/include/linux/bootconfig.h
> @@ -265,6 +265,9 @@ static inline struct xbc_node * __init xbc_node_get_subkey(struct xbc_node *node
> int __init xbc_node_compose_key_after(struct xbc_node *root,
> struct xbc_node *node, char *buf, size_t size);
>
> +/* Render key/value pairs under @root as a flat cmdline string */
> +int __init xbc_snprint_cmdline(char *buf, size_t size, struct xbc_node *root);
> +
> /**
> * xbc_node_compose_key() - Compose full key string of the XBC node
> * @node: An XBC node.
> diff --git a/init/main.c b/init/main.c
> index 96f93bb06c490..e363232b428b4 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -324,51 +324,6 @@ static void * __init get_boot_config_from_initrd(size_t *_size)
>
> #ifdef CONFIG_BOOT_CONFIG
>
> -static char xbc_namebuf[XBC_KEYLEN_MAX] __initdata;
> -
> -#define rest(dst, end) ((end) > (dst) ? (end) - (dst) : 0)
> -
> -static int __init xbc_snprint_cmdline(char *buf, size_t size,
> - struct xbc_node *root)
> -{
> - struct xbc_node *knode, *vnode;
> - char *end = buf + size;
> - const char *val, *q;
> - int ret;
> -
> - xbc_node_for_each_key_value(root, knode, val) {
> - ret = xbc_node_compose_key_after(root, knode,
> - xbc_namebuf, XBC_KEYLEN_MAX);
> - if (ret < 0)
> - return ret;
> -
> - vnode = xbc_node_get_child(knode);
> - if (!vnode) {
> - ret = snprintf(buf, rest(buf, end), "%s ", xbc_namebuf);
> - if (ret < 0)
> - return ret;
> - buf += ret;
> - continue;
> - }
> - xbc_array_for_each_value(vnode, val) {
> - /*
> - * For prettier and more readable /proc/cmdline, only
> - * quote the value when necessary, i.e. when it contains
> - * whitespace.
> - */
> - q = strpbrk(val, " \t\r\n") ? "\"" : "";
> - ret = snprintf(buf, rest(buf, end), "%s=%s%s%s ",
> - xbc_namebuf, q, val, q);
> - if (ret < 0)
> - return ret;
> - buf += ret;
> - }
> - }
> -
> - return buf - (end - size);
> -}
> -#undef rest
> -
> /* Make an extra command line under given key word */
> static char * __init xbc_make_cmdline(const char *key)
> {
> diff --git a/lib/bootconfig.c b/lib/bootconfig.c
> index c470b93d5dbc2..f445b7703fdd9 100644
> --- a/lib/bootconfig.c
> +++ b/lib/bootconfig.c
> @@ -408,6 +408,62 @@ const char * __init xbc_node_find_next_key_value(struct xbc_node *root,
> return ""; /* No value key */
> }
>
> +static char xbc_namebuf[XBC_KEYLEN_MAX] __initdata;
> +
> +#define rest(dst, end) ((end) > (dst) ? (end) - (dst) : 0)
> +
> +/**
> + * xbc_snprint_cmdline() - Render bootconfig keys under @root as a cmdline string
> + * @buf: Destination buffer (may be NULL when @size is 0 to query the length)
> + * @size: Size of @buf in bytes
> + * @root: Subtree root whose key=value pairs should be rendered
> + *
> + * Walk all key/value pairs under @root and emit them as a space-separated
> + * cmdline string into @buf. Values containing whitespace are quoted with
> + * double quotes. Returns the number of bytes that would be written if @buf
> + * were large enough (matching snprintf semantics), or a negative errno on
> + * failure.
> + */
> +int __init xbc_snprint_cmdline(char *buf, size_t size, struct xbc_node *root)
> +{
> + struct xbc_node *knode, *vnode;
> + char *end = buf + size;
> + const char *val, *q;
> + int ret;
> +
> + xbc_node_for_each_key_value(root, knode, val) {
> + ret = xbc_node_compose_key_after(root, knode,
> + xbc_namebuf, XBC_KEYLEN_MAX);
> + if (ret < 0)
> + return ret;
> +
> + vnode = xbc_node_get_child(knode);
> + if (!vnode) {
> + ret = snprintf(buf, rest(buf, end), "%s ", xbc_namebuf);
> + if (ret < 0)
> + return ret;
> + buf += ret;
> + continue;
> + }
> + xbc_array_for_each_value(vnode, val) {
> + /*
> + * For prettier and more readable /proc/cmdline, only
> + * quote the value when necessary, i.e. when it contains
> + * whitespace.
> + */
> + q = strpbrk(val, " \t\r\n") ? "\"" : "";
> + ret = snprintf(buf, rest(buf, end), "%s=%s%s%s ",
> + xbc_namebuf, q, val, q);
> + if (ret < 0)
> + return ret;
> + buf += ret;
> + }
> + }
> +
> + return buf - (end - size);
> +}
> +#undef rest
> +
> /* XBC parse and tree build */
>
> static int __init xbc_init_node(struct xbc_node *node, char *data, uint16_t flag)
>
> --
> 2.53.0-Meta
>
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH 2/2] tools/bootconfig: render kernel.* subtree as cmdline string with -C
From: Masami Hiramatsu @ 2026-05-12 0:00 UTC (permalink / raw)
To: Breno Leitao
Cc: Andrew Morton, linux-kernel, linux-trace-kernel, paulmck, oss,
kernel-team
In-Reply-To: <20260508-bootconfig_using_tools-v1-2-1132219aa773@debian.org>
On Fri, 08 May 2026 06:55:04 -0700
Breno Leitao <leitao@debian.org> wrote:
> Add a -C option that finds the "kernel" subtree of a bootconfig file
> and prints it as a flat, space-separated cmdline string by calling the
> shared xbc_snprint_cmdline() renderer. An empty or absent kernel.*
> subtree produces empty output and exits successfully.
>
> This lets the kernel build embed a bootconfig file as a plain cmdline
> string at build time, so embedded bootconfig values can reach
> parse_early_param() during architecture setup without parsing the
> bootconfig at runtime.
>
> The renderer is intentionally limited to the kernel.* subtree: that is
> the only thing the kernel build needs to embed; init.* and other
> subtrees keep going through the runtime parser.
>
> Example of this new mode:
> # cat /tmp/test.bconf
> kernel {
> foo = bar
> baz = "hello world"
> arr = 1, 2
> }
> init.foo = nope
>
> # ./tools/bootconfig/bootconfig -C /tmp/test.bconf
> foo=bar baz="hello world" arr=1 arr=2 %
>
Nice! Looks good to me. Let me pick it.
Thanks,
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> tools/bootconfig/main.c | 60 ++++++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 52 insertions(+), 8 deletions(-)
>
> diff --git a/tools/bootconfig/main.c b/tools/bootconfig/main.c
> index 643f707b8f1da..e1bfab044fbcb 100644
> --- a/tools/bootconfig/main.c
> +++ b/tools/bootconfig/main.c
> @@ -286,7 +286,41 @@ static int init_xbc_with_error(char *buf, int len)
> return ret;
> }
>
> -static int show_xbc(const char *path, bool list)
> +static int show_xbc_kernel_cmdline(void)
> +{
> + struct xbc_node *root;
> + char *buf = NULL;
> + int len, ret;
> +
> + root = xbc_find_node("kernel");
> + if (!root)
> + return 0; /* no kernel.* keys: emit empty output */
> +
> + len = xbc_snprint_cmdline(NULL, 0, root);
> + if (len < 0) {
> + pr_err("Failed to size cmdline output: %d\n", len);
> + return len;
> + }
> + if (len == 0)
> + return 0;
> +
> + buf = malloc(len + 1);
> + if (!buf)
> + return -ENOMEM;
> +
> + ret = xbc_snprint_cmdline(buf, len + 1, root);
> + if (ret < 0) {
> + pr_err("Failed to render cmdline output: %d\n", ret);
> + free(buf);
> + return ret;
> + }
> +
> + fputs(buf, stdout);
> + free(buf);
> + return 0;
> +}
> +
> +static int show_xbc(const char *path, bool list, bool render_cmdline)
> {
> int ret, fd;
> char *buf = NULL;
> @@ -322,11 +356,14 @@ static int show_xbc(const char *path, bool list)
> if (init_xbc_with_error(buf, ret) < 0)
> goto out;
> }
> - if (list)
> + if (render_cmdline)
> + ret = show_xbc_kernel_cmdline();
> + else if (list)
> xbc_show_list();
> else
> xbc_show_compact_tree();
> - ret = 0;
> + if (ret > 0)
> + ret = 0;
> out:
> free(buf);
>
> @@ -486,7 +523,10 @@ static int usage(void)
> " Options:\n"
> " -a <config>: Apply boot config to initrd\n"
> " -d : Delete boot config file from initrd\n"
> - " -l : list boot config in initrd or file\n\n"
> + " -l : list boot config in initrd or file\n"
> + " -C : render the kernel.* subtree as a flat cmdline\n"
> + " string (suitable for embedding in a kernel image)\n"
> + " and print it to stdout\n\n"
> " If no option is given, show the bootconfig in the given file.\n");
> return -1;
> }
> @@ -495,10 +535,11 @@ int main(int argc, char **argv)
> {
> char *path = NULL;
> char *apply = NULL;
> + bool render_cmdline = false;
> bool delete = false, list = false;
> int opt;
>
> - while ((opt = getopt(argc, argv, "hda:l")) != -1) {
> + while ((opt = getopt(argc, argv, "hda:lC")) != -1) {
> switch (opt) {
> case 'd':
> delete = true;
> @@ -509,14 +550,17 @@ int main(int argc, char **argv)
> case 'l':
> list = true;
> break;
> + case 'C':
> + render_cmdline = true;
> + break;
> case 'h':
> default:
> return usage();
> }
> }
>
> - if ((apply && delete) || (delete && list) || (apply && list)) {
> - pr_err("Error: You can give one of -a, -d or -l at once.\n");
> + if ((!!apply + !!delete + !!list + !!render_cmdline) > 1) {
> + pr_err("Error: You can give one of -a, -d, -l or -C at once.\n");
> return usage();
> }
>
> @@ -532,5 +576,5 @@ int main(int argc, char **argv)
> else if (delete)
> return delete_xbc(path);
>
> - return show_xbc(path, list);
> + return show_xbc(path, list, render_cmdline);
> }
>
> --
> 2.53.0-Meta
>
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [RFC PATCH] trace: Introduce a new filter_pred "caller"
From: Masami Hiramatsu @ 2026-05-11 23:47 UTC (permalink / raw)
To: Chen Jun; +Cc: rostedt, mathieu.desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260508122623.74290-1-chenjun102@huawei.com>
On Fri, 8 May 2026 20:26:23 +0800
Chen Jun <chenjun102@huawei.com> wrote:
> Low-level functions have many call paths, and sometimes
> we only care about the calls on a specific call path.
> Add a new filter to filter based on the call stack.
>
> Usage:
> 1. echo 'caller=="$function_name"' > events/../filter
Thanks for interesting idea :)
BTW, we already have "stacktrace". Since this actually checks
stacktrace, not caller, so I think we should reuse it.
Also, I think OP_GLOB is more suitable for this case.
(and more useful)
Thank you,
>
> Only support OP_EQ and OP_NE
>
> Signed-off-by: Chen Jun <chenjun102@huawei.com>
> ---
> include/linux/trace_events.h | 1 +
> kernel/trace/trace.h | 3 ++-
> kernel/trace/trace_events.c | 1 +
> kernel/trace/trace_events_filter.c | 40 ++++++++++++++++++++++++++++--
> 4 files changed, 42 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index 40a43a4c7caf..1f109669a391 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -851,6 +851,7 @@ enum {
> FILTER_COMM,
> FILTER_CPU,
> FILTER_STACKTRACE,
> + FILTER_CALLER,
> };
>
> extern int trace_event_raw_init(struct trace_event_call *call);
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 80fe152af1dd..4e4b92ce264f 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -1825,7 +1825,8 @@ static inline bool is_string_field(struct ftrace_event_field *field)
> field->filter_type == FILTER_RDYN_STRING ||
> field->filter_type == FILTER_STATIC_STRING ||
> field->filter_type == FILTER_PTR_STRING ||
> - field->filter_type == FILTER_COMM;
> + field->filter_type == FILTER_COMM ||
> + field->filter_type == FILTER_CALLER;
> }
>
> static inline bool is_function_field(struct ftrace_event_field *field)
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index c46e623e7e0d..6d220d7eec73 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -199,6 +199,7 @@ static int trace_define_generic_fields(void)
> __generic_field(char *, comm, FILTER_COMM);
> __generic_field(char *, stacktrace, FILTER_STACKTRACE);
> __generic_field(char *, STACKTRACE, FILTER_STACKTRACE);
> + __generic_field(char *, caller, FILTER_CALLER);
>
> return ret;
> }
> diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
> index 609325f57942..1cf040065abe 100644
> --- a/kernel/trace/trace_events_filter.c
> +++ b/kernel/trace/trace_events_filter.c
> @@ -72,6 +72,7 @@ enum filter_pred_fn {
> FILTER_PRED_FN_CPUMASK,
> FILTER_PRED_FN_CPUMASK_CPU,
> FILTER_PRED_FN_FUNCTION,
> + FILTER_PRED_FN_CALLER,
> FILTER_PRED_FN_,
> FILTER_PRED_TEST_VISITED,
> };
> @@ -1009,6 +1010,21 @@ static int filter_pred_function(struct filter_pred *pred, void *event)
> return pred->op == OP_EQ ? ret : !ret;
> }
>
> +/* Filter predicate for caller. */
> +static int filter_pred_caller(struct filter_pred *pred, void *event)
> +{
> + unsigned long entries[32];
> + unsigned int nr_entries;
> + int i;
> +
> + nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 0);
> + for (i = 0; i < nr_entries ; i++)
> + if (pred->val <= entries[i] && entries[i] < pred->val2)
> + return !pred->not;
> +
> + return pred->not;
> +}
> +
> /*
> * regex_match_foo - Basic regex callbacks
> *
> @@ -1617,6 +1633,8 @@ static int filter_pred_fn_call(struct filter_pred *pred, void *event)
> return filter_pred_cpumask_cpu(pred, event);
> case FILTER_PRED_FN_FUNCTION:
> return filter_pred_function(pred, event);
> + case FILTER_PRED_FN_CALLER:
> + return filter_pred_caller(pred, event);
> case FILTER_PRED_TEST_VISITED:
> return test_pred_visited_fn(pred, event);
> default:
> @@ -2002,10 +2020,28 @@ static int parse_pred(const char *str, void *data,
>
> } else if (field->filter_type == FILTER_DYN_STRING) {
> pred->fn_num = FILTER_PRED_FN_STRLOC;
> - } else if (field->filter_type == FILTER_RDYN_STRING)
> + } else if (field->filter_type == FILTER_RDYN_STRING) {
> pred->fn_num = FILTER_PRED_FN_STRRELLOC;
> - else {
> + } else if (field->filter_type == FILTER_CALLER) {
> + unsigned long caller;
> +
> + if (op == OP_GLOB)
> + goto err_free;
>
> + pred->fn_num = FILTER_PRED_FN_CALLER;
> + caller = kallsyms_lookup_name(pred->regex->pattern);
> + if (!caller) {
> + parse_error(pe, FILT_ERR_NO_FUNCTION, pos + i);
> + goto err_free;
> + }
> + /* Now find the function start and end address */
> + if (!kallsyms_lookup_size_offset(caller, &size, &offset)) {
> + parse_error(pe, FILT_ERR_NO_FUNCTION, pos + i);
> + goto err_free;
> + }
> + pred->val = caller - offset;
> + pred->val2 = pred->val + size;
> + } else {
> if (!ustring_per_cpu) {
> /* Once allocated, keep it around for good */
> ustring_per_cpu = alloc_percpu(struct ustring_buffer);
> --
> 2.22.0
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [PATCHv2] uprobes: Use flexible array for xol_area bitmap
From: Rosen Penev @ 2026-05-11 22:56 UTC (permalink / raw)
To: linux-kernel
Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
Ian Rogers, Adrian Hunter, James Clark, Masami Hiramatsu,
Oleg Nesterov, open list:PERFORMANCE EVENTS SUBSYSTEM,
open list:UPROBES
The XOL slot bitmap has the same lifetime as struct xol_area, but it
is currently allocated separately. That adds another allocation
failure path and a matching cleanup branch without buying any extra
flexibility.
Store the bitmap as a flexible array member and allocate it together
with the xol_area using kzalloc_flex(). The bitmap remains
zero-initialized, while the allocation and error handling become
simpler.
Assisted-by: Codex:GPT-5.5
Signed-off-by: Rosen Penev <rosenp@gmail.com>
---
v2: add missing kfree
kernel/events/uprobes.c | 14 +++-----------
1 file changed, 3 insertions(+), 11 deletions(-)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 4084e926e284..eba71700667e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -108,7 +108,6 @@ static LIST_HEAD(delayed_uprobe_list);
*/
struct xol_area {
wait_queue_head_t wq; /* if all slots are busy */
- unsigned long *bitmap; /* 0 = free slot */
struct page *page;
/*
@@ -117,6 +116,7 @@ struct xol_area {
* the vma go away, and we must handle that reasonably gracefully.
*/
unsigned long vaddr; /* Page(s) of instruction slots */
+ unsigned long bitmap[]; /* 0 = free slot */
};
static void uprobe_warn(struct task_struct *t, const char *msg)
@@ -1755,18 +1755,13 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
struct xol_area *area;
void *insns;
- area = kzalloc_obj(*area);
+ area = kzalloc_flex(*area, bitmap, BITS_TO_LONGS(UINSNS_PER_PAGE));
if (unlikely(!area))
goto out;
- area->bitmap = kcalloc(BITS_TO_LONGS(UINSNS_PER_PAGE), sizeof(long),
- GFP_KERNEL);
- if (!area->bitmap)
- goto free_area;
-
area->page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
if (!area->page)
- goto free_bitmap;
+ goto free_area;
area->vaddr = vaddr;
init_waitqueue_head(&area->wq);
@@ -1779,8 +1774,6 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
return area;
__free_page(area->page);
- free_bitmap:
- kfree(area->bitmap);
free_area:
kfree(area);
out:
@@ -1831,7 +1824,6 @@ void uprobe_clear_state(struct mm_struct *mm)
return;
put_page(area->page);
- kfree(area->bitmap);
kfree(area);
}
--
2.54.0
^ permalink raw reply related
* [PATCH] rtla: Stop the record trace on interrupt
From: Crystal Wood @ 2026-05-11 22:35 UTC (permalink / raw)
To: Tomas Glozar
Cc: Steven Rostedt, linux-trace-kernel, John Kacur, Costa Shulyupin,
Wander Lairson Costa, Crystal Wood
Before, when rtla gets a signal, it stopped the main trace but not the
record trace. save_trace_to_file() could also fail to keep up on a debug
kernel -- and in any case, it adds post-stoppage noise to the trace file.
Signed-off-by: Crystal Wood <crwood@redhat.com>
---
tools/tracing/rtla/src/common.c | 19 +++++++++++--------
tools/tracing/rtla/src/common.h | 1 -
tools/tracing/rtla/src/timerlat.c | 2 +-
3 files changed, 12 insertions(+), 10 deletions(-)
diff --git a/tools/tracing/rtla/src/common.c b/tools/tracing/rtla/src/common.c
index 35e3d3aa922e..effad523e8cf 100644
--- a/tools/tracing/rtla/src/common.c
+++ b/tools/tracing/rtla/src/common.c
@@ -10,7 +10,7 @@
#include "common.h"
-struct trace_instance *trace_inst;
+struct osnoise_tool *trace_tool;
volatile int stop_tracing;
int nr_cpus;
@@ -21,12 +21,16 @@ static void stop_trace(int sig)
* Stop requested twice in a row; abort event processing and
* exit immediately
*/
- tracefs_iterate_stop(trace_inst->inst);
+ if (trace_tool)
+ tracefs_iterate_stop(trace_tool->trace.inst);
return;
}
stop_tracing = 1;
- if (trace_inst)
- trace_instance_stop(trace_inst);
+ if (trace_tool) {
+ trace_instance_stop(&trace_tool->trace);
+ if (trace_tool->record)
+ trace_instance_stop(&trace_tool->record->trace);
+ }
}
/*
@@ -273,11 +277,10 @@ int run_tool(struct tool_ops *ops, int argc, char *argv[])
tool->params = params;
/*
- * Save trace instance into global variable so that SIGINT can stop
- * the timerlat tracer.
+ * Expose the tool to signal handlers so they can stop the trace.
* Otherwise, rtla could loop indefinitely when overloaded.
*/
- trace_inst = &tool->trace;
+ trace_tool = tool;
retval = ops->apply_config(tool);
if (retval) {
@@ -285,7 +288,7 @@ int run_tool(struct tool_ops *ops, int argc, char *argv[])
goto out_free;
}
- retval = enable_tracer_by_name(trace_inst->inst, ops->tracer);
+ retval = enable_tracer_by_name(tool->trace.inst, ops->tracer);
if (retval) {
err_msg("Failed to enable %s tracer\n", ops->tracer);
goto out_free;
diff --git a/tools/tracing/rtla/src/common.h b/tools/tracing/rtla/src/common.h
index 51665db4ffce..eba40b6d9504 100644
--- a/tools/tracing/rtla/src/common.h
+++ b/tools/tracing/rtla/src/common.h
@@ -54,7 +54,6 @@ struct osnoise_context {
int opt_workload;
};
-extern struct trace_instance *trace_inst;
extern volatile int stop_tracing;
struct hist_params {
diff --git a/tools/tracing/rtla/src/timerlat.c b/tools/tracing/rtla/src/timerlat.c
index f8c057518d22..637f68d684f5 100644
--- a/tools/tracing/rtla/src/timerlat.c
+++ b/tools/tracing/rtla/src/timerlat.c
@@ -202,7 +202,7 @@ void timerlat_analyze(struct osnoise_tool *tool, bool stopped)
* If the trace did not stop with --aa-only, at least print
* the max known latency.
*/
- max_lat = tracefs_instance_file_read(trace_inst->inst, "tracing_max_latency", NULL);
+ max_lat = tracefs_instance_file_read(tool->trace.inst, "tracing_max_latency", NULL);
if (max_lat) {
printf(" Max latency was %s\n", max_lat);
free(max_lat);
--
2.54.0
^ permalink raw reply related
* [PATCH RESEND] tracing/osnoise: Dump stack on timerlat uret threshold event
From: Crystal Wood @ 2026-05-11 22:31 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-trace-kernel, John Kacur, Tomas Glozar, Costa Shulyupin,
Wander Lairson Costa, Crystal Wood
Dump the saved IRQ stack trace regardless of whether the event was
THREAD_CONTEXT or THREAD_URET.
In the uret case, the latency presumably had not yet crossed the
threshold at IRQ time (or else it would have dumped the stack at thread
wakeup time, unless we're racing with a change to the threshold), but it
may have at least contributed -- and this is possible with THREAD_CONTEXT
as well.
In any case, it helps with writing reliable rtla tests if we always get
a stack trace on a threshold event.
Signed-off-by: Crystal Wood <crwood@redhat.com>
---
Original: https://lore.kernel.org/all/20251112152529.956778-3-crwood@redhat.com/
kernel/trace/trace_osnoise.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 75678053b21c..62c2667d97fa 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -2544,9 +2544,12 @@ timerlat_fd_read(struct file *file, char __user *ubuf, size_t count,
notify_new_max_latency(diff);
tlat->tracing_thread = false;
- if (osnoise_data.stop_tracing_total)
- if (time_to_us(diff) >= osnoise_data.stop_tracing_total)
+ if (osnoise_data.stop_tracing_total) {
+ if (time_to_us(diff) >= osnoise_data.stop_tracing_total) {
+ timerlat_dump_stack(time_to_us(diff));
osnoise_stop_tracing();
+ }
+ }
} else {
tlat->tracing_thread = false;
tlat->kthread = current;
--
2.54.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox