Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* [PATCH] mm/lruvec: trace LRU add drains and drain-all queuing
From: JP Kobryn @ 2026-06-09  4:11 UTC (permalink / raw)
  To: linux-mm, willy, shakeel.butt, usama.arif, akpm, vbabka, mhocko,
	rostedt, mhiramat, mathieu.desnoyers, kasong, qi.zheng, baohua,
	axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
	baoquan.he, youngjun.park
  Cc: linux-kernel, linux-trace-kernel

LRU add batches can be drained before they reach capacity. This can be a
source of LRU lock contention, but it is not currently possible to
attribute these drains to callers with existing tracepoints.

Add mm_lru_add_drain to report the CPU and lru_add batch count when an
lru_add batch is drained. This allows tracing to distinguish full drains
from partial drains and attribute them to the calling stack.

Add mm_lru_drain_all_queue to report when lru_add_drain_all() queues
per-CPU drain work. This captures the requester stack and target CPU for
remote drain work. The event is named as a drain-all queue event because
the queued work can be needed for batches other than lru_add.

Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>
---
 include/trace/events/pagemap.h | 40 ++++++++++++++++++++++++++++++++++
 mm/swap.c                      |  6 ++++-
 2 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
index 171524d3526d..ea8fc46bedb0 100644
--- a/include/trace/events/pagemap.h
+++ b/include/trace/events/pagemap.h
@@ -77,6 +77,46 @@ TRACE_EVENT(mm_lru_activate,
 	TP_printk("folio=%p pfn=0x%lx", __entry->folio, __entry->pfn)
 );
 
+TRACE_EVENT(mm_lru_add_drain,
+
+	TP_PROTO(int cpu, unsigned int nr),
+
+	TP_ARGS(cpu, nr),
+
+	TP_STRUCT__entry(
+		__field(int,		cpu	)
+		__field(unsigned int,	nr	)
+	),
+
+	TP_fast_assign(
+		__entry->cpu	= cpu;
+		__entry->nr	= nr;
+	),
+
+	TP_printk("cpu=%d nr=%u", __entry->cpu, __entry->nr)
+);
+
+TRACE_EVENT(mm_lru_drain_all_queue,
+
+	TP_PROTO(int target_cpu, bool force_all_cpus),
+
+	TP_ARGS(target_cpu, force_all_cpus),
+
+	TP_STRUCT__entry(
+		__field(int,	target_cpu	)
+		__field(bool,	force_all_cpus	)
+	),
+
+	TP_fast_assign(
+		__entry->target_cpu	= target_cpu;
+		__entry->force_all_cpus	= force_all_cpus;
+	),
+
+	TP_printk("target_cpu=%d force_all_cpus=%s",
+		__entry->target_cpu,
+		__entry->force_all_cpus ? "true" : "false")
+);
+
 #endif /* _TRACE_PAGEMAP_H */
 
 /* This part must be outside protection */
diff --git a/mm/swap.c b/mm/swap.c
index 588f50d8f1a8..c385b93582eb 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -694,9 +694,12 @@ void lru_add_drain_cpu(int cpu)
 {
 	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
 	struct folio_batch *fbatch = &fbatches->lru_add;
+	unsigned int nr_folios_add = folio_batch_count(fbatch);
 
-	if (folio_batch_count(fbatch))
+	if (nr_folios_add) {
 		folio_batch_move_lru(fbatch, lru_add);
+		trace_mm_lru_add_drain(cpu, nr_folios_add);
+	}
 
 	fbatch = &fbatches->lru_move_tail;
 	/* Disabling interrupts below acts as a compiler barrier. */
@@ -928,6 +931,7 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
 		if (cpu_needs_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			queue_work_on(cpu, mm_percpu_wq, work);
+			trace_mm_lru_drain_all_queue(cpu, force_all_cpus);
 			__cpumask_set_cpu(cpu, &has_work);
 		}
 	}
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v8 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Miaohe Lin @ 2026-06-09  2:39 UTC (permalink / raw)
  To: Breno Leitao, David Hildenbrand (Arm)
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Lance Yang, Andrew Morton,
	Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <aibN4osY_QF1Rejh@gmail.com>

On 2026/6/8 22:15, Breno Leitao wrote:
> On Fri, Jun 05, 2026 at 11:42:53AM +0200, David Hildenbrand (Arm) wrote:
>> On 6/5/26 11:35, Breno Leitao wrote:
>>> On Wed, Jun 03, 2026 at 10:33:04AM +0800, Miaohe Lin wrote:
>>>> On 2026/6/2 17:41, David Hildenbrand (Arm) wrote:
>>>>>
>>>>> Races are fine. We might miss some pages, but that can happen on races either way.
>>>>>
>>>>>
>>>>> I'd just do something like
>>>>>
>>>>> if (PageReserved(page))
>>>>> 	return true;
>>>>>
>>>>> head = compound_head(page);
>>>>
>>>> If @head is split just after compound_head. And then @head is freed into buddy and re-allocated as slab
>>>> page while @page is still in the buddy. We would panic on this scene as @head is PageSlab. But we were
>>>> supposed to successfully handle @page. Or am I miss something?
>>>
>>> You're right that it is racy, but I think it is an acceptable race here.
>>>
>>
>> I mean, any such races can currently already happen one way or the other?
>>
>> Really, the only way to not get races is to tryget the (compound)page,
>> revalidate that the page is still part of the compound page.
>>
>> I'm not sure if that's really a good idea.
>>
>> But my memory is a bit vague in which scenarios we already hold a page reference
>> here to prevent any concurrent freeing?
> 
> No, we don't hold one here in the case that matters.
> 
> HWPoisonKernelOwned() runs at the very top of get_any_page(), before
> try_again: and before __get_hwpoison_page(). The first refcount taken in
> the whole path is the folio_try_get() inside __get_hwpoison_page(), which
> runs *after* the short-circuit.
> 
> So get_any_page() itself never holds a reference at the check -- the only way
> one exists is if the caller passed MF_COUNT_INCREASED (count_increased ==
> true).
> 
> So on the MCE/GHES path -- the one this panic option exists for -- no
> reference is held when HWPoisonKernelOwned() does its compound_head() +
> PageSlab()/PageTable()/PageLargeKmalloc() checks.
> 
> Given that, I'd rather keep it racy and take no refcount than add a
> tryget + revalidate purely for this check. As I've said earleir, an operator

Would it be acceptable to add a simple recheck? Something like below:

retry:
head = compound_head(page);
PageSlab()/PageTable()/PageLargeKmalloc() checks
if (head != compound_head(page))
	goto retry

Thanks both.
.

> who enabled it has chosen to crash rather than run on corrupted memory;
> mis-attributing one such rare, genuinely-poisoned page is within that contract.
> .
> 


^ permalink raw reply

* Re: [PATCH mm-unstable v19 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Lance Yang @ 2026-06-09  1:52 UTC (permalink / raw)
  To: david
  Cc: npache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260608162612.27122-1-lance.yang@linux.dev>



On 2026/6/9 00:26, Lance Yang wrote:
> 
> On Mon, Jun 08, 2026 at 04:56:37PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/6/26 12:28, Lance Yang wrote:
>>>
>>> On Fri, Jun 05, 2026 at 10:14:18AM -0600, Nico Pache wrote:
>>>> Enable khugepaged to collapse to mTHP orders. This patch implements the
>>>> main scanning logic using a bitmap to track occupied pages and the
>>>> algorithm to find optimal collapse sizes.
>>>>
>>>> Previous to this patch, PMD collapse had 3 main phases, a light weight
>>>> scanning phase (mmap_read_lock) that determines a potential PMD
>>>> collapse, an alloc phase (mmap unlocked), then finally heavier collapse
>>>> phase (mmap_write_lock).
>>>>
>>>> To enabled mTHP collapse we make the following changes:
>>>>
>>>> During PMD scan phase, track occupied pages in a bitmap. When mTHP
>>>> orders are enabled, we remove the restriction of max_ptes_none during the
>>>> scan phase to avoid missing potential mTHP collapse candidates. Once we
>>>> have scanned the full PMD range and updated the bitmap to track occupied
>>>> pages, we use the bitmap to find the optimal mTHP size.
>>>>
>>>> Implement mthp_collapse() to walk forward through the bitmap and
>>>> determine the best eligible order for each naturally-aligned region. The
>>>> algorithm starts at the beginning of the PMD range and, for each offset,
>>>> tries the highest order that fits the alignment. If the number of
>>>> occupied PTEs in that region satisfies the max_ptes_none threshold for
>>>> that order, a collapse is attempted. On failure, the order is
>>>> decremented and the same offset is retried at the next smaller size. Once
>>>> the smallest enabled order is exhausted (or a collapse succeeds), the
>>>> offset advances past the region just processed, and the next attempt
>>>> starts at the highest order permitted by the new offset's natural
>>>> alignment.
>>>>
>>>> The algorithm works as follows:
>>>>     1) set offset=0 and order=HPAGE_PMD_ORDER
>>>>     2) if the order is not enabled, go to step (5)
>>>>     3) count occupied PTEs in the (offset, order) range using
>>>>        bitmap_weight_from()
>>>>     4) if the count satisfies the max_ptes_none threshold, attempt
>>>>        collapse; on success, advance to step (6)
>>>>     5) if a smaller enabled order exists, decrement order and retry
>>>>        from step (2) at the same offset
>>>>     6) advance offset past the current region and compute the next
>>>>        order from the new offset's natural alignment via __ffs(offset),
>>>>        capped at HPAGE_PMD_ORDER
>>>>     7) repeat from step (2) until the full PMD range is covered
>>>>
>>>> mTHP collapses reject regions containing swapped out or shared pages.
>>>> This is because adding new entries can lead to new none pages, and these
>>>> may lead to constant promotion into a higher order mTHP. A similar
>>>> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
>>>> introducing at least 2x the number of pages, and on a future scan will
>>>> satisfy the promotion condition once again. This issue is prevented via
>>>> the collapse_max_ptes_none() function which imposes the max_ptes_none
>>>> restrictions above.
>>>>
>>>> We currently only support mTHP collapse for max_ptes_none values of 0
>>>> and HPAGE_PMD_NR - 1. resulting in the following behavior:
>>>>
>>>>     - max_ptes_none=0: Never introduce new empty pages during collapse
>>>>     - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
>>>>       available mTHP order
>>>>
>>>> Any other max_ptes_none value will emit a warning and default mTHP
>>>> collapse to max_ptes_none=0. There should be no behavior change for PMD
>>>> collapse.
>>>>
>>>> Once we determine what mTHP sizes fits best in that PMD range a collapse
>>>> is attempted. A minimum collapse order of 2 is used as this is the lowest
>>>> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
>>>>
>>>> Currently madv_collapse is not supported and will only attempt PMD
>>>> collapse.
>>>>
>>>> We can also remove the check for is_khugepaged inside the PMD scan as
>>>> the collapse_max_ptes_none() function handles this logic now.
>>>>
>>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>>> ---
>>>> mm/khugepaged.c | 146 +++++++++++++++++++++++++++++++++++++++++++++---
>>>> 1 file changed, 138 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index ec886a031952..430047316f43 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -99,6 +99,8 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>>>>
>>>> static struct kmem_cache *mm_slot_cache __ro_after_init;
>>>>
>>>> +#define KHUGEPAGED_MIN_MTHP_ORDER	2
>>>> +
>>>> struct collapse_control {
>>>> 	bool is_khugepaged;
>>>>
>>>> @@ -110,6 +112,9 @@ struct collapse_control {
>>>>
>>>> 	/* nodemask for allocation fallback */
>>>> 	nodemask_t alloc_nmask;
>>>> +
>>>> +	/* Each bit represents a single occupied (!none/zero) page. */
>>>> +	DECLARE_BITMAP(mthp_present_ptes, MAX_PTRS_PER_PTE);
>>>> };
>>>>
>>>> /**
>>>> @@ -1440,20 +1445,130 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>>>> 	return result;
>>>> }
>>>>
>>>> +/* Return the highest naturally aligned order that fits at @offset within a PMD. */
>>>> +static unsigned int max_order_from_offset(unsigned int offset)
>>>> +{
>>>> +	if (offset == 0)
>>>> +		return HPAGE_PMD_ORDER;
>>>> +
>>>> +	return min_t(unsigned int, __ffs(offset), HPAGE_PMD_ORDER);
>>>> +}
>>>> +
>>>> +/*
>>>> + * mthp_collapse() consumes the bitmap that is generated during
>>>> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
>>>> + *
>>>> + * Each bit in cc->mthp_present_ptes represents a single occupied (!none/zero)
>>>> + * page. We start at the PMD order and check if it is eligible for collapse;
>>>> + * if not, we check the left and right halves of the PTE page table we are
>>>> + * examining at a lower order.
>>>> + *
>>>> + * For each of these, we determine how many PTE entries are occupied in the
>>>> + * range of PTE entries we propose to collapse, then we compare this to a
>>>> + * threshold number of PTE entries which would need to be occupied for a
>>>> + * collapse to be permitted at that order (accounting for max_ptes_none).
>>>> + *
>>>> + * If a collapse is permitted, we attempt to collapse the PTE range into a
>>>> + * mTHP.
>>>> + */
>>>> +static enum scan_result mthp_collapse(struct mm_struct *mm,
>>>> +		unsigned long address, int referenced, int unmapped,
>>>> +		struct collapse_control *cc, unsigned long enabled_orders)
>>>> +{
>>>> +	unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
>>>> +	enum scan_result last_result = SCAN_FAIL;
>>>> +	int collapsed = 0;
>>>> +	bool alloc_failed = false;
>>>> +	unsigned long collapse_address;
>>>> +	unsigned int offset = 0;
>>>> +	unsigned int order = HPAGE_PMD_ORDER;
>>>> +
>>>> +	while (offset < HPAGE_PMD_NR) {
>>>> +		nr_ptes = 1UL << order;
>>>> +
>>>> +		if (!test_bit(order, &enabled_orders))
>>>> +			goto next_order;
>>>> +
>>>> +		max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
>>>> +		nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
>>>> +						      offset + nr_ptes);
>>>> +
>>>> +		if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
>>>
>>> Looks broken for swap PTEs in PMD collapse ...
>>>
>>> collapse_scan_pmd() allows them up to max_ptes_swap and record them in
>>> unmapped, but they don't get a bit in mthp_present_ptes. And then
>>> mthp_collapse() does the check above:
>>
>> Right. I assumed this is implicitly handled by the optimization in collapse_scan_pmd:
>>
>> 	if (enabled_orders != BIT(HPAGE_PMD_ORDER))
>> 		max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
>>
>> But we perform the check a second time.
> 
> Note that once lower orders are enabled, the scan *relaxes* max_ptes_none
> only so it can cover the whole PMD and build the bitmap ...
> 
>>>
>>> nr_occupied_ptes >= nr_ptes - max_ptes_none
>>>
>>> So max_ptes_none=0 + 511 present PTEs + one allowed swap PTE won't even
>>> call collapse_huge_page() for PMD order.
>>>
>>> Shouldn't we account for them in the PMD-order check? Something like:
>>>
>>> if (is_pmd_order(order))
>>> 	nr_occupied_ptes += unmapped;
>> As an alternative, we could either 1) skip the check there for
>> pmd order (as the check was already done); or 2) introduce+maintain
> 
> Yeah, skipping the check would do the trick, since isolate will check
> max_ptes_none again later :)

In addition, that later check is rather late, we may have already
allocated the folio and swapped in pages before isolate rejects
the range :)

>> a bitmap that tracks non-present PTEs.
>>
>> @@ -1475,7 +1477,9 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
>>                 nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
>>                                                       offset + nr_ptes);
>>
>> -               if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {

So I'd still slightly prefer keeping this check and accounting
unmapped for PMD order.

if (is_pmd_order(order))
	nr_occupied_ptes += unmapped;


>> +               /* Check was already done in the caller. */
> 
> This check is not quite redundant for PMD order, though. It avoids
> entering collapse_huge_page() for a range that already exceeds
> max_ptes_none for that order.
> 
>> +               if (is_pmd_order(order) ||
>> +                   nr_occupied_ptes >= nr_ptes - max_ptes_none) {
>>                         enum scan_result ret;
>>
>>                         collapse_address = address + offset * PAGE_SIZE;
>>
>> 2) would probably be cleanest long-term.
> 
> Yeah, Agreed.


^ permalink raw reply

* Re: [PATCH v3 0/6] bootconfig: embed kernel.* cmdline at build time
From: Masami Hiramatsu @ 2026-06-09  1:46 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Andrew Morton, Nathan Chancellor, paulmck, Nicolas Schier,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, kernel-team
In-Reply-To: <20260608-bootconfig_using_tools-v3-0-4ddd079a0696@debian.org>

On Mon, 08 Jun 2026 09:23:57 -0700
Breno Leitao <leitao@debian.org> wrote:

> The userspace pieces (xbc_snprint_cmdline() in lib/, tools/bootconfig -C)
> already landed; this series wires the rendered cmdline into the kernel.
> 
> Motivation: today the embedded bootconfig is parsed at runtime, after
> parse_early_param() has already run, so early_param() handlers can't
> see embedded values. Folding the kernel.* subtree into the cmdline at
> build time gives a CONFIG_CMDLINE-equivalent for embedded-bootconfig
> users without forcing them to maintain two cmdline sources.
> 
> Behaviorally, the "kernel" subtree is rendered to a flat string at
> build time and stashed in .init.rodata. setup_arch() prepends it to
> boot_command_line before parse_early_param() runs. Overflow is a soft
> error: the helper logs and leaves boot_command_line untouched rather
> than panicking, so an oversized embedded bconf cannot brick a boot.
> 

Sashiko still leaves some comments. 
https://sashiko.dev/#/patchset/20260608-bootconfig_using_tools-v3-0-4ddd079a0696%40debian.org

BTW, can you also update the document (Documentation/admin-guide/bootconfig.rst)
about what is the expected behavior of this feature (kconfigs, examples)?

Thank you,


> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> Changes in v3:
> - Patch 3: Move HOSTCC override to the kernel-side rule; tool keeps
>   $(CC) for standalone/cross builds.
> - Patch 6: Drop the false fail-safe wording; document the
>   BOOT_CONFIG_FORCE=y default interaction.
> - Link to v2:
>   https://lore.kernel.org/r/20260605-bootconfig_using_tools-v2-0-d309f544b5f7@debian.org
> 
> Changes in v2 (addressing review of v1):
> - Split out a standalone fix for the NULL-pointer arithmetic in
>   xbc_snprint_cmdline() so the build-time render cannot trip host
>   UBSan/FORTIFY_SOURCE.
> - Rework the leaf-root handling: instead of returning early, skip @root
>   inside the loop so a root carrying both a value and subkeys
>   (kernel = x together with kernel.foo = bar) still renders its
>   descendant keys.
> - Build tools/bootconfig with $(HOSTCC) so cross-compiled (ARCH=...)
>   builds render the cmdline on the build host instead of failing with
>   "Exec format error".
> - Mark the embedded cmdline section read-only (drop the "w" flag from
>   .init.rodata).
> - Add a make-clean hook so tools/bootconfig artifacts are removed by
>   make clean.
> - Gate the x86 prepend on "bootconfig" being present on the command
>   line (or CONFIG_BOOT_CONFIG_FORCE), matching the init.* opt-in
>   semantics documented in bootconfig.rst and preserving fail-safe
>   recovery: dropping "bootconfig" from the bootloader cmdline now also
>   disables the embedded kernel.* keys.
> - Link to v1: https://patch.msgid.link/20260527-bootconfig_using_tools-v1-0-b6906a86e7d5@debian.org
> 
> ---
> Breno Leitao (6):
>       bootconfig: fix NULL-pointer arithmetic in xbc_snprint_cmdline()
>       bootconfig: render descendant keys when xbc_snprint_cmdline() root has a value
>       bootconfig: render embedded bootconfig as a kernel cmdline at build time
>       bootconfig: clean build-time tools/bootconfig from make clean
>       bootconfig: add xbc_prepend_embedded_cmdline() helper
>       x86/setup: prepend embedded bootconfig cmdline before parse_early_param
> 
>  MAINTAINERS                |   1 +
>  Makefile                   |  24 +++++++++-
>  arch/x86/Kconfig           |   1 +
>  arch/x86/kernel/setup.c    |  16 +++++++
>  include/linux/bootconfig.h |   9 ++++
>  init/Kconfig               |  36 +++++++++++++++
>  init/main.c                |  25 ++++++++--
>  lib/Makefile               |  16 +++++++
>  lib/bootconfig.c           | 112 ++++++++++++++++++++++++++++++++++++++++++---
>  lib/embedded-cmdline.S     |  16 +++++++
>  tools/bootconfig/Makefile  |   4 +-
>  11 files changed, 247 insertions(+), 13 deletions(-)
> ---
> base-commit: e7e28506af98ce4e1059e5ec59334b335c00a246
> change-id: 20260508-bootconfig_using_tools-cfa7aa9d6a5a
> 
> Best regards,
> -- 
> Breno Leitao <leitao@debian.org>
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [RFC PATCH 0/7] tracing/probes: Add more typecast features
From: Masami Hiramatsu @ 2026-06-09  1:42 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Steven Rostedt, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178092865666.163648.10457567771536160909.stgit@devnote2>

Hi,

On Mon,  8 Jun 2026 23:24:16 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:

> Hi,
> 
> Here is a series of patches to introduce more typecast features
> to probe events, which includes 1. expanding BTF typecast to
> fprobe and kprobe events, 2. introducing container_of like typecst
> option, 3. supporting nested typecast, 4. adding $current special
> variable support, 5. adding per-cpu dereference support, 6. adding
> a testcase to check typecasts.

Sashiko found many issues on this series. I'll fix those.

BTW, for $current, I think it should typecasted to task_struct without
typecast. (except if it is the container_of() type typecasting)
In this case, ctx->strcut_btf will be updated automatically if $current
is specified.

Also, I'm thinking redesign +CPU/+PCPU/this_cpu_ptr().
Instead of those, what about introducing followings?

 - this_cpu_read(VAR)
 - this_cpu_ptr(VAR)

Comments are welcome!

Thanks,

> 
> Steve introduced BTF typecast feature for eprobe[1].
> This series extends it and add more options:
> 
> 1. Expanding BTF typecast to kprobe and fprobe.
>    (currently only function entry/exit)
> 
> 2. Introduce container_of like typecast. This adds a "assigned
>    member" option to the typecast.
> 
>    (STRUCT,MEMBER)VAR->ANOTHER_MEMBER
> 
>    This casts VAR to STRUCT type but the VAR is as the address
>    of STRUCT.MEMBER. In C, it is:
> 
>    container_of(VAR, STRUCT, MEMBER)->ANOTHER_MEMBER
> 
> 3. Support nested typecast, e.g.
> 
>    (STRUCT)((STRUCT2)VAR->MEMBER2)->MEMBER
> 
>    the nest level must be smaller than 3.
> 
> 4. Add $current variable to point "current" task_struct.
>    This is useful with typecast, e.g.
> 
>    (task_struct)$current->pid
> 
> 5. per-cpu dereference support.
> 
>    +CPU(VAR) is the same as this_cpu_read(VAR), and
>    +PCPU(VAR) is the same as this_cpu_ptr(VAR).
>    Also, "this_cpu_ptr(VAR)" is available. This is good
>    with nesting expression.
> 
>    (STRUCT)(this_cpu_ptr(VAR))->MEMBER
> 
>    (However, it might be better to allow a special way to omit
>     parentheses for thi_cpu_ptr())
> 
> And added a test script to test part of them.
> 
> [1] https://lore.kernel.org/all/20260601130746.2139d926@gandalf.local.home/
> 
> 
> ---
> 
> Masami Hiramatsu (Google) (7):
>       tracing/probes: Support typecast for various probe events
>       tracing/probes: Support nested typecast
>       tracing/probes: Support field specifier option for typecast
>       tracing/probes: Add $current variable support
>       tracing/probes: Add +CPU() and +PCPU() dereference method to fetcharg
>       tracing/probes: Support reserved this_cpu_ptr() method
>       tracing/probes: Add a new testcase for BTF typecasts
> 
> 
>  Documentation/trace/eprobetrace.rst                |   11 +
>  Documentation/trace/fprobetrace.rst                |   11 +
>  Documentation/trace/kprobetrace.rst                |   12 +
>  kernel/trace/trace.c                               |    6 
>  kernel/trace/trace_probe.c                         |  312 +++++++++++++++-----
>  kernel/trace/trace_probe.h                         |   12 +
>  kernel/trace/trace_probe_tmpl.h                    |   33 ++
>  samples/trace_events/trace-events-sample.c         |   38 ++
>  samples/trace_events/trace-events-sample.h         |   34 ++
>  .../ftrace/test.d/dynevent/btf_probe_event.tc      |   52 +++
>  10 files changed, 422 insertions(+), 99 deletions(-)
>  create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
> 
> --
> Signature


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v2 6/6] x86/setup: prepend embedded bootconfig cmdline before parse_early_param
From: Masami Hiramatsu @ 2026-06-09  1:34 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Andrew Morton, Nathan Chancellor, paulmck, Nicolas Schier,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, kernel-team
In-Reply-To: <aibSj4XeRWJmasCx@gmail.com>

On Mon, 8 Jun 2026 07:41:05 -0700
Breno Leitao <leitao@debian.org> wrote:

> On Mon, Jun 08, 2026 at 07:19:28PM +0900, Masami Hiramatsu wrote:
> > On Fri, 05 Jun 2026 05:03:37 -0700
> > Breno Leitao <leitao@debian.org> wrote:
> > 
> > > Call xbc_prepend_embedded_cmdline() in setup_arch() right after the
> > > CONFIG_CMDLINE merge and before strscpy(command_line, ...) so the
> > > build-time-rendered embedded bootconfig "kernel" subtree is part of
> > > boot_command_line by the time parse_early_param() runs. early_param()
> > > handlers (mem=, earlycon=, loglevel=, ...) now see values supplied via
> > > CONFIG_BOOT_CONFIG_EMBED_FILE without parsing bootconfig at runtime.
> > > 
> > > Gate the prepend on the bootconfig opt-in: only fold in the embedded
> > > kernel.* keys when "bootconfig" is present on the command line, or
> > > CONFIG_BOOT_CONFIG_FORCE is set. Applying the embedded cmdline
> > > unconditionally would (a) diverge from how embedded init.* keys are
> > > treated and (b) break fail-safe recovery: a malformed embedded
> > > console=/mem= could panic the boot with no way for the admin to disable
> > > it by dropping "bootconfig" from the bootloader cmdline.
> > > cmdline_find_option_bool() runs before parse_early_param(), so the gate
> > > is cheap and correctly ordered.
> > > 
> > > Select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG so the user-visible
> > > CONFIG_BOOT_CONFIG_EMBED_CMDLINE option becomes selectable on x86.
> > 
> > This seems like a dummy config. what code does depend on this flag?
> 
> No C code reads ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG directly — it's
> a silent gating symbol, the same ARCH_SUPPORTS_* idiom as
> ARCH_SUPPORTS_CFI, ARCH_SUPPORTS_LTO_CLANG, etc.
> 
> Its only role is the depends on line of BOOT_CONFIG_EMBED_CMDLINE: an
> arch selects it once its setup_arch() calls
> xbc_prepend_embedded_cmdline(), and that makes the user-visible
> BOOT_CONFIG_EMBED_CMDLINE selectable.
> 
> Right now, only x86 supports embedded bootconfig, thus, only x86 does
> the following (last patch):
> 
> 	config X86
> 	+       select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG
> 
> So, no other platform can see CONFIG_BOOT_CONFIG_EMBED_CMDLINE.

Ah, OK. I missed the 3/6, which defined the dependency.

Thanks!

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH] samples/ftrace: reject zero ftrace-ops call count
From: Steven Rostedt @ 2026-06-09  1:03 UTC (permalink / raw)
  To: Samuel Moelius
  Cc: Masami Hiramatsu, Mark Rutland, open list:FUNCTION HOOKS (FTRACE),
	open list:FUNCTION HOOKS (FTRACE)
In-Reply-To: <20260609004422.1239735.f6c06011af32.ftrace-ops-zero-function-calls-div0@trailofbits.com>

On Tue,  9 Jun 2026 00:44:23 +0000
Samuel Moelius <sam.moelius@trailofbits.com> wrote:

> The ftrace-ops sample exposes nr_function_calls as a module parameter
> and uses it as the divisor when printing the measured time per call.
> Loading the module with nr_function_calls=0 skips the benchmark loop and
> then divides the elapsed time by zero, crashing the kernel during sample
> module initialization.

This change is rather pointless, but whatever.

> 
> Reject a zero call count before registering any ftrace ops.
> 
> Assisted-by: Codex:gpt-5.5-cyber-preview
> Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com>
> ---
>  samples/ftrace/ftrace-ops.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/samples/ftrace/ftrace-ops.c b/samples/ftrace/ftrace-ops.c
> index 68d6685c80bd..d5adaa61484f 100644
> --- a/samples/ftrace/ftrace-ops.c
> +++ b/samples/ftrace/ftrace-ops.c
> @@ -190,6 +190,11 @@ static int __init ftrace_ops_sample_init(void)
>  		tracer_irrelevant = ops_func_count;
>  	}
>  
> +	if (!nr_function_calls) {
> +		pr_err("nr_function_calls must be non-zero\n");
> +		return -EINVAL;

No need to print that the admin did something stupid.

> +	}
> +
>  	pr_info("registering:\n"
>  		"  relevant ops: %u\n"
>  		"    tracee: %ps\n"

In fact, I would just change the output to be:

        pr_info("Attempted %u calls to %ps in %lluns (%lluns / call)\n",
                nr_function_calls, tracee_relevant,
                period, nr_function_calls ? div_u64(period, nr_function_calls) : -1LL);

and have garbage in, garbage out.

-- Steve




^ permalink raw reply

* Re: [PATCH] rethook: Use tsk->on_cpu to check task execution state
From: Tengda Wu @ 2026-06-09  0:59 UTC (permalink / raw)
  To: Masami Hiramatsu, Peter Zijlstra, bpf
  Cc: Steven Rostedt, Mathieu Desnoyers, Alexei Starovoitov,
	linux-trace-kernel, linux-kernel, Josh Poimboeuf, jikos, mbenes,
	pmladek
In-Reply-To: <20260608220811.d4a0b58961cfb9eeb6bbbccb@kernel.org>



On 2026/6/8 21:08, Masami Hiramatsu wrote:
> On Mon, 8 Jun 2026 12:23:26 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Mon, Jun 08, 2026 at 11:34:49AM +0200, Peter Zijlstra wrote:
>>>
>>> +Live patching folks
>>>
>>> On Mon, Jun 08, 2026 at 09:52:37AM +0800, Tengda Wu wrote:
>>>
>>>> Background: We are verifying the support of live patches for functions that
>>>> have a kretprobe. The specific verification method is as follows:
>>>>
>>>> We construct a function foo() that calls bar():
>>>>
>>>> void bar(void)
>>>> {
>>>>     for (;;) {
>>>>         schedule();
>>>>     }
>>>> }
>>>>
>>>> void foo(void)
>>>> {
>>>>     bar();
>>>> }
>>>>
>>>> A kretprobe is attached to bar():
>>>>
>>>> echo 'r:rp1 bar' > /sys/kernel/tracing/kprobe_events
>>>> echo 1 > /sys/kernel/tracing/events/kprobes/rp1/enable
>>>>
>>>> Then foo() is triggered. The expected behavior is that bar() will call
>>>> schedule() and yield the CPU.
>>>>
>>>> After that, the live patch is activated to attempt replacing the implementation
>>>> of foo(). The expectation is that this should succeed.
>>>
>>> This wholly depends on how foo() calls bar(), if it is a normal call,
>>> then no, it should not succeed, because foo() is still on the stack.
>>>
>>> If it is a tail-call, then yes, because foo() is no longer relevant.
>>>
>>>> However, in reality, because the task that called schedule() is still in the
>>>> RUNNING state,
>>>
>>> So calling schedule() without setting state is dodgy in the first place.
>>> Who is doing this? All wait primitives will set this to
>>> TASK_UNINTERRUPTIBLE or something along those lines.
>>>
>>>> the condition task_is_running(tsk) inside rethook_find_ret_addr()
>>>> is not satisfied, causing the function to return early. This, in turn,
>>>> prevents stack_trace_save_tsk_reliable() from determining the stack as
>>>> reliable, leading to a failure in activating the live patch.
>>>>
>>>> **Not sure if this is correct:**
>>>>
>>>> We believe that after a task voluntarily calls schedule(), when the stack
>>>> is expected to be reliable, it is a safe time to activate a live patch.
>>>
>>> Calling schedule() without setting state is a no-op and really shouldn't
>>> count much at all.
>>>
>>>> Additionally, a similar tsk->on_cpu check can be found elsewhere in the
>>>> kernel (See task_on_another_cpu() in arch/x86/include/asm/unwind.h).
>>>> Therefore, we propose changing the task_is_running(tsk) condition to
>>>> tsk->on_cpu.
>>>
>>> Anyway, I'm wondering what the purpose of this check here is, there is
>>> no real comment, and commit 5120d167e21c ("rethook: Remove warning
>>> messages printed for finding return address of a frame.") is just pure
>>> voodoo as well.
>>
>> FWIW, you should have had this discussion then.
> 
> Indeed. The rethook is making a shadow stack by list, thus caller must
> guarantee the target process is blocked at least during this function.
> 
> The commit messages suggest that when BPF takes a backtrace, it also
> includes other running tasks. Is that safe?
> 
>>
>>> Also, note the comment that goes with the usage of
>>> task_on_another_cpu(); that thing is racy as all heck.
>>>
>>> So it really comes down to what the purpose of this check is.
> 
> This check has been introduced when it is copied from
> kretprobe_find_ret_addr(). It has the comment:
> 
>  * The @tsk must be 'current' or a task which is not running. @fp is a hint
> 
> IIRC, I added this check to explicitly verify this condition.
> 
>>>
>>> I suspect the issue at hand is that tsk->rethook elements, such as
>>> iterated by __rethook_find_ret_addr() are not safe to be accessed for a
>>> running task.
>>>
>>> Notably while rethook_recycle() has some RCU thing on, that objpool
>>> thing (and the recycle name itself) seems to strongly suggest iterating
>>> these things is not sound (you could start with things from this task,
>>> hit a recycled entry and continue iterating rethooks from another task).
>>>
>>> Also note that the current check is also racy, nothing really prevents a
>>> wakeup from happening right after you observe task_is_running() being
>>> false. The task can then get scheduled in on another CPU and tear down
>>> its rethooks concurrent with __rethook_find_ret_addr().
> 
> Yeah, but is there any way to ensure the task is blocked? Even if it is
> blocked, like TASK_UNINTERRUPTIBLE, unless holding the actual lock in
> the rethook, it may not be possible to ensure it?
> 
> Of course, we could give up on checking within this function and leave
> everything to the caller to guarantee - as kretprobe does.
> 
> BTW, the reason why we made it possible to pass tasks other than current
>  is that the stack unwinding code itself supported unwinding tasks other
> than current, so we had no choice but to create this interface.
> 
> However, it is a bad idea to check this in deep inside of unwinding.
> 
>>> Now, livepatch itself calls unwind from a proper context, but unwinds in
>>> general are not. This rethook stuff doesn't seem to be sound in general.
>>
>> I suspect just entirely removing the check is the sanest option at this
>> point. Callers that do it right (livepatch) are guaranteed consistent
>> data, and the rest gets whatever pieces.
> 
> Agreed.
> 
>>
>> Notably, unwind_next() holds rcu, so the iteration is protected from any
>> of those rethook_node things getting freed. Its just that the iteration
>> can go sideways and you might not get a sane answer.
>>
>> The very worst possible option is getting stuck in an infinite loop when
>> concurrent with agressive rethook re-use or something daft like that,
>> but that seems extremely unlikely.
> 
> 
> OK, thanks for your review!
> 
> Tengda, can you send a patch to just remove the check?
> 
> Thank you,
> 

Sure, the patch to remove the check has been sent.

-- Tengda


^ permalink raw reply

* [PATCH v2] rethook: Remove the running task check in rethook_find_ret_addr()
From: Tengda Wu @ 2026-06-09  0:57 UTC (permalink / raw)
  To: Masami Hiramatsu, Peter Zijlstra
  Cc: Steven Rostedt, Mathieu Desnoyers, Alexei Starovoitov,
	linux-trace-kernel, linux-kernel, Tengda Wu

The current check in rethook_find_ret_addr() prevents obtaining a return
address when the target task is marked as running. However, this condition
is both insufficient for safety and unnecessary for its intended purpose.

The check is inherently racy: a task can begin running on another CPU
immediately after task_is_running() returns false, potentially leading to
concurrent modification of rethook data structures while the iteration is
in progress.

Rather than attempting to fix this unreliable check deep in the unwinding
path, remove it entirely. Callers that require consistency are expected
to provide a safe context.

Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook")
Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
---
v2: Remove the running task check.
v1: https://lore.kernel.org/all/20260525132253.1889726-1-wutengda@huaweicloud.com/

 kernel/trace/rethook.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
index 5a8bdf88999a..f70f11bc6c91 100644
--- a/kernel/trace/rethook.c
+++ b/kernel/trace/rethook.c
@@ -250,9 +250,6 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
 	if (WARN_ON_ONCE(!cur))
 		return 0;

-	if (tsk != current && task_is_running(tsk))
-		return 0;
-
 	do {
 		ret = __rethook_find_ret_addr(tsk, cur);
 		if (!ret)
-- 
2.34.1

^ permalink raw reply related

* [PATCH] samples/ftrace: reject zero ftrace-ops call count
From: Samuel Moelius @ 2026-06-09  0:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Samuel Moelius, Masami Hiramatsu, Mark Rutland,
	open list:FUNCTION HOOKS (FTRACE),
	open list:FUNCTION HOOKS (FTRACE)

The ftrace-ops sample exposes nr_function_calls as a module parameter
and uses it as the divisor when printing the measured time per call.
Loading the module with nr_function_calls=0 skips the benchmark loop and
then divides the elapsed time by zero, crashing the kernel during sample
module initialization.

Reject a zero call count before registering any ftrace ops.

Assisted-by: Codex:gpt-5.5-cyber-preview
Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com>
---
 samples/ftrace/ftrace-ops.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/samples/ftrace/ftrace-ops.c b/samples/ftrace/ftrace-ops.c
index 68d6685c80bd..d5adaa61484f 100644
--- a/samples/ftrace/ftrace-ops.c
+++ b/samples/ftrace/ftrace-ops.c
@@ -190,6 +190,11 @@ static int __init ftrace_ops_sample_init(void)
 		tracer_irrelevant = ops_func_count;
 	}
 
+	if (!nr_function_calls) {
+		pr_err("nr_function_calls must be non-zero\n");
+		return -EINVAL;
+	}
+
 	pr_info("registering:\n"
 		"  relevant ops: %u\n"
 		"    tracee: %ps\n"
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCHv4 00/13] uprobes/x86: Fix red zone issue for optimized uprobes
From: Andrii Nakryiko @ 2026-06-08 20:48 UTC (permalink / raw)
  To: Jiri Olsa, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu
  Cc: Oleg Nesterov, Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <aiEiP54zktDqAZpG@krava>

On Wed, Jun 3, 2026 at 11:59 PM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Tue, May 26, 2026 at 10:58:27PM +0200, Jiri Olsa wrote:
> > hi,
> > Andrii reported an issue with optimized uprobes [1] that can clobber
> > redzone area with call instruction storing return address on stack
> > where user code may keep temporary data without adjusting rsp.
> >
> > Fixing this by moving the optimized uprobes on top of 10-bytes nop
> > instruction, so we can squeeze another instruction to escape the
> > redzone area before doing the call.
> >
> > Note we need upstream update first for patch 3 (github.com/libbpf/usdt),
> > if we decide to take this change.
> >
> > thanks,
> > jirka
> >
> >
> > v1: https://lore.kernel.org/bpf/20260514135342.22130-1-jolsa@kernel.org/
> > v2: https://lore.kernel.org/bpf/20260518105957.123445-1-jolsa@kernel.org/
> > v3: https://lore.kernel.org/bpf/20260521124411.31133-1-jolsa@kernel.org/
> >
> > v4 changes:
> > - do not use 2nd int3 (ont +5 offset) because the call instruction
> >   is allways the same for the given nop10 address [Andrii/Peter]
> > - unmap unused trampoline vma after unsuccesfull optimization [sashiko]
> > - small change to patch#2 moved user_64bit_mode earlier in the path
> >   and pass/use mm_struct pointer directly from arch_uprobe_optimize
> >   instead of gettting current->mm
> >   Andrii, keeping your ack, please shout otherwise
>
> hi,
> I think bots did not find anything substantial, I have just small
> selftests changes queued for v5
>
> any other feedback/review would be great
>

one small nit on only, otherwise LGTM.

Peter, Masami, Ingo, should this go through tip tree or should we
route this through bpf-next tree? I think we are fine either way, but
might be more convenient to route through bpf-next given libbpf and
BPF selftest changes.

If so, I'd appreciate another look at first 5 patches by Peter, if
that's ok. Thanks!



> thanks,
> jirka
>
>
> >
> > v3 changes:
> > - use nop10 update suggested by Peter in [2]
> > - remove struct uprobe_trampoline object, use vma objects directly instead
> > - selftests fixes [sashiko]
> > - ack from Andrii
> >
> > v2 changes:
> > - several selftest fixes [sashiko]
> > - consolidate is_lea_insn and is_call_insn insto single check [Jakub Sitnicki]
> > - use proper mm_struct object in __in_uprobe_trampoline check [sashiko]
> > - allow to copy uprobe trampolines vma objects on fork [sashiko]
> > - change uprobe syscall detection error from -ENXIO to -EPROTO [Andrii]
> > - added fork/clone tests
> > - I kept the selftest changes and nop5->nop10 changes in separate
> >   commits for easier review, we can squash them later if we want to keep
> >   bisect working properly
> >
> >
> > [1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> > [2] https://lore.kernel.org/bpf/20260518104306.GU3102624@noisy.programming.kicks-ass.net/#t
> > ---
> > Andrii Nakryiko (1):
> >       selftests/bpf: Add tests for uprobe nop10 red zone clobbering
> >
> > Jiri Olsa (12):
> >       uprobes/x86: Use proper mm_struct in __in_uprobe_trampoline
> >       uprobes/x86: Remove struct uprobe_trampoline object
> >       uprobes/x86: Allow to copy uprobe trampolines on fork
> >       uprobes/x86: Unmap trampoline vma object in case it's unused
> >       uprobes/x86: Move optimized uprobe from nop5 to nop10
> >       libbpf: Change has_nop_combo to work on top of nop10
> >       libbpf: Detect uprobe syscall with new error
> >       selftests/bpf: Emit nop,nop10 instructions combo for x86_64 arch
> >       selftests/bpf: Change uprobe syscall tests to use nop10
> >       selftests/bpf: Change uprobe/usdt trigger bench code to use nop10
> >       selftests/bpf: Add reattach tests for uprobe syscall
> >       selftests/bpf: Add tests for forked/cloned optimized uprobes
> >
> >  arch/x86/kernel/uprobes.c                               | 379 +++++++++++++++++++++++++++++++++++++++++++-----------------------------
> >  include/linux/uprobes.h                                 |   5 -
> >  kernel/events/uprobes.c                                 |  10 --
> >  kernel/fork.c                                           |   1 -
> >  tools/lib/bpf/features.c                                |   4 +-
> >  tools/lib/bpf/usdt.c                                    |  16 +--
> >  tools/testing/selftests/bpf/bench.c                     |  20 ++--
> >  tools/testing/selftests/bpf/benchs/bench_trigger.c      |  38 ++++----
> >  tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh |   2 +-
> >  tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c | 307 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
> >  tools/testing/selftests/bpf/prog_tests/usdt.c           |  74 ++++++++++++--
> >  tools/testing/selftests/bpf/progs/test_usdt.c           |  25 +++++
> >  tools/testing/selftests/bpf/usdt.h                      |   2 +-
> >  tools/testing/selftests/bpf/usdt_2.c                    |  15 ++-
> >  14 files changed, 653 insertions(+), 245 deletions(-)

^ permalink raw reply

* Re: [PATCHv4 05/13] uprobes/x86: Move optimized uprobe from nop5 to nop10
From: Andrii Nakryiko @ 2026-06-08 20:46 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <20260526205840.173790-6-jolsa@kernel.org>

On Tue, May 26, 2026 at 1:59 PM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Andrii reported an issue with optimized uprobes [1] that can clobber
> redzone area with call instruction storing return address on stack
> where user code may keep temporary data without adjusting rsp.
>
> Fixing this by moving the optimized uprobes on top of 10-bytes nop
> instruction, so we can squeeze another instruction to escape the
> redzone area before doing the call, like:
>
>   lea -0x80(%rsp), %rsp
>   call tramp
>
> Note the lea instruction is used to adjust the rsp register without
> changing the flags.
>
> We use nop10 and following transformation to optimized instructions
> above and back as suggested by Peterz [2].
>
> Optimize path (int3_update_optimize):
>
>   1) Initial state after set_swbp() installed the uprobe:
>       cc 2e 0f 1f 84 00 00 00 00 00
>
>      From offset 0 this is INT3 followed by the tail of the original
>      10-byte NOP.
>
>      After a previous unoptimization bytes 5..9 may still contain the
>      old call instruction, which remains valid for threads already there.
>
>   2) Rewrite the LEA tail and call displacement:
>       cc [8d 64 24 80 e8 d0 d1 d2 d3]
>
>      From offset 0 this traps on the uprobe INT3.  Bytes 1..9 are not
>      executable entry points while byte 0 is trapped.
>
>   3) Publish the first LEA byte:
>       [48] 8d 64 24 80 e8 d0 d1 d2 d3
>
>      From offset 0 this is:
>         lea -0x80(%rsp), %rsp
>         call <uprobe-trampoline>
>
> Unoptimize path (int3_update_unoptimize):
>
>   1) Initial optimized state:
>       48 8d 64 24 80 e8 d0 d1 d2 d3
>      Same as 3) above.
>
>   2) Trap new entries before restoring the NOP bytes:
>       [cc] 8d 64 24 80 e8 d0 d1 d2 d3
>
>      From offset 0 this traps. A thread that had already executed the
>      LEA can still reach the intact CALL at offset 5.
>
>   3) Restore bytes 1..4 of the original NOP while keeping byte 0 trapped
>      and byte 5 as CALL.
>       cc [2e 0f 1f 84] e8 d0 d1 d2 d3
>
>      From offset 0 this still traps. Offset 5 is still the CALL for any
>      thread that was already past the first LEA byte.
>
>   4) Publish the first byte of the original NOP:
>       [66] 2e 0f 1f 84 e8 d0 d1 d2 d3
>
>      From offset 0 this is the restored 10-byte NOP; the CALL opcode and
>      displacement are now only NOP operands.  Offset 5 still decodes as
>      CALL for a thread that was already there.
>
>      Tthere is only a single target uprobe-trampoline for the given nop10
>      instruction address, so the CALL instruction will not be changed across
>      unoptimization/optimization cycles.
>      Therefore, any task that is preempted at the CALL instruction is guaranteed
>      to observe that CALL and not anything else.
>
> Note as explained in [2] we need to use following nop10:
>        PF1   PF2   ESC   NOPL  MOD   SIB   DISP32
> NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw 0x00000000(%rax,%rax,1)
>
> which means we need to allow 0x2e prefix which maps to INAT_PFX_CS
> attribute in is_prefix_bad function.
>
> Also changing the uprobe syscall error when called out of uprobe
> trampoline to -EPROTO, so we are able to detect the fixed kernel.
>
> The optimized uprobe performance stays the same:
>
>         uprobe-nop     :    3.129 ± 0.013M/s
>         uprobe-push    :    3.045 ± 0.006M/s
>         uprobe-ret     :    1.095 ± 0.004M/s
>   -->   uprobe-nop10   :    7.170 ± 0.020M/s
>         uretprobe-nop  :    2.143 ± 0.021M/s
>         uretprobe-push :    2.090 ± 0.000M/s
>         uretprobe-ret  :    0.942 ± 0.000M/s
>   -->   uretprobe-nop10:    3.381 ± 0.003M/s
>         usdt-nop       :    3.245 ± 0.004M/s
>   -->   usdt-nop10     :    7.256 ± 0.023M/s
>
> [1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> [2] https://lore.kernel.org/bpf/20260518104306.GU3102624@noisy.programming.kicks-ass.net/#t
> Reported-by: Andrii Nakryiko <andrii@kernel.org>
> Closes: https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> Assisted-by: Codex:GPT-5.5
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  arch/x86/kernel/uprobes.c | 255 ++++++++++++++++++++++++++++----------
>  1 file changed, 190 insertions(+), 65 deletions(-)
>

[...]

> @@ -943,13 +1026,31 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
>         smp_text_poke_sync_each_cpu();
>
>         /*
> -        * Write first byte.
> +        * 3) Restore bytes 1..4 of the original NOP while keeping byte 0 trapped
> +        *    and byte 5 as CALL:
> +        *    cc [2e 0f 1f 84] e8 d0 d1 d2 d3
> +        */
> +       ctx.expect = EXPECT_SWBP_OPTIMIZED;
> +       err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1,
> +                          LEA_INSN_SIZE - 1, verify_insn,
> +                          true /* is_register */, false /* do_update_ref_ctr */,

tbh, it's quite subtle and non-obvious why is_register should be set
to true first two times (and especially that is_register and
do_update_ref_ctr are implicitly connected), not sure how to make it
cleaner, but maybe leave a short comment explaining this twice
register, once unregister sequence?

> +                          &ctx);
> +       if (err)
> +               return err;
> +
> +       smp_text_poke_sync_each_cpu();

[...]

^ permalink raw reply

* Re: [PATCH 1/2] ring-buffer: Fix event length with forced 8-byte alignment
From: Steven Rostedt @ 2026-06-08 16:52 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Hui Wang, mathieu.desnoyers, pjw, linux-trace-kernel, shuah,
	wangfushuai, linux-kselftest
In-Reply-To: <20260608180245.09e083867a7d4d96058d7323@kernel.org>

On Mon, 8 Jun 2026 18:02:45 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> On Sun,  7 Jun 2026 15:24:30 +0800
> Hui Wang <hui.wang@canonical.com> wrote:
> 
> > When RB_FORCE_8BYTE_ALIGNMENT is true, rb_calculate_event_length()
> > reserves the space of event->array[0] for placing the data length and
> > rb_update_event() stores the data length in event->array[0]
> > accordingly. As a result the whole event length will add extra 4 bytes
> > for sizeof(event.array[0]) unconditionally.
> > 
> > But ring_buffer_event_length() only subtracts the
> > sizeof(event->array[0]) for events larger than RB_MAX_SMALL_DATA +
> > sizeof(event->array[0]). As a result, small events on architectures
> > with RB_FORCE_8BYTE_ALIGNMENT=true report a data length that is 4
> > bytes larger than expected.
> > 
> > To fix it, add the RB_FORCE_8BYTE_ALIGNMENT as a condition to subtract
> > the size of that length field whenever RB_FORCE_8BYTE_ALIGNMENT is
> > true.
> > 
> > This issue is observed in a riscv64 kernel with
> > CONFIG_HAVE_64BIT_ALIGNED_ACCESS set to y, when we run ftrace selftest
> > trace_marker_raw.tc, we get the weird log: for cases where the id is
> > 1..100, the number of data field is 8*N, but once id exceeds 100, the
> > number of data field becomes 8*N+4:
> >  # 1 buf: 58 00 00 00 80 5e d1 63 (number of data field is 8*1)
> >  ...
> >  # a buf: 58 ...                  (number of data field is 8*2)
> >  ...
> >  # 64 buf: 58 ...                 (number of data field is 8*13)
> >  # 65 buf: 58 ...                 (number of data field is 8*13+4)
> > 
> > After applying this change, the number of data field keeps being 8*N+4
> > consistently.
> >   
> 
> Good catch!
> 
> This looks good to me.
> 
> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

This is the patch I meant to reply to.

NACK as the test is broken and not the kernel.

There's a pending fix already:

  https://lore.kernel.org/all/20260601023251.1916483-1-dtcccc@linux.alibaba.com/


-- Steve

^ permalink raw reply

* Re: [PATCH 2/2] selftests/ftrace: Account for 8-byte aligned trace_marker_raw events
From: Steven Rostedt @ 2026-06-08 16:50 UTC (permalink / raw)
  To: Hui Wang
  Cc: mhiramat, mathieu.desnoyers, pjw, linux-trace-kernel, shuah,
	wangfushuai, linux-kselftest
In-Reply-To: <20260607072431.125633-3-hui.wang@canonical.com>

On Sun,  7 Jun 2026 15:24:31 +0800
Hui Wang <hui.wang@canonical.com> wrote:

> trace_marker_raw.tc assumes that the raw marker payload length
> reported in trace_pipe is the result of int((id + 3) / 4) * 4, but
> that is not true on kernels with CONFIG_HAVE_64BIT_ALIGNED_ACCESS
> enabled.
> 
> With forced 8-byte alignment, the ring buffer event forces 8-byte
> alignment. The event length is stored in array[0], the payload data
> and id are placed in a struct raw_data_entry which is stored starting
> at array[1]. In this case, the printed payload data length is 8*N+4
> bytes.
> 
> To make the testcase pass in this case, add a kconfig_enabled() helper
> and use it to detect CONFIG_HAVE_64BIT_ALIGNED_ACCESS so
> trace_marker_raw.tc can calculate the expected length correctly.
> 
> Assisted-by: Copilot:gpt-5.5
> Signed-off-by: Hui Wang <hui.wang@canonical.com>

NACK

Let's not change the kernel for a broken test. Also this has already
been fixed but appears not to be applied yet.

Shuah, can you please apply the below fix.

  https://lore.kernel.org/all/20260601023251.1916483-1-dtcccc@linux.alibaba.com/

-- Steve


^ permalink raw reply

* Re: [PATCH mm-unstable v19 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Lance Yang @ 2026-06-08 16:26 UTC (permalink / raw)
  To: david
  Cc: lance.yang, npache, linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, aarcange, akpm, anshuman.khandual, apopple,
	baohua, baolin.wang, byungchul, catalin.marinas, cl, corbet,
	dave.hansen, dev.jain, gourry, hannes, hughd, jack, jackmanb,
	jannh, jglisse, joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <2553caae-9e0e-42a7-8b61-d1216f1e81fa@kernel.org>


On Mon, Jun 08, 2026 at 04:56:37PM +0200, David Hildenbrand (Arm) wrote:
>On 6/6/26 12:28, Lance Yang wrote:
>> 
>> On Fri, Jun 05, 2026 at 10:14:18AM -0600, Nico Pache wrote:
>>> Enable khugepaged to collapse to mTHP orders. This patch implements the
>>> main scanning logic using a bitmap to track occupied pages and the
>>> algorithm to find optimal collapse sizes.
>>>
>>> Previous to this patch, PMD collapse had 3 main phases, a light weight
>>> scanning phase (mmap_read_lock) that determines a potential PMD
>>> collapse, an alloc phase (mmap unlocked), then finally heavier collapse
>>> phase (mmap_write_lock).
>>>
>>> To enabled mTHP collapse we make the following changes:
>>>
>>> During PMD scan phase, track occupied pages in a bitmap. When mTHP
>>> orders are enabled, we remove the restriction of max_ptes_none during the
>>> scan phase to avoid missing potential mTHP collapse candidates. Once we
>>> have scanned the full PMD range and updated the bitmap to track occupied
>>> pages, we use the bitmap to find the optimal mTHP size.
>>>
>>> Implement mthp_collapse() to walk forward through the bitmap and
>>> determine the best eligible order for each naturally-aligned region. The
>>> algorithm starts at the beginning of the PMD range and, for each offset,
>>> tries the highest order that fits the alignment. If the number of
>>> occupied PTEs in that region satisfies the max_ptes_none threshold for
>>> that order, a collapse is attempted. On failure, the order is
>>> decremented and the same offset is retried at the next smaller size. Once
>>> the smallest enabled order is exhausted (or a collapse succeeds), the
>>> offset advances past the region just processed, and the next attempt
>>> starts at the highest order permitted by the new offset's natural
>>> alignment.
>>>
>>> The algorithm works as follows:
>>>    1) set offset=0 and order=HPAGE_PMD_ORDER
>>>    2) if the order is not enabled, go to step (5)
>>>    3) count occupied PTEs in the (offset, order) range using
>>>       bitmap_weight_from()
>>>    4) if the count satisfies the max_ptes_none threshold, attempt
>>>       collapse; on success, advance to step (6)
>>>    5) if a smaller enabled order exists, decrement order and retry
>>>       from step (2) at the same offset
>>>    6) advance offset past the current region and compute the next
>>>       order from the new offset's natural alignment via __ffs(offset),
>>>       capped at HPAGE_PMD_ORDER
>>>    7) repeat from step (2) until the full PMD range is covered
>>>
>>> mTHP collapses reject regions containing swapped out or shared pages.
>>> This is because adding new entries can lead to new none pages, and these
>>> may lead to constant promotion into a higher order mTHP. A similar
>>> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
>>> introducing at least 2x the number of pages, and on a future scan will
>>> satisfy the promotion condition once again. This issue is prevented via
>>> the collapse_max_ptes_none() function which imposes the max_ptes_none
>>> restrictions above.
>>>
>>> We currently only support mTHP collapse for max_ptes_none values of 0
>>> and HPAGE_PMD_NR - 1. resulting in the following behavior:
>>>
>>>    - max_ptes_none=0: Never introduce new empty pages during collapse
>>>    - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
>>>      available mTHP order
>>>
>>> Any other max_ptes_none value will emit a warning and default mTHP
>>> collapse to max_ptes_none=0. There should be no behavior change for PMD
>>> collapse.
>>>
>>> Once we determine what mTHP sizes fits best in that PMD range a collapse
>>> is attempted. A minimum collapse order of 2 is used as this is the lowest
>>> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
>>>
>>> Currently madv_collapse is not supported and will only attempt PMD
>>> collapse.
>>>
>>> We can also remove the check for is_khugepaged inside the PMD scan as
>>> the collapse_max_ptes_none() function handles this logic now.
>>>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>> ---
>>> mm/khugepaged.c | 146 +++++++++++++++++++++++++++++++++++++++++++++---
>>> 1 file changed, 138 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index ec886a031952..430047316f43 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -99,6 +99,8 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>>>
>>> static struct kmem_cache *mm_slot_cache __ro_after_init;
>>>
>>> +#define KHUGEPAGED_MIN_MTHP_ORDER	2
>>> +
>>> struct collapse_control {
>>> 	bool is_khugepaged;
>>>
>>> @@ -110,6 +112,9 @@ struct collapse_control {
>>>
>>> 	/* nodemask for allocation fallback */
>>> 	nodemask_t alloc_nmask;
>>> +
>>> +	/* Each bit represents a single occupied (!none/zero) page. */
>>> +	DECLARE_BITMAP(mthp_present_ptes, MAX_PTRS_PER_PTE);
>>> };
>>>
>>> /**
>>> @@ -1440,20 +1445,130 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>>> 	return result;
>>> }
>>>
>>> +/* Return the highest naturally aligned order that fits at @offset within a PMD. */
>>> +static unsigned int max_order_from_offset(unsigned int offset)
>>> +{
>>> +	if (offset == 0)
>>> +		return HPAGE_PMD_ORDER;
>>> +
>>> +	return min_t(unsigned int, __ffs(offset), HPAGE_PMD_ORDER);
>>> +}
>>> +
>>> +/*
>>> + * mthp_collapse() consumes the bitmap that is generated during
>>> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
>>> + *
>>> + * Each bit in cc->mthp_present_ptes represents a single occupied (!none/zero)
>>> + * page. We start at the PMD order and check if it is eligible for collapse;
>>> + * if not, we check the left and right halves of the PTE page table we are
>>> + * examining at a lower order.
>>> + *
>>> + * For each of these, we determine how many PTE entries are occupied in the
>>> + * range of PTE entries we propose to collapse, then we compare this to a
>>> + * threshold number of PTE entries which would need to be occupied for a
>>> + * collapse to be permitted at that order (accounting for max_ptes_none).
>>> + *
>>> + * If a collapse is permitted, we attempt to collapse the PTE range into a
>>> + * mTHP.
>>> + */
>>> +static enum scan_result mthp_collapse(struct mm_struct *mm,
>>> +		unsigned long address, int referenced, int unmapped,
>>> +		struct collapse_control *cc, unsigned long enabled_orders)
>>> +{
>>> +	unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
>>> +	enum scan_result last_result = SCAN_FAIL;
>>> +	int collapsed = 0;
>>> +	bool alloc_failed = false;
>>> +	unsigned long collapse_address;
>>> +	unsigned int offset = 0;
>>> +	unsigned int order = HPAGE_PMD_ORDER;
>>> +
>>> +	while (offset < HPAGE_PMD_NR) {
>>> +		nr_ptes = 1UL << order;
>>> +
>>> +		if (!test_bit(order, &enabled_orders))
>>> +			goto next_order;
>>> +
>>> +		max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
>>> +		nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
>>> +						      offset + nr_ptes);
>>> +
>>> +		if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
>> 
>> Looks broken for swap PTEs in PMD collapse ...
>> 
>> collapse_scan_pmd() allows them up to max_ptes_swap and record them in
>> unmapped, but they don't get a bit in mthp_present_ptes. And then
>> mthp_collapse() does the check above:
>
>Right. I assumed this is implicitly handled by the optimization in collapse_scan_pmd:
>
>	if (enabled_orders != BIT(HPAGE_PMD_ORDER))
>		max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
>
>But we perform the check a second time.

Note that once lower orders are enabled, the scan *relaxes* max_ptes_none
only so it can cover the whole PMD and build the bitmap ...

>> 
>> nr_occupied_ptes >= nr_ptes - max_ptes_none
>> 
>> So max_ptes_none=0 + 511 present PTEs + one allowed swap PTE won't even
>> call collapse_huge_page() for PMD order.
>> 
>> Shouldn't we account for them in the PMD-order check? Something like:
>> 
>> if (is_pmd_order(order))
>> 	nr_occupied_ptes += unmapped;
>As an alternative, we could either 1) skip the check there for
>pmd order (as the check was already done); or 2) introduce+maintain

Yeah, skipping the check would do the trick, since isolate will check
max_ptes_none again later :)

>a bitmap that tracks non-present PTEs.
>
>@@ -1475,7 +1477,9 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
>                nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
>                                                      offset + nr_ptes);
> 
>-               if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
>+               /* Check was already done in the caller. */

This check is not quite redundant for PMD order, though. It avoids
entering collapse_huge_page() for a range that already exceeds
max_ptes_none for that order.

>+               if (is_pmd_order(order) ||
>+                   nr_occupied_ptes >= nr_ptes - max_ptes_none) {
>                        enum scan_result ret;
> 
>                        collapse_address = address + offset * PAGE_SIZE;
>
>2) would probably be cleanest long-term.

Yeah, Agreed.

^ permalink raw reply

* [PATCH v3 6/6] x86/setup: prepend embedded bootconfig cmdline before parse_early_param
From: Breno Leitao @ 2026-06-08 16:24 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team
In-Reply-To: <20260608-bootconfig_using_tools-v3-0-4ddd079a0696@debian.org>

Call xbc_prepend_embedded_cmdline() in setup_arch() right after the
CONFIG_CMDLINE merge and before strscpy(command_line, ...) so the
build-time-rendered embedded bootconfig "kernel" subtree is part of
boot_command_line by the time parse_early_param() runs. early_param()
handlers (mem=, earlycon=, loglevel=, ...) now see values supplied via
CONFIG_BOOT_CONFIG_EMBED_FILE without parsing bootconfig at runtime.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 arch/x86/Kconfig        |  1 +
 arch/x86/kernel/setup.c | 16 ++++++++++++++++
 init/main.c             | 25 ++++++++++++++++++++++---
 3 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f24810015234..f839795692b4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -126,6 +126,7 @@ config X86
 	select ARCH_SUPPORTS_NUMA_BALANCING	if X86_64
 	select ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP	if NR_CPUS <= 4096
 	select ARCH_SUPPORTS_CFI		if X86_64
+	select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG
 	select ARCH_USES_CFI_TRAPS		if X86_64 && CFI
 	select ARCH_SUPPORTS_LTO_CLANG
 	select ARCH_SUPPORTS_LTO_CLANG_THIN
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 46882ce79c3a..003f8651db6c 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -6,6 +6,7 @@
  * parts of early kernel initialization.
  */
 #include <linux/acpi.h>
+#include <linux/bootconfig.h>
 #include <linux/console.h>
 #include <linux/cpu.h>
 #include <linux/crash_dump.h>
@@ -36,6 +37,7 @@
 #include <asm/bios_ebda.h>
 #include <asm/bugs.h>
 #include <asm/cacheinfo.h>
+#include <asm/cmdline.h>
 #include <asm/coco.h>
 #include <asm/cpu.h>
 #include <asm/efi.h>
@@ -924,6 +926,20 @@ void __init setup_arch(char **cmdline_p)
 	builtin_cmdline_added = true;
 #endif
 
+	/*
+	 * Match the runtime bootconfig parser's opt-in: only fold the
+	 * embedded kernel.* keys into the cmdline when "bootconfig" is
+	 * present on the command line, or CONFIG_BOOT_CONFIG_FORCE is set.
+	 * setup_boot_config() bails out under the same condition, so the
+	 * early prepend stays in lockstep with what the late runtime parser
+	 * would have applied. CONFIG_BOOT_CONFIG_FORCE defaults to y when
+	 * BOOT_CONFIG_EMBED is set, so on the default config the embedded
+	 * keys are applied unconditionally.
+	 */
+	if (IS_ENABLED(CONFIG_BOOT_CONFIG_FORCE) ||
+	    cmdline_find_option_bool(boot_command_line, "bootconfig"))
+		xbc_prepend_embedded_cmdline(boot_command_line, COMMAND_LINE_SIZE);
+
 	strscpy(command_line, boot_command_line, COMMAND_LINE_SIZE);
 	*cmdline_p = command_line;
 
diff --git a/init/main.c b/init/main.c
index e363232b428b..2ecb6aa536dd 100644
--- a/init/main.c
+++ b/init/main.c
@@ -378,12 +378,15 @@ static void __init setup_boot_config(void)
 	int pos, ret;
 	size_t size;
 	char *err;
+	bool from_embedded = false;
 
 	/* Cut out the bootconfig data even if we have no bootconfig option */
 	data = get_boot_config_from_initrd(&size);
 	/* If there is no bootconfig in initrd, try embedded one. */
-	if (!data)
+	if (!data) {
 		data = xbc_get_embedded_bootconfig(&size);
+		from_embedded = true;
+	}
 
 	strscpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE);
 	err = parse_args("bootconfig", tmp_cmdline, NULL, 0, 0, 0, NULL,
@@ -421,8 +424,24 @@ static void __init setup_boot_config(void)
 	} else {
 		xbc_get_info(&ret, NULL);
 		pr_info("Load bootconfig: %ld bytes %d nodes\n", (long)size, ret);
-		/* keys starting with "kernel." are passed via cmdline */
-		extra_command_line = xbc_make_cmdline("kernel");
+		/*
+		 * keys starting with "kernel." are passed via cmdline. When
+		 * this bootconfig came from the embedded source and
+		 * setup_arch() already prepended the rendered "kernel" subtree
+		 * to boot_command_line, rendering again here would duplicate
+		 * the keys in saved_command_line and make accumulating handlers
+		 * (console=, earlycon=, ...) re-register the same value. Skip
+		 * only when the prepend really happened.
+		 *
+		 * On arches that do not select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG,
+		 * CONFIG_BOOT_CONFIG_EMBED_CMDLINE is unselectable and
+		 * xbc_embedded_cmdline_applied() collapses to a stub returning
+		 * false, so this path still runs and the embedded "kernel"
+		 * keys reach the cmdline via the runtime parser exactly as
+		 * before this series.
+		 */
+		if (!from_embedded || !xbc_embedded_cmdline_applied())
+			extra_command_line = xbc_make_cmdline("kernel");
 		/* Also, "init." keys are init arguments */
 		extra_init_args = xbc_make_cmdline("init");
 	}

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v3 5/6] bootconfig: add xbc_prepend_embedded_cmdline() helper
From: Breno Leitao @ 2026-06-08 16:24 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team
In-Reply-To: <20260608-bootconfig_using_tools-v3-0-4ddd079a0696@debian.org>

Add a helper that prepends the build-time-rendered embedded bootconfig
"kernel" subtree (embedded_kernel_cmdline[] from embedded-cmdline.S) to
a cmdline buffer with a separating space. Architectures call this from
setup_arch() before parse_early_param() so early_param() handlers
(mem=, earlycon=, loglevel=, ...) see values supplied via the embedded
bootconfig.

The in-place prepend (shift the existing string right, then drop the
embedded string in front) is factored into a small str_prepend() helper.

On overflow the helper logs an error and leaves the cmdline untouched
rather than panicking. Booting without the embedded values is better
than refusing to boot, and the error tells the user why their embedded
keys are missing.

The helper records whether it actually prepended, exposed via
xbc_embedded_cmdline_applied(). setup_boot_config() uses this to decide
whether the runtime "kernel" render would duplicate keys already folded
into boot_command_line.

When CONFIG_BOOT_CONFIG_EMBED_CMDLINE=n, the public declaration in
<linux/bootconfig.h> resolves to a no-op stub so callers compile
unchanged.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 include/linux/bootconfig.h |  9 ++++++
 lib/bootconfig.c           | 78 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 87 insertions(+)

diff --git a/include/linux/bootconfig.h b/include/linux/bootconfig.h
index 1c7f3b74ffcf..c186137f87ac 100644
--- a/include/linux/bootconfig.h
+++ b/include/linux/bootconfig.h
@@ -308,4 +308,13 @@ static inline const char *xbc_get_embedded_bootconfig(size_t *size)
 }
 #endif
 
+/* Build-time-rendered bootconfig cmdline prepended in setup_arch() */
+#ifdef CONFIG_BOOT_CONFIG_EMBED_CMDLINE
+void __init xbc_prepend_embedded_cmdline(char *dst, size_t size);
+bool __init xbc_embedded_cmdline_applied(void);
+#else
+static inline void xbc_prepend_embedded_cmdline(char *dst, size_t size) { }
+static inline bool xbc_embedded_cmdline_applied(void) { return false; }
+#endif
+
 #endif
diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index 926094d97397..f66be0b2dc24 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -19,6 +19,7 @@
 #include <linux/errno.h>
 #include <linux/cache.h>
 #include <linux/compiler.h>
+#include <linux/printk.h>
 #include <linux/sprintf.h>
 #include <linux/memblock.h>
 #include <linux/string.h>
@@ -34,6 +35,83 @@ const char * __init xbc_get_embedded_bootconfig(size_t *size)
 	return (*size) ? embedded_bootconfig_data : NULL;
 }
 #endif
+
+#ifdef CONFIG_BOOT_CONFIG_EMBED_CMDLINE
+/* embedded_kernel_cmdline is defined in embedded-cmdline.S */
+extern __visible const char embedded_kernel_cmdline[];
+extern __visible const char embedded_kernel_cmdline_end[];
+
+/* Set once the embedded cmdline has actually been prepended. */
+static bool xbc_cmdline_applied __initdata;
+
+/*
+ * str_prepend() - Prepend @src in front of the string in @dst, in place
+ * @dst: NUL-terminated destination buffer, currently @dst_len bytes long
+ * @dst_len: length of the current @dst string (excluding its NUL)
+ * @src: bytes to prepend (not NUL-terminated)
+ * @src_len: number of bytes from @src to prepend
+ *
+ * The caller must guarantee @dst has room for src_len + dst_len + 1 bytes.
+ * Moving dst_len + 1 bytes carries @dst's NUL terminator too, so an empty
+ * @dst needs no special case.
+ */
+static void __init str_prepend(char *dst, size_t dst_len,
+			       const char *src, size_t src_len)
+{
+	memmove(dst + src_len, dst, dst_len + 1);
+	memcpy(dst, src, src_len);
+}
+
+/**
+ * xbc_prepend_embedded_cmdline() - Prepend embedded bootconfig cmdline
+ * @dst: cmdline buffer to prepend into (must already contain a NUL byte)
+ * @size: total capacity of @dst in bytes
+ *
+ * Prepend the build-time-rendered "kernel" subtree of the embedded
+ * bootconfig to @dst. The rendered string already ends with a single
+ * space (the xbc_snprint_cmdline() invariant), which serves as the
+ * separator between the embedded keys and any existing content of @dst.
+ * On overflow, log an error and leave @dst untouched rather than
+ * silently truncating: booting without the embedded values is better
+ * than refusing to boot, and the error message tells the user why
+ * their embedded keys are missing.
+ *
+ * Intended to be called from setup_arch() before parse_early_param() so
+ * that early_param() handlers see the embedded values.
+ */
+void __init xbc_prepend_embedded_cmdline(char *dst, size_t size)
+{
+	size_t embed_len = embedded_kernel_cmdline_end - embedded_kernel_cmdline;
+	size_t dst_len;
+
+	if (!size || embed_len <= 1)	/* trailing NUL only */
+		return;
+	embed_len--;			/* exclude trailing NUL byte */
+
+	dst_len = strnlen(dst, size);
+	if (embed_len + dst_len + 1 > size) {
+		pr_err("embedded bootconfig cmdline (%zu bytes) does not fit in COMMAND_LINE_SIZE with %zu bytes already used; ignoring embedded values\n",
+		       embed_len, dst_len);
+		return;
+	}
+
+	str_prepend(dst, dst_len, embedded_kernel_cmdline, embed_len);
+	xbc_cmdline_applied = true;
+}
+
+/**
+ * xbc_embedded_cmdline_applied() - Did the embedded cmdline get prepended?
+ *
+ * Return true if xbc_prepend_embedded_cmdline() actually prepended the
+ * embedded "kernel" subtree. setup_boot_config() uses this to avoid
+ * rendering the same keys a second time.
+ */
+bool __init xbc_embedded_cmdline_applied(void)
+{
+	return xbc_cmdline_applied;
+}
+#endif
+
 #endif
 
 /*

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v3 4/6] bootconfig: clean build-time tools/bootconfig from make clean
From: Breno Leitao @ 2026-06-08 16:24 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team
In-Reply-To: <20260608-bootconfig_using_tools-v3-0-4ddd079a0696@debian.org>

The previous patch builds tools/bootconfig during 'make prepare' to
render the embedded bootconfig cmdline, but nothing removes it on
'make clean', leaving the compiled tool and its objects behind.

Wire a bootconfig_clean hook into the top-level clean target so the
compiled tool and its objects are removed by make clean, matching the
prepare-wired tools/objtool and tools/bpf/resolve_btfids.

The hook runs tools/bootconfig's Makefile via $(MAKE), which the kernel
build invokes with -rR (MAKEFLAGS += -rR). -rR drops the built-in $(RM)
variable, so the existing "$(RM) -f ..." clean recipe would expand to a
bare "-f ..." and fail. Spell the recipe with a literal "rm -f" so it
keeps working both standalone and when invoked from Kbuild.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 Makefile                  | 13 ++++++++++++-
 tools/bootconfig/Makefile |  2 +-
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/Makefile b/Makefile
index 4a8ea7c90ca8..84ca047f0c10 100644
--- a/Makefile
+++ b/Makefile
@@ -1580,6 +1580,17 @@ ifneq ($(wildcard $(objtool_O)),)
 	$(Q)$(MAKE) -sC $(abs_srctree)/tools/objtool O=$(objtool_O) srctree=$(abs_srctree) $(patsubst objtool_%,%,$@)
 endif
 
+PHONY += bootconfig_clean
+
+bootconfig_O = $(abspath $(objtree))/tools/bootconfig
+
+# tools/bootconfig is only built (via the prepare hook above) when
+# CONFIG_BOOT_CONFIG_EMBED_CMDLINE is set; skip its clean otherwise.
+bootconfig_clean:
+ifneq ($(wildcard $(bootconfig_O)),)
+	$(Q)$(MAKE) -sC $(srctree)/tools/bootconfig O=$(bootconfig_O) clean
+endif
+
 tools/: FORCE
 	$(Q)mkdir -p $(objtree)/tools
 	$(Q)$(MAKE) O=$(abspath $(objtree)) subdir=tools -C $(srctree)/tools/
@@ -1749,7 +1760,7 @@ vmlinuxclean:
 	$(Q)$(CONFIG_SHELL) $(srctree)/scripts/link-vmlinux.sh clean
 	$(Q)$(if $(ARCH_POSTLINK), $(MAKE) -f $(ARCH_POSTLINK) clean)
 
-clean: archclean vmlinuxclean resolve_btfids_clean objtool_clean
+clean: archclean vmlinuxclean resolve_btfids_clean objtool_clean bootconfig_clean
 
 # mrproper - Delete all generated files, including .config
 #
diff --git a/tools/bootconfig/Makefile b/tools/bootconfig/Makefile
index 4e82fd9553cd..3cb8066d5141 100644
--- a/tools/bootconfig/Makefile
+++ b/tools/bootconfig/Makefile
@@ -27,4 +27,4 @@ install: $(ALL_PROGRAMS)
 	install $(OUTPUT)bootconfig $(DESTDIR)$(bindir)
 
 clean:
-	$(RM) -f $(OUTPUT)*.o $(ALL_PROGRAMS)
+	rm -f $(OUTPUT)*.o $(ALL_PROGRAMS)

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v3 3/6] bootconfig: render embedded bootconfig as a kernel cmdline at build time
From: Breno Leitao @ 2026-06-08 16:24 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team
In-Reply-To: <20260608-bootconfig_using_tools-v3-0-4ddd079a0696@debian.org>

Add the build-time pipeline that renders the "kernel" subtree of
CONFIG_BOOT_CONFIG_EMBED_FILE into a flat cmdline string and stashes
it in .init.rodata as embedded_kernel_cmdline[]. A follow-up patch
adds the runtime helper that prepends this string to boot_command_line
during early architecture setup so parse_early_param() sees the values.

The build wires up:
  tools/bootconfig -C kernel - userspace tool already shared with
                               lib/bootconfig.c, used here in -C mode
                               to render a bootconfig file to a cmdline
  lib/embedded-cmdline.S     - .incbin's the rendered text plus a NUL
                               (listed under the EXTRA BOOT CONFIG
                               MAINTAINERS entry)
  lib/Makefile rule          - runs tools/bootconfig at build time
  Makefile prepare dep       - ensures tools/bootconfig is built first,
                               same pattern as tools/objtool and
                               tools/bpf/resolve_btfids

Drop the test target from tools/bootconfig/Makefile's default 'all'
recipe so that hooking the binary into the kernel build does not run
test-bootconfig.sh on every prepare. The tests stay available as
'make -C tools/bootconfig test', matching the convention of
tools/objtool and tools/bpf/resolve_btfids whose 'all' targets only
build the binary.

Require BOOT_CONFIG_EMBED_FILE to be non-empty before the new option
can be enabled, otherwise tools/bootconfig -C runs against an empty
file and prints a parse error on every kernel build.

The feature gates on CONFIG_ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG, a
silent symbol arches select once they've wired the prepend call into
setup_arch(). No arch selects it in this patch, so the user-visible
CONFIG_BOOT_CONFIG_EMBED_CMDLINE is not yet enableable; when an arch
later opts in, the runtime behavior is added by the follow-up patches.

tools/bootconfig also installs on target systems, so its own Makefile
keeps $(CC) and stays cross-buildable as a standalone tool. The kernel
build, which runs the tool on the build host during prepare, instead
forces CC=$(HOSTCC) from a dedicated tools/bootconfig rule, so the
executed binary is always a host binary -- plain $(CC) would
cross-compile it under ARCH=... and fail to exec ("Exec format error").

embedded-cmdline.S places the rendered string in .init.rodata with the
"a" (allocatable, read-only) flag and %progbits, not "aw": the data is
never written at runtime, so it must not land in a writable section.

A follow-up patch wires the build-time tools/bootconfig into the
top-level clean target.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 MAINTAINERS               |  1 +
 Makefile                  | 11 +++++++++++
 init/Kconfig              | 36 ++++++++++++++++++++++++++++++++++++
 lib/Makefile              | 16 ++++++++++++++++
 lib/embedded-cmdline.S    | 16 ++++++++++++++++
 tools/bootconfig/Makefile |  2 +-
 6 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 4087b67bbc69..fb9314cbe344 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9845,6 +9845,7 @@ F:	fs/proc/bootconfig.c
 F:	include/linux/bootconfig.h
 F:	lib/bootconfig-data.S
 F:	lib/bootconfig.c
+F:	lib/embedded-cmdline.S
 F:	tools/bootconfig/*
 F:	tools/bootconfig/scripts/*
 
diff --git a/Makefile b/Makefile
index d59f703f9797..4a8ea7c90ca8 100644
--- a/Makefile
+++ b/Makefile
@@ -1543,6 +1543,17 @@ prepare: tools/bpf/resolve_btfids
 endif
 endif
 
+# tools/bootconfig renders the embedded bootconfig into a cmdline at build time.
+ifdef CONFIG_BOOT_CONFIG_EMBED_CMDLINE
+prepare: tools/bootconfig
+endif
+
+# tools/bootconfig is run on the build host during prepare, so force a host
+# binary here; its own Makefile keeps $(CC) for standalone and cross builds.
+tools/bootconfig: FORCE
+	$(Q)mkdir -p $(objtree)/tools
+	$(Q)$(MAKE) O=$(abspath $(objtree)) subdir=tools -C $(srctree)/tools/ bootconfig CC=$(HOSTCC)
+
 # The tools build system is not a part of Kbuild and tends to introduce
 # its own unique issues. If you need to integrate a new tool into Kbuild,
 # please consider locating that tool outside the tools/ tree and using the
diff --git a/init/Kconfig b/init/Kconfig
index ca35184532dc..203b1187fde7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1569,6 +1569,42 @@ config BOOT_CONFIG_EMBED_FILE
 	  This bootconfig will be used if there is no initrd or no other
 	  bootconfig in the initrd.
 
+config ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG
+	bool
+	help
+	  Silent symbol; no C code reads it directly. Architectures
+	  select it once their setup_arch() calls
+	  xbc_prepend_embedded_cmdline() before parse_early_param().
+	  Its only role is to gate the user-visible
+	  BOOT_CONFIG_EMBED_CMDLINE option per-arch, the same
+	  ARCH_SUPPORTS_* idiom used by ARCH_SUPPORTS_CFI, etc.
+
+config BOOT_CONFIG_EMBED_CMDLINE
+	bool "Render embedded bootconfig as kernel cmdline at build time"
+	depends on BOOT_CONFIG_EMBED
+	depends on BOOT_CONFIG_EMBED_FILE != ""
+	depends on ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG
+	default n
+	help
+	  Render the "kernel" subtree of the embedded bootconfig file into a
+	  flat cmdline string at kernel build time and prepend it to
+	  boot_command_line during early architecture setup. This makes
+	  early_param() handlers (e.g. mem=, earlycon=, loglevel=) see the
+	  values supplied via the embedded bootconfig.
+
+	  The runtime bootconfig parser is unaffected, so tree-structured
+	  consumers such as ftrace boot-time tracing keep working.
+
+	  Note: when an initrd also carries a bootconfig, its "kernel"
+	  subtree is still parsed at runtime, but the embedded "kernel"
+	  keys remain in boot_command_line for parse_early_param() and
+	  end up later than the initrd keys in saved_command_line, so
+	  parse_args() last-wins favors the embedded values. If you need
+	  initrd to override embedded kernel.* keys, leave this option
+	  off.
+
+	  If unsure, say N.
+
 config CMDLINE_LOG_WRAP_IDEAL_LEN
 	int "Length to try to wrap the cmdline when logged at boot"
 	default 1021
diff --git a/lib/Makefile b/lib/Makefile
index 6e72d2c1cce7..9de0ac7732a2 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -273,6 +273,22 @@ filechk_defbconf = cat $(or $(real-prereqs), /dev/null)
 $(obj)/default.bconf: $(CONFIG_BOOT_CONFIG_EMBED_FILE) FORCE
 	$(call filechk,defbconf)
 
+obj-$(CONFIG_BOOT_CONFIG_EMBED_CMDLINE) += embedded-cmdline.o
+$(obj)/embedded-cmdline.o: $(obj)/embedded_cmdline.bin
+
+# Render the bootconfig "kernel" subtree to a flat cmdline string using
+# the userspace tools/bootconfig parser (-C mode). The runtime prepend
+# helper enforces COMMAND_LINE_SIZE at boot, so no build-time size
+# check is performed here (COMMAND_LINE_SIZE is an arch header
+# constant, not a Kconfig value).
+quiet_cmd_render_cmdline = BCONF2C $@
+      cmd_render_cmdline = \
+	$(objtree)/tools/bootconfig/bootconfig -C $< > $@
+
+targets += embedded_cmdline.bin
+$(obj)/embedded_cmdline.bin: $(obj)/default.bconf $(objtree)/tools/bootconfig/bootconfig FORCE
+	$(call if_changed,render_cmdline)
+
 obj-$(CONFIG_RBTREE_TEST) += rbtree_test.o
 obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o
 
diff --git a/lib/embedded-cmdline.S b/lib/embedded-cmdline.S
new file mode 100644
index 000000000000..740d7ad2dc01
--- /dev/null
+++ b/lib/embedded-cmdline.S
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Embed the build-time-rendered bootconfig "kernel" subtree as a flat
+ * cmdline string. setup_arch() prepends this to boot_command_line on
+ * architectures that select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates
+ * Copyright (c) 2026 Breno Leitao <leitao@debian.org>
+ */
+	.section .init.rodata, "a", %progbits
+	.global embedded_kernel_cmdline
+embedded_kernel_cmdline:
+	.incbin "lib/embedded_cmdline.bin"
+	.byte 0
+	.global embedded_kernel_cmdline_end
+embedded_kernel_cmdline_end:
diff --git a/tools/bootconfig/Makefile b/tools/bootconfig/Makefile
index 90eb47c9d8de..4e82fd9553cd 100644
--- a/tools/bootconfig/Makefile
+++ b/tools/bootconfig/Makefile
@@ -15,7 +15,7 @@ override CFLAGS += -Wall -g -I$(CURDIR)/include
 ALL_TARGETS := bootconfig
 ALL_PROGRAMS := $(patsubst %,$(OUTPUT)%,$(ALL_TARGETS))
 
-all: $(ALL_PROGRAMS) test
+all: $(ALL_PROGRAMS)
 
 $(OUTPUT)bootconfig: main.c include/linux/bootconfig.h $(LIBSRC)
 	$(CC) $(filter %.c,$^) $(CFLAGS) $(LDFLAGS) -o $@

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v3 2/6] bootconfig: render descendant keys when xbc_snprint_cmdline() root has a value
From: Breno Leitao @ 2026-06-08 16:23 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team
In-Reply-To: <20260608-bootconfig_using_tools-v3-0-4ddd079a0696@debian.org>

xbc_node_for_each_key_value() walks to the first leaf under @root, and
when @root is itself a leaf it yields @root. That happens not only for
an empty "kernel {}" subtree, but also when @root carries both a value
and subkeys, e.g.

	kernel = x
	kernel.foo = bar

Here @root ("kernel") is a leaf because its first child is the value
node "x", so the iterator returns @root first. Feeding @root back into
xbc_node_compose_key_after(root, root) returns -EINVAL, which the only
in-kernel caller papers over with a "len <= 0" check -- but the
follow-up tools/bootconfig -C user propagates the error and turns such
a bootconfig into a build failure. Worse, short-circuiting the whole
call on a leaf @root would silently drop the valid "kernel.foo = bar"
descendant that the pre-existing code rendered.

Skip @root inside the loop instead of bailing out: the value-only entry
is dropped (it is rendered through the "kernel" cmdline path, not here),
while real descendant keys are still emitted. An entirely empty subtree
now renders nothing and returns 0 rather than -EINVAL, matching the
"nothing to render is not an error" semantics expected by the new
build-time caller.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 lib/bootconfig.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index 2ed9ee3dc81c..926094d97397 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -440,6 +440,17 @@ int __init xbc_snprint_cmdline(char *buf, size_t size, struct xbc_node *root)
 	 * itself is well defined and returns the would-be length.
 	 */
 	xbc_node_for_each_key_value(root, knode, val) {
+		/*
+		 * An empty or value-only @root (e.g. "kernel {}" or
+		 * "kernel = x", possibly alongside "kernel.foo = bar")
+		 * yields @root itself here. Skip it: composing a key for it
+		 * would fail with -EINVAL, yet any real descendant keys must
+		 * still be rendered. An entirely empty subtree then renders
+		 * nothing and returns 0 rather than an error.
+		 */
+		if (knode == root)
+			continue;
+
 		ret = xbc_node_compose_key_after(root, knode,
 					xbc_namebuf, XBC_KEYLEN_MAX);
 		if (ret < 0)

-- 
2.53.0-Meta

^ permalink raw reply related

* [PATCH v3 0/6] bootconfig: embed kernel.* cmdline at build time
From: Breno Leitao @ 2026-06-08 16:23 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team

The userspace pieces (xbc_snprint_cmdline() in lib/, tools/bootconfig -C)
already landed; this series wires the rendered cmdline into the kernel.

Motivation: today the embedded bootconfig is parsed at runtime, after
parse_early_param() has already run, so early_param() handlers can't
see embedded values. Folding the kernel.* subtree into the cmdline at
build time gives a CONFIG_CMDLINE-equivalent for embedded-bootconfig
users without forcing them to maintain two cmdline sources.

Behaviorally, the "kernel" subtree is rendered to a flat string at
build time and stashed in .init.rodata. setup_arch() prepends it to
boot_command_line before parse_early_param() runs. Overflow is a soft
error: the helper logs and leaves boot_command_line untouched rather
than panicking, so an oversized embedded bconf cannot brick a boot.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v3:
- Patch 3: Move HOSTCC override to the kernel-side rule; tool keeps
  $(CC) for standalone/cross builds.
- Patch 6: Drop the false fail-safe wording; document the
  BOOT_CONFIG_FORCE=y default interaction.
- Link to v2:
  https://lore.kernel.org/r/20260605-bootconfig_using_tools-v2-0-d309f544b5f7@debian.org

Changes in v2 (addressing review of v1):
- Split out a standalone fix for the NULL-pointer arithmetic in
  xbc_snprint_cmdline() so the build-time render cannot trip host
  UBSan/FORTIFY_SOURCE.
- Rework the leaf-root handling: instead of returning early, skip @root
  inside the loop so a root carrying both a value and subkeys
  (kernel = x together with kernel.foo = bar) still renders its
  descendant keys.
- Build tools/bootconfig with $(HOSTCC) so cross-compiled (ARCH=...)
  builds render the cmdline on the build host instead of failing with
  "Exec format error".
- Mark the embedded cmdline section read-only (drop the "w" flag from
  .init.rodata).
- Add a make-clean hook so tools/bootconfig artifacts are removed by
  make clean.
- Gate the x86 prepend on "bootconfig" being present on the command
  line (or CONFIG_BOOT_CONFIG_FORCE), matching the init.* opt-in
  semantics documented in bootconfig.rst and preserving fail-safe
  recovery: dropping "bootconfig" from the bootloader cmdline now also
  disables the embedded kernel.* keys.
- Link to v1: https://patch.msgid.link/20260527-bootconfig_using_tools-v1-0-b6906a86e7d5@debian.org

---
Breno Leitao (6):
      bootconfig: fix NULL-pointer arithmetic in xbc_snprint_cmdline()
      bootconfig: render descendant keys when xbc_snprint_cmdline() root has a value
      bootconfig: render embedded bootconfig as a kernel cmdline at build time
      bootconfig: clean build-time tools/bootconfig from make clean
      bootconfig: add xbc_prepend_embedded_cmdline() helper
      x86/setup: prepend embedded bootconfig cmdline before parse_early_param

 MAINTAINERS                |   1 +
 Makefile                   |  24 +++++++++-
 arch/x86/Kconfig           |   1 +
 arch/x86/kernel/setup.c    |  16 +++++++
 include/linux/bootconfig.h |   9 ++++
 init/Kconfig               |  36 +++++++++++++++
 init/main.c                |  25 ++++++++--
 lib/Makefile               |  16 +++++++
 lib/bootconfig.c           | 112 ++++++++++++++++++++++++++++++++++++++++++---
 lib/embedded-cmdline.S     |  16 +++++++
 tools/bootconfig/Makefile  |   4 +-
 11 files changed, 247 insertions(+), 13 deletions(-)
---
base-commit: e7e28506af98ce4e1059e5ec59334b335c00a246
change-id: 20260508-bootconfig_using_tools-cfa7aa9d6a5a

Best regards,
-- 
Breno Leitao <leitao@debian.org>


^ permalink raw reply

* [PATCH v3 1/6] bootconfig: fix NULL-pointer arithmetic in xbc_snprint_cmdline()
From: Breno Leitao @ 2026-06-08 16:23 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team
In-Reply-To: <20260608-bootconfig_using_tools-v3-0-4ddd079a0696@debian.org>

xbc_snprint_cmdline() is meant to be called twice: first with
buf=NULL, size=0 to probe the rendered length, then with a real
buffer to fill it (the standard snprintf() two-pass pattern). The
probe call makes the function compute "buf + size" (NULL + 0) and,
on every iteration, advance "buf += ret" from that NULL base and
pass the result back into snprintf().

Pointer arithmetic on a NULL pointer is undefined behavior. It is
harmless in the in-kernel callers today, but the follow-up patches
run this same code in the userspace tools/bootconfig parser at kernel
build time, where host UBSan / FORTIFY_SOURCE abort the build.

Track a running written length (size_t) instead of mutating @buf, and
only form "buf + len" when @buf is non-NULL. snprintf(NULL, 0, ...)
is itself well defined and returns the would-be length, so the
two-pass "probe then fill" usage returns identical byte counts.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 lib/bootconfig.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index f445b7703fdd..2ed9ee3dc81c 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -427,10 +427,18 @@ static char xbc_namebuf[XBC_KEYLEN_MAX] __initdata;
 int __init xbc_snprint_cmdline(char *buf, size_t size, struct xbc_node *root)
 {
 	struct xbc_node *knode, *vnode;
-	char *end = buf + size;
 	const char *val, *q;
+	size_t len = 0;
 	int ret;
 
+	/*
+	 * Track the running written length rather than advancing @buf, so we
+	 * never form "buf + size" or "buf += ret" while @buf is NULL (the
+	 * size-probe call passes buf=NULL, size=0). NULL pointer arithmetic
+	 * is undefined behavior and trips host UBSan / FORTIFY_SOURCE when
+	 * this renderer runs at kernel build time. snprintf(NULL, 0, ...)
+	 * itself is well defined and returns the would-be length.
+	 */
 	xbc_node_for_each_key_value(root, knode, val) {
 		ret = xbc_node_compose_key_after(root, knode,
 					xbc_namebuf, XBC_KEYLEN_MAX);
@@ -439,10 +447,11 @@ int __init xbc_snprint_cmdline(char *buf, size_t size, struct xbc_node *root)
 
 		vnode = xbc_node_get_child(knode);
 		if (!vnode) {
-			ret = snprintf(buf, rest(buf, end), "%s ", xbc_namebuf);
+			ret = snprintf(buf ? buf + len : NULL, rest(len, size),
+				       "%s ", xbc_namebuf);
 			if (ret < 0)
 				return ret;
-			buf += ret;
+			len += ret;
 			continue;
 		}
 		xbc_array_for_each_value(vnode, val) {
@@ -452,15 +461,15 @@ int __init xbc_snprint_cmdline(char *buf, size_t size, struct xbc_node *root)
 			 * whitespace.
 			 */
 			q = strpbrk(val, " \t\r\n") ? "\"" : "";
-			ret = snprintf(buf, rest(buf, end), "%s=%s%s%s ",
-				       xbc_namebuf, q, val, q);
+			ret = snprintf(buf ? buf + len : NULL, rest(len, size),
+				       "%s=%s%s%s ", xbc_namebuf, q, val, q);
 			if (ret < 0)
 				return ret;
-			buf += ret;
+			len += ret;
 		}
 	}
 
-	return buf - (end - size);
+	return len;
 }
 #undef rest
 

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH] tracing: Add "within" filter for call-stack-based event filtering
From: Chen Jun @ 2026-06-08 14:55 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, linux-kernel,
	linux-trace-kernel
  Cc: chenjun102

Low-level kernel functions are called from many different paths.
When debugging, it is often useful to filter trace events to only
those occurring within a specific call chain.

Add a "within" filter predicate that tests whether a given function
appears in the current call stack at event time. The function name
is resolved to its address range via kallsyms during filter setup;
at runtime, stack_trace_save() captures the call stack and compares
each return address against the stored range.

Example:
  echo 'within == "vfs_read"' > events/sched/sched_switch/filter

Only "==" and "!=" operators are supported. The filter depends on
CONFIG_STACKTRACE.
Signed-off-by: Chen Jun <chenjun102@huawei.com>
---
 Documentation/trace/events.rst     | 12 +++++++++
 include/linux/trace_events.h       |  1 +
 kernel/trace/trace.h               |  3 ++-
 kernel/trace/trace_events.c        |  3 +++
 kernel/trace/trace_events_filter.c | 41 ++++++++++++++++++++++++++++--
 5 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/Documentation/trace/events.rst b/Documentation/trace/events.rst
index 18d112963dec..6e3877d376a9 100644
--- a/Documentation/trace/events.rst
+++ b/Documentation/trace/events.rst
@@ -243,6 +243,18 @@ the function "security_prepare_creds" and less than the end of that function.
 The ".function" postfix can only be attached to values of size long, and can only
 be compared with "==" or "!=".
 
+The special field "within" can be used to filter events based on whether
+a specific function appears in the current call stack::
+
+  within == "function_name"
+  within != "function_name"
+
+For example, to only trace events where "vfs_read" is in the call stack::
+
+  # echo 'within == "vfs_read"' > events/sched/sched_switch/filter
+
+The within field supports only the "==" and "!=" operators.
+
 Cpumask fields or scalar fields that encode a CPU number can be filtered using
 a user-provided cpumask in cpulist format. The format is as follows::
 
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 40a43a4c7caf..9ed22c210add 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -851,6 +851,7 @@ enum {
 	FILTER_COMM,
 	FILTER_CPU,
 	FILTER_STACKTRACE,
+	FILTER_WITHIN,
 };
 
 extern int trace_event_raw_init(struct trace_event_call *call);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 80fe152af1dd..a383da42badf 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -1825,7 +1825,8 @@ static inline bool is_string_field(struct ftrace_event_field *field)
 	       field->filter_type == FILTER_RDYN_STRING ||
 	       field->filter_type == FILTER_STATIC_STRING ||
 	       field->filter_type == FILTER_PTR_STRING ||
-	       field->filter_type == FILTER_COMM;
+	       field->filter_type == FILTER_COMM ||
+	       field->filter_type == FILTER_WITHIN;
 }
 
 static inline bool is_function_field(struct ftrace_event_field *field)
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index c46e623e7e0d..b7d681e55b0c 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -199,6 +199,9 @@ static int trace_define_generic_fields(void)
 	__generic_field(char *, comm, FILTER_COMM);
 	__generic_field(char *, stacktrace, FILTER_STACKTRACE);
 	__generic_field(char *, STACKTRACE, FILTER_STACKTRACE);
+#ifdef CONFIG_STACKTRACE
+	__generic_field(char *, within, FILTER_WITHIN);
+#endif
 
 	return ret;
 }
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 609325f57942..34e1a7f0b3cd 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -72,6 +72,7 @@ enum filter_pred_fn {
 	FILTER_PRED_FN_CPUMASK,
 	FILTER_PRED_FN_CPUMASK_CPU,
 	FILTER_PRED_FN_FUNCTION,
+	FILTER_PRED_FN_WITHIN,
 	FILTER_PRED_FN_,
 	FILTER_PRED_TEST_VISITED,
 };
@@ -1009,6 +1010,22 @@ static int filter_pred_function(struct filter_pred *pred, void *event)
 	return pred->op == OP_EQ ? ret : !ret;
 }
 
+/* Filter predicate for within. */
+static int filter_pred_within(struct filter_pred *pred, void *event)
+{
+#ifdef CONFIG_STACKTRACE
+	unsigned long entries[16];
+	unsigned int nr_entries;
+	int i;
+
+	nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 0);
+	for (i = 0; i < nr_entries; i++)
+		if (pred->val <= entries[i] && entries[i] < pred->val2)
+			return !pred->not;
+#endif
+	return pred->not;
+}
+
 /*
  * regex_match_foo - Basic regex callbacks
  *
@@ -1617,6 +1634,8 @@ static int filter_pred_fn_call(struct filter_pred *pred, void *event)
 		return filter_pred_cpumask_cpu(pred, event);
 	case FILTER_PRED_FN_FUNCTION:
 		return filter_pred_function(pred, event);
+	case FILTER_PRED_FN_WITHIN:
+		return filter_pred_within(pred, event);
 	case FILTER_PRED_TEST_VISITED:
 		return test_pred_visited_fn(pred, event);
 	default:
@@ -2002,10 +2021,28 @@ static int parse_pred(const char *str, void *data,
 
 		} else if (field->filter_type == FILTER_DYN_STRING) {
 			pred->fn_num = FILTER_PRED_FN_STRLOC;
-		} else if (field->filter_type == FILTER_RDYN_STRING)
+		} else if (field->filter_type == FILTER_RDYN_STRING) {
 			pred->fn_num = FILTER_PRED_FN_STRRELLOC;
-		else {
+		} else if (field->filter_type == FILTER_WITHIN) {
+			unsigned long func;
+
+			if (op == OP_GLOB)
+				goto err_free;
 
+			pred->fn_num = FILTER_PRED_FN_WITHIN;
+			func = kallsyms_lookup_name(pred->regex->pattern);
+			if (!func) {
+				parse_error(pe, FILT_ERR_NO_FUNCTION, pos + i);
+				goto err_free;
+			}
+			/* Now find the function start and end address */
+			if (!kallsyms_lookup_size_offset(func, &size, &offset)) {
+				parse_error(pe, FILT_ERR_NO_FUNCTION, pos + i);
+				goto err_free;
+			}
+			pred->val = func - offset;
+			pred->val2 = pred->val + size;
+		} else {
 			if (!ustring_per_cpu) {
 				/* Once allocated, keep it around for good */
 				ustring_per_cpu = alloc_percpu(struct ustring_buffer);
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH mm-unstable v19 11/14] mm/khugepaged: Introduce mTHP collapse support
From: David Hildenbrand (Arm) @ 2026-06-08 14:56 UTC (permalink / raw)
  To: Lance Yang, npache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260606102800.26940-1-lance.yang@linux.dev>

On 6/6/26 12:28, Lance Yang wrote:
> 
> On Fri, Jun 05, 2026 at 10:14:18AM -0600, Nico Pache wrote:
>> Enable khugepaged to collapse to mTHP orders. This patch implements the
>> main scanning logic using a bitmap to track occupied pages and the
>> algorithm to find optimal collapse sizes.
>>
>> Previous to this patch, PMD collapse had 3 main phases, a light weight
>> scanning phase (mmap_read_lock) that determines a potential PMD
>> collapse, an alloc phase (mmap unlocked), then finally heavier collapse
>> phase (mmap_write_lock).
>>
>> To enabled mTHP collapse we make the following changes:
>>
>> During PMD scan phase, track occupied pages in a bitmap. When mTHP
>> orders are enabled, we remove the restriction of max_ptes_none during the
>> scan phase to avoid missing potential mTHP collapse candidates. Once we
>> have scanned the full PMD range and updated the bitmap to track occupied
>> pages, we use the bitmap to find the optimal mTHP size.
>>
>> Implement mthp_collapse() to walk forward through the bitmap and
>> determine the best eligible order for each naturally-aligned region. The
>> algorithm starts at the beginning of the PMD range and, for each offset,
>> tries the highest order that fits the alignment. If the number of
>> occupied PTEs in that region satisfies the max_ptes_none threshold for
>> that order, a collapse is attempted. On failure, the order is
>> decremented and the same offset is retried at the next smaller size. Once
>> the smallest enabled order is exhausted (or a collapse succeeds), the
>> offset advances past the region just processed, and the next attempt
>> starts at the highest order permitted by the new offset's natural
>> alignment.
>>
>> The algorithm works as follows:
>>    1) set offset=0 and order=HPAGE_PMD_ORDER
>>    2) if the order is not enabled, go to step (5)
>>    3) count occupied PTEs in the (offset, order) range using
>>       bitmap_weight_from()
>>    4) if the count satisfies the max_ptes_none threshold, attempt
>>       collapse; on success, advance to step (6)
>>    5) if a smaller enabled order exists, decrement order and retry
>>       from step (2) at the same offset
>>    6) advance offset past the current region and compute the next
>>       order from the new offset's natural alignment via __ffs(offset),
>>       capped at HPAGE_PMD_ORDER
>>    7) repeat from step (2) until the full PMD range is covered
>>
>> mTHP collapses reject regions containing swapped out or shared pages.
>> This is because adding new entries can lead to new none pages, and these
>> may lead to constant promotion into a higher order mTHP. A similar
>> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
>> introducing at least 2x the number of pages, and on a future scan will
>> satisfy the promotion condition once again. This issue is prevented via
>> the collapse_max_ptes_none() function which imposes the max_ptes_none
>> restrictions above.
>>
>> We currently only support mTHP collapse for max_ptes_none values of 0
>> and HPAGE_PMD_NR - 1. resulting in the following behavior:
>>
>>    - max_ptes_none=0: Never introduce new empty pages during collapse
>>    - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
>>      available mTHP order
>>
>> Any other max_ptes_none value will emit a warning and default mTHP
>> collapse to max_ptes_none=0. There should be no behavior change for PMD
>> collapse.
>>
>> Once we determine what mTHP sizes fits best in that PMD range a collapse
>> is attempted. A minimum collapse order of 2 is used as this is the lowest
>> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
>>
>> Currently madv_collapse is not supported and will only attempt PMD
>> collapse.
>>
>> We can also remove the check for is_khugepaged inside the PMD scan as
>> the collapse_max_ptes_none() function handles this logic now.
>>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>> mm/khugepaged.c | 146 +++++++++++++++++++++++++++++++++++++++++++++---
>> 1 file changed, 138 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index ec886a031952..430047316f43 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -99,6 +99,8 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>>
>> static struct kmem_cache *mm_slot_cache __ro_after_init;
>>
>> +#define KHUGEPAGED_MIN_MTHP_ORDER	2
>> +
>> struct collapse_control {
>> 	bool is_khugepaged;
>>
>> @@ -110,6 +112,9 @@ struct collapse_control {
>>
>> 	/* nodemask for allocation fallback */
>> 	nodemask_t alloc_nmask;
>> +
>> +	/* Each bit represents a single occupied (!none/zero) page. */
>> +	DECLARE_BITMAP(mthp_present_ptes, MAX_PTRS_PER_PTE);
>> };
>>
>> /**
>> @@ -1440,20 +1445,130 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>> 	return result;
>> }
>>
>> +/* Return the highest naturally aligned order that fits at @offset within a PMD. */
>> +static unsigned int max_order_from_offset(unsigned int offset)
>> +{
>> +	if (offset == 0)
>> +		return HPAGE_PMD_ORDER;
>> +
>> +	return min_t(unsigned int, __ffs(offset), HPAGE_PMD_ORDER);
>> +}
>> +
>> +/*
>> + * mthp_collapse() consumes the bitmap that is generated during
>> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
>> + *
>> + * Each bit in cc->mthp_present_ptes represents a single occupied (!none/zero)
>> + * page. We start at the PMD order and check if it is eligible for collapse;
>> + * if not, we check the left and right halves of the PTE page table we are
>> + * examining at a lower order.
>> + *
>> + * For each of these, we determine how many PTE entries are occupied in the
>> + * range of PTE entries we propose to collapse, then we compare this to a
>> + * threshold number of PTE entries which would need to be occupied for a
>> + * collapse to be permitted at that order (accounting for max_ptes_none).
>> + *
>> + * If a collapse is permitted, we attempt to collapse the PTE range into a
>> + * mTHP.
>> + */
>> +static enum scan_result mthp_collapse(struct mm_struct *mm,
>> +		unsigned long address, int referenced, int unmapped,
>> +		struct collapse_control *cc, unsigned long enabled_orders)
>> +{
>> +	unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
>> +	enum scan_result last_result = SCAN_FAIL;
>> +	int collapsed = 0;
>> +	bool alloc_failed = false;
>> +	unsigned long collapse_address;
>> +	unsigned int offset = 0;
>> +	unsigned int order = HPAGE_PMD_ORDER;
>> +
>> +	while (offset < HPAGE_PMD_NR) {
>> +		nr_ptes = 1UL << order;
>> +
>> +		if (!test_bit(order, &enabled_orders))
>> +			goto next_order;
>> +
>> +		max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
>> +		nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
>> +						      offset + nr_ptes);
>> +
>> +		if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> 
> Looks broken for swap PTEs in PMD collapse ...
> 
> collapse_scan_pmd() allows them up to max_ptes_swap and record them in
> unmapped, but they don't get a bit in mthp_present_ptes. And then
> mthp_collapse() does the check above:

Right. I assumed this is implicitly handled by the optimization in collapse_scan_pmd:

	if (enabled_orders != BIT(HPAGE_PMD_ORDER))
		max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;

But we perform the check a second time.

> 
> nr_occupied_ptes >= nr_ptes - max_ptes_none
> 
> So max_ptes_none=0 + 511 present PTEs + one allowed swap PTE won't even
> call collapse_huge_page() for PMD order.
> 
> Shouldn't we account for them in the PMD-order check? Something like:
> 
> if (is_pmd_order(order))
> 	nr_occupied_ptes += unmapped;
As an alternative, we could either 1) skip the check there for
pmd order (as the check was already done); or 2) introduce+maintain
a bitmap that tracks non-present PTEs.

@@ -1475,7 +1477,9 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
                nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
                                                      offset + nr_ptes);
 
-               if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
+               /* Check was already done in the caller. */
+               if (is_pmd_order(order) ||
+                   nr_occupied_ptes >= nr_ptes - max_ptes_none) {
                        enum scan_result ret;
 
                        collapse_address = address + offset * PAGE_SIZE;

2) would probably be cleanest long-term.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH 2/2] selftests/ftrace: Account for 8-byte aligned trace_marker_raw events
From: Hui Wang @ 2026-06-08 14:51 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: rostedt, mathieu.desnoyers, pjw, linux-trace-kernel, shuah,
	wangfushuai, linux-kselftest
In-Reply-To: <20260608181716.726cb9c81d41d49095e7f3cf@kernel.org>

On 6/8/26 17:17, Masami Hiramatsu (Google) wrote:
> On Sun,  7 Jun 2026 15:24:31 +0800
> Hui Wang <hui.wang@canonical.com> wrote:
>
[...]
> +    for config_file in \
> +        /boot/config-$uname_r \
> +        /lib/modules/$uname_r/config \
> +        /lib/modules/$uname_r/build/.config
>
> Hmm, also I don't like this, because this highly depends on the environment.
> Instead, we can add CONFIG_IKCONFIG_PROC=y in tools/testing/selftests/ftrace/config.
>
> Thank you,
>
Thanks for the review. I'll address all other comments in v2.

I have a concern about this specific point. On Ubuntu kernels, both 
CONFIG_IKCONFIG and CONFIG_IKCONFIG_PROC are disabled by default, so 
/proc/config.gz does not exist. If we drop the /boot/config-$(uname -r) 
lookup and rely solely on /proc/config.gz, this test would become 
unresolved on every Ubuntu kernel — a regression, since it works on 
those kernels today.

There is also existing precedent for the /boot/config-$(uname -r) 
fallback: tools/testing/selftests/mm/va_high_addr_switch.sh checks 
/proc/config.gz first and falls back to /boot/config-$(uname -r).

So how about we keep /boot/config-$(uname -r) as a fallback, but drop 
the /lib/modules/... paths you objected to. And add ftrace/config as you 
suggested here.

Thanks,
Hui.
>> +    do
>> +        if [ -f "$config_file" ]; then
>> +            grep -Eq "^${config}=(y|m)$" "$config_file"
>> +            return $?
>> +        fi
>> +    done
>> +
>> +    return 2
>> +}
>> +
>>   LOCALHOST=127.0.0.1
>>   
>>   yield() {
>> -- 
>> 2.43.0
>>
>>
>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox