Linux Trace Kernel
 help / color / mirror / Atom feed
* Re: [PATCH v5] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-05-21  2:55 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: sashiko-bot, sashiko-reviews, bpf, LKML, Linux trace kernel
In-Reply-To: <20260521105804.a66cf40d48f796ff66ffface@kernel.org>

On Thu, 21 May 2026 10:58:04 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> On Wed, 20 May 2026 12:48:32 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Wed, 20 May 2026 15:20:21 +0900
> > Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> >   
> > > > > > @@ -515,6 +542,10 @@ static void clear_btf_context(struct traceprobe_parse_context *ctx)
> > > > > >  		ctx->params = NULL;
> > > > > >  		ctx->nr_params = 0;
> > > > > >  	}
> > > > > > +	if (ctx->struct_btf) {
> > > > > > +		btf_put(ctx->struct_btf);
> > > > > > +		ctx->last_struct = NULL;      
> > > > > 
> > > > > [Severity: Low]
> > > > > Should ctx->struct_btf be explicitly set to NULL after btf_put() drops
> > > > > the reference?    
> > > > 
> > > > I'm thinking of dropping it in the '(' switch case.    
> > > 
> > > Can you consider making the '(' switch case part as a helper
> > > function because it depends on CONFIG_DEBUG_INFO_BTF?  
> > 
> > Should we just encapsulate that entire case statement with:
> > 
> > #ifdef CONFIG_DEBUG_INFO_BTF
> > [..]
> > #endif  
> 
> Yeah that is possible, and I rather like to make it a separate
> function for simplifying switch-case block for readability.
> 

Hmm, but as a separate function, you mean something like this?

	case '(':
		ret = handle_typecast(...);
		break;

And have;

#ifdef CONFIG_DEBUG_INFO_BTF
static int handle_typecast(...)
{
	[ do the real work here ]
}
#else
static int handle_typecast(...)
{
	trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
	return -EOPNOTSUPP;
}
#endif

  ?

-- Steve

^ permalink raw reply

* Re: [PATCHv2] tracing: simplify pages allocation
From: Masami Hiramatsu @ 2026-05-21  2:55 UTC (permalink / raw)
  To: Rosen Penev
  Cc: linux-trace-kernel, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Kees Cook, Gustavo A. R. Silva,
	open list:TRACING,
	open list:KERNEL HARDENING (not covered by other areas):Keyword:b__counted_by(_le|_be|_ptr)?b
In-Reply-To: <20260520215006.12008-1-rosenp@gmail.com>

On Wed, 20 May 2026 14:50:06 -0700
Rosen Penev <rosenp@gmail.com> wrote:

> Change to a flexible array member to allocate together with the array
> struct.
> 
> Simplifies code slightly by removing no longer correct null checks for
> pages and removing kfrees.

Looks good to me.

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thanks!

> 
> Signed-off-by: Rosen Penev <rosenp@gmail.com>
> ---
>  v2: add back kfree(a). Accidentally removed.
>  kernel/trace/tracing_map.c | 30 +++++++++++-------------------
>  kernel/trace/tracing_map.h |  2 +-
>  2 files changed, 12 insertions(+), 20 deletions(-)
> 
> diff --git a/kernel/trace/tracing_map.c b/kernel/trace/tracing_map.c
> index c59326ad7a84..97f7e3cde262 100644
> --- a/kernel/trace/tracing_map.c
> +++ b/kernel/trace/tracing_map.c
> @@ -288,9 +288,6 @@ static void tracing_map_array_clear(struct tracing_map_array *a)
>  {
>  	unsigned int i;
> 
> -	if (!a->pages)
> -		return;
> -
>  	for (i = 0; i < a->n_pages; i++)
>  		memset(a->pages[i], 0, PAGE_SIZE);
>  }
> @@ -302,9 +299,6 @@ static void tracing_map_array_free(struct tracing_map_array *a)
>  	if (!a)
>  		return;
> 
> -	if (!a->pages)
> -		goto free;
> -
>  	for (i = 0; i < a->n_pages; i++) {
>  		if (!a->pages[i])
>  			break;
> @@ -312,9 +306,6 @@ static void tracing_map_array_free(struct tracing_map_array *a)
>  		free_page((unsigned long)a->pages[i]);
>  	}
> 
> -	kfree(a->pages);
> -
> - free:
>  	kfree(a);
>  }
> 
> @@ -322,24 +313,25 @@ static struct tracing_map_array *tracing_map_array_alloc(unsigned int n_elts,
>  						  unsigned int entry_size)
>  {
>  	struct tracing_map_array *a;
> +	unsigned int entry_size_shift;
> +	unsigned int entries_per_page;
> +	unsigned int n_pages;
>  	unsigned int i;
> 
> -	a = kzalloc_obj(*a);
> +	entry_size_shift = fls(roundup_pow_of_two(entry_size) - 1);
> +	entries_per_page = PAGE_SIZE / (1 << entry_size_shift);
> +	n_pages = max(1, n_elts / entries_per_page);
> +
> +	a = kzalloc_flex(*a, pages, n_pages);
>  	if (!a)
>  		return NULL;
> 
> -	a->entry_size_shift = fls(roundup_pow_of_two(entry_size) - 1);
> -	a->entries_per_page = PAGE_SIZE / (1 << a->entry_size_shift);
> -	a->n_pages = n_elts / a->entries_per_page;
> -	if (!a->n_pages)
> -		a->n_pages = 1;
> +	a->entry_size_shift = entry_size_shift;
> +	a->entries_per_page = entries_per_page;
> +	a->n_pages = n_pages;
>  	a->entry_shift = fls(a->entries_per_page) - 1;
>  	a->entry_mask = (1 << a->entry_shift) - 1;
> 
> -	a->pages = kcalloc(a->n_pages, sizeof(void *), GFP_KERNEL);
> -	if (!a->pages)
> -		goto free;
> -
>  	for (i = 0; i < a->n_pages; i++) {
>  		a->pages[i] = (void *)get_zeroed_page(GFP_KERNEL);
>  		if (!a->pages[i])
> diff --git a/kernel/trace/tracing_map.h b/kernel/trace/tracing_map.h
> index ed64136782d8..90a7fde5dd02 100644
> --- a/kernel/trace/tracing_map.h
> +++ b/kernel/trace/tracing_map.h
> @@ -167,7 +167,7 @@ struct tracing_map_array {
>  	unsigned int entry_shift;
>  	unsigned int entry_mask;
>  	unsigned int n_pages;
> -	void **pages;
> +	void *pages[] __counted_by(n_pages);
>  };
> 
>  #define TRACING_MAP_ARRAY_ELT(array, idx)				\
> --
> 2.54.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v20 07/10] ring-buffer: Skip invalid sub-buffers for iterator
From: Masami Hiramatsu @ 2026-05-21  2:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu,
	Mathieu Desnoyers, Catalin Marinas, Will Deacon, Ian Rogers
In-Reply-To: <20260520185018.051228084@kernel.org>

On Wed, 20 May 2026 14:49:45 -0400
Steven Rostedt <rostedt@kernel.org> wrote:

> From: Steven Rostedt <rostedt@goodmis.org>
> 
> On bootup if the persistent ring buffer finds an invalid sub-buffer, it
> only invalidates the invalid sub-buffer and continues. Several sub-buffers
> may be invalid and this can cause the iterator to loop more than 3 times
> looking for a new event. If that happens, then it returns NULL. Having a
> NULL return early can confuse the iterator looking for the next event, and
> may show events out of order.
> 
> Have the same logic for the consuming read for the iterator that will
> allow the loop to find the next event to happen the number of sub-buffers
> and not just 3.
> 
> Fixes: **TBD** ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Should we merge this into the original one?

Anyway, this looks good to me.

Thanks,

> ---
>  kernel/trace/ring_buffer.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index c6c2f92bfc24..bda53a2d2159 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -6103,12 +6103,14 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
>  	struct ring_buffer_per_cpu *cpu_buffer;
>  	struct ring_buffer_event *event;
>  	int nr_loops = 0;
> +	int max_loops;
>  
>  	if (ts)
>  		*ts = 0;
>  
>  	cpu_buffer = iter->cpu_buffer;
>  	buffer = cpu_buffer->buffer;
> +	max_loops = cpu_buffer->ring_meta ? cpu_buffer->nr_pages : 3;
>  
>  	/*
>  	 * Check if someone performed a consuming read to the buffer
> @@ -6131,7 +6133,7 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
>  	 * the ring buffer with an active write as the consumer is.
>  	 * Do not warn if the three failures is reached.
>  	 */
> -	if (++nr_loops > 3)
> +	if (++nr_loops > max_loops)
>  		return NULL;
>  
>  	if (rb_per_cpu_empty(cpu_buffer))
> -- 
> 2.53.0
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH mm-unstable v17 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Wei Yang @ 2026-05-21  2:46 UTC (permalink / raw)
  To: Vernon Yang
  Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <8f9834db-8981-4eb1-ae46-94908943da3d@gmail.com>

On Thu, May 21, 2026 at 10:36:15AM +0800, Vernon Yang wrote:
>On Mon, May 11, 2026 at 12:58:11PM -0600, Nico Pache wrote:
>> Enable khugepaged to collapse to mTHP orders. This patch implements the
>> main scanning logic using a bitmap to track occupied pages and a stack
>> structure that allows us to find optimal collapse sizes.
>>
>> Previous to this patch, PMD collapse had 3 main phases, a light weight
>> scanning phase (mmap_read_lock) that determines a potential PMD
>> collapse, an alloc phase (mmap unlocked), then finally heavier collapse
>> phase (mmap_write_lock).
>>
>> To enabled mTHP collapse we make the following changes:
>>
>> During PMD scan phase, track occupied pages in a bitmap. When mTHP
>> orders are enabled, we remove the restriction of max_ptes_none during the
>> scan phase to avoid missing potential mTHP collapse candidates. Once we
>> have scanned the full PMD range and updated the bitmap to track occupied
>> pages, we use the bitmap to find the optimal mTHP size.
>>
>> Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
>> and determine the best eligible order for the collapse. A stack structure
>> is used instead of traditional recursion to manage the search. This also
>> prevents a traditional recursive approach when the kernel stack struct is
>> limited. The algorithm recursively splits the bitmap into smaller chunks to
>> find the highest order mTHPs that satisfy the collapse criteria. We start
>> by attempting the PMD order, then moved on the consecutively lower orders
>> (mTHP collapse). The stack maintains a pair of variables (offset, order),
>> indicating the number of PTEs from the start of the PMD, and the order of
>> the potential collapse candidate.
>>
>> The algorithm for consuming the bitmap works as such:
>>     1) push (0, HPAGE_PMD_ORDER) onto the stack
>>     2) pop the stack
>>     3) check if the number of set bits in that (offset,order) pair
>>        statisfy the max_ptes_none threshold for that order
>>     4) if yes, attempt collapse
>>     5) if no (or collapse fails), push two new stack items representing
>>        the left and right halves of the current bitmap range, at the
>>        next lower order
>>     6) repeat at step (2) until stack is empty.
>>
>> Below is a diagram representing the algorithm and stack items:
>>
>>                             offset   mid_offset
>>                             |        |
>>                             |        |
>>                             v        v
>>           ____________________________________
>>          |          PTE Page Table            |
>>          --------------------------------------
>> 			    <-------><------->
>>                              order-1  order-1
>>
>> mTHP collapses reject regions containing swapped out or shared pages.
>> This is because adding new entries can lead to new none pages, and these
>> may lead to constant promotion into a higher order mTHP. A similar
>> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
>> introducing at least 2x the number of pages, and on a future scan will
>> satisfy the promotion condition once again. This issue is prevented via
>> the collapse_max_ptes_none() function which imposes the max_ptes_none
>> restrictions above.
>>
>> We currently only support mTHP collapse for max_ptes_none values of 0
>> and HPAGE_PMD_NR - 1. resulting in the following behavior:
>>
>>     - max_ptes_none=0: Never introduce new empty pages during collapse
>>     - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
>>       available mTHP order
>>
>> Any other max_ptes_none value will emit a warning and skip mTHP collapse
>> attempts. There should be no behavior change for PMD collapse.
>>
>> Once we determine what mTHP sizes fits best in that PMD range a collapse
>> is attempted. A minimum collapse order of 2 is used as this is the lowest
>> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
>>
>> Currently madv_collapse is not supported and will only attempt PMD
>> collapse.
>>
>> We can also remove the check for is_khugepaged inside the PMD scan as
>> the collapse_max_ptes_none() function handles this logic now.
>>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>  mm/khugepaged.c | 182 +++++++++++++++++++++++++++++++++++++++++++++---
>>  1 file changed, 174 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 3492b135d667..39bf7ea8a6e8 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -100,6 +100,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>>
>>  static struct kmem_cache *mm_slot_cache __ro_after_init;
>>
>> +#define KHUGEPAGED_MIN_MTHP_ORDER	2
>> +/*
>> + * mthp_collapse() does an iterative DFS over a binary tree, from
>> + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
>> + * size needed for a DFS on a binary tree is height + 1, where
>> + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
>> + *
>> + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
>> + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
>> + */
>> +#define MTHP_STACK_SIZE	(ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
>> +
>> +/*
>> + * Defines a range of PTE entries in a PTE page table which are being
>> + * considered for mTHP collapse.
>> + *
>> + * @offset: the offset of the first PTE entry in a PMD range.
>> + * @order: the order of the PTE entries being considered for collapse.
>> + */
>> +struct mthp_range {
>> +	u16 offset;
>> +	u8 order;
>> +};
>> +
>>  struct collapse_control {
>>  	bool is_khugepaged;
>>
>> @@ -111,6 +135,12 @@ struct collapse_control {
>>
>>  	/* nodemask for allocation fallback */
>>  	nodemask_t alloc_nmask;
>> +
>> +	/* Each bit represents a single occupied (!none/zero) page. */
>> +	DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
>> +	/* A mask of the current range being considered for mTHP collapse. */
>> +	DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
>> +	struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
>>  };
>>
>>  /**
>> @@ -1404,20 +1434,140 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>>  	return result;
>>  }
>>
>> +static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
>> +				     u16 offset, u8 order)
>> +{
>> +	const int size = *stack_size;
>> +	struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
>> +
>> +	VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
>> +	stack->order = order;
>> +	stack->offset = offset;
>> +	(*stack_size)++;
>> +}
>> +
>> +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
>> +						 int *stack_size)
>> +{
>> +	const int size = *stack_size;
>> +
>> +	VM_WARN_ON_ONCE(size <= 0);
>> +	(*stack_size)--;
>> +	return cc->mthp_bitmap_stack[size - 1];
>> +}
>> +
>> +static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
>> +						u16 offset, unsigned int nr_ptes)
>> +{
>> +	bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
>> +	bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
>> +	return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
>> +}
>> +
>> +/*
>> + * mthp_collapse() consumes the bitmap that is generated during
>> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
>> + *
>> + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
>> + * A stack structure cc->mthp_bitmap_stack is used to check different regions
>> + * of the bitmap for collapse eligibility. The stack maintains a pair of
>> + * variables (offset, order), indicating the number of PTEs from the start of
>> + * the PMD, and the order of the potential collapse candidate respectively. We
>> + * start at the PMD order and check if it is eligible for collapse; if not, we
>> + * add two entries to the stack at a lower order to represent the left and right
>> + * halves of the PTE page table we are examining.
>> + *
>> + *                         offset       mid_offset
>> + *                         |         |
>> + *                         |         |
>> + *                         v         v
>> + *      --------------------------------------
>> + *      |          cc->mthp_bitmap            |
>> + *      --------------------------------------
>> + *                         <-------><------->
>> + *                          order-1  order-1
>> + *
>> + * For each of these, we determine how many PTE entries are occupied in the
>> + * range of PTE entries we propose to collapse, then we compare this to a
>> + * threshold number of PTE entries which would need to be occupied for a
>> + * collapse to be permitted at that order (accounting for max_ptes_none).
>> + *
>> + * If a collapse is permitted, we attempt to collapse the PTE range into a
>> + * mTHP.
>> + */
>> +static int mthp_collapse(struct mm_struct *mm, unsigned long address,
>> +		int referenced, int unmapped, struct collapse_control *cc,
>> +		unsigned long enabled_orders)
>> +{
>> +	unsigned int nr_occupied_ptes, nr_ptes;
>> +	int max_ptes_none, collapsed = 0, stack_size = 0;
>> +	unsigned long collapse_address;
>> +	struct mthp_range range;
>> +	u16 offset;
>> +	u8 order;
>> +
>> +	collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
>> +
>> +	while (stack_size) {
>> +		range = collapse_mthp_stack_pop(cc, &stack_size);
>> +		order = range.order;
>> +		offset = range.offset;
>> +		nr_ptes = 1UL << order;
>> +
>> +		if (!test_bit(order, &enabled_orders))
>> +			goto next_order;
>> +
>> +		max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
>> +
>> +		if (max_ptes_none < 0)
>> +			return collapsed;
>> +
>> +		nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
>> +							       nr_ptes);
>> +
>> +		if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
>> +			int ret;
>> +
>> +			collapse_address = address + offset * PAGE_SIZE;
>> +			ret = collapse_huge_page(mm, collapse_address, referenced,
>> +						 unmapped, cc, order);
>> +			if (ret == SCAN_SUCCEED) {
>> +				collapsed += nr_ptes;
>> +				continue;
>> +			}
>> +		}
>> +
>> +next_order:
>> +		if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
>
>Hi Nico, thank you very much for your contributions to this series.
>
>I found a minor issue, for MADV_COLLAPSE, if collapse_huge_page() fails
>for some reason (e.g. allocate folio), it goes to next_order and
>continues splitting to the next small order. However, enabled_orders
>only supports HPAGE_PMD_ORDER, so it keeps runing the split operations
>without any effective work until KHUGEPAGED_MIN_MTHP_ORDER is reached
>before exiting. For khugepaged, e.g. setting only 2MB to always, also
>same phenomenon.

Yes, but it does no actual work since it is checked after pop up.

>
>This does not affect the overall functionality of mthp collapse, just
>redundant.
>
>The redundant operations can be easily skipped with the following
>modification. If I miss some thing, please let me know. Thanks!
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index 1a25af3d6d0f..fa407cce525c 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -1574,7 +1574,7 @@ static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> 		}
>
> next_order:
>-		if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
>+		if ((BIT(order) - 1) & enabled_orders) {
> 			const u8 next_order = order - 1;
> 			const u16 mid_offset = offset + (nr_ptes / 2);
>

This would stop the iteration if there are other lower enabled order, right?

>Cheers,
>Vernon

-- 
Wei Yang
Help you, Help me

^ permalink raw reply

* Re: [PATCH v2] tools/bootconfig: Fix buf leaks in apply_xbc
From: Masami Hiramatsu @ 2026-05-21  2:38 UTC (permalink / raw)
  To: lihongtao; +Cc: linux-kernel, linux-trace-kernel
In-Reply-To: <20260520030126.147782-1-lihongtao@kylinos.cn>

On Wed, 20 May 2026 11:01:26 +0800
lihongtao <lihongtao@kylinos.cn> wrote:

> From: Hongtao Lee <lihongtao@kylinos.cn>
> 
> If data calloc failed, free the buf before return.
> 

Thanks! Let me pick it.

> Fixes: 950313ebf79c ("tools: bootconfig: Add bootconfig command")
> Signed-off-by: Hongtao Lee <lihongtao@kylinos.cn>
> ---
> V1 -> V2: Change Email Signed name
>  tools/bootconfig/main.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/bootconfig/main.c b/tools/bootconfig/main.c
> index 643f707b8f1d..ddabde20585f 100644
> --- a/tools/bootconfig/main.c
> +++ b/tools/bootconfig/main.c
> @@ -390,8 +390,10 @@ static int apply_xbc(const char *path, const char *xbc_path)
>  
>  	/* Backup the bootconfig data */
>  	data = calloc(size + BOOTCONFIG_ALIGN + BOOTCONFIG_FOOTER_SIZE, 1);
> -	if (!data)
> +	if (!data) {
> +		free(buf);
>  		return -ENOMEM;
> +	}
>  	memcpy(data, buf, size);
>  
>  	/* Check the data format */
> -- 
> 2.25.1
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH mm-unstable v17 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Vernon Yang @ 2026-05-21  2:36 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260511185817.686831-12-npache@redhat.com>

On Mon, May 11, 2026 at 12:58:11PM -0600, Nico Pache wrote:
> Enable khugepaged to collapse to mTHP orders. This patch implements the
> main scanning logic using a bitmap to track occupied pages and a stack
> structure that allows us to find optimal collapse sizes.
>
> Previous to this patch, PMD collapse had 3 main phases, a light weight
> scanning phase (mmap_read_lock) that determines a potential PMD
> collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> phase (mmap_write_lock).
>
> To enabled mTHP collapse we make the following changes:
>
> During PMD scan phase, track occupied pages in a bitmap. When mTHP
> orders are enabled, we remove the restriction of max_ptes_none during the
> scan phase to avoid missing potential mTHP collapse candidates. Once we
> have scanned the full PMD range and updated the bitmap to track occupied
> pages, we use the bitmap to find the optimal mTHP size.
>
> Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> and determine the best eligible order for the collapse. A stack structure
> is used instead of traditional recursion to manage the search. This also
> prevents a traditional recursive approach when the kernel stack struct is
> limited. The algorithm recursively splits the bitmap into smaller chunks to
> find the highest order mTHPs that satisfy the collapse criteria. We start
> by attempting the PMD order, then moved on the consecutively lower orders
> (mTHP collapse). The stack maintains a pair of variables (offset, order),
> indicating the number of PTEs from the start of the PMD, and the order of
> the potential collapse candidate.
>
> The algorithm for consuming the bitmap works as such:
>     1) push (0, HPAGE_PMD_ORDER) onto the stack
>     2) pop the stack
>     3) check if the number of set bits in that (offset,order) pair
>        statisfy the max_ptes_none threshold for that order
>     4) if yes, attempt collapse
>     5) if no (or collapse fails), push two new stack items representing
>        the left and right halves of the current bitmap range, at the
>        next lower order
>     6) repeat at step (2) until stack is empty.
>
> Below is a diagram representing the algorithm and stack items:
>
>                             offset   mid_offset
>                             |        |
>                             |        |
>                             v        v
>           ____________________________________
>          |          PTE Page Table            |
>          --------------------------------------
> 			    <-------><------->
>                              order-1  order-1
>
> mTHP collapses reject regions containing swapped out or shared pages.
> This is because adding new entries can lead to new none pages, and these
> may lead to constant promotion into a higher order mTHP. A similar
> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> introducing at least 2x the number of pages, and on a future scan will
> satisfy the promotion condition once again. This issue is prevented via
> the collapse_max_ptes_none() function which imposes the max_ptes_none
> restrictions above.
>
> We currently only support mTHP collapse for max_ptes_none values of 0
> and HPAGE_PMD_NR - 1. resulting in the following behavior:
>
>     - max_ptes_none=0: Never introduce new empty pages during collapse
>     - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
>       available mTHP order
>
> Any other max_ptes_none value will emit a warning and skip mTHP collapse
> attempts. There should be no behavior change for PMD collapse.
>
> Once we determine what mTHP sizes fits best in that PMD range a collapse
> is attempted. A minimum collapse order of 2 is used as this is the lowest
> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
>
> Currently madv_collapse is not supported and will only attempt PMD
> collapse.
>
> We can also remove the check for is_khugepaged inside the PMD scan as
> the collapse_max_ptes_none() function handles this logic now.
>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 182 +++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 174 insertions(+), 8 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 3492b135d667..39bf7ea8a6e8 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -100,6 +100,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>
>  static struct kmem_cache *mm_slot_cache __ro_after_init;
>
> +#define KHUGEPAGED_MIN_MTHP_ORDER	2
> +/*
> + * mthp_collapse() does an iterative DFS over a binary tree, from
> + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> + * size needed for a DFS on a binary tree is height + 1, where
> + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> + *
> + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> + */
> +#define MTHP_STACK_SIZE	(ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> +
> +/*
> + * Defines a range of PTE entries in a PTE page table which are being
> + * considered for mTHP collapse.
> + *
> + * @offset: the offset of the first PTE entry in a PMD range.
> + * @order: the order of the PTE entries being considered for collapse.
> + */
> +struct mthp_range {
> +	u16 offset;
> +	u8 order;
> +};
> +
>  struct collapse_control {
>  	bool is_khugepaged;
>
> @@ -111,6 +135,12 @@ struct collapse_control {
>
>  	/* nodemask for allocation fallback */
>  	nodemask_t alloc_nmask;
> +
> +	/* Each bit represents a single occupied (!none/zero) page. */
> +	DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
> +	/* A mask of the current range being considered for mTHP collapse. */
> +	DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> +	struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
>  };
>
>  /**
> @@ -1404,20 +1434,140 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>  	return result;
>  }
>
> +static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> +				     u16 offset, u8 order)
> +{
> +	const int size = *stack_size;
> +	struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> +
> +	VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> +	stack->order = order;
> +	stack->offset = offset;
> +	(*stack_size)++;
> +}
> +
> +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> +						 int *stack_size)
> +{
> +	const int size = *stack_size;
> +
> +	VM_WARN_ON_ONCE(size <= 0);
> +	(*stack_size)--;
> +	return cc->mthp_bitmap_stack[size - 1];
> +}
> +
> +static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
> +						u16 offset, unsigned int nr_ptes)
> +{
> +	bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> +	bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
> +	return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> +}
> +
> +/*
> + * mthp_collapse() consumes the bitmap that is generated during
> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> + *
> + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
> + * A stack structure cc->mthp_bitmap_stack is used to check different regions
> + * of the bitmap for collapse eligibility. The stack maintains a pair of
> + * variables (offset, order), indicating the number of PTEs from the start of
> + * the PMD, and the order of the potential collapse candidate respectively. We
> + * start at the PMD order and check if it is eligible for collapse; if not, we
> + * add two entries to the stack at a lower order to represent the left and right
> + * halves of the PTE page table we are examining.
> + *
> + *                         offset       mid_offset
> + *                         |         |
> + *                         |         |
> + *                         v         v
> + *      --------------------------------------
> + *      |          cc->mthp_bitmap            |
> + *      --------------------------------------
> + *                         <-------><------->
> + *                          order-1  order-1
> + *
> + * For each of these, we determine how many PTE entries are occupied in the
> + * range of PTE entries we propose to collapse, then we compare this to a
> + * threshold number of PTE entries which would need to be occupied for a
> + * collapse to be permitted at that order (accounting for max_ptes_none).
> + *
> + * If a collapse is permitted, we attempt to collapse the PTE range into a
> + * mTHP.
> + */
> +static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> +		int referenced, int unmapped, struct collapse_control *cc,
> +		unsigned long enabled_orders)
> +{
> +	unsigned int nr_occupied_ptes, nr_ptes;
> +	int max_ptes_none, collapsed = 0, stack_size = 0;
> +	unsigned long collapse_address;
> +	struct mthp_range range;
> +	u16 offset;
> +	u8 order;
> +
> +	collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> +
> +	while (stack_size) {
> +		range = collapse_mthp_stack_pop(cc, &stack_size);
> +		order = range.order;
> +		offset = range.offset;
> +		nr_ptes = 1UL << order;
> +
> +		if (!test_bit(order, &enabled_orders))
> +			goto next_order;
> +
> +		max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
> +
> +		if (max_ptes_none < 0)
> +			return collapsed;
> +
> +		nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> +							       nr_ptes);
> +
> +		if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> +			int ret;
> +
> +			collapse_address = address + offset * PAGE_SIZE;
> +			ret = collapse_huge_page(mm, collapse_address, referenced,
> +						 unmapped, cc, order);
> +			if (ret == SCAN_SUCCEED) {
> +				collapsed += nr_ptes;
> +				continue;
> +			}
> +		}
> +
> +next_order:
> +		if (order > KHUGEPAGED_MIN_MTHP_ORDER) {

Hi Nico, thank you very much for your contributions to this series.

I found a minor issue, for MADV_COLLAPSE, if collapse_huge_page() fails
for some reason (e.g. allocate folio), it goes to next_order and
continues splitting to the next small order. However, enabled_orders
only supports HPAGE_PMD_ORDER, so it keeps runing the split operations
without any effective work until KHUGEPAGED_MIN_MTHP_ORDER is reached
before exiting. For khugepaged, e.g. setting only 2MB to always, also
same phenomenon.

This does not affect the overall functionality of mthp collapse, just
redundant.

The redundant operations can be easily skipped with the following
modification. If I miss some thing, please let me know. Thanks!

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1a25af3d6d0f..fa407cce525c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1574,7 +1574,7 @@ static int mthp_collapse(struct mm_struct *mm, unsigned long address,
 		}

 next_order:
-		if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
+		if ((BIT(order) - 1) & enabled_orders) {
 			const u8 next_order = order - 1;
 			const u16 mid_offset = offset + (nr_ptes / 2);

--
Cheers,
Vernon

> +			const u8 next_order = order - 1;
> +			const u16 mid_offset = offset + (nr_ptes / 2);
> +
> +			collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> +						 next_order);
> +			collapse_mthp_stack_push(cc, &stack_size, offset,
> +						 next_order);
> +		}
> +	}
> +	return collapsed;
> +}
> +
>  static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		struct vm_area_struct *vma, unsigned long start_addr,
>  		bool *lock_dropped, struct collapse_control *cc)
>  {
> -	const int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> +	int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
>  	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
>  	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> +	enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
>  	pmd_t *pmd;
> -	pte_t *pte, *_pte;
> -	int none_or_zero = 0, shared = 0, referenced = 0;
> +	pte_t *pte, *_pte, pteval;
> +	int i;
> +	int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
>  	enum scan_result result = SCAN_FAIL;
>  	struct page *page = NULL;
>  	struct folio *folio = NULL;
>  	unsigned long addr;
> +	unsigned long enabled_orders;
>  	spinlock_t *ptl;
>  	int node = NUMA_NO_NODE, unmapped = 0;
>
> @@ -1429,8 +1579,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		goto out;
>  	}
>
> +	bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
>  	memset(cc->node_load, 0, sizeof(cc->node_load));
>  	nodes_clear(cc->alloc_nmask);
> +
> +	enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> +
> +	/*
> +	 * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> +	 * scan all pages to populate the bitmap for mTHP collapse.
> +	 */
> +	if (enabled_orders != BIT(HPAGE_PMD_ORDER))
> +		max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
> +
>  	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
>  	if (!pte) {
>  		cc->progress++;
> @@ -1438,11 +1599,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		goto out;
>  	}
>
> -	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> -	     _pte++, addr += PAGE_SIZE) {
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		_pte = pte + i;
> +		addr = start_addr + i * PAGE_SIZE;
> +		pteval = ptep_get(_pte);
> +
>  		cc->progress++;
>
> -		pte_t pteval = ptep_get(_pte);
>  		if (pte_none_or_zero(pteval)) {
>  			if (++none_or_zero > max_ptes_none) {
>  				result = SCAN_EXCEED_NONE_PTE;
> @@ -1522,6 +1685,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			}
>  		}
>
> +		/* Set bit for occupied pages */
> +		__set_bit(i, cc->mthp_bitmap);
>  		/*
>  		 * Record which node the original page is from and save this
>  		 * information to cc->node_load[].
> @@ -1580,10 +1745,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  	if (result == SCAN_SUCCEED) {
>  		/* collapse_huge_page expects the lock to be dropped before calling */
>  		mmap_read_unlock(mm);
> -		result = collapse_huge_page(mm, start_addr, referenced,
> -					    unmapped, cc, HPAGE_PMD_ORDER);
> +		nr_collapsed = mthp_collapse(mm, start_addr, referenced, unmapped,
> +					      cc, enabled_orders);
>  		/* collapse_huge_page will return with the mmap_lock released */
>  		*lock_dropped = true;
> +		result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
>  	}
>  out:
>  	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> --
> 2.54.0
>
>

^ permalink raw reply related

* [PATCH v4 1/2] tracing: Return ERR_PTR() from expr_str()
From: Pengpeng Hou @ 2026-05-21  2:28 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Mathieu Desnoyers, linux-trace-kernel, linux-kernel, Pengpeng Hou

expr_str() already has failure cases for invalid recursion depth and
allocation failure, but it currently reports them as a bare NULL. Teach
it to return ERR_PTR()-encoded errors and update parse_unary() and
parse_expr() to propagate those errors.

This keeps the error conversion separate from the string-building change
so the follow-up seq_buf patch can stay focused on the overflow fix
itself.

Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
Changes since v3: https://lore.kernel.org/all/20260417223002.2-tracing-expr-v3-pengpeng@iscas.ac.cn/
- split the ERR_PTR() conversion into a separate patch as requested by
  Steven
- use __free(kfree) and return_ptr() in expr_str()
- propagate ERR_PTR() errors from parse_unary() and parse_expr()

 kernel/trace/trace_events_hist.c | 37 +++++++++++++++++++++-----------
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 0dbbf6cca9bc..0b33bb8ef6f7 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -1769,18 +1769,18 @@ static void expr_field_str(struct hist_field *field, char *expr)
 
 static char *expr_str(struct hist_field *field, unsigned int level)
 {
-	char *expr;
+	char *expr __free(kfree) = NULL;
 
 	if (level > 1)
-		return NULL;
+		return ERR_PTR(-EINVAL);
 
 	expr = kzalloc(MAX_FILTER_STR_VAL, GFP_KERNEL);
 	if (!expr)
-		return NULL;
+		return ERR_PTR(-ENOMEM);
 
 	if (!field->operands[0]) {
 		expr_field_str(field, expr);
-		return expr;
+		return_ptr(expr);
 	}
 
 	if (field->operator == FIELD_OP_UNARY_MINUS) {
@@ -1788,16 +1788,15 @@ static char *expr_str(struct hist_field *field, unsigned int level)
 
 		strcat(expr, "-(");
 		subexpr = expr_str(field->operands[0], ++level);
-		if (!subexpr) {
-			kfree(expr);
-			return NULL;
-		}
+		if (IS_ERR(subexpr))
+			return subexpr;
+
 		strcat(expr, subexpr);
 		strcat(expr, ")");
 
 		kfree(subexpr);
 
-		return expr;
+		return_ptr(expr);
 	}
 
 	expr_field_str(field->operands[0], expr);
@@ -1816,13 +1815,12 @@ static char *expr_str(struct hist_field *field, unsigned int level)
 		strcat(expr, "*");
 		break;
 	default:
-		kfree(expr);
-		return NULL;
+		return ERR_PTR(-EINVAL);
 	}
 
 	expr_field_str(field->operands[1], expr);
 
-	return expr;
+	return_ptr(expr);
 }
 
 /*
@@ -2636,6 +2634,11 @@ static struct hist_field *parse_unary(struct hist_trigger_data *hist_data,
 	expr->is_signed = operand1->is_signed;
 	expr->operator = FIELD_OP_UNARY_MINUS;
 	expr->name = expr_str(expr, 0);
+	if (IS_ERR(expr->name)) {
+		ret = PTR_ERR(expr->name);
+		expr->name = NULL;
+		goto free;
+	}
 	expr->type = kstrdup_const(operand1->type, GFP_KERNEL);
 	if (!expr->type) {
 		ret = -ENOMEM;
@@ -2848,6 +2851,11 @@ static struct hist_field *parse_expr(struct hist_trigger_data *hist_data,
 		destroy_hist_field(operand1, 0);
 
 		expr->name = expr_str(expr, 0);
+		if (IS_ERR(expr->name)) {
+			ret = PTR_ERR(expr->name);
+			expr->name = NULL;
+			goto free_expr;
+		}
 	} else {
 		/* The operand sizes should be the same, so just pick one */
 		expr->size = operand1->size;
@@ -2861,6 +2869,11 @@ static struct hist_field *parse_expr(struct hist_trigger_data *hist_data,
 		}
 
 		expr->name = expr_str(expr, 0);
+		if (IS_ERR(expr->name)) {
+			ret = PTR_ERR(expr->name);
+			expr->name = NULL;
+			goto free_expr;
+		}
 	}
 
 	return expr;
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related

* [PATCH v4 2/2] tracing: Bound histogram expression strings with seq_buf
From: Pengpeng Hou @ 2026-05-21  2:28 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Mathieu Desnoyers, linux-trace-kernel, linux-kernel, Pengpeng Hou
In-Reply-To: <20260521022817.38453-1-pengpeng@iscas.ac.cn>

expr_str() allocates a fixed MAX_FILTER_STR_VAL buffer and then builds
expression names with a series of raw strcat() appends. Nested operands,
constants and field flags can push the rendered string past that fixed
limit before the name is attached to the hist field.

Build the expression strings with seq_buf and return -E2BIG when the
rendered name would exceed MAX_FILTER_STR_VAL.

Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
Changes since v3: https://lore.kernel.org/all/20260417223002.2-tracing-expr-v3-pengpeng@iscas.ac.cn/
- rebase on top of v7.1-rc3
- keep the ERR_PTR() conversion in patch 1
- use seq_buf for expression construction and return -E2BIG on overflow

 kernel/trace/trace_events_hist.c | 62 +++++++++++++++++++-------------
 1 file changed, 38 insertions(+), 24 deletions(-)

diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 0b33bb8ef6f7..c878dc8f0cb9 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -8,6 +8,7 @@
 #include <linux/module.h>
 #include <linux/kallsyms.h>
 #include <linux/security.h>
+#include <linux/seq_buf.h>
 #include <linux/mutex.h>
 #include <linux/slab.h>
 #include <linux/stacktrace.h>
@@ -1743,33 +1744,37 @@ static const char *get_hist_field_flags(struct hist_field *hist_field)
 	return flags_str;
 }
 
-static void expr_field_str(struct hist_field *field, char *expr)
+static bool expr_field_str(struct hist_field *field, struct seq_buf *s)
 {
+	const char *field_name;
+
 	if (field->flags & HIST_FIELD_FL_VAR_REF) {
 		if (!field->system)
-			strcat(expr, "$");
-	} else if (field->flags & HIST_FIELD_FL_CONST) {
-		char str[HIST_CONST_DIGITS_MAX];
+			seq_buf_putc(s, '$');
+	} else if (field->flags & HIST_FIELD_FL_CONST)
+		seq_buf_printf(s, "%llu", field->constant);
 
-		snprintf(str, HIST_CONST_DIGITS_MAX, "%llu", field->constant);
-		strcat(expr, str);
-	}
+	field_name = hist_field_name(field, 0);
+	if (!field_name)
+		return false;
 
-	strcat(expr, hist_field_name(field, 0));
+	seq_buf_puts(s, field_name);
 
 	if (field->flags && !(field->flags & HIST_FIELD_FL_VAR_REF)) {
 		const char *flags_str = get_hist_field_flags(field);
 
-		if (flags_str) {
-			strcat(expr, ".");
-			strcat(expr, flags_str);
-		}
+		if (flags_str)
+			seq_buf_printf(s, ".%s", flags_str);
 	}
+
+	seq_buf_str(s);
+	return !seq_buf_has_overflowed(s);
 }
 
 static char *expr_str(struct hist_field *field, unsigned int level)
 {
 	char *expr __free(kfree) = NULL;
+	struct seq_buf s;
 
 	if (level > 1)
 		return ERR_PTR(-EINVAL);
@@ -1778,47 +1783,56 @@ static char *expr_str(struct hist_field *field, unsigned int level)
 	if (!expr)
 		return ERR_PTR(-ENOMEM);
 
+	seq_buf_init(&s, expr, MAX_FILTER_STR_VAL);
+
 	if (!field->operands[0]) {
-		expr_field_str(field, expr);
+		if (!expr_field_str(field, &s))
+			return ERR_PTR(-E2BIG);
+
 		return_ptr(expr);
 	}
 
 	if (field->operator == FIELD_OP_UNARY_MINUS) {
-		char *subexpr;
+		char *subexpr __free(kfree) = NULL;
 
-		strcat(expr, "-(");
+		seq_buf_puts(&s, "-(");
 		subexpr = expr_str(field->operands[0], ++level);
 		if (IS_ERR(subexpr))
 			return subexpr;
 
-		strcat(expr, subexpr);
-		strcat(expr, ")");
+		seq_buf_puts(&s, subexpr);
+		seq_buf_putc(&s, ')');
+		seq_buf_str(&s);
 
-		kfree(subexpr);
+		if (seq_buf_has_overflowed(&s))
+			return ERR_PTR(-E2BIG);
 
 		return_ptr(expr);
 	}
 
-	expr_field_str(field->operands[0], expr);
+	if (!expr_field_str(field->operands[0], &s))
+		return ERR_PTR(-E2BIG);
 
 	switch (field->operator) {
 	case FIELD_OP_MINUS:
-		strcat(expr, "-");
+		seq_buf_putc(&s, '-');
 		break;
 	case FIELD_OP_PLUS:
-		strcat(expr, "+");
+		seq_buf_putc(&s, '+');
 		break;
 	case FIELD_OP_DIV:
-		strcat(expr, "/");
+		seq_buf_putc(&s, '/');
 		break;
 	case FIELD_OP_MULT:
-		strcat(expr, "*");
+		seq_buf_putc(&s, '*');
 		break;
 	default:
 		return ERR_PTR(-EINVAL);
 	}
 
-	expr_field_str(field->operands[1], expr);
+	if (seq_buf_has_overflowed(&s) ||
+	    !expr_field_str(field->operands[1], &s))
+		return ERR_PTR(-E2BIG);
 
 	return_ptr(expr);
 }
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related

* Re: [PATCH v6 2/2] blk-mq: expose tag starvation counts via debugfs
From: Aaron Tomlin @ 2026-05-21  2:22 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, rostedt, mhiramat, mathieu.desnoyers, bvanassche,
	johannes.thumshirn, kch, dlemoal, ritesh.list, loberman, neelx,
	sean, mproche, chjohnst, linux-block, linux-kernel,
	linux-trace-kernel
In-Reply-To: <fc307bd1-2c41-4bb1-8a10-b9ffde685d30@oracle.com>

On Mon, May 18, 2026 at 09:14:49AM +0100, John Garry wrote:
> On 17/05/2026 22:36, Aaron Tomlin wrote:
> > In high-performance storage environments, particularly when utilising
> > RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
> > latency spikes can occur when fast devices are starved of available
> > tags.
> > 
> > This patch introduces two new debugfs attributes for each block
> > hardware queue:
> >    - /sys/kernel/debug/block/[device]/hctxN/wait_on_hw_tag
> >    - /sys/kernel/debug/block/[device]/hctxN/wait_on_sched_tag
> 
> How would these counters be used? You are just saying that we may have
> performance latency spikes and so here are two new counters.

[ ... ]

> > These files expose atomic counters that increment each time a submitting
> > context is forced into an uninterruptible sleep via io_schedule() due to
> > the complete exhaustion of physical driver tags or software scheduler
> > tags, respectively.
> > 
> > To ensure negligible performance overhead even in production
> > environments where CONFIG_BLK_DEBUG_FS is actively enabled, this
> > tracking logic utilises dynamically allocated per-CPU counters. When
> > this configuration is disabled, the tracking logic compiles down to a
> > safe no-op.
> 
> How does one normalise the values which are measured? I mean, during a
> period of high contention, we may get a bunch of threads waiting for a
> driver tag and the value in wait_on_hw_tag may jump considerably - how do
> you normalize this value in wait_on_hw_tag for meaningful analysis?

Hi John,

Thanks for the review.

Based on feedback from Jens regarding this series [1], I am actually going to
drop Patch 2 entirely in v7. 

To answer your questions in the context of the new tracepoint approach:

Storage engineers can use bpftrace(8) to hook the newly proposed
"block_rq_tag_wait" tracepoint. This allows us to track tag starvation
dynamically, filter it by specific devices, or even group it by the
specific process (comm) experiencing the latency spike.

Moving this to userspace completely solves the normalization problem you
highlighted. For example, using bpftrace, userspace can hook both the tag
starvation event (i.e., block_rq_tag_wait) and the standard block issue
event (i.e., block_rq_issue) over a defined time window (e.g., 5 seconds).

Userspace can then divide the waits by the total issues to get a
meaningful, normalized starvation ratio (e.g., "4% of I/Os were starved of
hardware tags in the last 5 seconds"). This is far more useful for analysis
than a contextless debugfs integer jumping by an arbitrary amount.

Thanks again for taking a look. I'll make sure to Cc you on v7.

[1]: https://lore.kernel.org/lkml/t47wegcgfc43nimo4vqfdedqih43ydfietb7tsaobeitxgdhxs@6lnzvbj5rhab/


Kind regards,
-- 
Aaron Tomlin

^ permalink raw reply

* [PATCH v2] ring-buffer: Fix reporting of missed events in iterator
From: Steven Rostedt @ 2026-05-21  2:08 UTC (permalink / raw)
  To: LKML, Linux trace kernel; +Cc: Masami Hiramatsu, Mathieu Desnoyers

From c3651ad3ac95b331c7aa010d163704a3702855da Mon Sep 17 00:00:00 2001
From: Steven Rostedt <rostedt@goodmis.org>
Date: Wed, 20 May 2026 14:28:17 -0400
Subject: [PATCH] ring-buffer: Fix reporting of missed events in iterator

When tracing is active while reading the trace file, if the iterator
reading the buffer detects that the writer has passed the iterator head,
it will reset and set a "missed events" flag. This flag is passed to the
output processing to show the user that events were missed:

  CPU:4 [LOST EVENTS]

The problem is that the flag is reset after it is checked in
ring_buffer_iter_dropped(). But the "trace" file iterates over all the CPU
ring buffers and it will check if they are dropped when figuring out which
buffer to print next. This prematurely clears the missed_events flag if
the CPU buffer with the missed events is not the one that is printed next.

On the iteration where the CPU buffer with the missed events is printed,
the check if it had missed events would return false and the output does
not show that events were missed.

Do not reset the missed_events flag when checking if there were missed
events, but instead clear it when moving the iterator head to the next
event.

Cc: stable@vger.kernel.org
Fixes: c9b7a4a72ff64 ("ring-buffer/tracing: Have iterator acknowledge dropped events")
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
Changes since v1: https://patch.msgid.link/20260520142817.4050abab@gandalf.local.home

- Added clearing iter->missed_events in rb_iter_reset() (Sashiko)

 kernel/trace/ring_buffer.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 7288383b1f27..7b07d2004cc6 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5429,6 +5429,7 @@ static void rb_iter_reset(struct ring_buffer_iter *iter)
 	iter->head_page = cpu_buffer->reader_page;
 	iter->head = cpu_buffer->reader_page->read;
 	iter->next_event = iter->head;
+	iter->missed_events = 0;
 
 	iter->cache_reader_page = iter->head_page;
 	iter->cache_read = cpu_buffer->read;
@@ -6108,10 +6109,7 @@ ring_buffer_peek(struct trace_buffer *buffer, int cpu, u64 *ts,
  */
 bool ring_buffer_iter_dropped(struct ring_buffer_iter *iter)
 {
-	bool ret = iter->missed_events != 0;
-
-	iter->missed_events = 0;
-	return ret;
+	return iter->missed_events != 0;
 }
 EXPORT_SYMBOL_GPL(ring_buffer_iter_dropped);
 
@@ -6273,7 +6271,7 @@ void ring_buffer_iter_advance(struct ring_buffer_iter *iter)
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
-
+	iter->missed_events = 0;
 	rb_advance_iter(iter);
 
 	raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH v6 0/2] blk-mq: introduce tag starvation observability
From: Aaron Tomlin @ 2026-05-21  2:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: rostedt, mhiramat, mathieu.desnoyers, bvanassche,
	johannes.thumshirn, kch, dlemoal, ritesh.list, loberman, neelx,
	sean, mproche, chjohnst, linux-block, linux-kernel,
	linux-trace-kernel
In-Reply-To: <189ddfd7-f579-4c86-bcfc-334cf574bdfc@kernel.dk>

On Mon, May 18, 2026 at 07:31:45AM -0600, Jens Axboe wrote:
> Why not just issue the trace points? Then there's close to zero
> overhead, rather than needing to need added counters for this, and the
> kernel to keep track. If you just issue the get/put tag kind of traces,
> then userspace can keep track. That's what blktrace has done for decades
> for things like inflight/queue depth accounting.
> 
> IOW, seems to me, this could be done with basically zero kernel
> additions outside of perhaps a trace point or two.

Hi Jens,

Thanks for taking a look.

You make a completely fair point.
I agree that pushing the accounting to userspace is the right approach,
especially given the proposed hard-coded tracepoint. For example, with
bpftrace(8):

# bpftrace -e 'tracepoint:block:block_rq_tag_wait { @tag_waits[cpu] = count(); }'
  Attaching 1 probe...
  ^C
  @tag_waits[4]: 12
  @tag_waits[12]: 87


I will drop Patch 2 from this series, in the next iteration.


Kind regards,
-- 
Aaron Tomlin

^ permalink raw reply

* Re: [PATCH v5] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Masami Hiramatsu @ 2026-05-21  1:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: sashiko-bot, sashiko-reviews, bpf, LKML, Linux trace kernel
In-Reply-To: <20260520124832.737a946a@gandalf.local.home>

On Wed, 20 May 2026 12:48:32 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 20 May 2026 15:20:21 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> 
> > > > > @@ -515,6 +542,10 @@ static void clear_btf_context(struct traceprobe_parse_context *ctx)
> > > > >  		ctx->params = NULL;
> > > > >  		ctx->nr_params = 0;
> > > > >  	}
> > > > > +	if (ctx->struct_btf) {
> > > > > +		btf_put(ctx->struct_btf);
> > > > > +		ctx->last_struct = NULL;    
> > > > 
> > > > [Severity: Low]
> > > > Should ctx->struct_btf be explicitly set to NULL after btf_put() drops
> > > > the reference?  
> > > 
> > > I'm thinking of dropping it in the '(' switch case.  
> > 
> > Can you consider making the '(' switch case part as a helper
> > function because it depends on CONFIG_DEBUG_INFO_BTF?
> 
> Should we just encapsulate that entire case statement with:
> 
> #ifdef CONFIG_DEBUG_INFO_BTF
> [..]
> #endif

Yeah that is possible, and I rather like to make it a separate
function for simplifying switch-case block for readability.

Thank you,

> 
>  ?
> 
> -- Steve
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH mm-unstable v17 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Wei Yang @ 2026-05-21  1:55 UTC (permalink / raw)
  To: Nico Pache
  Cc: Wei Yang, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <CAA1CXcD2KPKFrwCZd2PatQhf_e1nrvCguPD77GcNOVPFZLvsew@mail.gmail.com>

On Wed, May 20, 2026 at 06:05:31AM -0600, Nico Pache wrote:
>On Tue, May 12, 2026 at 9:44 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>>
>> On Mon, May 11, 2026 at 12:58:11PM -0600, Nico Pache wrote:
>> >Enable khugepaged to collapse to mTHP orders. This patch implements the
>> >main scanning logic using a bitmap to track occupied pages and a stack
>> >structure that allows us to find optimal collapse sizes.
>> >
>> >Previous to this patch, PMD collapse had 3 main phases, a light weight
>> >scanning phase (mmap_read_lock) that determines a potential PMD
>> >collapse, an alloc phase (mmap unlocked), then finally heavier collapse
>> >phase (mmap_write_lock).
>> >
>> >To enabled mTHP collapse we make the following changes:
>> >
>> >During PMD scan phase, track occupied pages in a bitmap. When mTHP
>> >orders are enabled, we remove the restriction of max_ptes_none during the
>> >scan phase to avoid missing potential mTHP collapse candidates. Once we
>> >have scanned the full PMD range and updated the bitmap to track occupied
>> >pages, we use the bitmap to find the optimal mTHP size.
>> >
>> >Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
>> >and determine the best eligible order for the collapse. A stack structure
>> >is used instead of traditional recursion to manage the search. This also
>> >prevents a traditional recursive approach when the kernel stack struct is
>> >limited. The algorithm recursively splits the bitmap into smaller chunks to
>> >find the highest order mTHPs that satisfy the collapse criteria. We start
>> >by attempting the PMD order, then moved on the consecutively lower orders
>> >(mTHP collapse). The stack maintains a pair of variables (offset, order),
>> >indicating the number of PTEs from the start of the PMD, and the order of
>> >the potential collapse candidate.
>> >
>> >The algorithm for consuming the bitmap works as such:
>> >    1) push (0, HPAGE_PMD_ORDER) onto the stack
>> >    2) pop the stack
>> >    3) check if the number of set bits in that (offset,order) pair
>> >       statisfy the max_ptes_none threshold for that order
>> >    4) if yes, attempt collapse
>> >    5) if no (or collapse fails), push two new stack items representing
>> >       the left and right halves of the current bitmap range, at the
>> >       next lower order
>> >    6) repeat at step (2) until stack is empty.
>> >
>> >Below is a diagram representing the algorithm and stack items:
>> >
>> >                            offset   mid_offset
>> >                            |        |
>> >                            |        |
>> >                            v        v
>> >          ____________________________________
>> >         |          PTE Page Table            |
>> >         --------------------------------------
>> >                           <-------><------->
>> >                             order-1  order-1
>> >
>> >mTHP collapses reject regions containing swapped out or shared pages.
>> >This is because adding new entries can lead to new none pages, and these
>> >may lead to constant promotion into a higher order mTHP. A similar
>> >issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
>> >introducing at least 2x the number of pages, and on a future scan will
>> >satisfy the promotion condition once again. This issue is prevented via
>> >the collapse_max_ptes_none() function which imposes the max_ptes_none
>> >restrictions above.
>> >
>> >We currently only support mTHP collapse for max_ptes_none values of 0
>> >and HPAGE_PMD_NR - 1. resulting in the following behavior:
>> >
>> >    - max_ptes_none=0: Never introduce new empty pages during collapse
>> >    - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
>> >      available mTHP order
>> >
>> >Any other max_ptes_none value will emit a warning and skip mTHP collapse
>> >attempts. There should be no behavior change for PMD collapse.
>> >
>> >Once we determine what mTHP sizes fits best in that PMD range a collapse
>> >is attempted. A minimum collapse order of 2 is used as this is the lowest
>> >order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
>> >
>> >Currently madv_collapse is not supported and will only attempt PMD
>> >collapse.
>> >
>> >We can also remove the check for is_khugepaged inside the PMD scan as
>> >the collapse_max_ptes_none() function handles this logic now.
>> >
>> >Signed-off-by: Nico Pache <npache@redhat.com>
>>
>> [...]
>>
>> >+static int mthp_collapse(struct mm_struct *mm, unsigned long address,
>> >+              int referenced, int unmapped, struct collapse_control *cc,
>> >+              unsigned long enabled_orders)
>> >+{
>> >+      unsigned int nr_occupied_ptes, nr_ptes;
>> >+      int max_ptes_none, collapsed = 0, stack_size = 0;
>> >+      unsigned long collapse_address;
>> >+      struct mthp_range range;
>> >+      u16 offset;
>> >+      u8 order;
>> >+
>> >+      collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
>> >+
>> >+      while (stack_size) {
>> >+              range = collapse_mthp_stack_pop(cc, &stack_size);
>> >+              order = range.order;
>> >+              offset = range.offset;
>> >+              nr_ptes = 1UL << order;
>> >+
>> >+              if (!test_bit(order, &enabled_orders))
>> >+                      goto next_order;
>> >+
>> >+              max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
>>
>> I am thinking whether there is a behavioral change for userfaultfd_armed(vma).
>>
>> collapse_single_pmd()
>>     collapse_scan_pmd
>>         max_ptes_none = collapse_max_ptes_none(cc, vma)
>>         max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT                --- (1)
>>         mthp_collapse
>>             max_ptes_none = collapse_max_ptes_none(cc, NULL)     --- (2)
>>             collapse_huge_page(mm)
>>                 hugepage_vma_revalidate(&vma)
>>                 __collapse_huge_page_isolate(vma)
>>                     max_ptes_none = collapse_max_ptes_none(cc, vma)
>>
>> Before mthp_collapse() introduced, userfaultfd_armed(vma) is skipped if there
>> is any pte_none_or_zero() in collapse_scan_pmd().
>>
>> But now, max_ptes_none could be set to KHUGEPAGED_MAX_PTES_LIMIT at (1), so
>> that we can scan all the pte to get the bitmap. This means
>> userfaultfd_armed(vma) could continue even with pte_none_or_zero().
>>
>> Then in mthp_collapse(), collapse_max_ptes_none() at (2) ignores
>> userfaultfd_armed(vma), which means it will continue to collapse a
>> userfaultfd_armed(vma) when there is pte_none_or_zero().
>>
>> The good news is we will stop at __collapse_huge_page_isolate(), where we
>> get collapse_max_ptes_none() with vma. But we already did a lot of work.
>
>Good catch!
>
>As you stated we eventually ensure we respect the uffd checks. So
>there are no correctness issues, just the potential for wasted cycles.
>
>At (1) we only do this if mTHPs are enabled. If that is the case, the
>only waste that can arise is at the PMD order, as that order respects
>the max_ptes_none value.
>
>I think one approach is to gate (1) with the uffd check as well. That
>way, if mTHPs are enabled and its uffd-armed, max_ptes_none will stay
>at 0, and we bail early on the scan early if any none_ptes are hit.
>
>But then we lose the ability to collapse to mTHPs that are uffd-armed,
>where the PMD has none/zero-ptes and the mTHP fully has 0
>non-none/zero-ptes.
>
>ie) assume a PMD is 16 x's [xxxxxxxx00000000]
>where x is a populated pte and 0 is not
>If we guard this scan (1), then we will never check if its possible to
>collapse to the smaller orders.
>
>Let me know if you see a flaw in my logic, I think it's best to keep it as is?
>

Yes, gate it at (1) is not a proper place.

I am thinking whether we could pass vma to (2)? So that we could respect
uffd-armed?

>>
>> Not sure if I missed something.
>>
>> >+
>> >+              if (max_ptes_none < 0)
>> >+                      return collapsed;
>> >+
>> >+              nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
>> >+                                                             nr_ptes);
>> >+
>> >+              if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
>> >+                      int ret;
>> >+
>> >+                      collapse_address = address + offset * PAGE_SIZE;
>> >+                      ret = collapse_huge_page(mm, collapse_address, referenced,
>> >+                                               unmapped, cc, order);
>> >+                      if (ret == SCAN_SUCCEED) {
>> >+                              collapsed += nr_ptes;
>> >+                              continue;
>> >+                      }
>> >+              }
>> >+
>> >+next_order:
>> >+              if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
>> >+                      const u8 next_order = order - 1;
>> >+                      const u16 mid_offset = offset + (nr_ptes / 2);
>> >+
>> >+                      collapse_mthp_stack_push(cc, &stack_size, mid_offset,
>> >+                                               next_order);
>> >+                      collapse_mthp_stack_push(cc, &stack_size, offset,
>> >+                                               next_order);
>> >+              }
>> >+      }
>> >+      return collapsed;
>> >+}
>> >+
>> > static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>> >               struct vm_area_struct *vma, unsigned long start_addr,
>> >               bool *lock_dropped, struct collapse_control *cc)
>> > {
>> >-      const int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
>> >+      int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
>> >       const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
>> >       const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
>> >+      enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
>> >       pmd_t *pmd;
>> >-      pte_t *pte, *_pte;
>> >-      int none_or_zero = 0, shared = 0, referenced = 0;
>> >+      pte_t *pte, *_pte, pteval;
>> >+      int i;
>> >+      int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
>> >       enum scan_result result = SCAN_FAIL;
>> >       struct page *page = NULL;
>> >       struct folio *folio = NULL;
>> >       unsigned long addr;
>> >+      unsigned long enabled_orders;
>> >       spinlock_t *ptl;
>> >       int node = NUMA_NO_NODE, unmapped = 0;
>> >
>> >@@ -1429,8 +1579,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>> >               goto out;
>> >       }
>> >
>> >+      bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
>> >       memset(cc->node_load, 0, sizeof(cc->node_load));
>> >       nodes_clear(cc->alloc_nmask);
>> >+
>> >+      enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
>>
>> Would it be 0 at this point?
>
>If your question relates to the issue you brought up above, then yes,
>max_ptes_none would be 0 if it's uffd-armed. We must recheck the
>uffd-armed status before modifying it to 511.
>
>>
>> >+
>> >+      /*
>> >+       * If PMD is the only enabled order, enforce max_ptes_none, otherwise
>> >+       * scan all pages to populate the bitmap for mTHP collapse.
>> >+       */
>> >+      if (enabled_orders != BIT(HPAGE_PMD_ORDER))
>> >+              max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
>> >+
>> >       pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
>> >       if (!pte) {
>> >               cc->progress++;
>> >@@ -1438,11 +1599,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>> >               goto out;
>> >       }
>> >
>> >-      for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>> >-           _pte++, addr += PAGE_SIZE) {
>> >+      for (i = 0; i < HPAGE_PMD_NR; i++) {
>> >+              _pte = pte + i;
>> >+              addr = start_addr + i * PAGE_SIZE;
>> >+              pteval = ptep_get(_pte);
>> >+
>> >               cc->progress++;
>> >
>> >-              pte_t pteval = ptep_get(_pte);
>> >               if (pte_none_or_zero(pteval)) {
>> >                       if (++none_or_zero > max_ptes_none) {
>> >                               result = SCAN_EXCEED_NONE_PTE;
>> >@@ -1522,6 +1685,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>> >                       }
>> >               }
>> >
>> >+              /* Set bit for occupied pages */
>> >+              __set_bit(i, cc->mthp_bitmap);
>> >               /*
>> >                * Record which node the original page is from and save this
>> >                * information to cc->node_load[].
>> >@@ -1580,10 +1745,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>> >       if (result == SCAN_SUCCEED) {
>> >               /* collapse_huge_page expects the lock to be dropped before calling */
>> >               mmap_read_unlock(mm);
>> >-              result = collapse_huge_page(mm, start_addr, referenced,
>> >-                                          unmapped, cc, HPAGE_PMD_ORDER);
>> >+              nr_collapsed = mthp_collapse(mm, start_addr, referenced, unmapped,
>> >+                                            cc, enabled_orders);
>> >               /* collapse_huge_page will return with the mmap_lock released */
>>
>> collapse_huge_page will return with mmap_lock released, but mthp_collapse()
>> may not?
>
>We are now releasing the lock before calling mthp_collapse, which
>subsequently calls collapse_huge_page. Even if `collapse_huge_page` is
>never called-- say, because enabled_orders is 0 (which should not
>happen) and all collapse orders are skipped (never calling
>collapse_huge_page)-- we still return here with the lock dropped.
>
>I think this is sound. Let me know if you think differently.
>

You are right. I missed the lock is released in previous patch.

>Cheers :)
>-- Nico
>
>>
>> >               *lock_dropped = true;
>> >+              result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
>> >       }
>> > out:
>> >       trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
>> >--
>> >2.54.0
>>
>> --
>> Wei Yang
>> Help you, Help me
>>

-- 
Wei Yang
Help you, Help me

^ permalink raw reply

* [PATCH] tracing: Fix README path for synthetic_events
From: ao.sun @ 2026-05-21  1:52 UTC (permalink / raw)
  To: rostedt@goodmis.org, mhiramat@kernel.org,
	mathieu.desnoyers@efficios.com, linux-kernel@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org
  Cc: jiazi.li, hongyan.xia, ao.sun

From: Ao Sun <ao.sun@transsion.com>

The events/ prefix should be removed, since synthetic_events
is now directly under the tracing root directory.

Signed-off-by: Ao Sun <ao.sun@transsion.com>
---
 kernel/trace/trace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..5dd4a8ca7586 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4474,7 +4474,7 @@ static const char readme_msg[] =
 	"\t        snapshot()                           - snapshot the trace buffer\n\n"
 #endif
 #ifdef CONFIG_SYNTH_EVENTS
-	"  events/synthetic_events\t- Create/append/remove/show synthetic events\n"
+	"  synthetic_events\t- Create/append/remove/show synthetic events\n"
 	"\t  Write into this file to define/undefine new synthetic events.\n"
 	"\t     example: echo 'myevent u64 lat; char name[]; long[] stack' >> synthetic_events\n"
 #endif
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH] ftrace: Use flexible array for hash buckets
From: Rosen Penev @ 2026-05-21  1:39 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, open list:FUNCTION HOOKS (FTRACE), sashiko-bot,
	sashiko-reviews
In-Reply-To: <20260520212829.7734bad4@fedora>

On Wed, May 20, 2026 at 6:28 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 20 May 2026 15:00:30 -0700
> Rosen Penev <rosenp@gmail.com> wrote:
>
> > Store ftrace hash buckets in the ftrace_hash allocation instead of
> > allocating the bucket array separately.
> >
> > This keeps the bucket storage tied to the hash lifetime and simplifies
> > the allocation and cleanup paths.
> >
> > Assisted-by: Codex:GPT-5.5
>
> I'll let the AI's duke it out!
>
> > Signed-off-by: Rosen Penev <rosenp@gmail.com>
> > ---
> >  kernel/trace/ftrace.c | 17 ++---------------
> >  kernel/trace/trace.h  |  2 +-
> >  2 files changed, 3 insertions(+), 16 deletions(-)
> >
> > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> > index b2611de3f594..25a9dca290dd 100644
> > --- a/kernel/trace/ftrace.c
> > +++ b/kernel/trace/ftrace.c
> > @@ -1082,10 +1082,7 @@ struct ftrace_func_probe {
> >   * it all the time. These are in a read only section such that if
> >   * anyone does try to modify it, it will cause an exception.
> >   */
> > -static const struct hlist_head empty_buckets[1];
> > -static const struct ftrace_hash empty_hash = {
> > -     .buckets = (struct hlist_head *)empty_buckets,
> > -};
> > +static const struct ftrace_hash empty_hash = {};
> >  #define EMPTY_HASH   ((struct ftrace_hash *)&empty_hash)
>
>
> According to Sashiko: https://sashiko.dev/#/patchset/20260520220030.16887-1-rosenp%40gmail.com
>
>    Could this conversion to a flexible array member cause an
>    out-of-bounds read when iterating over the empty hash? Because
>    empty_hash is now initialized as an empty struct, its flexible array
>    member buckets has a size of 0. However, empty_hash.size_bits is 0,
>    which means loop limits computing '1 << hash->size_bits' will
>    evaluate to 1. If functions like
>    prepare_direct_functions_for_ipmodify() iterate over a default
>    EMPTY_HASH without checking ftrace_hash_empty(), they will attempt
>    to read EMPTY_HASH->buckets[0]. This reads past the end of the
>    struct into adjacent memory in the .rodata section. If that adjacent
>    memory happens to be non-zero, the linked list loop could
>    dereference it and cause a kernel panic. Prior to this patch,
>    empty_buckets provided a safely zeroed array of size 1 to handle
>    this single iteration.
Yeah this looks right. Might as well abandon.
>
> -- Steve

^ permalink raw reply

* Re: [PATCH] ring-buffer: Fix reporting of missed events in iterator
From: Steven Rostedt @ 2026-05-21  1:38 UTC (permalink / raw)
  To: Masami Hiramatsu (Google); +Cc: LKML, Linux Trace Kernel, Mathieu Desnoyers
In-Reply-To: <20260521102921.a1449dedfe0392a3bb4a91b9@kernel.org>

On Thu, 21 May 2026 10:29:21 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:


> > Do not reset the missed_events flag when checking if there were missed
> > events, but instead clear it when moving the iterator head to the next
> > event.
> >   
> 
> As Sashiko pointed, this flag should be reset in rb_iter_reset() too.
> (But that seems like another issue?)
> 

Hmm, I think that was an issue even before this patch, but Sashiko is
right. It should be fixed. I'll send a v2.

Thanks,

-- Steve

^ permalink raw reply

* Re: [PATCH] ring-buffer: Fix reporting of missed events in iterator
From: Masami Hiramatsu @ 2026-05-21  1:29 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260520142817.4050abab@gandalf.local.home>

On Wed, 20 May 2026 14:28:17 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> From: Steven Rostedt <rostedt@goodmis.org>
> 
> When tracing is active while reading the trace file, if the iterator
> reading the buffer detects that the writer has passed the iterator head,
> it will reset and set a "missed events" flag. This flag is passed to the
> output processing to show the user that events were missed:
> 
>   CPU:4 [LOST EVENTS]
> 
> The problem is that the flag is reset after it is checked in
> ring_buffer_iter_dropped(). But the "trace" file iterates over all the CPU
> ring buffers and it will check if they are dropped when figuring out which
> buffer to print next. This prematurely clears the missed_events flag if
> the CPU buffer with the missed events is not the one that is printed next.
> 
> On the iteration where the CPU buffer with the missed events is printed,
> the check if it had missed events would return false and the output does
> not show that events were missed.
> 
> Do not reset the missed_events flag when checking if there were missed
> events, but instead clear it when moving the iterator head to the next
> event.
> 

As Sashiko pointed, this flag should be reset in rb_iter_reset() too.
(But that seems like another issue?)

Thank you,

> Cc: stable@vger.kernel.org
> Fixes: c9b7a4a72ff64 ("ring-buffer/tracing: Have iterator acknowledge dropped events")
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
>  kernel/trace/ring_buffer.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 5326924615a4..47b0a7b43f0f 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -6086,10 +6086,7 @@ ring_buffer_peek(struct trace_buffer *buffer, int cpu, u64 *ts,
>   */
>  bool ring_buffer_iter_dropped(struct ring_buffer_iter *iter)
>  {
> -	bool ret = iter->missed_events != 0;
> -
> -	iter->missed_events = 0;
> -	return ret;
> +	return iter->missed_events != 0;
>  }
>  EXPORT_SYMBOL_GPL(ring_buffer_iter_dropped);
>  
> @@ -6251,7 +6248,7 @@ void ring_buffer_iter_advance(struct ring_buffer_iter *iter)
>  	unsigned long flags;
>  
>  	raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
> -
> +	iter->missed_events = 0;
>  	rb_advance_iter(iter);
>  
>  	raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
> -- 
> 2.53.0
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH] ftrace: Use flexible array for hash buckets
From: Steven Rostedt @ 2026-05-21  1:28 UTC (permalink / raw)
  To: Rosen Penev
  Cc: linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, open list:FUNCTION HOOKS (FTRACE), sashiko-bot,
	sashiko-reviews
In-Reply-To: <20260520220030.16887-1-rosenp@gmail.com>

On Wed, 20 May 2026 15:00:30 -0700
Rosen Penev <rosenp@gmail.com> wrote:

> Store ftrace hash buckets in the ftrace_hash allocation instead of
> allocating the bucket array separately.
> 
> This keeps the bucket storage tied to the hash lifetime and simplifies
> the allocation and cleanup paths.
> 
> Assisted-by: Codex:GPT-5.5

I'll let the AI's duke it out!

> Signed-off-by: Rosen Penev <rosenp@gmail.com>
> ---
>  kernel/trace/ftrace.c | 17 ++---------------
>  kernel/trace/trace.h  |  2 +-
>  2 files changed, 3 insertions(+), 16 deletions(-)
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index b2611de3f594..25a9dca290dd 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -1082,10 +1082,7 @@ struct ftrace_func_probe {
>   * it all the time. These are in a read only section such that if
>   * anyone does try to modify it, it will cause an exception.
>   */
> -static const struct hlist_head empty_buckets[1];
> -static const struct ftrace_hash empty_hash = {
> -	.buckets = (struct hlist_head *)empty_buckets,
> -};
> +static const struct ftrace_hash empty_hash = {};
>  #define EMPTY_HASH	((struct ftrace_hash *)&empty_hash)


According to Sashiko: https://sashiko.dev/#/patchset/20260520220030.16887-1-rosenp%40gmail.com

   Could this conversion to a flexible array member cause an
   out-of-bounds read when iterating over the empty hash? Because
   empty_hash is now initialized as an empty struct, its flexible array
   member buckets has a size of 0. However, empty_hash.size_bits is 0,
   which means loop limits computing '1 << hash->size_bits' will
   evaluate to 1. If functions like
   prepare_direct_functions_for_ipmodify() iterate over a default
   EMPTY_HASH without checking ftrace_hash_empty(), they will attempt
   to read EMPTY_HASH->buckets[0]. This reads past the end of the
   struct into adjacent memory in the .rodata section. If that adjacent
   memory happens to be non-zero, the linked list loop could
   dereference it and cause a kernel panic. Prior to this patch,
   empty_buckets provided a safely zeroed array of size 1 to handle
   this single iteration.

-- Steve

^ permalink raw reply

* Re: [PATCH] tracing: Create output file from cmd_check_undefined
From: Nathan Chancellor @ 2026-05-20 22:42 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Vincent Donnefort, Marc Zyngier, Arnd Bergmann, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260520-tracing-ringbuffer-check-v1-1-d979cfab1338@weissschuh.net>

On Wed, May 20, 2026 at 08:01:55PM +0200, Thomas Weißschuh wrote:
> As the output file is currently never created, the check will run every
> time, even if the inputs have not changed.
> 
> Create an empty output file which allows make to skip the execution when
> it is not necessary.
> 
> Fixes: 1211907ac0b5 ("tracing: Generate undef symbols allowlist for simple_ring_buffer")
> Fixes: 58b4bd18390e ("tracing: Adjust cmd_check_undefined to show unexpected undefined symbols")
> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>

Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>

> ---
>  kernel/trace/Makefile | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index 1decdce8cbef..b5797457f9f4 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -154,7 +154,8 @@ quiet_cmd_check_undefined = NM      $<
>                echo "Unexpected symbols in $<:" >&2; \
>                echo "$$undefsyms" >&2; \
>                false; \
> -          fi
> +          fi; \
> +          touch $@
>  
>  $(obj)/%.o.checked: $(obj)/%.o $(obj)/undefsyms_base.o FORCE
>  	$(call if_changed,check_undefined)
> 
> ---
> base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
> change-id: 20260520-tracing-ringbuffer-check-3a6e748d37b7
> 
> Best regards,
> --  
> Thomas Weißschuh <linux@weissschuh.net>
> 

-- 
Cheers,
Nathan

^ permalink raw reply

* [PATCH] trace: allocate fields with elt struct
From: Rosen Penev @ 2026-05-20 22:31 UTC (permalink / raw)
  To: linux-trace-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	open list:TRACING

Use a flexible array member to embed the fields array in the
tracing_map_elt allocation, reducing the number of allocations
per element.

Since the fields are now embedded in the struct, taking the address
of a field through a const-qualified elt pointer yields a
const-qualified pointer. Rather than adding casts, switch the
comparison functions to take const void * parameters. These are
all read-only operations.

Assisted-by: OpenCode:BigPickle
Signed-off-by: Rosen Penev <rosenp@gmail.com>
---
 kernel/trace/tracing_map.c | 41 ++++++++++++++++----------------------
 kernel/trace/tracing_map.h |  8 ++++----
 2 files changed, 21 insertions(+), 28 deletions(-)

diff --git a/kernel/trace/tracing_map.c b/kernel/trace/tracing_map.c
index 1404bf752d99..97f7e3cde262 100644
--- a/kernel/trace/tracing_map.c
+++ b/kernel/trace/tracing_map.c
@@ -125,32 +125,32 @@ u64 tracing_map_read_var_once(struct tracing_map_elt *elt, unsigned int i)
 	return (u64)atomic64_read(&elt->vars[i]);
 }
 
-int tracing_map_cmp_string(void *val_a, void *val_b)
+int tracing_map_cmp_string(const void *val_a, const void *val_b)
 {
-	char *a = val_a;
-	char *b = val_b;
+	const char *a = val_a;
+	const char *b = val_b;
 
 	return strcmp(a, b);
 }
 
-int tracing_map_cmp_none(void *val_a, void *val_b)
+int tracing_map_cmp_none(const void *val_a, const void *val_b)
 {
 	return 0;
 }
 
-static int tracing_map_cmp_atomic64(void *val_a, void *val_b)
+static int tracing_map_cmp_atomic64(const void *val_a, const void *val_b)
 {
-	u64 a = atomic64_read((atomic64_t *)val_a);
-	u64 b = atomic64_read((atomic64_t *)val_b);
+	u64 a = atomic64_read((const atomic64_t *)val_a);
+	u64 b = atomic64_read((const atomic64_t *)val_b);
 
 	return (a > b) ? 1 : ((a < b) ? -1 : 0);
 }
 
 #define DEFINE_TRACING_MAP_CMP_FN(type)					\
-static int tracing_map_cmp_##type(void *val_a, void *val_b)		\
+static int tracing_map_cmp_##type(const void *val_a, const void *val_b)	\
 {									\
-	type a = (type)(*(u64 *)val_a);					\
-	type b = (type)(*(u64 *)val_b);					\
+	type a = (type)(*(const u64 *)val_a);				\
+	type b = (type)(*(const u64 *)val_b);				\
 									\
 	return (a > b) ? 1 : ((a < b) ? -1 : 0);			\
 }
@@ -385,7 +385,6 @@ static void tracing_map_elt_free(struct tracing_map_elt *elt)
 
 	if (elt->map->ops && elt->map->ops->elt_free)
 		elt->map->ops->elt_free(elt);
-	kfree(elt->fields);
 	kfree(elt->vars);
 	kfree(elt->var_set);
 	kfree(elt->key);
@@ -397,7 +396,7 @@ static struct tracing_map_elt *tracing_map_elt_alloc(struct tracing_map *map)
 	struct tracing_map_elt *elt;
 	int err = 0;
 
-	elt = kzalloc_obj(*elt);
+	elt = kzalloc_flex(*elt, fields, map->n_fields);
 	if (!elt)
 		return ERR_PTR(-ENOMEM);
 
@@ -409,12 +408,6 @@ static struct tracing_map_elt *tracing_map_elt_alloc(struct tracing_map *map)
 		goto free;
 	}
 
-	elt->fields = kzalloc_objs(*elt->fields, map->n_fields);
-	if (!elt->fields) {
-		err = -ENOMEM;
-		goto free;
-	}
-
 	elt->vars = kzalloc_objs(*elt->vars, map->n_vars);
 	if (!elt->vars) {
 		err = -ENOMEM;
@@ -848,10 +841,10 @@ static int cmp_entries_sum(const void *A, const void *B)
 {
 	const struct tracing_map_elt *elt_a, *elt_b;
 	const struct tracing_map_sort_entry *a, *b;
-	struct tracing_map_sort_key *sort_key;
-	struct tracing_map_field *field;
+	const struct tracing_map_sort_key *sort_key;
+	const struct tracing_map_field *field;
 	tracing_map_cmp_fn_t cmp_fn;
-	void *val_a, *val_b;
+	const void *val_a, *val_b;
 	int ret = 0;
 
 	a = *(const struct tracing_map_sort_entry **)A;
@@ -879,10 +872,10 @@ static int cmp_entries_key(const void *A, const void *B)
 {
 	const struct tracing_map_elt *elt_a, *elt_b;
 	const struct tracing_map_sort_entry *a, *b;
-	struct tracing_map_sort_key *sort_key;
-	struct tracing_map_field *field;
+	const struct tracing_map_sort_key *sort_key;
+	const struct tracing_map_field *field;
 	tracing_map_cmp_fn_t cmp_fn;
-	void *val_a, *val_b;
+	const void *val_a, *val_b;
 	int ret = 0;
 
 	a = *(const struct tracing_map_sort_entry **)A;
diff --git a/kernel/trace/tracing_map.h b/kernel/trace/tracing_map.h
index 18a02959d77b..90a7fde5dd02 100644
--- a/kernel/trace/tracing_map.h
+++ b/kernel/trace/tracing_map.h
@@ -13,7 +13,7 @@
 #define TRACING_MAP_VARS_MAX		16
 #define TRACING_MAP_SORT_KEYS_MAX	2
 
-typedef int (*tracing_map_cmp_fn_t) (void *val_a, void *val_b);
+typedef int (*tracing_map_cmp_fn_t) (const void *val_a, const void *val_b);
 
 /*
  * This is an overview of the tracing_map data structures and how they
@@ -137,11 +137,11 @@ struct tracing_map_field {
 
 struct tracing_map_elt {
 	struct tracing_map		*map;
-	struct tracing_map_field	*fields;
 	atomic64_t			*vars;
 	bool				*var_set;
 	void				*key;
 	void				*private_data;
+	struct tracing_map_field	fields[];
 };
 
 struct tracing_map_entry {
@@ -260,8 +260,8 @@ tracing_map_lookup(struct tracing_map *map, void *key);
 
 extern tracing_map_cmp_fn_t tracing_map_cmp_num(int field_size,
 						int field_is_signed);
-extern int tracing_map_cmp_string(void *val_a, void *val_b);
-extern int tracing_map_cmp_none(void *val_a, void *val_b);
+extern int tracing_map_cmp_string(const void *val_a, const void *val_b);
+extern int tracing_map_cmp_none(const void *val_a, const void *val_b);
 
 extern void tracing_map_update_sum(struct tracing_map_elt *elt,
 				   unsigned int i, u64 n);
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v6 20/43] KVM: guest_memfd: Enable INIT_SHARED on guest_memfd for x86 Coco VMs
From: Ackerley Tng @ 2026-05-20 22:04 UTC (permalink / raw)
  To: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
	forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
	Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-20-91ab5a8b19a4@google.com>

Ackerley Tng via B4 Relay <devnull+ackerleytng.google.com@kernel.org>
writes:

> From: Sean Christopherson <seanjc@google.com>
>
> Now that guest_memfd supports tracking private vs. shared within gmem
> itself, allow userspace to specify INIT_SHARED on a guest_memfd instance
> for x86 Confidential Computing (CoCo) VMs, so long as per-VM attributes
> are disabled, i.e. when it's actually possible for a guest_memfd instance
> to contain shared memory.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  arch/x86/kvm/x86.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1560de1e95be0..6609957ecfea3 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -14172,14 +14172,13 @@ bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
>  }
>
>  #ifdef CONFIG_KVM_GUEST_MEMFD
> -/*
> - * KVM doesn't yet support initializing guest_memfd memory as shared for VMs
> - * with private memory (the private vs. shared tracking needs to be moved into
> - * guest_memfd).
> - */
>  bool kvm_arch_supports_gmem_init_shared(struct kvm *kvm)
>  {
> -	return !kvm_arch_has_private_mem(kvm);
> +	/*
> +	 * INIT_SHARED isn't supported if the memory attributes are per-VM,
> +	 * in which case guest_memfd can _only_ be used for private memory.
> +	 */
> +	return !vm_memory_attributes || !kvm_arch_has_private_mem(kvm);

Adding a note here from PUCK on 2026-05-20:

Michael pointed out that it's odd that when vm_memory_attributes is
available, guest_memfd still can only be used for private memory.

It is a little odd, but we don't want to investigate the complexities of
supporting it, and Sean says this is working as intended, in line with
deprecating vm_memory_attributes=true.

>  }
>
>  #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_PREPARE
>
> --
> 2.54.0.563.g4f69b47b94-goog

^ permalink raw reply

* [PATCH] ftrace: Use flexible array for hash buckets
From: Rosen Penev @ 2026-05-20 22:00 UTC (permalink / raw)
  To: linux-trace-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers,
	open list:FUNCTION HOOKS (FTRACE)

Store ftrace hash buckets in the ftrace_hash allocation instead of
allocating the bucket array separately.

This keeps the bucket storage tied to the hash lifetime and simplifies
the allocation and cleanup paths.

Assisted-by: Codex:GPT-5.5
Signed-off-by: Rosen Penev <rosenp@gmail.com>
---
 kernel/trace/ftrace.c | 17 ++---------------
 kernel/trace/trace.h  |  2 +-
 2 files changed, 3 insertions(+), 16 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index b2611de3f594..25a9dca290dd 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -1082,10 +1082,7 @@ struct ftrace_func_probe {
  * it all the time. These are in a read only section such that if
  * anyone does try to modify it, it will cause an exception.
  */
-static const struct hlist_head empty_buckets[1];
-static const struct ftrace_hash empty_hash = {
-	.buckets = (struct hlist_head *)empty_buckets,
-};
+static const struct ftrace_hash empty_hash = {};
 #define EMPTY_HASH	((struct ftrace_hash *)&empty_hash)
 
 struct ftrace_ops global_ops = {
@@ -1295,7 +1292,6 @@ void free_ftrace_hash(struct ftrace_hash *hash)
 	if (!hash || hash == EMPTY_HASH)
 		return;
 	ftrace_hash_clear(hash);
-	kfree(hash->buckets);
 	kfree(hash);
 }
 
@@ -1333,20 +1329,11 @@ EXPORT_SYMBOL_GPL(ftrace_free_filter);
 struct ftrace_hash *alloc_ftrace_hash(int size_bits)
 {
 	struct ftrace_hash *hash;
-	int size;
 
-	hash = kzalloc_obj(*hash);
+	hash = kzalloc_flex(*hash, buckets, BIT(size_bits));
 	if (!hash)
 		return NULL;
 
-	size = 1 << size_bits;
-	hash->buckets = kzalloc_objs(*hash->buckets, size);
-
-	if (!hash->buckets) {
-		kfree(hash);
-		return NULL;
-	}
-
 	hash->size_bits = size_bits;
 
 	return hash;
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 80fe152af1dd..5a3f81f17317 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -1000,10 +1000,10 @@ enum {
 
 struct ftrace_hash {
 	unsigned long		size_bits;
-	struct hlist_head	*buckets;
 	unsigned long		count;
 	unsigned long		flags;
 	struct rcu_head		rcu;
+	struct hlist_head	buckets[];
 };
 
 struct ftrace_func_entry *
-- 
2.54.0


^ permalink raw reply related

* [PATCH] tracing: Use flexible array for entry fetch code
From: Rosen Penev @ 2026-05-20 21:58 UTC (permalink / raw)
  To: linux-trace-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Kees Cook,
	Gustavo A. R. Silva, open list:TRACING,
	open list:KERNEL HARDENING (not covered by other areas):Keyword:b__counted_by(_le|_be|_ptr)?b

Store probe entry fetch instructions in the probe_entry_arg
allocation instead of allocating a separate instruction array.

This keeps the entry fetch code tied to the entry argument lifetime while
leaving regular probe_arg instruction arrays separately allocated and
freed.

Assisted-by: Codex:GPT-5.5
Signed-off-by: Rosen Penev <rosenp@gmail.com>
---
 kernel/trace/trace_probe.c | 8 +-------
 kernel/trace/trace_probe.h | 2 +-
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index e0d3a0da26af..39f040c863e8 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -838,15 +838,10 @@ static int __store_entry_arg(struct trace_probe *tp, int argnum)
 	int i, offset, last_offset = 0;
 
 	if (!earg) {
-		earg = kzalloc_obj(*tp->entry_arg);
+		earg = kzalloc_flex(*earg, code, 2 * tp->nr_args + 1);
 		if (!earg)
 			return -ENOMEM;
 		earg->size = 2 * tp->nr_args + 1;
-		earg->code = kzalloc_objs(struct fetch_insn, earg->size);
-		if (!earg->code) {
-			kfree(earg);
-			return -ENOMEM;
-		}
 		/* Fill the code buffer with 'end' to simplify it */
 		for (i = 0; i < earg->size; i++)
 			earg->code[i].op = FETCH_OP_END;
@@ -2051,7 +2046,6 @@ void trace_probe_cleanup(struct trace_probe *tp)
 		traceprobe_free_probe_arg(&tp->args[i]);
 
 	if (tp->entry_arg) {
-		kfree(tp->entry_arg->code);
 		kfree(tp->entry_arg);
 		tp->entry_arg = NULL;
 	}
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 262d8707a3df..1076f1df347b 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -238,8 +238,8 @@ struct probe_arg {
 };
 
 struct probe_entry_arg {
-	struct fetch_insn	*code;
 	unsigned int		size;	/* The entry data size */
+	struct fetch_insn	code[] __counted_by(size);
 };
 
 struct trace_uprobe_filter {
-- 
2.54.0


^ permalink raw reply related

* [PATCHv2] tracing: simplify pages allocation
From: Rosen Penev @ 2026-05-20 21:50 UTC (permalink / raw)
  To: linux-trace-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Kees Cook,
	Gustavo A. R. Silva, open list:TRACING,
	open list:KERNEL HARDENING (not covered by other areas):Keyword:b__counted_by(_le|_be|_ptr)?b

Change to a flexible array member to allocate together with the array
struct.

Simplifies code slightly by removing no longer correct null checks for
pages and removing kfrees.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
---
 v2: add back kfree(a). Accidentally removed.
 kernel/trace/tracing_map.c | 30 +++++++++++-------------------
 kernel/trace/tracing_map.h |  2 +-
 2 files changed, 12 insertions(+), 20 deletions(-)

diff --git a/kernel/trace/tracing_map.c b/kernel/trace/tracing_map.c
index c59326ad7a84..97f7e3cde262 100644
--- a/kernel/trace/tracing_map.c
+++ b/kernel/trace/tracing_map.c
@@ -288,9 +288,6 @@ static void tracing_map_array_clear(struct tracing_map_array *a)
 {
 	unsigned int i;

-	if (!a->pages)
-		return;
-
 	for (i = 0; i < a->n_pages; i++)
 		memset(a->pages[i], 0, PAGE_SIZE);
 }
@@ -302,9 +299,6 @@ static void tracing_map_array_free(struct tracing_map_array *a)
 	if (!a)
 		return;

-	if (!a->pages)
-		goto free;
-
 	for (i = 0; i < a->n_pages; i++) {
 		if (!a->pages[i])
 			break;
@@ -312,9 +306,6 @@ static void tracing_map_array_free(struct tracing_map_array *a)
 		free_page((unsigned long)a->pages[i]);
 	}

-	kfree(a->pages);
-
- free:
 	kfree(a);
 }

@@ -322,24 +313,25 @@ static struct tracing_map_array *tracing_map_array_alloc(unsigned int n_elts,
 						  unsigned int entry_size)
 {
 	struct tracing_map_array *a;
+	unsigned int entry_size_shift;
+	unsigned int entries_per_page;
+	unsigned int n_pages;
 	unsigned int i;

-	a = kzalloc_obj(*a);
+	entry_size_shift = fls(roundup_pow_of_two(entry_size) - 1);
+	entries_per_page = PAGE_SIZE / (1 << entry_size_shift);
+	n_pages = max(1, n_elts / entries_per_page);
+
+	a = kzalloc_flex(*a, pages, n_pages);
 	if (!a)
 		return NULL;

-	a->entry_size_shift = fls(roundup_pow_of_two(entry_size) - 1);
-	a->entries_per_page = PAGE_SIZE / (1 << a->entry_size_shift);
-	a->n_pages = n_elts / a->entries_per_page;
-	if (!a->n_pages)
-		a->n_pages = 1;
+	a->entry_size_shift = entry_size_shift;
+	a->entries_per_page = entries_per_page;
+	a->n_pages = n_pages;
 	a->entry_shift = fls(a->entries_per_page) - 1;
 	a->entry_mask = (1 << a->entry_shift) - 1;

-	a->pages = kcalloc(a->n_pages, sizeof(void *), GFP_KERNEL);
-	if (!a->pages)
-		goto free;
-
 	for (i = 0; i < a->n_pages; i++) {
 		a->pages[i] = (void *)get_zeroed_page(GFP_KERNEL);
 		if (!a->pages[i])
diff --git a/kernel/trace/tracing_map.h b/kernel/trace/tracing_map.h
index ed64136782d8..90a7fde5dd02 100644
--- a/kernel/trace/tracing_map.h
+++ b/kernel/trace/tracing_map.h
@@ -167,7 +167,7 @@ struct tracing_map_array {
 	unsigned int entry_shift;
 	unsigned int entry_mask;
 	unsigned int n_pages;
-	void **pages;
+	void *pages[] __counted_by(n_pages);
 };

 #define TRACING_MAP_ARRAY_ELT(array, idx)				\
--
2.54.0


^ permalink raw reply related

* Re: [PATCH v6 05/43] KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes
From: Ackerley Tng @ 2026-05-20 21:44 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <CA+EHjTw-cUM=FrJevtSDtR7K6MwUfGfOx21LMFDn7DAy5bFzYw@mail.gmail.com>

Fuad Tabba <tabba@google.com> writes:

>
> [...snip...]
>
>> +unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
>> +{
>> +       struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>> +       struct inode *inode;
>> +
>> +       /*
>> +        * If this gfn has no associated memslot, there's no chance of the gfn
>> +        * being backed by private memory, since guest_memfd must be used for
>> +        * private memory, and guest_memfd must be associated with some memslot.
>> +        */
>> +       if (!slot)
>> +               return 0;
>> +
>> +       CLASS(gmem_get_file, file)(slot);
>> +       if (!file)
>> +               return 0;
>> +
>> +       inode = file_inode(file);
>> +
>> +       /*
>> +        * Rely on the maple tree's internal RCU lock to ensure a
>> +        * stable result. This result can become stale as soon as the
>> +        * lock is dropped, so the caller _must_ still protect
>> +        * consumption of private vs. shared by checking
>> +        * mmu_invalidate_retry_gfn() under mmu_lock to serialize
>> +        * against ongoing attribute updates.
>> +        */
>> +       return kvm_gmem_get_attributes(inode, kvm_gmem_get_index(slot, gfn));
>> +}
>
> Doesn't this imply that all consumers of kvm_mem_is_private() should
> validate the result using mmu_lock and the invalidation sequence?

Let me know how I can improve the comment.

I think the "consumption" of private vs shared here actually means
something like "don't commit a page being faulted into page tables based
on the result of kvm_gmem_get_memory_attributes() without checking
kvm->mmu_invalidate_in_progress.", since a racing conversion may
complete before you commit.

kvm_mem_is_private() is used from these places:

1. Fault handling in KVM, like page_fault_can_be_fast(),
   kvm_mmu_faultin_pfn(), kvm_mmu_page_fault(): this already handles the
   entire mmu_lock and invalidation dance. No fault will be committed if
   a racing conversion happened after kvm_mem_is_private() but before
   the commit.

2. kvm_mmu_max_mapping_level() from recovering huge pages after
   disabling dirty logging: Other than that it can't be used with
   guest_memfd now since dirty logging can't be used with guest_memfd
   and guest_memfd memslots are not updatable, this holds mmu_lock
   throughout until the huge page recovery is done. invalidate_begin
   also involves zapping the pages in the range, so if the order of
   events is

   | Thread A                     | Thread B          |
   |------------------------------|-------------------|
   | invalidate_begin + zap       |                   |
   | update attributes maple_tree | recover huge page |
   | invalidate_end               |                   |

   Then recovering will never see the zapped pages, nothing to
   recover, no kvm_mem_is_private() lookup.

3. kvm_arch_vcpu_pre_fault_memory()

   This eventually calls kvm_tdp_mmu_page_fault(), which checks
   is_page_fault_stale(), so it does check before committing.

Were there any other calls I missed?

> sev_handle_rmp_fault() calls kvm_mem_is_private() without holding
> mmu_lock and without any retry mechanism. Is that a problem?
>

Sean already replied on your actual question separately :)

> Cheers,
> /fuad
>
>
>>
>> [...snip...]
>>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox